I have the money shot for my LCA presentation


Thanks to Eric Whitney’s benchmarking results, I have my money shot for my upcoming 2011 LCA talk in Brisbane, which will be about how to improve scalability in the Linux kernel, using the case study of the work that I did to improve scalability via a series of scalability patches that were developed during 2.6.34, 2.6.35, and 2.6.36 (and went into the kernel during subsequent merge window).

These benchmarks were done on a 48-core AMD system (8 sockets, 6 cores/socket) using a 24 SAS-disk hardware RAID array.  Which is the sort of system which XFS has traditionally shined on, and for which ext3 has traditionally not scaled very well at.  We’re now within striking distance of XFS, and there’s more improvements to ext4 which I have planned that should help its performance even further.   This is the kind of performance improvement that I’m totally psyched to see!

15 thoughts on “I have the money shot for my LCA presentation

  1. Any possible of the alterations improving performance of ext3 and ext2 as well.

    Yes I run some legacy.

  2. @1: oiaohm:

    It’s not likely that these improvements will be pushed into ext3 and ext2. The whole point of ext3 at this point is to keep things stable, because there are lots of companies who are deeply afraid of change. Performance improvement implies change; and change scares people. Even if we did port it to ext2/3, it would mean moving to a new kernel, and most users who run legacy systems fear using new kernels. (They’re probably using some ancient userspace like RHEL2 or some such, for which new kernels would be a real challenge.)

  3. What is the ‘patched’ EXT4 offering that the vanilla is not? Also, isn’t a journal one of the needs/benefits of a modern file system? If so, why is anyone excited that the ‘non-modern-feature’ version of this filesystem getting close in performance to a modern file system? I guess, some context would be nice on this story.

  4. Great job!
    On another note, I’ve heard you say you thought btrfs was the future?
    So I assume the reason you’re working on ext4 is b/c you’re at Google, and they want to migrate to it from ext2.
    Once you’re done with ext4/ Google, do you have plans to work on btrfs at all?

  5. Eagerly looking forward to it. But will it ever be possible to create and use ext4 partitions > 15TB?
    Or is it just assumed that’s what btrfs will be for?

    (As far as I can tell – which I must disclaim is not really very far at all – the kernel code should be in place for ext4 to handle it but the tools still have never had the work done to allow 16+TB partitions to be created and fsck’ed? Or am I mistaken about that?)

  6. Hi,

    Great work, are you planning on publishing a more detailed paper on this?

    By the way. The link to “2.6.35” is broken.

    Reply from Ted: Oops, thanks for pointing that out. I’ve fixed the link.

  7. @5: What is the ‘patched’ EXT4 offering that the vanilla is not? Also, isn’t a journal one of the needs/benefits of a modern file system? If so, why is anyone excited that the ‘non-modern-feature’ version of this filesystem getting close in performance to a modern file system?

    The ext4 patch (which will be in 2.6.37) bypasses the buffer cache layer for buffered I/O submission, so that instead of submitting writes 4k at a time, which is what ext3 does, we submit potentially several megabytes at a time if we can. I’ll talk more about it at LCA (so come to Brisbane to here more), but lockstat showed that the major deadlock to scalability was the block I/O submission locks. When I fixed that, it was a major improvement.

    As to the question of “why are we benchmarking in no journal mode”, one of the major uses of ext4 at Google is to provide the object store for clustered file systems. Since the cluster file systems are replicating objects on multiple machines, the journal doesn’t provide functionality which is needed for this workload. It’s much faster to avoid the penalty of using the journal for the common case, and if the system crashes, (a) ext4 fsck’s pretty fast for 1T and 2T disks, and (b) while the system is rebooting and fsck’ing, the other N-1 severs can service requests for that object until the machine is back — and if the machine can’t recover the file system, or it takes too long for the machine to come back, the objects on the failed machine can simply copied to some of the several hundred machines participating in the cluster file system cell, so the N=3 (or whatever the replication factor is) of the objects is restored.

  8. @6: On another note, I’ve heard you say you thought btrfs was the future?
    So I assume the reason you’re working on ext4 is b/c you’re at Google, and they want to migrate to it from ext2. Once you’re done with ext4/ Google, do you have plans to work on btrfs at all?

    There are certain things that are just much easier to do with btrfs because they could start from scratch with their file system layout. That being said, it looks like that btrfs is optimized for certain specific use cases. As I mentioned in comment #8, a number of cluster file systems really don’t have any need for journaling or COW snapshots — and if you don’t need those features, it’s better if you don’t have to pay the overhead of having those features. Ext4 is fairly unique amongst the modern file systems in that the journal can be optionally disabled.

    I also have a lot of personal interest in seeing how far ext4 can be pushed; Amir Goldstein has figured out a way to do COW snapshots for ext4. If the code proves to be clean, maintainable, and no-overhead when it is disabled, I might very well merge it. If nothing else, competition is healthy and it’s a great way to keep us all honest.

    Btrfs is really Chris Mason’s show, and he has a great team of developers working on it. I don’t currently have any plans to switch to working on it, and as long as there are people who are interested in using ext4, and working with me to develop it, I’ll probably be happy to continue to be a maintainer for ext4/e2fsprogs.

  9. @7: But will it ever be possible to create and use ext4 partitions > 15TB?

    Epicanis,

    The support for large block numbers is in the e2fsprogs git tree, but because this fall has been incredibly hectic (I was the progam chair for the Kernel Summit and Linux Plumbers Conference this week, among other things), I haven’t released the 1.42-WIP e2fsprogs that has > 15TB support. Now that the conference is over, I have some tutorials which I am teaching at the LISA conference next week, followed by a a vacation to Hawaii planned. But I will get 1.42-WIP pre-release version of e2fsprogs before the end of the year.

    If you want to try it sooner, the master branch of the e2fsprogs git tree has the > 15TB support. There’s one known bug where e2fsck won’t fix one type of file system corruption correctly, but I’ll try to get that fixed in the next week or two. (I have the patch, I just have to validate/test it, and I just have to recover from conference planning psychosis. But it’s coming soon, I promise! :-)

  10. So 11+ years of playing copy-cat to XFS to “almost” be equal? Not sure that
    is something to be excited about. I guess there is alway the major argument that it’s easy to migrate the fs from ext2/3 to ext4 but aside from that what other MAJOR feature does ext4 have over XFS?
    Given the many scaling challenges facing storage right now both in terms of size (many TBytes) and IOPS (SSDs) where is ext4 headed to meet those challenges?

  11. @12: Russell,

    I don’t think of it as our trying to copy-cat XFS. XFS is what it is, and it’s a good file system. I have a lot of respect for its designers and the people who currently maintain it.

    Ext4 was always designed for the “common case Linux workloads/hardware”, and for a long time, 48 cores/CPU’s and large RAID arrays were in the category of “exotic, expensive hardware”, and indeed, for much of the ext2/3 development time, most of the ext2/3 developers didn’t even have access to such hardware. One of the main reasons why I am working on scalability to 32-64 nodes is because such 32 cores/socket will become available Real Soon Now, and high throughput devices such as PCIe attached flash devices will start becoming available at reasonable prices soon as well.

    I could point at things that we do better, such as the fact that the file system format is simpler/easier to understand, the fact that I think we have better tools for low-level manipulation of the file system data structures, the fact that e2fsck has a pretty comprehensive regression test suite, the fact that ext2/3/4 has off-line shrink capability whereas XFS doesn’t, certain benchmarks where we’ve tended to do better than XFS (i.e., the boxacle “mail server workload”), but in reality, I don’t see it as a win-lose competition. XFS does not have to lose for ext4 to win, and vice versa.

    I like hacking on file systems. I enjoy making them faster. I like the challenge of making a file system which will work well on everything from cell phones to large servers with RAID arrays. There are others who enjoy working on this adventure with me, and that’s great. The developers of XFS and btrfs are “competitors” the same way that Bruce Evans and I were “competitors” when Bruce and I worked on the FreeBSD and Linux serial drivers, respectively. We shared design ideas with each others, and shared benchmark results with each other, and in so doing, we each made our serial drivers faster, more robust, and to use less CPU overhead than they had before. Similarly, I talk to XFS and btrfs developers, and we help each other out. Chris Mason has even sent patches that help improve ext4, and I’ve helped to convince companies to support btrfs development.

    It does seem to me that much more of the rivalries come from the “fanboys” and “fangirls”, than it does from the developers.

  12. @14: Kubrick,

    That’s an interesting question. I’m not sure how JFS would have performed. I could ask Eric to add JFS to the comparison, but I’ve said before, I’m not doing this to score win/lose points versus other file systems. And he does this as a service to the ext4 development community on his own time, so I don’t want to ask him to do more work. We can see if he’s interested, but it’s not high on my priority list. I’d much rather him test out various scalability patches that I might come up with. Not all of them are as successful as this one. :-)

    Still, if you want some indication about how JFS would perform, these two graphs give some hints:

    • 1 threads
    • 16 threads
    • 128 threads

        One caveat here is that (a) these measurements are from almost two years ago, using 2.6.27.9, and (b) the hardware configuration is quite different; the system used for these sets of benchmarks was using LVM raid 0. Still, JFS hasn’t had significantly much development over the last two years, and both XFS and ext4 have had a lot of improvements — so I think it’s pretty likely JFS would perform relatively poorly.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>