Category Archives: Linux

Android will be using ext4 starting with Gingerbread

I received a trackback from Tim Bray’s Saving Data Safely post on the Android Developer’s blog to my Don’t fear the fsync! blog entry, so I guess the cat’s out of the bag.  Starting with Gingerbread, newer Android phones (starting with the Nexus S) will be using the ext4 file system.  Very cool!  So just as IBM used to promote Linux by saying that it was scalable enough to run on everything between watches and mainframes, I can now talk about ext4 as running in production on cell phones to Google data centers.
Continue reading

Google has a problem retaining great engineers? Bullcrap.

Once again, there’s been another story about how Google is having trouble retaining talent.   Despite all Eric Schmidt’s attempts to tell folks that Google’s regretted attrition rate has not changed in seven years, this story just doesn’t want to seem to die.   (And those stories about Google paying $3.5 million and $7 million to keep an engineer from defecting to Facebook?   As far as I know, total bull.  I bet it’s something made up by some Facebook recruiter who needed to explain how she let a live prospect get away.  🙂

At least for me, the complete opposite is true.   There are very few companies where I can do the work that I want to do, and Google is one of them.   A startup is totally the wrong place for me.   Why?  Because if you talk to any venture capitalist, a startup has one and only one reason to exist: to prove that it has a scalable, viable business model.   Take for example.   As Business Week described, while they were proving that they had a business model that worked, they purchased their diapers at the local BJ’s and shipped them via Fedex.   Another startup, Chegg, proved its business model by using to drop ship text books to their first customers.  (The venture capitalist Mark Maples talked about this in a brilliant talk at the Founders Showcase; the Chegg example starts around 20:50 minutes in, but I’d recommend listening to the whole thing, since it’s such a great talk.)   You don’t negotiate volume discounts with textbook publishers, or build huge warehouses to hold all of the diapers that you’re going to buy until you prove that you have a business model that works.
Continue reading

I have the money shot for my LCA presentation

Thanks to Eric Whitney’s benchmarking results, I have my money shot for my upcoming 2011 LCA talk in Brisbane, which will be about how to improve scalability in the Linux kernel, using the case study of the work that I did to improve scalability via a series of scalability patches that were developed during 2.6.34, 2.6.35, and 2.6.36 (and went into the kernel during subsequent merge window).

These benchmarks were done on a 48-core AMD system (8 sockets, 6 cores/socket) using a 24 SAS-disk hardware RAID array.  Which is the sort of system which XFS has traditionally shined on, and for which ext3 has traditionally not scaled very well at.  We’re now within striking distance of XFS, and there’s more improvements to ext4 which I have planned that should help its performance even further.   This is the kind of performance improvement that I’m totally psyched to see!

Don’t fear the fsync!

After reading the comments on my earlier post, Delayed allocation and the zero-length file problem as well as some of the comments on the Slashdot story as well as the Ubuntu bug, it’s become very clear to me that there are a lot of myths and misplaced concerns about fsync() and how best to use it.   I thought it would be appropriate to correct as many of these misunderstandings about fsync() in one comprehensive blog posting.

As the Eat My Data presentation points out very clearly, the only safe way according that POSIX allows for requesting data written to a particular file descriptor be safely stored on stable storage is via the fsync() call.  Linux’s close(2) man page makes this point very clearly:

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2).

Why don’t application programmers follow these sage words?  These three reasons are most often given as excuses:

  1. (Perceived) performance problems with fsync()
  2. The application only needs atomicity, but not durability
  3. The fsync() causing the hard drive to spin up unnecessarily in laptop_mode

Let’s examine each of these excuses one at a time, to see how valid they really are.

(Perceived) performance problems with fsync()

Most of the bad publicity with fsync() originated with the now infamous problem with Firefox 3.0 that showed up about a year ago in May, 2008.   What happened with Firefox 3.0 was that the primary user interface thread called the sqllite library each time the user clicked on a link to go to a new page. The sqllite library called fsync(), which in ext3’s data=ordered mode, caused a large, visible latency which was visible to the user if there was a large file copy happening by another process.

Nearly all of the reported delays was a few seconds, which would be expected; normally there isn’t that much dirty data that needs to be flushed out on a Linux system, even if it is even very busy.   For example, consier the example of a laptop downloading an .iso image from a local file server; if the laptop has the exclusive link of a 100 megabit/second ethernet link, and the server has the .iso file in cache, or has a nice fast RAID array so it is not the bottleneck, then in the best case, the laptop will be able to download data at the rate of 10-12 MB/second.  Assuming the default 5 second commit interval, that means that in the worst case, there will be at most 60 megabytes which must be written out before the commit can proceed.  A reasonably modern 7200 rpm laptop drive can write between 60 and 70 MB/second.   (The Seagate Momentus 7200.4 laptop drive is reported to be able to deliver 85-104 MB/second, but I can’t find it for sale anywhere for love or money.)   In this example, an fsync() will trigger a commit and might need to take a second while the download is going on; perhaps half a second if you have a really fast 7200 rpm drive, and maybe 2-3 seconds if you have a slow 5400 rpm drive.

(Jump to Sidebar: What about those 30 second fsync reports?)

Obviously, you can create workloads that aren’t bottlenecked on the maximum ethernet download speed, or the speed of reading from a local disk drive; for example, “dd if=/dev/zero of=big-zero-file” will create a very large number of dirty pages that must be written to the hard drive at the next commit or fsync() call. It’s important to remember though, fsync() doesn’t create any extra I/O (although it may remove some optimization opportunities to avoid double writes); fsync() just pushes around when the I/O gets done, and whether it gets done synchronously or asynchronously. If you create a large number of pages that need to be flushed to disk, sooner or later it will have a significant and unfortunate effect on your system’s performance.  Fsync() might make things more visible, but if the fsync() is done off the main UI thread, the fact that fsync() triggers a commit won’t actually disturb other processes doing normal I/O; in ext3 and ext4, we start a new transaction to take care of new file system operations while the committing transction completes.

The final observation I’ll make is that part of the problem is that Firefox as an application wants to make a huge number of updates to state files and was concerned about not losing that information even in the face of a crash.  Every application writer should be asking themselves whether this sort of thing is really necessary.   For example, doing some quick measurements using ext4, I determined that Firefox was responsible for 2.54 megabytes written to the disk for each web page visited by the user (and this doesn’t include writes to the Firefox cache; I symlinked the cache directory to a tmpfs directory mounted on /tmp to reduce the write load to my SSD).   So these 2.54 megabytes is just for Firefox’s cookie cache and Places database to maintain its “Awesome bar”.  Is that really worth it?   If you visit 400 web pages in a day, that’s 1GB of writes to your SSD, and if you write more than 20GB/day, the Intel SSD will enable its “write endurance management feature” which slows down the performance of the drive.   In light of that, exactly how important is it to update those darned sqllite databases after every web click?  What if Firefox saved a list of URL’s that has been visited, and only updated every 30 or 60 minutes, instead?   Is it really that every last web page that you browse be saved if the system crashes?  An fsync() call every 15, 30, or 60 minutes, done by a thread which doesn’t block the application’s UI, would have never been noticed and would have not started the firestorm on Firefox’s bugzilla #421482.   Very often, after a little thinking, a small change in the application is all that’s necessary for to really optimize the application’s fsync() usage.

(Skip over the sidebar — if you’ve already read it).

Sidebar: What about those 30 second fsync reports?

If you read through the Firefox’s bugzilla entry, you’ll find reports of fsync delays of 30 seconds or more. That tale has grown in the retelling, and I’ve seen some hyperbolic claims of five minute delays. Where did that come from? Well, if you look that those claims, you’ll find they were using a very read-heavy workload, and/or they were using the ionice command to set a real-time I/O priority. For example, something like “ionice -c 1 -n 0 tar cvf /dev/null big-directory”.

This will cause some significant delays, first of all because “ionice -c 1” causes the process to have a real-time I/O priority, such that any I/O requests issued by that process will be serviced before all others.   Secondly, even without the real-time I/O priority, the I/O scheduler naturally prioritizes reads as higher priority than writes because normally processes are waiting for reads to complete, but writes are normally asynchronous.

This is not at all realistic workload, and it is even more laughable that some people thought this might be an accurate representation of the I/O workload of a kernel compile. These folks had never tried the experiment, or measured how much I/O goes on during a kernel compile. If you try it, you’ll find that a kernel compile sucks up a lot of CPU, and doesn’t actually do that much I/O. (In fact, that’s why an SSD only speeds up a kernel compile by about 20% or so, and that’s in a completely cold cache case. If the commonly used include files are already in the system’s page cache, the performance improvement of the SSD is much less.)

Jump back to reading Performance problems with fsync.

The atomicity not durability argument

One argument that has commonly been made on the various comment streams is that when replacing a file by writing a new file and the renaming “” to “file”, most applications don’t need a guarantee that new contents of the file are committed to stable store at a certain point in time; only that either the new or the old contents of the file will be present on the disk. So the argument is essentially that the sequence:

  • fd = open(“”, O_WRONLY);
  • write(fd, buf, bufsize);
  • fsync(fd);
  • close(fd);
  • rename(“”, “foo”);

… is too expensive, since it provides “atomicity and durability”, when in fact all the application needed was “atomicity” (i.e., either the new or the old contents of foo should be present after a crash), but not durability (i.e., the application doesn’t need to need the new version of foo now, but rather at some intermediate time in the future when it’s convenient for the OS).

This argument is flawed for two reasons. First of all, the squence above exactly provides desired “atomicity without durability”.   It doesn’t guarantee which version of the file will appear in the event of an unexpected crash; if the application needs a guarantee that the new version of the file will be present after a crash, it’s necessary to fsync the containing directory. Secondly, as we discussed above, fsync() really isn’t that expensive, even in the case of ext3′ and data=ordered; remember, fsync() doesn’t create extra I/O’s, although it may introduce latency as the application waits for some of the pending I/O’s to complete. If the application doesn’t care about exactly when the new contents of the file will be committed to stable store, the simplest thing to do is to execute the above sequence (open-write-fsync-close-rename) in a separate, asynchronous thread. And if the complaint is that this is too complicated, it’s not hard to put this in a library. For example, there is currently discussion on the gtk-devel-list on adding the fsync() call to g_file_set_contents(). Maybe if someone asks nicely, the glib developers will add an asynchronous version of this function which runs g_file_set_contents() in a separate thread. Voila!

Avoiding hard drive spin-ups with laptop_mode

Finally, as Nathaniel Smith said in Comment #111 of of my previous post:

The problem is that I don’t, really, want to turn off fsync’s, because I like my data. What I want to do is to spin up the drive as little as possible while maintaining data consistency. Really what I want is a knob that says “I’m willing to lose up to minutes of work, but no more”. We even have that knob (laptop mode and all that), but it only works in simple cases.

This is a reasonable concern, and the way to fix this is to enhance laptop_mode in the Linux kernel. Bart Samwel, the author and maintainer of laptop_mode, actually discussed this idea with me last month at FOSDEM.  Laptop_mode already adjusts /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs based on the configuration parameter MAX_LOST_WORK_SECONDS, and it also adjusts the file system commit time (for ext3; it needs to be taught to do the same thing for ext4, which is a simple patch) to MAX_LOST_WORK_SECONDS as well. All that is necessary is a kernel patch to allow laptop_mode to disable fsync() calls, since the kernel knows that it is in laptop_mode, and it notices that the disk has spun up, it will sync out everything to disk, since once the energy has been spent to spin up the hard drive, we might as well write everything in memory that needs to be written out right away. Hence, a patch which allows fsync() calls to be disabled while in laptop_mode should do pretty much everything Nate has asked. I need to check to see if laptop_mode does this already, but if it doesn’t force a file system commit when it detects that the hard drive has been spun up, it should obviously do this as well.

(In addition to having a way to globally disable fsync()’s, it may also be useful to have a way to selectively disable fsync()’s on a per-process basis, or on the flip side, exempt some process from a global fsync-disable flag. This may be useful if there are some system daemons that really do want to wake up the hard drive — and once the hard drive is spinning, naturally everything else that needs to pushed out to stable store should be immediately written.)

With this relatively minor change to the kernel’s support of laptop_mode, it should be possible to achieve the result that Nate desires, without needing force applications to worry about this issue; applications should be able to just simply use fsync() without fear.


As we’ve seen, the reasons most people think fsync() should be avoided really don’t hold water.   The fsync() call really is your friend, and it’s really not the villain that some have made it out to be. If used intelligently, it can provide your application with a portable way of assuring that your data has been safely written to stable store, without causing a user-visible latency in your application. The problem is getting people to not fear fsync(), understand fsync(), and then learning the techniques to use fsync() optimally.

So just as there has been a Don’t fear the penguin campaign, maybe we also need to have a “Don’t fear the fsync()” campaign.  All we need is a friendly mascot and logo for a “Don’t fear the fsync()” campaign. Anybody want to propose an image?  We can make some T-shirts, mugs, bumper stickers…

Delayed allocation and the zero-length file problem

A recent Ubuntu bug has gotten slashdotted, and has started raising a lot of questions about the safety of using ext4. I’ve actually been meaning to blog about this for a week or so, but between a bout of the stomach flu and a huge todo list at work, I simply haven’t had the time.

The essential “problem” is that ext4 implements something called delayed allocation. Delayed allocation isn’t new to Linux; xfs has had delayed allocation for years. Pretty much all modern file systems have delayed allocation, according to the Wikipedia Allocate-on-flush article, this includes HFS+, Reiser4, and ZFS; btrfs has this property as well. Delayed allocation is a major win for performance, both because it allows writes to be streamed more efficiently to disk, and because it can reduce file fragmentation so that later on they can be read more efficiently from disk.

This sounds like a good thing, right? It is, except for badly written applications that don’t use fsync() or fdatasync(). Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds, and and uses a journalling mode called data=ordered. What does this mean? The journalling mode data=ordered means that before the commit takes place, any data blocks are
associated with inodes that are about to be committed in that transaction will be forced out to disk. This is primarily done for security reasons; if this is not done (which would be the case if the disk is mounted with the mount option data=writeback), then any newly allocated blocks might still contain previous data belonging to some other file or user, and after a crash, accessing that file might result in a user seeing unitialized data that had previously belonged to another user (say, their e-mail or their p0rn), which would be a Bad Thing from a security perspective.

However, this had the side effect of essentially guaranteeing that anything that had been written was guaranteed to be on disk after 5 seconds. (This is somewhat modified if you are running on batteries
and have enabled laptop mode, but we’ll ignore that for the purposes of this discussion.) Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data — even though POSIX never really made any such guarantee. This become especially noticeable on Ubuntu, which uses many proprietary, binary-only drivers, which caused some Ubuntu systems to become highly unreliable, especially for Alpha releases of Ubuntu Jaunty, with the net result that some Ubuntu users have become used to their machines regularly crashing. (I use bleeding edge kernels, and I don’t see the kind of unreliability that apparently at least some Ubuntu users are seeing, which came as quite a surprise to me.)

So what are the solutions to this? One thing is that the applications could simply be rewritten to properly use fsync() and fdatasync(). This is what is required by POSIX, if you want to be sure that data has gotten written to stable storage. Some folks have resisted this suggestions, on two grounds; first, that it’s too hard to fix all of the applications out there, and second, that fsync() is too slow. This perception that fsync() is too slow was most recently caused by a problem with Firefox 3.0. As Mike Shaver put it:

On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem.

   Fundamentally, the problem is caused by “data=ordered” mode.  This problem can be avoided by mounting the filesystem using “data=writeback” or by using a filesystem that supports delayed allocation — such as ext4.  This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file won’t be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks haven’t been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point.

Another solution is a set of patches to ext4 that has been queued for 2.6.30 merge window.  These three patches (with git id’s bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced.   This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file.  This solves the most annoying set of problems where an existing file gets rewritten, and thanks to the delayed allocation semantics, that existing file gets replaced with a zero-length file.   However, it will not solve the problem for newly created files, of course, which would have delayed allocation semantics.

Yet another solution would be to mount ext4 volumes with the nodelalloc mount option.   This will cause a significant performance hit, but apparently some Ubuntu users are happy using proprietary Nvidia drivers, even if it means that when they are done playing World of Goo, quitting the game causes the system to hang and they must hard-reset the system.   For those users, it may be that nodelalloc is the right solution for now — personally, I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do.

A final solution which might not be that hard to implement would be a new mount option, data=alloc-on-commit.    This would work much like data=ordered, with the additional constraint that all blocks that had delayed allocation would be allocated and forced out to disk before a commit takes place.   This would probably give slightly better performance compared to mounting with nodelalloc, but it shares many of the disadvantages of nodelalloc, including making fsync() to be potentially very slow because it would force all dirty blocks to be forced out to disk once again.

What’s the best path forward?   For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option.   I haven’t quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won’t have to worry about files getting lost as a result of delayed allocation.    Long term, application writers who are worried about files getting lost on an unclena shutdown really should use fsync.    Modern filesystems are all going to be using delayed allocation, because of its inherent performance benefits, and whether you think the future belongs to ZFS, or btrfs, or XFS, or ext4 — all of these filesystems used delayed allocation.

What do you think?   Do you think all of these filesystems have gotten things wrong, and delayed allocation is evil?   Should I try to implement a data=alloc-on-commit mount option for ext4?   Should we try to fix applications to properly use fsync() and fdatasync()?

SSD’s, Journaling, and noatime/relatime

On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s) due to the extra writes caused by journaling — and so Linux users using SSD’s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice.

For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:

  • Clone a git repository containing a linux source tree
  • Compile the linux source tree using make -j2
  • Remove the object files by running make clean

For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)

Amount of data written (in megabytes) on an ext4 filesystem
Operation with journal w/o journal percent change
git clone 367.7 353.0 4.00%
make 231.1 203.4 12.0%
make clean 14.6 7.7 47.3%


What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller: 4% for the git clone, and 12% for the actual kernel compile.

The noatime mount option

Can we do better? Yes, if we mount the file system using the noatime mount option:

Amount of data written (in megabytes) on an ext4 filesystem mounted with noatime
Operation with journal w/o journal percent change
git clone 367.0 353.0 3.81%
make 207.6 199.4 3.95%
make clean 6.45 3.73 42.17%


This reduces the extra cost of the journal in the git clone and make steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories.

The relatime mount option

There is a newer alternative to the noatime mount option, relatime. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):

Amount of data written (in megabytes) on an ext4 filesystem mounted with relatime
Operation with journal w/o journal percent change
git clone 366.6 353.0 3.71%
make 216.8 203.7 6.04%
make clean 13.34 6.97 45.75%


Personally, I don’t think relatime is worth it. There are other ways of working around the issue with mutt — for example, you can use Maildir-style mailboxes, or you can use mutt’s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you don’t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.

Comparing ext3 and ext2 filesystems

Amount of data written (in megabytes) on an ext3 and ext2 filesystem
Operation ext3 ext2 percent change
git clone 374.6 357.2 4.64%
make 230.9 204.4 11.48%
make clean 14.56 6.54 55.08%


Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn’t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.)


So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD’s come from? Some of it may have been from people worrying too much about extreme workloads such as “make clean”; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn’t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD’s had a very bad problem with what has been called the “write amplification effect”, where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations — that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD’s, such as Intel’s X25-M SSD, have worked around the write amplification affect.

What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system’s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime.

Fast ext4 fsck times, revisited

Last night I managed to finish up a rather satisfying improvement to ext4’s inode and block allocators. The ext4’s original allocator was actually a bit more simple-minded than ext3’s, in that it didn’t implement the Orlov algorithm to spread out top-level directories for better filesystem aging. It also was buggy in certain ways, where it would return ENOSPC even when there were still plenty of inodes in the file system.

So I had been working on extending ext3’s original Orlov allocator so it would work well with ext4. While I was at it, it occurred to me that one of the tricks I could play with ext4’s flex groups (which are higher-order collection of block groups), was to bias the block allocation algorithms such that the first block group in a flexgroup would be preferred for use by directories, and biased against data blocks for regular files. This meant that directory blocks would get clustered together, which cut a third off the time needed for e2fsck pass2:

Comparison of e2fsck times on an 32GB partition
Pass ext4 old allocator ext4 new allocator
time (s) I/O time (s) I/O
real user system MB read MB/s real user system MB read MB/s
1 6.69 4.06 0.90 82 12.25 6.70 3.63 1.58 82 12.23
2 13.34 2.30 3.78 133 9.97 4.24 1.27 2.46 133 31.36
3 0.02 0.01 0 1 63.85 0.01 0.01 0.01 1 82.69
4 0.28 0.27 0 0 0 0.23 0.22 0 0 0
5 2.60 2.31 0.03 1 0.38 2.42 2.15 0.07 1 0.41
Total 23.06 9.03 4.74 216 9.37 13.78 7.33 4.19 216 15.68

As you may recall from my previous observations on this blog, although we hadn’t been explicitly engineering for this, a file system consistency check on an ext4 file system tends to be a factor of 6-8 faster than the e2fsck times on an equivalent ext3 file system, mainly due to the elimination of indirect blocks and the uninit_bg feature reducing the amount of disk reads necessary in e2fsck’s pass 1. However, the ext4 layout optimizations didn’t do much for e2fsck’s pass 2. Well, the optimization of the block and inode allocators is complementary to the original ext4 fsck improvements, since it focuses on what we hadn’t optimized the first time around: e2fsck pass 2 times have been cut by a third, and the overall fsck time has been cut by 40%. Not too shabby!

Of course, we need to do more testing to make sure we haven’t caused other file system benchmarks to degrade, although I’m cautiously optimistic that this will end up being a net win. I suspect that some benchmarks will go up by a little, and others will go down a little, depending on how heavily the benchmark exercises directory operations versus sequential I/O patterns. If people want to test this new allocator, it is in the ext4 patch queue. If all goes well, I will hopefully be pushing it to Linus after 2.6.29 is released, at the next merge window.

horizontal separator

For comparison’s sake, here is a comparison of the fsck time of the same collection of files and directories, comparing ext3 and the original ext4 block and inode allocator. The file system in question is a 32GB install of Ubuntu Jaunty, with a personal home directory, a rather large Maildir directory, some linux kernel trees, and an e2fsprogs tree. It’s basically the emergency environment I set up on my Netbook at FOSDEM.

In all cases the file systems were freshly copied from the original root directory using the command rsync -axH / /mnt. It’s actually a bit surprising to me that ext3’s pass 2 e2fsck times was that much better than e2fsck time under the old ext4 allocator. My previous experience has shown that the two are normally about the same, with a write throughput of around 9-10 MB/s on for e2fsck’s pass 2 for both ext3 file systems and ext4 file systems with the original inode/block allocators. Hence, I would have expected ext3’s pass2 time to have been 12-13 seconds, and not 6. I’m not sure how that happened, unless it was the luck of draw in terms of how things ended up getting allocated on disk. So I’m not too sure what happened there, but overall things look quite good for ext4 and fsck times!

Comparison of e2fsck times on an 32GB partition
Pass ext3 ext4 old allocator
time (s) I/O time (s) I/O
real user system MB read MB/s real user system MB read MB/s
1 108.40 13.74 11.53 583 5.38 6.69 4.06 0.90 82 12.25
2 5.91 1.74 2.56 133 22.51 13.34 2.30 3.78 133 9.97
3 0.03 0.01 0 1 31.21 0.02 0.01 0 1 63.85
4 0.28 0.27 0 0 0 0.28 0.27 0 0 0
5 3.17 0.92 0.13 2 0.63 2.60 2.31 0.03 1 0.38
Total 118.15 16.75 14.25 718 6.08 23.06 9.03 4.74 216 9.37

Vital Statistics of the 32GB partition
312214 inodes used (14.89%)
263 non-contiguous files (0.1%)
198 non-contiguous directories (0.1%)
  # of inodes with ind/dind/tind blocks: 0/0/0
  Extent depth histogram: 292698/40
4388697 blocks used (52.32%)
0 bad blocks
1 large file
263549 regular files
28022 directories
5 character device files
1 block device file
5 fifos
615 links
20618 symbolic links (19450 fast symbolic links)
5 sockets
312820 files

Binary-only device drivers for Linux and the supportability matrix of doom

I came across the following from the ext3-users mailing list. The poor user was stuck on a never-updated RHEL 3 production server and running into kernel panic problems. He was advised to try updating to the latest kernel rpm from Red Hat, but he didn’t feel he could do that. In his words:

I’m caught between a rock and a hard place due to the EMC PowerPath binary only kernel crack. Which makes it painful to both me and my customers to regularly upgrade the kernel. Not to mention the EMC supportability matrix of doom.

That pretty much sums it all up right there.

The good news is that I’ve been told that dm-multipath is almost at the point where it has enough functionality to replace PowerPath. Of course, that version isn’t yet shipping in distributions, and I’m sure it needs more testing, but it’ll be good when enterprise users who need this functionality can move to a 100% fully open source storage stack.

About the only thing left to do is to work in a mention of the Frying Pan of Doom and the recipe for Quick After-Battle Triple Chocolate Cake into the mix.  🙂

Should Filesystems Be Optimized for SSD’s?

In one of the comments to my last blog entry, an anonymous commenter writes:

You seem to be taking a different perspective to linus on the “adapting to the the disk technology” front (Linus seems to against having to have the OS know about disk boundaries and having to do levelling itself)

That’s an interesting question, and I figure it’s worth its own top-level entry, as opposed to a reply in the comment stream.   One of the interesting design questions in any OS or Computer Architecture is where the abstraction boundaries should be drawn and which side of an abstraction boundary should various operations be pushed.   Linus’s arguments is that there a flash controller can do a better job of wear leveling, including detecting how “worn” a particular flash cell might be (for example, perhaps by looking at the charge levels at an analog level and knowing when the last time the cell was programmed), and so it doesn’t make sense to try to do wear leveling in a flash file system.   Some responsibilities of flash management, such as coalescing newly written blocks into erase blocks to avoid write amplification can be done either on the SSD or in the file system — for example, by using a log-structured file system, or some other copy-on-write file system, instead of a rewrite-in-place style file system, you can essentially solve the write amplification problem.   In some cases, it’s necessary let additional information leak across the abstraction — for example, the ATA TRIM command is a way for the file system to let the disk know that certain blocks no longer need to be used.   If too much information needs to be pushed across the abstraction, one way or another, then maybe we need to rethink whether the abstraction barrier is in the right place.

In addition, if the abstraction has been around for a long time, changing it also has costs, which has to be taken into account.   The 512 byte sector LBA abstraction has been around long time, and therefore dislodging it is difficult and costly.   For example, the same argument which says that because the underlying hardware details are changing between different generations of SSD is all of these details should be hidden in hardware, was also used to justify something that has been a complete commercial failure for years if not decades: Object Based Disks.

One of the arguments of OBD’s was that the hard drive has the best knowledge of how and where to store an contiguous stream of bytes, and so perhaps filesystems should not be trying to decide where on disk an inode should be stored, but instead tell the hard drive, “I have this object, which is 134 kilobytes long; please store it somewhere on the disk”.   At least in theory the HDD or SSD could handle all of the details of knowing the best place to store the object on the spinning magnetic media or flash media, taking into account how worn the flash is and automatically move the object around in the case of an SSD, and in the case of the HDD, the drive could know about (real) cylinder and track boundaries, and store the object in the most efficient way possible, since the drive has intimate knowledge about the low-level details of how data is stored on the disk.

This theory makes a huge amount of sense; but there’s only one problem.   Object Based Disks have been proposed in academia and advanced R&D shops of companies like Seagate have been proposing them for over a decade, with absolutely nothing to show for it.   Why?   There have been two reasons proposed.  One is that OBD vendors were too greedy, and tried to charge too much money for OBD’s.    Another explanation is that the interface abstraction for OBD’s was too different, and so there wasn’t enough software or file systems or OS’s that could take advantage of OBD’s.

Both explanations undoubtedly contributed to the commercial failure of OBD’s, but the question is which is the bigger reason.   And the reason why it is particularly important here is because at least as far as Intel’s SSD strategy is concerned, its advantage is that (modulo implementation shortcomings such as the reported internal LBA remapping table fragmentation problem and the lack of ATA TRIM support) filesystems don’t need to change (much) in order to take advantage of the Intel SSD and get at least decent performance.

However, if the price delta is a stronger reason for its failure, then the X25-M may be in trouble.   Currently the 80GB Intel X25-M has a street price of $400, so it costs roughly $5 per gigabyte.   “Dumb” MLC SATA SSD’s are available for roughly half the cost/gigabyte (64 GB for $164).   So what does the market look like 12-18 months from now?  If “dumb” SSD’s are still available at 50% of the cost of “smart” SSD’s, it would probably be worth it to make a copy-on-write style filesystem that attempts to do the wear leveling and write amplification reduction in software.   Sure, it’s probably more efficient to do it in hardware, but a 2x price differential might cause people will settle for a cheaper solution even if isn’t the absolutely best technical choice.   On the hand, if prices drop significantly, and/or “dumb” SSD’s completely disappear from the market, then time spent now optimizing for “dumb” SSD’s will be completely wasted.

So for Linus to make the proclamation that it’s completely stupid to optimize for “dumb” SSD’s seems to be a bit premature.   Market externalities — for example, does Intel have patents that will prevent competing “smart” SSD’s from entering the market and thus forcing price drops? — could radically change the picture.  It’s not just a pure technological choice, which is what makes projections and prognostications difficult.

As another example, I don’t know whether or not Intel will issue a firmware update that adds ATA TRIM support to the X25-M, or how long it will take before such SSD’s become available.   Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before, even if it causes more in-file system fragmentation and deeper extent allocation trees.   The reason for this is at the moment, once a block is used by the file system, at least today, the X25-M has absolutely no idea whether we still care about the contents of that block, or whether the block has since been released when the file was deleted.   However, if 20% of the SSD’s blocks have never been used, the X25-M can use 20% of the flash for better garbage collection and defragmentation algorithms.   And if Intel never releases a firmware update to add ATA TRIM support, then I will be out $400 out of my own pocket for an SSD that lacks this capability, and so adding a block allocator which works around limitations of the X25-M probably makes sense.   If it turns out that it takes two years before disks that have ATA TRIM support show up, then it will definitely make sense to add such an optimization. (Hard drive vendors have been historically S-L-O-W to finish standardizing new features and then letting such features enter the market place, so I’m not necessarily holding my breath; after all, the Linux block device layer and and file systems have been ready to send ATA TRIM support for about six months; what’s taking the ATA committees and SSD vendors so long? <grin>

On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4.   Or maybe Sandisk will make an ATA TRIM capable SSD available soon, and which is otherwise competitive with Intel, and I get a free sample, but it turns out another optimization on Sandisk SSD’s will give me an extra 10% performance gain under some workloads.   Is it worth it in that case?   Hard to tell, unless I know whether such a tweak addresses an optimization problem which is fundamental, and whether or not such a tweak will either be unnecessary, or perhaps actively unhelpful in the next generation.    As long as SSD manufacturers force us treat these devices as black boxes, there may be a certain amount of cargo cult science which may be forced upon us file system designers — or I guess I should say, in order to be more academically respectable, “we will be forced to rely more on empirical measurements leading to educated engineering estimations about what the SSD is doing inside the black box”. Heh.

Aligning filesystems to an SSD’s erase block size

I recently purchased a new toy, an Intel X25-M SSD, and when I was setting it up initially, I decided I wanted to make sure the file system was aligned on an erase block boundary.  This is a generally considered to be a Very Good Thing to do for most SSD’s available today, although there’s some question about how important this really is for Intel SSD’s — more on that in a moment.

It turns out this is much more difficult than you might first think — most of Linux’s storage stack is not set up well to worry about alignment of partitions and logical volumes.  This is surprising, because it’s useful for many things other than just SSD’s.  This kind of alignment is important if you are using any kind of hardware or software RAID, for example, especially RAID 5, because if writes are done on stripe boundaries, it can avoid a read-modify-write overhead.  In addition, the hard drive industry is planning on moving to 4096 byte sectors instead of the way-too-small 512 byte sectors at some point in the future.   Linux’s default partition geometry of 255 heads and 63 sectors/track means that there are 16065 (512 byte) sectors per cylinder.  The initial round of 4k sector disks will emulate 512 byte disks, but if the partitions are not 4k aligned, then the disk will end up doing a read/modify/write on two internal 4k sectors for each singleton 4k file system write, and that would be unfortunate.

Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track.   This results in a cylinder boundary which is divisible by 8, and so the partitions (with the exception of the first, which is still misaligned unless you play some additional tricks) are 4k aligned.    So this is one place where Vista is ahead of Linux….   unfortunately the default 255 heads and 63 sectors is hard coded in many places in the kernel, in the SCSI stack, and in various partitioning programs; so fixing this will require changes in many places.

However, with SSD’s (remember SSD’s?  This is a blog post about SSD’s…) you need to align partitions on at least 128k boundaries for maximum efficiency.   The best way to do this that I’ve found is to use 224 (32*7) heads and 56 (8*7) sectors/track.  This results in 12544 (or 256*49) sectors/cylinder, so that each cylinder is 49*128k.  You can do this by doing starting fdisk with the following options when first partitioning the SSD:

# fdisk -H 224 -S 56 /dev/sdb

The first partition will only be aligned on a 4k boundary, since in order to be compatible with MS-DOS, the first partition starts on track 1 instead of track 0, but I didn’t worry too much about that since I tend to use the first partition for /boot, which tends not to get modified much.   You can go into expert mode with fdisk and force the partition to begin on an 128k alignment, but many Linux partition tools will complain about potential compatibility problems (which are obsolete warnings, since the systems that would have booting systems with these issues haven’t been made in about ten years), but I didn’t needed to do that, so I didn’t worry about it.

So I created a 1 gigabyte /boot partition as /dev/sdb1, and allocated the rest of the SSD for use by LVM as /dev/sdb2. And that’s where I ran into my next problem. LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

# pvcreate –metadatasize 250k /dev/sdb2
Physical volume “/dev/sdb2” successfully created

Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:

# pvs /dev/sdb2 -o+pe_start
PV         VG   Fmt  Attr PSize  PFree  1st PE
/dev/sdb2       lvm2 —   73.52G 73.52G 256.00K

If you use a metadata size of 256k, the first PE will be at 320k instead of 256k. There really ought to be an –pe-align option to pvcreate, which would be far more user-friendly, but, we have to work with the tools that we have. Maybe in the next version of the LVM support tools….

Once you do this, we’re almost done. The last thing to do is to create the file system. As it turns out, if you are using ext4, there is a way to tell the file system that it should try to align files so they match up with the RAID stripe width. (These techniques can be used for RAID disks as well). If your SSD has an 128k erase block size, and you are creating the file system with the default 4k block size, you just have to specify a strip width when you create the file system, like so:

# mke2fs -t ext4 -E stripe-width=32,resize=500G /dev/ssd/root

(The resize=500G limits the number of blocks reserved for resizing this file system so that the guaranteed number size that the file system can be grown via online resize is 500G. The default is 1000 times the initial file system size, which is often far too big to be reasonable. Realistically, the file system I am creating is going to be used for a desktop device, and I don’t foresee needing to resize it beyond 500G, so this saves about a 50 megabytes or so. Not a huge deal, but “waste not, want not”, as the saying goes.)

With e2fsprogs 1.41.4, the journal will be 128k aligned, as will the start of the file system, and with the stripe-width specified, the ext4 allocator will try to align block writes to the stripe width where that makes sense. So this is as good as it gets without kernel changes to make the block and inode allocators more SSD aware, something which I hope to have a chance to look at.

horizontal separator

All of this being said, it’s time to revisit this question — is all of this needed for a “smart”, “better by design” next-generation SSD such as Intel’s? Aligning your file system on an erase block boundary is critical on first generation SSD’s, but the Intel X25-M is supposed to have smarter algorithms that allow it to reduce the effect of write-amplification. The details are a little bit vague, but presumably there is a mapping table which maps sectors (at some internal sector size — we don’t know for sure whether it’s 512 bytes or some larger size) to individual erase blocks. If the file system sends a series of 4k writes for file system blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, 99, followed by a barrier operation, a traditional SSD might do read/modify/write on four 128k erase blocks — one covering the blocks 0-31, another for the blocks 32-63, and so on. However, the Intel SSD will simply write a single 128k block that indicates where the latest versions of blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, and 99 can be found.

This technique tends to work very well.  However, over time, the table will get terribly fragmented, and depending on whether the internal block sector size is 512 or 4k (or something in between), there could be a situation where all but one or two of the internal sectors on the disks have been mapped away to other erase blocks, leading to fragmentation of the erase blocks. This is not just a theoretical problem; there are reports from the field that this happens relatively easy. For example, see Allyn Malventano’s Long-term performance analysis of Intel Mainstream SSDs and Marc Prieur’s report from which includes an official response from Intel regarding this phenomenon.  Laurent Gilson posted on the Linux-Thinkpad mailing list that when he tried using the X25-M to store commit journals for a database, that after writing 170% of the capacity of an Intel SSD, the small writes caused the write performance to go through the floor.   More troubling, Allyn Malventano indicated that if the drive is abused for too long with a mixture of small and large writes, it can get into a state where the performance degredation is permanent, and even a series of large writes apparently does not restore the drive’s function — only an ATA SECURITY ERASE command to completely reset the mapping table seems to help.

So, what can be done to prevent this?   Allyn’s review speculates that aligning writes to erase write boundaries can help — I’m not 100% sure this is true, but without detailed knowledge of what is going on under the covers in Intel’s SSD, we won’t know for sure.  It certainly can’t hurt, though, and there is a distinct possibility that the internal sector size is larger than 512 bytes, which means the default partitioning scheme of 255 heads/63 sectors is probably not a good idea.   (Even Vista has moved to a 240/63 scheme, which gives you 8k alignment of partitions; I prefer 224/56 partitioning, since the days when BIOS’s used C/H/S I/O are long gone.)

The Ext3 and Ext4 file system tend to defer meta-data writes by pinning them until a transaction commit; this definitely helps, and ext4 allows you to configure an erase block boundary, which should also be helpful.  Enabling laptop mode will discourage writing to the disk except in large blocks, which probably helps significantly as well.   And avoiding fsync() in applications will also be helpful, since a cache flush operation will force the SSD to write to an erase block even if it isn’t completely filled.   Beyond that, clearly some experimentation will be needed.  My current thinking is to use a standard file system aging workload, and then performing an I/O benchmark to see if there has been any performance degradation.  I can then vary various file system tuning parameters and algorithms, confirm whether or not a heavy fsync workload makes the performance worse.

In the long term, hopefully Intel will release a firmware update which adds support for ATA TRIM/DISCARD commands, which will allow the file system to inform the SSD that various blocks have been deleted and no longer need to be preserved by the SSD.   I suspect this will be a big help, if the SSD knows that certain sectors no longer need to be preserved, it can avoid copying them when trying to defragment the SSD.   Given how expensive the X25-M SSD’s are, I hope that there will be a firmware update to support this, and that Intel won’t leave its early adopters high and dry by only offering this functionality in newer models of the SSD.   If they were to do that, it would leave many of these early adopters, especially your humble writer (who paid for his SSD out of his own pocket), to be quite grumpy indeed.  Hopefully, though, it won’t come to that.

Update: I’ve since penned a follow-up post “Should Filesystems Be Optimized for SSD’s?”