Thoughts by Ted

Musings about Open Source, Linux, and Life by Theodore Tso

The Transitive Grace Period Public Licence: good ideas come around…

I recently came across the Transitive Grace Period Public License (alternate link) by Zooko Wilcox-O’Hearn.  I fonud it interesting because it’s very similar — almost identical — to something I had first starting floating about ten years ago.  I called it the Temporary Proprietary License (TPL).  I’m sure this is a case of “great minds think alike”.  One things that I like about my write up is that I gave some of the rationale behind why this approach is a fruitful one:

A while ago, I was talking to Jim Gettys at the IETF meeting in Orlando, and the subject of software licensing issues came up, and he had a very interesting perspective to share about the X Consortium License, and what he viewed as bugs in the GPL.

His basic observation was this: Many companies made various improvements to the X code, which they would keep as proprietary and give them a temporary edge in the marketplace. However, since the X code base was continually evolving, over time it became less attractive to maintain these code forks, since it would mean that they would have to be continually merging their enhancements into the evolving code base. Also, in general, the advantage in having the proprietary new feature or speed enhancement typically degraded over time. After all, most companies are quite happy if it takes 18-24 months for their competitor to match a feature in their release.

So sometime later, the companies would very often donate their previously proprietary enhancement to the X consortium, which would then fold it into the public release of X. Jim Gettys’ complaint about the GPL was that by removing this ability for companies to recoup the investment needed to make major developmental improvements to Open Source code bases, companies might not have the incentive do this type of infrastructural improvements to GPL’ed projects.

Upon reflection, I think this is a very valid model. When Open Vision distributed the Kerberos Administration daemon to MIT, they wanted an 18 month sunset clause in the license which would prevent commercial competitors from turning around and bidding their work against them. My contract with Powerquest for doing the ext2 partition resizer had a similar clause which kept the resizing code proprietary until a suitable timeout period had occurred, at which point it would be released under the GPL. (Note that by releasing under the GPL, it prevents any of Partition Magic’s commercial competitors from including it in their proprietary products!)

For more, read the full proposal.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

Proud to be a Googler

Although I obviously had nothing to do with Google’s decision vis-a-vis China, having only started working there for  a week, I was definitely glad to see it and it made me proud to be able to say that I work there.  Kudos to Google’s management team for having made (IMHO) the right decision.   Hopefully Yahoo and Microsoft will consider carefully what the ethical implications of their collusion and collaboration with the Chinese government’s attempt to control free speech and the human rights implications of the same.

I have my own opinion regarding the IETF’s decision to meet in Beijing, since as we’ve seen with the Search Engine industry’s attempt to accommodate the Chinese, engagement doesn’t necessarily always lead to openness and goodness.   All I can suggest is that those people who do decide to travel to that meeting be very careful about what sort of message they want to send with respect to China, as well as being very careful about protecting themselves against targetted information security attacks.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

Why the Sony PRS-505/PRS-700 is a better choice than the Kindle

Amazon can reach in and randomly destroy the books on your Kindle remotely over Whispernet, without asking your permission first.  Well, technically, thanks to the terms and conditions that you have to agree to before you buy the Kindle, you gave them permission in advance.   Well, no thanks.    I think the Sony e-Reader is a much better choice as a result.

Yet another reason why people should Just Say No to DRM.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

Don’t fear the fsync!

After reading the comments on my earlier post, Delayed allocation and the zero-length file problem as well as some of the comments on the Slashdot story as well as the Ubuntu bug, it’s become very clear to me that there are a lot of myths and misplaced concerns about fsync() and how best to use it.   I thought it would be appropriate to correct as many of these misunderstandings about fsync() in one comprehensive blog posting.

As the Eat My Data presentation points out very clearly, the only safe way according that POSIX allows for requesting data written to a particular file descriptor be safely stored on stable storage is via the fsync() call.  Linux’s close(2) man page makes this point very clearly:

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2).

Why don’t application programmers follow these sage words?  These three reasons are most often given as excuses:

  1. (Perceived) performance problems with fsync()
  2. The application only needs atomicity, but not durability
  3. The fsync() causing the hard drive to spin up unnecessarily in laptop_mode

Let’s examine each of these excuses one at a time, to see how valid they really are.

(Perceived) performance problems with fsync()

Most of the bad publicity with fsync() originated with the now infamous problem with Firefox 3.0 that showed up about a year ago in May, 2008.   What happened with Firefox 3.0 was that the primary user interface thread called the sqllite library each time the user clicked on a link to go to a new page. The sqllite library called fsync(), which in ext3’s data=ordered mode, caused a large, visible latency which was visible to the user if there was a large file copy happening by another process.

Nearly all of the reported delays was a few seconds, which would be expected; normally there isn’t that much dirty data that needs to be flushed out on a Linux system, even if it is even very busy.   For example, consier the example of a laptop downloading an .iso image from a local file server; if the laptop has the exclusive link of a 100 megabit/second ethernet link, and the server has the .iso file in cache, or has a nice fast RAID array so it is not the bottleneck, then in the best case, the laptop will be able to download data at the rate of 10-12 MB/second.  Assuming the default 5 second commit interval, that means that in the worst case, there will be at most 60 megabytes which must be written out before the commit can proceed.  A reasonably modern 7200 rpm laptop drive can write between 60 and 70 MB/second.   (The Seagate Momentus 7200.4 laptop drive is reported to be able to deliver 85-104 MB/second, but I can’t find it for sale anywhere for love or money.)   In this example, an fsync() will trigger a commit and might need to take a second while the download is going on; perhaps half a second if you have a really fast 7200 rpm drive, and maybe 2-3 seconds if you have a slow 5400 rpm drive.

(Jump to Sidebar: What about those 30 second fsync reports?)

Obviously, you can create workloads that aren’t bottlenecked on the maximum ethernet download speed, or the speed of reading from a local disk drive; for example, “dd if=/dev/zero of=big-zero-file” will create a very large number of dirty pages that must be written to the hard drive at the next commit or fsync() call. It’s important to remember though, fsync() doesn’t create any extra I/O (although it may remove some optimization opportunities to avoid double writes); fsync() just pushes around when the I/O gets done, and whether it gets done synchronously or asynchronously. If you create a large number of pages that need to be flushed to disk, sooner or later it will have a significant and unfortunate effect on your system’s performance.  Fsync() might make things more visible, but if the fsync() is done off the main UI thread, the fact that fsync() triggers a commit won’t actually disturb other processes doing normal I/O; in ext3 and ext4, we start a new transaction to take care of new file system operations while the committing transction completes.

The final observation I’ll make is that part of the problem is that Firefox as an application wants to make a huge number of updates to state files and was concerned about not losing that information even in the face of a crash.  Every application writer should be asking themselves whether this sort of thing is really necessary.   For example, doing some quick measurements using ext4, I determined that Firefox was responsible for 2.54 megabytes written to the disk for each web page visited by the user (and this doesn’t include writes to the Firefox cache; I symlinked the cache directory to a tmpfs directory mounted on /tmp to reduce the write load to my SSD).   So these 2.54 megabytes is just for Firefox’s cookie cache and Places database to maintain its “Awesome bar”.  Is that really worth it?   If you visit 400 web pages in a day, that’s 1GB of writes to your SSD, and if you write more than 20GB/day, the Intel SSD will enable its “write endurance management feature” which slows down the performance of the drive.   In light of that, exactly how important is it to update those darned sqllite databases after every web click?  What if Firefox saved a list of URL’s that has been visited, and only updated every 30 or 60 minutes, instead?   Is it really that every last web page that you browse be saved if the system crashes?  An fsync() call every 15, 30, or 60 minutes, done by a thread which doesn’t block the application’s UI, would have never been noticed and would have not started the firestorm on Firefox’s bugzilla #421482.   Very often, after a little thinking, a small change in the application is all that’s necessary for to really optimize the application’s fsync() usage.

(Skip over the sidebar — if you’ve already read it).

Sidebar: What about those 30 second fsync reports?

If you read through the Firefox’s bugzilla entry, you’ll find reports of fsync delays of 30 seconds or more. That tale has grown in the retelling, and I’ve seen some hyperbolic claims of five minute delays. Where did that come from? Well, if you look that those claims, you’ll find they were using a very read-heavy workload, and/or they were using the ionice command to set a real-time I/O priority. For example, something like “ionice -c 1 -n 0 tar cvf /dev/null big-directory”.

This will cause some significant delays, first of all because “ionice -c 1″ causes the process to have a real-time I/O priority, such that any I/O requests issued by that process will be serviced before all others.   Secondly, even without the real-time I/O priority, the I/O scheduler naturally prioritizes reads as higher priority than writes because normally processes are waiting for reads to complete, but writes are normally asynchronous.

This is not at all realistic workload, and it is even more laughable that some people thought this might be an accurate representation of the I/O workload of a kernel compile. These folks had never tried the experiment, or measured how much I/O goes on during a kernel compile. If you try it, you’ll find that a kernel compile sucks up a lot of CPU, and doesn’t actually do that much I/O. (In fact, that’s why an SSD only speeds up a kernel compile by about 20% or so, and that’s in a completely cold cache case. If the commonly used include files are already in the system’s page cache, the performance improvement of the SSD is much less.)

Jump back to reading Performance problems with fsync.

The atomicity not durability argument

One argument that has commonly been made on the various comment streams is that when replacing a file by writing a new file and the renaming “file.new” to “file”, most applications don’t need a guarantee that new contents of the file are committed to stable store at a certain point in time; only that either the new or the old contents of the file will be present on the disk. So the argument is essentially that the sequence:

  • fd = open(”foo.new”, O_WRONLY);
  • write(fd, buf, bufsize);
  • fsync(fd);
  • close(fd);
  • rename(”foo.new”, “foo”);

… is too expensive, since it provides “atomicity and durability”, when in fact all the application needed was “atomicity” (i.e., either the new or the old contents of foo should be present after a crash), but not durability (i.e., the application doesn’t need to need the new version of foo now, but rather at some intermediate time in the future when it’s convenient for the OS).

This argument is flawed for two reasons. First of all, the squence above exactly provides desired “atomicity without durability”.   It doesn’t guarantee which version of the file will appear in the event of an unexpected crash; if the application needs a guarantee that the new version of the file will be present after a crash, it’s necessary to fsync the containing directory. Secondly, as we discussed above, fsync() really isn’t that expensive, even in the case of ext3′ and data=ordered; remember, fsync() doesn’t create extra I/O’s, although it may introduce latency as the application waits for some of the pending I/O’s to complete. If the application doesn’t care about exactly when the new contents of the file will be committed to stable store, the simplest thing to do is to execute the above sequence (open-write-fsync-close-rename) in a separate, asynchronous thread. And if the complaint is that this is too complicated, it’s not hard to put this in a library. For example, there is currently discussion on the gtk-devel-list on adding the fsync() call to g_file_set_contents(). Maybe if someone asks nicely, the glib developers will add an asynchronous version of this function which runs g_file_set_contents() in a separate thread. Voila!

Avoiding hard drive spin-ups with laptop_mode

Finally, as Nathaniel Smith said in Comment #111 of of my previous post:

The problem is that I don’t, really, want to turn off fsync’s, because I like my data. What I want to do is to spin up the drive as little as possible while maintaining data consistency. Really what I want is a knob that says “I’m willing to lose up to minutes of work, but no more”. We even have that knob (laptop mode and all that), but it only works in simple cases.

This is a reasonable concern, and the way to fix this is to enhance laptop_mode in the Linux kernel. Bart Samwel, the author and maintainer of laptop_mode, actually discussed this idea with me last month at FOSDEM.  Laptop_mode already adjusts /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs based on the configuration parameter MAX_LOST_WORK_SECONDS, and it also adjusts the file system commit time (for ext3; it needs to be taught to do the same thing for ext4, which is a simple patch) to MAX_LOST_WORK_SECONDS as well. All that is necessary is a kernel patch to allow laptop_mode to disable fsync() calls, since the kernel knows that it is in laptop_mode, and it notices that the disk has spun up, it will sync out everything to disk, since once the energy has been spent to spin up the hard drive, we might as well write everything in memory that needs to be written out right away. Hence, a patch which allows fsync() calls to be disabled while in laptop_mode should do pretty much everything Nate has asked. I need to check to see if laptop_mode does this already, but if it doesn’t force a file system commit when it detects that the hard drive has been spun up, it should obviously do this as well.

(In addition to having a way to globally disable fsync()’s, it may also be useful to have a way to selectively disable fsync()’s on a per-process basis, or on the flip side, exempt some process from a global fsync-disable flag. This may be useful if there are some system daemons that really do want to wake up the hard drive — and once the hard drive is spinning, naturally everything else that needs to pushed out to stable store should be immediately written.)

With this relatively minor change to the kernel’s support of laptop_mode, it should be possible to achieve the result that Nate desires, without needing force applications to worry about this issue; applications should be able to just simply use fsync() without fear.

Summary

As we’ve seen, the reasons most people think fsync() should be avoided really don’t hold water.   The fsync() call really is your friend, and it’s really not the villain that some have made it out to be. If used intelligently, it can provide your application with a portable way of assuring that your data has been safely written to stable store, without causing a user-visible latency in your application. The problem is getting people to not fear fsync(), understand fsync(), and then learning the techniques to use fsync() optimally.

So just as there has been a Don’t fear the penguin campaign, maybe we also need to have a “Don’t fear the fsync()” campaign.  All we need is a friendly mascot and logo for a “Don’t fear the fsync()” campaign. Anybody want to propose an image?  We can make some T-shirts, mugs, bumper stickers…

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

Delayed allocation and the zero-length file problem

A recent Ubuntu bug has gotten slashdotted, and has started raising a lot of questions about the safety of using ext4. I’ve actually been meaning to blog about this for a week or so, but between a bout of the stomach flu and a huge todo list at work, I simply haven’t had the time.

The essential “problem” is that ext4 implements something called delayed allocation. Delayed allocation isn’t new to Linux; xfs has had delayed allocation for years. Pretty much all modern file systems have delayed allocation, according to the Wikipedia Allocate-on-flush article, this includes HFS+, Reiser4, and ZFS; btrfs has this property as well. Delayed allocation is a major win for performance, both because it allows writes to be streamed more efficiently to disk, and because it can reduce file fragmentation so that later on they can be read more efficiently from disk.

This sounds like a good thing, right? It is, except for badly written applications that don’t use fsync() or fdatasync(). Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds, and and uses a journalling mode called data=ordered. What does this mean? The journalling mode data=ordered means that before the commit takes place, any data blocks are
associated with inodes that are about to be committed in that transaction will be forced out to disk. This is primarily done for security reasons; if this is not done (which would be the case if the disk is mounted with the mount option data=writeback), then any newly allocated blocks might still contain previous data belonging to some other file or user, and after a crash, accessing that file might result in a user seeing unitialized data that had previously belonged to another user (say, their e-mail or their p0rn), which would be a Bad Thing from a security perspective.

However, this had the side effect of essentially guaranteeing that anything that had been written was guaranteed to be on disk after 5 seconds. (This is somewhat modified if you are running on batteries
and have enabled laptop mode, but we’ll ignore that for the purposes of this discussion.) Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data — even though POSIX never really made any such guarantee. This become especially noticeable on Ubuntu, which uses many proprietary, binary-only drivers, which caused some Ubuntu systems to become highly unreliable, especially for Alpha releases of Ubuntu Jaunty, with the net result that some Ubuntu users have become used to their machines regularly crashing. (I use bleeding edge kernels, and I don’t see the kind of unreliability that apparently at least some Ubuntu users are seeing, which came as quite a surprise to me.)

So what are the solutions to this? One thing is that the applications could simply be rewritten to properly use fsync() and fdatasync(). This is what is required by POSIX, if you want to be sure that data has gotten written to stable storage. Some folks have resisted this suggestions, on two grounds; first, that it’s too hard to fix all of the applications out there, and second, that fsync() is too slow. This perception that fsync() is too slow was most recently caused by a problem with Firefox 3.0. As Mike Shaver put it:

On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem.

   Fundamentally, the problem is caused by “data=ordered” mode.  This problem can be avoided by mounting the filesystem using “data=writeback” or by using a filesystem that supports delayed allocation — such as ext4.  This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file won’t be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks haven’t been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point.

Another solution is a set of patches to ext4 that has been queued for 2.6.30 merge window.  These three patches (with git id’s bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced.   This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file.  This solves the most annoying set of problems where an existing file gets rewritten, and thanks to the delayed allocation semantics, that existing file gets replaced with a zero-length file.   However, it will not solve the problem for newly created files, of course, which would have delayed allocation semantics.

Yet another solution would be to mount ext4 volumes with the nodelalloc mount option.   This will cause a significant performance hit, but apparently some Ubuntu users are happy using proprietary Nvidia drivers, even if it means that when they are done playing World of Goo, quitting the game causes the system to hang and they must hard-reset the system.   For those users, it may be that nodelalloc is the right solution for now — personally, I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do.

A final solution which might not be that hard to implement would be a new mount option, data=alloc-on-commit.    This would work much like data=ordered, with the additional constraint that all blocks that had delayed allocation would be allocated and forced out to disk before a commit takes place.   This would probably give slightly better performance compared to mounting with nodelalloc, but it shares many of the disadvantages of nodelalloc, including making fsync() to be potentially very slow because it would force all dirty blocks to be forced out to disk once again.

What’s the best path forward?   For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option.   I haven’t quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won’t have to worry about files getting lost as a result of delayed allocation.    Long term, application writers who are worried about files getting lost on an unclena shutdown really should use fsync.    Modern filesystems are all going to be using delayed allocation, because of its inherent performance benefits, and whether you think the future belongs to ZFS, or btrfs, or XFS, or ext4 — all of these filesystems used delayed allocation.

What do you think?   Do you think all of these filesystems have gotten things wrong, and delayed allocation is evil?   Should I try to implement a data=alloc-on-commit mount option for ext4?   Should we try to fix applications to properly use fsync() and fdatasync()?

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

SSD’s, Journaling, and noatime/relatime

On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s) due to the extra writes caused by journaling — and so Linux users using SSD’s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice.

For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:

  • Clone a git repository containing a linux source tree
  • Compile the linux source tree using make -j2
  • Remove the object files by running make clean

For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)


Amount of data written (in megabytes) on an ext4 filesystem
Operation with journal w/o journal percent change
git clone 367.7 353.0 4.00%
make 231.1 203.4 12.0%
make clean 14.6 7.7 47.3%

 

What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller: 4% for the git clone, and 12% for the actual kernel compile.

The noatime mount option

Can we do better? Yes, if we mount the file system using the noatime mount option:


Amount of data written (in megabytes) on an ext4 filesystem mounted with noatime
Operation with journal w/o journal percent change
git clone 367.0 353.0 3.81%
make 207.6 199.4 3.95%
make clean 6.45 3.73 42.17%

 

This reduces the extra cost of the journal in the git clone and make steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories.

The relatime mount option

There is a newer alternative to the noatime mount option, relatime. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):


Amount of data written (in megabytes) on an ext4 filesystem mounted with relatime
Operation with journal w/o journal percent change
git clone 366.6 353.0 3.71%
make 216.8 203.7 6.04%
make clean 13.34 6.97 45.75%

 

Personally, I don’t think relatime is worth it. There are other ways of working around the issue with mutt — for example, you can use Maildir-style mailboxes, or you can use mutt’s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you don’t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.

Comparing ext3 and ext2 filesystems


Amount of data written (in megabytes) on an ext3 and ext2 filesystem
Operation ext3 ext2 percent change
git clone 374.6 357.2 4.64%
make 230.9 204.4 11.48%
make clean 14.56 6.54 55.08%

 

Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn’t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.)

Conclusion

So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD’s come from? Some of it may have been from people worrying too much about extreme workloads such as “make clean”; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn’t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD’s had a very bad problem with what has been called the “write amplification effect”, where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations — that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD’s, such as Intel’s X25-M SSD, have worked around the write amplification affect.

What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system’s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

Fast ext4 fsck times, revisited

Last night I managed to finish up a rather satisfying improvement to ext4’s inode and block allocators. The ext4’s original allocator was actually a bit more simple-minded than ext3’s, in that it didn’t implement the Orlov algorithm to spread out top-level directories for better filesystem aging. It also was buggy in certain ways, where it would return ENOSPC even when there were still plenty of inodes in the file system.

So I had been working on extending ext3’s original Orlov allocator so it would work well with ext4. While I was at it, it occurred to me that one of the tricks I could play with ext4’s flex groups (which are higher-order collection of block groups), was to bias the block allocation algorithms such that the first block group in a flexgroup would be preferred for use by directories, and biased against data blocks for regular files. This meant that directory blocks would get clustered together, which cut a third off the time needed for e2fsck pass2:

Comparison of e2fsck times on an 32GB partition
Pass ext4 old allocator ext4 new allocator
time (s) I/O time (s) I/O
real user system MB read MB/s real user system MB read MB/s
1 6.69 4.06 0.90 82 12.25 6.70 3.63 1.58 82 12.23
2 13.34 2.30 3.78 133 9.97 4.24 1.27 2.46 133 31.36
3 0.02 0.01 0 1 63.85 0.01 0.01 0.01 1 82.69
4 0.28 0.27 0 0 0 0.23 0.22 0 0 0
5 2.60 2.31 0.03 1 0.38 2.42 2.15 0.07 1 0.41
Total 23.06 9.03 4.74 216 9.37 13.78 7.33 4.19 216 15.68


As you may recall from my previous observations on this blog, although we hadn’t been explicitly engineering for this, a file system consistency check on an ext4 file system tends to be a factor of 6-8 faster than the e2fsck times on an equivalent ext3 file system, mainly due to the elimination of indirect blocks and the uninit_bg feature reducing the amount of disk reads necessary in e2fsck’s pass 1. However, the ext4 layout optimizations didn’t do much for e2fsck’s pass 2. Well, the optimization of the block and inode allocators is complementary to the original ext4 fsck improvements, since it focuses on what we hadn’t optimized the first time around: e2fsck pass 2 times have been cut by a third, and the overall fsck time has been cut by 40%. Not too shabby!

Of course, we need to do more testing to make sure we haven’t caused other file system benchmarks to degrade, although I’m cautiously optimistic that this will end up being a net win. I suspect that some benchmarks will go up by a little, and others will go down a little, depending on how heavily the benchmark exercises directory operations versus sequential I/O patterns. If people want to test this new allocator, it is in the ext4 patch queue. If all goes well, I will hopefully be pushing it to Linus after 2.6.29 is released, at the next merge window.

horizontal separator

For comparison’s sake, here is a comparison of the fsck time of the same collection of files and directories, comparing ext3 and the original ext4 block and inode allocator. The file system in question is a 32GB install of Ubuntu Jaunty, with a personal home directory, a rather large Maildir directory, some linux kernel trees, and an e2fsprogs tree. It’s basically the emergency environment I set up on my Netbook at FOSDEM.

In all cases the file systems were freshly copied from the original root directory using the command rsync -axH / /mnt. It’s actually a bit surprising to me that ext3’s pass 2 e2fsck times was that much better than e2fsck time under the old ext4 allocator. My previous experience has shown that the two are normally about the same, with a write throughput of around 9-10 MB/s on for e2fsck’s pass 2 for both ext3 file systems and ext4 file systems with the original inode/block allocators. Hence, I would have expected ext3’s pass2 time to have been 12-13 seconds, and not 6. I’m not sure how that happened, unless it was the luck of draw in terms of how things ended up getting allocated on disk. So I’m not too sure what happened there, but overall things look quite good for ext4 and fsck times!

Comparison of e2fsck times on an 32GB partition
Pass ext3 ext4 old allocator
time (s) I/O time (s) I/O
real user system MB read MB/s real user system MB read MB/s
1 108.40 13.74 11.53 583 5.38 6.69 4.06 0.90 82 12.25
2 5.91 1.74 2.56 133 22.51 13.34 2.30 3.78 133 9.97
3 0.03 0.01 0 1 31.21 0.02 0.01 0 1 63.85
4 0.28 0.27 0 0 0 0.28 0.27 0 0 0
5 3.17 0.92 0.13 2 0.63 2.60 2.31 0.03 1 0.38
Total 118.15 16.75 14.25 718 6.08 23.06 9.03 4.74 216 9.37

Vital Statistics of the 32GB partition
312214 inodes used (14.89%)
263 non-contiguous files (0.1%)
198 non-contiguous directories (0.1%)
  # of inodes with ind/dind/tind blocks: 0/0/0
  Extent depth histogram: 292698/40
4388697 blocks used (52.32%)
0 bad blocks
1 large file
   
263549 regular files
28022 directories
5 character device files
1 block device file
5 fifos
615 links
20618 symbolic links (19450 fast symbolic links)
5 sockets
312820 files

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

Binary-only device drivers for Linux and the supportability matrix of doom

I came across the following from the ext3-users mailing list. The poor user was stuck on a never-updated RHEL 3 production server and running into kernel panic problems. He was advised to try updating to the latest kernel rpm from Red Hat, but he didn’t feel he could do that. In his words:

I’m caught between a rock and a hard place due to the EMC PowerPath binary only kernel crack. Which makes it painful to both me and my customers to regularly upgrade the kernel. Not to mention the EMC supportability matrix of doom.

That pretty much sums it all up right there.

The good news is that I’ve been told that dm-multipath is almost at the point where it has enough functionality to replace PowerPath. Of course, that version isn’t yet shipping in distributions, and I’m sure it needs more testing, but it’ll be good when enterprise users who need this functionality can move to a 100% fully open source storage stack.

About the only thing left to do is to work in a mention of the Frying Pan of Doom and the recipe for Quick After-Battle Triple Chocolate Cake into the mix.  :-)

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

Reflections on a complaint from a frustrated git user

Last week, Scott James Remnant posted a series of “Git Sucks” on his blog, starting with this one here, with follow up entries here and here.  His problem?  To quote Scott, “I want to put a branch I have somewhere so somebody else can get it.  That’s the whole point of distributed revision-control, collaboration.”   He thought this was a “mind-numbingly trivial” operation, and was frustrated when it wasn’t a one-line command in git.

Part of the problem here is that for most git workflows, most people don’t actually use “git push”.   That’s why it’s not covered in the git tutorial (this was a point of frustration for Scott).   In fact, in most large projects, the number of people need to use the “scm push” command is a very small percentage of the developer population, just as very few developers have commit privileges and are allowed to use the “svn commit” command in a project using Subversion.  When you have a centralized repository, only the privileged few will given commit privileges, for obvious security and quality control reasons.

Ah, but in a distributed SCM world, things are more democratic — anyone can have their own repository, and so everyone can type the commands “git commit” or “bzr commit”.   While this is true, the number of people who need to be able to publish their own branch is small.  After all, the overhead in setting up your own server just so people can “pull” changes from you is quite large; and if you are just getting started, and only need to submit one or two patches, or even a large series of patches, e-mail is a far more convenient route.  This is especially true in the early days of git’s development, before web sites such as git.or.cz, github, and gitorious made it much easier for people to publish their own git repository.   Even for a large series of changes, tools such as “git format-patch” and “git send-email” are very convenient for sending a patch series, and on the receiving side, the maintainer can use “git am” to apply a patch series sent via e-mail.

It turns out that from a maintainer’s point of view, reviewing patches via e-mail is often much more convenient.  Especially for developers who are just starting out with submitting patches to a project, it’s rare that a patch is of sufficiently high quality that it can be applied directly into the repository without needing fixups of one kind or another.   The patch might not have the right coding style compared to the surrounding code, or it might be fundamentally buggy because the patch submitter didn’t understand the code completely.   Indeed, more often than not, when someone submits a patch to me, it is more useful for indicating the location of the bug more than anything else, and I often have to completely rewrite the patch before it enters into the e2fsprogs mainline repository.    Given that, publishing a patch that will require modification in a public repository where it is ready to be pulled just doesn’t make sense for many entry-level patch submitters.   E-mail is in fact less work, and more appropriate for review purposes.

It is only when a mid-level to senior developer is trusted to create high quality patches that do not need review that publishing their branch in a pull-ready form really makes sense.   And that is fairly rare, and why it is not covered in most entry-level git documentation and tutorials.   Unfortunately, many people expect to see the command “scm push” in a distributed SCM, and since “git pull” is a commonly used command for beginning git users, they expect that they should use “git push” as well — not realizing that in a distributed SCM, “push” and “pull” are not symmetric operations.   Therefore, while most git users won’t need to use “git push”, git tutorials and other web pages which are attempting to introduce git to new users probably do need to do a better job explaining why most beginning participants in a project probably don’t need their own publically accessible repository that other people can pull from, and which they can push changes for publication.

There is one exception to this, of course, and this is a developer who wants to get started using git for a new project which he or she is starting and is the author/maintainer, or someone who is interested in converting their project to git.   And this is where bzr has an advantage over git, in that bzr is primarily funded by Canonical, which has a strong interest in pushing an on-line web service, Launchpad.   This makes it easier for bzr to have relatively simple recipes for sharing a bzr repository, since the user doesn’t need to have access to a server with a public IP address, or need to set up a web or bzr server; they can simply take advantage of Launchpad.

Of course, there are web sites which make it easy for people to publish their git repositories; earlier, I had mentioned git.or.cz, github, and gitorious.   Currently, the git documentation and tutorials don’t mention them since they aren’t formally affiliated with the git project (although they are used by many git users and developers and the maintainers of these sites have contributed a large amount of code and documentation to git).   This should change, I think.   Scott’s frustrations which kicked off his “git sucks” complaints would have been solved if the Git tutorial recommended that the easist ways for someone to publicly publish their repository is via one of these public web sites (although people who want to set up their own server certainly free to do so).

Most of these public repositories probably won’t have much reason to exist, but they don’t do much harm, and who knows?  While most of the repositories published at github and gitoriuous will be like the hundreds of thousands of abandoned projects on Sourceforge, one or two of the new projects which someone starts experimenting on at github or gitorious could turn out to be the next Ruby on Rails or Python or Linux.  And hopefully, they will allow more developers to be able to experiment with publishing commits on their own repositories, and lessen the frustrations of people like Scott who thought they needed their own repositories; whether or not a public repository is the best way for them to do what they need to do, at least this way they won’t get as frustrated about git.  :-)

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon

Should Filesystems Be Optimized for SSD’s?

In one of the comments to my last blog entry, an anonymous commenter writes:

You seem to be taking a different perspective to linus on the “adapting to the the disk technology” front (Linus seems to against having to have the OS know about disk boundaries and having to do levelling itself)

That’s an interesting question, and I figure it’s worth its own top-level entry, as opposed to a reply in the comment stream.   One of the interesting design questions in any OS or Computer Architecture is where the abstraction boundaries should be drawn and which side of an abstraction boundary should various operations be pushed.   Linus’s arguments is that there a flash controller can do a better job of wear leveling, including detecting how “worn” a particular flash cell might be (for example, perhaps by looking at the charge levels at an analog level and knowing when the last time the cell was programmed), and so it doesn’t make sense to try to do wear leveling in a flash file system.   Some responsibilities of flash management, such as coalescing newly written blocks into erase blocks to avoid write amplification can be done either on the SSD or in the file system — for example, by using a log-structured file system, or some other copy-on-write file system, instead of a rewrite-in-place style file system, you can essentially solve the write amplification problem.   In some cases, it’s necessary let additional information leak across the abstraction — for example, the ATA TRIM command is a way for the file system to let the disk know that certain blocks no longer need to be used.   If too much information needs to be pushed across the abstraction, one way or another, then maybe we need to rethink whether the abstraction barrier is in the right place.

In addition, if the abstraction has been around for a long time, changing it also has costs, which has to be taken into account.   The 512 byte sector LBA abstraction has been around long time, and therefore dislodging it is difficult and costly.   For example, the same argument which says that because the underlying hardware details are changing between different generations of SSD is all of these details should be hidden in hardware, was also used to justify something that has been a complete commercial failure for years if not decades: Object Based Disks.

One of the arguments of OBD’s was that the hard drive has the best knowledge of how and where to store an contiguous stream of bytes, and so perhaps filesystems should not be trying to decide where on disk an inode should be stored, but instead tell the hard drive, “I have this object, which is 134 kilobytes long; please store it somewhere on the disk”.   At least in theory the HDD or SSD could handle all of the details of knowing the best place to store the object on the spinning magnetic media or flash media, taking into account how worn the flash is and automatically move the object around in the case of an SSD, and in the case of the HDD, the drive could know about (real) cylinder and track boundaries, and store the object in the most efficient way possible, since the drive has intimate knowledge about the low-level details of how data is stored on the disk.

This theory makes a huge amount of sense; but there’s only one problem.   Object Based Disks have been proposed in academia and advanced R&D shops of companies like Seagate et.al. have been proposing them for over a decade, with absolutely nothing to show for it.   Why?   There have been two reasons proposed.  One is that OBD vendors were too greedy, and tried to charge too much money for OBD’s.    Another explanation is that the interface abstraction for OBD’s was too different, and so there wasn’t enough software or file systems or OS’s that could take advantage of OBD’s.

Both explanations undoubtedly contributed to the commercial failure of OBD’s, but the question is which is the bigger reason.   And the reason why it is particularly important here is because at least as far as Intel’s SSD strategy is concerned, its advantage is that (modulo implementation shortcomings such as the reported internal LBA remapping table fragmentation problem and the lack of ATA TRIM support) filesystems don’t need to change (much) in order to take advantage of the Intel SSD and get at least decent performance.

However, if the price delta is a stronger reason for its failure, then the X25-M may be in trouble.   Currently the 80GB Intel X25-M has a street price of $400, so it costs roughly $5 per gigabyte.   “Dumb” MLC SATA SSD’s are available for roughly half the cost/gigabyte (64 GB for $164).   So what does the market look like 12-18 months from now?  If “dumb” SSD’s are still available at 50% of the cost of “smart” SSD’s, it would probably be worth it to make a copy-on-write style filesystem that attempts to do the wear leveling and write amplification reduction in software.   Sure, it’s probably more efficient to do it in hardware, but a 2x price differential might cause people will settle for a cheaper solution even if isn’t the absolutely best technical choice.   On the hand, if prices drop significantly, and/or “dumb” SSD’s completely disappear from the market, then time spent now optimizing for “dumb” SSD’s will be completely wasted.

So for Linus to make the proclamation that it’s completely stupid to optimize for “dumb” SSD’s seems to be a bit premature.   Market externalities — for example, does Intel have patents that will prevent competing “smart” SSD’s from entering the market and thus forcing price drops? — could radically change the picture.  It’s not just a pure technological choice, which is what makes projections and prognostications difficult.

As another example, I don’t know whether or not Intel will issue a firmware update that adds ATA TRIM support to the X25-M, or how long it will take before such SSD’s become available.   Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before, even if it causes more in-file system fragmentation and deeper extent allocation trees.   The reason for this is at the moment, once a block is used by the file system, at least today, the X25-M has absolutely no idea whether we still care about the contents of that block, or whether the block has since been released when the file was deleted.   However, if 20% of the SSD’s blocks have never been used, the X25-M can use 20% of the flash for better garbage collection and defragmentation algorithms.   And if Intel never releases a firmware update to add ATA TRIM support, then I will be out $400 out of my own pocket for an SSD that lacks this capability, and so adding a block allocator which works around limitations of the X25-M probably makes sense.   If it turns out that it takes two years before disks that have ATA TRIM support show up, then it will definitely make sense to add such an optimization. (Hard drive vendors have been historically S-L-O-W to finish standardizing new features and then letting such features enter the market place, so I’m not necessarily holding my breath; after all, the Linux block device layer and and file systems have been ready to send ATA TRIM support for about six months; what’s taking the ATA committees and SSD vendors so long? <grin>

On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4.   Or maybe Sandisk will make an ATA TRIM capable SSD available soon, and which is otherwise competitive with Intel, and I get a free sample, but it turns out another optimization on Sandisk SSD’s will give me an extra 10% performance gain under some workloads.   Is it worth it in that case?   Hard to tell, unless I know whether such a tweak addresses an optimization problem which is fundamental, and whether or not such a tweak will either be unnecessary, or perhaps actively unhelpful in the next generation.    As long as SSD manufacturers force us treat these devices as black boxes, there may be a certain amount of cargo cult science which may be forced upon us file system designers — or I guess I should say, in order to be more academically respectable, “we will be forced to rely more on empirical measurements leading to educated engineering estimations about what the SSD is doing inside the black box”. Heh.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Reddit
  • Slashdot
  • StumbleUpon