After reading the comments on my earlier post, Delayed allocation and the zero-length file problem as well as some of the comments on the Slashdot story as well as the Ubuntu bug, it’s become very clear to me that there are a lot of myths and misplaced concerns about fsync() and how best to use it. I thought it would be appropriate to correct as many of these misunderstandings about fsync() in one comprehensive blog posting.
As the Eat My Data presentation points out very clearly, the only safe way according that POSIX allows for requesting data written to a particular file descriptor be safely stored on stable storage is via the fsync() call. Linux’s close(2) man page makes this point very clearly:
A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2).
Why don’t application programmers follow these sage words? These three reasons are most often given as excuses:
- (Perceived) performance problems with fsync()
- The application only needs atomicity, but not durability
- The fsync() causing the hard drive to spin up unnecessarily in laptop_mode
Let’s examine each of these excuses one at a time, to see how valid they really are.
(Perceived) performance problems with fsync()
Most of the bad publicity with fsync() originated with the now infamous problem with Firefox 3.0 that showed up about a year ago in May, 2008. What happened with Firefox 3.0 was that the primary user interface thread called the sqllite library each time the user clicked on a link to go to a new page. The sqllite library called fsync(), which in ext3’s data=ordered mode, caused a large, visible latency which was visible to the user if there was a large file copy happening by another process.
Nearly all of the reported delays was a few seconds, which would be expected; normally there isn’t that much dirty data that needs to be flushed out on a Linux system, even if it is even very busy. For example, consier the example of a laptop downloading an .iso image from a local file server; if the laptop has the exclusive link of a 100 megabit/second ethernet link, and the server has the .iso file in cache, or has a nice fast RAID array so it is not the bottleneck, then in the best case, the laptop will be able to download data at the rate of 10-12 MB/second. Assuming the default 5 second commit interval, that means that in the worst case, there will be at most 60 megabytes which must be written out before the commit can proceed. A reasonably modern 7200 rpm laptop drive can write between 60 and 70 MB/second. (The Seagate Momentus 7200.4 laptop drive is reported to be able to deliver 85-104 MB/second, but I can’t find it for sale anywhere for love or money.) In this example, an fsync() will trigger a commit and might need to take a second while the download is going on; perhaps half a second if you have a really fast 7200 rpm drive, and maybe 2-3 seconds if you have a slow 5400 rpm drive.
Obviously, you can create workloads that aren’t bottlenecked on the maximum ethernet download speed, or the speed of reading from a local disk drive; for example, “dd if=/dev/zero of=big-zero-file” will create a very large number of dirty pages that must be written to the hard drive at the next commit or fsync() call. It’s important to remember though, fsync() doesn’t create any extra I/O (although it may remove some optimization opportunities to avoid double writes); fsync() just pushes around when the I/O gets done, and whether it gets done synchronously or asynchronously. If you create a large number of pages that need to be flushed to disk, sooner or later it will have a significant and unfortunate effect on your system’s performance. Fsync() might make things more visible, but if the fsync() is done off the main UI thread, the fact that fsync() triggers a commit won’t actually disturb other processes doing normal I/O; in ext3 and ext4, we start a new transaction to take care of new file system operations while the committing transction completes.
The final observation I’ll make is that part of the problem is that Firefox as an application wants to make a huge number of updates to state files and was concerned about not losing that information even in the face of a crash. Every application writer should be asking themselves whether this sort of thing is really necessary. For example, doing some quick measurements using ext4, I determined that Firefox was responsible for 2.54 megabytes written to the disk for each web page visited by the user (and this doesn’t include writes to the Firefox cache; I symlinked the cache directory to a tmpfs directory mounted on /tmp to reduce the write load to my SSD). So these 2.54 megabytes is just for Firefox’s cookie cache and Places database to maintain its “Awesome bar”. Is that really worth it? If you visit 400 web pages in a day, that’s 1GB of writes to your SSD, and if you write more than 20GB/day, the Intel SSD will enable its “write endurance management feature” which slows down the performance of the drive. In light of that, exactly how important is it to update those darned sqllite databases after every web click? What if Firefox saved a list of URL’s that has been visited, and only updated every 30 or 60 minutes, instead? Is it really that every last web page that you browse be saved if the system crashes? An fsync() call every 15, 30, or 60 minutes, done by a thread which doesn’t block the application’s UI, would have never been noticed and would have not started the firestorm on Firefox’s bugzilla #421482. Very often, after a little thinking, a small change in the application is all that’s necessary for to really optimize the application’s fsync() usage.
(Skip over the sidebar — if you’ve already read it).
If you read through the Firefox’s bugzilla entry, you’ll find reports of fsync delays of 30 seconds or more. That tale has grown in the retelling, and I’ve seen some hyperbolic claims of five minute delays. Where did that come from? Well, if you look that those claims, you’ll find they were using a very read-heavy workload, and/or they were using the ionice command to set a real-time I/O priority. For example, something like “ionice -c 1 -n 0 tar cvf /dev/null big-directory”.
This will cause some significant delays, first of all because “ionice -c 1″ causes the process to have a real-time I/O priority, such that any I/O requests issued by that process will be serviced before all others. Secondly, even without the real-time I/O priority, the I/O scheduler naturally prioritizes reads as higher priority than writes because normally processes are waiting for reads to complete, but writes are normally asynchronous.
This is not at all realistic workload, and it is even more laughable that some people thought this might be an accurate representation of the I/O workload of a kernel compile. These folks had never tried the experiment, or measured how much I/O goes on during a kernel compile. If you try it, you’ll find that a kernel compile sucks up a lot of CPU, and doesn’t actually do that much I/O. (In fact, that’s why an SSD only speeds up a kernel compile by about 20% or so, and that’s in a completely cold cache case. If the commonly used include files are already in the system’s page cache, the performance improvement of the SSD is much less.)
Jump back to reading Performance problems with fsync.
One argument that has commonly been made on the various comment streams is that when replacing a file by writing a new file and the renaming “file.new” to “file”, most applications don’t need a guarantee that new contents of the file are committed to stable store at a certain point in time; only that either the new or the old contents of the file will be present on the disk. So the argument is essentially that the sequence:
- fd = open(“foo.new”, O_WRONLY);
- write(fd, buf, bufsize);
- rename(“foo.new”, “foo”);
… is too expensive, since it provides “atomicity and durability”, when in fact all the application needed was “atomicity” (i.e., either the new or the old contents of foo should be present after a crash), but not durability (i.e., the application doesn’t need to need the new version of foo now, but rather at some intermediate time in the future when it’s convenient for the OS).
This argument is flawed for two reasons. First of all, the squence above exactly provides desired “atomicity without durability”. It doesn’t guarantee which version of the file will appear in the event of an unexpected crash; if the application needs a guarantee that the new version of the file will be present after a crash, it’s necessary to fsync the containing directory. Secondly, as we discussed above, fsync() really isn’t that expensive, even in the case of ext3′ and data=ordered; remember, fsync() doesn’t create extra I/O’s, although it may introduce latency as the application waits for some of the pending I/O’s to complete. If the application doesn’t care about exactly when the new contents of the file will be committed to stable store, the simplest thing to do is to execute the above sequence (open-write-fsync-close-rename) in a separate, asynchronous thread. And if the complaint is that this is too complicated, it’s not hard to put this in a library. For example, there is currently discussion on the gtk-devel-list on adding the fsync() call to g_file_set_contents(). Maybe if someone asks nicely, the glib developers will add an asynchronous version of this function which runs g_file_set_contents() in a separate thread. Voila!
The problem is that I don’t, really, want to turn off fsync’s, because I like my data. What I want to do is to spin up the drive as little as possible while maintaining data consistency. Really what I want is a knob that says “I’m willing to lose up to minutes of work, but no more”. We even have that knob (laptop mode and all that), but it only works in simple cases.
This is a reasonable concern, and the way to fix this is to enhance laptop_mode in the Linux kernel. Bart Samwel, the author and maintainer of laptop_mode, actually discussed this idea with me last month at FOSDEM. Laptop_mode already adjusts /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs based on the configuration parameter MAX_LOST_WORK_SECONDS, and it also adjusts the file system commit time (for ext3; it needs to be taught to do the same thing for ext4, which is a simple patch) to MAX_LOST_WORK_SECONDS as well. All that is necessary is a kernel patch to allow laptop_mode to disable fsync() calls, since the kernel knows that it is in laptop_mode, and it notices that the disk has spun up, it will sync out everything to disk, since once the energy has been spent to spin up the hard drive, we might as well write everything in memory that needs to be written out right away. Hence, a patch which allows fsync() calls to be disabled while in laptop_mode should do pretty much everything Nate has asked. I need to check to see if laptop_mode does this already, but if it doesn’t force a file system commit when it detects that the hard drive has been spun up, it should obviously do this as well.
(In addition to having a way to globally disable fsync()’s, it may also be useful to have a way to selectively disable fsync()’s on a per-process basis, or on the flip side, exempt some process from a global fsync-disable flag. This may be useful if there are some system daemons that really do want to wake up the hard drive — and once the hard drive is spinning, naturally everything else that needs to pushed out to stable store should be immediately written.)
With this relatively minor change to the kernel’s support of laptop_mode, it should be possible to achieve the result that Nate desires, without needing force applications to worry about this issue; applications should be able to just simply use fsync() without fear.
As we’ve seen, the reasons most people think fsync() should be avoided really don’t hold water. The fsync() call really is your friend, and it’s really not the villain that some have made it out to be. If used intelligently, it can provide your application with a portable way of assuring that your data has been safely written to stable store, without causing a user-visible latency in your application. The problem is getting people to not fear fsync(), understand fsync(), and then learning the techniques to use fsync() optimally.
So just as there has been a Don’t fear the penguin campaign, maybe we also need to have a “Don’t fear the fsync()” campaign. All we need is a friendly mascot and logo for a “Don’t fear the fsync()” campaign. Anybody want to propose an image? We can make some T-shirts, mugs, bumper stickers…