After reading the comments on my earlier post, Delayed allocation and the zero-length file problem as well as some of the comments on the Slashdot story as well as the Ubuntu bug, it’s become very clear to me that there are a lot of myths and misplaced concerns about fsync() and how best to use it. I thought it would be appropriate to correct as many of these misunderstandings about fsync() in one comprehensive blog posting.
As the Eat My Data presentation points out very clearly, the only safe way according that POSIX allows for requesting data written to a particular file descriptor be safely stored on stable storage is via the fsync() call. Linux’s close(2) man page makes this point very clearly:
A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2).
Why don’t application programmers follow these sage words? These three reasons are most often given as excuses:
- (Perceived) performance problems with fsync()
- The application only needs atomicity, but not durability
- The fsync() causing the hard drive to spin up unnecessarily in laptop_mode
Let’s examine each of these excuses one at a time, to see how valid they really are.
(Perceived) performance problems with fsync()
Most of the bad publicity with fsync() originated with the now infamous problem with Firefox 3.0 that showed up about a year ago in May, 2008. What happened with Firefox 3.0 was that the primary user interface thread called the sqllite library each time the user clicked on a link to go to a new page. The sqllite library called fsync(), which in ext3’s data=ordered mode, caused a large, visible latency which was visible to the user if there was a large file copy happening by another process.
Nearly all of the reported delays was a few seconds, which would be expected; normally there isn’t that much dirty data that needs to be flushed out on a Linux system, even if it is even very busy. For example, consier the example of a laptop downloading an .iso image from a local file server; if the laptop has the exclusive link of a 100 megabit/second ethernet link, and the server has the .iso file in cache, or has a nice fast RAID array so it is not the bottleneck, then in the best case, the laptop will be able to download data at the rate of 10-12 MB/second. Assuming the default 5 second commit interval, that means that in the worst case, there will be at most 60 megabytes which must be written out before the commit can proceed. A reasonably modern 7200 rpm laptop drive can write between 60 and 70 MB/second. (The Seagate Momentus 7200.4 laptop drive is reported to be able to deliver 85-104 MB/second, but I can’t find it for sale anywhere for love or money.) In this example, an fsync() will trigger a commit and might need to take a second while the download is going on; perhaps half a second if you have a really fast 7200 rpm drive, and maybe 2-3 seconds if you have a slow 5400 rpm drive.
(Jump to Sidebar: What about those 30 second fsync reports?)
Obviously, you can create workloads that aren’t bottlenecked on the maximum ethernet download speed, or the speed of reading from a local disk drive; for example, “dd if=/dev/zero of=big-zero-file” will create a very large number of dirty pages that must be written to the hard drive at the next commit or fsync() call. It’s important to remember though, fsync() doesn’t create any extra I/O (although it may remove some optimization opportunities to avoid double writes); fsync() just pushes around when the I/O gets done, and whether it gets done synchronously or asynchronously. If you create a large number of pages that need to be flushed to disk, sooner or later it will have a significant and unfortunate effect on your system’s performance. Fsync() might make things more visible, but if the fsync() is done off the main UI thread, the fact that fsync() triggers a commit won’t actually disturb other processes doing normal I/O; in ext3 and ext4, we start a new transaction to take care of new file system operations while the committing transction completes.
The final observation I’ll make is that part of the problem is that Firefox as an application wants to make a huge number of updates to state files and was concerned about not losing that information even in the face of a crash. Every application writer should be asking themselves whether this sort of thing is really necessary. For example, doing some quick measurements using ext4, I determined that Firefox was responsible for 2.54 megabytes written to the disk for each web page visited by the user (and this doesn’t include writes to the Firefox cache; I symlinked the cache directory to a tmpfs directory mounted on /tmp to reduce the write load to my SSD). So these 2.54 megabytes is just for Firefox’s cookie cache and Places database to maintain its “Awesome bar”. Is that really worth it? If you visit 400 web pages in a day, that’s 1GB of writes to your SSD, and if you write more than 20GB/day, the Intel SSD will enable its “write endurance management feature” which slows down the performance of the drive. In light of that, exactly how important is it to update those darned sqllite databases after every web click? What if Firefox saved a list of URL’s that has been visited, and only updated every 30 or 60 minutes, instead? Is it really that every last web page that you browse be saved if the system crashes? An fsync() call every 15, 30, or 60 minutes, done by a thread which doesn’t block the application’s UI, would have never been noticed and would have not started the firestorm on Firefox’s bugzilla #421482. Very often, after a little thinking, a small change in the application is all that’s necessary for to really optimize the application’s fsync() usage.
(Skip over the sidebar — if you’ve already read it).
Sidebar: What about those 30 second fsync reports?
If you read through the Firefox’s bugzilla entry, you’ll find reports of fsync delays of 30 seconds or more. That tale has grown in the retelling, and I’ve seen some hyperbolic claims of five minute delays. Where did that come from? Well, if you look that those claims, you’ll find they were using a very read-heavy workload, and/or they were using the ionice command to set a real-time I/O priority. For example, something like “ionice -c 1 -n 0 tar cvf /dev/null big-directory”.
This will cause some significant delays, first of all because “ionice -c 1″ causes the process to have a real-time I/O priority, such that any I/O requests issued by that process will be serviced before all others. Secondly, even without the real-time I/O priority, the I/O scheduler naturally prioritizes reads as higher priority than writes because normally processes are waiting for reads to complete, but writes are normally asynchronous.
This is not at all realistic workload, and it is even more laughable that some people thought this might be an accurate representation of the I/O workload of a kernel compile. These folks had never tried the experiment, or measured how much I/O goes on during a kernel compile. If you try it, you’ll find that a kernel compile sucks up a lot of CPU, and doesn’t actually do that much I/O. (In fact, that’s why an SSD only speeds up a kernel compile by about 20% or so, and that’s in a completely cold cache case. If the commonly used include files are already in the system’s page cache, the performance improvement of the SSD is much less.)
Jump back to reading Performance problems with fsync.
The atomicity not durability argument
One argument that has commonly been made on the various comment streams is that when replacing a file by writing a new file and the renaming “file.new” to “file”, most applications don’t need a guarantee that new contents of the file are committed to stable store at a certain point in time; only that either the new or the old contents of the file will be present on the disk. So the argument is essentially that the sequence:
- fd = open(”foo.new”, O_WRONLY);
- write(fd, buf, bufsize);
- fsync(fd);
- close(fd);
- rename(”foo.new”, “foo”);
… is too expensive, since it provides “atomicity and durability”, when in fact all the application needed was “atomicity” (i.e., either the new or the old contents of foo should be present after a crash), but not durability (i.e., the application doesn’t need to need the new version of foo now, but rather at some intermediate time in the future when it’s convenient for the OS).
This argument is flawed for two reasons. First of all, the squence above exactly provides desired “atomicity without durability”. It doesn’t guarantee which version of the file will appear in the event of an unexpected crash; if the application needs a guarantee that the new version of the file will be present after a crash, it’s necessary to fsync the containing directory. Secondly, as we discussed above, fsync() really isn’t that expensive, even in the case of ext3′ and data=ordered; remember, fsync() doesn’t create extra I/O’s, although it may introduce latency as the application waits for some of the pending I/O’s to complete. If the application doesn’t care about exactly when the new contents of the file will be committed to stable store, the simplest thing to do is to execute the above sequence (open-write-fsync-close-rename) in a separate, asynchronous thread. And if the complaint is that this is too complicated, it’s not hard to put this in a library. For example, there is currently discussion on the gtk-devel-list on adding the fsync() call to g_file_set_contents(). Maybe if someone asks nicely, the glib developers will add an asynchronous version of this function which runs g_file_set_contents() in a separate thread. Voila!
Avoiding hard drive spin-ups with laptop_mode
Finally, as Nathaniel Smith said in Comment #111 of of my previous post:
The problem is that I don’t, really, want to turn off fsync’s, because I like my data. What I want to do is to spin up the drive as little as possible while maintaining data consistency. Really what I want is a knob that says “I’m willing to lose up to minutes of work, but no more”. We even have that knob (laptop mode and all that), but it only works in simple cases.
This is a reasonable concern, and the way to fix this is to enhance laptop_mode in the Linux kernel. Bart Samwel, the author and maintainer of laptop_mode, actually discussed this idea with me last month at FOSDEM. Laptop_mode already adjusts /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs based on the configuration parameter MAX_LOST_WORK_SECONDS, and it also adjusts the file system commit time (for ext3; it needs to be taught to do the same thing for ext4, which is a simple patch) to MAX_LOST_WORK_SECONDS as well. All that is necessary is a kernel patch to allow laptop_mode to disable fsync() calls, since the kernel knows that it is in laptop_mode, and it notices that the disk has spun up, it will sync out everything to disk, since once the energy has been spent to spin up the hard drive, we might as well write everything in memory that needs to be written out right away. Hence, a patch which allows fsync() calls to be disabled while in laptop_mode should do pretty much everything Nate has asked. I need to check to see if laptop_mode does this already, but if it doesn’t force a file system commit when it detects that the hard drive has been spun up, it should obviously do this as well.
(In addition to having a way to globally disable fsync()’s, it may also be useful to have a way to selectively disable fsync()’s on a per-process basis, or on the flip side, exempt some process from a global fsync-disable flag. This may be useful if there are some system daemons that really do want to wake up the hard drive — and once the hard drive is spinning, naturally everything else that needs to pushed out to stable store should be immediately written.)
With this relatively minor change to the kernel’s support of laptop_mode, it should be possible to achieve the result that Nate desires, without needing force applications to worry about this issue; applications should be able to just simply use fsync() without fear.
Summary
As we’ve seen, the reasons most people think fsync() should be avoided really don’t hold water. The fsync() call really is your friend, and it’s really not the villain that some have made it out to be. If used intelligently, it can provide your application with a portable way of assuring that your data has been safely written to stable store, without causing a user-visible latency in your application. The problem is getting people to not fear fsync(), understand fsync(), and then learning the techniques to use fsync() optimally.
So just as there has been a Don’t fear the penguin campaign, maybe we also need to have a “Don’t fear the fsync()” campaign. All we need is a friendly mascot and logo for a “Don’t fear the fsync()” campaign. Anybody want to propose an image? We can make some T-shirts, mugs, bumper stickers…
No related posts.
March 16th, 2009 at 12:58 am
Ted,
Thanks for the writing this. Your argument against the “atomicity and not durability” complaint is basically that fsync is actually cheap and was only a problem with ext3. I respectfully disagree. Either your disks are much faster than mine or your filesystems are in a lot better shape.
I switched to ext4 a few months ago with a full backup, new fs, and restore (no ext3->ext4 conversion). Within the past few weeks, I’ve begun to notice delays again with firefox’s stupid awesome bar. My guess is that my filesystem is fairly fragmented and that that’s contributing to the problem. I don’t really have any way to verify though.
I hope that either online or even offline defrag will eventually solve my problems, but it’s my understand that that feature isn’t finished yet. Either way, my point is that for me at least fsync (or maybe fragmentation) is still a significant problem. In an ideal world, fsync is really cheap and computers never crash. In the real world, that’s not true.
March 16th, 2009 at 1:10 am
Here’s a good logo:
http://www.trendir.com/archives/edda-eme-love-kitchen-sink.jpg
March 16th, 2009 at 1:16 am
Funny how your solution to the spin-up problem is to throw POSIX-compliance out the window and ditch fsync’s durability guarantee completely. There are applications out there that legitimately need to make sure that their data is on disk, even when running on battery.
March 16th, 2009 at 1:21 am
Thanks for the detailed post.
I’m a bit surprised that my post beginning “I don’t…want to turn off fsync’s” gets a reply “we can solve your problem by disabling fsync”… maybe I’m missing something
.
Is the idea that 1) fsync is made a no-op, but 2) when the hard drive does spin up, we write out all dirty data *in a single atomic transaction*? That would make fsync still effective as a write barrier, since either we’re in “AC mode” and fsync works as usual, or we’re “laptop mode” and the entire filesystem gets a strong consistency guarantee, so no-ops are a perfectly legal way to implement a write barrier.
But I don’t see how that’s possible without data journaling (maybe btrfs can do this more effectively?), and I *don’t* want laptop mode to mean “random large files might get totally trashed, even if only a tiny part was touched and the app was written correctly”… see my original comment.
March 16th, 2009 at 1:25 am
@1:I switched to ext4 a few months ago with a full backup, new fs, and restore (no ext3->ext4 conversion). Within the past few weeks, I’ve begun to notice delays again with firefox’s stupid awesome bar. My guess is that my filesystem is fairly fragmented and that that’s contributing to the problem. I don’t really have any way to verify though.
Btmorex,
What version of Firefox 3 are you running? Firefox made some changes to move the fsync() calls off the main UI thread. The latest versions of Firefox are not supposed to have these problems. Exactly what sort of delays are you seeing? All I can tell you is that using Firefox 3.0.7 from Ubuntu Jaunty, and using a Seagate Momentus 7200.3 drive, I’ve not noticed any problems. I don’t know what extensions you have enabled, and how many days worth of browsing history you are keeping; maybe there are some other firefox settings that are causing you to see more of an issue here?
March 16th, 2009 at 1:40 am
@3, @4: I’m a bit surprised that my post beginning “I don’t…want to turn off fsync’s” gets a reply “we can solve your problem by disabling fsync”… maybe I’m missing something
.
Tom, Nate:
I’m not a anal-retentive POSIX purist. I don’t believe that ‘du’ should return the size of directory hierarchies in units of 512-byte sectors, and as I’ve said in the past, I’m a big fan of noatime. However I do believe that people who invent file system semantics not guaranteed by Posix, when Linux has over 60 file systems in its kernel, are playing with fire. So even as I’ve implemented various work-arounds, some of which have already been sucked into Ubuntu Jaunty and Firefox 11 beta kernels, and even as I’ve implemented a full, completely bug-for-bug compatible data=alloc_on_commit mode for ext4 that has all of the behavioural advantages (flush all data blocks on commit) and disadvantages (fsync triggers a commit, which flushes all data blocks on commit) — I’m still going to call applications which depend on such behaviour broken, and call on them to use fsync().
What my position is on laptop_mode is that applications should use fsync() everywhere they need to make sure files are committed to stable store, not depend on individual file system semantics — possibly calling fsync() in a separate thread if there is a concern about latency issues. To mitigate against the effects of what this would do for laptop_mode, we can add a feature that allows us to optionally cause the kernel to ignore fsync()’s while laptop mode is enabled. Yes, this may potentially cause data to be lost while the laptop is running on laptop_mode; but laptop_mode is already monkeying with all sorts of kernel tuning parameters which delay when data gets flushed to disk, which increases the chances of data loss; when a system administrator enables laptop mode, that’s a tradeoff the system administrator has made — more battery life in exchange for a larger amount of work that might be lost in case the laptop crashes unexpectedly.
As far as system daemons that might want to force data to disk even while laptop_mode is enabled, Bart and I talked about that. My suggestion was to make a per-process flag which is inherited across fork() system calls which allowed fsync()’s to either be always enabled, always disabled, or disabled when a global “laptop_mode” flag is enabled. Ultimately, though, it’s up to the system administrator to decide how much risk he/she is willing to take. If the system administrator wants to globally disable fsync()’s, the system administrator should have the option to do that.
March 16th, 2009 at 1:43 am
The “use a separate thread to avoid the fsync() latency” argument seems flawed in at least two ways.
First, the implementation would need to be even more complex that described in the article. Consider what happens with a simple implementation that launches a thread to run the fsync/rename code if you ever update the same file twice. You have a race condition that will lead to data loss.
Second, it does not provide the desired semantics even for a single update. If you replace open+write+close+rename with “call complex library function that will start a thread calling rename after fsync” then in the latter case the file will still have the old contents after the call returns. If you do atomic_write(contents); read(file); then the read should always return the contents that were just written.
I think a reasonable implementation of an atomic write should fulfill the following conditions:
1) After the call returns any subsequent read will return the new contents unless there was a crash in between.
2) After a crash the contents may be either the new or a previous version, but not anything else.
3) The write operation should return before waiting for the physical disk to write anything if possible.
Doing fsync + rename in the main thread fails 3). Launching a separate thread to do fsync + rename fails 1).
March 16th, 2009 at 2:37 am
@6: I do believe that people who invent file system semantics not guaranteed by Posix, when Linux has over 60 file systems in its kernel, are playing with fire.
That’s fair, but this narrow piece of the problem (different filesystems having different capabilities) has a standard solution: add an fsync_me_lightly syscall that does something special on filesystems that support it, or falls back to fsync otherwise. (And libraries would of course fallback to calling fsync directly on less-capable non-Linux systems.)
Right now, on ext3 and ext4, atomic-rename-without-fsync is safe, and sqlite commit-with-fsync is also safe, and only the latter spins up my disk. But, in the future you advocate, where everyone calls fsync all the time, I have two choices: either atomic-rename will also spin up my disk and waste power+latency, *or* I get to play Russian roulette every time I touch a database. (And again, I’m not risking just the last few minutes of work — that would be fine. I’m risking the entire database.)
Either way it’s a regression. That’s why I fear the fsync.
All it would take to avoid this particular problem is a providing fsync_for_rename call that’s a no-op on ext3/ext4/btrfs. There may well be better solutions yet, but that’s a lower bound.
March 16th, 2009 at 2:42 am
The changes to Firefox that move the history fsync off the main thread are only in Firefox 3.5 (was going to be 3.1), due to the fact they had to change the history schema.
Your response to “The atomicity not durability argument” fails because returning an empty file is correct according to POSIX (and is exactly the bug that caused this flamewar). POSIX doesn’t give a way for applications to request this, instead they have to rely on filesystems implementing workarounds like you did for 2.6.30. Therefore applications are forced to fsync all the time to satisfy POSIX. You can’t claim “oh, data integrity isn’t guaranteed by POSIX if you don’t fsync” and then immediately say “applications don’t need to fsync if they want atomicity not durability” when that’s not specified by POSIX.
March 16th, 2009 at 3:24 am
Ted,
I think you misunderstood the “atomicity not durability argument” somewhat. What people claiming that behaviour want is this:
1. They want to keep doing just open(), write(), close(), rename().
2. If it so happens that the directory with the renamed file gets committed to the disk (pointing to the new file), they want the contents of file be committed before that for sure. They also want that no explicit commits of data are necessary for this to happen (i.e. they don’t want to burden the I/O with an explicit fsync() at all). In other words, they want that it should be all I/O or no I/O at all when it comes to write()/rename() combo.
So, for instance, they want to be able to run the open(), write(), close(), rename() sequence potentially hundreds of times without touching the platters of the disk once. If the file system comes to its normal commit time and there is a file to be renamed, they want to above “all I/O” option to happen.
I personally don’t agree with any of that (because this is not what POSIX requires), but I think this is what they want.
March 16th, 2009 at 3:30 am
tytso: it might be useful if you could make a simple one-paragraph post that states “since …effdc8, on 2009-02-24using
mktemp(); ...; rename();for transactional updates now does what people expect and will not generate zero-length filenames”. This specific point seems to have got lost in the sea of Btrfs comparision andfsync(). (And if my understanding if wrong and this is not the case, note which exactly which further patches need backporting by distros before release.March 16th, 2009 at 3:32 am
1)
“Fsync is slow” -> “Not really, but just don’t call fsync so much”
Hold on a sec. If Firefox only calls fsync() every 30 minutes, then at minute 29 we’re likely to have the original problem all over again where a crash will result in a 0-byte file. Unless you actually stop firefox from writing files at all during that 30 minutes, which is not what we want. Plus, how did Firefox pick that number? As an application developer, I have no idea what number to pick. As a user, I want to decide globally, not have to figure out how 10 applications picked their number and hope I can change it.
2) “I don’t need durability” -> “the simplest thing to do is to execute the above sequence in a separate, asynchronous thread”
That suggests an answer to the question of why few existing applications are written this way. If that’s the simplest way to do it, these interfaces are terrible. And a standard library like glibc really needs a helper function. (I’m not interested in linking PAM against GTK just to keep my contents of /etc/shadow)
3) “But fsync spins up my hard drive” -> “We can make it a no-op”
But I don’t want it to be a no-op. I want fsync to do its job. I just don’t want to _have_ to do that job when all I need is atomicity. Can’t we just invent a new command that actually gives us that?
feieio() — force all past operations on this inode to finish before new ones start?
Or, how about fsyncadvise(n) — if it is likely that future metadata updates can be written more than N seconds before the data, perform a fsync, otherwise, don’t bother.
On ext3 this does a sync if n < 5 or similar
On ext4 this does a sync if n < 60 or similar
In laptop-mode, it does NOT do a sync,
because the metadata and data will be written at the same time when the disk spins back up.
March 16th, 2009 at 3:37 am
What version of Firefox 3 are you running? Firefox made some changes to move the fsync() calls off the main UI thread. The latest versions of Firefox are not supposed to have these problems. Exactly what sort of delays are you seeing? All I can tell you is that using Firefox 3.0.7 from Ubuntu Jaunty, and using a Seagate Momentus 7200.3 drive, I’ve not noticed any problems. I don’t know what extensions you have enabled, and how many days worth of browsing history you are keeping; maybe there are some other firefox settings that are causing you to see more of an issue here?
I’m using Debian lenny’s 3.0.6. I’ll try 3.0.7 if I can install it without pulling in half of unstable.
The exact behavior I’m seeing is like so:
1.) Start with empty url bar
2.) type something like “slashdot”
3.) the results start showing up, but sometimes there is a fairly long pause (seconds) where the ui is frozen.
4.) tab to select first result and hit return. Sometimes there’s an even longer pause at this point before firefox starts trying to get the page.
It seems to be erratic and probably correlates pretty well with having some other kind of disk activity, but I haven’t been paying close attention.
None of the extensions I’m using should have any effect. They’re all pretty popular and hopefully wouldn’t touch the url bar in any way. (adblock, firebug, yslow)
Not sure about browsing history. ‘ls -lS | head’ in my profile directory shows:
-rw-r–r– 1 avery avery 51585024 2009-03-16 02:51 urlclassifier3.sqlite
-rw-r–r– 1 avery avery 37294080 2009-03-16 03:23 places.sqlite
-rw-r–r– 1 avery avery 4270080 2008-07-10 17:10 urlclassifier2.sqlite
-rw——- 1 avery avery 2945722 2009-03-16 03:34 XUL.mfasl
-rw——- 1 avery avery 2215562 2009-02-26 11:08 XPC.mfasl
-rw-r–r– 1 avery avery 997954 2008-07-10 17:36 history.dat
-rw-r–r– 1 avery avery 256000 2009-03-16 03:16 formhistory.sqlite
-rw——- 1 avery avery 212992 2009-03-16 02:20 cert8.db
-rw-r–r– 1 avery avery 184320 2009-03-16 03:24 cookies.sqlite
I have no idea if that’s large or small compared to most people, but this is my main computer and I don’t like reinstalling so maybe it is large.
March 16th, 2009 at 3:38 am
My comment on your previous post, which I think should be #149, is mostly in response to this post. (I do not like your proposal at all.)
March 16th, 2009 at 3:46 am
@10 Bojan:
So, for instance, they want to be able to run the open(), write(), close(), rename() sequence potentially hundreds of times without touching the platters of the disk once. If the file system comes to its normal commit time and there is a file to be renamed, they want to above “all I/O” option to happen.
Yes, that is precisely what I want.
Ted, consider the case of code that does (attempted) atomic replace via open/write/close/rename many times per second. It’s not just the UI folks here, it’s web developers and god knows whom else too. Adding an fsync on every call is a definite cost if a file is being updated hundreds of times per minute. What I want is to be able to “replace” that file with a new version whenever I want, and only have that stuff be written out to disk when bdflush/etc. comes along. I don’t much care which version I have at that point, as long as I have one valid, complete file.
I have some code written in perl on our backend webserver that does hundreds (at a minimum) of these file replace operations per second. Adding an fsync would add a serious performance drag, even on ext4, because we’re requiring an amount of real disk operations proportional to the number of file updates, instead of just 2 or 3 (for the metadata, inode, and file content itself) every 5 minutes. I’ll take the O(1) over the O(n) solution.
If we are peppering our code with fsync’s, even if it doesn’t hurt “that much”, we are violating the abstraction that says the kernel is supposed to take care of buffering, caching, and writing things out to disk in a sane way. Even if this use case seems a little bit insane to you, there are obviously a lot of us using it, for better or worse, and that’s why.
March 16th, 2009 at 3:49 am
[quote]Secondly, as we discussed above, fsync() really isn’t that expensive [...] [/quote]
Ted, I really don’t agree with that assertion. To really provide on-disk guarantees, fsync() needs to flush the on media data as well in presence of write back caching. ext4 does this through blkdev_issue_flush(), which (on sata drives, which I guess is what most people care about) entails a flush cache flush. So even if ext4 fsync in itself is loads better than ext3 w/ordered, if you have other writers it is still going to be very expensive.
March 16th, 2009 at 3:57 am
Thanks for writing this; I appreciate the detailed consideration of the issues involved. Nevertheless, I respectfully disagree on the atomicity-not-durability argument. First, yes, an fsync on both directories involved (though there’s usually just one) is required for full assurance of durability. However, since the issue at hand is avoiding the zero-length problem, one fsync is sufficient for expository purposes.
With that out of the way, let’s talk about rename. It should create a conceptual write barrier for the data blocks of the file involved. It’s not inventing a filesystem semantic out of thin air any more than writing a zero-length file is: POSIX doesn’t say much at all about what happens after a crash, and so this whole discussion is uncharted territory. It’d be perfectly fine for a POSIX system to overwrite all your files with pictures of donuts on an unclear shutdown. This is not an issue of standards conformance: it is a quality of implementation issue. The standard allowing you to do something terrible isn’t an excuse for that behavior. It’s like saying “yes, it’s perfectly fine that I live off of Crisco and tequila. The law allows it!”
Now, first of all: there’s a lot of historical precedent for rename writing data blocks before metadata: not only does ext3 do it, but many older filesystems too. Certainly, many programs are written under the assumption that my rename semantics hold: and these programs work fine (in fact, better) on a running system.
Second, your rename behavior will lead to bugs now and forever: open-write-close-rename will work just fine on a running system, and there’s a good chance it’ll appear to work even if the developer takes the unusual step of testing during a system crash. Because this sequence will seem to work just fine most of the time, plenty of programs will have hidden data-loss bugs. That’s not a world I want to live in.
Third, there’s the issue of API parsimony. Your semantics change rename from a hard to misuse API to one that’s very prone to misuse. See http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html (”How Do I Make This Hard to Misuse?”) and http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html (”What If I Don’t Actually Like My Users?”). On that scale, you’ve moved rename from a very good 7 (”the obvious use is (probably) the correct one”) to an appalling -5 (”do it right and it will sometimes break at runtime”).
On a running system, of course, a rename is atomic with respect to both the filename and its contents — otherwise it’d be useless. Under your semantics, however, you’ve effectively made rename without fsync a useless and dangerous, yet very conceptually tempting operation. Scolding application programmers to insert fsync calls will lead to confusion and frustration: fsync, as “make the data hit the disk now” doesn’t have anything conceptually to do with atomic replacement except as an arcane filesystem implementation detail. Anything that appears to work in the typical case, but that does something dangerous in special corner cases, is broken by design.
When an application performs a rename, the *INTENT* is to insert a write barrier for the data blocks of the file involved. When is this interpretation ever wrong? When is it ever useful to be able to tell the system to replace a file’s contents and its name, except when the system crashes, in which case you just want its name and some garbage? Nobody ever wants that.
But for the sake of argument, let’s bite our tongue and insert this fsync. The system can’t tell the difference between an fsync intended to ensure, say, message receipt, and an fsync that ensures after-crash consistency across a rename. Because we’re blocking and waiting for disk IO, application latency greatly increases (by up to three seconds, apparently). Users begin to complain. Now, the application developer has two choices: either implement the threaded solution you mention, or remove the fsync.
The threaded solution gives the correct behavior, but is horribly complicated, or requires libraries on which the developer might not want to depend, especially for a small operation. Look what you’ve done now: not only have you made the correct code non-obvious, but you’ve made the correct code under-performing as well. It’s absolutely ludicrous to expect every program that wants to correct replace a file’s contents to spawn off a worker thread.
Thus, most application developers will just remove the fsync. (Or do the moral equivalent, as KDE has done, and provide a knob to turn the fsync back on.) Now we’ve created a deliberate rare data-loss bug because the correct code is far too complicated.
Now, this situation in itself would be bad. But add laptop_mode, and now we’ve made an API the very contemplation of which drives men to unspeakable acts. We’ve added fsync everywhere, and we find it’s causing problems: the disk spins up all the time, as it must in order to maintain fsync’s semantics. So the solution is to neuter the very fsync you’ve implored application developers to add? Because you, the one making fsync a non-op, know that most of these fsyncs are there maintain data consistency, you can have laptop_mode trade durability for battery life.
But some fsyncs are there to ensure application-level durability. Imagine an SMTP server. So, you create an “fsync-really-means-fsync” inheritable process flag. If an application developer has an *important* fsync call, he’ll just set this process flag and call fsync. Now, since that flag is contrary to 20 years of established use and will be a footnote on a newish version of the fsync manual, most application developers won’t actually know about it. Oh, they’ll call fsync, and their programs will appear to work just fine, even after dutiful hard-reboot testing.
Except when someone using laptop_mode has an unexpected power failure. Now that user has lost, and it wasn’t his fault. (”How the hell wasn’t the message on the disk? fsync returned success. Must be a bad disk. [hours pass] Oh, what’s this laptop_mode? Changes fsync? @!#$%!@#%”) And before you say “caveat modulator:” you shouldn’t need to be an expert on the data retention needs of each of your programs to extract good battery life.
Now when an application developer needs to actually use the *real* fsync, he turns on this process flag. Except he’s also dutifully using fsync to ensure rename consistency, so he has to create plumbing to manage the state of the magic fsync flag across different parts of his program so only the fsyncs that need to be real fsyncs are real. Let’s imagine this program also runs arbitrary other programs: it then needs to unset the magic inheritable fsync flag before fork, otherwise programs that don’t really it will be running with he real fsync. That’s a non-trivial amount of work.
Also, application developers everywhere need to add autoconf tests for the magic process flag. Older programs will actually be broken, through no fault of their own. It’s either that, or rewrite the initialization scripts for programs that need the real fsync. (And in that case, the program may very well run far more real fsyncs that needed.)
Now you’ve made *two* traditional, long-standing system calls, rename and fsync, act dangerously in certain hard-to-test boundary cases, with elaborate and arcane workarounds that are so counter-intuitive (”fsync almost always means fsync?”) that developers will almost certainly get it wrong, at least the first time. Correct behavior might as well be in a disused lavatory behind a “beware of the leopard” sign.
What’s the alternative? fsync_and_i_mean_it? You could create an
fbarrier system call that applications would use to ensure data
consistency while preserving fsync’s current role. fbarrier might come
in handy in other contexts too. But of course that system call
wouldn’t be portable — but wait… we’ve already established that
when an application calls rename, it *means* to insert a write
barrier. fbarrier might be useful, but we can also infer it from a
rename call, and with perfect accuracy: when does an application *not*
want this behavior on rename?
So, just make rename include an implicit call to a conceptual
fbarrier. Existing applications work. Today. With no changes, or even
a recompile. Applications that call fsync before a rename at least do
no harm. rename remains an intuitive, powerful, and simple way for an
application developer to express what he wants to do (instead of being
a tasty-looking landmine). fsync doesn’t have to be treated specially
in certain bizarre modes. And you don’t really lose any efficiency,
because under your scheme, every correct application would have to
call fsync anyway — and I bet fbarrier would be far less expensive
than an outright fsync. (Or, if fsync really is cheap enough on a
given filesystem, make fbarrier *be* fsync.)
How often do you get to improve performance *and* safety at the same
time?
March 16th, 2009 at 4:00 am
(subscribe, please ignore)
March 16th, 2009 at 4:06 am
I believe there is sometimes a legitimate need to update a file very often (up to many times a second), have the newest version visible to other processes, and have some recent version around in case of a system crash. For example, the application (but not the whole system) may be unstable, or it may be desirable for the user to start and stop it frequently without explicitly saving, so it autosaves very frequently (e.g. Tomboy). Most of the intermediate versions should then never be written to disk, but fsync() (even in a separate thread) currently forces this. Running fsync() in a separate thread also increases the latency for other processes to see the updated file. Granted, one may say that a POSIX-only environment cannot support such an application efficiently, but ext3 with data=ordered and ext4 with alloc-on-commit do support it, so we should direct users who need this feature to these filesystems (for performance reasons) and allow application writers to exploit it easily and safely.
I think what we need right now is a simple mechanism for userland to detect the data integrity guarantees of a filesystem, so that things like rename_after_fsync_if_necessary_for_data_atomicity() can be implemented in a library (like glib), which application writers can immediately use on any journalling filesystem safely and obtain reasonable performance. The user can also have some say on this, particularly for filesystems like fat and ext2 that are inherently non-crash-proof, and on whether barrier=0 is considered safe.
In the long term, maybe it would be good to clean up the sync* API and allow applications to express “please notify me when this write has reached stable storage, but I’m willing to wait for xx seconds so don’t be too eager and impact throughput/power usage/whatever”. This will probably be useful to things like Firefox’s history database. Asynchronous I/O support for sync operations may also be nice.
By the way, is it also necessary, in theory, to do a fsync() on the directory containing the files before renaming in your open/write/close/fsync/rename example? I’d think not, but I’m not sure. Is there any standard on the expected post-crash behavior of journalling filesystems?
March 16th, 2009 at 4:08 am
Calling fsync asynchronously from a separate thread is plain silly. The whole point of fsync is to synchronize things, now you’re telling that applications that do not want to synchronize, should call fsync asynchronously. That is madness.
To quote the rename(23) manpage:
If newpath already exists it will be atomically replaced (subject to a
few conditions; see ERRORS below), so that there is no point at which
another process attempting to access newpath will find it missing.
“At no point” includes “after a crash”. If that isn’t true then update the manpage at the very least. And a ‘newpath’ that turns out to be an empty file instead of the one just written is not the same file for all common sense meanings. So even if technically you’re right, it’s like pointing at a loophole in a contract.
rename(2) should act like a write barrier and all this mess is solved. Saying that applications should wait for IO when they don’t want to is stupid.
Reordering writes is an optimisation, if it messes things up if fsync isn’t called then don’t do it. Don’t shove the blame towards applications.
If you keep clutching to what POSIX promises then at least give us a new systemcall, fbarrier() or something, the thing that is actually missing. Fsync() is too slow and doesn’t do what we want. Period.
1) You say fsync can take seconds, how can you deny that it has performance problems? And “performance” here is latency, not throughput.
Side note: I could reproduce minute long sync delays with the anticipatory scheduler, a problem that as far as I know was never solved (last tested with 2.6.24):
http://bugzilla.kernel.org/show_bug.cgi?id=5900
2) Quote: “… is too expensive, since it provides “atomicity and durability”, when in fact all the application needed was “atomicity” (i.e., either the new or the old contents of foo should be present after a crash), but not durability (i.e., the application doesn’t need to need the new version of foo now, but rather at some intermediate time in the future when it’s convenient for the OS).”
You give two invalid counter arguments:
- ‘First of all, the squence above exactly provides desired “atomicity without durability”.’
That is gives the desired behaviour is no argument that calling fsync() is needed and not too expensive to achieve the wanted behaviour.
- Secondly, as we discussed above, fsync() really isn’t that expensive.
Yes it is, it has unpredictable huge latencies.
Summary: Don’t keep waving fsync() at us, give us fbarrier() or better,
do less aggressive write reordering. There’s no reason to separate metadata and data writes to the same file from each other that much.
Or if that’s asking for too much, give an system wide option to turn fsyncs() into write barriers. Then servers can have the default behaviour and laptops etc. don’t have to wait for their slow hd to spin up.
March 16th, 2009 at 4:22 am
when people used fsync in the past they were punished – and now they suddenly are punished again. Why are app devs at fault? They can’t guess which ‘feature’ fs devs come up next month.
Maybe you shouldn’t have declared ext4 stable at this point? Or not changing so much without a complete rename and a big fat warning.
It is pretty damaging when a fs that zero’s files that already are on the media is called ’stable’ (and that fits xfs the same as ext4) while fs that doesn’t (reiser4) isn’t good enough for inclusion.
March 16th, 2009 at 5:26 am
> @14: Yes, that is precisely what I want.
I think we should have a rename2() call for that. Then, we can tell with 100% certainty that:
1. The existing API cannot be abused by assuming it does what it was never designed to do. We just tell people it’s a bug and that they should use rename2().
2. The new API does exactly the full atomic replace of the whole file on disk.
It is trivial to test in configure if rename2() exits and then have two code paths in the application: one with fsync()/rename(), the other with rename2().
March 16th, 2009 at 5:29 am
> @16: 2. The new API does exactly the full atomic replace of the whole file on disk.
And when I say this, I mean this in the context of what I said in comment #10. The platters of the disk are not touched _at_ _all_ until the normal FS commit time.
March 16th, 2009 at 5:32 am
In my opinion, the open/write/close/rename is amazingly common, and fsync isn’t as easy to do as other basic operations. For example, scripting languages may not provide a way to perform an fsync, and often do not define whether they do an fsync on close or similar operation.
Irrespective of whether an application is using fsync, is there any logical reason for an open/write/close/rename over a file with valid data in the old and new file to lose data after a system crash? This may not be directly related to the article, but it is a behavior that I don’t think any application developer would even imagine.
Also, just because fsync is (supposedly) cheap, doesn’t mean it should always be done. If the application developer knows that open/write/close doesn’t gaurantee the data will hit the disk, they really might not care. Why do extra work if all the documentation says you don’t have to if you don’t care that much about durability?
March 16th, 2009 at 5:43 am
Your argument about atomicity without durability is absurd – we need to have atomicity without fsync() and don’t mind loosing durability in the process. And what you say? We cann’t get that even with a fsync(), thanks – we’ll be using other filesystems then, those whose developers actually care about their users.
That is the height of the bar for a ‘modern’ filesystem. And stop whining about ‘but POSIX says I can do that’, that is lazy computing, get over it. If it is too much work to make new things work at least as good as the old things, maybe it is too much work because you’re doing it wrong. Listen to your users, or you’ll not have any soon enough.
March 16th, 2009 at 6:14 am
Thanks for the detailed post. It gives us users at least a chance to be heard.
But I must admit your way of thinking worries me. You say, fsync() doesn’t generate more writes. That may be true in bytes written to disk, but it must be wrong in numbers of seeks; because delayed allocation is there to reduce numbers of seeks (lower fragmentation, optionally also packing many small files into a large, contiguous chunk of data). Wether the latter is implemented in ext4 doesn’t matter, it *is* in other file systems like ReiserFS or btrfs. So if you have KDE4 or GNOME copying and replacing many small files, you want all those files finally go into one large chunk and written out in one go – with a single seek. A modern hard disk can write in the order of one MB in the time it needs for single seek. So seeks are expensive and to be avoided. The difference between copying 100 small files and then spilling them out in one go and 100 fsync()s is very likely to be a factor of 100. And that matters, especially since the fragmentation caused by those many fsync()s will increase the seek time of read accesses to these 100 small files, as well.
BTW: This is all a very old problem. POSIX actually defines an order of file system operations (the simplest one: they are in order), so please make sure this order is maintained when data goes to the disk. The problem has been discussed since the dawn of synchronous metadata update, which was advertised as “making UFS slightly slower, but a lot more crash-proof”, but it didn’t have that effect. See Anton Ertl’s old rant about it, read it, and please try hard to comprehend it:
http://www.complang.tuwien.ac.at/anton/sync-metadata-updates.html
And for the solution, just implement my suggestion in the last blog here – don’t reorder data and metadata updates, keep them in order, delay both of them, and make the whole delayed update process atomic. All other options in POSIX are there to sacrifice robustness for speed. But with (pseudo)synchronous metadata update, this is not the point. You sacrifice robustness for slowness. It is beyond my level of understanding why on earth somebody wants do do that.
Using fsync() in the application IMHO is a sort of premature optimization (here: for robustness), which is the root of all evil. As long as the application is fine with POSIX’s operation ordering, it should not perform any syncs.
March 16th, 2009 at 6:28 am
@Aigars: I don’t care who you may be. you are just a whiner. This guy devoted more than 15 years to the safety of your filesystems. Go write your own FS if you’re that smart. Or please use Reiser4.
March 16th, 2009 at 6:36 am
Hi Ted, first of all thanks for your well-written post.
Some remarks on it:
Regarding “Every application writer should be asking themselves whether this sort of thing is really necessary. … I determined that Firefox was responsible for 2.54 megabytes written to the disk for each web page visited by the user”:
Some thing I miss here is mentioning the fact that the huge amount of data written might be related to application features that are wanted by users, and that might be disabled in the application.
E.g. http://www.mozilla.com/en-US/firefox/features/ says about “Session Restore” things like: “If you’re in the middle of typing an email, you’ll pick up where you left off, even down to the last word you typed.” Using this feature (although in Seamonkey 2.0 Alpha) I can answer your question “Is that really worth it?” for this specific feature (Session Restore) for my personal use with “YES!”.
I do not know how much of the IO is caused by this specific feature (that is AFAIR enabled by default in Firefox), but you cannot blame the Firefox developers for having developed a very useful feature that results in more IO.
Which features are useful for a specific user is nothing an application developer can globally decide, he can only empower the user to enable/disable features to suit his needs.
It might be a problem if an average user finds somewhere in the web or some forum the advice “use laptop_mode”, and that implements your suggested delayed fsync. And then the user complaints that after a system crash the session he gets in Firefox is ancient…
And with your “it may also be useful to have a way to selectively disable fsync()’s on a per-process basis” suggestion would IMHO definitely be a horribly bad thing – the day after you add this people will give in forums and webpages the ill advice to use it for Firefox.
The times when most Linux users knew what they were doing are long gone, and the quality of advice people (who are the “system administrators” of their computers) find when searching for answers is not always high.
Regarding “All that is necessary is a kernel patch to allow laptop_mode to disable fsync() calls”:
I hope this will only have any effects on filesystems that can guarantee a semantics that avoids corruption if the computer crashes during the write to the disk.
E.g. if an application e.g. does an fsync(a); fsync(b), write(a,…); sequence, there are two obvious corruption scenarios if you cannot guarantee that either both a and b or none of them gets updated.
March 16th, 2009 at 6:51 am
I’m afraid I have to post my original comment again.
But it’s not right either.
It assumes you have permission to write the .new file.
It assumes this file doesn’t exist already (or can be overwritten).
It uses an fsync, which may not be required (if atomicity but no durability is desired).
It doesn’t retain permissions of the old file.
If the target is a symlink, it gets replaced by a normal file.
If you do 3.f, there is a window where no file exists at all.
It’s too complex, so needs to be wrapped in library funtions.
I think a concept like atomic updates (O_ATOMIC?) is needed. This would guarantee other apps and the disk (after a crash) either see the old file or the new file, but nothing else.
> First of all, the squence above exactly provides desired “atomicity without durability”.
Sure, but it includes the performance hit of fsync… This argument was made exactly to avoid that hit.
> Secondly, as we discussed above, fsync() really isn’t that expensive,
I’d like to decide for myself whether it’s cheap enough, thank you very much.
And I decided it isn’t.
IMO the latency of blocking on a (disk) IO is unacceptable. Yes, it’s not much. But it’s still too much.
> Avoiding hard drive spin-ups with laptop_mode
So this would guarantee atomicity without durability?
March 16th, 2009 at 6:57 am
@18
> Your argument about atomicity without durability is absurd – we need to have atomicity without fsync() and don’t mind loosing durability in the process. And what you say? We cann’t get that even with a fsync(), thanks – we’ll be using other filesystems then, those whose developers actually care about their users.
It’s not the FSs fault. It’s an application issue. One that should be fixed by providing a better API. Switching FSs won’t fix it, it’ll merely work around the issue until next time.
March 16th, 2009 at 8:06 am
@7:First, the implementation would need to be even more complex that described in the article. Consider what happens with a simple implementation that launches a thread to run the fsync/rename code if you ever update the same file twice. You have a race condition that will lead to data loss.
Uau,
Yes, of course you would need a mutex to protect against this case.
If you replace open+write+close+rename with “call complex library function that will start a thread calling rename after fsync” then in the latter case the file will still have the old contents after the call returns. If you do atomic_write(contents); read(file); then the read should always return the contents that were just written.
I’m not sure why you want to read from a file that you had just written to; after you call g_set_file_contents(), you have the contents still in the memory buffer; why would you want to read from them again?
Your assertion that you do care which version of the data you get if the program reads from the file, but not after a crash, seems very strange to me. No database has semantics like that, since it would mean that it might export some results to the outside world, that would be visible to the outside world, based on data that might disappear if the system crashed. So even the people who want to push database-style semantics into the file system would probably think such a requirement is a little bit odd, I think.
Still, within the context of the application, the data can simply be cached in memory and the application is probably already accessing the information via the in-memory representation anyway.
March 16th, 2009 at 8:21 am
The problem isn’t a fear of fsync(). The problem is that using fsync() fails to provide a set of behaviour that is very useful to application developers.
1) The “I just want to guarantee correct state, I don’t care /which/ correct state” argument. Yes, this can be mimiced using fsync(), at the cost of your process then blocking for longer. Spinning it out to an asynchronous thread solves the problem of blocking, but instead provides an entirely different set of issues. How does my task know that the in-kernel representation of the data now matches what I think I’ve written? What happens if I fail an assertion and exit on the next function call, before the thread has had the opportunity to run? And in any case I’m now going to irritate ext3 users because I’m forcing io that really doesn’t need to happen at that point in time.
2) The “I don’t want my laptop disk to spin up” argument. fsync() has clearly defined behaviour. Applications may depend on this behaviour. We can’t disable fsync() just because we’re using laptop mode. It doesn’t help, anyway – POSIX still doesn’t require that the operations be carried out in order, so there’s still a window where the on-disk representation may be a renamed but empty file.
If this behaviour requires new API, then so be it. But fsync() isn’t the answer here, and the contortions which we’d have to go through to make it even approximate the correct answer should be a pretty strong indication of that.
March 16th, 2009 at 8:27 am
@8: That’s fair, but this narrow piece of the problem (different filesystems having different capabilities) has a standard solution: add an fsync_me_lightly syscall that does something special on filesystems that support it, or falls back to fsync otherwise. (And libraries would of course fallback to calling fsync directly on less-capable non-Linux systems.)
Nate,
I suppose we could add an some call which roughly requests that the data blocks for the inode in question should be flushed out roughly before any operation involving its metadata is committed onto disk. I’m not sure fsync_on_rename() is the right name for it, but I’m sure we can come up with some valid name for it. However, it wouldn’t become usable for quite some time; application writers would have to wait for distributions to ship those kernels, and glibc would have to export such a new interface; and it would be a Linux-specific call that wouldn’t make sense anywhere else (although it could just be replaced just by an fsync()). I suspect it would also be a while before any other file systems implemented such a thing. So we could do this, but I think it’s useful to examine whether other options could work first, especially since many applications that you might use on your laptop are probably using fsync() already.
But, in the future you advocate, where everyone calls fsync all the time, I have two choices: either atomic-rename will also spin up my disk and waste power+latency, *or* I get to play Russian roulette every time I touch a database. (And again, I’m not risking just the last few minutes of work — that would be fine. I’m risking the entire database.)
I don’t think it’s that stark. As I said earlier, we can set things up so that kernel still honors fsync() calls from some processes. In fact, what I outlined to Bart was a per-process flag which would be set by a pam_module, supplied by the laptop_mode tools that would set that flag to the mode “ignore-fsyncs-while-on-battery”. That way only the desktop processes would ignore the fsync() call, while system daemons would still honor fsync().
I actually suspect we can do better than that, but what I’ve outlined above is extremely conservative, and not hard to implement. A more advanced way of prevent spinups with laptop_mode is to freeze the filesystem and the databases into a stable state, and then have the block device layer queue all writes until the the memory reaches a critical threshold or MAX_LOST_WORK_SECS have gone by, at which point we flush all of the writes to disk, and then once again freeze the database at a quiscent state and then freeze the filesystem. The point is there are ways we can implement battery preserving techniques that won’t endanger databases.
Either way it’s a regression. That’s why I fear the fsync.
You don’t have to fear the fsync(). We may need to do a little work, but an enhanced laptop_mode really wouldn’t be hard to create. So please, let go of your fear.
March 16th, 2009 at 8:34 am
@9: Your response to “The atomicity not durability argument” fails because returning an empty file is correct according to POSIX (and is exactly the bug that caused this flamewar). POSIX doesn’t give a way for applications to request this, instead they have to rely on filesystems implementing workarounds like you did for 2.6.30. Therefore applications are forced to fsync all the time to satisfy POSIX. You can’t claim “oh, data integrity isn’t guaranteed by POSIX if you don’t fsync” and then immediately say “applications don’t need to fsync if they want atomicity not durability” when that’s not specified by POSIX
James,
That wasn’t my argument. What I was saying is that number one, open-write-fsync-close-rename really only provides atomicity and not durability, because the containing directory isn’t fsync’ed. So in fact, the fsync() call isn’t actually doing that much more extra work than application writers have requested; the fsync() does move the work around some, and so it will flush the data blocks before returning, instead of at some time in the future when the journal commit takes place. But if the application doesn’t care about when the data blocks are pushed to disk, it could also simply do the open-write-fsync-close-rename sequence in a separate thread.
If you want a read() of that file after the open-write-fsync-close-rename sequence (but before a system crash) to return the new contents, and after the crash you don’t care whether you are getting the old or new contents, then yes, the fsync() paradigm won’t provide you those constraints, and in that case, if you really want the new contents to be read() back, it’s true that you can’t do the open-write-fsync-close-rename sequence in the background. I question how common those requirements really are, though, and whether we really need to optimize for such a case.
March 16th, 2009 at 8:41 am
@10: I think you misunderstood the “atomicity not durability argument” somewhat. What people claiming that behaviour want is… If it so happens that the directory with the renamed file gets committed to the disk (pointing to the new file), they want the contents of file be committed before that for sure. They also want that no explicit commits of data are necessary for this to happen (i.e. they don’t want to burden the I/O with an explicit fsync() at all). In other words, they want that it should be all I/O or no I/O at all when it comes to write()/rename() combo.
Bojan,
Oh, I understand that. But the reality is that they also want that when rename() gets committed at the next commit interval (which will happen within the next five seconds, given the default settings) that the data blocks also be pushed out to disk so on a crash, they either get the old or the new blocks. So the I/O is going to happen, one way or another (they are in effect asking for an implicit asynchronous fsync() happening in the very near future after the rename() operation). What I’m saying is that application writers can get most of they want today simply by putting the fsync() on a separate thread, or moving the fsync() off the main UI thread.
March 16th, 2009 at 8:50 am
@16
> ext4 does this through blkdev_issue_flush(), which (on sata drives, which I guess is what most people care about) entails a flush cache flush.
What about NCQ? The kernel should be able to tell when a write request has hit the disk, right? No need for a cache flush in that case.
March 16th, 2009 at 8:53 am
@12: Hold on a sec. If Firefox only calls fsync() every 30 minutes, then at minute 29 we’re likely to have the original problem all over again where a crash will result in a 0-byte file. Unless you actually stop firefox from writing files at all during that 30 minutes, which is not what we want. Plus, how did Firefox pick that number? As an application developer, I have no idea what number to pick. As a user, I want to decide globally, not have to figure out how 10 applications picked their number and hope I can change it.
Jim,
No, because what I’m saying is that firefox simply shouldn’t be updating the URL last visited database after every web click, burning through 2.5 megabytes of SSD write load for each web page visited,l and burning through a full gigabyte of writes to my SSD (which for me are mostly useless writes) for every 400 web pages visited. It’s not that it shouldn’t call fsync() except for every 30 minutes; it shouldn’t try to update files on disk after every web click! What I suggested was that maybe it would be better (and gentler on SSD’s) if Firefox cached the URL’s that it visited in memory, and every so often (say, every 15, 30 or 60 minutes) that it update the sqllite database, with an fsync() at the end of that operation.
As for whether it should be every 15, 30, or 60 minutes — that can be configurable, using a tuning parameter. Mozilla already has hundreds of tuning parameters available on its About:Config page. This would just be one more. The reason why this number probably needs to be firefox specific is because for me, the “Awesome bar” really isn’t that important. If I lose a few hours worth of visited pages, it really is no skin off my nose, and I care a lot more about minimizing writes to my SSD. For other people, they might want to make this parameter be 60 seconds, or maybe even every 5 seconds, if every single web page they ever visit is critically important to be remembered after a system crash. Personally, I think those people are nuts, but it takes all kinds.
March 16th, 2009 at 8:53 am
@34:So the I/O is going to happen, one way or another (they are in effect asking for an implicit asynchronous fsync() happening in the very near future after the rename() operation)
Ted,
Exactly. What I want is a way to say “do a fsync() when you do the rename()” (or to have it happen without saying anything). The application can’t know when this rename() is going to occur, since it doesn’t know how the kernel parameters are tuned or whether the disks are up or down, so it would a be a lot easier if we could just ask the filesystem to do it for us.
March 16th, 2009 at 8:54 am
@35
That is only true if you run with write through caching, which no SATA drive ships with. They are all using write back caching. For that case, the NCQ completion happens just like it does for non-NCQ: when the IO has hit the cache. For both NCQ and non-NCQ you can use a write command that forces disk access, but that is not a good fit for fsync since the IO may already be issued. So you have to do the flush, regardless of whether you use NCQ or not.
March 16th, 2009 at 8:55 am
No, using a separate thread is fail. You’re basically saying every application has to be threaded, which is ridiculous.
March 16th, 2009 at 8:56 am
@36:It’s not that it shouldn’t call fsync() except for every 30 minutes; it shouldn’t try to update files on disk after every web click!
I disagree.
There are plenty of application that want their data in the filesystem to be updated very frequently, for the purposes of other applications in the running system that might want that information. Wanting to have the most up-to-date information in the filesystem available every click is NOT the same as saying we need it forced to disk at every click.
As for whether it should be every 15, 30, or 60 minutes — that can be configurable, using a tuning parameter.My point was that having to configure this once-per-application is a user’s nightmare.
March 16th, 2009 at 9:00 am
What is your take on scripting languages, like shell scripts?
i=0
while true; do
i=$((i+1))
echo $i:`date` > log
./process_data
done
How do you propose that I ensure “log” will actually have some data in it, and not turn up empty? Call /bin/sync in the loop? Surely there is a less heavy-handed approach?
March 16th, 2009 at 9:06 am
@38
> For that case, the NCQ completion happens just like it does for non-NCQ: when the IO has hit the cache. For both NCQ and non-NCQ you can use a write command that forces disk access, but that is not a good fit for fsync since the IO may already be issued. So you have to do the flush, regardless of whether you use NCQ or not.
Why don’t you always use write-through requests if NCQ is enabled?
March 16th, 2009 at 9:08 am
@42
You may as well ask why people just don’t turn the write back caching off. Answer is the same: the write performance will suck. Except for custom enterprise SATA drives that have their own private firmware, nobody optimizes SATA drives for write through caching.
March 16th, 2009 at 9:16 am
@17: So, just make rename include an implicit call to a conceptual
fbarrier. Existing applications work. Today.
Daniel,
Maybe you’ve missed this, but I’ve done this already. Ext4 has such a hack/kludge queued up for 2.6.30, and Ubuntu and Fedora have already backported into their kernels for their distributions they plan to release in April/May, and which are currently in beta testing.
there’s a lot of historical precedent for rename writing data blocks before metadata: not only does ext3 do it, but many older filesystems too. Certainly, many programs are written under the assumption that my rename semantics hold: and these programs work fine (in fact, better) on a running system.
What file systems did you have in mind? Historically, BSD FFS sync’ed out meta-data every 5 seconds, and data blocks every 30 seconds. In think you may be confusing the fact that many file systems, including BSD’s FFS and Reiserfs (from which people seem to be found of quoting its design document) implemented an atomic rename operation, in terms of what happened to the directory entry, but BSD FFS at least never implemented anything like what you were describing as far I know, and I’ve had people comment that they’ve seen reiserfs generate zero-length files on a crash. (Although perhaps that was due to some application doing an open-truncate-write-close operation.)
Second, your rename behavior will lead to bugs now and forever: open-write-close-rename will work just fine on a running system, and there’s a good chance it’ll appear to work even if the developer takes the unusual step of testing during a system crash. Because this sequence will seem to work just fine most of the time, plenty of programs will have hidden data-loss bugs. That’s not a world I want to live in.
I’m not sure what you are criticizing here. I was anguished about creating the implicit fbarrier that you requested, precisely for this reason. It creates a crutch that application programmers will continue to reply upon, which will not be true on other operating systems or for other file systems. So to the extent that users are continuing to omit the fsync(), these are hidden data-loss bugs. I ultimately decided I needed to do this in order to be bug-compatible with ext3’s data=ordered mode. But that’s what you asked for!
If you are instead referring to the “change” I’ve made to rename’s semantics — here I have to object strenuously. I didn’t make a change, because I don’t get to define the semantics of the POSIX interface. There are other file systems besides ext3 and ext4, and your or I don’t get to single-handedly decree that rename shall henceforth imply a barrier operation, and all file systems that don’t implement such a barrier operation are broken.
I did what I did because of the old Internet maxim: “be conservative in what you send, and liberal in what you receive”, and because as one of the ext3 developers, I feel partially responsible for this situation, where application programmers have developed a crutch on ext3’s data=ordered mode.
For now, though, I also feel the responsibility to call out to application programmers to do the right thing. Maybe in the future we can create some kind of fbarrier() call. But it doesn’t exist today; what exists today is fsync(), and I want people to understand fsync() and at least not have an irrational fear of it.
March 16th, 2009 at 9:18 am
@34
No, we’re really not asking for any actual IO to be done at all. All we want is consistency. All we want is that if and when the file system decides to actually do the IO operations corresponding to the write() and rename(), that it not re-order the rename() before the write(). We’re not asking for anything else. Nothing needs to be written now or even in the near future.
We’re not asking for total ordering on IO operations either. You can still optimise writes by re-ordering. It’s just this particular one where it is much more useful if the rename() is not re-ordered before the file content write()s. So we want write barriers in the obvious places, not fsync.
Delaying allocation is great. Delaying writes is great. Just please respect that one write barrier between the file content changes and the rename.
Is it technically difficult to delay these various IO operations while preserving a partial ordering?
March 16th, 2009 at 9:45 am
@44
> You may as well ask why people just don’t turn the write back caching off.
That would also affect non-NCQ requests. I didn’t know SATA drives were still that bad with NCQ.
@45
> There are other file systems besides ext3 and ext4, and your or I don’t get to single-handedly decree that rename shall henceforth imply a barrier operation,
Isn’t there something between the application and the FS where such a barrier can be inserted?
March 16th, 2009 at 10:03 am
Hello,
You said
“.., that’s 1GB of writes to your SSD, and if you write more than 20GB/day, the Intel SSD will enable its “write endurance management feature” which slows down the performance of the drive.”
Do you have any link or details on that ? I never heard about such limitations
Sincerly,
Erwan
March 16th, 2009 at 10:12 am
@47:
We can add a g_atomic_replace() call to e.g. glib and get applications to use it (modifying existing glib functions to use fsync() isn’t too helpful as applications might be accidentally run with an older glib), but the kernel should first provide a reliable way for userland to know whether a fsync() is necessary in this case. For older kernels some kind of workaround should also be found, e.g. one that correctly avoids the fsync() when running on an ext3 data=journal/ordered filesystem.
March 16th, 2009 at 10:15 am
What I’m not getting here is why metadata is being committed before data at all.
You’re writing a journal entry claiming to have done something whilst knowing that it hasn’t really been done, which will inevitably cause data and metadata to be inconsistent at some point. Obviously this means that robustness is sacrificed.
At the same time, writing that journal entry has a cost, so doing this is more expensive than waiting until the data has been written (in the cases where delayed allocation is a win), or exactly as expensive at best (in the cases where it isn’t).
Since this behaviour *appears* to have only downsides, I really think it would help to have some explanation of why metadata commit isn’t delayed along with the corresponding data commit. There must be some upsides or else it wouldn’t be done, so what are they?
March 16th, 2009 at 10:34 am
Part of the problem with (mis)using fsync for cases where it’s not really needed is that other applications which really *do* need to block until the data is written risk being caught in the crossfire.
If Postgres loses an fsync data because someone enabled laptop_mode and the laptop_mode people had to disable fsync to get firefox to run reasonably then the entire database risks being silently corrupted. Exim risks losing emails it acknowledged receipt of, etc.
We’re all perfectly happy to load 30 minutes of browser history as long as firefox keeps running but we may not be willing to lose email or lose our entire database…
I’m a bit puzzled though. It seems like you’re implying fsync will schedule i/o on other unrelated files? Why? Isn’t that what sync(2) is for?
March 16th, 2009 at 10:34 am
er “lose 30 minutes of browser history”
March 16th, 2009 at 10:44 am
@Bernd (#26):
> BTW: This is all a very old problem. POSIX actually defines an order of file
> system operations (the simplest one: they are in order), so please make sure
> this order is maintained when data goes to the disk.
This is ridiculous — it would require that the kernel do no write reordering at all. If Application A said to write a block to File A, Application B said to write a block to File B 500us later, and then Application A said to write another block to File A, the kernel would have to instruct the disk head to seek from file A to file B and back to file A instead of doing both writes at file A before going to B (or vice versa). You’ll kill both throughput and latency this way. No operating system does this or should do this.
As far as I’ve heard, POSIX does not ever guarantee an order that anything is flushed to the physical storage medium. In particular, all bets are off in the event of a crash, except that if fsync() has returned then you’re guaranteed that the changes will be there (unless something else has touched the file since then).
March 16th, 2009 at 10:49 am
@45 tytso writes:
> I was anguished about creating the implicit fbarrier that you requested, precisely for this reason. It creates a crutch that application programmers will continue to reply upon, which will not be true on other operating systems or for other file systems.
I disagree. I think any filesystem that hopes to run on more than a very limited or specialized share of the systems out there will implement an implicit ordering constraint on write() followed by rename(), until such a time as libc is changed to do it implicitly to allow filesystems to ignore it (at which point most applications will still use it, and power applications will have the system calls necessary to do the right thing while avoiding it).
The crutch will probably always be there because application developers expect the filesystem to behave rationally. The specification may allow fs implementations to act in irrational ways, but that doesn’t mean they should. It is irrational for the write() and rename() operations to be reordered, POSIX or any other spec be damned. (And if you prefer a less loaded word than “irrational”, how about “blows a giant hole in the abstraction barrier around the file system.”)
March 16th, 2009 at 10:50 am
This discussion makes me wonder: Isn’t it about time the PC platform get’s some obligatory megabytes (or even gigabytes) of battery-backed memory?
This would fix this problem (dirt cheap fsync) but also improve filesystem journaling performance and the NFS protocol’s commit on close semantics. It could also help laptop mode and even reduce the wear of SDDs.
March 16th, 2009 at 10:56 am
Ted,
> Oh, I understand that. But the reality is that they also want that when
> rename() gets committed at the next commit interval (which will happen within
> the next five seconds, given the default settings) that the data blocks also
> be pushed out to disk so on a crash, they either get the old or the new
> blocks. So the I/O is going to happen, one way or another (they are in effect
> asking for an implicit asynchronous fsync() happening in the very near future
> after the rename() operation).
That’s not what they’re asking for, though. What they’re asking for is for neither the data *nor* the metadata to be flushed to disk until it’s convenient. You’ve said that it’s not possible to flush metadata out of order; why not? I would be very interested to hear more details about this.
Surely it would be conceptually possible to write out something to a log somewhere saying “if the machine was shut down uncleanly, undo this rename operation before mounting”. Then when the data is written, remove that from the log, or add something after it reversing it, or whatever. Or just don’t write the original rename to the log until the data is flushed. What problems, exactly, would this cause?
I can envision oddness that would be caused by a sequence like two things incrementing link count and then “set link count to 5″ being flushed before “set link count to 4″, but it seems like it might be possible to avoid that kind of issue with some care. For instance, if the second link count increment gets flushed first, rewrite it to say “set link count to 4″ and you’ll have no consistency problems. Are there trickier issues than this that arise when reordering metadata writes? Is there some reason that it’s really impossible, or is it just tricky to get right? Might it be possible to reorder only a few types of metadata operations, like just renames?
I would like to commend you on your diligence and patience in responding to random users’ questions, by the way. It’s really nice to see this level of involvement with the community on the part of kernel developers. You’d almost never see this sort of thing with really high-profile proprietary software.
March 16th, 2009 at 11:29 am
@31:
> I’m not sure why you want to read from a file that you had just written to
Because you don’t implement a cache for all files you may access. Because you don’t read it yourself but signal another process to read it. There are more possible reasons. You can work around these – but the complexity is going far beyond “just call a library function for atomic writes”.
> Your assertion that you do care which version of the data you get if the program reads from the file, but not after a crash, seems very strange to me.
What’s strange about that? It’s not that you “don’t care”; you always prefer the latest version. But in case of a crash you’re willing to accept some damage as long as it is limited – you can’t expect a crash to be completely harmless anyway. That does not mean you’d accept such damage (getting old contents) as the normal behavior of a system.
March 16th, 2009 at 11:46 am
@45: The fear of fsync isn’t irrational. As many others have said, its’ simply not practical to use either the threaded or unthreaded solution in many cases. It’s a non-solution. Your “hack” makes ext4 usable again by restoring sane behavior. I commend you for that. What I disagree with is your contention it’s a hack, and that fsync should be inserted anyway. It’s the wrong verb.
First, you didn’t address my point that POSIX doesn’t actually specify what happens on a system crash. ext2 is POSIX-compliant, after all. The rename-leaves-little-mounds-of-garbage behavior. POSIX allowing a behavior is not the same as POSIX demanding that behavior. You *did* change the rename behavior for users on typical systems: you made it less robust than the old behavior, despite both being allowable by POSIX. It’s not as if you changed rename from a non-complaint one to one that follows POSIX: *both* kinds of rename are allowed by POSIX, but one is terrible.
Second, there are filesystems that don’t create a write barrier for rename. They are broken. Their brokenness is not an excuse for ext4 to be broken. Other filesystems have generally managed to get data-before-rename working, whether that’s been through timers or an explicit write barrier. It doesn’t matter: the *conceptual* write barrier worked.
Also, your argument that ext4’s semantics actually do software developers a favor doesn’t hold water. The data-loss case is so rare that it probably won’t be tested. It’s not as if ext4 issues a warning kernel message when it sees a rename without an fsync. It just does the expected, intuitive thing most of the time, and occasionally blows up and loses data. This isn’t educational: it’s dangerous.
I agree that an explicit fbarrier would be great. But until then, adding a barrier on rename is a very good substitute that reflects the application developer’s intention. rename without that barrier is just a horrible API.
March 16th, 2009 at 12:30 pm
Ted – regardless of topic, this is a fantastic blog. Please keep writing!
This description of Firefox 3.0’s disk usage make explain some strange performance problems I have encountered on Windows in my environment, where I redirect the “Application Data” (XP)/”AppData” (Vista) folder to a shared folder that’s in the client-side cache (Offline Files). When the share goes offline, Firefox stops responding to user input until the cache finishes the online-to-offline transition. If network-layer problems delay the transition (e.g., the Network Location Awareness service and the Dfs client fight over whether the domain/share is online), Firefox will remain unresponsive for quite some time. I suspect that the Firefox developers don’t distinguish between the “Roaming” and “Local” versions of the AppData folder, which I completely understand given how things work under Unix as far as per-user settings go. Even so, I would argue that on Windows, given their potential sizes, the browser history and web site cache should not roam with the user.
March 16th, 2009 at 3:06 pm
As result I understand:
1. fsync() should flush file data on the storage, no delayed flush or cache.
2. if you really need to save data (critical) call fsync before close (in case such a mta is getting email or save OO document), but if file is closed without sync then system can cache it as long as needed (as config files).
3. it will be good point if system writes data before metadata to avoid bad files even truncate is made (cache matadata changes about truncate as much as possible until file is closed or memory cache require flush data from storage cache) but not strongly required. if it does not do that then we get zero-length files on crashes.
4. it will be good point to flush the file before rename to avoid to lost data that were not written to storage
3 & 4 are good points because of normal practice for old systems and safe for user data in normal logical mind, IMO.
I’ll be happy to use ext4 where 3 & 4 are made even like options in fstab.
March 16th, 2009 at 3:17 pm
@42: What is your take on scripting languages, like shell scripts?
Jim,
Personally, what I would do is use the logger(1) interface, and route the log information through syslog, which already has very powerful ways of dispatching log messages to appropriate log files, some of which can be fsync()’ed if necessary, and some of which might not be. If this is for an application that doesn’t have the ability to modify /etc/syslogd.conf (which is a root-owned system configuration file), then I’d use a helper C program or a helper perl script that called fsync, or implement the script in perl or python instead (both of which support access to fsync from scripts).
March 16th, 2009 at 3:29 pm
@28: Regarding “Every application writer should be asking themselves whether this sort of thing is really necessary. … I determined that Firefox was responsible for 2.54 megabytes written to the disk for each web page visited by the user”… Some thing I miss here is mentioning the fact that the huge amount of data written might be related to application features that are wanted by users, and that might be disabled in the application. e.g. http://www.mozilla.com/en-US/firefox/features/ says about “Session Restore” things like: “If you’re in the middle of typing an email, you’ll pick up where you left off, even down to the last word you typed.” Using this feature (although in Seamonkey 2.0 Alpha) I can answer your question “Is that really worth it?” for this specific feature (Session Restore) for my personal use with “YES!”.
Adrian,
Well, I didn’t do any typing into forms while measuring the 2.54 megabytes/link. In the ext4 patches queued for 2.6.30, in addition to the replace-via-rename/truncate workarounds, I’ve also implemented support for measuring the number I/O’s submitted to the file system, via /sys/fs/ext4//lifetime_write_kbytes and /sys/fs/ext4/ /session_write_kbytes. This feature was added so people could keep track of how much wear the system was imposing on the SSD. I’m hoping that just as with powertop, once application programmers start realizing how much impact they are putting on their SSD’s, they will have a natural incentive to optimize their applications to avoid writes or batch updates every 15 or 30 minutes in an attempt to reduce the wear their application is placing on users’ SSD’s.
So all I did was take the before and after values of session_write_kbytes, while the system was otherwise not doing anything at all, and clicked on a single link. What I noted was that apparently Firefox was still calling fsync(), since /sys/fs/ext4//delayed_allocation_blocks was 0, and that session_write_kbytes had increased by 2604. IMHO, this is a bit much just clicking on a link. And this was with my cache directory symlinked off to /tmp, to avoid unduly stressing out my SSD; if the cache directory was still in my firefox directory, the write load would have obviously been higher.
March 16th, 2009 at 3:36 pm
@48: You said “.. if you write more than 20GB/day, the Intel SSD will enable its “write endurance management feature” which slows down the performance of the drive.” Do you have any link or details on that ? I never heard about such limitations
Erwan, please see section 3.5.4 of the X25-M data sheet. We don’t know exactly what the X25-M SSD does when the “write endurance management feature” is enabled. Presumably it must affect the performance of the drive negatively in some way, or it would be on all the time, instead of only using it when the drive notices that it has been writing “too much”. I’ve asked, but the people at Intel that I’ve talked to aren’t willing to say anything on that point. I assume it may be part of the X25-M’s “secret sauce”.
March 16th, 2009 at 3:37 pm
I think all filesystems should have an implicit write barrier with a rename (at least between the file data writes and all metadata updates). Things should get better, not worse. Can’t this be done on a vfs level, with minimal support from filesystems?
March 16th, 2009 at 4:01 pm
> @35: they are in effect asking for an implicit asynchronous fsync() happening in the very near future after the rename() operation
Yes, something like that. Except that they want it to be done with the rest of the FS commit operation in one hit, so that it is not the application that is scheduling this “touching of the platters”, but rather the FS. In other words, they don’t want many different applications (which are completely unaware of each other) going: commit, commit, commit. They would like all of that be queued up as one big commit.
March 16th, 2009 at 4:03 pm
@42
Jim, even on ext3 that code can leave log in an empty or partially written state if the system crashes at the wrong time – $cmd >log translates into open(O_TRUNC)/write/close. Which means if your open() happens but the write does not due to system crash, etc. your data goes kaput. Worse yet another script on the system can try to read between the open and write calls and get an empty file, so it’ll bite you even without a system crash. If your script was more like:
i=0
while true; do
i=$((i+1))
echo $i:`date` > log.new
mv -f log.new log
./process_data
done
Then you’d have a closer parallel to the open/write/close/rename method. And no, there’s no f*sync’ing involved, according to my strace, so that script too would be vulnerable to inopportune crashes.
The prospect of having to add a call to sync to ensure atomicity in a shell script is not a good one… And shell scripts are used for such matters, see here:
http://www.mail-archive.com/debian-devel@lists.debian.org/msg250509.html
March 16th, 2009 at 4:19 pm
@50: What I’m not getting here is why metadata is being committed before data at all. You’re writing a journal entry claiming to have done something whilst knowing that it hasn’t really been done, which will inevitably cause data and metadata to be inconsistent at some point. Obviously this means that robustness is sacrificed.
Aneurin, that’s not what is going on here at all. The original purpose of all file system journals is to avoid needing to run an expensive (and long duration) file system consistency checker after an unclean shut down (i.e., a crash). That is all the file system journal is intended to do. File systems are optimized very differently from data bases. Databases have transactions that can be commited or rolled back if the database or the application decides to abort a transaction. In contrast, file systems do not support the concept of rollback or undo logs, and one of the reasons for this is in order to get very high performance levels, file systems usually combine 5-30 seconds worth of file system operations into a single transaction commit. This gives us much better performance, since the transaction overhead has been amortized across potentially hundreds of file system operations, but it comes at the cost of not being able to roll back an individual file system operation, or being able to delay a single file system operation without delaying all of the file system operations that have been combined into that single transaction.
At the same time, writing that journal entry has a cost, so doing this is more expensive than waiting until the data has been written (in the cases where delayed allocation is a win), or exactly as expensive at best (in the cases where it isn’t).
It’s not a matter of waiting until the data has been written, delayed allocation, as the name suggests, is not about merely delaying the writing of the data, but also delaying the allocation of the data; that is, the physical block numbers indicating where the data should be written doesn’t get determined for as long as possible.
There are many really good reasons for this; if we delay the allocation as long as possible, we might never need to assign it a location on disk; for example, if the file is deleted before we need to allocate it, we might never need to decide where on disk it would have been located. And deciding where on disk the file blocks should be written is not just a matter of making the choice, but actually recording that choice on stable storage; this means updating the block allocation bitmaps, the i_size and i_blocks fields in the inode, and so on — and of course, all of these meta-data updates have to be journalled.
The second reason why we try to avoid doing the block allocation for as long as possible is that it allows us to make a better choice when we finally make the block allocation, and it helps avoid file system fragmentation.
Since this behaviour *appears* to have only downsides, I really think it would help to have some explanation of why metadata commit isn’t delayed along with the corresponding data commit. There must be some upsides or else it wouldn’t be done, so what are they?
We can’t delay the meta-data commit because it has been batched with potentially a hundred or more other file system operations, for performance reasons. The data commit we simply can’t do because we don’t know where on disk the data will be placed. We can accelerate the data commit by accelerating when we make the allocation decision, which is the basis of the “alloc-on-rename-if-we-replace-another-file” patch. So what happens with the workaround patch is that we allocate the data blocks if rename() notices that a destination inode has been overwritten; then at the next commit, ext4’s data=ordered mode forces the data blocks out, so that when we commit the metadata update, the data blocks have been commited already.
Hopefully this answers your questions, and helps you you understand what’s going on, and why we are doing the things we are doing. If you examine how many transactions per second a typical database system can achieve when its transaction log, table space, etc., are all on a single hard drive, and then compare that to the number of file system operations per second we can do on that same hard drive (and which users expect us to be able to do), that might explain a few things to you. File systems are not databases and databases are not file systems.
March 16th, 2009 at 4:30 pm
@51: I’m a bit puzzled though. It seems like you’re implying fsync will schedule i/o on other unrelated files? Why? Isn’t that what sync(2) is for?
Gerg, fsync() on ext3 with data=ordered mode will force a transaction commit, and since (as described in comment #67 above) all file systems that do journalling tend to batch multiple file system operations into a single commit, that means that other inodes will get committed alongside inode which is being fsync()’ed, and data=ordered mode requires us to flush to disk all inodes involved with the commit.
So this basically gave us a two-for-one whammy. Not only did ext3’s data=ordered mode is what allowed application programmers to omit the fsync in an open-write-fsync-close-rename sequence, since when the open-write-close-rename sequence is commited in the transaction, the dirty blocks are written out as an implied fsync(); in addition, ext3’s data=ordered mode also flushes out other, uninvolved inodes’ dirty blocks on a commit, effectively turning fsync() into a sync(), and discouraging application programmers from using fsync() because it could take a long time to return.
Ext4 doesn’t have this problem thanks to delayed allocation, but it causes what people complain as the “zero-length files” problems, which is very similar to the problems reported against XFS.
March 16th, 2009 at 4:36 pm
but people hate that xfs ‘feature’, xfs users were ridiculed for it by ext3 users and still you ‘implemented’ something similar. Shouldn’t that give you something to think about? Combined with the fact that there are other fast fs that simply don’t have the same problem?
Why not sent a patch and mark ext4 broken until there is a real fix that prevents files on the media being nuked?
March 16th, 2009 at 4:45 pm
You say:
“In this example, an fsync() will trigger a commit and might need to take a second while the download is going on; perhaps half a second if you have a really fast 7200 rpm drive, and maybe 2-3 seconds if you have a slow 5400 rpm drive.”
For a user-interactive task (that is, any GUI app), even a half-second of latency is unacceptable. An entire second is gross, and 2-3 seconds makes you (as the GUI app developer) look incompetent.
March 16th, 2009 at 4:49 pm
@58: you didn’t address my point that POSIX doesn’t actually specify what happens on a system crash. ext2 is POSIX-compliant, after all. The rename-leaves-little-mounds-of-garbage behavior. POSIX allowing a behavior is not the same as POSIX demanding that behavior.
Daniel, you are correct that POSIX does not specify what happens on a system crash. What ext2, ext3, and ext4 all did are all POSIX compliant. So yes, POSIX does not require anything of file systems — unless fsync() is called. However, the flip side of this is from the perspective of the application, if the application is going to be safely portable, such that it will perform correctly on all POSIX-compliant systems, the application must assume the worst case, and not assume that the OS will do anything other than what POSIX is required to do.
Second, there are filesystems that don’t create a write barrier for rename. They are broken. Their brokenness is not an excuse for ext4 to be broken. Other filesystems have generally managed to get data-before-rename working, whether that’s been through timers or an explicit write barrier. It doesn’t matter: the *conceptual* write barrier worked.
Broken by whose definition? Yours? As I’ve said, neither you nor I nor anyone else has the ability change the required semantics for POSIX on our own. So it is certainly not broken by POSIX’s definition. If by this it means that you will only use file systems that meet your personal criteria of “goodness”, that’s fine. It’s guaranteed employment for ext3/ext4 developers, anyway.
But if the goal is to create application programs that will work on any POSIX-complaint system, including MacOS X, Solaris, and other POSIX compliant environments, it’s best that applications not assume that all the world’s a Linux system, or all the world’s an ext3 or ext4 filesystem. You may or may not be old enough to remember Henry Spencer’s Ten Commands of C Programmers, in particular the 10th commandment:
When you say that it’s fair game for application programmers to make assumptions that file systems will behave in “sane” ways (sane by your lights, anyway) that’s really not that different from people who had made other non-portable assumptions in the past, and religiously asserted that all systems that didn’t match with their assumptions were insane.
March 16th, 2009 at 5:23 pm
If what I care about is atomic replacement after crash recovery, and I must not assume that the OS will do anything other than what POSIX is required to do, then I might as well omit the fsync in open-write-fsync-close-rename, because POSIX does not require that that sequence gives atomic replacement after crash recovery.
I think POSIX is a red herring here.
March 16th, 2009 at 5:39 pm
@70: There are two questions to consider: 1) should applications assume rename acts as a write barrier, and 2) should filesystems actually implement that. Your reply addresses point 1: yes, for maximum portability, fsync. First, however, not all applications need to be maximally portable. There’s nothing wrong with taking advantage of system-specific functionality.
However, this whole discussion is about question 2, which is independent of question 1. It’s perfectly fine to implement features that aren’t portable, and perfectly fine for applications to use them. Other systems then copy the features in order to support programs that use them, and eventually the feature becomes a de facto standard. For better or for worse, that’s how progress is made.
Broken by whose definition? Yours? As I’ve said, neither you nor I nor anyone else has the ability change the required semantics for POSIX on our own. So it is certainly not broken by POSIX’s definition.
That’s a very postmodern viewpoint. Some things really can be objectively better than others. As I’ve argued at length, a rename that acts as a write barrier is better for everyone than one that doesn’t. You’re still hiding behind a standard. POSIX just mandates a minimum level of brokenness, just as the FDA mandates a maximum number of maggots per pound of cheese. Improvement can be made.
The “Ten Commands for C Programmers” is an interesting document, actually. Consider minimum significant identifier length. In practice, we stopped needing to worry about identifiers being unique in the first six characters a long time before C99 got around to formalizing what was by then a de facto standard. Honestly, in moderate doses, assuming all the world’s a VAX can drive improvement of systems, for better values of VAX.
Data-before-rename is clearly one of these cases. It’s far better for application developers than the alternative. With data-before-rename working, a call to rename without fsync ornamentation makes sense again. “It’s not portable” is not a good reason to eschew that feature when you know the systems you do support work just fine.
One sin programmers used to VAXen would commit would be to assume that *(char*)0 == 0. Not copying that particular behavior came with large architectural benefits, however, so the breakage was worth it. Not so with rename — flush-on-rename is strictly better than forcing application developers to fsync. In the worst case, rename could become fsync, followed by the rename. In the best case, the filesystem can do something far more sophisticated and elegant by creating a write barrier. The program becomes simpler and shorter, and thus most likely less buggy. Code becomes more intuitive and more robust. Everyone wins.
Instead of application developers being told to insert fsync, developers of other modern filesystems should be told to make rename work in a sane way.
On another note: since you’ve agreed to put the rename “hack” in ext4, why not extend it to cover all renames? It’d be safer for more cases — say, downloading a file into foo.tmp, then renaming it to ‘foo’ when done. And most of the time, rename is either called on a file with no dirty blocks, or one that’s going to need to be written out soon anyway.
March 16th, 2009 at 5:54 pm
modern 7200rpm laptop hardrive… some people like me enjoy rather having a slow SSD (cheap, little power consumption) instead:
http://suihkulokki.blogspot.com/2008/11/pimp-my-x40.html
This makes the point of spinups moot. But, the write performance is really awful, clearly less than 5MB/s. This made the firefox ui extremely frustrating (no matter which ext3 journalling option). switching ~/.mozilla under tmpfs makes it actually really usable. I do not want to see more fsyncs freezing my UI. But I really want to never ever experience seeing a existing file being replaced by a 0-length file. Transactions, dammit.
March 16th, 2009 at 8:30 pm
@Ted#33:
However, it wouldn’t become usable for quite some time; application writers would have to wait for distributions to ship those kernels, and glibc would have to export such a new interface
True, and it’s good to be conservative when talking about new kernel APIs. But if the will is there then it’d be plausible to get something workable into the next-but-one round of distro releases (Fedora 12 and Karmic Koala). And even if programmers start following your advice now, we won’t start seeing fsync-happy apps shipping until then anyway.
In fact, what I outlined to Bart was a per-process flag which would be set by a pam_module, supplied by the laptop_mode tools that would set that flag to the mode “ignore-fsyncs-while-on-battery”. That way only the desktop processes would ignore the fsync() call, while system daemons would still honor fsync().
But I run interactive programs that touch databases, and whose data I care about. So to preserve the current level of correctness we’d have to audit all of userspace to figure out which apps use fsync only for atomic rename, and which apps use it when really needed, and make sure everything stayed in sync as apps were upgraded and… it… that’s… aieeeee
I know it’s a hard problem, but please don’t punt it to userspace. That just makes it harder.
(This will probably be my last message in this debate for now, so I wanted to mention how impressed I’ve been at the time and effort you’ve been putting into this issue, and the calm replies you’ve managed to the unending stream of confusion and sometimes abuse. It’s very much appreciated.)
March 16th, 2009 at 8:54 pm
> @73: It’s perfectly fine to implement features that aren’t portable, and perfectly fine for applications to use them. Other systems then copy the features in order to support programs that use them, and eventually the feature becomes a de facto standard. For better or for worse, that’s how progress is made.
Please read the select manual page on your Linux system. Hint: timeout. The history of Unix is littered with examples like this. Do we really need another one?
March 16th, 2009 at 9:10 pm
@76: So, is your argument that we should standardize first and implement later? I hear the W3C is having great luck with that approach.
March 16th, 2009 at 9:18 pm
> @77: So, is your argument that we should standardize first and implement later? I hear the W3C is having great luck with that approach.
Never you mind W3C. What is important here is that we already _have_ a standard that’s been in use for many, many years and it behaves the way it does (which, BTW is useful in some cases). Anyhow, condoning the broken application behaviour is absolutely the wrong thing to do.
I’m not saying we standardise first. I said we should implement another API first (AFAIK, Ted is a kernel guy – he can do things like this). I tentatively called it rename2() (mostly because I’m not very imaginative).
Then, when our Linux programs switch properly to using it and take full advantage of it (and have alternative code paths for the old fsync()/rename()), others may see value in it and implement it. Then, POSIX folks may say “hey, this is useful, let’s make it optional” and later “hey, this is really useful, let’s make it required”. OK?
March 17th, 2009 at 12:25 am
I have one of the new Seagate Momentus 7200.4 I got it about a month ago, and yes it was extremely hard to find. hdparm reports ~ 95MB/s and it seems quite fast, but I haven’t done any other benchmarks.
March 17th, 2009 at 1:37 am
The obvious thing to take away here is that the design of most filesystems is decades behind the design of the best databases. If synchronous durability is not required, there is no reason why any actual disk I/O has to take place at all. The fact that it does in order to guarantee atomic replacement is a major design weakness in contemporary filesystems.
Any decent database can durably commit groups of transactions to a single disk every 10 ms or so. Databases commit transactions independently. Any good database has an option to commit transactions asynchronously, (i.e. atomically but not necessarily durably).
The claim is made that filesystems will be slower if such design elements are adopted. That is exactly wrong. Filesystems will be faster, precisely because the operations we care about do not require synchronous commits, let alone group commits for every pending operation on the system. The whole problem is assuming that the only solution to the problem is the slowest one possible.
Doesn’t anyone think something is wrong with the picture where any decent database can *synchronously* commit 100 serially dependent transactions per second even if other long transactions are still pending, and a filesystem is doing a good job if it manages one serially dependent transaction (i.e. a synchronous update to the same file) every three seconds under similar conditions?
In other words, contemporary filesystems are great under two conditions: (1) you don’t care about your data, or (2) you are willing to make databases look like speed demons.
March 17th, 2009 at 2:34 am
> @79: The claim is made that filesystems will be slower if such design elements are adopted. That is exactly wrong. Filesystems will be faster, precisely because the operations we care about do not require synchronous commits, let alone group commits for every pending operation on the system.
Let’s assume that every rename is done in order, as it’s being suggested here. If it so happens that someone just renamed a relatively big temporary file and if it also happens that the directory must be committed to disk (because someone ran fsync on it or because kernel decided to evict it from cache), the whole renamed _file_ will be committed before the directory, potentially causing I/O many times larger than that of the directory – I/O that otherwise would not happen, because the file would be removed a bit later on. In other words, out of order rename _can_ be useful.
> The whole problem is assuming that the only solution to the problem is the slowest one possible.
There is also another problem. It is the fact that people think that the only solution is to have fully ordered rename or do fsync all the time. That is not true.
This problem was discovered with small configuration files. This can be solved easily, by taking a _backup_ of such a file at an opportune moment (i.e. at start of the application and otherwise very, very rarely). In other words:
1. open and read file ~/.kde/foo/bar/baz
2. fd = open(”~/.kde/foo/bar/baz~.new”, O_WRONLY|O_TRUNC|O_CREAT)
3. write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
4. fsync(fd) — and check the error return from the fsync
5. close(fd)
6. rename(”~/.kde/foo/bar/baz~.new”, “~/.kde/foo/bar/baz~”)
7. fsync the directory where backup file lives (i.e. ~/kde/foo/bar)
So, after this, we are 100% sure that we have at least one _good_ copy of the precious file. We can then carry on all day with the usual open()/write()/close()/rename() (which won’t guarantee anything, but it also won’t use fsync all the time).
If it so happens that we crash, our ~/.kde/foo/bar/baz will be busted (empty, corrupt etc. – application will be able to tell), in which case we just rename our backup file into it and continue (we can also create a new backup file at this point).
March 17th, 2009 at 3:35 am
@80: Your first example assumes a mental picture of the way a filesystem is implemented that isn’t necessary. Databases issue hundreds of much more complex transactions a second without synchronously touching anything except the journal. That means indexes, btrees, meta-data of all sorts, you name it. There is no reason why a well designed filesystem can’t do the same.
Your second example is appreciated, but it is a lot of complexity that most applications don’t implement, shouldn’t really have to implement, and which burdens the system in a way that properly implemented atomic renames do not.
An properly implemented atomic rename requires no synchronous disk I/O at all. Nor does it create or require synchronous I/O dependencies on any other operation in the same directory, nor on whether any of the data has been written out yet, nor on whether any of the meta data has been written out, nor on any other synchronous I/O operation such as an an fsync on another file, in the same directory or otherwise.
The secret to this magic is called “metadata undo”. You can read and write metadata updates to the disk whenever you feel like it, as long as you have the ability to roll back meta data updates after a system crash. It is not rocket science. Databases have been doing this for decades. They do it because it makes them *faster*.
Index block update this, b-tree block update that – thousands of times a second, never synchronously written to disk, just lazily at some checkpoint every few minutes or so, all because they have meta data undo capability.
March 17th, 2009 at 4:10 am
Why can’t you, instead of forcing all apps to be fixed, just implemented “ordered writes”? That system commands are executed in the order in which they are called? That’s a very reasonable expectation. If I write something, close the file, and then do a rename, it’s reasonable that the file will be written before the rename happens. (Even if all that happens a bit later, due to caching.) *Not* doing so is entirely unreasonable, violates a fundamental assumption in procedural programming (sequetial execution), and screws the logic of applications. And bugs as we saw here are the result of exactly such “screwing the logic of applications”. It’s not that the logic of the application is wrong, it’s your execution of it.
It doesn’t matter what POSIX says – if POSIX allows unreasonable things, don’t do unreasonable things. Even if the government law allows you to run around in the office saying “blub blabla BUUUHHUUUUU” (literally), it doesn’t mean that you should do it or that it’s a good idea.
Why are you so stubborn about this?
March 17th, 2009 at 4:13 am
And, FWIW, I meantain that fsync() means “write this to disk now”, not “do the previous things before the following things”. I believe you argue for fsync() merely to not to have to change your own code.
March 17th, 2009 at 4:18 am
> @81:
So, if someone fsyncs the directory and that goes to disk in one form or another (by some magical database thing you may have) _and_ that directory says that there should be a new file there (just renamed, not the old one), how exactly are you going to have the data there without _actually_ writing it to disk? Remember, the user said the new directory must be _valid_ on disk and upon return from fsync, and you _must_ guarantee what you said – both the data of the file and the directory are on the platters – there is no more undo, rollback or anything of the sort possible.
March 17th, 2009 at 5:08 am
@63: Thanks for this very interesting info. Reading at some other spec, looks like the “Extreme” edition doesn’t “feature” that behavior. They just report :
3.5.4 Write Endurance
The drive supports 1 petabyte of lifetime random writes.
March 17th, 2009 at 6:31 am
Can anybody implement third scenario (with call to fsync()) in *POSIX* shell? In PHP? In JavaScript? In BASIC?
How to call fsync() when fsync() is disabled, e.g. in laptop mode or on MacOSX? How to do that in POSIX shell?
Why Ext3 performs better without calls to fsync(), than Ext4 with calls to fsync(), while providing better protection of user data?
March 17th, 2009 at 7:05 am
@67, 68
Thanks for the explanation given in these two posts. They were helpful and informative, and your time spent is appreciated, especially as you are being borderline flamed all the while.
March 17th, 2009 at 8:32 am
@82: The secret to this magic is called “metadata undo”. You can read and write metadata updates to the disk whenever you feel like it, as long as you have the ability to roll back meta data updates after a system crash. It is not rocket science. Databases have been doing this for decades. They do it because it makes them *faster*.
Mark, you can only do this as long as the undo log records are written before the metadata blocks are written to disk. Unfortunately, hard drives don’t have real barrier operations, they only have flush-everything-to-disk operations. So there is no way to be sure the undo log is safely on disk unless you actually issue a flush-to-spinning-platter command (basically, a synchronous write), and so an undo log is actually slower, not faster, than what most file systems do, which is to rely only on a redo log.
And of course you have to do some synchronous writes, even with an undo log (or a redo log; most databases use both an undo and a redo log). Anyone who tells you they can create a transactional system without synchronous writes is selling you something.
What ext3 does is to push data out to the redo log asynchronously, and do not write the data to the final location on disk; when a commit takes place, we wait for all of the asynchronous writes to the journal to finish, and synchronously write a commit block to the journal. Only then do we let the meta data writes happen to their normal location on disk. We allow transactions to overlap; while the committing transaction is being finalized, new file system operations can be made to the “current” transaction.
This allows the file system to be able to operate at raw disk speeds in circumstances where fsync’s are not required, and even a few fsync operations won’t slow the file system down much. Obviously, a very heavy fsync load will slow down a file system, but it turns out that by only using a redo log, a file system can be much more efficient at pushing data to the hard drive than databases. This comes at a cost of not being able to support transaction rollbacks, and having a much more primitive transaction system than databases, but that’s the tradeoff.
About ten years ago, Oracle once tried to persuade people that you could build an application-level file system on top of an Oracle DB that could be used to build ftp and http servers. This was a ridiculous notion, and it was quickly laughed down. File systems and databases are optimized for very different use cases, when you try to use a file system as a database, or a database as a file system, you very quickly run into either performance problems or functionality gaps.
March 17th, 2009 at 8:42 am
@84: I believe you argue for fsync() merely to not to have to change your own code.
Ben, obviously you haven’t been paying attention. I had already written the patches queued up for 2.6.30 which provided workarounds for broken applications, even before this huge discussion blew up on Slashdot and the Ubuntu Launchpad bug. So if people want to write applications that only work well on ext3 and ext4, that would actually be great job security for ext3/4 file system developers. So you might think that I should be encouraging application programmers to continue with their sloppy programming techniques. After all, that’s the sort of thinking that allowed Bill Gates to make his billions by taking advantage of application programmer lock-in. Why haven’t I done that? Because it would be the wrong thing to do.
March 17th, 2009 at 8:57 am
@22, 78: Bojan,
I’m not sure rename2() is the right API. It may very well be that some kind of fbarrier(fd) call which indicates that the data contained in the indicated file should be pushed to stable store at some point in the future no later than the next commit of the file’s metadata call is the better API. This could be simulated by the kernel as fsync(), and application programs would have no idea when the data would be committed to stable store, just that it wouldn’t be deferred past the next metadata operation which involved the inode in question.
In any case, there is an upcoming Linux Filesystem and Storage workshop in San Francisco (hosted by Linux Foundation, the organization for which I am the CTO) coming up in a few weeks, and I’ve already contacted the program chair of that workshop, and we will be discussing this topic at that event. You can be sure that we will be discussing requests from application programmers about their expectations about implicit guarantees for rename(), as well as proposed new interfaces that might provide new semantics that might be useful for application programmers.
Anything we do, of course, will be Linux specific, and at least initially, won’t be implemented on any other operating system out there (and yes, there are a still a few BSD’ers hiding in foxholes running FreeBSD and OpenBSD, and I hear there’s this BSD OS called MacOS that is kinda popular, not to mention some OS called Solaris that claims to have an open source development process). But at least we may be able to come up with something that isn’t ext3 or ext4 specific. Because while I might love it if everyone only used Linux and ext3/ext4 file systems, I do believe there is a bigger world out there, and I’ve always been a big believer in heterogeneous systems.
March 17th, 2009 at 10:33 am
> @91:
Thanks for all your hard work. I’m looking forward to meeting ext4 in F-11!
March 17th, 2009 at 11:08 am
>> in order to get very high performance levels, file systems usually combine 5-30 seconds worth of >> file system operations into a single transaction commit.
Ted, thanks for continuing the dialogue here. It’s been educational. Thanks for putting the rename ‘kludge’ in, but I do think it’s absolutely necessary. I will give you my use case.
I use the write-close-rename all the time with configuration files that I don’t care if I lose the last N seconds of change to. This is easy and thread safe and has worked for years on ext3. At the same time, I have a daemon that is continually writing out large log files. We also run on cheap IDE hardware due to cost pressures. It seems that fsyncs will force your above described ‘mega transaction’ to complete, which involves seeking all over the disk to our other dirty log files. If I made multiple changes to a conf file during a 5-30 second ext4 flush interval, fsync will cause more seeks than not using it, which will wear out our disks.
You are right that the FS is not a database, certain things I do not care about instant durability. So we can:
1. have the FS support write barriers on rename or other new api
2. do writes to a user space cache daemon that only flushes when necessary
3. make every app much more complicated and cache its own data
#1 already works.
March 17th, 2009 at 11:12 am
@91: Thank you for the consideration. I’ll look forward to hearing about the outcome of the conference. Obviously, I’m in the rename()-should-be-a-write-barrier camp. (For reasons of compatibility, API parsimony, simple shell access, making rename hard to misuse, and so on), and I’d still challenge anyone to come up with a real-world scenario where rename without a write barrier is what’s actually wanted.
That said, fbarrier() would be immensely powerful, especially if it could apply to all writes to a file or its metadata, and not simply placing a metadata-data write barrier. Ideally, we have an fbarrier that works like that, and rename becomes an atomic sequence of fbarrier-link-unlink.
Again, thanks for taking the time to have this discussion. If the comments I (or anyone) have submitted come across as “borderline flames”, it’s not a personal issue. I just feel that a robust filesystem is one of the most important attributes for an operating system to have.
March 17th, 2009 at 11:28 am
@91:
I think it is very difficult to get application developers and users to accept a potentially large performance hit on a widely used platform (ext3 with data=ordered) when the sole benefit there is portability. Since the unsafety of open/write/close/rename is neither a new problem (quite a few filesystems are like this) nor immediately harmful (ext3 data=ordered or ext4 with some workaround will probably remain the default in the near future), and anyway users cannot rely on corrected versions of major applications being installed for quite some time, I think we application developers might as well sit back and wait a few months, until we have the necessary kernel interfaces to do the right thing on all filesystems and with no performance hit on “safe” ones.
Of course, if applications begin to do fsync() only on “unsafe” filesystems, these filesystems might well become less competitive in some benchmarks
. But people can then turn on/off alloc-on-commit or your current without worrying about data integrity; they simply get better or worse performance for different workloads.
March 17th, 2009 at 11:56 am
@93
I apologise for my choice of words there. In fact, the comments on this blog entry weren’t so much what I was thinking of as the general discussion around this storm in a teacup. For example, more than a few comments on LWN have been, let’s say, *rather heated* – which can’t be much good for a developer’s morale, especially given that the issue in question was already solved.
March 17th, 2009 at 12:30 pm
@94: I’d still challenge anyone to come up with a real-world scenario where rename without a write barrier is what’s actually wanted.
Daniel,
Here’s one example. You’re downloading a file from the web, which Firefox does by creating the file “foo.iso” which is initially created as a zero-length file, but it then creates the file “foo.iso.part” where the file is actually downloaded, and then when once the download complete, Firefox renames “foo.iso.part” to “foo.iso”. The file foo.iso isn’t precious; in the case of a system crash, we can always download it again from the web. With the rename kludge, though, we are now treating this file as precious, so that we are doing the equivalent of an implied fsync() when the rename commits.
You can (and have) made the argument that it’s more important that config files which are replaced never be lost. But there are also lots of files, such as all derived files created in a build tree when a developer runs the command “./configure ; make” which are not precious. In the unlikely event that the system crashes, the effort required to regenerate the object files is no big deal, and if the developer isn’t sure, the effort of running “make clean; make” is minimal.
I think it’s fair to say that traditional POSIX fsync() semantics essentially assumed that (1) systems actually didn’t crash that often, and (2) files which are precious should be fsync()’ed. However, for files that are not precious; that can be easily verified using “rpm -V”, and reinstalled easily enough using “apt-get install –reinstall” (I’m taking no sides on the packaging debate here; the last thing we need is an rpm vs. dpkg flame war!
, why should we pay the cost of treating these files as precious when systems rarely crash and if they are, we can easily reconstruct/recompile/reinstall them?
All-in-all, it’s all about expectations, and it’s clear that Linux’s relative stability (until unfortunately the advent of unstable proprietary binary-only video drivers, which are far more unstable than the open source drivers), and application programmers which have gotten lazy because of ext3’s properties (which by the way are also responsible for ext3’s performance problems; I always laugh when developers of new file system compare themselves favourably to ext3 as if that was something to be proud — ext3’s performance numbers are actually a pretty low bar), have gotten us into a situation where there are expectations that could very well inhibit future file system innovation and performance improvements. We’re not going to get out of this corner that we’ve painted ourselves in very easily, and that too will be a big part of the discussions at the Linux Filesystem and Storage Summit.
March 17th, 2009 at 12:56 pm
@97: Thank you for your reply. I appreciate your consideration of the issues.
I don’t see, however, how a write barrier in that scenario would be detrimental overall. A write barrier on rename doesn’t make the file “precious”, in principle. The total amount of IO for the downloaded-file scenario will remain the same either way — and if the file is large, much of the file’s data will have already been flushed by the time the rename rolls around. And even if you don’t flush the remaining data blocks before the rename, you’ll have to flush them soon thereafter anyway. It’s not as if the rename synchronously flushes the data blocks, increasing application latency. The filesystem can still schedule the rename-flush combination at a convenient time.
The compilation example also doesn’t get hurt by a write barrier — all the data blocks for those intermediate files will be flushed anyway, and in principle, a filesystem could commit neither the data blocks nor the rename record if the file is unlinked before a commit timer expires. In principle (I have no idea how difficult this would be to implement), the filesystem could write all the rename records for a given commit interval first, then all the data blocks. That’d reduce seeking and still preserve the ordering constraints, right?
Please, if I’m wrong, tell me why. But I don’t see how there’s any significant IO cost to rename being a write barrier, in principle. The way ext4 implements that guarantee — through immediate allocation (your patch is actually very simple) — might incur some penalty, but that penalty is an implementation artifact and not a result of an inherent complexity of the barrier. Soft-updates for UFS, for example, manages to keep a dependency graph of all metadata updates without much of an overall performance penalty.
Also, preciousness isn’t hard-and-fast criterion: configuration files can be regenerated, sure, but the system might not boot if certain ones are damaged. A downloaded file over a flow link could certainly be considered precious by a user, though it, too, could technically be regenerated.
Even if some case could be found for rename without a write barrier (say it bottlenecked something, somewhere), I believe the default should still be the operation that’s both safe and intuitive, and that if any new API is added, that new API should override the safeties, as it were; consider glibc’s _unlocked variants of the standard C IO functions. It’d be easy enough to say, “call this function if you know what you’re doing and want the maximum possible performance, knowing these caveats”.
March 17th, 2009 at 2:16 pm
With all the criticism, I just want to say that I thought your post was great and addressed the criticism just fine.
With your laptop mode idea of “disable” fsyncs, I don’t know about “disabling” them, but I think delaying them until convenient would be a good idea so the atomicity would still be there.
March 17th, 2009 at 2:53 pm
89: you can only do this as long as the undo log records are written before the metadata blocks are written to disk
Absolutely, Ted. However, with an undo log if the only requirement is atomicity (not durability) then *nothing* needs to be synchronously written to disk. Not the data, not the meta data, nothing.
The undo records only need to be written prior to the time rename returns if one needs the rename to have guaranteed durability prior to proceeding further.
March 17th, 2009 at 2:58 pm
I also understand why people want to say INSIDE their app: “fsync/write whenever the os wants, as long as its before this rename”, instead of saying “fsync now, then rename whenever the OS wants”.
March 17th, 2009 at 3:46 pm
@98: A write barrier is an implementation detail. An atomic rename can be performed on a properly designed filesystem without one.
The downside of making renames an automatic write barrier on filesystems that require such things is that it dramatically slows down a non-trivial class of applications that do lots of rename replacements. Rsync is typical.
Right now rsync comes with an –fsync option that fsyncs every file prior to rename replacement. If rename always fsyncs, then not only will that option not have any effect, the unnecessarily slow behavior will be mandatory. It wouldn’t be a problem of course, if the option had no visible effect, if there wasn’t a very severe performance penalty to be paid.
That is why we need two things: (1) Filesystems that have the capability to do low overhead (non fsync-ing) atomic renames, and (2) A kernel API or fcntl option that allows an application like rsync to find out whether it needs to issue an fsync to make a rename atomic or not.
March 17th, 2009 at 3:59 pm
@101 — what is this about rsync having a -fsync option?
“grep -r fsync” on the extracted 3.0.5 source, the nightly from march 13th, and a freshly downloaded copy from git all yield no results.
March 17th, 2009 at 4:04 pm
@98: The downside of making renames an automatic write barrier on filesystems that require such things is that it dramatically slows down a non-trivial class of applications that do lots of rename replacements. Rsync is typical.
What makes you think it would do that? It’s not as if with every ordered rename, rsync will wait for data to hit the disk. All the rename does it tell the filesystem, whenever it gets around to flushing its buffers, to flush the data blocks before the rename record. It doesn’t command the filesystem to flush either immediately. That’s what “write barrier” means! It’s not synonymous with a full synchronous flush — in that case, we’d call it fsync! These data blocks need to be written anyway, and an ordered rename doesn’t change that at all.
There is no “very severe performance penalty.” In fact, there’s not even a “performance penalty”. rsync will run in exactly the same time it did before, but not when the filesystem flushes dirty blocks to disk, it’ll do it in a slightly different order. That’s all.
How many times do I have to say the same thing? This is not making rename include an automatic fsync. This is something far more subtle that has no real-world performance impact.
March 17th, 2009 at 4:05 pm
Err, @102, dammit.
March 17th, 2009 at 4:23 pm
>>With the rename kludge, though, we are now treating this file as precious, so that we are doing the equivalent of an implied fsync() when the rename commits.
I thought you already have an fsync every 5-30 seconds. All we want to do is make the rename commit *after* the data does. Put the rename commit in the next 5 second interval if necessary.
March 17th, 2009 at 4:23 pm
Would the sequence open, write, fdatasync, close, rename, work as well as open, write, fsync, close, rename? I believe it would, however I think that it would not result in any performance increase on ext3 data=ordered, because the new file length would still have to be written into the metadata which would require all metadata changed before and with it all data to be written to the disk.
March 17th, 2009 at 4:31 pm
@106: That’s exactly what he’s doing: note the words “when the rename commits”. rename will not synchronously update anything. That’d be silly.
March 17th, 2009 at 5:05 pm
> @104: All the rename does it tell the filesystem, whenever it gets around to flushing its buffers, to flush the data blocks before the rename record.
Could you please read the fsync manual page. It is perfectly OK for a user to fsync the directory by itself. That is to say, the directory and the file are two different entities that both users (and therefore the kernel) are allowed to commit to disk whenever desired and in whichever order.
With ordered rename, the fsync on the directory means committing all data in the file linked from it as well. And it means that for the kernel too.
So, neither the user nor the kernel are any longer free to reorder these things how they see fit, which will then cause that every time either an explicit fsync on the directory or one decided by the kernel are done, the whole renamed file will need to be committed first.
So, if another process/thread (or kernel) needs/wants to commit the directory _now_ and you just renamed a big file in it, your file will be committed first and _synchronously_, because your new requirement on rename is that _every_ rename is completely ordered.
Why is it so difficult to wait for the new API? Or have another solution for configuration files (such as backup files)?
The only thing overloading existing rename is going to do is make some programs crap on file systems that don’t have this (and it has been shown plenty of times that they don’t have to have it). Sure, Ted already fixed it for you in ext4 – that’s done. I think we all understand what the issue is, new stuff is forthcoming – let’s use it when it comes.
March 17th, 2009 at 5:23 pm
@109: Bojan, in your world, every application that did a rename() would need to fsync the file-to-be-renamed’s data blocks. So in your scenario, the large file’s data blocks will be fsynced anyway regardless of what the other process does. The only time it’d make a difference is when the large file would have been unlinked immediately after the rename-fsync of the other file. That’s unlikely. Directories tend to contain many of the same kind of file. You’re not going to have Firefox stick a huge downloaded file into the same directory exim is sticking its mail spool.
Can you give me a real-world scenario when this would actually be a problem?
A safe, ordered rename needs to be the default, and the “new api” can deal with high-risk renames.
Furthermore, even if you were right, and an ordered rename did cause a slight performance decrease, rename should still have safe, ordered semantics by default, and new APIs should be added to unsafe operations. The problem with any new API that purports to make rename safe is that rename itself is still unusable. Did you read my post above? It’s a good thing that the rename API is intuitively safe and correct in its obvious use. You know damn well that rename by itself works fine on a running system, and that application developers aren’t going to test catastrophic failure scenarios.
March 17th, 2009 at 6:12 pm
The open, write, fsync, close, rename sequence bothers me. This sequence destroys any metadata (permissions, ownership, etc) attached to the file, as well as messing with any hard links to the file.
That seems a bit defective if it is the recommended method when all an application wants to do is update the contents of a file.
March 17th, 2009 at 6:23 pm
>@110: Can you give me a real-world scenario when this would actually be a problem?
Remember the foo.iso that Ted was downloading? Well, if I edit a.txt (10 bytes, no more) in the same directory with a reliable editor, when I save that file, the directory will be fsync()-ed too. Hence, I’ll have to wait for Ted’s file to be committed (because it was just renamed) before I see my prompt back.
I didn’t tell the system to do it, neither did Ted, so why is it doing it? Are we back to lock-up-on-fsync disaster from ext3 now?
> rename should still have safe, ordered semantics by default
When the spec doesn’t support your view, just imagine what you want it to be. That’s always good
March 17th, 2009 at 6:23 pm
@110. Exactly. Is the end game the ext4 devs want that every single app has to call fsync on every file before rename? I mean *who* *wants* 0 byte files? Isn’t every f’king app calling fsync going to generate a hell of a lot of disk seeking, wearing out disks sooner than batching updates? I don’t see there being a system wide, holistic design here for this fsync all the time theory. It is starting to scream second system effect – they don’t understand why we actually like ext3, instead they call us idiots who can’t develop apps right.
March 17th, 2009 at 6:27 pm
> @112:
People call fsync() for legitimate reasons. For instance, to save files (WOW!).
Just because they do that, doesn’t mean that the system should commit renamed files for which nobody cares if they end up being zero length (and yes, there are such files; and no, systems don’t normally crash all the time, so it’s not like it happens every 5 minutes).
> instead they call us idiots who can’t develop apps right
Nobody is calling anyone idiots. The manual pages are, however, crystal clear. Applications have bugs. Things like this happen.
March 17th, 2009 at 6:37 pm
@113: Bojan, your reasoning is circular. In fact, you could play frisbee with your argument. It’s right because that’s what the documentation says, and the document documents what’s right, right? You’re aware that things can change, right?
As for your download example: that’s how it works today on ext3! Your reliable editor will call fsync on foo.txt, which will in turn flush alloutstanding data blocks, even those from foo.iso. Yet saving a file in an editor doesn’t cause an undue delay.
Second, with an ordered rename, there are two cases: First, the browser is still dumping stuff into foo.iso.tmp and hasn’t renamed it yet. In that case, saving foo.txt has no effect on foo.iso.tmp because it hasn’t been renamed.
Second case, the browser has renamed foo.iso.tmp to foo.iso, but hasn’t flushed the last of its data blocks yet. Then we’re back to the ext3 scenario with the directory fsync — but at that point, most of foo.iso will already have been written to disk anyway, so the additional information written out to foo.iso will cause minimal delay. This is the worst case, and we still get the same performance we have on ext3 today. In far more common cases, performance is much better.
More importantly, in all cases, robustness has improved. So where’s the downside?
March 17th, 2009 at 6:44 pm
@111:
Good example. I think it sounds easier for distros to work around than the ff problem, though. Perhaps the ’sync this directory’ that reliable editors need could be improved with a rename_sync(), which ensures that the new file will overwrite the old file and only this particular metadata change (and of course the new files contents) need to be flushed to disk.
March 17th, 2009 at 7:06 pm
> @114: You’re aware that things can change, right?
You have not been paying attention. The documentation is the only _objective_ thing programmers can rely on. If they do not, programs don’t work. Haven’t you seen that by now?
Changing existing API to do unspecified things is _dangerous_, because it eventually causes people to lose data somewhere else (see: XFS, which _wasn’t_ the fault of FS developers). And let’s be clear – it is the apps that are not behaving in line with spec. Not the FS. No amount if “imagining” can change that.
There is nothing circular about my argument. We should write the apps to the spec, because that is the only objective measure of their compliance.
On the other hand, every post of yours starts with “I can imagine the world this way” or some such. It isn’t like that – get over it.
> As for your download example: that’s how it works today on ext3!
Which is the reason why we had the FF v. ext3 ordered mode fiasco.
Sure, Ted fixed application breakage by a few workarounds in ext4, because he is a practical man. But, if we keep insisting on everything that ext3 does as being good, then why do we have ext4 again? Just run ext3.
> most of foo.iso will already have been written to disk anyway, so the additional information written out to foo.iso will cause minimal delay.
Says who? It could be all in core, waiting on delayed allocation.
> More importantly, in all cases, robustness has improved. So where’s the downside?
The downside is, as I explained to you about a dozen times by now, that it encourages people to write non portable and broken programs. They think they should be doing this, when it fact it is not correct. And, it also reduces the flexibility of the kernel and the FS to do commits in an unordered fashion, as clearly permitted by the spec.
BTW, machines don’t crash every 5 minutes.
PS. If you said that ext4 must implement ordered mode, like ext3 does, because if it doesn’t, it would be a regression (being the next in line and all), then I may even say: OK, make sense – provide people the compatibility mode. But, ext3 also has writeback mode, which does not work as ordered _and_ there are gazillion other FS implementations that don’t have ordered mode _and_ the spec doesn’t mandate it. So, ext4 is free to do what it does.
PPS. The new API for what you want is forthcoming. Just be patient and fix the apps in the meantime. They are broken.
March 17th, 2009 at 7:22 pm
@116: we’re not talking about what programs should do. We’re talking about what filesystems should do. Once the filesystem does the right thing, the documentation can change. Then, correct programs that rely on the documentation can change. After after that, the standard can change, and even the programmers most worried about adhering to the standard can use the new functionality.
Furthermore, every API does unspecified things — the specification can’t possibly describe everything that happens. The specification dictates minimum requirements. If an application calls fsync on a file that’s going to be renamed with an ordered rename, that’s fine. It’s no worse off than it was before. All the requirements of the POSIX rename are met.
What’s circular in your argument is supposing that simply because a portable program (which assumes nothing but POSIX) can’t use a particular piece of functionality, that piece of functionality shouldn’t be implemented because it’ll encourage programmers to write programs that aren’t portable to all POSIX systems.
Not every program needs to be portable to all POSIX systems. That’s how progress is made. Get over it yourself.
And we’re not talking about everything ext3 doing being good. That’s a strawman. We’re talking about just one particular guarantee it made: rename consistency.
Furthermore, POSIX doesn’t actually guarantee anything in a crash. What part of that don’t you understand? A program that depends on an fsynced file to be there after a crash is just as guilty of depending on non-POSIX functionality as a program that depends on rename having ordered semantics. Both programs depend on non-POSIX functionality.
Whether a given program should call fsync depends on where it’s being used. Random programs should not be calling fsync unless they legitimately need durability. fsync will cause problems elsewhere. Programs that rename without fsync are not broken.
(Also, core buffers have a finite size, and commit intervals have a finite duration. There’s no way you’re going to have gigabytes of dirty blocks sitting there waiting to be synced. Even with laptop_mode, in which these constants are expanded, the main problem is spinning up the disk. The editor’s fsync will cause the disk to spin up regardless, so you might as well write the ISO out too.)
March 17th, 2009 at 7:25 pm
@97: Ted T., I agree fbarrier(fd) would be a definite improvement for portable applications. However, I don’t think any particular filesystem should treat its users data so cavalierly by default, and if that means adding meta-data undo to avoid the data/meta-data ordering dependencies, that is what every modern filesystem should do.
@103: Apparently rsync used to have a –fsync option (or at least one was proposed in the 2.6.1 timeframe), but doesn’t anymore. Even more reason to make rename reliable by default.
@104: Daniel C., with regard to performance penalties, you are right in the general case of course – a write barrier is not nearly as bad as a synchronous fsync operation or the equivalent. The severe performance penalty I was trying to refer to was that incurred by portable applications having to issue fsync calls in this case.
I think you are absolutely right that all self respecting filesystems should provide ordered renames by default, even if their implementation of ordered renames is decidedly sub-optimal. However, there needs to be some way for a (relatively) portable application to know that an fsync is not required in this case (or some alternative system call). Otherwise there will be severe performance penalties for all portable applications that need ordered but not necessarily durable renames, no matter how smart the filesystem is.
March 17th, 2009 at 7:25 pm
> PPS. The new API for what you want is forthcoming.
I believe it when I see it
> Just be patient and fix the apps in the meantime. They are broken.
And then go back and fix the apps again when the new API arrives? That sounds like a lot of duplicate work.
PS. People that claim that you should be as anal as possible about the spec in order to teach others a lesson always crack me up.
March 17th, 2009 at 7:30 pm
> @117: Not every program needs to be portable to all POSIX systems. That’s how progress is made.
It that why these wonderful programs created zero length files on a perfectly good file systems? Because they were making “progress”. OK…
New definition of progress in computing: “We shall lose your data!”. Pretty catchy, eh?
> Furthermore, POSIX doesn’t actually guarantee anything in a crash.
Never you mind reading the fsync or close manual pages. It is not important that people are told to save their data to disk. No, please ignore it.
> Programs that rename without fsync are not broken.
Correct. I read it in POSIX.by-Daniel
March 17th, 2009 at 7:31 pm
@118: You’re right, of course. Portable applications need to be able to discover how rename works, at least until a sane rename becomes ubiquitous. The ideal mechanism to discover this information seems to be pathconf/fpathconf. It’ll even work on a filesystem-by-filesystem basis.
March 17th, 2009 at 7:51 pm
> @119: That sounds like a lot of duplicate work.
Which could be a good thing. If you are getting paid to do all that work
> People that claim that you should be as anal as possible about the spec in order to teach others a lesson always crack me up.
People that have bugs in their programs and want to blame others for it – well you don’t want to know what I think…
March 17th, 2009 at 8:03 pm
@120: There’s a fantastic t-shirt for you. But before you wear it, make sure to scratch out “democracy” and replace it with “POSIX”.
@122: Parable of the Broken Window
March 17th, 2009 at 8:27 pm
@111: That’s true, but a quality application that cares about that sort of thing will make sure the temporary file’s ownership and permissions match the original before actually performing the rename. That’s not always possible (consider a world-writable file owned by someone else), but fortunately, most of the time you need atomicity, you’re able to actually match permissions.
I agree it’s a problem in general though. If we’re dreaming, I’d love to see an atomic replacement system call: let’s call it sys_swap_contents(int fd_a, int fd_b). It’d atomically swap the contents of fd_a with the contents of fd_b. Filesystems could implement this fairly easily by munging block pointers, I imagine. fd_a and fd_b would have to be open for writing, and be on the same filesystem of course.
(A special kind of
linkwon’t work because the permissions of the files referred to by fd_a and fd_b would still be different — the system call would only affect content — and permissions are stored in the inode. i.e., two filenames that are hard-links to the same file cannot have different permissions. Unfortunately.)March 17th, 2009 at 8:40 pm
Ted,
I do not recognise your description of the FireFox fsync bug – I got 30 second delays on plain installs of Ubuntu/Fedora with no special modifications or workloads.
March 17th, 2009 at 8:42 pm
@98: I don’t see, however, how a write barrier in that scenario would be detrimental overall. A write barrier on rename doesn’t make the file “precious”, in principle. The total amount of IO for the downloaded-file scenario will remain the same either way — and if the file is large, much of the file’s data will have already been flushed by the time the rename rolls around. And even if you don’t flush the remaining data blocks before the rename, you’ll have to flush them soon thereafter anyway. It’s not as if the rename synchronously flushes the data blocks, increasing application latency. The file system can still schedule the rename-flush combination at a convenient time.
Daniel,
Well, the system can’t arbitrarily schedule the rename-flush at any time; remember POSIX does enforce ordering (as long the system doesn’t crash). So once the rename is visible we can’t “schedule it for later”. And we’ve talked about the problem entagled commits before. So like it or not, the rename-flush (or more accurately, flush-rename) will have an impact; it will interfere with other scheduled writes, including other fsync()’s, which some applications might actually need. So it’s incorrect to say that the rename-flush is “free”. Depending on whatever else is going on, it might not be noticeable; but then again, depending on what is going on in the system, the fsync() might be free as well. All of your arguments about saying that all the write barrier does is move writes around applies just as much to the fsync().
The compilation example also doesn’t get hurt by a write barrier — all the data blocks for those intermediate files will be flushed anyway, and in principle, a filesystem could commit neither the data blocks nor the rename record if the file is unlinked before a commit timer expires.
Except the commit timer is 5 seconds, and normally with delayed allocation the writes get dribbled out over 60-120 seconds. That gives you a much bigger window for the files to get unlinked without their needing to be written to disk. (In fact, in order to reduce the write load on my SSD, I actually turn up the dirty expiration time much higher, to exaggerate this effect even more.
In principle (I have no idea how difficult this would be to implement), the filesystem could write all the rename records for a given commit interval first, then all the data blocks. That’d reduce seeking and still preserve the ordering constraints, right?
In principle you can write the journal records and the data blocks in any order you want, as long as you wait for them all to complete before you write the commit record, synchronously. Once the commit record is written, only then can you start updating the on-disk metadata blocks. This could potentially reduce seeking somewhat, yes. Whether it’s enough to be noticeable, I’m not sure.
March 17th, 2009 at 8:57 pm
@99: With your laptop mode idea of “disable” fsyncs, I don’t know about “disabling” them, but I think delaying them until convenient would be a good idea so the atomicity would still be there.
Ian,
Actually, once laptop_mode notices that the disk is spun up, it will force out all of the dirty blocks, so it’s equivalent of delaying them. The theory is once the disk is spun up, you might as well force everything out to disk.
March 17th, 2009 at 8:58 pm
@126:
Thanks for taking the time to explain some of the details.
Well, the system can’t arbitrarily schedule the rename-flush at any time; remember POSIX does enforce ordering (as long the system doesn’t crash). So once the rename is visible we can’t “schedule it for later”.
I think we may be talking past each other here. I don’t see why a rename that’s visible to a running system necessarily needs to be committed to disk. POSIX’s ordering semantics apply only to the filesystem as seen by processes actually running on the system, and not to the on-disk image itself. Of course, from the point of view of processes running on the system, the rename happens immediately, and the file’s contents are visible immediately. What bearing does that have on the underlying disk image?
Depending on whatever else is going on, [flush-rename] might not be noticeable; but then again, depending on what is going on in the system, the fsync() might be free as well. All of your arguments about saying that all the write barrier does is move writes around applies just as much to the fsync().
Even if that were the case, at least the semantics and safety of an “ordered rename” would be superior for a host of reasons — applications wouldn’t need to use threads to hide the latency, for example.
But I don’t see how your statement is accurate in the first place. Correct me if I’m wrong, but I don’t think fsync just sleeps until the next time the filesystem was going to commit anyway and returns when that commit is done. Instead it schedules a disk flush immediately and blocks until this triggered flush finishes.
On the other hand, a pending rename can wait until the next time the filesystem would commit (in the absence of the rename) and just write the rename record after all the other work is finished. You mention that writes can be dribbled out over 60-120 seconds: why can’t the rename record be written after all that’s done? If I’m wrong here, I’d love to know why.
March 17th, 2009 at 9:11 pm
@100: “you can only do this as long as the undo log records are written before the metadata blocks are written to disk.” Absolutely, Ted. However, with an undo log if the only requirement is atomicity (not durability) then *nothing* needs to be synchronously written to disk. Not the data, not the meta data, nothing.
Mark, you’re forgetting the part where I observed that there’s no way to guarantee the undo logs have actually been written to the disk — so that it’s safe to write the metadata blocks — without actually doing the synchronous write; basically you have to do a flush-to-iron-oxide command, which is morally equivalent to doing a synchronous write. Basically hard drives don’t have a “you can write these blocks only after those other blocks have been written out” request. (SCSI TCQ had this facility, but it was an apparent commercial failure and newer disks have NCQ that don’t allow you these kind of ordering constraints.)
And so without a synchronous, “flush the data to disk and don’t return until everything that’s been sent has written to iron oxide” command, it’s not safe to write the metadata commands. Hence, if you are using an undo log, you have use a synchronous wait command, which is morally and from a performance point of view equivalent to a synchronous write, between phase where you are writing undo logs, and when you can write the corresponding metadata blocks.
This simply can’t be helped, whether or not you need atomicity or durability or both. It’s simply in the nature of the undo log.
March 17th, 2009 at 9:12 pm
My vague understanding of the zero length files problem suggests that the only change necessary is when replaying the journal after an unclean shutdown.
In that circumstance, if a file rename occurs, and the rename is replacing an existing file, and if foo.new is not known to be correct on the disk (i.e. the journal shows that it has been modified since the last sync), then do not perform the rename, otherwise continue as normal.
March 17th, 2009 at 9:29 pm
@129: newer disks have NCQ that don’t allow you these kind of ordering constraints.
They don’t?
March 17th, 2009 at 9:58 pm
@129: “there’s no way to guarantee the undo logs have actually been written to the disk — so that it’s safe to write the metadata blocks — without actually doing the synchronous write; basically you have to do a flush-to-iron-oxide command, which is morally equivalent to doing a synchronous write”
Absolutely – when you get around to writing metadata you must write the pertinent undo information somewhere first. But atomic semantics in general do not require either write to be synchronous with the rename operation. That is just a weakness of a particular design.
Some databases have separate undo buffers that get written to disk during large transactions. However, for performance reasons, they don’t write these undo buffers to disk under normal operation.
Instead they record undo information in the redo log, and on recovery restore the logical state of the undo buffers the same way they recover any other meta data.
So if you want atomic, but not necessarily durable commit semantics, they make the appropriate changes to various data and meta-data buffers in memory, with corresponding changes to the undo buffers in memory, and return immediately. The redo information, including the undo state redo information, gets physically written to disk later.
No synchronous disk write is required until someone asks for a durable commit. And then only to the redo log. All other I/O is asynchronous, and typically occurs only under memory pressure or at the time of the next checkpoint. And the checkpoint it is worth noting, does not delay or stop any ongoing transactions. It is a background process in the pure sense of the term.
March 17th, 2009 at 10:20 pm
@113: Is the end game the ext4 devs want that every single app has to call fsync on every file before rename? I mean *who* *wants* 0 byte files? Isn’t every f’king app calling fsync going to generate a hell of a lot of disk seeking, wearing out disks sooner than batching updates? I don’t see there being a system wide, holistic design here for this fsync all the time theory. It is starting to scream second system effect – they don’t understand why we actually like ext3, instead they call us idiots who can’t develop apps right.
Karl, only for files which are precious. And I, as a laptop user, also want you to be parsimonious about when you write to disk in the first place. Frequent writes, whether or not you use fsync(), burns battery power, and for SSD’s, burns drive life. If your application needs to commit data to stable more frequently than once or twice every couple of minutes, there is something seriously wrong — and ideally, you should be able to cut down your writes to once every 15-30 minutes. If you think the position of your window is that important that you need to write its location to disk every second, and if KDE applications need to rewrite hundreds of every small desktop files on desktop startup — when at best you should be *reading* from config files, not writing to them — and if Firefox three needs to write 2.5 megabytes (that’s two million, five hundred thousand bytes of data!!!) for every single URL visit just for their “awesome bar”, and needs to push it out to disk on every single disk click, waking up the laptop hard drive — there is something, seriously, seriously wrong with application programs today.
We can try to provide better API’s which are “cheaper” than fsync() such as the hypothetical fbarrier(), but seriously, that’s just going to be nibbling around the edges when applications are being as profligate with file write operations as they are today; we’re talking about orders of magnitudes of improvements that could be made by more intelligent application.
And in the meantime, you can get about 99% of the net result of fbarrier() today via this:
void fbarrier(int fd)
{
pid_t pid = vfork();
if (pid > 0) return;
fsync(fd);
if (pid == 0) _exit(0);
}
Yeah, there will issues with threaded programs, where we would need to use pthread_create() and pthread_join instead of vfork() and _exit() but the point is that doing something with fsync() that won’t impose application latencies really isn’t hard; just throw it in a library, and you won’t barely even notice that it’s there afterwards; and for many applications, you can probably get 10x or 100x the performance improvement simply by *thinking* and trying to avoid needless writes and calculating the cost of doing writes to the filesystem in the first place. Is it really necessary to wake up the disk to update the sqllite database after every single click on the web browser? I don’t think so….
March 17th, 2009 at 10:34 pm
@134: Err, vfork blocks the parent until the “child” exits. (And besides — calling anything other than exec or _exit is undefined.). Your code is exactly equivalent to calling fsync directly. If you mean to use threads or fork, your fbarrier is racy — either 1) the fbarrier child would need to include the rename operation, in which case other processes won’t see the new file until the fsync completes, or 2) you rename before the fsync returns, in which case you still get a zero-length file if the system reboots at an inopportune time.
And even if you could somehow make this fbarrier work in userland, it’d still be a ton of complexity that shouldn’t be needed for simply ensuring atomic replace (which works perfectly on a running system) behaves reasonably after a crash. Furthermore, fsync still pushes up the IO schedule, causes additional seeks, and so on. fsync is a non-solution.
Also, Firefox’s disk usage might be a profligate for an SSD, true. The commit interval should be tunable. But as a user of a conventional hard drive, I think the amount of IO is certainly reasonable for what I get in return — and the filesystem should be able to handle that amount of IO (which is still far less than what the drive can deliver) without breaking a sweat. It’s really no more than a moderately busy mailserver would see, after all. Are you claiming exim should only write messages every 30 seconds?
March 17th, 2009 at 10:41 pm
@129: On the other hand, a pending rename can wait until the next time the filesystem would commit (in the absence of the rename) and just write the rename record after all the other work is finished. You mention that writes can be dribbled out over 60-120 seconds: why can’t the rename record be written after all that’s done? If I’m wrong here, I’d love to know why.
Daniel,
Well, you can lengthen the commit interval from 5 seconds to something like 180 seconds. That will help, but it doesn’t really solve the problem. File writes which take place at the beginning of the commit interval will have dribbled out to disk by the time the transaction closes, yes. But file writes that take place right before the commit is scheduled to close won’t have had a chance to dribble out. Because of the entangled writes problem, we can’t just hold off the rename transaction. Sooner or later we need to commit the whole shebang, or we will run out of memory; the application is for some ungodly reason decides that it needs to rewrite files in ~/.kde or ~/.gnome or ~/.firefox every 10-15 seconds without any let up, there will never be a quiescent period when all of the data blocks have been dribbled out to disk.
March 17th, 2009 at 10:47 pm
@133,
Mark, past a certain point, all I can say is, “If you’re so smart, why don’t you try to make a file system that way?” Maybe everyone who has created file systems in the last 30 years of file system history are idiots, and you know better than all of us. OK, why don’t you show us how it’s done? And afterwards, I’ll benchmark your file system against ext4 using a standard suite of file system benchmarks, and we’ll see how you do.
I’ve already explained why an undo log isn’t as simple to implement as you think it is, but maybe I’m wrong. In that case show me the code. Implement a Linux file system that way, and we’ll see how it compares against ext4 for speed.
March 17th, 2009 at 11:13 pm
> @129: I don’t see why a rename that’s visible to a running system necessarily needs to be committed to disk.
Because someone ran an fsync() on the directory, perhaps? Like an editor, for instance, committing a 10 byte file and its corresponding directory entry.
The semantics of fsync() on a directory are the same as on the normal file. Which is:
“fsync() transfers (”flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) where that file resides.”
March 17th, 2009 at 11:20 pm
@138: I’m talking about the case when that doesn’t happen. We’ve already discussed the directory commit problem.
March 17th, 2009 at 11:37 pm
@132:
Daniel, unfortunately, FUA isn’t really suitable for a barrier operation. It is a flag that you can use on a write request which causes it to “jump the queue” and causes the write to be immediately written to disk, bypassing any kind of reordering that the hard drive might have done for better performance. But it doesn’t allow us to tag a set of writes as happening “before barrier #12″ and another set of writes as happening “after barrier #12 and before barrier #13″ where the disk drive is free to reorder write requests, so long as writes don’t cross the barriers that we have set up. SCSI TCQ had an optional ordering mode that supported this, but very few products actually supported it, and nothing since has had it.
The Linux block driver folks discussed it in great detail about two years ago, and decided that for the sorts of things the Linux kernel needed, FUA really didn’t do enough to be worth battling the many hard drives that simply implemented FUA wrongly, and SATA controllers which silently dropped the FUA bit without letting the device driver know, etc. Maybe FUA has stablized enough that we should experiment with it more now in 2009, but it seemed like that the SATA experts thought it would merely allow us to implement a flush-style barrier using a single SATA command instead of two SATA commands, which they didn’t think would be a big benefit in the grand scheme of things. I’ll ask the storage folks if anything has changed in the last two years at the Linux Filesystem and Storage Workshop.
March 17th, 2009 at 11:40 pm
@140: Thank you for the informative reply. It’s amazing how close a feature can come to usefulness without ever actually getting there.
March 18th, 2009 at 12:24 am
@137: Ted, your point is well taken of course. I may do just that.
Of course any benchmark has to compare apples to apples. It is similar to the old debate between MySQL and PostgreSQL. The MySQL line used to be: Transactions? We don’t need no stinking transactions. No redo logs, no journals, no undo. MySQL could handle certain workloads very efficiently. However, even completely deprived of the features of modern databases, MySQL couldn’t compete under load with better designed, more full featured alternatives. At some point, algorithmic suitability beats all other considerations.
March 18th, 2009 at 12:25 am
We have already made the mistake of having a single rename() call for two different purposes. The non-ordered version is conceptually simpler, but most applications would want the ordered version since very few can decide that their data is non-precious, and furthermore can handle random corruptions (on data=writeback filesystems at least) caused by a crash. I agree with Bojan that this cannot be satisfactorily fixed without introducing new API, be it fbarrier(), rename2(), some pathconf-based solution, or whatever: even if an updated kernel does ordered rename() for all filesystems by default, portable applications cannot depend on it being so and have to do an extraneous fsync() anyway.
Let’s not make the same mistake with fsync(), by advocating that applications call it sometimes for durability and sometimes for atomicity only. Applications that do not call fsync() in open/write/close/rename right now are comparable to those that do not work correctly on NFS. They are certainly not portable enough, but this is not an urgent issue, and a simple warning in the rename(2) manpage (”On certain otherwise journalled filesystems, rename() is not crash-proof for atomic replacement unless you do a fsync() first. Most applications would want to use rename2() instead.”) is enough. Just let them use the new API when it is available (and have a library to do the necessary fallbacks) and ask them to do nothing before then (users cannot depend on all applications being fixed anyway). Otherwise, when too many applications call fsync() with vague intentions, sometimes wrapped in threading, performance optimization in the kernel would become a nightmare, similar to but much worse than the case if we guarantee that rename() is always ordered.
March 18th, 2009 at 12:41 am
@143: no, there is only one rename. It always does the same thing on any POSIX machine so long as that machine doesn’t crash. The two implementation alternatives under discussion differ only in their behavior on a system crash. It’s absolutely ridiculous to expect application authors to care about that small distinction when that’s really something that should be handled by the filesystem. It’s the filesystem’s job to mimic the state of a running system after an unexpected crash. That’s why we use journaling filesystems in the first place.
Like I said upthread, “preciousness” is a useless criterion for determining what kind of operation should be performed. Instead, one needs to consider whether your file operation needs to be atomic, durable, or both, and choose a set of system calls based on that.
And as ext3 (and now, ext4) demonstrate, you can achieve good performance even with an ordered rename. It’s not the filesystem’s job to punish applications (and thus users) for not adhering to POSIX. For a problem like this, in which the boundary condition is difficult to test and normal rename works fine on a non-crashing system, you know damn well most people are going to use rename and be done with it. If there’s an additional API, that one needs to be the unsafe variant. (Although I maintain that in all real-world scenarios, you won’t see a performance boost by using the unordered variant. Compare the patched and unpatched ext4 under the same workload and show me I’m wrong.)
As for portability — like I said: you know damn well most people are going to use a bare rename, just as they’ve been doing for years. However, for certain authors want to cover all their bases, checking pathconf for rename behavior ought to be trivial. I maintain, though, that any journaling filesystem that doesn’t have an ordered rename is a poor choice for any but the most carefully-controlled use.
As for NFS — those difficulties are well-known and are more less tractable than just making rename ordered (as Tso has done, thankfully, in ext4.)
March 18th, 2009 at 1:01 am
> @144: The two implementation alternatives under discussion differ only in their behavior on a system crash.
Not true. The two implementations differ in the order of commits to disk. Ordered rename will always commit data of the file first, then the directory. Unordered rename will commit in any order. There are good reasons for having out of order commits, as Ted explained many times here.
If you are so worried about what happens on crashes, why are we then allowing delayed allocation at all? Shouldn’t we then say that when the new file gets created in the directory, _any_ data should be committed to disk _before_ the directory? That would preserve a picture “as close as possible” to the system before the crash.
March 18th, 2009 at 4:39 am
Links 180309…
Don’t fear the fsync!After reading the comments on my earlier post, Delayed allocation and the zero-length file problem as well as some of the comments on the Slashdot story as well as the Ubuntu bug, it’s become very clear to me that there are a l…
March 18th, 2009 at 8:01 am
@142: Of course any benchmark has to compare apples to apples. It is similar to the old debate between MySQL and PostgreSQL. The MySQL line used to be: Transactions? We don’t need no stinking transactions. No redo logs, no journals, no undo.
Mark, sure, but there are lots of uses of file systems that don’t require or don’t wish to assume that every single file system operation is transactional. Again, I’ll point at Oracle’s attempt to try to convince the market to use an enterprise Oracle DB as a replacement application-level file store for ftp and http servers. It didn’t perform all that well.
So if a file system implemented using two logs (such that over the long term metadata has to get written to three locations, the undo log, the redo log, and final location on disk), I have a feeling you may find it challenging to get performance levels to the point where it will be competitive on standard file systems benchmarks, such as postmark and SpecFS. And I’m curious to see how it is able to do on tests such as kernel compiles (where I’ve pointed out the output of compile jobs is not precious; after a crash, it’s no big deal to do a make clean; make).
The other thing you may want to consider is that one of the things that file systems are used for is to provide an implementation layer for databases. (Yes, databases perform best on raw devices; but users have consistently resisted using raw devices for file systems for management reasons — hence, in practice raw devices for databases are only used for benchmarking/marketing purposes). So your file system had better be able to get out of the way when it gets used by postgresql or mysql (or Oracle or DB2) for its database files. And it might be singularly unfortunate if all of your transactional machinery got in the way of applications using sqllite.
You may also have fun with two processes that want to modify an inode at the same time (those two processes might be two processes from a database); in SQL-land, if two processes want to modify a row at the same time, transaction-level locking means one of them gets to wait; for file systems, if you enforce row-level write locking so you get the transactional semantics that most file system users don’t want, watch your performance go down the tubes. And if you don’t do that, welcome to the joys of entangled operations.
Your file system should also be prepared to pass POSIX compliance test suites, including all of the requirements for when mtime and ctime are spec’ed to be updated; atime updates need to be optional, since some government customers will mandate it, although most everyone else will want it off. If you do want to break other POSIX requirements (which are pretty relaxed for file systems as it is), you had better be prepared to justify why you need to do so.
You may also then find it entertaining to discover all of the programming restrictions that come with implementing file systems or databases in kernel-space as compared to user-space, where memory can’t be swapped and (temporary) memory allocation failures have to be taken into account because they can happen at any time — and application writers and users get cranky if the kernel tries to use and lock down too much memory for things like undo and redo logs (funny thing, application writers and system administrators seem to think that the OS is supposed to designed to run programs, and the OS is supposed to leave memory for user-space programs to use; humph
So please, go ahead and try. Maybe you’ll find something that all other file system developers including some really bright engineers at Sun Microsystems, have all missed. Or maybe you’ll find why the Eat My Data presentation has said, “Databases are not file systems, and file systems are not databases.” In any case, welcome to my world.
March 18th, 2009 at 9:33 am
Ted, so basically your answer is that we can only use the page cache for read, not for write. All complexity has to be punted back to user space to cache data for writes for any file we don’t want to have suddenly corrupted.
Doesn’t make much sense. There is a page cache, we damn well should be able to use it for writes and not waste memory.
I read this whole thread and am not convinced by your fsync is not bad arguments especially when you ignore seek overhead. So now your position is ‘well just don’t overwrite files that often. How dare you actually use the filesystem and page cache’. No thanks, I’ll just stick with ext3. I give up.
March 18th, 2009 at 9:37 am
Ted,
first of all thanks for acknowledging that a fix was needed. Yes, Posix doesn’t require it. But then Posix doesn’t require journaling either. A journaled filesystem must live up to the expectation that if the system crashes, I only lose what I did since the last sync. If it doesn’t guarantee that, it may be a good Posix filesystem, but it’s a broken journaled filesystem.
Your patch does already fix the issue, but if it can be done more efficiently (i.e., without requiring an immediate fsync), why not? Unless there are technical reasons why this would be too difficult to implement. Or you’re simply being overworked and can’t find time to do it at the moment.
March 18th, 2009 at 11:06 am
Ted, thanks for the alloc_on_commit option.
http://thread.gmane.org/gmane.comp.file-systems.ext4/12179
March 18th, 2009 at 4:53 pm
@147: Ted, A full blown transactional filesystem is far beyond what I am suggesting. The idea here is twofold:
(1) implement enough metadata undo such that the atomic properties of rename that are guaranteed while the system is running are preserved following an unclean shutdown. As others have commented, that is very much in the spirit of the POSIX specification, not some radical departure.
(2) use metadata undo to eliminate most user visible, cross-process sync dependencies, such that no fsync ever waits longer than the time required to durably commit the data in the file itself, and (~)one redo block.
As I said, high performance databases do not make durable writes to an undo log except in unusual cases. Instead they include undo information in the redo log, and use that to restore durable undo state if they have any.
As a consequence, supporting meta-data undo would not require a separate durable undo log. The only reason why databases have durable undo state is because they support large transactions that can include an effectively unbounded amount of both data and meta data.
Filesystems have no need to log data undo and redo information. That slows them down by a factor of at least two. The only undo information is required is meta data undo, which is minimal by comparison. The reason why metadata undo is so valuable, is that it removes all user visible dependencies on when the actual metadata blocks (or blocks in any unrelated file) are actually written.
The meta-data undo information indeed does get written to the redo log prior to the filesystem undertaking physical metadata updates. However, the actual metadata block writes can occur in an arbitrary order after that.
Commiting a durable meta-data transaction never requires that a meta-data update has completed, or the next checkpoint has passed. Checkpoints can be every five seconds or every five minutes. The only thing that committing a meta-data transaction requires is that the meta-data undo and redo information has made it durably to the redo log.
In the important case we are concerned about, an ordered, but not necessarily durable rename operation, the filesystem can make the appropriate entries in memory and wait an arbitrary time interval to force the redo log to disk.
So when I read about the enormous gyrations that contemporary filesystems go through to update their meta-data blocks in order, I cannot help but think that most of these problems would be avoided if they just recorded meta-data undo information in their redo logs. Unlike a database, separate durable undo state is unnecessary. The additional I/O cost is minimal. The reduction in client latency (for fsync operations in particular) is large, extremely large in some cases.
I don’t know enough about ext4’s internal structures to know whether this can be done without changes to the on disk meta-data format. I suspect, however, that this could be done with no format changes other than additional meta-data undo records added to the journal / redo log.
March 19th, 2009 at 3:45 am
> This argument is flawed for two reasons.
Your anti-argument is also flawed for 4 reasons.
First, lets correctly re-define original statement “The application only needs atomicity, but not SEMI-durability” (choose the word you like instead of “semi-”).
You, instead, suggest fsync for “atomicity + SEMI-durability”. So you were fighting the wrong part of the original statement, which is offtopic (durability or semi-durability).
Second, “fsync() really isn’t that expensive…. remember, fsync() doesn’t create extra I/O’s”
It does create EXTRA LATENCY, so it does create problems, you cant mask with convenient examples of internet download, or by implying that other workloads are stupid.
Third, suggesting extra thread is bad idea. You suggest lots of work across thousands of applications including extra polluting of library dependencies, just to keep your ext4 original bad logic.
4) In any case thousands of older and now slowly-developed applications will never be fixed, as crash bugs are not obvious, and those bug will silently live forever. So Linux and ext4 , not necessary in that order, will be perceived as much more crash-buggy than today.
March 19th, 2009 at 8:08 am
@148: Ted, so basically your answer is that we can only use the page cache for read, not for write. All complexity has to be punted back to user space to cache data for writes for any file we don’t want to have suddenly corrupted.
Karl,
You can use the page cache for writes. It’s the fact that we are using the page cache for writes which was the cause of this problem in the first place. If you are worried about making sure updates aren’t lost after a crash, we have to flush things out when the commit happens; and that’s what the patches do in the case of replace-via-truncate and replace-via-rename. For those two cases, which was what was causing zero-length files, we will force the data blocks out before the commit in order to avoid that particular problem. It basically gives ext4 the same properties as ext3; my point has always been that other file systems aren’t going to give you these semantics, necessarily, and so if you care about portability it’s not wise to depend on these workarounds.
March 19th, 2009 at 8:15 am
@153: It basically gives ext4 the same properties as ext3
No, it doesn’t – data is now flushed at rename() time, not at the commit interval. The distinction is important.
March 19th, 2009 at 8:22 am
@149: Your patch does already fix the issue, but if it can be done more efficiently (i.e., without requiring an immediate fsync), why not? Unless there are technical reasons why this would be too difficult to implement. Or you’re simply being overworked and can’t find time to do it at the moment.
Onan, I don’t think you understand; with ***ext3***, we were doing an implied fsync() of all dirty files before the commit was allowed to proceed. This is what was causing ext3’s fsync() to be slow in the first place; it was a side effect of data=ordered, which required that all newly allocated blocks had to be forced out to disk. We could have optimized ext3 slightly in that we didn’t need to flush blocks that were dirty, but which were not newly allocated (i.e., blocks written by mysql or berkdb via a random-access write), but in practice, most of the really large writes that caused the delay came from large data writes that pushed very large numbers of files out to disk.
The other thing that we could have done, with ext3, would be to bump up the priority of writes caused by commits, so they get prioritized at a higher priority than read requests, so that synchronous writes don’t get starved by read requests. This is actually something we’ve done with ext4 already, by bumping up the I/O priority of write requests issued by kjournald. This should fix the “30 second fsync” reports that had been reported by ext3, which was caused by write starvation if there were lots of reads going on.
What we did with ext4 was to introduce a heuristic so that on rename() and close() operations, we would forcibly allocate the blocks in the case where the replace-via-truncate (if the file descriptor has been truncated) or replace-via-rename (if rename overwrites the target inode) pattern are detected, such that when we do a commit, we wait for the data blocks to get flushed to disk. This is effectively an implied fsync() for those files, yes — but that’s what ext3 was doing, except ext3 was doing an implied fsync() for all dirty files before allowing the commit to proceed.
There are some things I can do to improve the patches, yes; at the moment we are not only forcing the allocation in the case of replace-via-rename and replace-via-truncate, but we are also starting the I/O right away. This is good, because the writes start getting done asynchronously, instead of synchronously blocking the commit (which would slow down synchronous commits triggered by fsync). However, the trade-off is that it wakes up the hard drive right away upon seeing the rename() or close() in the replace-via-rename or replace-via-truncate cases, which is not so great in laptop mode. Implementing a scheme so that in laptop mode, we don’t wake up the disk right away, but delay the writes all the way until the commit, is something I won’t have time to do before the 2.6.30 merge window; however, this shortcoming has been amply documented in code comments, and we probably need to do some other things to improve laptop mode anyway, such as selectively being able to ignore fsync()’s.
But in terms of whether it’s better to do it without doing a fsync(), I think people don’t realize that that’s effectively what ext3 was doing, and in order to get bug-for-bug compatibility with ext3’s data=ordered mode, effectively that’s what we’re adding back to ext4.
March 19th, 2009 at 8:44 am
@150:Ted, thanks for the alloc_on_commit option.
http://thread.gmane.org/gmane.comp.file-systems.ext4/12179
Ben, as it turns out, we’re running into some problems implementing alloc-on-commit. See the comments that have come in on that thread since then. I’m not sure how important it will be to implement alloc_on_commit, given that the already back-ported patches in Fedora and Ubuntu solve the zero-length file problem for applications that do replace-via-truncate or replace-via-rename without calling fsync(). What alloc_on_commit does is force allocations (and then the data=ordered mode machinery which is still in ext4 forces an implied fsync before the commit, just like ext3 did before), for all files, not just for replace-via-rename and replace-via-commit. This becomes an issue for newly written files that are considered “precious” by the user, but where the application did not call fsync().
With new files, if the crash had happened 5 seconds earlier, the newly written file would have never have gotten written in the first place, and so it’s much less of an issue for newly written files. So alloc_on_commit was only more for people to feel good and get exactly ext3’s behaviour with respect implied fsync’s before commits, while still getting most of ext4’s benefits such as extents, fast fsck, and (some) delayed allocation. But for users who care about this, they can use the nodelalloc mount option, which removes delayed allocation from the picture altogether; it also has the property it provides exactly ext3’s data=ordered behaviour in a bug-for-bug compatible fashion, but it because there is absolutely no delayed allocation (as opposed delaying the allocation until the transaction commit), the files won’t be as optimally laid out on disk. This will increase file fragmentation, which in turn will somewhat degrade ext4’s fast fsck times. Of course, data=alloc_on_commit would have degraded ext4’s file fragmentation and fast fsck times as well; just not as much.
So basically, people have two choices; once they have the patches queued for 2.6.30, and which have been backported into the alpha/beta kernels for the upcoming Fedora and Ubuntu kernels already, we will have ext3-compatible “force the data blocks out before the commit” for file contents which are replaced via truncation or rename. This should prevent loss of existing data, and the recommended approach for both ext3 and ext4 is replace-via-rename, since even under ext3, replace-via-truncate did have a race window.
A second choice is that people who are really worried about this can use the mount option “nodelalloc”, which is in ext4 today. This will provide behaviour for newly allocated blocks/files which is exactly the same as ext3’s data=ordered mode today, at the cost of some of ext4’s benefits. There will still be other benefits derived from using ext4, but for people who are paranoid about their systems crashing enough such that this is an issue, and who use applications that are careless about using fsync(), it’s an option which is available to users today.
At some point we may try to solve the data=alloc_on_commit issue, but the problems we’ve run into are fairly difficult to resolve, and it may be that the two alternatives above are enough for most users. But at the very least, at this point it looks highly unlikely we’ll be able to resolve the data=alloc_on_commit implementation issues before the 2.6.30 merge window closes.
March 19th, 2009 at 9:00 am
@151: Ted, A full blown transactional filesystem is far beyond what I am suggesting. The idea here is twofold: (1) implement enough metadata undo such that the atomic properties of rename that are guaranteed while the system is running are preserved following an unclean shutdown. As others have commented, that is very much in the spirit of the POSIX specification, not some radical departure.
Mark,
I think we’ll have to agree to disagree about the “spirit” of the POSIX specification. I believe the spirit of the POSIX specification is to give freedom to file system designers to make file systems that are as fast as possible, and which can approach the speed of writing to raw disks; this is historically, at the time when POSIX was written, what was considered the ideal for file systems. File systems were expected to provide value-added services, but file systems that got in the way of the hard drive meant that system administrators who spent $$$ on expensive RAID arrays didn’t get the benefits of their expensive storage subsystems, and that made the sysadmins cranky. File system designers also got an earful from database designers about file systems that got in the way of the fastest possible database benchmarks (since those pesky users wanted the management advantages of putting databases on top of filesystems, and refused using raw devices). Also, historically, file systems of that era really did sync data blocks far less frequently than metadata blocks. This is why there were warnings that close(2) didn’t flush data to disk, and why POSIX essentially told application writers, if you care about data surviving a crash, the only portable thing you can rely upon is fsync(). So as far as I’m concerned, looking at both the legislative intent of the POSIX “founding fathers” and what was common practice at the time, it’s pretty clear what the “spirit of POSIX” when it was first drafted really was.
Of course, times change. Linux is being used on desktops, where the environment is much less controlled, and application writers are writing hundreds of files whenever the desktop starts up (for whatever reason, G*d
only knows) and Firefox writers seem to think it’s a good idea to force out to disk 2.5 megabytes of data each time you click on a link just so that we can remember what URL’s had been visited right after a crash for Firefox’s “awesome bar”). Back in the day, 2.5 megabytes was a very large amount of disk writes, and so application writers would only write out such information every 15-30 minutes, since keeping track of which URL’s you had visited really wasn’t that critical. Apparently, desktop users also spend hours and hours carefully positioning the location and size of windows, and get really cranky when that information gets lost — so there is also a fundamental disagreement over what is “precious files representing real work”, and things which are nice to save. So that’s fine, we can try to provide such semantics, but people need to understand that they will be sacrificing performance when they do; that’s why some of these things will be mount options, since Linux is also being used in the same old traditional server environments supporting enterprise databases.
In any case, this comment is getting long, so I’ll start a separate comment to discuss the technical issues.
March 19th, 2009 at 9:32 am
@151: Ted, A full blown transactional filesystem is far beyond what I am suggesting. The idea here is twofold: (1) implement enough metadata undo such that the atomic properties of rename that are guaranteed while the system is running are preserved following an unclean shutdown.
Mark,
So as I understand it, your proposal is to write the metadata undo information into the redo log, right? OK, what does that imply? Until the data blocks are written to disk, the metadata undo information has to be kept around. But at least for ext3/ext4 the redo log is limited in size; it’s only 128 megs or so, and we implement physical block journalling (that is, we write the contents of the metadata block after the transaction commits into the log itself), the redo log wraps fairly frequently, which we call a checkpoint operation. If the data blocks haven’t been written yet, either the undo log information needs to be retained and copy as a part of the checkpoint operation, or the data blocks have to be forced out to disk when the checkpoint takes place. This is not a problem per se, but it does more complexity that you probably had first considered.
The second problem, which is harder to work around, is what happens if the rename’ed inode is further modified? For example, consider this sequence of file system operations:
* fd = open(”FileA.new”)
* write(fd)
* close(fd)
* rename(”FileA.new”, “FileA”);
* …
* link(”FileA”, “FileB”);
Now we can’t just back out the rename operation; we have to do something about the second rename operation as well. You’ve said that you don’t care about external synchronization; for example, if the server sends a message to another machine, “ok, I’ve written the new data to FileA”, and then the server crashes, and since the data has been delayed, we have to back out the rename() operation, the fact that “FileA” still contains the old data is enough for you; the fact that another machine is under the mistaken impression that the new data which they sent to the server has been lost is not a problem. (Presumably, if this was important, the server should have used fsync().) But in the case I’ve just shown above, what has happened is that another program, maybe another shell script, as done further file system operations, and so you will need to track dependencies at a sufficiently high level of abstraction that you can back out the link() file system operation as well. (And there are other dependencies that might cause a problem; link() was just one I chose for the sake of argument. The file might be renamed another time, for example, and there might be other higher level semantic dependencies that simply couldn’t be detected by a file systems, where the application starts making changes to other files on the assumption that the rename operation won’t be backed out.)
So you might want to think about how you would handle all of these issues, and indeed what you might decide not to handle; for example, maybe if there’s an attempt to link() or rename() such an inode, that’s the point at which you will force an implied fsync() so you don’t have to track the dependency. That still won’t save you from higher order dependencies, that might leave an svn working directory inconsistent, for example.)
Also consider that while the rename operation can still be “backed out”, the data blocks containing the the original file’s contents can’t be reused. What happens if the file system attempts to reallocate those blocks for use by another file? If you’re not going to store data blocks in your undo log, you’ll have to block those files from being used. Ext3 and ext4 have machinery to do this, but the machinery assumes we only need to do this until the next commit. If you are going to do rolling commits where you want to allow the data blocks to be delayed for long periods of time, then it makes this machinery much more complicated.
It’s also the case when the data blocks finally do get written out, it means you have to then write something into the redo log “revoking” the undo logs, which is the point at which the rename() operation finally is committed. But until this is actually forced to disk as part of the next commit transaction, the rename() isn’t really committed.
And of course, you would also have to add all of this machinery for truncate, but truncate is even more complicated since sometimes the file isn’t truncated down to zero, but you might have an application that starts with a 120k file, truncates it down to 48k, then writes another 32k which must be allocated, and then it might seek to the 4k offset and write a 4k block. Do you provide undo semantics for the truncate case? If you’re not going to write undo records for data blocks, then if you try to handle this, you might have a file which is only partially modified. So what do you do with truncates? Not handle them at all? Only handle the case where the file is truncated down to zero?
At the end of the day, this is a huge amount of work, just to allow for applications that don’t want to use fsync(). If you implement it and it’s clean, and I could manage to convince myself that all of this complexity is maintainable and bug-free, I’d certainly consider inclouding such a patch into ext4. It’s a pretty huge stretch to convince me that it’s really worth my time implementing such a feature. I don’t think it would really speed up the file system any, so it’s really more of a “accommodate sloppy application writers who don’t care about portability beyond ext3/ext4″ kind of thing.
If you are really serious about wanting to implement an extension to ext4 that does all this, we can certainly talk. I suspect my replace-on-truncate and replace-on-rename workaround patches are probably enough for most users.
March 19th, 2009 at 9:45 am
@152: In any case thousands of older and now slowly-developed applications will never be fixed, as crash bugs are not obvious, and those bug will silently live forever. So Linux and ext4 , not necessary in that order, will be perceived as much more crash-buggy than today.
szh,
This is why I implemented the replace-via-truncate and replace-via-rename patches. I’m painfully aware that lots of applications as currently written are broken, and they won’t be fixed over night.
However, I’m a big enough believer in heterogeneous systems that I don’t want to pull a Microsoft and encourage application writers to keep doing things that will lock them into Linux and ext4. The fact that replacing files using rename() without using fsync() is safe against crashes is true of ext3, and with the replace-via-rename patch, it’s true of ext4, but it’s not true for all other file systems, and it’s not true for all other operating systems.
Now, we have an upcoming Linux Storage and Filesystem workshop, and I can confer with other Linux file system implementers; one of the things that we could decide is that as a Linux policy matter, we should cause replace-via-rename to either imply an fsync or some kind of marker which causes the data blocks to be allocated when the metadata associated with the rename takes place. One of the other things we could decide is that we shouldn’t pay that cost for all renames (since some renames involve precious files) and so there should be a new API that requests something like a “lightweight fsync”; in that case, some file systems might continue to provide an implied asynchronous fsync for rename, but some file systems might not. But at least then we could provide an official answer which applies for all file systems in Linux, and we can issue a recommendation about the right way to do things.
One of the factors about whether we pursue the second will be the fact that the application writers vastly outnumber us, and they tend to either be not very careful, or slow to update their applications, and so on. Weighing against this is that we would be imposing extra overhead for all file systems and for all workloads, even for renames() that involve files that aren’t particularly precious. Also weighing against this is that fsync() really isn’t that slow on non-ext3 file systems — but that since ext3 is the default, this has influenced application writers in the past year to avoid fsync() like the plague, ever since the Firefox 3.0 debacle.
So there is a much bigger picture here. There’s what ext4 should do (and what ext4 has already done); there’s what Linux should do as a whole; and there’s what application writers should do if they care about portability to non-Linux systems such as Solaris and MacOS, and other legacy Unix systems. It seems like people are mostly annoyed that I dare to give advice to application writers about the third question; but hey, if you want to write applications that only work well on Linux and ext3/ext4, don’t let me stop you.
March 19th, 2009 at 10:04 am
@154: @153: “It basically gives ext4 the same properties as ext3″ No, it doesn’t – data is now flushed at rename() time, not at the commit interval. The distinction is important.
Matthew,
With respect to laptop_mode users, that’s correct for now. We start the I/O write at rename() time, although we don’t wait for the I/O to complete. That’s an implementation issue with the replace-via-* patches, not a fundamental architectural issue. To quote from the comments that I’ve inserted in the code:
So it solves the problem with respect to people using unstable video drivers and crazy desktop applications/frameworks that write hundreds of flies at desktop startup time. I will eventually replace this with code that only does the allocation, without starting the I/O; I just knew I wouldn’t have time to get this code stable in time for the 2.6.30 merge window, and I wanted to have patches that Ubuntu and Fedora could pick up right away.
I’ll note that other file systems will probably work around this for now via an implied fsync(). That’s what Chris Mason did with btrfs’s patch, and how XFS implemented their replace-via-truncate workaround (they forced an fsync if the file descriptor had previously been truncated). If we implement an fbarrier() operation, I don’t know how many file systems will take a short cut and implement fbarrier as either an fsync() or an async fsync() which blocks the next metadata commit until the fsync completes, and how many file systems will add the extra machinery to delay the I/O until the last possible moment, at the journal commit. I do intend to fix up things to accommodate laptop users who don’t have SSD’s, but it just wasn’t going to get done before the 2.6.30 merge window. (Besides, I have an SSD in my laptop now; (in a Monty Python fake French access) “it’s very nice”.
)
For now, for laptop users w/o SSD’s, if this becomes an issue in terms of the hard drive waking up too often, you can either disable this replace-via-rename and replace-via-truncate workaround patch via the alloc_da_alloc=0 mount option — or if you are also worried about your system crashing randomly and losing files that aren’t fsync’ed, to use the “nodelalloc” mount option.
Life is full of trade-offs; and I want to provide better ones for users, but I do have limits in terms of only have 24 hours in a day. (And believe it or not, as much as people seem to love to flame me, most people in the file system development community are actually far less sympathetic to application writers than I have been; certainly that’s been true for everyone that I’ve talk to so far; I get asked why I’m being so accomodating, and whether what I’ve done is bad since it acts as an enabler for bad behaviour, much like people who cover up for alcoholics. So it’s hard to find other people to create improvements for what we really do consider to be sloppy application programmers, and I don’t get to work on ext4 full-time.) As folks in the open source world are fond of saying (high quality) patches are always appreciated….
March 19th, 2009 at 10:05 am
@159: As has been explained several times, it doesn’t matter whether fsync() is fast or slow or guarantees full data consistency or is merely a hint to the OS or whatever. fsync() provides guarantees that many applications don’t need, and suggesting that these applications use fsync() is suggesting that computers suck more power and provide a worse user experience.
Before making any decisions about the behaviour of Linux in the long term, please spend some time talking to application developers about what they want filesystems to provide (and what they explicitly don’t want filesystems to do) – for whatever reason, that doesn’t seem to have happened in the ext4 case and it’s now clear that we have bodies of people with vastly different expectations of what desirable behaviour is.
I don’t think anyone’s upset about you giving advice to application writers on how they should code if they want portability to Solaris or MacOS, but I think you underestimate the extent to which these operating systems are irrelevant to the majority of application developers. People aren’t going to hold off on using useful functionality simply because it’s not present on operating systems that account for a tiny percentage of their users. However, they are going to be upset if filesystem developers are unwilling to provide the behaviour they want – especially when sticking with ext3 and getting that behaviour is an option.
March 19th, 2009 at 10:06 am
> and there’s what application writers should do if they care about portability to non-Linux systems such as Solaris and MacOS, and other legacy Unix systems. It seems like people are mostly annoyed that I dare to give advice to application writers about the third question; but hey, if you want to write applications that only work well on Linux and ext3/ext4, don’t let me stop you.
fsync() is disabled on MacOSX, you need to call fcntl(F_FULLFSYNC) instead. IMHO, fsync() will be disabled on Linux soon too, when it become too popular.
BTW: can you (or anybody else) show of such portable application written in POSIX shell? Is it hard to answer?
March 19th, 2009 at 12:18 pm
@158: So as I understand it, your proposal is to write the metadata undo information into the redo log, right?
Yes. As far as checkpoints are concerned, I understand you have to write the necessary data and meta data blocks before reusing a redo log segment. You don’t want to write all outstanding meta-data blocks of course, that would cause the filesystem to stall temporarily, just the sufficiently “old” ones – often referred to as an “incremental” checkpoint.
The second problem, which is harder to work around, is what happens if the rename’ed inode is further modified?, etc…
At some point, you sync the data blocks to disk before proceeding with the next modification. Where that point is is an extent of implementation issue. Some uses could benefit from handling a considerable series of atomic but not necessarily durable rename replacements on the same file.
If there are external dependencies that people care about, the application indeed needs to use fsync or something like it. “mv”, “cp”, and “ln” in particular should come with a –commit option that makes the operations durable. They should be atomic (by whatever means is most efficient) by default. For portability, that probably means calling something like fbarrier wherever it is available, and fsync otherwise.
From a filesystem point of view, the filesystem should do an implicit fbarrier regardless, so that it never does a non-atomic rename by default. Of course many existing filesystems won’t, hence the need for an fbarrier system call.
I think metadata undo is a net long term win for filesystem design, but if newer filesystems make all renames atomic using some sort of write barrier, and an fbarrier system call is added so that applications are not forced to call fsync to get atomic replacement semantics, that would more than solve the immediate problem.
March 19th, 2009 at 2:49 pm
Ted, your feelings on the Firefox issue are remarkably different to mine, which is interesting.
Firefox crashes. A lot. And it bugs out at times and starts eating a megabyte per second or so until it’s consumed all available memory and gets killed. I restart my computer rarely so the majority of the time when I start Firefox it has to perform crash recovery. When it does so, I want it to come back as near as possible to where I left off.
From this point of view, I *want* it to save its state whenever it does more-or-less anything. 2.5MB does sound like a lot, and perhaps that could be brought down considerably, but I *always* want it to be storing something.
On the other hand, my system as a whole, even running Nvidia’s drivers, basically never crashes (last time it did, I tracked it down to bad RAM), so I only really worry about data loss when the power goes.
Given this, I don’t want Firefox to fsync more than once every half hour, say. In the occasional case that power is lost 29 minutes since the last fsync, and nothing else has forced a flush to disk, I’m willing to lose the last 29 minutes of history. I’m not willing to lose the entire history.
Firefox fsyncing constantly will harm performance, not to the extent of ext3, but it’s inevitable if it saves changes constantly. However, I want it to use the filesystem rather than its own in-memory cache, because the filesystem is still there when the application breaks. Obviously I have no influence on the development of Firefox so whatever I want isn’t going to happen, but this is just to point out that I do want the application to hand off responsibility for the data, without having to ensure that it’s been physically written to disk. (And this behaviour could be achieved using pre-patch ext4, if the application crash recovery is made slightly more complex, by writing to alternating files and allowing for the fact that the new one is either the one you want, in case of an application crash, or likely garbage/zero-length, in case of a system crash. Maybe I should suggest it.)
March 19th, 2009 at 3:24 pm
Thanks for the detailed update of the two mount option patches.
I would also add that, I develop vertical appliances and every year we have more and more options in that respect since almost everything supports linux now. What matters to us is not having RMA’s due to FS corruption like we had with ext2 (yes our product goes back that far). and also having maximum thoughput performance of the disk.
Furthermore, portability in the future for us may be using virtual appliances. What does POSIX or ext4 guarantee for a fsync inside a virtual machine? Ha! That may be a fine can of worms. I noticed there was a documentation about block device requirements for FS’s… but nothing about virtual machines.
March 19th, 2009 at 4:13 pm
[...] articles by Theodore Tso, “Delayed allocation and the zero-length file problem” and “Don’t fear the fsync!” and also Alessander Larsson’s one “ext4 vs fsync, my take” as well as comment in [...]
March 19th, 2009 at 4:54 pm
@162 Apple did not ‘disable’ fsync. The man page simply has a warning that the disk controller may have a write cache which will not get written to disk if power is removed. The ‘fcntl’ call is used to guarantee write ordering, but there is always a chance that you don’t reach the fcntl() call before power to the drive is removed.
March 19th, 2009 at 6:50 pm
The level of discussion generated by the fsync() comments is amazing. Two big articles on slashdot, this site, plus a very long bug report in launchpad. In total this is a huge amount of activity.
I was going to comment on how much I hate the idea of lots of fsync()s, and the reasons why.
However, doesn’t this huge level of discussion tell us something in itself?
March 19th, 2009 at 7:13 pm
@167: If Apple does not issue force unit access or cache flush commands to directly connected SATA and SCSI drives on every fsync, they most certainly have degraded fsync in a manner that is contrary to the intent of the POSIX standard:
If a drive controller has an independent battery backed up cache, that is certainly an adequate reason to claim that a write has made it to persistent storage. Otherwise, any power failure can corrupt the data of anyone and anything that relies on fsync to do what it was intended to do.
Of course if might be nice if operating systems supported an fcntl(fd,F_PARTIAL_SYNC) operation that does what Apple’s fsync does.
March 19th, 2009 at 7:42 pm
Ted, What mount option do you recommend for a server which runs an MTA such as postfix or qmail which do the right thing wrt POSIX
The implicit flush on rename seems to penalise well-written apps
March 19th, 2009 at 8:55 pm
Ted, I have a new proposal that is more compatible with the infrastructure of existing filesystems like ext4. For maximum flexibility, this proposal suggests the filesystem keep multiple inodes when necessary that correspond to one committed revision of a file, and one or more uncommitted revisions, where prior revisions are discarded as soon as any succeeding revision commits.
For simplicity, the number of uncommitted revisions supported can be reduced to one. All revisions except the last are invisible to the user, unless a read only user file handle still refers to one of the prior ones, or the system crashes.
Whenever a rename is done over an existing file, determine the last durable revision of that file (usually the one the directory entry points to) by transitively following an in core rename chain, increase the reference count on that inode, and add an entry to an undo log. If the earliest existing revision is uncommitted, commit that revision to disk prior to completing the rename.
However, do not physically write to the undo log when committing a transaction. Instead write an entry to the redo log, so the filesystem can recover the undo log the same way it recovers other meta data. In the event of a redo log check point, force any dirty undo log entries to disk.
Schedule data for the new revision to be written to disk in the usual manner, after the standard user configurable time delay. When the replacement version completely makes it to disk by being fsync-ed or completely written out, or is deleted, or written to, unlink the old revision you are holding on to and write a completion entry in the undo log.
On recovery, use the redo log to recover the undo log, then use the undo log to build a list of directory entries that need to be reverted to the last durable version. Swap the inodes out, clear the undo log, and you are done. When recovery is finished all unwritten out rename replacements will be restored to the last durable version.
What do you think about that?
March 19th, 2009 at 10:46 pm
As people have repeatedly stated – there’s a HUGE gap between “I absolutly want this data on the disk and you need to write it out before I go on” and “I’m just checkpointing semi-unimportant state – keep the most recent copy around at your leisure, but if I generate a new one before you write the last out, just toss it”.
SSD, laptops, lots of things like the kernel caching, so don’t force app developers to thrash the disk when they don’t want to. Especially if you’re going to ignore the writes they DO want to happen in response!
Now, what would it take to implement rename_on_disk_after_commit() ? Because that’s the behavior that’s wanted. The only guarantee it implies is that in the event of a crash, the filename will either point to the old data, or if the new file IF it was completely flushed to disk and is intact. To eliminate edge cases, if there is no entry a on the disk it will perform an explicit fsync of b. It will allow the in-ram file A to differ from the on disk file A.
The most important use case is
open(a); write(a); fsync(a); close(a);
****
open(b);write(b);close(b); rename2(b, a);
open(c);write(c);close(c); rename2(c, a);
…
If this happens between dirty writeouts, b will never touch the disk at all, because it was overwritten by c and not referenced by anything. In fact, d, e and f could also exist transiently with no disk activity whatsoever.
Of course, since this is exactly the semantics people seem to want with rename() now…
Right now, the only way I can see to achieve this with POSIX semantics is
write state.a; fsync(state.a); write state.b; write state.c; write state.d; write state.e; fsync state.e; unlink state.a; unlink state.b ; unlink state.c; unlink state.d ; write state.f ….
That seems rather suboptimal to put it mildly. In firefox’s use case, that’d mean 2.5mb of disk use for every click, guaranteed, that’s deleted every 15 minutes. That’s pretty damned ugly.
I guess you could “optimize” that by using the rename repeatedly on the non-fsynced state file, but still, that’s getting way to deep into writing a filesystem for every application.
March 19th, 2009 at 11:29 pm
@162: BTW: can you (or anybody else) show of such portable application written in POSIX shell? Is it hard to answer?
Volodymyr,
The way you would access fsync() in a shell script is by calling some executable which implements fsync — just as the way that you do search for a string using regular expressions is done by calling out to an external program, “grep”. I wouldn’t recommend shell scripts for “serious” programming efforts; but if you really want to use shell for most of your work, and you need to reliably push a file to disk such that it survives a system crash, I’d either have the shell script call out to perl or python, or if it’s a really quick and dirty effort where I don’t care that much about performance, I might just use a sledgehammer and call /bin/sync.
March 19th, 2009 at 11:33 pm
@169: Of course if might be nice if operating systems supported an fcntl(fd,F_PARTIAL_SYNC) operation that does what Apple’s fsync does.
Actually, ext3 by default doesn’t force a cache flush after an fsync, either. In order to enable this, you need to mount ext3 with the mount option “barrier=1″. Ext4 does this by default, but ext3 doesn’t for historical reasons, and Andrew Morton was reluctant to change the default because of the performance hit, so he nixed the patch that would have made “barrier=1″ the default in ext3, just as it is for ext4.
March 19th, 2009 at 11:58 pm
@173: sync just calls sync(), and POSIX doesn’t guarantee that sync() will block until the pages are written. So sync doesn’t provide the guarantees we want.
The way you would access fsync() in a shell script is by calling some executable which implements fsync — just as the way that you do search for a string using regular expressions is done by calling out to an external program, “grep”
That’s pretty disingenuous. POSIX specifies grep, but it doesn’t specify an fsync command. The only way to write a shell script that won’t potentially corrupt data is to rely on functionality above and beyond what POSIX provides. This could be in the form of an fsync command, or perl or python – or alternatively it could be in the form of a filesystem that provides guarantees that POSIX doesn’t strictly require of it. Either way, you’re obliged to rely on functionality that’s not strictly portable.
March 20th, 2009 at 8:50 am
@171: Ted, I have a new proposal that is more compatible with the infrastructure of existing filesystems like ext4….
Mark,
It’s certainly easier to implement. If I were going to implement it, I’d simplify it even more by keeping the undo log only in memory, and restrict it so that it would only be used for files where the rename would result in the original file going away (that is, where i_links_count is one). This simplification would mean that there is only one link to the file, which means the directory entry can be identified using merely the directory inode number and the old inode number, and the undo log can be a fixed 12 byte record, consisting of the directory inode number, the old inode number, and the new inode number. All of these records could then be stored as a special journal block type that would always be flushed out at a commit, and normally, if there is a small number of them, could be stored in the commit block itself.
This scheme only allows us to avoid needing to forcibly allocate when files are being replaced via a rename, and it does so by introducing some interesting semantics, since it would only back out the rename, and not necessarily any other changes in the file system. (For example, if you rename(src, target), and then do a chown(target, 1); chmod(target, 0755); the chown and chmod would also get discarded. And in the case of rename(src, target); link(target, target2); or rename(src, target); rename(target, target2); life gets even more interesting. Yes, we can fix these cases by forcing an allocation the moment target gets further modified, but that means that each of the system calls where this might be an issue would require searching the undo log to see if the inode requires special handling.
Is it worth doing? Perhaps; it does allow for allocations to be delayed past a rename, which as I’ve mentioned is good in a number of cases; for example, firefox downloading a file will create a zero-length file named foo.iso, and then download the file as foo.iso.part, and then rename foo.iso.part to foo.iso. Avoiding implied sync of that file at the next commit after the rename keeps the commit time low, and if the downloaded file is going to be deleted very shortly afterwards, we might avoid needing to write the file at all.
I consider it higher priority to rewrite ext4_alloc_da_blocks() to avoid a flush() operation, since there are also lots of broken files that are doing replace-via-truncate (i.e., open, truncate, write, cllose). And your solution wouldn’t help those applications, where as the replace-via-truncate workaround patch does, but at the cost of waking up the hard drive on the close(), as opposed to delaying it until the commit. (Which could be a big deal if laptop_mode has lengthened the commit interval to 30 seconds or more.)
Still, if someone wants to take a crack at implementing it, let me know. It would be great to have more talented ext4 developers join the team, and (high quality) patches are always welcome. (Low quality patches are welcome too, but they might not get acted upon right away.)
March 20th, 2009 at 9:04 am
@170: Ted, What mount option do you recommend for a server which runs an MTA such as postfix or qmail which do the right thing wrt POSIX
Yusuf,
Well, postfix and qmail probably won’t trigger the replace-via-truncate or replace-via-rename workaround patches. But if you want to be very careful, in 2.6.30 you can mount the file system with the “auto_da_alloc=0″, which will disable both of these workaround patches. 2.6.29 doesn’t have this mount option, but it also doesn’t have these workaround patches, so you’re OK. I don’t think Fedora and Ubuntu grabbed patch which implemented the “auto_da_alloc=0″ patch, so if you really want it, complain to the relevant distributions; but for postfix and qmail, it really shouldn’t matter.
The replace-via-truncate workaround patch only gets triggered if the file descriptor is truncated down to zero, either via being opened with the flag O_TRUNC, or via the ftruncate(fd, 0) system call, and the replace-via-rename patch is only triggered if a file is renamed in such a way that an existing target file gets unlinked (i.e., rename(src, target) where target exists). I don’t believe qmail or postfix will trigger either of these cases.
March 20th, 2009 at 12:54 pm
I plan to install Mandriva-2009.1 (spring) when it becomes available with the optional EXT4 file system. Since kernel-2.6.30 isn’t available in this distribution would it by wise to just mount my EXT4 partitions with the “-o nodelalloc” option until I can replace the Mandriva kernel with kernel-2.6.30.
I think the “allocate on replace – via truncate” & the “allocate on replace -via rename” patches scheduled for kernel-2.6.30-rc1 are just what a home user like me will need since I also plan on using KDE4 in Mandriva.
I realize that the (-o nodelalloc ) option will reduce EXT4 performance but I can regain some of that back when I upgrade to patched kernel-2.6.30. I will also have the option to mount EXT4 with the (-o auto_da_alloc=0) option to take full advantage of EXT4 performance whenever I am satisfied Mandriva-2009.1 is stable & I have a clone backup at hand.
March 20th, 2009 at 1:29 pm
I should really RTFS, but is it possible to decide, at fsync() time, that this is a small file and we’d like to keep deferring block allocation, so sync it to the journal? I.e. switching to data=journal for this file? (And could it even be done in advance of block allocation, or
is the journal physically indexed?)
Then you’re just adding the data to the journal ahead of the rename record, and not adding any more random I/Os or forcing early block allocation.
You’d have to have an escape for files too big to fit in the journal, but if the typical .gnome files were below the threshold, it would perform quite well.
It might help a lot of fsync() users with small lock files, mail status files, etc.
Even if you had to do the block allocation, deciding to fsync small files to the journal could get you some of that paradoxical data=journal speedup by reducing the random I/O.
March 20th, 2009 at 8:55 pm
This is getting long, and my two comments are a bit outside of the big sync issue. Personally I am with those say that if you want to be sure, you have to take the sure path and not cross fingers hoping lower levels go beyond the minimum. But anyway I would like to mention them.
First, I wonder why apps just do not do rename(foo, foo.old) then open(foo). They would always have one or two valid versions. Of course, this means handling the case of foo containing wrong data, so it mut be removed and then try with foo.old, and wasting disk space. But if the data is valuable, it sounds like a good compromise.
Second, for some tasks like saving config, I wonder why not just open files in append mode, and dump new settings. For config load, process all the settings up to the last valid mark, overwriting settings as new values are found. This assumes things are small, self contained, and checkpoint marks are possible. It could even be text based, long.key.for.something=value lines and comment lines with a given string (fixed, date, fixed+date) as checkpoints. When the file reaches a reasonable size (some K), proceed to do the paranoid replacement with a compacted version. IIRC this log approach would even allow multiple process sharing the same conf file.
Or I am totally wrong? Sometimes I think I am dumb, but then I hit things like unchecked writes and reconsider my dumbness and go read the docs or try to talk with something that knows, to get out of doubts.
March 20th, 2009 at 10:32 pm
@176: Ted, the only concern I have about keeping the undo log strictly in memory (apart from the redo log entries) is avoiding a stall at a redo checkpoint due to the need to commit certain files to disk before wrapping. If that can be avoided with some sort of scheduling mechanism, then no problem. The i_nlink_count restriction looks fine too.
Discarding altered security permissions of discarded version is the right behavior for security reasons. The permissions should go with the data, which is awfully convenient in this case.
I don’t think any searching of a global undo log should be necessary. Instead, one should keep the inode in memory when any undo operation is pending, and either embed or attach the pertinent undo information to the in memory inode structure.
In the case of chained renames, the pertinent in-memory inode structures should be transitively linked together in a rename chain. When the second rename occurs, the filesystem should look at the modification times of the two original entries in the chain, and decide whether any intermediate revisions were renamed in a time range that makes them worth finishing (i.e. an intermediate revision may be about to finish write out, while the start of write out for the latest revision may be thirty seconds away). Of course as soon as the data of any intermediate revision is ready to commit, the prior revision is discarded.
The discard rules should allow there to be at most three viable revisions in the rename chain at any time, one durable version, one version that has writeout in progress, and one revision with writeout scheduled in the future. Any other intermediate versions should be able to be immediately discarded, a circumstance that might arise if some application does rename replacements of a file faster than each revision would otherwise be written out (i.e. more often than every thirty seconds or so).
March 20th, 2009 at 10:52 pm
@180: First, I wonder why apps just do not do rename(foo, foo.old) then open(foo).
Applications that do that reliably currently have to write(foo.tmp), sync(foo), rename(foo, foo.old), rename(foo.tmp, foo). As long as foo is sufficiently old, no unusual overhead. The problem is all the clutter from the old versions, that you can only efficiently get rid of with some sort of background thread that calls sync(foo), remove(foo.old) a minute or so later, after writeout for foo is likely to have completed. You also have to have code to tell whether the current version you have of foo is any good or not.
March 20th, 2009 at 11:45 pm
@182: Butler, the first idea I talked about is “foo.old stays”. Please notice the “always” near “one or two valid versions”, no need of threads to remove anything. That is why I am asking, the approach is about keeping two copies of the file, not the one mentioned over and over that tries to keep only one (even if for a limited time, two exist), or the wrong ones that assume systems have instant save and fsync is just to make the work harder.
To clarify: rename(foo, foo.tmp), open(foo), write(foo), close(foo). From disc point of view, we have: nothing reached disk (nothing changed), data on disk but not metadata (old config still via foo), metadata but not data (app will have to resort to foo.old), both on disk (new config in place, and we have a backup with previous value if saves are only done on value changes, and not always). Or at least if I remember correctly how systems worked, it was some time ago. The question is if this will work or I forgot a vital detail, which I would be pretty happy to know about.
March 20th, 2009 at 11:45 pm
@182: Butler, the first idea I talked about is “foo.old stays”. Please notice the “always” near “one or two valid versions”, no need of threads to remove anything. That is why I am asking, the approach is about keeping two copies of the file, not the one mentioned over and over that tries to keep only one (even if for a limited time, two exist), or the wrong ones that assume systems have instant save and fsync is just to make the work harder.
To clarify: rename(foo, foo.old), open(foo), write(foo), close(foo). From disc point of view, we have: nothing reached disk (nothing changed), data on disk but not metadata (old config still via foo), metadata but not data (app will have to resort to foo.old), both on disk (new config in place, and we have a backup with previous value if saves are only done on value changes, and not always). Or at least if I remember correctly how systems worked, it was some time ago. The question is if this will work or I forgot a vital detail, which I would be pretty happy to know about.
March 21st, 2009 at 1:57 am
@183: Romero, if the application can reliably tell that foo is a bad version (e.g. foo has an internal checksum or something), then it can always rely on foo.old in a pinch. Of course all readers of foo need the same handling, especially if they might try to access foo while you are rewriting it.
The only other thing you need is an fsync on foo prior to renaming it as foo.old. Otherwise you have no guarantee that the old version you have is any good. This is not a problem (no disk i/o required) as long as the old version is actually old, which is usually the case. The performance problem in the procedure discussed earlier in this thread is due to an fsync on the new version, which requires waiting until the new version makes it to disk.
March 21st, 2009 at 4:07 pm
@185: So fsyncs still required, fine, but except if the (re)write operations are constant, the fsyncs will not slow down anything or wake up the disk, nice to hear. Then it seems an improvement over the other method in the field of program and system responsiveness.
No comments on the append method? Now that you mention shared files, I found one: compacting will require exclusive lock so no other process writes in the big file that is being replaced with a smaller version.
March 21st, 2009 at 4:31 pm
@178
Ubuntu and redhat have both backported these patches against 2.6.29 and are will be shipping their next OS’s with them. Are you sure Mandriva isn’t using these patches as well? I’d be surprised if any OS geared at desktop use didn’t include them by default…
Good luck!
March 21st, 2009 at 9:01 pm
POSIX fsync(2) really doesn’t do what application programmers are looking for. App programmers want something that works reliably across as many platforms as possible, the POSIX man page is quite explicit:
http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html
The rationale even says, “If _POSIX_SYNCHRONIZED_IO is not defined * * *It is explicitly intended that a null implementation is permitted.” In fact what is required is fdatasync(2), but that’s an optional (SIO) function. (Full synchronized IO is obviously way over the top here.)
In reality, however, the app programming technique being described is used primarily to ensure integrity of a set of files on a *running* system in the presence of multi-processing. The fundamental techinque is:
1) locked write of a temporary file (foo.tmp). The lock guarantees only one program updates the file at once.
2) delete the file (foo) and rename foo.tmp as foo. At this point either foo.tmp or foo will always exist and be a complete file. An app can always find a complete file to open.
The paradigm is totally portable, but the implementation depends on the OS. On UNIX and Win32 (not just POSIX) it’s trivial. E.g. (all UNIX):
newfd=open(”foo.tmp”, O_WRONLY, 0444);
… do the work …
close(newfd);
unlink(”foo”);
link(”foo.tmp”, “foo”);
unlnk(”foo.tmp”);
So far as I know that works *everywhere*. rename() (or ReplaceFile on Windows2000+) avoids the requirement for an app opening “foo” to spin on ENOENT.
This approach did *not* work over a crash before journalled file systems. After a crash the file system was toast. If the kernel was changing a directory (rename/link/unlink) when the crash happened then all bets were off for everything in the directory.
It’s just a happy accident that the techniques which ensure integrity of the file tree on a running system will (obviously) also work if the file system stores a single sequential journal.
I can’t see any way that app programmers will start introducing extra, file system specific, calls – fsync or whatever – just to avoid data loss on crash on one or two file systems. App programmers weren’t prepared to do this in the old pre-journal days and users didn’t demand this level of reliability over a system crash. In any case, app programmers have an easy response – if you have problems with ext4 and system crashes either stop the system crashing or stop using ext4!
Still the point is moot if the rename fix works in all the possible implementations of the above paradigm. If the problem goes away app programmers aren’t going to start adding fsync calls just in case the fix gets removed, or even because it’s ‘better’ according to some FS developers.
March 23rd, 2009 at 6:16 am
Ted,
Thanks for your post, it was interesting to read. I have one question
though: How cheep can fsync() be for a newly created file?
Does a POSIX complient implementation of fsync() has to really write to
the disk before returning from fsync, or it can mark that this new inode
with a special flag, which basically tells to flush data associated with
this inode before writing it to the disk? It seems to me that the second
implementation is possible, because unless the inode is written to the
disk, there is no way to access to data in the case of crash. So, the
open-write-fsync-close-rename sequence can be peformed without touching
the hard drive at all, and only bdflush (or fsync on the directory that
contains this new file) will perform actual writing data to the disk.
March 23rd, 2009 at 7:35 pm
How do I disable fsync?
Background:
I have a netbook, which translates to “I have a broken SSD”, in practice every 20-30 seconds the machine will lock up with the disk light on. If I disable fsync, the filesystem will only wait for the broken SSD firmware to finish whatever it’s doing /in the background/, but appliations will never lock up, beause fsync returns immediately.
Right?
Or is there an even better hack?
March 23rd, 2009 at 9:04 pm
Hearing the complaining about fsync latency when used in a straightforward manner in a main thread, I started thinking up an async fsync library using pthreads (but hiding that from the API), even some skeleton writing.
After being at this for a while, I realized that oh yeah, silly geezer me, nowadays we have the AIO interfaces. Namely, aio_fsync(). Which does pretty much just what I was doing (just not bunching up several fsync calls if wanted). And perhaps having a bit more generic = klunkier API. Some minimal wrapping could perhaps make it more friendly for casual users.
Aand then the fun hit me when I tried to find out how well-supported this actually is on Linux. Best I can tell, the AIO APIs work only for O_DIRECT stuff, and there’s some other impedance-matching necessary, and not much news on the PAIOL project recently, trying to offer a POSIX API over the kernel stuff. I found notions that earlier, glibc/librt actually did go the thread route (slower but works), but that nowadays it uses the Linux AIO syscalls which don’t allow for as much. Or not. It’s really unclear putting things together from possibly obsolete manpages and the old google hits on this matter… (Hmh, apparently my Ubuntu glibc source package includes an rt/aio_fsync.c which is a stub, and a sysdeps/pthread/aio_fsync.c which is a pthreads based implementation. I wonder what’s actually being used.)
I’ll probably try out if my glibc’s aio_fsync at least claims to work in the simplest of test situations, but checking if it actually does anything, I shan’t bother to find out how to do that anymore. Meanwhile, does anyone have a proper whole picture on the matter?
March 23rd, 2009 at 10:16 pm
Okay, so, I tried (on my Ubuntu Intrepid with libc6-2.8~20080505-0ubuntu9 and its librt.so.1 -> librt-2.8.90.so) a simple aio_fsync experiment (couldn’t bother with callbacks or signal notification, just polling). My dumb and yes, otherwise buggy test program (I didn’t check non-aio related errors and all that) is at http://mjr.iki.fi/software/aiotest.c
It does seem that the posix thread code is used at least for this aio call, and it does seem to work – at least it took a while for the queries to aio_error to return something else than EINPROGRESS, and error codes indicate that all went well. Also, I recalled that GNU shuns man pages (sigh) so went to glibc documentation, which sure enough tells me that the ops can be thread-based or use the kernel’s facilities if available, but the doc I found doesn’t get into more specifics. I shan’t peruse the source or perform experiments to see if they actually use the (limited) kernel facilities for some operations or just thread everything. I suspect it just threads everything to provide the POSIX semantics. Which is just fine for this purpose.
I still don’t have or claim to have a full picture of Linux/glibc AIO, and finding any solid information has been rather a pain (feel free to further enlighten me), but the upshot is:
You don’t have to code thread code yourself to get an async fsync on GNU/Linux systems. It does take some initialization code (not too much, see my sample, but recall what I said about a light wrapper perhaps being nice). You can poll the interface, order a signal to be delivered upon completion, or even give a callback function (the callback will be executed in another thread, which must be taken into account, but the thread creation is pretty transparent) so you are able to do arbitrary finishing touches that want to wait for the operation to complete.
Like, you know, scheduling that close() and rename() to be automatically called after the aio_fsync() has gone through. (Maybe a light wrapper that did the whole aio_fsync/close(/rename) shebang could be nice too… I might even consider that, and perhaps some other light utility wrapping, but no promises, I’m notoriously unmotivated and lacking follow-through.)
(All this may be self-evident to many of you, and I only started to talk about this because nobody else did even considering the direct relevance, and calls for such functionality. Hopefully this has been not a complete waste of time and somebody actually learned something. Well, I did, anyway.)
March 23rd, 2009 at 10:52 pm
Oh yeah, I neglected to mention that yes, the created thread indeed called fsync() which was what took its while, as monitored by strace -f. (Sorry for spamming the blog with consequent comments
March 24th, 2009 at 2:18 pm
Someone post a way to disable fsync, already. If not for fixing broken SSDs or emulating SAN performance on UPSed computers, then just to allow people to benchmark it
April 4th, 2009 at 12:46 am
It’s sad to see the disconnect between kernel developers and application developers.
The application developer says “I want to do foo, but it’s slow. Make it fast.”
The kernel developer says “If I make foo fast, bar will be very slow, and can cause other applications that do foo to be EXTREMELY slow some of the time.”
The application developer says “I don’t care about them. I care about me. Make foo fast!”
The kernel developer says “You *DO* care. To the system, you’re just as likely to be the ‘other application’ whose foo is slow”
At which point the application developer says “duwhat? How can that be? Mine is the only process of any importance running, because this system is a DESKTOP”.
The kernel developer points out that ps shows other important processes running, and that the kernel isn’t just for desktops anyway.
The application developer claims that the user doesn’t care about any of those other processes and that the kernel SHOULD be just for the desktop as only boring people do all that nasty old server stuff.
Of course I’m probably slandering many fine application developers, but as I watch the bloat in both the kernel and userspace, I find myself too frustrated to care.
For those who doubt that the bloat is enough to matter, here’s an experiment: try installing a modern distribution on an 100 MHz / 8 mb / 100 mb system. For that matter, even try building a decent x11 system using LFS or something. You’ll have to stick to ancient xlib-only applications, since anything else will drag in more dependencies than you have rootfs space…
Now, as a comparison, try installing a 10 year old distribution on a modern “hot desktop” box. Assuming you can get past hardware assumptions and changes in standards, I suspect you’ll have one of the fastest user experiences of your life. No, it doesn’t “do much”, and it’s certainly not as pretty, but I use my computer to get work (and play) done, not to have fun with skins.
It’s not that I like tiny systems; it’s that if the code is written efficiently enough to run on them, it NEVER EVER runs slow on a more modern system.
How does this apply to the fsync() discussion? Well, try to run all these modern wonderapps on an old ext2 filesystem and see how much data loss there is. It’s not the fault of ext2; it was used for a VERY long time and plenty of apps demonstrated that it could be used without these kinds of data loss, and without being pathetically slow (even, relatively speaking, on a 386.)
Bottom line. If the older program gets work done in time X without data loss, and the newer program takes 10X yet loses data “because fsync would make things slow”, I’ve got to wonder why I should move to the new program.
April 5th, 2009 at 1:22 am
A couple of other methods not mentioned are available for storing data in an atomic manner.
1) use binary random access files and write one block at a time keeping the file metadata constant (ie. it doesn’t get bigger, like RRD)
2) create new larger individual files and store the names using the method of 1, this bypasses most of the atomicity issues if there are checksums of the individual files.
A single block write will always be atomic.
Files either have correct checksum or you revert to an older version.
April 18th, 2009 at 5:51 am
Ted,
I read this blog post (and the eat my data presentation) and I learned something new. As a developer I was not even aware of this problem, so it has been worth reading.
However with the fsync() you only solved the problem for C developers. Since I am a C++ and Java developer, I’m in the situation where I know the problem but not the solution.
So I’m asking you to post the correct way to implement the open-write-fsync-close-rename semantics in other languages, especially C++ and Java.
I think this will be useful to many developers.
April 22nd, 2009 at 11:09 am
I did some test about MySQL’s Insert performance on ZFS, and met a big performance problem,i’m not sure what’s the point.
Environment
2 Intel X5560 (8 core), 12GB RAM, 7 slc SSD(Intel).
A Java client run 8 threads concurrency insert into one Innodb table:
~600 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=1
~600 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=1
~600 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=1
~900 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=0
~5500 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=0
~15000 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=0
~800 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=2
~4500 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=2
~13000 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=2
1 ssd as 1 zpool
~350 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=1
~400 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=1
~400 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=1
~900 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=0
~5300 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=0
~15000 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=0
~750 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=2
~4500 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=2
~13000 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=2
1 ssd as ufs:
~1500 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=1
~2100 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=1
~2100 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=1
~4000 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=0
~13500 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=0
~19500 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=0
~4000 qps when sync_binlog=1 & innodb_flush_log_at_trx_commit=2
~13000 qps when sync_binlog=10 & innodb_flush_log_at_trx_commit=2
~18000 qps when sync_binlog=0 & innodb_flush_log_at_trx_commit=2
i cann’t belive this result!
When sync_binlog=1 & innodb_flush_log_at_trx_commit=1, qps is too…
And i collect some stats data when qps < 1000:
[root@ssd /data/mysqldata3]#truss -c -p 13968
^C
syscall seconds calls errors
read .649 90816 30265
write .770 57157
open .000 4
close .000 4
time .368 83358
lseek .000 66
fdsync 2.250 80699
fcntl .268 60530
lwp_park .210 28842
lwp_unpark .198 28842
yield .000 47
pread .025 250
pwrite .857 53880
pollsys .005 603
——– —— —-
sys totals: 5.605 485098 30265
usr time: 5.519
elapsed: 61.520
ps:13968 is mysqld process’s pid
[root@ssd /data/mysqldata3]#vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s1 s2 s3 s5 in sy cs us sy id
0 0 0 575836 1388816 1 18 0 0 0 0 1 3 17 17 17 1226 4185 2008 1 0 99
0 0 0 211652 5620432 21 99 0 0 0 0 0 0 461 470 473 4207 8834 16311 1 1 99
0 0 0 211496 5620260 0 26 0 0 0 0 0 0 469 468 467 4168 8789 16328 1 1 99
0 0 0 211496 5620164 0 28 0 0 0 0 0 0 504 504 505 4471 9443 17606 1 1 98
0 0 0 211496 5620100 0 20 0 0 0 0 0 0 504 505 505 4527 9525 17618 1 1 99
0 0 0 211496 5620004 0 20 0 0 0 0 0 0 507 506 505 4491 9494 17630 1 1 98
0 0 0 211496 5619940 0 12 0 0 0 0 0 0 507 508 509 4512 9497 17743 1 1 98
0 0 0 211496 5619876 0 24 0 0 0 0 0 0 504 502 503 4370 9486 17650 1 1 98
0 0 0 211488 5619804 0 12 0 0 0 0 0 0 508 509 508 4341 9636 17853 0 1 99
^C
[root@ssd /data/mysqldata3]#zpool iostat data 1
capacity operations bandwidth
pool used avail read write read write
———- —– —– —– —– —– —–
data 141G 37.9G 4 51 144K 3.15M
data 141G 37.9G 1 1.50K 11.9K 6.06M
data 141G 37.9G 0 1.37K 0 5.48M
data 141G 37.9G 0 1.49K 0 5.98M
data 141G 37.9G 214 1.45K 5.22M 7.27M
data 141G 37.9G 0 1.37K 0 5.48M
data 141G 37.9G 0 1.39K 0 5.58M
data 141G 37.9G 0 1.48K 0 5.92M
data 141G 37.9G 51 1.50K 1.98M 6.06M
data 141G 37.9G 0 2.09K 0 23.7M
data 141G 37.9G 0 1.38K 0 5.52M
data 141G 37.9G 0 1.37K 7.92K 6.09M
ZFS Conf detail:
[root@ssd /]#zpool status
pool: data
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c0t8d0 ONLINE 0 0 0
c0t9d0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
errors: No known data errors
pool: rpool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c0t3d0s0 ONLINE 0 0 0
errors: No known data errors
April 29th, 2009 at 5:40 am
Where does this leave the shell script that wants to ensure “date >foo” has really hit the disc? There’s no fsync(1) or fdatasync(1), only sync(8) which seems a might heavy handed.
“exec >foo && date && fsync && >&-” would work if fsync(1) used fd 1 by default. Except it doesn’t cover the “syncing the containing directory” case. It would seem a shame if shell scripts are now a poor cousin.
April 29th, 2009 at 5:07 pm
@199: Ralph,
It means that a shell script programmer needs to run a helper program, that calls fsync(2), just like it needs to do anything that isn’t implemented by the shell itself. FreeBSD has a fsync(1) command-line program which is in /usr/bin/fsync. Linux doesn’t, but it certainly wouldn’t be hard to write such a program. You could also easily call out to perl or python, both of which have support for fsync.
April 29th, 2009 at 9:53 pm
The idea of calling fsync(2) *after* the original file descriptor has been closed only works if “_POSIX_SYNCHRONIZED_IO is defined” (to quote from the opengroup man page: http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html
Otherwise (!_POSIX_SYNCHRONIZED_IO) the operation:
fsync group-but-not-me-readable
The file created on success cannot be opened for either read or write by the script.
While this may seem obscure it is an extreme variant of the standard shell script multi-process locking technique. In that technique the strict sequencing of file creation (of a lock file) and file writing (of to-be-updated file) is essential.
The shell script programmer can do the classic sync;sync;sync trick, but in my experience these scripts do not. I suspect shell programmers and C programmers alike have always assumed implicitly that close(2) behaves like an ANSI-C sequence point with respect to other open(2) and close(2) operations. This may be wrong, but it is a dangerous assumption to violate particularly as the violation is only detectable under a hard system crash.
April 29th, 2009 at 9:58 pm
Entertaining. The Blog Code deleted a middle part of my previous post. After “Otherwise (!_POSIX_SYNCHRONIZED_IO) the operation:” should have been a line about executing an fsync on a read-only file “foo”. Unfortunately I expressed this with shell input redirection – presumably the character in question is not permitted… I need to rewrite the whole thing, the previous comment makes no sense as is.
April 30th, 2009 at 6:56 am
> It means that a shell script programmer needs to run a helper program, that calls fsync(2), just like it needs to do anything that isn’t implemented by the shell itself.
I know much easier and cheaper way to fix all these problems in all scripts and/or programs written in any language by any programmer.
April 30th, 2009 at 12:24 pm
@201, @202: John,
The blog entries supports a limited HTML subset. So if you want to use an angle bracket, you need to quote it, html-style, i.e., “>” and “<”.
All modern systems support fsync, and so they define _POSIX_SYNCHRONIZED_IO. The “fsync() only works if _POSIX_SYNCHORNIZED_IO” statement in the standard was necessary two decades ago, when there were ancient Unix systems that didn’t support fsync(). The same is true for being able to stop a program using control-Z; Posix Job Control was not guaranteed, and on ancient AT&T System V unix systems, control-Z didn’t work because neither POSIX nor BSD job control existed on those systems. But to say that programmers can’t count on fsync() because it’s an optional part of the specification is to misunderstand the historical context. Once upon a time it was optional; these days, no modern system would stand a chance in the marketplace if it didn’t implement that “optional” part of the spec. And indeed, all modern systems do support fsync().
Also, ANSI-C doesn’t define open() and close(). It does define fopen() and fclose(), but it is unspecified what happens if the system crashes. All talk of “sequence points” only makes sense in the context of the fact that standard I/O is buffered, and until a stdio FILE handle is flushed out via fflush() or fclose(), another process won’t see it. But ANSI-C is very careful not to say what happens if the system crashes. The concept of what happens on a file handle flush and what happens when the system crashes are quite different.
April 30th, 2009 at 7:27 pm
Indeed I would expect all the systems that support ext4 to also support fsync(2), and to define _POSIX_FSYNC, but I was talking about support for _POSIX_SYNCHRONIZED_IO, which, if implemented, modifies the POSIX fsync behavior. The functionality is optional, according to the Open Group man page it was added in issue 6 which is dated 2004, whereas fsync was added in issue 3.
As for the comparison with _POSIX_JOB_CONTROL, well, to quote the document “_POSIX_JOB_CONTROL shall have a value greater than zero” – somewhat different from “the system may support one or more options…”
Still, it may be reasonable to say that every implementation that supports ext4 also supports SIO – I don’t know. SIO makes the whole discussion moot because an app being written to use SIO and will be using OPEN(foo,O_DSYNC/O_SYNC/O_RSYNC) which, I believe, obviates any need to use fsync(2).
My point about synchronization points was intended to clarify by analogy – I naively assumed everyone understood the role of synchronization points in the C language. Well, I was wrong… They’re nothing to do with the library. They exist in the language to allow compiler implementators to arbitrarily order operations *between* sequence points without invalidating compiler users expectations that operations happen in strict order.
The deleted part of my post would have made this clearer. I raised the point that the suggested shell sequence of an fsync on a new file descriptor rather than the one originally used to write the data is not the same, both because of the potential absence of _POSIX_SYNCHRONIZED_IO and because it may be impossible to open a file descriptor on the file in question.
For example:
umask 757
echo “secret” >foo
fsync <foo
fails because “foo” is not readable by the shell script. Yes, I know that there are ways round this because the script has its fingers on the fildes and need not release this, but consider the example where the file is created within a subprocess.
The problem is complicated by the variety of techniques shell and application programmers use to effect inter-process synchronization throught the file system. Another example is a semphor file – the file might disappear before the process that creates it can synchronize it. I think you will see that this is more serious because there is now some unsynchronizable data in the kernel surrounded by synchronizable but still reorderable directory updates.
Having said all that, though, I think these questions are moot – app and shell programmers aren’t going to rewrite all their code because of ext4. Anyway, the scenario is that of a hard system crash in the middle of a critical operaton. As the Open Group says validating behavior in these cases is almost impossible. The chance of anyone changing anything is minimal; system administrators will simply be expected to clear up the mess as they always have.
May 20th, 2009 at 8:53 pm
> Nearly all of the reported delays was a few seconds, which would be expected; normally there isn’t that much dirty data that needs to be flushed out on a Linux system, even if it is even very busy
> fsync() will trigger a commit and might need to take a second while the download is going on
Wait: are you saying that delaying for a second to flush a couple disk blocks isn’t a long time? For that little data, a second is an eternity.
> The atomicity not durability argument
> This argument is flawed for two reasons. First of all, the squence above exactly provides desired “atomicity without durability”.
This is a strawman. You gave an incorrect sequence of code, explained how incorrect it was, and then concluded that the argument was flawed. There’s nothing wrong with the argument, just your code. A correct atomic-rename in Linux includes an fsync() on the directory.
> Secondly, as we discussed above, fsync() really isn’t that expensive,
It’s very expensive, even ignoring full-second delays. Compare sqlite performance with and without fsync: without fsync you can do a thousand transactions a second or more, but with safe synchronization you’re lucky to do 80. It both completely kills write buffering, and completely serializes the application with disk access.
I’m baffled that anyone would argue against the atomicity-without-durability “argument”. It’s an important, obvious case.
May 29th, 2009 at 11:20 am
[...] Theodore Ts’o blog advocates using fsync(2) to get around the rename issue on ext4, one would imagine that it performs [...]
May 30th, 2009 at 2:56 pm
Just because you’ve downloaded a file once doesn’t mean it can be re-downloaded. I just got a tarball of Nine Inch Nails songs from their website through a one-time download link. That tarball /is/ precious. So now… what is a good example of a call to rename() that shouldn’t have an implicit barrier?
June 5th, 2009 at 5:08 am
Hi Ted, I am often concerned with the problem of unnecessarily spinning up hard drives in laptop_mode, since under some circumstances firefox is doing crazy FS commits out of nowhere: my vmstat shows 64K/s of ‘bo’ with an untouched firefox session, whose death makes the world quiet.
But I (to some extent) don’t want to lose the commit guarantee.
So my question is: if this “ignore fsync” thing is to be implemented in laptop_mode, how would it change the semantic? I mean when the kernel spits out all data that’s been chunked for the past, say 6, minutes, will the fsyncs still constrain the order of writes?
BTW is the fsync on ext3 fixed? Rumors had it that it syncs the entire fs instead of the file descriptor (I know that’s still in line with POSIX specs though)
June 7th, 2009 at 8:42 pm
Ted,
I want to compliment you on the patience, open-mindedness, and tact that you’ve shown here regarding this issue. Next time I want to show someone how developers should interact with their community, particularly in the face of heated disagreement, I’ll point to this blog. I can be very patient and polite, but I’m not sure I have anywhere near the patience you have shown with this issue, as the discussions and flames have dragged on and on.
Thank you for being the fantastic Linux contributor and positive role model that you have been all these years!
June 9th, 2009 at 11:57 pm
> Wait: are you saying that delaying for a second to flush a couple disk blocks isn’t a long time? For that little data, a second is an eternity.
It’s not just “a couple of disk blocks”. Depending on the filesystem, the device, etc, it could easily involve physically spinning up disks and flushing hardware IO buffers, touching parent-directory inodes, etc.
I’m somewhat perturbed by your sqlite performance comment. Either you’re intending to use sqlite for persistent data (in which case you need it flushed so that a power outage or yanked drive cable doesn’t corrupt the data) or else if you’re just using it to use sql/relational semantics to manage data, why not have sqlite use an in-memory database instead?
A generally safe rule (and not just in linux): All bets are off when you’re using buffered I/O, except that your data is generally consistent after a buffer/io flush.
If someone writes a filesystem with different behavior, it will generally underperform for *someone’s* usage profile AND applications written expecting its behavior will behave very oddly elsewhere. For example, it is surprising how badly many modern GUI applications behave on an ext2 filesystem, or for that matter a filesystem with sector size != 512 bytes (or when other common-but-not-universal-truth assumptions are violated)
June 10th, 2009 at 2:00 am
> Either you’re intending to use sqlite for persistent data (in which case you need it flushed so that a power outage or yanked drive cable doesn’t corrupt the data)
No, you don’t need it flushed. All you need is a guarantee that certain operations will be committed to disk in a particular order. The only reason many applications flush to disk to accomplish this is because that’s the only means available to do so. Write barriers are one approach to get ordering without blocking.
June 11th, 2009 at 10:16 pm
As others have pointed out: The firefox issue on its own is rather dumb. The forced fsync aren’t so slow on ext4, but they spin up disks even in laptop mode, and firefox itself is still laggy.
An easy work around is to rsync a “safe” ~/.mozilla to /dev/shm (or other tmpfs) and run it from there. If firefox exits gracefully, rsync it back. If not, then you revert to the previous safe version. (I have a script, but won’t post it here; the details are obvious to anyone with a little scripting experience.)
I wish the firefox developers would design their data I/O in a more robust and less braindead way overall. This sort of technique (work in memory, revert if bad) is obvious on its face.
June 12th, 2009 at 10:04 am
[...] Don’t fear the fsync! | Thoughts by Ted – My boss from Redknee used to have a mantra: "I hate disks." [...]
June 18th, 2009 at 6:51 am
@ads, I agree Firefox developers need to re-think, but I’m not sure how your suggestion of /dev/shm helps. If Firefox crashes I want to get it back just as it was when it crashed, including those half a dozen new tabs I’ve opened that I wouldn’t be able to re-find again easily from this morning’s email/RSS reading. The reason Firefox is making lots of effort to keep saving the current state is that users, including me, would find it very annoying if it restored itself as it was half an hour ago.
June 18th, 2009 at 8:20 am
@Ralph,
For any system (firefox or otherwise), if you say “we MUST preserve the complete state every few seconds”, then we are back to fsync every few seconds, and the whole argument runs back to the beginning w.r.t. laptop-mode, slow fsyncs on certain systems, et cetera. I personally can’t imagine how a few tabs can be so important, but to each his own.
Maybe an eventual solution is to migrate /home to a log-based FS (say, nilfs2, now in 2.6.30) which internally does all this “snapshotting” as part of normal operations. Alternatively, one could write a userland library which does this for configuration files. Really, this means “open config files in append mode, and use a re-playable format”.
June 18th, 2009 at 8:46 am
The problem with Firefox is that it accesses the disk _constantly_. I’m not as concerned about the disk activity when you’re actually DOING something. But Firefox accesses the disk even when you’re doing NOTHING. This prevents my Mac from sleeping, for instance.
June 18th, 2009 at 11:38 am
@ads, I think Firefox needs to distinguish between protecting the user from Firefox crashing, and the OS crashing. The former can be common depending on version, plugins, etc. The latter a lot more unusual with Linux. As long as FF has handed the data to the OS then I don’t mind if it doesn’t reach the platters for a while; the OS can keep it in RAM for a bit if it, and the user, prefers.
June 20th, 2009 at 8:27 pm
Having a stable OS doesn’t mean your laptop battery won’t fail, or that the dog won’t yank the plug.
There’s a similar issue using Vim on a busy system. Writing a file can block the editor for several seconds, because it–correctly–uses a usual safe write sequence (a different one, since it’s overwriting the whole file). With write barriers, it could get safe writes without blocking, so I wouldn’t have to wait for several seconds, breaking my train of thought, as Vim freezes up on fsync.
> I wish the firefox developers would design their data I/O in a more robust and less braindead way overall
So now SQLite is braindead and unrobust. Right.
June 21st, 2009 at 6:39 am
> So now SQLite is braindead and unrobust. Right.
I’d argue that SQLite is (when correctly configured) quite smart and robust. However, I’d also argue that it is not the correct tool for the job the Firefox developers had in mind. It is, however, quite simple to use and that apparently makes it the best choice for the task.
Part of the problem here is that everyone seems to assume that there is only one valid kind of problem, and that only the solutions that trade other things to maximize capability for *that* kind of problem are valid ones.
Another problem is that plenty of app developers seem willing to assume that the product will always run on Linux (by which they really mean “Always run on a linux box using a given filesystem, configured in a given way”). Many of them also seem to assume that “Linux” (see above) should return the favor by becoming perfectly optimal for their task at the expense of every other problemspace that might potentially ALSO be using the Linux kernel and libraries.
It is frustrating to watch someone take a perfectly good electric drill, use it as a hammer, complain vociferously that it’s awkward and far too complex for the job, and proceed to re-engineer it permanently into a bad hammer. Some of us occasionally need to drill holes or install drywall, and would really have liked to use the drill as a drill.
June 21st, 2009 at 1:08 pm
I’ve used SQLite extensively. It’s absolutely a correct–and in my experience, the best–tool for this task.
Adding an fbarrier()-like API would not inherently make anything less suitable for any other task. It might in practice, because implementing it may be difficult and cause internal design changes that could have adverse effects; but there’s nothing *inherent* about it that would do that.
Right now, we have a drill, and the hammer hasn’t been invented. The only means we have available to bang nails (safely write files) is to hit them with a drill (call fsync). There are no hammers (fbarrier).
June 22nd, 2009 at 10:58 am
Consider the case of NFS, or SSHFS, or any random not-an-ssd flash filesystem. Or ext2 on an older kernel (there are still environments that prefer kernel 2.0.current because of resource limits or new regressions). No matter what nice tools you have on a preferred filesystem, you cannot assume that your application is running in such an environment.
I accept that an fbarrier() api would be convenient and would substantially improve things for many usage profiles. But, what is the fallback strategy when running in an environment where the fbarrier() api is missing, ineffective, or where the api knows it cannot provide the expected guarantees? Merely having the fbarrier execute flush and sync type operations will result in worse behavior, because the app writer just blindly called fbarrier considering that it would do the right thing.
I generally find that hiding complexity from the developer is a bad thing. It leads to him being unaware of what the machine is actually doing, and often leads to wildly inaccurate assumptions about performance and safety.
From #16 above (I think)
> If we are peppering our code with fsync’s, even if it doesn’t hurt “that much”, we are violating the abstraction that says the kernel is supposed to take care of buffering, caching, and writing things out to disk in a sane way.
The problem is that the kernel cannot know what each developer means by “a sane way”, and kernel behavior that is correct for one usage case is totally wrong for another. The behavior that is correct for a high load webserver is almost certainly wrong for a critical log path, and neither behavior is really quite right for a responsive medium-importance gui app. Adapting the kernel and libs to assume any one of these jobs is going to break more things than it fixes. This suggests to me that the expected kernel abstraction has gotten a little too abstract.
Perhaps what is truly needed is to document correct recipes for all the different intended-behavior cases then make sure the kernel doesn’t suddenly regress any of the expected behaviors. It seems to me that a lot of people are counting on behavior that may or may not be in but that certainly are NOT in ext2. When something other than their favorite assumed fs behaves differently, they go around crying “Bug!”. I agree that there is a bug, but perhaps we disagree as to which code contains it
June 22nd, 2009 at 6:25 pm
Linux isn’t a lowest-common-denominator kernel that doesn’t do anything not already possible on other platforms. Developers do the best they can manage on each platform; that’s just part of porting.
(If the kernel can’t implement fbarrier() for a particular scenario, it should return an error. In practice, it would probably need to take an array of fds, and it should be possible to tell in advance whether fbarrier() will work with a particular set of FDs, to select an appropriate writing strategy. Anyhow, we’re not here to design an API that will probably never be implemented, but none of this is very difficult to define sensibly.)
November 7th, 2009 at 7:18 am
@6: Regarding “Ubuntu Jaunty and Firefox 11 beta kernels”:
I knew Firefox was getting bloated, but that’s a bit excessive…