Don’t fear the fsync!

After reading the comments on my earlier post, Delayed allocation and the zero-length file problem as well as some of the comments on the Slashdot story as well as the Ubuntu bug, it’s become very clear to me that there are a lot of myths and misplaced concerns about fsync() and how best to use it.   I thought it would be appropriate to correct as many of these misunderstandings about fsync() in one comprehensive blog posting.

As the Eat My Data presentation points out very clearly, the only safe way according that POSIX allows for requesting data written to a particular file descriptor be safely stored on stable storage is via the fsync() call.  Linux’s close(2) man page makes this point very clearly:

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2).

Why don’t application programmers follow these sage words?  These three reasons are most often given as excuses:

  1. (Perceived) performance problems with fsync()
  2. The application only needs atomicity, but not durability
  3. The fsync() causing the hard drive to spin up unnecessarily in laptop_mode

Let’s examine each of these excuses one at a time, to see how valid they really are.

(Perceived) performance problems with fsync()

Most of the bad publicity with fsync() originated with the now infamous problem with Firefox 3.0 that showed up about a year ago in May, 2008.   What happened with Firefox 3.0 was that the primary user interface thread called the sqllite library each time the user clicked on a link to go to a new page. The sqllite library called fsync(), which in ext3’s data=ordered mode, caused a large, visible latency which was visible to the user if there was a large file copy happening by another process.

Nearly all of the reported delays was a few seconds, which would be expected; normally there isn’t that much dirty data that needs to be flushed out on a Linux system, even if it is even very busy.   For example, consier the example of a laptop downloading an .iso image from a local file server; if the laptop has the exclusive link of a 100 megabit/second ethernet link, and the server has the .iso file in cache, or has a nice fast RAID array so it is not the bottleneck, then in the best case, the laptop will be able to download data at the rate of 10-12 MB/second.  Assuming the default 5 second commit interval, that means that in the worst case, there will be at most 60 megabytes which must be written out before the commit can proceed.  A reasonably modern 7200 rpm laptop drive can write between 60 and 70 MB/second.   (The Seagate Momentus 7200.4 laptop drive is reported to be able to deliver 85-104 MB/second, but I can’t find it for sale anywhere for love or money.)   In this example, an fsync() will trigger a commit and might need to take a second while the download is going on; perhaps half a second if you have a really fast 7200 rpm drive, and maybe 2-3 seconds if you have a slow 5400 rpm drive.

(Jump to Sidebar: What about those 30 second fsync reports?)

Obviously, you can create workloads that aren’t bottlenecked on the maximum ethernet download speed, or the speed of reading from a local disk drive; for example, “dd if=/dev/zero of=big-zero-file” will create a very large number of dirty pages that must be written to the hard drive at the next commit or fsync() call. It’s important to remember though, fsync() doesn’t create any extra I/O (although it may remove some optimization opportunities to avoid double writes); fsync() just pushes around when the I/O gets done, and whether it gets done synchronously or asynchronously. If you create a large number of pages that need to be flushed to disk, sooner or later it will have a significant and unfortunate effect on your system’s performance.  Fsync() might make things more visible, but if the fsync() is done off the main UI thread, the fact that fsync() triggers a commit won’t actually disturb other processes doing normal I/O; in ext3 and ext4, we start a new transaction to take care of new file system operations while the committing transction completes.

The final observation I’ll make is that part of the problem is that Firefox as an application wants to make a huge number of updates to state files and was concerned about not losing that information even in the face of a crash.  Every application writer should be asking themselves whether this sort of thing is really necessary.   For example, doing some quick measurements using ext4, I determined that Firefox was responsible for 2.54 megabytes written to the disk for each web page visited by the user (and this doesn’t include writes to the Firefox cache; I symlinked the cache directory to a tmpfs directory mounted on /tmp to reduce the write load to my SSD).   So these 2.54 megabytes is just for Firefox’s cookie cache and Places database to maintain its “Awesome bar”.  Is that really worth it?   If you visit 400 web pages in a day, that’s 1GB of writes to your SSD, and if you write more than 20GB/day, the Intel SSD will enable its “write endurance management feature” which slows down the performance of the drive.   In light of that, exactly how important is it to update those darned sqllite databases after every web click?  What if Firefox saved a list of URL’s that has been visited, and only updated every 30 or 60 minutes, instead?   Is it really that every last web page that you browse be saved if the system crashes?  An fsync() call every 15, 30, or 60 minutes, done by a thread which doesn’t block the application’s UI, would have never been noticed and would have not started the firestorm on Firefox’s bugzilla #421482.   Very often, after a little thinking, a small change in the application is all that’s necessary for to really optimize the application’s fsync() usage.

(Skip over the sidebar — if you’ve already read it).

Sidebar: What about those 30 second fsync reports?

If you read through the Firefox’s bugzilla entry, you’ll find reports of fsync delays of 30 seconds or more. That tale has grown in the retelling, and I’ve seen some hyperbolic claims of five minute delays. Where did that come from? Well, if you look that those claims, you’ll find they were using a very read-heavy workload, and/or they were using the ionice command to set a real-time I/O priority. For example, something like “ionice -c 1 -n 0 tar cvf /dev/null big-directory”.

This will cause some significant delays, first of all because “ionice -c 1” causes the process to have a real-time I/O priority, such that any I/O requests issued by that process will be serviced before all others.   Secondly, even without the real-time I/O priority, the I/O scheduler naturally prioritizes reads as higher priority than writes because normally processes are waiting for reads to complete, but writes are normally asynchronous.

This is not at all realistic workload, and it is even more laughable that some people thought this might be an accurate representation of the I/O workload of a kernel compile. These folks had never tried the experiment, or measured how much I/O goes on during a kernel compile. If you try it, you’ll find that a kernel compile sucks up a lot of CPU, and doesn’t actually do that much I/O. (In fact, that’s why an SSD only speeds up a kernel compile by about 20% or so, and that’s in a completely cold cache case. If the commonly used include files are already in the system’s page cache, the performance improvement of the SSD is much less.)

Jump back to reading Performance problems with fsync.

The atomicity not durability argument

One argument that has commonly been made on the various comment streams is that when replacing a file by writing a new file and the renaming “” to “file”, most applications don’t need a guarantee that new contents of the file are committed to stable store at a certain point in time; only that either the new or the old contents of the file will be present on the disk. So the argument is essentially that the sequence:

  • fd = open(“”, O_WRONLY);
  • write(fd, buf, bufsize);
  • fsync(fd);
  • close(fd);
  • rename(“”, “foo”);

… is too expensive, since it provides “atomicity and durability”, when in fact all the application needed was “atomicity” (i.e., either the new or the old contents of foo should be present after a crash), but not durability (i.e., the application doesn’t need to need the new version of foo now, but rather at some intermediate time in the future when it’s convenient for the OS).

This argument is flawed for two reasons. First of all, the squence above exactly provides desired “atomicity without durability”.   It doesn’t guarantee which version of the file will appear in the event of an unexpected crash; if the application needs a guarantee that the new version of the file will be present after a crash, it’s necessary to fsync the containing directory. Secondly, as we discussed above, fsync() really isn’t that expensive, even in the case of ext3′ and data=ordered; remember, fsync() doesn’t create extra I/O’s, although it may introduce latency as the application waits for some of the pending I/O’s to complete. If the application doesn’t care about exactly when the new contents of the file will be committed to stable store, the simplest thing to do is to execute the above sequence (open-write-fsync-close-rename) in a separate, asynchronous thread. And if the complaint is that this is too complicated, it’s not hard to put this in a library. For example, there is currently discussion on the gtk-devel-list on adding the fsync() call to g_file_set_contents(). Maybe if someone asks nicely, the glib developers will add an asynchronous version of this function which runs g_file_set_contents() in a separate thread. Voila!

Avoiding hard drive spin-ups with laptop_mode

Finally, as Nathaniel Smith said in Comment #111 of of my previous post:

The problem is that I don’t, really, want to turn off fsync’s, because I like my data. What I want to do is to spin up the drive as little as possible while maintaining data consistency. Really what I want is a knob that says “I’m willing to lose up to minutes of work, but no more”. We even have that knob (laptop mode and all that), but it only works in simple cases.

This is a reasonable concern, and the way to fix this is to enhance laptop_mode in the Linux kernel. Bart Samwel, the author and maintainer of laptop_mode, actually discussed this idea with me last month at FOSDEM.  Laptop_mode already adjusts /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs based on the configuration parameter MAX_LOST_WORK_SECONDS, and it also adjusts the file system commit time (for ext3; it needs to be taught to do the same thing for ext4, which is a simple patch) to MAX_LOST_WORK_SECONDS as well. All that is necessary is a kernel patch to allow laptop_mode to disable fsync() calls, since the kernel knows that it is in laptop_mode, and it notices that the disk has spun up, it will sync out everything to disk, since once the energy has been spent to spin up the hard drive, we might as well write everything in memory that needs to be written out right away. Hence, a patch which allows fsync() calls to be disabled while in laptop_mode should do pretty much everything Nate has asked. I need to check to see if laptop_mode does this already, but if it doesn’t force a file system commit when it detects that the hard drive has been spun up, it should obviously do this as well.

(In addition to having a way to globally disable fsync()’s, it may also be useful to have a way to selectively disable fsync()’s on a per-process basis, or on the flip side, exempt some process from a global fsync-disable flag. This may be useful if there are some system daemons that really do want to wake up the hard drive — and once the hard drive is spinning, naturally everything else that needs to pushed out to stable store should be immediately written.)

With this relatively minor change to the kernel’s support of laptop_mode, it should be possible to achieve the result that Nate desires, without needing force applications to worry about this issue; applications should be able to just simply use fsync() without fear.


As we’ve seen, the reasons most people think fsync() should be avoided really don’t hold water.   The fsync() call really is your friend, and it’s really not the villain that some have made it out to be. If used intelligently, it can provide your application with a portable way of assuring that your data has been safely written to stable store, without causing a user-visible latency in your application. The problem is getting people to not fear fsync(), understand fsync(), and then learning the techniques to use fsync() optimally.

So just as there has been a Don’t fear the penguin campaign, maybe we also need to have a “Don’t fear the fsync()” campaign.  All we need is a friendly mascot and logo for a “Don’t fear the fsync()” campaign. Anybody want to propose an image?  We can make some T-shirts, mugs, bumper stickers…

234 thoughts on “Don’t fear the fsync!

  1. The idea of calling fsync(2) *after* the original file descriptor has been closed only works if “_POSIX_SYNCHRONIZED_IO is defined” (to quote from the opengroup man page:

    Otherwise (!_POSIX_SYNCHRONIZED_IO) the operation:

    fsync group-but-not-me-readable

    The file created on success cannot be opened for either read or write by the script.

    While this may seem obscure it is an extreme variant of the standard shell script multi-process locking technique. In that technique the strict sequencing of file creation (of a lock file) and file writing (of to-be-updated file) is essential.

    The shell script programmer can do the classic sync;sync;sync trick, but in my experience these scripts do not. I suspect shell programmers and C programmers alike have always assumed implicitly that close(2) behaves like an ANSI-C sequence point with respect to other open(2) and close(2) operations. This may be wrong, but it is a dangerous assumption to violate particularly as the violation is only detectable under a hard system crash.

  2. Entertaining. The Blog Code deleted a middle part of my previous post. After “Otherwise (!_POSIX_SYNCHRONIZED_IO) the operation:” should have been a line about executing an fsync on a read-only file “foo”. Unfortunately I expressed this with shell input redirection – presumably the character in question is not permitted… I need to rewrite the whole thing, the previous comment makes no sense as is.

  3. > It means that a shell script programmer needs to run a helper program, that calls fsync(2), just like it needs to do anything that isn’t implemented by the shell itself.

    I know much easier and cheaper way to fix all these problems in all scripts and/or programs written in any language by any programmer.

  4. @201, @202: John,

    The blog entries supports a limited HTML subset. So if you want to use an angle bracket, you need to quote it, html-style, i.e., “>” and “<”.

    All modern systems support fsync, and so they define _POSIX_SYNCHRONIZED_IO. The “fsync() only works if _POSIX_SYNCHORNIZED_IO” statement in the standard was necessary two decades ago, when there were ancient Unix systems that didn’t support fsync(). The same is true for being able to stop a program using control-Z; Posix Job Control was not guaranteed, and on ancient AT&T System V unix systems, control-Z didn’t work because neither POSIX nor BSD job control existed on those systems. But to say that programmers can’t count on fsync() because it’s an optional part of the specification is to misunderstand the historical context. Once upon a time it was optional; these days, no modern system would stand a chance in the marketplace if it didn’t implement that “optional” part of the spec. And indeed, all modern systems do support fsync().

    Also, ANSI-C doesn’t define open() and close(). It does define fopen() and fclose(), but it is unspecified what happens if the system crashes. All talk of “sequence points” only makes sense in the context of the fact that standard I/O is buffered, and until a stdio FILE handle is flushed out via fflush() or fclose(), another process won’t see it. But ANSI-C is very careful not to say what happens if the system crashes. The concept of what happens on a file handle flush and what happens when the system crashes are quite different.

  5. Indeed I would expect all the systems that support ext4 to also support fsync(2), and to define _POSIX_FSYNC, but I was talking about support for _POSIX_SYNCHRONIZED_IO, which, if implemented, modifies the POSIX fsync behavior. The functionality is optional, according to the Open Group man page it was added in issue 6 which is dated 2004, whereas fsync was added in issue 3.

    As for the comparison with _POSIX_JOB_CONTROL, well, to quote the document “_POSIX_JOB_CONTROL shall have a value greater than zero” – somewhat different from “the system may support one or more options…”

    Still, it may be reasonable to say that every implementation that supports ext4 also supports SIO – I don’t know. SIO makes the whole discussion moot because an app being written to use SIO and will be using OPEN(foo,O_DSYNC/O_SYNC/O_RSYNC) which, I believe, obviates any need to use fsync(2).

    My point about synchronization points was intended to clarify by analogy – I naively assumed everyone understood the role of synchronization points in the C language. Well, I was wrong… They’re nothing to do with the library. They exist in the language to allow compiler implementators to arbitrarily order operations *between* sequence points without invalidating compiler users expectations that operations happen in strict order.

    The deleted part of my post would have made this clearer. I raised the point that the suggested shell sequence of an fsync on a new file descriptor rather than the one originally used to write the data is not the same, both because of the potential absence of _POSIX_SYNCHRONIZED_IO and because it may be impossible to open a file descriptor on the file in question.

    For example:

    umask 757
    echo “secret” >foo
    fsync <foo

    fails because “foo” is not readable by the shell script. Yes, I know that there are ways round this because the script has its fingers on the fildes and need not release this, but consider the example where the file is created within a subprocess.

    The problem is complicated by the variety of techniques shell and application programmers use to effect inter-process synchronization throught the file system. Another example is a semphor file – the file might disappear before the process that creates it can synchronize it. I think you will see that this is more serious because there is now some unsynchronizable data in the kernel surrounded by synchronizable but still reorderable directory updates.

    Having said all that, though, I think these questions are moot – app and shell programmers aren’t going to rewrite all their code because of ext4. Anyway, the scenario is that of a hard system crash in the middle of a critical operaton. As the Open Group says validating behavior in these cases is almost impossible. The chance of anyone changing anything is minimal; system administrators will simply be expected to clear up the mess as they always have.

  6. > Nearly all of the reported delays was a few seconds, which would be expected; normally there isn’t that much dirty data that needs to be flushed out on a Linux system, even if it is even very busy
    > fsync() will trigger a commit and might need to take a second while the download is going on

    Wait: are you saying that delaying for a second to flush a couple disk blocks isn’t a long time? For that little data, a second is an eternity.

    > The atomicity not durability argument

    > This argument is flawed for two reasons. First of all, the squence above exactly provides desired “atomicity without durability”.

    This is a strawman. You gave an incorrect sequence of code, explained how incorrect it was, and then concluded that the argument was flawed. There’s nothing wrong with the argument, just your code. A correct atomic-rename in Linux includes an fsync() on the directory.

    > Secondly, as we discussed above, fsync() really isn’t that expensive,

    It’s very expensive, even ignoring full-second delays. Compare sqlite performance with and without fsync: without fsync you can do a thousand transactions a second or more, but with safe synchronization you’re lucky to do 80. It both completely kills write buffering, and completely serializes the application with disk access.

    I’m baffled that anyone would argue against the atomicity-without-durability “argument”. It’s an important, obvious case.

  7. You’re downloading a file from the web, which Firefox does by creating the file “foo.iso” which is initially created as a zero-length file, but it then creates the file “foo.iso.part” where the file is actually downloaded, and then when once the download complete, Firefox renames “foo.iso.part” to “foo.iso”. The file foo.iso isn’t precious; in the case of a system crash, we can always download it again from the web.

    Just because you’ve downloaded a file once doesn’t mean it can be re-downloaded. I just got a tarball of Nine Inch Nails songs from their website through a one-time download link. That tarball /is/ precious. So now… what is a good example of a call to rename() that shouldn’t have an implicit barrier?

  8. Hi Ted, I am often concerned with the problem of unnecessarily spinning up hard drives in laptop_mode, since under some circumstances firefox is doing crazy FS commits out of nowhere: my vmstat shows 64K/s of ‘bo’ with an untouched firefox session, whose death makes the world quiet.

    But I (to some extent) don’t want to lose the commit guarantee.
    So my question is: if this “ignore fsync” thing is to be implemented in laptop_mode, how would it change the semantic? I mean when the kernel spits out all data that’s been chunked for the past, say 6, minutes, will the fsyncs still constrain the order of writes?

    BTW is the fsync on ext3 fixed? Rumors had it that it syncs the entire fs instead of the file descriptor (I know that’s still in line with POSIX specs though)

  9. Ted,

    I want to compliment you on the patience, open-mindedness, and tact that you’ve shown here regarding this issue. Next time I want to show someone how developers should interact with their community, particularly in the face of heated disagreement, I’ll point to this blog. I can be very patient and polite, but I’m not sure I have anywhere near the patience you have shown with this issue, as the discussions and flames have dragged on and on.

    Thank you for being the fantastic Linux contributor and positive role model that you have been all these years!

  10. > Wait: are you saying that delaying for a second to flush a couple disk blocks isn’t a long time? For that little data, a second is an eternity.

    It’s not just “a couple of disk blocks”. Depending on the filesystem, the device, etc, it could easily involve physically spinning up disks and flushing hardware IO buffers, touching parent-directory inodes, etc.

    I’m somewhat perturbed by your sqlite performance comment. Either you’re intending to use sqlite for persistent data (in which case you need it flushed so that a power outage or yanked drive cable doesn’t corrupt the data) or else if you’re just using it to use sql/relational semantics to manage data, why not have sqlite use an in-memory database instead?

    A generally safe rule (and not just in linux): All bets are off when you’re using buffered I/O, except that your data is generally consistent after a buffer/io flush.

    If someone writes a filesystem with different behavior, it will generally underperform for *someone’s* usage profile AND applications written expecting its behavior will behave very oddly elsewhere. For example, it is surprising how badly many modern GUI applications behave on an ext2 filesystem, or for that matter a filesystem with sector size != 512 bytes (or when other common-but-not-universal-truth assumptions are violated)

  11. > Either you’re intending to use sqlite for persistent data (in which case you need it flushed so that a power outage or yanked drive cable doesn’t corrupt the data)

    No, you don’t need it flushed. All you need is a guarantee that certain operations will be committed to disk in a particular order. The only reason many applications flush to disk to accomplish this is because that’s the only means available to do so. Write barriers are one approach to get ordering without blocking.

  12. As others have pointed out: The firefox issue on its own is rather dumb. The forced fsync aren’t so slow on ext4, but they spin up disks even in laptop mode, and firefox itself is still laggy.

    An easy work around is to rsync a “safe” ~/.mozilla to /dev/shm (or other tmpfs) and run it from there. If firefox exits gracefully, rsync it back. If not, then you revert to the previous safe version. (I have a script, but won’t post it here; the details are obvious to anyone with a little scripting experience.)

    I wish the firefox developers would design their data I/O in a more robust and less braindead way overall. This sort of technique (work in memory, revert if bad) is obvious on its face.

  13. @ads, I agree Firefox developers need to re-think, but I’m not sure how your suggestion of /dev/shm helps. If Firefox crashes I want to get it back just as it was when it crashed, including those half a dozen new tabs I’ve opened that I wouldn’t be able to re-find again easily from this morning’s email/RSS reading. The reason Firefox is making lots of effort to keep saving the current state is that users, including me, would find it very annoying if it restored itself as it was half an hour ago.

  14. @Ralph,
    For any system (firefox or otherwise), if you say “we MUST preserve the complete state every few seconds”, then we are back to fsync every few seconds, and the whole argument runs back to the beginning w.r.t. laptop-mode, slow fsyncs on certain systems, et cetera. I personally can’t imagine how a few tabs can be so important, but to each his own.

    Maybe an eventual solution is to migrate /home to a log-based FS (say, nilfs2, now in 2.6.30) which internally does all this “snapshotting” as part of normal operations. Alternatively, one could write a userland library which does this for configuration files. Really, this means “open config files in append mode, and use a re-playable format”.

  15. The problem with Firefox is that it accesses the disk _constantly_. I’m not as concerned about the disk activity when you’re actually DOING something. But Firefox accesses the disk even when you’re doing NOTHING. This prevents my Mac from sleeping, for instance.

  16. @ads, I think Firefox needs to distinguish between protecting the user from Firefox crashing, and the OS crashing. The former can be common depending on version, plugins, etc. The latter a lot more unusual with Linux. As long as FF has handed the data to the OS then I don’t mind if it doesn’t reach the platters for a while; the OS can keep it in RAM for a bit if it, and the user, prefers.

  17. Having a stable OS doesn’t mean your laptop battery won’t fail, or that the dog won’t yank the plug.

    There’s a similar issue using Vim on a busy system. Writing a file can block the editor for several seconds, because it–correctly–uses a usual safe write sequence (a different one, since it’s overwriting the whole file). With write barriers, it could get safe writes without blocking, so I wouldn’t have to wait for several seconds, breaking my train of thought, as Vim freezes up on fsync.

    > I wish the firefox developers would design their data I/O in a more robust and less braindead way overall

    So now SQLite is braindead and unrobust. Right.

  18. > So now SQLite is braindead and unrobust. Right.

    I’d argue that SQLite is (when correctly configured) quite smart and robust. However, I’d also argue that it is not the correct tool for the job the Firefox developers had in mind. It is, however, quite simple to use and that apparently makes it the best choice for the task.

    Part of the problem here is that everyone seems to assume that there is only one valid kind of problem, and that only the solutions that trade other things to maximize capability for *that* kind of problem are valid ones.

    Another problem is that plenty of app developers seem willing to assume that the product will always run on Linux (by which they really mean “Always run on a linux box using a given filesystem, configured in a given way”). Many of them also seem to assume that “Linux” (see above) should return the favor by becoming perfectly optimal for their task at the expense of every other problemspace that might potentially ALSO be using the Linux kernel and libraries.

    It is frustrating to watch someone take a perfectly good electric drill, use it as a hammer, complain vociferously that it’s awkward and far too complex for the job, and proceed to re-engineer it permanently into a bad hammer. Some of us occasionally need to drill holes or install drywall, and would really have liked to use the drill as a drill.

  19. I’ve used SQLite extensively. It’s absolutely a correct–and in my experience, the best–tool for this task.

    Adding an fbarrier()-like API would not inherently make anything less suitable for any other task. It might in practice, because implementing it may be difficult and cause internal design changes that could have adverse effects; but there’s nothing *inherent* about it that would do that.

    Right now, we have a drill, and the hammer hasn’t been invented. The only means we have available to bang nails (safely write files) is to hit them with a drill (call fsync). There are no hammers (fbarrier).

  20. Consider the case of NFS, or SSHFS, or any random not-an-ssd flash filesystem. Or ext2 on an older kernel (there are still environments that prefer kernel 2.0.current because of resource limits or new regressions). No matter what nice tools you have on a preferred filesystem, you cannot assume that your application is running in such an environment.

    I accept that an fbarrier() api would be convenient and would substantially improve things for many usage profiles. But, what is the fallback strategy when running in an environment where the fbarrier() api is missing, ineffective, or where the api knows it cannot provide the expected guarantees? Merely having the fbarrier execute flush and sync type operations will result in worse behavior, because the app writer just blindly called fbarrier considering that it would do the right thing.

    I generally find that hiding complexity from the developer is a bad thing. It leads to him being unaware of what the machine is actually doing, and often leads to wildly inaccurate assumptions about performance and safety.

    From #16 above (I think)
    > If we are peppering our code with fsync’s, even if it doesn’t hurt “that much”, we are violating the abstraction that says the kernel is supposed to take care of buffering, caching, and writing things out to disk in a sane way.

    The problem is that the kernel cannot know what each developer means by “a sane way”, and kernel behavior that is correct for one usage case is totally wrong for another. The behavior that is correct for a high load webserver is almost certainly wrong for a critical log path, and neither behavior is really quite right for a responsive medium-importance gui app. Adapting the kernel and libs to assume any one of these jobs is going to break more things than it fixes. This suggests to me that the expected kernel abstraction has gotten a little too abstract.

    Perhaps what is truly needed is to document correct recipes for all the different intended-behavior cases then make sure the kernel doesn’t suddenly regress any of the expected behaviors. It seems to me that a lot of people are counting on behavior that may or may not be in but that certainly are NOT in ext2. When something other than their favorite assumed fs behaves differently, they go around crying “Bug!”. I agree that there is a bug, but perhaps we disagree as to which code contains it 😉

  21. Linux isn’t a lowest-common-denominator kernel that doesn’t do anything not already possible on other platforms. Developers do the best they can manage on each platform; that’s just part of porting.

    (If the kernel can’t implement fbarrier() for a particular scenario, it should return an error. In practice, it would probably need to take an array of fds, and it should be possible to tell in advance whether fbarrier() will work with a particular set of FDs, to select an appropriate writing strategy. Anyhow, we’re not here to design an API that will probably never be implemented, but none of this is very difficult to define sensibly.)

  22. @6: Regarding “Ubuntu Jaunty and Firefox 11 beta kernels”:

    I knew Firefox was getting bloated, but that’s a bit excessive… 😛

  23. I know the thread is less than fresh, even so it looks the right place for me to make the suggestion that’s been reasserting itself every time I browse for an ssd;

    Firefox is not the only drive hog, I’ve heard that the gnome desktop, particularly at startup, is quite a hog too; then there are files such as .xsession-errors that I’ve seen growing to 120GB+ particularly when reading kde’s full message list

    The way it looks to me, therefore, is that it’s time to recognize that ssd is a new hardware paradigm, leading to a need to rethink the Filesystem Hierarchy Standard / the way mount works; what I’m thinking is that it should be possible to mount 2 devices at one place, for instance /home could link to a hard drive partition *as well as* the ssd;

    Then it would be up to apps / the user [in the case of /home] / the distro / sudo [more generally] to give the filesystem a hint whether a cache is a ‘heavy write’ cache needing a hard drive location rather than an ssd providing there is one;

    The actual distinction could be managed in a filesystem bit flag – looks simplest providing all potential filesystems support it – alternatively in the filesystem hierarchy itself, as for instance two root mountpoints /heavy , /regular – plus symlinks from /regular/home/.mozilla to /heavy/home/.mozilla for instance – that were somehow usually transparent, that possibly the kernel would somehow manage; allowing /tmp as a 3rd alternative for ‘very heavy’;

    would people even think that would be possible?

  24. Mark,

    On the system level, this is the distinction between /usr (static) and /var (changes at runtime).
    For more detailed purposes, this has existed for some time as “mount –bind” in simple cases or using unionfs in complex cases.

  25. Hmmm… the fsync-on-replace is actually not my usecase. My usecase is the fsync-on-create-to-mark complete case. And I could swear not only NTFS but also ext3 was vulnerable to 0-byte files without fsync(). Did that change?

    @222: actually NFS is a pretty good fielsystem for file-name based transactions since it has a commit on close semantic. If you close a written file it has to be stable, so you can rename it safely.

    try {FileOutputStream fos = new FileOutputStream(tmp);
    BufferedOutputStream bos = new BOS(fos);
    finally { bos.close(); }

    @192: there is actually HortnetQ a JMS Implementation which uses a lazy sleep on AIO. However I havent looked at the implementation details yet, they claim performance gain (they have the side effect problem, that the transaction has to be stable before they can acknowledge the message externally).


  26. I realize this is an old thread, but I’ll throw my 1/2 cent in.

    For unimportant (doesn’t need to be written to disk at all costs) data, just use sync() (that’s a real function, right? There should be a non-forced sync for such data, imho).
    For important (needs to be written to disk at all costs) data, use fsync().

    What’s wrong with this?

Comments are closed.