A recent Ubuntu bug has gotten slashdotted, and has started raising a lot of questions about the safety of using ext4. I’ve actually been meaning to blog about this for a week or so, but between a bout of the stomach flu and a huge todo list at work, I simply haven’t had the time.
The essential “problem” is that ext4 implements something called delayed allocation. Delayed allocation isn’t new to Linux; xfs has had delayed allocation for years. Pretty much all modern file systems have delayed allocation, according to the Wikipedia Allocate-on-flush article, this includes HFS+, Reiser4, and ZFS; btrfs has this property as well. Delayed allocation is a major win for performance, both because it allows writes to be streamed more efficiently to disk, and because it can reduce file fragmentation so that later on they can be read more efficiently from disk.
This sounds like a good thing, right? It is, except for badly written applications that don’t use fsync() or fdatasync(). Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds, and and uses a journalling mode called data=ordered. What does this mean? The journalling mode data=ordered means that before the commit takes place, any data blocks are
associated with inodes that are about to be committed in that transaction will be forced out to disk. This is primarily done for security reasons; if this is not done (which would be the case if the disk is mounted with the mount option data=writeback), then any newly allocated blocks might still contain previous data belonging to some other file or user, and after a crash, accessing that file might result in a user seeing unitialized data that had previously belonged to another user (say, their e-mail or their p0rn), which would be a Bad Thing from a security perspective.
However, this had the side effect of essentially guaranteeing that anything that had been written was guaranteed to be on disk after 5 seconds. (This is somewhat modified if you are running on batteries
and have enabled laptop mode, but we’ll ignore that for the purposes of this discussion.) Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data — even though POSIX never really made any such guarantee. This become especially noticeable on Ubuntu, which uses many proprietary, binary-only drivers, which caused some Ubuntu systems to become highly unreliable, especially for Alpha releases of Ubuntu Jaunty, with the net result that some Ubuntu users have become used to their machines regularly crashing. (I use bleeding edge kernels, and I don’t see the kind of unreliability that apparently at least some Ubuntu users are seeing, which came as quite a surprise to me.)
So what are the solutions to this? One thing is that the applications could simply be rewritten to properly use fsync() and fdatasync(). This is what is required by POSIX, if you want to be sure that data has gotten written to stable storage. Some folks have resisted this suggestions, on two grounds; first, that it’s too hard to fix all of the applications out there, and second, that fsync() is too slow. This perception that fsync() is too slow was most recently caused by a problem with Firefox 3.0. As Mike Shaver put it:
On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem.
Fundamentally, the problem is caused by “data=ordered” mode. This problem can be avoided by mounting the filesystem using “data=writeback” or by using a filesystem that supports delayed allocation — such as ext4. This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file won’t be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks haven’t been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point.
Another solution is a set of patches to ext4 that has been queued for 2.6.30 merge window. These three patches (with git id’s bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced. This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file. This solves the most annoying set of problems where an existing file gets rewritten, and thanks to the delayed allocation semantics, that existing file gets replaced with a zero-length file. However, it will not solve the problem for newly created files, of course, which would have delayed allocation semantics.
Yet another solution would be to mount ext4 volumes with the nodelalloc mount option. This will cause a significant performance hit, but apparently some Ubuntu users are happy using proprietary Nvidia drivers, even if it means that when they are done playing World of Goo, quitting the game causes the system to hang and they must hard-reset the system. For those users, it may be that nodelalloc is the right solution for now — personally, I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do.
A final solution which might not be that hard to implement would be a new mount option, data=alloc-on-commit. This would work much like data=ordered, with the additional constraint that all blocks that had delayed allocation would be allocated and forced out to disk before a commit takes place. This would probably give slightly better performance compared to mounting with nodelalloc, but it shares many of the disadvantages of nodelalloc, including making fsync() to be potentially very slow because it would force all dirty blocks to be forced out to disk once again.
What’s the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I haven’t quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won’t have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclena shutdown really should use fsync. Modern filesystems are all going to be using delayed allocation, because of its inherent performance benefits, and whether you think the future belongs to ZFS, or btrfs, or XFS, or ext4 — all of these filesystems used delayed allocation.
What do you think? Do you think all of these filesystems have gotten things wrong, and delayed allocation is evil? Should I try to implement a data=alloc-on-commit mount option for ext4? Should we try to fix applications to properly use fsync() and fdatasync()?
No related posts.
March 12th, 2009 at 2:28 pm
But the problem is that ext3 performs very poorly if applications use fsync a lot; the ordering constraints can mean that if you fsync one file, then large numbers of other files may need to be committed to disk as well. Firefox 3 initially used fsync the way you recommend, and they were forced to take it out because of the apparent lockups it caused on Linux systems.
So ext3 didn’t just allow application developers to get away with not doing fsync; it punishes them if they use it.
March 12th, 2009 at 2:29 pm
XFS & KDE zero length config file problem is known since the dawn of the time. I think you should have known better and warn the users before marking as stable enough to be used daily.
Now best solution would be to fix the userspace but note that calling fsync() on every config change is not sane, it will slow down considerably.
March 12th, 2009 at 2:33 pm
One point that I saw raised in the Ubuntu bug discussion, is about the Posix API not providing *what application authors really want to do* (I think POSIX doesn’t provide it).
So we want to do this:
Take file “.settings”, write out the file with new data (derived from the original data).
As you Ted explained this is implemented by
1. Open file
2. Truncate file
3. Write to file
The problem is when the filesystem gets stuck at state 2!!
What if application authors could say this
1. Open file for replacement
2. Overwrite with this content
Where result of step number two would be either A) No change written to disk (operation failed) B) New content of file written to disk (overwrite)
If we would have a more atomic API like that, application programmers could simply say what they mean; that they have to do 3 1/2 workarounds to get this done safely, tells me that POSIX/whatever is a pretty leaky abstraction!
I see it as if your patches to the kernel actually pushes it in this direction, which is sensible. Let’s not hope that the heuristics lead to systems that almost always work but behave strangely with corner-cases.
March 12th, 2009 at 2:35 pm
well, I’m just a simple country sysadmin, but it seems to me that you’ve gone out of your way to accommodate the app writers by providing a workaround for this that mimics ext3 behaviour. That will give them time to fix the real problem: if (as you point out in the bug report) emacs, vi and OpenOffice (!) can all do The Right Thing(tm), then what’s stopping other folks?
Also: thank you *very* much for your time in explaining all this, here and in the bug report. I’m no kernel programmer, but you’ve gone a long way toward educating me on filesystem performance/design/tradeoffs.
March 12th, 2009 at 2:41 pm
@1: Joe,
Agreed, ext3’s default data=ordered mode did punish developers who used fsync(); although the problem really only showed up with Firefox 3.0 because it was using fsync() so frequently, and in places where the the latency of the fsync() become very visible in the user interface. This got amplified into a meme which said, “fsync() is always painful; we should avoid fsync() at all costs”, which really isn’t true.
For example, emacs issues an fsync() after saving a file; it turns out that on various networked file systems, such as AFS, issuing the fsync() and checking its error returns is the only reliable way of detecting quota overflow situations, and so emacs, over ten years ago, always called fsync() as part of its file save operations. Do you hear people complaining about emacs having performance problems as result of its call to fsync()? Not hardly! But then again, it’s not calling fsync() after each click to visit a new web page in a browser, either. So this assumption that fsync() is always slow and painful is really overblown.
March 12th, 2009 at 2:56 pm
Sounds to me like there needs to be a function fsync_if_appropriate() that will call fsync if it’s on a filesystem with delayed allocation, but be a no-op if it’s on ext3 with data=ordered or some other system that’s going to commit every few seconds regardless.
March 12th, 2009 at 2:59 pm
@3: Ulrik,
The sequence:
1. Open file
2. Truncate file
3. Write to file
is always going to be unsafe. Application writers should never do this. The POSIX equivalent is:
1. Open file “foo” for reading and read the contents into memory
2. Modify the contents of file “foo” in memory
3. Write file “foo.new” with the new contents of file “foo”
4. Call fsync() on the file “foo.new” and close the file handle for “foo.new”
5. Rename “foo.new” to “foo”, overwriting “foo” in the process
It’s easy enough to create a library function that does this, if this is too complicated for application programmers to remember.
I see it as if your patches to the kernel actually pushes it in this direction, which is sensible. Let’s not hope that the heuristics lead to systems that almost always work but behave strangely with corner-cases.
Unfortunately, applications that use the sequence 1) Open file, 2) truncate file, 3) write file, are always going to be unsafe. They were unsafe with ext3, and they will continue to be unsafe with ext4 with my patches. The fundamental problem is that ext3’s data=ordered mode is a heuristic which works “almost always”, and it perpetuated bad habits for application writers. So the patches for ext4 makes the “replace file via truncate” no less unsafe than ext3, but it’s never going to be 100% safe. The right way to do this is to write the new contents to “foo.new”, and then use rename — call that “replace file via rename”.
The patches in question implicitly causes an fsync() operation when the kernel detects a “replace via truncate” or “replace via rename” operation taking place. The latter is there mainly to support applications which omit the fsync() because of application programmers who have been infected by the “fsync() is expensive and must be avoided at all costs” meme.
So the patches may perpetuate the “replace via truncate” pattern by making things no worse than under ext3; but given that there seems to be a large number of broken applications out there, I don’t see that I have much choice. What do you think?
March 12th, 2009 at 3:05 pm
No matter what “fixes” you decide to implement in ext4, please do not make either of them the default, or at least make a way to disable them per filesystem (that is, once created with these tricks disabled, have these filesystems work with complete delayed allocation no matter if where theya re mounted).
This really isn’t a filesystem problem, so please make the gamers have to fiddle with special mount options, not everyone else.
March 12th, 2009 at 3:11 pm
I’m surprised by the real pragmatism shown in the Ubuntu bug discussion, and I understand that this almost has to be implemented if we want ext4 to be a viable upgrade from ext3 (My impression of ext3: never fails and crashes are harmless. But I only admin my laptop..).
Application writes doing the wrong thing and/or the POSIX library being leaky, can be understood like this instead:
Application writers doing this wrong are using APIs too low-level for them, and should use something with the correct “write to temp then swap files” algorithm (for example using their desktop’s dotfiles-writing library or similar).
March 12th, 2009 at 3:18 pm
asdHrmmm….
Me thinks the correct solution is to stop using shit drivers that crash your system all the time.
And, please god, don’t give people the misconception that they should be using fsync() all the f-ing time. You’ll give people the impression that OMG sync() is UNSAFE and there for we should use Fsync() to write out our thousands of small configuration files.
That was one of the more irritating thing about the Firefox fsync() debacle. To me it wasn’t so horrible that sqlite was doing something stupid or that ext3 caused the performance problem to get worse… it was the Firefox’s developers attitude that their stupid little records keeping in my URL bar was so god-damn important that it needed to run fsync a thousand times a second. _That_data_is_not_that_important_. If the file system crashes and the URL records are borked then that is somethin that, frankly, I am not going to give a crap about. I’d much rather have the disk being used to save the data I am currently working on and actually matters.
So encourage people to use sync() correctly according to the actual specifications and not to go on a witch hunt and replace every instance of ’sync()’ with ‘fsync()’ in their applications and scripting libraries. There is a reason both exist. Learn it, Use it, be happy. Then when the file system mutches your data you actually have something _real_ to be pissed off about.
Application Developers: Putting the ASS in assumption since 196x.
March 12th, 2009 at 3:32 pm
IMO, using fsync/fdatasync all the time is a bad idea, except for applications that overwrite files in place (databases).
A better idea would be an implicit barrier around rename.
It would either cause the changes to be logged in the journal or allocated/flushed, depending on the amount of uncommited data.
March 12th, 2009 at 4:09 pm
In many cases, you don’t want the fsync() on rename (whether explicit or implicit). The behaviour you want is that after a crash you will see either the old file or the new file, but not an empty or corrupt file.
fsync() should only be for occasions where you want to promise someone that the new file has hit the disk.
ISTR that Linus has never been a fan of “what we do is allowed by the standard” as a justification for an implementation choice that causes problems in practice.
March 12th, 2009 at 4:30 pm
My sole complaint about the sync issue is the inability to provide more granular targeting.
Recently, I’d have given my left eyeball for the ability to:
1. sync on a specific set of files only, while other large temp writes were ongoing.
2. sync ONLY a specific mountpoint
3. drop caches+buffers ONLY on a specific mountpoint.
March 12th, 2009 at 4:33 pm
What is the argument against a simple idea such as performing the delayed allocation at file close() time, regardless of O_TRUNC etc.? That would be similar to NFS’s model.
March 12th, 2009 at 4:38 pm
@8: No matter what “fixes” you decide to implement in ext4, please do not make either of them the default, or at least make a way to disable them per filesystem (that is, once created with these tricks disabled, have these filesystems work with complete delayed allocation no matter if where they are mounted).
George,
There will be a mount option to disable these “fixes”, via the mount option “auto_da_alloc=0“. The reason I made this the default is most of the time, when replacing files, you really do want these guarantees, and if the application is sane and is already doing the fsync() (i.e., emacs, vi, OpenOffice, etc), it doesn’t hurt. But if you don’t want this hueristic enabled, you can disable it via a mount option in /etc/fstab.
I suppose I could add a way of specifying this mount option in the superblock, but I would think it’s easy enough for people who care about this sort of thing to just add the mount option to /etc/fstab.
March 12th, 2009 at 4:43 pm
@10: And, please god, don’t give people the misconception that they should be using fsync() all the f-ing time. You’ll give people the impression that OMG sync() is UNSAFE and there for we should use Fsync() to write out our thousands of small configuration files.
Nate:
Actually, sync() is effectively equivalent to fsync() on all dirty files on all filesystems in the system. So I don’t think anyone has ever said that sync() is unsafe — in fact, sync() is a very, very, large hammer that will definitely work, but which could in practice take up far more time and disk write bandwidth than the targetted use of fsync() only on the files that you need.
With ext3’s “data=ordered” mode, fsync() on a file is currently effectively equivalent to fsync’ing all dirty files on that filesystem; but that’s something which application writers shouldn’t depend upon, either.
March 12th, 2009 at 4:50 pm
I understand why doing open(), truncate(), write() and close() will always be unsafe but being ignorant on filesystems I don’t quite get the basic reason why will the effect of truncate() get on disk so much faster than write()? If application takes e.g. 500 milliseconds to execute the whole sequence and commit interval is 5 seconds, shouldn’t it be quite unlikely that the buffers are written to disk exactly when application is between truncate() and write()? Is the problem people are now reporting more likely than that?
March 12th, 2009 at 5:17 pm
tytso,
So shouldn’t fsync() on a fs with delayed allocation much less expensive than on ext3 w/ data=ordered?
It seems like this is really a plain old backward-compatibility issue. Let people get away with not fsync’ing files properly because it works with current filesystems, or accept that to properly handle correct posix file operations requires a new generation of filesystems that don’t sync everything when you fsync() a single file handle.
And realizing that these new filesystems will probably be required in the next 5 years to handle SSD’s takeover, the choice seems pretty clear.
Is there a reason I’m missing that the rename/truncate must occur immediately and can’t be delayed along with the allocation of the blocks?
March 12th, 2009 at 5:21 pm
I’m glad that you heard user concerns. The safest option must be the default and only the people that absolutely must push every last bit of performance out of their file system should fiddle with the mount options. Not only gamers are affected by this bug – as a system administrator if I am forced to choose between ext4 (a filesystem that gives me 5% faster disk operations, but will truncate a bunch of important files in the once-per-year chance event that my server looses power) and ext3 (slower, but more reliable), I will ALWAYS choose the ext3. Predictable losses are better than unpredictable losses and computer time losses are better than human time losses.
While it is understood that read, truncate, write is not atomic, the truncate operation should hit the disk in the same write cycle as the write operation. If I call truncate and then immediately call write, it should be very uncommon to have a significant time difference between truncate and write hitting the disk. If you are delaying writes, then delay truncates as well and block them together with subsequent writes.
Even the proposed read A, write to B, replace A with B apparently results in 0 length files in current ext4. That is a bug, please admit that and move on.
March 12th, 2009 at 5:24 pm
I think the driver thing is a red herring. File systems like ext4 are supposed to be robust. I would think that part of that testing is repeatedly cutting reseting the system, thousands of times.
If application writers are seriously truncating files and then rewriting them, I do place the blame 100% on application writers. But if metadata is journaled and not data, then I have this question:
1. Open new file
– data write out stops here –
2. Write data
3. Rename file
– Journal stops here –
– system loses power –
When the journal gets replayed, maybe the config file is an empty file? Unless of course the application writer calls fsync between 2 and 3.
March 12th, 2009 at 5:30 pm
It would be nice if the filesystem would inform the user at boot of what files had their contents lost in the crash. This information could be made available by writing an entry in the journal about written-but-unallocated blocks. Then at boot time, the journal could be examined for recently written-but-unallocated blocks that never made it to disk.
March 12th, 2009 at 5:39 pm
Ext3 had the 5-second sync guarantee…but if your system locked-up before the 5-second sync, it didn’t zeroed files, right? I’ve never heard about anyone getting zeroed files with ext3, even if you increased the sync period to 30 seconds or more…by the way, to get back to ext3 is not easier to add a 5-second sync (including the dealloc data) to ext4?
What *scares* me about ext4, from what i’ve heard, is the possibility of having a file zeroed just because an application happened to write to it without using fsync. I don’t mind if I lose the data that it’s in the memory in the last 50 seconds, but losing the old version of the file is a no-go. I don’t mind getting the old or the new version, or even getting a mix of the old and new version, I care about not getting zeros in any case.
Call me paranoid, but when I read this, I’ve opened a terminal and I’ve written the following command: “while [ 1 ]; do sleep 5; sync; done” and left it running in the background…
March 12th, 2009 at 5:52 pm
Having a stable release shouldn’t create so many problem:)
Last time i personally had a hard lock or a shutdown was like one in 3-4 months. So delayed allocation is a great thing. Having a UPS is good only when you got many blackouts. And a hard lock, i haven’t experienced any when having just the Nvidia blob. Actual Ubuntu is an Alpha release, they shouldn’t complain that ext4 is making these files like that if they use unstable OS.
March 12th, 2009 at 5:54 pm
“I suppose I could add a way of specifying this mount option in the superblock, but I would think it’s easy enough for people who care about this sort of thing to just add the mount option to /etc/fstab.”
The problem with /etc/fstab is that it is local to a machine. It will not carry over to other machines when I take my USB-based extt4-formatted hard drive to another computer. If you could indeed specify this in the superblock, it would be wonderful – that way everyone gets to pick what they believe to be the best default for each filesystem.
I disagree with Aigars Mahinovs: If an application wants to be safe, it can call fsync() as soon as it’s done writing. As long as the filesystem guarantees safety after fsync() to the data written before the fsync(), it should be free to be as fast as it can.
I thank you for the detailed information and all of your effort!
March 12th, 2009 at 6:00 pm
[...] – których ciekawi dokładny opis mechanizmu powstawania obciętych plików, można znaleźć na blogu Ts’o (dzięki uprzejmości [...]
March 12th, 2009 at 6:01 pm
When I wrote “zeroed” in #19 I meant “zero-length”.
Let me explain again my post…could you confirm if Ext3 can get a zero-length file in some cases, or it’s impossible? That’s all what I care about (i dont care about losing data that has not been synced for a while), if ext3 is the only one it can do that i guess i’ll have to migrate back to ext3 until similar behaviour can be found in ext4!
March 12th, 2009 at 6:21 pm
fdatasync() being blocking is problematic — if my program needs to sync 100 files, I’ve gotta either create 100 threads or call them one after the other and live with the disk latency. Is some kind of non-blocking API in the works?
March 12th, 2009 at 6:57 pm
@19 Diego:
If you are that paranoid why not remount your filesystem with the sync option so everything is always being written in order?
March 12th, 2009 at 6:58 pm
@26: Let me explain again my post…could you confirm if Ext3 can get a zero-length file in some cases, or it’s impossible?
Diego,
It can happen with ext3 if the application uses “replace via truncate” — i.e., open(fd, “foo” O_TRUNC”), write(fd, buf, buflen), close(fd). It’s a hard race to hit, but if the open takes place just before an ext3 transaction commit, and the write system call happens right after the transaction commit, and then the system crashes before another commit can place, you will have a zero-length file.
I’m more sympathetic to application writers who have the mistaken impression that “replace via rename” is safe without the fsync(), and the patches queued for 2.6.30 will cause ext4 to force the blocks to be allocated on the rename (and on close() for the “replace-via-truncate” case), and then forced out to disk on the transaction commit. I do want people to understand though that there are no guarantees that any other file system will have these sorts of heuristics. If you want your application to be portably safe, you have to use fsync().
March 12th, 2009 at 7:05 pm
I think what is needed is a FaultyFS. FaultyFS adheres to POSIX but does _everything_ it can to damage your data in undefined POSIX areas (possibly only power loss related though). FaultyFS has a commit interval measured in hours. FaultyFS reorders everything in the worst way imaginable that is allowed. FaultyFS fills files with Ks before writing their contents (if POSIX allows this) etc. FaultyFS would support only 10 snapshots of the filesystem over the past second (assuming that the snapshots differed) as a bandaid to allow developers to see what would have happened if a crash had occurred at those points. This bandaid would be thrown away once someone worked out how to make the live filesystem always show the corruption of a poorly rewritten app. Devs would test their apps with FaultyFS to see how quickly they would break. FaultyFS wouldn’t be a guarantee but it would be an aid to achieving robustness. FaultyFS would also have a super slow fsync to punish apps fsyncing too much and trying to beat it.
March 12th, 2009 at 7:10 pm
@27: fdatasync() being blocking is problematic — if my program needs to sync 100 files, I’ve gotta either create 100 threads or call them one after the other and live with the disk latency. Is some kind of non-blocking API in the works?
Brian,
I guess the first question I would ask is, “why do you really need to be updating 100 files at once”? If the answer is the application is using hundreds of small configuration files, are they all getting modified? Or is the application just being lazy and rewriting all 100 files whether they need to be rewritten or not?
Assuming there is a good reason to force out all 100 files (for example, in the case of package installation using dpkg or rpm), the best way to do this right now is to use the sync() system call. This is a bit of overkill since it will sync all files on all filesystems, but it’s not like you need to be installing packages several times a second. If you need to be updating 100’s of files that frequently, the first question I would ask is “Why? Is it really necessary? What about SSD’s?”
That being said, no there isn’t an asynchronous API for requesting that a file be flushed sooner than necessary. It wouldn’t be that hard to add one via the fadvise() system call, which would probably be the way I’d do this, if people really needed it.
March 12th, 2009 at 7:16 pm
@18: Charlie,
So shouldn’t fsync() on a fs with delayed allocation much less expensive than on ext3 w/ data=ordered?
Yup; fsync() was expensive only on ext3, because of data=ordered mode. Ext4 has a much cheaper fsync() cost; it’s one of the benefits of delayed allocation.
It seems like this is really a plain old backward-compatibility issue. Let people get away with not fsync’ing files properly because it works with current filesystems, or accept that to properly handle correct posix file operations requires a new generation of filesystems that don’t sync everything when you fsync() a single file handle.
Exactly. Except replace “current filesystems” with “ext3 (only)”. and “new generation of filesystems” with “all other filesystems”…
Is there a reason I’m missing that the rename/truncate must occur immediately and can’t be delayed along with the allocation of the blocks?
The problem is “entangled writes”. When you modify various file system meta-data blocks to implement a truncate operation, or create a new file or allocate new blocks to a file, very often more than one operation will need to make changes to a particular block — for example, a block or inode allocation bitmap. So you can’t arbitrarily delay one meta-data operation until later. You can delay all meta-data operations by lengthening the commit interval, but it’s very much an all-or-nothing sort of thing.
March 12th, 2009 at 9:10 pm
To say delayed allocation is the problem is a sneaky way out. I used reiser4, it has delayed allocation. When the system locked up I’d get old versions of files, never 0 byte files. Reiser4 treated file system operations like transactions, and completed the whole transactions and those depending on it, or none of it at all. There were no zeroed files. This argument is just a way of saying: Please use ext4, we only do this because we have to and everyone else will have to as well. That’s just not correct. Delayed allocation is not the sole cause of this.
The other side not presented in this article, but which has feet, is the POSIX argument. In the original bug report the developers claim that POSIX is not clear on what the state the files should be in if a system crashes before any fsync has taken place.
In the example given, the developer says this can lead to a 0 byte file:
2.a) open and read file ~/.kde/foo/bar/baz
2.b) fd = open(”~/.kde/foo/bar/baz.new”, OWRONLY|OTRUNC|O_CREAT)
2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
2.d) close(fd)
2.e) rename(”~/.kde/foo/bar/baz.new”, “~/.kde/foo/bar/baz”)
A new file is (2.b) opened. Then (2.c) data is written to it. Then (2.d) that file is closed. And finally (2.e) that file replaces the original. Because POSIX doesn’t say the data is ordered, it’s ok for the file system to perform 2.e on disk (the rename) before step 2.c is done on disk (putting information into the file). So you can get a 0 byte if uncleanly unmounted (ie a crash).
Lets not get confused, there is a design choice to do this here. It’s not necessary for things to happen this way, but it is “correct” according to POSIX. The choice is up to the developers, if they can’t think of a way around it without sacrificing performance then that’s how it will be in ext4.
People need to stop shifting the blame. This is how they decided to do it. If you don’t like it there are other choices. Edit: Although everyone is throwing it around, I don’t believe this is a problem with ZFS either, only XFS.
March 12th, 2009 at 10:51 pm
@33: Danny,
It’s interesting that you didn’t see this problem with reiser4; I’m guessing though that it didn’t really treat each filesystem operation as its on transaction, since that would have terrible performance. It probably did the equivalent of forcing an fsync() on close, assuming that a close operation should be treated as a transaction commit. That’s a guess, but it’s likely.
I did mention the fact fsync() and fdatasync() are the only ways that POSIX guarantees data can be safely written to disk; maybe you didn’t read my blog post more closely? In any case, we can and will put in hacks that force the file data to be written out in the case of “replace-via-truncate” and “replace-via-rename”, but you will be trading off performance for robustness in the face of sloppy applications and unreliable systems.
So we can let users have a choice; if their applications are good about using fsync() where necessary and/or their system is robust and rarely crashes, they can use the more aggressive forms of delayed allocation. If they have a really unreliable system and they use many applications that don’t use fsync() they can disable delayed allocation, or we can implement an data=alloc-on-commit mode, that will sacrifice performance. There basically is no free lunch here; the question is what is the right balance given a particular system’s reliability and application workload.
March 12th, 2009 at 11:17 pm
There are some comments and questions here that you might be interested in reading:
http://lwn.net/SubscriberLink/322823/102fd841e29f7637/
The comment there at http://lwn.net/Articles/322865/
explains my concerns pretty well. What do you think?
March 12th, 2009 at 11:32 pm
… But it’s not just ext3 that works with the open/write/close/rename idiom, is it? What about non-linux filesystems (ufs comes to mind — I’d be very surprised if the sync-less version of this idiom doesn’t feature heavily in bsd sources)? What about good ol’ non-journaling filesystems (where metadata and data aren’t written with vastly different priority)?
It seems to me that what you’re really arguing for is increased FS performance at a cost of extra application complexity. Which is an argument worth making, but it strikes me as disingenuous to claim it is due to programmer laziness as a result of ext3 or anything else. This behavior is the very reason more people don’t use XFS, and for good reason.
March 12th, 2009 at 11:40 pm
Great post. I liked the explanations a lot. Thank you.
Delayed allocation is a good thing. To be honest, I have always thought that modern (contemporary) file systems implemented it. Only after I’ve read that it’s an “advanced” feature, I’ve realised the cruel reality of ext3
A data=alloc-on-commit mount option for ext4 would be nice and I suppose it shouldn’t be very hard to implement. By the way, do you happen to know if the command line sync really works or not, including on ext4 file systems? Also, setting the dirty_writeback_centisecs & Co. parameters to low values couldn’t be a possible workaround for this “problem”?
Of course that applications should be fixed to properly use fsync and fdatasync (where needed). I think that this is one of the benefits of the bazaar/open source model where things evolve and get fixed rapidly.
March 13th, 2009 at 12:09 am
I just want to point out that this presents a very serious problem for data security when using interpreted languages like perl, where access to fsync/fdatasync is not available without a non-standard module. (what’s worse, that extra module requires access to a C compiler – it cannot be implemented in pure perl, which raises the bar substantially)
And just in case you thought maybe perl did some f*sync calls under the covers at close time – it doesn’t. You can verify yourself by examining the output of:
strace perl -le ‘open(my $f, “>foo”); print $f “hello\n”; close($f)’
… So if I’m a perl developer who wants to make sure my app maintains data security when used on ext4, what am I to do?
March 13th, 2009 at 12:34 am
@35: Jim,
Your argument seems to be that when doing a “replace via rename” (open, write, close, fsync, rename), that the fsync should be unnecessary since “obviously” the rename should imply atomicity without durability (i.e., that you’ll either get the old file or the new file, but not something in between). Furthermore, you claim that adding the fsync() is insane because of a “drastic and unnecessary performance penalty”.
My response would be this; first of all, while it is sane to want atomity without durability, that’s not what (open, write, close, rename) does. It may have had that effect under ext3, but that was an accident. Linux supports many filesystems, of which ext3 is just one. And there are other POSIX systems out there, so if the goal is to write portable code, it’s bad practice to make filesystem-specific assumptions. For example, maybe the code will need to run on a shared-block cluster file system, like GFS2 or OCFS2? Or maybe someone will want to run the application with their home directory on AFS?
Secondly, fsync() does not necessarily impose a drastic performance penalty. It does with ext3’s data=ordered mode, which is highly unfortunate, since not only did data=ordered mode trained application programers to think that fsync() wasn’t necessary, but it also trained application programmers into thinking that fsync() is expensive. It doesn’t need to be expensive, and with ext4 and delayed allocation, it isn’t.
This does leave the question that what works best for ext4 and other newer, more modern file systems, and what works well for ext3 is different. And yes, that’s unfortunate. I’m not sure there is a good solution for that, other than some way of giving a hint to applications whether the filesystem in question has “implied fsync semantics” (which is what ext3 data=ordered mode is really all about) or not.
March 13th, 2009 at 12:39 am
@36: What about non-linux filesystems (ufs comes to mind — I’d be very surprised if the sync-less version of this idiom doesn’t feature heavily in bsd sources)? What about good ol’ non-journaling filesystems (where metadata and data aren’t written with vastly different priority)?
Somejack,
Actually, for historical BSD systems, such as ufs, traditionally metadata was written out on a 5 second timer, and data was written out on a 30 second timer. So data and metadata were historically always written out with vastly different priorities. This is actually where the default 5 second commit interval from ext3, and the 30 second dirty_expire_centisecs default values came from.
It’s only ext3’s data=ordered which is unusual, in that data=ordered had the side effect of making the 30 second dirty_expire_centisecs largely irrelevant (at least for ext3 filesystems using the default data=ordered mode). So this is an old issue, and it’s why older, more portable programs (like emacs and vim) tend to use fsync() — because it’s necessary on most other filesystems.
March 13th, 2009 at 12:44 am
Updating a file doesn’t need a flush (fsync()), it never did. What it needs is a memory barrier to prevent reordering across renames. There’s no way to spell that though, and even if it was available the performance effect should be negligible, so you might as well do it every time you close a file.
ext3’s data=ordered did that. It was a killer feature, not just of ext3, but of Linux in general. For your typical desktop user it far exceeded any other advantage of using ext3 over ext2.
March 13th, 2009 at 12:46 am
@37: By the way, do you happen to know if the command line sync really works or not, including on ext4 file systems?
Yes, it really works. The sync() system call, which is what the command line sync command calls, has the same effect as calling fdatasync() on all dirty inodes. Also queued for the 2.6.30 are patches so you can monitor the number of delayed allocation blocks, via /sys/fs/ext4//delayed-allocation-blocks, so it’s easy to keep an eye on these things.
Also, setting the dirty_writeback_centisecs & Co. parameters to low values couldn’t be a possible workaround for this “problem”?
Along with adjusting the journal commit interval upwards? Yes, possibly. Like everything else, this will come with certain performance trade-offs.
March 13th, 2009 at 12:52 am
@38: I just want to point out that this presents a very serious problem for data security when using interpreted languages like perl, where access to fsync/fdatasync is not available without a non-standard module. (what’s worse, that extra module requires access to a C compiler – it cannot be implemented in pure perl, which raises the bar substantially)
skibrianski,
Well, there is a CPAN module called File::Sync which has already implemented access to fsync(). I don’t know if you consider CPAN to be a location for “non-standard modules”, but it doesn’t look like it’s that hard to arrange to have it be installed.
March 13th, 2009 at 1:13 am
@41: Adam,
The problem is that it was not an intentional “feature”, but an implementation side effect of data=ordered — and data=ordered had the additional misfeature that it made fsync()’s much more expensive than they needed to be. Desktop users may have latched onto this as a “feature” they can’t live without, so maybe we’ll have to add data=alloc-on-close mount option that emulates both the upsides and the downsides of ext3’s data=ordered mode. But it will come at a cost; there’s no such thing as a free lunch, after all.
March 13th, 2009 at 4:25 am
Thanks for writing all of this.
Myself, I’ve often forgotten to perform the fsync before the close and rename, then sometimes remembered to add it later (even though I couldn’t produce the bug in testing on ext3; now I know why).
I’m a little leery of your new workaround; it seems that other filesystems are going to have to implement the same workaround in order to be reliable with applications that run reliably on ext4.
This whole thing is the worst kind of API problem: an API that requires an easy-to-forget step in order to achieve reliability under circumstances that never happen, coupled with an implementation that severely penalizes some applications that use that step, while not in fact requiring it for reliability. But it’s a few years too late to do anything about that.
March 13th, 2009 at 6:47 am
Is it true that using ‘data=journal’ would also eliminate this problem (although impose a performance penalty)?
March 13th, 2009 at 8:22 am
Let me ask a follow-up, despite no answer having been offered to @14 above.
If indeed the ext4 attitude is that “if the program cares about its output files, it should use fsync on them”, and thus programs will be deemed well-behaved if they add a *sync to all their file operations, what happens to the system performance over all as more programs become “well-behaved”?
In other words, does ext4 performance only shine while there are enough “badly behaved” programs that do not use *sync? So once more of them behave “well”, so there is lots of *sync traffic, would ext4 punish them by becoming overall as slow as ext3 et al. ?
March 13th, 2009 at 8:40 am
@skibrianski
Perl has fsync() in core, its just well hidden
From “perldoc POSIX”:
fsync Use method “IO::Handle::sync()” instead.
So the way to go is:
use IO::File; my $fh = IO::File->new(’foo’); $fh->sync; $fh->close;
March 13th, 2009 at 9:38 am
@Frank,
Sorry, I missed answering your question in #14 earlier.
What is the argument against a simple idea such as performing the delayed allocation at file close() time, regardless of O_TRUNC etc.? That would be similar to NFS’s model.
First of all, that’s not NFS’s model; NFS v2/v3 requires that each write RPC call not return until the data has hit stable storage. So in fact it’s a stronger requirement than alloc on close. We could do alloc-on-close, but most of the time applications either write the entire file all at once, in something well less than the commit interval, or they do random access writes to file that already exists, so we might as well simply do alloc-on-commit, which when combined with the already-existing data=ordered machinery, means that the data blocks gets flushed out on the commit. In the first case (where applications write the entire file out in well less than the commit window) doing it at commit time actually delays the allocation by a small amount (maybe more if the commit interval has been lengthened by laptop mode), and in the second case, there isn’t much block allocations going on at all. The downside is if you are copying very large files, and you have lots of memory, and the commit window is relatively small compared to the time that it takes to copy a multi-gigabyte file, alloc-on-commit might end up doing the allocations sooner. OTOH, alloc-on-commit exactly mirrors ext3’s behaviour, and there seems to be people who are clamouring for such a thing. We could do both as options, I suppose but then maybe we are adding too many knobs. <shrung> There are always tradeoffs, no matter what we do.
If indeed the ext4 attitude is that “if the program cares about its output files, it should use fsync on them”, and thus programs will be deemed well-behaved if they add a *sync to all their file operations, what happens to the system performance over all as more programs become “well-behaved”?
Well, we’ve implemented some of these workarounds (more than XFS has already, and once I implement alloc-on-commit, we’ll have something exactly like ext3, for those people who want it). So I wouldn’t call it “ext4’s attitude”, but rather, “if you want to write a portable and safe application that works everywhere”.
As I’ve said already, because of delayed allocation, fsync() for ext4 is not the performance problem that it was under ext3. Ext3’s performance issue with fsync() comes from data=ordered, since that was what required flushing to disk a 2GB file being written to disk just because Firefox 3.0 was trying to fsync() a 30k sqllite database. Furthermore, it was a highly visible problem because Firefox was doing all the time — every time you clicked on a link, to update its URL visited database — and if there was a big write happening in the background, it introduced a highly visible latency in the UI. In OpenOffice, for example, there is no UI delay at all, since the write happens in a separate thread, and even with emacs, you only type ^X^S every so often, and it happens at a point in the workflow where if the ediitor has to delay for even a second, it’s not as big of a deal as “every time you click on a link” in a browser. In any case, as I’ve said before, it’s only ext3 that has this data=ordered mode, and with all other filesystems, including ext4, fsync() is not the big performance monster people seem to perceive it to be. Yes, it will take a small amount of time to write the file data to disk; but if you want robustness you have to do that, by definition. For small files, it’s really no big deal to fsync() them on any filesystem other than ext3.
Now, of course, once I implement alloc-on-commit for ext4, ext4 will start behaving exactly like ext3, including fsync() getting expensive again. And alloc-on-close will be similar, since all allocated files will have to be flushed out to disk on the journal commit triggered by the fsync(). In the case of the multi-gigabyte file that takes longer than 5 seconds to write, fsync()’s before the close won’t be affected, but the moment the close happens, an fsync() before the next commit will incur a really huge latency.
So why isn’t this a problem for delayed allocation? With delayed allocation, we don’t allocate the blocks until the file is explicitly fsync’ed which means the application has told is it really really wants this particular file pushed out to disk, and it’s important enough it’s willing to wait for it to happen and for it to get any I/O errors from the writeout. If the file isn’t fsync’ed, then once the pages have been dirty for longer than the dirty expiration timer, the VM subsystem will gradually stream them out, allocating the pages and then triggering the writeout. This happens in the background, not all at once, and not connected to a journal commit. If a commit gets triggered while a background write is happening, it will need to wait for the pages allocated by the background write to be written out before the commit can proceed, but since this happening gradually, it won’t add significant delay to a synchronous commit triggered by an fsync() call.
March 13th, 2009 at 9:51 am
@45: I’m a little leery of your new workaround; it seems that other filesystems are going to have to implement the same workaround in order to be reliable with applications that run reliably on ext4.
Kragen,
Yeah, my current thinking is that we’ll do the automatic alloc-on-replace-via-truncate (which was taken from XFS, and rewards seriously broken application), and the automatic-alloc-on-replace-via-rename (which XFS doesn’t do, but rewards applications that at least tried a little to do the right thing) by default, since most of the time, when you are replacing a file, they generally tend to be small files. (i.e., config file, registry files, source files, etc.) And if the applications do the fsync(), the automatic alloc-on-replace doesn’t hurt anything.
The much more extreme alloc-on-commit, which exactly mirrors ext3’s features and misfeatures, will be something that people will have to enable explicitly. That’s basically there because of a whole bunch of Ubuntu users who are loudly clamoring for Canonical not to ship ext4 “because ext4 is fundamentally broken”. I disagree with them, but apparently at least some Ubuntu users seem to live a world where playing “World of Goo” using crappy proprietary binary drivers is so important they are willing to live with Windows-levels of “blue screens of death” several times a day, and where their Ubuntu systems are crashing with alarming regularity.
This whole thing is the worst kind of API problem: an API that requires an easy-to-forget step in order to achieve reliability under circumstances that never happen, coupled with an implementation that severely penalizes some applications that use that step, while not in fact requiring it for reliability. But it’s a few years too late to do anything about that.
I agree, whole-heartedly. The interaction of fsync() and ext3’s data=ordered mode (and the hypothetical ext4’s alloc-on-commit, although I’ve already completely mapped out how to implement it) is highly unfortunate. But it’s too late to change, and there are a lot of broken applications out there.
One of the commenters on the Launchpad bug claims that svn apparnetly isn’t fsync()’ing certain new files that it creates after an “svn update”, and so an svn working directory can become useless if you do an “svn update” and the system immediately crashes. Now, my systems aren’t crashing all the time, but for all of the clamour that I’m hearing on the Ubuntu bug, it seems that for certain class of Ubuntu users, their machines are crashing left and right, at the slightest provocation. It’s really for those users that I’m planning on implementing alloc-on-commit, even though it’s going to perpetuate this API problem. It’ll be up to Canonical whether they want to enable this by default, of course. Maybe something smart where it gets enabled on early alpha kernels, and/or binary proprietary drivers are in use, especially the proprietary video drivers that seem to be the most problematic? I dunno. That’s not my problem.
March 13th, 2009 at 10:02 am
@46: Is it true that using ‘data=journal’ would also eliminate this problem (although impose a performance penalty)?
Sam,
Actually, no, it won’t, at least not until I implement alloc-on-commit. That’s because data=journal only journals data buffers once they are written; but with delayed allocation blocks, the data blocks aren’t written to disk right away. With out a final location on disk assigned, we can’t journal them.
Admittedly, data=journal isn’t very useful without alloc-on-commit. For some workloads data=journal can be faster (since the writes go to the journal you don’t have to seek as much as data=ordered), but most people use data=journal as a cheaper variant of the ’sync’ mount option.
March 13th, 2009 at 10:28 am
Wouldn’t be much easier to “fix” one filesystem, than to fix thousands apps out there?
March 13th, 2009 at 10:46 am
@52: John,
Only if you don’t mind those applications only being safe to use on one filesystem. I suppose it is one way to for ext3/4 engineers to have guaranteed employment for the indefinite future.
March 13th, 2009 at 11:14 am
A few points:
1.) The best solution is to fix all the applications.
2.) The posix API is really lacking here. There should be an api for read/modify/replace that is easy to use correctly. open #1->read->close->open #2->write->close->fsync->rename is ridiculous for such a common operation.
3.) I think the blame on nvidia is misplaced. Even if nvidia is responsible for a huge number of crashes (anecdote: never crashes for me), linux and the power grid are not *that* reliable. Furthermore, there aren’t any alternatives to nvidia for fast (rules out intel), mostly reliable (rules out ati) 3d performance.
March 13th, 2009 at 11:21 am
Your analysis does not reflect what I am seeing. I installed ubuntu about 3 times. I am an aqpplications programmer but not an experienced system admin. I want to develop open source linux applications in my spare time. Each time I install ubuntu (8.10), I start installing and updating development software. Maybe sometimes some things fail. Maybe sometimes we have a brief power failure overnight while I am away from the computer. I never knew howcome, but three times I wind up with Ubuntu with no way that I can figure out how to get root privileges. Root password or sudoers list or something gets mucked up three times in about three weeks and I resort to reinstall.
Seems to be a lack of whole-system engineering here. Systems guys and apps guys pointing fingers. As user, I don’t care whose fault it is. How about the file system gives the file creator a way to flag a file as critical (etended attributes or something), and those critical files will get special handling from the file system to prevent 0-length and similar crashes, and the rest of the filesystem can do whatever it does to make things efficient?
March 13th, 2009 at 11:31 am
Hi Ted,
Thanks for blogging. IMHO it is anal to point out that POSIX makes no guarantees to crashing systems; of course it doesn’t (and it’s a point in the Unix haters FAQ for long: Unix file systems are fast, but not robust). So what kind of promise should a file system make that is not just fast, but also robust? Simple: Preserve a consistent state of the FS in case of a crash. POSIX consistency semantics is easy, because all file system operations are performed in order in POSIX. If you have a sequence write, close, rename, the rename happens clearly after the write.
So the problem is not from delayed allocation. Losing five seconds of work, or even a minute is not the problem. The problem comes from reordering metadata and data writes. The fsync() discussion is completely misleading, because that’s IMHO only there for people who roll their own file system in a file (aka database), and really want a raw device, and know that Unix file systems are not robust, but their database should be.
What you should do is to delay metadata changes so that they are only committed *when* the data operations that have been issued prior to the metadata operations have been committed, too. Best, keep them in memory as well. So when you come to your sync point where the delayed allocation is actually performed, do the following (in this order):
* allocate all inodes and blocks.
* Write all metadata to the journal as “uncommitted”.
* Write barrier (make sure it will not reorder between journal writes and the following writes)
* Write metadata and data, in whatever order you or the disk likes.
* deallocate all inodes and blocks freed¹.
* Write barrier again.
* Write the commit block which makes the metadata changes final.
* Final write barrier.
This makes the whole set of transactions performed in the last 5 seconds (or whatever period you choose) atomic. That’s what the users want, nothing else. If the sync-to-disk fails, the commit block is not written, and the previous state can be restoured by rolling back all changes. The only risk here is that in-place writes of data can be reordered, but in-place writes should come from applications who know about fsync, or wait for btrfs, which is log-structured, and that’s the “right” solution to the problem.
While you perform these sync steps, all writes and metadata changes that happen in parallel will be delayed to the next sync point.
¹) This may cause problems in case of a full hard disk. The way out is: Do not overcommit disk space, sync when you appear to run out of disk space. Allocation has immediate effects (though the actual allocation is delayed), file removal is delayed.
March 13th, 2009 at 11:31 am
Hi Ted,
Firstly, many thanks for your work on ext4. As to your question about the way forward, my vote is that the “data=alloc-on-commit” mount option is a good idea. I would very much like to see this implemented — there will always be misbehaving apps, and it would be reassuring that there is another layer of “protection”.
While the “data=alloc-on-commit” mount option may give a performance reduction, the big win in terms of ext4’s shorter fsck times would still be present.
March 13th, 2009 at 11:42 am
So basically you’re saying that “data=ordered” on ext3 is a horrible idea because it encourages broken behavior and penalizes correct behavior (the use of fsync()). So what you really need to do is fix ext3 to remove the penalty.
A solution to this conundrum would be to either add a kernel patch that spits out loud warning messages when mounting a partition with “data=ordered” or to make the kernel ignore “data=ordered” on ext3 or to make the kernel ignore fsync() when data=ordered is used on ext3. (Unless there’s another more creative solution to making fsync with data=ordered fast.) Of these, I think I’d go for number 2.
March 13th, 2009 at 11:55 am
While I think your technical comments are very sound I found your pokes at the Ubuntu distro to be somewhat pointless. This only serves to enforce in-fighting within the community. I’d have expected more from someone in your position.
March 13th, 2009 at 12:08 pm
I think that it should be a safe assumption that applications shouldn’t be expected to be aware of the properties of the filesystems they’re writing to. If they do things “The Right Way”, it should always be safe, and always be optimal, regardless of the filesystem. Is there really no way of fixing the sub-optimality of fsync() on ext3/data=ordered? If not, then it would suggest that there needs to be a new fsync() call that applications can use and which does the right thing (or even “nearly the right thing, tweakable by the sysadmin”) according to the filesystem of the mount point in question.
March 13th, 2009 at 12:20 pm
Would it be really gross and/or unhelpful to have a ‘max-fsyncs-per-path-per-min’ (or ‘max-fsyncs-per-fd-per-min’) mount option for ext3 to make fsync with data=ordered less expensive, allowing apps like Firefox to use fsync without any more penalty than the admin is prepared to bear?
March 13th, 2009 at 1:22 pm
[...] meal: wrzucił link na Flakera przez stronę www przed chwilą Delayed allocation and the zero-length file problem thunk.org/…layed-allocation-and-the-… [...]
March 13th, 2009 at 1:40 pm
[...] Reiser4, XFS, HFS+, ZFS and btrfs use this method. Ted Ts’o wrote a blog post titled “Delayed allocation and the zero-length file problem” in which he discusses these issues more in depth and talks about the pros and cons of all of [...]
March 13th, 2009 at 1:58 pm
What is the correlation between gaming on Ubuntu and ext4? The driver should not be writing any data to disk during runtime – that should be handled by userspace applications, correct?
If you check nvidia’s Linux forums, all of their crashes are due to their new driver series which have their own share of problems. Ext4 is arbitrary and completely unrelated to the abundance of stability bugs they have not fixed yet.
March 13th, 2009 at 2:48 pm
The best solution is to fix applications. Fix them so they don’t crash. Fix them to use filesystem commands correctly. For most applications, it shouldn’t affect their (or the whole system’s) performance to be fixed in one way so that they work approximately equally well no matter what filesystem was used, since many applications do not need to constantly save data, or could be rewritten that way.
In some cases, the fix is not so easy though, because of ext3’s combination of quirks AND popularity: applications have to run on it well, but have to take into account the poor way it was handling fsync (the example is what happened in firefox). In other words, in rare cases, you may want your application to run one way while interacting with ext4, and other future FSs, and a slightly different way while working under ext3. Is that possible?
There needs to be better quality control, testing, and interaction between Linux distros that provide a suite of applications, and the application developers themselves. What is really happening is distros pushing to use ext4 just because it was “better” than ext3, without any idea of what the applications they package actually would do, let alone who well they were written in their interactions with the fs. And thankfully it’s blown up early this time.
It is good to provide options on ext4, like Tso is doing. But distros really need to do their part: if they want users to use ext4, and future newer filesystems, by default, they need to quality control the applications they sponsor so that they work for a filesystem running at its most optimal, rather than with quirks, with the idea of eventually shipping such a quirk-free fs.
March 13th, 2009 at 2:51 pm
@56: IMHO it is anal to point out that POSIX makes no guarantees to crashing systems; of course it doesn’t (and it’s a point in the Unix haters FAQ for long: Unix file systems are fast, but not robust). So what kind of promise should a file system make that is not just fast, but also robust? Simple: Preserve a consistent state of the FS in case of a crash. POSIX consistency semantics is easy, because all file system operations are performed in order in POSIX. If you have a sequence write, close, rename, the rename happens clearly after the write.
Bernd,
That is simple, but it won’t be fast. CPU’s have been using techniques such as out-of-order execution for performance speedups. POSIX says that file system operations need to appear as if they are performed in order, but not that they have to be committed to stable store in strict order. If you want completely strict ordering ordering, things will be slow. As a very extreme (low-level) example, the block device elevator algorithm reorders writes so that they reduces seeks on the disk; so instead of writing (or reading) blocks 42, 6000, 43, 10050, 6001, 6002, 44, 10051, etc., the block elevator sorts the reads (or writes) to avoids seeks: 42, 43, 44, 6001, 6002, 10050, 10051. Makes common sense, no? But by definition, we are reordering operations!
So the real question is not whether or not strict, simple ordering is allowed, but at what level and how much reordering is allowed? POSIX takes a fairly relaxed view on these things, probably because traditional Unix has never had these guarantees, and because many Unix systems simply don’t crash that often! Server class machines generally tend to have UPS’s; laptops have batteries that have the same effect of protecting against power failures. I have cheap UPS’s on what few desktop machines I have left, mainly to protect against power spikes and brownouts. And certainly I came from an era when Linux users were proud to announce uptimes measured in years.
There’s an old saying in computer science; you optimize for the common case. If the common case is that the system stays up, and you have fsync to enforce consistency points where it really matters, that is a perfectly valid solution. I agree with observation a commenter has made that this is really a overall systems engineering problem. The OS-level system specification said one thing, but the implementation made it almost safe to ignore the spec, and applications started getting flabby and relying on ext3’s data=ordered. And if certain systems with proprietary binary drivers are constantly crashing, left and right, at the slightest provocation, then they very clearly have a very different common case than I am used to, so we need to adapt for them.
So I agree the file system has some measure of blame in this; and it’s one of the reasons why I’m willing to implement multiple methods of giving users a transition path back to the proper use of fsync(). But I also want to call out those applications that have gotten lazy, and encourage them to make changes as well. In the long term, assuming some approximation of strict consistency, even if it’s ext3’s level of consistency, which was by no means perfect, is going to hold Linux file system performance back. So I view an allocate-on-commit mount option is a temporary fix, not a permanent one, assuming we can get most application writers to Do The Right Thing. I’m an idealist, I know…
March 13th, 2009 at 2:56 pm
I wonder about the same thing Matthew Woodcraft said in #12, is there a way in ext4 to request rename but only care about atomicity, without the workaround patches? I guess in other words, what’s the POSIX-correct way of doing it?
March 13th, 2009 at 3:03 pm
So basically you’re saying that “data=ordered” on ext3 is a horrible idea because it encourages broken behavior and penalizes correct behavior (the use of fsync()). So what you really need to do is fix ext3 to remove the penalty.
Ken,
Well, you could consider ext4 a “fix” for this. You can take an existing ext3 file system, without any modifications or conversions, and mount it using the ext4 file system driver. Just change ext3 to ext4 in /etc/fstab, or for the root filesystem add rootfstype=ext4 to the boot command line, and your file system will be mounted using ext4 instead of ext3. So you can take a file system, try it out using ext4, and as long as you don’t enable any of the ext4-specific features using the tune2fs program, you can always back out switch back to ext3 if you want.
You won’t get most of the new features of ext4, such as extents, for example — but you will get delayed allocation, and avoids the fsync() “performance problem” (which is really a latency problem). Of course, if you have a system where you don’t trust it not to crash at inconvenient times, and a lot of existing applications that aren’t using fsync() today, we will need to have a transition plan; and that’s what some of the patches queued for 2.6.30 and the alloc-on-commit idea is all about.
March 13th, 2009 at 3:05 pm
@ted: You are getting academical. I’m turning to think that if ext4 can’t provide a reliable upgrade for ext3, it should not be ext4, and you should not develop it that way.
It should be optimized for the common case. A common case is that the user expects the computer to be as fast as possible and no faster, keeping track of the data. When the computer fails to keep track of the data, it doesn’t matter how fast it does it for the common user.
Modern computers, devices everywhere are not the Unix of past. Unix is on my ipod (hacked it yesterday) well that’s not linux, but hopefully the next successful device like that will carry linux.. Does your cell phone have a UPS? Does it ever run out of battery? We want devices everywhere. And reading on laptops on the train.. using ext4 everywhere..
Producing the patches mentioned before, great pragmatism has been shown. I think that is important for the future of the linux desktop.
(Linux non-desktop can play with other filesystems, other settings)
March 13th, 2009 at 3:17 pm
@61: Would it be really gross and/or unhelpful to have a ‘max-fsyncs-per-path-per-min’ (or ‘max-fsyncs-per-fd-per-min’) mount option for ext3 to make fsync with data=ordered less expensive, allowing apps like Firefox to use fsync without any more penalty than the admin is prepared to bear?
Hmm…. that’s an interesting idea, but the problem is what might be appropriate for one file (say, firefox’s sqllite database) might not be appropriate for another file (say, mysql or postgresql database file). What I would probably think is more appropriate would be a per-file flag “not_important”, which limited the number of fsync’s per minute, or better yet, changes fsync()’s behaviour on that file to return right away, instead of waiting for the commit to complete. After all, most people don’t really care that much if firefox loses track of the last few URL’s that it visited, but they would care if a file system compromised postgresql’s ACID properties.
March 13th, 2009 at 3:23 pm
I must say there is a disconnect here in terms what actual application developers can do. I understand all the arguments on both why delayed allication is good and how its allowed by POSIX.
However, as a author of a library that does an “atomic replace file”. How do you know what files are important enough to require an fsync? This is generally not possible, even to the app that calls the function. So the only solution is to always fsync, which is of course a) slow on ext3, b) gonna make all file saves sync-on-close anyway, so many advantages of delayed allocation are lost. (Not all though, we can still e.g. coalesce writes so that we can allocate continuous regions.)
And this is not a theoretical question I have either. We need to decide how to actually handle this in glib. See the thread at:
http://mail.gnome.org/archives/gtk-devel-list/2009-March/msg00082.html
March 13th, 2009 at 3:40 pm
@49
As I’ve said already, because of delayed allocation, fsync() for ext4 is not the performance problem that it was under ext3. Ext3’s performance issue with fsync() comes from data=ordered, since that was what required flushing to disk a 2GB file being written to disk just because Firefox 3.0 was trying to fsync() a 30k sqllite database.
It’s still not crystal clear to me whether or not there would be a problem for ext4 performance (in the sense of reducing it to ext3 levels) if henceforth all applications were modified to follow all their close()/etc. operations with an fsync().
If the answer is “yes”, then we have a problem (in that you are pressing applications to be changed, and yet overall this would simply get them back to the performance & safety they already enjoy (?) with ext3.)
March 13th, 2009 at 4:01 pm
@64: What is the correlation between gaming on Ubuntu and ext4? The driver should not be writing any data to disk during runtime – that should be handled by userspace applications, correct?
damentz,
The correlation seems to be that Ubuntu users, especially those involved with gaming, seem to prize performance so badly that they are willing to use very marginal proprietary device drivers. At least, using a very unscientific sample of the people who were complaining on the Ubuntu Launchpad bug, many of the people who were complaining where talking about how their video drivers were causing their system to crash, or when they exited some game such as “World of Goo” — I didn’t know what that was until I looked it up — the system would lock up and they would have to reboot it.
You’re right that the game itself shouldn’t be doing any I/O, but if there was any activity happening in parallel with the game, that could result in lost files (why some GNOME or KDE application would be rewriting its dotfiles when the user is playing a game is a good question). In other cases, I suspect what was going on was that since they were using the unstable (beta?) driver, it was crashing even though they weren’t playing a game at the time.
Later on in the bug series, it’s clear that many of the people who were commenting weren’t necessarily gamers, but were simply opining on the state of affairs; the person who reported that svn wasn’t using fsync() and that pulling the plug right after running “svn update” wasn’t a gamer as far as I can tell, for example.
The bottom line is that the loss of file data requires three components: (1) hardware problems, environmental issues (unreliable power/no UPS), or unstable device drivers leading to crashes, (2) applications that don’t call fsync() and engage in various unsafe ways of writing files, and (3) a filesystem that implements delayed allocation without workarounds that degrade performance in the interest of providing better reliability in presence of (1) and (2). Gamers seem to be willing to tolerate (1) with respect to choosing video cards and bleeding-edge device drivers in order to get the very best Doom frames-per-second rating. I’m not all that good at first-person shooter games, and couldn’t tell the difference between 24 fps and 200 fps; but some people do seem to care about such things.
And Ubuntu, to its credit, have drawn in many Windows refugees that have been trained by the Windows blue-screen-of-death to be tolerant of systems randomly crashing much more frequently than Linux users that came out of more of a traditional Unix background. For better or for worse, that’s a cultural difference that is real.
March 13th, 2009 at 4:24 pm
Ted, obviously you have not played a game in a while. With a reasonably moder game (say from last 3-4 years) the open source drivers might provide 2-3 frames per second aka a slideshow, while proprietary drivers would provide 60+ frames per second aka a playable game. Same can be said about 1080p video playback. In many real-life cases the open source drivers are simply unusable due to the huge performance gap on the same hardware.
I don’t care about POSIX, nor has anyone else really cared about POSIX in the last 20 years. Software is designed to serve its users best. In case of a file system, applications are users. Including proprietary applications. If there is an established use pattern in which a new filesystem is losing data while the previous filesystem did not loose data, that is a bug and a regression in the new filesystem, regardless of what a decades old document says.
If you were writing ext4 in the 70s, that might have been ok, but nowadays file systems are held to a higher standard.
March 13th, 2009 at 4:48 pm
Ted: you’ve spent a lot of your energy talking about this, thanks. Despite some people suggesting that this will turn them off from ext4, I wonder if anyone remembers what it was like BEFORE journaled filesystems and you had a system crash: there was the possibility of lost data, silently corrupted files, and slower fsck times than today. I, for one, welcome our new delayed allocation and next-generation filesystem overlords.
If some of my files get zeroed out because my system crashed after “World of Goo” then my first instinct isn’t to get mad – it’s to hit Alt-Sysrq-s (to sync my filesystems) if the system has not yet rebooted. Would that be a total-hack remedy that would at least alleviate the symptoms if someone persists in using a configuration disposed to crashiness?
It’s good in a way that despite the “customer/user is always right” attitude (that is a consequence of increased adoption of linux) of some users both here and on the ubuntu bug, that this issue has gotten some light. You’ve done a good job on educating people on the truth of the issue, so what if you single out Ubuntu users who play “World of Goo” – there’s many of them on the bug and only one of you.
This brings up an interesting question: will future disks or chipsets have fast non-volatile (possibly flash-based) caches that could alleviate this problem?
I know drives already have caches on the order of tens of MB, but I’m talking larger, say a couple of GB. Such a non-volatile cache could give the benefits of having an on-disk journal, but also allowing staging of disk commits for further optimization.
March 13th, 2009 at 6:17 pm
@71: It’s still not crystal clear to me whether or not there would be a problem for ext4 performance (in the sense of reducing it to ext3 levels) if henceforth all applications were modified to follow all their close()/etc. operations with an fsync().
Frank,
So if applications were modified to precede every close() statement with an fsync(), ext4’s read and write performance would be fine. Keep in mind, fsync() doesn’t create each (much) extra work for the disk drive; it just pushes it around. The “performance problem” with fsync() and Firefox 3.0 was a latency problem in Firefox 3.0’s UI responsiveness. With ext4 there will still be disk activity required when you do a fsync(), but it will be significantly less. And if lots of applications are calling fsync(), the average amount of disk activity required by each fsync() will go down, not up. So for normal desktop applications, if we modify them all to call fsync(), under ext4, it will still be better than ext3, since whether or not we sync the files doesn’t negate all of the other ext4 improvements (such as extent-based files).
Now, when you say all applications, you mean all, including the 100GB video stream file which was being copied using /bin/cp, if /bin/cp fsync()’s that video stream file, obviously that will take a while. If you want to make sure that file is safely stored before you fire up that “World of Goo” application that is likely to crash your system, then sure, manually type sync before you start “World of Goo”. Any scheme where you copy that much data is going stress a filesystem and cause latency hiccups, whether it’s ext3’s data=ordered mode, or an explicit or implicit fsync operations. The best thing to do is to allow delayed allocation and the VM subsystem stream out the file writes, gradually. But if you want the humongous file safely on disk before you do anything else, no matter whether the application calls fsync() explicitly, or the filesystem implicitly forces it out to disk, it’s going to cost. TNSTAAFL.
March 13th, 2009 at 6:34 pm
I think the best way to address the issue is to implement alloc-on-commit. Stable systems that need the performance badly (servers) would simply not use that mount option. The option would also support the “Don’t break userspace!” imperative.
March 13th, 2009 at 6:47 pm
[...] Ted Tso has written an in-depth blog entry about the recent “ext4″ defect which he (and many other people) see as defects in the user software, not the kernel. [...]
March 13th, 2009 at 7:38 pm
What should really be fixed is ext3 flushing out the lot when asked to fsync just one file. How did that remain in the kernel for so long?
Everything that you said made sense to me. Apps have bugs and have to be fixed. Good apps already work correctly.
March 13th, 2009 at 7:39 pm
Ted, I checked dpkg’s source, and while it fsyncs after writing at least some(?) of its database files in /var/lib/dpkg, it does *not* fsync after renaming package files into place.
I suspect that rpm is similar, although I’ve only glanced through its source.
So, I suppose one could get lucky and lose ld.so or libc this way..
PS to @aigars: World of Goo performs acceptably well with free intel drivers. Where acceptably == well enough to have fun and finish all levels.
March 13th, 2009 at 9:01 pm
@71: However, as a author of a library that does an “atomic replace file”. How do you know what files are important enough to require an fsync? This is generally not possible, even to the app that calls the function.
Alexander,
I can easily see that the library can’t know; the application has to tell library. And granted, in some cases, the application might not know (for example) whether or not the last position and size of the window is “critical information”. (I don’t, but apparently some people spend vast amounts of time exactly configuring the sizes and positioning of each of their windows in their desktops. I’ve definitely come to realize that part of what is going here is a severe cultural disconnect.)
But one or occasional fsync() is not the end of the world, even for ext3’s data=ordered mode. Again, please remember the primary problem with Firefox 3.0 was the latency that the fsync() call incurred; the fsync() call really doesn’t cause that much extra work for a filesystem, it just pushes things around, and adds latency by adding a synchronous wait().
So even if the application is doing something really antisocial, such as rewriting a dot file containing the window size and position 30 times a second as the mouse is moving the window around, and calls into a library which calls fsync() 30 times a second, this isn’t really going to be a huge performance problem, especially if the window is getting moved around in a separate thread from the one which is continuously calling fsync(). It won’t be good for SSD owners who care about needless writes trashing their SSD’s lifespan, and such a really stupid application design will waste disk write bandwidth and will probably slow down other applications, ext3 or ext4, fsync() or no fsync(); but it might not even be user visible.
So just as there was a “Don’t fear the penguin” campaign, I think there should be a “don’t fear the fsync()” campaign. What happened with Firefox 3.0, from what I understand, is that the thread which handled web navigation called into the sqllite library after every click to go to a new page. The sqllite library called fsync(), which in ext3’s data=ordered mode, incurred a massive latency if there was a large file copy happening by another process. If Firefox had called the sqllite library in another thread, there wouldn’t have been an issue. If Firefox had indicated that to sqllite that this keeping track of the fax this was the user’s 32,157th time they’ve visited the Slashdot web site isn’t that important, maybe sqllite could have omitted the fsync() in that case. (This would be easier if Firefox segregates out “important” information, such as user preferences, that don’t change that often, with “less important” information, such as whether the user has visited Slashdot 32,157 times versus 32,158 times.) The bottom line is that the occasional fsync() really isn’t that much of a problem; the issue if frequent calls to fsync() called by threads that cause user visible latencies that is problematic.
BTW, just as an aside, if sqllite had used fdatasync(), and managed growth of space of the database file in chunks — ideally, using fallocate(), which is how more sophisticated database do things — then it wouldn’t have run into problems even using ext3’s data=ordered mode.
Anyway, hopefully this helps how you think about things for glib. Part of it really will be giving the applications a chance to weigh in, and if you are really worried, you can always kick off a thread and run fsync() in a thread. (BTW, note that fsync() may be one of the few ways that the kernel can signal an I/O error back to the program — so if you want to be notified about a file write not going well, you do want to pay attention to fsync()’s error return. One of the reasons why we don’t have an asynchronous fsync() is concerns about the fact that the any I/O errors would get thrown away; of course, the application never bothers to call fsync(), it wouldn’t hear about an error return anyway, so I’m not sure that’s such a huge issue.)
March 13th, 2009 at 9:10 pm
Aigars (#74) you’re stretching it. The “established use pattern” of the proprietary software you’re talking about is *crashing*.
March 13th, 2009 at 9:49 pm
Ted,
I am running Xubuntu Jaunty Alpha 5(updating at least twice a day)and have no problems with ext4. I love it just the way it is. The performance difference between ext3 and ext4 is definitely noticeable. If I want to game, I use my windows partition. The problem is not with ext4, but with applications that need to be re-written and graphics card chip makers who are unwilling to write compatible Linux drivers for their hardware in a timely fashion – if at all.
March 13th, 2009 at 9:59 pm
In comment #49, Ted says:
First of all, that’s not NFS’s model; NFS v2/v3 requires that each write RPC call not return until the data has hit stable storage. So in fact it’s a stronger requirement than alloc on close.
This statement is misleading.
Firstly, it accurately describes the NFSv2 semantics, but no sane person deliberately uses NFSv2 anymore so the statement is of no help in the real world.
The NFSv3 semantics are more flexible. The v3 WRITE RPC adds a flag which allows the client to say whether it wants the data on stable storage before the RPC returns, i.e. whether to do the old slow thing that was the only way with NFSv2. This flag is hardly ever used by clients (O_SYNC or a “sync” mount enable it, as does O_DIRECT in most circumstances). Instead clients will typically send a bunch of WRITE calls with data, and then a COMMIT call which does the actual forcing of data to server-side stable storage. This is significantly faster than the NFSv2 model.
NFSv4 behaves like NFSv3 in this regard, but adds a further feature called file delegations which complicate the picture even further.
Now to actually answer Frank’s question. But first some background.
The NFS protocol doesn’t know about any block-level behaviour like allocation, it works entirely on files. When the allocation occurs is a server-side implementation detail entirely. Having the data on stable storage is indeed a stronger requirement than forcing allocation, but the server could choose to do the allocation at any time from the start of unstable WRITE RPC to the end of the COMMIT RPC, which can be a window of several seconds.
NFS however does keep data in the client which has been written by applications on the client but not yet sent to the server. If the lifetime of the application is short and the file is small, this could include the entire data of the file.
NFS clients practice a behaviour known as CTO or “Close-To-Open consistency”, which is a very weak form of inter-client file cache consistency. This means that when the application close()s the last fd the client will perform the equivalent of an fsync(), i.e. issue WRITEs to the server for all any dirty data remaining in the client and a COMMIT to force that data to stable storage on the server. In other words, when close() returns to the app, the data is safe on the server. This is the behaviour on close() that Frank refers to above. Note that this is a much tighter constraint than POSIX requires.
I can think of two reasons why the ext4 folks probably don’t want to introduce that semantic.
Firstly (as Ted is discovering) application writers are lazy. Despite the clear advice in the close(2) man page, most application developers do *not* call fsync() and do *not* check the error code from close(). With NFS, this can lead to really bad drama if an error like ENOSPC occurs during the writing of dirty data on close(). In the worst case, the entire file’s data can be lost without the application even noticing that something went wrong. I have no idea what the fallout would be for such an application programming error on ext4.
Secondly, one of the stated design reasons for delayed allocation on XFS was for /tmp files, i.e. small files which are created, written, read, and unlinked in rapid sequence. The idea was that such a file would never require allocation or indeed any disk IO at all. You may consider this moot given tmpfs; but the fact remains that not every file needs to be safe on the disk as soon as it’s written and we have fsync() to allow application progammers to tell us which files those are.
March 13th, 2009 at 10:33 pm
@ted: I’m really excited to hear that ext4 fixes the old “fsync and sync are the same” issue with ext3. But for day-to-day desktop use I will continue to hate fsync (and I’ve disabled emacs’s fsync-on-save!), because I work on a laptop with laptop-mode, and fsync forces a disk spin up. Having my editor freeze waiting for the disk to spin up every time I hit save is just unusable.
While it’s POSIXly correct that apps should call fsync a lot more than they do, and ext4 will even make this viable latency-wise on systems where power usage isn’t a priority, we may need to expose something like ‘fdatasync_at_next_metadata_writeout’ to avoid a future where every app is calling fsync all the time, and my disk never gets to spin down.
…Or maybe app writers will just continue to ignore fsync until SSDs arrive and make this irrelevant (at least assuming SSDs don’t see much power/latency benefits from batching writes).
March 13th, 2009 at 11:01 pm
@ted#81: I’m not sure things work out the way you describe regarding “less important information, such as whether the user has visited Slashdot 32,157 times versus 32,158 times”… obviously if you omit fsync and all you lose is a counter increment then no-one will care. OTOH, if you lose the entire database of all sites you’ve visited, then that’s much worse. I don’t see how sqlite can guarantee its database will be readable *at all* after a crash without calling fsync.
(IIUC, sqlite’s transactions work as: create a lookaside rollback file; copy out all the database file pages that will be modified to the rollback file; fsync the rollback file; mutate the database file in place; fsync the database file; unlink the rollback file; fsync the directory to flush the unlink. If we don’t care about durability, then the fsyncs are overkill, but still the rollback file must hit the media before any writes to the database file, and the database file writes must hit the media before the rollback file metadata update. Without these ordering guarantees, your database file may end up an arbitrary mixture of different half-applied transactions, right?)
March 13th, 2009 at 11:05 pm
@79: What should really be fixed is ext3 flushing out the lot when asked to fsync just one file. How did that remain in the kernel for so long?
Flushing all dirty files is a consequence of data=ordered mode, and the issue of “entangled writes”. Consider that there can be up to 32 inodes in a single 4k block, and that multiple file system operations can be happening in parallel on that file system, that might modify that same inode table block.
So suppose one file system operation is creating a new inode; so it touches the inode table block, but it also touches the inode allocation bitmap, and a directory block to update a directory entry, and if the directory needs to be extended, a block allocation bitmap block as well. Another file system operation might be deleting a file, which might also touch the same inode table block, and also touch the same block allocation block, and perhaps a different directory entry block. And a third filesystem operation might be appending to a third inode in the same inode table block.
OK, now let’s assume that a program calls fsync() on that third inode. The problem is that when we write out the inode table block, we also are writing out the file system operations for the first and second inode. Even if we can filter those out, we also have to worry about changes to block allocation bitmap that may have been involved with other file system operations. This is what is called the entangled writes problem. When we fsync() the third inode, we have to synchronize all modified inodes. Now, if we are only doing this for metadata blocks, this isn’t a major overhead. However, in data=ordered mode, we need to write out all of the data blocks as well. If we didn’t, then one of the newly allocated blocks which hadn’t yet been initialized might end up revealing previously written data which could be a security exposure.
One of the fundamental difference between a file system and a database is that a database will prevent one transaction for reading or modifying a table or a row (depending on whether whether it is doing table or row locking) if an earlier transaction has modified a table until the earlier transaction is committed. File systems don’t allow for transactions to be aborted, and they allow for interleaved transactions for efficiency’s sake.
In any case, data=ordered basically means that for security reasons, we want to push out to disk all data blocks belonging to all inodes to be committed before the commit, and the entangled writes issue is what causes to need to commit all modified inodes when we want to commit a single inode.
March 13th, 2009 at 11:30 pm
@87: In any case, data=ordered basically means that for security reasons, we want to push out to disk all data blocks belonging to all inodes to be committed before the commit
Yeah, data=ordered mode seems like a really bad idea now. It turned a perfectly good call (fsync) into something that’s completely unusable (i.e. the infamous FF bug).
March 14th, 2009 at 12:14 am
Flushing all dirty files is a consequence of data=ordered mode, and the issue of “entangled writes”. Consider that there can be up to 32 inodes in a single 4k block, and that multiple file system operations can be happening in parallel on that file system, that might modify that same inode table block.
To what extent could some effort be directed at solving this single problem? Not that I find firefox 3’s fsync-happy behavior a positive role model, but what if ext3 was taught to “disentangle” writes on the go by tactically sacrificing some disk space (un-sharing those 4k pages — sort of how we try to avoid sharing cross-cpu data on single cache lines)?
March 14th, 2009 at 3:02 am
The expected use patterns are open-read-truncate-write-close or openA-readA-closeA-openB-writeB-closeB-replaceBA . And the expectation is that in case of a power loss, crash, cable disconnect or any other error in the end we would have either the new file contents or the old file contents.
ext4 violates that – if your power goes or a crash happens within X seconds from either of the above operations (with X being pretty high, like 30+) you will definitely get a 0 length file and ALL data from that file (both old and new) will be lost.
That is not acceptable for a modern filesystem, given that the above useage patterns are commonplace. Maintaining integrity of the data in the files is the job of a filesystem.
And trying to change thousands of applications to call an extra fsync to avoid loosing data is not a solution, because: 1) it is a needless complication of the API, 2) it is much easier to fix this in one place (the filesystem) than thousands of separate places, 3) if applications would start doing that, their performance will be severely degraded in the most common filesystem currently – ext3 (or they would have to become severely more complicated and use multiple threads).
March 14th, 2009 at 3:04 am
Alloc-on-commit seems like a bit of a sledgehammer to me. I have an alternative suggestion. There are lots of applications that don’t need guaranteed (i.e. synchronous) durability, but which benefit greatly from expedited durability, especially after a file has been closed.
In most cases, if a large file is being written and the system crashes, it doesn’t matter if the file is not recoverable, But once a file is closed, it should become recoverable as soon as possible.
On the filesystem level, the asynchronous allocation and write of blocks for files that have been closed should start at a configurable delay that is different (i.e. much smaller) than the delay for the asynchronous allocation and write of blocks for files that have not been closed.
It would also be very helpful to have a call comparable to posix_fadvise that advises the system to expedite or delay making outstanding writes on an open file durable. A typical use would be applications that generate durability sensitive log files.
If expedite durability was specified, a filesystem might make writes to the file durable every few seconds, if delay durability was specified, a filesystem might choose an extra long delay. By default, a filesystem might do something in between. The durability scheduling advice should be specifiable independently for pre-file-close and post-file-close operation.
Something like:
sys_fadvise(fd, FADV_WRITE_DURABLE_EXPEDITE);
sys_fadvise(fd, FADV_CLOSE_DURABLE_EXPEDITE);
sys_fadvise(fd, FADV_WRITE_DURABLE_NORMAL);
sys_fadvise(fd, FADV_CLOSE_DURABLE_NORMAL);
sys_fadvise(fd, FADV_WRITE_DURABLE_DELAY);
sys_fadvise(fd, FADV_CLOSE_DURABLE_DELAY);
where the filesystem typically expedites a normal close more than a normal write.
March 14th, 2009 at 3:10 am
and 4) fsync() would be inappropriate in most of these cases, because the applications mostly do not really care that the changes are saved to disk. As you said – it is not really important whether the fact that the user visited Slashdot 32546th time is saved, however that does not mean that the whole visited pages database is not important and can be nuked at will.
March 14th, 2009 at 3:43 am
@89: 1) it is a needless complication of the API,
Could you please go and read the manual page for close() and then see that the API is already set to be like this. On close(), there are no guarantees that what you wrote will be on disk. You _must_ call fsync() to get that.
The underlying problem is that ext3 ordered mode screwed fsync(), so programmers stopped using it, because it could effectively lock the machine up. That’s exactly how XFS got blamed for a lot of “data loss”, which was in fact sloppy programming.
Ask yourself this: why is it that emacs carefully and properly does all this?
@91: 4) fsync() would be inappropriate in most of these cases, because the applications mostly do not really care that the changes are saved to disk.
Well, if you don’t care that it’s not on disk, then what is the problem? Well, the problem is that you cannot have it both ways. If you want to replace a configuration file with the new one, you obviously care that at least _one_ of the files (either old or new) is there and contains something. In order to get that, you _must_ call fsync() on the new file before renaming, otherwise there will be no guarantee that you file will not be empty. ext4 works exactly as permitted by documentation.
March 14th, 2009 at 7:03 am
@70: the problem is what might be appropriate for one file (say, firefox’s sqllite database) might not be appropriate for another file (say, mysql or postgresql database file). What I would probably think is more appropriate would be a per-file flag “not_important”, which limited the number of fsync’s per minute, or better yet, changes fsync()’s behaviour on that file to return right away, instead of waiting for the commit to complete. After all, most people don’t really care that much if firefox loses track of the last few URL’s that it visited, but they would care if a file system compromised postgresql’s ACID properties.
Absolutely; I figured a mount option would be the best compromise between administrative granularity and preventing unexpected behaviour (after all, SQL databases usually end up under /var/lib rather than /home, and everyone else does what I do and don’t just mount a huge / filesystem, right?!
)
If a per-file flag is achievable, I guess that’s better. How will that work for newly created files (i.e. open-write-fsync-close-rename), though? That said, I do like the idea of making lazy-fsync the exception rather than the rule, though, which would encourage applications to use fsync sensibly.
March 14th, 2009 at 8:43 am
@84: Greg,
I stand corrected about how NFSv3 doesn’t require synchronous writes. Clearly my knowledge of NFS is somewhat out of date. I knew NFSv4 was different, but I didn’t realize that NFSv3 had also provided a fix to this issue. Close-to-open commit semantics also seems to post-date the NFSv3 RFC’s, and seems to be something you can only find discussed on various powerpoint decks and one or two NetApp authored papers. You didn’t say this explicitly, but Close-to-Open is needed to allow for synchronization between different NFS clients, which is its raison d’être.
We actually go to a fair amount of effort to return ENOSPC errors on the write() system call, not waiting until the fsync() or close(). However, like all local disk file systems, if you want to be notified about local I/O errors, you have to use fsync() and check its error return. Close() doesn’t imply flushing the file out to disk — the close(2) man page warns of this, as others have already pointed out.
One other reason why I’m not particularly fond of enforcing an fsync on close semantics is that it will seriously hurt performance as well as SSD lifetime for certain workloads, and not just /tmp. Consider “quilt push/pop -a”, or “git rebase –interactive”, or a series of “git merge” commands. All of these commands modify the source tree, with source files getting modified multiple times before the “quilt push -a” or “git rebase –interactive” completes. There really is no point to force each intermediate version of the file to disk, especially in the case of the SSD, when it will be modified very shortly afterwards. Even with ext3’s data=ordered mode, it won’t push files out to the disk if they end up getting re-modified and/or deleted before the next commit.
March 14th, 2009 at 9:05 am
@Theodore Ts’o
> 3.a) open and read file ~/.kde/foo/bar/baz
> 3.b) fd = open(”~/.kde/foo/bar/baz.new”, O_WRONLY|O_TRUNC|O_CREAT)
> 3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
> 3.d) fsync(fd) — and check the error return from the fsync
> 3.e) close(fd)
> 3.f) rename(”~/.kde/foo/bar/baz”, “~/.kde/foo/bar/baz~”) — this is optional
> 3.g) rename(”~/.kde/foo/bar/baz.new”, “~/.kde/foo/bar/baz”)
> (3) is the ***only*** thing which is guaranteed not to lose data.
But it’s not right either.
It assumes you have permission to write the .new file.
It assumes this file doesn’t exist already (or can be overwritten).
It uses an fsync, which may not be required (if atomicity but no durability is desired).
It doesn’t retain permissions of the old file.
If the target is a symlink, it gets replaced by a normal file.
If you do 3.f, there is a window where no file exists at all.
It’s too complex, so needs to be wrapped in library funtions.
I think a concept like atomic updates (O_ATOMIC?) is needed. This would guarantee other apps and the disk (after a crash) either see the old file or the new file, but nothing else.
March 14th, 2009 at 9:13 am
@85: But for day-to-day desktop use I will continue to hate fsync (and I’ve disabled emacs’s fsync-on-save!), because I work on a laptop with laptop-mode, and fsync forces a disk spin up. Having my editor freeze waiting for the disk to spin up every time I hit save is just unusable.
Nate,
I guess I don’t run on batteries that often (mainly because LiON’s only have 200 charge/discharge cycles), so I religiously try to run on AC mains whenever possible. Also, I tend to not enable very aggressive APM mode for disks, because laptop drives are only rated for 600,000 head load/unload cycles, and when a few years ago, when I enabled the most aggressive battery management mode, after a 4 or 5 months had gone by I noticed with smartctl that the drive had already gone through 300k load/unload cycles. I disabled the aggressive power management mode on that drive very shortly afterwards!
I can respect wanting to turn off fsync()’s on laptops if you want to avoid hard drive spinups, and if it’s a decision made by the system administrator or laptop owner, and the laptop owner is confident that their system isn’t going to crash if you breathe on it wrong, why not? Also, if it is something which is manually enabled by the laptop owner, they can’t then whine on a Launchpad bug entry about how file systems are suppose to compensate for badly written applications and unstable device drivers. It’s a decision they made.
It’s probably something that we would want to enable only when the laptop is running on batteries, and only if the user is willing to accept the risks involved. If they are using some extreme gaming laptop with ultra-unstable drivers, this would obviously be a Really Bad Idea to disable fsync()’s. In fact, the file system on such a system should probably have an implicit global sync at each commit, just like ext3 does today with data=ordered.
March 14th, 2009 at 9:59 am
@95 Ted says:
Close-to-open commit semantics also seems to post-date the NFSv3 RFC’s, and seems to be something you can only find discussed on various powerpoint decks and one or two NetApp authored papers.
It’s one of several important implementation details not mentioned in the RFCs. Most are documented in “NFS Illustrated”, aka the Gospel According to Brent. CTO is covered in section 8.14.3.
CTO is, not to mince words, a nasty hack which should not be necessary for local filesystems.
We actually go to a fair amount of effort to return ENOSPC errors on the write() system call, not waiting until the fsync() or close().
This is an excellent thing to do, and I wish the NFS protocol were capable of supporting it.
One other reason why I’m not particularly fond of enforcing an fsync on close semantics is that it will seriously hurt performance as well as SSD lifetime for certain workloads, and not just /tmp.
I agree entirely.
Just a thought: perhaps you could have an inheritable per-file “I’m a config file so please sync me on close” flag which is set on /etc/ ~/.kde/ and ~/.gnome2/ and inherited as new files get created in those directories.
March 14th, 2009 at 10:23 am
@86: I’m not sure things work out the way you describe regarding “less important information, such as whether the user has visited Slashdot 32,157 times versus 32,158 times”… obviously if you omit fsync and all you lose is a counter increment then no-one will care. OTOH, if you lose the entire database of all sites you’ve visited, then that’s much worse. I don’t see how sqlite can guarantee its database will be readable *at all* after a crash without calling fsync.
Nate,
Yeah, there are two problem here. One is that if applications put important and unimportant information in the same database, and the unimportant stuff needs frequent updates, you have to fsync() or fdatasync() all of the information.
IIUC, sqllite’s transactions work as: create a lookaside rollback file; copy out all the database file pages that will be modified to the rollback file; fsync the rollback file; mutate the database file in place; fsync the database file; unlink the rollback file; fsync the directory to flush the unlink.
Yeah, if that’s how sqllite works, then it’s going to be hard to avoid requiring metadata changes that will (a) require the use of fsync() not fdatasync(), and (b) means that for ext3, sqllite is going to be require involve potential serious latency problems if data=ordered mode is enabled. If didn’t create and destroy the log file, and instead kept internal information about whether the log file was valid or not given the version of the database file, and where the log file and the database file was mutated in place and flushed using fdatasync(), it could be made to be much more efficient and fast andsafe for ext3. Disabling fsync()’s may have made the performance problem for sqllite go away, but at the cost of losing sqllite’s ACID properties; there are better ways of fixing this problem, but it requires making some possibly non-trivial changes to sqllite.
If we don’t care about durability, then the fsyncs are overkill, but still the rollback file must hit the media before any writes to the database file, and the database file writes must hit the media before the rollback file metadata update. Without these ordering guarantees, your database file may end up an arbitrary mixture of different half-applied transactions, right?)
March 14th, 2009 at 10:26 am
@89: To what extent could some effort be directed at solving this single problem? Not that I find firefox 3’s fsync-happy behavior a positive role model, but what if ext3 was taught to “disentangle” writes on the go by tactically sacrificing some disk space (un-sharing those 4k pages — sort of how we try to avoid sharing cross-cpu data on single cache lines)?
Frank,
You can’t really do that without making major substantive changes to ext3, and it’s really not as easy as you might think. You could put each inode on its own 4k pages (although this will waste a lot of disk space), but what about the fact that directory entries will share multiple references to multiple inodes; do you put each directory entry on its own 4k page? What about the bitmap and inode allocation bitmaps? No, it’s really not practical.
Besides, we already have a solution to this problem which is completely backwards compatible to ext3, that doesn’t require massive changes to ext3’s file system format; it’s called ext4, and delayed allocation.
March 14th, 2009 at 10:41 am
@90:The expected use patterns are open-read-truncate-write-close or openA-readA-closeA-openB-writeB-closeB-replaceBA . And the expectation is that in case of a power loss, crash, cable disconnect or any other error in the end we would have either the new file contents or the old file contents.
Aigars,
If you really want that, in 2.6.30, ext4 will give you that in the case of openA-readA-closeA-openB-writeB-closeB-renameBA. And Ubuntu has backported those patches into the Jaunty kernel. Happy? Linux has over 60 different filesystems, and these are not reasonable expectations for Unix class, unless you call fsync() after closeB. It has been documented for a long time, and it is in the standards spec for a long time, and that’s just the way it is. If you don’t like, you better not use any other file systems. If you are an application author, and you want your applications to be portable and safe on other systems, then you should put in the fsync(); it’s as simple as that.
As far as open-read-truncate-write-close, this has always, always, always been unsafe, even under ext3. If you get unlucky and crash between the truncate and the write, there’s really nothing the file system can do. You can’t control when ext3 will choose to place its commits, and in some cases, ext3 doesn’t have a choice; if there is no more journal space available, it might have to place a commit after the truncate, and if the system crashes then, there is really nothing you can do. As the Eat My Data presentation has pointed out, file systems are not databases and databases are not file systems. Ext3 can’t signal abort a transaction due to a lack of journal space and force an application to retry an transaction like SQL can, and the Unix/POSIX api doesn’t allow you to surround open-read-truncate-write-close with a SQL BEGIN TRANSACTION and an SQL END TRANSACTION. If you have those expectations, as the Man in Black said in Princess Bride, “Prepare to be disappointed”. Sooner or later you will lose, no matter what file system you are using. And you know what? If you want to switch to another file system, please, do so. You have unrealistic expectations that no one can satisfy, so you might as well complain at someone else when they also can’t do the impossible. Meanwhile, I’ll try to do what I can to meet even badly considered expectations, even as I try to wean people off them — but if you want me to do the impossible, sorry, I just don’t have any magic fairy dust.
And remember, “don’t fear the fsync”.
March 14th, 2009 at 10:51 am
Oh, by the way, the other problem with open-read-truncate-write-close is that if the new contents is bigger than the old, you could end up with an incompletely written file, when write() returns ENOSPC. At that point, you’ve lost the old contents of the file, and have a corrupted (partially-writen) version of the new contents.
Fundamentally, open-read-truncate-write-close is broken. People who use it should have their programmer’s license revoked; but unfortunately, there are a lot of incompetent application programmers out there, which is why we do try to accommodate broken application programs to some extent, but there are limits to what we can do….
March 14th, 2009 at 11:06 am
@96: Olaf,
You can work around a lot of the problems with open-write-close-fsync-rename; such as copying over the permissions, etc., and if you are worried about the window where the file doesn’t exist, then you can replace the optional “rename(foo, foo~)” step with “unlink(foo~); link(foo, foo~)” instead. Note that it’s optional because it’s not necessarily required that we keep an old copy of the file around. Following symlinks can also be handled, too.
Yes, it’s complicated and but people can provide libraries which do this. The standard Unix design rule still applies; if it can be done in user space, and there aren’t sufficient reasons to justify putting it in the kernel, it’s better to do it in a user space library. I believe glibc has such a library function, for example.
As far as not liking fsync() because it provides guarantees which perhaps the application doesn’t need — remember, “don’t fear the fsync()”. It really doesn’t have to be that bad. And it turns out that enforcing a required set of ordering semantics everywhere else, even when it may not be needed, will end up costing more performance system-wide than simply calling fsync() where it is needed.
The important thing to consider is that not all files are precious. For example, object files in a build tree aren’t precious; they can be easily rebuilt. If you force an implicit ordering everywhere, you will slow down the system where it is not really necessary, because the system can’t figure out which ordering is important, and where ordering might not matter and reordering writes might make a significant performance difference.
March 14th, 2009 at 11:40 am
@103:
I prefer solutions over work arounds.
Copying permissions etc is also easier said than done I guess. Don’t forget ACLs. What about other file attributes?
And the other arguments?
Do you happen to know the name of that glibc function?
> It really doesn’t have to be that bad.
I like the zero-cost principle, don’t pay for what you don’t use.
> And it turns out that enforcing a required set of ordering semantics everywhere else,
What semantics are you referring to? The ones I proposed for O_ATOMIC?
Aren’t those nearly equivalent to the manual solution?
March 14th, 2009 at 2:45 pm
I would note that despite all Ted’s trash-talking of Ubuntu and its users, it appears that it was Ubuntu testers who uncovered and reported this data integrity problem in ext4. And it was that distro which was also first to supply a safe ext4, having applied the patches to work around the problem some 5 weeks ago. If it was Fedora which had done these things, I suspect they would have gotten praise. But this is Ubuntu, so they get derided for it.
Furthermore, I was around for ext3 at this stage of its life-cycle. And let me tell you, Stephen Tweedie would *never* have taken such a cavalier, finger pointing stance regarding this very significant data-munching problem had it occurred in ext3 back then. Where along the line did extX developers forget that data integrity is more important than speed?
March 14th, 2009 at 3:19 pm
> As far as open-read-truncate-write-close, this has always, always, always
> been unsafe, even under ext3. If you get unlucky and crash between the
> truncate and the write, there’s really nothing the file system can do.
Emm, I’ve not looked at the corresponding source code, but as far as I know the described data loss (0 length files after a crash) will happen with ext3 in only one highly unlikely scenario:
* if the ext3 decides to commit changes to disk in the exact moment between truncate and write operations and a crash happens before the next commit;
That is a very, very tiny window of opportunity for the first part of the condition and a rare event for the second part, thus this problem (while entirely possible) is sufficiently rare to not present a significant issue in normal use.
However in current ext4 the first part race condition always is true, thus greatly increasing the chances of this issue occuring on almost every crash, thus making it a significant problem.
In my opinion this is one of the case where perfect is the enemy of good. Not only a more perfect POSIXy implemented filesystem is an enemy of good, but not perfect applications, but those applications could become enemies of each other as a solution – if the proposed application solution (fsync) would become commonplace, then this would greatly increase chances to hit the race condition on ext3. Applications trying to fix their behaviour would actually increase chances for other, unrelated applications to loose data.
However, I am happy with the fixes coming in 2.6.30 as far as I understand them, thank you
March 14th, 2009 at 4:06 pm
Hi Ted,
Thanks for spending so much time answering these concerns.
I have a few more questions:
1) For filesystems without the alloc-on-commit-for-renames workaround, is there a POSIX way to request equivalent behavior from the application when doing open-write-close-rename? Fsync is too strict for my needs — I don’t want to have to spin up the disk and write the data immediately, I just want to ensure that either the old or new data will be found on disk after a crash and recovery. Some way to insert a barrier that attempts to ensure the write will happen before the metadata update, at least at the filesystem level? (Clearly it cannot be perfect due to block write scheduling, etc, but where the window of potential error would be reduced from 30+ seconds to, say, a few hundred milliseconds?)
2) After a crash, is it possible for the journal recovery to see what happened? And maybe even fix it? If it could tell that the metadata change occurred but the corresponding data had not been written, maybe the metadata could be rolled back? Wishful thinking, I’m sure. But at the very least, can fsck tell that a file with delayed allocation was never actually allocated? Losing data is bad, but not knowing immediately that it was lost is even worse. I’ve had that happen with reiserfs, where I thought I was OK after a crash, only to find a month later that some important file had been filled with NULs, and it was too late to easily replace the contents. That alone made me switch back to ext2 at the time.
3) That “Eat My Data” presentation says that a POSIX-compliant fsync() function is empty when _POSIX_SYNCHRONIZED_IO is not defined. And apparently MacOSX’s fsync() is broken too. Are there any libraries that you’re aware of that can make life less painful for application developers?
March 14th, 2009 at 5:22 pm
@105: I would note that despite all Ted’s trash-talking of Ubuntu and its users, it appears that it was Ubuntu testers who uncovered and reported this data integrity problem in ext4. And it was that distro which was also first to supply a safe ext4, having applied the patches to work around the problem some 5 weeks ago. If it was Fedora which had done these things, I suspect they would have gotten praise. But this is Ubuntu, so they get derided for it.
Actually, Fedora was the first distribution that shipped a testing version of ext4 (and Red Hat Enterprise Linux 5.1 shipped the first preview release of ext4 in an Enterprise Linux release), and Fedora users have supplied plenty of very useful bug reports. In addition, Red Hat has an engineer, Eric Sandeen, that has been very diligently responding to many bug reports and supplying bug fixes. Red Hat also funded Val Aurora (previously Val Henson) to work on the 64-bit block number support in e2fsprogs.
Fedora 11 Beta also has the workaround patches for the replace-on-truncate and replace-on-rename cases, so both Fedora 11 and Ubuntu Jaunty (both of which are currently in alpha/beta release status) will have these workarounds. Fedora 11 is scheduled to release in May, and Ubuntu Jaunty in April, so I suppose you can call Ubuntu as releasing these patches “first” (assuming both sides stick to their announced schedules), but I think it’s only fair to point out that Red Hat and Fedora have contributed a huge amount to ext4; far more than Ubuntu ever has.
I do appreciate Ubuntu users’ bug reports (although again, I have to point out Fedora users have actually contributed far more to ext4’s stability). To be honest, though, I do wish Ubuntu worked harder on kernel stability, and worked harder on persuading its partners to provide usable open source drivers, and failing that, at least being a bit more competent with their proprietary drivers! I understand that sometimes people need to resort to using proprietary drivers (and believe it or not, I’m generally considered a moderate on binary drivers; most kernel developers are much less tolerant of them than I am), but fact that people seem to accept drivers that crash as often as some of these drivers that Ubuntu is willing to ship, even in alpha releases, is a little scary to me. How did these drivers survive the QA process from the proprietary video card vendor?
March 14th, 2009 at 6:00 pm
Hi, Ted!
First of all, thanks for taking the time to answer these questions.
As an application developer, I’m mostly interested in doing these things right.
So, I am really curious how would you reply to the concerns from Olaf@104.
I think this thread should indicate the right way for application developers to do it, not just complain about many (most?) applications getting it wrong.
Thanks again!
March 14th, 2009 at 6:54 pm
“”"
Actually, Fedora was the first distribution that shipped a testing version of ext4
“”"
That’s not what I said. Sure, Fedora was first to ship ext4 in their development branch. But Fedora’s testing missed this data integrity regression. Ubuntu testers found it, and reported it, and were, I believe, first to apply the work-around patches to their development branch. It’s irrelevant how much help Fedora has provided in the past. In this case, Ubuntu users pointed out a significant flaw, and instead of appreciating it, you seem to be taking offence and looking around for anyone you can to blame it on: application programmers… and the messenger.
As a current user of ext4, I’m disappointed that the response was not to gracefully accept responsibility, thank the parties for reporting the regression, and fix it… without all the drama, denial, and finger pointing.
In fact, the defensive attitude you’ve taken actually *decreases* my confidence in ext4 a bit more than just the regression alone would have.
Don’t get me wrong. I appreciate the work that you guys do. But your response to this seems totally out of character for an extX dev based upon the extreme dedication to data integrity issues that I have come to expect from the team.
This is the last I will speak of Ubuntu, Fedora, et. al. Because there was never any reason that the distro wars needed to make an appearance in what should be a discussion of a technical problem. But considering your initial post, I feel that I would have been remiss not to say something.
March 14th, 2009 at 10:25 pm
@97: The problem is that I don’t, really, want to turn off fsync’s, because I like my data. What I *want* to do is to spin up the drive as little as possible *while* maintaining data consistency. Really what I want is a knob that says “I’m willing to lose up to minutes of work, but no more”. We even have that knob (laptop mode and all that), but it only works in simple cases.
The problem is these cases where there are app-level ordering constraints between different writes (atomic rename, sqlite commit, etc.). The user considers the data in the newly created atomically-renamed file to be old work, but the filesystem doesn’t know that. So to maintain the no-more-than-n-minutes-lost invariant, the app has to tell the filesystem somehow, and the only POSIX way to do it is by calling fsync.
When the only way to guarantee I won’t lose more than n minutes of work is to use fsync, I do want apps to use fsync. But since fsync adds latency and hurts power-saving, I want those situations to be very rare, and I don’t see how userspace can provide this feature without more help from the kernel. Does that make sense?
March 14th, 2009 at 11:32 pm
@96: But it’s not right either.
It assumes you have permission to write the .new file.
It assumes this file doesn’t exist already (or can be overwritten).
It uses an fsync, which may not be required (if atomicity but no durability is desired).
It doesn’t retain permissions of the old file.
If the target is a symlink, it gets replaced by a normal file.
If you do 3.f, there is a window where no file exists at all.
It’s too complex, so needs to be wrapped in library funtions.
Where to do I start?
1. Well, you gotta have permission to anything with the file system, no?
2. Look, this is from emacs. It is an _example_ that deals with a certain name spaces. Use secure temporary file names to get what you want for other files.
3. You do not understand that rename(2) only guarantees the atomicity of file paths, not the data inside the files. If you do not believe me, read the manual page. There is no situation in which you want durability of _data_ in you file and don’t want to call fsync. Atomicity you think rename(2) has does not exist.
4. Permissions again?
5. Wow, it’s a symlink. So? Nobody said programming was going to be easy.
6. Again, emacs does it to get a backup file. You would not do that if you were replacing a configuration file. That’s why even in this example, it is marked as _optional_.
7. Well, yeah. People do libraries all the time.
March 14th, 2009 at 11:36 pm
Ted,
Speaking of fixing regressions, given that the only _real_ regression here is the fact that fsync on ext3 in ordered mode may lock up your system for a few seconds, I think the real fix sequence for this whole problem should be:
1. By default make ext3 ordered mode have fsync as a no-op. People that want current broken behaviour could specify a mount option to get it.
2. Tell folks that they _must_ use fsync in order to commit their data.
3. Once critical mass of applications achieved the above, remove all hacks from ext4 (i.e. the ones destined to 2.6.30), XFS etc.
4. Retire ext3.
March 14th, 2009 at 11:46 pm
The more I read about this the more I’m convinced that there’s an underlying issue that isn’t addressed. The problem is that there’s no way for application programmers to specify ordering (and only ordering) for file operations. Application programmers want to say “wait for the write to complete before doing the rename, but I don’t actually care when those happen”, but there’s no way to do that directly. Ordering is a side effect of fsync, but it doesn’t actually have the same semantics and most of the problems with fsync seem to grow out of that.
March 15th, 2009 at 2:08 am
There is no fundamental problem with filesystems implementing better average recovery behavior than POSIX requires. It is just that portable applications shouldn’t depend on that being case. As a filesystem user, anything that the filesystem can do behind the scenes to lose less data then they are allowed to lose without ruining performance is a good thing.
Suppose that a filesystem wanted to make sequences such as “open; write; close; rename” atomic, but not durable, as much as possible. No guarantee, just best effort.
The way to do that would be to store (journaled) meta-data undo information in addition to the regular meta-data information on the disk. Then after a system crash, the filesystem would first apply all the meta data redo information from the journal, including redo of the meta-data undo information.
After that the file system would check to see which meta-data updates were dependent on data block writes that did not complete, and use the undo information to restore the corresponding meta data to its original state.
That would make all rename after write operations atomic, but not durable. After recovery a consistent prior version of the file would be intact. There would be minimal performance impact, in fact much less than any scheme that forces data blocks to hit the disk before the meta data update does. The only requirement is a way to bring the original meta data back if the data write never completes.
It is worth noting that many databases now implement this functionality for special applications – i.e. you can do a fast commit that doesn’t guarantee that your transaction will be durable. No wait for anything to physically hit the disk. What the database will guarantee is that if that transaction occurs, it will occur completely or not at all.
That seems like a worthwhile compromise for many filesystem applications, mostly because a very large class of applications (untarring a directory tree for example) will not do per data file fsyncs due to the performance or complexity implications. 10,000 asynchronous fsyncs, for example? A filesystem that carried meta data undo information could (non-portably) ensure that the files that did get preserved after a recovery were not the wrong size, or filled with garbage, for example, by implementing a policy of undoing the meta data entry creation for any file create and write operation that did not both close and physically make it to the disk.
March 15th, 2009 at 3:15 am
> @115: There is no fundamental problem with filesystems implementing better average recovery behavior than POSIX requires.
It is actually a good thing that this “problem” hit such a popular file system as ext4. We have finally discovered how applications have been written incorrectly for years.
Overloading the defined behaviour with guarantees that don’t exist in the API is dangerous. It is exactly what caused the problem – applications relied on this behaviour, which was by no means portable. Coupled with the broken fsync() call in ext3, which was causing performance problems, it caused that even the ones that knew what to do had to give up on it (i.e. FF bug).
A better thing to do would be to create a new API for this and submit it for standardisation to Open Group.
March 15th, 2009 at 4:48 am
Uberbad application? You have it. Or rather millions of people have it. P2P client.
What this beast does? It opens hundreds of files (fallocate is good here). Then for each file it creates ANOTHER small file where “status of main file” is kept. “Status of file” is mostly list of peers – sometimes you must wait for days to find even one, but if there are a lot of peers they are changing IPs (dynamic IPs are evil, but they are fact of life), going in and out of scope every few minutes so you need to keep track of all that by saving state every few minutes. Further: we have few worker threads (usually less worker threads then files, but more then one to use SMP – we need a lot of CPU power here to to compute checksums). Oh and we are actually downloading files, right (typically megabytes poer second all over 100 or so files – think about all seeks required) – thus “sync” is not very good solution either.
Here you go. Real (not just realistic, but real) application which does it all: state of 100 files is updated every minute or so on very busy disk AND it’s done for crash-resistance so we REALLY want to have some consistent state after crash and reboot (state for 5 minutes ago is Ok, zero files are disaster). How can I write such an application to make is usable on both ext3 and ext4? Should every such application include complex logic with separate threads and such to make “state drops” safe and nonblocking? Looks like an ovekill to me. “Atomic rename” was enough to make it all robust on ext3…
March 15th, 2009 at 5:23 am
A friend of mine had zero length /etc/shadow after running passwd and resetting the machine. It seems that even PAM does not get writing files “right” and misses call to fsync(). See function unix_update_shadow():
http://pam.cvs.sourceforge.net/viewvc/pam/Linux-PAM/modules/pam_unix/passverify.c?revision=1.10&view=markup
This looks pretty bad… I really hope distributions will backport your patches to earlier kernels with ext4 support built in!
March 15th, 2009 at 6:24 am
I’ve seen repeated assertions that open-write-fsync-close-rename is blessed by POSIX as the method for ensuring atomicity after crash recovery. I’ve checked the standard and this turns out not to be the case.
POSIX has no wording to require implementations to implement rename() using a single disk write. An implementation that could leave the renamed-over file empty or missing after crash recovery would nonetheless be compliant.
Any case that open-write-fsync-close-rename is the right way to avoid zero-length files after crash recovery must be based on tradition, not standards.
March 15th, 2009 at 7:37 am
[quote]Another solution is a set of patches to ext4 that has been queued for 2.6.30 merge window. These three patches (with git id’s bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced. This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file. This solves the most annoying set of problems where an existing file gets rewritten, and thanks to the delayed allocation semantics, that existing file gets replaced with a zero-length file.[/quote]
Wouldnt be better to use a solution like the safe links used in reiserfs?
[quote]Whenever there is a truncate/unlink on a file, Reiserfs creates a safe
link for the same and deletes the same once the operation is complete.
If the machine crashes before committing the operation, whenever the fs
is mounted next time, the fs will look for the saved links ( easy to
find out, since they have special key) and commit the operation that was
unfinished. [/quote]
March 15th, 2009 at 7:52 am
@112 Bojan Says:
3. You do not understand that rename(2) only guarantees the atomicity of file paths, not the data inside the files. If you do not believe me, read the manual page. There is no situation in which you want durability of _data_ in you file and don’t want to call fsync. Atomicity you think rename(2) has does not exist.
Either you’re reading a different man page than I am, or you are interpreting it differently from me. Here’s what man 2 rename on my machine says:
If newpath already exists it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing.
I’d argue that “truncated to empty” sounds a whole lot like missing. If file truncation was permissible behavior, you sure as heck would see a Big Honkin’ Disclaimer ™ there, and linux man pages tend to be pretty good about copping to even perceived errrors.
I agree with the others who’ve said I shouldn’t have to tell the kernel to sync my data if I don’t care when it gets to disk, just that it goes to disk in a way that is consistent with what the running image sees. Doing otherwise violates all sorts of abstractions.
That said, a different mechanism for doing atomic replaces wouldn’t be the end of the world… But whatever it is, it really should avoid an explicit fsync, because there’s no reason I should have to think about disk scheduling to do an atomic replace – in most cases I just want to make sure the write and relink occur in order.
It is a bit troubling to see that the maintainer of my FS doesn’t see eye to eye about what the ideal should be, but the patch ted made is a good compromise. I just don’t want anyone to think we should keep running in the “always use fsync/fdatasync before close” direction. It’s not an issue of overhead, it’s an issue of mixing abstractions.
March 15th, 2009 at 8:19 am
The second quote form linuxjournal should be probably interpreted this way I think.
[quote]Whenever there is a truncate/unlink on a file, Reiserfs creates a safe
link for the same and deletes the same once the operation is complete.
If the machine crashes before committing the operation, whenever the fs
is mounted next time, the fs will look for the saved links ( easy to
find out, since they have special key) and rollback the operation that was
unfinished. [/quote]
March 15th, 2009 at 12:03 pm
I think that the following idea, or some variant of it, might be a good compromise. I assume that the kernel has an idea whether a particular block has been written back or not (I admit that I am not played with the Linux vfs bits much). The kernel could – optionally – keep a log of such transactions, both data and metadata, in a file on the filesystem, or a pre-allocated set of blocks on a swap device, making sure that transactions are either written to the filesystem or to the log after a certain maximum period of time. Once transactions hit the filesystem they can be removed from the log if they were written there. This could be disabled or not compiled in for those who do not need it. It would also work for all filesystems, not just ext4, and it would not require maintainers of filesystems – such as ext4 – to pollute their own code to achieve this result.
Please feel free to tell me why this is a very bad idea, or to ignore it totally if it is too bad to merit comment
March 15th, 2009 at 1:23 pm
I really wish everyone would knock it off with this notion that all open source software is perfect and never crashes. I’ve had all kinds of hard system lockups with *open source* video drivers. Am I to be punished with lost data too?
https://bugzilla.redhat.com/show_bug.cgi?id=441665
https://bugzilla.redhat.com/show_bug.cgi?id=474977
Besides, you can’t patch acts of god. Power cables get kicked out, fuses blow, laptop batteries go bad, power supplies light on fire, motherboards light on fire, UPSs fail, lightning strikes, power transformers explode knocking out power to the entire neighborhood. EVERY one of those has happened to me over the years.
The most reliable systems in the world get that way by *expecting* failure and have mechanisms in place to mitigate it. You develop reliable software by EXPECTING failure and dealing with it, not by writing perfect code.
http://lwn.net/Articles/191059/
http://en.wikipedia.org/wiki/Fault-tolerant_design
http://www.fastcompany.com/magazine/06/writestuff.html
(And yes that goes for user space too, truncating and overwriting is stupid and doomed to fail.)
March 15th, 2009 at 2:06 pm
===
@3: Ulrik,
The sequence:
1. Open file
2. Truncate file
3. Write to file
is always going to be unsafe. Application writers should never do this. The POSIX equivalent is:
1. Open file “foo” for reading and read the contents into memory
2. Modify the contents of file “foo” in memory
3. Write file “foo.new” with the new contents of file “foo”
4. Call fsync() on the file “foo.new” and close the file handle for “foo.new”
5. Rename “foo.new” to “foo”, overwriting “foo” in the process
It’s easy enough to create a library function that does this, if this is too complicated for application programmers to remember.
===
Isn’t this inefficient (or wrong) ?
At #3, if you have a big file and are low on space, how would you accommodate it?
March 15th, 2009 at 3:23 pm
“That’s not what I said. Sure, Fedora was first to ship ext4 in their development branch. But Fedora’s testing missed this data integrity regression. Ubuntu testers found it, and reported it, and were, I believe, first to apply the work-around patches to their development branch. It’s irrelevant how much help Fedora has provided in the past.”
You are wrong on all these counts.
March 15th, 2009 at 5:21 pm
The arguments above provide a convincing case for calling fsync() before close() when writing a file. This presents application writers with a choice of two paths:
1) close() without fsync() to gain the performance advantages of delayed allocation, but risk leaving zero length files on a system crash or power loss.
or 2) fsync() before close() for safety, but lose the performance advantages of delayed allocation and force extra disk i/o.
What applications on a general purpose desktop or development workstation should take path 1? At the moment my list contains only Nautilus thumbnail generation and the Firefox cache.
March 15th, 2009 at 6:17 pm
@116: The idea that filesystems shouldn’t make the slightest effort to recover more than POSIX requires in the event of a crash is insane. POSIX doesn’t require that the file operations initiated by non-fsync using applications ever actually hit the disk.
By default, rsync does not fsync every file. The –fsync option can make the transfer of a directory tree hundreds of times slower. So should we go out of our way to make sure that the output of rsync in its default mode is *always* lost if the system crashes, even if the rsync completed weeks ago?
After all, we could keep a journalled log of all files that had never been fsynced by an application. Then on “recovery”, we could make sure that all those files had been deleted. That’ll teach em, right?
March 15th, 2009 at 6:30 pm
> @121: so that there is no point at which another process attempting to access newpath will find it missing
This refers to processes running on the machine _now_. If contents of the directory (i.e. rename) gets committed to disk _before_ the data, the kernel makes sure these processes _still_ see one and the same thing. This satisfies POSIX just fine.
March 15th, 2009 at 6:32 pm
Ted, I have a more immediately practical suggestion. Add a mount option that causes the equivalent of a sync_file_range(fd, 0, file_size, SYNC_FILE_RANGE_WRITE) whenever the last file handle open for write on a file is closed. As others may not be aware, that will schedule any unflushed blocks for immediate asynchronous write-out.
This does not guarantee anything. It just makes it more likely that recently created files will survive a crash. This is particularly useful for applications like rsync that have good reasons not to fsync every file prior to close.
Without an option like this, the first thing I would do with any system running ext4 (even on a data center UPS) would be to turn dirty_expire_centisecs down to the five second range. That sort of defeats the benefit of delayed allocation for files that have not been closed yet.
March 15th, 2009 at 6:40 pm
> @127: So should we go out of our way to make sure that the output of rsync in its default mode is *always* lost if the system crashes, even if the rsync completed weeks ago?
The file system will decide, in its own good time, when the contents of rsync will be committed to disk. Just like it decides when to commit the rename and when to commit the write into the file, unless fsync is called.
So, to answer your question, no, we should not make sure that’s the case, because that’s not asked by the standard. We can implement it in some way that we find reasonable. But we also don’t _need_ to make sure that something that isn’t specified always happens.
In other words, the requirement is not there, hence it is left up to implementation. Applications relying on this implementation may be in trouble, as we’ve seen.
PS. Better run sync after that big rsync then, if it’s important to you
March 15th, 2009 at 6:46 pm
Matthew Garrett put the reasonable analysis of this issue, with included pointers to the solutions, in the right words.
March 15th, 2009 at 6:47 pm
@112
> 1. Well, you gotta have permission to anything with the file system, no?
No. Permission to write a.txt does not imply permission to write any other file.
> 2. Look, this is from emacs. It is an _example_ that deals with a certain name spaces. Use secure temporary file names to get what you want for other files.
I’m not saying it can’t be done right. I’m only saying the example has flaws.
> 3. You do not understand that rename(2) only guarantees the atomicity of file paths, not the data inside the files.
Why do you think I don’t understand that?
> 4. Permissions again?
Yep. Your point being?
March 15th, 2009 at 6:49 pm
@125
> Isn’t this inefficient (or wrong) ?
No and no.
> At #3, if you have a big file and are low on space, how would you accommodate it?
You can’t. If you don’t have space, you can’t provide this kind of safety.
March 15th, 2009 at 7:04 pm
@130: “The file system will decide, in its own good time, when the contents of rsync will be committed to disk”
Exactly. It is up to the implementation to do what it thinks is best. I am saying that it is *advisable* to start committing the contents of closed files to disk in short order. That is completely different than guaranteeing, for example, that a file is committed when close(2) returns (which POSIX does not require, and which would be a performance disaster).
The thing is, POSIX does not require how unreliable your filesystem *must* be in the event of a crash, rather it implies how unreliable it *may* be. There are things that can be done that make it more reliable than POSIX requires (which isn’t much) without incurring a significant performance penalty.
Any application which is updating one or two files should fsync before close. It is the ones that generate hundreds of files (cp -R, tar, rsync) that cannot afford to fsync every one that benefit the most from filesystems going beyond the call of duty. Going beyond the call of duty where possible and convenient is not the same as guaranteeing anything. Programmers should write to guarantees. Everyone else benefits from better than average (if non-guaranteed) recovery behavior.
March 15th, 2009 at 7:54 pm
> @132: Why do you think I don’t understand that?
Because you say:
> It uses an fsync, which may not be required (if atomicity but no durability is desired).
If you want to see data that you wrote inside your file on disk and _now_, you have to call fsync. Atomicity of rename applies _only_ to the file names and kernel is free to reorder its commit _before_ the data commit (unless you fsync), because even such a rename will meet what POSIX requires – which is that any process running on the system sees the same thing and the file is never missing. The data of the file can still be in the buffers and _all_ processes will see it – that is how the kernel works and what the spec requires.
When kernel makes your rename durable by chance (which it is allowed to do), if _you_ didn’t make the _data_ inside the file durable beforehand, you may end up with empty file after the crash.
In other words, the atomicity you are referring to (where one file gets replaced _completely_ by another on disk, content and all), is the job of the application. Hence, fsync – _required_ fsync!
Rename manual page talks about the view of currently running processes and it doesn’t specify how exactly this should be implemented. ext4 meets this just fine (even without patches destined for 2.6.30).
> Yep. Your point being?
My point being, of course you need to have correct permissions to write files where required. Nobody said programming was going to be easy. This has nothing to do with committing things to disk. It is just another one of hundred things that programmers need to get right.
PS. Yes, I got bitten by this thing in my own code. I was also renaming without committing first. Guilty as charged. At least I’m willing to admit I was wrong.
March 15th, 2009 at 7:59 pm
> @135: Programmers should write to guarantees. Everyone else benefits from better than average (if non-guaranteed) recovery behavior.
Absolutely agree.
March 15th, 2009 at 8:13 pm
#136:
> Because you say:
I didn’t mean it’s not required in that example. I meant that durability may not be required by the programmer, although atomicity is.
> My point being, of course you need to have correct permissions to write files where required.
So to safely write file A, I suddenly also need permission to write another file B?
That’s not logical.
> PS. Yes, I got bitten by this thing in my own code. I was also renaming without committing first. Guilty as charged. At least I’m willing to admit I was wrong.
Maybe you could post your code then, so others can learn from it.
For example, how did you implement retaining permissions, creation time and attributes?
March 15th, 2009 at 8:39 pm
> @136: So to safely write file A, I suddenly also need permission to write another file B? That’s not logical.
Eh, what does logic have to do with this? The spec is what it is. To atomically do this, you need another file. That’s how it works.
Just because we’d all like it to be some other way, doesn’t mean it is.
> Maybe you could post your code then, so others can learn from it.
I didn’t say I fixed it. I just said I was using the exact same problem pattern.
Off hand though, I would think it would have to do with calls to stat(), utime[s](), chmod() and friends. Most times, however, permissions and ownership will be just right if you are using reasonable file system setup. Other times, you will have to explicitly set things right.
But, if you want to see how it should be done, Google it. For example, Gnome folks are already working on it:
http://mail.gnome.org/archives/gtk-devel-list/2009-March/msg00082.html
March 15th, 2009 at 8:46 pm
@130 Mark Butler says:
Add a mount option that causes the equivalent of a sync_file_range(fd, 0, file_size, SYNC_FILE_RANGE_WRITE) whenever the last file handle open for write on a file is closed. As others may not be aware, that will schedule any unflushed blocks for immediate asynchronous write-out.
Please be aware that this will have a significant effect on write performance over NFS. The current NFS server code doesn’t keep a file open between RPCs from the client, so you get the equivalent of the last file handle open for write being closed on every single WRITE RPC, i.e. every 32K to 1M of data.
March 15th, 2009 at 8:48 pm
> @136: I didn’t mean it’s not required in that example. I meant that durability may not be required by the programmer, although atomicity is.
You have to define _which_ atomicity is actually required here.
If you are referring to atomicity of a pure rename, that’s already there. If your directory entry with the renamed file gets evicted from the cache, it will be written to disk in such a way that at no point any running process will be missing the file. Also, all processes will see the same content of the file, although this content may not be on disk yet.
If you are referring to atomicity of replacing the file with another file in its entirety and expecting this atomicity to persist across a crash (which is not what POSIX specifies in any way), then the only proper way to do this now is to fsync the data first and then rename.
It sucks, yes. But, that’s what it is.
March 15th, 2009 at 11:20 pm
@Ted
A couple of points:
1. reiser4
According to their transaction design document,
http://lwn.net/2001/1108/a/reiser4-transaction.php3
reiser4 gets it exactly right (assuming this is what was implemented): Offer apps an interface to define atomic and consistent transactions, and deduce implicit ordering constraints for apps using the posix interface. Smart strategies for choosing between data journaling and copy-on-write. This is good stuff.
This was 2001. Fast forward to 2009. What’s the situation we’re in? The same crappy API from the 70s and people denying that there even is a problem. Sad.
2. don’t fear the fsync
You claim that fsync is not expensive, and application-writers shouldn’t be afraid to use it. Let’s assume for a moment that we live in a parallel universe where ext3, which makes fsync almost prohibitively expensive, isn’t the most widely used linux-filesystem. There would still be a real cost to using fsync, especially for laptop users: It will cause frequent spin-ups, i.e. high latencies, increased power usage and increased wear and tear. We all remember the high-load-cycle-count debacle.
3. separating important from unimportant data
This is a false dichotomy. In the case of browsing history for example, it is true that the exact number that I’ve visited a particular website doesn’t really matter, but the aggregate data, i.e. my complete history is pretty important, as a matter of fact, it’s more important to me than, say, my bookmarks.
4. file systems are not databases
It is clear that the posix file system interface is unsuitable for using the file system as a database, but it should be equally obvious that this is the direction we need to move into. There is a real demand for atomic (but not necessarily durable) updates of files, for example. The ideal situation would be that apps never had to use (the equivalent of) fsync unless they need to be sure the data is written NOW.
5. vim
You mentioned vim as a positive example, but I doubt you’ve actually looked at the source. Here’s what the vim author has to say (src/fileio.c):
/* On many journalling file systems there is a bug that causes both the
* original and the backup file to be lost when halting the system right
* after writing the file. That’s because only the meta-data is
* journalled. Syncing the file slows down the system, but assures it has
* been written to disk and we don’t lose it.
That’s right, he thinks that this behavior is a *bug*, and I tend to agree.
March 15th, 2009 at 11:57 pm
[...] reading the comments on my earlier post, Delayed allocation and the zero-length file problem as well as some of the comments on the Slashdot story as well as the Ubuntu bug, it’s become [...]
March 16th, 2009 at 12:31 am
@107: For filesystems without the alloc-on-commit-for-renames workaround, is there a POSIX way to request equivalent behavior from the application when doing open-write-close-rename? Fsync is too strict for my needs — I don’t want to have to spin up the disk and write the data immediately, I just want to ensure that either the old or new data will be found on disk after a crash and recovery.
Jim,
Please see my new blog post whichshould hopefully answer this question in detail (over 2200 words!).
Some way to insert a barrier that attempts to ensure the write will happen before the metadata update, at least at the filesystem level?
There is no way to do this using the POSIX interface, and in fact, most hard drives don’t even support it at the low level. As a result, in the kernel, a barrier operation in reality causes a flush. I know that you’re willing to schedule something that doesn’t take block scheduling, et. al., into account, but even if Linux were to introduce some non-standard way of specifying these semantics, most of Linux’s file systems wouldn’t support it and at least initially, it would be the equivalent of an fsync(), just as today, the barrier interface in the kernel really is a device flush command. It might be interesting to add such an interface to the Linux VFS, but (1) no other OS would have it, and (2) for a long time, it would be functionality equivalent to fsync().
After a crash, is it possible for the journal recovery to see what happened?
No, because we don’t record it; I don’t know of any file system implementation of delayed allocation (reiser4, ZFS, btrfs, XFS, etc.) which attempts to record this sort of thing.
If it could tell that the metadata change occurred but the corresponding data had not been written, maybe the metadata could be rolled back? Wishful thinking, I’m sure.
Yep, wishful thinking. Recording this kind of rollback information would double the overhead of the journal; and again, I don’t know of any file system that tries to do this. Again, databases are not file systems, and file systems are not databases.
For precious files, you really need to use fsync(); it’s the only way to be sure.
That “Eat My Data” presentation says that a POSIX-compliant fsync() function is empty when _POSIX_SYNCHRONIZED_IO is not defined. And apparently MacOSX’s fsync() is broken too. Are there any libraries that you’re aware of that can make life less painful for application developers?
If _POSIX_SYNCHRONIZED_IO is not defined, that’s basically a way for the OS to say that “fsync() is not defined”. Fortunately there are relatively few OS’s for which this is the case. If you do come across the system, there is absolutely nothing you can do. As far as MacOSX’s fsync(), fsync() does partially work; it just sends the information to the disk, but it doesn’t ask the disk to write the data into the spinning platters. There is an F_FULLSYNC fnctl() which handles this. (Actually, ext3 doesn’t do this as well, unless you use the mount option “barrier=1″. Ext4 enables barriers by default — you can disable them using the mount option “barrier=0″ — but Andrew Morton has refused to enable barriers by default on ext3 because of the performance hit. In truth, the window is relatively small, although Chris Mason has demonstrated how to force ext3 file system corruption with a sample workload simulation script combined with a power failure. We haven’t been able to convince Andrew to take the patch, though.)
March 16th, 2009 at 12:39 am
@109:So, I am really curious how would you reply to the concerns from Olaf@104. I think this thread should indicate the right way for application developers to do it, not just complain about many (most?) applications getting it wrong.
Andre,
Take a look at my blog post, Don’t fear the fsync!, I think it should answer most of those concerns.
March 16th, 2009 at 12:44 am
@117: Uberbad application? You have it. Or rather millions of people have it. P2P client. What this beast does? It opens hundreds of files (fallocate is good here). Then for each file it creates ANOTHER small file where “status of main file” is kept. “Status of file” is mostly list of peers – sometimes you must wait for days to find even one, but if there are a lot of peers they are changing IPs (dynamic IPs are evil, but they are fact of life), going in and out of scope every few minutes so you need to keep track of all that by saving state every few minutes.
This isn’t a problem. Calling fsync() every few minutes is not a big problem; if you are concerned about the time that fsync() takes, use a separate thread for the open-write-fsync-close-rename sequence. See my new blog post, Don’t fear the fsync for an extended discussion why this is the right want to handle this situation.
The big thing for bittorrent clients is to make sure they use fallocate() to preallocate blocks for the files they are downloading, since the blocks arrive from their peers in essentially random order.
March 16th, 2009 at 3:15 am
> Fundamentally, open-read-truncate-write-close is broken. People who
> use it should have their programmer’s license revoked; but
> unfortunately, there are a lot of incompetent application programmers
> out there, which is why we do try to accommodate broken application
> programs to some extent, but there are limits to what we can do….
The Emacs Manual, section 22.3.2.3 Copying vs. Renaming, explains that
Emacs sometimes uses rename() and other times uses the
open-truncate-write-fsync-close sequence, for reasons that sound
entirely reasonable to me. Maybe you think the Emacs programmers are
incompetent and should have their programmer’s license revoked, but
their reasons for doing things that way seem pretty sound to me, and I
don’t know of a better alternative in the cases where Emacs does that.
(It does normally make a backup copy of the old file first, though.)
I think Matthew J. Garrett’s comments in his blog post (which I think
everyone on this thread should read:
http://mjg59.livejournal.com/108257.html ) are relevant. I summarize,
hopefully correctly: ext3 provided the ability to do file updates that
were atomic but not durable; atomicity is necessary, because it allows
the disk to always be in some consistent state after a crash, and
durability is too expensive much of the time. Garrett writes, “I said
we could fix up applications fairly easily. But to do that, we need an
interface that lets us do the right thing. The behaviour application
writers want is one which ext4 doesn’t appear to provide. Can that be
fixed, please?”
We seem to have discovered empirically that this functionality ext3
provided by accident is extremely useful. We shouldn’t let the fact
that ext3 provided it by accident blind us to its usefulness. Instead
we should create a way to request atomic file updates explicitly, make
it a no-op on ext3fs, and then applications can fail back to fsync()
if it’s not available on whatever filesystem they’re running on.
I am pretty sure that providing atomic-but-not-durable file updates
does not *necessarily* have to make the filesystem slow in general (as
Bernd’s proposal would, according to Ted’s earlier comment) nor to
make fsync() painfully slow (as the accidental implementation of this
feature on ext3fs does). I don’t want to get into detailed filesystem
design here, since I’m sure it would take me several tries to get it
right, but I am pretty sure this is a possible feature. Does the new
btrfs have it? An anonymous commenter on MJG’s blog claims it does.
I’ve become accustomed to my computer not losing data due to crashes
or power loss. My open-source Intel 3-D drivers hang the machine
sometimes (yes, I’m running Ubuntu), my laptops run out of battery
sometimes, sometimes I use unreliable hardware that crashes
spontaneously, I have power outages about once a month (and no UPS).
And I had a machine at 365 Main on 2007-07-24, when the generators
failed to fire up properly (see http://365main.com/status_update.html
for details). Ext3 has cut my rate of data loss to almost nothing.
At a job a few years back, we shipped network management server
appliances to clients and took responsibility for maintaining them.
Sometimes the clients wouldn’t plug them into UPSes; if that meant we
lost data about their network when they had a power outage (because a
Postgres transaction log file was missing after recovery, so Postgres
wouldn’t start), we had unhappy clients, and we lost engineering hours
doing customer support. We stopped losing data when we switched from
ext2 to ext3. We probably would have changed operating systems if
there hadn’t been something like ext3.
The kinds of data loss people are talking about here — /etc/shadow
from PAM, libc and kernel from dpkg — are pretty horrifying.
March 16th, 2009 at 3:22 am
I regret that I had not read your next post before posting my comment, Ted; reading it now.
March 16th, 2009 at 3:37 am
I’ve read your next post. I don’t think it successfully addresses the point.
I am really unhappy with your proposal to turn off the functionality of fsync() in laptop_mode, except (optionally) for certain processes. I don’t think the right solution to “we need atomic, but not durable, file changes” is “force applications to request atomic and durable file changes, then break the system call that provides durability.” I like to have durability when I need it. I just don’t want to pay for durability every time I write a file.
And I’ve gotten 30-second fsync() waits in Firefox and Emacs when I’m doing heavy disk I/O, without fooling around with ionice. Most of time I’d be perfectly happy to remove fsync() from Emacs saving, but not if I’m going to lose work in a crash.
March 16th, 2009 at 4:04 am
Guys,
The first is I don’t realy know how transation is done in filesystems, but have some ideas on behavior with using it.
The simplest idea is to make data=journal for all truncated files and selected mode in options for others.
The second idea is
1 When make truncating – don’t back allocated disk space for that file to free space, until file is closed or flushing data. So we can write in log operation as “soft truncate” while old data are on disk and safe for undo “soft truncate” on error with restore old data.
2 back allocated disk space for the file has “soft truncate” when it have to be flushed on disk (by some pre allocation function) or fsync call
I think it should be enough to resist from zero-sized config files / data loss on delayed allocation, should not it ?
March 16th, 2009 at 4:08 am
@139
> Eh, what does logic have to do with this? The spec is what it is. To atomically do this, you need another file. That’s how it works.
Of course, that’s why I brought it up as a flaw.
> Just because we’d all like it to be some other way, doesn’t mean it is.
Doesn’t mean we can work on getting where we want to get.
> I didn’t say I fixed it. I just said I was using the exact same problem pattern.
Ah, so that’s another flawed fix?
> Off hand though, I would think it would have to do with calls to stat(), utime[s](), chmod() and friends. Most times, however, permissions and ownership will be just right if you are using reasonable file system setup. Other times, you will have to explicitly set things right.
Most times isn’t good enough. And by just right, I assume you mean the defaults (umask etc)? Again, no good.
> http://mail.gnome.org/archives/gtk-devel-list/2009-March/msg00082.html
I’ll have a look.
March 16th, 2009 at 4:10 am
@141
> You have to define _which_ atomicity is actually required here.
Both. And yes, I know fsync is currently required in that case.
March 16th, 2009 at 9:56 am
@147:The Emacs Manual, section 22.3.2.3 Copying vs. Renaming, explains that Emacs sometimes uses rename() and other times uses the open-truncate-write-fsync-close sequence, for reasons that sound entirely reasonable to me.
Kragen,
That’s the exception that proves the rule; and if a backup copy of the file has been made first via copying the file, and calling fsync() on the backup copy, it is mostly safe, although on a crash the system administrator will still need to manually rename the backup copy of the file over the partially written new version of the file. In some specialized situations, this might be required, yes, and that section of the emacs manual describes some of these situations. But it also says that backup-by-copying (which implies the dangerous open-truncate-write-close sequence) is not the default.
Maybe I need to repeat myself. I’ve already provided a kludge which adds the implicit barrier to rename, and Ubuntu and Fedora have already backported those patches to their kernels. However, application programmers who expect this to work on other operating systems or other file systems are setting themselves up for failure. So please stop saying that ext4 doesn’t provide this behaviour. I added it because I know there are a lot of broken application programs out there. And I want people to understand that they are broken, because otherwise this is going to bite users on other file systems, and on other operating systems. But if you only care about your applications being safe on Linux and ext3/ext4, be my guest…
March 16th, 2009 at 1:11 pm
@143 Greg Banks: This would naturally be a filesystem / mount specific option.
March 16th, 2009 at 1:22 pm
Ted Ts’o writes: “I’ve already provided a kludge which adds the implicit barrier to rename, and Ubuntu and Fedora have already backported those patches to their kernels. However, application programmers who expect this to work on other operating systems or other file systems are setting themselves up for failure. So please stop saying that ext4 doesn’t provide this behaviour. ”
I’m sorry, I guess I misunderstood. I thought you had added an implicit flush to rename, not just an implicit barrier, and that the new behavior also brought back the downside of ext3’s approach, which is that fsync() can require flushing a lot of data from unrelated files, so you were recommending that people leave it turned off.
Also, I suggested there should be an application-level way to ask the kernel to provide atomicity without requesting durability, rather than hoping the application is running on a filesystem that happens to provide it. Then non-broken application programs can fall back to fsync() when they are guaranteeing the user durability.
March 17th, 2009 at 7:27 pm
[...] sobre sistemas de archivos con delayed allocation, como ZFS, btrfs, XFS, o ext4. Estas aplicaciones se apoyan para funcionar correctamente en el modo data=ordered de ext3 (activado por defecto), y en su intervalo automático de ejecución de las operaciones de escritura [...]
March 18th, 2009 at 1:12 am
I’ve thought of another scenario where the current ext3 data=ordered behavior may more desirable than your patched ext4.
Consider a wget-like program downloading a file without using fsync() or rename() at all. If I understand correctly, in ext3 data=ordered mode, any updates to file size will be done only after the data is safely on disk, so after a crash we can simply resume the download. In your patched ext4, delay-allocated blocks may be written after file size updates. Will the allocations and writes always be done in file order, even when the writes are due to memory pressure? If not, we may see holes in the file in case of a crash, and verification will be necessary before a resumption.
This also applies to logs; a truncated logs might sometimes be more acceptable than one with holes.
I’m not asking you to correct this behavior, but maybe it is necessary to warn the users upgrading from ext3 if this is true.
March 18th, 2009 at 7:36 am
@157,
In the absence of any instructions to the contrary (i.e., sync_file_range), the writes should normally get done in the order that they were written out, oldest writes first, which would be file order if you are downloading the file using a program like wget; memory pressure shouldn’t change that. Detecting holes in the file is not hard, though; it’s simply a matter of comparing i_blocks and i_size, and that would be a good thing for a wget-like program to do before resuming.
Ext3 doesn’t use barriers by default, so actually with ext3 a crash right after a commit could result in unwritten data blocks if the hard drive is reordering writes. Also, remember that PC class machines don’t have power fail interrupts, and memory can start returning garbage data before the DMA engine and hard drive stops functioning, trusting data written right before a power failure is a really bad idea. I really wouldn’t trust resuming a download after a reboot, just on general principles, unless it was double-checked with a checksum — for example, using a program like rsync. UPS’s really aren’t that expensive, and brownouts can damage computer equipment. It’s a Really Good Idea to get a UPS, and set it up so that when the UPS battery starts getting low, it sends a notification to your system so it can do a controlled shutdown. For all that people seem to like to spend lots of time optimize their systems so it will do the right thing when it crashes, let me gently suggest that a tiny amount of effort trying to reduce the probability that the system will crash in the first place might not be such a bad idea.
March 18th, 2009 at 8:39 am
@158 Do you have any real-life example to back up those claims? I’ve never seen that happen or even heard about someone who had that happen to them. That in itself means that such an event has a very small probability. I would like to have high probability problems fixed first, such as ext4 loosing data.
And FYI, in developing countries an UPS often costs more than the second hand computers most people are using. Buying extra hardware to compensate for bugs in your software is not a valid answer, sorry.
March 18th, 2009 at 1:31 pm
I read through this discussion. I guess I am still at a loss on this bug or not bug. In the past I have used XFS and lockup or power failure wiped open for write files. EXT3 is not prone to this.
Does EXT4 work as XFS. I am trying the latest Alpha of Ubuntu 9.04. Our server uses samba, VMware v1, mysql and postgres. If an unexpected failure/reboot occurs am I looking at empty files rather than simply an earlier revision? I understand the DMA issue etc on failing power. What is not cool is files being trashed ala XFS rather than an earlier revision.
March 19th, 2009 at 4:16 pm
[...] If you want to learn more about this issues I recommend you to read both articles by Theodore Tso, “Delayed allocation and the zero-length file problem” and “Don’t fear the fsync!” and also Alessander Larsson’s one “ext4 vs [...]
March 20th, 2009 at 8:27 pm
Semi-timely notes on filesystems…
There was a semi-large bruhaha on the next generation Linux filesystem ext4 losing data in some situations. Note that if you were already using ext4 for something important, it’s your fault, since one should avoid production use of a brand-spaking new…
March 24th, 2009 at 12:34 am
( sorry if this was proposed in comment # 46-162, my brain fried
There is one thing about this *whole ecosystem* problem:
For the problem to be fixed, it is necessary for the problem apps to be identified & fixed.
Could someone write a small util that could be run on packages ( either compiled, or source — I’m not a coder, so I don’t know which ), so people could discover which apps/packages do it wrong, and all together get lobbying the devs for portable/safe fixes?
If the output said something like:
processing src/foo
no calls to fsync()
.. this relies on the non-standard ext3 data=ordered filesystem!
non-portable!
or
many calls to fsync(), included nested loop calls
this appears to call too many times to work on data=ordered fs’s
or
uses fsync(), and not in nested loops: looks good (:
I’m thinking
informative, helpful, and contributory
as the mode, and it’ll be less hated by those subject to its influence, if you see what I mean…
As I say, though, such a util would help change the game from
“it’ll never be safe to do it either way: too many apps depend on data=ordered”
to
“with 10 000 people testing the apps *they* rely on, influence/pressure’s building against those apps that make wrong assumptions about underlying fs’s,
and in 2 years we can simply cut out all the remaining broken apps”
( PS: how come there isn’t a bunch of utils for coders to run on their projects to see if they’ve done things The Right Way?
Test First *prevents* incompetencies like the one our entire ecology’s in now, eh? )
and finally, *thanks* for explaining, not just this, but endlessly, everything, to include us in understanding of the infrastructure that our lives stand-on.
( :
March 25th, 2009 at 12:16 am
#163: for backward compatibility, there are now options that disable delayed allocation of rename replacements at the filesystem level on ext4 and I believe XFS. There are a lot of applications that do not call fsync for performance reasons and probably never will (although they really ought to come with options to do so). If they want to reduce the window of vulnerability, on Linux they can call sync_file_range() to schedule immediate writeout.
Hopefully in the not too distant future there will be an fbarrier(fd) system call that is more generally useful (i.e. write data before metadata, but don’t make me wait). Most of the problematic applications will not get fixed until there is. fsync is fast enough for saving a document. Not for just about anything else without adding threads. That is why we need fbarrier.
March 25th, 2009 at 7:16 am
“There was a semi-large bruhaha on the next generation Linux filesystem ext4 losing data in some situations. Note that if you were already using ext4 for something important, it’s your fault, since one should avoid production use of a brand-spaking new…”
For one major Linux Distributions have already or will in the very near future include ext4 as an option. Its in the mainline kernel. Right now I am testing in VMware. The concerns are valid.
March 27th, 2009 at 12:15 pm
[...] (rather than just use tune2fs) then you get additional improvements. You do want to check the data loss issue though. Fortunately the defaults set in /etc/mke2fs.conf are up to date – setting the new features [...]
March 28th, 2009 at 6:56 pm
Ted,
Thanks for the insight in this. I’ve written several low-level file IO systems myself, although none anywhere near the complexity as you have, and just wanted to point a few things out to your readers:
1. It is impossible to guarantee no data loss in all situations
2. There is always a tradeoff between performance and safety
Personally I feel that applications should be doing the right thing, and not forcing the OS to conform to bugs in applications. However, that is not acceptable from a consumer point of view, who just want their application to work. Witness the megabytes of patches for Microsoft Office bugs in both the Windows and Mac OS system code.
One thing I have always found I have to write is a simple way to replace a file safely, it would have been good for posix to have a slightly higher level API where you could read a file, write a file or replace a file and then these operations could have been done correctly at all times inside the posix api.
Also, on a modern OS you should always be doing file io operations in a separate thread from your UI, which should be in it’s own thread. This is the actual bug in Firefox 3, not the use of fsync (although that in itself was highly un-optimal). If you want your UI to be responsive you have to make sure that it doesn’t get held up by IO; firefox can still get hung if another thread causes a 100MB video to write to disk just before firefox needs to write 35 bytes of visit data. All their solution did was to make the problem less apparent, it did nothing to actually solve it.
It’s like treating the symptoms of a disease; if you make the pain from the patients cancer go away then the patient will feel better, but their still going to have cancer unless you treat the cancer itself.
Unfortunately one thing I have learned is that modern software development is by and large one of apathy and programming for the engineer and not the best interests of the consumer. Less focus on programming to the latest and greatest trend and/or whatever is easier for the engineer and more focus on writing small, clean code that conforms to system standards would really help in every aspect of our daily computing.
March 28th, 2009 at 7:28 pm
Follow up;
I’ve noticed a lot of people commenting on the “always call fsync” problem, but that is not how I read what is being said; I read what as being said as “if you want to make sure your data is on disk, call fsync”.
So I’m not sure why so many people are upset over this; you only need to call fsync if you want to ensure that some data is written to the physical device (this can never be 100%, but only the intent) in case of unexpected failure.
It’s a very simple concept in my mind; if I want to ensure data is on the disk I need to specifically tell the system to ensure that happens, otherwise I just let the system do what is best for the current situation, which could mean that my data NEVER gets written to disk (and would make perfect sense for instance if I write a temp file, it was read by another process then deleted). The system can never know about what data is really important, that’s application specific.
March 30th, 2009 at 3:38 am
Follow up to #168:
Lane, I have one issue to take with your comment about fsync. Although I have never done a proper study of this, I have a strong suspicion that when a process writes data to the disk, the normal case is that is wishes the data to reach the disk. I would even be willing to hazard a guess that in the normal case, the process would be willing to accept a certain performance penalty in return for the assurance that the data has reached the disk. In theory, this would call for special APIs to open a file for which the process does not consider it critical to flush data immediately, although I realise of course that since the process writing data only pays part of the price when it is fsynced (and many may not care) there would not be a high motivation to use those APIs.
March 30th, 2009 at 4:43 am
[...] http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/ [...]
April 2nd, 2009 at 2:01 pm
I use kernel 2.6.29 and ext4 with default mount options. Please, clarify the following point.
After an unexpected system crash some files may contain user users’ data. True or false?
April 2nd, 2009 at 7:08 pm
#168, #169 – Yes, there are two levels of data security – “I want this data on disk ASAP!” and “I want this data on disk, when you can”. In the first case you need to call fsync() and that is ok. Most data, however is in the second category, where fsync() is an overkill and it is not really fatal that in case of a system crash, the latest version of the file is lost, because it was not written to the disk yet. The bug was that ext4 destroyed data that was already on disk immediately (destroyed current version of the file) and then waited for 30+ seconds before fixing it by writing the new data.
#171 – ‘other users’ data’, I assume you mean? False. The affected files will just be 0 length, until the patches in 2.6.30. After that there should also be wither new or old file contents, with no assigned but unallocated blocks from other user files in the middle.
April 3rd, 2009 at 3:48 am
#172 – Thanks for the reply. Yes, I meant “other users’ data”.
Does the behaviour targeted for 2.6.30 have more chances of fragmenting a file system?
April 5th, 2009 at 12:57 pm
[...] sobie ostatnimi tygodniami na http://www.planet.debian.org boje Theodere’a T’so ze zgłoszeniami na temat utraty danych z filesystemu ext4. Gdy pierwszy raz to przeczytałem, to pomyślałem coś [...]
April 6th, 2009 at 9:32 am
[...] semaines, les débats sur l’allocation retardée des systèmes de fichiers modernes sont assez nombreux. Tout a été déclenché par le passage en mode par défaut d’ext4, le remplaçant [...]
April 17th, 2009 at 12:17 am
I think that the problem is the journal being serial instead of parallel. Two transactions that don’t step on each other should be reorderable and combinable at will, particularly if neither of them depend on each other. A prime example would be two separate transactions, each containing blocks belonging to completely unrelated files, so it’s unreasonable for them to be lumped together in the same commit IMO.
Let us suppose that you have two (or more) completely different files getting fsync’ed, while background flushing is going on too.
What IMHO would be appropriate is
1. The two fsync calls cause the “rush flush order” buffers invovled to get put as far to the front of the line as can be, but when competing with other “fsync rush orders”, should be freely reorderable.
2. If any “fsync rush order”s are pending, then background writes, though they shoudl stay in the pending list, defer to any fsync-initiated writes if there is contention for the device. So background does it’s merry merry trickle out, but fsync triggered writes get VIP treatment. background writes have to wait their turn until the fsync writes drain, including any background writes that got promoted to fsync priority by an fsync call. Though, like priority inversions, if any of the fsync writes depend on proper flushing of a background write, then the VIP treatment gets cascaded and the backgrounders inherit the fsync’s rush status.
3. (if they don’t already) inodes should keep track of any of their blocks that happen to be dirty.
4. Any buffers that can be written in parallel (so perhaps the fs should handle more of the elevator logic?) should be permitted to be parallellized as far as the device can.
5. Pointeds are flushed before pointers like one link at a time in a chain. But if you have more than one chain. Perhaps this could be implemented as block dependencies, where block X isn’t written until any of the blocks it depends on are confirmed written, but block X doesn’t have to wait aorund for block Y if X and Y don’t depend on each other.
I think these principles should be introduced in the zen of filesystems:
Unrelated transactions should be freely reorderable at will to provide block devices (AND filesystems that are scheduling the writes!) maximum opportunity to reorder and coalesce for efficiency sake.
Filesystems should be more involved in making decisions about what blocks get flushed when, instead of letting the block device’s I/O scheduler make all the decisions. Filesystems know best what blocks need flushed first, both in cases of integrity as well as cases of
Journal commits should be done in reorderable parallellism if they have no need to be done in a certain order. Case in point…data blocks that are in unrelated files should be kept separate yet freely
Make transactions/commits smaller, so that an important fsync doesn’t get mashed together with a bazillion other pending writes and wind up stalling the application that only cared about that one fsync’ed flush
April 17th, 2009 at 12:42 am
Also, russ:
rename doesn’t touch the inode being renamed, all it does is fiddle with the directory inodes that have the old and new names.
What you ahve here is a dirty inode getting a name change.
Directory tree consistency and file data consistencey are IMO separate issues.
April 17th, 2009 at 1:38 am
Perhaps filesystems should provide a vm callable hook that basically orders the FS to flush out a specific block, with the default behavior being to simply let the VM flush it itself.
April 17th, 2009 at 7:17 am
@176 Raymond,
It turns out that it is extremely difficult to determine whether or not two individual file system operations “step on each other”. It’s of course possible, but only if you keep a huge amount of accounting machinery around. For example, suppose you are unpacking a tar file with thousands of small files; creating each of those files will involve touching a block bitmap allocation block, and so they must be considered to have “stepped on each other”. Other file system metadata blocks that can involve multiple files include the inode allocation blocks, directory blocks, the inode table — and ultimately the block group descriptor summary accounting blocks.
That least means any change in the number of free blocks, inodes, or directories in 128 adjacent block groups will be summarized in the block group group descriptor block — so in practice, two updates to two different and unrelated files have a very high probability of “stepping on each other”.
So basically, what you propose is possible, but (a) it’s a huge amount of work, (b) it becomes a maintenance nightmare, because it’s hugely complex and fewer people will be able to understand the dependency machinery (not as bad as trying to implement soft updates, but still pretty bad), and (c) it won’t work all that well because ultimately there are a huge number of dependencies due to the fact that most block and inode allocations will ultimately end up touching the same blocks.
BSD soft updates tried to make this situation better, but they did so by keeping even _more_ accounting data, so you could partially rollback transactions you didn’t want to commit yet. But that added even more complexity to the whole mess, which meant very few people could maintain it. It’s one of the reasons why BSD FFS didn’t get ACL and extended attribute support for quite a while; it had to wait for Kirk McCusick to implement it.
Ultimately, though, needing to commit extra metadata blocks to the journal generally isn’t the problem; the problem is needing to flush out the data blocks in data=ordered mode. And there are simpler solutions. One is to simply use data=writeback; another is to use ext4’s delayed allocation. Yet another is a solution Chris Mason has been working on, which is a data=guarded mode for ext3, which will avoid extending i_size until the blocks hit disk. All of these solutions solve the fsync-is-slow problem quite well.
April 17th, 2009 at 7:36 am
@178: Perhaps filesystems should provide a vm callable hook that basically orders the FS to flush out a specific block, with the default behavior being to simply let the VM flush it itself.
Raymond,
Linux has such a thing already. See the man page for sync_file_range. One caveat, though; this system call doesn’t do anything about file metadata; it only operates on the data blocks. It was intended for use by databases, which generally operate on a fully allocated files.
April 26th, 2009 at 5:47 am
Ted, you have the patience of a saint – especially in not taking the heads off those who persist in calling ext4’s behavior a ‘bug’, but as well in tolerating and responding politely to all the other inanities in both this and your other thread.
How quickly people forget. The rename-after-update behavior that’s caused all the ruckus here was bog-standard in virtually all Linux/Unix file systems until relatively recently, when people like you started offering new approaches which at least provided the option of more civilized behavior than what POSIX permits.
Civilized behavior which many non-Unix file systems have offered for well over three decades, often based on careful-replacement strategies. But rather than clamoring for such protection all this time the Unix crowd has instead loudly proclaimed how much better their approach was due to its allegedly superior performance…
So reminding people that to ensure portability (even within the meager breadth of Linux) applications really do need to avoid dependence on behavior not guaranteed by POSIX is entirely appropriate – and that any application that ignores this advice is, by definition, broken. “But – but – but…” they cry. ‘But’, schmut: broken is broken.
Still, it was nice of people like you to provide an alternative that (at least as a configuration option) helped cover for those brain-damaged apps (plus avoided most of the pain of fsck) – not that you seem to be getting much credit for that, let alone for providing it in a manner that made upgrading from ext2 completely painless (funny how that constrained you – well, perhaps Stephen originally – from doing a lot of things that a from-scratch implementation might have allowed).
Instead, in come the complaints because in some pathological cases (like the Firefox one) performance took a hit. Some people might say that this was what Firefox deserved for requiring a full-fledged database to manage its modest data and then casually ignoring what the consequences of this might be, but no, let’s place all the blame on the file system.
Now you’ve fixed that in ext4 and, as usual, no good deed goes unpunished – because the very behavior that caused performance complaints in ext3 turned out to have bailed some broken applications out of the holes they would have fallen into had they been running under many other Linux file systems. So people are demanding ad hoc fixes like linking file data updates to renames.
Ignoring for the moment their confusion about whether they want actual atomicity (which would require you to implement full-fledged transactions in ext4) or just post-crash ordering guarantees (which might only require that you flush out dirty file data prior to a rename operation on that file – oh, wait: they don’t want that either because they’re afraid of the potential disk activity), let’s instead ask the question, “Why would anyone consider such a solution to be anything but an incredibly ugly and unjustifiable hack?” Some people seem to be claiming that *all* metadata updates should by definition flush all data updates before becoming persistent. Really? Would this apply to a change in file permissions as well? If the argument is that a change in the path to the file is somehow more closely tied to its content than its access control is, would it then apply to renaming a directory a couple of levels higher in the directory hierarchy, not just to renaming the file itself? Or let’s modify an existing file, then copy it to a new file: does that imply that you need to propagate the changes to the original file on disk before they go to the new file on disk to keep everything consistent after a crash with respect to what many might consider logical operation ordering?
Anyone who has any concept of a file as an object, or of cleanliness in abstraction, would have to say “No.” At best, they should be asking for either the current behavior guaranteed by POSIX (which doesn’t couple changes between otherwise independent operations) or for complete post-crash ordering that mirrors run-time ordering (but doesn’t guarantee durability – i.e., reflects run-time ordering up to the point where the crash interrupted persistence and nothing thereafter): this is of course far beyond what POSIX guarantees, but is at least a conceptually clean extension of those guarantees that applications might find useful.
But I doubt that ext4 could provide that with anything like acceptable performance (though I can imagine a couple of from-scratch approaches that just might be able to – assuming that people would find that useful even though no other file system supported it). Hey, while we’re at it, atomic individual writes and optional multi-operation transactions would be nice, too.
Still, if you’ve been able to implement the rename-specific hack without much additional code complexity (which could translate to maintenance problems down the road) and without becoming actively nauseated, my hat’s off to you (especially given that a lot of contributors here are still managing to find fault even after having been given this).
Incidentally, I agree with those who deplore ignoring fsync for laptops – or in any other situation, for that matter: if an application issues an fsync it’s with the expectation that it will have executed by the time it returns, and no one – not you, not a system admin, not a user – is likely to know whether that expectation may be important enough that it should not be ignored. Rather than ignoring fsyncs, moving toward a situation where they’re not needed as often seems like the proper manner in which to evolve, at least as long as intelligently scheduling persistence ordering and persistence itself continues to have performance benefits. If, for example, most applications depended not on fsync but on some configurable system parameter like dirty_x_centisecs (which could be set according to the likelihood of a power interruption or system crash – e.g., larger for laptops and systems with a UPS, smaller for systems with buggy drivers) to keep their on-disk data from becoming too out of date then eventually users might get out of the habit of issuing explicit ’save’ operations and most fsyncs might be avoidable, especially if they weren’t required for post-crash ordering guarantees. Note that this would at least significantly ameliorate your objections to Firefox’s behavior (even on laptops: there’d still be some SSD churn if it wrote promiscuous amounts of data even if writing that data was significantly delayed to allow reasonable spin-down time, but I expect SSDs will be able to deal with that before too long if only by virtue of size increases and improved wear-leveling algorithms).
And while write entanglement may well be a serious obstacle to changing ext4’s behavior, with a from-scratch implementation it could be handled with logical REDO/UNDO mechanisms operating on fields rather than entire blocks.
We may differ in how disks should be managed, since with a competent file system and contemporary disks (e.g., those that don’t depend upon on-disk write-back caching in order to achieve decent streaming write performance as some oldler ATA disks did) I don’t think the on-disk write-back cache should be enabled, eliminating the issue of full-disk flushes (FUA helps some in this area, but still leaves situations in which you really need to know when some non-FUA data has made it to the platters, which requires a full-disk flush if the write-back cache is enabled). NCQing still allows some flexibility in request ordering at the disk among those operations which don’t require specific ordering plus some prioritization of important (e.g., log) writes, and this does require that the file system take responsibility for scheduling writes in an intelligent manner (which may be difficult to achieve using Linux’s normal unified cache mechanisms – I’m not intimately acquainted with that). But perhaps I’ve misinterpreted some of your comments in this area, and one must also consider (hopefully rare) situations in which disk use other than the main filesystem’s may want the disk’s write-back cache enabled (in which case the filesystem must fall back upon flush operations when non-FUA writes must be forced to disk).
In any event, thanks for having had the patience throughout these interminable threads to try to educate those who most need it. I’ve learned things myself as well.
- bill
April 28th, 2009 at 12:23 am
#181:
I think you are too harsh on the application programmers, who just have different needs that they hope the filesystem can provide. Some just want their data to be safe in case of a crash and do not need too much performance; they can simply use a journalling filesystem and do a fsync() whenever the filesystem requires it for data integrity. Some want fast and safe atomic file replacement, which leads to the ext3 ordered mode which requires no fsync() and no immediate (before the update is visible to other processes on a running system) disk activity for this. Some want both fast atomic file replacements and fsync() with acceptable performance, which is not easy to achieve with current filesystems; the hacks introduced to ext3 and ext4, though ugly, seem to be the best compromise.
So there are three essentially orthogonal issues here:
1) The filesystem needs to be capable, given sufficiently expressive APIs, to provide good performance and data integrity at the same time. This is the problem solved by the recent ext3 and ext4 changes.
2) The API must be expressive enough for application programmers to describe the behavior they want, both for a running system and after a crash, and preferably portable among filesystems and OSes and hard to use wrong. The need for expressiveness leads to the proposal of calls such as fbarrier(), and the “hard to use wrong” requirement prefers a safe rename() even without fsync()/fbarrier() (as post-crash behavior is very hard to test). The conceptually simple and clean API that you want is good as a low-level one, but the high-level API used directly by applications must have better portability and fewer hard-to-test traps.
3) Application programmers must make the best use of whatever API is available. If fsync() before rename() is unnecessary for a majority of users and cause performance issues for many of them, depending on the application’s priorities, it might be reasonable to omit the fsync() and thus sacrifice portability for performance.
April 28th, 2009 at 1:59 am
The problem with application programmers is that they’re often blind to all needs save their own (and don’t even understand what their actual needs, vs. mere conveniences, are).
Most of the garbage that they’re asking for here is ridiculous given that they’re asking it of only one of many file systems that they can reasonably assume their application will be run on. And in particular, asking for enhancements to the *standard* rename function in ext4 – when those enhancements almost certainly won’t be present in other Linux file systems on which their application may run – is just plain wrong.
There is a common existing mechanism that should work on all Linux file systems for updating a file in place with guarantees that either the old or the new contents will be found after a crash: it’s to fsync the new file before closing it and then rename the new file to the old file. Applications that omit the fsync and still expect the same result are simply broken – end of discussion. Application developers who complain about this situation are lazy and spoiled.
Why? Because Linux file systems *already are* “capable … to provide good performance and data integrity at the same time”, with no API extensions whatsoever. One obvious way to achieve the kind of result that people here seem to be seeking is to ping-pong updates between two files *without* renaming, using a generation counter in the file to decide which is more recent if both exist (and have been written out completely, which can be verified with a checksum in the same sector as the generation counter – though this last is probably unnecessary for any file less than one disk sector in size). You could even do this within a single file with two sections, though then you’d have to split the sections at a suitable boundary to make sure that a single update operation never hit more than one of them (which could wind up being an implementation-dependent requirement).
But that would still leave a problem that hasn’t been very well explored here: without fsyncing, nothing is *ever* guaranteed to be on disk, so you *can’t* avoid whatever overhead fsync may entail and get anything dependable after a crash. If you omit the fsync in the de facto standard rename-based approach you don’t know that the new copy ever made it to disk, hence a crash can still give you a zero-length file (which may be either the old version or the new version). Or a partially-written file if it has more than a single block.
Ted originally framed this as a delayed-allocation issue, but if I understand things correctly it’s really just the same old lazy-write behavior that virtually all Unix file systems have had: if you don’t force something to disk explicitly, there are no guarantees about what you’ll get after a crash. All that delayed allocation does is increase the length of time that dirty data tends to remain in memory before being written to disk, thus increasing the *probability* that a broken application’s latent faults will rise up to bite the user.
The proper sequence that Ted outlined (create new file/fsync it/close it/rename to old file) guarantees that after a crash, regardless of when that crash may have occurred, any file with the ‘old file’ name will always have been written out completely. There’s really no way to avoid fsync and guarantee this: without fsync, none, some, or all the file’s modified blocks may be on the disk. Having the standard rename function imply an fsync or be delayed until one has effectively occurred is inappropriate for the reasons noted initially above (unless *all* Linux file systems support this new behavior) – and Ted is entirely correct in characterizing any such patch to ext4 as being a temporary kludge provided only out of the goodness of his heart to help make up for the incompetence of application developers until they’ve had time fix their mistakes. And creating a new rename function the only purpose of which would be to delay the rename until all the new data had migrated to disk without having been forced there does not strike me as a reasonable request: it wouldn’t fix a single current broken application because none of them use this hypothetical new rename, and it’s ‘way too special-purpose a facility to ask a file system to support given that a very reasonable solution (using fsync) already exists.
Explicit use of an ‘fbarrier’ is an interesting thought until you start to think about how it would work – even just in the current situation under discussion. Would it hold up execution of the following rename operation (freezing the application thread that had issued it) until the dirty data had migrated to disk of its own accord? If not, is the desired behavior that the rename complete in memory (and be visible to other accessors) but not be allowed to migrate to disk until the changed data had done so? Would this apply to all subsequent file operations in the thread – or to all subsequent operations on that one file in the thread?
All this really does appear to be confused developers proposing solutions in search of a problem. Ted is correct in saying not to fear fsync – save possibly in ext3 with data=ordered, but then ext3’s behavior is what people should be complaining about, not ext4’s.
- bill
April 28th, 2009 at 6:26 am
Well, I agree that application programmers have to make the best use of the available APIs. If they don’t care too much about performance (i.e. they don’t update hundreds of small files all the time), they should call fsync() before rename(). If they need better performance while retaining crash safety and portability across other journalling filesystems, they may try to figure out whether a fsync() is necessary on the specific filesystem, or do something more complicated as you have described.
But this does not mean that unordered rename() and fsync() forms a good API for atomic file replacement. It makes such a common operation either slow or complicated (and not transparent to consumers of the file), when most filesystems themselves require no such tradeoff. fsync() may be faster on some filesystems than others, but it can never be very fast: as the kernel cannot know the application’s true intentions, it must always do a disk I/O immediately. This is not good enough for some users (e.g. laptop users) and some applications. As filesystems such as ext3 data=ordered or ext3/4 with the hack can already do atomic file replacements much more efficiently, it is reasonable to ask for a better API that exposes such functionality.
A safe rename() (crash-proof without fsync()) is a high-level solution to this. It is difficult to implement efficiently in user-space only, so it naturally belongs in the kernel where filesystem-specific issues can be observed. Your concerns about portability and legacy applications are easy to solve: simply define a raw_rename() call that does not order the data block updates and the renaming itself, an ordered_rename() call that does, and make plain rename() do an ordered_rename() in order to accommodate legacy applications. New applications can call ordered_rename() on systems with it, and fall back to fsync()/rename() on those without.
fbarrier() should be a relatively low-level API for extremely performance-conscious people, mainly for databases rather than atomic file replacements. I agree that it is difficult to define well, such that it suits both the design of most filesystems and the need of applications.
April 28th, 2009 at 2:05 pm
Your comments, like so many others both here and in the other thread, reflect the fact that people simply aren’t listening to people like Ted who actually understand the situation. So try harder this time.
1. Even ext3 isn’t safe without the fsync in the common rename-to-update case *unless* it’s operating with data=ordered. So in the situations which you postulate above where performance is supposedly so important that the cost of the fsync would be prohibitive, guess what? Data=ordered may well not be used and ext3 will be just as risky as ext4 was before Ted decided that a kludge was preferable to putting up with so much continued whining.
2. You claim that “most filesystems themselves require no such tradeoff”. Please be specific about which ones you are talking about that can omit an fsync in the rename-to-update sequence with zero risk in all their permitted operating modes (or even just the common operating modes).
3. You imply that using fsync in that sequence makes it ’slow’. Please quantify that, since Ted obviously disagrees with you (and I suspect knows a great deal more about this area than you do). Don’t talk about how much using fsync may slow down ext3 – that’s not the issue here (in part because if ext4 is about to become the standard on Linux most applications that are affected by ext3’s performance have already been written and already either use fsync or don’t).
4. You state above that “filesystems such as ext3 data=ordered or ext3/4 with the hack can already do atomic file replacements much more efficiently.” Ted has already made it clear that in the case of ext3 this is poppycock: the reason that ext3’s behavior with data=ordered is safe is because when the 5-second timer forces out the rename update it also as a side effect forces out *every other dirty update in the filesystem*, which in most filesystems has a great deal *more* overhead than a normal fsync has. As for ext4, please explain, in detail, just how ext4 could have made the broken operation sequence lacking fsync safe without incurring about the same amount of overhead (little though that may be) that an explicit fsync would incur (don’t appeal to the special case of laptops here: the fact that without an fsync the request sequence may be able to be delayed is only an issue for spun-down conventional disk drives, and anyone serious about battery life already has or very soon will have the option of using SSDs to solve that problem even in the presence of fsyncs).
5. *After* you feel that you have successfully dismissed all the above issues you will be in a position to defend the quantitative value of defining new general renaming functions that most Linux (let alone Unix) file systems may well never support.
6. Please explain how changing the defined semantics of the existing rename function *in the code of every existing Linux (let alone Unix) filesystem* to make omission of the fsync safe in the broken operation sequence which you favor qualifies as an ‘easy’ solution to this (non-)problem. Because that’s what redefining the semantics of the existing rename implies (unless you believe that creating a situation in which most existing filesystems would have to be considered broken is reasonable).
7. Please respond to my earlier challenge to specify exactly how you want this fbarrier to work, rather than wave your hands and allege its desirability as if it were some sort of magic solution to problems which you haven’t defined very well either. Then at least take a stab at defining how it could be implemented in all Linux filesystems without truly massive effort.
- bill
April 28th, 2009 at 2:37 pm
While it is wise to not feed a troll, there are many of you Bill, so I will show my unwise side here.
The behavior of filesystems in case of a power loss or system crash is not defined in POSIX. There can be filesystems that erase all your data on every crash and they would still be POSIX-compliant. POSIX is lowest common denominator and we in the modern Linux operating system can do better, much better than that. All the filesystems that have been affected by this bug and that are in Linux, have been patched to fix it.
1. ext3 in data=ordered mode is and has been the default for most distros for many years, it is safe to assume that most Linux instalations out there use this very filesystem in this very mode.
2. with ext3 in data-ordered mode it is perfectly safe to ommit fsync, even more so even if fsync was used with the buggy version of ext4, it was not safe. You had to fsync before and after rename and then fsync the containing folder as well. That is one retarded way to interpret the API.
3. fsync is slow, expensive and does not match what the applications actually want to say. fsync wakes up hard drives, increases wear of SSDs, cuts trough write caches on both the system and the drive and causes all kinds of chaos, because it is a priority operation – it MUST happen now. In the situations being looked at the applications do not care to save the data now, they do not even really care to save the new version of the data. All the applications care about is to either save the version or the old one. Not to get an empty file or an amalgamation of old and new.
4. The point is about the order of data and metadata operations. While data operations were delayed, metadata hit the disk almost instantly, resulting in broken links on disk. That was the whole essence of this bug – the filesystem was broken for most of the time. The basics of programming with pointers is to make the new target data first and only then change pointer to it.
How rename behaves over a crash has never been defined, so Tso assumed that it would be perfectly file if you lost all your data if your system crashed 30-60 seconds after a rename. Users disagreed. So no the rename is defined as being atomic even over a crash. Either the rename is complete (both data and metadata) or neither data or metadata are changed.
It is better to write smart filesystems and simpler applications than to worry about how our application code would work on old, obsolete, proprietary systems that are often retarded in very many different ways.
Also, I do assume that filesystems are written by people with better understanding of data security and safe coding practices than most applications, so I would rather depend on filesystem developers making damn sure my data is secure, regardless of what crazy moves the applications do.
April 28th, 2009 at 3:26 pm
Making a filesystem an order of magnitude more reliable in certain common cases is well worth doing. If someone prefers fast and dangerous, they can turn the option off. As I have outlined above, there are straightforward methods to making rename replacement fast and safe.
Least common denominatorism is a disease. fsync works, but few use it for the same reason that POSIX doesn’t require such behavior by default – it has a major performance impact. Anyone ever wonder why “mv” doesn’t call fsync? It doesn’t even provide it as an option (which it should). What about “cp”?
Has anyone stopped to consider why rsync doesn’t call fsync? It ought to provide it as an option, but the basic truth is rsync of a typical directory tree would be at least 10 times slower on a fast connection if fsync was called for each file prior to doing a rename replacement. In some circumstances it could slow down by a factor of hundreds. Data journalling the whole filesystem only slows things down by a factor of two or so.
Calling fsync even on a filesystem that is designed for fast fsync performance (most aren’t) means a delay of 50-100 ms for a tiny file under nearly ideal conditions. On many filesystems under load fsync performance (on a small file) is thought to be acceptable if it completes in 3 seconds(!).
Any time you have hundreds of files, especially ones that need to be read synchronously by multiple processes, fsync is out of the question. Async fsync based renamed replacement only works if there aren’t any other readers who need to get the current configuration not what it was two or three seconds ago.
Suppose an extra thread was added to rsync so that it could fsync each file? Would that help? Hardly. There is no vectorized fsync, so you would have to add dozens of threads that fsync in parallel to get rsync performance with per file fsync anywhere close to rsync without it (assuming a directory tree with small files).
In other words, fsync is a useless API for a large class of applications that could benefit from greater reliability. They won’t call fsync anyway because fsync is overkill, not on the fastest fsyncing filesystem, so it makes sense for filesystem developers to engage in a modicum of effort to make their filesystem more recoverable from crashes. POSIX does not say “thou shalt make thy filesystem as unreliable as possible”. Thousands of applications will never call fsync, except perhaps as an option. Suppose a “–fsync” option was added to mv, cp, scp, and rsync. How many shell scripts would be modified to use it. Approximately none.
How many shell scripts or utility programs even know when they are called in a context where they should run fast and unreliable or slow and safe? Approximately none of them. The obvious answer is to make rename replacements without fsync fast and reliable (using rename undo), since very few applications are going to call fsync anyway. They don’t even know when they should. Slow to a crawl is not a reasonable default.
April 29th, 2009 at 12:24 am
re 186:
Ah, the traditional recourse of the incompetent: when you don’t know what you’re talking about, just call someone who does know and who disagrees with you a troll. I’ll be charitable and assume that you’re responding not to entry 185 but to some earlier one, since you failed to address a single one of 185’s specific challenges (though you really should do so now, rather than continue with these fatuous generalities and outright falsehoods – a style of discourse reminiscent of certain elements of Russian academia in the Bad Old Days when politics played so much more important a role than merit in advancement).
To take your inanities in order:
1. “The behavior of filesystems in case of a power loss or system crash is not defined in POSIX.” Actually, it is defined – you just don’t care for the definition. The definition is that all data which has been fsynced (absent cases where fsync is defined by the installation to be a no-op, which is allowed but clearly intended only for unusual situations that may justify this) is guaranteed to be on disk, while any data that has not been fsynced may not be. That, in fact, is the *purpose* of fsync.
2. “There can be filesystems that erase all your data on every crash and they would still be POSIX-compliant” is incorrect according to the explanatory material accompanying the definition of fsync (see, for example, http://www.opengroup.org/onlinepubs/9699919799/functions/fsync.html ): if fsynced data could be arbitrarily erased as you suggest, fsync would not serve its stated purpose.
3. If by “we in the modern Linux operating system” you mean that you’re a Linux OS contributor, I’m afraid that this is more a reflection on the lack of qualifications required of such contributors than anything else. If you’re just putting on airs, you might consider how pompous and disingenuous that makes you look.
4. “All the filesystems that have been affected by this bug and that are in Linux, have been patched to fix it.” Ignoring the fact that this behavior *IS NOT A BUG*, do you have any idea how many filesystems exist in Linux? Ext2 is probably only *the most common* example of a filesystem which has *not* been patched to kludge around this specific kind of application bug. And while XFS received at least two changes that (in some ways just as a side-effect) decreased the *likelihood* of encountering this situation, it’s not clear that they *completely* closed the hole that makes broken applications pay the price of their incompetence (e.g., see http://www.mail-archive.com/debian-amd64@lists.debian.org/msg23001.html and http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=ba87ea699ebd9dd577bf055ebc4a98200e337542 while noting in a response to http://sandeen.net/wordpress/?p=17 that Valerie Aurora Henson – a name with significant credibility in Linux filesystem development circles – agrees with the explanation). There seems to be some disagreement about whether JFS and Reiser3 can leave applications with this opportunity to shoot their users in the foot, but I’ve spent as much time researching this as I care to right now – so I’ll leave it to you to provide credible citations if you believe that they don’t.
(Ted himself has waffled over this issue, incidentally – see http://linuxmafia.com/faq/Filesystems/reiserfs.html in which he describes ext3’s handling of this as a feature compared with XFS’s and by implication what had until his recent kludge been ext4’s. That discussion also includes an erroneous assumption about the perils of logical logging, insofar as it equates it with unsafe in-place updates to disk when in fact implementations that use logical logging should simply protect such in-place updates when they finally do occur with a quick copy of the entire block to the log before it’s updated in place so that it can be replayed in such cases. Well, nobody’s perfect…)
5. Even if it *were* “safe to assume that most Linux installations out there use this very filesystem in this very mode” (ext3 with data=ordered) that *still* wouldn’t justify simply ignoring those that do not by changing the existing semantics of rename without changing those other filesystems (and ext3’s behavior when data = writeback) as well. Ted’s generosity in creating an ext4 kludge in anticipation of that filesystem’s likely popularity (and the prevalence of broken applications) may not be misplaced, but anything beyond that would definitely be.
6. “[E]ven if fsync was used with the buggy version of ext4, it was not safe. You had to fsync before and after rename and then fsync the containing folder as well.” In ext4 without the recent kludge fsyncing the new file before renaming it made the operation completely safe in the manner that people have been requesting: after a crash, they got either the old version or the new version, never a zero-length file (or one populated wholly with NULLs, which I have a bit less tolerance for). The only thing that fsyncing the directory does is decrease the window to almost (but not quite) zero during which a crash might return the old version – and fsyncing after the rename as you suggest would be *completely* redundant. This level of incompetence on your part makes your use of the word ‘retarded’ in your description exceptionally ironic.
7. I’ve already addressed your babble about the slowness of fsync in 185, point 3 (and a bit at the end of point 4), so I won’t bother to repeat that here. Address those considerations in detail, or kindly shut up in this area.
8. “The point is about the order of data and metadata operations. While data operations were delayed, metadata hit the disk almost instantly, resulting in broken links on disk.” You are, yet again, seriously confused. This discussion has nothing to do with broken links, only the fact that people are asking for a new kind of ordering guarantee to bail incompetent applications out of the mess their incompetent writers created. If you populate a file without fsyncing it, the data may not get to disk for some not-always-predictable amount of time. If you then do something else to the file where you would like to know that the previous updates have all made it to disk, it’s incumbent on you to perform the fsync or live with the consequences. What XFS seems to have fixed is something that legitimately *could* be termed a structural problem – the fact that a file’s EOF marker might be advanced on disk before the update that had caused this made it to disk, resulting in bogus (rather than just absent) data. That’s why I said above that I have somewhat less tolerance for returning NULLs (data the application *never* wrote) than for returning a zero-length file (i.e., a legitimate state of the file before any of the data that the application wrote made it to disk during the normal course of write-back processing uninfluenced by use of fsync) – and why I believe that delayed-allocation implementations such as ext4’s and XFS’s really ought to keep the old data around and reachable on disk until the new data has made it there (you actually alluded to this at one point above – one of the few sensible statements that you’ve made). But there’s nothing particularly special about rename (save that some broken applications use it with erroneous expectations) that doesn’t equally apply to thinks like chmod, link, etc.: all are metadata operations on the file *object* that are relatively independent of its *content*, whereas metadata relating to the actual *mapping* of the content *are* closely related and should be synchronized appropriately with it (though for more tangentially-connected metadata like mtime less rigorous synchronization could well be justified for performance reasons – e.g., just setting mtime to the recovery time after a crash on any file opened for write access that had not been closed).
9. “How rename behaves over a crash has never been defined.” Yes, it has: the *operation* is atomic (see http://www.opengroup.org/onlinepubs/9699919799/functions/rename.html ), but it has no intrinsic connection to existing file content.
10. “Tso assumed that it would be perfectly file if you lost all your data if your system crashed 30-60 seconds after a rename.” Ted assumed nothing: unlike you (and apparently a good many application developers out there), he *knew* what was required (and what was reasonable) and what was not.
11. Your comment about making your data “secure, regardless of what crazy moves the applications do” ignores the fact that many of the filesystems under discussion here have options that allow the user to ensure precisely that. The fact that there *have to be* noticeable performance compromises to make this happen appears to be beyond your ability to comprehend.
re: 187
Your post wasn’t *all* that much later than 185 so again I’ll be charitable and assume that you’re responding to some earlier one, since you failed to address a single one of 185’s specific challenges (though you really should do so now, rather than continue with the kind of unsupported generalities that you’ve been spouting):
1. “Anyone ever wonder why “mv” doesn’t call fsync? It doesn’t even provide it as an option (which it should). What about “cp”? Has anyone stopped to consider why rsync doesn’t call fsync?” Well, duh: for bulk operations it’s far more reasonable to force data to disk at the operation’s end than at every step along the way, not only because it’s far more efficient but because doing so for every individual file really doesn’t buy you much if the operation fails later (you go on at some length later about this straw man of using fsyncs during bulk operations so consider this a response to that verbiage as well). A better question might be why they don’t call *sync* after they’ve completed: even though sync needn’t complete before returning, it at least minimizes the time during which the operation might be left incomplete if a crash occurred. Another reasonable question (if you don’t consider sync to be an adequate solution) is what kind of interface would be appropriate to allow such a bulk activity to verify completion before returning to the user – and it’s clearly not the kind of simple ordering guarantee within a single file that people have been touting but something more like a synchronous sync call (at least in the general case where many files and directories may have been updated).
2. “Calling fsync even on a filesystem that is designed for fast fsync performance (most aren’t) means a delay of 50-100 ms for a tiny file under nearly ideal conditions.” Clearly you’ve never designed a journaling file system, where fsyncing a tiny file (or even a not-so-tiny one) can require as little as a single journal entry (containing the file data and the inode/bitmap/directory updating information) and can thus execute in at most about one log disk revolution (4 – 8 milliseconds with contemporary disks if the disk is dedicated to the log, not all that much more at least on average even if a seek is required) while the updates to the other filesystem structures can be deferred for later write-back at the system’s convenience (when the disk heads are in a good position, or the target disk is idle, or other updates to the same blocks have occurred and can be combined into a single write, or many such writes can be combined when quasi-log-structured copy-on-write approaches are used: the only real limit is how old you’re willing to let relevant log entries become before forcing out the dirty data that depends upon them, since eventually that can compromise restart performance). In cases of bulk operations many such log entries can be written in a single log disk access (called a ‘group commit’ in database terminology) at speeds approaching the disk’s streaming sequential bandwidth – though this may require minor application cooperation (e.g., you don’t want to stall the application waiting for each strictly serial operation/fsync pair to complete: you could issue the operation/fsync pairs asynchronously in parallel where this is supported – not exactly something Unix has traditionally been good at though it’s finally getting better, or issue the pairs in many threads each of which can afford to stall (about the same approach but with a more Unixy idiom), or depending on the filesystem implementation simply issue a batch of operations and then the corresponding batch of fsyncs, the first fsync of which tends to force all the lazily-written log entries to disk and the rest of which then execute at RAM speeds). Without having examined the code I suspect that any or all of XFS, ZFS (I know, not a Linux filesystem), Reiser4, perhaps Reiser3, and even good old JFS may do this to one degree or another, even though ext3/4 probably doesn’t.
3. “Async fsync based renamed replacement only works if there aren’t any other readers who need to get the current configuration not what it was two or three seconds ago.” You’re obviously confused: the fsync in the sequence that Ted has described applies not to the file being read by some hypothetical parallel accessors but to the file that’s about to replace that file. After the fsync the new version is on disk, and then the subsequent rename atomically replaces the old version. A system crash could then leave the old version accessible on restart if the rename operation never made it to disk, but that’s a different issue than the one you describe (since no ‘other readers’ are still executing) and is just one of many problems that simply can’t be addressed by ordering guarantees but require real synchronous writes (that don’t complete and unlock their target data until they’re safely on the disk).
I’m afraid that the balance of your post is too general, unfounded, and incoherent to allow any sensible response. If you want to be taken seriously you’ll really need to arrange your thoughts a lot better into clear problem statements (which can then be assessed for credibility) with clearly-proposed solutions (which can then be assessed for value and feasibility).
- bill
April 29th, 2009 at 10:51 am
Application programmers are the clients of the filesystem programmers. A filesystem is nothing more than a variable storage library. If a new filesystem tries to make the work of an application programmer harder than the last big thing, then the new filesystem will not be used. You might enjoy writing obscure super-strictly POSIX compilant filesystems, but if you want to write something that will be as popular as ext3, you must make the work application programmers easier not harder.
In short – if a new filesystem causes data loss in a typical usage scenario, then it will not be used.
It is useless to argue how right or wrong that usage scenario might be. If it is common enough, it is a de facto standard. In the grand scheme of things it takes much less work to fix ALL filesystems in the world than to fix all applications (even if there was a fix without drawbacks).
As a home exercise please write all the patches required to make unpacking and compiling a Linux kernel fully fsync-safe, compare that patch size to the ext4 cludge patch and compare actual performance.
April 29th, 2009 at 11:17 am
I exaggerate slightly – sysadmins and end users decide what filesystem to use. However, for that target audience this event has been even worse. While application programmers might the merits of the code and would consider new APIs to improve app to filesystem cooperation, end users will not. For a system administrator this whole discussion is a huge flashing red sign saying – there is a data loss in common usage scenario and they don’t even consider that a bug. What other data loss bugs are there in the code that they don’t consider bugs? Unless there is a repentance and neverending claims that user data is sacrosaint, one more such problem will destroy user trust in ext4. It will die a quiet death in loneliness and obscurity regardless of all the cool technical stuff it can do.
I don’t want that. I want ext4 to be great. And part of that is getting all of application programmers, system administrators and end users on board and trusting the decision process of the ext4 developers.
April 29th, 2009 at 1:09 pm
Re: 189 & 190
Since you have nothing but the same repetitive and fatuous generalities to offer despite clear demands for specifics, my conversation with you is over. But I’d be happy to talk with anyone who *is* willing to step up to the plate and attempt to address the multiple areas in which you’re so confused about those specifics: there’s always the possibility that something actually useful might result from such a substantive discussion.
- bill
April 29th, 2009 at 2:19 pm
@190 Aigars,
You seem to be forgetting that the first thing I did was to program in workarounds for the applications that were rewriting files in an incorrect way that happened to work well with ext3. Only then did I tell people that (a) there was a workaround, and (b) application programmers should really do things the right away. And for that, my competence, my judgement, and even my paternity was questioned.
I do care about users’ data, and it was for that reason that ext4 was even more aggressive than XFS or btrfs in terms of implementing workarounds for buggy applications. But I also think that it’s best that application programmers do a better job; and that means considering whether rewriting hundreds of files each time the application starts up, even though none of them had changed, or writing megabytes of data to the disk each time a user clicks on a link to visit a new URL. And it’s also about using fsync() where it is necessary if you care about portability of your application. (After all, even if you use ext3 or ext4 on your local filesystem, what if you are using NFS for some remote file systems?)
So this was not a matter of filesystem developers not willing to make changes; we made changes on our end. We just also think that application programmers should do their part in terms of making for a completely robust system. Filesystem authors can’t always compensate for incompetent application programers…
April 29th, 2009 at 2:56 pm
@tytso , I appreciate your work very much and I am always excited to read what new interesting ideas you have added to ext4. I am also glad that after initial confusion and outcry the fixes have been implemented.
I must point out that you did suggest that all applications use fsync when replacing files, despite that being to strong a request in most cases.
Obviously, there is little what a filesystem could do if the application requests lots of configuration file rewrite operations. But on the other hand “writing megabytes of data to the disk each time a user clicks on a link to visit a new URL” is required to provide to end users features that these users demand an with computer power being so abundant, it should not be a problem to provide such access with sufficient reliability. And you proved that it is very possible.
A lot of applications do quite a lot of pretty stupid things. Sometimes those are needed to satisfy user demand, sometimes it is just an artifact of making application programming as easy as possible and sometimes it is fully avoidable. While it could be possible to improve some user space libraries and a few key applications, a general purpose filesystem needs to be able to support a lot of stupidity thrown at it with least amount of surprise to the user.
Generic suggestions such as ‘use fsync!’ will fall on deaf ears, but it could be possible to implement some specific changes in key places (GTK, QT, Firefox, libc) to improve ext4 performance, provided that the changes will not degrade performance in corner cases (laptops, SSDs, ext3 users, …).
April 30th, 2009 at 6:05 am
re. 185:
Below I use ordered_rename() for the rename operation guaranteeing (on a journalling filesystem) that either file be accessible after a crash, and raw_rename() for the raw operation on directory metadata that may not provide such a guarantee.
1. ext3 data=ordered should perform decently for a workload with a huge number of atomic file replacements (and maybe other operations for which the ordering characteristics of data=ordered are useful) and very few other fsync() calls. After all, with the exception of atomic file replacements, heavy users of synchronous writes are mostly databases, which usually do not make much use of filesystem-level metadata anyway. Users with a more fsync()-heavy workload should of course use a different filesystem or mode.
2. I meant that an efficient ordered_rename() is *in principle* possible on most filesystems without changing their basic designs (e.g. on-disk formats), as this is just a question of recording the data blocks that must be flushed before a metadata block can be written to the disk (or committed to the journal). Here “efficient” means that the latency of the call, as viewed by the calling process and other processes accessing the file on a running system, is not limited by disk latency, unlike fsync/raw_rename. Of course, the ordering has a cost on the performance of a concurrent I/O workload, especially on a fsync()-heavy one if it has to wait for the metadata update being delayed, but fsync/raw_rename is no better in this respect. Whether the current *implementation* can record such ordering requirements efficiently without too much code changes is a side issue. At least, ext3/ext4 with the rename hack can support this, and ext3 data=ordered has supported this for a long time if fsync performance is not an issue.
3. See my definition of “efficient” above. fsync/raw_rename is indeed quite suboptimal in terms of latency.
4. See above. ordered_rename() on ext4 with the hack incurs about the same cost as fsync/raw_rename() to a concurrent fsync()-heavy workload, but has a much lower latency as viewed by processes on the running system.
5. As ext3 and ext4 and possibly other filesystems already supports ordered_rename() much more efficiently than a fsync/raw_rename sequence, it is of course desirable to expose this functionality via a system call. User-space applications and libraries are not in a position to deal with filesystem-specific issues, nor are they currently able to express “I want this new file to be visible to other running processes immediately; the recording of the file on permanent storage can be delayed for a while, but when it happens it must happen with such an order”. Therefore, even though they have various means to reduce the latency when doing many atomic replacements, such as threading or keeping multiple versions, these methods are complicated and non-transparent and should be rightfully called workarounds. In general, I think an API should just avoid forcing the user to choose between speed, correctness and simplicity, unless such a tradeoff is unavoidable due to hardware limitations.
6. ordered_rename() is already available on ext3 and ext4 with the hack. In the short term we can just make it imply a fsync/raw_rename on other filesystems without specific support. This should not be very difficult, and will give a more user-friendly API that is also more efficient on ext3 and ext4. On systems without ordered_rename(), a wrapper library can fall back to fsync/raw_rename.
7. What we need is a way to express the order in which changes should be committed to disk, without affecting the order or latency viewed by the calling process and other processes accessing the file on a running system. I think ordered_rename() solves most of the problems, but a lower-level system call like fbarrier() may also be worthwhile if it is sufficiently useful.
fbarrier() can be defined as fbarrier(before_fd, after_fd), which ensures that all previous changes to before_fd hit permanent storage before any subsequent (metadata or data) change to after_fd is written to disk. This does not have to cause the subsequent calls to block, but only makes their changes stay in dirty pages for a while longer. The ordering guarantee applies only to changes made through before_fd or after_fd; in this way applications can specify more fine-grained ordering requirements by having multiple fd’s to the same file (or directory) with which the changes can be tagged. To avoid most difficulties due to the “before” and “after” changes sharing the same page, before_fd and after_fd can be the same fd or otherwise refer to the same file only if the file is a regular file or a block device, the changes are data-only (i.e. the accessed portions of the file have been allocated, and mtime/atime updates are disabled), and they affect different blocks and pages. Otherwise, unless the kernel can maintain multiple versions of a page, an “after” change affecting the same page as a “before” change may be forced to immediately commit the “before” change to disk first.
It can be used to do atomic file replacement as an alternative to ordered_rename(): dir_fd = open(”/containing_directory”); file_fd = openat(dir_fd, “file.new”); write(file_fd, …); fbarrier(file_fd, dir_fd); raw_renameat(dir_fd, “file.new”, dir_fd, “file”); close(file_fd); close(dir_fd); Here raw_renameat() is viewed as a change to the containing directory. None of these calls have to block, yet the writes and the rename are committed to disk in order. If a large number of files are replaced in this way, the OS only needs to flush all the file data to disk, issue one write barrier on the block device, and then flush the updated directory metadata to disk (or write the commit record in case of metadata journalling). This is similar to ext3’s data=ordered mode and is very efficient for this workload.
Another use is on the transaction log of databases, when most commits are asynchronous (a crash may cause the loss of the latest commits even though data integrity must be preserved; this is suitable to e.g. Firefox’s history). Let log_fd and commit_fd be two file descriptors for the transaction log, and db_fd be a fd on the database file itself, for each transaction we can do: write(log_fd, data); fbarrier(log_fd, commit_fd); write(commit_fd, commit_record); fbarrier(commit_fd, db_fd); write(db_fd, data); This expresses the ordering requirements exactly, leaving the OS free to group the commits and thus improve throughput. Of course the DBMS can also group asynchronous commits itself, but I think the above approach is cleaner. To allow reuse of journal space and support synchronous commits efficiently (fsync/fdatasync suffices if such efficiency is not a concern), we also need fbarrier() to return, when requested, a handle that can be used to wait until the “before” changes are committed to disk. This is a bit similar to mechanisms in asynchronous I/O, but for disk files I think such synchronous-to-other-processes and asynchronous-to-the-disk calls are more useful.
Full support for fbarrier() requires extensive changes to the kernel. For example, the ordering requirements must be recorded for each dirty data page and specified for each dirty metadata page by the filesystem. However, a fallback to fsync/fdatasync on before_fd is always possible (maybe there should be an open-time flag specifying that mtime/atime/etc. are unimportant and fsync can then behave like fdatasync()). It also seems to be easy to implement efficiently in ext3 data=journal mode, or in ext3 data=ordered mode if after_fd is a directory (and thus all its changes are recorded in the journal and the dirty pages of the file associated with before_fd can be linked to this transaction).
re. 188 on your reply to #186:
8. Yes, raw_rename() makes sense as a low-level operation on directories, but it does not solve the problem of atomic file replacement very well.
re. 188 on your reply to #187:
1. Of course, when atomically replacing a large number of files without ordered_rename() being available, it is better to group all the sync/fsync’s rather than fsync-ing after writing each small file. However, this can complicate application programming as the complexities of sync/fsync plus rename can no longer be hidden in a library (or a utility called by a shell script), and the application must constantly be aware which files have been fsync’d and which have been renamed. ordered_rename() makes things simple.
If the atomicity requirement applies to a large number of files as a whole, rather than individually, we can create the new version of the files and update a pointer (e.g. a symlink) to the current version. The ordering requirements can be expressed with fbarrier() above, although a full sync isn’t usually much worse. However, file-level atomicity is sufficient in many cases, as crashes are rare and files are often fairly independent, and as long as some version of every file is preserved, consistency among files is not that important in many applications.
2, 3. I don’t think there is anything a filesystem can do to make simplistic applications calling write/fsync/rename on many files sequentially run fast, as at the time of fsync() the kernel has not seen any later writes to include into the transaction. Unfortunately, most applications will probably be like this, as the asynchronous or threaded approaches cannot be wrapped into a library as easily and transparently: the caller often has to explicitly wait until the fsync is completed and the rename executed (not necessarily committed to disk), since only then can other processes (e.g. a email program reading a file recently saved by a word processor) see the updated version.
April 30th, 2009 at 12:10 pm
@194: r1644,
We’ll see if anyone actually implements fbarrier(). When I brought it up at the Linux Storage and Filesystem workshop, there wasn’t much enthusiasm, mostly because there was much scepticism that application programmers would actually use the new interface — and by the time they did, new filesystems (including potentially fixes to ext3) would exist that make fsync() fast enough that application programmers that are willing to change their code to be portable should just use fsync().
The reality is that most applications are highly unlikely to change, so we’ll need to have replace-via-truncate and replace-via-rename workarounds for the indefinite future. And for most filesystems, fbarrier() will either be an implied fsync() or a no-op (in the case of ext3 data=ordered mode) anyway. Linus changed the default ext3 mode to be data=writeback with replace-via-truncate and replace-via-rename workarounds in 2.6.30, and we may have a replacement data=guarded mode that will avoid seeing uninitialized data on crash, but still have most of the performance penalties as data=writeback. So once the assumption that fsync() is expensive goes away, there is some potential gain for a hypothetical fbarrier() system call, but most application writer’s refusal to use fsync() was more about laziness and/or ignorance rather than any kind of principled stand or technical judgement, so there wasn’t a whole lot of faith anyone would actually use something as complicated as your proposed ordered_rename() composed out of fbarrier() as a low-level operation. It’s certainly possible, but given all the bellyaching about how difficult it was to use fsync(), and the fact that you can’t use it from shell scripts, etc., there just wasn’t much faith in application progamers’ ability to use something like fbarrier().
It might get implemented at some point in the future, but at the moment it’s pretty low-priority.
April 30th, 2009 at 1:00 pm
@tysto: Can we at least get a pathconf(_PC_RENAME_FLUSHES_CONTENT)? As long as fsync forces a disk spin-up, there will be cases where fsync is dispreferred, and that would at least make it *possible* for apps to do the right thing without the design and support overhead of adding a new syscall.
April 30th, 2009 at 2:22 pm
re: 194
Thanks for attempting to rise to the challenge of being specific, but I’m afraid that some of your thinking may still be somewhat fuzzy. To take your points by number:
1. “ext3 data=ordered should perform decently for a workload with a huge number of atomic file replacements (and maybe other operations for which the ordering characteristics of data=ordered are useful) and very few other fsync() calls.” That depends upon how much other write activity is occurring that could benefit from being deferred (e.g., to combine multiple modifications to the same blocks into a single disk write), since (as I understand things) somewhat unintentionally ext3 with data=ordered by default writes out *all* dirty data and metadata every 5 seconds, whereas with data=writeback it writes out dirty metadata every 5 seconds but lets dirty data sit for 30 seconds. I’m also assuming that despite your phrasing (”few OTHER fsync() calls”) you’re actually talking about using ext3 with data=ordered in the erroneous fashion that *omits* the fsync calls during purported atomic file replacement, since if they’re included there’s at least one synchronous disk write for every one of them (unlikely to be an issue for individual file replacements but possibly an issue for “a huge number of them” where using sync might be a more reasonable solution if the bulk replacement is repeatable should it be interrupted by a crash).
2. “I meant that an efficient ordered_rename() is *in principle* possible on most filesystems without changing their basic designs.” In ext3 (and ext4 when allocation delay is prohibited in this situation as the new patches do), as I just noted above coupling on-disk data updates to on-disk metadata updates effectively couples the dirty data update timer to the dirty metadata update timer for the relevant files – and there’s likely a good reason why the two were defined separately. More troubling, though, is your casual assumption that the fact that a new facility may *in principle* be achievable on most (or even all) existing Linux/Unix filesystems without changing their basic designs means anything significant – when in fact the likelihood of all such existing Linux/Unix filesystems being modified for any reason less crucial than a truly catastrophic POSIX design flaw is indistinguishable from zero, and if they’re not so modified then any competent application developer must design the application to be safe in whatever environments it may reasonably be run.
3. “See my definition of “efficient” above. fsync/raw_rename is indeed quite suboptimal in terms of latency.” That’s still not a *quantitative* observation, just a better-qualified one – so I’ll quantify it for you. For the kind of smallish config file often used as an example here the difference is between execution at memory speeds and a single disk access on the order of 10 milliseconds – a difference not perceivable by an interactive user. For (e.g.) a large text file being edited more than one disk access might be required – but in that case the replacement is often in response to an explicit ’save’ request by the user, in which case fsync is required anyway. For bulk operations that are repeatable if interrupted a bulk sync at the end is more appropriate – not only more efficient, but offering a better guarantee that the bulk changes have actually all become persistent than just waiting until they may eventually migrate to disk if not interrupted by a crash. So I’ll frame my question differently: in what common instances will using fsync rather than your hypothetical ordered_rename make any difference that the end-user will perceive as significant (please don’t confuse this issue with the question of whether broken applications should be bailed out of their problems: that’s a completely different issue covered elsewhere)?
4. See above for why this quantitative difference in latency is likely of no real importance.
5. “As ext3 and ext4 and possibly other filesystems already supports ordered_rename() much more efficiently than a fsync/raw_rename sequence, it is of course desirable to expose this functionality via a system call.” Poppycock. First, because of the highly-questionable quantitative importance of your ‘much more efficiently’ assertion. Second, because of the competing cost of complicating the system API to support a feature which many filesystems will likely never implement (and which applications therefore will not be able to depend upon being present, complicating their own code as well if they decide to use it conditionally). This is a performance optimization in search of any real problem, and as such it would be questionable even if implementing it required no effort whatsoever. One of Unix’s (and perhaps even more so Linux’s) traditional alleged strengths has been simplicity, which has been achieved by demanding that added complexity justify itself: if you want a system with more bells and whistles than any normal application will ever use, try VMS (and I say this as one who actually *likes* VMS).
6. Are you completely ignorant of LInux’s VFS layer and what implications that layering has for your proposed ‘wrapper’ approach to changing the semantics of the *existing* rename operation in a manner transparent both to applications and to the underlying file systems? Or did you simply misunderstand point 6 in 185?
7. “What we need is a way to express the order in which changes should be committed to disk, without affecting the order or latency viewed by the calling process and other processes accessing the file on a running system.” I really, really do understand what you *want*: the point is that you have yet to come anywhere near justifying any *need* for it. “I think ordered_rename() solves most of the problems” strikes me as ridiculous when taken in the context of the generality of your preceding sentence: ordered_rename solves only one special case – a case of relatively little significance (if it weren’t for the application bugs in this area it would never have come up at all, so it’s clearly not a performance issue per se for rename-style replacements and only discussion about how to deal with those application bugs has brought up performance as a side topic). I do appreciate your attempt to describe how fbarrier would work and after a casual read see the main problem (yet again) as justifying the effort. “Of course the DBMS can also group asynchronous commits itself, but I think the above approach is cleaner” reflects naivety about how closely a database needs to control its log’s propagation to disk (as well as about how little interest database implementors have in writing conditional code to leverage minor features on some systems that they will need to roll themselves on others). Your concluding paragraph in this point also reflects naivety about the costs both of interface complexity and of implementation effort.
Concerning your comment on my reply to 186:
8. “Yes, raw_rename() makes sense as a low-level operation on directories, but it does not solve the problem of atomic file replacement very well.” At the risk of being as repetitive as so much of this discussion has been, neither you nor anyone else has demonstrated that it doesn’t “solve the problem of atomic replacement” more than adequately when combined with fsync in the appropriate location. In fact, that combination solves that specific problem *better* than a new rename variety would in the sense that it preserves the logical distinction between data updates and unrelated metadata updates that has been a feature of the defined Unix interface just about forever. The value of a conceptually simple interface should not be underestimated.
Concerning your comments on my reply to 187:
1. “Ordered_rename” doesn’t help *at all* when performing this kind of bulk operation, because what you want is some guarantee that the *entire sequence* has completed before a crash occurs (and in the unlikely event that a crash occurs earlier you can just repeat that entire sequence). Even if you *did* use ‘ordered_rename’ you’d *still* want a sync (or something equivalent) at the sequence’s end to make sure that everything actually made it to disk (that’s what’s important in this case, not mere ordering).
2 (your comment had no relevance to 3). You’re confused: my comments regarding bulk operations were about the kinds of bulk operations referred to in 187, not to bulk file rename-style replacements (which don’t strike me as very likely to occur, especially with the kind of concerns you raised about concurrent access by other programs). However, the kind of batching of updates first, then the associated batch of fsyncs, that I described could achieve the kind of streaming on-disk update performance (prior to the batched renames in the rename-style replacement that you provided as an example) that you’re seeking with an appropriate filesystem implementation and with neither overt asynchrony nor parallelism in the application: since you seem to consider such filesystem-specific performance behavior to be acceptable as long as it’s available somewhere for applications that require it, perhaps that would satisfy you.
- bill
May 1st, 2009 at 1:01 pm
I’m recently reminded that, when moving a file across filesystems, in order to avoid losing the file upon a crash, a write barrier or fsync on the new copy is always necessary before unlinking the old one, whatever the filesystems may be. (Although mv does not currently do this; see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=262274 and http://savannah.gnu.org/support/?106582) As atomic file replacement on one filesystem similarly involves writing a new version of a file and unlinking an old version, I now find it quite reasonable to ask application programmers doing so to be aware of the possible necessity of fsync/write barriers. I still think efficient implementations of ordered_rename() (a new name helps to make the meaning clear) and maybe fbarrier() are nice things to have, as their low costs encourage application programmers who are aware of this issue to do the right thing, when many of them may be reluctant to use fsync() due to perceived or real slowness. However, Ted’s concerns in #195 are also quite valid.
@197: Thank you for your valuable (albeit harsh) input. I think I understand most of your points and you understand mine, and the rest is mostly a matter of opinion, so I won’t argue any more.
May 19th, 2009 at 11:07 pm
when’s dealloc gonna hit data=journal?
June 13th, 2009 at 9:57 pm
[...] ext4 se marcó como release, ubuntu lo activó en la beta de Jaunty saltó la alarma dado que bajo ciertas condiciones este sistema daba como resultado archivos con logitud 0 (quiere [...]
September 3rd, 2009 at 1:51 pm
Hi, and thankyou for this informative blog. Could you please explain why fsync() just does not flush the data for only the current process, rather than all buffers for the file system? Thanks.
October 19th, 2009 at 8:41 am
I suspect some people can’t see the forest for the trees.
How to write a file system:
First: get the data on the disc quickly and safely.
Second: Worry about what makes the system faster/storage-efficient afterwards. And if it “could” impact on the first – just don’t do it.
The only thing that counts is secure data. Everything else is irrelevant and has to simply be tolerated. If its too slow – tough luck – learn patience.
Any and all write cache’s should be dissabled/removed.
File systems are too important to mess about playing complicated tricks with.
October 24th, 2009 at 7:43 pm
@Azumi123,
If that’s what you want, you can mount the filesystem using -o sync. Performance will be terribly, terribly slow. Worse yet, it still won’t necessarily help you because application writers take shortcuts. For example, if the application writer truncates the file down to zero, and then rewrites the file with the new contents, because the application writer wants to preserve the file ACL without actually saving and restoring the ACL entries, and the system crashes right after the file is truncated, there’s really not much you can do.
So I agree with you that performance is the #1 priority — assuming that the application writer is doing things correctly. If the application writer does something stupid, such as truncating the file down to zero, there are real limits to what you can to protect against application writer stupidity. We are in fact adding some hueristics to try to protect the data in the face of application writer malpractice, but at that point, I think we do need to trade off performance versus trying to protect against application writer stupidity. After all, as the old saying goes, the problem with trying to create a fool-proof solution is that fools are so ingenious….
October 24th, 2009 at 8:01 pm
re: 202
Perhaps you’re confused about who owns the computer. If it’s in fact yours, you’re free to have it work any way you want it to, including the way that you described.
Of course, simply getting data to disk is nowhere nearly sufficient to guarantee its utility in all cases: you often have to do things like group updates appropriately for atomicity, for example – though I’ve only written a couple of file systems that did this because facilities that far beyond the industry’s least common denominator tend not to get used all that much.
The reality of the situation is that many people (though perhaps not you) get annoyed at ‘tolerating’ the kinds of performance constraints that you describe when there’s absolutely no benefit to be gained from them (because just as some situations call for far more care in handling data than you suggested, others require far less: it depends on the details of how the application uses its storage). Since the file system cannot by definition satisfy the entire range of such needs with a single approach good designs offer a variety of approaches from which each application can choose what best suits its needs.
Another reality, unfortunately, is that many people can see only what *they* want from a file system rather than the full range of needs that it should satisfy. Perhaps if they actually wrote one and had the opportunity to experience feedback from a few thousand vocal users they’d develop more appreciation for this concept.
- bill
October 29th, 2009 at 2:26 pm
One note from a usability angle, and (shudder) from a Linux user who appreciates something MicroSoft got right:
Linux got many things right: security, stability well before XP, and quite a lot else. Linux excelled at being technically correct, at least compared to MicroSoft Windows.
Meanwhile, MicroSoft got more or less one thing right in Windows: usability. Where Linux was secure, open, and so on, Microsoft knew the value of being something that would work for people, and that people without computer science backgrounds could figure out. Apple understood the importance of user-centeredness too, even if they didn’t make the best business decisions. The 90% market share achieved by MicroSoft is because however many things they got wrong, however badly they bungled stability, security, and so on and so forth, they sold people a way that they could figure out how to use their computers. Only recently has Linux caught up with this way of putting users at the center.
The basic argument for ext4 is that it is more correct compared to a precise reading of specifications. If that causes large-scale practical instability for users who failed to exercise the due diligence of only using programs whose source may contain open; write; close; rename; without including an fsync, then this is not a problem with the file system. It’s a more correct read on the spec, so it’s an improvement to the filesystem, and if there are consequences, that’s Not Our Problem.
I wince at saying this, but I’d like to see developers think a little more like MicroSoft here.
October 30th, 2009 at 3:55 am
@205: Jonathon,
Maybe this wasn’t made clear enough, but the first thing that I did, before writing this blog article, was to create hueristics for ext4 that worked around broken application behavior. That is, ext4 tries to determine if applications are trying to update files in dangerous way (i.e., update-via-truncate and update-via-rename without using fsync), and it will force an implied file system flush to avoid data loss most of the time. Unfortunately, if an application truncates a file, and then system crashes before the application gets around to writing the new data, there’s not much that can be done at the file system level.
But I did first work around application programmer stupidity, and then called on application programmers to be, well, less stupid. That was because I knew application programmers outnumbered file system developers by several orders of magnitude. So with all due respect, I was thinking from a user-centric point of view; the first thing I did was to try to avoid as much data loss as possible without application programmer assistance.
One advantage Microsoft has, that Linux kernel programmers don’t have, is the Windows logo compatibility program. If there is some really stupid thing that Microsoft wants to prohibit, they can add a requirement to the Windows application logo compatibility program, and software companies won’t be able to put the Windows logo on their software packages unless they conform to all of the requirements of the Windows logo program. We don’t have that big stick to beat over the heads of application programmers, so all we can try to do is pursude application programmers to do better, via blog posts such as this one.
October 30th, 2009 at 11:55 am
Jonathan, I’m afraid that you, like so many others here, Just Don’t Get It.
This is not a discussion about correctness and specs: it’s very much a discussion about usability – the ability to use a file system to satisfy a wide variety of needs for a wide range of applications written to support a wide range of users.
If users got to use the file system directly rather than predominantly through intermediary applications then the file system *might* be able to provide default behavior that the majority would find appropriate. Instead, a wide variety of applications using the file system in a wide variety of ways are what the users see, and the file system cannot, by definition, serve all these applications (and their user) well with a single approach even if one assumes that all users want the same things: only applications can do that, because only they understand the ways they’re using the file system and what implications this has for the user experience.
Perhaps if you understood both the Linux and Windows file systems better you’d be less inclined to hold up the latter as some sort of paragon of usability even in the face of the kind of application incompetence being discussed here. Like nearly all modern file systems Windows defers most on-disk updates in the absence of application instructions to the contrary. For example, when a user clicks on ‘Save’ and the application issues a standard file system Write request nothing goes to disk for some period of time: only if the application recognizes that ‘Save’ means that the user wants the data to move to disk Right Now (just in case that ominous thunder outside presages a power outage) and explicitly flushes the data to disk immediately after issuing its normal Writes does the user get the behavior desired.
There’s another layer involved as well, since desktop systems are typically configured to enable the disk’s own internal write-back cache (as usual, for performance reasons: users do get annoyed at slow computers far more frequently than they get annoyed because they’ve lost some data, after all). So when the file system gets instructions from an application to force data to disk it in turn must tell the disk to force it to the platters (which competent file systems of course do).
Windows systems typically ship with the disk’s write-back cache enabled, because that’s what users seem to want. And Windows file systems don’t subvert that facility without explicit instructions from the application (or to protect their own internal consistency). So if applications fail to tell the Windows file system what to do, it will in most cases just as happily leave their data subject to loss should an interruption occur as ext4 will when its applications do the same.
Ted has very thoughtfully back-stopped broken applications in this one specific area, perhaps because it has relatively little performance down-side, perhaps because he feels some responsibility for having set false (though completely undocumented) expectations in ext3, perhaps because it was relatively easy to do. Don’t make the mistake of thinking that such back-stopping for application incompetence should (let alone could) be applied across the board.
- bill
November 8th, 2009 at 10:02 pm
“This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file.”
Seriously, Ted. You complain about people relying on nice side effects of ext3 and now you’re adding yet another one to ext4. I understand the motivation to make replacing rename()s safe, because the method used suggests that the user wanted consistency. But users who open files with O_TRUNC don’t care about consistency, they want to delete data. Please revert that part of the patch, or make that not a default.
November 8th, 2009 at 11:04 pm
@208: Marc,
Unfortunately, there very many application programmers that attempt to update an existing file’s contents by opening it with O_TRUNC. I have argued that those application programs are broken, but the problem is that the application programmers are “aggressively ignorant”, and they outnumber those of us who are file system programmers. When we try to tell them that no, the right way to update a file is to create a new file, write the contents to the new file, fsync the new file, and then use rename() to rename the new file on top of the old file, they question our competence; they even question our paternity.
One could argue that they don’t care about consistency, but the problem is they do it anyway, and then when you combine it with users who use these broken applications, and then use Ubuntu systems with broken proprietary video drivers which crash the entire system whenever you breathe on them wrong (or when you exit certain 3D graphical games) — unfortunately the users don’t blame the application programmers, they blame the file system developers.
So unfortunately, because application programmers aren’t willing to even acknowledge that their programming styles are broken, we have to include that as a workaround. You can strace programs such as certain editors, and find that they really do update existing files (including files that might be considered precious, such as someone’s Ph.D. thesis or C source code) by opening the file with O_TRUNC. And if you crash between the time the file is truncated and the time that the data blocks are safely written to disk, you’ll lose data. And unfortunately, we have ample evidence that (a) users don’t blame the application programmers, and (b) there are a significant number of application programmers which refuse to fix their programs.
November 9th, 2009 at 12:03 am
Actually, I do think that “write to new file, then rename to old file” is the ONLY sane way of doing things. Truncate-rewrite is asking for trouble even if the system itself is solid as a rock.
After all, things can go plenty awry even without the system crashing. Suppose that the app hits a permission problem, or runs out of disk space, or somehow manages to segfault at an inopportune moment? (some bugs have made that happen in some of my favorite programs). With the write-rename method, those troubles pounce the application BEFORE the original data is destroyed or replaced.
So, in my opinion, workarounds designed to accomodate inherently broken apps (i.e., truncate-rewriters) should get less priority than those to accomodate proper apps (write fsync rename fsync-on-dir). A way of “encouraging” application writers to stop being fsync shy.
Though I do agree that fsync should NEVER punish an app more than needed. Waiting for your own file to hit the disk is quite expected, but getting ambushed with a massive cascading writeout of everyone else’s files is a violation of the “principle of least surprise”, so to speak.
my two cents
November 9th, 2009 at 1:45 am
@209:
I don’t usually want to side with people who write code that is incorrect, but are you really blaming application developers and proprietary drivers???
I actually don’t blame the filesystem per se, but the steps required to read a file, change something, and write it back are absolutely ridiculous. Maybe it should be fixed in glibc or something, but I’m not surprised at all that people screw it up. Let’s see:
1.) read file
2.) make changes in mem
3.) create new file
4.) modify acls/permissions on new file so they match old file
5.) write new file
6.) fsync new file (oh wait it fsyncs everything in reality… huge pause)
7.) rename new file -> old file
8.) fsync containing directory
9.) ok, now show that the file has been saved
it should be:
1.) read file
2.) make changes in mem
3.) atomic_replace(old_file, mem)
but instead of adding that to some library people just blame application developers.
Also, I’m not fan of proprietary drivers, but nvidia has consistently had the best drivers (proprietary or open source) available for linux of any manufacturer. Blaming nvidia is definitely a red herring. you’re essentially saying “computers should never crash and if we didn’t have proprietary drivers they wouldn’t!!!”
November 9th, 2009 at 2:42 am
Of course we blame application developers: they’re the ones incorrectly using the tools they have to work with, rather than seeking some other line of work because they don’t understand how to use those tools correctly (or are simply too lazy to, given that the rules in this area have been very clear for decades in Unix environments).
Would easier-to-use tools be nice? Perhaps – though extra layers add extra overheads and interface detail (even though perhaps making specific tasks easier). Would easier tools be nice for just one of many file systems in use on Linux? That’s a lot less clear – but (as you suggest) a library approach, with appropriate specialization for each such file system, might handle that.
Libraries are, of course, application rather than system code, so anyone can write one (and system developers tend to leave that up to others: they’ve got enough problems of their own to handle). If the library is sufficiently successful it might even become a standard, with the advantage of being portable across many different environments.
In this particular case you’re asking for a specialized kind of transaction, something eminently achievable at application (or library) level. Transactions have not traditionally fallen within the scope of file system responsibilities on *any* common platform, which may help explain why there’s a bit of resistance to being told that any individual file system is at fault for not providing them. The atomicity of the rename operation itself is transactional in nature, but only within a single action – and that is what allows the traditional sequence used to update a file atomically to be as concise as it is.
I suspect that the reason that no one has developed the kind of library function which you seem to be advocating is that this particular situation covers only one small part of what applications require to be robust in the face of unexpected interruptions – and hence addresses so small a part of the post-interruption clean-up they must perform that it’s not worth special-casing. I suspect the main reason for this tempest in a teapot is that application developers found a convenient (though undocumented and unintentional) short-cut with ext3 which they unwisely assumed would exist in perpetuity, and that their unsuspecting users are now looking for a single scapegoat because that’s less intellectually challenging than understanding why the file system really has good reason to work the way it traditionally has.
That last would certainly be consistent with the general collapse of analytical competence in the U.S. during this decade, and I see little evidence that the technical community has in some way remained immune to that (much as it would be comforting to believe otherwise).
- bill
November 17th, 2009 at 5:47 am
[...] semaines, les débats sur l’allocation retardée des systèmes de fichiers modernes sont assez nombreux. Tout a été déclenché par le passage en mode par défaut d’ext4, le remplaçant [...]