Delayed allocation and the zero-length file problem
A recent Ubuntu bug has gotten slashdotted, and has started raising a lot of questions about the safety of using ext4. I’ve actually been meaning to blog about this for a week or so, but between a bout of the stomach flu and a huge todo list at work, I simply haven’t had the time.
The essential “problem” is that ext4 implements something called delayed allocation. Delayed allocation isn’t new to Linux; xfs has had delayed allocation for years. Pretty much all modern file systems have delayed allocation, according to the Wikipedia Allocate-on-flush article, this includes HFS+, Reiser4, and ZFS; btrfs has this property as well. Delayed allocation is a major win for performance, both because it allows writes to be streamed more efficiently to disk, and because it can reduce file fragmentation so that later on they can be read more efficiently from disk.
This sounds like a good thing, right? It is, except for badly written applications that don’t use fsync() or fdatasync(). Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds, and and uses a journalling mode called data=ordered. What does this mean? The journalling mode data=ordered means that before the commit takes place, any data blocks are
associated with inodes that are about to be committed in that transaction will be forced out to disk. This is primarily done for security reasons; if this is not done (which would be the case if the disk is mounted with the mount option data=writeback), then any newly allocated blocks might still contain previous data belonging to some other file or user, and after a crash, accessing that file might result in a user seeing unitialized data that had previously belonged to another user (say, their e-mail or their p0rn), which would be a Bad Thing from a security perspective.
However, this had the side effect of essentially guaranteeing that anything that had been written was guaranteed to be on disk after 5 seconds. (This is somewhat modified if you are running on batteries
and have enabled laptop mode, but we’ll ignore that for the purposes of this discussion.) Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data — even though POSIX never really made any such guarantee. This become especially noticeable on Ubuntu, which uses many proprietary, binary-only drivers, which caused some Ubuntu systems to become highly unreliable, especially for Alpha releases of Ubuntu Jaunty, with the net result that some Ubuntu users have become used to their machines regularly crashing. (I use bleeding edge kernels, and I don’t see the kind of unreliability that apparently at least some Ubuntu users are seeing, which came as quite a surprise to me.)
So what are the solutions to this? One thing is that the applications could simply be rewritten to properly use fsync() and fdatasync(). This is what is required by POSIX, if you want to be sure that data has gotten written to stable storage. Some folks have resisted this suggestions, on two grounds; first, that it’s too hard to fix all of the applications out there, and second, that fsync() is too slow. This perception that fsync() is too slow was most recently caused by a problem with Firefox 3.0. As Mike Shaver put it:
On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem.
Fundamentally, the problem is caused by “data=ordered” mode. This problem can be avoided by mounting the filesystem using “data=writeback” or by using a filesystem that supports delayed allocation — such as ext4. This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file won’t be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks haven’t been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point.
Another solution is a set of patches to ext4 that has been queued for 2.6.30 merge window. These three patches (with git id’s bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced. This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file. This solves the most annoying set of problems where an existing file gets rewritten, and thanks to the delayed allocation semantics, that existing file gets replaced with a zero-length file. However, it will not solve the problem for newly created files, of course, which would have delayed allocation semantics.
Yet another solution would be to mount ext4 volumes with the nodelalloc mount option. This will cause a significant performance hit, but apparently some Ubuntu users are happy using proprietary Nvidia drivers, even if it means that when they are done playing World of Goo, quitting the game causes the system to hang and they must hard-reset the system. For those users, it may be that nodelalloc is the right solution for now — personally, I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do.
A final solution which might not be that hard to implement would be a new mount option, data=alloc-on-commit. This would work much like data=ordered, with the additional constraint that all blocks that had delayed allocation would be allocated and forced out to disk before a commit takes place. This would probably give slightly better performance compared to mounting with nodelalloc, but it shares many of the disadvantages of nodelalloc, including making fsync() to be potentially very slow because it would force all dirty blocks to be forced out to disk once again.
What’s the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I haven’t quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won’t have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclena shutdown really should use fsync. Modern filesystems are all going to be using delayed allocation, because of its inherent performance benefits, and whether you think the future belongs to ZFS, or btrfs, or XFS, or ext4 — all of these filesystems used delayed allocation.
What do you think? Do you think all of these filesystems have gotten things wrong, and delayed allocation is evil? Should I try to implement a data=alloc-on-commit mount option for ext4? Should we try to fix applications to properly use fsync() and fdatasync()?