Fast ext4 fsck times, revisited

Last night I managed to finish up a rather satisfying improvement to ext4’s inode and block allocators. The ext4’s original allocator was actually a bit more simple-minded than ext3’s, in that it didn’t implement the Orlov algorithm to spread out top-level directories for better filesystem aging. It also was buggy in certain ways, where it would return ENOSPC even when there were still plenty of inodes in the file system.

So I had been working on extending ext3’s original Orlov allocator so it would work well with ext4. While I was at it, it occurred to me that one of the tricks I could play with ext4’s flex groups (which are higher-order collection of block groups), was to bias the block allocation algorithms such that the first block group in a flexgroup would be preferred for use by directories, and biased against data blocks for regular files. This meant that directory blocks would get clustered together, which cut a third off the time needed for e2fsck pass2:

Comparison of e2fsck times on an 32GB partition
Pass ext4 old allocator ext4 new allocator
time (s) I/O time (s) I/O
real user system MB read MB/s real user system MB read MB/s
1 6.69 4.06 0.90 82 12.25 6.70 3.63 1.58 82 12.23
2 13.34 2.30 3.78 133 9.97 4.24 1.27 2.46 133 31.36
3 0.02 0.01 0 1 63.85 0.01 0.01 0.01 1 82.69
4 0.28 0.27 0 0 0 0.23 0.22 0 0 0
5 2.60 2.31 0.03 1 0.38 2.42 2.15 0.07 1 0.41
Total 23.06 9.03 4.74 216 9.37 13.78 7.33 4.19 216 15.68

As you may recall from my previous observations on this blog, although we hadn’t been explicitly engineering for this, a file system consistency check on an ext4 file system tends to be a factor of 6-8 faster than the e2fsck times on an equivalent ext3 file system, mainly due to the elimination of indirect blocks and the uninit_bg feature reducing the amount of disk reads necessary in e2fsck’s pass 1. However, the ext4 layout optimizations didn’t do much for e2fsck’s pass 2. Well, the optimization of the block and inode allocators is complementary to the original ext4 fsck improvements, since it focuses on what we hadn’t optimized the first time around: e2fsck pass 2 times have been cut by a third, and the overall fsck time has been cut by 40%. Not too shabby!

Of course, we need to do more testing to make sure we haven’t caused other file system benchmarks to degrade, although I’m cautiously optimistic that this will end up being a net win. I suspect that some benchmarks will go up by a little, and others will go down a little, depending on how heavily the benchmark exercises directory operations versus sequential I/O patterns. If people want to test this new allocator, it is in the ext4 patch queue. If all goes well, I will hopefully be pushing it to Linus after 2.6.29 is released, at the next merge window.

horizontal separator

For comparison’s sake, here is a comparison of the fsck time of the same collection of files and directories, comparing ext3 and the original ext4 block and inode allocator. The file system in question is a 32GB install of Ubuntu Jaunty, with a personal home directory, a rather large Maildir directory, some linux kernel trees, and an e2fsprogs tree. It’s basically the emergency environment I set up on my Netbook at FOSDEM.

In all cases the file systems were freshly copied from the original root directory using the command rsync -axH / /mnt. It’s actually a bit surprising to me that ext3’s pass 2 e2fsck times was that much better than e2fsck time under the old ext4 allocator. My previous experience has shown that the two are normally about the same, with a write throughput of around 9-10 MB/s on for e2fsck’s pass 2 for both ext3 file systems and ext4 file systems with the original inode/block allocators. Hence, I would have expected ext3’s pass2 time to have been 12-13 seconds, and not 6. I’m not sure how that happened, unless it was the luck of draw in terms of how things ended up getting allocated on disk. So I’m not too sure what happened there, but overall things look quite good for ext4 and fsck times!

Comparison of e2fsck times on an 32GB partition
Pass ext3 ext4 old allocator
time (s) I/O time (s) I/O
real user system MB read MB/s real user system MB read MB/s
1 108.40 13.74 11.53 583 5.38 6.69 4.06 0.90 82 12.25
2 5.91 1.74 2.56 133 22.51 13.34 2.30 3.78 133 9.97
3 0.03 0.01 0 1 31.21 0.02 0.01 0 1 63.85
4 0.28 0.27 0 0 0 0.28 0.27 0 0 0
5 3.17 0.92 0.13 2 0.63 2.60 2.31 0.03 1 0.38
Total 118.15 16.75 14.25 718 6.08 23.06 9.03 4.74 216 9.37

Vital Statistics of the 32GB partition
312214 inodes used (14.89%)
263 non-contiguous files (0.1%)
198 non-contiguous directories (0.1%)
  # of inodes with ind/dind/tind blocks: 0/0/0
  Extent depth histogram: 292698/40
4388697 blocks used (52.32%)
0 bad blocks
1 large file
263549 regular files
28022 directories
5 character device files
1 block device file
5 fifos
615 links
20618 symbolic links (19450 fast symbolic links)
5 sockets
312820 files

10 thoughts on “Fast ext4 fsck times, revisited

  1. i just read your post and i want to congrats you for your great job.
    i do have a little question and i want to inform you that i’m pretty new at this things.
    if these new allocator will be merged soon and i use ext4 on my partitions, after do i change my kernel version do i have do to something to use the new allocator instead of the old one?

  2. @2: Ionut,

    Unless some major problems are found with it, it will probably be merged at the next merge window (i.e., after 2.6.29 is released). At this point my plans are to make it the default allocator, so no, you won’t have to do anything special once you are booting a kernel that has the new allocator merged.

    Of course, to get the most value out of the allocator, you’ll need to do a backup/reformat/restore pass, so that the directory blocks are concentrated together, etc. But it shouldn’t do any harm to use the new allocator on an existing ext4 or ext3 filesystem; you just won’t see all of the benefits of the new allocator.

  3. (Darn it I was too slow in my request and the cloud of uncertainty has been spread far and wide. Sorry Ted. Feel free to delete this and the previous comment)

  4. For what it’s worth, JkDefrag[1], a defragmenter for Windows, groups all directories at the start of the disk. Apparently, the start of the disk is (a bit to significantly) faster than the end, and since directories are by far the most accessed files, it makes sense to put ’em there. I suppose a similar approach might be a good idea for ext4, if you’re going to group ’em all anyway.


  5. I still have some doubts…. this means a 32 Gb partition needs more than 30
    secs fsck at boot. I have 250 Gb space in my laptop thus about 8 times fsck = 4 minutes.
    Now just imagine what happens when I need to turn on my PC for a conference or some other critical and urgent task and I need to wait 4 minutes for fsck + about 30-40 secs of boot….

  6. mat,

    First of all, with the new allocator, it was taking about 14 seconds to fsck a 32GB partition in question, not 30 seconds. Secondly, how the fsck time scales to a 250GB partition very much depends on the file system; it’s a function of the number of inodes, average size of the files, average size of the directories, etc. So it won’t necessarily be a linear scale; if you use the 250GB file system to store a large number of large video files, for example, the fsck time will take much less than if you use that 250GB file system to store a gargantuan number of very small files in a very large number of directories.

    Third of all, with a laptop, I recommend using suspend-to-ram as much as possible; if you’re going to be using your laptop within a few hours, or if you can keep it plugged in, why shut it down? Just use suspend-to-ram, and that will tend to reduce the number of mounts, which in turn will reduce the number of times fsck might kick in.

    Finally, I strongly recommend that you use LVM, and then periodically run a script (perhaps out of cron) which creates a snapshot, checks the snapshot, and if the file system is consistent, updates the last checked time on the base file system. This will eliminate the need to run fsck at boot time, and it will in practice mean that your file system can be checked for any corruption induced by hardware errors, et. al, much more regularly.

  7. I like your “finally” answer, just I have no Idea on how do that…. is there some example on how:

    1) create a snapshot
    2) check the snapshot
    3)if anything went fine, update the last checked time on the base filesystem

    I never found any script such this on web….. You have a laptop with ext4 so I
    swear if you can publish your, please……

Leave a Reply

Your email address will not be published. Required fields are marked *