Fast ext4 fsck times

This wasn’t one of the things we were explicitly engineering for when were designing the features that would go into ext4, but one of the things which we’ve found as a pleasant surprise is how much more quickly ext4 filesystems can be checked. Ric Wheeler reported some really good fsck times that were over ten times better than ext3 using filesystems generated using what was admittedly a very artificial/synthetic benchmark. During the past six weeks, though, I’ve been using ext4 on my laptop, and I’ve seen very similar results.

This past week, while at LinuxWorld, I’ve been wowing people with the following demonstration. Using an LVM snapshot, I ran e2fsck on the root filesystem on my laptop. So using a 128 gigabyte filesystem, on a laptop drive, this is what people who got to see my demo saw:

e2fsck 1.41.0 (10-Jul-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 3440k/12060k (3311k/130k), time: 17.82/ 5.52/ 1.11
Pass 1: I/O read: 233MB, write: 0MB, rate: 13.08MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 3440k/13476k (3311k/130k), time: 41.47/ 2.16/ 3.30
Pass 2: I/O read: 274MB, write: 0MB, rate: 6.61MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 3440k/14504k (3311k/130k), time: 59.88/ 7.75/ 4.42
Pass 3: Memory used: 3440k/13476k (3311k/130k), time:  0.04/ 0.02/ 0.01
Pass 3: I/O read: 1MB, write: 0MB, rate: 27.38MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 3440k/6848k (3310k/131k), time:  0.25/ 0.24/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 3440k/5820k (3310k/131k), time:  3.13/ 1.85/ 0.10
Pass 5: I/O read: 5MB, write: 0MB, rate: 1.60MB/s

  779726 inodes used (9.30%)
       1 non-contiguous inode (0.0%)
         # of inodes with ind/dind/tind blocks: 719/712/712
22706429 blocks used (67.67%)
       0 bad blocks
       4 large files

  673584 regular files
   58903 directories
    1304 character device files
    4575 block device files
      11 fifos
    1818 links
   41336 symbolic links (32871 fast symbolic links)
       4 sockets
--------
  781535 files
Memory used: 3440k/5820k (3376k/65k), time: 63.35/ 9.86/ 4.54
I/O read: 511MB, write: 1MB, rate: 8.07MB/s

How does this compare against ext3? To answer that, I copied my entire ext4 file system to an equivalently sized partition formatted for use with ext3. This comparison is a little unfair since the ext4 file system has six weeks of aging on it, where as the ext3 filesystem was a fresh copy, so the directories are a bit more optimized. That probably explains the slightly better times in pass 2 for the ext3 file system. Still, it was no contest; the ext4 file system was almost seven times faster to check using e2fsck compared to the ext3 file system. Fsck on an ext4 filesystem is fast!

Comparison of e2fsck times on an 128GB partition
Pass ext3 ext4
time (s) I/O time (s) I/O
real user system MB read MB/s real user system MB read MB/s
1 382.63 18.06 14.99 2376 6.21 17.82 5.52 1.11 233 13.08
2 31.76 1.76 2.13 303 9.54 41.47 2.16 3.3 274 6.61
3 0.03 0.01 0 1 31 0.04 0.02 0.01 1 27.38
4 0.2 0.2 0 0 0 0.25 0.24 0 0 0
5 9.86 1.26 0.22 5 0.51 3.13 1.85 0.1 5 1.6
Total 424.81 21.36 17.34 2685 6.32 63.35 9.86 4.54 511 8.07

25 thoughts on “Fast ext4 fsck times

  1. Thanks for the report! I’m curious about ” 1 non-contiguous inode (0.0%)” line though. Was is just proper allocation on file creation and no data was removed from fs during 6 weeks “aging” process, or does it mean that online defragmentation is working for you?

  2. Vladimir,

    Actually, the 1 non-contiguous inode is a bug in e2fsck; it wasn’t properly accounting for extent-based files. I just fixed that, and the number I got was 2.2%, which seemed high. I just did some poking about, and looks like your question caused me to (a) fix a bug in e2fsck, and (b) uncovered a bug in the delayed allocation code. (Thanks!)

    It looks like under some circumstances, the multiblock allocator isn’t doing the right thing with the first block. I tested this by creating a freshly created ext4 filesystem, and then populating it via a single-threaded tar process, copying /usr/bin and /usr/lib to that new filesystem. I then ran e2fsck with some instrumentation to look for fragmented files, and I found this:

    Inode 122: (0):58399, (1-3):43703-43705
    Inode 124: (0):58400, (1):43707
    Inode 127: (0):58401, (1-7):43709-43715
    Inode 128: (0):58402, (1-2):43716-43717
    Inode 129: (0):58403, (1-3):43718-43720
    Inode 133: (0):58404, (1-5):43722-43726
    Inode 135: (0):58405, (1):43728
    Inode 136: (0):58406, (1-3):43729-43731
    Inode 141: (0-1):58407-58408, (2-6):43734-43738
    Inode 143: (0):58409, (1):43740
    Inode 144: (0):58410, (1-5):43741-43745
    Inode 146: (0):58411, (1):43746
    

    Bad, very bad. Fortunately, this should be fairly easy to fix, now that we know about it….

  3. Hey there,

    can you try to do it again on xfs as well? I heard that the guys sped up xfs_repair a lot!

    Kind regards,
    Dennis

  4. Ted, will ext4 online defragmentator be ever usable? Or it’s just an unsubstantiated hype and no such code is ever going to be merged and becomes safe to use?

  5. re: comment #4, defrag code exists (it’s not unsubstantiated hype) and it will likely be usable some day but it’s not the highest priority at this point, especially with the ext4 allocator now working so much better than ext3′s.

  6. Hey great article…thanks….
    can anyone please provide me some to-do lists for ext4???

    Regards,
    Fur

  7. Fur, what do you mean by “to-do lists”? If you mean what’s left to do in terms of development, it’s mainly fixing bugs, and doing tuning and performance tests, at least kernel side. On the e2fsprogs side there is still work to add support for > 32-bit block numbers, so that people can actually take advantage of ext4′s ability to create filesystems > 16TB.

    If you mean and what you need to do in order to start using ext4, please see the ext4 howto page which can be found here: http://ext4.wiki.kernel.org/index.php/Ext4_Howto

  8. Re: online defrag

    Maybe I’m unnecessarily pessimistic about how much extents and the pre allocation changes are going to help, but I’d think this should be quite high priority. I think it’s just a continuation of the thinking that ext* file systems are resistant to fragmentation. As far as I can tell, that’s just a long standing myth probably bolstered because there hasn’t been any solution other than a full backup and restore.

    Here’s a real example on my computer (note that I didn’t have to do anything special to get this to happen):

    dd if=vid1.avi of=/dev/null
    1433703+1 records in
    1433703+1 records out
    734056262 bytes (734 MB) copied, 12.4157 s, 59.1 MB/s

    That looks good for a fairly modern SATA drive.

    filefrag vid1.avi
    vid1.avi: 37 extents found, perfection would be 6 extents

    Not perfect, but reasonable. Let’s look at another file:

    dd if=vid2.avi of=/dev/null
    1433864+1 records in
    1433864+1 records out
    734138408 bytes (734 MB) copied, 218.334 s, 3.4 MB/s

    Woah! My disk that can read at 60MB/s is brought to its knees.

    filefrag vid2.avi
    vid2.avi: 45449 extents found, perfection would be 6 extents

    And that explains it.

  9. btmorex,

    The basis for saying that ext* (and filesystems using the BSD fast filesystem basic design of cylinder groups) are fragmentation resistant are definitely true, especially when compared to FAT filesystems. However, the key word in that statement is resistant. You can get into plenty of situations ext* filesystems will not perform so well, especially if filesystem is mostly full and has been around for a very long time. This is true for essentially every single filesystem out there.

    Ext4 has two new features that make it even more fragmentation resistent, and that is delayed allocation and preallocation. The former works automtically; the latter requires the application to tell the filesystem up front how big the file will be (i.e., mythTV uses fallocate() to give a hint to the filesystem that it will be recording a 30 minute TV show, so by giving that hint to the filesystem, it can attempt to preallocate that much space up front, in an efficient way). These will definitely help, especially if you start with a fresh filesystem (i.e., backup and restore to a new ext4 partition, instead of just upgrading the filesystem from ext3 to ext4). However, if severely abuse the filesystem (run it at 90-100% full for long periods of time, with lots of small files being added and deleted, and then expect to fill the remaiing space with a 750meg video stream and expect it to be contiguous), no filesystem is going to be able to give you that kind of allocation guarantees. Even with the (proprietary) real-time XFS extensions worked by massive allocation of extra disk space that could only be used for large files, and which could not be used for small files. It is really not realistic to try mixing small and large files, and then expect zero fragmentation.

    As far as raising the priority of the on-line defragmentation patches, it is next on the list, but please keep in mind that fixing bugs so that users who are using ext4 don’t suffer data loss is going to be higher priority, and a number of the ext4 developers are doing this as volunteers. If you’d like to try it and report bugs and comments, the code is available, and I can point you at it. The kernel patches for on-line defrag are the ext4 patches, and the user-space portion is available in the ext4 patch queue. The code has not been reviewed yet, but if you’d like to be an early tester, it is supposed to be fully functional.

  10. Thanks for your reply.

    On resistance to fragmentation:

    I agree that compared to FAT filesystems, ext3 performs pretty well. However, it’s not very hard to find situations where ext3 doesn’t perform well at all. The problem is once you’ve run into one of those situations there is no easy way to fix it (apart from a full backup and restore).

    For example, a common situation that I find myself in is downloading multiple large files with firefox simultaneously to ~/downloads. If I understand ext3′s block allocation strategy correctly, and that’s a big if, the preferred block group for those new files is going to be the same one that has ~/downloads inode. Invariably, if the files are large enough, they’re going to end up competing for blocks and consequently both are going to end up heavily fragmented. Now, if I just had two heavily fragmented files, that’s not so bad. If I delete one though, I now have one heavily fragmented file and a horribly messed up block group that’s going to be used the next time I download something. I couldn’t find any documentation on what strategy is used for picking a new block group if the preferred one is full, but my assumption is that the algorithm is deterministic. In other words, if I’m downloading two iso’s simultaneously, my guess is that they’re not only going to fragment one block group, but probably multiple block groups.

    Another common situation is bittorrent. Admittedly, bittorrent is probably a filesystem’s worst enemy, but then again it’s a pretty popular distribution mechanism.

    These are fairly common situations and they don’t require a nearly full filesystem or a really old filesystem to happen. My disk is at 1/3 usage and about four months old (since my last restore) and already most of my home directory is heavily fragmented. I don’t really think that my use cases are particularly abusive or rare either.

    On delayed and pre allocation:

    I’m not entirely confident that these are going to make a huge difference at least in near future. Delayed allocation doesn’t really affect the bittorrent usage pattern and only slightly helps in the “multiple files competing for blocks” situation. If I understand it correct, the fragments are just going to be larger, dependent on how long actual allocation is delayed.

    Pre allocation would solve most of these problems, but honestly, that’s probably years away given the fact that countless user space programs need to be changed. Perhaps in the coming year, the most important programs (cp, tar, etc.) will be updated, but it’s still going to take a while for all of this stuff to get into distribution releases.

    On defrag/ext4 testing:

    I don’t mean to disparage your work or anything. I’m actually very excited about ext4 and pretty eager to test it out. Right now, I only have one computer and I’m not really willing to migrate that yet, but in a month or two I’ll have my laptop back and I’ll probably test it on that. Hopefully, a good ext4 kernel is in debian unstable by then.

    I made my original comment because I think a lot of people downplay how easy it is to end up with a pretty fragmented filesystem. FAT may be many times worse, but Microsoft has included a defragmentation program since at least Windows 9x. So, even if it takes ext3 ten times as long to become fragmented, there’s no solution once it does.

  11. Btmorex,

    Delayed allocation works very well for small files, and for large files, if the fragments are large enough, it really doesn’t matter. If I have an ISO image file which is broken into 16 extents, then each “fragment” will be around 48 megabytes. The average seek time for a Seagate Momentus laptop drive is 13 milliseconds; disks for server average around 8-10ms. So compared to the time it takes to read 750 megabytes of ISO image, we might have to seek an extra 16 times, which will cost less than a fifth of a second. This is literally less than a blink of an eye (which according to wikipedia takes 300-400 ms). If you have a file which has tens of thousands of fragments, that’s a different story, yes.

    As far as delayed allocation is concerned, it’s a single line patch and there are only a few high-value programs that would be important to get right — namely, bittorrent clients and firefox for downloading. In both cases, the programs know the size of the file they are downloading, and so all they have to do is add the line “fallocate(fd, 0, 0, length);” to the program. This is not rocket science. And for people who only know how to download from distributions, sure, it might take an six months before Ubuntu picks up the updated program — and once Lenny unfreezes, fixes will enter Debian testing quickly enough. But this is all about Open Source, remember? People helping themselves, people having the freedom to download source and modify it, and recompile it themselves, right?

    I don’t optimize for people who are too lazy to follow a more rapidly changing distribution (either Ubuntu, or Fedora, or Debian testing). And realistically, even if I had infinite amounts of free time to work on the defrag tool, it’s not like the defrag tool would come any faster for people who use Debian Stable, and are willing to settle for slow release cycles.

  12. @tytso:

    You’re right, noone *has* to work on it. And yeah, if all the planets align then we can minimise fragmentation – for a time.

    Pity that when you *have* got bad fragmentation that you’re stuffed.

    I agree that its completely up to the developer to work on it if they want. But in terms of its importance, I think you underestimate it.

    I completely agree with Btmorex. At the moment, if you get bad fragmentation then you’re basically stuffed. There are a lot of less educated ‘n00bs’ out there that aren’t going to know what to do – or even why their system is running so slow.

    Its up to the ext4 guys to decide if they would like to cater for 90% of the user demographic.

  13. @16: Tshepang,

    Unfortunately, the fast fsck times won’t show up if you convert the filesystem from ext3. Sorry, I should have made this clear.

    New files will benefit from extents and delayed allocation, so you’ll see run-time performance by moving to ext4. However, the improvements in fsck time stem from changes in how the bitmap and inode tables are laid out (which require a fresh mkfs -t ext4; a conversion from ext3 won’t help that) and the use of extents instead of indirect blocks to take a huge amount of time out of pass 1 seeking to deal with the indirect blocks requires that all of your files larger than 48k (assuming 4k block files) are rewritten using extents. We do have code that will be available in the future to “migrate” files using indirect blocks to extents, but it still won’t be as good as a filesystem which is freshly made with ext4.

    Finally, not all of the instructions on converting ext3 filesystems tell you to run tune2fs -O uninit_bg /dev/XXX; e2fsck /dev/XXX. This will also reduce the fsck time, but the first fsck run is really annoying since you have to individually answer “Y” to all of the questions about setting the group checksums. After that, e2fsck will be able to skip inode table blocks that are completely empty during pass 1. In the long run I need to improve e2fsprogs to make this step not quite so annoying, but even then, you’ll get the best performance (and the best anti-fragmentation resistance, et. al), by doing a backup, recreating the filesystem using mke2fs -t ext4, and then a restore.

    I’ll also note just out of a sense of fair warning, that there’s an additional block and allocation layout change which I am planning which should make improvements in ext4′s fsck and fragmentation resistance. Essentially it’s a change in where ext4 decides to allocate blocks for directories, segregating them from blocks for regular inodes. As before, it’s fully backwards compatible, but to see the full results there will be the need to do a backup, re-mkfs, and restore. The changes I’m contemplating will be probably only a tiny incremental improvement (if it’s 5% better, I’ll be ecstatic), so it’s probably not a reason to justify a full rebuild, but I thought I would mention it. Since it won’t get done in time for the 2.6.29 merge window, it won’t see fruition in a stable release for six months (i.e., when 2.6.30 is released). So it’s probably not worth waiting for, since the improvements between the ext3 layout and the current ext4 layout and block allocation algorithms are a factor of 7 or so, plus or minus.

  14. Hi Ted,

    I’m finally trying out ext4 and especially seeing how fragmentation is vs ext3. I’m running into some output that I don’t understand from filefrag:

    # filefrag vid.avi
    vid.avi: 2 extents found
    # ls -l vid.avi
    -rw-r–r– 1 avery avery 733243392 2008-08-12 02:01 vid.avi

    Shouldn’t the minimum number of extents be 6 based on 128MB extents?

    I found this bug which is similar and supposed to be fixed:
    http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=458306

    I thought I would post here before filing another one to make sure I’m not missing something obvious.

  15. The filefrag program is a bit misleading. When it uses the word “extents”, what it means is “contiguous ranges of blocks”. Keep in mind that filefrags predates ext4 by something like 5-6 years or more. So its use of extents is well before ext4 came on the scene. Similarly, if you look inside the source code for resize2fs, it uses “extents” to mean its own internal way of tracking a contiguous range of blocks used by a particular inode. But again, this predates ext4, so it has nothing to do with how ext4 happens to encode extents.

    So if you were to look at vid.avi using a tool such as tst_extents, you would no doubt see a series of 128 meg extents, since that is the maximum number of blocks that can be encoded in the extents structure used is indeed 128 metabytes (assuming a 4k blocksize). But ext4 will try very hard to keep block allocations contiguous, even between ext4 extent encodings, such that if extent number #2 ended with block N, extent #3 will begin with N+1 if at all possible. Apparently, in your vid.avi file, this was true for all but one case, so filefrag reported two contiguous block ranges, which it reported as “2 extents”.

    I hope that clarifies things!

  16. Thanks, that explains it.

    I think I’m using tst_extents correctly:
    1.) open /path/to/fs
    2.) inode /path/to/vid.avi
    3.) root… gives

    (Left 0)
    extent: lblk 0–179014, len 179015, pblk 66879488, flags: (none)

    4.) n… gives

    <<<inode->root->n) and just parse “(Left X)”, is the number of extents used always X+1 or are there special cases?

    2.) Is there a way to use tst_extents not interactively other than feeding commands through stdin?

    Thanks,
    Avery

  17. Thanks, that explains it.

    I think I’m using tst_extents correctly:
    1.) open /path/to/fs
    2.) inode /path/to/vid.avi
    3.) root… gives

    (Left 0)
    extent: lblk 0–179014, len 179015, pblk 66879488, flags: (none)

    4.) n… gives

    [deleted because wordpress snips it, important part is (Left 6)]

    So, if I keep hitting ‘n’ for this particular file I find 7 extents and indeed there is a break in there which would explain filefrag reporting 2. I have a couple questions though:

    1.) Is the root an extent too (which would make 8 total)? Or some kind of directory to the other ones? Why does the root report “Left 0″? If I use the procedure above (open->inode->root->n) and just parse “(Left X)”, is the number of extents used always X+1 or are there special cases?

    2.) Is there a way to use tst_extents not interactively other than feeding commands through stdin?

    Thanks,
    Avery

  18. Avery,

    Note that tst_extents was intended as a debugging tool, not as an officially supported interface. So it may very well change without warning. It’s basically an interface to the extent functions in libext2fs which I used when I was debugging it. I would not recommend trying to create a program based on interacting with tst_extents and trying to parse its output. You’re much better off writing C program that links with libext2fs directly, or using SWIG to link it into Python or Perl.

    The extents information is stored in a tree structure, where the root node of the tree is located in the inode and stores at most 4 entries. Said entries can either be pointers to single blocks that can hold 340 entries assuming a 4k blocksize, or if the entry is a leaf entry, it will be a pointer to a file’s extents. In practice, ext4′s anti-fragmentation allocation algorithms are good enough that most of the time, 4 entries in the inode is more than enough to cover most normally sized files, or for big files or if things are fragmented, a single external block. Very rarely (as in at the moment I have a single such file on my filesystem) more than 4 blocks are needed, at which point a depth two extent tree might be needed. It’s usually not because the filesystem’s free space is fragmented, but because the file is sparse, such that the discontinuities are in the logical block numbers, not the physical block numbers.

    Finally, note that the “all” command to tst_extents will walk the entire extent tree, which is far more convenient that just entering the ‘n’ command over and over.

  19. @17,

    Thanks for the info, for that was quite a mouthful, and thanks for your mighty contributions to FLOSS. When do you estimate (what kernel release) that ext4 will be considered ready for production systems?

  20. For some reason, I missed the announcement that ext4 has been declared stable since Christmas, nor was I aware that 2.6.28 was available on that date. That’s a bit earlier than I thought. Thanks for the work…

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>