Last night I managed to finish up a rather satisfying improvement to ext4’s inode and block allocators. The ext4’s original allocator was actually a bit more simple-minded than ext3’s, in that it didn’t implement the Orlov algorithm to spread out top-level directories for better filesystem aging. It also was buggy in certain ways, where it would return ENOSPC even when there were still plenty of inodes in the file system.

So I had been working on extending ext3’s original Orlov allocator so it would work well with ext4. While I was at it, it occurred to me that one of the tricks I could play with ext4’s flex groups (which are higher-order collection of block groups), was to bias the block allocation algorithms such that the first block group in a flexgroup would be preferred for use by directories, and biased against data blocks for regular files. This meant that directory blocks would get clustered together, which cut a third off the time needed for e2fsck pass2:

Comparison of e2fsck times on an 32GB partition
Passext4 old allocatorext4 new allocator
time (s)I/Otime (s)I/O
realusersystemMB readMB/srealusersystemMB readMB/s
16.694.060.908212.256.703.631.588212.23
213.342.303.781339.974.241.272.4613331.36
30.020.010163.850.010.010.01182.69
40.280.270000.230.22000
52.602.310.0310.382.422.150.0710.41
Total23.069.034.742169.3713.787.334.1921615.68

As you may recall from my previous observations on this blog, although we hadn’t been explicitly engineering for this, a file system consistency check on an ext4 file system tends to be a factor of 6-8 faster than the e2fsck times on an equivalent ext3 file system, mainly due to the elimination of indirect blocks and the uninit_bg feature reducing the amount of disk reads necessary in e2fsck’s pass 1. However, the ext4 layout optimizations didn’t do much for e2fsck’s pass 2. Well, the optimization of the block and inode allocators is complementary to the original ext4 fsck improvements, since it focuses on what we hadn’t optimized the first time around: e2fsck pass 2 times have been cut by a third, and the overall fsck time has been cut by 40%. Not too shabby!

Of course, we need to do more testing to make sure we haven’t caused other file system benchmarks to degrade, although I’m cautiously optimistic that this will end up being a net win. I suspect that some benchmarks will go up by a little, and others will go down a little, depending on how heavily the benchmark exercises directory operations versus sequential I/O patterns. If people want to test this new allocator, it is in the ext4 patch queue. If all goes well, I will hopefully be pushing it to Linus after 2.6.29 is released, at the next merge window.

horizontal separator

For comparison’s sake, here is a comparison of the fsck time of the same collection of files and directories, comparing ext3 and the original ext4 block and inode allocator. The file system in question is a 32GB install of Ubuntu Jaunty, with a personal home directory, a rather large Maildir directory, some linux kernel trees, and an e2fsprogs tree. It’s basically the emergency environment I set up on my Netbook at FOSDEM.

In all cases the file systems were freshly copied from the original root directory using the command rsync -axH / /mnt. It’s actually a bit surprising to me that ext3’s pass 2 e2fsck times was that much better than e2fsck time under the old ext4 allocator. My previous experience has shown that the two are normally about the same, with a write throughput of around 9-10 MB/s on for e2fsck’s pass 2 for both ext3 file systems and ext4 file systems with the original inode/block allocators. Hence, I would have expected ext3’s pass2 time to have been 12-13 seconds, and not 6. I’m not sure how that happened, unless it was the luck of draw in terms of how things ended up getting allocated on disk. So I’m not too sure what happened there, but overall things look quite good for ext4 and fsck times!

Comparison of e2fsck times on an 32GB partition
Passext3ext4 old allocator
time (s)I/Otime (s)I/O
realusersystemMB readMB/srealusersystemMB readMB/s
1108.4013.7411.535835.386.694.060.908212.25
25.911.742.5613322.5113.342.303.781339.97
30.030.010131.210.020.010163.85
40.280.270000.280.27000
53.170.920.1320.632.602.310.0310.38
Total118.1516.7514.257186.0823.069.034.742169.37

Vital Statistics of the 32GB partition
312214inodes used (14.89%)
263non-contiguous files (0.1%)
198non-contiguous directories (0.1%)
 # of inodes with ind/dind/tind blocks: 0/0/0
 Extent depth histogram: 292698/40
4388697blocks used (52.32%)
0bad blocks
1large file
  
263549regular files
28022directories
5character device files
1block device file
5fifos
615links
20618symbolic links (19450 fast symbolic links)
5sockets
312820files