This wasn’t one of the things we were explicitly engineering for when were designing the features that would go into ext4, but one of the things which we’ve found as a pleasant surprise is how much more quickly ext4 filesystems can be checked. Ric Wheeler reported some really good fsck times that were over ten times better than ext3 using filesystems generated using what was admittedly a very artificial/synthetic benchmark. During the past six weeks, though, I’ve been using ext4 on my laptop, and I’ve seen very similar results.
This past week, while at LinuxWorld, I’ve been wowing people with the following demonstration. Using an LVM snapshot, I ran e2fsck on the root filesystem on my laptop. So using a 128 gigabyte filesystem, on a laptop drive, this is what people who got to see my demo saw:
e2fsck 1.41.0 (10-Jul-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 3440k/12060k (3311k/130k), time: 17.82/ 5.52/ 1.11
Pass 1: I/O read: 233MB, write: 0MB, rate: 13.08MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 3440k/13476k (3311k/130k), time: 41.47/ 2.16/ 3.30
Pass 2: I/O read: 274MB, write: 0MB, rate: 6.61MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 3440k/14504k (3311k/130k), time: 59.88/ 7.75/ 4.42
Pass 3: Memory used: 3440k/13476k (3311k/130k), time: 0.04/ 0.02/ 0.01
Pass 3: I/O read: 1MB, write: 0MB, rate: 27.38MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 3440k/6848k (3310k/131k), time: 0.25/ 0.24/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 3440k/5820k (3310k/131k), time: 3.13/ 1.85/ 0.10
Pass 5: I/O read: 5MB, write: 0MB, rate: 1.60MB/s
779726 inodes used (9.30%)
1 non-contiguous inode (0.0%)
# of inodes with ind/dind/tind blocks: 719/712/712
22706429 blocks used (67.67%)
0 bad blocks
4 large files
673584 regular files
58903 directories
1304 character device files
4575 block device files
11 fifos
1818 links
41336 symbolic links (32871 fast symbolic links)
4 sockets
--------
781535 files
Memory used: 3440k/5820k (3376k/65k), time: 63.35/ 9.86/ 4.54
I/O read: 511MB, write: 1MB, rate: 8.07MB/s
How does this compare against ext3? To answer that, I copied my entire ext4 file system to an equivalently sized partition formatted for use with ext3. This comparison is a little unfair since the ext4 file system has six weeks of aging on it, where as the ext3 filesystem was a fresh copy, so the directories are a bit more optimized. That probably explains the slightly better times in pass 2 for the ext3 file system. Still, it was no contest; the ext4 file system was almost seven times faster to check using e2fsck compared to the ext3 file system. Fsck on an ext4 filesystem is fast!
| Pass | ext3 | ext4 | |||||||||
| time (s) | I/O | time (s) | I/O | ||||||||
| real | user | system | MB read | MB/s | real | user | system | MB read | MB/s | ||
| 1 | 382.63 | 18.06 | 14.99 | 2376 | 6.21 | 17.82 | 5.52 | 1.11 | 233 | 13.08 | |
| 2 | 31.76 | 1.76 | 2.13 | 303 | 9.54 | 41.47 | 2.16 | 3.3 | 274 | 6.61 | |
| 3 | 0.03 | 0.01 | 0 | 1 | 31 | 0.04 | 0.02 | 0.01 | 1 | 27.38 | |
| 4 | 0.2 | 0.2 | 0 | 0 | 0 | 0.25 | 0.24 | 0 | 0 | 0 | |
| 5 | 9.86 | 1.26 | 0.22 | 5 | 0.51 | 3.13 | 1.85 | 0.1 | 5 | 1.6 | |
| Total | 424.81 | 21.36 | 17.34 | 2685 | 6.32 | 63.35 | 9.86 | 4.54 | 511 | 8.07 | |




August 9th, 2008 at 2:09 am
“OMG” Jawdropping!
August 10th, 2008 at 5:43 am
Thanks for the report! I’m curious about ” 1 non-contiguous inode (0.0%)” line though. Was is just proper allocation on file creation and no data was removed from fs during 6 weeks “aging” process, or does it mean that online defragmentation is working for you?
August 10th, 2008 at 1:08 pm
Vladimir,
Actually, the 1 non-contiguous inode is a bug in e2fsck; it wasn’t properly accounting for extent-based files. I just fixed that, and the number I got was 2.2%, which seemed high. I just did some poking about, and looks like your question caused me to (a) fix a bug in e2fsck, and (b) uncovered a bug in the delayed allocation code. (Thanks!)
It looks like under some circumstances, the multiblock allocator isn’t doing the right thing with the first block. I tested this by creating a freshly created ext4 filesystem, and then populating it via a single-threaded tar process, copying /usr/bin and /usr/lib to that new filesystem. I then ran e2fsck with some instrumentation to look for fragmented files, and I found this:
Bad, very bad. Fortunately, this should be fairly easy to fix, now that we know about it….
August 10th, 2008 at 3:02 pm
Hey there,
can you try to do it again on xfs as well? I heard that the guys sped up xfs_repair a lot!
Kind regards,
Dennis
August 11th, 2008 at 1:27 am
Ted, will ext4 online defragmentator be ever usable? Or it’s just an unsubstantiated hype and no such code is ever going to be merged and becomes safe to use?
August 11th, 2008 at 8:26 am
Nice article, thank you.
August 11th, 2008 at 11:51 am
re: comment #4, defrag code exists (it’s not unsubstantiated hype) and it will likely be usable some day but it’s not the highest priority at this point, especially with the ext4 allocator now working so much better than ext3’s.
August 17th, 2008 at 7:07 am
Hey great article…thanks….
can anyone please provide me some to-do lists for ext4???
Regards,
Fur
August 17th, 2008 at 8:29 am
Fur, what do you mean by “to-do lists”? If you mean what’s left to do in terms of development, it’s mainly fixing bugs, and doing tuning and performance tests, at least kernel side. On the e2fsprogs side there is still work to add support for > 32-bit block numbers, so that people can actually take advantage of ext4’s ability to create filesystems > 16TB.
If you mean and what you need to do in order to start using ext4, please see the ext4 howto page which can be found here: http://ext4.wiki.kernel.org/index.php/Ext4_Howto
September 1st, 2008 at 8:04 am
For those folks who are wondering about the small file fragmentation problem, thanks to Aneesh Kumar, the problem has been fixed; it is in the , or and it will be in 2.6.27. (Basically, anyone wanting to use the ext4 should either use the 2.6.26 with the 2.6.26-ext4-7 patchset or at least 2.6.27-rc5.)
September 1st, 2008 at 8:39 am
Re: online defrag
Maybe I’m unnecessarily pessimistic about how much extents and the pre allocation changes are going to help, but I’d think this should be quite high priority. I think it’s just a continuation of the thinking that ext* file systems are resistant to fragmentation. As far as I can tell, that’s just a long standing myth probably bolstered because there hasn’t been any solution other than a full backup and restore.
Here’s a real example on my computer (note that I didn’t have to do anything special to get this to happen):
dd if=vid1.avi of=/dev/null
1433703+1 records in
1433703+1 records out
734056262 bytes (734 MB) copied, 12.4157 s, 59.1 MB/s
That looks good for a fairly modern SATA drive.
filefrag vid1.avi
vid1.avi: 37 extents found, perfection would be 6 extents
Not perfect, but reasonable. Let’s look at another file:
dd if=vid2.avi of=/dev/null
1433864+1 records in
1433864+1 records out
734138408 bytes (734 MB) copied, 218.334 s, 3.4 MB/s
Woah! My disk that can read at 60MB/s is brought to its knees.
filefrag vid2.avi
vid2.avi: 45449 extents found, perfection would be 6 extents
And that explains it.
September 1st, 2008 at 9:19 am
btmorex,
The basis for saying that ext* (and filesystems using the BSD fast filesystem basic design of cylinder groups) are fragmentation resistant are definitely true, especially when compared to FAT filesystems. However, the key word in that statement is resistant. You can get into plenty of situations ext* filesystems will not perform so well, especially if filesystem is mostly full and has been around for a very long time. This is true for essentially every single filesystem out there.
Ext4 has two new features that make it even more fragmentation resistent, and that is delayed allocation and preallocation. The former works automtically; the latter requires the application to tell the filesystem up front how big the file will be (i.e., mythTV uses fallocate() to give a hint to the filesystem that it will be recording a 30 minute TV show, so by giving that hint to the filesystem, it can attempt to preallocate that much space up front, in an efficient way). These will definitely help, especially if you start with a fresh filesystem (i.e., backup and restore to a new ext4 partition, instead of just upgrading the filesystem from ext3 to ext4). However, if severely abuse the filesystem (run it at 90-100% full for long periods of time, with lots of small files being added and deleted, and then expect to fill the remaiing space with a 750meg video stream and expect it to be contiguous), no filesystem is going to be able to give you that kind of allocation guarantees. Even with the (proprietary) real-time XFS extensions worked by massive allocation of extra disk space that could only be used for large files, and which could not be used for small files. It is really not realistic to try mixing small and large files, and then expect zero fragmentation.
As far as raising the priority of the on-line defragmentation patches, it is next on the list, but please keep in mind that fixing bugs so that users who are using ext4 don’t suffer data loss is going to be higher priority, and a number of the ext4 developers are doing this as volunteers. If you’d like to try it and report bugs and comments, the code is available, and I can point you at it. The kernel patches for on-line defrag are the ext4 patches, and the user-space portion is available in the ext4 patch queue. The code has not been reviewed yet, but if you’d like to be an early tester, it is supposed to be fully functional.
September 1st, 2008 at 1:33 pm
Thanks for your reply.
On resistance to fragmentation:
I agree that compared to FAT filesystems, ext3 performs pretty well. However, it’s not very hard to find situations where ext3 doesn’t perform well at all. The problem is once you’ve run into one of those situations there is no easy way to fix it (apart from a full backup and restore).
For example, a common situation that I find myself in is downloading multiple large files with firefox simultaneously to ~/downloads. If I understand ext3’s block allocation strategy correctly, and that’s a big if, the preferred block group for those new files is going to be the same one that has ~/downloads inode. Invariably, if the files are large enough, they’re going to end up competing for blocks and consequently both are going to end up heavily fragmented. Now, if I just had two heavily fragmented files, that’s not so bad. If I delete one though, I now have one heavily fragmented file and a horribly messed up block group that’s going to be used the next time I download something. I couldn’t find any documentation on what strategy is used for picking a new block group if the preferred one is full, but my assumption is that the algorithm is deterministic. In other words, if I’m downloading two iso’s simultaneously, my guess is that they’re not only going to fragment one block group, but probably multiple block groups.
Another common situation is bittorrent. Admittedly, bittorrent is probably a filesystem’s worst enemy, but then again it’s a pretty popular distribution mechanism.
These are fairly common situations and they don’t require a nearly full filesystem or a really old filesystem to happen. My disk is at 1/3 usage and about four months old (since my last restore) and already most of my home directory is heavily fragmented. I don’t really think that my use cases are particularly abusive or rare either.
On delayed and pre allocation:
I’m not entirely confident that these are going to make a huge difference at least in near future. Delayed allocation doesn’t really affect the bittorrent usage pattern and only slightly helps in the “multiple files competing for blocks” situation. If I understand it correct, the fragments are just going to be larger, dependent on how long actual allocation is delayed.
Pre allocation would solve most of these problems, but honestly, that’s probably years away given the fact that countless user space programs need to be changed. Perhaps in the coming year, the most important programs (cp, tar, etc.) will be updated, but it’s still going to take a while for all of this stuff to get into distribution releases.
On defrag/ext4 testing:
I don’t mean to disparage your work or anything. I’m actually very excited about ext4 and pretty eager to test it out. Right now, I only have one computer and I’m not really willing to migrate that yet, but in a month or two I’ll have my laptop back and I’ll probably test it on that. Hopefully, a good ext4 kernel is in debian unstable by then.
I made my original comment because I think a lot of people downplay how easy it is to end up with a pretty fragmented filesystem. FAT may be many times worse, but Microsoft has included a defragmentation program since at least Windows 9x. So, even if it takes ext3 ten times as long to become fragmented, there’s no solution once it does.
September 1st, 2008 at 3:01 pm
Btmorex,
Delayed allocation works very well for small files, and for large files, if the fragments are large enough, it really doesn’t matter. If I have an ISO image file which is broken into 16 extents, then each “fragment” will be around 48 megabytes. The average seek time for a Seagate Momentus laptop drive is 13 milliseconds; disks for server average around 8-10ms. So compared to the time it takes to read 750 megabytes of ISO image, we might have to seek an extra 16 times, which will cost less than a fifth of a second. This is literally less than a blink of an eye (which according to wikipedia takes 300-400 ms). If you have a file which has tens of thousands of fragments, that’s a different story, yes.
As far as delayed allocation is concerned, it’s a single line patch and there are only a few high-value programs that would be important to get right — namely, bittorrent clients and firefox for downloading. In both cases, the programs know the size of the file they are downloading, and so all they have to do is add the line “fallocate(fd, 0, 0, length);” to the program. This is not rocket science. And for people who only know how to download from distributions, sure, it might take an six months before Ubuntu picks up the updated program — and once Lenny unfreezes, fixes will enter Debian testing quickly enough. But this is all about Open Source, remember? People helping themselves, people having the freedom to download source and modify it, and recompile it themselves, right?
I don’t optimize for people who are too lazy to follow a more rapidly changing distribution (either Ubuntu, or Fedora, or Debian testing). And realistically, even if I had infinite amounts of free time to work on the defrag tool, it’s not like the defrag tool would come any faster for people who use Debian Stable, and are willing to settle for slow release cycles.
October 15th, 2008 at 9:46 pm
@tytso:
You’re right, noone *has* to work on it. And yeah, if all the planets align then we can minimise fragmentation - for a time.
Pity that when you *have* got bad fragmentation that you’re stuffed.
I agree that its completely up to the developer to work on it if they want. But in terms of its importance, I think you underestimate it.
I completely agree with Btmorex. At the moment, if you get bad fragmentation then you’re basically stuffed. There are a lot of less educated ‘n00bs’ out there that aren’t going to know what to do - or even why their system is running so slow.
Its up to the ext4 guys to decide if they would like to cater for 90% of the user demographic.