X
    Categories: FilesystemsLinuxSSD

SSD’s, Journaling, and noatime/relatime

On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s) due to the extra writes caused by journaling — and so Linux users using SSD’s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice.

For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:

  • Clone a git repository containing a linux source tree
  • Compile the linux source tree using make -j2
  • Remove the object files by running make clean

For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)

Amount of data written (in megabytes) on an ext4 filesystem
Operation with journal w/o journal percent change
git clone 367.7 353.0 4.00%
make 231.1 203.4 12.0%
make clean 14.6 7.7 47.3%

 

What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller: 4% for the git clone, and 12% for the actual kernel compile.

The noatime mount option

Can we do better? Yes, if we mount the file system using the noatime mount option:

Amount of data written (in megabytes) on an ext4 filesystem mounted with noatime
Operation with journal w/o journal percent change
git clone 367.0 353.0 3.81%
make 207.6 199.4 3.95%
make clean 6.45 3.73 42.17%

 

This reduces the extra cost of the journal in the git clone and make steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories.

The relatime mount option

There is a newer alternative to the noatime mount option, relatime. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):

Amount of data written (in megabytes) on an ext4 filesystem mounted with relatime
Operation with journal w/o journal percent change
git clone 366.6 353.0 3.71%
make 216.8 203.7 6.04%
make clean 13.34 6.97 45.75%

 

Personally, I don’t think relatime is worth it. There are other ways of working around the issue with mutt — for example, you can use Maildir-style mailboxes, or you can use mutt’s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you don’t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.

Comparing ext3 and ext2 filesystems

Amount of data written (in megabytes) on an ext3 and ext2 filesystem
Operation ext3 ext2 percent change
git clone 374.6 357.2 4.64%
make 230.9 204.4 11.48%
make clean 14.56 6.54 55.08%

 

Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn’t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.)

Conclusion

So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD’s come from? Some of it may have been from people worrying too much about extreme workloads such as “make clean”; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn’t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD’s had a very bad problem with what has been called the “write amplification effect”, where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations — that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD’s, such as Intel’s X25-M SSD, have worked around the write amplification affect.

What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system’s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime.

tytso :

View Comments (71)

  • hehe, yes, I know the X25 is fast. I have been following it, the OCZ Vertex series, and some other pretenders.

    The SuperTalent FusionIO PCIE 8x card, blows away all the SSD competition. It's not really an SSD - it's more like a NAND memory expansion with up to 2TB of address space. It beats the X25m by a factor of 25 or more on random 4k block writes (off the top of my head.)

    Having no Windows platform to run iometer on, and with the Linux binary having no command-line options to directly run tests (that I can find), also the documentation for the latest version of IoMeter currently dumping Perl code instead of running the script on the IoMeter website, I'm afraid I'm not going to get any mileage out of that app...

    http://www.iometer.org/cgi-bin/jump.cgi?URL=http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/iometer/iometer/Docs/Iometer.pdf?rev=HEAD

  • @68: Wil,

    It wasn't clear to me you understood that the X25-M is fast, if you're comparing it to a SD card. :-(

    The Fusion IOXtreme drive is fast, yes, but it's also about 4 times the cost of an X25-M drive, and you can't boot off of one, and it won't fit in an laptop. But sure, if you have a PCIE card, and you can afford it, and you need its speed, then it's a good choice.

    The trick with using Iometer under Linux is apparently to put the options in the configuration file, but it's not at all well documented, and I don't have time to try to figure out the the config file from the sources --- but iometer is open source, so hopefully someone will get around to fixing it up and documenting it better for operation under Linux.

  • That X25-M SSDSA2MH160G2R5 part is looking pretty inviting. There's an OCZ part that's a little cheaper, faster, smaller.

    Really, if you have a laptop in the $1000+ range, what can anyone add to it that will improve the overall performance as much for the money as putting one of the above SSDs into it?

    I'm sure the manufacturers know this, but it's hard to sell 1/10th the storage for the same money, regardless of the performance.

    There's still the question of reliability. Both Intel and OCZ state "1.5M hours MTBF" then offer a 1 or 2 year warranty. Hello? 1.5M hours is 171 years! If there's a shred of honesty behind their statements then they would back them with lifetime unrated replacement warranties. So take both with a sack of salt.

    I'll leave checking OCZ's track record on SSD parts as an exercise for the reader. Spending that much money without doing research is something I would never encourage anyone to do. BTW, that was a subtle hint to check OCZ's track record before buying their parts...

  • @69: I was comparing the Intel Z-P230 to an SD card. They have very similar performance characteristics, in fact a high-end SD card outperforms the Intel SSD in all but max linear read, and even then by <10%. Pretty sad.

  • @69: Wil,

    Sorry, I missed the Z-P230 reference. I must have been reading your post too quickly, and I probably mistook it for an Atom CPU part number.... My bad. Yeah, as far as I can tell that Intel SSD was something cheap cheap cheap that was intended for sale directly to netbook manufacturers. It has since disappeared without a trace.

    Note BTW, that if price is an issue and you don't need that much capacity (how much do you need for a netbook or even a laptop if most of your software is on the cloud? I need 80GB because I'm a developer; if I nuked all of the source trees I could probably live with 40GB) you can also get a 40GB Kingston SSDNow V SSD which uses the Intel controller, but with half the flash channels. You have to be careful though; the 64G and 128G Kinginston SSDNow V use the JMicron controller. (V stands for value, which can sometimes also be another way of saying, buyer needs to be careful. :-)

    With OCZ, yes, you have to be careful. They produce a large number of devices at different price points, and clearly at different levels of quality.

  • I should add that if you have a larger laptop, such as Lenovo Thinkpad T series, you can use two drives. A 40GB or 80GB SSD drive where you have your OS and your home directory, and a 500GB 5400rpm hard drive where you have your build directories, music, images, etc. That's what I do these days... the source tree is on the SSD for speed, but the build trees where the compiler deposits the object files is on a 500GB disk. That way I get the best of both worlds. I use the SSD for frequently accessed files or files where fast access will improve the "feel" of my system, and I use the hard drive for bulk storage, and writes which if slower (such as writing object files) won't hinder the overall speed of my system.

    This is also very easy to do for a desktop, of course --- although these days I generally use my laptop instead of a desktop.

  • Hi Ted,
    How do you use the ext4 information about the lifetime write in practice?
    I've seen your kernel commit:

    commit afc32f7ee9febc020c73da61402351d4c90437f3
    Author: Theodore Ts'o
    Date: Sat Feb 28 19:39:58 2009 -0500

    ext4: Track lifetime disk writes

    Add a new superblock value which tracks the lifetime amount of writes
    to the filesystem. This is useful in estimating the amount of wear on
    solid state drives (SSD's) caused by writes to the filesystem.

    but I don't know how I can actually see that information? Does it appear
    in the output of some tool made for ext4?

  • Ted,

    I notice that all of the benchmarks do writes. However, I wonder what might happen if you included one that was read-only, such as a recursive word count.
    Something like "find /fs -type f | xargs wc".

    There you would see the atime/relatime factor really show up. But by how much? I would guess A Lot. I think atime is useless (mutt bedamned) personally, but by how much?

  • Another program that depends on atime is tmpwatch. Took me a while to figure out why /var/tmp/nginx/* kept disappearing on me. Turns out noatime + tmpwatch guarantees issues for more than just mutt.