Category Archives: SSD

SSD’s, Journaling, and noatime/relatime

On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s) due to the extra writes caused by journaling — and so Linux users using SSD’s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice.

For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:

  • Clone a git repository containing a linux source tree
  • Compile the linux source tree using make -j2
  • Remove the object files by running make clean

For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)


Amount of data written (in megabytes) on an ext4 filesystem
Operation with journal w/o journal percent change
git clone 367.7 353.0 4.00%
make 231.1 203.4 12.0%
make clean 14.6 7.7 47.3%

 

What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller: 4% for the git clone, and 12% for the actual kernel compile.

The noatime mount option

Can we do better? Yes, if we mount the file system using the noatime mount option:


Amount of data written (in megabytes) on an ext4 filesystem mounted with noatime
Operation with journal w/o journal percent change
git clone 367.0 353.0 3.81%
make 207.6 199.4 3.95%
make clean 6.45 3.73 42.17%

 

This reduces the extra cost of the journal in the git clone and make steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories.

The relatime mount option

There is a newer alternative to the noatime mount option, relatime. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):


Amount of data written (in megabytes) on an ext4 filesystem mounted with relatime
Operation with journal w/o journal percent change
git clone 366.6 353.0 3.71%
make 216.8 203.7 6.04%
make clean 13.34 6.97 45.75%

 

Personally, I don’t think relatime is worth it. There are other ways of working around the issue with mutt — for example, you can use Maildir-style mailboxes, or you can use mutt’s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you don’t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.

Comparing ext3 and ext2 filesystems


Amount of data written (in megabytes) on an ext3 and ext2 filesystem
Operation ext3 ext2 percent change
git clone 374.6 357.2 4.64%
make 230.9 204.4 11.48%
make clean 14.56 6.54 55.08%

 

Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn’t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.)

Conclusion

So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD’s come from? Some of it may have been from people worrying too much about extreme workloads such as “make clean”; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn’t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD’s had a very bad problem with what has been called the “write amplification effect”, where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations — that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD’s, such as Intel’s X25-M SSD, have worked around the write amplification affect.

What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system’s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime.

Should Filesystems Be Optimized for SSD’s?

In one of the comments to my last blog entry, an anonymous commenter writes:

You seem to be taking a different perspective to linus on the “adapting to the the disk technology” front (Linus seems to against having to have the OS know about disk boundaries and having to do levelling itself)

That’s an interesting question, and I figure it’s worth its own top-level entry, as opposed to a reply in the comment stream.   One of the interesting design questions in any OS or Computer Architecture is where the abstraction boundaries should be drawn and which side of an abstraction boundary should various operations be pushed.   Linus’s arguments is that there a flash controller can do a better job of wear leveling, including detecting how “worn” a particular flash cell might be (for example, perhaps by looking at the charge levels at an analog level and knowing when the last time the cell was programmed), and so it doesn’t make sense to try to do wear leveling in a flash file system.   Some responsibilities of flash management, such as coalescing newly written blocks into erase blocks to avoid write amplification can be done either on the SSD or in the file system — for example, by using a log-structured file system, or some other copy-on-write file system, instead of a rewrite-in-place style file system, you can essentially solve the write amplification problem.   In some cases, it’s necessary let additional information leak across the abstraction — for example, the ATA TRIM command is a way for the file system to let the disk know that certain blocks no longer need to be used.   If too much information needs to be pushed across the abstraction, one way or another, then maybe we need to rethink whether the abstraction barrier is in the right place.

In addition, if the abstraction has been around for a long time, changing it also has costs, which has to be taken into account.   The 512 byte sector LBA abstraction has been around long time, and therefore dislodging it is difficult and costly.   For example, the same argument which says that because the underlying hardware details are changing between different generations of SSD is all of these details should be hidden in hardware, was also used to justify something that has been a complete commercial failure for years if not decades: Object Based Disks.

One of the arguments of OBD’s was that the hard drive has the best knowledge of how and where to store an contiguous stream of bytes, and so perhaps filesystems should not be trying to decide where on disk an inode should be stored, but instead tell the hard drive, “I have this object, which is 134 kilobytes long; please store it somewhere on the disk”.   At least in theory the HDD or SSD could handle all of the details of knowing the best place to store the object on the spinning magnetic media or flash media, taking into account how worn the flash is and automatically move the object around in the case of an SSD, and in the case of the HDD, the drive could know about (real) cylinder and track boundaries, and store the object in the most efficient way possible, since the drive has intimate knowledge about the low-level details of how data is stored on the disk.

This theory makes a huge amount of sense; but there’s only one problem.   Object Based Disks have been proposed in academia and advanced R&D shops of companies like Seagate et.al. have been proposing them for over a decade, with absolutely nothing to show for it.   Why?   There have been two reasons proposed.  One is that OBD vendors were too greedy, and tried to charge too much money for OBD’s.    Another explanation is that the interface abstraction for OBD’s was too different, and so there wasn’t enough software or file systems or OS’s that could take advantage of OBD’s.

Both explanations undoubtedly contributed to the commercial failure of OBD’s, but the question is which is the bigger reason.   And the reason why it is particularly important here is because at least as far as Intel’s SSD strategy is concerned, its advantage is that (modulo implementation shortcomings such as the reported internal LBA remapping table fragmentation problem and the lack of ATA TRIM support) filesystems don’t need to change (much) in order to take advantage of the Intel SSD and get at least decent performance.

However, if the price delta is a stronger reason for its failure, then the X25-M may be in trouble.   Currently the 80GB Intel X25-M has a street price of $400, so it costs roughly $5 per gigabyte.   “Dumb” MLC SATA SSD’s are available for roughly half the cost/gigabyte (64 GB for $164).   So what does the market look like 12-18 months from now?  If “dumb” SSD’s are still available at 50% of the cost of “smart” SSD’s, it would probably be worth it to make a copy-on-write style filesystem that attempts to do the wear leveling and write amplification reduction in software.   Sure, it’s probably more efficient to do it in hardware, but a 2x price differential might cause people will settle for a cheaper solution even if isn’t the absolutely best technical choice.   On the hand, if prices drop significantly, and/or “dumb” SSD’s completely disappear from the market, then time spent now optimizing for “dumb” SSD’s will be completely wasted.

So for Linus to make the proclamation that it’s completely stupid to optimize for “dumb” SSD’s seems to be a bit premature.   Market externalities — for example, does Intel have patents that will prevent competing “smart” SSD’s from entering the market and thus forcing price drops? — could radically change the picture.  It’s not just a pure technological choice, which is what makes projections and prognostications difficult.

As another example, I don’t know whether or not Intel will issue a firmware update that adds ATA TRIM support to the X25-M, or how long it will take before such SSD’s become available.   Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before, even if it causes more in-file system fragmentation and deeper extent allocation trees.   The reason for this is at the moment, once a block is used by the file system, at least today, the X25-M has absolutely no idea whether we still care about the contents of that block, or whether the block has since been released when the file was deleted.   However, if 20% of the SSD’s blocks have never been used, the X25-M can use 20% of the flash for better garbage collection and defragmentation algorithms.   And if Intel never releases a firmware update to add ATA TRIM support, then I will be out $400 out of my own pocket for an SSD that lacks this capability, and so adding a block allocator which works around limitations of the X25-M probably makes sense.   If it turns out that it takes two years before disks that have ATA TRIM support show up, then it will definitely make sense to add such an optimization. (Hard drive vendors have been historically S-L-O-W to finish standardizing new features and then letting such features enter the market place, so I’m not necessarily holding my breath; after all, the Linux block device layer and and file systems have been ready to send ATA TRIM support for about six months; what’s taking the ATA committees and SSD vendors so long? <grin>

On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4.   Or maybe Sandisk will make an ATA TRIM capable SSD available soon, and which is otherwise competitive with Intel, and I get a free sample, but it turns out another optimization on Sandisk SSD’s will give me an extra 10% performance gain under some workloads.   Is it worth it in that case?   Hard to tell, unless I know whether such a tweak addresses an optimization problem which is fundamental, and whether or not such a tweak will either be unnecessary, or perhaps actively unhelpful in the next generation.    As long as SSD manufacturers force us treat these devices as black boxes, there may be a certain amount of cargo cult science which may be forced upon us file system designers — or I guess I should say, in order to be more academically respectable, “we will be forced to rely more on empirical measurements leading to educated engineering estimations about what the SSD is doing inside the black box”. Heh.

Aligning filesystems to an SSD’s erase block size

I recently purchased a new toy, an Intel X25-M SSD, and when I was setting it up initially, I decided I wanted to make sure the file system was aligned on an erase block boundary.  This is a generally considered to be a Very Good Thing to do for most SSD’s available today, although there’s some question about how important this really is for Intel SSD’s — more on that in a moment.

It turns out this is much more difficult than you might first think — most of Linux’s storage stack is not set up well to worry about alignment of partitions and logical volumes.  This is surprising, because it’s useful for many things other than just SSD’s.  This kind of alignment is important if you are using any kind of hardware or software RAID, for example, especially RAID 5, because if writes are done on stripe boundaries, it can avoid a read-modify-write overhead.  In addition, the hard drive industry is planning on moving to 4096 byte sectors instead of the way-too-small 512 byte sectors at some point in the future.   Linux’s default partition geometry of 255 heads and 63 sectors/track means that there are 16065 (512 byte) sectors per cylinder.  The initial round of 4k sector disks will emulate 512 byte disks, but if the partitions are not 4k aligned, then the disk will end up doing a read/modify/write on two internal 4k sectors for each singleton 4k file system write, and that would be unfortunate.

Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track.   This results in a cylinder boundary which is divisible by 8, and so the partitions (with the exception of the first, which is still misaligned unless you play some additional tricks) are 4k aligned.    So this is one place where Vista is ahead of Linux….   unfortunately the default 255 heads and 63 sectors is hard coded in many places in the kernel, in the SCSI stack, and in various partitioning programs; so fixing this will require changes in many places.

However, with SSD’s (remember SSD’s?  This is a blog post about SSD’s…) you need to align partitions on at least 128k boundaries for maximum efficiency.   The best way to do this that I’ve found is to use 224 (32*7) heads and 56 (8*7) sectors/track.  This results in 12544 (or 256*49) sectors/cylinder, so that each cylinder is 49*128k.  You can do this by doing starting fdisk with the following options when first partitioning the SSD:

# fdisk -H 224 -S 56 /dev/sdb

The first partition will only be aligned on a 4k boundary, since in order to be compatible with MS-DOS, the first partition starts on track 1 instead of track 0, but I didn’t worry too much about that since I tend to use the first partition for /boot, which tends not to get modified much.   You can go into expert mode with fdisk and force the partition to begin on an 128k alignment, but many Linux partition tools will complain about potential compatibility problems (which are obsolete warnings, since the systems that would have booting systems with these issues haven’t been made in about ten years), but I didn’t needed to do that, so I didn’t worry about it.

So I created a 1 gigabyte /boot partition as /dev/sdb1, and allocated the rest of the SSD for use by LVM as /dev/sdb2. And that’s where I ran into my next problem. LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

# pvcreate –metadatasize 250k /dev/sdb2
Physical volume “/dev/sdb2″ successfully created

Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:

# pvs /dev/sdb2 -o+pe_start
PV         VG   Fmt  Attr PSize  PFree  1st PE
/dev/sdb2       lvm2 —   73.52G 73.52G 256.00K

If you use a metadata size of 256k, the first PE will be at 320k instead of 256k. There really ought to be an –pe-align option to pvcreate, which would be far more user-friendly, but, we have to work with the tools that we have. Maybe in the next version of the LVM support tools….

Once you do this, we’re almost done. The last thing to do is to create the file system. As it turns out, if you are using ext4, there is a way to tell the file system that it should try to align files so they match up with the RAID stripe width. (These techniques can be used for RAID disks as well). If your SSD has an 128k erase block size, and you are creating the file system with the default 4k block size, you just have to specify a strip width when you create the file system, like so:

# mke2fs -t ext4 -E stripe-width=32,resize=500G /dev/ssd/root

(The resize=500G limits the number of blocks reserved for resizing this file system so that the guaranteed number size that the file system can be grown via online resize is 500G. The default is 1000 times the initial file system size, which is often far too big to be reasonable. Realistically, the file system I am creating is going to be used for a desktop device, and I don’t foresee needing to resize it beyond 500G, so this saves about a 50 megabytes or so. Not a huge deal, but “waste not, want not”, as the saying goes.)

With e2fsprogs 1.41.4, the journal will be 128k aligned, as will the start of the file system, and with the stripe-width specified, the ext4 allocator will try to align block writes to the stripe width where that makes sense. So this is as good as it gets without kernel changes to make the block and inode allocators more SSD aware, something which I hope to have a chance to look at.

horizontal separator

All of this being said, it’s time to revisit this question — is all of this needed for a “smart”, “better by design” next-generation SSD such as Intel’s? Aligning your file system on an erase block boundary is critical on first generation SSD’s, but the Intel X25-M is supposed to have smarter algorithms that allow it to reduce the effect of write-amplification. The details are a little bit vague, but presumably there is a mapping table which maps sectors (at some internal sector size — we don’t know for sure whether it’s 512 bytes or some larger size) to individual erase blocks. If the file system sends a series of 4k writes for file system blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, 99, followed by a barrier operation, a traditional SSD might do read/modify/write on four 128k erase blocks — one covering the blocks 0-31, another for the blocks 32-63, and so on. However, the Intel SSD will simply write a single 128k block that indicates where the latest versions of blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, and 99 can be found.

This technique tends to work very well.  However, over time, the table will get terribly fragmented, and depending on whether the internal block sector size is 512 or 4k (or something in between), there could be a situation where all but one or two of the internal sectors on the disks have been mapped away to other erase blocks, leading to fragmentation of the erase blocks. This is not just a theoretical problem; there are reports from the field that this happens relatively easy. For example, see Allyn Malventano’s Long-term performance analysis of Intel Mainstream SSDs and Marc Prieur’s report from BeHardware.com which includes an official response from Intel regarding this phenomenon.  Laurent Gilson posted on the Linux-Thinkpad mailing list that when he tried using the X25-M to store commit journals for a database, that after writing 170% of the capacity of an Intel SSD, the small writes caused the write performance to go through the floor.   More troubling, Allyn Malventano indicated that if the drive is abused for too long with a mixture of small and large writes, it can get into a state where the performance degredation is permanent, and even a series of large writes apparently does not restore the drive’s function — only an ATA SECURITY ERASE command to completely reset the mapping table seems to help.

So, what can be done to prevent this?   Allyn’s review speculates that aligning writes to erase write boundaries can help — I’m not 100% sure this is true, but without detailed knowledge of what is going on under the covers in Intel’s SSD, we won’t know for sure.  It certainly can’t hurt, though, and there is a distinct possibility that the internal sector size is larger than 512 bytes, which means the default partitioning scheme of 255 heads/63 sectors is probably not a good idea.   (Even Vista has moved to a 240/63 scheme, which gives you 8k alignment of partitions; I prefer 224/56 partitioning, since the days when BIOS’s used C/H/S I/O are long gone.)

The Ext3 and Ext4 file system tend to defer meta-data writes by pinning them until a transaction commit; this definitely helps, and ext4 allows you to configure an erase block boundary, which should also be helpful.  Enabling laptop mode will discourage writing to the disk except in large blocks, which probably helps significantly as well.   And avoiding fsync() in applications will also be helpful, since a cache flush operation will force the SSD to write to an erase block even if it isn’t completely filled.   Beyond that, clearly some experimentation will be needed.  My current thinking is to use a standard file system aging workload, and then performing an I/O benchmark to see if there has been any performance degradation.  I can then vary various file system tuning parameters and algorithms, confirm whether or not a heavy fsync workload makes the performance worse.

In the long term, hopefully Intel will release a firmware update which adds support for ATA TRIM/DISCARD commands, which will allow the file system to inform the SSD that various blocks have been deleted and no longer need to be preserved by the SSD.   I suspect this will be a big help, if the SSD knows that certain sectors no longer need to be preserved, it can avoid copying them when trying to defragment the SSD.   Given how expensive the X25-M SSD’s are, I hope that there will be a firmware update to support this, and that Intel won’t leave its early adopters high and dry by only offering this functionality in newer models of the SSD.   If they were to do that, it would leave many of these early adopters, especially your humble writer (who paid for his SSD out of his own pocket), to be quite grumpy indeed.  Hopefully, though, it won’t come to that.

Update: I’ve since penned a follow-up post “Should Filesystems Be Optimized for SSD’s?”