Aligning filesystems to an SSD’s erase block size

I recently purchased a new toy, an Intel X25-M SSD, and when I was setting it up initially, I decided I wanted to make sure the file system was aligned on an erase block boundary.  This is a generally considered to be a Very Good Thing to do for most SSD’s available today, although there’s some question about how important this really is for Intel SSD’s — more on that in a moment.

It turns out this is much more difficult than you might first think — most of Linux’s storage stack is not set up well to worry about alignment of partitions and logical volumes.  This is surprising, because it’s useful for many things other than just SSD’s.  This kind of alignment is important if you are using any kind of hardware or software RAID, for example, especially RAID 5, because if writes are done on stripe boundaries, it can avoid a read-modify-write overhead.  In addition, the hard drive industry is planning on moving to 4096 byte sectors instead of the way-too-small 512 byte sectors at some point in the future.   Linux’s default partition geometry of 255 heads and 63 sectors/track means that there are 16065 (512 byte) sectors per cylinder.  The initial round of 4k sector disks will emulate 512 byte disks, but if the partitions are not 4k aligned, then the disk will end up doing a read/modify/write on two internal 4k sectors for each singleton 4k file system write, and that would be unfortunate.

Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track.   This results in a cylinder boundary which is divisible by 8, and so the partitions (with the exception of the first, which is still misaligned unless you play some additional tricks) are 4k aligned.    So this is one place where Vista is ahead of Linux….   unfortunately the default 255 heads and 63 sectors is hard coded in many places in the kernel, in the SCSI stack, and in various partitioning programs; so fixing this will require changes in many places.

However, with SSD’s (remember SSD’s?  This is a blog post about SSD’s…) you need to align partitions on at least 128k boundaries for maximum efficiency.   The best way to do this that I’ve found is to use 224 (32*7) heads and 56 (8*7) sectors/track.  This results in 12544 (or 256*49) sectors/cylinder, so that each cylinder is 49*128k.  You can do this by doing starting fdisk with the following options when first partitioning the SSD:

# fdisk -H 224 -S 56 /dev/sdb

The first partition will only be aligned on a 4k boundary, since in order to be compatible with MS-DOS, the first partition starts on track 1 instead of track 0, but I didn’t worry too much about that since I tend to use the first partition for /boot, which tends not to get modified much.   You can go into expert mode with fdisk and force the partition to begin on an 128k alignment, but many Linux partition tools will complain about potential compatibility problems (which are obsolete warnings, since the systems that would have booting systems with these issues haven’t been made in about ten years), but I didn’t needed to do that, so I didn’t worry about it.

So I created a 1 gigabyte /boot partition as /dev/sdb1, and allocated the rest of the SSD for use by LVM as /dev/sdb2. And that’s where I ran into my next problem. LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

# pvcreate –metadatasize 250k /dev/sdb2

Physical volume “/dev/sdb2” successfully created

Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:

# pvs /dev/sdb2 -o+pe_start

PV         VG   Fmt  Attr PSize  PFree  1st PE

/dev/sdb2       lvm2 —   73.52G 73.52G 256.00K

If you use a metadata size of 256k, the first PE will be at 320k instead of 256k. There really ought to be an –pe-align option to pvcreate, which would be far more user-friendly, but, we have to work with the tools that we have. Maybe in the next version of the LVM support tools….

Once you do this, we’re almost done. The last thing to do is to create the file system. As it turns out, if you are using ext4, there is a way to tell the file system that it should try to align files so they match up with the RAID stripe width. (These techniques can be used for RAID disks as well). If your SSD has an 128k erase block size, and you are creating the file system with the default 4k block size, you just have to specify a strip width when you create the file system, like so:

# mke2fs -t ext4 -E stripe-width=32,resize=500G /dev/ssd/root

(The resize=500G limits the number of blocks reserved for resizing this file system so that the guaranteed number size that the file system can be grown via online resize is 500G. The default is 1000 times the initial file system size, which is often far too big to be reasonable. Realistically, the file system I am creating is going to be used for a desktop device, and I don’t foresee needing to resize it beyond 500G, so this saves about a 50 megabytes or so. Not a huge deal, but “waste not, want not”, as the saying goes.)

With e2fsprogs 1.41.4, the journal will be 128k aligned, as will the start of the file system, and with the stripe-width specified, the ext4 allocator will try to align block writes to the stripe width where that makes sense. So this is as good as it gets without kernel changes to make the block and inode allocators more SSD aware, something which I hope to have a chance to look at.

All of this being said, it’s time to revisit this question — is all of this needed for a “smart”, “better by design” next-generation SSD such as Intel’s? Aligning your file system on an erase block boundary is critical on first generation SSD’s, but the Intel X25-M is supposed to have smarter algorithms that allow it to reduce the effect of write-amplification. The details are a little bit vague, but presumably there is a mapping table which maps sectors (at some internal sector size — we don’t know for sure whether it’s 512 bytes or some larger size) to individual erase blocks. If the file system sends a series of 4k writes for file system blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, 99, followed by a barrier operation, a traditional SSD might do read/modify/write on four 128k erase blocks — one covering the blocks 0-31, another for the blocks 32-63, and so on. However, the Intel SSD will simply write a single 128k block that indicates where the latest versions of blocks 10, 12, 13, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, and 99 can be found.

This technique tends to work very well.  However, over time, the table will get terribly fragmented, and depending on whether the internal block sector size is 512 or 4k (or something in between), there could be a situation where all but one or two of the internal sectors on the disks have been mapped away to other erase blocks, leading to fragmentation of the erase blocks. This is not just a theoretical problem; there are reports from the field that this happens relatively easy. For example, see Allyn Malventano’s Long-term performance analysis of Intel Mainstream SSDs and Marc Prieur’s report from which includes an official response from Intel regarding this phenomenon.  Laurent Gilson posted on the Linux-Thinkpad mailing list that when he tried using the X25-M to store commit journals for a database, that after writing 170% of the capacity of an Intel SSD, the small writes caused the write performance to go through the floor.   More troubling, Allyn Malventano indicated that if the drive is abused for too long with a mixture of small and large writes, it can get into a state where the performance degredation is permanent, and even a series of large writes apparently does not restore the drive’s function — only an ATA SECURITY ERASE command to completely reset the mapping table seems to help.

So, what can be done to prevent this?   Allyn’s review speculates that aligning writes to erase write boundaries can help — I’m not 100% sure this is true, but without detailed knowledge of what is going on under the covers in Intel’s SSD, we won’t know for sure.  It certainly can’t hurt, though, and there is a distinct possibility that the internal sector size is larger than 512 bytes, which means the default partitioning scheme of 255 heads/63 sectors is probably not a good idea.   (Even Vista has moved to a 240/63 scheme, which gives you 8k alignment of partitions; I prefer 224/56 partitioning, since the days when BIOS’s used C/H/S I/O are long gone.)

The Ext3 and Ext4 file system tend to defer meta-data writes by pinning them until a transaction commit; this definitely helps, and ext4 allows you to configure an erase block boundary, which should also be helpful.  Enabling laptop mode will discourage writing to the disk except in large blocks, which probably helps significantly as well.   And avoiding fsync() in applications will also be helpful, since a cache flush operation will force the SSD to write to an erase block even if it isn’t completely filled.   Beyond that, clearly some experimentation will be needed.  My current thinking is to use a standard file system aging workload, and then performing an I/O benchmark to see if there has been any performance degradation.  I can then vary various file system tuning parameters and algorithms, confirm whether or not a heavy fsync workload makes the performance worse.

In the long term, hopefully Intel will release a firmware update which adds support for ATA TRIM/DISCARD commands, which will allow the file system to inform the SSD that various blocks have been deleted and no longer need to be preserved by the SSD.   I suspect this will be a big help, if the SSD knows that certain sectors no longer need to be preserved, it can avoid copying them when trying to defragment the SSD.   Given how expensive the X25-M SSD’s are, I hope that there will be a firmware update to support this, and that Intel won’t leave its early adopters high and dry by only offering this functionality in newer models of the SSD.   If they were to do that, it would leave many of these early adopters, especially your humble writer (who paid for his SSD out of his own pocket), to be quite grumpy indeed.  Hopefully, though, it won’t come to that.

Update: I’ve since penned a follow-up post “Should Filesystems Be Optimized for SSD’s?”