Should Filesystems Be Optimized for SSD’s?

In one of the comments to my last blog entry, an anonymous commenter writes:

You seem to be taking a different perspective to linus on the “adapting to the the disk technology” front (Linus seems to against having to have the OS know about disk boundaries and having to do levelling itself)

That’s an interesting question, and I figure it’s worth its own top-level entry, as opposed to a reply in the comment stream.   One of the interesting design questions in any OS or Computer Architecture is where the abstraction boundaries should be drawn and which side of an abstraction boundary should various operations be pushed.   Linus’s arguments is that there a flash controller can do a better job of wear leveling, including detecting how “worn” a particular flash cell might be (for example, perhaps by looking at the charge levels at an analog level and knowing when the last time the cell was programmed), and so it doesn’t make sense to try to do wear leveling in a flash file system.   Some responsibilities of flash management, such as coalescing newly written blocks into erase blocks to avoid write amplification can be done either on the SSD or in the file system — for example, by using a log-structured file system, or some other copy-on-write file system, instead of a rewrite-in-place style file system, you can essentially solve the write amplification problem.   In some cases, it’s necessary let additional information leak across the abstraction — for example, the ATA TRIM command is a way for the file system to let the disk know that certain blocks no longer need to be used.   If too much information needs to be pushed across the abstraction, one way or another, then maybe we need to rethink whether the abstraction barrier is in the right place.

In addition, if the abstraction has been around for a long time, changing it also has costs, which has to be taken into account.   The 512 byte sector LBA abstraction has been around long time, and therefore dislodging it is difficult and costly.   For example, the same argument which says that because the underlying hardware details are changing between different generations of SSD is all of these details should be hidden in hardware, was also used to justify something that has been a complete commercial failure for years if not decades: Object Based Disks.

One of the arguments of OBD’s was that the hard drive has the best knowledge of how and where to store an contiguous stream of bytes, and so perhaps filesystems should not be trying to decide where on disk an inode should be stored, but instead tell the hard drive, “I have this object, which is 134 kilobytes long; please store it somewhere on the disk”.   At least in theory the HDD or SSD could handle all of the details of knowing the best place to store the object on the spinning magnetic media or flash media, taking into account how worn the flash is and automatically move the object around in the case of an SSD, and in the case of the HDD, the drive could know about (real) cylinder and track boundaries, and store the object in the most efficient way possible, since the drive has intimate knowledge about the low-level details of how data is stored on the disk.

This theory makes a huge amount of sense; but there’s only one problem.   Object Based Disks have been proposed in academia and advanced R&D shops of companies like Seagate have been proposing them for over a decade, with absolutely nothing to show for it.   Why?   There have been two reasons proposed.  One is that OBD vendors were too greedy, and tried to charge too much money for OBD’s.    Another explanation is that the interface abstraction for OBD’s was too different, and so there wasn’t enough software or file systems or OS’s that could take advantage of OBD’s.

Both explanations undoubtedly contributed to the commercial failure of OBD’s, but the question is which is the bigger reason.   And the reason why it is particularly important here is because at least as far as Intel’s SSD strategy is concerned, its advantage is that (modulo implementation shortcomings such as the reported internal LBA remapping table fragmentation problem and the lack of ATA TRIM support) filesystems don’t need to change (much) in order to take advantage of the Intel SSD and get at least decent performance.

However, if the price delta is a stronger reason for its failure, then the X25-M may be in trouble.   Currently the 80GB Intel X25-M has a street price of $400, so it costs roughly $5 per gigabyte.   “Dumb” MLC SATA SSD’s are available for roughly half the cost/gigabyte (64 GB for $164).   So what does the market look like 12-18 months from now?  If “dumb” SSD’s are still available at 50% of the cost of “smart” SSD’s, it would probably be worth it to make a copy-on-write style filesystem that attempts to do the wear leveling and write amplification reduction in software.   Sure, it’s probably more efficient to do it in hardware, but a 2x price differential might cause people will settle for a cheaper solution even if isn’t the absolutely best technical choice.   On the hand, if prices drop significantly, and/or “dumb” SSD’s completely disappear from the market, then time spent now optimizing for “dumb” SSD’s will be completely wasted.

So for Linus to make the proclamation that it’s completely stupid to optimize for “dumb” SSD’s seems to be a bit premature.   Market externalities — for example, does Intel have patents that will prevent competing “smart” SSD’s from entering the market and thus forcing price drops? — could radically change the picture.  It’s not just a pure technological choice, which is what makes projections and prognostications difficult.

As another example, I don’t know whether or not Intel will issue a firmware update that adds ATA TRIM support to the X25-M, or how long it will take before such SSD’s become available.   Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before, even if it causes more in-file system fragmentation and deeper extent allocation trees.   The reason for this is at the moment, once a block is used by the file system, at least today, the X25-M has absolutely no idea whether we still care about the contents of that block, or whether the block has since been released when the file was deleted.   However, if 20% of the SSD’s blocks have never been used, the X25-M can use 20% of the flash for better garbage collection and defragmentation algorithms.   And if Intel never releases a firmware update to add ATA TRIM support, then I will be out $400 out of my own pocket for an SSD that lacks this capability, and so adding a block allocator which works around limitations of the X25-M probably makes sense.   If it turns out that it takes two years before disks that have ATA TRIM support show up, then it will definitely make sense to add such an optimization. (Hard drive vendors have been historically S-L-O-W to finish standardizing new features and then letting such features enter the market place, so I’m not necessarily holding my breath; after all, the Linux block device layer and and file systems have been ready to send ATA TRIM support for about six months; what’s taking the ATA committees and SSD vendors so long?

On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4.   Or maybe Sandisk will make an ATA TRIM capable SSD available soon, and which is otherwise competitive with Intel, and I get a free sample, but it turns out another optimization on Sandisk SSD’s will give me an extra 10% performance gain under some workloads.   Is it worth it in that case?   Hard to tell, unless I know whether such a tweak addresses an optimization problem which is fundamental, and whether or not such a tweak will either be unnecessary, or perhaps actively unhelpful in the next generation.    As long as SSD manufacturers force us treat these devices as black boxes, there may be a certain amount of cargo cult science which may be forced upon us file system designers — or I guess I should say, in order to be more academically respectable, “we will be forced to rely more on empirical measurements leading to educated engineering estimations about what the SSD is doing inside the black box”. Heh.