X
    Categories: FilesystemsLinuxSSD

Should Filesystems Be Optimized for SSD’s?

In one of the comments to my last blog entry, an anonymous commenter writes:

You seem to be taking a different perspective to linus on the “adapting to the the disk technology” front (Linus seems to against having to have the OS know about disk boundaries and having to do levelling itself)

That’s an interesting question, and I figure it’s worth its own top-level entry, as opposed to a reply in the comment stream.   One of the interesting design questions in any OS or Computer Architecture is where the abstraction boundaries should be drawn and which side of an abstraction boundary should various operations be pushed.   Linus’s arguments is that there a flash controller can do a better job of wear leveling, including detecting how “worn” a particular flash cell might be (for example, perhaps by looking at the charge levels at an analog level and knowing when the last time the cell was programmed), and so it doesn’t make sense to try to do wear leveling in a flash file system.   Some responsibilities of flash management, such as coalescing newly written blocks into erase blocks to avoid write amplification can be done either on the SSD or in the file system — for example, by using a log-structured file system, or some other copy-on-write file system, instead of a rewrite-in-place style file system, you can essentially solve the write amplification problem.   In some cases, it’s necessary let additional information leak across the abstraction — for example, the ATA TRIM command is a way for the file system to let the disk know that certain blocks no longer need to be used.   If too much information needs to be pushed across the abstraction, one way or another, then maybe we need to rethink whether the abstraction barrier is in the right place.

In addition, if the abstraction has been around for a long time, changing it also has costs, which has to be taken into account.   The 512 byte sector LBA abstraction has been around long time, and therefore dislodging it is difficult and costly.   For example, the same argument which says that because the underlying hardware details are changing between different generations of SSD is all of these details should be hidden in hardware, was also used to justify something that has been a complete commercial failure for years if not decades: Object Based Disks.

One of the arguments of OBD’s was that the hard drive has the best knowledge of how and where to store an contiguous stream of bytes, and so perhaps filesystems should not be trying to decide where on disk an inode should be stored, but instead tell the hard drive, “I have this object, which is 134 kilobytes long; please store it somewhere on the disk”.   At least in theory the HDD or SSD could handle all of the details of knowing the best place to store the object on the spinning magnetic media or flash media, taking into account how worn the flash is and automatically move the object around in the case of an SSD, and in the case of the HDD, the drive could know about (real) cylinder and track boundaries, and store the object in the most efficient way possible, since the drive has intimate knowledge about the low-level details of how data is stored on the disk.

This theory makes a huge amount of sense; but there’s only one problem.   Object Based Disks have been proposed in academia and advanced R&D shops of companies like Seagate et.al. have been proposing them for over a decade, with absolutely nothing to show for it.   Why?   There have been two reasons proposed.  One is that OBD vendors were too greedy, and tried to charge too much money for OBD’s.    Another explanation is that the interface abstraction for OBD’s was too different, and so there wasn’t enough software or file systems or OS’s that could take advantage of OBD’s.

Both explanations undoubtedly contributed to the commercial failure of OBD’s, but the question is which is the bigger reason.   And the reason why it is particularly important here is because at least as far as Intel’s SSD strategy is concerned, its advantage is that (modulo implementation shortcomings such as the reported internal LBA remapping table fragmentation problem and the lack of ATA TRIM support) filesystems don’t need to change (much) in order to take advantage of the Intel SSD and get at least decent performance.

However, if the price delta is a stronger reason for its failure, then the X25-M may be in trouble.   Currently the 80GB Intel X25-M has a street price of $400, so it costs roughly $5 per gigabyte.   “Dumb” MLC SATA SSD’s are available for roughly half the cost/gigabyte (64 GB for $164).   So what does the market look like 12-18 months from now?  If “dumb” SSD’s are still available at 50% of the cost of “smart” SSD’s, it would probably be worth it to make a copy-on-write style filesystem that attempts to do the wear leveling and write amplification reduction in software.   Sure, it’s probably more efficient to do it in hardware, but a 2x price differential might cause people will settle for a cheaper solution even if isn’t the absolutely best technical choice.   On the hand, if prices drop significantly, and/or “dumb” SSD’s completely disappear from the market, then time spent now optimizing for “dumb” SSD’s will be completely wasted.

So for Linus to make the proclamation that it’s completely stupid to optimize for “dumb” SSD’s seems to be a bit premature.   Market externalities — for example, does Intel have patents that will prevent competing “smart” SSD’s from entering the market and thus forcing price drops? — could radically change the picture.  It’s not just a pure technological choice, which is what makes projections and prognostications difficult.

As another example, I don’t know whether or not Intel will issue a firmware update that adds ATA TRIM support to the X25-M, or how long it will take before such SSD’s become available.   Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before, even if it causes more in-file system fragmentation and deeper extent allocation trees.   The reason for this is at the moment, once a block is used by the file system, at least today, the X25-M has absolutely no idea whether we still care about the contents of that block, or whether the block has since been released when the file was deleted.   However, if 20% of the SSD’s blocks have never been used, the X25-M can use 20% of the flash for better garbage collection and defragmentation algorithms.   And if Intel never releases a firmware update to add ATA TRIM support, then I will be out $400 out of my own pocket for an SSD that lacks this capability, and so adding a block allocator which works around limitations of the X25-M probably makes sense.   If it turns out that it takes two years before disks that have ATA TRIM support show up, then it will definitely make sense to add such an optimization. (Hard drive vendors have been historically S-L-O-W to finish standardizing new features and then letting such features enter the market place, so I’m not necessarily holding my breath; after all, the Linux block device layer and and file systems have been ready to send ATA TRIM support for about six months; what’s taking the ATA committees and SSD vendors so long? <grin>

On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4.   Or maybe Sandisk will make an ATA TRIM capable SSD available soon, and which is otherwise competitive with Intel, and I get a free sample, but it turns out another optimization on Sandisk SSD’s will give me an extra 10% performance gain under some workloads.   Is it worth it in that case?   Hard to tell, unless I know whether such a tweak addresses an optimization problem which is fundamental, and whether or not such a tweak will either be unnecessary, or perhaps actively unhelpful in the next generation.    As long as SSD manufacturers force us treat these devices as black boxes, there may be a certain amount of cargo cult science which may be forced upon us file system designers — or I guess I should say, in order to be more academically respectable, “we will be forced to rely more on empirical measurements leading to educated engineering estimations about what the SSD is doing inside the black box”. Heh.

tytso :

View Comments (24)

  • "On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4."
    means this that TRIM is working NOW with ext4 or is a mount option required?

    SSDs with Indilinx Barefoot Controller (OCZ Vertex, Supertalent Ultradrive, both available) should have TRIM support.
    A OCZ moderator says that next firmware (next week or month) has TRIM support, but it is not working with linux.
    http://www.ocztechnologyforum.com/forum/showpost.php?p=351938&postcount=35
    a german Supertalent distributor says that the Ultradrives with current firmware has TRIM support.

    Is there a easy way to check if TRIM is supported from the ssd (e.g. hdparm, dmesg) and if the kernel and filesystem is using it?

  • Hi!

    While waiting for the TRIM option, would it be of any use to fill the deleted sectors with 0xFF? Would the SSD consider it as erased or not?

    Easy to try; when the disk is getting slower, just fill all free space with a file filled with 0xFF. Maybe it works..

  • I'd like to know how the TRIM command will be passed through RAID controllers, such as the Intel Matrix many people use.

    I have 2 160GB Gen2 Intel X25-M's and I've been trying to figure out the best way to stripe and subsequently format them for a dual boot Windows 7 / Linux Mint system.

    People seem to be testing using low level tools and saying 128K stripe sizes are best because that is the Intel native size, but it's been hard finding guidance on what block size to use for the filesystem itself.

    My first reaction would be to say a 256K block size would split the data evenly, but that seems terribly wasteful on devices with such limited capacity.

    Does anyone have any ideas, for regular desktop usage, games, and dual booting from an Intel Matrix setup, what a suggested stripe/filesystem size would be? One possibly for maximum speed, and another for a reasonable speed/space comprimise?

  • Intel just added TRIM to the latest firmware update for X25-M drives.

    The update won't work on the earliest drives :-(

    Formy planned use (Backup spooling) kernel TRIM support would be a huge win as the files I'm working with are approx 10Gb apiece.

  • > And if Intel never releases a firmware update to add ATA TRIM support, then
    > I will be out $400 out of my own pocket for an SSD that lacks this capability,
    > and so adding a block allocator which works around limitations of the X25-M
    > probably makes sense. If it turns out that it takes two years before disks
    > that have ATA TRIM support show up, then it will definitely make sense to
    > add such an optimization. (Hard drive vendors have been historically
    > S-L-O-W to finish standardizing new features and then letting such features
    > enter the market place, so I’m not necessarily holding my breath; after all,
    > the Linux block device layer and and file systems have been ready to send
    > ATA TRIM support for about six months; what’s taking the ATA committees
    > and SSD vendors so long?

    Is this indeed working with TRIM enabled SSD's today? I read somewhere in the OCZ forums, in a post by the hdparm author, that issues showed up when running on real hardware, causing it to be disabled for now.

    > On the other hand, if Intel releases ATA TRIM support next month, then it
    > might not be worth my effort to add such a mount option to ext4.

    So, Ted, will we see such optimization now that Intel has officially left its early adopters out in the cold?

    Also, how much of this as well as your previous posts on the subject also applies to USB thumb drives? Any (pun not intended) thumb rules to follow?

    But hey, Kingston is now shipping 40 GB SSD's with Intel 2-gen internals for around $100 or so. Not shipping with TRIM support but a new firmware is in the works. Getting mine on monday :-)

  • I personally expect SSDs to become dumb in the future, as soon as they stop being pricey products made in USA and start being mass products made in China. The manufacturers will start putting the necessary "intelligence" into a windows driver (for almost no cost per unit) instead of into costly on-device silicon. At least, this is what happened to other devices: WLAN controllers (cf. early Prism chips with the ones used today), Laser printers (GDI), you name it.

  • One question keeps spinning in my head....Why scsi for semiconductor memory at all? These memories(ideally) should come closer to DRAM and not Disk...
    Look at all the memories in market which are close to manufacturing (Pram,feram...), where are we heading...
    About question should FS change? Why only filesystem shouldn't the database?

  • FYI gparted can align partitions on megabyte boundaries (in the GUI version at least).

    In general partitoining tools suck at partition alignment, though.

1 2 3