SSD’s, Journaling, and noatime/relatime

On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s) due to the extra writes caused by journaling — and so Linux users using SSD’s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice.

For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:

  • Clone a git repository containing a linux source tree
  • Compile the linux source tree using make -j2
  • Remove the object files by running make clean

For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)

Amount of data written (in megabytes) on an ext4 filesystem
Operation with journal w/o journal percent change
git clone 367.7 353.0 4.00%
make 231.1 203.4 12.0%
make clean 14.6 7.7 47.3%

 

What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller: 4% for the git clone, and 12% for the actual kernel compile.

The noatime mount option

Can we do better? Yes, if we mount the file system using the noatime mount option:

Amount of data written (in megabytes) on an ext4 filesystem mounted with noatime
Operation with journal w/o journal percent change
git clone 367.0 353.0 3.81%
make 207.6 199.4 3.95%
make clean 6.45 3.73 42.17%

 

This reduces the extra cost of the journal in the git clone and make steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories.

The relatime mount option

There is a newer alternative to the noatime mount option, relatime. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):

Amount of data written (in megabytes) on an ext4 filesystem mounted with relatime
Operation with journal w/o journal percent change
git clone 366.6 353.0 3.71%
make 216.8 203.7 6.04%
make clean 13.34 6.97 45.75%

 

Personally, I don’t think relatime is worth it. There are other ways of working around the issue with mutt — for example, you can use Maildir-style mailboxes, or you can use mutt’s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you don’t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.

Comparing ext3 and ext2 filesystems

Amount of data written (in megabytes) on an ext3 and ext2 filesystem
Operation ext3 ext2 percent change
git clone 374.6 357.2 4.64%
make 230.9 204.4 11.48%
make clean 14.56 6.54 55.08%

 

Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn’t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.)

Conclusion

So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD’s come from? Some of it may have been from people worrying too much about extreme workloads such as “make clean”; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn’t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD’s had a very bad problem with what has been called the “write amplification effect”, where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations — that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD’s, such as Intel’s X25-M SSD, have worked around the write amplification affect.

What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system’s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime.

81 thoughts on “SSD’s, Journaling, and noatime/relatime

  1. I’d thought that the reason to avoid ext3 on SSDs, at least most of the ones available today, was not the total number of writes but rather the repeated writes to the same place on the disk (that is, the journal), which might blow out primitive wear-leveling schemes and result in those blocks becoming unreliable. (I don’t know quite where I got this notion from, though, and it certainly wasn’t anywhere authoritative.)

  2. @1: I’d thought that the reason to avoid ext3 on SSDs, at least most of the ones available today, was not the total number of writes but rather the repeated writes to the same place on the disk (that is, the journal)

    Norman,

    Actually, even the most primitive SSD’s and Flash drives have to get this right, because the Windows FAT filesystem are constantly updating the same locations on disk (namely for the File Allocation Table), which is in a fixed location on disk. So although there’s not a lot we can count on in terms of the quality of flash drives’ wear level, it’s very likely they get that right, since otherwise their reliability on basic FAT filesystems, which are used in essentially every single digital camera on the market, would be pretty bad.

  3. Is it possible for ext4 to add an allocator that keeps track of the last write place and (when possible) allocate blocks from there on for SSD? btrfs is doing something similar for there are benchmarks showing that SSDs have better performance on sequential writes. But btrfs is still years away. It would be useful for ext4 to add some optimization for SSD.

  4. Is there a way to specify that a filesystem should always be mounted with noatime, even if the option is not given on mount? I would love to be able to mark my USB stick this way, so that no matter where I plug it in, it will use noatime.

  5. @5: Is there a way to specify that a filesystem should always be mounted with noatime, even if the option is not given on mount?

    There isn’t a way to do this as a mount option, but the easist thing to do is to set the noatime flag on the file system’s root directory when it’s freshly created, or to set the noatime flag for all files and directories, using the chattr command: “chattr -R +A /mntpt“, where you should replace /mntpt with the mount point of your thumbdrive.

    The reason why this works is because the noatime flag is inherited, so all new files and directories created in a directory that has the noatime flag set will also have the noatime flag set. And if all of the files and directories in use on the file system has the noatime flag set, it’s functionally equivalent to mounting the filesystem with the noatime mount option.

  6. Thanks for “chattr” tip. I used it for my directories and it printed “Operation not supported while reading flags on …” for every symlink. Maybe that output should be only printed when in verbose mode?

  7. There is a big big mistake made about ssd in this article :

    all flash memory (that includes ssd) are limited on how much you access to the memory for writing.

    An access does not have direct relation with the units mesured here : the Megabyte.

    the relation between those two units is complex and is not linear.

    sometimes 10mb can be 1 access and another times it can be 100 access …….

    The only thing to compare well is to measure the hits that the system made for writing ……. not how much megas it wrote.

    In our case, the relationship take those factors in considerations :
    – how the partition is made and how it is working
    – how the system manage the partition

    basicaly ext2 is better suited than ext3 ……… because ext3 does extra writes for the journal (it is not important how much megas it wrote, it has made a minimum of one writing cycle)

    ext4 might be a good exit because I’ve read somewhere that this filesystem will have a new cache option especialy designed for ssd ……… basicaly that will do lesser write cycle because it will acces only once when his cache is near to full.

  8. I’ve been this blog for a while hoping to gain some more insight into how to deal with SSDs in general since many Arch Linux users have been looking to increase their everyday performance over using HDDs. I found one thread sometime ago that suggested using reiserfs due to its fragmentation properties and another that suggested adding elevator=noop to the kernel boot parameters in the grub boot menu. Your thoughts have given me much to think about.

  9. @9: Judicator,

    Actually, if you kept on reading all the way to the conclusion, you would have noted that I talked about the write amplification effect, and how with newer SSD’s, such as the Intel X25-M, this is much less of a factor — it has a write amplification factor averaging around 1.1, with a wear leveling overhead of 1.4, compared to older SSD’s that had a write amplification affect of 20 or more.

    I believe you are also incorrect when you say that it’s about is “The only thing to compare well is to measure the hits that the system made for writing”. In fact, ext3 tends to pin writes until the transaction commit timer goes off, at which point the data blocks get flushed out, and then the journal blocks, and finally the metadata blocks. The real issue is that older SSD’s did their wear leveling in 128k erase block chunks, and so if you had writes which are scattered across the disk, a single 4k update in an erase block region caused the entire 128k erase block to be rewritten. The X25-M keeps track of disk block sectors at a smaller granularity than the 128k erase block segments, so the fact that writes are scattered across the disk doesn’t cause a massive write amplification effect.

  10. @10: Soul_Est,

    I’m not that convinced that elevator=noop is the best idea for SSD’s, since combining writes is critical for SSD’s, and I’m not sure the noop elevator will be sufficiently aggressive at combining write requests. I have a feeling the deadline scheduler may be better choice, but I haven’t had a chance to benchmark it yet.

  11. @2:

    My experience is rather different. On “generation 0” SSDs the anecdotal comments say the wear levelling is group cyclic and worth very little. The SSD in my own EeePC has now developed bad parts (and that was using an ext2 filesystem mounted with noatime). Other blogs have commented on this too: (Val Aurora has something here: http://valhenson.livejournal.com/25228.html?thread=108940 and davej warns against putting swap on the EeePC “gen 0” SSDs here http://kernelslacker.livejournal.com/132087.html (it’s a pity the comments have gone on davej’s blog – they were very good))

    Now as it happens on my EeePC I used to use ext3 but have switched to ext2 (remember – gen 0 SSD). The biggest difference was in the latency of writes – this SSD has very slow writes. Booting has become a few seconds faster. With ext3 using firefox 3 (which is fsync happy) causes HUGE and very painful delays. With ext2 this is noticably less (but the stalls are still there just not as long). fsck goes quite quickly with ext2 but for some reason even when everything is OK the periodic startup fsck will force a reboot (which is painful). It’s also interesting to see that Ubuntu will never do a startup fsck on battery even if the FS was not properly unmounted…

    I’ve also tried a few different io schedulers. The EeePC Xandros distro ships with a command line option to use deadline. I have also used cfq and noop (noop didn’t seem noticibably better than deadline). My hope is that cfq is worth it due to being able to have IO priorities. I have even twiddled the rotational flag that has appeared in 2.6.29 but sadly I don’t have benchmarks to be able to tell if it made a difference (I’m too worried that benchmarking this machine is going to decay the SSD further).

    Incidentally I used to use ext3 on an SD card in the EeePC. Now THAT was interesting… After a period of time (seemingly not more than a few weeks) I would be pretty much assured the filesystem would develop self destroying corruption (where an fsck would go and delete everything in site). Since using just the FAT32 partition on the SD card I have not developed any further corruption (although since I stopped booting from it I have taken to write protecting the card whenever possible).

  12. The really big problem is that SSD manufacturers don’t feel obligated to tell you what sort of wear leveling algorithm they are using. There is a huge difference between those drives that do:

    • No wear levelling at all — rare, since these drives die very quickly
    • Those that do dynamic (or what you call “group cyclic”) wear levelling
    • Those that do static wear levelling (this is where blocks that contain data that doesn’t move are also periodically moved around so writes can be distributed to those flash cells)
    • Those that do sub-erase block allocation and wear levelling to reduce write amplification effects

    The Intel SSD is the only one on the market which does the last, although rumor has it that a competitor will be showing up on the market within the next 30 days that will have similar capabilities. I can tell you that with the Intel X25-M SSD, which I now have installed as the primary disk in my laptop, I don’t see any stuttering and performance has been very agreeably fast. Also note that the fsync() issue in Firefox 3 was fixed by FF 3.0.1 (it may have been fixed in FF 3.0 final; I’m not 100% sure). So if you were seeing the problem, my guess is that your distro picked a pre-release FF 3 and didn’t bother to upgrade to a newer Firefox.

    Finally, if you’re seeing filesystem corruption which required fsck to fix stuff that then required a reboot, my guess is there is something really bad going on. My guess is that you mentioned an SD card, and that there may have been some issues with the SD card getting jostled or the contacts not being secure that caused the data corruption. Even the crappy SD cards had wear levelling logic that noticed when a cell started going bad, and would stop using that flash cell. So that may have been more of a mechanical issue causing data loss, not a funamental flash card — but that being said, there are many laptops that have SD cards where I would not use them for regular data use, but only for pulling data off a card used by a digital camera — since that’s what they were probably primarily designed for.

    I’ve had other people complain about certain notebooks where the SD card stuck out slightly, and when it was jostled, it would get disconnected and the filesystem would get horribly corrupted since (a) they weren’t using ext3, and (b) the filesystem was being written at the time when the SD card was nudged. About the only thing I can tell them is the response to the old joke, “Doctor, doctor, it hurts when I do that….”

  13. @14:
    Re wear levelling:
    So true. If only they would say! I’d be willing to pay a little more to have something that isn’t going to go bad in less than a year. However that kind of goes against your “even the most primitive SSD” statement that you made earlier. Surely you can’t get more primodial than gen 0 SSD? : ) Now you’ve said it I’m wondring if the stock EeePC SSDs really have no wear levelling at all *shudder*. No that’s too painful to even think about so I’m going to stop that thought there…

    I wish I could afford an Intel SSD but I can’t and they don’t fit in EeePCs anyway. You speak of a utopia I cannot reach…

    Alas the fsync issue was NOT fixed in Firefox 3.0.1. It was lessened slightly but as soon as sqlite starts writing after you’ve got a few links in your history you will really feel it (I can only suggest using an EeePC with an existing firefox 3 profile and you will see just how bad the SSD write speed is). Just for the record I have Firefox 3.0.6 on this machine and this is with the google bad site thing turned off. If you know where to look you will find that this bug is alive and well – https://bugzilla.mozilla.org/show_bug.cgi?id=442967 .

    As for your last point – hehe! Well half of my SD Card filesystem was always ext3 the other half vfat and the vfat was (seemingly) always OK (but it was never the root fs). Do both a and b have to be present for the corruption to manifest or is b alone enough?

  14. @12 (tytso)
    Thanks for replying. I had no idea whether I read was best for SSD performance since unfortunately, I don’t own a good SSD (or notebook for that matter). I’ll relay what you posted to those in the Arch Linux forum as I believe many Archers need to know this. Thanks again.

  15. Isn’t relatime supposed to help mostly on read heavy workloads? Can you try what the results for e.g. “grep -r Theodore” inside of the kernel tree are, especially when running with cold and hot cache (e.g. running make twice, touching maybe .config inbetween, or with no changes at all)?

  16. @17: Isn’t relatime supposed to help mostly on read heavy workloads?

    Stefan,

    Yes, but noatime helps even more. On a laptop with using mutt to read a local mailbox stored in Maildir format, each time we read a new mail message, we will make up the disk, if relatime is used. With noatime, the disk won’t get spin up for each new message, which is a huge win. Other workloads the differences will be less; on average, noatime helps about twice as much as relatime.

  17. Rumours have it, that the folks wisdom is hard to change. I think the majority of SSDs out there have been shipped with netbooks, and there are only a few, that consist of a usable SSD, therefore they need a lot of tweaking. Been through a couple of installs the last two weeks to try and get the most out of my system (AA1) and yes, IMHO ext2 feels the fastest out of the box, with no options set, but xfs is close to that. Ext3 and reiserfs have slowed it down, before using mount-options. If we take aside the wear-out issue (i don’t care about it, it will need years before that SSD gets unusable and then can be replaced at a fair price) i think that most hints and votes regarding the usage of ext2 comes out of fear, to grant support for a technique that was not ready for mass-market. Wish i could have my hands on some of the new “real” SSDs, just like your Intel for comparisons, but until they show up in affordable laptops for the common user i think this wisdom is here to stay…
    However, it is only natural for data-storage in general to become unsafer each time you turn on your box, if it happens to users using new technology i can only recommend to back up every day. Had no issues with data-loss on SD-cards with any File-system, on some occasions they needed to be re-formated to gain back their capacity. My conclusion: it doesn’t matter which fs you do use, as long as you use your head as well.
    Andy

  18. very nice technical article

    I experience opensuse 11.0 i386 on Corsair Flash Voyager 8G and Corsair Flash Voyager GT 16G USB stick. Boot up time is very slow at XFS.

    I only do noatime, nodiratime and tweaking kernel delay write to 120s.

    Opensuse will be plan to reinstall and new optimisation will be tried,

    1. /tmp and /var/tmp run on tmpfs.
    2. JFS with mount option – nointegrity (disable journaling)
    OR 3. JFS / XFS use external logging devices. (16M SD card from canon DC) 16M is enought for journal log?
    3. Disable Firefox Disk Cache

    Have any other suggestion to improve performance for USB stick to run linux?

  19. Recently I put quite some effort to try and make an acer aspire one 110L work as I liked it to work.
    in a few words, now it is running ubuntu 9.04 (daily) on an ext2 fs and I also added most of the tweaks I found on the arch wiki guide, fedora guide and ubuntu guide. These include disabling both nodiratime,noatime and adding the noop elevator. I remember that these changes made it noticably faster (boot time, starting up programs as well as decreased stattering which was MUCH of a problem) . Now it runs gnome as well as compiz beautifully! After reading this post I am curious to try ext4 (which I tried on a “normal” laptop and I was quite excited about) on the aspire one, possibly disabling journaling.

  20. hi everyone!
    I ust finished to read this interesting article, thanks!

    just a couple of questions.
    I have an EeePC 901 and I’m very interesting to try Ext4 without the journaling and the option noatime, but I don’t know how to configure Ext4 without journal, It’s the mount option writeback??

    and if it’s that option, it isn’t dangerous?. I mean here for example talks about this option: http://www.mjmwired.net/kernel/Documentation/filesystems/ext4.txt#313
    says:
    “Data Mode
    =========
    There are 3 different data modes:

    * writeback mode
    In data=writeback mode, ext4 does not journal data at all. This mode provides a similar level of journaling as that of XFS, JFS, and ReiserFS in its default mode – metadata journaling. A crash+recovery can cause incorrect data to appear in files which were written shortly before the crash. This mode will typically provide the best ext4 performance.”

    now my question is: in some cases the fs ReiserFS it’s a bit dangerous for the ssd, I don’t now why, but is EXT4 whit the option writeback also dangerous like ReiserFS?

    and its there anybody who can tell me if my SSD on the 901 belongs to the “first generation SSD’s”? O.o

    I’m getting crazy whit this paranoias 🙁 what can I do? I really want to prove the EXT4 system.
    thanks who can answer. And sorry for my English.

  21. @22 Fabrico,

    If you want to use ext4 without the journal, all you need to do is to format the file system like this: “mke2fs -t ext4 -O ^has_journal /dev/sdXX”, where you need to replace /dev/sdXX with the appropriate file system.

    Most of the netbooks that have SSD’s that have reached the market to date have very cheapo, “crapola” (to use Linus’s technical term) SSD’s. This means a couple of things. First of all, they tend to be relatively bad wear leveling algorithms, so reducing unnecessary writes is key to making them last longer. Secondly, they tend to have really lousy small random write performance, which is going to impact pretty much all file systems (basically, if you think you’re going to be able to do anything more than simple word processing/spreadsheets and web browsing, you’re kidding yourself).

    Ext3 is especially problematic because the journal means that metadata gets written twice to disk, and in a write pattern that might exacerbate wear-levelling. Worse yet, because of data=ordered mode, ext3 does a lot of synchronous writes, which will be painful because these SSD have slow write speeds to begin with, and then you combine that with the slow small random write performance, and life gets really bad.

    The SSD vendors like to blame the OS, but in reality, they are the ones who are claiming that they have devices that are going to replace HDD’s, so it’s really their responsibility to create non-crap drives. As of this writing, the only drives that seem to meet that requirement are the Intel X25-M and the OCZ Vertex SSD’s. (See the Anandtech article,
    “The SSD Anthology: Understanding SSD’s and New Drives from OCZ”
    for more details. For an example of SSD’s trying to shift the blame away from their own crappy products, see this article from gizmodo.com.)

    So yeah, for netbooks with SSD’s that are running Linux, I would recommend the use of ext4 without the journal. This will give you the advantages of ext4’s delayed allocation, and the reduced metadata advantage of using extents versus indirect blocks will definitely help. You will need to fsck your system after a crash, but with an SSD reads are fast, and these filesystems are small enough that it shouldn’t be a major issue. Make sure you are using 2.6.30 or newer, so that you get the replace-via-rename and replace-via-truncate hueristics to work around applications who think they are too good for fsync().

  22. @23 Theodore:
    Thanks to you! I’m very happy because now I have got the confirmations of my suspects. You know there are a lot of discording opinions and suggestions about the file system Ext4, I was getting mad. O.o
    So thanks for the answer. 🙂

    I’ve just make some researches, but I realize that kernel.28 hasn’t recognize the fs EXT4 without journal.
    So I need the kernel.29 or 30 like you suggest.
    But here is the problem. How can I create the partition without journal with the command that you suggest if then I install for example Ubuntu and have the kernel.28?

    I was thinking in alternative:
    install Ubuntu jaunty.
    then install or compile the kernel.29 or .30 and then use:
    “tune2fs -t ext4 -O ^has_journal /dev/sdXX”
    but here I found this comment that says it comports problems:
    http://ubuntuforums.org/showpost.php?p=7077528&postcount=29
    and someone answered it’s needed to ” have to e2fsck manually before rebooting to remove the journal..”

    In conclusion, Ive read all the “man” of tune2fs, mke2fs and e2fsck. So I was thinking in doing something like this:

    -Install Ubuntu on the EeePc, so on default I will found the kernel.28
    -then install or compile the kernel.29, .30 or newer like you suggest.
    -then from a live: ” tune2fs -t ext4 -O ^has_journal /dev/sdXX ”
    -after that: ” e2fsck /dev/sdXX ”
    -and then reboot.
    -for control everything I could do this: ” sudo dumpe2fs -h /dev/sdXX ”

    It’s anything wrong on this procedure? What do you think?

    Thanks for all this help! I appreciate that. 🙂
    Maybe if everything goes fine I could do a guide in Italian and also in Spanish, obviously saying thank to you.

  23. Memory cards can be bought/replaced at not very expensive prices.

    Can anybody throw light on flash memories built into the phone.They cannot be replaced.Should I avoid installing software on internal memory of the phone(s60 and windows mobile phones).

    Also there is this Nokia internet tablet N800/N810 which has Linux and has 2GB internal memory.Should i avoid using internal memory on such a device for fear of damaging the internal memory chip which cannot be replace at all?

  24. @24: Fabricio,

    What I would do is similar to your suggested approach. Using Ubuntu 9.04, you should be able to install using standard ext4 (with the journal). Once the install is completed, you can then install the new kernel, and then remove the journal. (A filesystem originally created as ext4 is more efficient than one that was originally ext3 and then later converted to ext4.)

  25. @27: Abhisek,

    That’s actually one of the older Anandtech artcies, dating from September 2008. The best one to look at (IMHO) is “The SSD Anthology: Understanding SSDs and New Drives from OCZ”, from March, 2009:

    http://www.anandtech.com/storage/showdoc.aspx?i=3531

    Also of interest is a follow-up article, “The SSD Update: Vertex Gets Faster, New Indilinx Drives and Intel/MacBook Problems Resolved”, written two weeks later:

    http://www.anandtech.com/storage/showdoc.aspx?i=3535

  26. @26Theodore:
    thank you! I’ll try now this optimization, then I will make you know if everything goes ok!

  27. Hi again Theodore, bad news.. 🙁

    I wasn’t able to use the partition without journal. I install Ubuntu Jaunty with ext4, then put the kernel.29.1. reboot from the live version, use tune2fs, after that force e2fsck and then reboot again, bbut before the login there’s a message witch says that there’s no partition:
    “Alert! /dev/disk/by-uuid/[numbers] does not exist. Dropping to a shell!”

    I found the solution putting the journal again inside the partitions. Why is that? maybe the kernel?
    I read an interesting discusion: http://bbs.archlinux.org/viewtopic.php?pid=546525#p546525

    that says anything about this problem:
    “Still take this advice, do not put auto as fs_type in fstab for that filesystem.
    As of 2.6.29 the kernel is not able to recognize an ext4 superblock without journal as an ext4 type for auto fs_type purposes.”

    but I don’t understand what he mean with auto fs_type…

    thanks again and sorry.

  28. /dev/disk/by-uuid/* is something which vol_id library and udev generates, and it’s something broken that I haven’t been able to convince distributions (especially Ubuntu) to drop. Specifically, it looks like vol_id isn’t able to find filesystems that are installed on a whole-disk; the blkid library has no problems with this. So that’s probably a bug you’ll have to report to Ubuntu. Tell them to stop using vol_id, and that using fixed-path /by/disk/by-uuid/* is especially broken.

    It has nothing to do with the kernel. The user on the archlinux forum who thinks this is a kernel issue is completely mistaken. The issue is whether or not various userspace helper libraries, such as vol_id (boo, hiss!) or blkid (which I maintain) understand that a ext4 supports filesystems without a journal, and that a filesystem that has extents, but not a journal, should be mounted as ext4. The blkid library gets this right. I can’t speak to what other autodection libraries, such as vol_id, do. So you are better off explicitly listing in the /etc/fstab what filesystem driver should be used rather than putting a filesystem type of “auto” in /etc/fstab.

    Also, what the author suggested in the archlinux forum isn’t quite right. Just adding a new stanza in the /etc/mke2fs.conf file:

    [fs_types]
    ext4ssd = {
    features = extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
    inode_size = 256
    }

    … won’t do anything unless you actually specify the filesystem type on the command like to mke2fs, like this: “mke2fs -T ext4ssd”. Actually the better thing to do is something like this:

    [fs_types]
    ssd = {
    features = ^has_journal
    }

    and then use the mke2fs command “mke2fs -t ext4 -T ssd /dev/XXX”. The filesystem usage class “ssd” will override the parameters set by ext4.

  29. So it’s impossible to use for example “tune2fs -O ^has_journal /dev/sda1” because that bug on /dev/disk/by-uuid/* ?

    I don’t understand one thing, can I use Ext4 without journal on Ubuntu 9.04 with a kernel.29 or it’s impossible because ubuntu don’t use blkid?

    maybe it’s possible to change vol_id with blkid? It’s a difficult thing to do?

    My problem is that I don’t know what could happen if I create a fs ext4 w/o journal and then install on this ubuntu 9.04 that has the kernel.28.. would it work?

    So the clue is: For removing the journal, It’s possible to tune the file system(tune2fs) or the only way it’s making one(mke2fs) without journal? and if the only way is like that how it’s possible to resolve the problem of the kernel?
    Sorry if I did’t understand well what you explained to me.

  30. to 32: blkid is program in Ubuntu used to determine which partition holds root filesystem. I have tried it on my eee701 (xubuntu 9.04) and if you do e2fstune -O ^has_journal, blkid fails. If you enable journal again, it works as expected.

    I will report this bug. It has nothing to do with kernel. Just wait until it’s fixed.

  31. Nice test, but you missed one important think. The biggest problem of modern SSD is IMHO not wear-out, but internal fragmentation caused by wear-levelling. Wear levelling basicly remaps some blocks to another, causing internal, invisible fragmentation. When you then make sequental read, it is therefore more or less random read, which is much slower. Much worse is this problem with writing, since almost all SSD have very, very slow random write. Internal fragmentation seems to be related to small random writes, such as DB or journal. Sequental write of large data doesnt seem to cause internal fragmentation much. google ssd fragmnetation for more info.

  32. @34: Petrik,

    Internal fragmentation doesn’t actually cause that much performance slowdown for sequential reads, since unlike hard drives, seeks are very cheap for SSD’s, and so random reads are quite fast for SSD’s — much faster than HDD’s. Yes, there is a tiny slowdown that you’ll see when a SSD gets more fragmented as far as reads are concerned, but that’s really not a major factor in terms of SSD performance. The much bigger issue from a performance point of view is painfully slow random *writes*, and that has nothing to do with internal fragmentation, as it does with incompetent design of certain SSD controllers. The Intel X25-M has none of these problems, and the OCZ Vertex with the Indilux controller is starting to figure out how to deal with these problems. So it is not inherent with SSD’s; just with certain incompetently designed SSD controllers.

    Issues around performance, and the “stuttering” effect of very bad random writes, are much more visible, but the lifetime of SSD’s is a much more important issue in my humble opinion. After all, we know most people don’t do backups as often as they should, so if SSD’s end up dying unexpectedly, a lot of people will end up losing their data, and that would be bad.

    Of course, readers of this blog all faithfully do regular backups, right? 🙂

  33. @33, Martin Hinner :
    thanks! How can I know when the bug will be repaired? Can you link it?

    35 tytso: ooh suuuuuuure of coourse I always do regular backups… =P

  34. Hi,

    I noticed strange file corruption using ext4 without journal, on two different laptop (a dell vostro 1700, ubuntu jaunty 64bits, perso 2.6.29.2 kernel and AAO, archlinux 32 bits, perso 2.6.29.3 kernel). On the dell laptop the symptom was when I install the nvidia kernel module with nvidia-installer, all is fine, sync, reboot, ok. Do an apt-get update+upgrade, reboot, and the nvidia module becomes corrupted, with the content of some package description ! I think it was nvidia fault, but today on my other laptop (aspire one ssd, archlinux, i915 with kms) that was /usr/lib/locale/locale-archive file who self-destroyed itself when I reboot properly (and issuing “sync” before).

    All those problems don’t appears when I use ext4 WITH journal. Have you any idea ??

    Regards,

    Thibault

  35. Thaibault,

    Can you try replicating it under controlled circumstances? If so, can you leave detailed information at http://bugzilla.kernel.org? That’s probably the best place to try to track down the problem.

    I would suggest trying to replicate this with a separate scratch partition (so you don’t have to worry about constantly reinstalling your system), and seeing if you can create a regular reproduction case. It would be interesting to see if mounting and unmounting the filesystem is enough, or whether you need to reboot — and whether rebooting while the filesystem is mounted, and/or rebooting after remounting the filesystem read-only makes a difference.

    Thanks!!

  36. Hi Ted, you seem to be one of the few writing who know much about these issues. Hope you can go a little off topic.

    Why would anyone want to use ext2? I mean, I know what they say, of course, but for most use can the speed and wear superiority over ext3 or ext4 really trump the fact that ext2 file systems are far more likely to be unrepairable after a crash or power failure (the latter with netbooks I imagine happening more often).

    But I see you recommend ext4 *without* journaling. Would adding journaling (or using standard ext3 with journaling) be so bad?

    Now a little off topic: you say the 1st generation were crappy wrt wear levellng. You cite the pricey Intel drive as an example of the next generation. But what about for example the STEC in the Dell Mini 9? I was considering getting one of these. On their web site STEC claims that their wear levelling is such that the youngest block is used on every write. From the point of view of wear-levelling, that seems totally optimal (although as a trade off no doubt it slows it down with the Flash Translation Later having to handle all this!)

    Another important issue: I have heard many of these drives get very bad numbers of IOPS in smallish random writes. Is that some to be concerned about for a purchaser of a Dell Mini?

  37. @36
    Fabricio,
    My system seams ok after I have changed all references from UUID=something to the real devices (/dev/sdXn). So after do tune2fs -O ^has_journal /dev/sdXn, edit /etc/fstab and /boot/grub/menu.lst and change everything to /dev/sdXn. The final tip is tp tell to initrd what type your root partion is, you can do this by adding this kernel parameter: rootfstype=ext4, add it to menu.lst

    If you cannot boot anyway (drop to shell symptom), you can boot by editing the grub command line before boot (press e key)[add/change rootfstype=ext4 and root=/dev/sdXn], after a semi-successful boot, edit /etc/fstab, /boot/grub/menu.lst and reboot. Everything should be ok now.

  38. G’Day, Ted.

    My question is not related directly to the article. But i think the answer is interesting to many people who use hdd cryptography in Linux (dm-crypt).

    As you know dm-crypt (like most if not all hdd software cryptography solutions) works with data on the fly, i.e. data encryption-decryption is done in RAM and written to disk encrypted.

    What ext3/ext4 filesystem features should be disabled/enabled to get best results when using dm-crypt?
    Those features that on normal hdd partition may be useful, but on encrypted partition are useless (e.g., produce unneeded overhead to harware and/or OS), and vice versa?

    For example i found in the net following opinion (http://www.saout.de/tikiwiki/tiki-index.php?page=EncryptedDevice):
    “Use your favourite options, filesystem type etc (I use ext3) or just copy my options. Note: you do *not* want journalling, or else writes will have to be encrypted twice (once to journal and once when committing journal to final resting place).”

    Other encryption tutorials suggest ext3 features like those which are used during filesystem creation on normal partitions (http://en.opensuse.org/Encrypted_Root_File_System_with_SUSE_HOWTO , https://help.ubuntu.com/community/EncryptedFilesystemLVMHowto etc.).

    p.s.:
    Thanks for noatime/relatime options description. I met in the net some articles about noatime vs relatime but none of them described issue thoroughly, like you did.

  39. @40, Iuri Diniz:

    hey! thank you very much for the suggest. At the moment I don’t have too much time to prove it but I’ll do it!
    I will write here what happends on my machine. Thanks to you and Tytso for your work!!

  40. Great!!! it works.. or it seems.
    after all the proves the comand: “sudo dumpe2fs -h /dev/sda1” gives me this:

    dumpe2fs 1.41.4 (27-Jan-2009)
    Filesystem volume name:
    Last mounted on:
    Filesystem UUID: 36458922-0fbf-4758-abf4-3dc1f8bae3e7
    Filesystem magic number: 0xEF53
    Filesystem revision #: 1 (dynamic)
    Filesystem features: ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
    Filesystem flags: signed_directory_hash
    Default mount options: journal_data_writeback
    Filesystem state: not clean
    Errors behavior: Continue
    Filesystem OS type: Linux
    Inode count: 245280
    Block count: 979957
    Reserved block count: 48997
    Free blocks: 220843
    Free inodes: 84902
    First block: 0
    Block size: 4096
    Fragment size: 4096
    Reserved GDT blocks: 239
    Blocks per group: 32768
    Fragments per group: 32768
    Inodes per group: 8176
    Inode blocks per group: 511
    Flex block group size: 16
    Filesystem created: Fri May 1 18:54:35 2009
    Last mount time: Wed Jul 1 16:02:05 2009
    Last write time: Wed Jul 1 16:07:48 2009
    Mount count: 3
    Maximum mount count: 20
    Last checked: Wed Jul 1 15:33:10 2009
    Check interval: 15552000 (6 months)
    Next check after: Mon Dec 28 14:33:10 2009
    Reserved blocks uid: 0 (user root)
    Reserved blocks gid: 0 (group root)
    First inode: 11
    Inode size: 256
    Required extra isize: 28
    Desired extra isize: 28
    Default directory hash: half_md4
    Directory Hash Seed: ab7ab185-ad7e-4479-b5e3-e67ee2ca736c
    Journal backup: inode blocks

    so “Filesystem features” don’t has the option ” has_journal ” and there’s no journal size at the end. This means there’s no journal for that partition? I’m right?
    tell me please so I’ll be quiet 🙂
    thanks to all.

  41. @43: Fabrico,

    Yes, if the filesystem features line printed by dumpe2fs does not include “has_journal”, the filesystem does not have a journal.

  42. @41: Yuriy,

    Using cryptography doesn’t really change whether or not you want a particular filesystem feature or not. Journalling does require writing metadata blocks twice, yes: once in the journal and once in the final location on disk. On the other hand, without journalling, you have to run e2fsck on your filesystem after a system crash, and manual system administrator action may be required to repair the filesystem. This is true regardless of whether you are using dm-crypt or not.

    Using ext4 will reduce the filesystem overhead by reducing the number of metadata blocks needed for large files (without extents the filesystem has to use many more indirect blocks for large files), but that’s an advantage you have whether you are using dm-crypt or not. The flip side is that some distributions don’t fully support ext4 yet. Fedora 11 seems to be doing quite well with ext4; Ubuntu 9.04 not so much, because they used an older kernel and I suspect screwed up one of their patch backports. So with Ubuntu, if you want to use ext4 you really need to use the 2.6.30 mainline kernel with 9.04, or to use the pre-release Alpha snapshots of Ubuntu 9.10. Again, all of this is true whether you are using dm-crypt or not.

  43. hi theodore! thanks for your help and support.
    I succesful applied Ext4 without journal and here’s the italian guide:
    http://www.uielinux.org/guide-e-tutorial/2-configurazione/188-ext4-senza-journaling-ottimo-per-dischi-ssd.html

    I also translate some of your opinions and report the link to this blog.

    PS:
    this bug now is fixed??
    https://bugs.launchpad.net/bugs/197311
    “Changed in util-linux (Ubuntu):
    status: Confirmed → Fix Released ”
    what does it means? That now is possible to take away the journal without that problem so we didn’t have to change the UUIDs on /dev/sdXX?

Leave a Reply

Your email address will not be published. Required fields are marked *