On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s) due to the extra writes caused by journaling — and so Linux users using SSD’s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice.
For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:
- Clone a git repository containing a linux source tree
- Compile the linux source tree using make -j2
- Remove the object files by running make clean
For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)
| Operation | with journal | w/o journal | percent change |
| git clone | 367.7 | 353.0 | 4.00% |
| make | 231.1 | 203.4 | 12.0% |
| make clean | 14.6 | 7.7 | 47.3% |
What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller: 4% for the git clone, and 12% for the actual kernel compile.
The noatime mount option
Can we do better? Yes, if we mount the file system using the noatime mount option:
| Operation | with journal | w/o journal | percent change |
| git clone | 367.0 | 353.0 | 3.81% |
| make | 207.6 | 199.4 | 3.95% |
| make clean | 6.45 | 3.73 | 42.17% |
This reduces the extra cost of the journal in the git clone and make steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories.
The relatime mount option
There is a newer alternative to the noatime mount option, relatime. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):
| Operation | with journal | w/o journal | percent change |
| git clone | 366.6 | 353.0 | 3.71% |
| make | 216.8 | 203.7 | 6.04% |
| make clean | 13.34 | 6.97 | 45.75% |
Personally, I don’t think relatime is worth it. There are other ways of working around the issue with mutt — for example, you can use Maildir-style mailboxes, or you can use mutt’s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you don’t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.
Comparing ext3 and ext2 filesystems
| Operation | ext3 | ext2 | percent change |
| git clone | 374.6 | 357.2 | 4.64% |
| make | 230.9 | 204.4 | 11.48% |
| make clean | 14.56 | 6.54 | 55.08% |
Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn’t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.)
Conclusion
So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD’s come from? Some of it may have been from people worrying too much about extreme workloads such as “make clean”; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn’t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD’s had a very bad problem with what has been called the “write amplification effect”, where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations — that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD’s, such as Intel’s X25-M SSD, have worked around the write amplification affect.
What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system’s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime.
No related posts.
March 2nd, 2009 at 12:29 am
I’d thought that the reason to avoid ext3 on SSDs, at least most of the ones available today, was not the total number of writes but rather the repeated writes to the same place on the disk (that is, the journal), which might blow out primitive wear-leveling schemes and result in those blocks becoming unreliable. (I don’t know quite where I got this notion from, though, and it certainly wasn’t anywhere authoritative.)
March 2nd, 2009 at 1:15 am
@1: I’d thought that the reason to avoid ext3 on SSDs, at least most of the ones available today, was not the total number of writes but rather the repeated writes to the same place on the disk (that is, the journal)
Norman,
Actually, even the most primitive SSD’s and Flash drives have to get this right, because the Windows FAT filesystem are constantly updating the same locations on disk (namely for the File Allocation Table), which is in a fixed location on disk. So although there’s not a lot we can count on in terms of the quality of flash drives’ wear level, it’s very likely they get that right, since otherwise their reliability on basic FAT filesystems, which are used in essentially every single digital camera on the market, would be pretty bad.
March 2nd, 2009 at 2:56 am
Is it possible for ext4 to add an allocator that keeps track of the last write place and (when possible) allocate blocks from there on for SSD? btrfs is doing something similar for there are benchmarks showing that SSDs have better performance on sequential writes. But btrfs is still years away. It would be useful for ext4 to add some optimization for SSD.
March 2nd, 2009 at 3:12 am
[...] SSD Write Amplification: href=”http://www.extremetech.com/article2/0,2845,2329594,00.asp”>http://www.extremetech.com/article2/0,2845,2329594,00.asp [...]
March 2nd, 2009 at 5:30 am
Is there a way to specify that a filesystem should always be mounted with noatime, even if the option is not given on mount? I would love to be able to mark my USB stick this way, so that no matter where I plug it in, it will use noatime.
March 2nd, 2009 at 9:31 am
@5: Is there a way to specify that a filesystem should always be mounted with noatime, even if the option is not given on mount?
There isn’t a way to do this as a mount option, but the easist thing to do is to set the noatime flag on the file system’s root directory when it’s freshly created, or to set the noatime flag for all files and directories, using the chattr command: “chattr -R +A /mntpt“, where you should replace /mntpt with the mount point of your thumbdrive.
The reason why this works is because the noatime flag is inherited, so all new files and directories created in a directory that has the noatime flag set will also have the noatime flag set. And if all of the files and directories in use on the file system has the noatime flag set, it’s functionally equivalent to mounting the filesystem with the noatime mount option.
March 2nd, 2009 at 12:43 pm
Thanks for “chattr” tip. I used it for my directories and it printed “Operation not supported while reading flags on …” for every symlink. Maybe that output should be only printed when in verbose mode?
March 2nd, 2009 at 3:10 pm
[...] “SSD’s, Journaling, and noatime/relatime” – сравнение производительности ФС ext3 и ext4 на SSD накопителе, и оценка влияния наличия журналирования и монтирования в режимах noatime/relatime. В качестве тестов производилось измерение времени клонирования git дерева и сборки Linux ядра. В тестах ext3 заметно проигрывает ext4: git clone был выполнен в ext4 на 4.64% быстрее, make – на 11.48%, make clean – на 55.08%. [...]
March 2nd, 2009 at 5:55 pm
There is a big big mistake made about ssd in this article :
all flash memory (that includes ssd) are limited on how much you access to the memory for writing.
An access does not have direct relation with the units mesured here : the Megabyte.
the relation between those two units is complex and is not linear.
sometimes 10mb can be 1 access and another times it can be 100 access …….
The only thing to compare well is to measure the hits that the system made for writing ……. not how much megas it wrote.
In our case, the relationship take those factors in considerations :
- how the partition is made and how it is working
- how the system manage the partition
basicaly ext2 is better suited than ext3 ……… because ext3 does extra writes for the journal (it is not important how much megas it wrote, it has made a minimum of one writing cycle)
ext4 might be a good exit because I’ve read somewhere that this filesystem will have a new cache option especialy designed for ssd ……… basicaly that will do lesser write cycle because it will acces only once when his cache is near to full.
March 4th, 2009 at 2:49 pm
I’ve been this blog for a while hoping to gain some more insight into how to deal with SSDs in general since many Arch Linux users have been looking to increase their everyday performance over using HDDs. I found one thread sometime ago that suggested using reiserfs due to its fragmentation properties and another that suggested adding elevator=noop to the kernel boot parameters in the grub boot menu. Your thoughts have given me much to think about.
March 4th, 2009 at 6:10 pm
@9: Judicator,
Actually, if you kept on reading all the way to the conclusion, you would have noted that I talked about the write amplification effect, and how with newer SSD’s, such as the Intel X25-M, this is much less of a factor — it has a write amplification factor averaging around 1.1, with a wear leveling overhead of 1.4, compared to older SSD’s that had a write amplification affect of 20 or more.
I believe you are also incorrect when you say that it’s about is “The only thing to compare well is to measure the hits that the system made for writing”. In fact, ext3 tends to pin writes until the transaction commit timer goes off, at which point the data blocks get flushed out, and then the journal blocks, and finally the metadata blocks. The real issue is that older SSD’s did their wear leveling in 128k erase block chunks, and so if you had writes which are scattered across the disk, a single 4k update in an erase block region caused the entire 128k erase block to be rewritten. The X25-M keeps track of disk block sectors at a smaller granularity than the 128k erase block segments, so the fact that writes are scattered across the disk doesn’t cause a massive write amplification effect.
March 4th, 2009 at 6:31 pm
@10: Soul_Est,
I’m not that convinced that elevator=noop is the best idea for SSD’s, since combining writes is critical for SSD’s, and I’m not sure the noop elevator will be sufficiently aggressive at combining write requests. I have a feeling the deadline scheduler may be better choice, but I haven’t had a chance to benchmark it yet.
March 6th, 2009 at 1:00 pm
@2:
My experience is rather different. On “generation 0″ SSDs the anecdotal comments say the wear levelling is group cyclic and worth very little. The SSD in my own EeePC has now developed bad parts (and that was using an ext2 filesystem mounted with noatime). Other blogs have commented on this too: (Val Aurora has something here: http://valhenson.livejournal.com/25228.html?thread=108940 and davej warns against putting swap on the EeePC “gen 0″ SSDs here http://kernelslacker.livejournal.com/132087.html (it’s a pity the comments have gone on davej’s blog – they were very good))
Now as it happens on my EeePC I used to use ext3 but have switched to ext2 (remember – gen 0 SSD). The biggest difference was in the latency of writes – this SSD has very slow writes. Booting has become a few seconds faster. With ext3 using firefox 3 (which is fsync happy) causes HUGE and very painful delays. With ext2 this is noticably less (but the stalls are still there just not as long). fsck goes quite quickly with ext2 but for some reason even when everything is OK the periodic startup fsck will force a reboot (which is painful). It’s also interesting to see that Ubuntu will never do a startup fsck on battery even if the FS was not properly unmounted…
I’ve also tried a few different io schedulers. The EeePC Xandros distro ships with a command line option to use deadline. I have also used cfq and noop (noop didn’t seem noticibably better than deadline). My hope is that cfq is worth it due to being able to have IO priorities. I have even twiddled the rotational flag that has appeared in 2.6.29 but sadly I don’t have benchmarks to be able to tell if it made a difference (I’m too worried that benchmarking this machine is going to decay the SSD further).
Incidentally I used to use ext3 on an SD card in the EeePC. Now THAT was interesting… After a period of time (seemingly not more than a few weeks) I would be pretty much assured the filesystem would develop self destroying corruption (where an fsck would go and delete everything in site). Since using just the FAT32 partition on the SD card I have not developed any further corruption (although since I stopped booting from it I have taken to write protecting the card whenever possible).
March 6th, 2009 at 2:13 pm
The really big problem is that SSD manufacturers don’t feel obligated to tell you what sort of wear leveling algorithm they are using. There is a huge difference between those drives that do:
The Intel SSD is the only one on the market which does the last, although rumor has it that a competitor will be showing up on the market within the next 30 days that will have similar capabilities. I can tell you that with the Intel X25-M SSD, which I now have installed as the primary disk in my laptop, I don’t see any stuttering and performance has been very agreeably fast. Also note that the fsync() issue in Firefox 3 was fixed by FF 3.0.1 (it may have been fixed in FF 3.0 final; I’m not 100% sure). So if you were seeing the problem, my guess is that your distro picked a pre-release FF 3 and didn’t bother to upgrade to a newer Firefox.
Finally, if you’re seeing filesystem corruption which required fsck to fix stuff that then required a reboot, my guess is there is something really bad going on. My guess is that you mentioned an SD card, and that there may have been some issues with the SD card getting jostled or the contacts not being secure that caused the data corruption. Even the crappy SD cards had wear levelling logic that noticed when a cell started going bad, and would stop using that flash cell. So that may have been more of a mechanical issue causing data loss, not a funamental flash card — but that being said, there are many laptops that have SD cards where I would not use them for regular data use, but only for pulling data off a card used by a digital camera — since that’s what they were probably primarily designed for.
I’ve had other people complain about certain notebooks where the SD card stuck out slightly, and when it was jostled, it would get disconnected and the filesystem would get horribly corrupted since (a) they weren’t using ext3, and (b) the filesystem was being written at the time when the SD card was nudged. About the only thing I can tell them is the response to the old joke, “Doctor, doctor, it hurts when I do that….”
March 6th, 2009 at 2:44 pm
@14:
Re wear levelling:
So true. If only they would say! I’d be willing to pay a little more to have something that isn’t going to go bad in less than a year. However that kind of goes against your “even the most primitive SSD” statement that you made earlier. Surely you can’t get more primodial than gen 0 SSD? : ) Now you’ve said it I’m wondring if the stock EeePC SSDs really have no wear levelling at all *shudder*. No that’s too painful to even think about so I’m going to stop that thought there…
I wish I could afford an Intel SSD but I can’t and they don’t fit in EeePCs anyway. You speak of a utopia I cannot reach…
Alas the fsync issue was NOT fixed in Firefox 3.0.1. It was lessened slightly but as soon as sqlite starts writing after you’ve got a few links in your history you will really feel it (I can only suggest using an EeePC with an existing firefox 3 profile and you will see just how bad the SSD write speed is). Just for the record I have Firefox 3.0.6 on this machine and this is with the google bad site thing turned off. If you know where to look you will find that this bug is alive and well – https://bugzilla.mozilla.org/show_bug.cgi?id=442967 .
As for your last point – hehe! Well half of my SD Card filesystem was always ext3 the other half vfat and the vfat was (seemingly) always OK (but it was never the root fs). Do both a and b have to be present for the corruption to manifest or is b alone enough?
March 7th, 2009 at 11:56 pm
@12 (tytso)
Thanks for replying. I had no idea whether I read was best for SSD performance since unfortunately, I don’t own a good SSD (or notebook for that matter). I’ll relay what you posted to those in the Arch Linux forum as I believe many Archers need to know this. Thanks again.
March 10th, 2009 at 3:58 pm
Isn’t relatime supposed to help mostly on read heavy workloads? Can you try what the results for e.g. “grep -r Theodore” inside of the kernel tree are, especially when running with cold and hot cache (e.g. running make twice, touching maybe .config inbetween, or with no changes at all)?
March 12th, 2009 at 9:42 am
@17: Isn’t relatime supposed to help mostly on read heavy workloads?
Stefan,
Yes, but noatime helps even more. On a laptop with using mutt to read a local mailbox stored in Maildir format, each time we read a new mail message, we will make up the disk, if relatime is used. With noatime, the disk won’t get spin up for each new message, which is a huge win. Other workloads the differences will be less; on average, noatime helps about twice as much as relatime.
March 17th, 2009 at 3:02 am
Rumours have it, that the folks wisdom is hard to change. I think the majority of SSDs out there have been shipped with netbooks, and there are only a few, that consist of a usable SSD, therefore they need a lot of tweaking. Been through a couple of installs the last two weeks to try and get the most out of my system (AA1) and yes, IMHO ext2 feels the fastest out of the box, with no options set, but xfs is close to that. Ext3 and reiserfs have slowed it down, before using mount-options. If we take aside the wear-out issue (i don’t care about it, it will need years before that SSD gets unusable and then can be replaced at a fair price) i think that most hints and votes regarding the usage of ext2 comes out of fear, to grant support for a technique that was not ready for mass-market. Wish i could have my hands on some of the new “real” SSDs, just like your Intel for comparisons, but until they show up in affordable laptops for the common user i think this wisdom is here to stay…
However, it is only natural for data-storage in general to become unsafer each time you turn on your box, if it happens to users using new technology i can only recommend to back up every day. Had no issues with data-loss on SD-cards with any File-system, on some occasions they needed to be re-formated to gain back their capacity. My conclusion: it doesn’t matter which fs you do use, as long as you use your head as well.
Andy
March 19th, 2009 at 1:32 pm
very nice technical article
I experience opensuse 11.0 i386 on Corsair Flash Voyager 8G and Corsair Flash Voyager GT 16G USB stick. Boot up time is very slow at XFS.
I only do noatime, nodiratime and tweaking kernel delay write to 120s.
Opensuse will be plan to reinstall and new optimisation will be tried,
1. /tmp and /var/tmp run on tmpfs.
2. JFS with mount option – nointegrity (disable journaling)
OR 3. JFS / XFS use external logging devices. (16M SD card from canon DC) 16M is enought for journal log?
3. Disable Firefox Disk Cache
Have any other suggestion to improve performance for USB stick to run linux?
March 31st, 2009 at 5:39 am
Recently I put quite some effort to try and make an acer aspire one 110L work as I liked it to work.
in a few words, now it is running ubuntu 9.04 (daily) on an ext2 fs and I also added most of the tweaks I found on the arch wiki guide, fedora guide and ubuntu guide. These include disabling both nodiratime,noatime and adding the noop elevator. I remember that these changes made it noticably faster (boot time, starting up programs as well as decreased stattering which was MUCH of a problem) . Now it runs gnome as well as compiz beautifully! After reading this post I am curious to try ext4 (which I tried on a “normal” laptop and I was quite excited about) on the aspire one, possibly disabling journaling.
April 26th, 2009 at 7:07 am
hi everyone!
I ust finished to read this interesting article, thanks!
just a couple of questions.
I have an EeePC 901 and I’m very interesting to try Ext4 without the journaling and the option noatime, but I don’t know how to configure Ext4 without journal, It’s the mount option writeback??
and if it’s that option, it isn’t dangerous?. I mean here for example talks about this option: http://www.mjmwired.net/kernel/Documentation/filesystems/ext4.txt#313
says:
“Data Mode
=========
There are 3 different data modes:
* writeback mode
In data=writeback mode, ext4 does not journal data at all. This mode provides a similar level of journaling as that of XFS, JFS, and ReiserFS in its default mode – metadata journaling. A crash+recovery can cause incorrect data to appear in files which were written shortly before the crash. This mode will typically provide the best ext4 performance.”
now my question is: in some cases the fs ReiserFS it’s a bit dangerous for the ssd, I don’t now why, but is EXT4 whit the option writeback also dangerous like ReiserFS?
and its there anybody who can tell me if my SSD on the 901 belongs to the “first generation SSD’s”? O.o
I’m getting crazy whit this paranoias
what can I do? I really want to prove the EXT4 system.
thanks who can answer. And sorry for my English.
April 27th, 2009 at 8:37 pm
@22 Fabrico,
If you want to use ext4 without the journal, all you need to do is to format the file system like this: “mke2fs -t ext4 -O ^has_journal /dev/sdXX”, where you need to replace /dev/sdXX with the appropriate file system.
Most of the netbooks that have SSD’s that have reached the market to date have very cheapo, “crapola” (to use Linus’s technical term) SSD’s. This means a couple of things. First of all, they tend to be relatively bad wear leveling algorithms, so reducing unnecessary writes is key to making them last longer. Secondly, they tend to have really lousy small random write performance, which is going to impact pretty much all file systems (basically, if you think you’re going to be able to do anything more than simple word processing/spreadsheets and web browsing, you’re kidding yourself).
Ext3 is especially problematic because the journal means that metadata gets written twice to disk, and in a write pattern that might exacerbate wear-levelling. Worse yet, because of data=ordered mode, ext3 does a lot of synchronous writes, which will be painful because these SSD have slow write speeds to begin with, and then you combine that with the slow small random write performance, and life gets really bad.
The SSD vendors like to blame the OS, but in reality, they are the ones who are claiming that they have devices that are going to replace HDD’s, so it’s really their responsibility to create non-crap drives. As of this writing, the only drives that seem to meet that requirement are the Intel X25-M and the OCZ Vertex SSD’s. (See the Anandtech article,
“The SSD Anthology: Understanding SSD’s and New Drives from OCZ” for more details. For an example of SSD’s trying to shift the blame away from their own crappy products, see this article from gizmodo.com.)
So yeah, for netbooks with SSD’s that are running Linux, I would recommend the use of ext4 without the journal. This will give you the advantages of ext4’s delayed allocation, and the reduced metadata advantage of using extents versus indirect blocks will definitely help. You will need to fsck your system after a crash, but with an SSD reads are fast, and these filesystems are small enough that it shouldn’t be a major issue. Make sure you are using 2.6.30 or newer, so that you get the replace-via-rename and replace-via-truncate hueristics to work around applications who think they are too good for fsync().
April 28th, 2009 at 4:40 pm
@23 Theodore:
Thanks to you! I’m very happy because now I have got the confirmations of my suspects. You know there are a lot of discording opinions and suggestions about the file system Ext4, I was getting mad. O.o
So thanks for the answer.
I’ve just make some researches, but I realize that kernel.28 hasn’t recognize the fs EXT4 without journal.
So I need the kernel.29 or 30 like you suggest.
But here is the problem. How can I create the partition without journal with the command that you suggest if then I install for example Ubuntu and have the kernel.28?
I was thinking in alternative:
install Ubuntu jaunty.
then install or compile the kernel.29 or .30 and then use:
“tune2fs -t ext4 -O ^has_journal /dev/sdXX”
but here I found this comment that says it comports problems:
http://ubuntuforums.org/showpost.php?p=7077528&postcount=29
and someone answered it’s needed to ” have to e2fsck manually before rebooting to remove the journal..”
In conclusion, Ive read all the “man” of tune2fs, mke2fs and e2fsck. So I was thinking in doing something like this:
-Install Ubuntu on the EeePc, so on default I will found the kernel.28
-then install or compile the kernel.29, .30 or newer like you suggest.
-then from a live: ” tune2fs -t ext4 -O ^has_journal /dev/sdXX ”
-after that: ” e2fsck /dev/sdXX ”
-and then reboot.
-for control everything I could do this: ” sudo dumpe2fs -h /dev/sdXX ”
It’s anything wrong on this procedure? What do you think?
Thanks for all this help! I appreciate that.
Maybe if everything goes fine I could do a guide in Italian and also in Spanish, obviously saying thank to you.
April 29th, 2009 at 2:42 am
Memory cards can be bought/replaced at not very expensive prices.
Can anybody throw light on flash memories built into the phone.They cannot be replaced.Should I avoid installing software on internal memory of the phone(s60 and windows mobile phones).
Also there is this Nokia internet tablet N800/N810 which has Linux and has 2GB internal memory.Should i avoid using internal memory on such a device for fear of damaging the internal memory chip which cannot be replace at all?
April 30th, 2009 at 1:14 am
@24: Fabricio,
What I would do is similar to your suggested approach. Using Ubuntu 9.04, you should be able to install using standard ext4 (with the journal). Once the install is completed, you can then install the new kernel, and then remove the journal. (A filesystem originally created as ext4 is more efficient than one that was originally ext3 and then later converted to ext4.)
May 1st, 2009 at 2:34 am
I think the following link is very very informative.Read ALL the pages.
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=1
May 1st, 2009 at 9:24 am
@27: Abhisek,
That’s actually one of the older Anandtech artcies, dating from September 2008. The best one to look at (IMHO) is “The SSD Anthology: Understanding SSDs and New Drives from OCZ”, from March, 2009:
http://www.anandtech.com/storage/showdoc.aspx?i=3531
Also of interest is a follow-up article, “The SSD Update: Vertex Gets Faster, New Indilinx Drives and Intel/MacBook Problems Resolved”, written two weeks later:
http://www.anandtech.com/storage/showdoc.aspx?i=3535
May 2nd, 2009 at 1:41 pm
@26Theodore:
thank you! I’ll try now this optimization, then I will make you know if everything goes ok!
May 2nd, 2009 at 2:40 pm
Hi again Theodore, bad news..
I wasn’t able to use the partition without journal. I install Ubuntu Jaunty with ext4, then put the kernel.29.1. reboot from the live version, use tune2fs, after that force e2fsck and then reboot again, bbut before the login there’s a message witch says that there’s no partition:
“Alert! /dev/disk/by-uuid/[numbers] does not exist. Dropping to a shell!”
I found the solution putting the journal again inside the partitions. Why is that? maybe the kernel?
I read an interesting discusion: http://bbs.archlinux.org/viewtopic.php?pid=546525#p546525
that says anything about this problem:
“Still take this advice, do not put auto as fs_type in fstab for that filesystem.
As of 2.6.29 the kernel is not able to recognize an ext4 superblock without journal as an ext4 type for auto fs_type purposes.”
but I don’t understand what he mean with auto fs_type…
thanks again and sorry.
May 2nd, 2009 at 4:30 pm
/dev/disk/by-uuid/* is something which vol_id library and udev generates, and it’s something broken that I haven’t been able to convince distributions (especially Ubuntu) to drop. Specifically, it looks like vol_id isn’t able to find filesystems that are installed on a whole-disk; the blkid library has no problems with this. So that’s probably a bug you’ll have to report to Ubuntu. Tell them to stop using vol_id, and that using fixed-path /by/disk/by-uuid/* is especially broken.
It has nothing to do with the kernel. The user on the archlinux forum who thinks this is a kernel issue is completely mistaken. The issue is whether or not various userspace helper libraries, such as vol_id (boo, hiss!) or blkid (which I maintain) understand that a ext4 supports filesystems without a journal, and that a filesystem that has extents, but not a journal, should be mounted as ext4. The blkid library gets this right. I can’t speak to what other autodection libraries, such as vol_id, do. So you are better off explicitly listing in the /etc/fstab what filesystem driver should be used rather than putting a filesystem type of “auto” in /etc/fstab.
Also, what the author suggested in the archlinux forum isn’t quite right. Just adding a new stanza in the /etc/mke2fs.conf file:
[fs_types]
ext4ssd = {
features = extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
inode_size = 256
}
… won’t do anything unless you actually specify the filesystem type on the command like to mke2fs, like this: “mke2fs -T ext4ssd”. Actually the better thing to do is something like this:
[fs_types]
ssd = {
features = ^has_journal
}
and then use the mke2fs command “mke2fs -t ext4 -T ssd /dev/XXX”. The filesystem usage class “ssd” will override the parameters set by ext4.
May 3rd, 2009 at 6:41 am
So it’s impossible to use for example “tune2fs -O ^has_journal /dev/sda1″ because that bug on /dev/disk/by-uuid/* ?
I don’t understand one thing, can I use Ext4 without journal on Ubuntu 9.04 with a kernel.29 or it’s impossible because ubuntu don’t use blkid?
maybe it’s possible to change vol_id with blkid? It’s a difficult thing to do?
My problem is that I don’t know what could happen if I create a fs ext4 w/o journal and then install on this ubuntu 9.04 that has the kernel.28.. would it work?
So the clue is: For removing the journal, It’s possible to tune the file system(tune2fs) or the only way it’s making one(mke2fs) without journal? and if the only way is like that how it’s possible to resolve the problem of the kernel?
Sorry if I did’t understand well what you explained to me.
May 5th, 2009 at 2:32 pm
to 32: blkid is program in Ubuntu used to determine which partition holds root filesystem. I have tried it on my eee701 (xubuntu 9.04) and if you do e2fstune -O ^has_journal, blkid fails. If you enable journal again, it works as expected.
I will report this bug. It has nothing to do with kernel. Just wait until it’s fixed.
May 9th, 2009 at 7:35 pm
Nice test, but you missed one important think. The biggest problem of modern SSD is IMHO not wear-out, but internal fragmentation caused by wear-levelling. Wear levelling basicly remaps some blocks to another, causing internal, invisible fragmentation. When you then make sequental read, it is therefore more or less random read, which is much slower. Much worse is this problem with writing, since almost all SSD have very, very slow random write. Internal fragmentation seems to be related to small random writes, such as DB or journal. Sequental write of large data doesnt seem to cause internal fragmentation much. google ssd fragmnetation for more info.
May 10th, 2009 at 3:36 am
@34: Petrik,
Internal fragmentation doesn’t actually cause that much performance slowdown for sequential reads, since unlike hard drives, seeks are very cheap for SSD’s, and so random reads are quite fast for SSD’s — much faster than HDD’s. Yes, there is a tiny slowdown that you’ll see when a SSD gets more fragmented as far as reads are concerned, but that’s really not a major factor in terms of SSD performance. The much bigger issue from a performance point of view is painfully slow random *writes*, and that has nothing to do with internal fragmentation, as it does with incompetent design of certain SSD controllers. The Intel X25-M has none of these problems, and the OCZ Vertex with the Indilux controller is starting to figure out how to deal with these problems. So it is not inherent with SSD’s; just with certain incompetently designed SSD controllers.
Issues around performance, and the “stuttering” effect of very bad random writes, are much more visible, but the lifetime of SSD’s is a much more important issue in my humble opinion. After all, we know most people don’t do backups as often as they should, so if SSD’s end up dying unexpectedly, a lot of people will end up losing their data, and that would be bad.
Of course, readers of this blog all faithfully do regular backups, right?
May 10th, 2009 at 6:20 am
@33, Martin Hinner :
thanks! How can I know when the bug will be repaired? Can you link it?
35 tytso: ooh suuuuuuure of coourse I always do regular backups… =P
May 10th, 2009 at 8:21 am
Hi,
I noticed strange file corruption using ext4 without journal, on two different laptop (a dell vostro 1700, ubuntu jaunty 64bits, perso 2.6.29.2 kernel and AAO, archlinux 32 bits, perso 2.6.29.3 kernel). On the dell laptop the symptom was when I install the nvidia kernel module with nvidia-installer, all is fine, sync, reboot, ok. Do an apt-get update+upgrade, reboot, and the nvidia module becomes corrupted, with the content of some package description ! I think it was nvidia fault, but today on my other laptop (aspire one ssd, archlinux, i915 with kms) that was /usr/lib/locale/locale-archive file who self-destroyed itself when I reboot properly (and issuing “sync” before).
All those problems don’t appears when I use ext4 WITH journal. Have you any idea ??
Regards,
Thibault
May 10th, 2009 at 12:04 pm
Thaibault,
Can you try replicating it under controlled circumstances? If so, can you leave detailed information at http://bugzilla.kernel.org? That’s probably the best place to try to track down the problem.
I would suggest trying to replicate this with a separate scratch partition (so you don’t have to worry about constantly reinstalling your system), and seeing if you can create a regular reproduction case. It would be interesting to see if mounting and unmounting the filesystem is enough, or whether you need to reboot — and whether rebooting while the filesystem is mounted, and/or rebooting after remounting the filesystem read-only makes a difference.
Thanks!!
May 21st, 2009 at 2:28 am
Hi Ted, you seem to be one of the few writing who know much about these issues. Hope you can go a little off topic.
Why would anyone want to use ext2? I mean, I know what they say, of course, but for most use can the speed and wear superiority over ext3 or ext4 really trump the fact that ext2 file systems are far more likely to be unrepairable after a crash or power failure (the latter with netbooks I imagine happening more often).
But I see you recommend ext4 *without* journaling. Would adding journaling (or using standard ext3 with journaling) be so bad?
Now a little off topic: you say the 1st generation were crappy wrt wear levellng. You cite the pricey Intel drive as an example of the next generation. But what about for example the STEC in the Dell Mini 9? I was considering getting one of these. On their web site STEC claims that their wear levelling is such that the youngest block is used on every write. From the point of view of wear-levelling, that seems totally optimal (although as a trade off no doubt it slows it down with the Flash Translation Later having to handle all this!)
Another important issue: I have heard many of these drives get very bad numbers of IOPS in smallish random writes. Is that some to be concerned about for a purchaser of a Dell Mini?
June 20th, 2009 at 9:15 am
@36
Fabricio,
My system seams ok after I have changed all references from UUID=something to the real devices (/dev/sdXn). So after do tune2fs -O ^has_journal /dev/sdXn, edit /etc/fstab and /boot/grub/menu.lst and change everything to /dev/sdXn. The final tip is tp tell to initrd what type your root partion is, you can do this by adding this kernel parameter: rootfstype=ext4, add it to menu.lst
If you cannot boot anyway (drop to shell symptom), you can boot by editing the grub command line before boot (press e key)[add/change rootfstype=ext4 and root=/dev/sdXn], after a semi-successful boot, edit /etc/fstab, /boot/grub/menu.lst and reboot. Everything should be ok now.
June 29th, 2009 at 2:52 pm
G’Day, Ted.
My question is not related directly to the article. But i think the answer is interesting to many people who use hdd cryptography in Linux (dm-crypt).
As you know dm-crypt (like most if not all hdd software cryptography solutions) works with data on the fly, i.e. data encryption-decryption is done in RAM and written to disk encrypted.
What ext3/ext4 filesystem features should be disabled/enabled to get best results when using dm-crypt?
Those features that on normal hdd partition may be useful, but on encrypted partition are useless (e.g., produce unneeded overhead to harware and/or OS), and vice versa?
For example i found in the net following opinion (http://www.saout.de/tikiwiki/tiki-index.php?page=EncryptedDevice):
“Use your favourite options, filesystem type etc (I use ext3) or just copy my options. Note: you do *not* want journalling, or else writes will have to be encrypted twice (once to journal and once when committing journal to final resting place).”
Other encryption tutorials suggest ext3 features like those which are used during filesystem creation on normal partitions (http://en.opensuse.org/Encrypted_Root_File_System_with_SUSE_HOWTO , https://help.ubuntu.com/community/EncryptedFilesystemLVMHowto etc.).
p.s.:
Thanks for noatime/relatime options description. I met in the net some articles about noatime vs relatime but none of them described issue thoroughly, like you did.
June 30th, 2009 at 8:45 am
@40, Iuri Diniz:
hey! thank you very much for the suggest. At the moment I don’t have too much time to prove it but I’ll do it!
I will write here what happends on my machine. Thanks to you and Tytso for your work!!
July 1st, 2009 at 10:29 am
Great!!! it works.. or it seems.
after all the proves the comand: “sudo dumpe2fs -h /dev/sda1″ gives me this:
dumpe2fs 1.41.4 (27-Jan-2009)
Filesystem volume name:
Last mounted on:
Filesystem UUID: 36458922-0fbf-4758-abf4-3dc1f8bae3e7
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: journal_data_writeback
Filesystem state: not clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 245280
Block count: 979957
Reserved block count: 48997
Free blocks: 220843
Free inodes: 84902
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 239
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8176
Inode blocks per group: 511
Flex block group size: 16
Filesystem created: Fri May 1 18:54:35 2009
Last mount time: Wed Jul 1 16:02:05 2009
Last write time: Wed Jul 1 16:07:48 2009
Mount count: 3
Maximum mount count: 20
Last checked: Wed Jul 1 15:33:10 2009
Check interval: 15552000 (6 months)
Next check after: Mon Dec 28 14:33:10 2009
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Default directory hash: half_md4
Directory Hash Seed: ab7ab185-ad7e-4479-b5e3-e67ee2ca736c
Journal backup: inode blocks
so “Filesystem features” don’t has the option ” has_journal ” and there’s no journal size at the end. This means there’s no journal for that partition? I’m right?
tell me please so I’ll be quiet
thanks to all.
July 2nd, 2009 at 2:09 pm
@43: Fabrico,
Yes, if the filesystem features line printed by dumpe2fs does not include “has_journal”, the filesystem does not have a journal.
July 2nd, 2009 at 2:15 pm
@41: Yuriy,
Using cryptography doesn’t really change whether or not you want a particular filesystem feature or not. Journalling does require writing metadata blocks twice, yes: once in the journal and once in the final location on disk. On the other hand, without journalling, you have to run e2fsck on your filesystem after a system crash, and manual system administrator action may be required to repair the filesystem. This is true regardless of whether you are using dm-crypt or not.
Using ext4 will reduce the filesystem overhead by reducing the number of metadata blocks needed for large files (without extents the filesystem has to use many more indirect blocks for large files), but that’s an advantage you have whether you are using dm-crypt or not. The flip side is that some distributions don’t fully support ext4 yet. Fedora 11 seems to be doing quite well with ext4; Ubuntu 9.04 not so much, because they used an older kernel and I suspect screwed up one of their patch backports. So with Ubuntu, if you want to use ext4 you really need to use the 2.6.30 mainline kernel with 9.04, or to use the pre-release Alpha snapshots of Ubuntu 9.10. Again, all of this is true whether you are using dm-crypt or not.
July 10th, 2009 at 6:52 pm
[...] Zugriff gar nicht mehr ab. Auf meinem System habe ich keine Nachteile bis jetzt, und auch der Vater der Idee hat keine gefunden – er hat sogar Benchmarks gemacht. Anscheinend sind ext2 oder ext4 mit [...]
July 16th, 2009 at 3:28 pm
[...] funcionário da IBM que trabalha ativamente no desenvolvimento do Kernel do Linux. No post “SSD’s, Journaling, and noatime/relatime” ele entra em uma discussão bem interessante sobre o funcionamento do ext4 principalmente em [...]
July 17th, 2009 at 1:18 pm
hi theodore! thanks for your help and support.
I succesful applied Ext4 without journal and here’s the italian guide:
http://www.uielinux.org/guide-e-tutorial/2-configurazione/188-ext4-senza-journaling-ottimo-per-dischi-ssd.html
I also translate some of your opinions and report the link to this blog.
PS:
this bug now is fixed??
https://bugs.launchpad.net/bugs/197311
“Changed in util-linux (Ubuntu):
status: Confirmed → Fix Released ”
what does it means? That now is possible to take away the journal without that problem so we didn’t have to change the UUIDs on /dev/sdXX?
July 19th, 2009 at 9:30 pm
@48
Well, I also have a portuguese guide here: http://blog.igdium.com/2009/06/ubuntu-904-alinhando-o-sistema-de.html
July 23rd, 2009 at 8:34 am
[...] A performance comparison of ext2, ext3, ext4 (with and without journals) [...]
August 13th, 2009 at 8:29 am
hey, very nice entry!
i am using an lenovo thinkpad t400s with an 128gb toshiba ssd and ubuntu 9.04. as filesystem, i use ext4 with the noatime instead of relatime and journal. shoud i rather use this ssd without journal? thanks!
October 10th, 2009 at 9:33 am
[...] Tijdens de installatie Tijdens de installatie heb ik gekozen voor een aangepaste partitionering: /dev/mmcblk0 /home ext4 7948 MB /dev/sda1 / ext4 6497 MB /dev/sda2 swap 1571 MB /home staat dus op een SDHC-kaartje. Swap is net iets groter dan het geheugen, om achteraf eventueel suspend-to-disk te kunnen doen. Voor de reden waarom ik ext4 op SSD kies ipv ext2, zie http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime. [...]
November 18th, 2009 at 11:20 pm
Thank you for all your educational posts on block devices and filesystems.
As you, and probably a lot of other people, including myself, prefer noatime over relatime, is there any chance this could be implemented as an option to write in the super block? I often use live systems on CD’s or USB sticks, and it would be very nice to only have to specify it once per filesystem and then ensure it will always be mounted with that option. Same goes for any mount option, I guess.
December 17th, 2009 at 12:04 am
@gutman
I have basically the same system. Do you have any more information on the quality/performance of this SSD? Are you still happy with ext4 with journal, or have you found something better? Thanks, if you reply it will be appreciated. Can anyone else comment on these SSD? Thanks.
December 31st, 2009 at 6:47 am
@14-ish
With kernels 2.6.27 through 2.6.31 and an Acer AspireOne AOA 101, I have found that the geometry of the block device is configured differently after a resume from RAM, which causes massive filesystem corruption with all EXT variants. The symptoms are in the kernel messages – after resume I get numerous out of range errors accessing the device, and on next FSCK I lose just about every file that has been touched since resume, as well as random corruption of other files.
Mostly because of my tiny SSD, I keep my /home on a 16G SD card in the recessed SD slot on the left. Needless to say I don’t use suspend to RAM on this netbook any more. Resume from RAM is about 14 seconds anyhow. Booting only takes twice that long.
I’m not sure how to report the block device configuration problem, and I’m not really looking forward to corrupting my /home again to replicate the error to create a bug report, lol. If it does happen again, I’ll try to be a better community member and get that bug report in.
Next time any of you are reporting corruption of an SD card’s filesystem ask yourself: have I suspended to RAM? Have I got an EXT filesystem?
December 31st, 2009 at 9:23 am
@55 Some relevant details:
In the first paragraph the block device I’m referring to is /dev/mmcblk0, which is a SanDisk 16G card in a JMicron controller with PCI ID 197b:2381
mmc0: SDHCI controller on PCI [0000:01:00.0] using ADMA
sdhci-pci 0000:01:00.2: SDHCI controller found [197b:2381] (rev 0)
sdhci-pci 0000:01:00.2: PCI INT A -> GSI 16 (level, low) -> IRQ 16
sdhci-pci 0000:01:00.2: Refusing to bind to secondary interface.
sdhci-pci 0000:01:00.2: PCI INT A disabled
[I'm guessing that explains the high CPU use when writing to this device]
mmc0: new SDHC card at address 0007
mmcblk0: mmc0:0007 SD16G 15.3 GiB
mmcblk0: p1
jmb38x_ms 0000:01:00.3: PCI INT A -> GSI 16 (level, low) -> IRQ 16
jmb38x_ms 0000:01:00.3: setting latency timer to 64
[This contradicts the interrupt rejection above]
EXT2-fs warning (device mmcblk0p1): ext2_fill_super: mounting ext3 filesystem as ext2
[expect this is because I turned the journal off]
pciehp 0000:00:1c.0:pcie04: Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add
pciehp 0000:00:1c.0:pcie04: Cannot add device at 0000:01:00
pciehp 0000:00:1c.0:pcie04: service driver pciehp loaded
pciehp 0000:00:1c.1:pcie04: Bypassing BIOS check for pciehp use on 0000:00:1c.1
pciehp 0000:00:1c.1:pcie04: HPC vendor_id 8086 device_id 27d2 ss_vid 0 ss_did 0
[not sure why pciehp is trying to reconfigure all my PCI devices at this late stage]
December 31st, 2009 at 5:27 pm
Hi Ted,
After sleeping on it, I am not sold on the stats given in the tables. These are for workloads a normal user would not likely see in their day-to-day use. In reality writes will be more sparse over time, resulting in many more syncs per write, which in turn will prevent a lot of write combining.
How bad can bad be?
For each of these I deleted the file; synced, and re-ran the test 20 times then took the lowest time (trying to find the hardware limit):
$ time bash -c ‘dd if=/dev/zero of=nominal_case bs=4096 count=1; sync’
reveals a ‘real’ time of 92ms
$ time bash -c ‘dd if=/dev/zero of=nominal_case bs=409600 count=1; sync’
reveals a ‘real’ time of 687ms
Running sync few thousand times reveals that it requires 58ms to run, on average. Subtracting that from the above numbers I get:
Create a 4K file with a single 4K write: 34ms
Create a 400k file with a single 400k write: 629ms = 6.29ms per block
Now for the worst-case scenario: create a 4k file with 4096 single-char writes:
$ time bash -c ‘for (( a=0 ; a>pathological_case; sync; done’
reveals a ‘real’ time of 5m35.320s on an idle machine.
Subtracting the sync times I get 96168ms. That’s a 2828:1 performance ratio for the single-block file. For the 100 block file, it’s 15289:1.
This is not hard to predict. For every write to the file the mtime in the directory will change, the file data will change. If journaling is on, that will change too. Depending on the filesystem, the block bitmaps may change as well, so we can expect to see that if 1 char is written per sync period we may well see on the order of 4 blocks written = 16K:1 write amplification, which in turn explains much of the performance penalty seen in this test.
Even the ideal X25 hardware will suffer from this, unless it has some internal heuristic which recognises the incremental writes and takes advantage of some flash devices’ ability to write very small blocks, even a byte at a time, and also assuming that the filesystem does not prevent it from doing this by allocating a new block for every write, as some of the journaling filesystems can/do.
So again, I don’t believe that the tables presented are indicative of normal use.
The reality is probably somewhere around 3x the overhead suggested, with journaling plus atime resulting in perhaps more than double the write volume, and journaling alone contributing perhaps more like 30-50% write load for lighter use such as browsing, which I think most people do a lot more of than compiling…
December 31st, 2009 at 7:08 pm
Wil,
The X-25 hardware will demonstrate minimal write amplification since it has an indirection layer which maps 512 byte sectors into partially written flash blocks. Intel doesn’t say what size erase blocks it is using, but say that it’s 64k. That means there are 128 sectors in each erase block. Each time the system writes a 512 byte sector, the new contents of that sector will be written into a partially written erase block, and then the indirection layer will be updated so that sector that had contained the previous counts of that sector is marked as no longer in use. Eventually, when an erase block consists of completely superseded sectors, it is erased and then made available for new contents. If the X-25 comes close to completely exhausting available unwritten sectors, then the X-25 controller will pick a flash block that contains the largest number of previously used sectors, and copies the still-in-use sectors so that the flash block can be erased. In effect, it’s a garbage collection pass.
As a result, it doesn’t matter whether the writes are contiguous or sparsely separated. Writing non-contiguous sectors will have some overhead, since it results in more garbage collections, and copying still-unused sectors is extra overhead. Still, X-25’s write amplification factor is claimed by Intel to be 1.1, as compared to a factor of 20 found in most naively implemented flash devices. (This includes most USB thumb drives, SD cards, and most older SSD’s. There are a few newer SSD’s that are competently implemented, but there are many SSD’s which are complete crap — and after the X-25, any SSD which isn’t implementing this kind of flash translation layer is complete crap.
As far as your results are concerned, the time needed do certain operations can be an indication of write amplification, yes. But note that most of the time, many applications are constantly forcing blocks to the disk using the sync command. If you do, it’s true that it will force writes to the allocation bitmaps, to the journal, etc. But if you aren’t forcing a sync after every single 4k write, then multiple updates to the file system metadata and the journal is very likely to occur. Most desktop workloads don’t look like mail server workloads, which tend to force a sync after typically writing two files (the qf* and df* files if you’re send-mail, or the *-D and *-H files if you are using mutt, etc).
December 31st, 2009 at 7:15 pm
@55: Wil,
I doubt the problem is due to a different hard drive geometry, since the Linux kernel doesn’t care about the hard drive geometry. The fdisk program cares about it, but only because the bootloaders care about HD geometry if they are using the oldest BIOS interfaces.
It does sound like the SDHCI controller isn’t getting its state properly saved before the system is suspended, such that some or all reads and writes aren’t being accepted afterwards. This is going to cause problems no matter what file system you are using. I can’t really help you debug this; I’d suggest sending a note to LKML. Maybe Rafael Wysocki (who does a lot with suspend/resume). Or you might send a note to the linux-mmc@vger mailing list. According to the MAINTAINERS file, the sdhci driver is orphaned, which means there is no active maintainer, but maybe someone on the LKML or the linux-mmc list will be able to help you.
Good luck!
December 31st, 2009 at 8:10 pm
@58 Sorry, I should have been more specific. I mean data : write bandwidth amplification, not hardware write amplification after the fact due to rewriting entire eraseblocks for a 512-byte disk block flush. Again, to be clear, I was talking about the ratio found when comparing a single byte written to a file followed by a sync, to the number of bytes passing the interface to the block device. This ratio is inherent in the way the filesystem uses blocks to represent storage, and doesn’t even have anything to do with SSDs.
When files are being updated slowly, ie not at Bonnie-like speeds, then the system’s natural sync rate applies. If I only modify a small number of files in that period then the write combining done between each flush can only merge that limited number of changes. Ie if I write 1000 files over the course of 1 hour, that will average to about 1 file every 3 seconds. This cannot be handled as efficiently as if all 1000 are written in 15 seconds.
I hope this clarifies what I’m talking about. Real-world usage is far more more sparse in the time domain than your git-clone ; make; make clean example, and that is why I’m skeptical of your numbers when applied to an average user.
December 31st, 2009 at 9:14 pm
Aha, someone has figured out the Acer AspireOne SD card problem. This probably applies to a lot of other SD controllers as well.
http://en.gentoo-wiki.com/wiki/Acer_Aspire_One_A110L#SD_Cards_and_suspend
Long story short, the kernel unmounts and remounts the device during suspend/resume. EXT2 and EXT3 fail to unmount, the kernel increments the device number (ie mmcblk2p1 instead of mmcblk1p1) and therefore you end up writing to a nonexistent device.
2 workarounds include : use UNSAFE_RESUME kernel option for the MMC driver, which will prevent the kernel from messing with the device, or use LVM, which will redirect the mount seamlessly, and add lovely things like dm_crypt, snapshotting, etc.
December 31st, 2009 at 11:21 pm
@60:
Well, I still don’t believe that most users have workloads that use a huge number of sync’s. And very often people will have workloads where a number files are writing in groups (i.e., when they type “make”, or when updating the software on their laptop to fix security bug in firefox or GNOME). If the user is editing a document in a word processor, they won’t be writing 1000 files an hour. Maybe they hit the save button once every five minutes, but even if they hit the save button once a minute, that’s still only 60 files an hour.
So what workload do you think users would have where they will both (a) writing 1000 files a minute, and (b) will be spreading those writes evenly over the hour, so there is no write-combining, or where they will be writing files slowly? In fact, the vast majority of files are written all at once, and are not appended to slowly. The only exception to this are log files and mail spool files. But those are really the exceptions that prove the rule….
January 1st, 2010 at 8:32 am
@62: The 1000 files in a minute is the ideal circumstance, like your make example. That allows lots of write combining, even preventing some temporary files from ever making it to disk…
It’s pretty easy to list examples of slow updates. A couple dozen RSS feeds open in an RSS aggregator (these creep in increments of around 300 bytes). A bittorrent client keeping a client list for a number of active torrents. A web developer with their browser set to refresh a page every 15 seconds. 2 chat clients logging all friend messages and status changes. The same web developer running their own Apache with PHP, editing their script and debugging bits here and there in a relatively constant stream. Many of these programs keeping their own logs and sending notices to the system message log…. it’s really not that difficult to generate a write every 3 seconds.
I still think you’re largely right, but I still think it’s going to be a lot worse than your tables, which are representative of the ideal circumstances for which the filesystems and VFS are tuned – a heavy workload.
Probably the only way to really prove this would be to have counters in the VFS, FS, and block layers which track the total number of block writes generated by each of:
1) data changes
2) journal changes
3) metadata changes
… and reveal them as counters in /proc for a given mount so that benchmarks can be run on various scenarios, and so that the write bandwidth of the journal can be compared directly to the volume of input data.
This sounds pretty similar to what you started out the whole article with.
PS – running without a journal on something as flaky as an SD card, I do keep regular backups. I use rsync on an hourly cron job to an external disk (when present) and duplicity nightly to a remote server, so I feel pretty comfortable.
PPS, thanks for all your pointers re MMC / SDHCI, which put me on the track of a good solution – using LVM to manage the removable media. I wish it came configured out of the box like that. Once I get it tweaked to my tastes I will pass it on as a suggestion to UBUNTU/MID team and Debian. Very much appreciated.
January 2nd, 2010 at 8:41 am
@54, ZNiP
Yes, I am still happy with this solution! I am using ext4 with noatime and journal on the toshiba ssd (THNS128GG4BAAA).
Small write-benchmark with ubuntu 9.10/64bit:
root@t400s:/home/user# dd if=/dev/zero of=test.dat bs=1M count=5000
5000+0 Datensätze ein
5000+0 Datensätze aus
5242880000 Bytes (5,2 GB) kopiert, 23,9574 s, 219 MB/s
January 2nd, 2010 at 4:17 pm
That’s write-through caching performance, not real sustainable write performance.
To find the real sustainable sequential write speed, try this:
dd if=/dev/zero of=_filler bs=1M count=10 oflag=sync; rm -f _filler
My AOA SSD’s performance (Z-P230, model SSDPAMM0008G1) is roughly 5.4M/s, a little over half what Intel claim(ed) it is (38M/s read, 10M/s write). The drive is at 88% capacity, so that’s really not too bad. SSD drives tend to slow down as they fill up, and get older, for a variety of reasons. Intel has pulled all specs for this device from their site. I’m guessing they’re not specially proud of it.
January 2nd, 2010 at 5:37 pm
@65:
Actually, Intel’sX25-M Product Manual claims a sustained write performance of up to 70MB/s, but that’s not the important figure. The much more interesting number, which most SSD’s, including your SD flash card, can’t match is the random 4k write benchmark using iometer. Traditionally flash devices, including most SD cards, were used in cameras, where you tended to write nice big images, and so you could get away with a relatively primitive filesystem such as FAT. But Linux/Unix system (or even Windows or MacOS) will be writing small files and big files, and will tend to have a much more random workload that includes writing many more single-block writes.
Using an iometer queue depth of 3, with 4k random writes, the X25-M can sustain a write bandwidth of 54.5 MB/s. In contast, a crappy JMicron JMF602B-based SSD can do maybe 21 k/s, and with an average latency of 500ms, and a worst-case latency of 2 seconds. That means the system could take up to 2 seconds to write a random 4k block!!! This is what really matters if you’re trying to engineer for performance. (The X25-M has an average random 4k write latency of 0.089ms, and a worst case write latency of around 100ms.) 3 orders of magnitude difference when measuring 4k random write latency is what separates the big boys from the also-rans. Sequential write speed matters only if your work load is a digital camera taking pictures, or something equivalent.
Try running iometer with a 4k random write size and a queue depth of 3, and see what you get on your SD card, and report back if you dare.
January 2nd, 2010 at 5:37 pm
Oops, the link to the X25-M product manual got filtered out. Here it is:
http://download.intel.com/support/ssdc/hpssd/x25m/sb/x18mx25msatassdproductmanual34nm322296.pdf
January 2nd, 2010 at 6:52 pm
hehe, yes, I know the X25 is fast. I have been following it, the OCZ Vertex series, and some other pretenders.
The SuperTalent FusionIO PCIE 8x card, blows away all the SSD competition. It’s not really an SSD – it’s more like a NAND memory expansion with up to 2TB of address space. It beats the X25m by a factor of 25 or more on random 4k block writes (off the top of my head.)
Having no Windows platform to run iometer on, and with the Linux binary having no command-line options to directly run tests (that I can find), also the documentation for the latest version of IoMeter currently dumping Perl code instead of running the script on the IoMeter website, I’m afraid I’m not going to get any mileage out of that app…
http://www.iometer.org/cgi-bin/jump.cgi?URL=http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/iometer/iometer/Docs/Iometer.pdf?rev=HEAD
January 2nd, 2010 at 11:07 pm
@68: Wil,
It wasn’t clear to me you understood that the X25-M is fast, if you’re comparing it to a SD card.
The Fusion IOXtreme drive is fast, yes, but it’s also about 4 times the cost of an X25-M drive, and you can’t boot off of one, and it won’t fit in an laptop. But sure, if you have a PCIE card, and you can afford it, and you need its speed, then it’s a good choice.
The trick with using Iometer under Linux is apparently to put the options in the configuration file, but it’s not at all well documented, and I don’t have time to try to figure out the the config file from the sources — but iometer is open source, so hopefully someone will get around to fixing it up and documenting it better for operation under Linux.
January 3rd, 2010 at 3:06 am
That X25-M SSDSA2MH160G2R5 part is looking pretty inviting. There’s an OCZ part that’s a little cheaper, faster, smaller.
Really, if you have a laptop in the $1000+ range, what can anyone add to it that will improve the overall performance as much for the money as putting one of the above SSDs into it?
I’m sure the manufacturers know this, but it’s hard to sell 1/10th the storage for the same money, regardless of the performance.
There’s still the question of reliability. Both Intel and OCZ state “1.5M hours MTBF” then offer a 1 or 2 year warranty. Hello? 1.5M hours is 171 years! If there’s a shred of honesty behind their statements then they would back them with lifetime unrated replacement warranties. So take both with a sack of salt.
I’ll leave checking OCZ’s track record on SSD parts as an exercise for the reader. Spending that much money without doing research is something I would never encourage anyone to do. BTW, that was a subtle hint to check OCZ’s track record before buying their parts…
January 3rd, 2010 at 3:11 am
@69: I was comparing the Intel Z-P230 to an SD card. They have very similar performance characteristics, in fact a high-end SD card outperforms the Intel SSD in all but max linear read, and even then by <10%. Pretty sad.
January 3rd, 2010 at 8:42 am
@69: Wil,
Sorry, I missed the Z-P230 reference. I must have been reading your post too quickly, and I probably mistook it for an Atom CPU part number…. My bad. Yeah, as far as I can tell that Intel SSD was something cheap cheap cheap that was intended for sale directly to netbook manufacturers. It has since disappeared without a trace.
Note BTW, that if price is an issue and you don’t need that much capacity (how much do you need for a netbook or even a laptop if most of your software is on the cloud? I need 80GB because I’m a developer; if I nuked all of the source trees I could probably live with 40GB) you can also get a 40GB Kingston SSDNow V SSD which uses the Intel controller, but with half the flash channels. You have to be careful though; the 64G and 128G Kinginston SSDNow V use the JMicron controller. (V stands for value, which can sometimes also be another way of saying, buyer needs to be careful.
With OCZ, yes, you have to be careful. They produce a large number of devices at different price points, and clearly at different levels of quality.
January 3rd, 2010 at 8:46 am
I should add that if you have a larger laptop, such as Lenovo Thinkpad T series, you can use two drives. A 40GB or 80GB SSD drive where you have your OS and your home directory, and a 500GB 5400rpm hard drive where you have your build directories, music, images, etc. That’s what I do these days… the source tree is on the SSD for speed, but the build trees where the compiler deposits the object files is on a 500GB disk. That way I get the best of both worlds. I use the SSD for frequently accessed files or files where fast access will improve the “feel” of my system, and I use the hard drive for bulk storage, and writes which if slower (such as writing object files) won’t hinder the overall speed of my system.
This is also very easy to do for a desktop, of course — although these days I generally use my laptop instead of a desktop.
January 15th, 2010 at 6:52 am
Hi Ted,
How do you use the ext4 information about the lifetime write in practice?
I’ve seen your kernel commit:
commit afc32f7ee9febc020c73da61402351d4c90437f3
Author: Theodore Ts’o
Date: Sat Feb 28 19:39:58 2009 -0500
ext4: Track lifetime disk writes
Add a new superblock value which tracks the lifetime amount of writes
to the filesystem. This is useful in estimating the amount of wear on
solid state drives (SSD’s) caused by writes to the filesystem.
but I don’t know how I can actually see that information? Does it appear
in the output of some tool made for ext4?
January 24th, 2010 at 12:41 pm
Ah, I’ve just found the answer to my question in #74!
dumpe2fs /dev/sda2 |grep Lifetime
February 12th, 2010 at 7:36 am
[...] 12 February 2010: the following results were achieved with the ext3 file system, using the noatime option. Contrary to my expectations it’s not better, but sucks [...]