SSD’s, Journaling, and noatime/relatime

March 2, 2009
Filesystems
Linux
SSD

On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s) due to the extra writes caused by journaling — and so Linux users using SSD’s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice.

For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:

Clone a git repository containing a linux source tree
Compile the linux source tree using make -j2
Remove the object files by running make clean

For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)

  <td align="center">
    with journal
  </td>
  
  <td align="center">
    w/o journal
  </td>
  
  <td align="center">
    percent change
  </td>
</tr>

<tr>
  <td>
    git clone
  </td>
  
  <td align="right">
    367.7
  </td>
  
  <td align="right">
    353.0
  </td>
  
  <td align="center">
    4.00%
  </td>
</tr>

<tr>
  <td>
    make
  </td>
  
  <td align="right">
    231.1
  </td>
  
  <td align="right">
    203.4
  </td>
  
  <td align="center">
    12.0%
  </td>
</tr>

<tr>
  <td>
    make clean
  </td>
  
  <td align="right">
    14.6
  </td>
  
  <td align="right">
    7.7
  </td>
  
  <td align="center">
    47.3%
  </td>
</tr>

  <table border="1" cellspacing="1" cellpadding="2">
    <caption>Amount of data written (in megabytes) on an ext4 filesystem mounted with noatime</caption> <colgroup align="left"> </colgroup> <colgroup align="right"> </colgroup> <colgroup align="right"> </colgroup> <colgroup align="right"></colgroup> <tr>
      <td align="center">
        Operation
      </td>
      
      <td align="center">
        with journal
      </td>
      
      <td align="center">
        w/o journal
      </td>
      
      <td align="center">
        percent change
      </td>
    </tr>
    
    <tr>
      <td>
        git clone
      </td>
      
      <td align="right">
        367.0
      </td>
      
      <td align="right">
        353.0
      </td>
      
      <td align="center">
        3.81%
      </td>
    </tr>
    
    <tr>
      <td>
        make
      </td>
      
      <td align="right">
        207.6
      </td>
      
      <td align="right">
        199.4
      </td>
      
      <td align="center">
        3.95%
      </td>
    </tr>
    
    <tr>
      <td>
        make clean
      </td>
      
      <td align="right">
        6.45
      </td>
      
      <td align="right">
        3.73
      </td>
      
      <td align="center">
        42.17%
      </td>
    </tr>
  </table>
  
  <p>
    </center>
  </p>
  
  <p>
    &nbsp;
  </p>
  
  <p>
    This reduces the extra cost of the journal in the <tt>git clone</tt> and <tt>make</tt> steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories.
  </p>
  
  <h2>
    The relatime mount option
  </h2>
  
  <p>
    There is a newer alternative to the noatime mount option, <b>relatime</b>. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):
  </p>
  
  <p>
    <center>
      </p> 
      
      <table border="1" cellspacing="1" cellpadding="2">
        <caption>Amount of data written (in megabytes) on an ext4 filesystem mounted with relatime</caption> <colgroup align="left"> </colgroup> <colgroup align="right"> </colgroup> <colgroup align="right"> </colgroup> <colgroup align="right"></colgroup> <tr>
          <td align="center">
            Operation
          </td>
          
          <td align="center">
            with journal
          </td>
          
          <td align="center">
            w/o journal
          </td>
          
          <td align="center">
            percent change
          </td>
        </tr>
        
        <tr>
          <td>
            git clone
          </td>
          
          <td align="right">
            366.6
          </td>
          
          <td align="right">
            353.0
          </td>
          
          <td align="center">
            3.71%
          </td>
        </tr>
        
        <tr>
          <td>
            make
          </td>
          
          <td align="right">
            216.8
          </td>
          
          <td align="right">
            203.7
          </td>
          
          <td align="center">
            6.04%
          </td>
        </tr>
        
        <tr>
          <td>
            make clean
          </td>
          
          <td align="right">
            13.34
          </td>
          
          <td align="right">
            6.97
          </td>
          
          <td align="center">
            45.75%
          </td>
        </tr>
      </table>
      
      <p>
        </center>
      </p>
      
      <p>
        &nbsp;
      </p>
      
      <p>
        Personally, I don&#8217;t think relatime is worth it. There are other ways of working around the issue with mutt &#8212; for example, you can use Maildir-style mailboxes, or you can use mutt&#8217;s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use <tt>chattr +A</tt> to set the noatime flag on all files and directories where you don&#8217;t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running <tt>chattr +A /mntpt</tt> right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.
      </p>
      
      <h2>
        Comparing ext3 and ext2 filesystems
      </h2>
      
      <p>
        <center>
          </p> 
          
          <table border="1" cellspacing="1" cellpadding="2">
            <caption>Amount of data written (in megabytes) on an ext3 and ext2 filesystem</caption> <colgroup align="left"> </colgroup> <colgroup align="right"> </colgroup> <colgroup align="right"> </colgroup> <colgroup align="right"></colgroup> <tr>
              <td align="center">
                Operation
              </td>
              
              <td align="center">
                ext3
              </td>
              
              <td align="center">
                ext2
              </td>
              
              <td align="center">
                percent change
              </td>
            </tr>
            
            <tr>
              <td>
                git clone
              </td>
              
              <td align="right">
                374.6
              </td>
              
              <td align="right">
                357.2
              </td>
              
              <td align="center">
                4.64%
              </td>
            </tr>
            
            <tr>
              <td>
                make
              </td>
              
              <td align="right">
                230.9
              </td>
              
              <td align="right">
                204.4
              </td>
              
              <td align="center">
                11.48%
              </td>
            </tr>
            
            <tr>
              <td>
                make clean
              </td>
              
              <td align="right">
                14.56
              </td>
              
              <td align="right">
                6.54
              </td>
              
              <td align="center">
                55.08%
              </td>
            </tr>
          </table>
          
          <p>
            </center>
          </p>
          
          <p>
            &nbsp;
          </p>
          
          <p>
            Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn&#8217;t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.)
          </p>
          
          <h2>
            Conclusion
          </h2>
          
          <p>
            So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD&#8217;s come from? Some of it may have been from people worrying too much about extreme workloads such as &#8220;make clean&#8221;; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn&#8217;t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD&#8217;s had a very bad problem with what has been called the &#8220;write amplification effect&#8221;, where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations &#8212; that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD&#8217;s, such as Intel&#8217;s X25-M SSD, <a title=" Write Amplification: Intel's Secret Sauce" href="http://www.extremetech.com/article2/0,2845,2329594,00.asp" target="_blank">have worked around the write amplification affect</a>.
          </p>
          
          <p>
            What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system&#8217;s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime.
          </p>