<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Wanted: Incremental Backup Solutions that Use a Database</title>
	<atom:link href="http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/feed/" rel="self" type="application/rss+xml" />
	<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/</link>
	<description>Musings about Open Source, Linux, and Life by Theodore Tso</description>
	<lastBuildDate>Mon, 22 Feb 2010 22:39:59 -0500</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: tytso</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-2806</link>
		<dc:creator>tytso</dc:creator>
		<pubDate>Fri, 04 Dec 2009 03:22:21 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-2806</guid>
		<description>The file system is a database, sure; but it&#039;s not a very good general-purpose database.   It&#039;s optimized for the workloads that are typically experienced by file systems, which are quite different from those that might be seen by most relation databases, for example.

There is a desire in computer science, just as in physics, for the &quot;grand unified theory&quot;; so you will see people argue that there should be one product that could solve all problems efficiently; whether that is a file system, or a relational database, or a key/value databases.   It used to be, for example, that people thought no matter what your problem was, the answer was a relational database.  Oracle even tried to convince people that an Oracle database could be the basis of a general purpose file system.   That idea died quickly once people discovered how awful Oracle was at being a file system.   Similarly, we are now seeing non-relational databases pop up in Amazon, Google, and many other distributed systems because it turns out relational databases really suck at scaling out.

So I really get my dander up when people say, &quot;the file system is essentially a database&quot;.   I suppose it is, in the sense that any computer program can be transformed into a Turing Machine.   But that doesn&#039;t mean that it is an efficient or sane thing to do for a production system....</description>
		<content:encoded><![CDATA[<p>The file system is a database, sure; but it&#8217;s not a very good general-purpose database.   It&#8217;s optimized for the workloads that are typically experienced by file systems, which are quite different from those that might be seen by most relation databases, for example.</p>
<p>There is a desire in computer science, just as in physics, for the &#8220;grand unified theory&#8221;; so you will see people argue that there should be one product that could solve all problems efficiently; whether that is a file system, or a relational database, or a key/value databases.   It used to be, for example, that people thought no matter what your problem was, the answer was a relational database.  Oracle even tried to convince people that an Oracle database could be the basis of a general purpose file system.   That idea died quickly once people discovered how awful Oracle was at being a file system.   Similarly, we are now seeing non-relational databases pop up in Amazon, Google, and many other distributed systems because it turns out relational databases really suck at scaling out.</p>
<p>So I really get my dander up when people say, &#8220;the file system is essentially a database&#8221;.   I suppose it is, in the sense that any computer program can be transformed into a Turing Machine.   But that doesn&#8217;t mean that it is an efficient or sane thing to do for a production system&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Henri</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-2805</link>
		<dc:creator>Henri</dc:creator>
		<pubDate>Thu, 03 Dec 2009 23:56:28 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-2805</guid>
		<description>The file system is essentially a database. Personally, when restoring a file I like the flexibility of tools available for copying files from a file system.

Hard linking backup solutions is not a silver bullet. However, there this approach has some advantages.

One possibility is that you could limit the history (number of snapshots) the hard linking backup solution is generating and you could then use a backup tool to push files to tape from the latest archive. The advantage of such an is the ability to restore for disk if you do not need to go back very far. If it is no longer available on disk then you could order in the tapes to go back further.

Finally, you may want to look though the list of projects listed on the LBackup about page.</description>
		<content:encoded><![CDATA[<p>The file system is essentially a database. Personally, when restoring a file I like the flexibility of tools available for copying files from a file system.</p>
<p>Hard linking backup solutions is not a silver bullet. However, there this approach has some advantages.</p>
<p>One possibility is that you could limit the history (number of snapshots) the hard linking backup solution is generating and you could then use a backup tool to push files to tape from the latest archive. The advantage of such an is the ability to restore for disk if you do not need to go back very far. If it is no longer available on disk then you could order in the tapes to go back further.</p>
<p>Finally, you may want to look though the list of projects listed on the LBackup about page.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Georg Sauthoff</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-2718</link>
		<dc:creator>Georg Sauthoff</dc:creator>
		<pubDate>Sun, 13 Sep 2009 21:48:58 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-2718</guid>
		<description>I use dar (http://dar.linux.free.fr/) for incremental backups to disk. Tested restore, too.

Deleted files are recorded and not restored[1] (don&#039;t understand how other incremental backup tools fail at this point). Dar doesn&#039;t use rsync-like algorithm to detect changed files, just looks at the time. It writes its own archive format.

[1] by default</description>
		<content:encoded><![CDATA[<p>I use dar (<a href="http://dar.linux.free.fr/" rel="nofollow">http://dar.linux.free.fr/</a>) for incremental backups to disk. Tested restore, too.</p>
<p>Deleted files are recorded and not restored[1] (don&#8217;t understand how other incremental backup tools fail at this point). Dar doesn&#8217;t use rsync-like algorithm to detect changed files, just looks at the time. It writes its own archive format.</p>
<p>[1] by default</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: eolo999</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-2485</link>
		<dc:creator>eolo999</dc:creator>
		<pubDate>Tue, 14 Apr 2009 22:01:33 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-2485</guid>
		<description>@tytso

Hi, it would be nice to have all this comments/suggestions summarized in a new blog post... of course with your conclusions.

great blog</description>
		<content:encoded><![CDATA[<p>@tytso</p>
<p>Hi, it would be nice to have all this comments/suggestions summarized in a new blog post&#8230; of course with your conclusions.</p>
<p>great blog</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: hugh</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-2388</link>
		<dc:creator>hugh</dc:creator>
		<pubDate>Mon, 23 Mar 2009 16:35:19 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-2388</guid>
		<description>This isn&#039;t what tytso is looking for, but since some folks have mentioned git and keeping important personal stuff, I&#039;ll throw out flashbake (http://bitbucketlabs.net/flashbake/). Flashbake is aimed at not loosing work in progress. (Disclaimer: One of my friends runs this project.)</description>
		<content:encoded><![CDATA[<p>This isn&#8217;t what tytso is looking for, but since some folks have mentioned git and keeping important personal stuff, I&#8217;ll throw out flashbake (<a href="http://bitbucketlabs.net/flashbake/)" rel="nofollow">http://bitbucketlabs.net/flashbake/)</a>. Flashbake is aimed at not loosing work in progress. (Disclaimer: One of my friends runs this project.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Henrik Nordvik</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-1954</link>
		<dc:creator>Henrik Nordvik</dc:creator>
		<pubDate>Sat, 07 Mar 2009 00:38:53 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-1954</guid>
		<description>I&#039;m currently trying out Gibak. It de-duplicated and compresses really nicely, and it&#039;s also basically a shell-script around a git-repository, so you can use regular git commands if you want. Very cheap on space and fast transmitting.</description>
		<content:encoded><![CDATA[<p>I&#8217;m currently trying out Gibak. It de-duplicated and compresses really nicely, and it&#8217;s also basically a shell-script around a git-repository, so you can use regular git commands if you want. Very cheap on space and fast transmitting.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-1935</link>
		<dc:creator>Jon</dc:creator>
		<pubDate>Tue, 03 Mar 2009 13:46:39 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-1935</guid>
		<description>Hello again!

One particular usage scenario I am considering a solution with de-duplication for (whereas I&#039;m using rdiff-backup everywhere else) is for backing up a digital photo archive. The vast majority of the files should not change at all, and if they have changed, it&#039;s probably due to bit errors or corruption somewhere in the hardware/software chain. De-duplication would hopefully catch this (and backup the modified/subtly corrupted photo to a different hash). Whether or not it would report it adequately I don&#039;t know. (It&#039;s also possible some dumb photo management software could corrupt my photos, e.g. a tool which does lossy JPEG rotate).

Whilst the files would not change, they may well move around a lot as I try different management schemes. I currently have a scheme whereby I file them at disc/YYYY/MM/DD/photo , where disc is an arbitrary separator that helps me split the files into DVD-R manageable chunks. A naive backup system (such as I think, unfortunately, rdiff-backup) could consume an enormous amount of disk space if it re-represented a chunk of files that moved from one location to another.</description>
		<content:encoded><![CDATA[<p>Hello again!</p>
<p>One particular usage scenario I am considering a solution with de-duplication for (whereas I&#8217;m using rdiff-backup everywhere else) is for backing up a digital photo archive. The vast majority of the files should not change at all, and if they have changed, it&#8217;s probably due to bit errors or corruption somewhere in the hardware/software chain. De-duplication would hopefully catch this (and backup the modified/subtly corrupted photo to a different hash). Whether or not it would report it adequately I don&#8217;t know. (It&#8217;s also possible some dumb photo management software could corrupt my photos, e.g. a tool which does lossy JPEG rotate).</p>
<p>Whilst the files would not change, they may well move around a lot as I try different management schemes. I currently have a scheme whereby I file them at disc/YYYY/MM/DD/photo , where disc is an arbitrary separator that helps me split the files into DVD-R manageable chunks. A naive backup system (such as I think, unfortunately, rdiff-backup) could consume an enormous amount of disk space if it re-represented a chunk of files that moved from one location to another.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan Langille</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-1795</link>
		<dc:creator>Dan Langille</dc:creator>
		<pubDate>Sat, 21 Feb 2009 17:02:25 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-1795</guid>
		<description>&lt;i&gt;The idea of *depending* on a database for my backups strikes me as foolhardy. It sure wouldn’t work for archival purposes — y’mean I need to get a copy of this 30-year old database program running before I can even read the 30-year-old backups from the machine I had at the time? No thanks!&lt;/i&gt;

With respect to Bacula, you can extract data from the tapes without the database.  See bextract, bls, etc.

The Catalog is stored in the database to make certain tasks easier.  For example, what backups do I have of /etc/openvpn.conf between last month and today?  By no means do you NEED the Catalog.  It is a very convenient tool</description>
		<content:encoded><![CDATA[<p><i>The idea of *depending* on a database for my backups strikes me as foolhardy. It sure wouldn’t work for archival purposes — y’mean I need to get a copy of this 30-year old database program running before I can even read the 30-year-old backups from the machine I had at the time? No thanks!</i></p>
<p>With respect to Bacula, you can extract data from the tapes without the database.  See bextract, bls, etc.</p>
<p>The Catalog is stored in the database to make certain tasks easier.  For example, what backups do I have of /etc/openvpn.conf between last month and today?  By no means do you NEED the Catalog.  It is a very convenient tool</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Gilmore</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-1789</link>
		<dc:creator>John Gilmore</dc:creator>
		<pubDate>Sat, 21 Feb 2009 02:58:39 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-1789</guid>
		<description>I do disk-to-disk backups with &quot;dump&quot; and &quot;restore&quot; and a small set of shell scripts.  Being raised in the mainframe era, I do a &quot;full dump&quot; periodically,
do &quot;incremental dumps&quot; nightly in cron, and do a &quot;middump&quot; whenever the
incrementals get so big that they&#039;re eating too much disk space.  I tune the dump script for each filesystem to compress it or not; and can queue the compression until after the dumps, to get fast snapshots of filesystems.

I go back in and manually remove the incrementals that are less useful; e.g. for each month, I keep the incrementals for the 1st, 11th, and 21st day.  More recent months keep more daily incrementals.  This lets me tune up the space allocation while still giving me easy restores (max 3 passes: full, mid, and latest incremental) and access to many recent versions of any file.

This has the great advantage that I can remove the backup disk and stash it in a safe place (i.e. where no computer can write on it!  When did the write-protect switches/jumpers disappear?), then insert an empty drive for many months of subsequent incrementals.  Ultimately I must retain the drive(s) containing fulldumps and middumps until I&#039;ve recycled all the incdumps that depend on them.  But I&#039;m free to discard any incdump (or drive full of incdumps), and can discard any middump or fulldump that has had a subsequent fulldump (unless I want it for archival purposes, which I often do).

What this lacks is:  a database of what&#039;s backed up where (only needed when you want to restore and don&#039;t know which incrementals it might be on),
and automation of restores.  You need to be comfortable editing shell scripts
as well.  It doesn&#039;t automatically manage its disk space consumption.  It doesn&#039;t feed the cat.  I have to use &quot;tar&quot; to back up filesystems that dump doesn&#039;t understand (like MSDOS and Windows stuff).

The idea of *depending* on a database for my backups strikes me as foolhardy.  It sure wouldn&#039;t work for archival purposes -- y&#039;mean I need to get a copy of this 30-year old database program running before I can even read the 30-year-old backups from the machine I had at the time?  No thanks!

I have done plenty of restores from these backups, and I trust &#039;em.  (Except just this week I was restoring a filesystem with many directories containing a hundred thousand files each -- a set of MH mail folders full of spam.  The older copy of restore that I was running burned CPU time forever, without writing to the disk, because it kept and searched a singly linked list of all files in a directory.  I upgraded to the latest version, which uses hash buckets if you ask nicely, fixed a few things, and it&#039;s doing fine at restoring my filesystem.)

By the way:  Every backup disk I make contains a &quot;tools&quot; directory that has both binaries and source code for every tool used to make the backups.  And before I remove a drive to keep it in offline safe storage, I remove the journal so that anything later that can read ext2 can read it.  (If there was a more popular filesystem that could do the job, I&#039;d switch to it.)</description>
		<content:encoded><![CDATA[<p>I do disk-to-disk backups with &#8220;dump&#8221; and &#8220;restore&#8221; and a small set of shell scripts.  Being raised in the mainframe era, I do a &#8220;full dump&#8221; periodically,<br />
do &#8220;incremental dumps&#8221; nightly in cron, and do a &#8220;middump&#8221; whenever the<br />
incrementals get so big that they&#8217;re eating too much disk space.  I tune the dump script for each filesystem to compress it or not; and can queue the compression until after the dumps, to get fast snapshots of filesystems.</p>
<p>I go back in and manually remove the incrementals that are less useful; e.g. for each month, I keep the incrementals for the 1st, 11th, and 21st day.  More recent months keep more daily incrementals.  This lets me tune up the space allocation while still giving me easy restores (max 3 passes: full, mid, and latest incremental) and access to many recent versions of any file.</p>
<p>This has the great advantage that I can remove the backup disk and stash it in a safe place (i.e. where no computer can write on it!  When did the write-protect switches/jumpers disappear?), then insert an empty drive for many months of subsequent incrementals.  Ultimately I must retain the drive(s) containing fulldumps and middumps until I&#8217;ve recycled all the incdumps that depend on them.  But I&#8217;m free to discard any incdump (or drive full of incdumps), and can discard any middump or fulldump that has had a subsequent fulldump (unless I want it for archival purposes, which I often do).</p>
<p>What this lacks is:  a database of what&#8217;s backed up where (only needed when you want to restore and don&#8217;t know which incrementals it might be on),<br />
and automation of restores.  You need to be comfortable editing shell scripts<br />
as well.  It doesn&#8217;t automatically manage its disk space consumption.  It doesn&#8217;t feed the cat.  I have to use &#8220;tar&#8221; to back up filesystems that dump doesn&#8217;t understand (like MSDOS and Windows stuff).</p>
<p>The idea of *depending* on a database for my backups strikes me as foolhardy.  It sure wouldn&#8217;t work for archival purposes &#8212; y&#8217;mean I need to get a copy of this 30-year old database program running before I can even read the 30-year-old backups from the machine I had at the time?  No thanks!</p>
<p>I have done plenty of restores from these backups, and I trust &#8216;em.  (Except just this week I was restoring a filesystem with many directories containing a hundred thousand files each &#8212; a set of MH mail folders full of spam.  The older copy of restore that I was running burned CPU time forever, without writing to the disk, because it kept and searched a singly linked list of all files in a directory.  I upgraded to the latest version, which uses hash buckets if you ask nicely, fixed a few things, and it&#8217;s doing fine at restoring my filesystem.)</p>
<p>By the way:  Every backup disk I make contains a &#8220;tools&#8221; directory that has both binaries and source code for every tool used to make the backups.  And before I remove a drive to keep it in offline safe storage, I remove the journal so that anything later that can read ext2 can read it.  (If there was a more popular filesystem that could do the job, I&#8217;d switch to it.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Dowland</title>
		<link>http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/comment-page-2/#comment-1760</link>
		<dc:creator>Jon Dowland</dc:creator>
		<pubDate>Sat, 14 Feb 2009 10:27:40 +0000</pubDate>
		<guid isPermaLink="false">http://thunk.org/tytso/blog/?p=208#comment-1760</guid>
		<description>Someone mentioned archfs earlier, a fuse-powered filesystem frontend for rdiff-backup directories. I found a stale Debian ITP and have took it on board.</description>
		<content:encoded><![CDATA[<p>Someone mentioned archfs earlier, a fuse-powered filesystem frontend for rdiff-backup directories. I found a stale Debian ITP and have took it on board.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
