Wanted: Incremental Backup Solutions that Use a Database

Dear Lazyweb,

I’m looking for recommendations for Open Source backup solutions which track incremental backups using a database, and which do not use hard link directories.   Someone gave me a suggested OSS backup program at UDS, but it’s slipped my memory; so I’m fairly sure that at least one or more such OSS backup solutions exist, but don’t know their names.  Can some folks give me some suggestions?   Thanks!

horizontal separator

There are a number of very popular Open Source backup solutions that use a very clever hack of using hard link trees to maintain incremental backups.   The advantage of such schemes is that they are very easy to code up, and it allows you to easily browse incremental backups by simply using “cd” and “ls”.

The disadvantage of such schemes is that it creates very large number of directories blocks which must be validated by an fsck operation.  As I’ve discussed previously, this causes e2fsck to consume a vast amount of memory; sometimes more than can be supported by 32-bit systems.  Another problem which has recently been brought home to me, is how much time it can take to fsck such file systems.

This shouldn’t have come as a surprise.  Replicating the directory hierarchy for each incremental backup is perhaps the most inefficient way you could come up with for storing information for an incremental backup.   The filenames are replicated in each directory hierarchy, and even if an entire subtree hasn’t changed, the directories associated with that subtree must be replicated for each snapshot. As a result, each incremental snapshot results in a large number of additional directory blocks which must be tracked by the filesystem and checked by fsck.   For example, in one very extreme case, a user reported to me that their backup filesystem contained 88 million inodes, of which 77 million of them were directories.   Even if we assume that every directory was only a single block long, that still means that during e2fsck’s pass 2 processing, 77 million times 4k, or 308 gigabytes, worth of directory blocks must be read into memory by e2fsck’s pass 2.  Worse yet, these 308 GB of directory blocks are scattered all across the filesystem,  which means the time to simply read all of the directory blocks so they can be validated will take a very, very, long time indeed.

The real right way to do implement tracking incremental backups is to use a database, since it can much more efficiently store and organize the information of what file is located where, for each incremental snapshot.   If the user wants to browse an incremental snapshot via “cd” and “ls”, this could be done via a synthetic FUSE filesysem.  There’s a reason why all of the industrial-strength, enterprise-class backup systems use real databases; it’s the same reason why enterprise class databases use their own data files, and not try to store relational tables in a filesystem, even if the filesystem supports b-tree directories and efficient storage of small files.   Purpose written-and-opimzied solutions can be far more efficient than general purpose tools.

So, can anyone recommend OSS backup solutions which track incremental backups using some kind of database?   It doesn’t really matter whether it’s MySQL, Postgresql, SQLite, so long as it’s not using/abusing a file systems’ hard links to create directory trees for each snapshot.   For bonus points, there would be a way to browse a particular incremental snapshot via a FUSE interface.  What suggestions can you give me, so I can pass it on to ext3/4 users who are running into problems with backup solutions such as Amanda, Backup-PC, dirvish, etc.?

Update: A feature which a number of these hard-link tools do right is that they do de-duplication; that is, they will create hard links between multiple files that have the same contents, even if they are located in different directories with different names (or they have been renamed or their directory structure reorganized since they were originally backed up) and even if the two files are from different clients. This basically means the database needs to include a checksum of the file, and a way to look up to see if the contents of that file have already been backed up, perhaps with a different name. Unfortunately, it looks like many of the tape-based tools, such as Bacula, assume that tape is cheap, so they don’t have these sorts of de-duplication features. Does anyone know of non-hard-link backup tools that do de-dup and which are also Open Source?

63 thoughts on “Wanted: Incremental Backup Solutions that Use a Database

  1. I do disk-to-disk backups with “dump” and “restore” and a small set of shell scripts. Being raised in the mainframe era, I do a “full dump” periodically,
    do “incremental dumps” nightly in cron, and do a “middump” whenever the
    incrementals get so big that they’re eating too much disk space. I tune the dump script for each filesystem to compress it or not; and can queue the compression until after the dumps, to get fast snapshots of filesystems.

    I go back in and manually remove the incrementals that are less useful; e.g. for each month, I keep the incrementals for the 1st, 11th, and 21st day. More recent months keep more daily incrementals. This lets me tune up the space allocation while still giving me easy restores (max 3 passes: full, mid, and latest incremental) and access to many recent versions of any file.

    This has the great advantage that I can remove the backup disk and stash it in a safe place (i.e. where no computer can write on it! When did the write-protect switches/jumpers disappear?), then insert an empty drive for many months of subsequent incrementals. Ultimately I must retain the drive(s) containing fulldumps and middumps until I’ve recycled all the incdumps that depend on them. But I’m free to discard any incdump (or drive full of incdumps), and can discard any middump or fulldump that has had a subsequent fulldump (unless I want it for archival purposes, which I often do).

    What this lacks is: a database of what’s backed up where (only needed when you want to restore and don’t know which incrementals it might be on),
    and automation of restores. You need to be comfortable editing shell scripts
    as well. It doesn’t automatically manage its disk space consumption. It doesn’t feed the cat. I have to use “tar” to back up filesystems that dump doesn’t understand (like MSDOS and Windows stuff).

    The idea of *depending* on a database for my backups strikes me as foolhardy. It sure wouldn’t work for archival purposes — y’mean I need to get a copy of this 30-year old database program running before I can even read the 30-year-old backups from the machine I had at the time? No thanks!

    I have done plenty of restores from these backups, and I trust ’em. (Except just this week I was restoring a filesystem with many directories containing a hundred thousand files each — a set of MH mail folders full of spam. The older copy of restore that I was running burned CPU time forever, without writing to the disk, because it kept and searched a singly linked list of all files in a directory. I upgraded to the latest version, which uses hash buckets if you ask nicely, fixed a few things, and it’s doing fine at restoring my filesystem.)

    By the way: Every backup disk I make contains a “tools” directory that has both binaries and source code for every tool used to make the backups. And before I remove a drive to keep it in offline safe storage, I remove the journal so that anything later that can read ext2 can read it. (If there was a more popular filesystem that could do the job, I’d switch to it.)

  2. The idea of *depending* on a database for my backups strikes me as foolhardy. It sure wouldn’t work for archival purposes — y’mean I need to get a copy of this 30-year old database program running before I can even read the 30-year-old backups from the machine I had at the time? No thanks!

    With respect to Bacula, you can extract data from the tapes without the database. See bextract, bls, etc.

    The Catalog is stored in the database to make certain tasks easier. For example, what backups do I have of /etc/openvpn.conf between last month and today? By no means do you NEED the Catalog. It is a very convenient tool

  3. Hello again!

    One particular usage scenario I am considering a solution with de-duplication for (whereas I’m using rdiff-backup everywhere else) is for backing up a digital photo archive. The vast majority of the files should not change at all, and if they have changed, it’s probably due to bit errors or corruption somewhere in the hardware/software chain. De-duplication would hopefully catch this (and backup the modified/subtly corrupted photo to a different hash). Whether or not it would report it adequately I don’t know. (It’s also possible some dumb photo management software could corrupt my photos, e.g. a tool which does lossy JPEG rotate).

    Whilst the files would not change, they may well move around a lot as I try different management schemes. I currently have a scheme whereby I file them at disc/YYYY/MM/DD/photo , where disc is an arbitrary separator that helps me split the files into DVD-R manageable chunks. A naive backup system (such as I think, unfortunately, rdiff-backup) could consume an enormous amount of disk space if it re-represented a chunk of files that moved from one location to another.

  4. I’m currently trying out Gibak. It de-duplicated and compresses really nicely, and it’s also basically a shell-script around a git-repository, so you can use regular git commands if you want. Very cheap on space and fast transmitting.

  5. This isn’t what tytso is looking for, but since some folks have mentioned git and keeping important personal stuff, I’ll throw out flashbake (http://bitbucketlabs.net/flashbake/). Flashbake is aimed at not loosing work in progress. (Disclaimer: One of my friends runs this project.)

  6. @tytso

    Hi, it would be nice to have all this comments/suggestions summarized in a new blog post… of course with your conclusions.

    great blog

  7. I use dar (http://dar.linux.free.fr/) for incremental backups to disk. Tested restore, too.

    Deleted files are recorded and not restored[1] (don’t understand how other incremental backup tools fail at this point). Dar doesn’t use rsync-like algorithm to detect changed files, just looks at the time. It writes its own archive format.

    [1] by default

  8. The file system is essentially a database. Personally, when restoring a file I like the flexibility of tools available for copying files from a file system.

    Hard linking backup solutions is not a silver bullet. However, there this approach has some advantages.

    One possibility is that you could limit the history (number of snapshots) the hard linking backup solution is generating and you could then use a backup tool to push files to tape from the latest archive. The advantage of such an is the ability to restore for disk if you do not need to go back very far. If it is no longer available on disk then you could order in the tapes to go back further.

    Finally, you may want to look though the list of projects listed on the LBackup about page.

  9. The file system is a database, sure; but it’s not a very good general-purpose database. It’s optimized for the workloads that are typically experienced by file systems, which are quite different from those that might be seen by most relation databases, for example.

    There is a desire in computer science, just as in physics, for the “grand unified theory”; so you will see people argue that there should be one product that could solve all problems efficiently; whether that is a file system, or a relational database, or a key/value databases. It used to be, for example, that people thought no matter what your problem was, the answer was a relational database. Oracle even tried to convince people that an Oracle database could be the basis of a general purpose file system. That idea died quickly once people discovered how awful Oracle was at being a file system. Similarly, we are now seeing non-relational databases pop up in Amazon, Google, and many other distributed systems because it turns out relational databases really suck at scaling out.

    So I really get my dander up when people say, “the file system is essentially a database”. I suppose it is, in the sense that any computer program can be transformed into a Turing Machine. But that doesn’t mean that it is an efficient or sane thing to do for a production system….

  10. Currently investigating Lessfs http://www.lessfs.com/ and NILFS http://www.nilfs.org/ with a view to FUSE-ing lessfs on to NILFS and seeing how that combo behaves. (Out of curiosity as Data Domain/EMC has a de-duping log-structured FS under the hood).
    Lessfs stores various metadata in a database but it may not be of the sort you require.

  11. Here’s the problem you have:
    “The disadvantage of [backup solutions that use a very clever hack of using hard link trees to maintain incremental backups] is that it creates very large number of directories blocks which must be validated by an fsck operation. As I’ve discussed previously, this causes e2fsck to consume a vast amount of memory; sometimes more than can be supported by 32-bit systems. Another problem which has recently been brought home to me, is how much time it can take to fsck such file systems.”

    Why do you think that the correct response to this problem is to use some database backend? If the *only* problem is that e2fsck is too slow, then perhaps you should consider some other filesystem for your backup filesystem? Not all filesystems have problems with high number of directories (performance wise).

  12. @62: Mikko.

    It’s not a performance problem, it’s a memory utilization problem. And it only applies to fsck, and it’s going to be true for pretty much any file system. The problem if you are trying to make sure the links count (i.e., the value returned by stat(2) in the st_nlinks field) is accurate, you need to count all of the directories that reference a particular inode. This means keeping an in-memory array so you can well, keep the count. Pretty much any Unix file system has this requirement, since it’s enforced by POSIX, and it’s needed so you know when the last directory has removed its link to the inode, so that the inode and the blocks associated with it can be released. If the link count is too low, you might release the inode early, and that would result in data loss.

    Now there are optimizations you can use, such as only allocating memory for counting the ref counts for files that are referenced by more than one directory (i.e., they have at least one additional hard link beyond the directory entry created when the file was first created). E2fsck does that. But if you use a backup system which utilizes huge numbers of hard links, then you need lots of memory. And that’s a solution too. And if you don’t have enough memory, you can use an on-disk database to deal with the fact that you don’t have enough memory to store all of the refcounts in memory. And e2fsck has that too.

    Or you could decide that the “clever hack” of using hard link trees to maintain incremental backups has costs which are too great for the benefit they bring, and in fact using a real database would be cheaper past a certain scaling point. Which is basically what I tried to argue, perhaps not clearly enough.

Leave a Reply

Your email address will not be published. Required fields are marked *