Wanted: Incremental Backup Solutions that Use a Database

January 12, 2009
Filesystems
OSS

Dear Lazyweb,

I’m looking for recommendations for Open Source backup solutions which track incremental backups using a database, and which do not use hard link directories. Someone gave me a suggested OSS backup program at UDS, but it’s slipped my memory; so I’m fairly sure that at least one or more such OSS backup solutions exist, but don’t know their names. Can some folks give me some suggestions? Thanks!

There are a number of very popular Open Source backup solutions that use a very clever hack of using hard link trees to maintain incremental backups. The advantage of such schemes is that they are very easy to code up, and it allows you to easily browse incremental backups by simply using “cd” and “ls”.

The disadvantage of such schemes is that it creates very large number of directories blocks which must be validated by an fsck operation. As I’ve discussed previously, this causes e2fsck to consume a vast amount of memory; sometimes more than can be supported by 32-bit systems. Another problem which has recently been brought home to me, is how much time it can take to fsck such file systems.

This shouldn’t have come as a surprise. Replicating the directory hierarchy for each incremental backup is perhaps the most inefficient way you could come up with for storing information for an incremental backup. The filenames are replicated in each directory hierarchy, and even if an entire subtree hasn’t changed, the directories associated with that subtree must be replicated for each snapshot. As a result, each incremental snapshot results in a large number of additional directory blocks which must be tracked by the filesystem and checked by fsck. For example, in one very extreme case, a user reported to me that their backup filesystem contained 88 million inodes, of which 77 million of them were directories. Even if we assume that every directory was only a single block long, that still means that during e2fsck’s pass 2 processing, 77 million times 4k, or 308 gigabytes, worth of directory blocks must be read into memory by e2fsck’s pass 2. Worse yet, these 308 GB of directory blocks are scattered all across the filesystem, which means the time to simply read all of the directory blocks so they can be validated will take a very, very, long time indeed.

The real right way to do implement tracking incremental backups is to use a database, since it can much more efficiently store and organize the information of what file is located where, for each incremental snapshot. If the user wants to browse an incremental snapshot via “cd” and “ls”, this could be done via a synthetic FUSE filesysem. There’s a reason why all of the industrial-strength, enterprise-class backup systems use real databases; it’s the same reason why enterprise class databases use their own data files, and not try to store relational tables in a filesystem, even if the filesystem supports b-tree directories and efficient storage of small files. Purpose written-and-opimzied solutions can be far more efficient than general purpose tools.

So, can anyone recommend OSS backup solutions which track incremental backups using some kind of database? It doesn’t really matter whether it’s MySQL, Postgresql, SQLite, so long as it’s not using/abusing a file systems’ hard links to create directory trees for each snapshot. For bonus points, there would be a way to browse a particular incremental snapshot via a FUSE interface. What suggestions can you give me, so I can pass it on to ext3/4 users who are running into problems with backup solutions such as Amanda, Backup-PC, dirvish, etc.?

Update: A feature which a number of these hard-link tools do right is that they do de-duplication; that is, they will create hard links between multiple files that have the same contents, even if they are located in different directories with different names (or they have been renamed or their directory structure reorganized since they were originally backed up) and even if the two files are from different clients. This basically means the database needs to include a checksum of the file, and a way to look up to see if the contents of that file have already been backed up, perhaps with a different name. Unfortunately, it looks like many of the tape-based tools, such as Bacula, assume that tape is cheap, so they don’t have these sorts of de-duplication features. Does anyone know of non-hard-link backup tools that do de-dup and which are also Open Source?