bup - towards the perfect backup

Since I discovered it, I've been in love with the concept behind bup.

bup appeals to my sense of efficiency in taking backups: backups should backup the absolute minimum amount of data so I can have the most versions, and then that frees me to use whatever level of redundancy I deem appropriate for my backup media.

But more then that, the underlying technology of bup is ripe with possibility: the basic premise of a backup tool gives rise to the possibility of a sync tool, a deduplicated home directory tool, distributed repositories, archives and more.

how it works

A more complete explanation can be found on the main GitHub repository, but essentially bup applies rsync's rolling-checksum (literally, the same algorithm) to determine file-differences, and then only backs up the differences - somewhat like rsnapshot.

Unlike rsnapshot however, bup then applies deduplication of the chunks produced this way using SHA1 hashes, and stores the results in the git-packfile format.

This is both very fast (rsnapshot, conversely, is quite slow) and very redundant - the Git tooling is able to read and understand a bup-repository as just a Git repository with a specific commit structure (you can run gitk -a in a .bup directory to inspect it).

why its space efficient

bup's archive and rolling-checksum format mean it is very space efficient. bup can correctly deduplicate data that undergoes insertions, deletions, shifts and copies. bup deduplicates across your entire backup set, meaning the same file uploaded 50 times is only stored once - in fact it will only be transferred across the network once.

For comparison I recently moved 180 gb of ZFS snapshots of the same dataset undergoing various daily changes into a bup archive, and successfully compacted it down to 50 gb. I suspect I could have gotten it smaller if I'd unpacked some of the archive files that have been created in that backup set.

That is a dataset which is already deduplicated via copy-on-write semantics (it was not using ZFS deduplication because you should basically never use ZFS deduplication).

why its fast

Git is well known as being bad at handling large binary files - it was designed to handle patches of source code, and makes assumptions to that effect. bup steps around this problem because it only used the Git packfile and index format to store data: where Git is slow, bup implements its own packfile writers index readers to make looking up data in Git structures fast.

bup also uses some other tricks to do this: it will combine indexes into midx files to speed up lookups, and builds bloom filters to add data (a bloom filter is a fast data structure based on hashes which tells you something is either 'probably in the data set' or definitely not).

using bup for Windows backups

bup is a Unix/Linux oriented tool, but in practice I've applied it most usefully at the moment to some Windows servers.

Running bup under cygwin on Windows, and is far superior to the built in Windows backup system for file-based backups. It's best to combine it with the vscsc tool which allows using 1-time snapshots to save the backup and avoid inconsistent state.

You can see a link to a Gist here of my current favorite script for this type of thing - this bash script needs to be invoked from a scheduled task which runs a batch file like this.

If you want to use this script on Cygwin then you need to install the mail utility for sending email, as well as rsync and bup.

This script is reasonably complicated but it is designed to be robust against failures in a sensible way - and if we somehow fail running bup, to fallback to making tar archives - giving us an opportunity to fix a broken backup set.

This script will work for backing up to your own remote server today. But, it was developed to work around limitations which can be fixed - and which I have fixed - and so the bup of tomorrow will not have them.

towards the perfect backup

The script above was developed for a client, and the rsync-first stage was designed to ensure that the most recent backup would always be directly readable from a Windows Samba share and not require using the command line.

It was also designed to work around a flaw with bup's indexing step which makes it difficult to use with variable paths as produced by the vscsc tool in cygwin. Although bup will work just fine, it will insist on trying to hash the entire backup set every time - which is slow. This can be worked around by symlinking the backup path in cygwin beforehand, but since we needed a readable backup set it was as quick to use rsync in this instance.

But it doesn't have to be this way. I've submitted several patches against bup which are also available in my personal development repository of bup on GitHub.

The indexing problem is fixed via index-grafts: modifying the bup-index to support representing the logical structure as it is intended to be in the bup repository, rather then the literal disk path structure. This allows the index to work as intended without any games on the filesystem, hashing only modified or updated files.

The need for a directly accessible version of the backup is solved via a few other patches. We can modify the bup virtual-filesystems layer to support a dynamic view of the bup repository fairly easily, and add WebDAV support to the bup-web command (the dynamic-vfs and bup-webdav branches).

With these changes, a bup repository can now be directly mounted as a Windows mapped network drive via explorers web client, and files opened and copied directly from the share. Any version of a backup set is then trivially accessible and importantly we can simply start bup-web as a cygwin service and leave it running.

Hopefully these patches will be incorporated into mainline bup soon (they are awaiting review).

so should I use it?

Even with the things I've had to fix, the answer is absolutely. bup is by far the best backup tool I've encountered lately. For a basic Linux system it will work great, for manual backups it will work great, and with a little scripting it will work great for automatic backups under Windows and Linux.

The brave can try out the cutting-edge branch on my GitHub account to test out the fixes in this blog-post, and if you do then posting about them to [bup-list@googlegroups.com[(https://groups.google.com/forum/#!forum/bup-list) with any problems or successes or code reviews would help a lot.