Blog Post #8, In Which I List Several Backup Tools: magikos

magikos

Blog Post #8, In Which I List Several Backup Tools

Nov 09, 2010 03:18

Last time, i talked about the importance of backups. Today, i will survey some backup tools.
tar

The venerable UNIX utility. "tar" stands for "Tape ARchive". My understanding is that, in the old days, you would tar up your files every night and write the archive to a magnetic tape drive-tape being one of the most reliable storage methods (if stored properly). Nowadays, we have large, cheap external disk drives that are almost as good (if not better). CD-Rs and DVD-Rs are not really worth considering, as they are too small, too slow, and degrade after just a few years.

Although tape drives are disappearing, tarballs are still in widespread use. Indeed; compressed tarballs are the most popular archive format on unix-like systems.

You can still use hand-rolled tar archives for backups, but we have developed much better tools over the years. Let's have a look-see.
rsync

rsync synchronizes files to a (possibly) remote location, using an extremely clever protocol designed to minimize bandwidth. rsync was written by Andrew Tridgell, and the rsync protocol was the subject his PhD thesis.

rsync is extremely popular; many linux backup solutions are based on rsync (either using the program directly, or just using the protocol).

Let's try to understand the magic of rsync.

You have some file on your computer (the sender) that you want to send to some remote computer (the receiver). Since you are connected over a network, and networks are slow, you want to use as little bandwidth as possible. Fortunately, there is already an outdated copy of the file on the remote computer. The obvious solution is: rather than send the whole file, send a diff of the two files! The only problem is that to generate the diff you need to be able to compare the two files. Which means you would have to transfer the old file to your computer, compare them, generate the diff, and send that. Unless the old file is much smaller than the new one, this wouldn't save bandwidth at all!

So we have to be a bit cleverer. We need some way to compare two files while knowing the full contents of only one of them.

Here is a sketch of what rsync does: 1

The receiver splits the file into chunks of N bytes. It computes the hash of every chunk and sends the hashes to the sender.
The sender computes the same hash of N-byte chunks, but it does so for every byte in the file.
The sender compares the hashes to compute the diff, which it then sends to the receiver.
The receiver uses the diff to patch the file. Yay!

So if the file is "abcdefgh" and N is 4, the receiver would compute the hashes of "abcd" and "efgh" and send those to us, and we, the sender, would compute the hashes of "abcd", "bcde", "cdef", "defg", and "efgh"!

If one of our hashes matches a hash we received from the receiver, then that means that whatever chunk it corresponds to is the same in both versions of the file-so we can tell the receiver, "copy the chunk with such-and-such hash into this location in the file".

The data that the sender sends to the receiver is any new data-chunks which didn't match a received hash-plus a list of hashes to copy into the new file.

Since the hashes are much smaller than the original chunks, this uses less bandwidth than sending the whole file. 2

If you're wondering why the sender has to compute so many hashes, imagine if it didn't-if the sender computed the same number of hashes as the receiver. Now consider what happens when you insert a byte into file. The receiver's hashes and the sender's hashes will get out of sync, and past the inserted byte, none of the hashes will match. You'd wind up treating the whole file after the inserted byte as new, and send it to the receiver. By computing hashes at every byte, we are able to catch chunks which are out of sync.

While rsync is extremely useful, it doesn't do incremental backups. Only mirroring. It also requires that, for remote transfers, the remote end must be running rsync.

The 3 programs listed next all use rsync.
rdiff-backup

rdiff-backup is like rsync, except it also does incremental backups by storing reverse diffs. And other nice features.

Like rsync, the remote end must be running rdiff-backup.
Duplicity

Duplicity was designed to do remote backups without server support. It uses rsync, but it precomputes the file hashes and stores them on the server alongside the backups. When it comes time to sync to the remote end, it grabs the hashes and goes from there.

This is useful for backing up to, say, Amazon S3 or other cloud storage services, where you can't run an rsync server.

It can do incremental backups and also supports encryption.
rsnapshot

rsnapshot is a wrapper for rsync that does incremental backups, but instead of storing diffs like rdiff-backup and duplicity, it does full snapshots. It achieves efficient storage of these snapshots through clever use of hard links. Basically, if a file hasn't changed since the last snapshot, then instead of storing another copy it simply links to the old file, using effectively 0 disk space. From the point of view of the user, you get the convenience of snapshots with the efficiency of incremental diffs.
tarsnap

"Online backups for the truly paranoid."

Tarsnap is two things: a paid internet backup service, and also an open-source backup program. As the name suggests, it uses tar, not rsync; although it does do incremental backups.

Everything in tarsnap is insanely encrypted, so you trade the problem of backing up your data with the problem of just backing up your keys. *grin*

If not your primary backup facility, tarsnap seems worth considering for your secondary backups.

(These are just a few of the programs that i found mentioned most when i was looking for backup solutions. ArchWiki has a rather comprehensive list of more Backup Programs, once again demonstrating that Arch has the Best Wiki Ever.)

So which of these do I use? I use rsnapshot to sync to an external USB hard drive 4 times a day. I wrote a trivial wrapper script which mounts the drive before running rsnapshot, and unmounts it afterwards. I back up the entire disk, using the instructions in Full System Backup with rsync (adapted for rsnapshot). The backup won't be directly bootable, of course, because the files are in a snapshot directory rather than the root of the drive; but theoretically, if my hard drive fails i should be able to cp -ar everything to a new disk and it will be bootable (no re-installing linux or anything). I haven't tested this though.

I also don't have any off-site backups. Tarsnap looks pretty cool.

Another backup strategy:

http://jwz.livejournal.com/801607.html (Use rsync to mirror to an external bootable drive.)

More on the importance of backups:

[You] should think of “long term storage” as a string of short-medium term solutions that are replaced every few years.

-http://diveintomark.org/archives/2006/05/08/backup#comment-6405

The fundamental truth about backups is redundancy. If you want to protect something, make as many copies of it as you possibly can 3. (Then the problem becomes synchronization.) The best 4 backup solution is to put it on the internet. As long as enough people find it worthwhile, you'll never be able to lose it.

Oh, if you're running Windows, you're on your own. Sorry.
Appendix: RAID

RAID stands for Redundant Array of Inexpensive Disks, and is definitely not a replacement for real backups.

RAID guards against one kind of hardware failure. There's lots of failure modes that it doesn't guard against.

File corruption
Human error (deleting files by mistake)
Catastrophic damage (someone dumps water onto the server)
Viruses
Software bugs that wipe out data ..

-http://serverfault.com/questions/2888/why-is-raid-not-a-backup

P.S., RAID is a waste of your goddamned time and money. Is your personal computer a high-availability server with hot-swappable drives? No? Then you don't need RAID, you just need backups.

- http://jwz.livejournal.com/801607.html?thread=15375431#t15375431

Of course, it's actually a bit more complicated than that. For the gory details, you can read the thesis, Efficient Algorithms for Sorting and Synchronization [PDF]. Or you can be a wimp and read the Wikipedia page. ↩
As explained in the paper, the rsync algorithm can also be used as a decent compression algorithm. The hashes allow it to exploit redundancy in a file over distances much larger than the typical compression algorithm's window size. ↩
This is how DNA works and it's why some genes have survived for millions of years. ↩
For some value of "best". Not valid in Canada. Additional restrictions may apply. ↩

nablopomo2010, quote, backups, geeky, looooooong, software