It's been a while since I described the way I do backups -- in fact,
the only
public document I could find on the subject was written in 2006, and
things have changed a great deal since then. I believe there have been a
few mentions in Dreamwidth and elsewhere, but in this calamitous year it
seems prudent to do it again. Especially since I'm starting to feel
mortal, and starting to think that some day one of my kids is going to
have to grovel through the whole mess and try to make sense of it.
(Whether they'll find anything worth keeping or even worth the trouble of
looking is, of course, an open question.)
My home file server, a small Linux box called Nova, is backed up by simply
copying (almost -- see below) its entire disk to an external hard drive
every night. (It's done using rsync, which is efficient
because it skips over everything that hasn't been changed since the last
copy.) When the disk crashes (it's almost always the internal disk,
because the external mirror is idle most of the time) I can (and have,
several times) swap in the external drive, make it bootable, order a new
drive for the mirror, and I'm done. Or, more likely, buy a new pair of
drives that are twice as big for half the price, copy everthing, and
archive the better of the old drives. Update it occasionally.
That's not very interesting, but it's not the whole story. I used to make
incremental backups -- instead of the mirror drive being an exact copy of
the main one, it's a sequence of snapshots (like Apple's Time Machine, for
example). There were some problems with that, including the fact because
of the way the snapshots were made (using cp -l to copy
directories but leave hard links to the files that haven't changed) it
takes more space than it needs to, and makes the backup disk very
difficult -- not to mention slow -- to copy if it starts flaking out.
There are ways of getting around those problems now, but I don't need
them.
The classic solution is to keep copies offsite. But I can do better than
that because I already have a web host, and I have Git. I need to back up
a little.
I noticed that almost everything I was backing up fell into one of three
categories:
- Files I keep under version control.
- Files (mostly large ones, like audio recordings) that never change
after they've been created -- recordings of past concerts, my
collection of ripped CDs, the masters for my CD, and so on. I
accumulate more of them as time goes by, but most of the old
ones stick around.
- Files I can reconstruct, or that are purely ephemeral -- my browser
cache, build products like PDFs, executable code, downloaded install
CDs, and of course entire OS, which I can re-install any time I need to
in under an hour.
Git's biggest advantage for both version control and backups is that it's
distributed -- each working directory has its own repository, and you can
have shared repositories as well. In effect, every repository is a
backup. In my case the shared repositories are in the cloud on
Dreamhost, my web host. There are
working trees on Nova (the file server) and on one or more laptops. A few
of the more interesting ones have public copies on GitLab and/or GitHub as
well. So that takes care of Group 1.
The main reason for using incremental backup or version control is so that
you can go back to earlier versions of something if it gets messed up.
But the files in group don't change, they just accumulate.
So I put all of the files in Group 2 -- the big ones -- into
the same directory tree as the Git working trees; the only difference is
that they don't have an associated Git repo. I keep thinking I should set
up
git-annex to manage
them, but it doesn't seem necessary. The workflow is very similar to the
Git workflow: add something (typically on a laptop), then push it to a
shared server. The Rsync commands are in a Makefile, so I don't have to
remember them: I just make rsync. (Rsync doesn't copy
anything that is already at the destination and hasn't changed since the
previous run, and by default it ignores files on the destination that
don't have corresponding source files. So I don't have to have a
complete copy of my concert recordings (for example) on my
laptop, just the one I just made.)
That leaves Group 3 -- the files that don't have to be backed up because
they can be reconstructed from version-controlled sources. All of my
working trees include a Makefile -- in most cases it's a link to
MakeStuff/Makefile --
that builds and installs whatever that tree needs. Programs, web pages,
songbooks, what have you. Initial setup of a new machine is done by a
package called
Honu
(Hawaiian for the green sea turtle), which I described a little over a
year ago in
Sable
and the turtles: laptop configuration made easy.
The end result is that "backups" are basically a side-effect of the way I
normally work, with frequent small commits that are pushed almost
immediately to a shared repo on Dreamhost. The workflow for large files,
especially recording projects, is similar, working on my laptop and
backing up with Rsync to the file server as I go along. When things are
ready, they go up to the web host. Make targets push and
rsync simplify the process. Going in the opposite direction,
the
pull-all command updates everything from the shared repos.
Your mileage may vary.
Resources and references
Another fine post from
The Computer Curmudgeon (also at
computer-curmudgeon.com).
Donation buttons in
profile.
[Crossposted from
mdlbear.dreamwidth.org, where it has
comments. You can comment here,
or there with openID, but wouldn't you really rather be on Dreamwidth?]