Blog Post #9, In Which I Give A Slightly Higher-Level Overview Of Backups: magikos

magikos

Blog Post #9, In Which I Give A Slightly Higher-Level Overview Of Backups

Nov 10, 2010 01:29

Number 9... number 9... number 9...

Yesterday i talked about backups, and listed some backup tools. Today i would like to take a slightly higher-level look at backup systems.

The two technologies which form the basis of backup systems are archives and mirroring.

Archives. An archive is a file which contains other files; it is a sort of miniature filesystem. Archives store not only the contents of files, but also metadata such as the name, the last modified time, the owner, the permissions, and so on.

Archives are handy because they bind a group of files into a single unit. This is particularly useful for making snapshots. Archives can also be useful for backing up files to foreign filesystems.

You are probably already familiar with archives. You have probably even used zip or tar at some point to roll your own backups.

Mirroring. A mirror is a complete copy of something-a directory, a file, a website, a disk, whatever. The UNIX cp command can be used for mirroring files, though it is not particularly efficient where backups are concerned. For efficient remote mirroring, rsync is better tool (as discussed yesterday).

dd is a partition mirroring tool, in the same vein as cp. RAID does disk mirroring. (RAID is special because it is transparently handled by the operating system or the hardware. We'll discuss RAID later.)

In my mind, the term "mirror" is associated with websites. In this usage, a mirror is a copy of the website, hosted on another server. Website mirrors are used for load balancing, to improve connection speeds, and yes, for redundancy.

The Interactive Fiction Archive is one example of a mirrored website. Most Linux distributions also have a large number of mirrors. For example, Debian, Ubuntu, Arch.

There are two ways to approach backups:

Snapshots. Snapshots are simple: each backup is a complete copy (a snapshot) of your data at some point in time. A snapshot is basically the same as a mirror, except "mirror" implies current and "snapshot" implies past.

The downside to snapshots is that they take up a lot of space.

Tools such as partclone and clonezilla can be used to snapshot entire disk partitions. Personally, i have found this to be a rather fiddly backup solution. Although restoration of the entire partition is made simple, it is difficult to retrieve a single file by itself.

Incremental Backups. An improvement on snapshots-instead of storing a complete copy of the data for each snapshot, you store only the difference from one point of time to the next.

The downside is that restoring a file takes some work, because you have to retrieve the base file and then apply a series of diffs to bring it to the desired state.

Incremental backups go in either of two directions: forwards or reverse. For forwards, you store one full backup of a previous point in time, and subsequent backups store only the changes necessary to bring the full backup up to the current state.

For reverse, there is one complete copy of the data, which is always a snapshot of the most recent backup. Previous backups are stored as reverse diffs; that is, they store how to change the current state to the previous one. Reverse diffs are a good fit for backups, because recent data is more likely to be needed than older data.

Incremental backups don't have to use diffs; they can store complete files instead.

Here is an example incremental backup system using tape drives: Every week, use tar to do a full backup to a tape drive. Every day, do another backup with tar, to another tape drive, but use the --newer option to only store files which have changed since the previous day. 1

The tape drive example goes forwards and stores complete files.

rdiff-backup uses reverse diffs.

rsnapshot is a complete-file incremental backup tool that pretends to be a snapshot tool. Or perhaps it is a snapshot tool which acts like an incremental backup tool.

There are two important variables in any backup system: the backup interval and the retention rate. 2

The backup interval is amount of time between two succesive backups.

The retention rate is how long you keep old backups.

The backup interval controls how much data you can potentially lose. If your hard drive crashes or your laptop is stolen, Murphy's Law Finagle's Law ensures that it will always happen right before your next backup. So if your backup interval is 1 month, you are looking at month's worth of data loss.

In an ideal universe, the perfect backup system would have a backup interval of 0 and a retention rate of ∞; that is, you could retrieve the state of the data at any point in history, to any degree of precision. In the real world, this plan would require copious amounts of disk space-not to mention the processing and bandwidth overhead.

In the real world, a backup interval of 1 day is usually fine, and retention rate should be as large as you have the disk space for, but 1 month minimum.

It is possible to strike a compromise between the snapshot and incremental backup strategies. For example, incremental backups with an interval of 1 day and a retention rate of 1 year, combined with snapshots with an interval of 1 year and an infinite retention rate.

For a backup system to be useful, it needs to have a decent retention rate. Now we can see why RAID is not a backup solution: the retention rate is 0! RAID does not preserve any state beyond the current one. It also has a backup interval of effectively 0, which means that any deletions or corruptions will instantly be mirrored to the other drives in the array.

It is my hope that you now know more about backups than you possibly care to. For my part, i've now written more about backups than i could possibly know.

http://tldp.org/LDP/sag/html/simple-backups.html ↩
Yes, i made those up. ↩

nablopomo2010, linux, backups, geeky, software