Nobody expects the spanish inquisition

Nobody expects the spanish inquisition

An ancient italian proverb says

Il backup è quella cosa che andava fatta prima.

A rough translation could be something like this: the backup is the thing that had to be done before.

The concept of backup is quite confusing. Some people wrongly believe that executing a local backup is sufficient to ensure the recover of the data.

In the real world a valid backup is present only if three conditions are true.

  1. You run the backup
  2. You copy the backup on a separate machine (possibly off site)
  3. You test the restore from the backup (periodically)

In the previous post we have seen how with a little configuration we can achieve points 1 and 2.

In this post we’ll see how to schedule periodical backups and test the restore.

As a closing remark I’ll tell a database horror story I’ve originally listened at the It’s bigger on the inside mini conference.

Scheduling a backup strategy

The thing I love the most of pgbackrest is the fact that the backup repository is self contained. You don’t need anything else to run a restore.

pgbackrest supports three kind of backups. full, incremental and differential.

Shameless copied from the pgbackrest online manual

  A backup is a consistent copy of a database cluster that can be restored to recover from a hardware failure, to perform Point-In-Time Recovery, or to bring up a new standby.

  Full Backup: pgBackRest copies the entire contents of the database cluster to the backup server. The first backup of the database cluster is always a Full Backup. pgBackRest is always able to restore a full backup directly. The full backup does not depend on any files outside of the full backup for consistency.

  Differential Backup: pgBackRest copies only those database cluster files that have changed since the last full backup. pgBackRest restores a differential backup by copying all of the files in the chosen differential backup and the appropriate unchanged files from the previous full backup. The advantage of a differential backup is that it requires less disk space than a full backup, however, the differential backup and the full backup must both be valid to restore the differential backup.

  Incremental Backup: pgBackRest copies only those database cluster files that have changed since the last backup (which can be another incremental backup, a differential backup, or a full backup). As an incremental backup only includes those files changed since the prior backup, they are generally much smaller than full or differential backups. As with the differential backup, the incremental backup depends on other backups to be valid to restore the incremental backup. Since the incremental backup includes only those files since the last backup, all prior incremental backups back to the prior differential, the prior differential backup, and the prior full backup must all be valid to perform a restore of the incremental backup. If no differential backup exists then all prior incremental backups back to the prior full backup, which must exist, and the full backup itself must be valid to restore the incremental backup.

So, in short the first time we run a backup for a stanza, pgbackrest executes a full backup. All the subsequent backups are incremental if we don’t specify that we want a full backup.

A reasonable strategy for normal workloads could be to have a full backup every week followed by daily incremental backups.

In the previous post we configured pgbackrest with the retention-full=1. This means that only one full backup at a time will be kept on the backup repository.

However, if a full backup is executed the previous full backup will be expired only after the new one is marked as valid. Therefore, with a retention-full=1 on the repository there must be sufficient disk space hold at least two full backups and the extra space required for the wal files archived between the full backups.

The incremental backups are expired only when their full or differential backup is expired.

Configuring the cron on the backup server requires minimal effort to have a weekly full backup and a daily incremental.

#weekly full, executed at midnight,saturday
00 00 * * 6 pgbackrest backup --stanza db01 --type full
#daily incremental, executed at midnight except saturday
00 00 * * 0-5 pgbackrest backup --stanza db01

Please note that we don’t care if the full backup is executed at saturday. Whether is the type of backup, pgbackrest will do a full backup in any case if runs against an empty stanza.

With the full and an incremental backups executed, the output of pgbackrest info should look like like this.

postgres@backupsrv:~$ pgbackrest info
stanza: db01
    status: ok

    db (current)
        wal archive min/max (10-1): 00000001000000000000000E / 000000010000000000000010

        full backup: 20180209-184318F
            timestamp start/stop: 2018-02-09 18:43:18 / 2018-02-09 18:43:38
            wal start/stop: 00000001000000000000000E / 00000001000000000000000E
            database size: 22.3MB, backup size: 22.3MB
            repository size: 2.6MB, repository backup size: 2.6MB

        incr backup: 20180209-184318F_20180209-184348I
            timestamp start/stop: 2018-02-09 18:43:48 / 2018-02-09 18:43:53
            wal start/stop: 000000010000000000000010 / 000000010000000000000010
            database size: 22.3MB, backup size: 8.2KB
            repository size: 2.6MB, repository backup size: 470B
            backup reference list: 20180209-184318F

Configuring the restore server

For the restore test we should use a secondary machine as we want to simulate a disaster recovery scenario. We wouldn’t restore the lost database on the same backup server.

For simplicity we’ll clone the VM db01 that we used in the previous post into a new machine called db02.

Before cloning is a very good idea to turn the cluster’s start to manual changing the value in the file /etc/postgresql/10/main/start.conf from auto to manual.

This will prevent the database to start on boot and mess up with the existing backups on pgBackRest.

After cloning the system we must setup the network and the hostname.

# /etc/network/interfaces
#db02
allow-hotplug eth0
iface eth0 inet static
    address 192.168.1.23
    netmask 255.255.255.0
    gateway 192.168.1.1

# /etc/hostname
db02

In our experiment regenerating the ssh keys is not necessary but, in a real world enviroment is imperative to distinguish the machines identities.

Do not forget to add the alias in your local hosts file.

After making the changes you can either reboot the vm or manually restart the network and set the host name.

Restoring from the backup

We are now ready to test if our backup is restorable.

As we have a pre existing stopped PostgreSQL cluster we need to drop it and recreate it.

As the vm has been cloned the /etc/pgbackrest.conf file doesn’t need any change.

pgbackrest have a super userful switch --delta which tells the restore process to perform a delta restore. This way all the files not necessary are removed from the data area and only the missing files are restored, maximising the restore speed.

The restore command auto generates a recovery.conf in the postgres data area in order to get the cluster up and running using the archived wals for the recovery.

When the last wal has been replayed then recovery ends and the cluster switches to a different timeline.

postgres@db02:~$ pgbackrest info
stanza: db01
    status: ok

    db (current)
        wal archive min/max (10-1): 00000001000000000000000E / 000000020000000000000014

        full backup: 20180209-184318F
            timestamp start/stop: 2018-02-09 18:43:18 / 2018-02-09 18:43:38
            wal start/stop: 00000001000000000000000E / 00000001000000000000000E
            database size: 22.3MB, backup size: 22.3MB
            repository size: 2.6MB, repository backup size: 2.6MB

        incr backup: 20180209-184318F_20180209-184348I
            timestamp start/stop: 2018-02-09 18:43:48 / 2018-02-09 18:43:53
            wal start/stop: 000000010000000000000010 / 000000010000000000000010
            database size: 22.3MB, backup size: 8.2KB
            repository size: 2.6MB, repository backup size: 470B
            backup reference list: 20180209-184318F
postgres@db02:~$

The archive max value shows clearly how the timeline has switched from 1 to 2.

Automating a test restore

Wrapping everything in a simple shell script is not complicated.

The script should do the following actions.

  • stop the local cluster
  • perform a delta backup as shown in the asciicast
  • start the cluster
  • wait until the cluster ends the recovery. In that sense pg_isready is very useful.
  • send an email/slack message/pagerduty alert with the test results

The last point is important. Despite the increased noise, it is much better to receive always a message with OK/KO and the report.

Assuming that no message means success it can mask that the restore test is failing (see the next section).

A database horror story

In January 2016 I had the pleasure to attend to a mini conference on postgresql held in Italy. The last talk I listened was a true database horror story.

A lonely PostgreSQL cluster was up and running on a linux box. The system worked efficiently for at least a year without need for restart.

The cluster was configured in archive mode and a daily script saved the data area on a different machine using a simple tar command. The monitoring shown no issues. The data area was consuming storage and not changing too much over time.

Everything worked smootly until the day of the hardware upgrade. As there were no problems with a prolonged downtime, the sysadmins decided to use a simple cold copy strategy to move the data area on the new machine.

After running pg_ctl -m fast stop it was clear that something was going terribly wrong.

The disk usage dropped to zero and a check of the data area shown no files.

The sysadmins decided to use the backup as they were pretty sure that the data area was saved every day.

However even the multipe backup copies they had were empty. At the end of the story, the cluster was lost.

After giving this talk the speaker started an interesting discussion with the audience to figure out what happened.

The solution of the mistery is in the way the filesystem works on linux. If a process holds a file and the file is removed by another process, the inode remains in use until the holding process release it.

It is very likely that somebody wiped out the data area and the postgres process kept the inodes in use. The daily backup script, which was not tested, saved only an empty directory. As soon as the cluster was stopped the inodes were released and the deleted data files were lost.

Three villains conjoured to lead toward the disaster all togheter.

  1. The person which configured the backup and didn’t setup a validation test
  2. The person which wiped out the data area
  3. The person which didn’t check the data area status before stopping the cluster

Wrap up

I want to finish this post quoting Christophe Pettus: Hardware is cheap. Data is expensive.

pgbackrest is a powerful ally to ensure the data is saved safely. Is up to us to verify that the data is really saved and can be restored.

Take care of your data and remember,