Checkpoints and wals, fantastic beasts (and where to find them)

Friday, 2018-03-02

Back in the days of when I was an Oracle DBA I had to solve a strange behaviour on an Oracle installation. The system for some reasons stopped accepting the writes occasionally and without an apparent reason.

This behaviour appeared odd to anybody except for my team, which addressed the issue immediately. The problem was caused by a not optimal configuration on the Oracle instance.

This thought led me to writing this post. I will try to explain how the durability works in PostgreSQL and how the Oracle’s implementation was the root cause of the issue I faced.

Fasten your seatbelts, we are about to depart for the wonderful land of ACID.

A drop in the ACID

ACID compliant databases implement the four ACID properties which are Atomicity, Consistency, Isolation, Durability.

We’ll focus only on the Durability which implementation must be taken in account in order to use our database system at its best.

The definition of durability its copied without shame from the wikipedia ACID entry. The bolds are mine.

The durability property ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.

The durability is enforced when the committed transactions or their effect are recorded on durable storage. As writing the data directly on the data files can cause a consistent amount of I/O with resultng in slowing down the entire system the modern databases implement the logging for the committed transactions. The combination of the logs and the data files guarantee, in case of power loss, crashes or errors, that the database can recover the state immediately before the loss of service.

WAL’s Den

In PostgreSQL the logs where the transactions are saved are called Write Ahead Log or WAL. When the data page is modified in memory the change is logged on the WAL storage. The page which is not yet written on its data file is called dirty. The time required for that page to be consolidated on disk is variable as the flush on disk happens under specific circumstances.

When the buffer holding the page is required by the buffer manager
When the background writer writes down the page
When a checkpoint occurs

In PostgreSQL the WAL files are fixed size segments of 16 MB. The management of those files is completely automatic and must not be touched, unless you want to corrupt your data.

In particular, the checkpoint is a common procedure which can be found virtually in any database tecnology. When a checkpoint starts the process moves across the shared buffer and write down all the dirty pages found during the scan. The WAL position of the last committed transaction is saved into the file $PGDATA/global/pg_control. This way the system knows which wal files are no longer required for the crash recovery, which are recycled or deleted.

The pg_control file is used by the postgres process during the startup to determine whether the cluster requires a recovery or not. In case of recovery then the last checkpoint location is used to for the start of recovery. At the end of the recovery and before opening the database to the connections the system executes a checkpoint in order to flush all the replayed changes on disk.

Checkpoints occurring frequently will result in an unnecessary I/O. Checkpoints happening seldom will result in a longer recovery time in case of crash. And tuning the checkpoint frequency is just a compromise between two dimensions, space and time.

The parameter checkpoint_timeout express the time after a checkpoint is triggered.

PostgreSQL up to the version 9.4 have the parameter checkpoint_segments which express the number of wal segments generated by the new transactions before a checkpoint is triggered. This parameter controlled also the amount of minimum log files kept in the wal directory. As this caused some confusions from the version 9.5 this parameter has been replaced by max_wal_size. The online documentation is very clear about this parameter.

max_wal size is the maximum size to let the WAL grow to between automatic WAL checkpoints. This is a soft limit; WAL size can exceed max_wal_size under special circumstances, like under heavy load, a failing archive_command, or a high wal_keep_segments setting. The default is 1 GB. Increasing this parameter can increase the amount of time needed for crash recovery.

The PostgreSQL’s write ahead logging implementation allows the checkpoint and running transaction to co exist with minimal interference. The wal segments are written sequentially and their names never repeat. Thus the overwrite of the existing wal segments record is never required. However this brilliant approach have a weak spot that if ignored can result in the database crash.

A 1000 ways to crash (and eventually corrupt) PostgreSQL

At the Fosdem PGDay 2017 Ilya Kosmodemiansky presented a very interesting talk about the worst practices on PostgreSQL. One of the methods to crash your PostgreSQL database was to let the WAL area running out of space.

And this apparently difficult task is indeed is very simple to achieve.

The first step to the disaster is to open a database session and and start a read write transaction, leaving it idle in transaction.

postgres@db01:/pg_data/10/demo$ psql -p 5433
psql (10.2 (Debian 10.2-1.pgdg90+1))
Type "help" for help.

postgres=# begin;
BEGIN
postgres=# create table dd();
CREATE TABLE
postgres=#

Then we open a second session and we setup a regular data write which is committed. The psql’s watch facility is perfect for the purpose

postgres=# create table test as select * from pg_attribute;
SELECT 2579
postgres=# insert into test select * from test;\watch 1
INSERT 0 2579
Fri 02 Mar 2018 06:42:25 GMT (every 1s)
INSERT 0 5158
Fri 02 Mar 2018 06:42:26 GMT (every 1s)
INSERT 0 10316
Fri 02 Mar 2018 06:42:27 GMT (every 1s)
INSERT 0 20632
Fri 02 Mar 2018 06:42:28 GMT (every 1s)
INSERT 0 41264
Fri 02 Mar 2018 06:42:29 GMT (every 1s)
INSERT 0 82528
Fri 02 Mar 2018 06:42:31 GMT (every 1s)
INSERT 0 165056
Fri 02 Mar 2018 06:42:33 GMT (every 1s)
INSERT 0 330112
Fri 02 Mar 2018 06:42:36 GMT (every 1s)
INSERT 0 660224
Fri 02 Mar 2018 06:42:40 GMT (every 1s)
INSERT 0 1320448
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Because of the first session is holding an open transaction the checkpoint cannot advance the recovery WAL position and recycle/delete the old wal files. However the second session is writing new data and this requires the creation of new wals because the existing files are still required in case of a crash recovery. We made the wal location very small in this example, just 500MB. When the partition fills up this is what happens.

2018-03-02 06:44:39.971 GMT [3849] PANIC:  could not write to file "pg_wal/xlogtemp.3849": No space left on device
2018-03-02 06:44:39.971 GMT [3849] CONTEXT:  writing block 25637 of relation base/12994/16387
2018-03-02 06:44:39.972 GMT [3554] LOG:  server process (PID 3849) was terminated by signal 6: Aborted
2018-03-02 06:44:39.972 GMT [3554] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE public.test
2018-03-02 06:44:39.972 GMT [3554] LOG:  terminating any other active server processes
2018-03-02 06:44:39.973 GMT [3580] postgres@postgres WARNING:  terminating connection because of crash of another server process
2018-03-02 06:44:39.973 GMT [3580] postgres@postgres DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-03-02 06:44:39.973 GMT [3580] postgres@postgres HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-03-02 06:44:39.982 GMT [3559] WARNING:  terminating connection because of crash of another server process
2018-03-02 06:44:39.982 GMT [3559] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-03-02 06:44:39.982 GMT [3559] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-03-02 06:44:39.985 GMT [3612] postgres@postgres WARNING:  terminating connection because of crash of another server process
2018-03-02 06:44:39.985 GMT [3612] postgres@postgres DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-03-02 06:44:39.985 GMT [3612] postgres@postgres HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-03-02 06:44:39.999 GMT [3996] postgres@postgres FATAL:  the database system is in recovery mode
2018-03-02 06:44:40.004 GMT [3554] LOG:  all server processes terminated; reinitializing
2018-03-02 06:44:40.238 GMT [3997] LOG:  database system was interrupted; last known up at 2018-03-02 06:44:06 GMT
2018-03-02 06:44:40.947 GMT [3997] LOG:  database system was not properly shut down; automatic recovery in progress
2018-03-02 06:44:40.951 GMT [3997] LOG:  redo starts at 0/19400200
2018-03-02 06:44:51.324 GMT [3997] LOG:  redo done at 0/3BFFFEA8
2018-03-02 06:44:51.325 GMT [3997] LOG:  last completed transaction was at log time 2018-03-02 06:44:15.364711+00
2018-03-02 06:44:55.038 GMT [3997] FATAL:  could not write to file "pg_wal/xlogtemp.3997": No space left on device
2018-03-02 06:44:55.620 GMT [3554] LOG:  startup process (PID 3997) exited with exit code 1
2018-03-02 06:44:55.620 GMT [3554] LOG:  aborting startup due to startup process failure
2018-03-02 06:44:55.632 GMT [3554] LOG:  database system is shut down

This scenario is quite nasty. The database cannot recover because the pg_wal is full, therefore after the first attempt the process gives up and leaves the database system in shutdown state. The only solution for having the database up again is to give more space to the pg_wal location.

In all the versions before 10 the directory where the files are stored is named pg_xlog. However the name has caused confusion for some users which deleted the directory contents wrongly assuming that the files inside were not critical.

Should our example be on the version 9.6, the wal location’s name would be pg_xlog. In that case an inexperienced user could think that the files inside the directory are not important and remove them. With doind so, the database system would be corrupted.

The Oracle way

The Oracle implementation of this mechanism is similar in logging aspect as the data changes are logged on the Redo Log Files. However the management of those files is left to the user which decides the name, the size, how many files are present and their location with (optional but strongly suggested) multiplexing.

A redo log group which is actively written is in CURRENT status. When the log is full the DBMS switch the writes to the next inactive log group available. The former CURRENT log groups is then marked active and a checkpoint is started in order to write down the blocks with transactions logged into the the ACTIVE group. When the checkpoint completes the redo log group is marked as INACTIVE and ready for being used again. The DBMS writes on the redo log groups in a circular way.

If the level of writes it too high it can happen that there is no INACTIVE redo group available yet. In that case the DBMS it pause in accepting the writes until the checkpoint cleans at least one redo log group. The odd behaviour I mentioned at the beginning of this post wasn’t odd at all. And the solution was very simple indeed. Increasing the size and the amount of redo log groups solved completely the problem.

Wrap up

Checkpoint and WAL are fantastic beasts. They protect our data that implementing the Durability. However it’s very important to understand how do they work, in order to avoid the risk of disruption or data loss.