Regenerated

Regenerated

With PostgreSQL 12 the generated columns are now supported natively. Until the version Postgresql 11 it were possible to have generated columns using a trigger.

In this post we’ll see how to configure a generated column via trigger and natively then we’ll compare the performances of both strategies.

Generated columns

A generated column can be stored or virtual. A stored column calculates the generated value from other columns of the table and stores the value during the INSERT and UPDATE operations. A virtual column is computed on the fly when the data is retrieved from the table. Both implementations have pros and cons, however PostgreSQL 12 supports only STORED generated columns.

Some limitations and considerations apply.

  • The generation expression can only use immutable functions
  • Subqueries or any reference to anything other than the current row are forbidden
  • A generation expression cannot reference another generated column
  • A generation expression cannot reference a system column, except for the tableoid
  • A generated column cannot have a column default or an identity definition
  • A generated column cannot be part of a partition key
  • Foreign tables can have generated columns
  • Generated columns can’t be written explicitly by INSERT or UPDATE, however it’s possible to specify the keyword DEFAULT in the INSERT or UPDATE list
  • Access privileges for generated columns follow the normal rules of the column permissions
  • Generated columns are updated after the BEFORE trigger is processed, therefore it is not allowed to access generated columns in BEFORE triggers

The trigger approach

Until PostgreSQL 11 generated columns were possible using a trigger. Triggers has been part of the PostgreSQL architecture for a very long time. They provide a convenient way to run automatically functions in procedural language that may update data somewhere or even manipulate the row image within the trigger’s call.

In our example let consider a table with three fields.

CREATE TABLE tab_trigger
(
	username CHARACTER VARYING(20),
	userpass CHARACTER VARYING(20),
	userhash CHARACTER VARYING(50)
);

When a row is inserted with username and userpass we want the field userhash to store the md5 hash of the concatenated strings username and userpass.

To do so we need first to write the function that will be executed by the trigger when fired.

CREATE OR REPLACE FUNCTION fn_gen_hash()
RETURNS TRIGGER AS
$BODY$
	BEGIN
		IF (NEW.userhash IS NOT NULL AND TG_OP='INSERT')
	    THEN
	      RAISE EXCEPTION 'cannot insert into the column "userhash". It''s a generated column';
	    ELSEIF (TG_OP='UPDATE' AND NEW.userhash <> OLD.userhash)
	    THEN
		      RAISE EXCEPTION 'cannot update the column "userhash". It''s a generated column';
	    END IF;
		NEW.userhash=md5(NEW.username||NEW.userpass);
		RETURN NEW;
	END;
$BODY$
LANGUAGE plpgsql;

Differently from a user defined function, a trigger function returns the type TRIGGER. Within the function there are specific variables available along with the images of the row that fired the trigger.

In order to replicate the native generated columns we need to prevent the direct changes to userhash.

To do so we shall run different checks according to the operation which fired the trigger using the information stored into variable TG_OP.

When the trigger is fired by an INSERT operation we need to ensure that NEW.userhash IS NULL. If the trigger is fired by an UPDATE operation we need to ensure that NEW.userhash is equal to OLD.userhash.

If any of those checks fail then the function must raise an exception.

In the rest of the function we calculate the md5 hash using the concatenated string of NEW.username and NEW.userpass and the value is assigned to NEW.userhash. The function then returns the NEW row image.

After defining the function we can now create the table’s trigger which will fire BEFORE any INSERT or UPDATE for EACH ROW.

CREATE TRIGGER trg_gen_hash
	BEFORE INSERT OR UPDATE
	ON tab_trigger
	FOR EACH ROW
	EXECUTE FUNCTION fn_gen_hash();

This asciicast shows the trigger in action:

The native approach

The declarative generated columns are supported by PostgreSQL 12 onward. The current implementation supports only the STORED generated columns. The syntax for creating a generated column is:

GENERATED ALWAYS AS ( generation_expr ) STORED

Adding this clause to the column definition creates the column as a generated column. When configured as generated the column doesn’t accept direct writes. The result of the specified expression is returned when read.

The keyword STORED means that the column is be computed on the write and then is stored on disk.

It’s possible for the generation expression to refer to other columns in the table unless they are other generated columns. Any functions and operators used in the generation expression must be immutable. Using other table’s columns in the generation expression is not allowed.

Therefore a table with a generated column can be created with following command.

CREATE TABLE tab_gen
(
	username CHARACTER VARYING(20),
	userpass CHARACTER VARYING(20),
	userhash CHARACTER VARYING(50) GENERATED ALWAYS AS (md5(username||userpass)) STORED
);

This asciicast shows the generated column in action and compares the the results from the data stored in tab_trigger.

Performance

Triggers in PostgreSQL have the bad reputation of poor performance which is caused by the trigger’s architecture requiring a call to a function written in procedural language. As procedural languages are implemented with the use of shared libraries, this may cause CPU churn.

The generated column’s native implementation improves the performances as the functionality sits in the core.

With a simple test we can check how fast is a generated column compared to the trigger implementation.

For the sake of completeness we’ll create a table with no generated columns in order to test how much is the overhead of a generated column, native and trigger based.

CREATE TABLE tab_nogen
(
	username CHARACTER VARYING(20),
	userpass CHARACTER VARYING(20),
	userhash CHARACTER VARYING(50)
);

Our test will test the speed of a bulk insert using the function generate_series and a bulk update on the three tables.

The PostgreSQL 12 used for the test is compiled from source on a Slackware Linux 14.2. The cluster’s shared buffers is left to the default value of 128MB. The tables even when full of data are small enough to fit in the shared buffer. The CPU is an AMD Ryzen 5 2600 Six-Core Processor. The installed memory is 16GB DDR4 3000MHz. The disk subsystem is a rotating disk 5400 RPM with a luks encrypted filesystem on LVM. The filesystem in use is EXT4.

Before starting our test we’ll truncate the tables tab_gen and tab_trigger in order to start with clean relations. In psql we’ll enable the timing in order to get the query’s execution time.

db_test=# TRUNCATE TABLE tab_gen ;
TRUNCATE TABLE
db_test=# TRUNCATE TABLE tab_trigger ;
TRUNCATE TABLE
db_test=# \timing
Timing is on.

INSERT

The insert of 1,000,000 rows in tab_trigger takes roughly 6.5 seconds

db_test=# INSERT INTO tab_trigger (username,userpass) SELECT 'a','b' FROM generate_series(1,1000000);
INSERT 0 1000000
Time: 6439.785 ms (00:06.440)

The same insert in tab_gen takes roughly 4 seconds.

db_test=# INSERT INTO tab_gen (username,userpass) SELECT 'a','b' FROM generate_series(1,1000000);
INSERT 0 1000000
Time: 4112.265 ms (00:04.112)
The insert in the table with the trigger is sensibly slower than the insert into the table with the native generated column.

Unsurprisingly the insert into the table without generated columns is the fastest of the three.

db_test=# INSERT INTO tab_nogen (username,userpass) SELECT 'a','b' FROM generate_series(1,1000000);
INSERT 0 1000000
Time: 2432.060 ms (00:02.432)

Even with a better performance the native column generation have some impact on the inserts. The table with the trigger gives the worst performance compared to the insert into a table without generated columns.

UPDATE

This time we’ll run a bulk update on the rows previously added to the three tables.

db_test=# UPDATE tab_trigger SET username='c',userpass='d';
UPDATE 1000000
Time: 13304.145 ms (00:13.304)
db_test=# UPDATE tab_gen SET username='c',userpass='d';
UPDATE 1000000
Time: 8926.579 ms (00:08.927)

Despite update’s slower execution, which is likely caused by the MVCC, the overhead imposed by the trigger seems to be less aggressive compared to the native implementation.

db_test=# UPDATE tab_nogen SET username='c',userpass='d';
UPDATE 1000000
Time: 5835.877 ms (00:05.836)

Comparing the update on a table without a generated column we find again that the native column generation adds an overhead which is still smaller than the trigger based setup.

DELETE

Suprisingly the DELETE is impacted by an overhead when using the trigger an as well with the native generated columns.

db_test=# DELETE FROM tab_gen;
DELETE 1000000
Time: 4557.193 ms (00:04.557)
db_test=# DELETE FROM tab_trigger;
DELETE 1000000
Time: 4711.798 ms (00:04.712)
db_test=# DELETE FROM tab_nogen;
DELETE 1000000
Time: 1882.149 ms (00:01.882)

Wrap up

PostgreSQL continues to add new features that may improve the performance and the maintainability of the data model. The generated columns is a step forward to filling the gap with other RDBMS. As for any new feature it’s necessary to understand how it works and what is the cost/benefits ration.

In the case of the generated columns there are some limitations which may add some difficulties to migrate from the trigger approach. However the performance benefits and the simplification added by the native generated columns are sufficient reasons to justify the efforts for migrationg from the trigger approach.

Thanks for reading.

Trust me I’m the Doctor, copyright Federico Campoli