In this course we will look deeply into how postgres works.

Day 1 - Indexing

Installation

  1. Install postgres from its source
  2. Enable debugging
  3. Add the contrib package
  4. Authorize a new user: sudo -u postgres -i
  5. Create a cluster: initdb -E ETF8 --locale=en_US.utf8
  6. Connect to it with: psql -h 127.0.0.1 -p 5432 -U postgres -d postgres

psql

The standard postgres client to interact with your cluster.

| Command | Explanation | | :————————-: | :———————————– | | \? | List of psql commands | | \h | Display SQL commands | | \l | List SQL tables | | \d test | List the test table | | TABLE test | runs a SELECT * on a given table; | | \dt+ | Show me more on a table size | | timing | Set timing execution | | \x | Sets expanded view | | SET enable_seqscan = off; | Disable sequential scanning | | ANALYZE | Better approximation | | \set | Display pg settings for the client | | \e | Open a text editor | | \>@ | Display a help from this operator | | \du | Show list of rules | | \dt | Show list of object relations | | SHOW data_activity; | Display path to postgres | | \dn | List of schemas | | \sf count_rows | Show definition of a method | | \ef count_rows | Edit a function | | \conninfo | Give info about the current session | | SELECT pg_size_pretty(pg_total_relation_size('tablename')); | Get table size in bytes |

Indexes

btree indexes

This is the most common index used with simple data type that is orderable.

/* Create a new database */
CREATE DATABASE course;

/* Create a new table */
CREATE TABLE test (id bigint GENERATED ALWAYS AS IDENTITY NOT NULL, name text NOT NULL);

/* Insert a lot of data */
INSERT INTO test (name) VALUES ('max'), ('ore');
INSERT INTO test (name) SELECT name FROM test;

/* Create an index */
CREATE INDEX test_id_idx ON test (id);

/* Analyze what pg does */
postgres=# EXPLAIN SELECT * FROM test WHERE id = 42;
 Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=12)
   Index Cond: (id = 42)

Those numbers are an approximation generated by postgres.

data Explanation
cost 0.43 time to find the data
cost 8.45 average time to perform the entire SQL request
rows how many rows will be returned?

More

Index type SQL How/When?
Conditional index CREATE INDEX ON test (name) WHERE name NOT IN ('max', 'ore'); If more constraint
Multicolumn indexes CREATE INDEX ON test (id, name); Pick columns where you have an EQUAL operator.
Expressions CREATE INDEX ON test (upper(name));  
Extensions CREATE EXTENSION citext; Dealing with case sensitive
Operators CREATE INDEX ON test (name text_pattern_ops); Useful when index required with a LIKE (ex: “max%”)
Exclusions CREATE INDEX EXCLUDE USING gist(res WITH &&);  

gin indexes

Use gin indexes for composite data type (JSON, array ...).

CREATE INDEX ON json USING gin(j);

The more complex it gets the harder it is to make it efficient with JSON.

gist indexes

Use gist indexes for geometry data type (point (10,20);).

CREATE TABLE reservation (id bigint PRIMARY KEY, name text NOT NULL, res tsrange NOT NULL);
CREATE INDEX on reservation USING gist (res);
EXPLAIN SELECT * FROM reservation WHERE res && '[2020-04-24 09:00:00,2020-04-24 12:00:00]';

Regular expressions

When working with regular expressions you should use a gin_trgm_ops index.

CREATE EXTENSION pg_trgm;

CREATE INDEX ON test USING gin (name gin_trgm_ops);

EXPLAIN SELECT * FROM test WHERE name % 'maxime';

EXPLAIN SELECT * FROM test WHERE name ~ 'ma(x)';

SELECT show_trgm('maxime');
{"  m"," ma",axi,ime,max,"me ",xim}

Tips

  • Always use bigint by default
  • Use text if your application does not enforce any limit
  • Never use string to store a date or a number
  • Always test what happens with NULL values
  • IS NULL is the way to test for NULL values
  • Whenever you create a foreign key always create an index with it.
  • You don’t want to have data redundancy in your application
  • You should always filter data in SQL

Sometimes you have to change the query to make it efficient.

Day 2 - Transactions and locking; Functions / Json /Partitioning

A transaction is either completed or not.

By default pg has an auto commit mode which means Each SQL statement is ran in a transaction.

  • A: Atomicity
  • C: Consistency
  • I: Isolation
  • D: Durability

Any errors that happens within a transaction aborts the transaction.

/* COMPLETE A TRANSACTION */
BEGIN;
COMMIT;

/* ROLLBACK A TRANSACTION */
BEGIN;
ROLLBACK;

We can prevent this behavior by using a SAVEPOINT. This will create a sub transaction to prevent the transaction to abort. This has a performance cost so careful with those.

SAVEPOINT a;
SELECT 1/0;
ERROR:  division by zero
SELECT 42;
ERROR:  current transaction is aborted, commands ignored until end of transaction block
ROLLBACK TO SAVEPOINT a;

By default, waiting for lock is infinite, you can modify it with:

  • SET lock_timeout = '3s';

Pessimist locking

The row is always locked.

If conflicts happen often prefer the pessimist locking strategy.

/* This will lock all columns for all tables updated */
SELECT amount FROM account WHERE id = 2 FOR UPDATE;

/* Lock only the account table */
SELECT amount FROM account WHERE id = 2 FOR UPDATE OF account;

/* Do not wait and throw an error */
SELECT amount FROM account WHERE id = 2 FOR UPDATE NOWAIT;
ERROR:  could not obtain lock on row in relation "account"

/* Skip */
SELECT amount FROM account WHERE id = 2 FOR UPDATE SKIP LOCKED;

Optimist locking

We don’t lock the row right away. We need to run the transaction again.

If conflicts happen sometimes use the optimist locking strategy.

BEGIN ISOLATION LEVEL REPEATABLE READ;

Debugging

/* How pg persist the data internally */
SELECT ctid, xmin, xmax, * FROM account;
/* ctid = where the row is physically located in the database */

BEGIN;
TRUNCATE account;

/* Show pg processes currently blocked */
SELECT * FROm pg_stat_activity WHERE datname IS NOT NULL;

SELECT pg_backend_pid();
SELECT * FROM pg_locks WHERE pid = 3419;

/* Kill a given process */
SELECT pg_cancel_backend(3419);

/* Set 2s as the default timeout for transactions */
SET idle_in_transaction_session_timeout = '2s';
SHOW statement_timeout;

Deadlock

It’s okay to have some deadlocks, we can live with them. Adjust your design and implement rules around them.

Functions

You can create functions in pg. By default they are executable by everybody.

CREATE FUNCTION doubleme(i integer) RETURNS integer LANGUAGE sql AS 'SELECT i * 2';
SELECT doubleme(21);

CREATE FUNCTION another (IN i integer, OUT x integer, OUT y integer) RETURNS record LANGUAGE sql AS 'SELECT i, i+5';
SELECT another(2);

CREATE FUNCTION tabf (IN i integer, OUT x integer, OUT y integer) RETURNS SETOF record LANGUAGE sql AS 'SELECT i, j FROM generate_series(1, i) AS j';
SELECT * FROM tabf(5);

CREATE FUNCTION hello() RETURNS text LANGUAGE sql AS 'SELECT ''hello world''';
CREATE OR REPLACE FUNCTION hello() RETURNS text LANGUAGE sql AS $$SELECT 'hello world'$$;

CREATE FUNCTION addone(integer) RETURNS integer LANGUAGE sql AS 'SELECT $1 + 1';
EXPLAIN (VERBOSE) SELECT addone(id::integer) FROM test;

Tips

  • Locking documentation
  • Locks affect concurrency
  • Readers never block writers and writers never block readers
  • Repeatable ensures the consistency of the data

Short transactions are the key to database hapiness.

Day 3 - The PostgreSQL I/O system; Memory management

Triggers & Views

It’s a function that returns a trigger.

They only run on data modification. They run immediatly after they has been triggered. You can delay them at the end of a transaction with the DEFERRED option.

There are 2 kind of triggers: BEFORE & AFTER.

  • tstzrange: timestamp with timezone

A SQL View is just an SQL statement. Here some good use cases:

  • Showing subset of data
  • As a security tool
  • Interface as an abstraction

Don’t design a view as a table!

Let’s see an example with a trigger and a view.

CREATE TRIGGER

CREATE EXTENSION btree_gist;

/* Unique constraint, valid and overlap */
CREATE TABLE realdata (
   id bigint GENERATED BY DEFAULT AS IDENTITY NOT NULL,
   value text,
   valid tstzrange DEFAULT tstzrange(current_timestamp, NULL) NOT NULL,
   EXCLUDE USING gist (valid WITH &&, id WITH =)
);

CREATE VIEW data AS SELECT id, value FROM realdata WHERE valid @> current_timestamp;

CREATE OR REPLACE FUNCTION data_trig() RETURNS trigger LANGUAGE plpgsql
AS $$
BEGIN
   CASE TG_OP
      WHEN 'INSERT' THEN
         IF NEW.id IS NULL THEN
            INSERT INTO realdata(value) VALUES (NEW.value);
         ELSE
            INSERT INTO realdata(id, value) VALUES (NEW.id, NEW.value);
         END IF;
         RETURN NEW;
      WHEN 'DELETE' THEN
         UPDATE realdata
            SET valid = tstzrange(lower(valid), current_timestamp)
            WHERE id = OLD.id AND valid @> TIMESTAMPTZ 'infinity';
         RETURN OLD;
      WHEN 'UPDATE' THEN
         UPDATE realdata
            SET valid = tstzrange(lower(valid), current_timestamp)
            WHERE id = OLD.id AND valid @> TIMESTAMPTZ 'infinity';
         INSERT INTO realdata(id, value) VALUES (NEW.id, NEW.value);
         RETURN NEW;
   END CASE;
END;
$$;

CREATE TRIGGER data_trig INSTEAD OF INSERT OR UPDATE OR DELETE ON data FOR EACH ROW EXECUTE PROCEDURE data_trig();

INSERT INTO data(value) VALUES ('first');
UPDATE data SET value = 'changed' WHERE id = 1;
UPDATE data SET id = 2  WHERE id = 1;
DELETE FROM data;
INSERT INTO data(value) VALUES ('second');
TABLE realdata;

EXPLAIN SELECT id, value FROM realdata WHERE valid @> TIMESTAMPTZ '2020-04-29 17:18:37';

/* Good for security but alter performance */
ALTER VIEW data SET (security_barrier = true);

postgres tracks dependencies between objects. We can’t drop our realdata table as our data view depends on it.

We can rename our realdata table to something else and the view will be automatically updated as postgres keeps a reference of the object id hunder the hood.

Timezones

Stored as UTC and viewed by the client timezone.

You either choose to deal with them at the application level or the database level.

CREATE TABLE aside(x timestamp with time zone);

INSERT INTO aside VALUES('2020-01-01 00:00:00');

TABLE aside;

SHOW timezone;

SET timezone = 'Europe/Vienna';

TABLE aside;

plsql: Procedure Language SQL

Performance

When reading, everything is a read from the shared buffers. Everything is mirrored in the shared buffer.

SET enable_seqscan = off is only enable for the open session.

The base directory is where all the data is stored.

postgresql.conf contains all the global settings.

ALTER SYSTEM SET enable_seqscan = off;

You can increase the size of the shared_buffers if you want but no more than 8GB.

Increasing work_mem makes big query much performant.

Rule of thumb: RAM >= shared_buffers + work_mem * max_connections

Day 4 - Security

Users

You should never use a SUPERUSER in your application!

postgres does not have USERS it’s called a ROLE. An object is something created by CREATE and each object has an owner. Only the owner may alter/drop an object.

pg_hba.conf is the config file for user authentication.

You can not remove an existing permission at the user level if it has been granted at the group level.

CREATE USER

/*It is equivalent to */
CREATE ROLE joe LOGIN;

/* This make sure the password does not leak in clear all over the place */
\password

CREATE ROLE logistics;
GRANT logistics TO joe;
\du
\dt

CREATE ROLE read_only NOLOGIN;

/* Alter permissions for existing tables to read_only */
SELECT format('GRANT SELECT ON myapp.%I TO read_only;' table_name) FROM information_schema.tables WHERE table_schema = 'myapp' \gexec
GRAND SELECT ON ALL TABLES IN SCHEMA myapp TO read_only;

/* Make sure all future new tables gets read_only permissions */
ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA myapp GRANT SELECT ON TABLES TO read_only;

Schema

Prefer to create your own schema for your app.

If you are connected to one database you can not access to a different object in another database.

Best practice is to set the search_path so you don’t have to reference it everytime.

CREATE SCHEMA myapp;

/* Prevent users without access to create object */
REVOKE CREATE ON SCHEMA public FROM public;

/* Add privilege to a specific group */
GRANT USAGE ON SCHEMA myapp TO logistics;
\dn+ myapp

/* Create the new table in the myapp schema */
CREATE TABLE myapp.new (x inet);

SET search_path = myapp, public;
SELECT * FROM new;

Table

/* Grant permission at the role level */
GRANT SELECT ON account TO logistics ;

Columns

You can do it but never use it in code!

REVOKE SELECT ON account FROM logistics;

/* Grant permission at the role level */
GRANT SELECT (id, name) ON account TO logistics ;

Encryption

The early you encrypt the better it is as there is no way for the database to decrypt the data.

CREATE EXTENSION pgcrypto;
\dx+ pgcrypto

SQL injection

  • Do not not construct SQL statement yourself!
  • Use a prepare statement
CREATE OR REPLACE FUNCTION public.count_rows(tablename text)
 RETURNS bigint
 LANGUAGE plpgsql
AS $function$
DECLARE
  sql text;
  res bigint;
BEGIN
  sql := 'SELECT count(*) FROM ' || tablename;
   -- RAISE NOTICE
  EXECUTE sql INTO res;
  RETURN res;
END;

SELECT count_rows('test; SELECT count(*) FROM pg_user WHERE superuser');

/* The fix is to use quote_ident() */
sql := 'SELECT count(*) FROM ' || quote_ident(tablename);

Partitioning

It looks like a simple table but there are many behind the scene. The partition key determines which partition is choosen.

Prefer using pg12 for better performance. Upper limit is always inclusive.

A good case it that you can cheaply and easily get rid of old data.

There are 3 techniques existing today:

  • List: Enumerate by a unique key
  • Range: Range by a timestamp
  • Hash: Arbitrary split

Don’t create a default partition, you never know what could happen in the future!

CREATE TABLE part (id bigint NOT NULL, createdat timestamp with time zone NOT NULL, data text) PARTITION BY RANGE (createdat);

CREATE TABLE part_2020 PARTITION OF part FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2021-01-01 00:00:00');
CREATE TABLE part_2019 PARTITION OF part FOR VALUES FROM ('2019-01-01 00:00:00') TO ('2020-01-01 00:00:00');
CREATE TABLE part_2018 PARTITION OF part FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00');

INSERT INTO part VALUES (1, current_timestamp, 'something');

EXPLAIN SELECT * FROM part;
                             QUERY PLAN
--------------------------------------------------------------------
 Append  (cost=0.00..62.10 rows=3210 width=48)
   ->  Seq Scan on part_2020  (cost=0.00..20.70 rows=1070 width=48)
   ->  Seq Scan on part_2019  (cost=0.00..20.70 rows=1070 width=48)
   ->  Seq Scan on part_2018  (cost=0.00..20.70 rows=1070 width=48)
(4 rows)

/* Create an index to a partition */
CREATE INDEX ON part(createdat);
ANALYZE;
EXPLAIN SELECT * FROM part WHERE createdat BETWEEN '2019-01-01' AND '2020-01-04'


/* You can attach and detach partition */
ALTER TABLE part ATTACH PARTITION part_2021 FOR VALUES FROM ('2021-01-01 00:00:00') TO ('2022-01-01 00:00:00')

/* The primary key should include the partition key */
ALTER TABLE part ADD PRIMARY KEY (id, createdat);

If you want to implement sharding have a look at Pl/Proxy.

Day 5 - Execution plans / Internal optimization; Ruby on Rails ActiveRecord with PostgreSQL (RAW SQL)

Exporting data

\COPY part_2020 TO 'clientfile' (FORMAT 'csv')

Performance

Enable pg_stat_statements to find problems in your queries. It observes all queries executed.

pgbench -c 10 -T 30 course
SHOW log_min_duration_statement;

CREATE EXTENSION pg_stat_statements;
\d pg_stat_statements

ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';
SHOW shared_preload_libraries;

SELECT * FROM pg_settings WHERE name LIKE 'pg_stat_statements.%';
ALTER SYSTEM SET shared_preload_libraries.track = 'all';
SELECT pg_reload_conf();

SELECT calls, total_time, query FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

SELECT relname, seq_scan, seq_tup_read, seq_tup_read::float8 / seq_scan AS tup_per_scan FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY tup_per_scan DESC LIMIT 10;

\d pg_stat_user_indexes
SELECT relname, indexrelname, pg_total_relation_size(indexrelid) FROM pg_stat_user_indexes WHERE idx_scan = 0 ORDER BY 3 DESC;

Rails with postgres

Array

add_column :users, :emails, :string, array: true, default: []
add_index :users, :emails, using: 'gin'

# Batch insert
User.insert_all(users)
User.where("emails @> '{[email protected]}'").explain

JSON / JSONB

If you want to have an index use the jsonb datatype.

add_column :users, :settings, :jsonb

Daterange

create_table :events do |t|
   t.daterange :duration
   t.timestamps
end

Event.create(duration: Data.new(2014, 2, 11)..Date.new(2014, 2, 12))
event = Event.first
event.duration
Event.where("duration @> ?::data", Date.new(2014, 2, 12))
event = Event.select("lower(duration) AS starts_at").select("upper(duration) AS ends_at").first
event.starts_at
event.ends_at

UUID

A good use case is security to prevent attacker to identify a given url.

enable_extension 'pgcrypto'

create_table :people, id: :uuid, do |t|
  t.string :name
  t.timestamps
end

# Set up automatic uuids for future models
Rails.application.config.generators do |g|
  g.orm :active_record, primary_key_type: :uuid
end

Transactions

create_table :accounts do |t|
   t.string :name
   t.decimal :amount, precision: 10, scale: 2
   t.timestamps
end

class Account < ApplicationRecord
  def withdraw(quantity)
    Account.transaction do
      self.amount -= quantity
      save!
    end
  end

  def deposit(quantity)
    Account.transaction do
      self.amount += quantity
      save!
    end
  end
end

Scopes

Your scopes should do one thing at a time.

scope :active, -> { where(active: true) }

scope :ordered, -> { order('LOWER(name)') }

RAW SQL

query = <<~SQL
  SELECT SUM(CASE WHEN status = 0 THEN 1 ELSE 0 END)
  FROM PROJECTS
SQL

ActiveRecord::Base.connection.execute(query)

Day 6 - Vacuum / Backup / Recovery

Vacuuming

Vacuuming proceeds in 3 steps:

  1. Look for wich rows to remove
  2. Index scan to remove index referencing offending rows
  3. Remove offending rows
CREATE TABLE vactest (id integer);
/* generate_series returns a table */
INSERT INTO vactest SELECT * FROM generate_series(1, 100000);
DELETE FROM vactest WHERE id % 2 = 0;

VACUUM (VERBOSE) vactest;

/* get information about autovacuum for our given table */
SELECT * FROM pg_stat_user_tables WHERE relname = 'vactest';

CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('vactest');
SELECT * FROM pgstattuple_approx('vactest');

Long running transactions are bad because they prevent vaccum to do its job.

Auto-vacuum allows us to clean up in the background the garbage in your database without disturbing your work.

When is autovaccum is executed? Threshold(50) + scale_factor (0.2) * #rows < #death_rows

If autovaccum runs to often and slow, you can reduce the cost_delay or increase the cost_limit.

/* change the config only for test table */
ALTER TABLE test SET (autovaccum_vacuum_cost_delay = 1);

ALTER TABLE test SET (autovaccum_vacuum_scale_factor = 0, autovaccum_vacuum_auto_analyze = 1000000);

Backup

pg_dump is a great tool for backup, however it only captures in 1 point in time.

# dump a given database
$ pg_dump course > course.sql
$ psql CREATE DATABASE copy
$ psql copy < course.sql

# creates a directory for each table
$ pg_dump -F d -f course.dir course

# dump the whole cluster
$ pg_dumpall -g

# backup as a binary file which is smaller
$ pg_dumpall -F c -f course.dmp course
$ pg_restore -d copy course.dmp
$ pg_restore -f - course.dmp
$ pg_restore -f - -s accounts course.dmp
  • Memory processes = shared buffers or pages
  • Disk = block (8Kb)

Backup & Recovery

postgres has a specific client used for online backups using WAL segments.

Set the configuration archive mode to in order to enable backup.

# Archive
archive_mode = on
achive_command = 'cp %p /var/lib/postgresql/walarchive/%f'

$ systemctl restart postgresql
$ vi var/log/postrgresq/postrgresql-13-main.log

$ pg_basebackup -c fast -D backup
$ ls backup
$ ls 13/main
$ less -RS backup/backup_manifest
$ cat backup/backup_label
$ systemctl restart postgresql


# Recovery
restore_command = 'cp /var/lib/postgresql/walarchive/%f %p'
restore_target_time = '2021-09-01 18:25:00'
restore_target_action = 'promote'

$ touch 13/main/recovery.signal
$ systemctl restart postgresql
/* Get information about the archiver */
SELECT * FROM pg_stat_archiver \gx

Day 7 - Replication / High availability

Replication

Replication is basically recovery which never stops. It’s a physical copy of the primary server that is kept synchronised.

If you kill your primary or secondary server nothing can’t go wrong if you set up archive recovery mode.

$ pg_ctlcluster standby
$ rm -rf 13/standby

$ pg_basebackup -D 13/standby

# Standby Servers
primary_conninfo = 'host=/var/run/postgresql port=5432 user=postgres'
primary_slot_name = 'slot'

$ touch 12/standby/standby.signal
$ systemctl start postgresql@13-standby
$ less /var/log/postgresql/postgresql-13-standby.log

## Replication without archive enabled
wal_keep_size = '1GB'

Let’s monitor how our replication is working.

SELECT * FROM pg_stat_replication \gx

SELECT pg_current_wal_lsn();
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) FROM pg_stat_replication;

/* Repeats the last command every 1 second */
\watch