PostgreSQL Database Administration Vol 1
PostgreSQL Database Administration Vol 1
PostgreSQL Database Administration Vol 1
Database Administration
Volume1
Federico Campoli
PostgreSQL Database Administration
Volume 1
Basic concepts
Federico Campoli
• Attribution — You must give appropriate credit, provide a link to the license, and
indicate if changes were made. You may do so in any reasonable manner, but not in
any way that suggests the licensor endorses you or your use.
• NonCommercial — You may not use the material for commercial purposes.
• ShareAlike — If you remix, transform, or build upon the material, you must distribute
your contributions under the same license as the original.
• No additional restrictions — You may not apply legal terms or technological measures
that legally restrict others from doing anything the license permits.
1
Copyright
2
Preface
When I first came up with the idea to write a PostgreSQL DBA book, my intention was to
publish it commercially.
Shortly I changed my mind as I became aware the uniqueness of a book for the database
administrators. Then I decided to keep this book free. The same will happen for the next
books in this series. I hope this will spread the knowledge on PostgreSQL becoming an useful
reference.
Just a quick advice before you start reading. I beg your pardon in advance for my bad
English. Unfortunately I’m not native English and it’s very likely the book to be full of typos
and bad grammar.
However, if you want to help me in cleaning and reviewing the text please to fork the github
repository where I’m sharing the latex sources https://github.com/the4thdoctor/pgdba books.
Intended audience
Database administrators, System administrators, Developers
Book structure
This book assumes the reader knows how to perform basic user operations such as connecting
to the database and creating tables.
The book covers the basic aspects of database administration from installation to cluster
management.
A couple of chapters are dedicated to the logical and physical structure in order to show
both sides of coin. The triplet of maintenance backup and restore completes the the picture,
not exhaustive but good enough to start getting “hands on” the product. The final chapter
is dedicated to the developers. The various sections are giving advice that can seem quite
3
obvious. But it’s better to repeat things than having dangerous mistakes when building an
application.
4
Version and platform
This book is based on PostgreSQL version 9.3 running on Debian GNU Linux 7. References
to older versions or different platforms are explicitly specified.
5
Thanks
A big thank you to Craig Barnes for the priceless work on the book review.
The beautiful cover has been made by Chiaretta e Bon .
6
Contents
1 PostgreSQL at a glance 12
1.1 Long time ago in a galaxy far far away... . . . . . . . . . . . . . . . . . . . . . 12
1.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 ACID compliant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 MVCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Write ahead logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.4 Point in time recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.5 Standby server and high availability . . . . . . . . . . . . . . . . . . . 14
1.2.6 Streaming replication . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.7 Procedural languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.8 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.9 Cost based optimiser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.10 Multi platform support . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.11 Tablespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.12 Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.13 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.14 Constraint enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.15 Extension system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.16 Federated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Database installation 16
2.1 Install from source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Packaged install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Install structure 19
3.1 The core binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 pg ctl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 initdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.4 psql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 pg dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.6 pg restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7
3.1.7 pg controldata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.8 pg resetxlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Wrappers and contributed modules . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 The create and drop utilities . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 clusterdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 reindexdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4 vacuumdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.5 vacuumlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Debian’s specific utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 pg createcluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 pg dropcluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 pg lscluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.4 pg ctlcluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8
5 The logical layout 34
5.1 The connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.1 Logged tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.2 Unlogged tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.3 Temporary tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.4 Foreign tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Table inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5.1 b-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5.2 hash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5.3 GiST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5.4 GIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.7 Tablespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.8 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.8.1 Snapshot exports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Data integrity 49
6.1 Primary keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Unique keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Foreign keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4 Check constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 Not null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8 Maintenance 70
8.1 VACUUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.1.1 vacuum freeze table age . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.1.2 vacuum freeze min age . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.1.3 vacuum multixact freeze table age . . . . . . . . . . . . . . . . . . . . 76
8.1.4 vacuum multixact freeze min age . . . . . . . . . . . . . . . . . . . . . 76
9
8.1.5 vacuum defer cleanup age . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.1.6 vacuum cost delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.1.7 vacuum cost limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.1.8 vacuum cost page hit . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.1.9 vacuum cost page miss . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.1.10 vacuum cost page dirty . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 ANALYZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 REINDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.4 VACUUM FULL and CLUSTER . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.5 The autovacuum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.5.1 autovacuum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.5.2 autovacuum max workers . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.5.3 autovacuum naptime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.5.4 autovacuum vacuum scale factor . . . . . . . . . . . . . . . . . . . . . 85
8.5.5 autovacuum vacuum threshold . . . . . . . . . . . . . . . . . . . . . . 85
8.5.6 autovacuum analyze scale factor . . . . . . . . . . . . . . . . . . . . . 85
8.5.7 autovacuum analyze threshold . . . . . . . . . . . . . . . . . . . . . . 86
8.5.8 autovacuum freeze max age . . . . . . . . . . . . . . . . . . . . . . . . 86
8.5.9 autovacuum multixact freeze max age . . . . . . . . . . . . . . . . . . 86
8.5.10 autovacuum vacuum cost delay . . . . . . . . . . . . . . . . . . . . . . 86
8.5.11 autovacuum vacuum cost limit . . . . . . . . . . . . . . . . . . . . . . 86
9 Backup 87
9.1 pg dump at glance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.1.1 Connection options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.1.2 General options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.1.3 Output options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.2 Performance tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2.1 Avoid remote backups . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2.2 Skip replicated tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2.3 Check for slow cpu cores . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.2.4 Check for the available locks . . . . . . . . . . . . . . . . . . . . . . . 93
9.3 pg dump under the bonnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.4 pg dumpall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.5 Backup validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10 Restore 96
10.1 The plain format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.2 The binary formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.3 Restore performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.3.1 shared buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.3.2 wal level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.3.3 fsync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.3.4 checkpoint segments, checkpoint timeout . . . . . . . . . . . . . . . . 107
10
10.3.5 autovacuum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.3.6 max connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.3.7 port and listen addresses . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.3.8 maintenance work memory . . . . . . . . . . . . . . . . . . . . . . . . 107
C Contacts 117
11
Chapter 1
PostgreSQL at a glance
PostgreSQL is a first class product with enterprise class features. This chapter is nothing but
a general review on the product with a short section dedicated to the database’s history.
..........................
At this email replied Bruce Momjian,Thomas Lockhart, and Vadim Mikheev, the very
first PostgreSQL Global Development Team.
Today, after almost 20 years and millions of rows of code, PostgreSQL is a robust and
reliable relational database. The most advanced open source database. The slogan speaks
truth indeed.
12
1.2 Features
Each time a new major release is released it adds new features to the already rich feature’s
set. What follows is a small excerpt of the latest PostgreSQL’s version capabilities.
1.2.2 MVCC
PostgreSQL ensures atomiticy consistency and isolation via the MVCC. The acronym stands
for Multi Version Concurrency Control. The mechanism is incredibly efficient, it offers great
level of concurrency keeping the transaction’s snapshots isolated and consistent. There is one
single disadvantage in the implementation. We’ll see in detail in 7.6 how MVCC works and
the reason why there’s no such thing like an update in PostgreSQL.
13
1.2.4 Point in time recovery
When PostgreSQL switches to a new WAL this could be a new segment or a recycled one. If
the old WAL is archived in a safe location it’s possible to get a copy of the physical data files
meanwhile the database is running. The hot copy, alongside with the archived WAL segments
have all the informations necessary and sufficient to recover the database’s consistent state.
The recovery by default terminates when all the archived data files are replayed. Anyway it’s
possible to stop the recover at a given point in time.
1.2.8 Partitioning
Despite the partitioning implementation in PostgreSQL is still very basic it’s not complicated
to build an efficient partitioned structure using the table inheritance.
Unfortunately because the physical storage is distinct for each partition, is not possible
to have a global primary key for the partitioned structure. The foreign keys can be emulated
in some way using the triggers.
14
1.2.10 Multi platform support
PostgreSQL supports almost any unix flavour, and from version 8.0 runs natively on Windows.
1.2.11 Tablespaces
The tablespace support permits a fine grained distribution of the data files across filesystems.
In 5.7 and 7.5 we’ll see how to take advantage of this powerful feature.
1.2.12 Triggers
The triggers are well supported on tables and views. A basic implementation of the events
triggers is also present. The triggers can emulate completely the updatable views feature.
1.2.13 Views
The read only views are well consodlidated in PostgreSQL. The version 9.3 introduced the
basic support for the materialised and updatable views. For the materialised views there is
no incremental refresh. The complex views, like views joining two or more tables, are not
updatable.
1.2.16 Federated
From PostgreSQL 9.1 is possible to have foreign tables pointing to external data sources.
PostgreSQL 9.3 introduced also the foreign table’s write the PostgreSQL’s foreign data wrap-
per.
15
Chapter 2
Database installation
In this chapter will see how to install PostgreSQL on Debian Gnu Linux. We’ll take a look
to two procedures, compiling from source and using the packages shipped by the pgdg apt
repository.
There are advantages and disadvantages on both procedures. The compile from source offers
a fine grain control on all the aspects of the binaries configuration. Also doesn’t have risks of
unwanted restarts when upgrading and it’s possible to install and upgrade the binaries with
normal privileges.
The packaged install is easier to manage when deploying the new binaries on the server,
in particular if there is a large number of installations to manage. The binary packages are
released shortly after a new update is released. Because the frequency for the minor releases
is not fixed, it could happen to have in place bugs affecting the production for months. For
example the bug causing the standby server crashing when the master found invalid pages
during a conventional vacuum, it was fixed almost immediately. Unfortunately the release
with the fix appeared after five months.
16
• apt-get install build-essential libreadline6-dev zlib1g-dev
Please note the second step will require inserting a new user password. Unless is a personal
test it’s better to avoid obvious passwords like postgres.
In order to build the binaries we must download and extract the PostgreSQL’s source
tarball.
mkdir ~/download
cd ~/download
wget http://ftp.postgresql.org/pub/source/v9.3.5/postgresql-9.3.5.tar.bz2
tar xfj postgresql-9.3.5.tar.bz2
cd postgresql-9.3.5
Using the configure script’s option –prefix we’ll point the install directory to a writable
location. We can also use a director named after the the major version’s numbering. This
will allow us to have installed different PostgreSQL versions without problems.
mkdir -p /home/postgres/bin/9.3
./configure --prefix=/home/postgres/bin/9.3
The configuration script will check all the dependencies and, if there’s no error, will gen-
erate the makefiles. Then we can start the build simply running the command make. The
time required for compiling is variable and depends from the system’s power. If you have a
multicore processor the make -j option can improve significantly the build time. When the
build is is complete it’s a good idea to to run the regression tests. Those tests are designed
to find any regression or malfunction before the binaries are installed.
make
<very verbose output>
make check
make install
17
• Debian 7.0 (wheezy)
• Debian unstable (sid)
• Ubuntu 10.04 (lucid)
• Ubuntu 12.04 (precise)
• Ubuntu 13.10 (saucy)
• Ubuntu 14.04 (trusty)
All the installation steps require root privileges, via sudo or acquiring the root login via
su. Before starting configuring the repository it’s a good idea to import the GPG key for the
package validation.
In a root shell simply run
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
When the key is imported create a file named pgdg.list into the directory /etc/apt/sources.d/
and add the following row.
The distribution’s codename can be found using the command lsb release -c. e.g.
thedoctor@tardis:~$ lsb_release -c
Codename: wheezy
After the repository configuration the installation is completed with two simple commands.
apt-get update
apt-get install postgreql-9.3 postgreql-contrib-9.3 postgreql-client-9.3
Be aware that this method, as automated installation task creates a new database cluster
in the default directory /var/lib/postgresql.
18
Chapter 3
Install structure
Depending on the installation method, the install structure is set up in a single directory or
in multiple folders.
The install from source creates into the target directory four subfolders bin include lib
and share.
The packaged install puts the binaries and the libraries in the folder /usr/lib/postgresql
organised by major version. For example the 9.3 install will put the binaries into /usr/lib/-
postgresql/9.3/bin and the libraries in /usr/lib/postgresql/9.3/lib. The extensions and con-
tributed modules are installed into the folder /usr/share/postgresql with the same structure.
In the directory /usr/bin/ are installed the debian’s specific utilities and the symbolic link
psql pointing the file /usr/lib/share/postgresql-common/pg wrapper. This file is a perl script
which calls the PostgreSQL client reading the version the cluster and the default database
from the file /.postgresqlrc or in /etc/postgresql-common/user clusters.
19
3.1.1 postgres
This is the PostgreSQL’s main process. The program can be started directly or using the
pg ctl utility. The second method is to be preferred as offer a simpler way to control the
postgres process. The direct execution is the unavoidable choice when the database won’t
start for an old XID near to the wraparound failure. In this case the cluster can only start
in single user mode to perform a cluster wide vacuum. For historical reasons there’s also a
symbolic link named postmaster pointing to the postgres executable.
3.1.2 pg ctl
This utility is the simplest way for managing a PostgreSQL instance. The program reads the
postgres pid from the cluster’s data area and sending the os signals manages the start the
stop or the process reload. It’s also possible to send kill signals to the running instance. The
pg ctl The supported actions are the following.
3.1.3 initdb
Is the binary which initialises the PostgreSQL data area. The directory to initialise must be
empty. Various options can be specified on the command line, like the character enconding
or the collation order.
3.1.4 psql
This is the PostgreSQL command line client. The client it looks very essential, however is
one of the most flexible tools available to interact with the server and the only choice when
working on the command line.
3.1.5 pg dump
This is the binary dedicated to the backup. Can produce consistent backups in various
formats. The usage is described shown in 9.
20
3.1.6 pg restore
This program is used to restore a database reading a binary dump like the custom or directory
format. It’s able to run the restore in multiple jobs in order to speed up the process. The
usage is described in 10
3.1.7 pg controldata
This program can query the cluster’s control file where PostgreSQLstores critical informations
for the cluster activity and reliability.
3.1.8 pg resetxlog
If a WAL file becomes corrupted the cluster cannot perform a crash recovery. This lead to
a not startable cluster in case of system crash. In this catastrophic scenario there’s still a
possibility to start the cluster. Using pg resetxlog the cluster is cleared of any WAL file, the
control file is initialised from scratch and the transaction’s count is restarted.
The tabula rasa have a cost indeed. The cluster lose any reference between the transac-
tions progression and the data files. The physical integrity is lost and any attempt to run
queries which write data will results in corruption.
After running pg_resetxlog the database must start without user access,
the entire content must be dumped, the data directory must be dropped and recreated
from scratch using initdb and then the dump file can be restored using psql or pg_restore
21
3.2.2 clusterdb
This program performs a database wide cluster on the tables with clustered indices. The
binary can run on a single table specified on the command line. In 8.4 we’ll take a look to
CLUSTER and VACUUM FULL.
3.2.3 reindexdb
The command does a database wide reindex. It’s possible to run the command just on a table
or index passing the relation’s name on the command line. In 8.3 we’ll take a good look to
the index management.
3.2.4 vacuumdb
This binary is a wrapper for the VACUUM SQL command. This is the most important
maintenance task and shouldn’t be ignored. The program performs a database wide VACUUM
if executed without a target relation. Alongside with a common vacuum it’s possible to have
the usage statistics updated on the same time.
3.2.5 vacuumlo
This binary will remove the orphaned large objects from the pg largeobject system table. The
pg largeobject is used to store the binary objects bigger than the limit of 1GB imposed by
the bytea data type. The limit for a large object it is 2 GB since the version 9.2. From the
version 9.3 the limit was increased to 4 TB.
3.3.1 pg createcluster
This script adds a new PostgreSQL cluster with the given major version, if installed, and
the given name. The script puts all the configurations in /etc/postgresql. Each major ver-
sion have a dedicated directory under which is present a group of directories with the clus-
ter’s specific configuration files. If not specified the data directory is created into the folder
/var/lib/postgresql. It’s also possible to specify the options for initd.
22
3.3.2 pg dropcluster
The program delete a PostgreSQL cluster created previously with pg createcluster. The
program will not drop a running cluster. If the dropped cluster have any tablespace those
must be manually removed after the drop as the program doesn’t follow the symbolic links.
3.3.3 pg lscluster
Lists the clusters created with pg createcluster.
3.3.4 pg ctlcluster
The program manages the cluster in a similar way pg ctl does. Before the version 9.2 this
wrapper had a dangerous behaviour for the shutdown. The script did not offered a flexible
way to provide the shutdown mode. More informations about the shutdown sequence are in
4.3. Without any option pg ctlcluster performs a smart shutdown mode. The –force option
tells the script to try a fast shutdown mode. Unfortunately if the database doesn’t shutdown
in a reasonable time the script performs an immediate shutdown. After another short wait,
if the the instance is still up the script sends a kill -9 to the postgres process. Because this
kind of actions can result in data loss they should be made manually by the DBA. It’s better
to avoid the shutdown using pg ctlcluster.
23
Chapter 4
For example, using the initdb shipped with the debian archive requires the following
commands.
24
The database cluster will be initialized with locale "en_GB.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
/usr/lib/postgresql/9.3/bin/postgres -D /var/lib/postgresql/tempdata
or
/usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/tempdata -l
logfile start
PostgreSQL 9.3 introduces the data page checksums used for detecting the data page cor-
ruption. This great feature can be enabled only when initialising the data area with initdb
and is cluster wide. The extra overhead caused by the checksums is something to consider
because the only way to disable the data checksums is a dump and reload on a fresh data area.
25
After initialising the data directory initdb emits the message with the commands to start
the database cluster. The first form is useful for debugging and development purposes because
it starts the database directly from the command line with the output displayed on the
terminal.
postgres@tardis:~/tempdata$ /usr/lib/postgresql/9.3/bin/postgres -D
/var/lib/postgresql/tempdata
LOG: database system was shut down at 2014-03-23 18:52:07 UTC
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
Starting the cluster with pg ctl usage is very simple. This program also accepts the data
area as parameter or using the environment variable PGDATA. It’s also required to provide
the command to execute. The start command for example is used to start the cluster in multi
user mode.
postgres@tardis:~/tempdata$ /usr/lib/postgresql/9.3/bin/pg_ctl -D
/var/lib/postgresql/tempdata -l logfile start
server starting
Omitting the logfile with the -l will display the alerts and warnings on the terminal.
The command stop will end the cluster’s activity.
postgres@tardis:~$ /usr/lib/postgresql/9.3/bin/pg_ctl -D
/var/lib/postgresql/tempdata -l logfile stop
waiting for server to shut down.... done
server stopped
FATAL: could not create shared memory segment: Cannot allocate memory
26
DETAIL: Failed system call was shmget(key=X, size=XXXXXX, XXXXX).
HINT: This error usually means that PostgreSQL’s request for a shared memory
segment exceeded available memory or swap space, or exceeded your kernel’s
SHMALL parameter. You can either reduce the request size or reconfigure the
kernel with larger SHMALL. To reduce the request size (currently XXXXX bytes),
reduce PostgreSQL’s shared memory usage, perhaps by reducing shared_buffers or
max_connections.
The kernel parameter governing this limit is SHMMAX, the maximum size of shared
memory segment. The value is measured in bytes and must be bigger than the shared buffers
parameter. Another parameter which needs adjustment is SHMALL. This value sets the
amount of shared memory available and usually on linux is measured in pages. Unless the
kernel is configured to allow the huge pages the page size is 4096 byes. The value should be
the same as SHMMAX. Changing those parameters requires the root privileges. It’s a good
measure to have a small extra headroom for the needed memory instead of setting the exact
require value.
For example, setting the shared buffer to 1 GB requires SHMMAX to be at least 1073741824.
The value 1258291200 (1200 MB) is a reasonable setting. The corresponding SHMALL is
307200. The value SHMMNI is the minimum value of shared memory, is safe to set to 4096,
one memory page.
kernel.shmmax = 1258291200
kernel.shmall = 307200
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 658576
When the memory is allocated the postmaster reads the pg control file to check if the
instance requires recovery. The pg control file is used to store the locations to the last check-
point and the last known status for the instance.
If the instance is in dirty state, because a crash or an unclean shutdown, the startup pro-
cess reads the last checkpoint location and replays the blocks from the corresponding WAL
segment in the pg xlog directory. Any corruption in the wal files during the recovery or the
pg control file results in a not startable instance.
When the recovery is complete or if the cluster’s state is clean the postgres process com-
pletes the startup and sets the cluster in production state.
27
4.3 The shutdown sequence
The PostgreSQL process enters the shutdown status when a specific OS signal is received.
The signal can be sent via the os kill or using the program pg ctl.
As seen in 3.1.2 pg ctl accepts the -m switch when the command is stop. The -m switch is
used to specify the shutdown mode and if is omitted it defaults to smart which corresponds
to the SIGTERM signal. With the smart shuthdown the cluster stops accepting new connec-
tions and waits for all backends to quit.
When the shutdown mode is set to fast pg ctl sends the SIGINT signal to the postgres
main process. Like the smart shutdown the cluster does not accepts new connections and
terminates the existing backends. Any open transaction is rolled back.
When the smart and the fast shutdown are complete they leave the cluster in clean state.
This is true because when the postgres process initiate the final part of the shutdown it starts
a last checkpoint which consolidates any dirty block on the disk. Before quitting the postgres
process saves the latest checkpoint’s location to the pg control file and marks the cluster as
clean.
The checkpoint can slow down the entire shutdown sequence. In particular if the shared buffer
is big and contains many dirty blocks, the checkpoint can run for a very long time. Also if
at the shutdown time, another checkpoint is running the postgres process will wait for this
checkpoint to complete before starting the final checkpoint.
Enabling the log checkpoints in the configuration gives us some visibility on what the
cluster is actually doing. The GUC parameter governing the setting is log checkpoints.
If the cluster doesn’t stop, there is a shutdown mode which leaves the cluster in dirty
state. The immiediate shutdown. The equivalent signal is the SIGQUIT and it causes the
main process alongside with the backends to quit immediately without the checkpoint.
The subsequent start will require a crash recovery. The recovery is usually harmless with
one important exception. If the cluster contains unlogged tables those relations are recreated
from scratch when the recovery happens and all the data in those table is lost.
A final word about the SIGKILL signal, the dreaded kill -9. It could happen the cluster
will not stop even using the immediate mode. In this case, the last resort is to use SIGKILL.
Because this signal cannot be trapped in any way, the resources like the shared memory and
the inter process semaphores will stay in place after killing the server. This will very likely
affect the start of a fresh instance. Please refer to your sysadmin to find out the best way to
cleanup the memory after the SIGKILL.
28
4.4 The processes
Alongside with postgres process there are a number of accessory processes. With a running
9.3 cluster ps shows at least six postgres processes.
29
4.5 The memory
Externally the PostgreSQL’s memory structure is very simple to understand. Alongside with
a single shared segment there are the per user memories. Behind the scenes things are quite
complex and beyond the scope of this book.
The memory segment is formatted in pages like the data files. When a new backend is
forked from the main process is attached to the shared buffer. Because usually the shared
buffer is a fraction of the cluster’s size, a simple but very efficient mechanism keeps in mem-
ory the blocks using a combination of LRU and MRU. Since the version 8.3 is also present a
protection mechanism against the page eviction from the memeory in the case of IO intensive
operations.
Any data operation is performed loading the data pages in the shared buffer. Alongside
with the benefits of the memory cache there is the enforcement of the data consistency at any
time.
In particular, if any backend crash happens PostgreSQL resets all the existing connections
to protect the shared buffer from potential corruption.
A correct size for this memory can improve the performance of any memory intensive
operation like the sorts. It’s very important to set this value to a reasonable size in order to
avoid any risk of out of memory error or unwanted swap.
30
4.5.4 The temporary memory
The temporary memory is set using the parameter temp buffers. The main usage is for storing
the the temporary tables. If the table doesn’t fit in the allocated memory then the relation
is saved on on disk. It’s possible to change the temp buffers value for the current session but
only before creating a temporary table.
4.6.1 base
This directory it does what the name suggests. It holds the database files. For each database
in the cluster there is a dedicated sub directory in base named after the database’s object id.
A new installation shows only three sub directories in the base folder.
Two of them are the template databases,template0 and template1. The third is the post-
gres database. In 5 there are more information about the logical structure.
Each database directory contains many files with the numerical names. They are the
physical database’s files, tables indices etc.
The relation’s file name is set initially using the relation’s object id. Because there are op-
erations that can change the file name (e.g. VACUUM FULL, REINDEX) PostgreSQL tracks
the file name in a different pg class’s field, the relfilenode. In 7 there are more information
about the physical data file structure.
4.6.2 global
The global directory holds all the shared relations. Alongside with the data files there is a
small file, just one data page, called pg control. This file is vital for the cluster’s activity . If
there is any corruption on the control file the cluster cannot start.
4.6.3 pg xlog
This is probably the most important and critical directory in the data area. The directory
holds the write ahead logs, also known as WAL files. Each segment is by default 16 MB and
is used to store the records for the pages changed in the shared buffer. The write first on on
this durable storage ensures the cluster’s crash recovery. In the event of a crash the WAL
are replayed when the startup begins from the last checkpoint location read from control
file.Because this directory is heavily written, putting it on a dedicated device improves the
performance.
31
4.6.4 pg clog
This directory contains the status of the committed transactions stored in many files, each one
big like a data page. The directory does not store the status of the transactions executed with
the SERIALIZABLE isolation. The directory is managed by PostgreSQL. The number of files
is controlled by the two parameters autovacuum freeze max age and vacuum freeze table age.
They control the “event horizon” of the oldest frozen transaction id and the pg clog must
store the commit status accordingly.
4.6.5 pg serial
This directory is similar to the pg clog except the commit statuses are only for the transactions
executed with the SERIALIZABLE isolation level.
4.6.6 pg multixact
The directory stores the statuses of the multi transactions. They are used in general for the
row share locks.
4.6.7 pg notify
The directory is used to stores the LISTEN/NOTIFY operations.
4.6.8 pg snapshots
This directory stores the exported transaction’s snapshots. From the version 9.2 PostgreSQL
can export a consistent snapshot to the other sessions. More details about the snapshots are
in 5.8.1.
4.6.10 pg stat
This directory contains the files saved permanently by the statistic subsystem to keep them
persistent between the restarts.
4.6.11 pg subtrans
In this folder there are the subtransactions statuses.
32
4.6.12 pg twophase
There is where PostgreSQL saves the two phase commit’s data. This feature allow a transac-
tion to become independent from the backend status. If the backend disconnects, for example
in a network outage, the transaction does not rollbacks waiting for another backend to pick
it up and complete the commit.
4.6.13 pg tblspc
In this folder there are the symbolic links to the tablespace locations. In 5.7 and 7.5 there
are more informations about it.
33
Chapter 5
In this we’ll take a look to the PostgreSQL logical layout. We’ll start with the connection
process. Then we’ll see the logical relations like tables, indices and views. The chapter will
end with the tablespaces and the MVCC.
The first connection’s stage is the check using the host based authentication. The cluster
scans the pg hba.conf file searching a match for the connection’s parameters. Those are, for
example, the client’s host, the user etc. The host file is usually saved inside the the data
area alongside the configuration file postgresql.conf. The pg hba.conf is read from the top to
the bottom and the first matching row for the client’s parameters is used to determine the
authentication method to use. If PostgreSQL reaches the end of the file without match the
connection is refused.
34
Type Database User Address Method
local name name ipaddress/network mask trust
host * * host name reject
hostssl md5
hostnossl password
gss
sspi
krb5
ident
peer
pam
ldap
radius
cert
The column type specifies if the connection is local or host. The former is when the con-
nection is made using a socket. The latter when the connection uses the network. It’s also
possible to specify if the host connection should be secure or plain using hostssl and hostnossl.
The Database and User columns are used to match specific databases and users.
The column address have sense only if the connection is host, hostssl or hostnossl. The
value can be an ip address plus the network mask. Is also possible to specify the hostname.
There is the full support for ipv4 and ipv6.
The pg hba.conf’s last column is the authentication method for the matched row. The
action to perform after the match is done. PostgreSQL supports many methods ranging from
the plain password challenge to kerberos.
• trust: The connection is authorised without any further action. Is quite useful if the
password is lost. Use it with caution.
• peer: The connection is authorised if the OS user matches the database user. It’s useful
for the local connections.
• password: The connection establishes if the connection’s user and the password matches
with the values stored in the pg shadow system table. This method sends the password
in clear text. Should be used only on trusted networks.
• md5: This method is similar to password. It uses a better security encoding the
passwords using the md5 algorithm. Because md5 is deterministic, there is pseudo
random subroutine which prevents to have the same string sent over the network.
35
• reject: The connection is rejected. This method is very useful to keep the sessions out
of the database. e.g. maintenance requiring single user mode.
When the connection establishes the postgres main process forks a new backend process
attached to the shared buffer. The fork process is expensive. This makes the connection a po-
tential bottleneck. Opening new connections can degrade the operating system performance
and eventually produce zombie processes. Keeping the connections constantly connected
maybe is a reasonable fix. Unfortunately this approach have a couple of unpleasant side ef-
fects.
Changing any connection related parameter like the max connections, requires a cluster
restart. For this reason planning the resources is absolutely vital. For each connection present
in max connections the cluster allocates 400 bytes of shared memory. For each connection
established the cluster allocates a per user memory area wich size is determined by the pa-
rameter work mem.
For example let’s consider a cluster with a shared buffer set to 512 MB and the work mem
set to 100MB. Setting the max connections to only 500 requires a potentially 49 GB of total
memory if all the connections are in use. Because the work mem can affects the performances,
its value should be determined carefully. Being a per user memory any change to work mem
does not require the cluster’s start but a simple reload.
In this kind of situations a connection pooler can be a good solutions. The sophisticated
pgpool or the lightweight pgbouncer can help to boost the connection’s performance.
By default a fresh data area initialisation listens only on the localhost. The GUC parame-
ter governing this aspect is listen addresses. In order to have the cluster accepting connections
from the rest of the network the values should change to the correct listening addresses spec-
ified as values separated by commas. It’s also possible to set it to * as wildcard.
Changing the parameters max connections and listen addresses require the cluster restart.
5.2 Databases
Unlikely other DBMS, a PostgreSQL connection requires the database name in the connection
string. Sometimes this can be omitted in psql when this information is supplied in another
way.
When omitted psql checks if the environment variable $PGDATABASE is set. If $PG-
DATABASE is missing then psql defaults the database name to connection’s username. This
leads to confusing error messages. For example, if we have a username named test but not a
database named test the connection will fail even with the correct credentials.
36
psql: FATAL: database "test" does not exist
This error appears because the pg hba.conf allow the connection for any database. Even
for a not existing one. The connection is then terminated when the backend ask to connect
to the database named test which does not exists.
This is very common for the new users. The solution is incredibly simple because in a
PostgreSQL cluster there are at least three databases. Passing the name template1 as last
parameter will do the trick.
When the connection is established we can query the system table pg database to get the
cluster’s database list.
template1 = > SELECT datname FROM pg_database ;
datname
-- - - - - - - - - - - - - -
template1
template0
postgres
(3 rows )
Database administrators coming from other DBMS can be confused by the postgres
database. This database have nothing special. Its creation was added since the version
8.4 because it was useful to have it. You can just ignore it or use it for testing purposes.
Dropping the postgres database does not corrupts the cluster. Because this database is often
used by third party tools before dropping it check if is in use in any way.
The databases template0 and template1 like the name suggests are the template databases.
A template database is used to build new database copies via the physical file copy.
When initdb initialises the data area the database template1 is populated with the correct
references to the WAL records, the system views and the procedural language PL/PgSQL.
When this is done the database template0 and the postgres databases are then created using
the template1 database.
The database template0 doesn’t allow the connections. It’s main usage is to rebuild the
database template1 if it gets corrupted or for creating databases with a character encoding/c-
type, different from the cluster wide settings.
postgres =# CREATE DATABASE db_test WITH ENCODING ’ UTF8 ’ LC_CTYPE ’ en_US . UTF -8 ’;
ERROR : new LC_CTYPE ( en_US . UTF -8) is incompatible with the LC_CTYPE of the
template database ( en_GB . UTF -8)
HINT : Use the same LC_CTYPE as in the template database , or use template0 as
template .
37
postgres =# CREATE DATABASE db_test WITH ENCODING ’ UTF8 ’ LC_CTYPE ’ en_US . UTF -8 ’
TEMPLATE template0 ;
CREATE DATABASE
postgres =#
If the template is omitted the CREATE DATABASE statement will use template1 by
default.
A database can be renamed or dropped with ALTER DATABASE and DROP DATABASE
statements. Those operations require the exclusive access to the affected database. If there
are connections established the drop or rename will fail.
postgres =# ALTER DATABASE db_test RENAME TO db_to_drop ;
ALTER DATABASE
5.3 Tables
In our top down approach to the PostgreSQL’s logical model, the next step is the relation.
In the PostgreSQL jargon a relation is an object which carries the data or the way to retrieve
the data. A relation can have a physical counterpart or be purely logical. We’ll take a look
in particular to three of them starting with the tables.
A table is the fundamental storage unit for the data. PostgreSQL implements many
kind of tables with different levels of durability. A table is created using the SQL command
CREATE TABLE. The data is stored into a table without any specific order. Because the
MVCC implementation a row update can change the row’s physical position. For more
informations look to 7.6. PostgreSQL implements three kind of tables.
38
5.3.3 Temporary tables
A temporary table is a relation which lives into the backend’s local memory. When the
connection ends the table is dropped. Those table can have the same name for all the sessions
because they are completely isolated. If the amount of data stored into the table is lesser than
the temp buffers value the table will fit in memory with great speed advantage. Otherwise
the database will create a temporary relation on disk. The parameter temp buffers can be
altered for the session but only before the first temporary table is created.
There are many different foreign data wrappers available for very exotic data sources.
From the version 9.3 the postgres fdw becomes available and the the foreign tables are
writable. The implementation of the postgres fdw implementation is similar to old dblink
module with a more efficient performance management and the connection’s caching.
The table inheritance is a logical relationship between a parent table and one or more
child tables. The child table inherits the parent’s attribute structure but not the physical
storage.
CREATE TABLE
()
INHERITS ( t_parent )
;
39
db_test =# \ d t_parent
Table " public . t_parent "
Column | Type | Modifiers
-- - - - - - - - - -+ - - - - - - - - - - - - - - - - - - - - - - - -+ - - - - - - - - - - -
i_id_data | integer |
v_data | character varying (300) |
Number of child tables : 1 ( Use \ d + to list them .)
db_test =# \ d t_child_01
Table " public . t_child_01 "
Column | Type | Modifiers
-- - - - - - - - - -+ - - - - - - - - - - - - - - - - - - - - - - - -+ - - - - - - - - - - -
i_id_data | integer |
v_data | character varying (300) |
Inherits : t_parent
The inheritance is usually defined at creation time. It’s possible to enforce the inheritance
between two existing tables with the ALTER TABLE ... INHERIT command. The two
table’s structure must be identical.
Because the physical storage is not shared then the unique constraints aren’t globally
enforced on the inheritance tree. This prevents the creation of any global foreign key. Using
the table inheritance, combined with the constraint exclusion and the triggers/rules, is a
partial workaround for the table partitioning.
5.5 Indices
An index is a structured relation. The indexed entries are used to access the tuples stored in
the tables. The index entries are the actual data with a pointer to the corresponding table’s
pages.
It’s important to bear in mind that the indices add overhead to the write operations.
Creating an index does not guarantee its usage.The cost based optimiser, for example, can
simply consider the index access more expensive than a sequential access. The stats system
views, like the pg stat all indexes, store the usage counters for the indices.
For example this simple query finds all the indices in the public schema with index scan
counter zero.
SELECT
schemaname ,
relname ,
indexrelname ,
idx_scan
FROM
40
pg_stat_all_indexes
WHERE
schemaname = ’ public ’
AND idx_scan =0
;
Having an efficient maintenance plan can improve sensibly the database performance.
Take a look to 8 for more information.
PostgreSQL implements many kind of indices. The keyword USING specifies the index
type at create time.
CREATE INDEX idx_test ON t_test USING hash ( t_contents ) ;
5.5.1 b-tree
The general purpose B-tree index implements the Lehman and Yao’s high-concurrency B-tree
management algorithm. The B-tree can handle equality and range queries returning ordered
data. The indexed values are stored into the index pages with the pointers to the table’s
pages. Because the index is a relation not TOASTable the max length for an indexed key is
1/3 of the page size. More informations about TOAST are in 7.4
5.5.2 hash
The hash indices can handle only equality and are not WAL logged. Their changes are not
replayed if the crash recovery occurs and do not propagate to the standby servers.
5.5.3 GiST
The GiST indices are the Generalised Search Tree. The GiST is a collection of indexing
strategies organised under a common infrastructure. They can implement arbitrary indexing
schemes like B-trees, R-trees or other. The default installation comes with operator classes
working on two elements geometrical data and for the nearest-neighbour searches. The GiST
indices do not perform an exact match. The false positives are removed with second rematch
on the table’s data.
5.5.4 GIN
The GIN indices are the Generalised Inverted Indices. This kind of index is optimised for
indexing the composite data types, arrays and vectors like the full text search elements. This
is the only index supported by the range types. The GIN are exact indices, when scanned
the returned set doesn’t require recheck.
41
5.6 Views
A view is a relation composed by a name and a query definition. This permits a faster access
to complex SQL. When a view is created the query is validated and all the objects involved
are translated to their binary representation. All the wildcards are expanded to the corre-
sponding field’s list.
A simple example will help us to understand better this important concept. Let’s create
a table populated using the function generate series(). We’ll then create a view with a simple
SELECT * from the original table.
We can select from the view and from the the table with a SELECT and get the same
data. The view’s definition in pg views shows no wildcard though.
db_test =# \ x
db_test =# SELECT * FROM pg_views where viewname = ’ v_data ’;
-[ RECORD 1 ] -- - - - - - - - - - - - - - - - - - -
schemaname | public
viewname | v_data
viewowner | postgres
definition | SELECT t_data . i_id ,
42
| t_data . t_content
| FROM t_data ;
If we add a new field to the table t data this will not applies to the view.
ALTER TABLE t_data ADD COLUMN d_date date NOT NULL default now () :: date ;
Using the statement CREATE OR REPLACE VIEW we can put the view in sync with
the table’s structure.
CREATE OR REPLACE VIEW v_data
AS
SELECT
*
FROM
t_data ;
Using the wildcards in the queries is a bad practice for many reasons. The potential out-
dated match between the physical and the logical relations is one of those.
The way PostgreSQL implements the views guarantee they never invalidate when the re-
ferred objects are renamed.
If new attributes needs to be added to the view the CREATE OR REPLACE statement
can be used but only if the fields are appended. If a table is referred by a view the drop is
not possible. It’s still possible to drop a table with all the associated views using the clause
CASCADE. This is a dangerous practice though. The dependencies can be very complicated
and a not careful drop can result in a regression. The best approach is to check for the
dependencies using the table pg depend.
Storing a complex SQL inside the database avoid the overhead caused by the round trip
between the client and the server. A view can be joined with other tables or views. This
practice is generally bad because the planner can be confused by mixing different queries and
43
can generate not efficient execution plans.
A good system to spot a view when writing a query is to use a naming convention. For
example adding a v in the view names and the t in the table names will help the database
developer to avoid mixing logical an physical objects when writing SQL. Look to 11 for more
information.
PostgreSQL from the version 9.3 supports the updatable views. This feature is limited
just to the simple views. A view is defined simple when the following is true.
• Does have exactly one entry in its FROM list, which must be a table or another updat-
able view.
• Does not contain WITH, DISTINCT, GROUP BY, HAVING,LIMIT, or OFFSET
clauses at the top level.
• Does not contain set operations (UNION, INTERSECT or EXCEPT) at the top level
• All columns in the view’s select list must be simple references to columns of the un-
derlying relation. They cannot be expressions, literals or functions. System columns
cannot be referenced, either.
• Columns of the underlying relation do not appear more than once in the view’s select
list.
• Does not have the security barrier property.
A complex view can still become updatable using the triggers or the rules.
Another feature introduced by the 9.3 is the materialised views. This acts like a physical
snapshot of the saved SQL. The view’s data can be refreshed with the statement REFRESH
MATERIALIZED VIEW.
5.7 Tablespaces
A tablespace is a logical name pointing to a physical location. This feature was introduced
with the release 8.0 and its implementation did not change too much since then. From the
version 9.2 a new function pg tablespace location(tablespace oid) offers the dynamic resolu-
tion of the physical tablespace location,making the dba life easier.
When a new physical relation is created without tablespace indication, the value defaults
to the parameter default tablespace. If this parameter is not set then the relation’s tablespace
is set to the database’s default tablespace. Into a fresh initialised cluster there are two ta-
blespaces initially. One is the pg default which points to the path $PGDATA/base. The
second is pg global which is reserved for the cluster’s shared objects and its physical path is
$PGDATA/global.
44
Creating a new tablespace is very simple. The physical location must be previously created
and the os user running the postgres process shall be able to write into it. Let’s create, for
example, a tablespace pointing to the folder named /var/lib/postgresql/pg tbs/ts test. Our
new tablespace will be named ts test.
CREATE TABLESPACE ts_test
OWNER postgres
LOCATION ’/ var / lib / postgresql / pg_tbs / ts_test ’ ;
Only superusers can create tablespaces. The clause OWNER is optional and if is omitted
the tablespace’s owner defaults to the user issuing the command. The tablespaces are cluster
wide and are listed into the pg tablespace system table.
The clause TABLESPACE followed by the tablespace name will create the new relation
into the specified tablespace.
CREATE TABLE t_ts_test
(
i_id serial ,
v_value text
)
TABLESPACE ts_test ;
A relation can be moved from a tablespace to another using the ALTER command. The
following command moves the table t ts test from the tablespace ts test to pg default.
ALTER TABLE t_ts_test SET TABLESPACE pg_default ;
The move is transaction safe but requires an access exclusive lock on the affected relation.
The lock prevents accessing the relation’s data for the time required by the move.If the rela-
tion have a significant size this could result in a prolonged time where the table’s data is not
accessible. The exclusive lock conflicts any running pg dump which prevents any tablespace
change.
A tablespace can be removed with DROP TABLESPACE command but must be empty
before the drop. There’s no CASCADE clause for the DROP TABLESPACE command.
postgres =# DROP TABLESPACE ts_test ;
ERROR : tablespace " ts_test " is not empty
A careful design using the tablespaces, for example putting tables and indices on different
devices,can improve sensibly the cluster’s performance.
In 7.5 we’ll take a look to the how PostgreSQL implements the tablespaces from the
physical point of view.
45
5.8 Transactions
PostgreSQL implements the atomicity, the consistency and the isolation with the MVCC.
The Multi Version Concurrency Controloffers high efficiency in the concurrent user accesss.
The MVCC logic is somewhat simple. When a transaction starts a write operation gets
a transaction id, called XID, a 32 bit quantity. The XID value is used to determine the
transaction’s visibility, its relative position in an arbitrary timeline. All the transactions with
XID smaller than the current XID in committed status are considered in the past and then
visible. All the transactions with XID bigger than the current XID are in the future and
therefore invisible.
The check is made at tuple level using two system fields xmin and xmax. When a new
tuple is created the xmin is set with the transaction’s xid. This field is also referred as the
insert’s transaction id. When a tuple is deleted then the xmax value is updated to the delete’s
xid. The xmax field is also know as the delete’s transaction id. The tuple is not physically
removed in order to ensure the read consistency for any transaction in the tuple’s past. The
tuples having the xmax not set are live tuples. Therefore the tuples which xmax is set are
dead tuples. In this model there is no field dedicated to the update which is an insert insert
combined with a delete. The update’s transaction id is used either for the new tuple’s xmin
and the old tuple’s xmax.
The dead tuples are removed by VACUUM when no longer required by existing transac-
tions. For the tuple detailed description check 7.3.
Alongside with xmin and xmax there are cmin and cmax which data type is the command
id, CID. Those fields store the internal transaction’s commands in order to avoid the command
to be executed on the same tuple multiple times. One practical effect of those fiels is to solve
the database’s Halloween Problem described there http://en.wikipedia.org/wiki/Halloween Problem.
The SQL standard defines four level of the transaction’s isolation. Each level allows or
deny the following transaction’s anomalies.
• dirty read, when a transaction can access the data written by a concurrent not com-
mitted transaction.
• non repeatable read, when a transaction repeats a previous read and finds the data
changed by another transaction which has committed since the initial read.
• phantom read, when a transaction executes a previous query and finds a different
set of rows with the same search condition because the results was changed by another
committed transaction
The table 5.2 shows the transaction’s isolation levels and which anomalies are possible or
not within. PostgreSQL supports the minimum isolation level to read committed. Setting
46
the isolation level to read uncommited does not cause an error. However, the system adjusts
silently the level to read committed.
The isolation level can be set per session with the command SET TRANSACTION ISO-
LATION LEVEL.
SET TRANSACTION ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ
COMMITTED | READ UNCOMMITTED };
It’s also possible to change the isolation level cluster wide changing the GUC parameter
transaction isolation.
An example will help us to explain better the concept. We’ll use the table created in 5.6.
The first thing to do is connecting to the cluster and start a new transaction with at least
the REPEATABLE READ isolation level. Then the function pg export snapshot() is used to
get the snapshot’s identifier.
postgres =# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN
postgres =# SELECT p g _ e x p o r t _ s n a p s h o t () ;
pg_export_snapshot
-- - - - - - - - - - - - - - - - - - -
00001369 -1
(1 row )
Connectin with another backend let’s remove all the rows from the table t data table.
47
postgres =# DELETE FROM t_data ;
DELETE 200
postgres =# SELECT count (*) FROM t_data ;
count
-- - - - - -
0
(1 row )
After importing the snapshot 00001369-1 the rows are back in place.
postgres =# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN
postgres =# SET TRANSACTION SNAPSHOT ’ 00001369 -1 ’;
SET
postgres =# SELECT count (*) FROM t_data ;
count
-- - - - - -
200
(1 row )
It’s important to use at least the REPEATABLE READ as isolation level. Using the
READ COMMITTED for the export does not generates an error. However the snapshot
is discarded immediately because the READ COMMITTED takes a new snapshot for each
command.
48
Chapter 6
Data integrity
There is only one thing worse than losing the database. When the data is rubbish. In this
chapter we’ll have a brief look to the constraints available in PostgreSQL and how they can
be used to preserve the data integrity..
A constraint, like the name suggest enforces one or more restrictions over the table’s data.
When the restriction enforces the data on the relation where the constraint is defined, then
the constraint is local. When the constraints validates the local using the data in a different
relation then the constraint is foreign.
The constraints can be defined like table or column constraint. The table constraints are
defined at table level, just after the field’s list. A column constraint is defined in the field’s
definition after the data type.
When a constraint is created the enforcement applies immediately. At creation time the
table’s data is validated against the constraint. If any validation error, then the creation
aborts. However, the foreign keys and the check constraints can be created without the
initial validation using the clause NOT VALID. This clause tells PostgreSQL to not validate
the constraint’s enforcement on the existing data, improving the creation’s speed.
A primary key can be defined with the table or column constraint’s syntax.
49
-- PRIMARY KEY AS TABLE C O N S T R A I N T
CREATE TABLE t_table_cons
(
i_id serial ,
v_data character varying (255) ,
CONSTRAINT pk _t _ ta bl e_ c on s PRIMARY KEY ( i_id )
)
;
With the table constraint syntax it’s possible to specify the constraint name.
The previous example shows the most common primary key implementation. The con-
straint is defined over a serial field. The serial type is a shortcut for integer NOT NULL
with the default value set by an auto generated sequence. The sequence’s upper limit is
9,223,372,036,854,775,807. However the integer’s upper limit is just 2,147,483,647. On tables
with a high generation for the key’s new values the bigserial should be used instead of serial.
Changing the field’s type is still possible but unfortunately this requires a complete table’s
rewrite.
The primary keys can be configured as natural keys, with the field’s values meaningful in
the real world. For example a table storing the cities will have the field v city as primary key
instead of the surrogate key i city id.
50
-- PRIMARY NATURAL KEY
CREATE TABLE t_cities
(
v_city character varying (255) ,
CONSTRAINT pk_t_cities PRIMARY KEY ( v_city )
)
;
This results in a more compact table with the key values already indexed.
Being the city a value which can be the same for many addresses is more efficient to store
the city name into a separate table and set a relation to the address table.
CREATE TABLE t_addresses
(
i_id_address serial ,
v_address character varying (255) ,
i_id_city integer NOT NULL ,
CONSTRAINT pk_ t_addre sses PRIMARY KEY ( i_id_address )
)
;
51
The main problem with this structure is the consistency between the tables. Without
constraints there is no validation for the city identifier. Invalid values will make the table’s
join invalid. The same will happen if for any reason the city identifier in the table t cities is
changed.
Enforcing the relation with a foreign key will solve both of the problems.
ALTER TABLE t_addresses
ADD CONSTRAINT f k _ t _ a d d r _ t o _ t _ c i t y
FOREIGN KEY ( i_id_city )
REFERENCES t_cities ( i_id_city )
;
The foreign key works in two ways. When a row with an invalid i id city hits the table
t addresses the key is violated and the insert aborts. Deleting or updating a row from the
table t cities still referenced in the table t addresses, violates the key as well.
The enforcement is performed using the triggers. When performing a data only dump/re-
store, the foreign keys will not allow the restore for some tables. The option –disable-trigger
allows the restore on the existing schema to succeed. For more information on this topic check
9 and 10.
The many options available with the FOREIGN KEYS give us great flexibility. The
referenced table can drive different actions on the referencing data using the two event options
ON DELETE and ON UPDATE. The event requires an action to perform when fired. By
default this is NO ACTION which checks the constraint only at the end of the transaction.
This is useful with the deferred keys. The other two actions are the RESTRICT which does
not allow the deferring and the CASCADE which cascades the action to the referred rows.
For example, let’s create a foreign key restricting the delete without deferring and cascad-
ing the updates.
ALTER TABLE t_addresses
ADD CONSTRAINT f k _ t _ a d d r _ t o _ t _ c i t y
FOREIGN KEY ( i_id_city )
REFERENCES t_cities ( i_id_city )
ON UPDATE CASCADE ON DELETE RESTRICT
;
Another useful clause available only with the foreign keys and check is the NOT VALID.
Creating a constraint with NOT VALID tells PostgreSQL the data is already validated by
the database developer. The initial check is then skipped and the constraint creation is
instantaneous. The constraint is then enforced only for the new data. The invalid constraint
can be validated later with the command VALIDATE CONSTRAINT.
postgres =# ALTER TABLE t_addresses
ADD CONSTRAINT f k _ t _ a d d r _ t o _ t _ c i t y
FOREIGN KEY ( i_id_city )
REFERENCES t_cities ( i_id_city )
ON UPDATE CASCADE ON DELETE RESTRICT
NOT VALID
52
;
ALTER TABLE
postgres =# ALTER TABLE t_addresses VALIDATE CONSTRAINT f k _ t _ a d d r _ t o _ t _ c i t y ;
ALTER TABLE
The check is satisfied if the condition returns true or NULL. This behaviour can produce
unpredictable results if not fully understood. An example will help to clarify the behaviour.
Let’s add a CHECK constraint on the v address table in order to have no zero length ad-
dresses. The insert with just the city succeed without key violation though.
postgres =# ALTER TABLE t_addresses
ADD CONSTRAINT c h k _ t _ a d d r _ c i t y _ e x i s t s
CHECK ( length ( v_address ) >0)
;
postgres =# INSERT INTO t_cities ( v_city ) VALUES ( ’ Brighton ’) RETURNING
i_id_city ;
i_id_city
-- - - - - - - - - -
2
This is possible because the field v address does not have a default value which defaults to
NULL when not listed in the insert. The check constraint is correctly violated if, for example
we’ll try to update the v address with the empty string.
postgres =# UPDATE t_addresses SET v_address = ’ ’ ;
ERROR : new row for relation " t_addresses " violates check constraint "
chk_t_addr_city_exists "
DETAIL : Failing row contains (3 , , 2)
Changing the default value for the v address field to the empty string, will make the check
constraint working as expected.
postgres =# ALTER TABLE t_addresses ALTER COLUMN v_address SET DEFAULT ’ ’;
ALTER TABLE
postgres =# INSERT INTO t_addresses ( i_id_city ) VALUES (2) ;
ERROR : new row for relation " t_addresses " violates check constraint "
chk_t_addr_city_exists "
DETAIL : Failing row contains (4 , , 2) .
Please note the existing rows are not affected by the default value change.
53
6.5 Not null
The NULL value is strange. When a NULL value is stored the resulting field entry is an empty
object without any type or even meaning which doesn’t consumes physical space. Without
specifications when a new field this is defined accepts the NULL values.
When dealing with the NULL it’s important to remind that the NULL acts like the math-
ematical zero. When evaluating an expression where an element is NULL then the entire
expression becomes NULL.
As seen before the fields with NULL values are usable for the unique constraints. Other-
wise the primary key does not allow the NULL values. The NOT NULL is a column constraint
which does not allow the presence of NULL values.
Actually a field with the NOT NULL the unique constraint defined is exactly what the
PRIMARY KEY enforces.
For example, if we want to add the NOT NULL constraint to the field v address in the
t addresses table the command is just.
postgres =# ALTER TABLE t_addresses ALTER COLUMN v_address SET NOT NULL ;
ERROR : column " v_address " contains null values
In this case the alter fails because the column v address contains NULL values from the
example seen in 6.4. The fix is quick and easy.
postgres =# UPDATE t_addresses
SET v_address = ’ EMPTY ’
WHERE v_address IS NULL ;
UPDATE 1
postgres =# ALTER TABLE t_addresses ALTER COLUMN v_address SET NOT NULL ;
ALTER TABLE
When adding new NULLable columns is instantaneous. PostgreSQL simply adds the new
attribute in the system catalogue and manages the new tuple structure considering the new
field as empty space. When the NOT NULL constraint is enforced, adding a new field requires
the DEFAULT value set as well. This is an operation to consider carefully when dealing with
large data sets because the table will be completely rewritten. This requires an exclusive lock
on the affected relation. A better way to proceed adding a NULLable field. Afterwards the
new field will be set with the expected default value. Finally a table’s update will fix the
NULL values without exclusive locks. When everything is fine, finally, the NOT NULL could
be enforced on the new field.
54
Chapter 7
After looking to the logical structure we’ll now dig into PostgreSQL’s physical structure.
We’ll start with the top layer, looking into the data area. We’ll take a look first to the data
files and how they are organised. Then we’ll move inside them, where the data pages and
the fundamental storage unit, the tuples, are stored. A section is dedicated to the TOAST
tables. The chapter will end with the physical aspect of the tablespaces and the MVCC.
The data files are organised in multiple segments, each one of 1 GB and numbered with
a suffix. However the first segment created is without suffix. Alongside the main data files
there are some additional forks needed used by PostgreSQL for tracking the data visibility
and free space.
55
7.1.2 Visibility map
The table’s data file have a visibility map file which suffix is vm. PostgreSQL tracks the data
pages with all the tuples visible to the active transactions. This fork is also used for running
the index only scans.
7.1.4 pg class
When connecting to a database, all the relations inside it are listed in the pg class system ta-
ble. The field relfilenode stores the relation’s filename. The system field oid, which is hidden
when selecting with the wildcard *, is just the relation’s object identifier and should not be
used for the physical mapping.
However, PostgreSQL have many useful functions which retrieve the information using
the relation’s OID. For example the function pg total relation size(regclass) returns the disk
space used by the table, including the additional forks and the eventual TOAST table, andthe
indices. The function returns the size in bytes. Another function, the pg size pretty(bigint),
returns a human readable format for better reading.
7.2 Pages
Each datafile is a collection of elements called pages. The default size is for a data page is 8 kb.
The page size can be changed only recompiling the sources with the different configuration
and re initialising the data area. Table’s pages are also known as heap pages. The index
pages have almost the same heap structure except for the special space allocated in the
56
page’s bottom. The figure 7.1 shows an index page structure. The special space is used to
store information needed by the relation’s structure. For example a B-tree index puts in the
special space the pointers to the pages below in the B-tree structure.
A data page starts with a header of 24 bytes. After the header there are the item pointers,
which size is usually 4 bytes. Each item pointer is an array of pairs composed by the offset
and the length of the item which ponints the physical tuples in the page’s bottom.
The page header holds the information for the page’s generic space management as shown
in figure 7.2.
57
Figure 7.2: Page header
• pd lsn identifies the xlog record for last page’s change. The buffer manager uses the
LSN for enforcing the WAL mechanism. A dirty buffer is not dumped to the disk until
the xlog has been flushed at least as far as the page’s LSN.
• pd checksum stores the page’s checksum if is enabled.
• pd flags is used to store the page’s various flags
58
• pg upper is the offset to the end of the free space
The pd checksum field replaces the pd tli field present in the page header until PostgreSQL
9.2 which was used to track the xlog records across the timeline id.
The page’s checksum is a new 9.3’s feature which can detects the page corruption. It can
be enabled only when the data area is initialised with initdb.
The offset fields, pg lower, pd upper and the optional pd special, are 2 bytes long limiting
the max page size to 32KB.
The field for the page version was introduced with PostgreSQL 7.3. Table 7.2 shows the
page version number for the major versions.
7.3 Tuples
The tuples are the fundamental storage unit in PostgreSQL. They are organised as array of
items which kind is initially unknown, the datum. Each tuple have a fixed header of 23 bytes
as shown in the figure 7.3.
59
Figure 7.3: Tuple structure
The fields t xmin and t xmax are used to track the tuple’s visibility as seen in 7.6. The
field t cid is a “virtual” field and is used either for cmin and cmax.
The field t xvac is used by VACUUM when moving the rows, according with the source
code’s comments in src/include/access/htup details.h this field is used only by the old style
VACUUM FULL.
The field t cid is the tuple’s physical location identifier. Is composed by a couple of integers
representing the page number and the tuple’s index along the page. When a new tuple is
created t cid is set to the actual row’s value. When the tuple is updated the this value changes
to the new tuple’s version location. This field is used in pair with t xmax to check if the tuple
is the last version. The two infomask fields are used to store various flags like the presence of
the tuple’s OID or if the tuple have NULL values. The last field t off is used to set the offset
to the actual tuple’s data. This field’s value is usually zero if the table doesn’t have NULLable
fields or is created WITHOUT OIDS. If the tuples have the OID and or a NULLable fields,
the object identifier and a NULL bitmap are stored immediately after the tuple’s header. The
bitmap if present begins just after the tuple’s header and consumes enough bytes to have one
bit per data column. The OID if present is stored after the bitmap and consumes 4 bytes.
The tuple’s data is a stream of composite data described by the composite model stored in
60
the system catalogue.
7.4 TOAST
The oversize attribute storage technique is the PostgreSQL implementation for storing the
data which overflows the page size. PostgreSQL does not allow the tuples spanning multiple
pages. However is possible to store large amount of data which is compressed or split in
multiple rows in an external TOAST table. The mechanism is completely transparent from
the user’s point of view.
The storage model treats the fixed length, like the integers, and the variable length types,
like text, in a different way. The fixed length types which cannot produce large data are not
processed through the TOAST routines. The variable length types are TOASTable if the
first 32-bit word of any stored value contains the total length of the value in bytes (including
itself).
The kind of the TOAST is stored in the first two bits1 of the varlena length word. When
both bits are zero then the attribute is an unTOASTed data type. In the remaining bits is
stored the datum size in bytes including the length word.
If the first bit is set then the value have only a single-byte header instead of the four byte
header. In the remaining bits is stored the total datum size in bytes including the length
byte. This scenario have a special case uf the remaining bits are all zero. This means the
value is a pointer to an out of line data stored in a separate TOAST table which structure is
shown in figure 7.4.
Finally, whether is the first bit, if the second bit is set then the corresponding datum is
compressed and must be decompressed before the use.
Because the TOAST usurps the first two bits of the varlena length word it limits the max
stored size to 1 GB (230 − 1bytes) .
1 On the big-endian architecture those are the high-order bits; on the little-endian those are the low-order
bits
61
Figure 7.4: Toast table structure
The toast table is composed by three fields. The chunk id is an OID used to store the
chunk identifiers. The chunk seq is an integer which stores the chunk orders. The chunk data
is a bytea field containing the the actual data converted in a binary string.
The chunk size is normally 2k and is controlled at compile time by the symbol
TOAST MAX CHUNK SIZE. The TOAST code is triggered by the value
TOAST TUPLE THRESHOLD, also 2k by default. When the tuple’s size is bigger than
TOAST TUPLE THRESHOLD then the TOAST routines are triggered.
The TOAST TUPLE TARGET, default 2 kB, governs the compression’s behaviour. Post-
greSQL will compress the datum to achieve a final size lesser than
TOAST TUPLE TARGET. Otherwise the out of line storage is used.
TOAST offers four different storage strategies. Each strategy can be changed per column
using the ALTER TABLE SET STORAGE statement.
• PLAIN prevents either compression or out-of-line storage; It’s the only storage available
for fixed length data types.
• EXTENDED allows both compression and out-of-line storage. It is the default for most
TOAST-able data types. Compression will be attempted first, then out-of-line storage
if the row is still too big.
• EXTERNAL allows out-of-line storage but not compression.
• MAIN allows compression but not out-of-line storage. Actually the out-of-line storage
is still performed as last resort.
The out of line storage have the advantage of leaving out the stored data from the row
versioning; if the TOAST data is not affected by the update there will be no dead row for
the TOAST data. That’s possible because the varlena is a mere pointer to the chunks and a
new row version will affect only the pointer leaving the TOAST data unchanged.
The TOAST table are stored like all the other relation’s in the pg class table, the associated
62
table can be found using a self join on the field reltoastrelid.
7.5 Tablespaces
PostgreSQL implements the tablespaces with the symbolic links. Inside the directory $PG-
DATA/pg tblspc there are the links to the physical location. Each link is named after the
tablespace’s OID. Therefore the tablespaces are available only on the systems with the sym-
bolic link support.
Before the version 8.4 the tablespace symbolic link pointed directly to the referenced di-
rectory. This was a race condition when upgrading in place because the the location could
clash with the upgraded cluster. From the version 9.0, the tablespace creates a sub directory
directory in the tablespace location which is after the major version and the system catalogue
version number.
postgres@tardis:~$ ls -l /var/lib/postgresql/pg_tbs/ts_test
total 0
drwx------ 2 postgres postgres 6 Jun 9 13:01 PG_9.3_201306121
The sub directory’s name is a combination of the capital letters PG followed by the major
version, truncated to the first two numbers, and the catalogue version number stored in the
control file.
63
Float8 argument passing: by value
Data page checksum version: 0
PG_{MAJOR_VERSION\}_{CATALOGUE_VERSION_NUMBER}
Inside the container directory the data files are organised in the same way as in base
directory. 4.6.
Moving a tablespace to another physical location it’s not complicated but the cluster needs
to be shut down. With the cluster stopped the container directory can be safely copied to
the new location. The receiving directory must have the same permissions like the origin’s.
The symbolic link must be recreated to point to the new physical location. At the cluster’s
start the change will be automatically resolved from the symbolic link.
Until PostgreSQL 9.1 the tablespace location was stored into the field spclocation in the
system table pg tablespace. From the version 9.2 the spclocation field is removed and the ta-
blespace’s location is resolved on the fly using the function pg tablespace location(tablespace oid).
This function can be used to query the system catalogue about the tablespaces. In this
simple example the query returns the tablespace’s location resolved from the OID.
postgres =#
SELECT
p g _ t a b l e s p a c e _ l o c a t i o n ( oid ) ,
spcname
FROM
pg_tablespace
;
pg_tablespace_location | spcname
-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -+ - - - - - - - - - - - -
| pg_default
| pg_global
/ var / lib / postgresql / pg_tbs / ts_test | ts_test
(3 rows )
Because the function pg tablespace location returns the empty string for the system
tablespaces, a better approach is combining the CASE construct with the function cur-
rent settings and build the absolute path for the system tablespaces.
postgres =# SELECT cu rr en t _s et ti n g ( ’ data_d irector y ’) ;
c ur re n t_ se tt i ng
-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
/ var / lib / postgresql /9.3/ main
(1 row )
postgres =#
64
SELECT
CASE
WHEN
p g _ t a b l e s p a c e _ l o c a t i o n ( oid ) = ’ ’
AND spcname = ’ pg_default ’
THEN
c ur re n t_ se tt i ng ( ’ da ta_direc tory ’) || ’/ base / ’
WHEN
p g _ t a b l e s p a c e _ l o c a t i o n ( oid ) = ’ ’
AND spcname = ’ pg_global ’
THEN
c ur re n t_ se tt i ng ( ’ da ta_direc tory ’) || ’/ global / ’
ELSE
p g _ t a b l e s p a c e _ l o c a t i o n ( oid )
END
AS spclocation ,
spcname
FROM
pg_tablespace ;
spclocation | spcname
-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -+ - - - - - - - - - - - -
/ var / lib / postgresql /9.3/ main / base / | pg_default
/ var / lib / postgresql /9.3/ main / global / | pg_global
/ var / lib / postgresql / pg_tbs / ts_test | ts_test
(3 rows )
Another useful function the pg tablespace databases(tablespace oid) can help us to find
the databases with the relations on a certain tablespace.
The following example uses this function again with a CASE construct for building the
database having objects on a specific tablespace, in our example the ts test created in 5.7.
65
db_test =#
SELECT
datname ,
spcname ,
CASE
WHEN
p g _ t a b l e s p a c e _ l o c a t i o n ( tbsoid ) = ’ ’
AND spcname = ’ pg_default ’
THEN
c ur re n t_ se tt i ng ( ’ da ta_direc tory ’) || ’/ base / ’
WHEN
p g _ t a b l e s p a c e _ l o c a t i o n ( tbsoid ) = ’ ’
AND spcname = ’ pg_global ’
THEN
c ur re n t_ se tt i ng ( ’ da ta_direc tory ’) || ’/ global / ’
ELSE
p g _ t a b l e s p a c e _ l o c a t i o n ( tbsoid )
END
AS spclocation
FROM
pg_database dat ,
(
SELECT
oid as tbsoid ,
p g _ t a b l e s p a c e _ d a t a b a s e s ( oid ) as datoid ,
spcname
FROM
pg_tablespace where spcname = ’ ts_test ’
) tbs
WHERE
dat . oid = tbs . datoid
;
datname | spcname | spclocation
-- - - - - - - -+ - - - - - - - - -+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
db_test | ts_test | / var / lib / postgresql / pg_tbs / ts_test
(1 row )
7.6 MVCC
The multiversion concurrency control is used in PostgreSQL to implement the transactional
model seen in 5.8.
At logical level this is completely transparent to the user and the new row versions become
visible after the commit, accordingly with the transaction isolation level.
At physical level we have for each new row version, the insert’s XID stored into the t xmin
field which is used by the internal semantic to determine the row visibility.
Because the XID is a 32 bit quantity, it wraps at 4 billions. When this happens theoreti-
cally all the tuples should suddenly disappear because they switch from in the current XID’s
past to its future in the well known XID wraparound failure,. In the old PostgreSQL versions
66
this was a serious problem which forced the administrators to dump/reload the entire cluster
into a freshly initialised new data area every 4 billion of transactions.
In PostgreSQL 7.2 was introduced a new comparison method for the XID, the modulo−232
arithmetic. It was also introduced a special XID, the FrozenXID2 assumed as always in the
past. With the new comparison method, for any arbitrary XID exists 2 billion of transactions
in the future and 2 billion transactions in the past.
When the age of the tuple’s t xmin becomes old the periodic VACUUM freezes the age-
ing tuple changing its t xmin to the FrozenXID always in the past. In the pg class and the
pg database tables there are two dedicated fields to track the age of the oldest XID. The
value stored in those tables have little meaning if not processed through the function age()
which shows the number of transactions between the current XID and the value stored in the
system catalogue.
This following query returns all the databases, the corresponding datfrozenxid and the
XID’s age.
2 The FrozenXID’s value is 2. The docs of PostgreSQL 7.2 also mention the BootstrapXID which value is 1
67
postgres =#
SELECT
datname ,
age ( datfrozenxid ) ,
datfrozenxid
FROM
pg_database ;
datname | age | datfrozenxid
-- - - - - - - - - - - - - -+ - - - - - -+ - - - - - - - - - - - - - -
template1 | 4211 | 679
template0 | 4211 | 679
postgres | 4211 | 679
db_test | 4211 | 679
When a tuple’s age is more than 2 billions the tuple simply disappears from the cluster.
Before the version 8.0 there was no alert or protection against the XID wraparound failure.
Since then it was introduced a passive mechanism which emits messages in the activity log
when the age of datfrozenxid is less than ten million transactions from the wraparound point.
A message like this is quite serious and should not be ignored.
The autovacuum daemon in this case acts like a watchdog and starts vacuuming the tables
with ageing tuples even if autovacuum is turned off in the cluster. There is another protec-
tion, quite radical, if for some reasons one of the database’s datfrozenxid is at one million
transactions from the wraparound point. In this case the cluster shuts down and refuse to
start again. The only option in this case is to run the postgres process in single-user backend
and execute the VACUUM on the affected relations.
The debian package’s configuration is quite odd, putting the configuration files in the
/etc/postgresql instead of the data area. The following example is the standalone backend’s
call for the debian’s packaged default cluster main.
postgres@tardis:~/tempdata$ /usr/lib/postgresql/9.3/bin/postgres \
--single -D /var/lib/postgresql/9.3/main/base/ \
--config-file=/etc/postgresql/9.3/main/postgresql.conf
The database interface in single user mode and does not have all the sophisticated features
like the client psql. Anyway with a little knowledge of SQL it’s possible to find the database(s)
causing the shutdown and fix it.
68
1: datname (typeid = 19, len = 64, typmod = -1, byval = f)
2: age (typeid = 23, len = 4, typmod = -1, byval = t)
----
1: datname = "template1" (typeid = 19, len = 64, typmod = -1, byval = f)
2: age = "2146435072" (typeid = 23, len = 4, typmod = -1, byval = t)
----
1: datname = "template0" (typeid = 19, len = 64, typmod = -1, byval = f)
2: age = "10" (typeid = 23, len = 4, typmod = -1, byval = t)
----
1: datname = "postgres" (typeid = 19, len = 64, typmod = -1, byval = f)
2: age = "10" (typeid = 23, len = 4, typmod = -1, byval = t)
----
The age function shows how old is the last XID not yet frozen. In our example the
template1 database have an age of 2146435072, one million transactions to the wraparound.
We can then exit the backend with CTRL+D and restart it again in the in single user mode
specifying the database name. A VACUUM will get rid of the problematic xid.
postgres@tardis:~/tempdata$ /usr/lib/postgresql/9.3/bin/postgres \
--single -D /var/lib/postgresql/9.3/main/base/ \
--config-file=/etc/postgresql/9.3/main/postgresql.conf \
template1
This procedure must be repeated for any database with very old XID.
Because the new rows generation at update time, this can lead to an unnecessary table
and index bloat. PostgreSQL with the Heap Only Tuples (HOT) strategy can limit the
unavoidable bloat caused by the updates. HOT’s main goal is to keep the new row versions
into the same page.
The MVCC is something to consider at design time. Ignoring the way PostgreSQL man-
ages the physical tuples can result in data bloat and lead in general to poor performances.
69
Chapter 8
Maintenance
The database maintenance is something crucial for keeping the data access efficient. Building
a proper maintenance plan is almost important like having a good disaster recovery plan.
As seen in 7.6 the update generates new tuple’s version rather updating the affected field.
The new tuple is stored in the next available free space in the same page or a different one.
Frequent updates will in move the tuples across the data pages many and many times with a
trail of dead tuples. Unfortunately those tuples although consuming physically space, are no
longer visible for the new transactions and this results in the table bloat. The indices make
things more complicated. When a new tuple’s version is stored in a different page the index
entry needs update to point the new page. The the index’s ordered structure makes more
difficult to find free space, resulting in an higher rate of new pages added to the relation and
consequent bloat.
The following sections will explore the tools available for the relation’s maintenance.
8.1 VACUUM
VACUUM is a PostgreSQL specific command which reclaims back the dead tuple’s space.
When executed without a target table, the command scans all the tables in the database. A
regular VACUUM have some beneficial effects.
• Removes the dead tuples and updates the free space map.
• Updates the visibility map improving the index only scans.
• It freezes the tuples with ageing XID preventing the XID wraparound
The optional ANALYZE clause gathers the runtime statistics on processed table.
70
When run VACUUM clear the space used by the dead rows making space for the inserts
and updates inside the data files. The data files are not shrunk except if there is a contiguous
free space in the table’s end. VACUUM in this case runs a truncate scan which can fail if
there is a conflicting lock with the database activity. The VACUUM’s truncate scan works
only on the table’s data files. The general approach for VACUUM is to have the minimum
the impact on the cluster’s activity. However, because the pages are rewritten, VACUUM
can increase the I/O activity.
The index pages are scanned as well and the dead tuples are also cleared. The VACUUM
performances on the indices are influenced by the maintenance work mem setting. If the table
does not have indices VACUUM will run the cleanup reading the pages sequentially. If there
is any index VACUUM will store in the maintenance work memory the tuple’s references
for the subsequent index cleanup. If the memory is not sufficient to fit all the tuples then
VACUUM will stop the sequential read to execute the partial cleanup on the indices and free
the maintenance worm mem.
The the maintenance work mem can impact sensibly on the VACUUM’s performance on
large tables. For example let’s build build a simple table with 10 million rows.
postgres =# CREATE TABLE t_vacuum
(
i_id serial ,
ts_value timestamp with time zone DEFAULT c lo ck _t i me st am p () ,
t_value text ,
CONSTRAINT pk_t_vacuum PRIMARY KEY ( i_id )
)
;
CREATE TABLE
CREATE INDEX
71
In order to have a static environment we’ll disable the table’s autovacuum. We’ll also increase
the session’s verbosity display what’s happening during the VACUUM’s run.
We are now executing a complete table rewrite running an UPDATE without the WHERE
condition. This will create 10 millions of dead rows.
Before running the VACUUM we’ll change the maintenance work mem to a small value.
We’ll also enable the query timing.
postgres =# SET m a i n t e n a n c e _ w o r k _ m e m = ’2 MB ’;
SET
SET c l i e n t _ m i n _ m e s s a g e s = ’ debug ’;
postgres =# \ timing
Timing is on .
72
0 index pages have been deleted , 0 are currently reusable .
CPU 0.00 s /0.00 u sec elapsed 0.00 sec .
DEBUG : index " idx_ts_value " now contains 1000000 row versions in 8237 pages
DETAIL : 999930 index row versions were removed .
0 index pages have been deleted , 0 are currently reusable .
CPU 0.00 s /0.00 u sec elapsed 0.00 sec .
DEBUG : " t_vacuum " : found 1000000 removable , 1000000 nonremovable row versions
in 20619 out of 20619 pages
DETAIL : 0 dead row versions cannot be removed yet .
There were 43 unused item pointers .
0 pages are entirely empty .
CPU 0.53 s /2.05 u sec elapsed 12.34 sec .
DEBUG : vacuuming " pg_toast . pg_toast _51919 "
DEBUG : index " p g _ t o a s t _ 5 1 9 1 9 _ i n d e x " now contains 0 row versions in 1 pages
DETAIL : 0 index row versions were removed .
0 index pages have been deleted , 0 are currently reusable .
CPU 0.00 s /0.00 u sec elapsed 0.00 sec .
DEBUG : " pg_ toast_51 919 " : found 0 removable , 0 nonremovable row versions in 0
out of 0 pages
DETAIL : 0 dead row versions cannot be removed yet .
There were 0 unused item pointers .
0 pages are entirely empty .
CPU 0.00 s /0.00 u sec elapsed 0.00 sec .
VACUUM
Time : 12377.436 ms
postgres =#
VACUUM stores in the maintenance work mem an array of TCID pointers to the removed
dead tuples. This is used for the index cleanup. With a small maintenance work mem the
array can consume the entire memory causing VACUUM to pause the table scan for a partial
index cleanup. The table scan then resumes. Increasing the maintenance work mem to 2
GB1 the index scan without pauses which improves the VACUUM’s speed.
postgres =# SET m a i n t e n a n c e _ w o r k _ m e m = ’ 20 MB ’;
SET
73
CPU 0.00 s /0.00 u sec elapsed 0.00 sec .
DEBUG : " t_vacuum " : found 1000000 removable , 1000000 nonremovable row versions
in 20619 out of 20619 pages
DETAIL : 0 dead row versions cannot be removed yet .
There were 100 unused item pointers .
0 pages are entirely empty .
CPU 0.56 s /1.39 u sec elapsed 9.61 sec .
DEBUG : vacuuming " pg_toast . pg_toast _51919 "
DEBUG : index " p g _ t o a s t _ 5 1 9 1 9 _ i n d e x " now contains 0 row versions in 1 pages
DETAIL : 0 index row versions were removed .
0 index pages have been deleted , 0 are currently reusable .
CPU 0.00 s /0.00 u sec elapsed 0.00 sec .
DEBUG : " pg_ toast_51 919 " : found 0 removable , 0 nonremovable row versions in 0
out of 0 pages
DETAIL : 0 dead row versions cannot be removed yet .
There were 0 unused item pointers .
0 pages are entirely empty .
CPU 0.00 s /0.00 u sec elapsed 0.00 sec .
VACUUM
Time : 9646.112 ms
postgres =# SET m a i n t e n a n c e _ w o r k _ m e m = ’ 20 MB ’;
SET
postgres =# \ timing
Timing is on .
74
DETAIL : 0 dead row versions cannot be removed yet .
There were 0 unused item pointers .
0 pages are entirely empty .
CPU 0.00 s /0.00 u sec elapsed 0.00 sec .
VACUUM
Time : 1581.384 ms
Before proceeding let’s put back the primary key and the index on the relation. We’ll
need it later.
postgres =# ALTER TABLE t_vacuum ADD CONSTRAINT pk_t_vacuum PRIMARY KEY ( i_id ) ;
DEBUG : ALTER TABLE / ADD PRIMARY KEY will create implicit index " pk_t_vacuum "
for table " t_vacuum "
DEBUG : building index " pk_t_vacuum " on table " t_vacuum "
ALTER TABLE
Time : 1357.689 ms
The table seen in the example begins with a size of 806 MB . After the update the table’s
size is doubled. After the VACUUM run the table does not shrink. This is caused because
there is no contiguous free space in the end. The new row versions generated by the update
are stored in the table’s end. A second UPDATE with a new VACUUM could truncate the
table if all the dead rows are in the table’s end. However VACUUM’s main goal is to keep
the table’s size stable, rather shrinking down the space.
It’s also possible to run VACUUM with the FREEZE clause. In this case VACUUM will
freeze all the tuples regardless of their age. The command is equivalent of running VACUUM
with vacuum freeze min age set to zero.
75
95% of the autovacuum freeze max age, reducing the possibility to have an anti-wraparound
autovacuum.
76
8.1.8 vacuum cost page hit
The parameter sets the arbitrary cost for vacuuming one buffer found in the shared buffer
cache. It represents the cost to lock the buffer, look up to the shared hash table and scan the
content of the page. The default value is one.
8.2 ANALYZE
The PostgreSQL’s query optimiser builds the query execution plan using the cost estimates
from the internal runtime statistics. Each step in the execution plan gets an arbitrary cost
used to compute the plan total cost. The execution plan which estimated cost is lesser is
then sent to the query executor. Keeping the runtime statistics up to date helps the cluster
to build efficient plans.
The command ANALYZE gathers the relation’s runtime statistics. When executed reads
the data, builds up the statistics and stores them into the pg statistics system table. The
command accepts the optional clause VERBOSE to increase verbosity alongside the optional
target table and the eventual column list. If ANALYZE is launched with no parameters scans
all the tables in the database. Specifying the table name will cause ANALYZE to process all
the table’s columns.
When working on large tables ANALYZE runs a sample read on the table. The GUC
parameter default statistics target determines the amount of entries read by the sample. The
default limit is 100. Increasing the value will cause the planner to get better estimates, in
particular for the columns with the data distributed irregularly. This accuracy have a cost.
Will cause ANALYZE to spend a longer time for the statistics gathering and building plus
an bigger space required in pg statistics.
The following example will show how default statistics target can affects the estimates.
We’ll re use the table created in 8.1. This is the result of ANALYZE VERBOSE with the
default statistic target.
2A page is dirty when its modifications are not yet written on the relation’s data file
77
postgres =# SET d e f a u l t _ s t a t i s t i c s _ t a r g e t =100;
SET
postgres =# ANALYZE VERBOSE t_vacuum ;
INFO : analyzing " public . t_vacuum "
INFO : " t_vacuum " : scanned 30000 of 103093 pages , containing 2909979 live rows
and 0 dead rows ;
30000 rows in sample , 9999985 estimated total rows
ANALYZE
The table have 10 million rows but ANALYZE estimates the contents in just 2,909,979
rows, the 30% of the effective live tuples.
Now we’ll run ANALYZE with default statistics target set to its maximum allowed value,
10000.
SET
postgres =# ANALYZE VERBOSE t_vacuum ;
INFO : analyzing " public . t_vacuum "
INFO : " t_vacuum " : scanned 103093 of 103093 pages , containing 10000000 live
rows and 0 dead rows ;
3000000 rows in sample , 10000000 estimated total rows
ANALYZE
This time the table’s live tuples are estimated correctly in 10 millions.
The table pg statistics is not intended for human reading. The statistics are translated in
human readable format by the view pg stats.
The rule of thumb when dealing with poorly performing queries, is to check if statistics
are recent and accurate. The information is stored into the view pg stat all tables 3 .
For example this query gets, for a certain table, the last execution of the manual and the
auto vacuum alongside with the last analyze and auto analyze.
postgres =# \ x
Expanded display is on .
postgres =# SELECT
schemaname ,
relname ,
last_vacuum ,
last_autovacuum ,
last_analyze ,
l a s t _ a u t o a n a l yz e
FROM
pg_stat_all_tables
WHERE
relname = ’ t_vacuum ’
;
-[ RECORD 1 ] -- - -+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
schemaname | public
3 The subset views pg stat user tables and pg stat sys tables are useful to search respectively the current
78
relname | t_vacuum
last_vacuum |
l as t_ au t ov ac uu m |
last_analyze | 2014 -06 -17 1 8 : 4 8 : 5 6 . 3 5 9 7 0 9 + 0 0
last_autoanalyze |
postgres =#
The statistics target is a per column setting allowing a fine grained tuning for the ANA-
LYZE command.
The default statistic target can be changed for the current session only using the SET
command. The cluster wide value is changed using the parameter in the postgresql.conf file.
8.3 REINDEX
The general purpose B-tree index stores the indexed value alongside with the pointer to the
tuple’s heap page. The index pages are organised in the form of a balanced tree linking
each other using page’s special space seen in 7.1. As long as the heap tuple remains in the
same page the index entry doesn’t need update. The HOT strategy tries to achieve this goal
keeping the heap tuples in the same page. When a new tuple version is stored in the heap
page then also the index entry needs to reflect the change. By default the index pages have a
percentage of space reserved for the updates. This is an hardcoded 30% for the not leaf pages
and a 10% for the leaf pages. The latter can be changed adjusting the index’s fillfactor.
VACUUM efficiency is worse with the indices because their ordered nature. Even convert-
ing the dead tuples to free space, this is reusable only if the new entry is compatible with the
B-tree position. The empty pages can be recycled but this requires at least two VACUUM
runs. When an index page is empty then is marked as deleted by VACUUM but not immedi-
ately recycled. The page is first stamped with the next XID and therefore becomes invisible.
Only a second VACUUM will clear the deleted pages returning the free space to the relation.
This behaviour is made on purpose, because there might be running scans which still need to
access the page. The second VACUUM is the safest way to recycle the page only if no longer
required.
Therefore the indices are affected by the data bloat more than the tables. Alongside with
a bigger disk space allocation, the bloat results generally in bad index’s performances. The
periodical reindex is the best way to keep the indices in good shape.
79
Unlike the VACUUM, the REINDEX have a noticeable impact on the cluster’s activity.
To ensure the data is consistently read the REINDEX sets a lock on the table preventing the
table’s writes. The reads are also blocked for the queries which are using the index.
A B-tree index build requires a the data to be sorted. PostgreSQL comes with a handy
GUC parameter to track the sort, the trace sort which requires a verbosity set to DEBUG.
The following example is the output of the primary key’s reindex of the test table created
in 8.1.
postgres =# SET trace_sort = on ;
SET
postgres =# SET c l i e n t _ m i n _ m e s s a g e s = ’ debug ’;
SET
postgres =# \ timing
Timing is on .
The reindex performs a data sort but maintenance work mem does not fit the table’s data.
PostgreSQL then starts a disk sort in order to build up the index. The way PostgreSQL
determines whether sort on disk or in memory should use follow this simple rule. If after the
table scan the maintenance work memory is exhausted then will be used a sort on disk. That’s
the reason why increasing the maintenance work mem can improve the reindex. Determining
the correct value for this parameter is quite tricky.
This is the reindex using 1 GB for the maintenance work mem.
postgres =# \ timing
Timing is on .
postgres =# SET m a i n t e n a n c e _ w o r k _ m e m = ’1 GB ’;
SET
Time : 0.193 ms
postgres =# REINDEX INDEX pk_t_vacuum ;
DEBUG : building index " pk_t_vacuum " on table " t_vacuum "
LOG : begin index sort : unique = t , workMem = 1048576 , randomAccess = f
LOG : begin index sort : unique = f , workMem = 1024 , randomAccess = f
LOG : internal sort ended , 25 KB used : CPU 0.45 s /2.02 u sec elapsed 2.47 sec
LOG : performsort starting : CPU 0.45 s /2.02 u sec elapsed 2.47 sec
80
LOG : performsort done : CPU 0.45 s /4.36 u sec elapsed 4.81 sec
LOG : internal sort ended , 705717 KB used : CPU 0.66 s /4.74 u sec elapsed 6.85 sec
REINDEX
Time : 6964.196 ms
After the sort the reindex creates a new index file from the sorted data which is changed
in the system catalogue’s pg class.relfilenode. When the reindex’s transaction commits the
old file node is deleted. The sequence can be emulated creating a new index with a different
name. The old index can be dropped safely and the new one renamed to the old’s name. This
approach have the advantage of not blocking the table’s reads using the old index.
From the version 8.2 PostgreSQL supports the CREATE INDEX CONCURRENTLY
statement which doesn’t block the cluster’s activity. With this method the index creation
adds a new invalid index in the system catalogue then starts a table scan to build the dirty
index. A second table scan is then executed to fix the invalid index entries and, after a final
validation the index becomes valid.
The concurrent index build have indeed some caveats and limitations.
• Any problem with the table scan will fail the command and leaving an invalid index in
place. This relation is not used for the reads but adds an extra overhead to the inserts
and updates.
• When building an unique index concurrently this start enforcing the uniqueness when
the second table scan starts. Some transactions could then start reporting the unique-
ness violation before the index becoming available. In the case the build fails on the
second table scan the invalid index will enforce the uniqueness regardless of its status.
• Regular index builds can run in parallel on the same table. Concurrent index builds
cannot.
• Concurrent index builds cannot run within a transaction block.
The primary keys and unique constraints can be swapped like the indices using a different
approach. PostgreSQL since the version 9.1 supports the ALTER TABLE table name ADD
table constraint using index statement. Combining a DROP CONSTRAINT with this com-
mand is possible to swap the constraint’s index without losing the uniqueness enforcement.
81
postgres =# CREATE UNIQUE INDEX p k_ t_ va c uu m_ ne w
ON t_vacuum USING BTREE ( i_id ) ;
CREATE INDEX
postgres =# ALTER TABLE t_vacuum
DROP CONSTRAINT pk_t_vacuum ,
ADD CONSTRAINT pk _t _ va cu u m_ ne w PRIMARY KEY
USING INDEX p k_ t _v ac uu m _n ew
;
ALTER TABLE
postgres =# ALTER INDEX p k_ t_ va c uu m_ ne w
RENAME TO pk_t_vacuum ;
ALTER INDEX
The example uses a regular index build and then blocks the writes. It’s also possible to
build the new index concurrently.
This method cannot be used though if any foreign key references the local key.
postgres =# CREATE TABLE t_vac_foreign
(
i_foreign serial ,
i_id integer NOT NULL ,
t_value text
)
;
CREATE TABLE
postgres =# ALTER TABLE t_vac_foreign
ADD CONSTRAINT f k _ t _ v a c _ f o r e i g n _ t _ v a c u u m _ i _ i d
FOREIGN KEY ( i_id )
REFERENCES t_vacuum ( i_id )
ON DELETE CASCADE
ON UPDATE RESTRICT ;
ALTER TABLE
82
The command CLUSTER can be quite confusing. It’s purpose is to rebuild a completely
new table with the tuples with same order of the clustered index which is set using the com-
mand ALTER TABLE table name CLUSTER ON index name.
For example, this is the verbose output of the cluster command for the table created in
8.1. The table has been clustered on the timestamp field’s index.
postgres =# SET trace_sort = ’ on ’;
SET
postgres =# SET c l i e n t _ m i n _ m e s s a g e s = ’ debug ’;
SET
postgres =# ALTER TABLE t_vacuum CLUSTER ON idx_ts_value ;
ALTER TABLE
postgres =# CLUSTER t_vacuum ;
DEBUG : building index " p g _ t o a s t _ 5 1 9 4 9 _ i n d e x " on table " pg_toas t_51949 "
LOG : begin index sort : unique = t , workMem = 16384 , randomAccess = f
LOG : begin index sort : unique = f , workMem = 1024 , randomAccess = f
LOG : internal sort ended , 25 KB used : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort starting : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort done : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : internal sort ended , 25 KB used : CPU 0.00 s /0.00 u sec elapsed 0.06 sec
LOG : begin tuple sort : nkeys = 1 , workMem = 16384 , randomAccess = f
DEBUG : clustering " public . t_vacuum " using sequential scan and sort
LOG : switching to external sort with 59 tapes : CPU 0.02 s /0.02 u sec elapsed
0.05 sec
LOG : performsort starting : CPU 0.10 s /0.71 u sec elapsed 0.81 sec
LOG : finished writing run 1 to tape 0: CPU 0.11 s /0.75 u sec elapsed 0.86 sec
LOG : finished writing final run 2 to tape 1: CPU 0.11 s /0.75 u sec elapsed 0.86
sec
LOG : performsort done ( except 2 - way final merge ) : CPU 0.11 s /0.76 u sec elapsed
0.87 sec
LOG : external sort ended , 10141 disk blocks used : CPU 0.22 s /1.01 u sec elapsed
1.23 sec
DEBUG : " t_vacuum " : found 0 removable , 1000000 nonremovable row versions in
20619 pages
DETAIL : 0 dead row versions cannot be removed yet .
CPU 0.24 s /1.02 u sec elapsed 1.84 sec .
DEBUG : building index " pk_t_vacuum " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.01 s /0.07 u sec elapsed
0.09 sec
LOG : performsort starting : CPU 0.04 s /0.74 u sec elapsed 0.78 sec
LOG : finished writing final run 1 to tape 0: CPU 0.04 s /0.88 u sec elapsed 0.92
sec
LOG : performsort done : CPU 0.04 s /0.88 u sec elapsed 0.92 sec
LOG : external sort ended , 2445 disk blocks used : CPU 0.07 s /0.96 u sec elapsed
1.23 sec
DEBUG : building index " idx_ts_value " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.00 s /0.07 u sec elapsed
0.08 sec
LOG : performsort starting : CPU 0.02 s /0.74 u sec elapsed 0.76 sec
LOG : finished writing final run 1 to tape 0: CPU 0.02 s /0.88 u sec elapsed 0.91
sec
83
LOG : performsort done : CPU 0.02 s /0.88 u sec elapsed 0.91 sec
LOG : external sort ended , 2445 disk blocks used : CPU 0.04 s /0.98 u sec elapsed
1.21 sec
DEBUG : drop auto - cascades to type pg_temp_51919
DEBUG : drop auto - cascades to type pg_temp_51919 []
DEBUG : drop auto - cascades to toast table pg_toast . pg_toas t_51949
DEBUG : drop auto - cascades to index pg_toast . p g _ t o a s t _ 5 1 9 4 9 _ i n d e x
DEBUG : drop auto - cascades to type pg_toast . pg_toa st_5194 9
CLUSTER
postgres =#
CLUSTER have different strategies to order the data. In this example the chosen strategy
is the sequential scan and sort strategy. The tuples are stored into a new file node which is
assigned to the relation’s relfilenode. Before completing the operation the indices are rein-
dexed. When the CLUSTER is done the old file node is removed from the disk. The process
is quite invasive though. Because the relation is literally rebuilt from scratch it requires an
exclusive access lock which blocks the reads and the writes. The storage is another critical
point. There should be enough to keep old relation’s data files, with the new files plus the
indices and the eventual sort on disk.
VACUUM FULL and CLUSTER have some beneficial effects on the disk storage as the
space is returned to the operating system and improve the indices performance because the
implicit reindex.
The blocking nature of those commands have an unavoidable impact on the cluster’s ac-
tivity. Unlike the conventional VACUUM, CLUSTER and VACUUM FULL should run when
the cluster is not in use or in a maintenance window. CLUSTER and VACUUM FULL do
not fix the XID wraparound failure.
As rule of thumb, in order to minimise the database’s downtime, CLUSTER and VACUUM
FULL should be used only for the extraordinary maintenance and only if the disk space is
critical.
84
track counts= ’on’.
8.5.1 autovacuum
This parameter is used to enable or disable the autovacuum daemon. Changing the setting
requires the cluster’s restart.
85
8.5.7 autovacuum analyze threshold
This parameter sets the extra threshold of updated or deleted tuples to add to the value
determined from autovacuum analyze scale factor. The value is used to trigger an automatic
ANALYZE. The default is 50 tuples. This setting can be overridden for individual tables
by changing the storage parameters. For example a table with 10 million rows and autovac-
uum analyze scale factor, autovacuum analyze threshold set both to their default values will
start an automatic ANALYZE when 1,000,050 tuples are updated or deleted.
86
Chapter 9
Backup
The hardware is subject to faults. In particular if the storage is lost the entire data infras-
tructure becomes inaccessible, sometime for good. Also human errors, like wrong delete or
table drop can happen. A solid backup strategy is the best protection against these problems
and much more. The chapter covers the logical backup with pg dump.
Usage:
pg_dump [OPTION]... [DBNAME]
General options:
-f, --file=FILENAME output file or directory name
-F, --format=c|d|t|p output file format (custom, directory, tar,
plain text (default))
-j, --jobs=NUM use this many parallel jobs to dump
-v, --verbose verbose mode
-V, --version output version information, then exit
-Z, --compress=0-9 compression level for compressed formats
--lock-wait-timeout=TIMEOUT fail after waiting TIMEOUT for a table lock
-?, --help show this help, then exit
87
Options controlling the output content:
-a, --data-only dump only the data, not the schema
-b, --blobs include large objects in dump
-c, --clean clean (drop) database objects before recreating
-C, --create include commands to create database in dump
-E, --encoding=ENCODING dump the data in encoding ENCODING
-n, --schema=SCHEMA dump the named schema(s) only
-N, --exclude-schema=SCHEMA do NOT dump the named schema(s)
-o, --oids include OIDs in dump
-O, --no-owner skip restoration of object ownership in
plain-text format
-s, --schema-only dump only the schema, no data
-S, --superuser=NAME superuser user name to use in plain-text format
-t, --table=TABLE dump the named table(s) only
-T, --exclude-table=TABLE do NOT dump the named table(s)
-x, --no-privileges do not dump privileges (grant/revoke)
--binary-upgrade for use by upgrade utilities only
--column-inserts dump data as INSERT commands with column names
--disable-dollar-quoting disable dollar quoting, use SQL standard quoting
--disable-triggers disable triggers during data-only restore
--exclude-table-data=TABLE do NOT dump data for the named table(s)
--inserts dump data as INSERT commands, rather than COPY
--no-security-labels do not dump security label assignments
--no-synchronized-snapshots do not use synchronized snapshots in parallel jobs
--no-tablespaces do not dump tablespace assignments
--no-unlogged-table-data do not dump unlogged table data
--quote-all-identifiers quote all identifiers, even if not key words
--section=SECTION dump named section (pre-data, data, or post-data)
--serializable-deferrable wait until the dump can run without anomalies
--use-set-session-authorization
use SET SESSION AUTHORIZATION commands instead of
ALTER OWNER commands to set ownership
Connection options:
-d, --dbname=DBNAME database to dump
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port number
-U, --username=NAME connect as specified database user
-w, --no-password never prompt for password
-W, --password force password prompt (should happen automatically)
--role=ROLENAME do SET ROLE before dump
88
9.1.1 Connection options
The connection options are used to specify the way the program connects to the cluster. All
the options are straightforward except for the password. Usually the PostgreSQL clients don’t
accept the plain password as parameter. However is still possible to connect without specify-
ing the password using the environmental variable PGPASSWORD or using the password file.
Using the variable PGPASSWORD is considered not secure and shouldn’t be used if not
trusted users are accessing the server. The password file is a text file named saved in the
users’s home directory as .pgpass . The file must be readable only by the user, otherwise the
client will refuse to read it.
Each line specifies a connection in a fixed format.
hostname:port:database:username:password
The following example specifies the password for the connection to the host tardis, port
5432, database db test and user usr test.
tardis:5432:db_test:usr_test:testpwd
The switch -F specifies the backup format and requires a second option to tell pg dump
which format to use. The allowed formats are c d t p respectively custom directory tar plain.
If the parameter is omitted pg dump uses the plain text format. not compressed and
suitable for the direct load using psql.
The the custom and the directory format are the most versatile backup formats. They give
compression and flexibility at restore time. Both have the parallel and selective restore option.
The directory format stores the schema dump in a toc file. Each table’s content is then
saved in a compressed file inside the target directory specified with the -f switch. From the
version 9.3 this format allows the parallel dump functionality.
The tar format stores the dump in the well known tape archive format. This format is
compatible with the directory format, does not compress the data and there is the limit of 8
GB for the individual table.
The -j option specifies the number of jobs to run in parallel when dumping the data. This
feature is available from the version 9.3 and uses the transaction’s snapshot export to have a
consistent dump over the multiple export jobs. The switch is usable only with the directory
89
format and only with PostgreSQL 9.2 and later.
The option -Z specifies the compression level for the compressed formats. The default is
5 resulting in a dumped archive from 5 to 8 times smaller than the original database.
The option –lock-wait-timeout is the number of milliseconds for the table’s lock acquisition.
When expired the dump will fail. Is useful to avoid the program to wait forever for a table
lock but can result in failed backups if value is too much low.
The -a option sets the data only export. Separating schema and data have some effects
at restore time, in particular with the performance. We’ll see in the detail in 10 how to build
an efficient two phase restore.
The -b option exports the large objects. This is the default setting except if the -n switch
is used. In this case the -b is required to export the large objects.
The options -c and -C are meaningful only for the plain output format. They respectively
add the DROP and CREATE command before the object’s DDL. For the archive formats the
same option exists for pg restore.
The -E specifies the character encoding for the archive. If not set the database’s encoding
is used.
The -n switch is used to dump the named schema only. It’s possible to specify multiple
-n switches to select many schema or using the wildcards. However despite the efforts of
pg dump to get all the dependencies resolved, something could be missing. There’s no guar-
antee the resulting archive can be successfully restored.
The -N switch does the opposite of the -n switch. Excludes the named database schema
from the backup. The switch accepts wildcards and it’s possible to specify multiple schema
with multiple -N switches. When both -n and -N are given, the behaviour is to dump just
the schema that match at least one -n switch but no -N switches.
The -o switch option dumps the object id as part of the table for every table. This options
should be used only if the OIDs are part of the design.
The -O switch have effects only on plain text exports and does not dump statements set-
ting object ownership.
90
The -s switch option dumps only the database schema.
The -S switch is meaningful only for plain text exports. The switch specifies the super user
for disabling and enabling the triggers if the export is performed with the option –disable-
triggers.
The -t switch is used to dump the named table only. It’s possible to specify multiple tables
using the wildcards or specifying the -t many times.
The -T skips the named table in the dump. It’s possible to exclude multiple tables using
the wildcards or specifying the -T many times.
The switch -x does not save the grant/revoke commands for setting the privileges.
The switch –binary-upgrade is used only for the in place upgrade program pg upgrade. Is
not intended for general usage.
The switch –insert option dumps the data as INSERT command instead of the COPY.
The restore with this option is very slow because each statement is parsed and executed in-
dividually.
The switch –column-inserts results in the data exported as INSERT commands with all
the column names specified.
The switch –disable-dollar-quoting disables the dollar quoting for the function’s body and
uses the standard SQL quoting.
The switch –disable-triggers save the statements for disabling the triggers before the data
load and the enabling them back after the data load. Disabling the triggers will ensure the
foreign keys will not cause errors during the data load. This switch have effect only for the
plain text export.
The switch –exclude-table-data=TABLE skips the data dump for the named table. The
same rules of the -t and -T apply to this switch.
The switch –no-security-labels doesn’t include the security labels into the dump file.
The switch –no-synchronized-snapshots is used to run a parallel export with the pre 9.2
databases. Because the snapshot export feature is missing this means the database shall not
change state until all the exporting jobs are connected.
The switch –no-unlogged-table-data does not export data for the unlogged relations.
91
The switch –quote-all-identifiers cause all the identifiers to be enclosed in double quotes.
The switch –section option specifies one of the three export’s sections. The first section
is the pre-data, which saves the definitions for the tables, the views and the functions. The
second section is the data which saves the table’s contents. The third section is the post-data
which saves the constraints, the indices and the eventual GRANT REVOKE commands .
This switch applies only to the plain format.
The switch –serializable-deferrable uses a serializable transaction for the dump, to ensure
the database state is consistent. The dump execution waits for a point in the transaction
stream without anomalies to avoid the risk of serialization failure. The option is not useful
for the backup used only for disaster recovery and should be used only when the dump should
reload into a read only database which needs to get a consistent state compatible with the
origin’s database.
The switch –use-set-session-authorization sets the objects ownership using the command
SET SESSION AUTHORIZATION instead of the ALTER OWNER. SET SESSION AU-
THORIZATION requires the super user privileges whereas ALTER OWNER doesn’t.
A far better approach is to save locally the database and then copy the entire dump file
using remote copy program like rsync or scp.
92
9.2.3 Check for slow cpu cores
PostgreSQL is not multi threaded. Each backend is attached to just one cpu core. When
pg dump starts it opens one backend on the cluster which is used to export the database
objects. The pg dump process receives the data output from the backend saving in the
chosen format. The single cpu’s speed is then critical to avoid a bottleneck. The recently
introduced parallel export, implemented with the snapshot exports can improve sensibly the
pg dump performance.
The default settings allow just 6400 lock slots. This value is generally OK. However, if
the database have complex schema with hundreds of relations, the backup can exhaust the
available slots and fail with an out of memory error. Adjusting the parameters involved in
the compute of locks resolve the problem but this requires a cluster restart.
93
From the version 9.3 pg dump supports also the parallel dump using the feature seen in
5.8.1. The snapshot export is also supported in the version 9.2 which offers this improvement
on the previous version as well. However, using the option –no-synchronized-snapshots tells
pg dump to not issue a snapshot export. This allows a parallel backup from the versions with-
out the snapshot exports. In order to have the data export consistent the database should
stop the write operations for the time required to all the export processes to connect.
The parallel dump is available only with the directory format. The pg restore program
from the version 9.3 can do a paralell restore with the directory format as well.
9.4 pg dumpall
pg dumpall does not have all the pg dump’s options. The program basically dumps all the
cluster’s databases in plain format.
However, pg dumpall is very useful because the switch –globals-only . With this option
pg dumpall saves the the global object definitions in plain text.
This includes the tablespace definitions, the users which are saved with their passwords.
The following example shows the program’s execution and the contents of the output file.
p os tg re s @t ar di s :~/ dmp$ pg_dumpall -- globals - only -f m a i n _ g l o b a l s . sql
p os tg re s @t ar di s :~/ dmp$ cat main_globals . sql
--
-- P o s t g r e S Q L d a t a b a s e cluster dump
--
SET d e f a u l t _ t r a n s a c t i o n _ r e a d _ o n l y = off ;
SET c li en t _e nc od i ng = ’ UTF8 ’;
SET s t a n d a r d _ c o n f o r m i n g _ s t r i n g s = on ;
--
-- Roles
--
--
-- P o s t g r e S Q L d a t a b a s e cluster dump c o m p l e t e
--
p os tg re s @t ar di s :~/ dmp
94
9.5 Backup validation
There’s little advantage in having a backup if this is not valid. The corruption can happen
at various levels and unfortunately when the problem is detected is too late.
Obviously this don’t give us the certain the backup can be restore. It’s important then
running a periodical check for the restore. The strategy to use is determined by the amount
of data, the time required for the restore and the backup schedule.
The general purpose databases, which size is measurable in hundreds of gigabytes, the
restore can complete in few hours and the continuous test is feasible. For the VLDB, which
size is measured in terabytes, the restore can take more than one day, in particular if there
are big indices requiring expensive sort on disk for the build. In this case a test on a weekly
basis is more feasible.
95
Chapter 10
Restore
There’s little advantage in saving the data if the restore is not possible. In this chapter we’ll
take a look to the fastest and possibly the safest way to restore the saved dump.
The program used for the restore is determined by the dump format. We’ll first take a
look to the restore using a plain format then the custom and the directory formats. Finally
we’ll the way to improve the restore performances with a temporary sacrifice of the cluster’s
reliability.
This format have few advantages. For example it’s possible to edit the statements using a
common text editor. This of course if the dump is reasonably small. Even loading a file with
vim when its size is measured in gigabytes becomes a stressful experience though.
The data contents are saved using the COPY command. At restore time this choice have
the best performance.
It’s possible to save the data contents using the inserts. The restore is indeed very slow
because each statement has to be parsed, planned and executed.
If the backup saves the schema and the data in two separate files this requires extra care
at dump time if there are triggers and foreign keys in the database schema.
The data only backup should include the switch –disable-triggers which writes emit the
DISABLE TRIGGER statements before the data load and the ENABLE TRIGGER after
the data is restored.
96
The following example shows a dump/reload session using the separate schema and data
dump files.
Let’s create a new database with a simple data structure. Two tables storing a city and
the address and a foreign key between them enforcing the referential integrity.
postgres =# CREATE DATABASE db_addr ;
CREATE DATABASE
postgres =# \ c db_addr
You are now connected to database " db_addr " as user " postgres " .
db_addr =# CREATE TABLE t_address
(
i_id_addr serial ,
i_id_city integer NOT NULL ,
t_addr text ,
CONSTRAINT pk_id_address PRIMARY KEY ( i_id_addr )
)
;
CREATE TABLE
db_addr =# CREATE TABLE t_city
(
i_id_city serial ,
v_city character varying (255) ,
v_postcode character varying (20) ,
CONSTRAINT pk_i_id_city PRIMARY KEY ( i_id_city )
)
;
CREATE TABLE
db_addr =# ALTER TABLE t_address ADD
CONSTRAINT f k _ t _ c i t y _ i _ i d _ c i t y FOREIGN KEY ( i_id_city )
REFERENCES t_city ( i_id_city )
ON DELETE CASCADE
ON UPDATE RESTRICT ;
ALTER TABLE
i_id_city
-- - - - - - - - - -
3
(1 row )
97
i_id_city ,
t_addr
)
VALUES
(
3,
’4 , malvern road ’
)
RETURNING i_id_addr
;
i_id_addr
-- - - - - - - - - -
1
(1 row )
We’ll now execute dump the schema and the data in two separate plain files. Please note
we are not using the –disable-triggers switch.
Looking to the schema dump it’s quite obvious what it does. All the DDL are saved in
the correct order to restore the same database structure .
The data is then saved by pg dump in the correct order for having the referential integrity
guaranteed. In our very simple example the table t city is dumped before the table t address.
This way the data will not violate the foreign key. In a complex scenario where multiple
foreign keys are referring the same table, the referential order is not guaranteed. Let’s run
the same dump with the option –disable-trigger.
The copy statements in this case are enclosed by two extra statements which disable and
then re enable the triggers.
The foreign keys and all the user defined trigger will not fire during the data restore,
ensuring the data will be safely stored and improving the speed.
Let’s then create a new database where we’ll restore the dump starting from the saved
schema.
98
postgres =# CREATE DATABASE db _ ad dr _r e st or e ;
CREATE DATABASE
postgres =# \ c db _a d dr _r es t or e
You are now connected to database " d b_ a dd r_ re s to re " as user " postgres " .
d b_ ad dr _ re st or e =# \ i db_addr . schema . sql
SET
...
SET
CREATE EXTENSION
COMMENT
SET
SET
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
ALTER SEQUENCE
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
ALTER SEQUENCE
ALTER TABLE
...
ALTER TABLE
REVOKE
REVOKE
GRANT
GRANT
d b_ ad dr _ re st or e =# \ i db_addr . data . sql
SET
...
SET
ALTER TABLE
...
ALTER TABLE
setval
-- - - - - - -
1
(1 row )
setval
-- - - - - - -
3
(1 row )
d b_ ad dr _ re st or e =# \ d
List of relations
Schema | Name | Type | Owner
-- - - - - - -+ - - - - - - - - - - - - - - - - - - - - - - - - -+ - - - - - - - - - -+ - - - - - - - - - -
public | t_address | table | postgres
public | t _ a d d r e s s _ i _ i d _ a d d r _ s e q | sequence | postgres
public | t_city | table | postgres
public | t _ c i t y _ i _ i d _ c i t y _ s e q | sequence | postgres
(4 rows )
99
10.2 The binary formats
The three binary formats supported by pg dump are the custom, the directory and the tar
format. The first two support the selective access when restoring and the parallel execution.
Those features make them the best choice for a flexible and reliable backup. Before the the
9.3 the only format supporting the parallel restore was the custom format. The latest version
extended the functionality to the directory format which, combined with the parallel dump
improves massively the recovery performances on big amount of data. The tar format which
its limitations is suitable for saving only small amount of data.
The custom format is a binary archive. It have a table of contents which can address the
the data saved inside the archive. The directory format is composed by toc.dat file where
the schema is stored alongside with the references to the zip files where the table’s contents
are saved. For each table there is a gz mapped inside the toc. Each file contains command,
COPY or inserts, for reloading the data in the specific table.
The restore from the binary happens via the pg restore program which have almost the
same switches as pg dump’s as seen in 9.1. This is the pg restore’s help output.
Usage:
pg_restore [OPTION]... [FILE]
General options:
-d, --dbname=NAME connect to database name
-f, --file=FILENAME output file name
-F, --format=c|d|t backup file format (should be automatic)
-l, --list print summarized TOC of the archive
-v, --verbose verbose mode
-V, --version output version information, then exit
-?, --help show this help, then exit
100
-O, --no-owner skip restoration of object ownership
-P, --function=NAME(args) restore named function
-s, --schema-only restore only the schema, no data
-S, --superuser=NAME superuser user name to use for disabling triggers
-t, --table=NAME restore named table(s)
-T, --trigger=NAME restore named trigger
-x, --no-privileges skip restoration of access privileges (grant/revoke)
-1, --single-transaction restore as a single transaction
--disable-triggers disable triggers during data-only restore
--no-data-for-failed-tables do not restore data of tables that could not be
created
--no-security-labels do not restore security labels
--no-tablespaces do not restore tablespace assignments
--section=SECTION restore named section (pre-data, data, or post-data)
--use-set-session-authorization
use SET SESSION AUTHORIZATION commands instead of
ALTER OWNER commands to set ownership
Connection options:
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port number
-U, --username=NAME connect as specified database user
-w, --no-password never prompt for password
-W, --password force password prompt (should happen automatically)
--role=ROLENAME do SET ROLE before restore
pg restore requires a file to process and an optional database connection. If the latter
is omitted the output is sent to the standard output. However, the switch -f specifies a
file where to send the output instead of the standard output. This is very useful if we want
just check the original dump file can be read, for example, redirecting the output to /dev/null.
The speed of a restoring from custom or directory, using a database connection, can be
massively improved on a multi core system with the -j switch which specifies the number of
parallel jobs to run when restoring the data and the post data section.
As said before PostgreSQL does not supports the multithreading. Therefore each parallel
job will use just only one cpu over a list of obects to restore determined when pg resotore is
started.
101
The switch –section works the same way as for pg dump controlling the section of the
archived data to restore. The custom and directory format have these sections.
• pre-data This section contains the schema definitions without the keys, indices and
triggers.
• data This section contains the tables’s data contents.
• post-data This section contains the objects enforcing the data integrity alongside with
the triggers and the indices.
The switch -C is used to create the target database before starting the restore. The con-
nections need also a generic database to connect in order to create the database listed in the
archive.
The following example shows the restore of the database created in 10.1 using the custom
format, using the schema and the data restore.
102
pg_dump: finding the columns and types of table "t_address"
pg_dump: finding default expressions of table "t_address"
pg_dump: finding the columns and types of table "t_city"
pg_dump: finding default expressions of table "t_city"
pg_dump: flagging inherited columns in subtables
pg_dump: reading indexes
pg_dump: reading indexes for table "t_address"
pg_dump: reading indexes for table "t_city"
pg_dump: reading constraints
pg_dump: reading foreign key constraints for table "t_address"
pg_dump: reading foreign key constraints for table "t_city"
pg_dump: reading triggers
pg_dump: reading triggers for table "t_address"
pg_dump: reading triggers for table "t_city"
pg_dump: reading rewrite rules
pg_dump: reading large objects
pg_dump: reading dependency data
pg_dump: saving encoding = UTF8
pg_dump: saving standard_conforming_strings = on
pg_dump: saving database definition
pg_dump: dumping contents of table t_address
pg_dump: dumping contents of table t_city
103
pg_restore: creating CONSTRAINT pk_id_address
pg_restore: creating FK CONSTRAINT fk_t_city_i_id_city
pg_restore: setting owner and privileges for DATABASE db_addr
pg_restore: setting owner and privileges for SCHEMA public
pg_restore: setting owner and privileges for COMMENT SCHEMA public
pg_restore: setting owner and privileges for ACL public
pg_restore: setting owner and privileges for EXTENSION plpgsql
pg_restore: setting owner and privileges for COMMENT EXTENSION plpgsql
pg_restore: setting owner and privileges for TABLE t_address
pg_restore: setting owner and privileges for SEQUENCE t_address_i_id_addr_seq
pg_restore: setting owner and privileges for SEQUENCE OWNED BY t_address_i_id_addr_seq
pg_restore: setting owner and privileges for TABLE t_city
pg_restore: setting owner and privileges for SEQUENCE t_city_i_id_city_seq
pg_restore: setting owner and privileges for SEQUENCE OWNED BY t_city_i_id_city_seq
pg_restore: setting owner and privileges for DEFAULT i_id_addr
pg_restore: setting owner and privileges for DEFAULT i_id_city
pg_restore: setting owner and privileges for CONSTRAINT pk_i_id_city
pg_restore: setting owner and privileges for CONSTRAINT pk_id_address
pg_restore: setting owner and privileges for FK CONSTRAINT fk_t_city_i_id_city
The dump file is specified as last parameter. The -d switch tells pg restore in which
database restore the archive. Because we are connecting locally and with the postgres os
user, there is no need to specify the authentication parameters.
Then we are ready to load the data. We’ll disable again the triggers in order to avoid
potential data load failures as seen in 10.1.
The problem with this approach is the presence of the indices when loading the data which
104
is a massive bottleneck. Using the –section instead of the schema and data reload improves
the restore performance.
The data section will then load the data contents as fast as possible.
105
pg_restore: setting owner and privileges for SEQUENCE SET t_address_i_id_addr_seq
pg_restore: setting owner and privileges for TABLE DATA t_city
pg_restore: setting owner and privileges for SEQUENCE SET t_city_i_id_city_seq
Finally, the post-data section will create the constraint and the indices over the existig
data.
The bottleneck is the post-data section which requires CPU intensive operations with ran-
dom disk access operations. In large databases this section can require more time than the
entire data section even if the objects built by the post-data section are smaller. The parallel
can improve the speed but sometimes is not enough.
The postgresql.conf file can be tweaked in order to improve dramatically restore’s speed
up to the 40% compared to production’s configuration. This is possible because the restore
configuration disables some settings used by PostgreSQL to guarantee the durability. The
emergency configuration must be swapped with the production settings as soon as the restore
is complete to avoid a further data loss. What follows assumes the production’s database is
lost and restore is reading from a custom format’s backup.
106
10.3.2 wal level
The parameter wal level sets the amount of redo records to store in the WAL segments. By
default is the value is to minimal which is used for the standalone clusters. If the cluster
feeds a standby server or there is a the point in time recovery setup, the wal level must be at
least archive or hot standby. If there is a PITR or a standby server available for the recover
stop reading this book and act immediately. Restoring from a physical backup is several time
faster rather a logical restore. Otherwise, the standby or PITR snapshot is lost as well, before
starting the reload the wal level must be set to minimal to reduce the WAL generation rate.
10.3.3 fsync
Turning off fsync can improve the restore’s speed. Unless there is a backup battery on the
disk cache, turning off this parameter in production is not safe and can lead to data loss in
case of power failure.
10.3.5 autovacuum
Turning off the autovacuum will avoid to have the tables meanwhile are restored reducing the
unnecessary IO.
107
to 10 GB. For bigger amount of memory the reduction should be the 10%. This will leave out
from the estimate the memory consumed by the operating system and the other processes.
From the remaining memory ram scould be removed the the shared buffer’s memory. Finally
the remaining value must be divided by the max connections.
Let’s consider a a system with 26GB ram. If we set the shared buffer to 2 GB and 10 max
connections, the maintenance work mem will be 2.14 GB.
26 - 10% = 23.4
23.4 - 2 = 21.4
21.4 / 10 = 2.14
108
Chapter 11
Mastering the SQL is a slow and difficult process and requires some sort of empathy with
the DBMS. Asking for advice to the database administrator when building any design is a
good idea to have a better understanding of what the database thinks. Having a few words
with the DBA is a good idea in any case though.
109
expected. Sometimes the database itself is considered just storage, a big mistake.
The database design is either a complex and important and too much delicate to make
it using by a dumb automatic tool. For example, using a generic abstraction layer will build
access methods that will almost certainly ignore the PostgreSQL peculiar update strategy,
resulting in bloat and general poor performance.
It doesn’t matter if the database is simple or the project is small. Nobody knows how
successful could be a new idea. A robust design will make the project scale properly.
Trying to read such queries is a nightmare. Often it takes more time in reformatting the
queries rather doing the performance tuning. The following guidelines are a good reference
for writing decent SQL and avoid a massive headache to the DBA.
110
Object Prefix
Table t
View v
Btree Index idx bt
GiST Index idx gst
GIN Index idx gin
Unique index u idx
Primary key pk
Foreign key fk
Check chk
Unique key uk
Type ty
Sql function fn sql
PlPgsql function fn plpg
PlPython function fn plpy
PlPerl function fn plpr
Trigger trg
rule rul
A similar approach can be used for the column names, making the data type immediately
recognisable.
Type Prefix
Character c
Character varying v
Integer i
Text t
Bytea by
Numeric n
Timestamp ts
Date d
Double precision dp
Hstore hs
Custom data type ty
111
• using lowercase keywords it makes difficult to distinguish them from the identifiers
• the wildcard * mask which fields are really needed; returning all the fields consumes
more bandwidth than required; it prevents the index only scans
• the meaningless aliases like a and b are confusing the query’s logic
• without proper indention the query logic cannot be understood
Despite existence of tools capable to prettify such queries, their usage doesn’t solve the
root problem. Writing decently formatted SQL helps to create a mental map of what the
query should removing as well the confusion when building the SQL.
The following rules should be kept in mind constantly when writing SQL.
• Specify explicitly the join method in order to make it clear the query’s logic
• Adopt meaningful aliases
112
Despite the strange reputation, a database expert is an incredible resource for building
up efficient designs. Nowadays is very simple to set up a PostgreSQL cluster. Even with
the default configuration the system is so efficient that under normal load doesn’t show any
problem. This could look like a fantastic feature but actually is a really bad thing. Any
mistake at design level is hidden and when the problem appears is maybe too late to fix it.
This final advice is probably the most important of the entire chapter. If you have a DBA
don’t be shy. Ask for any suggestion, even if the solution seems obvious or if the design seems
simple. The database layer is a universe full of pitfalls where a small mistake can result in a
very big problem.
Of course if there’s no DBA, that’s bad. Never sail without a compass. Never start a
database project without an expert’s advice. Somebody to look after of the most important
part of the business, the foundations.
113
Appendix A
The PostgreSQL version’s number is composed by three integer. The first number is the gen-
erational version. Currently the value is 9. This number changes when there is a substantial
generational gap with the previous version. For example the version 9.0 started its life as 8.5.
Later was decided the change of generation.
The second number is the major version. The value starts from zero and increases by one
for each release along the generation. Because each major version differs internally from the
others the data area is not compatible between them. Usually a new major version is released
on a yearly basis.
The third number is the minor version. Usually a new minor release appears when a suf-
ficient number of bug fixes are merged into the codebase. Upgrading a minor version usually
requires just the binary upgrade and the cluster restart. However is a good practice to check
for any extra action required in the release notes.
The PostgreSQL project aims to fully support a major release for five years. The policy
is applied on a best-effort basis.
114
Appendix B
PostgreSQL 9.4
The new PostgreSQL major version, the 9.4 was released the 18th of December 2014. Along-
side the new developer wise features this release introduces several enhancements making the
DBA life easier.
115
B.4 Replication Slots
With the replication slots the master becomes aware of the slave’s replication status. The
master with replication slots allocated does not remove the WAL segments until they have
been received by all the standbys. Also the master does not remove the rows which could
cause a recovery conflict even when the standby is disconnected.
B.6 pg prewarm
This additional module loads the relation’s data into the shared buffer after a shutdown. This
allows the cluster reaching the efficiency quickly.
116
Appendix C
Contacts
• Email: [email protected]
• Twitter: @4thdoctor scarf
• Blog: http://www.pgdba.co.uk
117
List of Figures
118
List of Tables
5.1 pg hba.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 SQL Transaction isolation levels . . . . . . . . . . . . . . . . . . . . . . . . . 47
119
Index
120
Item pointers, 57 pgdg apt repository, 16
PGPASSWORD, 89
kernel resources, 27 phantom read, 46
Physical layout, 55
log shipping, 14 pitr, 14
log switch, 13 platform, 15
Logged tables, 38 point in time recovery, 14
Logical layout, 34 postgres, 20
postgres, single user mode, 68
Maintenance, 70
psql, 20
maintenance work memory, 30
Marc G. Fournier, 12 reindexdb, 22
Multi Version Concurrency Control, 46
multixact ID, 76 serial, 50
multixact ID, wraparound failure, 76 shared buffer, 30
MVCC, 46, 55, 66 shared buffers, 30
shutdown sequence, 28
non repeatable read, 46 standby server, 14
streaming replication, 14
Page checksum, 59
Page version, 59 t cid, 60
partitioning, 14 t xmax, 60
password file, 89 t xmin, 60
pg class, 56 t xvac, 60
pg controldata, 21 Table inheritance, 39
pg createcluster, 22 Tables, 38
pg ctl, 20 tablespace, 44
pg ctlcluster, 23 tablespaces,logical, 44
pg dropcluster, 23 tablespaces,physical, 63
pg dump, 20, 87 template database, 37
pg dump, connection options, 89 template0 database, 37
pg dump, general options, 89 template1 database, 37
pg dump, internals, 93 Temporary tables, 39
pg dump, output formats, 89 Thomas Lockhart, 12
pg dump, output options, 90 TOAST, 61
pg dump, parallel export, 89 TOAST, out of line storage, 62
pg dump, performance tips, 92 TOAST, storage strategies, 62
pg dumpall, 94 trace sort, 80
pg dumpall, global objects, 94 transactions, 46
pg lscluster, 23 transactions, isolation levels, 46
pg resetxlog, 21 transactions, snapshot export, 47
pg restore, 21 Tuples, 59
pg statistics, table, 77
pg stats, view, 78 Unlogged tables, 38
pg tablespace, 64
121
VACUUM, 22, 67, 70
VACUUM FREEZE, 75
VACUUM FULL, 82
VACUUM, cost based delay, 76
vacuumdb, 22
vacuumlo, 22
Vadim Mikheev, 12
varlena, 61
view, materialised, 44
view, updatable, 44
views, 42
Visibility map, 56
wal, 13
wal files, 31
wal writer delay, 29
warm standby, 14
work memory, 30
write ahead log, 13
XID, 46
XID wraparound failure, 20, 66, 70
XID wraparound failure, fix, 68
122