No. The two key spaces do not overlap. They are distinguished by the objsubid value in pg_locks: objsubid=1 for single bigint keys, objsubid=2 for two-integer key pairs.
PostgreSQL FAQ & Answers
6000 expert PostgreSQL answers researched from official documentation. Every answer cites authoritative sources you can verify.
Jump to section:
Functions - System Administration
80 questionsThe storage size in bytes, including compression if applied. It returns the actual storage size, not the logical size.
Only when the built-in log collector is running. It returns false if the log collector is not active.
pg_try_advisory_lock() is non-blocking and returns immediately with true/false. pg_advisory_lock() blocks and waits if necessary until the lock is acquired.
It forces a snapshot to be logged immediately, which is useful for logical decoding on standby servers to avoid waiting for the background writer.
18 bytes, regardless of the actual size of the represented value.
Yes. Superusers can cancel/terminate any backend. Roles with pg_signal_backend can cancel/terminate non-superuser backends. Roles with pg_signal_autovacuum_worker can cancel/terminate autovacuum workers.
PostgreSQL 9.6. This was the first predefined role available in PostgreSQL.
relid (regclass), parentrelid (regclass), isleaf (boolean), and level (integer).
No. Session-level advisory locks do not honor transaction semantics. A lock acquired during a transaction that is later rolled back will still be held after the rollback.
NULL. It only returns a value for partitions and partitioned tables.
Bytes. It calculates the difference in bytes between two LSN values (lsn1 - lsn2).
No. It will not move the slot's position backward or beyond the current WAL insert location.
Yes. pg_advisory_unlock_all() is implicitly invoked at session end, releasing all session-level advisory locks.
Superusers and members of the pg_monitor role by default. Access can be granted to others via GRANT.
It returns boolean. The return value only indicates whether the signal was successfully sent, NOT whether the configuration was successfully reloaded. Check pg_file_settings and pg_hba_file_rules views for configuration errors.
No. These functions cannot be executed during promotion to primary.
PostgreSQL 15. pg_stop_backup() was introduced in PostgreSQL 8.0 and replaced by pg_backup_stop() in version 15.
Outside the live data directory. They must be written as part of the backup archive but not within the PostgreSQL data directory itself.
CREATE privilege on the tablespace OR membership in the pg_read_all_stats role. Default tablespaces don't require special permissions.
Level 0. The input table has level 0, and child partitions have incrementing levels.
pg_total_relation_size() = pg_table_size() + pg_indexes_size(). The total includes both table and index storage.
SIGHUP. This signal causes the postmaster and all server processes to reload their configuration files.
It returns 0 when: (1) the fastupdate storage parameter is disabled for the GIN index, or (2) the pending list is already empty (already cleaned by VACUUM or previous cleanup).
Recovery will stop at the first matching name encountered. Using duplicate names can cause recovery to stop at an unintended point.
Normally 2 kB (2048 bytes). TOAST is triggered when a row value exceeds this threshold.
Only information functions. Most recovery control functions like pg_wal_replay_pause() can only execute during recovery. Information functions like pg_is_in_recovery() work in any mode.
Multiple locks on the same resource require multiple unlocks. If you lock the same advisory lock multiple times, you must unlock it the same number of times.
true. By default, pg_promote() waits for the promotion to complete before returning.
'main'. When only one argument is provided, pg_relation_size() returns the size of the main data fork.
The entire file is returned. Both parameters are optional, and omitting them reads the complete file.
No. It only adds new collations; it does not remove obsolete ones.
No. Attempting to copy an invalidated slot will fail with an error.
true. By default, pg_backup_stop() waits for WAL to be archived when archiving is enabled.
It fails with an error. Once a transaction has exported any snapshots, it cannot be prepared with PREPARE TRANSACTION. The error message is: 'ERROR: cannot PREPARE a transaction that has exported snapshots'.
NULL. It returns NULL rather than raising an error when the origin doesn't exist.
At least 25%. PGLZ will only compress if it can achieve at least a 25% reduction in size.
It returns the current LSN but does nothing. No new WAL file is created if there's been no WAL activity.
No. It returns bytea without encoding validation, unlike pg_read_file() which validates encoding.
Yes. They return NULL if the server was never in recovery or if no WAL has been received/replayed yet.
It returns NULL if the value is un-TOASTed (not stored in the TOAST table) or not stored on disk.
A negative offset is relative to the end of the file. The function uses SEEK_END positioning, allowing you to read from the end of the file backwards.
To the server log only, at LOG level. The output never appears in the client application; it only appears in the server's log files.
Yes. File access functions bypass standard database privilege checks, which is why they require superuser permissions by default.
false. When is_local is false, the setting applies to the entire session. When true, it applies only to the current transaction.
name (text), size (bigint), and modification (timestamp with time zone).
A record with three columns: lsn (pg_lsn), labelfile (text), and spcmapfile (text).
An integer representing the count of new collations created. It skips collations that already exist.
READ COMMITTED isolation level only. Multiple exports are only allowed in READ COMMITTED transactions.
pg_lsn (Log Sequence Number). This represents the backup's starting WAL location.
There is no difference in how the limit is applied. Both functions check the limit after each transaction, so the actual number of rows returned may exceed upto_nchanges.
No. Files and directories with names beginning with a dot are excluded from the results (except '.' and '..' when include_dot_dirs is true in pg_ls_dir).
The timestamp when the commit or abort WAL record was generated on the primary server. This is the transaction time on the primary, not when it was replayed on the standby.
pglz (historical, integrated in all versions) and lz4 (since PostgreSQL 14).
The four valid fork values are: 'main' (the main data fork), 'fsm' (Free Space Map), 'vm' (Visibility Map), and 'init' (initialization fork for unlogged tables/indexes).
NULL. When missing_ok is supplied and is true, current_setting() returns NULL instead of throwing an error.
Temporary synced slots cannot be used for logical decoding. They must be dropped after promoting the standby to primary.
It returns false and issues a SQL warning. The function does not raise an error.
The valid units are: 'bytes', 'B', 'kB', 'MB', 'GB', 'TB', and 'PB'.
When true, it includes '.' and '..' in the results. By default (false), these directories are excluded.
When true, it enables decoding of prepared transactions in two-phase commit. When false (default), prepared transactions are not decoded.
It displays them as the next-lowest unit. For example, 9216 bytes is displayed as '9216 bytes' instead of converting to kB.
CONNECT privilege on the database OR membership in the pg_read_all_stats role.
None. LZ4 only requires that the compressed data be no larger than uncompressed data (no size increase).
No, except for certain information functions. Most backup control functions like pg_backup_start() cannot execute during recovery.
A bigint representing the number of pages removed from the pending list.
No. Exported snapshots are only available for import until the exporting transaction ends.
pg_logical_slot_get_changes() consumes the changes (advances the slot position), making them unavailable for future reads. pg_logical_slot_peek_changes() returns changes without consuming them, allowing the same changes to be retrieved again.
PLAIN (no compression or out-of-line storage), EXTENDED (allows both compression and out-of-line storage), EXTERNAL (allows out-of-line storage but not compression), and MAIN (allows compression but not out-of-line storage).
No. The flush parameter has no effect when transactional is true.
Monitoring > Database Statistics
66 questionsThe number of live table rows fetched by simple index scans using this index
No, log_statement_stats cannot be enabled simultaneously with any of the per-module options
The number of queries canceled due to conflicts with recovery on standby servers (primarily relevant on replicas)
No, SQL-language functions that are simple enough to be inlined will not be tracked, regardless of the track_functions setting
No, it can be changed without restarting (does not require restart)
The backend is not waiting on anything and is actively executing its query
temp_files tracks the number of temporary files created by queries, and temp_bytes tracks the total amount of data written to temporary files (in bytes). These are cumulative counters over the database lifetime
Total time spent executing the function in milliseconds, including time spent in other functions that it calls
The number of live table rows fetched by index scans (both simple and bitmap)
client backend, autovacuum launcher, autovacuum worker, background writer, checkpointer, walwriter, archiver, logical replication launcher, and background worker
The timestamp when statistics were last reset for that object or database
The external file might grow unmanageably large, and pg_stat_statements may choose to discard query texts
That backends are writing buffers themselves, suggesting a need for more shared_buffers or more aggressive background writer configuration
sent_lsn (WAL sent to replica), write_lsn (WAL written to OS on replica), flush_lsn (WAL flushed to disk on replica), and replay_lsn (WAL replayed and visible to queries on replica)
When the update does not modify any columns referenced by the table's indexes (excluding summarizing indexes) and there is sufficient free space on the page containing the old row
pg_stat_io provides detailed I/O statistics by backend type, I/O object, and I/O context, including timing for reads, writes, extends, and fsyncs
Activity, BufferPin, Client, Extension, IO, IPC, Lock, LWLock, and Timeout
WAL that has been flushed to permanent storage on the replica but not yet replayed (e.g., due to replication conflicts). When replayed, these changes will become visible
PostgreSQL 17. Before version 17, you had to call pg_stat_reset_shared() individually for each statistics type
none, pl (track procedural-language functions only), or all (track SQL and C language functions too)
The number of disk blocks found in the buffer cache (shared_buffers), avoiding the need to read from disk
tup_returned will increase by one million (rows scanned), but tup_fetched will increase by only one (the final result row)
Only during a clean shutdown. Statistics are retained across server restarts when the server shuts down cleanly
Because idx_tup_fetch can be less if dead or not-yet-committed rows are fetched, if heap fetches are avoided via index-only scans, or if bitmap scans are used (which don't affect idx_tup_fetch)
track_io_timing (for relation I/O timing) or track_wal_io_timing (for WAL I/O timing)
It allows modules like pg_stat_statements to automatically enable query identifier computation
The number of rows modified (inserted, updated, or deleted) since the table was last analyzed
Up to 1% in testing scenarios, though it can cause significant overhead on some platforms with high-cost timing calls
analyze_count (manual analyze count) and autoanalyze_count (autoanalyze count)
checkpoints_timed counts scheduled checkpoints, checkpoints_req counts requested checkpoints. Ideally checkpoints_timed should be much higher and checkpoints_req near zero
The cumulative number of transactions that have been committed in this database
Prior to PostgreSQL 10, pg_stat_activity only contained information about normal backend processes (client backends) serving end users
No, it can be changed without restarting (does not require restart)
No, it can be changed without restarting (does not require restart)
Superusers and roles with privileges of the built-in role pg_read_all_stats
They update continuously throughout the transaction and show statistics not yet flushed to shared memory
No, it can be changed without restarting (does not require restart)
Because autovacuum uses insert/update/delete counters to determine when to visit a table, and resetting them delays necessary table maintenance
Just before going idle, but not more frequently than once per PGSTAT_MIN_INTERVAL milliseconds (default 1 second)
1 second (1000 milliseconds) unless altered while building the server
Yes, they are independent. A backend in 'active' state may or may not be waiting on some event
Triggers
66 questionsSECURITY DEFINER functions that don't properly control search_path can lead to privilege escalation attacks, where malicious users exploit them through custom operators or types to execute code with elevated privileges.
Use ALTER TABLE (not ALTER TRIGGER) with commands like: ALTER TABLE table_name DISABLE TRIGGER trigger_name or ALTER TABLE table_name ENABLE TRIGGER trigger_name.
pg_trigger_depth() returns the current nesting level of triggers. If not invoked from inside a trigger, directly or indirectly, it returns zero.
The maximum is 100 arguments, as defined by FUNC_MAX_ARGS, which is a compile-time constant that applies to all PostgreSQL functions including trigger arguments.
No, BEFORE ROW triggers on INSERT cannot change which partition is the final destination for a new row.
No, TRUNCATE triggers cannot be specified with BEFORE or AFTER on views. They are permitted only on tables and foreign tables.
tgenabled is of type char and controls firing based on session replication role. It can have values: 'O' (origin/local), 'D' (disabled), 'R' (replica), or 'A' (always).
Row-level BEFORE triggers are much cheaper than AFTER triggers when updating many rows, because BEFORE triggers fire immediately while AFTER triggers fire at the end of the statement, requiring PostgreSQL to temporarily store row state information.
Use a WHEN clause with pg_trigger_depth() = 0 or pg_trigger_depth() < 1 to ensure the trigger only fires for top-level operations.
Triggers defined for the same event on the same relation fire in alphabetical order by trigger name.
Yes, the syntax is: ALTER TRIGGER name ON table_name RENAME TO new_name.
When a row-level AFTER trigger is fired, all data changes made by the command are already complete, and are visible to the invoked trigger function.
No, the data change causing the trigger to fire is not visible to SQL commands executed in a row-level BEFORE trigger, because it has not yet occurred.
The REFERENCING clause is only allowed for AFTER triggers on plain tables (not foreign tables or constraint triggers).
BEFORE row-level triggers on partitioned tables were enabled in PostgreSQL 13. Prior versions required creating BEFORE triggers on each partition individually.
CASCADE automatically drops objects that depend on the trigger, and in turn all objects that depend on those objects.
The return value is ignored for row-level AFTER triggers, so they may as well return NULL.
In REPEATABLE READ, a trigger sees the database as of the beginning of the transaction (the start of the first non-transaction-control statement), not as of the current statement.
No, TRUNCATE triggers can only be statement-level triggers, never row-level.
RESTRICT refuses to drop the trigger if any objects depend on it. This is the default behavior.
TG_OP is of type 'text' and specifies the operation: INSERT, UPDATE, DELETE, or TRUNCATE.
The trigger will only fire if at least one of the listed columns is mentioned as a target of the UPDATE command, even if the value doesn't actually change.
Statement-level BEFORE triggers execute before the statement starts to do anything.
If the table that the trigger is on is a partitioned table, then corresponding clone triggers in the partitions are renamed too.
A trigger function must be declared as a function with no arguments and a return type of 'trigger' (for data change triggers) or 'event_trigger' (for database event triggers).
Creating a row-level trigger on a partitioned table will cause an identical 'clone' trigger to be created on each of its existing partitions, and any partitions created or attached later will automatically have an identical trigger.
Triggers configured as ENABLE ALWAYS will fire regardless of the current replication role (session_replication_role setting).
Statement-level triggers execute once per SQL statement regardless of rows affected, while row-level triggers execute for every row. Statement-level triggers with transition tables can show 7x performance improvement over row-by-row processing.
The execution of an AFTER trigger can be deferred to the end of the transaction, rather than the end of the statement, if it was defined as a constraint trigger.
It disables only user-defined triggers, excluding internally generated constraint triggers such as those used to implement foreign key constraints or deferrable uniqueness and exclusion constraints.
Disabling or enabling internally generated constraint triggers requires superuser privileges.
Row-level BEFORE triggers fire immediately before a particular row is operated on.
Constraint triggers must be AFTER ROW triggers on plain tables (not foreign tables).
For typical OLTP transactions with simple triggers, the overhead is minimal. Tests showed 1.173 ms vs 1.178 ms average transaction latency (0.4% difference) for transactions with 3 simple data checking/filling triggers.
INSTEAD OF triggers fire immediately as each row in the view is identified as needing to be operated on.
In DELETE triggers, only OLD is available (the row being deleted). In INSERT triggers, only NEW is available; OLD is null.
For constraint triggers, evaluation of the WHEN condition is not deferred, but occurs immediately after the row update occurs.
NEW TABLE may only be specified once, and only for a trigger that can fire on UPDATE or INSERT.
tgnewtable is of type name and stores the REFERENCING clause name for NEW TABLE (transition table).
In UPDATE triggers, both OLD (the row before update) and NEW (the row after update) are available. In INSERT triggers, only NEW is available; OLD is null.
Yes, it is possible for a column's value to change even when the trigger is not fired, because changes made to the row's contents by BEFORE UPDATE triggers are not considered when determining if the column was targeted.
There is no functional difference. The keywords FUNCTION and PROCEDURE are equivalent in CREATE TRIGGER syntax, but the use of PROCEDURE is historical and deprecated. Modern code should use EXECUTE FUNCTION.
Row-level AFTER triggers fire at the end of the statement (but before any statement-level AFTER triggers).
OLD TABLE may only be specified once, and only for a trigger that can fire on UPDATE or DELETE.
Returning NULL from a row-level BEFORE trigger signals the trigger manager to skip the rest of the operation for that row (i.e., subsequent triggers are not fired, and the INSERT/UPDATE/DELETE does not occur for this row).
It disables all triggers on the table, including both user-defined and internally generated constraint triggers.
The referenced object must in any case be a function, not a procedure, regardless of whether EXECUTE FUNCTION or EXECUTE PROCEDURE is used.
No, a BEFORE trigger's WHEN condition is not allowed to examine the system columns of the NEW row (such as ctid), because those won't have been set yet.
DROP TRIGGER [ IF EXISTS ] name ON table_name [ CASCADE | RESTRICT ]. RESTRICT is the default.
Trigger functions execute with the privileges of the user who created them (SECURITY DEFINER behavior), not the user invoking the trigger.
No, row-level BEFORE and AFTER triggers may only be defined on tables and foreign tables, not views.
tgnargs is of type int2 (smallint) and contains the count of argument strings for the trigger function.
Trigger support on foreign tables was introduced in PostgreSQL 9.4.
BEFORE and AFTER triggers on a view must be marked as FOR EACH STATEMENT (not FOR EACH ROW).
Statement-level AFTER triggers fire at the very end of the statement.
Yes, triggers are executed as part of the same transaction as the statement that triggered them, so they inherit the transaction's isolation level.
If a view is automatically updatable, the action is handled by automatically rewriting the statement into an action on the view's base table, so the base table's statement-level triggers are the ones that are fired (not triggers on the view itself).
No, INSTEAD OF triggers may only be defined on views, and only at row level.
None of the changes made by a statement are visible to statement-level BEFORE triggers, because the statement has not yet started.
Foreign tables support row-level BEFORE and AFTER triggers, and statement-level triggers. They do not support INSTEAD OF triggers, constraint triggers, TRUNCATE triggers, or the REFERENCING clause.
When a partition is detached from its parent, its clone triggers are removed.
Yes, SQL commands executed in a row-level BEFORE trigger see the effects of data changes for rows previously processed in the same command.
Functions - System Information
63 questionsIt returns the text of the currently executing query, as submitted by the client (may contain more than one statement).
It returns a string describing the PostgreSQL server's version, including version number, OS information, compiler information, and architecture (32-bit or 64-bit).
No. This parameter can only be set at server start, so changing it requires a PostgreSQL restart.
It returns the authentication method and identity presented during authentication.
It returns true if the given string is valid input for the specified data type, false otherwise.
It returns NULL when the collation of an expression is undefined or when there are mixed collations in the expression.
Yes. Multiple privilege types can be listed separated by commas (added in PostgreSQL 8.4).
It returns the name of the sequence associated with a column (or NULL if no sequence is associated).
When column is 0, it returns the full index definition (all columns) rather than information about a specific column position.
The track_activity_query_size parameter controls this. It has a minimum of 100 bytes, maximum of 1048576 bytes (1MB), and a default of 1024 bytes.
No. current_user has special syntactic status in SQL and must be called without trailing parentheses.
With NULL typemod, it outputs a 'prettier' representation (e.g., 'character' for bpchar). With typemod -1, it preserves the base type name (e.g., 'bpchar' which would be interpreted as character(1) by the parser).
It is a synonym for current_user (PostgreSQL unifies users and roles into a single kind of entity).
It returns a record with columns: pg_control_version, catalog_version_no, system_identifier, pg_control_last_modified.
It accepts an object OID and optionally a catalog name. The two-parameter form is preferred; the one-parameter form is deprecated.
The return type is record with columns: message (text), detail (text), hint (text), sql_error_code (text).
When include_implicit is TRUE, it includes implicitly searched system schemas such as pg_catalog. When FALSE, it only returns schemas explicitly defined in the current search path.
session_user is the user who initiated the database connection (the 'real user'). current_user is the user identifier applicable for permission checking (the 'effective user'). Normally they are equal, but current_user can be changed with SET ROLE or during execution of SECURITY DEFINER functions.
It returns the OID of the data type of the value passed to it, displayed as a type name (regtype format).
It accepts an optional text parameter specifying the log format.
The function returns 'in progress', 'committed', or 'aborted' to report the commit status of a transaction.
It returns the time when the server configuration files were last loaded.
It returns a record with columns including: checkpoint_lsn, redo_lsn, redo_wal_file, timeline_id, prev_timeline_id, full_page_writes, next_xid, next_oid, next_multixact_id, next_multi_offset, oldest_xid, oldest_xid_dbid, oldest_active_xid, oldest_multi_xid, oldest_multi_dbid, oldest_commit_ts_xid, newest_commit_ts_xid, checkpoint_time.
The return type is setof record with columns: word (text), catcode ("char"), barelabel (boolean), catdesc (text), baredesc (text).
It returns the file system path that the specified tablespace is located in.
It returns a string representing the version of Unicode used by PostgreSQL.
typemod -1 indicates that no type modifier is specified or known. format_type() will output the type name in a way that the parser will interpret as having typemod -1.
xid is 32 bits wide and wraps around every 4 billion transactions. xid8 is 64-bit and does not wrap around during the life of an installation.
It returns the set of names of asynchronous notification channels that the current session is listening to.
Yes. WITH GRANT OPTION can be added to a privilege type to test whether the privilege is held with grant option.
The required parameter is include_implicit and it is of type boolean.
Yes. In PostgreSQL, parentheses can optionally be used with current_schema, unlike other special session information functions.
It returns a current snapshot showing which transaction IDs are currently in-progress.
Queries and SELECT
62 questionsNULLS LAST is the default for ascending order, meaning null values sort after all non-null values.
OFFSET 0 is the same as omitting the OFFSET clause entirely. OFFSET with a NULL argument also behaves the same way.
IN returns true if any equal subquery row is found. It is equivalent to = ANY(subquery).
The USING clause suppresses duplicate columns - each joined column appears only once in the output table, unlike ON which includes all columns from both tables.
A large OFFSET might be inefficient because the rows skipped by OFFSET still have to be computed inside the server before being discarded.
When ORDER BY is omitted, the default frame consists of all rows in the partition (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).
NATURAL JOIN creates an implicit join based on all columns with the same name in both tables, automatically matching and combining rows. It's equivalent to a USING clause listing all common column names.
UNION eliminates duplicate rows by default, in the same way as DISTINCT, unless UNION ALL is used.
WITH TIES only adds rows if all columns in the ORDER BY clause are identical for the tie. The tie-breaking logic depends on which columns appear in ORDER BY.
SYSTEM sampling is much more efficient than BERNOULLI because it samples entire blocks rather than individual rows, though BERNOULLI provides better random distribution.
For UNION/INTERSECT/EXCEPT results, only output column names or ordinal numbers are allowed in ORDER BY, not arbitrary expressions.
AS is optional for column aliases, but becomes required when the alias name matches any PostgreSQL keyword (reserved or not). Best practice is to always use AS to avoid future conflicts.
CUBE(e1, e2, ...) represents the given list and all of its possible subsets (i.e., the power set), generating all possible combinations.
Yes, CTEs can contain SELECT, TABLE, VALUES, INSERT, UPDATE, DELETE, or MERGE statements. Data-modifying statements must use the RETURNING clause to create queryable output.
Yes, a WITH query can be referenced multiple times in the FROM list. By default, if referenced more than once, it will be materialized (computed once) unless NOT MATERIALIZED is specified.
LIMIT ALL is the same as omitting the LIMIT clause entirely, returning all matching rows. LIMIT with a NULL argument also behaves the same way.
SELECT privilege is required on each column used in the query. For FOR UPDATE/FOR NO KEY UPDATE/FOR SHARE/FOR KEY SHARE, UPDATE privilege is also needed on the columns.
A recursive CTE must have the form: non-recursive term UNION [ALL] recursive term. The system evaluates iteratively, repeating the recursive term until it produces no output.
If table1 has x rows and table2 has y rows, the CROSS JOIN result set will have x * y rows (Cartesian product).
GROUPING SETS returns the same result as applying UNION ALL on multiple queries with different GROUP BY clauses, but PostgreSQL optimizes it to scan the table fewer times rather than multiple separate scans.
With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped, allowing the query to return only the unlocked rows.
PostgreSQL recognizes functional dependencies when grouping by a primary key. When you GROUP BY a primary key column, other columns from the same table can be selected without being in GROUP BY.
All parenthesized lists must have the same number of elements (columns), and corresponding entries in each list must have compatible data types.
UNLOGGED creates a table where data is not written to the write-ahead log, making it considerably faster but not crash-safe. The table is automatically truncated after a crash or unclean shutdown.
WITHIN GROUP and ordered-set aggregates were introduced in PostgreSQL 9.4.
WITH TIES returns additional rows that tie for the last position according to the ORDER BY clause. This requires ORDER BY to be specified, and may return more rows than the specified count.
UNION ALL is faster than UNION because it does not perform the additional step of eliminating duplicates (sorting and comparing rows).
Yes, the FILTER clause works for any aggregate function including ordered-set functions and array_agg.
NULLS FIRST is the default for descending order, meaning null values sort before all non-null values.
SELECT * expands to all columns from all tables in the FROM clause. A table-qualified form like table_name.* expands to all columns from that specific table only.
The presence of HAVING creates a grouped query even without GROUP BY, treating the entire table as a single group. The result is either a single row (if HAVING is true) or no rows.
LATERAL allows a subquery to reference columns of FROM items that appear before it in the FROM list. Without LATERAL, each subquery is evaluated independently.
REPEATABLE with a seed parameter allows you to define a random seed for the sampling process. Given the same seed, PostgreSQL normally returns the same result set.
The common convention is to write all EXISTS tests in the form EXISTS(SELECT 1 WHERE ...).
No, all data-modifying statements in WITH execute concurrently and share the same snapshot. They do not see each other's effects on the target tables.
Yes, you can reference output columns by their ordinal position (e.g., ORDER BY 1, 2). This works with output column names and arbitrary expressions as well.
The two built-in sampling methods are BERNOULLI (tuple-level sampling where each row has equal probability) and SYSTEM (block-level sampling where entire pages are selected).
By default with ORDER BY, the frame consists of all rows from the start of the partition up through the current row, plus any following rows equal to the current row according to ORDER BY (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
If there are no common column names, NATURAL JOIN behaves like CROSS JOIN, producing a Cartesian product.
No, FOR UPDATE/SHARE locking clauses cannot be used with GROUP BY, HAVING, DISTINCT, UNION/INTERSECT/EXCEPT, or aggregate functions.
No, unlike other join clauses such as LEFT JOIN or INNER JOIN, the CROSS JOIN clause does not have a join predicate.
If ORDER BY is not given, the rows are returned in whatever order the system finds fastest to produce. This order is unpredictable and not guaranteed to be consistent across queries.
No, PostgreSQL does not have a server-level configuration parameter that automatically limits rows returned by SELECT. The LIMIT clause must be explicitly specified in each query.
By default, if a WITH query is non-recursive and side-effect-free (SELECT with no volatile functions), it can be folded into the parent query when referenced just once, allowing joint optimization.
NOT IN returns true if only unequal subquery rows are found, including the case where the subquery returns no rows.
FETCH FIRST defaults to 1 row if the count is omitted (equivalent to FETCH FIRST 1 ROW ONLY).
Yes, SOME is a synonym for ANY. They are completely interchangeable.
SELECT ALL is the default, which returns all rows including duplicates. To remove duplicates, you must explicitly use SELECT DISTINCT.
ALL returns true when the subquery returns no rows (vacuous truth - all zero comparisons succeeded).
Without ORDER BY, you will get an unpredictable subset of the query's rows. Using different LIMIT/OFFSET values will give inconsistent results unless you enforce a predictable result ordering with ORDER BY.
EXISTS returns true if the subquery returns at least one row, false if it returns no rows.
FILTER removes rows only from the input of the particular aggregate function it is attached to, similar to WHERE but applying only to that specific aggregate.
Yes, because VALUES is treated like a SELECT by the grammar, it is possible to use ORDER BY, LIMIT, and OFFSET clauses with a VALUES command.
CREATE TABLE AS is the recommended syntax. SELECT INTO is not available in ECPG or PL/pgSQL, and CREATE TABLE AS offers a superset of functionality.
NOT MATERIALIZED forces the WITH query to be merged into the parent query, enabling joint optimization such as index usage and predicate pushdown, but risks duplicate computation if referenced multiple times.
The subquery is generally only executed long enough to determine whether at least one row is returned, not all the way to completion.
TEMPORARY (or TEMP) creates a temporary table that is automatically dropped at the end of the session.
PostgreSQL automatically assigns the names column1, column2, column3, etc. to result columns of VALUES.
Without ORDER BY, the 'first row' of each group becomes unpredictable because the table stores rows in an unspecified order.
GLOBAL or LOCAL can be written before TEMPORARY or TEMP, but this currently makes no difference in PostgreSQL and is deprecated.
Architecture and Concepts
60 questionsPostgreSQL uses a simple 'process per-user' client/server model, where the server starts (forks) a new process for each connection.
The default starting value is minvalue for ascending sequences (which defaults to 1).
The default is 128MB, but might be less if your kernel settings will not support it (as determined during initdb).
The TOAST_TUPLE_TARGET is normally 2 kB (adjustable). Compression and/or out-of-line storage continues until the row reaches this size.
PostgreSQL implements three distinct isolation levels (Read Uncommitted behaves identically to Read Committed), though users can request any of the four SQL standard levels.
The default starting value is maxvalue for descending sequences.
TOAST_MAX_CHUNK_SIZE is approximately 2000 bytes, sized so that four chunks fit on a page.
The default value is 100, but might be less if your kernel settings will not support it (as determined during initdb).
No, reading never blocks writing and writing never blocks reading in PostgreSQL's MVCC implementation.
If neither CYCLE or NO CYCLE are specified, NO CYCLE is the default.
The four strategies are: PLAIN (prevents compression and out-of-line storage), EXTENDED (allows both compression and out-of-line storage, default for most types), EXTERNAL (out-of-line without compression), and MAIN (allows compression but not out-of-line storage except as last resort).
The default is 'localhost', which allows only local loopback connections.
An on-disk TOAST pointer is 18 bytes total (including varlena header).
Server Configuration > Error Reporting and Logging
59 questionsYes, for PostgreSQL's internal log rotation to work, log_filename must contain %-escapes. Without them, you should use an external log rotation utility.
64 (meaning max identifier length is 63 bytes due to trailing zero byte)
Yes, logging_collector must be enabled to generate CSV-format log output
stderr, csvlog, jsonlog, syslog, and eventlog (Windows only). Multiple destinations can be specified separated by commas.
10 minutes (600000 milliseconds) in PostgreSQL 17+; was -1 (disabled) in PostgreSQL 14 and earlier
SQLSTATE error code and source code file name, function name, and line number that generated the error
receipt, authentication, authorization, setup_durations, all, or empty string. Can be a comma-separated list.
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1, INFO, NOTICE, WARNING, ERROR, LOG, FATAL, and PANIC
Backend type (e.g., client backend, autovacuum worker, checkpointer)
Less than NAMEDATALEN characters (64 in a standard build), which means effectively 63 characters
All DDL statements plus data-modifying statements (INSERT, UPDATE, DELETE, TRUNCATE, COPY FROM)
Connection Pooling > PgBouncer Configuration
57 questionsNo, server_lifetime only closes unused (not currently linked to any client) server connections that have been connected longer than the specified time.
It's a comma-separated list of database users allowed to connect to the PgBouncer console and run read-only SHOW commands (except SHOW FDS).
RELOAD reloads the configuration file and updates changeable settings. If a database's connection parameters change, existing server connections to the old destination are closed when released.
It tracks protocol-level named prepared statements in transaction and statement pooling mode using an LRU cache. The value controls the number of prepared statements kept active on a single server connection.
query_timeout cancels queries running longer than specified (default 0/disabled), while query_wait_timeout disconnects clients if they wait too long for a server connection (default 120s).
It allows running multiple PgBouncer instances on the same host listening on the same port, with the kernel distributing connections automatically. This enables PgBouncer to use more CPU cores.
It's a comma-separated list of database users allowed to connect to the PgBouncer console and run all administrative commands.
Immediate (SHUTDOWN), safe (SHUTDOWN WAIT_FOR_SERVERS waits for servers), and super safe (SHUTDOWN WAIT_FOR_CLIENTS waits for all clients to disconnect).
It limits the total number of server connections per user across all databases.
It sets the number of additional connections to allow for a pool beyond default_pool_size. The reserve pool kicks in when the regular pool is exhausted and is designed to handle bursts of client connections.
client_encoding, datestyle, timezone, standard_conforming_strings, and application_name
Linux, DragonFlyBSD, and FreeBSD. On FreeBSD, it uses SO_REUSEPORT_LB instead of SO_REUSEPORT.
Yes, pool_size can be set per-database in the [databases] section to override the global default_pool_size.
Only SET LOCAL should be used in transaction pooling mode, as it scopes changes to the current transaction. Never use SET SESSION or SET alone, which persist beyond the transaction.
PostgreSQL version 14 and above should use scram-sha-256, while versions below 14 should use md5.
Setting reserve_pool_size to 0 deactivates the reserve pool feature.
It disables prepared statement support for transaction and statement pooling modes.
Yes, pool_mode can be set per-database in the [databases] section to override the global pool_mode setting.
Only parameters that PostgreSQL reports back to the client can be tracked. Parameters must be in the list that Postgres sends to clients after connection or via SET commands.
PAUSE causes PgBouncer to wait for all queries to complete and disconnect from all servers. New client connections will wait until RESUME is called.
min_pool_size only maintains the minimum number of connections when the number of clients connected to the pool is greater than 0. If all clients disconnect, the pool can drop to 0 connections.
It specifies a query to be executed after a server connection is established, but before allowing the connection to be used by clients. Errors are logged but ignored.
If automatically created database pools (via '*' wildcard) have been unused for this many seconds, they are freed. Default is 3600 seconds (1 hour).
It limits the total number of server connections to a specific database, shared across all pools for that database.
Routine Maintenance > VACUUM
54 questionsThe pg_stat_progress_vacuum view, which was added in PostgreSQL 9.6.
No, VACUUM FULL progress is reported in the pg_stat_progress_cluster view instead.
VACUUM, ANALYZE, CLUSTER, REFRESH MATERIALIZED VIEW, REINDEX, LOCK TABLE, and database object statistics manipulation functions.
The default is AUTO, which allows VACUUM to skip index vacuuming when there are very few dead tuples in the table.
The system refuses new XID assignments when fewer than 3 million transactions remain until wraparound.
Approximately 500 MB at the default of 200 million transactions.
The system emits warnings when the oldest XIDs reach 40 million transactions from the wraparound point.
True (enabled by default), meaning VACUUM processes the main relation.
The effective maximum is 0.95 * autovacuum_freeze_max_age, even if a higher value is configured.
TRUNCATE is enabled by default (true), meaning VACUUM attempts to truncate empty pages at the end of the table and return disk space to the operating system.
If the installation has N databases, a new worker will be launched every autovacuum_naptime/N seconds to distribute work across time.
An index must be greater than min_parallel_index_scan_size (default 512 kB) to be eligible for parallel vacuum.
It causes VACUUM to only update the database-wide statistics about oldest unfrozen XIDs without processing any tables.
No, VACUUM may still block when opening relation indexes, acquiring row samples from partitions, table inheritance children, and some types of foreign tables.
The default is 0 milliseconds, which disables cost-based vacuum delay.
The default is 2 (reduced from the historical default of 10 to better reflect modern hardware).
It causes VACUUM to skip updating the database-wide statistics about oldest unfrozen XIDs.
The default is -1, which means the regular vacuum_cost_limit value will be used.
VACUUM will silently adjust the effective value to no less than 105% of autovacuum_freeze_max_age.
PostgreSQL stops allowing WRITE operations and switches the database to READ-ONLY mode to prevent transaction ID wraparound corruption.
No, VACUUM cannot be executed inside a transaction block and requires autocommit mode.
Cost-based delay is disabled, non-essential maintenance tasks (such as index vacuuming) are bypassed, and any Buffer Access Strategy is disabled, allowing VACUUM to use all of shared_buffers.
It performs aggressive tuple freezing, equivalent to setting vacuum_freeze_min_age and vacuum_freeze_table_age to zero.
No, regular VACUUM reclaims space and makes it available for re-use within the same table, but does not return it to the operating system (except in the special case where one or more pages at the end of a table become entirely free and can be truncated).
No, autovacuum workers do not count toward max_connections or superuser_reserved_connections limits.
True (enabled by default), meaning VACUUM processes the corresponding TOAST table if one exists.
No, column specification only applies when using the ANALYZE option. Regular VACUUM operations work at the table level.
Two bits per page: one for all-visible status and one for all-frozen status.
VACUUM updates the relfrozenxid (oldest unfrozen transaction ID) and relminmxid (oldest multixact ID) fields in pg_class.
At least 2 indexes are required, since only one worker can be used per index.
Regular VACUUM requires a ShareUpdateExclusiveLock, which allows normal reading and writing of the table concurrently.
The MAINTAIN privilege is required. However, database owners are allowed to vacuum all tables in their databases except shared catalogs.
The default is 10min (10 minutes or 600,000 milliseconds) in PostgreSQL 15 and later. In earlier versions, it was -1 (disabled).
It indicates that every tuple in the page is frozen and no future vacuum will need to modify the page until a tuple is inserted, updated, deleted, or locked.
Data Manipulation (DML)
54 questionsYes, if the expression for any column is not of the correct data type, automatic type conversion will be attempted
Unlimited, though constrained by the maximum relations per database limit
Data-modifying statements are only allowed in WITH clauses that are attached to the top-level statement
When used within the same transaction as an earlier CREATE TABLE or TRUNCATE command, with wal_level set to minimal, for non-partitioned tables
The table being loaded must have been created or truncated in the current subtransaction, there must be no cursors open, and there must be no older snapshots held by this transaction
OVERRIDING SYSTEM VALUE allows inserting explicit values into a GENERATED ALWAYS identity column
Yes, DO NOTHING works in both MATCHED and NOT MATCHED clauses (this is a PostgreSQL extension)
No, VALUES lists with very large numbers of rows should be avoided, as you might encounter out-of-memory failures or poor performance
Yes, loading a large number of rows using COPY is almost always faster than using INSERT, even if PREPARE is used and multiple insertions are batched into a single transaction
Materialization occurs by default when CTEs are referenced multiple times
For a simple DELETE, all new values will be NULL. However, if an ON DELETE rule causes an INSERT or UPDATE to be executed instead, the new values may be non-NULL
No, exclusion constraints are not supported as arbiters with ON CONFLICT DO UPDATE. Only NOT DEFERRABLE constraints and unique indexes are supported.
ACCESS EXCLUSIVE lock, which blocks all other concurrent operations on the table
Same as QUOTE value (double-quote), so the quoting character is doubled if it appears in the data
No, WHERE CURRENT OF cannot be specified together with a Boolean condition
Do not repeat the target table as a from_item unless you intend a self-join (in which case it must appear with an alias in the from_item)
1,600 columns, with the caveat that the tuple being stored must fit in a single 8192-byte heap page
No, PostgreSQL does not natively support LIMIT in DELETE statements. Workarounds using CTEs or subqueries are required.
Each column not present in the target list will be filled with a default value, either a declared DEFAULT value or NULL if there is none
Yes, TRUNCATE will fire ON TRUNCATE triggers (both BEFORE and AFTER)
No, PostgreSQL does not natively support LIMIT in UPDATE statements. Workarounds using CTEs or subqueries are required.
No, for ON CONFLICT DO NOTHING, it is optional to specify a conflict_target; when omitted, conflicts with all usable constraints (and unique indexes) are handled
RESTRICT is the default, which refuses truncation if any tables have foreign-key references from unlisted tables
It becomes the final reachable clause of that kind. If a later WHEN clause of that kind is specified it would be provably unreachable and an error is raised
No more than one WHEN clause is executed for any candidate change row
No, attempts to use TRUNCATE ONLY on a partitioned table will always return an error because a partitioned table does not have any data itself
CONTINUE IDENTITY is the default, meaning sequences remain unchanged
It forms no temporary table and cannot be referred to in the rest of the query, but the statement will still be executed
The subquery must yield no more than one row. If it yields one row, its column values are assigned to the target columns; if it yields no rows, NULL values are assigned
No, WITH HOLD may not be specified when the query includes FOR UPDATE or FOR SHARE
Any values supplied for identity columns are ignored and the default sequence-generated values are applied
Yes, it is possible to use ORDER BY, LIMIT (or equivalently FETCH FIRST), and OFFSET clauses with a VALUES command
Always 0, because OID system columns are not supported for user tables anymore (since PostgreSQL 12)
Server Configuration > Write-Ahead Log Settings
54 questionsApproximately 30% more compression, writing about half as much as LZ4.
No, it is a soft limit. WAL size can exceed max_wal_size under special circumstances such as heavy load, a failing archive_command, or a high wal_keep_size setting.
Because checkpoints include activities besides writing dirty buffers, and a setting of 1.0 is likely to result in checkpoints not being completed on time, causing performance loss due to unexpected variation in the number of WAL segments needed.
Operations related to bulk data processing such as CREATE TABLE AS SELECT, CREATE INDEX, and row information for permanent relations in transactions that create or rewrite them.
lz4 is just as fast as uncompressed data and writes approximately 60% as much (40% reduction in WAL volume).
Unrecoverable data corruption may occur, potentially rendering the entire cluster unusable and requiring a full restore from backups.
Causes commits to wait for local flush to disk, but not for replication.
No, the server cannot be started with summarize_wal=on if wal_level is set to minimal.
When data checksums are enabled, hint bit updates are always WAL-logged and the wal_log_hints setting is ignored.
Torn page writes (also called fractured blocks or partial writes), which occur when a page write is only partially completed during an operating system crash, leaving an on-disk page with a mix of old and new data.
The WAL archiver is enabled also during archive recovery or standby mode, and all files restored from the archive or streamed with streaming replication will be archived (again).
Up to 20x faster in contrived larger-than-memory pgbench crash recovery experiments.
It controls when checkpoints are started based on the amount of WAL written since the last checkpoint, not the actual size of the pg_wal directory.
Zero exit status. A nonzero status tells PostgreSQL that the file was not archived and it will try again periodically until it succeeds.
Success is returned only after the transaction has been applied (made visible) on the required number of synchronous replica(s), providing the highest level of consistency.
It logs a message like 'checkpoints are occurring too frequently (X seconds apart)' with a hint to 'Consider increasing the configuration parameter max_wal_size' when checkpoints occur closer together than the checkpoint_warning threshold.
The primary will remove WAL segments even if a replication slot says that someone still needs them, potentially causing the standby to be unable to continue replication.
PostGIS > Spatial Functions
52 questionsmax_vertices must be 5 or more, as 5 points are needed to represent a closed box.
ST_IsValidReason returns text stating if a geometry is valid, or a reason for invalidity. In addition to the reason (such as self-intersection), the location of the invalidity (coordinate) is also returned.
The default envelope is the bounding box of the input expanded by about 50%. If the extend_to parameter is present, the diagram is extended to cover the envelope of the supplied geometry, unless smaller than the default envelope.
No, ST_Collect does not perform any geometric union operation. It simply collects geometries into a GeometryCollection or Multi* geometry without merging overlapping areas.
ST_ForceRHR orients the exterior ring in a clockwise direction and the interior rings in a counter-clockwise direction.
No, the named spatial predicates automatically use a spatial index if one is available, so there is no need to use the bounding box operator && as well.
ST_Covers returns true if every point in Geometry B lies inside (interior or boundary of) Geometry A, including boundary points. ST_Contains does not include boundary points - geometries do not contain their boundary. For example, a circle covers its exterior ring but doesn't contain it.
If no params argument is provided, the 'linework' algorithm will be used as the default.
As of version 2.0.0+, ST_NumGeometries now returns 1 for single geometries like POLYGON, LINESTRING, and POINT. In prior versions this would return NULL if the geometry was not a collection/MULTI type.
The default value is 8 segments used to approximate a quarter circle.
For geography types, ST_Length returns values in meters, with computation performed using the inverse geodesic calculation.
The three parameters are: geom (the geometries to be analyzed), eps (the maximum distance that points in a cluster can be from each other), and minpoints (the minimum number of points required to form a cluster).
The join style defaults to 'round'. Available options are round (default), mitre (synonym: miter), and bevel.
It is represented by a 9-character text string using the symbols 'F', '0', '1', '2' (e.g. 'FF1FF0102').
ST_GeometryN uses 1-based indexing. Elements in the collection are numbered starting from 1.
The endcap style defaults to 'round'. Available options are round (default), flat (synonym: butt), and square.
For geometry types, ST_Distance returns the minimum 2D Cartesian (planar) distance in the units specified by the spatial reference system (SRID). For example, if the SRID uses feet, the distance will be in feet.
For geography types, ST_Area returns area in square meters by default, determined on a spheroid.
Only vertices in the target are considered as snap points. ST_Snap only snaps to existing vertices, not to arbitrary points along edges. You'll never get a vertex of first geometry snapped against a segment of the second.
Yes, the named spatial predicates (ST_Contains, ST_ContainsProperly, ST_CoveredBy, ST_Covers, ST_Crosses, ST_Overlaps, ST_Touches, ST_Within, ST_Intersects, and ST_3DIntersects) automatically use a spatial index if one is available, so there is no need to use the bounding box operator && as well.
For geometry types, the distance is specified in units defined by the spatial reference system of the geometries.
The most common SRID for geographic coordinates is 4326, which corresponds to longitude/latitude on the WGS84 spheroid.
Yes, ST_DWithin uses a spatial index if available. This makes it much faster than using ST_Distance for radius queries.
ST_LineMerge sews together the constituent line work of a MULTILINESTRING to form LineString(s). However, lines are not merged across intersections with degree > 2, and the input MultiLineString must not have gaps.
ST_ClosestPoint returns the first point of the shortest line between the geometries (as computed by ST_ShortestLine).
ST_Envelope returns a geometry (polygon) defined by the corner points of the minimum bounding box.
ST_Centroid is much faster. ST_PointOnSurface is much more complex to compute, so you should only use it when you really need a point guaranteed to be inside the geometry.
The ST_Union array and aggregate variants use the fast Cascaded Union algorithm.
Points that do not meet the criteria (not within eps distance of at least minpoints geometries, and not within eps of a core geometry) will be considered noise and will be assigned a NULL value for cluster_id.
ST_Centroid computes the geometric center of mass but can potentially return a point outside the bounds of a given geometry (e.g., for C-shaped polygons). ST_PointOnSurface is guaranteed to return a point that lies in the interior of a surface (POLYGON, MULTIPOLYGON, CURVEPOLYGON).
ST_IsValid tests if an ST_Geometry value is well-formed and valid in 2D according to the OGC rules.
The second argument is a float between 0 and 1 representing the fraction of line length where you want to find a point on the line.
ST_Split supports splitting a LineString by a (Multi)Point, (Multi)LineString or (Multi)Polygon boundary, or a (Multi)Polygon by a LineString.
Yes, the result may not be valid even if the input is valid. For topology-preserving simplification, use ST_SimplifyPreserveTopology instead.
ST_AsEWKT outputs the same text as ST_AsText, but it also includes the SRID in the output.
ST_Equals tests topological/spatial equality (same space occupied, regardless of vertex order or starting point). ST_OrderingEquals tests exact coordinate equality with identical coordinates in the same order - it's the strictest form of equality.
As of version 3.0.0, parsed geometry defaults to SRID=4326 if not specified otherwise.
Boolean spatial predicates like ST_Contains and ST_Intersects are computationally much faster than spatial overlay operations like ST_Intersection. Predicates are faster because their tests often allow for logical short circuits and can use prepared geometry optimizations.
ST_Disjoint is the opposite of ST_Intersects. If two geometries are disjoint, they do not intersect, and vice-versa. The relationship is: ST_Intersects(g1, g2) --> Not (ST_Disjoint(g1, g2)).
ST_Touches returns TRUE if either of the geometries' boundaries intersect or if only one of the geometry's interiors intersects the other's boundary. The geometries touch at their boundaries but their interiors do not intersect.
For geography types, ST_Distance defaults to return the minimum geodesic distance in meters, computed on the spheroid determined by the SRID.
ST_Overlaps requires that the two geometries have the same dimension (dim(A) = dim(B) = dim(Int(A) ⋂ Int(B))). They must intersect in that dimension with each having at least one point inside the other.
ST_Envelope returns the minimum bounding box for a single geometry. ST_Extent is an aggregate function that returns the bounding box for multiple geometries across table rows.
The && operator returns TRUE if the 2D bounding box of geometry A intersects the 2D bounding box of geometry B.
For a POLYGON the paths are {i,j} where i is the ring number (1 is outer; inner rings follow) and j is the coordinate position in the ring.
No, ST_AsBinary returns the OGC/ISO Well-Known Binary (WKB) representation of the geometry/geography without SRID metadata.
Monitoring > Replication Monitoring
51 questionsServer startup only. This parameter can only be set at server start and requires a restart.
PostgreSQL 13. It replaced wal_keep_segments for consistency with the newly-added max_slot_wal_keep_size parameter.
replica or higher. wal_level must be set to replica or higher to allow replication slots to be used.
60 seconds. If the value is specified without units, it is taken as milliseconds.
The number of bytes of WAL that can be written before this slot gets in danger of having WAL files removed. Data type is int8 (bigint).
PostgreSQL 14. It contains one row per logical replication slot, showing statistics about logical replication.
pg_wal_lsn_diff(). For example: pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) calculates lag in bytes.
apply (main apply process), parallel apply (parallel apply workers), or table synchronization (workers handling initial data copy of subscribed tables).
Server startup only. It requires a restart and cannot be set lower than the existing slot count. Also requires wal_level set to replica or higher.
Only one row, showing statistics about the WAL receiver from that receiver's connected server.
Empty string. When archive_command is an empty string while archive_mode is enabled, WAL archiving is temporarily disabled, but the server continues to accumulate WAL segment files.
-1 (no limit). When set to -1, replication slots may retain an unlimited amount of WAL files.
minimal, replica, and logical. Each level includes the information logged at all lower levels.
Yes. Zero disables replication as it prevents standby servers from connecting.
PostgreSQL 9.4. This view has not been modified since it was introduced.
10 seconds. This specifies the minimum frequency for the WAL receiver process on the standby to send replication progress information to the primary. If specified without units, it is taken as seconds.
Always a single row, containing data about the archiver process of the cluster.
0 (zero), which means the system doesn't keep any extra WAL segments for standby purposes. If specified without units, it is taken as megabytes.
Server startup only. This parameter can only be set in postgresql.conf file or on the server command line and requires a restart.
It optionally specifies an existing replication slot to be used when connecting to the sending server via streaming replication to control resource removal on the upstream node. This ensures the primary doesn't remove WAL segments until the standby processes them.
reserved (within max_wal_size, healthy), extended (exceeded max_wal_size but protected), unreserved (in danger but can recover), and lost (WAL files removed, no recovery possible).
One row per WAL sender process. It tracks only directly connected standbys; no information is available about downstream standby servers.
[FIRST] num_sync ( standby_name [, ...] ). This specifies priority-based synchronous replication. For example, FIRST 3 (s1, s2, s3, s4) will cause each commit to wait for replies from three higher-priority standbys. The FIRST keyword is optional.
archived_count (bigint), last_archived_wal (text), last_archived_time (timestamptz), failed_count (bigint), last_failed_wal (text), last_failed_time (timestamptz), and stats_reset (timestamptz).
- This parameter specifies the maximum number of concurrent connections from standby servers.
off. When enabled, a hot standby will send feedback to the primary about queries currently executing on the standby.
pg_last_wal_replay_lsn(). It returns the last write-ahead log location replayed during recovery, which increases monotonically if recovery is still in progress.
text. It shows the synchronous replication status of the standby.
By calling pg_stat_reset_replication_slot(slot_name). To reset statistics for all logical replication slots, call pg_stat_reset_replication_slot(NULL).
startup, catchup, and streaming. 'catchup' indicates the replica is catching up to the primary, while 'streaming' indicates the standby is actively receiving WAL data and is caught up.
interval. These columns represent replication delays for write, flush, and replay operations. On idle systems, lag values revert to NULL after a short time.
This standby's xmin horizon reported by hot_standby_feedback. It has data type xid and is used by the primary to determine which dead tuples are safe to be vacuumed.
off. This parameter records commit time of transactions and can only be set at server start.
Yes. Changes can be applied with pg_reload_conf(), pg_ctl reload or SIGHUP (no server restart required).
- This parameter controls the total replication slots the server can support.
replica or higher. Level replica is required for binary replication, and level logical is required for logical replication. Minimal WAL does not contain sufficient information for replication.
Key columns include: slot_name (name), slot_type (text: 'physical' or 'logical'), active (bool), restart_lsn (pg_lsn), confirmed_flush_lsn (pg_lsn), wal_status (text), safe_wal_size (int8), temporary (bool), and active_pid (int4).
pid (integer), status (text), receive_start_lsn (pg_lsn), receive_start_tli (integer), written_lsn (pg_lsn), flushed_lsn (pg_lsn), received_tli (integer), last_msg_send_time (timestamptz), last_msg_receipt_time (timestamptz), latest_end_lsn (pg_lsn), latest_end_time (timestamptz), slot_name (text), sender_host (text), sender_port (integer), and conninfo (text).
off, local, remote_write, on, and remote_apply. 'off' has no waiting, 'local' waits for local disk flush, 'remote_write' waits for remote write (not flush), 'on' waits for remote flush, and 'remote_apply' waits for remote application.
slot_name (text), spill_txns (bigint), spill_count (bigint), spill_bytes (bigint), stream_txns (bigint), stream_count (bigint), stream_bytes (bigint), total_txns (bigint), total_bytes (bigint), and stats_reset (timestamptz).
async (asynchronous), sync (synchronous), potential (can become synchronous if current sync fails), and quorum (quorum-based synchronous replication).
0 (zero), which means no delay is added. If specified without units, it is taken as milliseconds.
pg_last_wal_receive_lsn(). It returns the write-ahead log location received and synced to disk by streaming replication.
60 seconds. If the value is specified without units, it is taken as milliseconds.
replica. This writes enough data to support WAL archiving and replication, including running read-only queries on a standby server.
off. When enabled, completed WAL segments are sent to archive storage. Possible values are: off, on (archives only on primary), and always (archives on both primary and standby).
One row per subscription worker. A subscription can have zero or more active subscription workers depending on its state (enabled, disabled, or crashed subscriptions).
Connection Pooling > PgBouncer
50 questionssession. In session pooling mode, the server is released back to the pool after the client disconnects.
- The peer_id value should be unique within a group of peered PgBouncer processes. When set to 0, PgBouncer peering is disabled.
max_user_connections limits the total number of connections to any database from one user. This limit makes sense only in the case of multiple databases with the same users.
30 seconds. Released connections can be reused without running server_check_query for this duration. If set to 0, the check runs every time.
LIFO (last-in, first-out). By default, server_round_robin is set to 0, meaning PgBouncer reuses the most recently used connections so that few connections get the most load.
0 seconds (disabled). When set to 0, queries can run indefinitely without being canceled.
- This limits the maximum number of client connections allowed. The actual limit may depend on file descriptor limits of the system.
"pgbouncer". This is the name under which logs are sent to syslog when syslog is enabled.
dbname, host, port, user, password, auth_user, client_encoding, datestyle, timezone, pool_size, reserve_pool_size, max_db_connections, pool_mode, connect_query, and application_name among others.
client_encoding, datestyle, timezone, and standard_conforming_strings. PgBouncer allows only these parameters in startup packets by default; other parameters will raise an error unless specified in ignore_startup_parameters.
60 seconds. Clients who have not managed to log in after connecting within this time are disconnected.
Version 1.25.0 added the transaction_timeout setting, configurable both globally and at the user level.
0 (disabled). When set to 0, prepared statement support for transaction and statement pooling is disabled. To enable this feature (available since PgBouncer 1.21.0), set it to a non-zero value like 10.
daemon. This is the facility under which logs are sent to syslog. Other options include auth, authpriv, user, and local0-7.
"select 1". This simple query verifies that idle server connections are still functional before reusing them.
3600 seconds (1 hour). Server connections that have been idle for longer than this time are closed.
1 (enabled). By default, PgBouncer logs disconnections with reasons.
No. In statement pooling mode, server connections are released after each query finishes, which breaks transaction semantics. Multi-statement transactions are disallowed and clients behave as if in autocommit mode.
0 (disabled). When enabled, this allows running multiple PgBouncer instances on the same host listening on the same port with kernel-level connection distribution.
Only parameters that Postgres reports to the client (marked with GUC_REPORT flag internally) can be tracked. Postgres has an official list of these parameters, and extensions can modify this list.
0 (disabled). By default, logs are written to a log file rather than syslog.
any, trust, plain, md5, cert, hba, ldap, and pam. Each method has different requirements and security implications.
4096 bytes. This is the internal buffer size for packets and affects the size of TCP packets sent and general memory usage.
1 (enabled). By default, aggregated statistics are written to the log every stats_period seconds.
1 (per Heroku documentation). This controls how many additional connections to allow to a pool. Setting it to 0 disables the reserve pool.
The server_reset_query is not used in transaction pooling mode (default is empty). Clients must not use session-based features since each transaction ends up in a different connection with different session state.
- This is the default port for both TCP and Unix sockets, chosen to differentiate PgBouncer from the standard PostgreSQL port (5432).
WAIT_CLOSE waits until all server connections have cleared the "close_needed" state. This can be called after RECONNECT or RELOAD to wait until the configuration change has been fully activated.
- This controls how many server connections to allow per user/database pair when not explicitly overridden in per-database configuration.
3600 seconds (1 hour). Server connections that have been alive longer than this time are closed, regardless of whether they're idle or active.
Yes, as of version 1.23.0, pool_size can be configured per user. server_lifetime can also be configured per database.
2147483647 bytes (approximately 2GB). This is the maximum size for PostgreSQL packets that PgBouncer allows through.
60 seconds. This controls how often statistics averages are updated and how often aggregated statistics are written to the log.
Not set by default. Without pidfile set, daemonization (-d) is not allowed.
KILL immediately drops all client and server connections to the named database. New client connections to a killed database will wait until RESUME is called. This is useful in emergency failover scenarios.
No. It only works for protocol-level named prepared statements (e.g., using libpq's PQprepare/PQexecPrepared, JDBC's PreparedStatement). It cannot handle SQL PREPARE
5 seconds (per Heroku documentation). If a client has not been serviced in this time, additional connections from the reserve pool are used. Setting it to 0 disables this feature.
There are no default values for admin_users or stats_users. They must be explicitly configured to grant admin or stats access to specific users.
15 seconds. DNS lookups are cached for this duration, and the actual DNS TTL is ignored.
15 seconds. If connection and login don't finish within this time, the connection is closed.
RECONNECT closes each open server connection after it is released (according to the pooling mode), even if its lifetime is not up. It is useful when server connection setup has changed, for example to perform a gradual switchover to a new server.
SUSPEND flushes all socket buffers and stops listening for data on them. The command will not return before all buffers are empty. It is used when rebooting PgBouncer online. New client connections to a suspended database will wait until RESUME is called.
"DISCARD ALL". This cleans up session state (including prepared statements) when a connection is released in session pooling mode. For transaction pooling mode, the default is empty.
0 (disabled). When disabled, PgBouncer uses LIFO connection reuse. When enabled (set to 1), it uses round-robin distribution, which is better for load balancers or DNS round-robin setups.
No. When the connection string in pgbouncer.ini has been changed and reloaded, or when DNS resolution has changed, the equivalent of RECONNECT is run automatically.
pgbouncer. Users connect to this virtual database (e.g., psql -p 6432 -U someuser pgbouncer) to access the administration interface. This database name cannot be used for regular database connections.
0 seconds (disabled). When disabled, idle transactions are not automatically terminated.
PAUSE tries to disconnect from all servers, first waiting for all queries to complete. The command will not return before all queries are finished. New client connections to a paused database will wait until RESUME is called.
Version 1.23.0 added support for replication connections through PgBouncer.
Functions - Date/Time
50 questionsclock_timestamp() returns the actual current time, and therefore its value changes even within a single SQL statement.
No. to_timestamp and to_date ignore letter case in the input; so MON, Mon, and mon all accept the same strings.
microseconds, milliseconds, second, minute, hour, day, week, month, quarter, year, decade, century, millennium
For timestamp values, the number of seconds since 1970-01-01 00:00:00-00 (can be negative); for interval values, the total number of seconds in the interval.
date_part() returns double precision. This can result in a loss of precision in certain cases, which is why EXTRACT() is recommended from PostgreSQL 14 onwards.
pg_sleep_until() accepts a timestamp with time zone as an argument.
No. In DST timezones, interval '1 day' ≠ interval '24 hours' because hours may be skipped or repeated due to DST changes.
isfinite() returns false for infinity and -infinity timestamp values, and true for finite values.
No. PostgreSQL doesn't support infinity intervals, so isfinite() always returns true for an interval.
Yes. timestamp without time zone AT TIME ZONE zone returns timestamp with time zone.
make_interval([years int [, months int [, weeks int [, days int [, hours int [, mins int [, secs double precision]]]]]]]). All parameters are optional and default to zero. The first 6 are integers, secs is double precision.
Monday. By ISO 8601 definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.
date_bin(stride, source, origin) where stride is an interval, source is a timestamp, and origin is a timestamp.
The year gets truncated to 4 digits. You must use some non-digit character or template after YYYY for years exceeding 4 digits.
statement_timestamp() returns the start time of the current statement (more specifically, the time of receipt of the latest command message from the client).
EXTRACT() returns numeric type in PostgreSQL 14 and later. Prior to PostgreSQL 14, it returned double precision.
make_timestamp(year int, month int, day int, hour int, min int, sec double precision) → timestamp. The first five parameters are integers, and sec is double precision.
AT LOCAL is shorthand for AT TIME ZONE with the session's TimeZone setting.
sql_standard, postgres, postgres_verbose, and iso_8601. The default is postgres.
No. A time zone cannot be specified when processing timestamp without time zone or interval inputs.
Year 2000 returns 20 (the 20th century). Year 2001 returns 21 (the 21st century). PostgreSQL uses conventional numbering where centuries begin with years ending in 01.
The iso_8601 style matches the 'format with designators' described in section 4.4.3.2 of the ISO 8601 standard.
Year 2000 is in the 2nd millennium. The 3rd millennium starts January 1, 2001. PostgreSQL uses conventional calendar numbering.
The system converts 'now' to a timestamp as soon as the constant is parsed, so it's fixed at table creation time. Use now() or CURRENT_TIMESTAMP function instead for dynamic evaluation.
pg_sleep() accepts a double precision argument representing seconds, so fractional-second delays can be specified.
PostgreSQL keeps the day of month the same unless it would be past the end of the new month, in which case the last day of that month is used.
First, a nonzero months field advances/decrements the date by the indicated number of months. Then the days field advances/decrements the date. Finally, the microseconds field is added or subtracted literally.
In PostgreSQL, FM modifies only the next specification. In Oracle, FM affects all subsequent specifications, and repeated FM modifiers toggle fill mode on and off.
CURRENT_TIMESTAMP delivers values with time zone; LOCALTIMESTAMP delivers values without time zone.
No. timestamp without time zone ignores any specified timezone during input and stores no timezone information.
make_date(year int, month int, day int) → date. All three parameters are integers.
When the input value is +/-Infinity, extract returns +/-Infinity for monotonically-increasing fields (epoch, julian, year, isoyear, decade, century, and millennium).
Each time period is considered to represent the half-open interval start <= time < end, unless start and end are equal. Two periods with only an endpoint in common do not overlap.
300 milliseconds (not 3), because the conversion treats it as 12 + 0.3 seconds.
In early January or late December, the ISO year may be different from the Gregorian year. For example, January 1, 2006 is in ISO year 2005.
PostgreSQL converts timestamptz values to UTC using the provided offset and stores them in UTC. The original timezone is not retained.
Database Roles and Privileges
50 questionsThe pg_database_owner role always has one implicit member: the current database owner. It cannot be granted or have membership granted in it. It owns the public schema by default.
No. Circular membership loops are prohibited in PostgreSQL role membership.
For sequences, USAGE allows use of the currval and nextval functions.
No privileges are granted to PUBLIC by default on tables. The object owner must explicitly grant privileges.
The pg_monitor predefined role is a member of pg_read_all_settings, pg_read_all_stats, and pg_stat_scan_tables, providing comprehensive monitoring capabilities.
The pg_execute_server_program predefined role allows executing server-side programs via COPY PROGRAM. This role could be used to gain superuser-level access and should be granted carefully.
When GRANT OPTION FOR is specified, only the grant option for the privilege is revoked, not the privilege itself. The user retains their own access but loses the ability to grant that privilege to others.
For schemas, USAGE allows access to objects contained in the schema (assuming those objects' own privilege requirements are also met). Without USAGE on a schema, you cannot access any objects within it.
USAGE privilege is granted to PUBLIC by default for languages and data types (including domains).
The pg_use_reserved_connections predefined role allows using connection slots reserved by the reserved_connections parameter.
To analyze a table, one must ordinarily have the MAINTAIN privilege on the table. However, database owners are allowed to analyze all tables in their databases, except shared catalogs.
When revoking privileges on a table, the corresponding column privileges (if any) are automatically revoked on each column of the table.
Databases support: CREATE, CONNECT, and TEMPORARY (TEMP) privileges.
No. SET ROLE does not process session variables as specified by the role's ALTER ROLE settings; those would only apply when initially authenticating as that role.
No. The documentation states that connection limits are enforced approximately, not strictly. There may be brief periods where slightly more connections than the limit are allowed.
For types and domains, USAGE allows use of the type or domain in the creation of tables, functions, and other schema objects.
The pg_read_server_files role allows reading files via COPY and file-access functions, and pg_write_server_files allows writing files. Both could be used to gain superuser-level access.
No. CREATEROLE does not confer the ability to grant or revoke the BYPASSRLS privilege.
No. Neither pg_read_all_data nor pg_write_all_data bypass row-level security (RLS) policies. RLS policies still apply to users with these roles.
SESSION_USER returns the original authenticated user (the role that logged in), while CURRENT_USER reflects the role set by SET ROLE. After SET ROLE, CURRENT_USER changes but SESSION_USER remains the same.
Both SET and INHERIT default to TRUE when not explicitly specified in a GRANT role membership command.
ALTER SYSTEM writes configuration changes to postgresql.auto.conf, which is read alongside postgresql.conf.
SET ROLE NONE always reverts to the session user. RESET ROLE first checks for connection-time role settings (from configuration like ALTER ROLE or command-line options), and only falls back to the session user if no such settings exist.
No. Granting permission on a table does not automatically extend permissions to any sequences used by the table, including sequences tied to SERIAL columns. Permissions on sequences must be set separately.
No privileges are granted to PUBLIC by default on schemas. The object owner must explicitly grant privileges.
No. The IN SCHEMA clause is not allowed for schemas, because schemas can't be nested, and is not allowed for large objects, because large objects don't belong to a schema.
Database roles are global across a database cluster installation (and not per individual database).
No. CASCADE is not supported in DROP ROLE. You must manually drop or reassign all objects owned by the role and revoke all privileges before dropping the role.
No. Revoking SELECT privilege on a column will not affect any table-level SELECT privilege.
The ADMIN option is disabled by default. New memberships will have the ADMIN option disabled unless explicitly specified.
The TEMPORARY privilege allows temporary tables to be created while using the database. This is a database-level privilege.
A freshly initialized system contains one predefined superuser role, normally named after the operating system user that initialized the database cluster with initdb (commonly 'postgres').
No. It is not permitted to grant membership in a role to PUBLIC.
No. You will only use the default privileges for a particular role if you are that role when you create the object. Default privileges from roles you belong to don't automatically apply.
The MAINTAIN privilege allows execution of VACUUM, ANALYZE, REINDEX, REFRESH MATERIALIZED VIEW, CLUSTER, and LOCK TABLE commands on a relation.
The SET privilege allows a server configuration parameter to be set to a new value within the current session. While this privilege can be granted on any parameter, it is meaningless except for parameters that would normally require superuser privilege to set.
CONNECT and TEMPORARY (create temporary tables) privileges are granted to PUBLIC by default for databases.
The default value is zero connections. These slots are reserved for roles with the pg_use_reserved_connections privilege.
The default value is three connections. These connection slots are reserved exclusively for superuser connections.
No. Since ALTER SYSTEM acts directly on the file system and cannot be rolled back, it is not allowed inside a transaction block or function.
The TRIGGER privilege allows creation of a trigger on a table, view, etc. You must also have EXECUTE privilege on the trigger function.
The ALTER SYSTEM privilege allows a server configuration parameter to be configured to a new value using the ALTER SYSTEM command, which writes to the postgresql.auto.conf file.
Tables support: SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER, and MAINTAIN privileges.
You can query the pg_roles system catalog using: SELECT rolname FROM pg_roles; or use psql's \du meta-command for a convenient listing.
EXECUTE privilege is granted to PUBLIC by default for functions and procedures.
Logical Replication > Subscriptions
49 questionsNo. When creating a replication slot (the default behavior), CREATE SUBSCRIPTION cannot be executed inside a transaction block.
You must own the subscription. To alter the owner, you must have CREATE permission on the database. In addition, to alter the owner, you must be able to SET ROLE to the new owning role.
No. Initial data synchronization does not take into account the publication publish parameter when copying existing table data, meaning some rows may be copied that wouldn't normally be replicated using DML operations.
Such subscriptions must also have both enabled and create_slot set to false.
Yes. It is safe to use off for logical replication: If the subscriber loses transactions because of missing synchronization, the data will be sent again from the publisher.
Yes. This parameter can only be set in the postgresql.conf file or on the server command line (not requiring a server restart, just a configuration reload).
Streaming is triggered when the total amount of changes decoded from the WAL (for all in-progress transactions) exceeds the limit defined by logical_decoding_work_mem setting. Changes in large transactions (greater than logical_decoding_work_mem) are applied directly via one of the parallel apply workers, if available.
No, if the subscription is associated with a replication slot. DROP SUBSCRIPTION cannot be executed inside a transaction block if the subscription is associated with a replication slot.
No. These key words do not have any effect, since there are no dependencies on subscriptions.
These are non-error conflicts - the operations are skipped and replication continues.
When altering two_phase from true to false, the backend process reports an error if any prepared transactions done by the logical replication worker (from when two_phase parameter was still true) are found.
To create a subscription, you must have the privileges of the pg_create_subscription role, as well as CREATE privileges on the current database.
When run_as_owner = true, all replication actions are performed as the subscription owner. No role switching occurs.
Text format. If the publisher is a PostgreSQL version before 16, then any initial table synchronization will use text format even if binary = true.
It allows any table owner to easily usurp the privileges of the subscription owner - basically, to take over their account.
Replication stops immediately. The system generates detailed logs including the table name, conflict type, affected column values, existing local row data, and remote row data.
To alter the owner, you must also be a direct or indirect member of the new owning role. The new owner has to be a superuser.
The password_required setting is ignored when the subscription is owned by a superuser.
The tables are matched between the publisher and the subscriber using the fully qualified table name.
Only persistent slots that have attained synced state as true on the standby before failover can be used for logical replication after failover. To resume logical replication after failover from synced logical slots, the subscription's conninfo must be altered to point to the new primary server.
When using a subscription parameter combination of copy_data = true and origin = NONE, the initial sync table data is copied directly from the publisher, meaning that knowledge of the true origin of that data is not possible. A WARNING is logged to the user.
When run_as_owner = false (the default), the subscription apply process will run with the privileges of the subscription owner at the session level, but will switch roles to the table owner and perform operations with the table owner's privileges.
No. The database schema and DDL commands are not replicated through logical replication, and subsequent schema changes need to be kept in sync manually.
pg_dump will generate CREATE SUBSCRIPTION commands that use the connect = false option, so that restoring the subscription does not make remote connections for creating a replication slot or for initial table copy.
No. copy_data is an option of the action, not a property of the subscription. Even if you created a subscription with copy_data = false, REFRESH PUBLICATION will default back to copy_data = true unless you explicitly specify otherwise.
It is mandatory to have a physical replication slot between the primary and the standby (primary_slot_name should be configured), and hot_standby_feedback must be enabled on the standby. It's also necessary to specify a valid dbname in the primary_conninfo.
Setting the LSN to NONE resets it. For example: ALTER SUBSCRIPTION mysub SKIP (lsn = NONE);
No. Commands ALTER SUBSCRIPTION ... REFRESH PUBLICATION and ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ... with refresh option as true cannot be executed inside a transaction block.
No. PostgreSQL CREATE SUBSCRIPTION does not support an IF NOT EXISTS clause.
No, unless copy_data = false. Commands cannot be executed when the subscription has two_phase commit enabled, unless copy_data is false.
The implementation of two_phase commit requires that replication has successfully finished the initial table synchronization phase. So even when two_phase is enabled for a subscription, the internal two-phase state remains temporarily 'pending' until the initialization phase completes.
Write-Ahead Log (WAL)
48 questionsThe minimum is 32 kB. Any positive value less than 32kB will be treated as 32kB.
No. archive_mode cannot be enabled when wal_level is set to 'minimal'. The wal_level must be 'replica' or higher to enable continuous archiving and streaming binary replication.
The default is 30 seconds (30s). If this value is specified without units, it is taken as seconds. Setting it to 0 disables the warning.
The default is 'latest' (in PostgreSQL 12+). Valid values are numeric timeline ID, 'current', or 'latest'.
A WAL segment is internally divided into pages of 8192 bytes (8 KB), also known as XLOG_BLCKSZ.
The default is 'off'. When data checksums are enabled, hint bit updates are always WAL-logged and this setting is ignored.
The default value is -1, which automatically selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB nor more than the size of one WAL segment (typically 16MB).
The default is 'pause'. Valid values are 'pause', 'promote', 'shutdown'. This parameter was added in PostgreSQL 9.5.
Turning off full_page_writes might lead to either unrecoverable data corruption, or silent data corruption, after a system failure. The risks are similar to turning off fsync, though smaller.
No. The server cannot be started with summarize_wal=on if wal_level is set to 'minimal'.
At most one of recovery_target, recovery_target_lsn, recovery_target_name, recovery_target_time, or recovery_target_xid can be specified. Setting more than one is not allowed.
The default is 5 transactions. This is the minimum number of concurrent open transactions required before performing the commit_delay delay.
The default value is 80 MB. If this value is specified without units, it is taken as megabytes.
wal_decode_buffer_size sets a limit on how far ahead the server can look in the WAL to find blocks to prefetch during recovery.
Recovery will be paused when the target is reached. The paused state can be resumed by using pg_wal_replay_resume(), which then causes recovery to end. The intended use is to allow queries to be executed against the database to check if this recovery target is the most desirable point for recovery.
The default is 0 (disabled). When greater than zero, it forces the server to switch to a new WAL segment file whenever this amount of time has elapsed since the last segment file switch.
recovery_prefetch was originally added in PostgreSQL 14 but subsequently withdrawn. It was later reintroduced in PostgreSQL 15.
The default is 0 microseconds (no delay). The valid range is 0 to 100000 microseconds. If specified without units, it is taken as microseconds.
The default is 5 minutes (5min). The valid range is between 30 seconds and 1 day.
The default is 200 milliseconds (200ms). The valid range is 1ms to 10000ms (10 seconds). If specified without units, it is taken as milliseconds.
No. The wal_level parameter can only be set at server start, requiring a full PostgreSQL restart to change.
The default is 'on' (true). This specifies whether to stop just after the specified recovery target (on), or just before the recovery target (off).
The default is 512 kB. The valid range is 64kB to 1073741823 bytes. This parameter can only be set at server start.
WAL files are stored in the pg_wal directory under the data directory (typically $PGDATA/pg_wal). In PostgreSQL 9.6 and earlier, this directory was named pg_xlog.
If data checksums are enabled, hint bit updates are always WAL-logged and the wal_log_hints setting is ignored.
When wal_level is minimal and a transaction commits after creating or rewriting a permanent relation, wal_skip_threshold determines how to persist the new data. If the data is smaller than this setting, write it to the WAL log; otherwise, use an fsync of affected files.
The default is 'on'. Disabling fsync can result in unrecoverable data corruption in the event of a power failure or system crash.
If hot_standby is not enabled, a setting of 'pause' will act the same as 'shutdown'.
lz4 compression is light on the CPU and can give compression very close to pglz without causing high CPU overhead.
No. The archive_command and archive_library parameters are mutually exclusive and cannot both be set.
The default is 'try'. Valid values are 'off', 'on', 'try'. The 'try' setting enables prefetching only if the operating system provides the posix_fadvise function.
The default is 0 (0MB), meaning the system doesn't keep any extra segments for standby purposes. If specified without units, it is taken as megabytes.
The default is 10 days. If this value is specified without units, it is taken as minutes.
The three valid values are: 'minimal', 'replica', and 'logical'. Each level includes the information logged at all lower levels.
No. When both wal_init_zero and wal_recycle are enabled (their default state), recycled files are not filled with zeroes because they are not considered 'new' files. wal_init_zero only applies to newly created WAL files.
zstd provides the highest compression, about 30% more compared to lz4, though at the expense of more CPU utilization.
No. A cluster's wal_segment_size cannot be modified after initialization. It can only be set with the initdb option --wal-segsize (PostgreSQL 11 and later) or by changing the --with-wal-segsize configuration option at compile time.
'on' means archive_command will only be applied on the primary, while 'always' means archive_command will be applied regardless of whether the server is a primary or a standby.
The default is 'on'. When enabled, new WAL files are filled with zeroes to ensure space is allocated before writing WAL records on some file systems.
The default is 'off'. Valid values are: 'pglz', 'lz4', 'zstd'. Using lz4 or zstd requires PostgreSQL to be compiled with --with-lz4 and --with-zstd respectively.
The default is 'on'. This parameter protects against torn page writes by storing complete page images in WAL after each checkpoint.
The default is 256 kB on Linux, 0 (disabled) on other platforms. The valid range is between 0 and 2MB.
The default WAL segment size is 16 MB. This is a read-only parameter that cannot be changed after cluster initialization without recompiling PostgreSQL.
The default is 2 MB. If this value is specified without units, it is taken as kilobytes. This parameter only applies when wal_level is 'minimal'.
The parameter wal_keep_segments was replaced by wal_keep_size in PostgreSQL 13, for consistency with the newly-added max_slot_wal_keep_size parameter.
The default is 0.9 (90% of the checkpoint interval). This was changed from 0.5 to 0.9 in PostgreSQL 14.
Client Authentication > pg_hba.conf Configuration
47 questionsscram-sha-256 is the most secure of the currently provided password authentication methods.
Trust authentication is only suitable for TCP/IP connections if you trust every user on every machine allowed to connect, and is seldom reasonable for TCP/IP connections other than those from localhost (127.0.0.1).
Include directives (include, include_if_exists, include_dir) for pg_hba.conf were introduced in PostgreSQL 16.
The 'samehost' keyword matches any of the server's own IP addresses.
The 'sameuser' keyword matches if the database name equals the requested user name.
When specified for local connections, peer authentication will be used instead.
The 'host' connection type matches TCP/IP connections (SSL or non-SSL, GSSAPI encrypted or not).
include_if_exists includes the file if it exists and logs a skip message if not.
No, using hostssl does not automatically require client certificates. Client certificate authentication must be explicitly configured using the cert method or clientcert option.
Yes, a record can be continued onto the next line by ending the line with a backslash.
No, superusers are not considered members of a role unless they are explicitly members of the role, directly or indirectly, and not just by virtue of being a superuser.
Any database or user entry beginning with a slash (/) is considered a regular expression (available in PostgreSQL 16+).
By default, the pg_hba_file_rules view can be read only by superusers.
To use hostssl records, the server must be built with SSL support, and SSL must be enabled by setting the ssl configuration parameter.
The 'samenet' keyword matches any address in any subnet that the server is directly connected to.
Both require the client to present a valid (trusted) SSL certificate, while verify-full additionally enforces that the cn (Common Name) in the certificate matches the username or an applicable mapping.
A default pg_hba.conf file is installed when the data directory is initialized by initdb.
Yes, support for MD5-encrypted passwords is deprecated and will be removed in a future release of PostgreSQL.
The 'password' method requires the client to supply an unencrypted password, and since the password is sent in clear text over the network, this should not be used on untrusted networks.
If md5 is specified as a method in pg_hba.conf but the user's password on the server is encrypted for SCRAM, then SCRAM-based authentication will automatically be chosen.
trust is the default authentication method for ease of installation.
Files containing the server certificate and private key must exist, by default named server.crt and server.key in the server's data directory.
Prefix the file name with @ to specify a separate file containing database names and/or regular expressions or user names.
It is generally easier to control who can connect to which databases by granting or revoking the CONNECT privilege rather than by placing rules in pg_hba.conf.
The hba_file configuration parameter can be used to specify a custom location for pg_hba.conf.
The 'samerole' keyword matches if the requested user is a member of the role with the same name as the requested database.
No, users must pass both pg_hba.conf checks AND possess CONNECT privilege for the database.
Blank lines are ignored, as is any text after the # comment character.
The 'trust' authentication method allows unconditional acceptance without requiring a password.
The pg_hba_file_rules view provides a summary of the contents of the pg_hba.conf file.
local (Unix-domain socket), host (TCP/IP), hostssl (SSL only), hostnossl (non-SSL only), hostgssenc (GSSAPI encryption only), and hostnogssenc (non-GSSAPI only).
The 'replication' keyword matches physical replication connections only (not logical replication connections).
A host name specification that starts with a dot (.) matches a suffix of the actual host name. For example, .example.com matches foo.example.com but not example.com.
The MD5 hash algorithm is no longer considered secure against determined attacks.
Functions - Pattern Matching
47 questionsThe 'q' option treats the rest of the RE as a literal string (all characters treated as ordinary characters).
When subexpr is 0 (or omitted), the result identifies the position of the whole match regardless of parenthesized subexpressions.
\Z matches only at the end of the string, regardless of newline-sensitive mode.
word (alphanumeric plus underscore) and ascii (7-bit ASCII characters).
A non-zero subexpr value specifies which parenthesized subexpression to use, numbered in the order of their leading parentheses.
\n (where n is 1-9) represents the source substring matching the nth parenthesized subexpression of the pattern.
regexp_count(), regexp_instr(), and regexp_like() were added in PostgreSQL 15 along with regexp_substr().
Yes. Regular expressions can be contrived that take arbitrary amounts of time and memory to process. This is an important performance consideration.
substring(string FROM pattern) uses POSIX regular expressions (two-parameter form), and substring(string FROM pattern FOR escape-character) uses SQL regular expressions (three-parameter form).
A word is defined as a sequence of word characters that is neither preceded nor followed by word characters.
(?=re) for positive lookahead, (?!re) for negative lookahead, (?<=re) for positive lookbehind, and (?<!re) for negative lookbehind.
The 'n' flag enables newline-sensitive matching (synonym for 'm').
PostgreSQL always initially presumes that a regular expression follows the ARE (Advanced Regular Expression) rules.
Quantifiers are greedy by default, preferring the longest match.
When N=0 is used with the 'g' flag, it replaces all occurrences of the pattern.
Prepend an embedded option to the pattern: (?e) switches to ERE mode, and (?b) switches to BRE mode.
\A matches only at the beginning of the string, regardless of newline-sensitive mode.
An equivalence class is enclosed in [= and =], representing all collating elements equivalent to that one, including itself.
(?e) switches the regex to ERE (Extended Regular Expression) mode.
\Y matches only at a point that is not the beginning or end of a word (non-boundary).
The 'x' flag enables expanded syntax, where whitespace and comments (from # to end of line) are ignored.
regexp_matches() returns no rows (an empty set) if there is no match, not NULL. This is because it is defined as a set-returning function.
In newline-sensitive mode, ^ and $ match after and before newlines respectively, but \A and \Z continue to match only at the beginning and end of the string.
If an RE begins with ***:, the rest of the RE is taken as an ARE (Advanced Regular Expression). This can override ERE or BRE mode specified by the flags parameter.
~~ is equivalent to LIKE, * corresponds to ILIKE, ! represents NOT LIKE, and !~~* represents NOT ILIKE.
When endoption is 0 (or omitted), regexp_instr() returns the position of the first character of the match.
PostgreSQL defaults to backslash () as the escape character. According to the SQL standard, omitting ESCAPE means there is no escape character, not defaulting to backslash. PostgreSQL's behavior is nonstandard.
POSIX character classes are interpreted according to the prevailing locale, which can be controlled by attaching a COLLATE clause to the operator or function.
Back references use \n where n is 1 through 9. For example, ([bc])\1 matches 'bb' or 'cc' but not 'bc'.
Without the 'g' flag, regexp_replace() replaces only the first match by default.
The underscore (_) matches any single character, and the percent sign (%) matches any sequence of zero or more characters.
The 'g' flag causes the function to find or replace all matches, not just the first one.
PostgreSQL uses a software package written by Henry Spencer. PostgreSQL 7.4 and later use the exact same regular expression engine that was developed by Henry Spencer for Tcl 8.2.
No. The period (.) is not a metacharacter for SIMILAR TO. It matches only a literal period character, unlike in POSIX regular expressions where it matches any single character.
No. Lookahead and lookbehind constraints cannot contain back references.
\w matches any word character (alphanumeric plus underscore, equivalent to [[:word:]]).
alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, and xdigit.
\y matches only at the beginning or end of a word (word boundary).
LIKE pattern matching always covers the entire string. To match a pattern anywhere within a string, the pattern must start and end with a percent sign.
Yes. PostgreSQL 18 added support for LIKE with nondeterministic collations.
Streaming Replication > Replication Slots
46 questionswal_removed, rows_removed, wal_level_insufficient, or idle_timeout
pg_create_logical_replication_slot(slot_name name, plugin name [, temporary boolean, two_phase boolean, failover boolean]) → record (slot_name name, lsn pg_lsn)
Yes, replication slots are designed to survive system failures and persist independently of the connection using them.
When set to true, the slot is enabled to be synced to standby servers so that logical replication can be resumed after failover.
The oldest WAL LSN (Log Sequence Number) that might be needed by this slot. It represents the starting WAL location required by the consumer.
pg_create_physical_replication_slot(slot_name name [, immediately_reserve boolean, temporary boolean]) → record (slot_name name, lsn pg_lsn)
The number of bytes of WAL that can be written before this slot gets in danger of having WAL files removed.
sync_replication_slots must be enabled, hot_standby_feedback must be enabled, primary_slot_name should be configured, and a valid dbname must be specified in primary_conninfo.
WAL files have been removed and the slot is no longer valid; no recovery is possible.
WAL will be reserved immediately from the most recent checkpoint, regardless of whether a client is connected.
During checkpoint, so there can be some lag between when the timeout was exceeded and when the slot is actually invalidated.
No, changes can be applied with pg_reload_conf(), pg_ctl reload or SIGHUP.
The replica will be invalidated as soon as a checkpoint occurs.
hot_standby_feedback must be enabled on the standby, a physical replication slot is required between primary and standby, and wal_level on the primary must be logical or higher.
The address (LSN) up to which the logical slot's consumer has confirmed receiving data.
Whether the slot was synced from the primary server (relevant for failover slots on standbys).
Yes, logical replication allows replication across different major versions of PostgreSQL.
True if the slot is currently being used by a replication connection, False if inactive.
When set to true, it enables decoding of prepared transactions (PREPARE TRANSACTION, COMMIT PREPARED, ROLLBACK PREPARED).
pg_replication_slot_advance(slot_name name, upto_lsn pg_lsn) → record
No, it is not applicable for slots that do not reserve WAL or for synced standby slots.
Foreign Data Wrappers
46 questionsNo, the REFERENCING clause (OLD TABLE or NEW TABLE) is not allowed for foreign table triggers. This option is only allowed for AFTER triggers on plain tables.
If (number of columns) * batch_size is greater than 65,535, PostgreSQL automatically sets batch_size to 65535 / (number of columns).
postgres_fdw_disconnect_all() returns boolean. It returns true if it disconnects at least one connection, otherwise false.
The handler function must be registered as taking no arguments and returning the special pseudo-type fdw_handler.
Creating a server requires USAGE privilege on the foreign-data wrapper being used.
user, password, and sslpassword are not allowed in CREATE SERVER. These must be specified in a user mapping instead, or use a service file.
No, constraints on foreign tables are not enforced by the core PostgreSQL system. A constraint attached to a foreign table should represent a constraint that is being enforced by the remote server.
Yes, a user can create a user mapping for their own user name if USAGE privilege on the server has been granted to the user.
No, the extensions option can only be specified for foreign servers, not per-table.
To run IMPORT FOREIGN SCHEMA, the user must have the USAGE privilege on the foreign server and the CREATE privilege on the target local schema.
The HANDLER and VALIDATOR clauses are PostgreSQL extensions to the SQL/MED standard.
No, CHECK expressions cannot contain subqueries or variable references beyond the current row.
If set to off, all connections to this foreign server will be discarded at the end of each transaction.
Parallel commit support (parallel_commit option) for postgres_fdw was introduced in PostgreSQL 15.
Foreign tables support BEFORE and AFTER row-level and statement-level triggers for INSERT, UPDATE, and DELETE operations.
The extensions option is a comma-separated list of names of PostgreSQL extensions that are installed, in compatible versions, on both the local and remote servers. Functions and operators that are immutable and belong to a listed extension will be considered shippable to the remote server.
No, routing of tuples to foreign-table partitions is not supported.
To create a foreign table, you must have USAGE privilege on the foreign server, as well as USAGE privilege on all column types used in the table.
RESTRICT is the default. It will refuse to drop the foreign-data wrapper if any objects depend on it.
The standard clauses LIBRARY and LANGUAGE are not implemented in PostgreSQL.
For servers, the USAGE privilege enables the grantee to create, alter, and drop their own user mappings associated with that server, create foreign tables using the server, and query the options of the server and associated user mappings.
No, it is currently not allowed to create the foreign table as a partition if there are UNIQUE indexes on the parent table.
Only CHECK and NOT NULL constraints are supported on foreign tables.
No, by default, when running ANALYZE without specifying a table, foreign tables are not analyzed. Only regular tables, partitioned tables, and materialized views are included.
When the local transaction uses READ COMMITTED isolation level, the remote transaction uses REPEATABLE READ isolation level (not READ COMMITTED).
Asynchronous execution support (async_capable option) for postgres_fdw was introduced in PostgreSQL 14.
PostgreSQL implements portions of the SQL/MED (SQL Management of External Data) specification, ISO/IEC 9075-9.
When adding constraints, nothing is done to verify the constraint is correct; the action simply declares that the condition should be assumed to hold.
Yes, foreign tables can be specified as partitions of partitioned tables.
The SET STORAGE clause sets the storage mode for a column, though the storage mode has no effect unless the foreign-data wrapper chooses to use it.
When the local transaction uses SERIALIZABLE isolation level, the remote transaction also uses SERIALIZABLE isolation level.
RENAME and SET SCHEMA cannot be combined with other alterations. All other actions can be combined into a list of multiple alterations to apply in parallel.
Functions - String
45 questionsA negative width value automatically enables left-justification (as if the - flag had been specified), and the field length used is the absolute value of the negative width.
The default count equals the length of the replacement substring (length(newsubstring)). The overlay() function replaces the same number of characters as the length of the replacement string.
The 'g' flag enables global matching, returning all matches instead of just the first. It is specific to regexp_matches() and regexp_replace().
char_length() counts characters (so a multibyte character counts as 1), while octet_length() counts bytes (so a multibyte character counts as multiple bytes depending on encoding).
A too-small width does not cause truncation of the output, but is simply ignored.
starts_with() returns boolean (true or false). It returns true if the string begins with the specified prefix.
Only the minus sign (-) flag is supported for left-justification. No other flags are currently supported.
The 'w' flag enables inverse partial newline-sensitive matching.
trim() removes spaces by default. The documentation states it 'removes the longest string containing only characters in characters (a space by default)'.
bit_length() returns the number of bits, which is octet_length() × 8. For example, a character that takes 3 bytes will return 24 bits.
Yes, split_part() supports negative indexing to count from the end. For example, split_part('abc,def,ghi,jkl', ',', -2) returns 'ghi' (the second-to-last field).
Both lpad() and rpad() truncate the string from the right side to the specified length when the input string is longer.
strpos() performs a case-sensitive search and returns 0 if the substring is not found (counting positions from 1).
When given a NULL delimiter, string_to_table() splits each character into separate rows.
With a negative start position, PostgreSQL treats it as counting 'invisible' characters to the left of the string. The practical behavior depends on the count parameter but doesn't count from the end like other databases.
The 'i' flag enables case-insensitive matching, overriding the operator type.
If the server encoding isn't UTF-8, the Unicode code point is converted to the actual server encoding. An error occurs if the conversion fails.
When given an empty string as delimiter, string_to_table() treats the entire input string as one field.
The 's' flag disables newline-sensitive matching, which is the default behavior.
concat() ignores NULL arguments. However, it returns NULL if the variadic array argument itself is NULL.
The %L format specifier (for SQL literals) displays NULL as the unquoted string NULL (not an empty string).
PostgreSQL uses 1-based indexing. For example, position('om' in 'Thomas') returns 3, where 'o' is at position 3, not 2.
The 'x' flag activates expanded syntax, which ignores whitespace and comments in the pattern.
Yes, string_agg() produces meaningfully different results depending on the order of input values. By default ordering is unspecified, but it can be controlled by writing an ORDER BY clause within the aggregate call.
The 'q' flag treats the rest of the pattern as a literal string (all ordinary characters).
When N is 0, regexp_replace() replaces all occurrences of the pattern, equivalent to using the 'g' global flag.
concat_ws() ignores NULL arguments except for the first argument (the separator). If the separator is NULL, the function returns NULL.
Use encode(sha256('value'), 'hex') to convert the bytea output to hex-encoded text.
The 'e' flag treats the rest of the pattern as an Extended Regular Expression.
regexp_match() returns a single text array for the first match (or NULL), while regexp_matches() returns setof text[], providing all matches when the 'g' flag is used.
unistr() evaluates escaped Unicode characters in a string. It supports formats: \XXXX (4 hex digits), +XXXXXX (6 hex digits), \uXXXX (4 hex digits), or \UXXXXXXXX (8 hex digits). Two backslashes represent a literal backslash.
The 'b' flag treats the rest of the pattern as a Basic Regular Expression.
Yes, casefold() can modify string length. For example, in PG_UNICODE_FAST collation, 'ß' (U+00DF) folds to 'ss', changing the length from 1 to 2 characters.
The 'm' flag is a historical synonym for 'n' (newline-sensitive matching).
If encoding is omitted, the database encoding is assumed, which in practice is the only useful case.
quote_nullable() returns the string 'NULL' (not a NULL value) when given NULL as input.
The default value of strict_mode is true. When enabled, extra characters after the last identifier trigger an error; when false, such characters are ignored.
In multi-byte encodings other than UTF8, chr() and ascii() are only usable for strictly ASCII characters (values 0-127, or 0x00-0x7F hex).
The libc provider doesn't support case folding, so casefold() is identical to lower() when using libc.
The default normalization form is NFC (Normalization Form Canonical Composition).
to_ascii() only supports conversion from LATIN1, LATIN2, LATIN9, and WIN1250 encodings.
Point-in-Time Recovery
45 questionsNo, %r is optional and typically only used by warm-standby configurations for archive cleanup
On a primary, if archive_mode is enabled and the wait_for_archive parameter is true, pg_backup_stop does not return until the last segment has been archived
From oldest to newest. The first backup directory should be the full backup, and the last should be the final incremental backup
recovery.signal (for targeted recovery) or standby.signal (for hot standby). If both are present, standby.signal takes precedence
Yes, archive_command and archive_library can be changed with a configuration file reload
wal_keep_size (wal_keep_segments was renamed to wal_keep_size in PostgreSQL 13)
No, at most one of these parameters can be used. If more than one is specified in the configuration file, the last entry will be used
WAL files are in $PGDATA/pg_wal by default, but a different location can be specified with -X/--waldir during initdb, creating a symbolic link
Three values: lsn (WAL location), labelfile (backup_label content), spcmapfile (tablespace_map content)
The server removes recovery.signal upon successful completion to prevent accidental re-entry into recovery mode
No, WAL summarization cannot be enabled when wal_level is set to minimal
recovery_target = 'immediate' stops recovery as soon as a consistent state is reached
Zero exit status if and only if it succeeds. A nonzero status tells PostgreSQL that the file was not archived and it will retry periodically
In a recovery.conf file placed in the data directory. This file was removed in PostgreSQL 12 and replaced with signal files and postgresql.conf parameters
Two replication connections total (one for backup, one for WAL streaming)
%f (filename to retrieve), %p (copy destination path on server), %r (name of file containing last valid restart point, optional)
Archiving is effectively disabled and WAL files will accumulate in pg_wal/ until a working archive_command is re-established
pg_wal/ contents (not the directory itself), postmaster.pid, postmaster.opts, pg_replslot/ contents, pg_dynshmem/, pg_notify/, pg_serial/, pg_snapshots/, pg_stat_tmp/, pg_subtrans/ contents, files/directories beginning with pgsql_tmp, and pg_internal.init files
test ! -f /mnt/server/archivedir/%f && cp pg_wal/%p /mnt/server/archivedir/%f
Between 30 minutes and 1 hour (the default 5 minutes is clearly too low for production)
In the root directory of the backup (these are returned by pg_backup_stop() in non-exclusive mode)
Stop just after the specified recovery target. If false, stops just before the recovery target. Applies when recovery_target_lsn, recovery_target_time, or recovery_target_xid is specified
A minute or so (though it will bloat your archive storage if set too short)
Full Text Search > Text Search Functions
44 questionsThe nentry column represents the total number of occurrences of the word across all documents.
setweight has two forms: setweight(vector tsvector, weight "char") returns tsvector, and setweight(vector tsvector, weight "char", lexemes text[]) returns tsvector.
No, ts_rank_cd requires lexeme positional information to perform its calculation. It ignores any stripped lexemes, and if there are no unstripped lexemes in the input, the result will be zero.
Position values in tsvector can range from 1 to 16383. Values larger than 16383 are silently set to 16383.
Weight A corresponds to the ranking multiplier 1.0 (highest priority).
json_to_tsvector([config regconfig,] document json, filter jsonb) returns tsvector
ts_debug returns six columns: alias, description, token, dictionaries, dictionary, and lexemes.
The <-> operator is the FOLLOWED BY operator, which means the left operand must be followed immediately by the right operand (distance of 1).
No, phraseto_tsquery does not discard stop words. Instead, it uses them to compute the number of semantically less important words between lexemes using the
The
The precedence from highest to lowest is: ! (NOT), <-> (FOLLOWED BY), & (AND), | (OR). Parentheses can enforce grouping.
websearch_to_tsquery([config regconfig,] querytext text) returns tsquery. The config parameter is optional and defaults to default_text_search_config if omitted.
to_tsvector([config regconfig,] document text) returns tsvector. The config parameter is optional and defaults to default_text_search_config if omitted.
The default value is pg_catalog.simple, though it is commonly set to pg_catalog.english or other language-specific configurations.
PostgreSQL 10 added to_tsvector(json) and to_tsvector(jsonb) support, but these only process string values from JSON. PostgreSQL 11 added dedicated json_to_tsvector/jsonb_to_tsvector functions with filter parameters.
The default weights array is {0.1, 0.2, 0.4, 1.0} in order D, C, B, A.
The default value for ShortWord is 3, which eliminates common English articles.
The || operator is used to concatenate two tsvector values, combining their lexemes and preserving positional information and weight labels.
Normalization value 32 divides the rank by itself + 1, which scales all ranks into the range zero to one.
ts_rewrite(query tsquery, target tsquery, substitute tsquery) returns tsquery
ts_parse has two forms: ts_parse(parser_name text, document text, OUT tokid integer, OUT token text) returns setof record, and ts_parse(parser_oid oid, document text, OUT tokid integer, OUT token text) returns setof record.
ts_rank([weights real[],] vector tsvector, query tsquery [, normalization integer]) returns real
The && operator performs AND combination of two tsquery values, creating a query that matches documents containing both queries.
ts_rewrite(query tsquery, select text) returns tsquery. The select must yield two columns of tsquery type (target and substitute).
phraseto_tsquery([config regconfig,] querytext text) returns tsquery. The config parameter is optional and defaults to default_text_search_config if omitted.
ts_delete(vector tsvector, lexeme text) returns tsvector. It removes the given lexeme from the vector.
Normalization value 8 divides the rank by the number of unique words in the document.
length(tsvector) returns an integer representing the number of distinct lexemes stored in the tsvector.
phraseto_tsquery inserts the <-> (FOLLOWED BY) operator between surviving words, making it suitable for phrase searches where word order matters.
The @> operator tests if the left tsquery contains the right tsquery, while <@ tests if the left is contained by the right. These operators consider only the lexemes, ignoring combining operators.
plainto_tsquery inserts the & (AND) operator between surviving words after discarding punctuation and stop words.
to_tsquery recognizes & (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY) operators. It requires properly formatted input with these operators.
ts_lexize returns an empty array ({}) for a stop word, indicating the dictionary knows the token but it is a stop word.
ts_parse returns two columns: tokid (an integer showing the assigned token type) and token (the text of the token).
The valid weight values are A, B, C, or D (single characters). D is the default for new vectors.
An empty array ({}) in the lexemes column means the token was recognized as a stop word.
The ndoc column represents the number of documents (tsvectors) in which the word occurred.
The default value for MaxFragments is 0, which selects a non-fragment-based headline generation method.
ts_rank ranks vectors based on the frequency of matching lexemes, while ts_rank_cd computes cover density ranking which also considers the proximity of matching lexemes to each other in the document.
Server Configuration > Connections and Authentication
44 questionsThe default is empty, meaning no CRL file is loaded (unless ssl_crl_dir is set).
Roles must have privileges of the pg_use_reserved_connections role to use connection slots reserved by reserved_connections.
The default is empty, meaning no CA file is loaded, and client certificate verification is not performed.
If this value is specified without units, it is taken as seconds.
This parameter is supported only on systems that support TCP_USER_TIMEOUT (which does not include Windows); on other systems, it must be zero.
The default value is 0, which selects the operating system's default.
The default value is empty string, which allows any protocol version (no upper limit is imposed).
If this value is specified without units, it is taken as seconds.
Only available on operating systems which support the POLLRDHUP extension to the poll system call. Currently this is only supported by Linux (and the BSD family, macOS, and illumos).
The special entry * corresponds to all available IP interfaces. The entry 0.0.0.0 allows listening for all IPv4 addresses and :: allows listening for all IPv6 addresses.
The minimum is 1 second (1s) and the maximum is 600 seconds (10min).
On a standby server, max_connections must be set to the same or higher value than on the primary server. Otherwise, queries will not be allowed in the standby server.
When off (the default), ssl_passphrase_command will be ignored during a configuration reload and the SSL configuration will not be reloaded if a passphrase is needed.
On Windows, setting a value of 0 will set this parameter to 1 second, since Windows does not provide a way to read the system default value.
The default value is 0, which selects the operating system's default.
The value of reserved_connections must be less than max_connections minus superuser_reserved_connections.
If this value is specified without units, it is taken as milliseconds.
No, only connections using TLS version 1.2 and lower are affected.
The default value is X25519:prime256v1 (in PostgreSQL 18+; before version 18.0, the parameter was named ssl_ecdh_curve with default prime256v1).
The reserved_connections parameter was introduced in PostgreSQL 16.
The default is empty, in which case compiled-in default DH parameters are used.
Yes, MD5 password support is deprecated and will be removed in a future release of PostgreSQL.
The default value is empty, which uses the default group of the server user.
The default value is FILE:/usr/local/pgsql/etc/krb5.keytab (where the directory part is whatever was specified as sysconfdir at build time).
The default value is 0, which selects the operating system's default.
On Unix systems, the permissions on server.key must disallow any access to world or group, achieved by the command chmod 0600 server.key.
Valid values are TLSv1, TLSv1.1, TLSv1.2, and TLSv1.3 (though older versions of OpenSSL may not support all values).
Data Definition (DDL) > Sequences
44 questionsThe default is NO CYCLE, meaning nextval calls after reaching the maximum will throw an error rather than wrapping around.
Using regclass creates a dependency on the sequence (preventing it from being dropped without CASCADE), while using text does not create a dependency.
The default MINVALUE for a descending sequence is the minimum value of the data type.
lastval() requires USAGE or SELECT privilege on the last-used sequence.
The valid data types are smallint, integer, and bigint. No other data types are supported.
No, sequence state changes made by setval are immediately visible to other transactions and are not undone if the calling transaction rolls back.
Each session will allocate and cache successive sequence values independently, potentially creating gaps when sessions end without consuming cached values. Multiple sessions are guaranteed distinct values but not sequential ordering.
Only those sequences are shown that the current user has access to (by way of being the owner or having some privilege).
The minimum CACHE value is 1 (only one value can be generated at a time, i.e., no cache).
IF EXISTS prevents errors if the sequence doesn't exist. Instead of throwing an error, the system issues a notice.
When is_called is true, the next nextval() will return value + 1 (it advances the sequence before returning a value).
GENERATED ALWAYS AS IDENTITY and GENERATED BY DEFAULT AS IDENTITY.
No, sequence objects cannot be used if 'gapless' assignment of sequence numbers is needed, because nextval and setval calls are never rolled back and can create gaps.
A user-specified value is only accepted if the INSERT statement specifies OVERRIDING SYSTEM VALUE.
Sequences are based on bigint arithmetic, so the range cannot exceed -9223372036854775808 to 9223372036854775807 (eight-byte integer bounds).
Calling currval() before nextval() in the current session will throw an error because currval returns the value most recently obtained by nextval in the current session.
No, lastval() returns the value most recently returned by nextval() in the current session regardless of which sequence was called, without taking a sequence name argument.
These values are stored as character strings rather than numeric types, following SQL standard formatting.
The user-specified value takes precedence and is inserted, overriding the sequence-generated value.
The last_value column returns null if the sequence hasn't been accessed yet, the current user lacks USAGE or SELECT privileges, or the sequence is unlogged on a standby server.
OWNED BY causes the sequence to be associated with a specific table column, such that if that column (or its whole table) is dropped, the sequence will be automatically dropped as well.
No, ALTER SEQUENCE cannot be applied to change the logged/unlogged status of a temporary sequence. Only non-temporary sequences can be converted between logged and unlogged.
setval() returns bigint (specifically, it returns the second argument value that was passed in).
No, identity columns do NOT guarantee uniqueness. You must add a PRIMARY KEY or UNIQUE constraint to ensure uniqueness.
bigint is the default data type when no type is specified in CREATE SEQUENCE.
No, child tables in table inheritance do not automatically inherit identity properties from parents.
No, granting permission on a table does not automatically extend permissions to any sequences used by the table, including sequences tied to SERIAL columns. Permissions on sequences must be set separately.
When CYCLE is enabled, an ascending sequence resets to minvalue upon reaching its maximum.
When is_called is false, the next nextval() will return exactly the specified value, and sequence advancement commences with the following nextval.
serial creates integer (range 1 to 2,147,483,647), bigserial creates bigint (range 1 to 9,223,372,036,854,775,807), and smallserial creates smallint (range 1 to 32,767).
Using regclass provides 'early binding' where the OID tracks the originally identified object despite later renaming or schema reassignment. Using text provides 'late binding' where the object reference is resolved at run time.
The default MAXVALUE for an ascending sequence is the maximum value of the data type.
Calling lastval() before nextval() has been called in the session produces the error: 'lastval is not yet defined in this session'.
The default is RESTRICT, which refuses to drop the sequence if any objects depend on it.
Routine Maintenance > Bloat Management
43 questionsNo. Plain VACUUM reclaims space and makes it available for re-use within the same table, but extra space is not returned to the operating system in most cases.
No. When HOT updates occur, new index entries are not required for updated rows (though summary indexes like BRIN may still need updates).
No exclusive lock; VACUUM can operate in parallel with normal reading and writing of the table.
- The update does not modify any columns referenced by the table's indexes (excluding summarizing indexes), and 2) There is sufficient free space on the page containing the original row to accommodate the updated version.
No. Autovacuum workers do not count toward max_connections limits.
No. Exclusion constraint indexes cannot be reindexed concurrently.
ACCESS EXCLUSIVE lock on each table while it is being processed, which blocks all reads and writes.
When triggered, any cost-based delay that is in effect will no longer be applied, non-essential maintenance tasks (such as index vacuuming) are bypassed, and any Buffer Access Strategy in use will be disabled, allowing VACUUM to use all of shared buffers.
Yes. VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space, allowing unused space to be returned to the operating system.
Yes. VACUUM normally processes the corresponding TOAST table for each relation when one exists, and this is the default behavior.
No. REINDEX cannot execute inside transaction blocks when used with partitioned indexes/tables or with SCHEMA/DATABASE/SYSTEM options.
ACCESS EXCLUSIVE lock on the specific index being processed, which locks out writes but not reads on the parent table.
ACCESS EXCLUSIVE lock, which prevents any other database operations (both reads and writes) until CLUSTER is finished.
Zero (0). FREEZE is equivalent to setting both parameters to zero for aggressive tuple freezing.
The index size must be more than min_parallel_index_scan_size (default 512kB).
No. Autovacuum cannot process partitioned tables directly; it processes partitions individually.
Logical Replication > Publications
43 questionsFor boolean parameters, omitting the '= value' part is equivalent to specifying TRUE. For example, 'publish_via_partition_root' is the same as 'publish_via_partition_root = true'.
'insert, update, delete, truncate' - all DML operations are published by default.
The pg_publication_namespace catalog contains the mapping between schemas and publications in the database (many-to-many mapping).
PostgreSQL 18. Prior to version 18, logical replication does not publish GENERATED columns at all.
No, unlogged tables cannot be part of a publication because their contents are not written to the write-ahead log.
The pg_publication catalog contains all publications created in the database.
When a partitioned table is added to a publication, all of its existing and future partitions are implicitly considered to be part of the publication.
Subscribers on PostgreSQL versions before 15 copy all table columns during initial sync, ignoring column lists.
Yes, if FOR TABLE, FOR ALL TABLES or FOR TABLES IN SCHEMA are not specified, the publication starts out with an empty set of tables. This is useful if tables or schemas are to be added later.
false - changes are published using the identity of the individual partitions by default, not the partitioned table.
Initial table synchronization won't copy generated columns even if publish_generated_columns is 'stored' on the publisher.
No, INSERT operations work regardless of replica identity status.
DEFAULT, which uses the primary key if available. If there is no primary key, the behavior is the same as NOTHING.
Yes, both operations are transactional, so the table will start or stop replicating at the correct snapshot once the transaction has committed.
Yes, each table can be added to multiple publications if needed.
The row filter WHERE clauses are OR-ed together, so rows satisfying any of the expressions will be replicated. If one publication has no row filter, all other row filters for that table are ignored.
Attaching a table does not result in the table's existing contents being replicated.
No, specifying a column list when the publication also publishes FOR TABLES IN SCHEMA is not supported.
It indicates whether the publication automatically includes all current and future tables (FOR ALL TABLES).
It replaces the entire list of tables in the publication with the specified list; existing tables are removed.
CREATE privilege on the current database (superusers are exempt from this requirement).
No, if the table uses REPLICA IDENTITY FULL, specifying a column list will cause publication errors for UPDATE or DELETE operations.
Yes, parentheses are required around the expression in row filters.
No, PostgreSQL does not support IF NOT EXISTS for CREATE PUBLICATION.
If the old row matches but new doesn't, it's transformed to DELETE. If the old row doesn't match but new does, it's transformed to INSERT. If both match, standard UPDATE replication occurs. If neither matches, no replication occurs.
The row filter WHERE clause must contain only columns that are covered by the replica identity.
No, it is not possible to add tables to, or drop tables from a publication which was defined as FOR ALL TABLES.
No, tables with REPLICA IDENTITY NOTHING cannot support UPDATE or DELETE operations when included in a publication.
You must be able to SET ROLE to the new owning role, and that role must have CREATE privilege on the database.
No, sequence data is not replicated in PostgreSQL 18 and earlier. Sequence replication support is coming in PostgreSQL 19.
Only immutable built-in functions are permitted. User-defined functions, operators, types, collations, and non-immutable built-in functions cannot be used.
No, tables with DEFAULT replica identity and no primary key cannot support UPDATE or DELETE operations.
Initial table synchronization fires both row and statement triggers for INSERT (implemented like a COPY command). Regular replication only fires row triggers, not statement triggers.
Query Processing
43 questionsThe default value of enable_incremental_sort is 'on'. This enables or disables the query planner's use of incremental sort steps.
The default value of TIMING is TRUE (when ANALYZE is enabled). This shows actual startup time and per-node execution times. TIMING can only be used when ANALYZE is also enabled.
The default value of cpu_index_tuple_cost is 0.005. This represents the cost of processing each index entry during a scan.
The default value of plan_cache_mode is 'auto'. This allows PostgreSQL to automatically choose whether to use a generic or custom plan for prepared statements. Other values are 'force_custom_plan' and 'force_generic_plan'.
The default value of from_collapse_limit is 8. The planner merges sub-queries into upper queries if the resulting FROM list has no more than this many items.
The default value of max_parallel_workers_per_gather is 2. This sets the maximum number of parallel workers that can be started by a single Gather or Gather Merge node.
The default value of parallel_tuple_cost is 0.1. This sets the planner's estimate of the cost of transferring one tuple from a parallel worker process to another process.
CREATE VIEW creates a relation and an entry in pg_rewrite that tells there is a rewrite rule that must be applied whenever the relation is referenced in a query. The action of the rule is one query tree that is a copy of the SELECT statement in the view creation command. Views are implemented using the rule system.
Set join_collapse_limit to 1 to prevent PostgreSQL from reordering explicit JOIN clauses and force the planner to follow the join order you specified.
The default value of jit_inline_above_cost is 500000. This is the threshold above which JIT will perform function and operator inlining.
The default value of enable_memoize is 'on'. This enables or disables the query planner's use of memoization for caching results during parameterized scans of the inner side of nested loop joins.
The default value of enable_partitionwise_aggregate is 'off'. When enabled, this permits grouping or aggregation on partitioned tables to be performed separately for each partition.
The parser and lexer are implemented using bison and flex. The lexer is defined in scan.l and transformed to scan.c using flex, while the parser grammar is defined in gram.y and transformed to gram.c using bison.
The default value of min_parallel_table_scan_size is 8MB. This is the minimum amount of table data required for a parallel scan to be considered.
The default value of geqo_threshold is 12. This means the Genetic Query Optimizer is used for queries with at least 12 FROM items involved.
The default value of enable_indexscan is 'on'. This enables or disables the query planner's use of index-scan and index-only-scan plan types.
No. The parser stage does not make any lookups in the system catalogs, so there is no possibility to understand the detailed semantics of the requested operations. It only validates syntax and builds a parse tree.
The default value of join_collapse_limit is 8 (same as from_collapse_limit). The planner will rewrite explicit JOIN constructs into lists of FROM items whenever a list of no more than this many items would result.
The default value of constraint_exclusion is 'partition'. This causes the constraint exclusion technique to be applied only to queries that are likely to be working on inheritance partitioned tables, not to all queries.
Yes. The cost of an upper-level node includes the cost of all its child nodes, so costs are cumulative in the query plan tree.
The default value of random_page_cost is 4.0. This represents the cost estimate for non-sequential disk page access.
The default value of BUFFERS is FALSE, but it is automatically enabled when ANALYZE is used. BUFFERS provides buffer usage statistics including shared blocks hit, read, dirtied, and written.
PostgreSQL executes the first five executions of a prepared statement with custom plans, then creates a generic plan and compares its estimated cost to the average custom-plan cost. If the generic plan is not significantly more expensive, it will be used for subsequent executions.
The default value of enable_partitionwise_join is 'off'. When enabled, this allows joins between partitioned tables to join matching partitions separately, which can be more efficient.
The default value of enable_seqscan is 'on'. However, it is impossible to suppress sequential scans entirely even when this is set to 'off'.
PostgreSQL processes queries through five stages in this order: 1) Parser - generates a parse tree from SQL text, 2) Analyzer/Analyser - performs semantic analysis and generates a query tree, 3) Rewriter - transforms the query tree using rules from the rule system, 4) Planner - generates the optimal execution plan, 5) Executor - executes the query by accessing tables and indexes.
The default FORMAT is TEXT. Other available formats are XML, JSON, and YAML.
Costs are measured in arbitrary units determined by the planner's cost parameters. Traditional practice is to measure costs in units of disk page fetches, where seq_page_cost is conventionally set to 1.0 and other cost parameters are set relative to that.
The default value of parallel_setup_cost is 1000. This sets the planner's estimate of the cost of launching parallel worker processes.
When default_statistics_target is 100, PostgreSQL samples 30,000 rows (300 × 100 = 30,000). The multiplier of 300 is based on research on random sampling for histogram construction.
The default value of enable_nestloop is 'on'. This enables or disables the query planner's use of nested-loop join plans.
Nested loop join is suitable for small tables or when one table is much smaller than the other. It's the only join method that doesn't need to scan the entire set (when index access is available), making it ideal for OLTP systems with short queries returning few rows. It's also the only option when there is no join condition with an = operator.
The default value of enable_sort is 'on'. This enables or disables the query planner's use of explicit sort steps.
The default value of cursor_tuple_fraction is 0.1. This sets the planner's estimate of the fraction of a cursor's rows that will be retrieved, biasing the planner toward fast-start plans that retrieve the first 10% of rows quickly.
The default value of default_statistics_target is 100. This controls the maximum number of entries in the most-common-value list and the maximum number of bins in the histogram for table statistics.
The default value of geqo_effort is 5. This parameter ranges from 1 to 10 and controls the trade-off between planning time and query plan quality in GEQO, with larger values increasing planning time but improving plan quality.
Merge join is best for joining sorted tables. It works well with both short OLTP queries and long OLAP ones, has linear complexity (both sets scanned only once), needs little memory, and can start output immediately. The only requirement is that the data sets must be pre-sorted.
The default value of min_parallel_index_scan_size is 512kB (512 kilobytes). This is the minimum amount of index data required for a parallel scan to be considered.
The default value of enable_mergejoin is 'on'. This enables or disables the query planner's use of merge-join plan types.
The default value of work_mem is 4MB (four megabytes). This sets the base maximum amount of memory to be used by a query operation before writing to temporary disk files.
The default value of enable_hashjoin is 'on'. This enables or disables the query planner's use of hash-join plan types.
PostgreSQL employs three available join strategies: nested loop join, merge join, and hash join.
The default value of COSTS is TRUE. This includes estimated startup and total costs, row counts, and row width for each node in the plan.
Server Configuration > Resource Consumption - Memory
43 questionsOnly on Linux and Windows. On Linux, it is only supported when shared_memory_type is set to mmap (the default).
2MB and 1GB (Intel and AMD), 16MB and 16GB (IBM POWER), and 64kB, 2MB, 32MB and 1GB (ARM)
The server will try to request huge pages, but fall back to the default if that fails.
Setting it at 50% of the machine's total RAM is a normal conservative setting, and 3/4 of memory is a more aggressive but still reasonable amount.
Yes, when setting this parameter, you should consider both PostgreSQL's shared_buffers and the OS disk cache.
Up to autovacuum_max_workers times maintenance_work_mem (or autovacuum_work_mem if set), as each autovacuum worker can use this amount of memory.
0, which means the default huge page size on the system will be used.
The ideal setting is the actual stack size limit enforced by the kernel (as set by ulimit -s), less a safety margin of a megabyte or so.
Failure to request huge pages will prevent the server from starting up.
PostgreSQL will not permit max_stack_depth to be set higher than the kernel limit less 512kB.
2.0 (for PostgreSQL 15 and later). It was 1.0 in PostgreSQL 13 and 14.
mmap (anonymous shared memory via mmap), sysv (System V shared memory via shmget), and windows (Windows shared memory)
Hash-based operations can use work_mem × hash_mem_multiplier. With the default value of 2.0, they can use twice the work_mem base amount.
mmap (though the first supported option is the default for that platform)
No, this is not the number of rows that can be locked; that value is unlimited. Row locks are not limited by this parameter.
2147483647kB (approximately 2TB), but on Windows there is a platform-specific 2GB limit even for 64-bit systems.
Yes, this parameter can only be set at server start and requires a full server restart to take effect.
Because PostgreSQL also relies on the operating system cache for disk caching, and the dual-layer caching approach (PostgreSQL's shared_buffers + OS cache) typically outperforms having one very large database cache and a starved OS cache.
Because only one maintenance operation can be executed at a time by a database session, and an installation normally doesn't have many of them running concurrently.
shared_memory_type controls the implementation for the main shared memory region (which holds shared_buffers and other shared data), while dynamic_shared_memory_type controls dynamic shared memory segments used for things like parallel queries.
If you are using prepared transactions, you will probably want max_prepared_transactions to be at least as large as max_connections, so that every session can have a prepared transaction pending.
Backup and Restore > pg_basebackup
42 questionsYes. pg_basebackup will create the target directory (and any missing parent directories) if it does not exist. If it already exists, it must be empty.
The main data directory is written to 'base.tar', and all other tablespaces are named after the tablespace OID.
Yes. The compression location can be specified as 'client' (default) or 'server'. Server-side compression reduces transfer bandwidth but increases server CPU consumption. Available starting in PostgreSQL 15.
The default checkpoint mode is 'spread'. The --checkpoint option sets checkpoint mode to fast (immediate) or spread (the default).
The backup_label file identifies the backup for recovery purposes and contains the label specified with the --label option (or the default 'pg_basebackup base backup' if none is specified).
The default checksum algorithm is CRC32C. Available algorithms are NONE, CRC32C, SHA224, SHA256, SHA384, and SHA512.
If the standby is promoted to be primary during backup, the backup fails.
Yes. A backup manifest is generated by default. It can be disabled with the --no-manifest option.
The -R option creates a standby.signal file in the output directory and appends connection settings to postgresql.auto.conf, automating the setup of a standby server with the necessary replication configuration.
WAL streaming mode (-X stream) only works with server version 9.3 and later.
No. When pg_basebackup is stopped with a signal, no cleanup action is taken.
No. The --waldir option can only be specified when the backup is in plain mode. It sets the directory to write WAL files to, separate from the main data directory.
By default, when pg_basebackup aborts with an error, it removes any directories it might have created before discovering that it cannot finish the job. This can be disabled with the --no-clean option.
The --create-slot option can only be used together with -X stream. It specifies that the replication slot named by the --slot option should be created before starting the backup, and an error is raised if the slot already exists.
Yes, when the backup is in tar format, the target directory may be specified as '-' (dash), causing the tar file to be written to stdout. This is only possible if the cluster has no additional tablespaces and WAL streaming is not used.
No. pg_basebackup cannot force WAL file switch when backing up from a standby.
Set up the standby so that it can accept replication connections by setting max_wal_senders, enabling hot_standby, and configuring pg_hba.conf appropriately.
Incremental backup (--incremental) only works with server version 17 and later.
pg_basebackup supports two format options: 'plain' (default) which writes output as plain files with the same layout as the source server's data directory, and 'tar' which writes the output as tar files.
pg_hba.conf must permit replication connections. The backup is made over a regular PostgreSQL connection that uses the replication protocol.
The summarize_wal parameter must be set to 'on' (default is 'off'). This enables the WAL summarizer process which tracks changes within the database for incremental backups.
Valid values are between 32 kilobytes per second (minimum) and 1024 megabytes per second (maximum).
Checksum failures are reported in the pg_stat_database view, which tracks the number of data page checksum failures detected in each database.
No. When backing up from a standby, the backup history file is not created.
When using --wal-method=fetch, it is necessary for the wal_keep_segments parameter (or wal_keep_size in PostgreSQL 13+) to be set high enough that the log is not removed before the end of the backup. If the log has been rotated when it's time to transfer it, the backup will fail and be unusable.
Replication slot names must contain only lower case letters, numbers, and the underscore character, and can be a maximum of 63 characters long (NAMEDATALEN).
To be effective, olddir must exactly match the path specification of the tablespace as it is defined on the source server. The path is not canonicalized, so if a tablespace was created with '/tmp//foo-old', you must map '/tmp//foo-old' exactly, not '/tmp/foo-old'.
The user must have REPLICATION permissions or be a superuser. For server-side backup storage, superuser privileges or the pg_write_server_files role is required.
The default target is 'client', which specifies that the backup should be sent to the machine where pg_basebackup is running. Alternative targets are 'server:/path' and 'blackhole'.
Yes. Checksums are verified by default and checksum failures will result in a non-zero exit status. Verification can be disabled with --no-verify-checksums.
The default status interval is 10 seconds. This specifies the number of seconds between status packets sent back to the source server.
Yes, starting in PostgreSQL 10. Temporary replication slots will be used by default when WAL streaming is used and no slot name is specified with -S. These slots are removed automatically at the end of the session or on error.
wal_level must be set to 'replica' or higher. The 'minimal' wal_level doesn't support PITR because some SQL commands are optimized to avoid WAL logging at that level.
Tar format mode (--format=tar) only works with server version 9.5 or later.
If the target is set to 'blackhole', the contents are discarded and not stored anywhere. This should only be used for testing purposes, as you will not end up with an actual backup.
The pg_stat_progress_basebackup view reports the progress of the backup whenever pg_basebackup is taking a base backup. It shows columns like backup_total, backup_streamed, tablespaces_total, and tablespaces_streamed.
No. Incremental backups cannot be used directly and require processing with pg_combinebackup to create a usable backup.
If none is specified, a default value of 'pg_basebackup base backup' will be used.
If only a level is specified without mentioning an algorithm, gzip compression will be used if the level is greater than 0, and no compression will be used if the level is 0.
pg_basebackup works with servers of the same or an older major version, down to 9.1.
The default unit is kilobytes per second. A suffix of 'M' indicates megabytes per second. A suffix of 'k' is also accepted and has no effect since kilobytes is already the default.
Streaming Replication > Failover and Switchover
41 questionsIt ensures logical replication failover slots don't consume changes until those changes are received and flushed to corresponding physical standbys.
pg_promote(wait boolean DEFAULT true, wait_seconds integer DEFAULT 60) → boolean
It enables a physical standby to synchronize logical failover slots from the primary server so that logical subscribers can resume replication from the new primary server after failover.
To make the standby server follow the timeline change that occurs at failover, ensuring downstream servers continue streaming from a promoted standby.
The slot must have attained synced state as true on the standby before failover.
pg_ctl promote (command-line utility) and pg_promote() (SQL function).
Any WAL immediately available in the archive or in pg_wal will be restored, but no attempt is made to connect to the primary.
WAL replay needs to complete. When the target server is started it will enter archive recovery and replay all WAL generated in the source server from the last checkpoint before the point of divergence.
Yes, the target server must be shut down cleanly before running pg_rewind.
Yes, if the primary server crashes, some committed transactions may not have been replicated to the standby server. The amount of data loss is proportional to the replication delay at the time of failover.
0 (the system doesn't keep any extra WAL segments for standby purposes)
No, PostgreSQL does not provide built-in failure detection software; external tools are required.
A comma-separated list of streaming replication standby server slot names (e.g., 'slot1,slot2,slot3')
Either wal_log_hints must be enabled in postgresql.conf OR data checksums must be enabled when the cluster was initialized with initdb.
Creates standby.signal and appends connection settings to postgresql.auto.conf to prepare the rewound server as a standby.
[FIRST] num_sync (standby_name [, ...]), ANY num_sync (standby_name [, ...]), or standby_name [, ...] (legacy)
hot_standby_feedback will work but only while the connection is alive (e.g., a node restart would break it).
No, temporary synced slots cannot be used for logical decoding after failover.
SSL and Encryption
40 questionsssl_cert_file - defaults to $PGDATA/server.crt if not specified.
Set sslrootcert=system (available from PostgreSQL 16) to use the default certification authority certificates trusted by the operating system.
disable (only non-GSSAPI-encrypted connections), prefer (default, tries GSSAPI-encrypted if credentials present), and require (only GSSAPI-encrypted connections).
Yes, intermediate CA certificates can be appended to server.crt, but the root certificate should NOT be added.
The default location is server.key in the server's data directory ($PGDATA/server.key).
If the certificate name starts with an asterisk (*), it is treated as a wildcard that matches all characters except a dot (.), meaning it will not match subdomains.
It enforces encryption and verifies that the server certificate is signed by a trusted certificate authority, but does not verify the hostname.
%APPDATA%\postgresql\postgresql.crt for the certificate and %APPDATA%\postgresql\postgresql.key for the private key.
verify-ca (requires a valid, trusted certificate) and verify-full (requires a valid certificate and checks that the Common Name matches the username or mapping).
host - this allows both SSL and non-SSL connections unless paired with other authentication options.
Yes, OpenSSL must be installed on both client and server systems, and SSL support must be enabled at build time.
It is only provided as the default for backward compatibility. It silently falls back to an unencrypted connection if SSL is not available, providing no protection against passive eavesdropping.
SSL version 2 and SSL version 3 are always disabled (protocol versions before TLS 1.0).
No, with GSSAPI encryption no password is transmitted over the network.
No, passphrase-protected keys cannot be used on Windows systems.
SCRAM (more secure, Internet standard). MD5 is deprecated and will be removed in future releases.
cert - this method authenticates using SSL client certificates and is only available for SSL connections.
libpq version 15.2 added support for 0640 permissions (with root ownership) for client key files.
CN (default, matches Common Name) or DN (matches the entire Distinguished Name in RFC 2253 format).
hostnogssenc - this matches only TCP/IP connections that do not use GSSAPI encryption.
It disables SSL encryption entirely, providing no encryption or MITM protection with the lowest overhead.
An empty string, which means compiled-in default 2048-bit DH parameters are used.
The context is 'sighup', which means it can be reloaded without restarting the PostgreSQL server.
Yes, the first certificate in server.crt must match the server's private key in server.key.
The hostname is matched against the certificate's Subject Alternative Name (SAN) attributes, or against the Common Name if no SAN of type dNSName is present.
An empty string, which means no maximum limit (allows any protocol version).
Yes, using the cert authentication method is effectively the same as trust authentication with clientcert=verify-full, so using both is redundant.
It enforces encryption, verifies the certificate chain, and also checks that the server hostname matches the name stored in the server certificate. This is recommended for most security-sensitive environments.
TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256
hostssl - this record type matches connection attempts made using TCP/IP only when the connection is made with SSL encryption.
No, ssl_ciphers only applies to TLS versions 1.2 and lower. There is no setting for TLS 1.3 cipher choices in PostgreSQL versions before 18.
PostGIS > Geometry Types
40 questionsPostGIS 1.4 (as of PostGIS 1.4, PostGIS supports compound curves including CircularString).
PolyhedralSurface, Triangle, and TIN (Triangulated Irregular Network).
0 is returned when the dimension is unknown (e.g., for an empty GEOMETRYCOLLECTION).
At least two distinct points (at least two points that are not at the same location).
It depends on the geometry type. Non-simple (self-intersecting) LineStrings are valid. However, self-intersecting Polygons are not valid. Use ST_IsValid() to check polygon validity, not ST_IsSimple().
A MultiPolygon is valid if and only if all of its elements are valid and the interiors of no two elements intersect.
A MultiPoint is simple if no two coordinates (Points) are equal (have identical coordinate values).
enforce_dims_geom (ensures every geometry has the same dimension), enforce_geotype_geom (ensures every geometry is of the same type), and enforce_srid_geom (ensures every geometry is in the same projection).
At least 4 coordinate pairs (representing at least 3 distinct points), with the first and last coordinates being identical to close the ring.
Degrees (since geometry type uses Cartesian distance calculations, the units for SRID 4326 are degrees).
PostGIS 2.0 (since version 2.0.0, PostGIS by default uses type modifiers instead of check constraints to constrain geometry types).
A Multi* geometry (homogeneous collection like MultiPoint, MultiLineString, or MultiPolygon).
No. An empty geometry represents a known empty set (e.g., the intersection of two disjoint geometries), while NULL represents an unknown or missing value.
It indicates a Z coordinate dimension for elevation or height information.
Three distinct non-collinear vertices. However, because a Triangle is a polygon, it is specified by four coordinates, with the first and fourth being equal.
Approximately 33.5 million vertices (considering 16 bytes per vertex for 2D geometries).
16 bytes per coordinate (8 bytes for X + 8 bytes for Y), since PostGIS stores all coordinates at double precision.
It indicates an M (measure) coordinate dimension, commonly used for time, road-mile, or upstream-distance information.
Homogeneous collections (they contain only one type of geometry each).
24 bytes per coordinate (8 bytes for X + 8 bytes for Y + 8 bytes for Z).
An infinite Cartesian plane with no units assigned to its axes (undefined/Cartesian plane).
It indicates both Z and M coordinate dimensions (4D geometry with X, Y, Z, and M values).
A MultiLineString is simple only if all of its elements are simple and the only intersection between any two elements occurs at Points that are on the boundaries of both elements.
No. The geography type does not support curves, TINs, or PolyhedralSurfaces.
The "ST_" prefix (e.g., 'ST_Point', 'ST_LineString', 'ST_Polygon').
A LinearRing is a closed, simple LineString (starts and ends on the same point and does not self-intersect) that forms the boundaries of polygons.
A single, continuous curve that has both curved (circular) segments and linear segments.
ST_IsValid() (not ST_IsSimple(), which cannot be used to test for invalid self-intersections in Polygons).
Extended Well-Known Binary, the WKB representation with SRID metadata.
Routine Maintenance > Autovacuum
40 questionsThe default is 0.1 (10% of table size). This fraction of the table size is added to autovacuum_analyze_threshold when deciding whether to trigger an ANALYZE.
The default is -1, which means the regular vacuum_cost_limit value (default 200) will be used. The value is distributed proportionally among running autovacuum workers.
The default is 0.2 (20% of table size). This fraction of the table size is added to autovacuum_vacuum_threshold when deciding whether to trigger a VACUUM.
Yes. The system will launch autovacuum processes to prevent wraparound even when autovacuum is otherwise disabled. This is a critical safety feature.
With the default autovacuum_freeze_max_age of 200 million transactions, pg_xact storage grows to about 50MB.
The default is 1.6 billion transactions. This controls when VACUUM triggers the failsafe mechanism to prevent transaction ID wraparound failure.
The default is 150 million transactions. VACUUM performs an aggressive scan when the table's pg_class.relfrozenxid field reaches this age.
PostgreSQL emits a warning when the database's oldest XIDs reach 40 million transactions from the wraparound point, displaying messages like 'WARNING: database "mydb" must be vacuumed within 39985967 transactions'.
Wraparound-prevention autovacuum operations have query names in pg_stat_activity that end with '(to prevent wraparound)', for example: 'autovacuum: VACUUM schema.tablename (to prevent wraparound)'.
The default is on. The autovacuum parameter controls whether the server should run the autovacuum launcher daemon.
Yes. Autovacuum thresholds, scale factors, and other control parameters can be overridden for individual tables by changing table storage parameters using ALTER TABLE.
VACUUM will silently limit the effective value of vacuum_multixact_freeze_min_age to half the value of autovacuum_multixact_freeze_max_age, so there is not an unreasonably short time between forced autovacuums.
The default is 2 milliseconds in PostgreSQL 12+. This was reduced from 20ms in earlier versions to make autovacuum more aggressive by default.
PostgreSQL will refuse to assign new XIDs once there are fewer than 3 million transactions left until wraparound, displaying an error: 'ERROR: database is not accepting commands that assign new XIDs to avoid wraparound data loss'.
The default is 0 (disabled) for manual VACUUM commands. This disables the cost-based vacuum delay feature.
The maximum allowed value is 2 billion (2,000,000,000) transactions, though the default is much lower at 200 million.
The pg_multixact/members storage area can grow to a maximum of approximately 20GB before wraparound occurs.
With autovacuum_freeze_max_age set to 2 billion and track_commit_timestamp enabled, pg_commit_ts can grow to about 20GB.
The default is 1000 tuples. This parameter was introduced in PostgreSQL 13 and specifies the number of inserted tuples needed to trigger a VACUUM for insert-only workloads.
The default is 200 million transactions. This specifies the maximum age (in transactions) that a table's pg_class.relfrozenxid field can attain before a VACUUM operation is forced to prevent transaction ID wraparound.
Transaction IDs are 32-bit values, which means they wrap around every 4 billion transactions.
The default is 50 tuples. This is the minimum number of inserted, updated or deleted tuples needed to trigger an ANALYZE in any one table.
The default is 1 minute (1min). This parameter specifies the minimum delay between autovacuum runs on any given database.
A normal autovacuum will terminate after about 1 second (the deadlock_timeout parameter) when it detects it's blocking other queries. Anti-wraparound autovacuum does not automatically terminate.
The default is 5 million multixacts. This specifies the cutoff age that VACUUM should use to decide whether to freeze pages with an older multixact ID.
vacuum threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × reltuples. When the number of dead tuples exceeds this threshold, VACUUM is triggered.
The default is 400 million multixacts. This specifies the maximum age (in multixacts) that a table's pg_class.relminmxid field can attain before a VACUUM operation is forced to prevent multixact ID wraparound.
When the storage occupied by multixacts members exceeds about 10GB, aggressive vacuum scans will occur more often for all tables.
The autovacuum_vacuum_cost_limit value is distributed proportionally among running autovacuum workers, so that the sum of the limits for each worker does not exceed the configured value. Workers with per-table settings are not included in the balancing algorithm.
With the default autovacuum_freeze_max_age of 200 million and track_commit_timestamp enabled, pg_commit_ts storage grows to about 2GB.
Every table must be vacuumed at least once every 2 billion transactions to prevent wraparound issues.
vacuum insert threshold = autovacuum_vacuum_insert_threshold + autovacuum_vacuum_insert_scale_factor × reltuples. This is used for insert-only workloads in PostgreSQL 13+.
When failsafe is triggered: (1) any cost-based delay is no longer applied, (2) non-essential maintenance tasks like index vacuuming are bypassed, and (3) any Buffer Access Strategy is disabled, allowing VACUUM to use all shared buffers.
track_counts must be set to true for autovacuum to work. Autovacuum uses the statistics collection facility and cannot be used unless track_counts is enabled.
The default is 50 million transactions. This specifies the cutoff age that VACUUM should use to decide whether to freeze row versions with an older XID.
Workers are launched approximately every autovacuum_naptime/N seconds, where N is the number of databases. For example, with 5 databases and default naptime of 1 minute, a worker launches every 12 seconds.
The default is 0.2 (20% of unfrozen pages in table) in PostgreSQL 16+. In PostgreSQL 13-15, it was 0.1 (10%).
The default is 200. This is the accumulated cost that will cause the vacuum process to sleep.
The default is 150 million multixacts. VACUUM performs an aggressive scan when the table's pg_class.relminmxid field reaches this age.
The default is 20 milliseconds in PostgreSQL 11 and earlier versions.
Functions - Aggregate
40 questionsregr_syy calculates sum(Y^2) - sum(Y)^2/N, which is the 'sum of squares' of the dependent variable.
covar_samp = (SUM(xy) - SUM(x) * SUM(y) / COUNT()) / (COUNT() - 1), or equivalently: SUM((y - avg_y)(x - avg_x)) / (N - 1)
PostgreSQL 9.4 added percentile_cont(), percentile_disc(), mode(), rank(), dense_rank(), percent_rank(), and cume_dist() as ordered-set and hypothetical-set aggregate functions using the WITHIN GROUP syntax.
bit_xor was added in PostgreSQL 14. bit_and and bit_or were added earlier in PostgreSQL 8.0.
regr_avgy calculates the average of the dependent variable (y) of the regression line for non-null pairs.
No. Unlike most built-in aggregates, hypothetical-set aggregates are not strict, that is they do not drop input rows containing nulls.
No. Using a sorted subquery can fail if the outer query level contains additional processing, such as a join, because that might cause the subquery's output to be reordered before the aggregate is computed. The ORDER BY within the aggregate call is the recommended approach.
ALL is the default. If neither ALL nor DISTINCT is specified, the aggregate behaves as if ALL was specified.
var_samp = Σ(x_i - mean)² / (N - 1), where N is the number of values. The divisor is (N - 1) rather than N, applying Bessel's correction for sample statistics.
stddev() is an alias for stddev_samp(). The semantics of stddev() and stddev_samp() are identical.
The forward transition function for moving-aggregate mode is not allowed to return null as the new state value.
Like most built-in aggregate functions, NULL values are ignored by mode(). If NULL occurs the most frequently, the mode() function will return the second most common value.
Yes. Multiple aggregates can share the same name; PostgreSQL determines which applies based on the input data types.
The delimiter parameter is mandatory. PostgreSQL does not provide a default delimiter for string_agg; it must be explicitly specified as the second argument.
If the inverse transition function returns null, this is taken as an indication that the inverse function cannot reverse the state calculation for this particular input, and so the aggregate calculation will be redone from scratch for the current frame starting position.
Yes. Versions of percentile_cont() and percentile_disc() can compute multiple percentile values in one pass over the data, such as: SELECT percentile_cont(array[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY column_name)
regr_sxx calculates sum(X^2) - sum(X)^2/N, which is the 'sum of squares' of the independent variable.
percentile_disc returns a value from the input set closest to the percentile requested, while percentile_cont returns an interpolated value. For example, with [1, 2], percentile_disc(0.5) returns 1 (actual value), while percentile_cont(0.5) returns 1.5 (interpolated median).
PostgreSQL provides json_object_agg_strict, jsonb_object_agg_strict (skip null values), json_object_agg_unique, jsonb_object_agg_unique (error on duplicate keys), and json_object_agg_unique_strict, jsonb_object_agg_unique_strict (both behaviors combined).
To create an aggregate function, you must have USAGE privilege on the argument types, the state type(s), and the return type, as well as EXECUTE privilege on the supporting functions.
Historically, count() has been faster than count(1) in PostgreSQL because count() is parsed as a special case taking no arguments, while count(1) takes an argument and PostgreSQL has to check every row to see that the argument is not NULL. Recent optimizations in PostgreSQL 19 have improved count(1) performance to match count(*).
regr_avgx calculates the average of the independent variable (x) of the regression line for non-null pairs.
Parallel aggregation is not supported if any aggregate function call contains DISTINCT or ORDER BY clause, and is also not supported for ordered-set aggregates or when the query involves GROUPING SETS.
The WITHIN GROUP clause for ordered-set aggregates was introduced in PostgreSQL version 9.4.
No. For aggregate functions whose state_data_type is internal, the combinefunc must not be strict. In this case the combinefunc must ensure that null states are handled correctly and that the state being returned is properly stored in the aggregate memory context.
The ORDER BY clause has only been added to aggregates whose output is affected by ordering, including: array_agg, json_agg, string_agg, xmlagg, and statistical regression functions.
regr_sxy calculates sum(X*Y) - sum(X) * sum(Y)/N, which is the 'sum of products' of independent times dependent variable.
Yes. count(*) counts all input rows including those with NULL values in any column. It does not check individual columns. In contrast, count(column_name) counts only rows where the specified column is not NULL.
Both bool_and and bool_or return boolean values. bool_and returns true if all values are true, bool_or returns true if at least one value is true.
When an aggregate uses a strict transition function, rows with any null input values are ignored (the function is not called and the previous state value is retained).
variance() is an alias for var_samp(). The semantics of variance() and var_samp() are identical.
INITCOND must be a string constant in the form accepted for the data type state_data_type.
cume_dist() returns (number of rows preceding or peer with current row) / (total rows).
The fraction value must be between 0 and 1; an error is thrown if not. However, a null fraction value simply produces a null result.
The strict suffix causes the aggregate function to skip null values during aggregation.
When DISTINCT is specified with an order_by_clause, all the ORDER BY expressions must match regular arguments of the aggregate; that is, you cannot sort on an expression that is not included in the DISTINCT list.
Moving-aggregate mode requires specifying MSFUNC (moving state transition function), MINVFUNC (inverse transition function), and MSTYPE (moving state type) parameters.
Concurrency Control > Table-Level Locks
39 questionsSELECT FOR SHARE acquires a ROW SHARE lock mode on the target table(s).
If deadlock_timeout is specified without units, it is taken as milliseconds.
PostgreSQL defines 8 different table-level lock modes: ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE.
ROW EXCLUSIVE conflicts with SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE lock modes.
When the TRUNCATE option is enabled (which it is by default), VACUUM attempts to truncate empty pages at the end of the table and return disk space to the operating system. This operation requires briefly acquiring an ACCESS EXCLUSIVE lock on the table.
PostgreSQL reports an error if LOCK is used outside a transaction block, because the lock would remain held only to the completion of the statement, making it useless.
No, advisory locks have application-defined meanings and are not enforced by the system. It is up to the application to use them correctly.
No, max_locks_per_transaction does not limit row locks. Row locks are unlimited in PostgreSQL.
VACUUM has 4 phases: scanning heap (ShareUpdateExclusiveLock), vacuuming indexes (ShareUpdateExclusiveLock), vacuuming heap (ShareUpdateExclusiveLock), and truncating heap (AccessExclusiveLock when TRUNCATE option is enabled).
Yes, when an index scan is used, CLUSTER creates a temporary copy of the table containing the data in index order, and temporary copies of each index. Therefore, you need free space on disk at least equal to the sum of the table size and the index sizes.
CLUSTER acquires an ACCESS EXCLUSIVE lock on the table being clustered.
lock_timeout controls the maximum time waiting to acquire a lock on a database object, while statement_timeout limits the total execution time of a query including both lock acquisition and actual execution. Unlike statement_timeout, lock_timeout can only occur while waiting for locks.
REINDEX CONCURRENTLY acquires a SHARE UPDATE EXCLUSIVE lock on the indexes being reindexed and their associated tables, which is much less restrictive than the ACCESS EXCLUSIVE lock used by standard REINDEX.
A plain SELECT command acquires an ACCESS SHARE lock on the table(s) being queried.
deadlock_timeout is the amount of time to wait on a lock before checking to see if there is a deadlock condition. When log_lock_waits is enabled, it also determines the amount of time to wait before a log message is issued about the lock wait.
ROW SHARE conflicts with EXCLUSIVE and ACCESS EXCLUSIVE lock modes.
TRUNCATE acquires an ACCESS EXCLUSIVE lock on each table it operates on.
SELECT FOR NO KEY UPDATE acquires a ROW SHARE lock mode on the target table(s), similar to SELECT FOR UPDATE.
VACUUM FULL acquires an ACCESS EXCLUSIVE lock on the table, which blocks all concurrent access including SELECT queries.
ACCESS EXCLUSIVE conflicts with all lock modes: ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE (itself).
ANALYZE acquires a SHARE UPDATE EXCLUSIVE lock mode on the table being analyzed.
In PostgreSQL 17 and later, if the user has MAINTAIN, UPDATE, DELETE, or TRUNCATE privileges on the table, any lockmode is permitted. The MAINTAIN privilege specifically allows LOCK TABLE and other maintenance operations.
SELECT FOR KEY SHARE acquires a ROW SHARE lock mode on the target table(s).
CREATE INDEX (without CONCURRENTLY) acquires a SHARE lock mode on the table.
REINDEX (without CONCURRENTLY) acquires an ACCESS EXCLUSIVE lock on the index and its associated table.
You can prevent VACUUM from taking an ACCESS EXCLUSIVE lock by setting the TRUNCATE option to false, either in the VACUUM command or by disabling vacuum_truncate on a per-table basis.
REFRESH MATERIALIZED VIEW CONCURRENTLY requires at least one UNIQUE index on the materialized view which uses only column names and includes all rows (not an expression index or include a WHERE clause). The materialized view must already be populated.
CREATE INDEX CONCURRENTLY acquires a SHARE UPDATE EXCLUSIVE lock mode instead of the SHARE lock taken during regular index creation, allowing concurrent inserts, updates, and deletes.
SHARE UPDATE EXCLUSIVE conflicts with SHARE UPDATE EXCLUSIVE (itself), SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE lock modes. It is self-conflicting.
When ONLY is specified with LOCK TABLE, only the named table is locked. Without it, the table and all its descendant tables (if any) are locked.
UPDATE, DELETE, INSERT, and MERGE commands acquire a ROW EXCLUSIVE lock mode on the target table.
If no lock mode is specified, then ACCESS EXCLUSIVE, the most restrictive mode, is used.
VACUUM (without FULL) acquires a SHARE UPDATE EXCLUSIVE lock mode. However, when the TRUNCATE option is enabled (which it is by default), VACUUM may also briefly acquire an ACCESS EXCLUSIVE lock during the truncation phase.
SHARE ROW EXCLUSIVE conflicts with ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE (itself), EXCLUSIVE, and ACCESS EXCLUSIVE lock modes.
Server Configuration > Query Planning Settings
38 questions0 (which means a suitable value is chosen automatically based on geqo_effort and the number of tables in the query)
No, it is not meaningful to set jit_inline_above_cost to less than jit_above_cost.
No. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes by the query planner.
The planner assumes that only 10% of a cursor's rows will be retrieved, biasing toward query plans that quickly return the first rows rather than optimizing for total execution time.
Database Administration
38 questionsThe default value is 50 tuples. This is the minimum number of inserted, updated, or deleted tuples needed to trigger an ANALYZE operation.
statement_timeout can be set at multiple levels (postgresql.conf, database, role, or session). Setting it in postgresql.conf is not recommended because it would affect all sessions. It is recommended to set statement_timeout at session level rather than globally.
Parameters with context 'postmaster' require a server restart. Examples include: max_connections, shared_buffers, max_prepared_transactions, max_worker_processes, max_wal_senders, wal_level, listen_addresses, port, max_locks_per_transaction, autovacuum_max_workers, autovacuum_freeze_max_age, and wal_buffers.
The default value is 100 connections, but might be less if kernel settings will not support it (as determined during initdb).
The default value is 3. This specifies the maximum number of autovacuum processes (other than the autovacuum launcher) that may be running at any one time.
The default value is 0 (zero) milliseconds, which means the timeout is disabled and queries can run indefinitely.
The default value is 0.1 (10% of table size). This fraction of the table size is added to autovacuum_analyze_threshold when deciding whether to trigger an ANALYZE.
The default value is 'off'. Valid settings are 'off', 'on', and 'always'. When set to 'on', WAL archiving is enabled using archive_command. This parameter requires a server restart to change.
The default value is -1, which means no limit. This specifies the maximum amount of disk space that a session can use for temporary files.
The default value is 200 million transactions. This specifies the maximum age (in transactions) that a table's pg_class.relfrozenxid field can attain before a VACUUM operation is forced to prevent transaction ID wraparound.
The default value is 10 in PostgreSQL 10 and later. In PostgreSQL 9.4 through 9.6, the default was 0. This sets the maximum number of simultaneously defined replication slots.
The default value is 64. This limits the average number of object locks used by each transaction. The minimum value is 10.
The default value is 1000 milliseconds (1 second). This is the amount of time PostgreSQL waits on a lock before checking for a deadlock condition.
Both log_connections and log_disconnections have a default value of 'off', meaning connection and disconnection events are not logged by default.
The default value is five minutes (5min). The valid range is between 30 seconds and one day.
When wal_buffers is set to -1 (the default), it automatically selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB nor more than one WAL segment (typically 16MB).
The default port is 5432. Port 5432 is officially assigned to PostgreSQL by the Internet Assigned Numbers Authority (IANA).
The default value is 'on'. When enabled, PostgreSQL will try to make sure that updates are physically written to disk by issuing fsync() system calls. Turning this off can result in unrecoverable data corruption.
The default value is 1000 tuples. This is the number of inserted tuples needed to trigger a VACUUM operation.
The default value is 'none', which means that by default PostgreSQL does not log any SQL statements. Valid values are 'none', 'ddl', 'mod', and 'all'.
No, archive_mode cannot be enabled when wal_level is set to minimal. You must use wal_level of 'replica' or higher to enable continuous archiving.
The default value is 4.0. This sets the planner's estimate of the cost of a non-sequentially-fetched disk page.
The default value is -1, which disables logging of statement durations. Setting it to 0 logs all statement durations. When specified without units, it is taken as milliseconds.
When the number of active concurrent connections is at least max_connections minus superuser_reserved_connections, new connections will be accepted only for superusers. With default settings (max_connections=100, superuser_reserved_connections=3), regular users can use up to 97 connections.
The default value is 1 GB (1024 MB). This is a soft limit; WAL size can exceed max_wal_size under special circumstances.
The default value is 0 (zero), which disables the timeout. When specified without units, it is taken as milliseconds. This parameter was added in PostgreSQL 9.6.
The default value is 10 minutes (10min), equivalent to 600000 milliseconds. Any autovacuum action that runs for at least this amount of time will be logged.
The built-in default timezone is GMT, but this is typically overridden in postgresql.conf. The initdb command installs a setting corresponding to the system environment.
The default value is 'on'. If hot_standby is on and there is a standby.signal file present, the server will run in hot standby mode, allowing read-only queries on standby servers.
The default value is 30 seconds (30s). If checkpoints happen closer together than this threshold due to WAL volume, a warning message will be output to the server log.
The default value is 80 MB. As long as WAL disk usage stays below this setting, old WAL files are always recycled for future use at a checkpoint rather than removed.
The default value is -1, which automatically selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB nor more than the size of one WAL segment (typically 16MB).
The default value is 200 milliseconds (200ms). This specifies the delay between activity rounds for the background writer. The valid range is 10ms to 10000ms.
The default value is 'on'. This sets the current transaction's synchronization level. Valid values are 'local', 'remote_write', 'remote_apply', 'on', and 'off'.
The default value is 50 tuples. This is the minimum number of updated or deleted tuples needed to trigger a VACUUM operation.
Performance Tips > EXPLAIN and EXPLAIN ANALYZE
37 questionsConfiguration parameters that affect query planning with values different from the built-in default value. It defaults to FALSE.
The estimated average width of rows output by this plan node, measured in bytes.
The cost of converting the query's output data to TEXT or BINARY format. Options are NONE (default), TEXT, or BINARY. If SERIALIZE is specified without an argument, TEXT is assumed.
Statistics produced by ANALYZE are taken from a randomized sample of the table, not from examining every row. Different samples produce slightly different statistics.
The startup cost is the estimated time expended before the first row can be returned by a plan node.
3 seconds. Longer durations give slightly better accuracy and are more likely to discover problems with the system clock.
FALSE, but it becomes TRUE when ANALYZE is used. SUMMARY includes summary information like total planning and execution timing.
Only ANALYZE and VERBOSE can be specified without parentheses, and only in that order (e.g., EXPLAIN ANALYZE VERBOSE). All other options require parenthesized syntax.
Multiply the displayed actual time by the loops value. For example, if actual time=0.028 and loops=9, total time is 0.028 * 9 = 0.252 ms.
Yes. The statement is actually executed when the ANALYZE option is used, and any side effects of the statement will happen as usual.
The WAL option includes the number of WAL records, number of full page images (FPI), amount of WAL generated in bytes, and number of times WAL buffers became full. Only non-zero values are printed in text format.
Wrap the statement in a transaction with BEGIN and ROLLBACK: BEGIN; EXPLAIN ANALYZE ...; ROLLBACK;
No. Execution time includes executor start-up and shut-down time and trigger execution, but does not include parsing, rewriting, or planning time.
No. The time does not include converting output to displayable form or sending it to the client, unless the SERIALIZE option is specified.
TEXT. The FORMAT option defaults to TEXT, but can also be set to XML, JSON, or YAML.
For tables occupying only one disk page, PostgreSQL realizes it will take one disk page read anyway, so there's no value in additional page reads to access an index. Generally, tables with ~100 rows or fewer get sequential scans.
When the outer child contains duplicate key values, the inner child is rescanned for matching rows. EXPLAIN ANALYZE counts these repeated emissions as if they were additional rows, which can significantly inflate the reported count.
This is expected behavior. Estimated costs and row counts are shown as if the plan node runs to completion, but with LIMIT, execution stops early. The planner interpolates between endpoint costs to estimate the actual cost with LIMIT.
The total cost is the estimated time to return all rows from a plan node. It is stated on the assumption that the plan node runs to completion.
TRUE. The TIMING option defaults to TRUE and shows startup time and per-node timing information. It requires ANALYZE to be enabled.
Milliseconds. The actual time values shown in EXPLAIN ANALYZE are measured in milliseconds of real time.
The overhead of repeatedly reading the system clock can slow down the query significantly on some systems. This can be avoided by setting TIMING FALSE.
The number of blocks from tables/indexes that were read from disk (or the OS cache).
Time spent executing BEFORE triggers is included in the time for the related Insert, Update, or Delete node.
Good results will show most (>90%) individual timing calls take less than one microsecond, with average per loop overhead below 100 nanoseconds.
pg_test_timing. This tool measures timing overhead and confirms that system time never moves backwards.
0.01. This sets the planner's estimate of the cost of processing each row during a query.
1.0. This sets the planner's estimate of the cost of a disk page fetch that is part of a series of sequential fetches.
SELECT, INSERT, UPDATE, DELETE, MERGE, VALUES, EXECUTE, DECLARE, CREATE TABLE AS, and CREATE MATERIALIZED VIEW AS.
OFF (false). This parameter must be enabled to display I/O timing information in EXPLAIN BUFFERS output.
Temporary blocks contain short-term working data used in sorts, hashes, Materialize plan nodes, and similar cases.
The number of previously-dirtied blocks evicted from cache by this backend during query processing.
Backup and Restore > Continuous Archiving
37 questionsThe connection calling pg_backup_start must be maintained until the end of the backup, or the backup will be automatically aborted.
Yes, wal_level can only be set at server start and requires a database restart to change.
In PostgreSQL 10, as part of a broader consistency change where all references to 'xlog' were renamed to 'wal'.
No, WAL files are normally archived in order (oldest to newest), but this is not guaranteed and does not hold under special circumstances like promoting a standby or after crash recovery. Therefore, it is not safe to assume that all files older than last_archived_wal have been successfully archived.
WAL segment files are normally 16MB apiece, although the segment size can be altered during initdb.
archive_timeout settings of a minute or so are usually reasonable.
The archive_command or archive_library is only invoked for completed WAL segments, not partial segments.
Use -n or --dry-run to print the names of files that would have been removed on stdout without actually deleting them.
The valid parameters are: recovery_target (immediate), recovery_target_name (named restore point), recovery_target_time (timestamp), recovery_target_xid (transaction ID), and recovery_target_lsn (LSN location). At most one can be used.
No, archived files that are archived early due to a forced switch are still the same length as completely full files (16MB by default). It is therefore unwise to set a very short archive_timeout.
wal_level must be set to 'replica' or higher. It cannot be enabled when wal_level is set to 'minimal'.
No, archive_mode cannot be enabled when wal_level is set to minimal.
The default value is 'latest', which recovers to the latest timeline found in the archive. This has been the default behavior since PostgreSQL 12.
The valid values are: 'minimal' (removes all logging except crash recovery), 'replica' (adds logging for WAL archiving and replication), and 'logical' (adds information for logical decoding). Each level includes information logged at lower levels.
No, restore_command can be set in the postgresql.conf file or on the server command line, but changes require a configuration reload (SIGHUP), not a restart.
archive_cleanup_command = 'pg_archivecleanup archivelocation %r' where archivelocation is the directory from which WAL segment files should be removed, and %r is replaced by the name of the file containing the last valid restart point.
You must create a recovery.signal file in the cluster data directory. This file tells PostgreSQL to enter normal archive recovery and will be deleted when recovery finishes.
Exclude: pg_wal/, postmaster.pid, postmaster.opts, pg_replslot/, pg_dynshmem/, pg_notify/, pg_serial/, pg_snapshots/, pg_stat_tmp/, pg_subtrans/, files beginning with pgsql_tmp, and pg_internal.init files.
Starting with PostgreSQL 11, use the --wal-segsize parameter with initdb when initializing a new cluster. The value must be a power of 2 between 1 and 1024 megabytes. This option can only be set during initialization and cannot be changed later.
The default value is 0 (disabled), meaning there is no time limit on WAL segment switches.
Use the -b or --clean-backup-history option to remove backup history files in addition to WAL files.
PostgreSQL 12 eliminated the recovery.conf file. The server will not start if recovery.conf is present. All configuration parameters moved to postgresql.conf, and two new signal files (recovery.signal and standby.signal) replaced recovery.conf for triggering recovery modes.
No, you cannot recover to a time when that backup was in progress.
The wal_level must be set to replica, logical, or minimal (though minimal is not recommended for archiving).
The valid values are 'pause' (the default, pauses recovery), 'promote' (ends recovery and allows connections), and 'shutdown' (stops the server after reaching the recovery target).
The valid values are 'off' (archiving disabled), 'on' (archive_command applied only on primary), and 'always' (WAL archiver enabled also during archive recovery or standby mode).
The default value is 0 MB, meaning the system doesn't keep extra segments for standby purposes.
They were renamed in PostgreSQL 15 when the deprecated exclusive backup mode was finally removed.
No, pg_switch_wal() can only be executed on a primary server (not a read-only standby).
recovery.signal tells PostgreSQL to enter normal archive recovery (PITR), while standby.signal tells PostgreSQL to enter standby mode (replication). Both were introduced in PostgreSQL 12 to replace recovery.conf.
The default value is true, meaning recovery will stop just after the specified recovery target is reached.
When archive_mode is on, the archiver keeps retrying to archive the same WAL segment until it succeeds. This can cause WAL size to exceed max_wal_size.
Server Configuration > Replication Settings
37 questionsStarting with PostgreSQL 12, max_wal_senders is independent of max_connections. WAL sender connections have their own dedicated connection slots separate from regular database connections, handled using their own free queue.
promote_trigger_file (or trigger_file before PostgreSQL 12) was removed in PostgreSQL 16. Users should now use pg_ctl promote or call pg_promote() to promote a standby instead.
The valid values are 'remote_apply', 'on', 'remote_write', 'local', and 'off'.
The default value is an empty string (no slots synchronized by default).
primary_conninfo uses a libpq connection string format. Example: 'user=postgres host=10.10.1.8 port=5432 application_name=standby_east'. For replication slot synchronization, it is necessary to also specify a valid dbname in the connection string.
The valid values are 'minimal', 'replica', and 'logical'. The values 'archive' and 'hot_standby' are deprecated since PostgreSQL 9.6 but still accepted and automatically mapped to 'replica'.
The default value is 0, which means automatic WAL segment switching based on time is disabled.
wal_keep_size replaced wal_keep_segments in PostgreSQL 13. The old parameter specified number of segments, while wal_keep_size specifies megabytes.
This parameter can be changed at any time. The behavior for any one transaction is determined by the setting in effect when it commits.
In PostgreSQL versions before 12, WAL sender processes counted towards the total number of connections, so max_wal_senders could not be set higher than max_connections.
If the value is specified without units, it is taken as seconds.
This parameter can only be set at server start. Configuration changes require a PostgreSQL server restart to take effect.
FIRST specifies priority-based synchronous replication where standbys earlier in the list have higher priority and num_sync standbys are chosen based on priority. ANY specifies quorum-based synchronous replication where commits proceed as soon as any num_sync standbys reply, regardless of which ones.
The default value is -1, which means unlimited WAL retention for replication slots.
A value of -1 allows the standby to wait forever for conflicting queries to complete, never canceling them due to replication conflicts.
This parameter can only be set in the postgresql.conf file or on the server command line. If changed while the WAL receiver is running, it will be signaled to shut down and restart with the new setting.
If the value is specified without units, it is taken as milliseconds.
The default value is 10 in PostgreSQL 10 and later. In PostgreSQL 9.4-9.6, the default was 0 (disabled).
The default value is unset (no replication slot is used by default).
From fastest to slowest: 'off' (fastest), 'local', 'remote_write', 'on' (default), 'remote_apply' (slowest).
For synchronization to work, it is mandatory to have a physical replication slot between the primary and the standby (primary_slot_name must be configured on the standby), and hot_standby_feedback must be enabled on the standby.
The default value is 'on' in PostgreSQL 10 and later. In PostgreSQL 9.0-9.6, the default was 'off'.
Client Interfaces
37 questions1 if the data was queued, zero if it was not queued because of full buffers (only in nonblocking mode), or -1 if an error occurred.
PostgreSQL 14. It is a client-side feature that works with any server supporting the v3 extended query protocol.
Yes. All large object manipulation using these functions must take place within an SQL transaction block, since large object file descriptors are only valid for the duration of a transaction.
PGRES_TUPLES_OK for successful completion of a command returning data (such as a SELECT or SHOW). Note that a SELECT that retrieves zero rows still shows PGRES_TUPLES_OK.
Client programs that use libpq must include the header file libpq-fe.h
PGRES_COMMAND_OK is for commands that can never return rows (INSERT or UPDATE without a RETURNING clause, etc.).
-lpq - and the -L option should be placed before -lpq for maximum portability.
PGRES_PIPELINE_SYNC (marks pipeline synchronization completion) and PGRES_PIPELINE_ABORTED (indicates error during pipeline processing).
Zero, negative, or not specified means wait indefinitely for the connection.
INFORMIX, INFORMIX_SE, and ORACLE. These are set using the -C command-line option.
The permissions must disallow any access to world or group; achieve this by a command such as chmod 0600 ~/.pgpass. If permissions are too permissive, the file will be ignored.
No. Command strings containing multiple SQL commands are disallowed, and so is COPY.
InvalidOid equals zero and is returned from large object functions on failure.
.pgc - If an input file name does not have any extension, .pgc is assumed.
--includedir prints the location of C header files of the client interfaces.
Two client interfaces are included in the base PostgreSQL distribution: libpq (the C library) and ECPG (Embedded SQL in C).
SQLERROR (errors during SQL execution), SQLWARNING (warnings during SQL execution), and NOT FOUND (SQL statement retrieves or affects zero rows).
prefer - which first tries an SSL connection; if that fails, tries a non-SSL connection.
A null pointer if there are no pending notifications. Otherwise, it returns the next notification from a list of unhandled notification messages received from the server.
.c - The file's extension will be replaced by .c to construct the output file name (or .h when using the -h flag).
5432 is the default port number established when PostgreSQL was built.
CONTINUE - this effectively means that the condition is ignored.
libecpg - programs require linking using linker options like -L/usr/local/pgsql/lib -lecpg
0 if OK, -1 if error. It sets the state of the connection to nonblocking if arg is 1, or blocking if arg is 0.
Streaming Replication > Primary Configuration
37 questionsYes, track_commit_timestamp can only be set at server start and requires a restart.
No, wal_segment_size is a preset/read-only parameter that is determined at cluster initialization time (PostgreSQL 11+) or compile time (PostgreSQL 10 and earlier) and cannot be changed while running.
It specifies a comma-separated list of standby slot names that logical WAL sender processes will wait for, ensuring that logical replication slots cannot advance beyond the confirmed LSN of specified physical replication slots.
max_wal_senders should be set to at least the same as max_replication_slots, plus the number of physical replicas that are connected at the same time.
No, wal_keep_size can be changed without a server restart (it can be set in postgresql.conf or on the command line and reloaded).
With 'local', commits wait for local flush to disk but not for replication. With 'on', commits wait until replies from synchronous standbys indicate they have flushed the commit record to durable storage (when synchronous_standby_names is configured; otherwise 'on' behaves like 'local').
No, idle_replication_slot_timeout can be changed with SIGHUP (configuration reload) without requiring a server restart.
Setting max_replication_slots to a lower value than the number of currently existing replication slots will prevent the server from starting.
The default value is 0, meaning the system doesn't keep any extra segments for standby purposes.
The default value is empty (an empty string), which means synchronous replication is disabled.
Setting wal_sender_timeout to 0 disables the timeout mechanism.
num_sync specifies the number of synchronous standbys that transactions need to wait for replies from before commits can proceed.
Yes, max_wal_senders can only be set at server start and requires a server restart to take effect.
The valid values are: off, local, on (the default), remote_write, and remote_apply.
Starting from PostgreSQL 12, wal_sender_timeout can be set for individual connections.
The default value is 'replica', which writes enough data to support WAL archiving and replication, including running read-only queries on a standby server.
wal_level must be set to logical for logical replication to work.
The two methods are: FIRST num_sync (standby_name [...]) for priority-based synchronous replication, and ANY num_sync (standby_name [...]) for quorum-based synchronous replication.
The keyword 'replication' is used in the database field of pg_hba.conf to match physical replication connections.
No, max_slot_wal_keep_size can be changed with pg_reload_conf(), pg_ctl reload, or SIGHUP without requiring a server restart.
Yes, max_replication_slots can only be set at server start and requires a server restart to take effect.
wal_level must be set to replica or higher (replica or logical) to allow connections from standby servers.
In releases prior to PostgreSQL 9.6, wal_level allowed the values 'archive' and 'hot_standby'. These are still accepted in 9.6+ but are mapped to 'replica'.
Yes, archive_mode can only be set in the postgresql.conf file or on the server command line and requires a server restart to take effect.
No, synchronous_standby_names can be changed with a configuration reload (SIGHUP) without requiring a server restart.
The default value is 10 (in PostgreSQL 10 and later; it was 0 in PostgreSQL 9.6 and earlier).
The three valid values are: minimal (removes all logging except crash recovery info), replica (supports WAL archiving and streaming replication), and logical (adds information for logical decoding/replication).
Yes, vacuum_defer_cleanup_age was added in PostgreSQL 9.0 and removed in PostgreSQL 16.
Indexes > GiST Indexes
36 questionsNo, multirange types are supported only by GiST indexes. SP-GiST does not support multiranges, only ranges.
A GiST index build automatically switches to the buffering method when the index size reaches effective_cache_size.
A GiST operator class must provide five required support methods: consistent, union, penalty, picksplit, and same.
Use the command: CREATE EXTENSION IF NOT EXISTS btree_gist; This extension is considered 'trusted' and can be installed by non-superusers who have CREATE privilege on the current database.
On some systems you may need to install an additional package such as postgresql-contrib (or postgresql12-contrib on CentOS/RHEL) for the btree_gist extension to become available.
The default value of the buffering parameter is AUTO, which initially disables buffering but activates it automatically once the index size reaches effective_cache_size.
Yes, GiST indexes support WAL. The GiST layer takes care of concurrency, logging, and searching the tree structure. Usually, replay of the WAL log is sufficient to restore index integrity following a database crash.
If sortsupport is not provided, CREATE INDEX builds the index by inserting each tuple to the tree using the penalty and picksplit functions, which is much slower than the sorted build method.
KNN support for GiST indexes was introduced in PostgreSQL 9.1, released in 2011.
Yes, PostgreSQL supports CREATE INDEX CONCURRENTLY for GiST indexes, allowing index builds without blocking write operations.
Yes, GiST text search indexes are lossy. Each document is represented by a fixed-length signature, and the index might produce false matches that must be checked against the actual table row.
Correctness is ensured by the same, consistent, and union methods, while efficiency (size and speed) depends on the penalty and picksplit methods.
The penalty function returns a value indicating the 'cost' of inserting a new entry into a particular branch of the tree. Items are inserted down the path of least penalty in the tree.
The fetch method converts the compressed index representation of a data item into the original data type for index-only scans. The returned data must be an exact, non-lossy copy of the originally indexed value.
btree_gist provides GiST operator classes for: int2, int4, int8, float4, float8, numeric, timestamp with time zone, timestamp without time zone, time with time zone, time without time zone, date, interval, oid, money, char, varchar, text, bytea, bit, varbit, macaddr, macaddr8, inet, cidr, uuid, bool, and all enum types.
If sorted build is possible (when sortsupport function is available), it will be used instead of buffered build unless buffering=ON is explicitly specified.
GiST geometric indexes define 12 R-tree strategy numbers: 1=strictly left of, 2=does not extend to right of, 3=overlaps, 4=does not extend to left of, 5=strictly right of, 6=same, 7=contains, 8=contained by, 9=does not extend above, 10=strictly below, 11=strictly above, 12=does not extend below.
Yes, GiST compress methods can be lossy. If compress is lossy, then it's impossible for decompress to exactly reconstruct the original data. This convention allows GiST to support both lossless and lossy index structures.
Yes, GiST indexes can be used to support exclusion constraints (EXCLUDE), which ensure that given fields of any two table rows do not correspond to each other in terms of specified operators.
The maximum index entry size is 8191 bytes. Attempts to create index entries larger than this will result in an error: 'index row requires X bytes, maximum size is 8191'.
For historical reasons, the inet_ops operator class is not the default class for types inet and cidr. To use it, you must explicitly specify the operator class name when creating the index.
The nine built-in GiST operator classes are: box_ops, circle_ops, inet_ops, multirange_ops, point_ops, poly_ops, range_ops, tsquery_ops, and tsvector_ops.
The distance operator <-> is used for nearest neighbor (KNN) searches with GiST indexes in ORDER BY clauses.
The btree_gist extension supports the distance operator for: int2, int4, int8, float4, float8, timestamp with time zone, timestamp without time zone, time without time zone, date, interval, oid, and money.
GiST geometric indexes support: << (strictly left of), &< (does not extend to right of), &> (does not extend to left of), >> (strictly right of), <<| (strictly below), &<| (does not extend above), |&> (does not extend below), |>> (strictly above), @> (contains), <@ (contained by), ~= (same), and && (overlaps).
When recheck=false, the index has tested the predicate condition exactly. When recheck=true, the row is only a candidate match and the system must evaluate the operator against the actual table row to verify it is really a match.
GiST range_ops supports: = (equals), && (overlaps), <@ (is contained by), @> (contains), << (strictly left of), >> (strictly right of), -|- (is adjacent to), &< (does not extend to the right of), and &> (does not extend to the left of).
The two methods are sorted and buffered. The sorted method is only available if each opclass provides a sortsupport function and is usually the best method, so it is used by default when available.
A GiST operator class can provide seven optional support methods: compress, decompress, distance, fetch, options, sortsupport, and translate_cmptype (also called stratnum).
GiST disables WAL-logging during the build and only logs it once at the end, reducing WAL overhead during index creation.
Information Schema
36 questionsThe possible values are: BASE TABLE (for regular tables), VIEW (for views), FOREIGN TABLE (for foreign tables), and LOCAL TEMPORARY (for temporary tables).
The view identifies all privileges granted on functions to a currently enabled role or by a currently enabled role. There is one row for each combination of function, grantor, and grantee.
The information schema contains 64 views (sections 35.3 through 35.66 in the documentation).
The view identifies all routines (functions and procedures) that are used in the query expression of a view.
No. The views do not contain information about PostgreSQL-specific features; to inquire about those you need to query the system catalogs or other PostgreSQL-specific views.
No. System columns (ctid, etc.) are not included in the information_schema.columns view.
The view contains all constraints belonging to domains defined in the current database, showing only those domains that the current user has access to (by way of being the owner or having some privilege).
The view identifies all roles whose privileges the current user can use, which means there is some chain of role grants from the current user to the role in question. The current user itself is also an applicable role.
The three values correspond to the ON COMMIT options: PRESERVE ROWS (the default), DELETE ROWS (truncate at commit), and DROP (drop table at commit).
The constraint_type column can be PRIMARY KEY, FOREIGN KEY, UNIQUE, or CHECK.
Both columns can have these values: CASCADE, SET NULL, SET DEFAULT, RESTRICT, or NO ACTION.
YES (the grantee holds admin option on the role) or NO (the grantee does not hold admin option).
The view identifies all privileges granted on tables or views to a currently enabled role or by a currently enabled role. There is one row for each combination of table, grantor, and grantee.
The action_orientation column can be ROW (fires once for each processed row) or STATEMENT (fires once for each statement).
The security_type column has two values: INVOKER (if the function runs with the privileges of the current user) or DEFINER (if the function runs with the privileges of the user who defined it).
The view identifies all columns that are used in the query expression of a view. A column is only included if the table that contains the column is owned by a currently enabled role.
Yes, the owner can drop it, but the space savings achieved by that are minuscule.
The five special data types are: cardinal_number (a nonnegative integer), character_data (a character string), sql_identifier (used for SQL identifiers), time_stamp (a domain over timestamp with time zone), and yes_or_no (contains either YES or NO).
The view identifies all tables that are used in the query expression of a view (the SELECT statement that defines the view).
The schemata view contains 7 columns: catalog_name, schema_name, schema_owner, default_character_set_catalog, default_character_set_schema, default_character_set_name, and sql_path.
The SQL standard requires constraint names to be unique within a schema, but PostgreSQL does not enforce this restriction. This means queries expecting one row can return multiple rows when querying views like check_constraints, domain_constraints, and referential_constraints.
The routine_type column contains either FUNCTION (for a function) or PROCEDURE (for a procedure).
Only those functions and procedures are shown that the current user has access to (by way of being the owner or having some privilege).
The information schema is defined in the SQL standard and can therefore be expected to be portable and remain stable — unlike the system catalogs, which are specific to PostgreSQL and are modeled after implementation concerns.
The owner of this schema is the initial database user in the cluster, and that user naturally has all the privileges on this schema.
The view identifies all privileges granted on tables or views where the grantor or grantee is a currently enabled role.
These values are returned as character_data (character string) per the SQL standard.
When a table column, composite-type attribute, domain, function parameter, or function return value is defined to be of an array type, the respective information schema view only contains ARRAY in the data_type column. To obtain information on the element type of the array, you can join the respective view with element_types.
No. The information_schema views are defined by the SQL standard and therefore only show content that matches the standard. They ignore PostgreSQL-invented objects like materialized views and exclude sequences from views like view_table_usage.
The maximum precision that can be explicitly specified in a numeric type declaration is 1000.
Client Authentication > External Authentication
36 questionsThe default value of radiusidentifiers is 'postgresql', which is used as the NAS Identifier in RADIUS requests.
Peer authentication only supports local connections (Unix domain sockets). It is not supported for remote TCP/IP connections.
Yes, SSPI and GSSAPI clients and servers can authenticate to each other.
Yes, using pam_use_hostname can lead to login delays because host name resolution is required.
Peer authentication is available on operating systems providing the getpeereid() function, the SO_PEERCRED socket parameter, or similar mechanisms. Currently this includes Linux, most flavors of BSD (including macOS), and Solaris.
The Common Name (cn) attribute of the client certificate is compared against the requested database user name.
No, PostgreSQL cannot decrypt encrypted responses from ident servers. Some ident servers have a nonstandard encryption option, but this must not be used with PostgreSQL.
PostgreSQL uses SSPI in 'negotiate' mode, which uses Kerberos when possible and automatically falls back to NTLM in other cases.
The ident server listens on TCP port 113 by default on Unix-like operating systems.
The default value of include_realm is 1 (enabled) in SSPI authentication.
The default value of include_realm is 1 (enabled), meaning the realm is included in the authenticated principal.
The default value of ldaptls is 0 (disabled). Set to 1 to use TLS encryption via StartTLS (RFC 4513).
The default search scope is 'base' when using ldapurl. For effective searches, 'sub' (subtree) is typically required.
If ldapport is not specified, the LDAP library's default port setting will be used.
No, ident authentication works only over TCP/IP connections. For local connections, peer authentication is used instead.
Yes, SSPI authentication works for connections to localhost on a standalone Windows computer, even without a domain environment.
The default behavior is to use the IP address. When pam_use_hostname is set to 1, the resolved hostname is used instead (sent via PAM_RHOST).
clientcert=verify-ca only verifies that the client certificate is signed by a trusted certificate authority. clientcert=verify-full additionally enforces that the cn (Common Name) in the certificate matches the username or an applicable mapping.
No, certificate authentication is only available for SSL connections. It requires SSL to be configured as described in Section 18.9.2 on OpenSSL settings.
If a negative response (Access Reject) is received from a RADIUS server, the authentication fails immediately. However, if no response is received, PostgreSQL tries the next server in the list.
The PostgreSQL user account (the operating system user running the server) must be added to the 'auth' group, which exists by default on OpenBSD systems.
The recommended minimum length is 16 characters for radiussecrets.
The standard format is 'servicename/hostname@realm', where servicename defaults to 'postgres'.
Yes, PostgreSQL supports OpenLDAP DNS SRV discovery, which looks up _ldap._tcp.DOMAIN. However, this is not supported on Windows.
No, ldapscheme and ldaptls only encrypt the traffic between the PostgreSQL server and the LDAP server. The connection between the PostgreSQL client and PostgreSQL server remains unencrypted unless SSL is used separately.
PostgreSQL sends Access Request messages of type 'Authenticate Only', including username, encrypted password, and NAS Identifier.
No, ident authentication is only appropriate for closed networks where each client machine is under tight control. It depends entirely on the integrity of the client machine, and a compromised client could return any username.
No, the user must already exist in the database before PAM can be used for authentication. PAM is only used to validate user name/password pairs.
Peer authentication obtains the client's operating system user name from the kernel and uses it as the allowed database user name (with optional user name mapping).
Yes, it is redundant to use the clientcert option with cert authentication since certificate verification is inherent to the cert method. The cert method is effectively 'trust' authentication with 'clientcert=verify-full'.
The ssl_ca_file parameter specifies the file containing trusted certificate authorities (CAs) used to verify client certificates.
PAM authentication fails because the PostgreSQL server is started by a non-root user and cannot read /etc/shadow. This limitation does not apply when PAM uses LDAP or other authentication methods.
BSD authentication is only available on OpenBSD. The BSD Authentication framework is not available on other operating systems.
No, you cannot combine simple bind parameters (ldapprefix, ldapsuffix) with search+bind parameters (ldapbasedn, ldapbinddn, etc.) in the same configuration.
The SSL private key file requires restricted permissions with mode 0600 (read/write for owner only).
Transactions > Transaction Basics
36 questionsPostgreSQL internally implements only three distinct isolation levels, even though it accepts four standard SQL isolation levels. Read Uncommitted mode behaves identically to Read Committed.
START TRANSACTION [ transaction_mode [, ...] ] where transaction_mode options include: ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }, READ WRITE | READ ONLY, [ NOT ] DEFERRABLE
ABORT is identical in behavior to the standard SQL command ROLLBACK and exists only for historical reasons.
Issuing COMMIT outside a transaction will provoke a warning message. However, using COMMIT AND CHAIN when not in a transaction is an error.
No, START TRANSACTION is functionally equivalent to the BEGIN command. Both serve the same purpose of initiating a transaction block with identical characteristics when parameters are specified.
If savepoints with the same name already exist, they will be inaccessible until newer identically-named savepoints are released. The newer savepoint takes precedence for rollback and release operations.
Yes, write operations to temporary tables are permitted even in read-only mode.
Up to 64 open subxids are cached in shared memory for each backend; after that point, the storage I/O overhead increases significantly due to additional lookups of subxid entries in pg_subtrans.
When a subtransaction aborts, all of its child subtransactions will also be considered aborted. If the top-level transaction aborts, all subtransactions abort regardless of their prior subcommitted status.
INSERT, UPDATE, DELETE, MERGE, and COPY FROM if the table they would write to is not a temporary table; all CREATE, ALTER, and DROP commands; COMMENT, GRANT, REVOKE, TRUNCATE; and EXPLAIN ANALYZE and EXECUTE if the command they would execute is among those listed.
Issuing BEGIN while already in a transaction generates a warning but doesn't affect transaction state.
Without BEGIN, PostgreSQL operates in autocommit mode where each statement is executed in its own transaction and a commit is implicitly performed at the end of the statement (if execution was successful, otherwise a rollback is done).
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] where transaction_mode includes: ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }, READ WRITE | READ ONLY, [ NOT ] DEFERRABLE
PREPARE TRANSACTION is not intended for use in applications or interactive sessions. Its purpose is to allow an external transaction manager to perform atomic global transactions across multiple databases or other transactional resources. Unless you're writing a transaction manager, you probably shouldn't be using PREPARE TRANSACTION.
Yes, changes to sequences (including serial columns) are immediately visible to all other transactions and are not rolled back, regardless of isolation level.
Specifying a savepoint name that was not previously defined is an error.
PREPARE TRANSACTION transaction_id where transaction_id must be a string literal with a maximum of 200 bytes and cannot match any currently prepared transaction identifier.
When a subtransaction commits, all of its committed child subtransactions with subxids will also be considered subcommitted in that transaction.
RELEASE [ SAVEPOINT ] savepoint_name. The SAVEPOINT keyword is optional in PostgreSQL, though required by SQL standard.
Yes, autocommit is turned on by default in psql, meaning that every statement is automatically committed once submitted.
BEGIN (initiates a transaction block), COMMIT (finalizes all changes), and ROLLBACK (cancels all updates executed so far in the transaction).
The savepoint remains valid and can be rolled back to again later, if needed.
The DEFERRABLE transaction property has no effect unless the transaction is also SERIALIZABLE and READ ONLY.
ROLLBACK TO SAVEPOINT implicitly destroys all savepoints that were established after the named savepoint. For example, if you have savepoints A, B, and C (in order), rolling back to A will eliminate both B and C.
Use the command \echo :AUTOCOMMIT which will display 'on' or 'off' depending on the current autocommit setting.
Indexes > BRIN Indexes
35 questionsNo, BRIN indexes cannot enforce unique constraints or primary keys. Only B-tree indexes support unique constraints.
Revmap pages are located between the meta page (page zero) and the summary data pages. The structure is: meta page, revmap pages, then summary information pages.
Lossy heap blocks refer to pages rather than specific rows. When a bitmap scan bitmap becomes too large to fit in work_mem, some parts are made lossy, meaning Postgres must fetch entire pages and recheck all rows on those pages.
The minmax-multi operator class was introduced in PostgreSQL 14.
By default (autosummarize=off), BRIN indexes do NOT automatically keep the index up to date during INSERTs. New pages remain unsummarized until VACUUM runs, brin_summarize_new_values() is called, or the index is created with autosummarize=on.
BRIN indexes can have up to 32 columns, including INCLUDE columns.
Because BRIN indexes are very small, scanning the index adds little overhead compared to a sequential scan, but may avoid scanning large parts of the table known not to contain matching tuples.
BRIN indexes are lossy. The query executor must recheck tuples retrieved via BRIN indexes and discard those that do not match the query conditions.
BRIN indexes satisfy queries via regular bitmap index scans, returning all tuples in all pages within each range if the summary info is consistent with the query conditions.
The values_per_range parameter defines the maximum number of values (each representing either a point or a boundary of an interval) stored to summarize a block range, allowing multiple smaller intervals instead of one large interval.
Parallel build support for BRIN indexes was added in PostgreSQL 17.
Only B-tree, GiST, GIN, and BRIN index methods support multiple-key-column indexes.
When a table is vacuumed (either manually or by autovacuum), all existing unsummarized page ranges are automatically summarized.
With pages_per_range of 128 and 8KB block size, one BRIN index entry covers 128 × 8KB = 1024KB (1MB).
Yes, a multicolumn BRIN index can be used with query conditions that involve any subset of the index's columns, and index search effectiveness is the same regardless of which index column(s) the query conditions use.
Operator class parameters are specified in the index definition, for example: CREATE INDEX idx ON table USING BRIN (column bloom_ops(n_distinct_per_range = 100, false_positive_rate = 0.05));
No, unlike B-tree indexes where column order matters, BRIN index search effectiveness is the same regardless of which index column(s) the query conditions use.
Yes, when VACUUM processes a table (even with autosummarize=off), it will summarize all unsummarized page ranges. The autosummarize parameter only controls whether insertions trigger immediate summarization requests.
The four types are: minmax (stores minimum and maximum values), minmax-multi (stores multiple min/max intervals), inclusion (stores values that include the indexed column values), and bloom (builds Bloom filters).
The 'Rows Removed by Index Recheck' metric shows the number of false positives (rows that matched the BRIN summary but didn't match the actual query conditions) that were discarded after the executor rechecked them.
The only reason to have multiple BRIN indexes instead of one multicolumn BRIN index is to have a different pages_per_range storage parameter for different columns.
You should use brin_desummarize_range() when the index tuple is no longer a good representation of the data, such as after deletions that removed minimum or maximum values, making the stored range less optimal.
CREATE INDEX index_name ON table_name USING BRIN (column_name) WITH (pages_per_range = value);
BRIN indexes were first introduced in PostgreSQL 9.5, released on January 7, 2016.
The brin_desummarize_range(regclass, bigint) function removes the summary information for a specific range.
BRIN indexes are effective when correlation is close to 1 (ascending order) or -1 (descending order). When correlation is close to 0, BRIN indexes are no better than a sequential scan.
BRIN indexes are dramatically smaller than B-tree indexes. Real-world examples show BRIN indexes as small as 32KB-184KB compared to B-tree indexes of 676MB-2GB on the same data, often representing 1/100th or less of the B-tree size.
When pages are already summarized, the summary in BRIN is updated immediately along with DML operations. Only new, unsummarized pages require explicit summarization via VACUUM or maintenance functions.
BRIN indexes are best suited for large, insert-only tables with naturally ordered data (high correlation), such as tables with timestamp columns where new rows are continuously appended in chronological order.
PL/pgSQL > Cursors
35 questionsA holdable cursor is implicitly closed if the transaction that created it aborts via ROLLBACK.
The default is to allow scrolling in some cases depending on the query plan's complexity; this is not the same as specifying SCROLL. PostgreSQL will allow backward fetches if the cursor's query plan is simple enough.
No, MOVE repositions a cursor without retrieving any data. MOVE works like the FETCH command, except it only repositions the cursor and does not return the row moved to.
BINARY causes the cursor to return data in binary format rather than text format. This reduces conversion effort for both the server and client, at the cost of more programmer effort to deal with platform-dependent binary data formats.
Declare the cursor SCROLL WITH HOLD and commit the transaction before reading any rows from it. This will force the entire output to be materialized in temporary storage, so that volatile functions are executed exactly once for each row.
The rows represented by a held cursor are copied into a temporary file or memory area so that they remain available for subsequent transactions.
The variable recordvar is automatically defined as type record and exists only inside the loop. Any existing definition of the variable name is ignored within the loop.
The cursor must be a non-grouping query on the target table. There are restrictions on what the cursor's query can be (in particular, no grouping) and it's best to use FOR UPDATE in the cursor.
NEXT, PRIOR, FIRST, LAST, ABSOLUTE count, RELATIVE count, FORWARD, BACKWARD, count, and ALL.
OPEN unbound_cursorvar [NO SCROLL | SCROLL] FOR EXECUTE query_string [USING expression [, ...]];
All cursor variables in PL/pgSQL must be of the special data type refcursor.
WITH HOLD specifies that the cursor can continue to be used after the transaction that created it successfully commits. The cursor persists beyond transaction boundaries.
NEXT is the default direction when no direction clause is specified in a FETCH command.
FETCH RELATIVE 0 re-fetches the current row (if any) without moving the cursor position.
Yes, the FETCH forms involving FORWARD and BACKWARD, as well as the forms FETCH count and FETCH ALL (where FORWARD is implicit), are PostgreSQL extensions.
Unbound cursor variables (simple refcursor declarations without an associated query) and bound cursor variables (declared with a query using the CURSOR FOR syntax).
Argument values can be passed using either positional notation (all arguments in order) or named notation (using := or => to separate argument name from value). You can also mix positional and named notation.
Yes, the name of the cursor must be different from any other active cursor name in the session. Portal names (the underlying server structure) must be unique within the session for the duration of the portal's existence.
No, absolute fetches are not any faster than navigating to the desired row with a relative move because the underlying implementation must traverse all the intermediate rows anyway. Negative absolute fetches are even worse as the query must be read to the end first.
In PL/pgSQL, the count can be any integer-valued expression (unlike the SQL FETCH command, which only allows an integer constant).
The case where the direction clause is simply a count expression with no keyword is deprecated because that syntax is ambiguous with the case where the direction clause is omitted altogether, and hence it may fail if the count is not a constant.
No, in PostgreSQL all cursors are insensitive (changes are not visible), so these keywords have no effect and are only accepted for compatibility with the SQL standard.
FETCH ALL fetches all remaining rows in the forward direction, leaving the cursor positioned after the last row.
If the cursor's query includes FOR UPDATE or FOR SHARE, then returned rows are locked at the time they are first fetched.
No, FETCH in PL/pgSQL can only fetch one row at a time (directions that fetch more than one row like in SQL FETCH are not allowed). The valid directions are NEXT, PRIOR, FIRST, LAST, ABSOLUTE count, RELATIVE count, FORWARD, or BACKWARD.
No, WITH HOLD may not be specified when the query includes FOR UPDATE or FOR SHARE.
Yes, rewinding to the start of the query with FETCH ABSOLUTE 0 is fast, unlike negative absolute fetches.
The FOR statement automatically opens the cursor, and it closes the cursor again when the loop exits. This eliminates the need to manually code OPEN and CLOSE statements.
Prior to PostgreSQL 16, bound cursor variables were initialized to contain their own names, so the portal name would be the same as the cursor variable's name by default. This was changed in version 16 because it created too much risk of conflicts between similarly-named cursors in different functions.
The server data structure underlying an open cursor is called a portal.
The special variable FOUND can be checked to see whether a row was obtained. FOUND is set to TRUE if FETCH retrieved a row, FALSE otherwise.
FETCH ABSOLUTE 0 positions the cursor before the first row without returning any data. This is a fast way to rewind the cursor to its initial position.
Data Types - Range
35 questionsGiST (Generalized Search Tree) and SP-GiST (Space-Partitioned Generalized Search Tree) indexes. B-tree and hash indexes are not suitable for range types.
Specifies a function that converts range values to a canonical form. The function must take one argument of the range type and return a value of the same type.
The system automatically generates a multirange type name by either replacing 'range' with 'multirange' in the type name, or appending '_multirange' suffix.
range_merge(anymultirange) → anyrange - Computes the smallest range that includes the entire multirange.
EXCLUDE USING GIST (during WITH &&) - where 'during' is a range column and && is the overlap operator.
unnest(anymultirange) → setof anyrange - expands a multirange into a set of ranges.
'()', '(]', '[)', and '[]' - where square brackets [ ] mean inclusive and parentheses ( ) mean exclusive.
multirange(anyrange) → anymultirange - Returns a multirange containing just the given range.
int4range (integer), int8range (bigint), numrange (numeric), tsrange (timestamp without time zone), tstzrange (timestamp with time zone), and daterange (date).
When combining scalar data types (like int, bigint, or text) with range types in an exclusion constraint, since scalar types don't have a default GiST operator class.
Tests if two ranges are adjacent (next to each other without overlapping or having a gap).
SUBTYPE - the name of the element type that the range type will represent ranges of. The subtype must have an associated b-tree operator class.
Tests if one range does not extend to the left of another range.
Returns a multirange containing the union of all non-NULL input range values in a group.
int2, int4, int8, float4, float8, numeric, timestamp with time zone, timestamp without time zone, time with time zone, time without time zone, date, interval, oid, money, char, varchar, text, bytea, bit, varbit, macaddr, macaddr8, inet, cidr, uuid, bool and all enum types.
Discrete: int4range, int8range, daterange. Continuous: numrange, tsrange, tstzrange.
Parentheses, brackets, commas, double quotes, or backslashes require the bound value to be quoted using double quote characters.
Lower bound inclusive, upper bound exclusive - notation [). For example, int4range(5,10) creates the range [5,10).
Adding an exclusion constraint will automatically create a GiST index of the type specified in the constraint declaration.
int4multirange, int8multirange, nummultirange, tsmultirange, tstzmultirange, and datemultirange.
Specifies a function that takes two values of the subtype and returns a double precision value representing the difference. This allows much greater efficiency of GiST indexes on the range type.
Discrete ranges have element types with a well-defined 'step' (like integer or date) where adjacent values exist. Continuous ranges (like numeric or timestamp) have infinite possible values between any two points.
Returns a range that is the intersection of all non-NULL input range values in a group, or empty if there is no intersection.
Tests if one range does not extend to the right of another range.
Yes, they both represent the integers 4, 5, 6, 7, and 8. For discrete range types, these normalize to the same canonical representation.
They fail if the resulting range would need to contain two disjoint sub-ranges, since a range cannot represent disconnected intervals.
It is automatically converted to exclusive, e.g., [,] is converted to (,).
Localization
35 questionsOnly superusers and users with the appropriate SET privilege can change the lc_messages setting.
When a COLLATE clause is used, it results in explicit collation derivation. All other collation derivations are implicit.
The default value is an empty string, which inherits from the server execution environment.
Nondeterministic collations are only supported with the ICU provider.
When PGCLIENTENCODING is defined in the client's environment, that client encoding is automatically selected when a connection to the server is made. It behaves the same as the client_encoding connection parameter.
The default value is an empty string, which inherits from the server execution environment.
No, when copying template1, you must use the same encoding and locale settings it has.
If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised.
The builtin C.UTF-8 locale provides code point sort order (fast, memcmp-based) combined with Unicode semantics for character operations such as pattern matching, regular expressions, and LOWER()/INITCAP()/UPPER().
When using the builtin locale provider with initdb, the locale must be specified and set to C, C.UTF-8, or PG_UNICODE_FAST.
ICU locales use BCP 47 language tags (e.g., ja-JP, de, or fr-CA), but can also accept most libc-style locale names.
B-tree cannot use deduplication with indexes that use a nondeterministic collation.
For C or POSIX locale, all encodings are allowed. For other libc-provided locales, there is only one character set that will work correctly (except on Windows, where UTF-8 encoding can be used with any locale).
The builtin provider supports only C, C.UTF-8, and PG_UNICODE_FAST locales.
Yes, using SQL_ASCII encoding when LC_CTYPE is not C or POSIX is deprecated and may someday be forbidden altogether, as it poses risks of locale-dependent misbehavior.
Valid values are DISABLED, DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1, INFO, NOTICE, WARNING, ERROR, and LOG.
The PG_UNICODE_FAST locale is available only when the database encoding is UTF-8.
SQL_ASCII interprets byte values 0-127 according to the ASCII standard, while byte values 128-255 are taken as uninterpreted characters.
The 'default' collation refers to the locale settings defined for the database, which were specified at database creation time.
template0 is normally marked datallowconn = false to prevent its modification.
The default value is an empty string, which inherits from the server execution environment.
The following are always available: 'default' (selects locale specified at database creation), 'C' and 'POSIX' (sort by byte values), 'unicode' (sorts using Unicode Collation Algorithm with Default Unicode Collation Element Table), 'ucs_basic' (sorts by Unicode code point values with ASCII-only letter treatment), 'pg_unicode_fast' (Unicode code point sorting with full case mapping), and 'pg_c_utf8' (Unicode code point sorting with simple case mapping).
PostgreSQL supports 6 locale categories: LC_COLLATE (string sort order), LC_CTYPE (character classification), LC_MESSAGES (language of messages), LC_MONETARY (formatting of currency amounts), LC_NUMERIC (formatting of numbers), and LC_TIME (formatting of dates and times).
Yes, new encoding and locale settings can be specified when copying template0, whereas a copy of template1 must use the same settings it does.
No. All supported character sets can be used transparently by clients, but a few are not supported for use within the server (as a server-side encoding). Examples include BIG5, GB18030, GBK, JOHAB, SJIS, SHIFT_JIS_2004, and UHC.
After rebuilding all objects depending on the collation (e.g., using REINDEX) following a collation version change, such as when upgrading the operating system to a new major version or using pg_upgrade with newer ICU binaries.
LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME can be adjusted via server configuration parameters after database creation.
If the --locale option is not specified, the locale is inherited from the environment that initdb runs in.
Backup and Restore > pg_dump
35 questionsPostgreSQL 9.2. This feature allows splitting dumps into pre-data, data, and post-data sections.
No. When --schema, --table, or --schema-only is specified, large objects are excluded unless you explicitly add the -b (--blobs) option.
It disables dollar quoting for function bodies and forces them to be quoted using SQL standard string syntax with escaped single quotes.
pg_dump sets row_security to OFF by default to ensure all data is dumped. This requires sufficient privileges to bypass row security, or an error will be thrown.
No. Extensions are not dumped when using --schema or --table switches, as these options focus on specific schema or table objects.
It prevents outputting commands to select table access methods, causing all objects to be created with the default access method during restore.
Restoring a dump causes the destination to execute arbitrary code of the source superusers' choice. Inspect non-plain-text dumps before restoring from untrusted sources.
It dumps only extensions matching the pattern. Multiple extensions can be selected with multiple -e switches. When specified, pg_dump makes no attempt to dump dependencies of selected extensions.
No. pg_dump maintains consistency during concurrent database access without blocking other users from accessing the database (reads or writes).
It causes pg_dump to respect row-level security policies and only export rows that are visible according to the policies defined for the user running the dump.
Yes. By default, pg_dump includes access privileges (GRANT/REVOKE commands) and object ownership (ALTER OWNER commands) in the dump output.
It allows pg_dump to use a snapshot previously defined by a concurrent session (via pg_export_snapshot() or a logical replication slot) instead of taking a new snapshot.
Because template1 may contain site-local additions that could conflict with objects in the dump, causing duplicate definition errors. template0 provides a pristine database.
PostgreSQL 9.5. This feature enables coordination between pg_dump and logical replication or other concurrent operations.
Used with --clean, it adds IF EXISTS to DROP commands, suppressing 'does not exist' errors when restoring to an empty database. It is not valid unless --clean is also specified.
It requires that each extension (-e), schema (-n), and table (-t) pattern match at least one object. Without it, only an error if NO patterns match anything.
PostgreSQL 11. This option helps with partitioned tables when restoring to systems with different collations, encodings, or hash partitioning configurations.
No. pg_dump cannot dump from servers newer than its own major version, though it can dump from servers version 9.2 and newer if pg_dump is a newer version.
No. pg_dump handles only one database at a time. Use pg_dumpall to dump an entire cluster or multiple databases, including global objects.
Directory format (-Fd). Only the directory format supports parallel dumps using the -j option.
Definitions of indexes, triggers, rules, statistics for indexes, and constraints (other than validated check and not-null constraints).
In cross-version cases where the source and target PostgreSQL major versions differ, as different versions may have different sets of reserved words.
0 (zero). pg_dump returns exit code 0 on success and non-zero (typically 1) on failure.
It forces quoting of all identifiers to prevent compatibility issues when dumping from or restoring to different PostgreSQL major versions that may have different reserved word lists.
A number greater than zero. This controls the maximum number of rows per INSERT command when dumping data as INSERT statements rather than COPY.
Subscriptions are restored with connect = false, requiring manual reactivation after restore to prevent unintended replication connections.
During pg_restore, not during pg_dump. The option is ignored when emitting archive (non-text) output files.
All data definition items including table schemas, types, functions, and other database object definitions (excluding indexes, triggers, rules, and constraints).
COPY. By default, pg_dump uses COPY commands for data export, which is much faster than INSERT statements.
If the restore is interrupted mid-process with triggers disabled, system catalogs may be left in the wrong state, potentially causing data integrity issues.
No. Despite their importance to the query optimizer, statistics are NOT dumped by default. You must run ANALYZE after restore for optimal performance.
It prevents dumping of COMMENT ON SQL commands (user-defined comments on database objects), not inline SQL comments in code.
Actual table data, large-object contents, sequence values, and statistics for tables, materialized views, and foreign tables.
njobs + 1 connections. If you specify -j 4, pg_dump will open 5 connections total (4 worker connections plus 1 master connection).
pg_dump includes a CREATE EXTENSION command in dumps and does not dump the individual member objects of the extension, as those are managed by the extension itself.
High Availability
35 questionsThe default value is 'on'. Valid values are: 'off', 'local', 'remote_write', 'on' (default), and 'remote_apply', listed in order of increasing synchronization level.
The default value is 0 (zero megabytes), meaning no extra WAL segments are kept beyond what is needed for regular operation.
The default value is 30 seconds (specified as milliseconds if no units are provided). Setting it to -1 allows the standby to wait indefinitely.
No, logical replication does not support DDL replication. Database schema and DDL definitions (like ALTER TABLE, CREATE/DROP INDEX) are not replicated to the subscriber. Published tables must already exist on the subscriber.
The default value is 0.9 (a floating point value representing the fraction of time between checkpoints).
The default value is 0 (disabled), meaning replication slots will not be dropped due to inactivity. The value is specified in seconds.
The default value is 1 GB (1024 megabytes). This is a soft limit; WAL size can exceed max_wal_size under special circumstances such as heavy load or a failing archive_command.
The cascading standby must have max_wal_senders and hot_standby set, and host-based authentication (pg_hba.conf) configured to accept replication connections. Downstream standbys set primary_conninfo to point to the cascading standby.
The default value is 5 seconds (specified as milliseconds if no units are provided). This controls how long to wait before retrying to retrieve WAL data.
The default value is 'pause'. Valid values are: 'pause' (pauses recovery to allow query execution), 'promote' (finishes recovery and starts accepting connections), and 'shutdown' (stops the server). If hot_standby is not enabled, 'pause' acts the same as 'shutdown'.
The default value is 'off'. When enabled, the standby sends feedback to the primary about queries currently executing, which can prevent query cancels but may cause bloat on the primary.
ANY specifies quorum-based synchronous replication. For example, 'ANY 3 (s1, s2, s3, s4)' causes each commit to proceed as soon as at least any three standbys reply. This was added in PostgreSQL 10.
The default value is 5 minutes. The valid range is 30 seconds to 1 day. This parameter requires a server restart to change.
When WAL streaming is used and no slot name is specified with -S, pg_basebackup automatically uses a temporary replication slot (PostgreSQL 10 and later). Temporary slots are automatically dropped when the connection ends.
The default value is 'latest'. This is crucial for high availability as it allows standby servers to automatically follow timeline changes when a standby is promoted.
Log shipping is asynchronous by default, meaning WAL records are shipped after transaction commit. This creates a window of potential data loss if the primary fails before WAL is shipped.
The default value is 2. This controls the maximum number of synchronization workers per subscription for initial table synchronization.
The default value is 10. The valid range is 0 to 262143. When set to 0, replication is disabled. This parameter requires a server restart to change.
The standby_mode parameter was removed in PostgreSQL 12. It was replaced by the standby.signal file - the presence of this empty file signals the cluster to run in standby mode.
%r is replaced by the name of the file containing the last valid restart point. All WAL files logically preceding this value will be removed from the archive location.
The default value is 0 (zero seconds), meaning there is no forced WAL file switching based on time. When set to a non-zero value, it forces WAL switching at the specified interval.
vacuum_defer_cleanup_age was removed in PostgreSQL 16. It was added in PostgreSQL 9.0 but became unnecessary after hot_standby_feedback and replication slots were added. It was deprecated in version 15.
The default value is -1, which means unlimited. The value is specified in megabytes.
The default value is 10 seconds. This controls how often the standby sends status updates to the primary.
The default value is 'on'. This parameter can only be set at server start and affects whether queries can run during archive recovery or standby mode.
The default value is 60 seconds (specified as milliseconds if no units are provided). Setting it to 0 disables the timeout mechanism.
The default value is 256kB on Linux and 0 elsewhere. Valid range is between 0 (which disables forced writeback) and 2MB. This parameter was added in PostgreSQL 9.6.
The default value is -1, which enables auto-tuning. With auto-tuning, wal_buffers is set to 1/32nd of shared_buffers, with a minimum of 64kB and maximum of one WAL segment (typically 16MB). If manually set, the minimum is 32kB.
Support for recovery.conf was removed in PostgreSQL 12. It was replaced by two signal files: standby.signal (for standby mode) and recovery.signal (for normal archive recovery). Recovery parameters moved to postgresql.conf. The presence of recovery.conf in PG12+ will cause the server to refuse to start.
The default value is 'off'. Valid values are: 'off', 'on', and 'always'. 'on' applies archive_command only on the primary, while 'always' applies it regardless of whether the server is primary or standby. This parameter requires a server restart.
The three lag metrics are: write_lag (time for standby to write WAL records), flush_lag (time for WAL to be flushed to disk on standby), and replay_lag (time for WAL to be replayed and become visible to queries on standby). All use the interval data type.
Possible values are: 'startup', 'catchup' (standby is behind and synchronizing), 'streaming' (standby is caught up and replication is ongoing), and 'backup'. The ideal state is 'streaming'.
The promote_trigger_file parameter (renamed from trigger_file in PostgreSQL 12) was removed in PostgreSQL 16. The recommended method is to use pg_ctl promote or call pg_promote().
wal_keep_segments was renamed to wal_keep_size in PostgreSQL 13. To convert: if wal_keep_segments was N, set wal_keep_size to 16 * N megabytes (since each segment is normally 16 megabytes).
The default value is 80 MB. This ensures enough WAL space is reserved to handle spikes in WAL usage.
PostGIS > Spatial Indexing
35 questionsDROP INDEX index_name CASCADE; This drops the index and all objects that depend on it.
Yes, it is a self-tuning index structure that automatically handles variable data density, differing amounts of object overlap, and object size.
First, it uses the fast bounding box index query to answer 'what boxes intersect?' (using the index), then it performs exact geometric calculations only on the filtered results to verify actual intersection.
SP-GiST is about 5-15% faster than GiST when data has less overlap and is relatively uniform. GiST is about 5-15% faster when data has a high degree of overlay.
gist_geometry_ops_2d uses the && operator for 2D bounding box operations, while gist_geometry_ops_nd uses the &&& operator for n-dimensional (including 3D) spatial operations.
REINDEX INDEX index_name; or REINDEX TABLE table_name; to rebuild all indexes on the table.
When you suspect corruption (detectable via ST_IsValid or pgstattuple), after significant data operations (loading, deleting, updating), or when experiencing performance issues from index bloat or fragmentation.
No, currently only B-tree and BRIN indexes support parallel creation through max_parallel_maintenance_workers. GiST indexes are built serially.
The default is 16MB, which is generally too low. The recommendation is 32MB to 256MB on production servers with lots of RAM, or 512MB to 1GB for high-performance databases.
You may get 'cannot cluster when index access method does not handle null values'. Fix by adding a NOT NULL constraint to the geometry column.
CREATE INDEX [indexname] ON [tablename] USING GIST ([geometryfield] gist_geometry_ops_nd);
Spatial indexes store only the bounding box of geometries, not the full geometry.
PostGIS uses an R-Tree index implemented on top of GiST to index spatial data.
R-Trees organize data into nesting rectangles (bounding boxes), with smaller rectangles nested in larger ones. The largest rectangle is in the top layer and smallest in the bottom layer, similar to a B-tree root-branch-leaf structure.
PostgreSQL requires a btree operator class for any type that supports ORDER BY in queries (geometry is sortable), and requires a hash operator class for any type that supports DISTINCT in queries. These exist solely to enable core database functionality, not for spatial indexing.
Look for 'Index Scan' instead of 'Seq Scan' (Sequential Scan), and the presence of operators like && (bounding box operator) in Index Cond or Filter clauses.
btree_gist provides GiST operator classes for standard B-tree-indexable data types, allowing you to create multi-column GiST indexes that combine spatial columns with non-spatial columns (like timestamps or text).
GiST stands for 'Generalized Search Tree' and is a generic form of indexing for multi-dimensional data.
VACUUM reclaims unused space in table pages left by updates or deletes. ANALYZE updates internal statistics used for query plan estimation. They serve different purposes and issuing one does not perform the other's function.
Once a spatial data table exceeds a few thousand rows, you will want to build an index to speed up spatial searches of the data.
CREATE INDEX [indexname] ON [tablename] USING BRIN ( [geometryfield] ) WITH (pages_per_range = [number]);
Use EXPLAIN ANALYZE before your query to see the query plan and verify index usage.
CREATE INDEX [indexname] ON [tablename] USING BRIN ( [geometryfield] );
No, ST_Relate does not include an automatic index filter. You must manually add the && operator for index usage.
ST_DWithin uses the && operator internally on an expanded bounding box of the query geometry to provide index-accelerated distance testing.
PostGIS used the Morton curve before version 3.0, which was replaced with the Hilbert curve in PostGIS 3.0.
No, GiST index build time is not sensitive to the maintenance_work_mem parameter, unlike GIN indexes which can benefit from increased maintenance_work_mem.
SP-GiST builds in approximately 5.6 seconds with 44 MB size, compared to GiST at 15.0 seconds with 53 MB size (about 2.7x faster build time).
PostGIS 3.0+ uses Hilbert curve ordering as the default sort order for geometry columns.
Spatial indexes are sensitive to input order. Pre-ordering rows randomly will result in a more balanced tree and measurable improvements in query performance.
Yes, the database does not maintain BRIN indexes dynamically. Changes to spatial data in the table are simply appended to the end of the index until summarization occurs.
SET enable_seqscan TO off; This forces the planner to use the index. Remember to set it back to on afterwards.
btree_geometry_ops (btree), hash_geometry_ops (hash), gist_geometry_ops_2d (gist), gist_geometry_ops_nd (gist), brin_geometry_inclusion_ops_2d (brin), brin_geometry_inclusion_ops_3d (brin), brin_geometry_inclusion_ops_4d (brin), spgist_geometry_ops_2d (spgist), spgist_geometry_ops_3d (spgist), spgist_geometry_ops_nd (spgist).
There are actually 10 different operator classes for the PostGIS geometry type.
Connection Pooling > pgpool-II
34 questionsTo use md5 authentication, the pool_passwd authentication file must contain the user password in either plain text, AES or md5 encrypted format.
Default is off. If set to on, pgpool-II's in-memory query cache feature is enabled.
When set to 'dml_adaptive', Pgpool-II tracks each table referenced in write statements within explicit transactions and will not load balance subsequent read queries if the table being read was previously modified in the same transaction.
This parameter allows Pgpool-II to use clear-text-password authentication with frontend clients when pool_passwd file does not contain the password for the connecting user, and use that password (provided by client) to authenticate with the backend using MD5 and/or SCRAM authentication. This parameter does not enforce SSL encryption because pool_hba.conf cannot be used with it.
To enable native replication mode, use 'native_replication' for backend_clustering_mode. In this mode, Pgpool-II performs replication by routing write queries to all PostgreSQL servers.
The default value is 32. This parameter specifies the number of preforked Pgpool-II server processes and also represents the concurrent connections limit to Pgpool-II from clients.
The default is 300 seconds. This parameter specifies the maximum amount of time in seconds to search for the primary node when a failover scenario occurs. Setting this parameter to 0 means keep trying forever.
Default is on. This parameter enables relation cache sharing among pgpool-II child processes using shared memory.
Default is 0, which means the cache never expires. This parameter controls the lifetime of the relation cache in seconds.
The default is 0, which turns off the feature. This parameter specifies the time in seconds to disconnect a client if it remains idle since the last query.
The selection of the load balance node is influenced by the backend_weight config value for each node. Although backend_weight accepts any positive integer value, all the backend_weights eventually get normalized on the scale of [0 to 1]. For example, if you have 2 nodes with backend_weight = 1 each, the normalized weight for each node will be 0.5 and both will get 50% of SELECT queries.
The default value is 4. This parameter represents the maximum number of cached connections in each pgpool-II child process.
This parameter (added in version 4.0) accepts a semicolon separated list of SQL patterns that should be sent to primary node only. Regular expressions can be used in SQL patterns. This enables specifying SQL pattern lists that should not be load-balanced.
If on, this parameter allows Pgpool-II to ignore SQL comments when judging if load balance or query cache is possible.
The default is 9898 for pcp_port. This parameter can only be set at server start.
No, Pgpool-II does not accept client connections and queries during the execution of recovery_2nd_stage_command command, and waits for the existing clients to close their connections before executing the command.
Online recovery is performed in two phases called 'first stage' and 'second stage'. recovery_2nd_stage_command is required only for replication_mode. For other modes including streaming replication mode the second stage is not performed and you don't need to provide a script.
To use SCRAM authentication, the pool_passwd authentication file must contain the user password in either plain text or AES encrypted format (md5 format is NOT supported for SCRAM).
The delegate_IP parameter specifies the IP address for virtual IP address which must not be used by someone else. When a standby pgpool server promotes to active, the new active server brings up virtual IP interface, while the previous active server brings down the virtual IP interface, enabling the active pgpool to work using the same IP address even when servers is switched over.
The number of connections from Pgpool-II processes to the backends may reach num_init_children * max_pool in total. For query canceling, the formula is: max_pool * num_init_children * 2 <= (max_connections - superuser_reserved_connections).
min_spare_children = 5 (default minimum number of idle child processes) and max_spare_children = 10 (default maximum number of idle child processes). These parameters are used in dynamic process management mode.
Default value is 'pool_passwd'. Passwords can be stored in the pool_passwd file using three formats: AES256 encrypted format, plain text format and md5 format.
You can check if the listen queue is actually overflowed by using 'netstat -s' command. If you find something like '535 times the listen queue of a socket overflowed', then the listen queue is definitely overflowed.
Default is 'static', which is compatible with pre V4.4 behavior. This parameter can be set to either 'static' or 'dynamic'. process_management_mode is not available prior to Pgpool-II V4.4.
pcp_attach_node attaches the given node to Pgpool-II. However, you should only use pcp_attach_node if you are sure that there's no replication delay (or the delay will be recovered later on) and the role of the node (primary/standby) will not be changed.
Pgpool-II supports trust, md5, scram-sha-256, and pam authentication methods. MD5 is a password-based authentication method where MD-5-hashed password is sent by the client. SCRAM-SHA-256 is a challenge-response based authentication that prevents password sniffing on untrusted connections.
The default value is 0, which means no connection from clients will be refused. When set to 1 or greater, incoming connections are rejected if the number of current connections exceeds (num_init_children - reserved_connections). This parameter can only be set at server start.
The default is 2. The queue length (the 'backlog' parameter of listen system call) is defined as listen_backlog_multiplier * num_init_children.
If more than num_init_children clients try to connect to Pgpool-II, they are blocked (queued) until a connection closes, with up to listen_backlog_multiplier * num_init_children connections that can be queued.
The default value is 0 (meaning disabled). When set, a pgpool-II child process will be terminated after this many connections from clients.
ssl_key and ssl_cert must both be configured in order for SSL to work with frontend connections. ssl_key specifies the path to the private key file, and ssl_cert specifies the path to the public x509 certificate file. There is no default value for these options, and if left unset SSL will be disabled for incoming frontend connections.
The configuration files are located at /usr/local/etc with default installation from source code. When you install Pgpool-II, pgpool.conf.sample is automatically created.
It specifies a comma separated list of function names that DO update the database. SELECTs including functions specified in this list are not load balanced. You can use regular expression to match function names, to which ^ and $ are automatically added. Example: black_function_list = 'nextval,setval,set_.,update_.,delete_.,insert_.'
Concurrency Control > Transaction Isolation Levels
34 questionsIn REPEATABLE READ, UPDATE, DELETE, MERGE, SELECT FOR UPDATE, and SELECT FOR SHARE will only find target rows that were committed as of the transaction start time, not rows committed afterward.
Using narrower access through indexes (rather than sequential scans) can significantly reduce the frequency of serialization failures.
The importing transaction must use either SERIALIZABLE or REPEATABLE READ isolation level. READ COMMITTED is not sufficient for snapshot import.
Use SELECT pg_export_snapshot(); within a transaction (typically at REPEATABLE READ or SERIALIZABLE isolation level) to export the snapshot. It returns a snapshot identifier string.
If SET TRANSACTION is executed without a prior START TRANSACTION or BEGIN, it emits a warning and otherwise has no effect.
Yes. Applications using REPEATABLE READ must be prepared to retry transactions that fail due to serialization errors (SQLSTATE 40001), just like SERIALIZABLE transactions.
The default value of max_pred_locks_per_relation is -2, which means max_pred_locks_per_transaction divided by 2. This maintains backward compatibility with previous PostgreSQL versions.
No. Read-only transactions at SERIALIZABLE level will never have serialization conflicts. Only updating transactions might need to be retried.
SET TRANSACTION SNAPSHOT can only be executed at the start of a transaction, before the first query or data-modification statement.
Yes. Progress may be impossible if conflicting prepared transactions exist. The situation will not resolve until the prepared transactions commit or roll back.
The DEFERRABLE transaction property has an effect only when the transaction is also SERIALIZABLE and READ ONLY. In all other cases, it has no effect.
You can specify the isolation level directly in BEGIN using: BEGIN TRANSACTION ISOLATION LEVEL
SERIALIZABLE returns: 'ERROR: could not serialize access due to read/write dependencies among transactions' when it detects conditions that could cause a serialization anomaly.
In READ COMMITTED isolation level, each SQL statement sees a snapshot of data as it was at the start of that particular statement (not the start of the transaction).
The four phenomena are: 1) Dirty Read (reading uncommitted data), 2) Nonrepeatable Read (data changed between reads), 3) Phantom Read (rows added/removed between queries), and 4) Serialization Anomaly (inconsistent ordering of transactions).
No. Changes made to sequences (via nextval or setval) are immediately visible to all other transactions and are not rolled back if the transaction that made the changes aborts. Sequences bypass normal transaction isolation.
The default value of default_transaction_isolation is 'read committed'.
No. Normal SELECT statements do not acquire locks on rows and are never blocked by SELECT FOR UPDATE or any other locking mode, regardless of isolation level.
PostgreSQL internally implements only three distinct isolation levels, even though you can request any of the four standard SQL isolation levels. READ UNCOMMITTED is treated as READ COMMITTED.
No. The default_transaction_isolation parameter has context 'user' and can be set at the session or user level without restarting the database server.
Applications must retry the complete transaction, including all logic that decides which SQL to issue and/or which values to use. PostgreSQL provides no automatic retry mechanism because it cannot do so with any guarantee of correctness.
PostgreSQL implements SERIALIZABLE using Serializable Snapshot Isolation (SSI), which builds on Snapshot Isolation by adding checks for serialization anomalies using predicate locking.
Applications may also want to retry SQLSTATE 40P01 (deadlock_detected), 23505 (unique_violation), and 23P01 (exclusion_violation), though these may represent persistent rather than transient conditions.
No. A read-only SQL transaction cannot alter non-temporary tables. Commands like INSERT, UPDATE, DELETE, MERGE, and COPY FROM are disallowed on non-temporary tables.
Yes. PostgreSQL implements REPEATABLE READ using Snapshot Isolation (SI), where each transaction sees a consistent snapshot of the database taken at transaction start.
A REPEATABLE READ transaction sees a snapshot as of the start of the first non-transaction-control statement in the transaction, and this same snapshot is used for all statements in the transaction.
PostgreSQL returns SQLSTATE code 40001 (serialization_failure) for serialization failures. The error message text varies but always has this SQLSTATE code.
No. When a transaction is READ ONLY, all CREATE, ALTER, and DROP commands are disallowed, including creating or dropping temporary tables.
The default value of default_transaction_read_only is off (read/write mode). Note that on a standby server, it is always on.
No. In PostgreSQL, READ UNCOMMITTED is treated as READ COMMITTED, which prevents dirty reads. PostgreSQL's MVCC architecture does not support true READ UNCOMMITTED behavior.
No. Nextval and setval calls are never rolled back. This means transaction aborts or database crashes can result in gaps in the sequence of assigned values.
Only the SERIALIZABLE isolation level prevents serialization anomalies. READ COMMITTED and REPEATABLE READ can still experience serialization anomalies.
Row-Level Security
34 questionsIf no WITH CHECK expression is defined, then the USING expression will be used both to determine which rows are visible and which new rows will be allowed to be added.
DROP POLICY [ IF EXISTS ] name ON table_name [ CASCADE | RESTRICT ];
The BYPASSRLS attribute. Roles with this attribute always bypass the row security system when accessing a table.
Yes, current_role is synonymous with current_user - they return the same value.
Yes, the same policy name can be used on different tables. Policy names must only be unique within a single table.
When the UPDATE path is taken, the existing row is first checked against the USING expressions of any UPDATE policies, then the new updated row is checked against the WITH CHECK expressions. If the existing row does not pass the USING expressions, an error is thrown (unlike a standalone UPDATE which would skip the row).
RESTRICTIVE policies are combined using the Boolean AND operator.
Setting row_security to off does not bypass row security; it throws an error if any query's results would get filtered by a policy. This is useful for operations like backups where you need to ensure all rows are captured.
UPDATE commands with RETURNING require both UPDATE policies (USING for existing rows, WITH CHECK for new rows) and SELECT policies, because UPDATE operations need to read data from columns.
Passing row data to a function requires a separate function call for every row returned. Since functions in Postgres are slow, calling a function n times for each row is even slower.
No, policy expressions cannot contain any aggregate or window functions.
No, policies can exist even when row-level security is disabled, but they will not be enforced in that state.
Yes, superusers and roles with the BYPASSRLS attribute always bypass the row security system when accessing a table.
You can modify the role list (TO clause), USING expression, and WITH CHECK expression. These are replaced independently if specified.
No, these keywords do not have any effect, since there are no dependencies on policies.
Only the table owner can create, alter, or drop policies on a table.
Policy expressions are evaluated for each row prior to any conditions or functions coming from the user's query, with leakproof functions being the only exception that may be applied ahead of row-security checks.
Both INSERT policies (WITH CHECK clause) and SELECT policies (USING clause) are evaluated. The INSERT policy determines if the row can be inserted, while the SELECT policy determines if the inserted row can be returned.
Yes, if you call a SECURITY DEFINER function in your RLS policy, you can perform a query without incurring the cost of any other RLS policies, as the function executes with the privileges of the function owner.
No, operations that apply to the whole table, such as TRUNCATE and REFERENCES, are not subject to row security.
Yes, you can use sub-SELECTs or functions containing SELECTs in policy expressions to consult other rows or tables, though such accesses can create race conditions that could allow information leakage if care is not taken.
session_user is the user who initiated the database connection and normally doesn't change. current_user is the user identifier applicable for permission checking and can change with SET ROLE or during execution of SECURITY DEFINER functions. In Unix terms, session_user is the 'real user' and current_user is the 'effective user'.
A default-deny policy is used, meaning that no rows are visible or can be modified.
A leakproof function has no side effects and doesn't leak information about values received. Common operators like equality operators are leakproof. Functions that might throw errors depending on values (like division by zero) are not leakproof. The LEAKPROOF parameter can only be set by superusers.
Data Types - JSON > JSONPath
34 questionsjsonb_path_exists(target jsonb, path jsonpath [, vars jsonb [, silent boolean]]) → boolean
It negates and iterates over values (e.g., '- $.x' on {"x": [2,3,4]} yields [-2, -3, -4])
When true, it suppresses structural errors (missing keys, array bounds violations, type mismatches) that would normally be thrown, returning empty results instead
- (addition and unary plus), - (subtraction and negation), * (multiplication), / (division), % (modulo)
Missing object field or array element, unexpected JSON item type, datetime errors, and numeric errors
In lax mode, the path engine implicitly adapts the queried data to the specified path and suppresses structural errors. In strict mode, structural errors raise an error and the path must exactly match the JSON structure
Only jsonb_ops supports .** and .* accessors; jsonb_path_ops does not
Path expression validation - checks if a path expression returns any items
jsonb_path_exists, jsonb_path_match, jsonb_path_query, jsonb_path_query_array, and jsonb_path_query_first
No, they cannot be used in indexes due to stability requirements
" for quotes, \ for backslashes, \n, \t, \b, \f, \r, \v for control characters, and Unicode sequences like \uNNNN or \u{N...N}
It iterates over the values (e.g., '+ $.x' on {"x": [2,3,4]} yields [2, 3, 4])
In lax mode, methods applied to an array will be executed for each value in the array, with exceptions for .type() and .size()
boolean - returns true if the path returns any item, false otherwise
strict mode (to avoid surprising results like selecting values multiple times)
$ represents the JSON value being queried and is always the first element of a path
It provides a JSON object containing named values for path variables (like $min, $max, etc.) used in the path expression
setof jsonb - returns all JSON items matching the path expression as separate rows
As SQL character string literals (single-quoted), with doubled single quotes for embedded apostrophes
Internals > Page Structure
33 questions2 bytes. This field stores various flag bits indicating tuple properties.
0x07FF. This mask represents the lower 11 bits used to store the number of attributes in the tuple.
Version 4. PostgreSQL 8.1 and 8.2 used version 3, PostgreSQL 8.0 used version 2, and earlier versions used version 1 or 0.
2 bytes. This field contains the byte offset from the page start to the start of special space.
2 bytes. This field stores both the page size and a version indicator.
0x0004. This flag indicates all tuples on the page are visible to everyone.
- This indicates the line pointer is available for immediate re-use and should always have lp_len=0.
0x0004. This flag indicates the tuple has external stored attribute(s) (TOASTed).
15 bits. This field stores the offset to the tuple from the start of the page.
4 bytes. This field stores the oldest unpruned XMAX on the page as a hint for pruning.
2 bytes. This uint16 field stores the number of attributes (11 bits) plus various flags.
- This indicates the line pointer is the first of a HOT (Heap Only Tuple) chain and should have lp_len=0.
2 bytes. This field stores the page checksum when checksums are enabled.
0x0001. This flag indicates the tuple has null attribute(s).
8 bytes. This field stores the LSN of the last WAL record that modified this page.
0x4000. This flag indicates the tuple was HOT-updated (Heap-Only Tuple update).
2 bytes. This field contains the byte offset from the page start to the start of unallocated space.
2 bytes. This field contains the byte offset from the page start to the end of unallocated space.
23 bytes. This is the fixed-size header (HeapTupleHeaderData) that precedes tuple data.
pd_upper - pd_lower. This represents the contiguous unallocated space in the middle of the page between the growing item identifier array and the tuple data.
t_hoff must always be a multiple of MAXALIGN. This ensures user data begins at a properly aligned boundary (typically 8 bytes on 64-bit systems).
4096 bytes (4 KB). While the smallest compilable BLCKSZ is 256 and the smallest that works is 1024, 4096 is the minimum that passes regression tests.
1600 columns. This limit exists because the tuple header size plus null bitmap (1 bit per column) plus MAXALIGN must fit in t_hoff which is uint8.
0x0002. This flag indicates the tuple has variable-width attribute(s).
0x8000. This flag indicates this is a heap-only tuple (part of a HOT chain).
(BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData)). This represents the upper bound on tuples per page, accounting for maxaligned tuple headers and line pointers.
0x0002. This flag is set when an UPDATE doesn't find enough free space for its new tuple version.
- This indicates the line pointer is in use and points to a valid tuple with lp_len > 0.
0x0001. This flag indicates there are LP_UNUSED line pointers before pd_lower.
2 bytes. This field stores flag bits including PD_HAS_FREE_LINES, PD_PAGE_FULL, and PD_ALL_VISIBLE.
24 bytes. This fixed-size header appears at the beginning of every database page.
It equals the page size. Ordinary tables do not use a special section, so pd_special is set to the page size to indicate no special space exists.
- This indicates the line pointer is dead and will likely be freed by VACUUM soon, with lp_len=0.
Connection Pooling > Connection Pooling Patterns
33 questionsThe default is 0, which turns off the feature. When set, it specifies the lifetime of a Pgpool-II child process in terms of the number of client connections it can receive before being terminated and replaced.
serialize_accept becomes ineffective when child_life_time is enabled.
Each child process stores 1 connection per user:database pair and can store up to max_pool connections of unique pairs. A child opens a new backend connection only if the requested [user,database] pair is not already in the cache. When max_pool is reached, the oldest connection is disconnected to make room for a new pair.
In transaction pooling mode, changes to session state must only be made with SET LOCAL (scoped to current transaction). Never use SET SESSION or SET alone, as these would affect session state beyond the current transaction.
The default is 15.0 seconds. This controls how long DNS lookups can be cached. The actual DNS TTL is ignored.
The default is 3 connections. This determines the number of connection slots reserved for connections by PostgreSQL superusers.
The default is off. When set to on, Pgpool-II enables serialization on incoming client connections to avoid the 'thundering herd problem'. It should only be enabled with larger values of num_init_children.
The default is 'DISCARD ALL'. This query is sent to the server on connection release before making it available to other clients.
The default value is 4. This parameter determines the number of connections to cache per child process.
The default is 30.0 seconds. This determines how long to keep released connections available for immediate re-use, without running sanity-check queries on it. If set to 0, the query is run always.
The default is 0 (no timeout). This is the number of milliseconds to wait before timing out when connecting a new client. If set to 0, calls to pool.connect will hang indefinitely if there are no available clients.
For optimal throughput, the number of active connections should be near ((core_count * 2) + effective_spindle_count).
Transaction pooling does not support: SET/RESET (except SET LOCAL), LISTEN, WITH HOLD CURSOR, PREPARE/DEALLOCATE (traditionally, though version 1.21+ has protocol-level support), PRESERVE/DELETE ROW temp tables (must be created, used, and dropped in same transaction), LOAD, and session-level advisory locks.
The default is 30000 milliseconds (30 seconds). This is the maximum wait time for acquiring a connection. The minimum allowed value is 250 milliseconds.
The default is 0 (unlimited). This limits the total number of connections to any database from one user.
Traditionally, prepared statements cannot be used with transaction mode because they are session-bound. However, PgBouncer version 1.21 introduced support for prepared statements in transaction mode when max_prepared_statements is set greater than 0, but this only works for protocol-level prepared statements.
The default is 0. When set to 0, no connection from clients will be refused. If reserved_connections = 1 and num_init_children = 32, then the 32nd connection from a client will be refused.
The default is typically 100 connections, but might be less if kernel settings will not support it (as determined during initdb). This parameter can only be set at server start.
The default is 10. This controls the maximum size that the pool is allowed to reach, including both idle and in-use connections.
The default is 0 (disabled). This adds more server connections to the pool if below this number, which improves behavior when normal load returns after a period of total inactivity.
The default is 10000 (10 seconds). This is the number of milliseconds a client must sit idle in the pool before being disconnected from the backend and discarded. Setting to 0 disables auto-disconnection.
The default is 3600 seconds (1 hour). The pooler will close an unused server connection that has been connected longer than this. Setting it to 0 means the connection is to be used only once, then closed.
By default, the load balance mode is at 'session level', meaning the node that read queries are sent to is determined when a client connects to Pgpool-II, with one of the nodes selected randomly each time a new session is created.
The default is 60 seconds. This sets how often the averages shown in various SHOW commands are updated and how often aggregated statistics are written to the log.
The default is 120 seconds. This is the maximum time queries are allowed to spend waiting for execution. If the query is not assigned to a server during that time, the client is disconnected.
The default is 0 (disabled). This parameter determines how many additional connections to allow to a pool.
The default is 'SELECT 1;'. This is a simple do-nothing query to check if the server connection is alive. If an empty string, sanity checking is disabled.
The default value is 100. This sets the maximum number of client connections allowed to PgBouncer.
PgBouncer maintains a pool of connections for each unique user, database pair. The default_pool_size determines how many server connections to allow per user/database pair.
The default is 600 seconds (10 minutes). If a server connection has been idle more than this many seconds, it will be closed.
The default is 0 (unlimited). This limits the total number of connections to any database.
The default is 0, which means cached connections will not be disconnected automatically based on time.
The default is 32. This is the number of preforked Pgpool-II server processes and also the concurrent connections limit to Pgpool-II from clients.
Data Types - Geometric
33 questionsThe @> operator (does first contain second?) is available for various geometric type combinations to test containment relationships.
Line segments are output using bracketed format: [(x1,y1),(x2,y2)]
The ?-| operator tests if objects are perpendicular, and the ?|| operator tests if objects are parallel.
The box(polygon) function computes the bounding box of the polygon.
Use the polygon(integer, circle) function where the first parameter specifies the number of points, e.g., polygon(4, circle '<(3,0),1>').
Two operator classes are available for point: quad_point_ops (the default) and kd_point_ops (alternative implementation).
Square brackets ([]) indicate an open path, while parentheses (()) indicate a closed path. When the outermost parentheses are omitted, a closed path is assumed.
Rotating a box only moves corner points; the box is still considered to have sides parallel to the axes. Hence the box's size is not preserved.
As a general rule, when the data have a lot of overlaps, GIST will outperform SPGIST.
The # operator returns the number of points (as an integer) in the path or polygon.
Circles are output using angle-bracket notation: <(x,y),r> where (x,y) is the center point and r is the radius.
Lines are represented by the linear equation Ax + By + C = 0, where A and B are not both zero.
The box(circle) function computes the box inscribed within the circle.
The box(point) function converts a point to an empty box with both corners at the same location, e.g., box(point '(1,0)') returns (1,0),(1,0).
The @@ operator computes the center point and returns a point. It works with box, lseg, polygon, and circle types.
The <-> operator returns a double precision value representing the distance between geometric objects.
Yes, GiST indexes can optimize nearest-neighbor searches, and SP-GiST operator classes support the <-> ordering operator for k-nearest neighbor (k-NN) search.
The # operator computes the point of intersection, or returns NULL if there is none.
The area() function returns NULL if the path input is open (not closed).
The && operator (do objects overlap?) works with box, polygon, and circle types.
PostgreSQL supports 8 geometric data types: point, line, lseg (line segment), box, path, polygon, circle, and these represent two-dimensional spatial objects.
PostgreSQL geometric types are supported by GiST (Generalized Search Tree) indexes and SP-GiST (Space-Partitioned GiST) indexes.
The ?- operator tests if objects are horizontal (or horizontally aligned), and the ?| operator tests if objects are vertical (or vertically aligned).
Box values are reordered as needed to store the upper right corner first and the lower left corner second.
The <^ operator tests if the first box is below the second (with edge contact allowed), and the >^ operator tests if the first box is above the second (with edge contact allowed). These operators are box-only.
The path type (both open and closed) requires 16+16n bytes, where n is the number of points.
SQL Syntax
33 questionsAt least one digit must be before or after the decimal point. For example, .5 and 5. are valid, but a standalone decimal point is not.
-- and /* cannot appear anywhere in an operator name, as they would be interpreted as the start of a comment.
PostgreSQL 16. This added support for hexadecimal (0x), octal (0o), and binary (0b) integer literals.
Yes, block comments nest in PostgreSQL (as of version 8.0 and later), as specified in the SQL standard. This allows commenting out larger blocks that may contain existing block comments.
pg_get_keywords(). Use SELECT * FROM pg_get_keywords() ORDER BY 1; to see all keywords, or filter with WHERE catdesc = 'reserved' to see only reserved words.
At the same precedence level as AT (the AT TIME ZONE operator), which is higher than exponentiation (^) but lower than unary +/- operators. It is left-associative.
Write two single quotes. For example, 'It''s' represents the string It's.
Tags follow the same rules as unquoted identifiers except they cannot contain a dollar sign. Tags are case-sensitive, so $tag$ and $TAG$ are different.
Left-associative. The exceptions are unary operators (+ - as prefix), which are right-associative, and NOT, which is also right-associative.
/* comment text */. Block comments can be nested, as specified in the SQL standard.
-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (64-bit signed integer).
PostgreSQL 16. This allows visual grouping like 1_000_000 for improved readability.
Use the UESCAPE clause after the string with a single character in single quotes. For example, U&'d!0061ta' UESCAPE '!' uses ! as the escape character instead of backslash.
-- followed by text to the end of the line. Everything after -- is ignored.
X'hex_digits' where each hexadecimal digit represents 4 binary digits. The X can be upper or lower case, with no space between X and the opening quote. Example: X'1FF'.
One-based (1). By default, the first array element is at index 1, not 0. For example, array_name[1] accesses the first element.
B'binary_digits' where only 0 and 1 are allowed. The B can be upper or lower case, with no space between B and the opening quote. Example: B'1001'.
A multi-character operator name cannot end in + or - unless the name also contains at least one of these characters: ~ ! @ # % ^ & | ` ?
Write two double quotes. For example, "table""name" represents the identifier table"name.
array_name[lower:upper] where lower and upper are the subscript bounds. For example, products[1:3] returns elements from index 1 to 3.
Yes. Quoting an identifier makes it case-sensitive. For example, "Foo" and "FOO" are different identifiers, but both are different from the unquoted foo.
Hexadecimal digits, the plus sign, a single quote, a double quote, or a whitespace character.
$n where n is a number. $1 refers to the first argument, $2 to the second, and so on. For example, in a prepared statement: INSERT INTO foo VALUES($1, $2, $3).
Between any two digits, or immediately after the base prefix (0x, 0o, 0b). They cannot appear at the start or end, immediately before/after the decimal point or exponent, or consecutively. Example: 1_500_000.
U&'string' where Unicode characters are specified as \xxxx (4-digit hex) or +xxxxxx (6-digit hex). For example, U&'d\0061t+000061' represents 'data'.
When they are separated by whitespace with at least one newline. For example, 'Hello'\n'World' is treated as 'HelloWorld'.
$tag$string content$tag$ where tag is optional and can be zero or more characters. For example, $$string$$ or $customtag$string$customtag$.
No. The :: syntax is historical PostgreSQL usage and is not part of the SQL standard. The standard-compliant syntax is CAST(expression AS type).
\b (backspace), \f (form feed), \n (newline), \r (carriage return), \t (tab), \digits (octal byte value), \xhexdigits (hexadecimal byte value). Any other character following backslash is taken literally.
- . (table/column separator), 2. :: (typecast), 3. [ ] (array element selection), 4. + - (unary), 5. ^ (exponentiation), 6. * / % (multiplication/division/modulo), 7. + - (addition/subtraction), 8. other operators, 9. BETWEEN IN LIKE ILIKE SIMILAR, 10. comparison operators (<, >, =, etc.), 11. IS ISNULL NOTNULL, 12. NOT, 13. AND, 14. OR.
=> (equal sign followed by greater-than) is reserved by the SQL grammar and cannot be used as an operator name.
Installation and Setup
33 questionsYes, listen_addresses has Context: postmaster, which means it requires a server restart to take effect.
4 megabytes (4MB). This sets the maximum amount of memory for a query operation (such as a sort or hash table) before writing to temporary disk files. If specified without units, it is taken as kilobytes.
32 megabytes (32MB), but might be less if kernel settings will not support it (as determined during initdb).
postgres (default database for users, utilities, and third-party applications), template1 (source template for new databases), and template0 (pristine template that should never be modified).
libc. The locale provider determines the encoding from the locale settings inherited from the environment where initdb runs.
No, wal_level can only be set at server start. You must restart PostgreSQL for changes to this parameter to take effect.
The special entry * (asterisk) corresponds to all available IP interfaces.
OpenSSL 1.1.1 or later (or LibreSSL 3.4 or later as an alternative).
4 gigabytes (4GB). More specifically, 524288 blocks (typically 8kB each). This is a planner hint about available cache memory and doesn't actually allocate memory.
localhost, which allows only local TCP/IP loopback connections. This means PostgreSQL will only accept connections from the local machine unless explicitly changed.
Fast mode. Fast mode does not wait for clients to disconnect, rolls back all active transactions, forcibly disconnects clients, then shuts down the server.
replica. This writes enough data to support WAL archiving and replication, including running read-only queries on a standby server. Note: In PostgreSQL 9.6 and earlier, the default was 'minimal'.
If not specified, the postgres database will be used; if that does not exist (or if it is the name of the new database being created), template1 will be used.
The postgres database is meant as a default database for users and applications to connect to. It is simply a copy of template1 and can be dropped and recreated if necessary.
template0 contains the pristine initial data that matches your PostgreSQL version. It should never be modified after initialization. It allows creating a clean database with new encoding and locale settings, or restoring pg_dump backups without site-local additions from template1.
Yes, the autovacuum parameter is set to on by default. However, track_counts must also be enabled for autovacuum to work.
The built-in default is GMT, but this is typically overridden in postgresql.conf. initdb will install a setting corresponding to the system environment at initialization time.
WAL segment size must be a power of 2 between 1 and 1024 megabytes.
Smart mode (waits for all clients to disconnect), Fast mode (forcibly disconnects clients after rolling back transactions - this is the default), and Immediate mode (aborts all server processes immediately, leading to crash-recovery on next start).
trust. The documentation warns: 'Do not use trust unless you trust all local users on your system. trust is the default for ease of installation.'
initdb will initialize the configuration file with a setting that corresponds to the chosen lc_ctype locale. If no matching configuration is found, the built-in default is pg_catalog.simple.
- PostgreSQL has 5432 registered with IANA as its default TCP port.
100 connections. However, it might be less if kernel settings will not support it (as determined during initdb). The parameter has a minimum of 1 and maximum of 262143.
The superuser defaults to the name of the operating-system user running initdb.
64 megabytes (64MB). This specifies the maximum memory for maintenance operations like VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. If specified without units, it is taken as kilobytes.
Yes, data checksums are enabled by default. They apply to all objects in all databases.
Indexes > B-tree Indexes
33 questionsThe three pattern-matching operator classes are: text_pattern_ops (for text type), varchar_pattern_ops (for varchar type), and bpchar_pattern_ops (for char type).
The maximum size is 2704 bytes. PostgreSQL 12 reduced the maximum size by 8 bytes (from 2712 bytes in earlier versions) to store extra metadata used for improving block splitting with duplicate entries.
An index entry cannot exceed approximately one-third of a page. With the default 8KB page size, this translates to roughly 2704 bytes in PostgreSQL 12 and later (after TOAST compression, if applicable).
INCLUDE columns are truncated from pivot index tuples (tuples located in non-leaf pages and high keys). They are only stored in leaf pages as payload.
No, CREATE INDEX is always non-concurrent on temporary tables, as no other session can access them.
The INCLUDE clause for creating covering indexes was introduced in PostgreSQL 11.
B-tree indexes typically have a depth of 4-5 levels for very large tables. Due to high branching factors (typically in the thousands), a B-tree can store millions of elements in only 2-3 layers.
PostgreSQL can use a partial index only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index. The system can recognize simple inequality implications (e.g., 'x < 1' implies 'x < 2'), but otherwise the predicate condition must exactly match part of the query's WHERE condition.
The following cannot use deduplication: text/varchar/char with nondeterminable collations, numeric, jsonb, float4, float8, and container types (composite types, arrays, range types). Additionally, indexes with INCLUDE columns cannot use deduplication.
B-tree is the only index type currently supported by PostgreSQL that can produce sorted output. Other index types return matching rows in an unspecified, implementation-dependent order.
Typically over 99% of all pages are leaf pages. Internal pages account for less than 1% of the index structure.
B-tree is currently the only index method that supports unique indexes.
The default value is ON (enabled). Deduplication has been enabled by default since PostgreSQL 13.
A B-tree operator class must provide five comparison operators: <, <=, =, >=, and >. Note that <> (not equal) is not included in this required set.
By default, B-tree indexes store their entries in ascending order with nulls last (ASC NULLS LAST). A forward scan produces output satisfying ORDER BY x ASC NULLS LAST.
No, expressions are not currently supported as included columns. Only simple column references can be used in the INCLUDE clause.
Pattern-matching operator classes compare values strictly character by character rather than according to locale-specific collation rules. This makes them suitable for LIKE and POSIX regular expression queries when the database does not use the standard 'C' locale.
The six support functions are: (1) order (required), (2) sortsupport, (3) in_range, (4) equalimage, (5) options, and (6) skipsupport. Functions 2-6 are optional.
REINDEX CONCURRENTLY takes a ShareUpdateExclusiveLock, which only prevents concurrent changes to the table structure but allows reads and writes to continue.
B-tree is the default index type. When you use CREATE INDEX without specifying USING, PostgreSQL creates a B-tree index.
The = operator must satisfy equivalence relation properties: reflexive (A = A is true), symmetric (if A = B then B = A), and transitive (if A = B and B = C, then A = C).
The < operator must be a strong ordering: irreflexive (A < A is false) and transitive (if A < B and B < C, then A < C). Additionally, trichotomy must hold: exactly one of A < B, A = B, and B < A is true.
Bottom-up deletion specifically targets version churn from UPDATEs that do not logically modify indexed columns and is triggered to prevent page splits. Simple deletion is opportunistically driven by passing index scans.
B-tree version 4 is a new index format introduced in PostgreSQL 12 with optimizations for duplicate entries and improved page splitting logic. Indexes created before version 12 remain in version 3 format after pg_upgrade and must be REINDEXed to use version 4.
No, all values of the data type must be comparable, and NULL comparisons are not allowed in comparison functions. NULLs are handled separately by the index infrastructure.
The branching factor typically ranges in the low thousands. Because PostgreSQL nodes have a fixed amount of bytes (default 8KB per page) rather than a fixed number of keys, the branching factor depends on the size of each individual tuple. Common values range from hundreds to the low thousands.
The order support function must return int32 values: <0 for less-than, 0 for equal, and >0 for greater-than.
Yes, only one concurrent index build can occur on a table at a time. This is a restriction of the CONCURRENTLY option.
The in_range support function supports window function RANGE frame bounds. It determines whether a value is within a specified offset from another value.
Data Types - JSON > JSON Functions
32 questionsAutomatic unwrapping is not performed when the path expression contains type() or size() methods. This prevents ambiguity when you explicitly want to check the type or size of an array itself.
SQL NULL, not JSON null. PostgreSQL maintains a strict distinction between SQL NULL and JSON null. To get a JSON null value, you must explicitly use 'null'::json.
false (PostgreSQL 18+). When false (the default), json_strip_nulls() only removes null-valued object fields. When true, it also removes null array elements. This parameter was added in PostgreSQL 18; earlier versions don't have this parameter.
If the array index is out of the range -array_length .. array_length -1, and create_if_missing is true, the new value is added at the beginning of the array if the item is negative, and at the end of the array if it is positive.
FALSE ON ERROR. By default, JSON_EXISTS() returns false when an error occurs instead of raising an exception. Other options include TRUE ON ERROR, UNKNOWN ON ERROR, and ERROR ON ERROR.
true. When create_if_missing is true (the default), jsonb_set() will create a new key if you attempt to set it for a key that does not exist. Example: jsonb_set('{"x": 1}', '{y}', '2') returns {"x": 1, "y": 2}.
$.column_name - the default path is derived from the column name. For example, a column named 'email' defaults to PATH '$.email'.
JSON_VALUE() returns SQL NULL if the path expression returns a JSON null, whereas JSON_QUERY() returns the JSON null value as-is (preserving it as the JSON value 'null').
The @> operator returns true if the left JSONB value contains the right JSONB value. Containment means all key-value pairs in the right operand exist in the left operand, including nested structures. This operator is GIN-indexable.
The jsonb_path_ops operator class supports only: @> (containment), @? (jsonpath match), and @@ (jsonpath match). It does NOT support the key-exists operators (?, ?|, ?&), but offers better performance and smaller index size for the operators it does support.
json_object(text[]) accepts either a 1-dimensional array with an even number of elements (alternating keys and values), or a 2-dimensional array where each inner array has exactly 2 elements [key, value]. Alternatively, json_object(keys text[], values text[]) takes two separate arrays.
Returns the object's key-value pairs as an array of objects. Each object contains three fields: 'key' (the property name), 'value' (the property value), and 'id' (a unique identifier for the pair).
false. When silent is false (the default), errors are not suppressed. When silent is true, the function suppresses the same errors as the @? and @@ operators (missing object fields/array elements, unexpected JSON item types, datetime and numeric errors).
false (implicit default). When omitted or false, array_to_json() produces compact JSON. When true, line feeds are added between dimension-1 elements of the JSON array.
text. The JSON_VALUE() function returns text by default unless you specify a different type with the RETURNING clause.
json_extract_path(from_json json, VARIADIC path_elems text[]) extracts a JSON sub-object at the specified path. It is equivalent to the #> operator. For example, json_extract_path(data, 'a', 'b') is equivalent to data #> '{a,b}'.
Concatenates two jsonb values. When concatenating two arrays, it generates an array containing all elements from both inputs. When concatenating objects, it merges them (with the right operand's values overwriting the left's for duplicate keys).
PostgreSQL 17 (released September 26, 2024). While some SQL/JSON features were committed in PostgreSQL 15, the JSON_TABLE() function specifically became available in version 17.
The field/element/path extraction operators return NULL rather than failing if the JSON input does not have the right structure to match the request (e.g., non-existent key or out-of-bounds array index).
jsonb_path_exists() checks whether the JSON path returns any item (returns boolean based on existence). jsonb_path_match() returns the result of a JSON path predicate check (evaluates the path as a boolean expression).
Tests whether jsonb_populate_record() would succeed without errors for the given base type and JSONB input. Returns boolean true if the operation would succeed, false otherwise.
For arrays, it returns the number of elements. For non-array values, it returns 1. The return type is number.
json or jsonb (matching the input type). The #> operator extracts a JSON sub-object at the specified path as a JSON value. Use #>> to get the result as text instead.
'use_json_null'. The possible values are: 'use_json_null' (default - converts SQL NULL to JSON null), 'raise_exception' (raises error on NULL), 'delete_key' (removes the key if new_value is NULL), and 'return_target' (returns original target unchanged).
json_array_elements() returns setof json (each array element as a JSON value), while json_array_elements_text() returns setof text (each element converted to text). The text variant is useful when you need scalar values directly.
The .decimal() method accepts optional precision and scale parameters: value.decimal([precision [, scale]]). These control the numeric precision and number of decimal places for the conversion.
Sibling NESTED PATH clauses at the same nesting level are combined via UNION, producing rows from each path separately in the result set.
Lax mode is the default for SQL/JSON path expressions. In lax mode, the path engine implicitly adapts the queried data to the specified path and suppresses structural errors. In strict mode, structural errors raise exceptions.
json or jsonb (same type as the input). The -> operator returns the extracted element as a JSON value. To get text instead, use the ->> operator which returns type text.
The flags are: 'i' (case-insensitive matching), 'm' (^ and $ match at newlines), 's' (. matches newline), and 'q' (quote the pattern - treat as literal string). Flags are specified as a string, e.g., 'string' like_regex 'pattern' flag 'i'.
NULL. The jsonb_path_query_first() function returns the first item from the JSON path results, or NULL if the path returns no items.
JSON null values are converted to SQL NULL in all cases when using json_populate_record(). This applies to both json and jsonb variants.
Monitoring > pg_stat_statements
32 questionson. This controls whether utility commands (non-DML statements) are tracked.
on. This preserves pg_stat_statements statistics across server restarts.
Approximately pg_stat_statements.max * track_activity_query_size bytes. This memory is consumed whenever the module is loaded, even if pg_stat_statements.track is set to none.
No. This parameter can only be set at server start, requiring a server restart to change.
top. This tracks only top-level statements issued directly by database clients.
Information about the least-executed statements is discarded. The dealloc counter in pg_stat_statements_info tracks how many times this has occurred.
Queries are combined into a single entry if they are semantically equivalent except for literal constant values. Constants are replaced by parameter symbols like $1, $2, etc. in the displayed query text.
10 columns: jit_functions, jit_generation_time, jit_inlining_count, jit_inlining_time, jit_optimization_count, jit_optimization_time, jit_emission_count, jit_emission_time, jit_deform_count, and jit_deform_time.
A boolean indicating whether the query was executed as a top-level statement (true) or as a nested statement within functions/procedures (false). This column was introduced in PostgreSQL 14.
pg_stat_statements (tracks planning and execution statistics) and pg_stat_statements_info (provides module statistics including deallocation count and reset timestamp).
stats_since (timestamp with time zone) tracks when the entry was created. minmax_stats_since (timestamp with time zone) tracks when min/max statistics were last reset. Both were introduced in PostgreSQL 17.
No. Only superusers can change the pg_stat_statements.track setting.
Yes, once loaded in shared_preload_libraries, it collects statistics for queries across all databases. However, the view must be created per database using CREATE EXTENSION. Statistics are distinguished by the dbid column.
Query identifier calculation must be enabled by setting compute_query_id to auto or on, or by loading a third-party module that calculates query identifiers. In PostgreSQL 13 and earlier, pg_stat_statements handled query ID computation internally.
When the same query is executed both at top-level and as a nested statement (with pg_stat_statements.track = all), two entries are created - one with toplevel = true and one with toplevel = false. Queries from different users or databases also get separate entries.
dealloc (bigint - total number of times entries were deallocated) and stats_reset (timestamp with time zone - last reset timestamp).
plans (bigint), total_plan_time, min_plan_time, max_plan_time, mean_plan_time, and stddev_plan_time (all double precision, measured in milliseconds). These are populated when pg_stat_statements.track_planning is enabled.
PostgreSQL 14. Prior to version 14, pg_stat_statements computed query IDs internally.
When minmax_only is true, only the minimum and maximum planning and execution time fields are reset (min_plan_time, max_plan_time, min_exec_time, max_exec_time). Other statistics are preserved.
- This is the maximum number of statements tracked by the module. The default was 1000 in PostgreSQL 9.2 and earlier, but was changed to 5000 starting in PostgreSQL 9.4.
Run CREATE EXTENSION pg_stat_statements; in each database where you want to access the view. The extension must be created per database, not globally.
pg_stat_statements_reset(userid Oid, dbid Oid, queryid bigint, minmax_only boolean) returns timestamp with time zone. All parameters default to 0 (invalid) except minmax_only which defaults to false.
Yes, there is a small chance of hash collisions causing unrelated queries to be merged into one entry. However, this cannot happen for queries belonging to different users or databases.
top (tracks only client-issued statements), all (tracks both top-level and nested statements within functions), or none (disables statement statistics collection).
PostgreSQL 18. The columns parallel_workers_to_launch (number planned) and parallel_workers_launched (number actually launched) were added.
off. This is disabled by default because enabling it may incur a noticeable performance penalty, especially when statements with identical query structure are executed by many concurrent connections.
The number of WAL full page images generated by the query. Full page images are complete copies of data pages written to WAL.
Stored Procedures
32 questionsThe prokind column uses 'p' to identify procedures, allowing them to be distinguished from functions in the pg_proc catalog.
String literals (dollar-quoted) are parsed at execution time, allowing polymorphic argument types. The sql_body form is parsed at procedure definition time and cannot support polymorphic argument types.
For procedures whose body is defined as a string literal, PostgreSQL tracks dependencies associated with externally-visible properties (like argument and result types) but not dependencies that could only be known by examining the function body.
For SQL-language procedures whose body is written in SQL-standard style, the body is parsed at procedure definition time and all dependencies recognized by the parser are stored.
The user must have EXECUTE privilege on the procedure in order to be allowed to invoke it.
SECURITY INVOKER is the default, which indicates that the procedure is executed with the privileges of the user that calls it.
No, volatility categories (VOLATILE, STABLE, IMMUTABLE) are only supported for functions, not for procedures. These attributes are not part of the CREATE PROCEDURE syntax.
Use ALTER PROCEDURE name SET SCHEMA new_schema to move a procedure to a different schema. You must own the procedure and have CREATE privilege on the new schema.
PostgreSQL 14 added support for OUT parameters in procedures. Before PostgreSQL 14, only IN, INOUT and VARIADIC were supported.
Procedures are stored in the pg_proc system catalog, which stores information about functions, procedures, aggregate functions, and window functions collectively.
No, transaction control is only possible in CALL or DO invocations from the top level or nested CALL or DO invocations without any other intervening command. For example, if the call stack is CALL proc1() → SELECT func2() → CALL proc3(), then proc3 cannot perform transaction control because of the SELECT in between.
The pg_get_functiondef() function produces a CREATE OR REPLACE PROCEDURE statement for the stored procedure, returning its complete definition.
A new transaction is started automatically after a transaction is ended using COMMIT or ROLLBACK commands.
Yes, PostgreSQL allows procedure overloading, meaning more than one procedure can be defined with the same SQL name, so long as the arguments they take are different.
The mode can be IN, OUT, INOUT, or VARIADIC. IN is the default if omitted.
All user-defined functions are assumed to be PARALLEL UNSAFE unless otherwise marked.
No, procedures with SET clauses cannot execute transaction control statements.
RESTRICT is the default behavior, which refuses to drop the procedure if any objects depend on it.
Yes, in PostgreSQL 9.0 and later, PL/pgSQL is installed by default and is the standard procedural language for writing stored procedures.
PostgreSQL 11 was the first version to support stored procedures using the CREATE PROCEDURE statement.
The CALL command is used to execute a procedure in isolation, unlike functions which are called as part of a query or DML command.
No, STRICT and RETURNS NULL ON NULL INPUT are function-specific attributes that apply to functions, not procedures, since procedures don't return values.
No, a SECURITY DEFINER procedure cannot execute transaction control statements.
No, transactions cannot be ended inside a block with exception handlers because exception handlers form subtransactions internally.
No, procedures do not return a function value; hence CREATE PROCEDURE lacks a RETURNS clause. However, procedures can return data to callers via output parameters.
No, LEAKPROOF is only supported for functions, not for procedures.
The three parallel safety modes are PARALLEL SAFE (safe to run in parallel mode without restriction), PARALLEL UNSAFE (cannot be executed in parallel mode - the default), and PARALLEL RESTRICTED (can be executed in parallel mode but only in the parallel group leader).
The default language is 'sql' when sql_body is specified in the CREATE PROCEDURE statement.
Yes, procedures can commit or roll back transactions during execution, so long as the invoking CALL command is not part of an explicit transaction block. This is a key difference from functions.
Named parameters use the syntax name => value (e.g., CALL procedure_name(param1 => value1)). An older syntax based on := is also supported for backward compatibility.
Arguments must be supplied for all procedure parameters that lack defaults, including OUT parameters. However, arguments matching OUT parameters are not evaluated, so it's customary to just write NULL for them.
Data Definition (DDL) > Constraints
31 questionsA table can have at most one primary key. This is a fundamental constraint in PostgreSQL.
The available actions are: NO ACTION (default), RESTRICT, CASCADE, SET NULL, and SET DEFAULT. RESTRICT prevents the operation immediately, CASCADE propagates changes, SET NULL sets columns to NULL, and SET DEFAULT sets columns to their default values.
No. PostgreSQL assumes that CHECK constraints' conditions are immutable but does not enforce this requirement. This allows flexibility but requires developer discipline to avoid issues.
RESTRICT prevents deletion/update of a referenced row immediately and does not allow the check to be deferred. NO ACTION also prevents the operation but allows the check to be deferred until later in the transaction if the constraint is declared DEFERRABLE.
Exclusion constraints ensure that if any two rows are compared on the specified columns or expressions using the specified operators, at least one of these operator comparisons will return false or null. This is useful for preventing overlapping ranges or conflicting data.
NOT VALID skips the table scan that normally verifies existing rows satisfy the constraint. The constraint is still enforced for new inserts/updates, but existing rows are not validated until VALIDATE CONSTRAINT is run.
The three options are: NOT DEFERRABLE (default - checked immediately, cannot change), DEFERRABLE INITIALLY IMMEDIATE (checked immediately but can be changed per transaction), and DEFERRABLE INITIALLY DEFERRED (checked at transaction commit, can be changed per transaction).
Referenced columns must be a primary key or form a unique constraint, or are columns from a non-partial unique index.
The NOT VALID option applies to foreign-key, CHECK, and not-null constraints only. Other constraint types require immediate validation during addition.
The default is MATCH SIMPLE. This allows any of the foreign key columns to be null; if any of them are null, the row is not required to have a match in the referenced table.
PRIMARY KEY constraints are named by default with the pattern: tablename_pkey. For example, a table named 'country' would have a primary key constraint named 'country_pkey'.
Adding a unique constraint will automatically create a unique B-tree index on the column or group of columns listed in the constraint.
Constraint names are limited to 63 bytes (NAMEDATALEN is 64, so maximum identifier length is 63 bytes). Longer names will be automatically truncated.
Yes. A PRIMARY KEY constraint is technically a combination of a UNIQUE constraint and a NOT NULL constraint. The NOT NULL is implicit and does not need to be specified explicitly.
A check constraint is satisfied if the check expression evaluates to true or the null value. This means NULL values can bypass CHECK constraints unless explicitly handled.
No. MATCH PARTIAL is not yet implemented in PostgreSQL, despite being part of the SQL standard.
VALIDATE CONSTRAINT acquires a SHARE UPDATE EXCLUSIVE lock on the table being altered. For foreign key constraints, it also requires a ROW SHARE lock on the referenced table.
All check constraints and not-null constraints on a parent table are automatically inherited by its children and are checked for child table operations, unless the constraint is explicitly marked with NO INHERIT.
Adding a primary key will automatically create a unique B-tree index on the column or group of columns listed in the primary key.
Standard ADD constraint operations require an ACCESS EXCLUSIVE lock. However, ADD FOREIGN KEY only requires SHARE ROW EXCLUSIVE lock on both the table and the referenced table.
Only UNIQUE, PRIMARY KEY, REFERENCES (foreign key), and EXCLUDE constraints support DEFERRABLE. NOT NULL and CHECK constraints are always checked immediately when a row is inserted or modified.
The NULLS NOT DISTINCT clause causes the unique constraint to treat NULL values as equal to each other, preventing multiple NULL entries. This option was added in PostgreSQL 15. Syntax: UNIQUE NULLS NOT DISTINCT (column_name)
Yes. Column constraints can also be written as table constraints, while the reverse is not necessarily possible, since a column constraint is supposed to refer to only the column it is attached to.
FOREIGN KEY constraints are named with the pattern: tablename_columnname_fkey. For example, 'user_order_user_account_id_fkey'.
Only the B-tree index type supports unique indexes in PostgreSQL. Other index types like GiST, GIN, BRIN, and hash do not support enforcing uniqueness.
NO INHERIT prevents the CHECK constraint from being inherited by child tables. By default, all check constraints on a parent table are automatically inherited by its children unless explicitly specified with NO INHERIT.
Exclusion constraints typically use GiST (Generalized Search Tree) or SP-GiST (Space-Partitioned Generalized Search Tree) index methods. While B-tree or hash indexes can be used, they provide nothing that an ordinary unique constraint doesn't do better.
NOT DEFERRABLE is the default for PostgreSQL constraints. This means constraints are checked immediately after every command and this setting cannot be changed per transaction.
No. There's no need to manually create indexes on unique columns; doing so would just duplicate the automatically-created index. PostgreSQL automatically creates the necessary B-tree index.
The contype values are: 'c' for CHECK, 'f' for FOREIGN KEY, 'n' for NOT NULL (PostgreSQL 18+), 'p' for PRIMARY KEY, 'u' for UNIQUE, 't' for TRIGGER, and 'x' for EXCLUSION constraints.
Data Definition (DDL) > Schemas and Namespaces
31 questionsNo. Schemas are analogous to directories at the operating system level, except that schemas cannot be nested. PostgreSQL maintains only one level of schemas within a database.
pg_temp_nnn, where nnn is a numeric identifier (typically the backend's pgproc slot number). It can be referenced using the alias pg_temp in the current session.
Bytes, not characters. This distinction matters when using multi-byte characters in UTF-8 or other multi-byte encodings.
If pg_temp is not listed in the path, it is searched first (even before pg_catalog) for relation and data type names only.
You must be able to SET ROLE to that role (meaning you must have membership in that role with the SET option), and that role must have CREATE privilege on the database.
Yes. The information schema itself is a schema named information_schema that automatically exists in all databases.
Tables, sequences, indexes, views, materialized views, and foreign tables share the same namespace. This means, for example, an index and table cannot have the same name in one schema.
No. Schema_element subcommands cannot be included when using the IF NOT EXISTS clause.
No. Upgrading from a previous version to PostgreSQL 15 (using pg_dump or pg_upgrade) preserves the original privileges, so you will not automatically get the more secure setup. You must manually apply the new security model.
No. There is no ALTER SCHEMA statement in the SQL standard - it is PostgreSQL-specific.
SHOW search_path; - This displays the current schema search path configuration.
No. The new schema name cannot begin with pg_, as such names are reserved for system schemas.
63 bytes. By default, NAMEDATALEN is 64, so the maximum identifier length is 63 bytes (NAMEDATALEN-1). Longer names can be written but will be silently truncated.
current_schema() returns the name of the schema that is at the front of the search path, which is the default location for creating new objects.
The public schema. Every new database automatically contains a public schema.
CVE-2007-2138 addressed a vulnerability where unprivileged users could use temporary objects to execute code with the privileges of SECURITY DEFINER functions by manipulating search_path. The fix ensured that unqualified function names are never searched for in the temporary schema.
All users have CREATE privileges on the public schema by default, allowing any user to create tables and other objects in it.
Yes. The system catalog schema pg_catalog is always searched, whether it is mentioned in the path or not. If not named explicitly, it is implicitly searched before searching the path's schemas.
RESTRICT is the default. The command will refuse to drop the schema if it contains any objects and will fail with an error.
The CREATE privilege on the public schema is revoked from all users except the database owner. Only the database owner can create objects in the public schema by default.
ALTER DATABASE database_name SET search_path TO schema1, schema2, public; - This permanently changes the schema search path at the database level.
Only CREATE TABLE, CREATE VIEW, CREATE INDEX, CREATE SEQUENCE, CREATE TRIGGER, and GRANT are accepted as clauses within CREATE SCHEMA. Other objects must be created separately after the schema is created.
Use the OPERATOR() syntax, for example: OPERATOR(schema.+). This is necessary when referencing operators with explicit schema qualification.
Yes, but it's only for pro forma compliance with the SQL standard. PostgreSQL primarily uses two-part notation (schema.table), and you must be connected to the database you reference in three-part notation.
It prevents an error if the schema does not exist. Instead of failing, PostgreSQL issues a notice and continues execution.
The CREATE privilege on the current database. Superusers bypass this check.
NAMEDATALEN is a compile-time constant that determines the maximum length of identifiers. The default is 64 bytes, which means the maximum identifier length is 63 bytes (NAMEDATALEN-1).
"$user", public - PostgreSQL first searches a schema matching the current user's name (if it exists and the user has USAGE permission), then searches the public schema.
The user executing the command becomes the owner of the new schema.
You must own the schema and also possess CREATE privilege for the database.
Extensions > pg_partman
31 questionshourly, daily, weekly, monthly, quarterly, yearly, custom, or any combination of these. By default, pg_partman's time-based partitioning will truncate child table starting values to line up at the beginning of typical boundaries (midnight for daily, day 1 for monthly, Jan 1 for yearly, etc).
true. This boolean value determines whether the pg_jobmon extension is used to log/monitor partition maintenance.
If the column(s) given contain all NULL values, no constraint will be made.
'on'. This configuration parameter controls whether a partition set is included in automated maintenance operations.
pg_partman supports RANGE and LIST partition types. For native partitioning, pg_partman only supports the RANGE type of partitioning (both for time & id). List partitioning is supported for number-based partitioning when the interval is 1.
It defaults to the configured partition interval if not given or if you give an interval larger than the partition interval.
No. Custom intervals are not allowed when moving data out of the DEFAULT partition - the p_interval/p_batch_interval parameters must be left unset or NULL to allow use of partition set's default partitioning interval. The batch_interval option CANNOT be used when moving data out of a default partition in PostgreSQL 11+.
Subpartitioning with multiple levels provides next to NO PERFORMANCE BENEFIT outside of extremely large data in a single partition set (100s of terabytes, petabytes).
true. When true, the function gives back an exact count of how many rows are in each parent if any is found. If set to false, it returns if it finds even just a single row in any parent, which can significantly speed up the check.
PostgreSQL 14 or higher. As of version 5.0.1, trigger-based partitioning is no longer supported and all partitioning is done using built-in declarative partitioning.
retention_schema is an optional parameter that tells pg_partman to move a table to another schema instead of dropping it. When combined with retention configuration, it provides a way to reliably dump out tables that would normally just be dropped by the retention system via ALTER TABLE SET SCHEMA.
The function returns child_start_time, child_end_time, child_start_id, child_end_id, and suffix. It provides detailed information about a specific child partition including start and end times for time-based partitions, start and end IDs for ID-based partitions, and the partition suffix.
The constraint generation logic creates CHECK constraints with this format: CHECK (column >= min_value AND column <= max_value). When constraint_valid = false, the SQL adds NOT VALID to the constraint.
true. When set to true, when the retention period is due, the table isn't deleted automatically; instead, partitions that are older than the retention period are only detached from the parent table.
This function reapplies ownership & grants on all child tables based on what the parent table has set. Privileges that the parent table has will be granted to all child tables and privileges that the parent does not have will be revoked (with CASCADE).
It lists all child partitions in a set in logical order, with the default partition not listed by default. The function signature is show_partitions(p_parent_table text, p_order text DEFAULT 'ASC', p_include_default boolean DEFAULT false) and returns partition_schemaname and partition_tablename.
- Constraints are placed on the last child table older than the optimize_constraint value - for example, if the optimize_constraint value is 30, then constraints will be placed on the child table that is 31 back from the current partition.
The part_config table has a constraint_valid column to set whether constraints should be set NOT VALID on creation. While this can make the creation of the constraint(s) nearly instantaneous, constraint exclusion cannot be used until it is validated.
- This determines the number of partitions to create in advance to support new inserts.
false. When true, this controls whether to inherit the ownership/privileges of the parent table to all child tables. This inheritance is only at child table creation and isn't automatically retroactive.
OFF. This parameter has the same purpose as the p_analyze argument to run_maintenance().
true. This determines whether indexes are dropped for child tables that are uninherited.
NULL (or OFF in background worker configuration). It determines whether to run an analyze on a partition set whenever a new partition is created during run_maintenance().
false. When set to true, it configures the table to be able to automatically create new partitions without any limit.
true. By default, partitions are not DROPPED, they are DETACHed, leaving previous child tables as empty, independent tables.
By default, pg_partman creates one time-partition for the current month, four partitions for the past four months, and four more partitions for four months in the future. You can use p_start_partition to override this default behavior.
- This controls how many times to run the batch_interval in a single call of this function, setting how many child tables will be processed in a single run.
3 consecutive failures. Any function that fails to run successfully 3 consecutive times will cause jobmon to raise an alert. This is why the default pre-make value is set to 4 so that an alert will be raised in time for intervention.
By default, if you undo a child table that is also partitioned, it will not stop additional sibling children of the parent partition set from being subpartitioned unless that parent is also undone. This function simply deletes the parent_table entry from the part_config_sub table to handle situations where you're not removing the parent but don't want any additional subpartitioned children.
'seconds', 'milliseconds', 'microseconds', 'nanoseconds', and 'none'. The default is 'none'. This parameter tells pg_partman that the control column is an integer type but actually represents an epoch time value.
When undo_partition functions are run and set to true, this causes all partition creation and retention management by the run_maintenance() function to stop.
Large Objects
31 questions32 TB (assuming default page size). This is because all large objects are stored in the pg_largeobject table, and the per-table size limit is 32 TB.
- This setting affects how many large objects vacuumlo can safely remove in one transaction.
Because the server acquires a lock per large object removed, removing too many LOs in one transaction risks exceeding max_locks_per_transaction.
SEEK_SET (seek from object start), SEEK_CUR (seek from current position), and SEEK_END (seek from object end).
Superusers only. By default, use of these functions is restricted to superusers.
lo_lseek will fail if the new location pointer would be greater than 2 GB. For large objects exceeding 2 GB, use lo_lseek64 instead.
No. The documentation states that large object functions cannot be used during pipeline mode connections.
A BEFORE UPDATE OR DELETE trigger. It executes before update or delete operations on tables containing large object references.
No. The trigger does not execute during DROP TABLE or TRUNCATE operations, potentially leaving orphaned objects.
No. Write operations on large objects are forbidden in read-only transactions.
A large object whose OID does not appear in any oid or lo data column of the database. vacuumlo removes these orphaned objects.
The data reflects the contents of the large object at the time of the transaction snapshot that was active when lo_open was executed, not the current committed state.
They use the permissions of the database's owning user. They read and write files in the server's file system, not the client's file system.
loid (oid type) - identifier of the large object; pageno (int4 type) - page number within the large object (counting from zero); data (bytea type) - actual data stored, never more than LOBLKSIZE bytes.
LOBLKSIZE equals BLCKSZ/4, which is typically 2 kB (2048 bytes). This is the amount of data stored per page in a large object.
INV_READ (read-only access), INV_WRITE (write access), and INV_READ | INV_WRITE (bidirectional access). These are defined in libpq/libpq-fs.h.
off. When set to on, it disables privilege checks on large objects for compatibility with pre-9.0 behavior.
As zeroes. The system permits sparse storage where pages can be missing, and missing regions read as zeroes.
-1 on failure. On success, it returns a non-negative large object descriptor.
8192 bytes (8 kB). This value is determined when building the server.
PostgreSQL 9.3. lo_truncate was introduced earlier in version 8.3.
libpq/libpq-fs.h. This header contains the function declarations and mode constants like INV_READ and INV_WRITE.
pg_largeobject_metadata. This catalog stores ownership (lomowner) and access privileges (lomacl) for large objects.
PostgreSQL 9.3. This function allows seeking to positions larger than 2 GB.
PostgreSQL 9.0. This version also introduced access privileges for large objects.
UPDATE privilege. lo_open will fail if INV_WRITE is specified and UPDATE privilege is not available.
The libpq large object functions reject length values larger than INT_MAX (typically 2,147,483,647 bytes or ~2 GB), even though the len parameter is declared as size_t.
Data Types - Pseudo-Types
31 questionsNo, pg_ddl_command cannot be output directly, but it can be passed to other functions to obtain different pieces of information about the command.
table_am_handler was introduced in PostgreSQL 12 along with the table access methods feature.
NEW is null in statement-level triggers and for DELETE operations.
A data change trigger function must be declared as a function with no arguments and a return type of trigger.
Yes, functions coded in C (whether built-in or dynamically loaded) can be declared to accept or return any of the pseudo-types. It is the function author's responsibility to ensure safe behavior.
A type input function can be declared as taking one argument of type cstring, or as taking three arguments of types cstring, oid, integer. The three-argument form includes the type's OID and the typmod for the destination column.
No, pseudo-types cannot be used as column data types. They are special-purpose type system entries that can only be used to declare a function's argument or result type.
The void pseudo-type indicates that a function returns no value. In PL/pgSQL, if you declared the function to return void, a RETURN statement can be used to exit the function early, but do not write an expression following the RETURN keyword.
String literals aren't necessarily strings—they might be meant to be point, json, or any number of other types. String literals have a fictive 'unknown' type, with the real type derived from context (operators, function parameters).
Both actual arguments have to be the same enum type. This is equivalent to declaring the function as f(anyenum, anyenum).
No, a function with the special return type event_trigger need not (and may not) return a value. The return type serves merely as a signal that the function is to be invoked as an event trigger.
tsm_handler is used for tablesample method handlers. A tablesample method handler is declared to return tsm_handler.
If there are positions declared anymultirange, their actual multirange type must contain ranges matching parameters declared anyrange and base elements matching parameters declared anyelement and anyarray.
Yes, each position declared as anyelement is allowed to have any specific actual data type, but in any given call they must all be the same actual type.
Row-level triggers fired BEFORE can return NULL to signal the trigger manager to skip the rest of the operation for this row (subsequent triggers are not fired, and the INSERT/UPDATE/DELETE does not occur for this row).
No, the simple family (anyelement, anyarray, etc.) and common family (anycompatible, anycompatiblearray, etc.) are independent sets of type variables and do not interact with each other.
No, for the common (anycompatible) family, the actual types of the arguments need not be identical, so long as they can be implicitly cast to a single common type following the same rules as UNION.
anyenum is treated exactly the same as anyelement, but adds the additional constraint that the actual type must be an enum type.
Once a common type is identified, arguments in anycompatible and anycompatiblenonarray positions are automatically cast to that type, and arguments in anycompatiblearray positions are automatically cast to the array type for that type.
Most procedural languages allow only void and record as result types, plus trigger or event_trigger when the function is used as a trigger or event trigger. Some procedural languages also support polymorphic pseudo-types as return types.
When the return value of a function is declared as a polymorphic type, there must be at least one argument position that is also polymorphic. The actual data type(s) supplied for the polymorphic arguments determine the actual result type for that call.
For DELETE operations, the return value should be OLD. The returned value has no direct effect but must be nonnull to allow the trigger action to proceed. Returning NEW is not sensible since NEW is null in DELETE triggers.
If a function has at least one internal-type argument, then it cannot be called from SQL. This is a security restriction to prevent direct invocation of internal-only database functions.
The unknown pseudo-type is initially imputed to unadorned string literals and NULL constants. The parser normally converts these to constants of some other type as soon as it can figure out what type they ought to be based on context.
The actual array type in the anyarray positions must be an array whose elements are the same type appearing in the anyelement positions.
A foreign-data wrapper handler is declared to return fdw_handler. The handler function returns a struct of function pointers to callback functions that will be called by the planner, executor, and various maintenance commands.
No, the return value of a row-level trigger fired AFTER or a statement-level trigger fired BEFORE or AFTER is always ignored; it might as well be null.
The anycompatible family (anycompatible, anycompatiblearray, anycompatiblenonarray, anycompatiblerange) was introduced in PostgreSQL 13.
cstring indicates null-terminated C strings. It's used to declare functions that accept or return C strings, commonly for type input/output functions and type modifier functions.
pg_event_trigger_ddl_commands() returns: classid (oid), objid (oid), objsubid (integer), command_tag (text), object_type (text), schema_name (text), object_identity (text), in_extension (boolean), and command (pg_ddl_command).
Two routines are provided in fmgr.h: get_fn_expr_rettype(FmgrInfo *flinfo) to discover the actual return type, and get_fn_expr_argtype(FmgrInfo *flinfo, int argnum) to discover the actual argument type.
Data Types - Arrays
31 questionsNo. The current implementation ignores any supplied array size limits. Arrays of a particular element type are all considered to be of the same type, regardless of size or number of dimensions, so declaring the array size or number of dimensions in CREATE TABLE is simply documentation.
The delimiter is usually a comma (,). Among the standard data types provided in the PostgreSQL distribution, all use a comma, except for type box which uses a semicolon (;).
array_length(array, dimension_number) returns the size (number of elements) of the specified dimension of the array.
The ARRAY constructor uses the keyword ARRAY followed by square brackets containing comma-separated expressions: ARRAY[1,2,3]. For subqueries, it uses parentheses instead: ARRAY(SELECT ...).
array_append and array_prepend only support one-dimensional arrays. array_cat supports multidimensional arrays.
The equality operators (= and <>) do an exact element-by-element comparison. Arrays are equal only when all corresponding elements match exactly.
cardinality() returns 0 for empty arrays. It counts the total number of elements across all dimensions.
A slice expression yields an empty (zero-dimensional) array instead of NULL. This does not match non-slice behavior where accessing an out-of-bounds subscript returns NULL.
You must explicitly cast NULL to the desired type, such as NULL::int, since the first parameter determines the data type of the resulting array.
array_lower(array, dimension_number) returns the lower bound of the specified array dimension.
When WITH ORDINALITY is specified, an additional column of type bigint will be added, numbering the rows of the function result set starting from 1.
The maximum number of elements that an array of any dimensionality can hold is 134,217,727 (2^27 - 1).
When the null_string parameter is omitted or NULL, any null elements in the array are simply skipped and not represented in the output string.
Multidimensional arrays must have matching extents for each dimension. They must be rectangular - all sub-arrays at the same level of nesting must have the same number of elements.
The default siglen value is 16 bytes, with a valid range of 1-2024 bytes.
The ordering is unspecified by default when no ORDER BY clause is provided within the array_agg call.
If the delimiter parameter is NULL, each character in the input string will become a separate element in the resulting array.
Yes. For empty arrays, explicit casting is mandatory since the system cannot infer the type from no elements.
An array subscript expression will return NULL if either the array itself or any of the subscript expressions are NULL.
Elements containing curly braces, commas (or the data type's delimiter character), double quotes, backslashes, leading or trailing whitespace, empty strings, or strings matching the word NULL must be double-quoted.
The ALL operator returns true when comparing against an empty array, since all comparisons (none) succeed.
gin__int_ops supports the <@ (is contained by) operator in addition to the &&, @>, @@, and regular array equality operators supported by the GiST operator classes.
The optional third parameter (null_string) specifies a string to be treated as a NULL value. Elements matching this string in the split array will be replaced with actual NULL values instead of the string literal.
If the subquery's output column is of an array type, the result will be an array of the same type but one higher dimension. In this case all the subquery rows must yield arrays of identical dimensionality, else the result would not be rectangular.
If the right-hand array contains any null elements and no true comparison result is obtained, the result of ANY will be NULL, not false.
When concatenating two arrays, there can't be any more than one dimension difference between them. Arrays with equal dimensions are concatenated along the outer dimension. When an N-dimensional array is concatenated with an N+1-dimensional array, each N-dimensional sub-array becomes an element of the outer dimension.
The default numranges value is 100, with a valid range of 1-253.
To put a double quote or backslash in a quoted array element value, precede it with a backslash.
By default, the array element type is the common type of the member expressions, determined using the same rules as for UNION or CASE constructs. You can override this by explicitly casting the array constructor to the desired type.
array_position and array_positions use IS NOT DISTINCT FROM semantics, which means NULL values can be successfully searched for and found in arrays (since NULL IS NOT DISTINCT FROM NULL evaluates to true).
The default value is 'on', which allows the array input parser to recognize unquoted NULL as specifying a null array element.
Functions - Mathematical
30 questionsThe modulo operator % is available for smallint, integer, bigint, and numeric types.
For type numeric, the result will have the same number of fractional decimal digits as min or max, whichever has more. The result scale equals Max(scale(min), scale(max)).
@ is the absolute value operator (@ numeric_type → numeric_type). For example, @ -5 returns 5.
|/ computes the square root (|/ double precision → double precision), and ||/ computes the cube root (||/ double precision → double precision).
atan2(y, x) - the y-coordinate (numerator) comes first, then the x-coordinate (denominator). It computes the arctangent of y/x.
trim_scale(numeric) reduces the scale of the value by removing trailing zeroes. It returns the numeric value with trailing fractional zeros removed.
width_bucket() returns count+1 for an input greater than or equal to the high parameter.
There is no difference. log10() is the same as log(). Both compute the base 10 logarithm.
The min_scale() and trim_scale() functions were added in PostgreSQL 13.
erfc(x) computes the complementary error function, which equals 1 - erf(x), but without loss of precision for large inputs.
div(y numeric, x numeric) returns the integer quotient of y/x, truncating toward zero. This is similar to y/x for integral types but works with numeric.
Using degree-based trigonometric functions is preferred because it avoids round-off error for special cases. For example, sind(30) returns exactly 0.5, while sin(pi()/6) may have small rounding errors.
When s is negative, trunc() truncates to the left of the decimal point. For example, trunc(1234.56, -1) returns 1230.
There is no difference. ceiling() is exactly equivalent to ceil(). Both round the value up to the next integer.
When s is negative, round() rounds to the left of the decimal point. For example, round(1234.56, -1) returns 1230.
For integral types, division truncates the result toward zero. For example, 5/2 returns 2, and (-5)/2 returns -2.
sign() returns +1 for positive values, 0 for zero, and -1 for negative values.
min_scale(numeric) returns the minimum scale (number of fractional decimal digits) needed to represent the value precisely, effectively removing any trailing zeros from consideration.
The functions working with double precision data are mostly implemented on top of the host system's C library; accuracy and behavior in boundary cases can therefore vary depending on the host system.
width_bucket() returns 0 for an input less than the low parameter.
gcd() and lcm() are available for integer, bigint, and numeric types only.
scale(numeric) returns the scale of the argument, which is the number of fractional decimal digits (count of decimal digits in the fractional part, to the right of the decimal point).
For the numeric type, ties are broken by rounding away from zero. For example, 2.5 rounds to 3, and -2.5 rounds to -3.
The ^ operator is left-associative. Multiple uses of ^ will associate left to right by default. For example, 2^3^2 evaluates as (2^3)^2 = 64, not 2^(3^2) = 512.
Yes. power() accepts both numeric and double precision variants. You can call power(a numeric, b numeric) or power(a double precision, b double precision). The return type matches the input types.
The random() function returns a value in the range 0.0 <= x < 1.0 (0.0 is inclusive, 1.0 is exclusive).
Data Types - Composite
30 questions1,600 attributes (the same limit as columns per table). This is further constrained by the requirement that tuples must fit on a single page.
What you write in an SQL command is first interpreted as a string literal, then as a composite. This doubles the number of backslashes needed (unless using dollar quoting).
No. The constraints shown in the table definition do not apply to values of the composite type outside the table.
Field values must be double-quoted if they are empty strings or contain parentheses, commas, double quotes, backslashes, or white space.
composite_type_name[] or ARRAY (SQL standard syntax for one-dimensional arrays). Example: CREATE TABLE t (images image[]).
The asterisk () expands all fields of the composite type into separate columns. You must use (composite_column). syntax with parentheses.
Precede it with a backslash. For example, to include a double quote: " and for a backslash: \.
ALTER TYPE type_name DROP ATTRIBUTE [ IF EXISTS ] attribute_name [ CASCADE | RESTRICT ]
The typed table will be dropped along with the type, because a typed table is tied to its type.
RESTRICT refuses the operation if the type being altered is the type of a typed table. This is the default behavior.
Yes. The ability to create a composite type with zero attributes is a PostgreSQL-specific deviation from the SQL standard.
Yes. The return type can be a base, composite, or domain type, or can reference the type of a table column.
In the pg_type system catalog. A composite type is automatically created for each table in the database to represent the row structure.
Row elements are compared left-to-right, stopping as soon as an unequal or null pair of elements is found. If either element is null, the result is null.
No. The ROW keyword is optional when you have more than one field in the expression.
Yes, but you need parentheses: UPDATE mytab SET complex_col.r = (complex_col).r + 1. Parentheses are required when referencing the column in an expression.
CASCADE automatically propagates the operation to typed tables of the type being altered, and their descendants.
Write "" (double quotes with nothing between them) to represent an empty string rather than NULL.
Yes. Arrays of any built-in or user-defined base type, enum type, composite type, range type, or domain can be created.
The remaining subfields are filled with null values if not explicitly specified.
CREATE TABLE table_name OF type_name. This creates a table whose structure is taken from the specified composite type.
Yes. The ADD ATTRIBUTE, DROP ATTRIBUTE, and ALTER ATTRIBUTE actions can be combined into a list of multiple alterations to apply in parallel.
A completely empty field value (no characters between commas or parentheses) represents NULL.
IS DISTINCT FROM treats null as a normal value: it returns true when both inputs are null, and false when only one is null. The result is always true or false, never null.
- String literal syntax: '("value1",value2)', 2) ROW expression: ROW('value1', value2), 3) Simplified form (multiple fields): ('value1', value2) without ROW keyword.
ALTER TYPE type_name ADD ATTRIBUTE attribute_name data_type [ COLLATE collation ] [ CASCADE | RESTRICT ]
Yes. You can use DEFAULT ROW(value1, value2) or DEFAULT (value1, value2) syntax to provide default values for all fields of the composite type.
RECORD is not a true data type but a placeholder with no predefined structure that changes each time it's assigned. Composite types have a fixed, predefined structure defined at creation time.
No. Unlike CREATE TABLE, composite types cannot include constraints such as NOT NULL. Only field names and data types can be specified.
Internals > Tuple Structure
30 questionsYes. Columns that have been dropped from the table still contribute to the maximum 1600 column limit.
TOAST_TUPLE_THRESHOLD, which is normally 2 kB (2048 bytes). The TOAST code is triggered when a row value to be stored is wider than this threshold.
A 1 bit indicates not-null (the column has a value). A 0 bit indicates null (the column is NULL).
4 bytes. The t_xmin field is of type TransactionId, which is a 32-bit unsigned integer (uint32).
0 (INVALID). If the tuple has not been deleted or updated, t_xmax is set to 0.
It indicates that the tuple was updated and key columns were modified, or the tuple was deleted. The flag value is 0x2000.
About 2000 bytes. This value is chosen so that four chunk rows will fit on a page (with the default 8KB page size).
It indicates that an object ID field is present in the tuple. The flag value is 0x0008. This flag exists only for backward compatibility with databases created before PostgreSQL 12, when WITH OIDS was removed.
t_xmin holds the transaction ID (txid) of the transaction that inserted this tuple.
The null bitmap begins just after the fixed 23-byte header and occupies enough bytes to have one bit per data column.
They are part of a union (t_field3) and share the same memory space, so only one can be active at a time. t_cid is used for normal operations (command ID), while t_xvac was used for old-style VACUUM FULL transaction IDs.
The t_hoff field indicates the offset to the start of user data. It marks where the actual column data begins after the tuple header, and must always be a multiple of the MAXALIGN distance for the platform.
PostgreSQL 12. The special behavior of oid columns was removed in this version, and the ability to create tables WITH OIDS was eliminated.
It indicates that the xmax transaction ID is invalid or that the tuple was not actually deleted. When set, the tuple is still visible.
1600 columns. This is a hard-coded limit defined as MaxHeapAttributeNumber in src/include/access/htup_details.h.
It indicates that the transaction that inserted the tuple (xmin) has committed. This hint bit allows PostgreSQL to avoid looking up the transaction status in CLOG, improving performance.
The fixed-size header occupies 23 bytes on most machines. This is followed by optional components like the null bitmap and object ID field.
0x07FF (HEAP_NATTS_MASK), which uses 11 bits for the number of attributes, allowing values up to 2047.
A 0 bit indicates null (the column is NULL). A 1 bit indicates not-null (the column has a value).
When HEAP_HASNULL is set in t_infomask, it indicates that the null bitmap is present in the tuple. If not set, all columns are assumed not-null.
- The update does not modify any columns referenced by the table's indexes (excluding summarizing indexes), and 2) there is sufficient free space on the page containing the old row for the updated row.
8 bytes. MAXALIGN is typically 8 bytes on a 64-bit OS and 4 bytes on a 32-bit OS.
In src/include/access/htup_details.h. The official documentation directs users to this file for all implementation details.
1 GB (2^30 - 1 bytes). TOAST limits the logical size of any value of a TOAST-able data type to this amount.
6 bytes. The t_ctid field is of type ItemPointerData, which consists of a BlockNumber (4 bytes) and an OffsetNumber (2 bytes).
No. PostgreSQL uses a fixed page size and does not allow tuples to span multiple pages. This is why TOAST exists to handle large values.
Data Types - Network Address
30 questionsMAC addresses given in 6 byte format are stored in 8 byte length format with the 4th and 5th bytes set to FF and FE, respectively.
The && operator checks if either subnet contains or equals the other, returning a boolean.
Yes, leading and trailing whitespace is ignored in macaddr8 input.
Yes, but only addresses that have FF and FE as values in the 4th and 5th bytes from the left are eligible to be converted from macaddr8 to macaddr.
The trunc(macaddr8) function sets the last 5 bytes of the address to zero.
The number of hex digits must be either 16 (8 bytes) or 12 (6 bytes).
Subtracting one inet from another (inet - inet) returns a bigint representing the numeric difference between the two addresses.
The >>= operator checks if the left subnet contains or equals the right subnet, returning a boolean.
Yes, MAC address types support the bitwise NOT (~), AND (&), and OR (|) operators similar to inet/cidr types.
If y is omitted, it is calculated using assumptions from the older classful network numbering system, except it will be at least large enough to include all of the octets written in the input.
Yes, you can add a bigint to inet (inet + bigint or bigint + inet) to add a numeric offset to the address, or subtract a bigint from inet (inet - bigint) to subtract an offset.
The >> operator checks if the left subnet strictly contains the right subnet, returning a boolean.
The inet_merge(inet, inet) function computes and returns the smallest network (as cidr type) that includes both input addresses.
The inet_same_family(inet, inet) function returns a boolean indicating whether the two addresses belong to the same IP family (both IPv4 or both IPv6).
The netmask() function computes and returns the network mask for the address.
The host() function returns the IP address as text, ignoring the netmask.
The essential difference is that inet accepts values with nonzero bits to the right of the netmask, whereas cidr does not. For example, 192.168.0.1/24 is valid for inet but invalid for cidr.
The bitwise NOT (~inet), bitwise AND (inet & inet), and bitwise OR (inet | inet) operators are supported.
It is an error to specify a network address that has bits set to the right of the specified netmask. This will produce an error message like 'Value has bits set to right of mask'.
The cidr data type uses 7 or 19 bytes of storage - 7 bytes for IPv4 addresses and 19 bytes for IPv6 addresses.
The <<= operator checks if the left subnet is contained by or equal to the right subnet, returning a boolean.
The network() function returns the network part of the address as cidr type.
The set_masklen(inet/cidr, integer) function sets the netmask length for the address, returning the modified inet or cidr value.
IEEE Standard 802-2001 specifies the hyphen-separated form (e.g., '08-00-2b-01-02-03') as the canonical format for MAC addresses.
The /y portion is suppressed in display if the netmask specifies a single host.
The << operator checks if the left subnet is strictly contained by the right subnet, returning a boolean.
No, both upper and lower case are accepted for the hex digits 'a' through 'f'.
When casting inet to cidr, any bits to the right of the netmask are silently zeroed to create a valid cidr value.
Data Types - Date/Time
30 questionsISO, MDY (though initdb will initialize it with a setting that corresponds to the behavior of the chosen lc_time locale)
It converts the timestamp to TIMESTAMP WITH TIME ZONE, treating the input value as being in the specified time zone
YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, YEAR TO MONTH, DAY TO HOUR, DAY TO MINUTE, DAY TO SECOND, HOUR TO MINUTE, HOUR TO SECOND, MINUTE TO SECOND
It defaults to the precision of the literal value (but not more than 6 digits)
Zones west of Greenwich (note: this is the opposite of the ISO-8601 sign convention used elsewhere in PostgreSQL)
Storing timestamp values as double precision floating-point numbers instead of 64-bit integers. It was deprecated because the effective precision limit could be less than 6 digits, and precision degraded for dates further from 2000-01-01
These fields are kept separate because the number of days in a month varies, while a day can have 23 or 25 hours if a daylight savings time transition is involved
ISO 8601, SQL (Ingres), traditional POSTGRES (Unix date format), and German
Because time zones lack meaning without associated dates (offsets vary with daylight-saving transitions), making the type exhibit properties which lead to questionable usefulness
It converts the timestamptz to a plain TIMESTAMP WITHOUT TIME ZONE in the specified time zone, removing the time zone designation
No, they are actually treated as named time zones because (for historical reasons) there are files by those names in the IANA time zone database
24:00:00 is a valid upper bound for the TIME type range, representing midnight at the end of a day, but adding even 0 seconds to it will cause it to overflow to 00:00:00
Extensions > pgcrypto
30 questionsbf (Blowfish), xdes, sha256crypt, and sha512crypt are adaptive. md5 and des are not adaptive.
PostgreSQL must be built with zlib support (--with-zlib). Without zlib, compression algorithms are not available.
pgcrypto does not support: (1) signing or signature verification, (2) encryption key as master key, (3) multiple subkeys, and (4) verification that encryption subkey belongs to master key.
on (default), off, and fips. The 'fips' value disables gen_salt() and crypt() functions if OpenSSL is in FIPS mode.
0 (disabled). This option controls line ending conversion and uniquely applies to both encrypt and decrypt functions, unlike other options which only apply to encrypt.
At most 1024 bytes can be extracted at a time. This is to avoid draining the randomness generator pool.
0 (use S2K key directly as session key). When set to 1, a separate random session key will be generated.
0 (no compression). Valid values are: 0 (no compression), 1 (ZIP compression), and 2 (ZLIB compression).
SYMKEY (indicates the message is encrypted with a symmetric key) and ANYKEY (indicates the message is public-key encrypted but the key ID has been removed).
The key will be upward zero-padded to the nearest supported key length. Keys up to 16 bytes use AES-128, 17-24 bytes use AES-192, and 25-32 bytes use AES-256.
The range is 1024 to 65011712. The default is a random value between 65536 and 253952. This option only applies when s2k-mode=3.
All pgcrypto functions return NULL if any of the arguments are NULL, as is standard in SQL.
CBC (Cipher Block Chaining) is the default mode. Other available modes are ECB and CFB.
The default iteration count is 725, with a valid range from 1 to 16777215.
0 (MDC is enabled by default). Setting it to 1 disables the modification detection code, making the encrypted message vulnerable to message modification attacks.
The default iteration count is 6, with a valid range from 4 to 31.
6 (valid range is 0-9, where higher levels compress smaller but are slower, and 0 disables compression).
Yes, pgcrypto is considered a 'trusted' extension, which means it can be installed by non-superusers who have CREATE privilege on the current database.
The legacy provider must be activated in the openssl.cnf configuration file in order to use older ciphers like DES or Blowfish.
The standard algorithms are: md5, sha1, sha224, sha256, sha384, and sha512. Additionally, any digest algorithm OpenSSL supports is automatically picked up.
additional
30 questionsPostgreSQL supports six types of constraints:
1. CHECK Constraints
Enforce that column values satisfy a Boolean expression. Evaluated on INSERT/UPDATE only.
CREATE TABLE products (
price numeric CHECK (price > 0),
discounted_price numeric CHECK (discounted_price > 0 AND price > discounted_price)
);
2. NOT-NULL Constraints
Prevent NULL values in a column. More efficient than equivalent CHECK constraints.
CREATE TABLE products (
name text NOT NULL
);
3. UNIQUE Constraints
Ensure values in a column or group of columns are unique across all rows. Automatically creates a B-tree index. Multiple NULLs are allowed by default.
CREATE TABLE products (
product_no integer UNIQUE,
UNIQUE (product_no, name) -- multi-column
);
4. PRIMARY KEY Constraints
Combines UNIQUE and NOT NULL. Each table can have at most one primary key. Automatically creates an index.
CREATE TABLE products (
product_no integer PRIMARY KEY
);
5. FOREIGN KEY Constraints
Maintain referential integrity between tables. Values must match existing values in referenced table. Supports CASCADE, RESTRICT, SET NULL, SET DEFAULT actions on DELETE/UPDATE.
CREATE TABLE orders (
product_no integer REFERENCES products (product_no) ON DELETE CASCADE
);
6. EXCLUSION Constraints
Ensure that comparing any two rows using specified operators on specified columns returns at least one false or null. Requires an index method (typically GiST or SP-GiST).
CREATE TABLE circles (
c circle,
EXCLUDE USING gist (c WITH &&) -- no overlapping circles
);
All constraints except NOT-NULL can be deferred using DEFERRABLE and checked at transaction end instead of immediately.
Sources:
Use EXPLAIN to show PostgreSQL's query execution plan without running the query, or EXPLAIN ANALYZE to execute it and see actual performance metrics.
Basic Usage
-- Show estimated execution plan only
EXPLAIN SELECT * FROM users WHERE age > 25;
-- Execute query and show actual performance
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 25;
-- Get detailed information with buffer statistics
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM users WHERE age > 25;
Key Optimization Steps
Run EXPLAIN ANALYZE - Execute your query with
EXPLAIN ANALYZEto get actual timing and row counts, not just estimates.Identify expensive operations - Look for:
- High total cost numbers (in arbitrary units, relative comparison matters)
- Sequential Scans on large tables (may need indexes)
- High "Rows Removed by Filter" (inefficient filtering)
- Nested Loop joins with large outer tables (may need different join strategy)
- Sorts/hashes that spill to disk (increase work_mem)
Compare estimated vs actual rows - Large discrepancies indicate stale statistics. Run
ANALYZE table_nameto update them.Check index usage - Verify the planner uses appropriate indexes. If not:
- Ensure indexes exist on WHERE/JOIN columns
- Check if condition makes index unusable (e.g., functions on indexed columns)
- Consider composite indexes for multi-column conditions
Analyze buffer hits - With
BUFFERSoption, check the ratio of "shared hit" (cache) vs "read" (disk). Low cache hits may indicate insufficient shared_buffers.
Useful Options
ANALYZE- Actually execute and show real timing (WARNING: modifies data for INSERT/UPDATE/DELETE)BUFFERS- Show cache hit/miss statisticsVERBOSE- Include column names and additional detailsFORMAT JSON- Output as JSON for programmatic analysis
Safe Testing for Data Modification
BEGIN;
EXPLAIN ANALYZE DELETE FROM orders WHERE status = 'cancelled';
ROLLBACK; -- Prevent actual deletion
Common Patterns to Optimize
- Sequential Scan → Index Scan: Add appropriate index
- High loop count in Nested Loop: Switch join type or reorder joins
- Large "Rows Removed by Filter": Create partial index or restructure query
- Sort operation with high memory: Increase work_mem configuration
Sources:
Based on the official node-postgres documentation, here's the authoritative answer:
Default: 0 (no timeout)
The connectionTimeoutMillis parameter controls how long (in milliseconds) the pool will wait when attempting to acquire a client connection before timing out.
Default behavior: When set to 0 (the default), there is no timeout - the pool will wait indefinitely for a connection to become available.
What it does: This setting determines the maximum time to wait when calling pool.connect() or pool.query() if all connections are currently in use and the pool is at max capacity. If a connection cannot be acquired within this time, an error is thrown.
Configuration considerations:
- Set a positive value (e.g.,
2000for 2 seconds) to prevent indefinite hangs when connections aren't properly released - The appropriate value depends on your application's performance requirements and expected connection availability
- If connections are frequently timing out, it may indicate you need to increase
maxpool size or fix connection leaks (unreleased clients)
Example:
const pool = new Pool({
connectionTimeoutMillis: 2000, // Wait max 2 seconds for a connection
max: 10
})
Sources:
JSON stores an exact copy of the input text and reparses it on every query, while JSONB stores data in a decomposed binary format that is faster to process.
Key Differences:
Storage:
JSON: Stores exact text copy, requires reparsing on each executionJSONB: Binary format, slower to insert (conversion overhead), faster to query (no reparsing)
Data Preservation:
JSON: Preserves whitespace, key order, and duplicate keysJSONB: Removes whitespace, doesn't preserve key order, keeps only last value for duplicate keys
Indexing:
JSON: No indexing supportJSONB: Supports indexing (GIN indexes for efficient queries)
When to Use:
JSONB: Default choice for most applications - use when querying, indexing, or manipulating JSON dataJSON: Use only when exact text preservation is required (whitespace, key order) or legacy compatibility needs
The PostgreSQL documentation states: "most applications should prefer to store JSON data as jsonb."
Sources:
B-tree indexes are PostgreSQL's default index type and should be used for most common indexing scenarios.
Use B-tree indexes when you need:
- Equality and range queries - B-trees handle comparisons with operators:
<,<=,=,>=,> - Sorted data retrieval - When you need to retrieve data in sorted order
- BETWEEN and IN queries - These constructs work efficiently with B-tree indexes
- NULL checks -
IS NULLandIS NOT NULLconditions are supported - Pattern matching with anchored patterns -
LIKE 'foo%'or~ '^foo'(pattern at the start of string)
Key characteristics:
- B-tree is created by default when you run
CREATE INDEXwithout specifying a type - Works with any data type that can be sorted into a well-defined linear order
- Handles the vast majority of indexing use cases in typical applications
When NOT to use B-tree:
- Full-text search (use GiST or GIN instead)
- Pattern matching with non-anchored patterns like
LIKE '%bar'(use trigram indexes) - Geometric data queries (use GiST or SP-GiST)
- Array containment or JSON operations (use GIN)
Example:
-- Creates a B-tree index by default
CREATE INDEX idx_users_email ON users(email);
-- Works efficiently with:
SELECT * FROM users WHERE email = '[email protected]';
SELECT * FROM users WHERE created_at >= '2024-01-01';
SELECT * FROM users WHERE name LIKE 'John%';
Sources:
Based on the authoritative sources I've found, here's the verified answer:
Connection Pooling
Connection pooling is a technique that maintains a cache of database connections that can be reused by multiple clients, rather than creating and destroying a new connection for each request.
Why it's needed for PostgreSQL:
PostgreSQL uses a "process per user" architecture where the postmaster spawns a dedicated backend process for every client connection. This creates significant overhead:
- Process creation cost: Each connection requires forking a new OS process
- Memory overhead: Each active connection consumes approximately 10 MB of RAM
- Connection handshake: Each new connection requires 20-30 milliseconds for password negotiation, SSL establishment, and configuration sharing
- Resource limits: PostgreSQL has a hard
max_connectionslimit; once reached, new clients cannot connect - Scalability issues: Some internal structures scale at O(N²) or O(N*log(N)) with connection count, causing lock contention and context switching overhead at high connection counts
Why PgBouncer is Recommended
PgBouncer is a lightweight connection pooler that sits between applications and PostgreSQL, appearing as a PostgreSQL server to clients while managing a smaller pool of actual database connections.
Key advantages:
- Connection reuse: Eliminates the overhead of repeatedly creating/destroying connections
- Resource efficiency: Maintains fewer actual PostgreSQL processes while serving many clients
- Overcomes connection limits: Supports thousands of client connections while using only dozens of server connections
- Multiple pooling modes:
- Session pooling: Connection assigned for client's entire session
- Transaction pooling: Connection assigned only during transactions (more efficient)
Sources:
PostgreSQL is an open-source object-relational database management system (ORDBMS) with nearly 40 years of active development, originating from the POSTGRES project at UC Berkeley in 1986.
Key Differentiators
Object-Relational Model: Unlike pure relational databases, PostgreSQL extends SQL with object-oriented features, allowing inheritance and complex data types.
Extensibility: Users can define custom data types, functions, operators, aggregate functions, index methods, and procedural languages without recompiling the database. This level of customization is not standard in traditional relational databases.
Advanced Concurrency: Uses Multi-Version Concurrency Control (MVCC), enabling readers and writers to operate without blocking each other.
Standards Compliance: Conforms to at least 170 of 177 mandatory features for SQL:2023 Core conformance—among the highest compliance levels of any database.
Multiple Index Types: Supports B-tree, Hash, GiST, SP-GiST, GIN, and BRIN indexes, allowing optimization for different data patterns.
Proven Architecture: Known for reliability, data integrity, and transactional guarantees—handling workloads from gigabytes to petabytes.
Open Source License: Freely available for private, commercial, or academic use with no vendor lock-in.
Sources:
PostgreSQL has three main backup methods, each with distinct trade-offs:
1. SQL Dump (pg_dump)
How it works: Exports database objects and data as SQL commands that can recreate the database.
Advantages:
- Cross-version compatibility (can restore to newer PostgreSQL versions)
- Works across different machine architectures (e.g., 32-bit to 64-bit)
- Can be performed remotely from any host with database access
- Non-blocking (doesn't interfere with other database operations)
- Internally consistent snapshot at start time
- Selective restoration possible with custom formats
Disadvantages:
- Requires read access to all tables (typically needs superuser privileges)
- Subject to operating system file size limits
- Only handles single databases (doesn't include cluster-wide data like roles)
- Slower than file-level backups
- Smaller storage footprint but slower execution
Use when: You need portability across versions/architectures, selective table restoration, or remote backup capability.
2. File System Level Backup
How it works: Direct copy of PostgreSQL data directory files.
Advantages:
- Faster execution than SQL dumps (no reconstruction needed)
- Can use filesystem snapshots for minimal downtime
- Rsync with
--checksumenables brief-shutdown backups
Disadvantages:
- Database must be shut down for usable backup (partial measures insufficient)
- Larger file size than SQL dumps (includes all index data)
- No selective restoration—must restore entire database cluster
- Not portable across PostgreSQL versions or architectures
- Requires commit log files (
pg_xact/*) for any table restoration
Use when: You need fast backup/restore of complete clusters on the same version/architecture, or have snapshot-capable filesystems.
3. Continuous Archiving and Point-in-Time Recovery (PITR)
How it works: Combines file-system backup with ongoing WAL (Write-Ahead Log) file archiving.
Advantages:
- Point-in-time recovery to any moment since base backup
- Indefinite backup continuity without frequent full backups
- Ideal for very large databases
- Enables warm standby systems
- Doesn't require perfectly consistent base backup (WAL replay corrects inconsistencies)
Disadvantages:
- Significant storage demands (must retain all WAL files since base backup)
- Recovery time proportional to time since last base backup (must replay all WAL segments)
- Cluster-level recovery only (cannot restore individual databases)
- High operational complexity (requires careful archive command configuration and monitoring)
- Failed archiving can fill
pg_wal/and trigger database shutdown - Performance impact during backup (
full_page_writesforced on)
Use when: You need point-in-time recovery capability, have large databases where frequent full backups are impractical, or need warm standby systems.
Key Trade-off Summary:
- Speed vs. Portability: File-level fastest but not portable; SQL dump slowest but most portable
- Size vs. Flexibility: SQL dumps smaller with selective restore; file-level larger, all-or-nothing
- Simplicity vs. Recovery Options: SQL dump/file-level simple but fixed recovery point; PITR complex but flexible recovery time
Sources:
Handle database migrations safely in PostgreSQL using these practices:
1. Use Transactions for Schema Changes
Wrap schema modifications in transactions so the database remains in a known-good state if the migration fails. PostgreSQL supports transactional DDL for most operations.
BEGIN;
ALTER TABLE users ADD COLUMN email VARCHAR(255);
-- More migration steps
COMMIT;
Exception: CREATE INDEX CONCURRENTLY cannot run inside a transaction block.
2. Create Indexes Concurrently
Use CREATE INDEX CONCURRENTLY to avoid blocking writes during index creation:
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
This performs two table scans and allows normal operations to continue. If it fails, drop the invalid index and retry.
Limitation: Cannot be used inside transaction blocks or on partitioned tables directly.
3. Set Lock Timeouts
Prevent migrations from blocking indefinitely when waiting for locks:
SET lock_timeout = '5s';
ALTER TABLE users ADD COLUMN status VARCHAR(50);
Most ALTER TABLE operations require ACCESS EXCLUSIVE locks, which block all access to the table.
4. Use Lower-Impact ALTER TABLE Operations
Some operations require lighter locks:
SET STATISTICSand per-attribute options: SHARE UPDATE EXCLUSIVE lock (allows reads and row-level operations)ADD FOREIGN KEY: SHARE ROW EXCLUSIVE lock (lighter than ACCESS EXCLUSIVE)
5. Add Columns with Defaults Carefully
In PostgreSQL 11+, adding columns with defaults is optimized and doesn't rewrite the table:
ALTER TABLE users ADD COLUMN active BOOLEAN DEFAULT true;
Pre-11 versions rewrite the entire table, which can be slow and lock-intensive.
6. Always Backup Before Migrations
Create encrypted backups before running migrations:
pg_dump -Fc database_name > backup.dump
7. Monitor Lock Waits
Enable logging to detect lock issues:
SET log_min_duration_statement = '1s';
SET log_lock_waits = on;
8. Avoid Long-Running Transactions
Keep migration transactions short to minimize lock contention and reduce the risk of blocking other operations.
Sources:
MVCC (Multiversion Concurrency Control) is PostgreSQL's concurrency mechanism where each SQL statement sees a snapshot of data as it existed at a specific point in time, rather than the absolute current state.
How it enables high concurrency:
The key advantage is that readers don't block writers, and writers don't block readers. In MVCC, locks acquired for querying (reading) data do not conflict with locks acquired for writing data. This eliminates the lock contention that exists in traditional locking systems.
Mechanism:
- Each transaction sees a consistent snapshot of the database from when its snapshot was taken
- Multiple versions of rows can exist simultaneously
- PostgreSQL maintains data consistency internally without requiring read/write locks to conflict
- This allows many concurrent users to read and modify data simultaneously without waiting for each other
Result:
High-concurrency performance in multiuser environments while maintaining data consistency—read operations proceed freely alongside writes, maximizing throughput.
Sources:
Window functions in PostgreSQL perform calculations across sets of rows that are related to the current query row, without collapsing the rows into a single output like aggregate functions do.
Key characteristics:
OVER clause required - Every window function must have an
OVERclause directly following the function name, which distinguishes it from regular functions.Maintains row identity - Unlike
GROUP BYaggregates, window functions preserve all individual rows while computing values across row sets.Window specification - The
OVERclause can include:PARTITION BY- divides rows into groupsORDER BY- controls the order of row processing within partitions- Frame specification - defines which rows are included in the calculation
Placement restrictions - Window functions are only permitted in the
SELECTlist andORDER BYclause. They cannot be used inWHERE,GROUP BY, orHAVINGclauses because they execute after row filtering and grouping.
Example:
SELECT salary,
AVG(salary) OVER (PARTITION BY department) as dept_avg,
rank() OVER (ORDER BY salary DESC) as salary_rank
FROM employees;
This returns every employee row with their salary, their department's average salary, and their salary rank—all without reducing the result set.
Sources:
When your PostgreSQL connection pool is exhausted, implement graceful degradation using these strategies:
1. Configure Connection Acquisition Timeouts
Set a reasonable timeout for connection acquisition. When this timeout is reached, return an error (typically 503 Service Unavailable) rather than letting requests hang indefinitely.
PgBouncer configuration:
query_wait_timeout = 120 # seconds - how long clients wait for a connection
reserve_pool_timeout = 5 # seconds - enables reserve pool connections
reserve_pool_size = 5 # additional connections available after timeout
2. Implement Circuit Breaker Pattern
Use a circuit breaker to prevent overwhelming a saturated connection pool:
States:
- CLOSED: Normal operation
- OPEN: Stops sending requests after consecutive failures (typically 5)
- HALF-OPEN: Tests recovery after cooldown period (typically 60 seconds)
Node.js example with opossum:
const CircuitBreaker = require('opossum');
const breaker = new CircuitBreaker(async () => {
const client = await pool.connect();
// ... database operation
client.release();
}, {
timeout: 3000, // operation timeout
errorThresholdPercentage: 50,
resetTimeout: 60000 // time before attempting HALF-OPEN
});
3. Queue Requests with Backpressure
Limit in-flight database transactions to match available resources. Queue additional requests rather than attempting to acquire connections immediately.
Monitor queue depth:
SHOW POOLS; -- in PgBouncer, watch cl_waiting column
If the oldest client wait time increases, your pool is undersized or queries are too slow.
4. Return Cached/Stale Data
For read operations, serve cached data with appropriate staleness indicators (e.g., X-Cache: stale header) when fresh connections are unavailable.
5. Implement Retry with Exponential Backoff
async function withRetry(operation, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await operation();
} catch (err) {
if (i === maxRetries - 1) throw err;
await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
}
}
}
6. Optimize Query Performance
The most effective solution: make queries faster so connections return to the pool sooner. PostgreSQL performance degrades significantly beyond the saturation point.
Sources:
SERIAL in PostgreSQL is not a true data type, but a notational convenience for creating auto-incrementing integer columns (similar to AUTO_INCREMENT in other databases).
When you declare a column as SERIAL, PostgreSQL automatically:
- Creates a sequence object
- Sets the column's default value to the next value from that sequence
- Marks the sequence as owned by the column (so it's dropped when the column is dropped)
Example:
CREATE TABLE tablename (
colname SERIAL
);
Is equivalent to:
CREATE SEQUENCE tablename_colname_seq AS integer;
CREATE TABLE tablename (
colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')
);
ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;
Three variants exist:
SMALLSERIAL(alias:SERIAL2) → createssmallintcolumn (2 bytes, range: 1 to 32,767)SERIAL(alias:SERIAL4) → createsintegercolumn (4 bytes, range: 1 to 2,147,483,647)BIGSERIAL(alias:SERIAL8) → createsbigintcolumn (8 bytes, range: 1 to 9,223,372,036,854,775,807)
Important: Sequences can have gaps in numbering even if no rows are deleted, because allocated values are consumed even when transactions roll back.
Sources:
Based on my research of authoritative PostgreSQL connection pool documentation, I can provide a verified answer:
Why Requests Block When Connection Pool Is Full
Requests block when a connection pool is full because connection pools use a queue to handle demand that exceeds available connections. When all connections in the pool are actively in use, new connection requests are placed in a FIFO (First-In-First-Out) queue and wait until an existing connection is released back to the pool.
The Mechanism:
Pool Saturation: When a client requests a connection and all connections in the pool are checked out (in use), the request cannot be served immediately
Queue Waiting: Instead of failing immediately, the request is placed in a waiting queue. The client thread/process blocks until:
- A connection becomes available (returned to the pool by another client)
- A timeout is reached (if configured)
- The queue limit is exceeded (if configured)
Fairness: The FIFO queue ensures requests are served in the order they arrive
Configuration Parameters That Control Blocking:
node-postgres (pg):
connectionTimeoutMillis: Maximum time to wait for a connection (default: 0 = wait indefinitely)- Blocks in FIFO queue until timeout or connection available
psycopg3 (Python):
timeout: Maximum wait time in seconds (default: 30 seconds)max_waiting: Maximum queued requests before rejecting new ones (default: 0 = unlimited)- Raises
TooManyRequestswhen queue limit exceeded
PgBouncer:
max_client_conn: Controls how many clients can queue while waitingmaxwait: Shows how long the oldest client has waited in queue
Why This Design?
Blocking with a queue is preferable to immediate failure because it smooths out temporary traffic spikes and ensures requests are eventually served when the pool has capacity, rather than dropping requests that might succeed moments later.
Sources:
JSONB is a binary storage format for JSON data in PostgreSQL. Unlike the plain json type which stores an exact copy of the input text, jsonb stores data in a decomposed binary format.
Key characteristics:
- Binary storage: Data is parsed and stored in a binary representation, not as text
- Faster processing: No reparsing needed on each query (unlike
jsontype) - Indexing support: Can be indexed using GIN (Generalized Inverted Index) for efficient searching
- Normalized format: Does not preserve whitespace, object key order, or duplicate keys
- Slower input: Requires conversion overhead when inserting data
Trade-offs vs JSON type:
| Feature | json |
jsonb |
|---|---|---|
| Storage | Exact text copy | Binary decomposed |
| Input speed | Faster | Slower (conversion overhead) |
| Query speed | Slower (must reparse) | Faster |
| Indexing | No | Yes (GIN indexes) |
| Preserves key order | Yes | No |
| Preserves whitespace | Yes | No |
Recommended use: PostgreSQL documentation recommends using jsonb for most applications unless you have specialized needs like preserving exact key ordering.
Example:
CREATE TABLE users (
id SERIAL PRIMARY KEY,
data JSONB
);
-- Create GIN index for efficient querying
CREATE INDEX idx_users_data ON users USING GIN (data);
-- Insert data
INSERT INTO users (data) VALUES ('{"name": "Alice", "age": 30}');
-- Query using containment operator
SELECT * FROM users WHERE data @> '{"age": 30}';
Sources:
Window functions preserve individual rows while computing across related row sets, whereas aggregate functions collapse rows into summary results.
Key Differences:
Output rows: Aggregate functions return one row per group (or one row total). Window functions return one row for each input row.
Syntax: Window functions require an
OVERclause after the function name. WithoutOVER, the same function acts as a regular aggregate.Row access: Window functions can access multiple rows related to the current row within a defined "window frame" while keeping each row separate.
Execution order: Window functions execute after aggregate functions in the query pipeline.
Clause restrictions: Window functions are only allowed in
SELECTlist andORDER BYclause, not inGROUP BY,HAVING, orWHERE.
Example:
-- Aggregate: Returns 1 row per department
SELECT depname, avg(salary)
FROM empsalary
GROUP BY depname;
-- Window: Returns all employee rows with dept average added
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname)
FROM empsalary;
In the window function example, every employee row is preserved, but each shows their department's average salary alongside their individual data.
Sources:
Use GIN indexes for JSONB and array data when you need to search within composite values rather than match them entirely. Specifically:
For JSONB:
- When searching for keys or key/value pairs across many JSONB documents (containment queries using
@>, jsonpath queries using@?and@@) - When checking for key existence using
?,?|, and?&operators - NOT when checking equality of complete JSONB documents (use B-tree instead)
For Arrays:
- When searching for specific elements within arrays
- When checking for overlap (
&&), containment (@>), or subset (<@) operations - NOT for simple array equality checks
Operator Class Selection for JSONB:
Use jsonb_ops (default) when:
- You need key-existence operators (
?,?|,?&) - Query patterns are varied and unpredictable
- Schema flexibility is important
Use jsonb_path_ops when:
- You only need containment (
@>) or jsonpath operators (@?,@@) - Performance and index size are critical (indexes are typically much smaller and searches faster)
- You won't search for empty JSON structures like
{"a": {}}(these require full-index scans)
Performance Tradeoffs:
- GIN indexes are slower for inserts/updates than B-tree indexes
- Use
fastupdateparameter for write-heavy workloads - For bulk loading, drop the index, load data, then recreate the index
Sources:
PostgreSQL triggers are database functions that execute automatically in response to data modification events (INSERT, UPDATE, DELETE, TRUNCATE) on tables, views, or foreign tables.
How Triggers Work:
Triggers execute at specific points:
- BEFORE - fires before the operation, can modify data or prevent execution
- AFTER - fires after the operation completes
- INSTEAD OF - replaces the operation (views only)
Triggers operate at two levels:
- Row-level - executes once per affected row
- Statement-level - executes once per SQL statement, regardless of rows affected
Common Use Cases:
- Automatic timestamp tracking - Recording when records were created/modified and by whom:
CREATE FUNCTION emp_stamp() RETURNS trigger AS $emp_stamp$
BEGIN
NEW.last_date := current_timestamp;
NEW.last_user := current_user;
RETURN NEW;
END;
$emp_stamp$ LANGUAGE plpgsql;
CREATE TRIGGER emp_stamp BEFORE INSERT OR UPDATE ON emp
FOR EACH ROW EXECUTE FUNCTION emp_stamp();
- Data validation - Enforcing business rules before data changes:
IF NEW.salary < 0 THEN
RAISE EXCEPTION '% cannot have a negative salary', NEW.empname;
END IF;
- Audit logging - Recording all changes to critical tables:
CREATE FUNCTION process_emp_audit() RETURNS TRIGGER AS $emp_audit$
BEGIN
IF (TG_OP = 'DELETE') THEN
INSERT INTO emp_audit SELECT 'D', now(), current_user, OLD.*;
ELSIF (TG_OP = 'UPDATE') THEN
INSERT INTO emp_audit SELECT 'U', now(), current_user, NEW.*;
ELSIF (TG_OP = 'INSERT') THEN
INSERT INTO emp_audit SELECT 'I', now(), current_user, NEW.*;
END IF;
RETURN NULL;
END;
$emp_audit$ LANGUAGE plpgsql;
Maintaining referential integrity - Cascading updates/deletes across related tables
Summary table synchronization - Automatically updating aggregate/materialized data when source tables change
Sources:
PostgreSQL supports four transaction isolation levels as defined by the SQL standard:
- Read Uncommitted
- Read Committed (default)
- Repeatable Read
- Serializable
Note: PostgreSQL internally implements only three distinct isolation levels. Read Uncommitted behaves identically to Read Committed, as PostgreSQL does not allow dirty reads.
You can set the isolation level using:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
Sources:
Advisory locks in PostgreSQL are application-defined locks where the database provides the locking mechanism but does not enforce their use—it's up to your application to use them correctly.
What they are:
- Locks with user-defined meanings that can be acquired on arbitrary integer values
- They exist outside PostgreSQL's MVCC (Multi-Version Concurrency Control) system
- Visible in the
pg_lockssystem view - Two types available:
- Session-level: Held until explicitly released or session ends (ignore transaction boundaries)
- Transaction-level: Automatically released at transaction end (no explicit unlock needed)
Functions available:
-- Session-level (must explicitly unlock)
pg_advisory_lock(key bigint)
pg_advisory_unlock(key bigint)
pg_try_advisory_lock(key bigint) -- Non-blocking
-- Transaction-level (auto-release)
pg_advisory_xact_lock(key bigint)
pg_try_advisory_xact_lock(key bigint) -- Non-blocking
When to use them:
- Implementing pessimistic locking strategies that mimic traditional flat-file database systems
- Application-level mutex/semaphore coordination (e.g., ensuring only one worker processes a queue item)
- Distributed job coordination where multiple application instances need to coordinate work
- Better than flag columns in tables because advisory locks:
- Offer superior performance
- Prevent table bloat
- Receive automatic cleanup when sessions terminate
Important considerations:
- Session-level locks ignore transaction semantics—a lock acquired in a rolled-back transaction remains held
- Be cautious with
LIMITclauses in queries that acquire locks, as execution order isn't guaranteed - A single lock identifier can be held at both session and transaction levels simultaneously
Sources:
The circuit breaker pattern for database connection pools is a fault tolerance mechanism that prevents cascading failures by monitoring database operation failures and temporarily blocking requests when a failure threshold is exceeded.
How It Works
The circuit breaker wraps database calls and operates as a state machine with three states:
1. CLOSED (Normal Operation)
- All database requests pass through to the connection pool
- Failures are counted within a time window
- If failures exceed the threshold, transitions to OPEN
2. OPEN (Failing Fast)
- Database requests fail immediately without attempting connection
- Returns errors to the application instantly
- After a timeout period (e.g., 30-60 seconds), transitions to HALF-OPEN
- Prevents connection pool exhaustion during database outages
3. HALF-OPEN (Testing Recovery)
- Limited requests are allowed through to test database availability
- If successful, transitions back to CLOSED
- If failures continue, returns to OPEN
Failure Thresholds
Common configurations:
- Failure threshold: 5-10 consecutive failures or 50% failure rate
- Timeout period: 30-60 seconds before retry
- Different thresholds for error types: e.g., 10 for timeouts, 3 for connection failures
Benefits for Connection Pools
- Prevents connection pool exhaustion when database is unresponsive
- Avoids resource depletion from connections waiting on timeouts
- Stops retry storms that amplify load on failing databases
- Provides automatic recovery testing without manual intervention
PostgreSQL-Specific Context
PostgreSQL connection poolers like PgBouncer manage connection reuse, but circuit breakers operate at the application layer to protect against database unavailability before connection attempts consume pool resources.
Sources:
SERIAL is a pseudo-type convenience notation, not an actual data type. When you use SERIAL, PostgreSQL automatically creates a sequence and wires it to the column. A sequence is the underlying database object that generates the numbers.
Key Differences:
SERIAL (convenience notation):
- Not a true data type
- Automatically creates a sequence with name
tablename_colname_seq - Sets column type to integer (smallint for smallserial, bigint for bigserial)
- Adds NOT NULL constraint
- Sets default value to
nextval('sequence_name') - Links sequence ownership to the column (sequence drops when column is dropped)
SEQUENCE (database object):
- Actual database object created with CREATE SEQUENCE
- Independent entity (unless OWNED BY is set)
- Can be shared across multiple tables/columns
- Requires manual configuration of all parameters
- Must manually set as column default with
DEFAULT nextval('sequence_name') - Persists independently unless owned by a column
Equivalent Commands:
-- Using SERIAL
CREATE TABLE users (
id SERIAL PRIMARY KEY
);
-- Using SEQUENCE (manual equivalent)
CREATE SEQUENCE users_id_seq;
CREATE TABLE users (
id integer NOT NULL DEFAULT nextval('users_id_seq') PRIMARY KEY
);
ALTER SEQUENCE users_id_seq OWNED BY users.id;
Modern Alternative:
Since PostgreSQL 10, IDENTITY columns are SQL-standard and preferred over SERIAL:
CREATE TABLE users (
id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
);
Sources:
An INNER JOIN in PostgreSQL returns only the rows where there is a match between the specified columns in both tables. Use the JOIN keyword (which defaults to INNER JOIN) with an ON clause to specify the matching condition.
Basic Syntax:
SELECT columns
FROM table1
JOIN table2 ON table1.column = table2.column;
Example with actual data:
SELECT city, temp_lo, temp_hi, prcp, date, location
FROM weather
JOIN cities ON weather.city = cities.name;
With qualified column names (recommended for clarity):
SELECT weather.city, weather.temp_lo, weather.temp_hi,
weather.prcp, weather.date, cities.location
FROM weather
JOIN cities ON weather.city = cities.name;
Key characteristics:
JOINandINNER JOINare equivalent (INNER is the default)- Only returns rows where the join condition matches in both tables
- Unmatched rows from either table are excluded from results
- The
JOIN ... ONsyntax is preferred over the olderWHEREclause style for clarity
Alternative (older) syntax (not recommended):
SELECT *
FROM weather, cities
WHERE weather.city = cities.name;
The explicit JOIN ... ON syntax is preferred because it separates the join condition from filtering conditions in the WHERE clause, making queries easier to understand and maintain.
Sources:
Regular Views vs. Materialized Views in PostgreSQL
Regular Views:
- Virtual/logical views with no data storage
- Execute the defining query every time they are accessed
- Always return current, real-time data
- Can be updatable (depending on the query)
- No refresh needed—data is always fresh
Materialized Views:
- Persist query results in a table-like form (physical storage)
- Return data directly from stored results, not by executing the query
- Data becomes stale until explicitly refreshed with
REFRESH MATERIALIZED VIEW - Cannot be directly updated with INSERT/UPDATE/DELETE
- Much faster query performance, but data may not be current
Performance Trade-off:
Materialized views provide faster access than regular views or direct table access, but at the cost of data currency. Use materialized views when you need query performance and can tolerate stale data.
Example:
-- Regular view (executes query each time)
CREATE VIEW sales_summary AS
SELECT product_id, SUM(amount) FROM sales GROUP BY product_id;
-- Materialized view (stores results)
CREATE MATERIALIZED VIEW sales_summary_mat AS
SELECT product_id, SUM(amount) FROM sales GROUP BY product_id;
-- Refresh when needed
REFRESH MATERIALIZED VIEW sales_summary_mat;
Sources:
PostgreSQL extensions package functionality into reusable, manageable units that can be loaded into a database with a single command.
How the Extension System Works
Core Components:
- Control file (
extension.control): Defines extension metadata including version, dependencies, and relocatability - Script file (
extension--version.sql): Contains SQL commands to create the extension's objects - Shared library (optional): C code for performance-critical functionality
Loading Mechanism:
When you run CREATE EXTENSION extension_name, PostgreSQL:
- Reads the control file from
SHAREDIR/extension/ - Executes the SQL script within a transaction
- Registers all created objects as extension members in system catalogs
Management Benefits:
- Drop all objects with single
DROP EXTENSIONcommand pg_dumpexportsCREATE EXTENSIONinstead of individual objects- Version tracking and upgrades via update scripts
Usage:
CREATE EXTENSION extension_name;
DROP EXTENSION extension_name CASCADE;
Essential Extensions (Supplied with PostgreSQL)
Data Types:
- hstore: Key-value pairs within a single value
- ltree: Hierarchical tree structures
- citext: Case-insensitive text type
- cube: Multi-dimensional cubes
Text Search & Matching:
- pg_trgm: Trigram-based similarity and fuzzy matching
- fuzzystrmatch: Soundex, Levenshtein distance algorithms
- unaccent: Remove accents from text
Security & Cryptography:
- pgcrypto: Cryptographic functions (hashing, encryption)
- uuid-ossp: UUID generation
Performance & Administration:
- pg_stat_statements: Track SQL query performance statistics
- pageinspect: Inspect database pages at low level
Array Operations:
- intarray: Integer array functions and operators with GiST/GIN indexing
Many extensions are "trusted," allowing non-superuser installation in databases where they have CREATE privilege.
Sources:
VACUUM is a maintenance operation that reclaims storage and prevents critical database failures in PostgreSQL.
What VACUUM Does
VACUUM removes dead row versions (tuples) left behind by UPDATE and DELETE operations. Due to PostgreSQL's MVCC (Multi-Version Concurrency Control) architecture, these operations don't immediately remove old row versions—they're marked as dead but remain on disk until VACUUM processes them.
Why It's Necessary
1. Disk Space Recovery
Dead tuples consume disk space. Without VACUUM, your database will grow unbounded even if the actual data size remains constant.
2. Query Performance
VACUUM updates table statistics used by the query planner to generate efficient execution plans. It also maintains the visibility map, enabling faster index-only scans.
3. Transaction ID Wraparound Prevention
PostgreSQL uses 32-bit transaction IDs that wrap around after ~4 billion transactions. Without regular VACUUM to "freeze" old rows, wraparound causes catastrophic data loss—old transactions suddenly appear "in the future" and become invisible. VACUUM must run at least once every 2 billion transactions per table.
4. Index Performance
Dead tuples bloat indexes, slowing down queries. VACUUM cleans up index entries pointing to dead tuples.
Key Variants
- VACUUM: Reclaims space for reuse within the table, runs concurrently with normal operations
- VACUUM FULL: Rewrites entire table to return space to OS, requires exclusive lock
- VACUUM ANALYZE: Combines vacuuming with statistics update
Most installations rely on the autovacuum daemon, which automatically runs VACUUM based on table activity.
Sources:
PostgreSQL implements ACID properties through four core mechanisms:
Atomicity
PostgreSQL uses Write-Ahead Logging (WAL) to ensure atomicity. Changes to data files are written only after those changes have been logged—WAL records describing the changes must be flushed to permanent storage first. If a transaction fails or the system crashes, PostgreSQL can roll back incomplete transactions or redo committed changes during recovery using the WAL records.
Consistency
PostgreSQL maintains consistency through:
- Integrity constraints (foreign keys, check constraints, unique constraints)
- Triggers that enforce business rules
- Transaction isolation that prevents viewing inconsistent intermediate states
Isolation
PostgreSQL implements Multi-Version Concurrency Control (MVCC) for isolation. Each SQL statement sees a snapshot of data as it existed at transaction start, regardless of concurrent modifications. This means:
- Reading never blocks writing
- Writing never blocks reading
- Multiple transactions work with their own data snapshots
PostgreSQL supports multiple isolation levels (Read Committed, Repeatable Read, Serializable) with Serializable Snapshot Isolation (SSI) providing the strictest guarantees.
Durability
Durability is guaranteed through WAL with synchronous commits. When a transaction commits, PostgreSQL:
- Flushes WAL records to disk (using fsync() system calls)
- Only requires the sequential WAL file to be synced, not all modified data pages
- Can recover to a consistent state after crashes by replaying WAL records
The WAL file is written sequentially, making synchronization much faster than flushing scattered data pages.
Sources:
Use both: application-level pooling AND PgBouncer, not either/or.
Application-level pooling (built into frameworks like HikariCP, pgx, Npgsql) should always be your first layer. It provides better integration with prepared statements, result set caching, and eliminates network hops within the application process.
Add PgBouncer when:
- You have 100+ concurrent clients (below ~56 clients, direct connections outperform poolers by 2.5x)
- Multiple small applications connect to the same database
- Connections are frequently idle or short-lived
- You need to pool 1,000+ application connections down to 100 database connections
Deployment strategy:
Run PgBouncer as close to your application as possible—ideally on the same host using Unix sockets to avoid TCP overhead. Only use centralized PgBouncer for multi-application environments.
Performance consideration:
With low concurrency, PgBouncer adds pure overhead with no benefit. Always benchmark your specific workload—patterns of data access and latencies determine which approach yields better results.
Sources:
PostgreSQL sequences are special single-row database objects that generate sequential numeric values. They are sequence generators used to produce unique, ordered numbers automatically.
What they are:
- Database objects created with
CREATE SEQUENCE - Single-row tables designed specifically to generate sequential numbers
- Commonly used to generate unique identifiers for table rows (e.g., primary keys)
How they work:
- Operated via functions:
nextval()(get next value),currval()(get current value),setval()(set value) - Support data types:
smallint,integer,bigint(default) - Generate values within range: -9223372036854775808 to 9223372036854775807 for bigint
Key characteristics:
- Configurable increment (positive or negative)
- Optional min/max boundaries
- Optional cycling (restart when limit reached)
- Cache support for performance
- Can be owned by table columns (auto-drop when column is dropped)
- Changes are never rolled back (sequence values are not transactional)
Example:
CREATE SEQUENCE serial START 101;
SELECT nextval('serial'); -- Returns 101
SELECT nextval('serial'); -- Returns 102
Sources:
The most important PostgreSQL configuration parameters for performance tuning are:
Memory Configuration
shared_buffers - Sets memory for database page caching. Recommended: 25% of system RAM (maximum 40% of RAM). Default is often too low for production systems.
work_mem - Memory per query operation (sort, hash) before spilling to disk. Default: 4MB. Increase for complex queries, but multiply by max_connections to avoid over-allocation.
maintenance_work_mem - Memory for maintenance operations (VACUUM, CREATE INDEX, ALTER TABLE). Default: 64MB. Can be set much higher than work_mem since few operations run concurrently.
effective_cache_size - Planner's estimate of OS and PostgreSQL cache size. Default: 4GB. Set to 50-75% of total RAM. Does not allocate memory, only influences query planning decisions.
Query Planning
random_page_cost - Cost estimate for non-sequential disk reads. Default: 4.0. Lower to 1.1-2.0 for SSDs to favor index scans.
effective_io_concurrency - Number of concurrent I/O operations PostgreSQL expects the storage can handle. Default: 1 (0 on systems without asynchronous I/O). Set to ~200 for SSDs, ~100 for RAID arrays.
default_statistics_target - Sample size for ANALYZE operations affecting planner estimates. Default: 100. Increase to 500-1000 for large tables with non-uniform distributions.
Background Processes
max_parallel_workers_per_gather - Workers per parallel query operation. Default: 2. Increase for CPU-bound analytic queries on multi-core systems.
max_worker_processes - Total background worker processes. Default: 8. Must be ≥ max_parallel_workers + autovacuum workers.
Checkpoint and WAL
checkpoint_completion_target - Fraction of checkpoint interval to spread writes. Default: 0.9. Keep at 0.9 to smooth I/O.
wal_buffers - WAL (Write-Ahead Log) buffer size. Default: -1 (auto-tuned to 1/32 of shared_buffers). Usually auto-sizing is sufficient.
Sources:
PL/pgSQL > Exception Handling
30 questionsSQLSTATE codes consist of five characters following SQL standard conventions.
A block containing an EXCEPTION clause is significantly more expensive to enter and exit than a block without one.
MESSAGE (sets error text), DETAIL (supplementary error information), HINT (guidance to users), ERRCODE (specifies SQLSTATE), COLUMN (related column name), CONSTRAINT (related constraint name), DATATYPE (related datatype name), TABLE (related table name), and SCHEMA (related schema name).
FOUND is always set to true after successful execution of a command with STRICT.
No, condition names are not case-sensitive in PostgreSQL exception handlers.
The target will be set to nulls if the query returned no rows, and FOUND will be set to false.
It cannot be caught by this EXCEPTION clause, but is propagated out. A surrounding EXCEPTION clause could catch it.
RETURNED_SQLSTATE (SQLSTATE error code), MESSAGE_TEXT (primary message), PG_EXCEPTION_DETAIL (detail message), PG_EXCEPTION_HINT (hint message), PG_EXCEPTION_CONTEXT (call stack description), COLUMN_NAME (related column), CONSTRAINT_NAME (related constraint), PG_DATATYPE_NAME (related datatype), TABLE_NAME (related table), and SCHEMA_NAME (related schema).
MESSAGE cannot be used in the form of RAISE that includes a format string before USING.
DEBUG, LOG, INFO, NOTICE, WARNING, and EXCEPTION (EXCEPTION is the default).
The parameterless form of RAISE can only be used inside a BEGIN block's EXCEPTION clause and causes the error currently being handled to be re-thrown.
NO_DATA_FOUND (when no rows are returned) or TOO_MANY_ROWS (when more than one row is returned).
When the condition is a Boolean expression that evaluates to false or null.
The first two characters of an error code denote a class of errors, while the last three characters indicate a specific condition within that class.
Any error code consisting of five digits and/or upper-case ASCII letters, other than 00000.
The system supplied error message "assertion failed" is used—in whatever national language the lc_messages run-time parameter specifies.
PostgreSQL implicitly creates an anonymous savepoint (subtransaction).
Changes outside the exception block are not rolled back; only changes to persistent database state within the block are rolled back.
The local variables of the PL/pgSQL function remain as they were when the error occurred, but all changes to persistent database state within the block are rolled back.
Data Types - JSON > JSON Operators
30 questionsThe ?| operator tests if any of the strings in the provided array exist as top-level keys or array elements. For example, '{"a":1, "b":2, "c":3}'::jsonb ?| array['b', 'd'] returns true because 'b' exists.
The operators ->, ->>, #>, and #>> are available for both json and jsonb types. These are the field/element/path extraction operators.
The @> operator tests if the left jsonb value contains the right jsonb value. For example, '{"a":1, "b":2}'::jsonb @> '{"b":2}'::jsonb returns true.
The #> operator takes a text array as the path. For example: '{"a": {"b": ["foo","bar"]}}'::json #> '{a,b,1}' returns "bar".
The -> operator returns json type (same as its left-hand input). For example, '{"a": {"b":"foo"}}'::json -> 'a' returns {"b":"foo"} as json.
The ->> operator always returns text type, regardless of whether the left-hand input is json or jsonb. It coerces the extracted value to text.
Yes, the jsonb - operator with integer supports negative indices that count from the end. Negative integers count from the end of the array.
The - operator deletes a key (and its value) from a JSON object, or matching string value(s) from a JSON array. For example, '{"a": "b", "c": "d"}'::jsonb - 'a' returns {"c": "d"}.
The default jsonb_ops GIN operator class supports the key-exists operators (?, ?|, ?&), the containment operator (@>), and the jsonpath match operators (@?, @@).
The <@ operator tests if the left jsonb value is contained within the right jsonb value. It is the inverse of @>. For example, '{"b":2}'::jsonb <@ '{"a":1, "b":2}'::jsonb returns true.
The json type stores an exact copy of the input text, which processing functions must reparse on each execution. The jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed.
The || operator concatenates the two arrays. For example, '["a", "b"]'::jsonb || '["a", "d"]'::jsonb returns ["a", "b", "a", "d"].
jsonb_ops creates independent index items for each key and value in the data, while jsonb_path_ops creates index items only for each value (incorporating keys into hash values). For example, to index {"foo": {"bar": "baz"}}, jsonb_path_ops creates a single index item with all three (foo, bar, baz) in the hash, while jsonb_ops creates three separate index items.
The #- operator deletes the field or array element at the specified path, where path elements can be either field keys or array indexes. For example, '["a", {"b":1}]'::jsonb #- '{1,b}' returns ["a", {}].
The @@ operator returns the result of a Boolean jsonpath expression. For example, '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2' returns true.
The -> operator returns jsonb type (same as its left-hand input). For example, '{"a": {"b":"foo"}}'::jsonb -> 'a' returns {"b":"foo"} as jsonb.
No, the || operator does not operate recursively. Only the top-level array or object structure is merged.
No, the standard comparison operators are available for jsonb, but not for json. The json type does not support these comparison operators.
The @? operator checks for path existence and should NOT use predicate check expressions. The @@ operator evaluates a predicate (boolean expression) and REQUIRES predicate check expressions. @? returns true/false based on existence, while @@ evaluates the boolean result of the path expression.
No, the containment operators @> and <@ are only available for jsonb type. There is no parallel set of facilities for the json type for containment testing.
The ? operator tests whether a string appears as an object key or array element at the top level of the jsonb value. For example, '{"a":1, "b":2}'::jsonb ? 'b' returns true.
The ?& operator tests if all of the strings in the provided array exist as top-level keys or array elements. For example, '["a", "b", "c"]'::jsonb ?& array['a', 'b'] returns true because both 'a' and 'b' exist.
No, array order is irrelevant for containment. For example, [1, 2, 3] contains [3, 1] returns true.
No, the ? operator only searches at the top level of the JSON structure. It does not search recursively through nested objects or arrays.
No, duplicates don't matter for containment. For example, [1, 2, 3] contains [1, 2, 2] returns true even though the first array only has one occurrence of 2.
Yes, the -> operator accepts negative array indices that count from the end of arrays. -1 designates the rightmost element, -2 the second from right, and so on.
Yes, the @> containment operator works recursively through nested structures. The contained object must match the containing object as to structure and data contents, possibly after discarding some non-matching elements.
The #> operator returns the result as the same type as the input (json or jsonb), while #>> returns the result as text. For example, '{"a": {"b": ["foo","bar"]}}'::json #> '{a,b,1}' returns "bar" (with quotes), while #>> returns bar (without quotes, as text).
The jsonb_path_ops operator class supports only @>, @?, and @@ operators. It does not support the key-exists operators (?, ?|, ?&).
System Catalogs
30 questionsNo, pg_statistic should not be readable by the public since even statistical information might be sensitive. The pg_stats view provides public access to appropriate statistics.
Yes, every table has implicit system columns (oid, tableoid, xmin, xmax, cmin, cmax, ctid) that cannot be used as user-defined column names.
Yes, input functions for OID alias types allow whitespace between tokens and fold upper-case letters to lower case, except within double quotes.
regproc accepts simple function names only (no overloading), while regprocedure accepts function names with argument types to handle overloaded functions (e.g., sum(int4)).
f = normal function, p = procedure, a = aggregate function, w = window function.
The maximum value size is 1 GB (2^30 - 1 bytes), due to two bits being reserved in the varlena length word.
information_schema is SQL-standard, portable, and read-only but may lack PostgreSQL-specific details. pg_catalog is PostgreSQL-native, complete, and includes all internal details like OIDs and sequences.
A direct DROP of the dependent object will be disallowed outright. The dependent object drops automatically when the referenced object is dropped, regardless of CASCADE specification.
Command identifiers are 32-bit quantities, limiting SQL commands to 2^32 (4 billion) per transaction.
No, pg_namespace is database-specific, not cluster-wide. It stores namespaces for each individual database.
PLAIN (prevents compression and out-of-line storage), EXTENDED (allows both compression and out-of-line storage, the default), EXTERNAL (allows out-of-line storage but not compression), MAIN (allows compression but not out-of-line storage).
Ordinary (user-defined) columns are numbered from 1 up. System columns have arbitrary negative numbers.
Yes, if a constant of an OID alias type (like regclass) appears in a stored expression (column default or view), it creates a dependency on the referenced object. Exception: regrole does not create dependencies.
indisprimary is a boolean that indicates whether this index is for a primary key.
c = check constraint, f = foreign key constraint, n = not-null constraint, p = primary key constraint, u = unique constraint, t = constraint trigger, x = exclusion constraint.
datistemplate is a boolean indicating whether the database can be cloned by any user with CREATEDB privileges. When true, the database is a template database.
template0 should never be changed after the database cluster has been initialized. It serves as a pristine backup.
Before PostgreSQL 8.1, default_with_oids was on by default. Starting with 8.1, the default became off (WITHOUT OIDS).
The default search_path is "$user", public where $user resolves to the current username.
The default value is off. This parameter allows modifications of system table structures and should only be used by initdb.
xmin contains the transaction ID of the inserting transaction for that row version.
indisunique is a boolean that indicates whether this is a unique index.
t = table access method (like heap), i = index access method (like btree, hash, gist, gin, spgist, brin).
In a dropped column's pg_attribute entry, atttypid is reset to zero, but attlen and other fields copied from pg_type remain valid.
No, a copy of template1 must use the same encoding and locale settings as template1. You must use template0 to specify different encoding or locale settings.
typlen = -1 indicates a 'varlena' type (variable-length type with a length word).
An on-disk TOAST pointer datum is 18 bytes (including the varlena header).
Yes, most system catalogs are copied from the template database during database creation and are thereafter database-specific. Only a few catalogs are shared cluster-wide.
Performance Tips > Query Statistics
29 questionsYes, pg_stat_statements must be loaded by adding it to shared_preload_libraries in postgresql.conf, which requires a server restart.
The valid values are 'none', 'pl' (procedural-language functions only), and 'all'.
No, this parameter can only be set in postgresql.conf or on the server command line, requiring a restart.
It accepts three optional parameters: userid (Oid), dbid (Oid), and queryid (bigint). If all parameters are NULL or 0, it discards all statistics.
Good results will show most (>90%) individual timing calls take less than one microsecond, with average per loop overhead below 100 nanoseconds.
Enabling this parameter may incur a noticeable performance penalty, especially when statements with identical query structure are executed by many concurrent connections which compete to update a small number of pg_stat_statements entries.
The default value is 'top', which tracks only top-level (client-issued) statements.
The pg_test_timing tool can be used to measure the overhead of timing on your system.
The valid values are 'top' (client-issued statements), 'all' (includes nested statements), and 'none' (disabled).
PostgreSQL samples 300 * default_statistics_target rows. With the default value of 100, this equals 30,000 rows.
Query texts are truncated in the pg_stat_activity.query column if they exceed the track_activity_query_size limit (default 1024 bytes).
It is disabled by default because it may cause significant overhead on some platforms when measuring cost-based vacuum delays.
The default value is 1024 bytes. This parameter reserves memory for storing the text of the currently executing command in pg_stat_activity.query.
The multiplier is 300. ANALYZE samples 300 times the statistics target value.
The default value is 'on', which preserves statement statistics across server shutdowns.
Yes, query identifier calculation must be enabled (compute_query_id set to 'auto' or 'on') for pg_stat_statements to be active.
No, inlined SQL functions are never tracked, regardless of the track_functions setting.
With 'cache', the first access to statistics for an object caches those statistics until the end of the transaction (unless pg_stat_clear_snapshot() is called), ensuring repeat accesses yield consistent values.
It sets the maximum number of entries in the most_common_vals and histogram_bounds arrays for table columns without a column-specific target set via ALTER TABLE SET STATISTICS.
PL/pgSQL > Dynamic SQL
29 questionsEXECUTE command-string [ INTO [STRICT] target ] [ USING expression [, ... ] ];
PostgreSQL will generate a custom execution plan on the first 5 executions of a given prepared statement (by default, typically 5 or more times). From the 6th execution onwards, PostgreSQL evaluates whether to switch to a generic plan instead.
The PL/pgSQL EXECUTE statement is not related to the EXECUTE SQL statement supported by the PostgreSQL server, and the server's EXECUTE statement cannot be used directly within PL/pgSQL functions.
The plan_cache_mode options are: auto (default), force_generic_plan, and force_custom_plan.
Yes. Parameter symbols work in optimizable SQL commands including SELECT, INSERT, UPDATE, DELETE, and MERGE (since PostgreSQL 15).
Width may be specified using: a positive integer, an asterisk (*) to use the next function argument as the width, or a string of the form *n$ to use the nth function argument as the width.
No. Parameter symbols can only be used for data values. If you want to use dynamically determined table or column names, you must insert them into the command string textually.
The regclass data type automatically handles proper quoting and schema qualification of table names, performs an early check to ensure the table exists and is visible to the current user, and helps prevent SQL injection.
|foo | (the string 'foo' is left-aligned in a 10-character wide field)
The print_strict_params option for displaying parameters passed to a query which violated a STRICT constraint was introduced in PostgreSQL 9.4.
FOR target IN EXECUTE text_expression [ USING expression [, ... ] ] LOOP
The current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function, which means that if a PL/pgSQL function produces a very large result set, performance might be poor.
%L quotes the argument value as an SQL literal. A null value is displayed as the string NULL, without quotes.
The query is evaluated and replanned on each entry to the FOR loop.
The only supported flag is a minus sign (-) which causes the format specifier's output to be left-justified. This has no effect unless the width field is also specified.
%s formats the argument value as a simple string. A null value is treated as an empty string.
Using USING is often preferable to inserting data values into the command string as text: it avoids run-time overhead of converting the values to text and back, and it is much less prone to SQL-injection attacks since there is no need for quoting or escaping.
The pipe operator || cannot handle NULL - if any value is NULL, the entire result is NULL.
Quotes are added only if necessary (i.e., if the string contains non-identifier characters or would be case-folded). Embedded quotes are properly doubled.
The currently available status items are ROW_COUNT (the number of rows processed by the last SQL command) and RESULT_OID (the OID of the last row inserted).
When print_strict_params is enabled for a function, and an error is thrown because the requirements of STRICT are not met, the DETAIL part of the error message will include information about the parameters passed to the command.
| foo| (the string 'foo' is right-aligned in a 10-character wide field)
Parameter symbols only work in optimizable SQL commands: SELECT, INSERT, UPDATE, DELETE, MERGE, and certain commands containing one of these. In utility statements (like CREATE TABLE, ALTER TABLE, DROP TABLE), you must insert values textually even if they are just data values.
RETURN QUERY EXECUTE command-string [ USING expression [, ... ] ]
No. SELECT INTO is not currently supported within EXECUTE; instead, execute a plain SELECT command and specify INTO as part of the EXECUTE itself.
quote_nullable() returns NULL (the unquoted string 'NULL') if the argument is null, whereas quote_literal() returns null on null input.
The STRICT option reports an error unless the command produces exactly one row. It raises NO_DATA_FOUND if zero rows are returned or TOO_MANY_ROWS if more than one row is returned.
quote_literal() returns null on null input; if the argument might be null, quote_nullable is often more suitable.
If multiple rows are returned, only the first will be assigned to the INTO variable(s). If no rows are returned, NULL is assigned to the INTO variable(s).
Monitoring > Table Statistics
29 questionsThe update does not modify any columns referenced by the table's indexes (excluding summarizing indexes), and there is sufficient free space on the page containing the old row for the updated row.
Number of disk blocks read from this table's TOAST table (if any).
They are estimates/approximations. Most operations adjust them only incrementally, and messages to the stats collector can get dropped entirely under heavy load, causing deltas to go missing.
The number of times the table has been manually vacuumed (using explicit VACUUM commands run by a user or administrator).
The default value is 'on'. This parameter is on by default because the autovacuum daemon needs the collected information.
Updates where no successor versions are required in indexes (HOT updates - Heap-Only Tuple updates).
About 99%. This is calculated as heap_blks_hit / (heap_blks_hit + heap_blks_read) from pg_statio_user_tables.
Number of buffer hits in this table (read from page cache/buffer pool).
Number of disk blocks read from this table's TOAST table indexes (if any).
A permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts.
bgwriter, archiver, wal, replication_slot, subscription, and sLRU
Row updates where the successor version goes onto a new heap page, always indicating non-HOT updates.
last_seq_scan (timestamp with time zone) and last_idx_scan (timestamp with time zone)
PostgreSQL 15. The statistics collector was replaced with a purely in-memory implementation using shared memory instead of files.
The estimated number of rows modified since this table was last analyzed.
The number of live rows fetched by sequential scans. This is not the number of records returned to applications but the number of records processed to create the result set.
json_operations
29 questionsPostgreSQL JSONB #>> Operator
The #>> operator extracts a JSON value at a specified path as text (not JSONB).
Syntax
jsonb_column #>> text_array_path
Key Differences from #>
#>returns JSONB type#>>returns TEXT type (the value is converted to text)
Examples
-- Given this JSONB data:
SELECT '{"user": {"name": "Alice", "age": 30}}'::jsonb #>> '{user,name}';
-- Returns: 'Alice' (text)
SELECT '{"user": {"name": "Alice", "age": 30}}'::jsonb #>> '{user,age}';
-- Returns: '30' (text, not integer)
SELECT '{"items": [10, 20, 30]}'::jsonb #>> '{items,1}';
-- Returns: '20' (text)
-- Array path uses zero-based indexing
SELECT '{"a": [{"b": "value"}]}'::jsonb #>> '{a,0,b}';
-- Returns: 'value' (text)
Return Values
- Scalar values: converted to text representation
- Objects/arrays: JSON text representation
- Non-existent path:
NULL
Version
Available in PostgreSQL 9.4+ (when JSONB type was introduced).
Source
PostgreSQL Official Documentation: JSON Functions and Operators
jsonb_path_query is a PostgreSQL function (available since version 12) that executes a JSON path expression against JSONB data and returns a set of JSONB values matching the path.
Syntax:
jsonb_path_query(target jsonb, path jsonpath [, vars jsonb [, silent boolean]])
Parameters:
target: The JSONB column/value to querypath: A JSONPath expression (SQL/JSON path language per SQL:2016 standard)vars: Optional JSONB object containing variables referenced in the path (e.g.,$varname)silent: Optional boolean (defaultfalse); iftrue, suppresses errors for invalid paths
Returns: Set of JSONB values (use with set-returning context like SELECT, LATERAL, or aggregate functions)
Key Differences from Related Functions:
jsonb_path_query()→ returns set of rows (multiple results)jsonb_path_query_first()→ returns single JSONB value (first match only)jsonb_path_query_array()→ returns JSONB array containing all matches
Example:
SELECT jsonb_path_query(
'{"items": [{"price": 10}, {"price": 20}, {"price": 15}]}',
'$.items[*].price'
);
-- Returns 3 rows: 10, 20, 15
SELECT jsonb_path_query(
'{"users": [{"age": 25}, {"age": 35}]}',
'$.users[*] ? (@.age > $min)',
'{"min": 30}'
);
-- Returns: {"age": 35}
Source: PostgreSQL 17 official documentation - JSON Functions and Operators
jsonb_build_array Function in PostgreSQL
jsonb_build_array is a PostgreSQL function that constructs a JSONB array from a variadic list of arguments. It returns a jsonb type (binary JSON format).
Syntax
jsonb_build_array(VARIADIC "any")
Key Characteristics
- Available since: PostgreSQL 9.4
- Return type:
jsonb - Arguments: Accepts 0 to N arguments of any type
- NULL handling: NULL values are preserved as JSON
nullin the array - Type conversion: Non-JSON types are automatically converted to their JSON equivalents
Usage Examples
-- Empty array
SELECT jsonb_build_array();
-- Result: []
-- Array with mixed types
SELECT jsonb_build_array(1, 2, 'foo', 4, 5);
-- Result: [1, 2, "foo", 4, 5]
-- Array with NULL
SELECT jsonb_build_array(1, NULL, 'text', true);
-- Result: [1, null, "text", true]
-- Nested with other JSONB functions
SELECT jsonb_build_array(
jsonb_build_object('id', 1, 'name', 'Alice'),
jsonb_build_object('id', 2, 'name', 'Bob')
);
-- Result: [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
Performance Note
jsonb format is stored in a decomposed binary format (slower to create than json, but significantly faster for processing operations like indexing and querying).
Source
PostgreSQL Official Documentation: JSON Functions and Operators (versions 9.4+)
Use the jsonb_typeof() function to check if a JSONB value is an array. It returns 'array' for JSON arrays.
SELECT jsonb_typeof('["a", "b", "c"]'::jsonb);
-- Returns: 'array'
SELECT jsonb_typeof('{"key": "value"}'::jsonb);
-- Returns: 'object'
To filter rows where a JSONB column is an array:
SELECT * FROM table_name
WHERE jsonb_typeof(json_column) = 'array';
Possible return values of jsonb_typeof():
'object'- JSON object'array'- JSON array'string'- JSON string'number'- JSON number'boolean'- JSON boolean'null'- JSON null
Alternative using @? operator (PostgreSQL 12+):
SELECT * FROM table_name
WHERE json_column @? '$ ? (@.type() == "array")';
Available since: PostgreSQL 9.4 (when JSONB was introduced)
Source: PostgreSQL official documentation - JSON Functions and Operators
The || operator for JSONB in PostgreSQL is the concatenation/shallow merge operator.
Behavior
For JSONB objects:
- Merges two JSONB objects by combining their keys
- If duplicate keys exist, the right operand's value overwrites the left
- Performs a shallow merge only (nested objects are replaced, not merged)
For JSONB arrays:
- Concatenates arrays into a single array
Examples
-- Object merge (duplicate key 'b' takes right value)
SELECT '{"a": 1, "b": 2}'::jsonb || '{"b": 3, "c": 4}'::jsonb;
-- Result: {"a": 1, "b": 3, "c": 4}
-- Shallow merge (nested object replaced entirely)
SELECT '{"a": {"x": 1}}'::jsonb || '{"a": {"y": 2}}'::jsonb;
-- Result: {"a": {"y": 2}}
-- Array concatenation
SELECT '[1, 2]'::jsonb || '[3, 4]'::jsonb;
-- Result: [1, 2, 3, 4]
-- Adding/updating a single key-value pair
SELECT '{"name": "John"}'::jsonb || '{"age": 30}'::jsonb;
-- Result: {"name": "John", "age": 30}
Key Limitations
- Not recursive: Nested objects are completely replaced, not deep-merged
- For deep merging, use
jsonb_set()or recursive CTEs - Available since PostgreSQL 9.5
Official Documentation
PostgreSQL JSON Functions and Operators: https://www.postgresql.org/docs/current/functions-json.html
row_to_json Function in PostgreSQL
row_to_json is a PostgreSQL built-in function that converts a SQL composite value (row) into a JSON object.
Syntax
row_to_json(record, [pretty_bool])
Parameters
record: A row type, composite type, or record valuepretty_bool(optional): Iftrue, adds whitespace for readability; iffalseor omitted, returns compact JSON
Return Type
Returns json type (not jsonb)
Behavior
- Each column name becomes a JSON key
- Each column value becomes the corresponding JSON value
- NULL SQL values become JSON
null - Column names are NOT quoted unless necessary (e.g., mixed case or special characters)
- Nested composite types are recursively converted to JSON objects
Code Examples
-- Basic usage
SELECT row_to_json(row(1, 'foo', true));
-- Result: {"f1":1,"f2":"foo","f3":true}
-- With table row
SELECT row_to_json(users) FROM users WHERE id = 1;
-- Result: {"id":1,"name":"Alice","email":"[email protected]"}
-- With subquery and named columns
SELECT row_to_json(t) FROM (
SELECT id, name, created_at
FROM users
WHERE id = 1
) t;
-- Result: {"id":1,"name":"Alice","created_at":"2024-01-15T10:30:00"}
-- Pretty printed
SELECT row_to_json(row(1, 'foo'), true);
-- Result: {"f1": 1,
-- "f2": "foo"}
Version
Available since PostgreSQL 9.2 (released September 2012)
Related Functions
json_build_object(): More control over key namesto_json(): Converts any value to JSONrow_to_jsonb(): Same functionality but returnsjsonbtype (PostgreSQL 9.4+)
jsonb_agg Function in PostgreSQL
jsonb_agg is an aggregate function that collects values from multiple rows into a single JSONB array.
Syntax
jsonb_agg(expression [ORDER BY ...])
Behavior
- Input: Any data type (automatically converted to JSONB)
- Output: A single JSONB array containing all input values
- NULL handling: NULL input values are included in the output array as JSON null
- Empty set: Returns NULL (not an empty array)
- Available since: PostgreSQL 9.5
Common Use Cases
1. Basic aggregation:
SELECT jsonb_agg(name) FROM users;
-- Result: ["Alice", "Bob", "Charlie"]
2. Aggregate entire rows as objects:
SELECT jsonb_agg(users) FROM users;
-- Result: [{"id":1,"name":"Alice"}, {"id":2,"name":"Bob"}]
3. Aggregate with row construction:
SELECT jsonb_agg(jsonb_build_object('id', id, 'name', name))
FROM users;
-- Result: [{"id":1,"name":"Alice"}, {"id":2,"name":"Bob"}]
4. With ORDER BY:
SELECT jsonb_agg(name ORDER BY created_at DESC) FROM users;
-- Result: ["Charlie", "Bob", "Alice"]
Key Differences from json_agg
jsonb_aggreturns binary JSONB format (more efficient for operations)json_aggreturns text JSON format- Use
jsonb_aggwhen you need to further manipulate the result with JSONB operators
Source
PostgreSQL Official Documentation: Aggregate Functions (PostgreSQL 9.5+)
The #- operator in PostgreSQL removes a key-value pair from a JSONB object by matching the key path. It takes a JSONB value on the left and a text array path on the right, returning a new JSONB value with the specified path removed.
Syntax:
jsonb #- text[]
Behavior:
- Removes the field or array element at the specified path
- Returns the modified JSONB value (original is unchanged)
- If the path doesn't exist, returns the original JSONB unchanged
- For nested paths, use an array like
'{key1, key2, key3}' - Array elements are referenced by index (0-based)
Examples:
-- Remove top-level key
SELECT '{"a": 1, "b": 2}'::jsonb #- '{a}';
-- Result: {"b": 2}
-- Remove nested key
SELECT '{"a": {"b": {"c": 1}}}'::jsonb #- '{a, b, c}';
-- Result: {"a": {"b": {}}}
-- Remove array element by index
SELECT '{"items": [1, 2, 3]}'::jsonb #- '{items, 1}';
-- Result: {"items": [1, 3]}
-- Non-existent path (no change)
SELECT '{"a": 1}'::jsonb #- '{b}';
-- Result: {"a": 1}
Version: Available in PostgreSQL 9.5+ (introduced with JSONB operators)
Source: PostgreSQL Official Documentation - JSON Functions and Operators
https://www.postgresql.org/docs/current/functions-json.html
The ?& operator in PostgreSQL checks if a JSONB object contains all of the specified keys in a text array.
Syntax:
jsonb_column ?& text_array
Return value: boolean - Returns true if the JSONB value contains all keys in the array, false otherwise.
Example:
-- Check if JSONB contains all specified keys
SELECT '{"name": "John", "age": 30, "city": "NYC"}'::jsonb ?& ARRAY['name', 'age'];
-- Returns: true
SELECT '{"name": "John", "age": 30}'::jsonb ?& ARRAY['name', 'city'];
-- Returns: false (missing 'city')
-- Common usage in WHERE clause
SELECT * FROM users WHERE metadata ?& ARRAY['email', 'phone'];
Key differences from related operators:
?- checks for a single key?|- checks if any of the keys exist (OR logic)?&- checks if all of the keys exist (AND logic)
Availability: PostgreSQL 9.4+ (introduced with JSONB type)
Index support: Can use GIN index with jsonb_path_ops or jsonb_ops operator class for performance.
Source: PostgreSQL Official Documentation - JSON Functions and Operators
json_build_object - PostgreSQL JSON Constructor Function
json_build_object is a PostgreSQL built-in function that constructs a JSON object from a variadic list of alternating keys and values.
Syntax
json_build_object(key1, value1, key2, value2, ...)
Key Characteristics
- Available since: PostgreSQL 9.4
- Return type:
json(usejsonb_build_objectforjsonbtype) - Parameters: Variadic list where odd positions are keys (coerced to text) and even positions are values
- Minimum arguments: 0 (returns empty JSON object
{}) - Arguments must be even: Function requires an even number of arguments (pairs of key-value)
Behavior
- Keys are automatically converted to text strings
- Values are converted to appropriate JSON types:
NULL→ JSONnull- boolean → JSON boolean
- numeric → JSON number
- text → JSON string
- arrays/composites → JSON arrays/objects
- Duplicate keys are allowed; all occurrences are preserved in the output
Examples
-- Basic usage
SELECT json_build_object('name', 'John', 'age', 30, 'active', true);
-- Result: {"name":"John","age":30,"active":true}
-- With NULL values
SELECT json_build_object('id', 1, 'email', NULL);
-- Result: {"id":1,"email":null}
-- From table columns
SELECT json_build_object('user_id', id, 'username', name)
FROM users
WHERE id = 5;
-- Empty object
SELECT json_build_object();
-- Result: {}
Source: PostgreSQL 17 Official Documentation - JSON Functions
PostgreSQL jsonb_each Function
jsonb_each is a set-returning function that expands a JSONB object's top-level key-value pairs into rows, where each row contains a key (text) and value (jsonb) column.
Signature
jsonb_each(jsonb) → SETOF record(key text, value jsonb)
Behavior
- Input: JSONB object (will error if given a JSONB array, null, or scalar)
- Output: One row per top-level key-value pair
- Column 1 (
key): Text representation of the object key - Column 2 (
value): JSONB value (preserves original JSON type) - Non-recursive: Only expands the top level of the object
Usage Example
SELECT * FROM jsonb_each('{"name":"Alice","age":30,"active":true}'::jsonb);
Result:
key | value
--------+---------
name | "Alice"
age | 30
active | true
Common Patterns
Filter by key:
SELECT value FROM jsonb_each('{"a":1,"b":2}'::jsonb) WHERE key = 'a';
-- Returns: 1
Convert values to text (use jsonb_each_text instead):
SELECT * FROM jsonb_each_text('{"name":"Alice","age":30}'::jsonb);
-- Returns: key=name, value=Alice (text); key=age, value=30 (text)
Version
Available since PostgreSQL 9.4 (when JSONB was introduced).
PostgreSQL to_jsonb() Function
to_jsonb() converts any SQL value into a JSONB (binary JSON) value. Available in PostgreSQL 9.4+.
Syntax
to_jsonb(anyelement) → jsonb
Behavior
- Primitives: Converts text, numbers, booleans to JSON equivalents
- NULL: Converts to JSON
null - Composite types/rows: Converts to JSON objects with column names as keys
- Arrays: Converts to JSON arrays
- Records: Converts to JSON objects
Key Characteristics
- Returns JSONB (binary format), not JSON text
- JSONB stores data in decomposed binary form for faster processing
- Whitespace is not preserved (normalized)
- Object key order is not preserved (sorted)
- Duplicate keys are eliminated (last value wins)
Examples
-- Primitive values
SELECT to_jsonb(42); -- 42
SELECT to_jsonb('hello'::text); -- "hello"
SELECT to_jsonb(true); -- true
-- Arrays
SELECT to_jsonb(ARRAY[1,2,3]); -- [1, 2, 3]
-- Row/composite type
SELECT to_jsonb(row(1, 'foo')); -- {"f1": 1, "f2": "foo"}
-- Table rows
SELECT to_jsonb(users.*) FROM users LIMIT 1;
-- {"id": 1, "name": "Alice", "email": "[email protected]"}
Common Use Cases
- Converting query results to JSON for APIs
- Building JSON aggregates with
jsonb_agg() - Storing structured data in JSONB columns
Related Functions
to_json(): Returns text JSON (not binary)row_to_json(): Specifically for row types (useto_jsonb()instead in 9.4+)jsonb_build_object(): Construct JSONB from key-value pairs
Source: PostgreSQL 16 Official Documentation - JSON Functions and Operators
PostgreSQL JSON Operators: -> vs ->>
-> returns JSON/JSONB (preserves type), ->> returns TEXT (converts to string).
Exact Behavior
->: Extracts JSON object field or array element as JSON/JSONB type->>: Extracts JSON object field or array element as TEXT type
Code Examples
-- Sample data
SELECT '{"name": "Alice", "age": 30}'::jsonb AS data;
-- Using -> returns JSONB (can be chained)
SELECT '{"user": {"name": "Alice"}}'::jsonb -> 'user' -> 'name';
-- Result: "Alice" (as JSONB, with quotes)
-- Using ->> returns TEXT (cannot be chained with JSON operators)
SELECT '{"user": {"name": "Alice"}}'::jsonb -> 'user' ->> 'name';
-- Result: Alice (as TEXT, without quotes)
-- Array access
SELECT '[1, 2, 3]'::jsonb -> 2; -- Result: 3 (as JSONB)
SELECT '[1, 2, 3]'::jsonb ->> 2; -- Result: 3 (as TEXT)
-- Type difference matters for operations
SELECT '{"count": 5}'::jsonb -> 'count'; -- Returns: 5 (JSONB number)
SELECT '{"count": 5}'::jsonb ->> 'count'; -- Returns: '5' (TEXT)
When to Use Each
- Use
->when: Chaining multiple extractions, need to preserve JSON type for further JSON operations - Use
->>when: Extracting final value for display, comparison with text, or casting to specific types
Compatibility
Available in PostgreSQL 9.3+ for both json and jsonb types.
Source: PostgreSQL JSON Functions and Operators Documentation
Deleting a Key from JSONB in PostgreSQL
Use the - operator to delete a key from a JSONB object:
-- Delete a single top-level key
UPDATE table_name
SET jsonb_column = jsonb_column - 'key_to_delete'
WHERE condition;
-- Example: Delete 'email' key
UPDATE users
SET data = data - 'email'
WHERE id = 1;
Deleting Nested Keys
For nested keys, use the #- operator with a text array path (PostgreSQL 9.5+):
-- Delete nested key at path
UPDATE table_name
SET jsonb_column = jsonb_column #- '{parent,child,key_to_delete}'
WHERE condition;
-- Example: Delete 'address.city'
UPDATE users
SET data = data #- '{address,city}'
WHERE id = 1;
Deleting Multiple Keys
Delete multiple top-level keys using the - operator with a text array (PostgreSQL 9.5+):
UPDATE table_name
SET jsonb_column = jsonb_column - ARRAY['key1', 'key2', 'key3']
WHERE condition;
Deleting Array Elements
To delete by array index, use - with an integer:
-- Delete element at index 2 (0-based)
UPDATE table_name
SET jsonb_column = jsonb_column - 2
WHERE condition;
Minimum Version: PostgreSQL 9.5 for - operator; 9.5+ for #- operator.
PostgreSQL JSONB -> Operator
The -> operator extracts a JSON object field or array element from a JSONB column and returns the result as JSONB (not as text).
Syntax
jsonb_column -> key -- Extract object field by text key
jsonb_column -> index -- Extract array element by integer index
Key Behavior
- Returns: JSONB type (preserves JSON structure)
- Object fields: Use text key (e.g.,
'name') - Array elements: Use zero-based integer index (e.g.,
0,1,2) - Non-existent keys/indices: Returns
NULL - Type preservation: Nested objects/arrays remain JSONB for further chaining
Examples
-- Extract object field
SELECT '{"name": "Alice", "age": 30}'::jsonb -> 'name';
-- Result: "Alice" (as JSONB)
-- Extract array element
SELECT '["red", "green", "blue"]'::jsonb -> 1;
-- Result: "green" (as JSONB)
-- Chain multiple extractions
SELECT '{"user": {"address": {"city": "NYC"}}}'::jsonb -> 'user' -> 'address' -> 'city';
-- Result: "NYC" (as JSONB)
-- Non-existent key
SELECT '{"name": "Alice"}'::jsonb -> 'email';
-- Result: NULL
Compare with ->> Operator
->returns JSONB (chainable, preserves structure)->>returns TEXT (terminal operation, converts to string)
SELECT data -> 'price' FROM products; -- Returns: 29.99 (JSONB number)
SELECT data ->> 'price' FROM products; -- Returns: '29.99' (TEXT)
Version Compatibility
Available in PostgreSQL 9.4+ (when JSONB type was introduced).
Source
PostgreSQL Official Documentation: JSON Functions and Operators
How to Insert JSON Data in PostgreSQL
PostgreSQL supports two JSON data types: json (text-based storage) and jsonb (binary, indexed, recommended). Use jsonb for better performance and indexing capabilities.
Direct JSON Insert
-- Create table with jsonb column
CREATE TABLE users (
id SERIAL PRIMARY KEY,
data jsonb
);
-- Insert JSON using single quotes and type cast
INSERT INTO users (data) VALUES ('{"name": "John", "age": 30}'::jsonb);
-- Alternative: explicit CAST
INSERT INTO users (data) VALUES (CAST('{"name": "Jane", "age": 25}' AS jsonb));
Insert from Application Parameters
Using parameterized queries (prevents SQL injection):
-- PostgreSQL expects JSON as a string parameter
INSERT INTO users (data) VALUES ($1::jsonb);
In application code, pass the JSON as a string:
- Node.js (pg):
client.query('INSERT INTO users (data) VALUES ($1)', [JSON.stringify(obj)]) - Python (psycopg2):
cursor.execute("INSERT INTO users (data) VALUES (%s)", (json.dumps(obj),)) - Go (lib/pq):
db.Exec("INSERT INTO users (data) VALUES ($1)", jsonBytes)
Key Requirements
- Always use
jsonboverjsonunless you need exact whitespace/key-order preservation - JSON must be valid - PostgreSQL validates on insert and rejects malformed JSON
- Use single quotes for string literals in SQL (double quotes are for identifiers)
- Maximum size: 1GB per jsonb value (PostgreSQL TOAST limit)
PostgreSQL Version: 9.4+ for jsonb (recommended: 12+)
Querying Nested JSONB in PostgreSQL
PostgreSQL provides three operators to query nested JSONB structures:
Operators
->- Returns JSONB object/array element (result is JSONB)->>- Returns JSONB object/array element as TEXT#>- Returns JSONB object at specified path (array of keys)#>>- Returns JSONB object at specified path as TEXT
Code Examples
-- Sample data
CREATE TABLE users (
id SERIAL PRIMARY KEY,
data JSONB
);
INSERT INTO users (data) VALUES
('{"name": "Alice", "address": {"city": "NYC", "zip": "10001"}}'),
('{"name": "Bob", "contacts": {"emails": ["[email protected]", "[email protected]"]}}');
-- Access nested object (returns JSONB)
SELECT data -> 'address' -> 'city' FROM users;
-- Result: "NYC"
-- Access nested object as text
SELECT data -> 'address' ->> 'city' FROM users;
-- Result: NYC
-- Access deeply nested using path (returns JSONB)
SELECT data #> '{address,city}' FROM users;
-- Result: "NYC"
-- Access deeply nested as text
SELECT data #>> '{address,city}' FROM users;
-- Result: NYC
-- Access array element (0-indexed)
SELECT data -> 'contacts' -> 'emails' -> 0 FROM users WHERE id = 2;
-- Result: "[email protected]"
-- Query with WHERE clause
SELECT * FROM users WHERE data -> 'address' ->> 'city' = 'NYC';
-- Check if nested key exists
SELECT * FROM users WHERE data -> 'address' ? 'zip';
-- Check if nested path exists (PostgreSQL 12+)
SELECT * FROM users WHERE data @? '$.address.city';
Key Rules
- Chain
->operators for each nesting level when keys are known at query time - Use
#>with array syntax'{key1,key2,key3}'for dynamic or deep paths - Use
->>or#>>only at the final level when you need TEXT output for comparisons or display - Array indexing is 0-based, not 1-based
Version: Available in PostgreSQL 9.4+. JSONPath syntax (@?) requires PostgreSQL 12+.
PostgreSQL JSONB <@ Operator
The <@ operator checks if the left JSONB value is contained within the right JSONB value. Returns true if all key-value pairs in the left operand exist in the right operand.
Syntax:
jsonb_left <@ jsonb_right
Containment Rules:
- For objects: All keys and values in left must exist in right (right can have additional keys)
- For arrays: All elements in left must exist in right (order doesn't matter, right can have additional elements)
- For scalars: Left must equal right
Examples:
-- Object containment (returns true)
SELECT '{"a": 1}'::jsonb <@ '{"a": 1, "b": 2}'::jsonb;
-- Array containment (returns true)
SELECT '[1, 2]'::jsonb <@ '[1, 2, 3]'::jsonb;
-- Not contained (returns false)
SELECT '{"a": 1, "c": 3}'::jsonb <@ '{"a": 1, "b": 2}'::jsonb;
-- Scalar match (returns true)
SELECT '"foo"'::jsonb <@ '"foo"'::jsonb;
Performance: The <@ operator uses GIN indexes. Create an index for optimal performance:
CREATE INDEX idx_data_gin ON table_name USING GIN (jsonb_column);
Available Since: PostgreSQL 9.4 (when JSONB was introduced)
Inverse Operator: @> (contains) - checks if left contains right
Source: PostgreSQL Official Documentation - JSON Functions and Operators
jsonb_typeof() Function in PostgreSQL
jsonb_typeof() returns the type of the top-level JSON value as a text string.
Return Values
The function returns one of these exact strings:
"object"- for JSON objects{}"array"- for JSON arrays[]"string"- for JSON strings"number"- for JSON numbers (integer or decimal)"boolean"- for JSON booleans"null"- for JSON null values
Syntax
jsonb_typeof(jsonb_value) → text
Code Examples
-- Returns "object"
SELECT jsonb_typeof('{"name": "Alice"}'::jsonb);
-- Returns "array"
SELECT jsonb_typeof('[1, 2, 3]'::jsonb);
-- Returns "string"
SELECT jsonb_typeof('"hello"'::jsonb);
-- Returns "number"
SELECT jsonb_typeof('42'::jsonb);
-- Returns "boolean"
SELECT jsonb_typeof('true'::jsonb);
-- Returns "null"
SELECT jsonb_typeof('null'::jsonb);
-- Nested access with ->
SELECT jsonb_typeof('{"user": {"age": 30}}'::jsonb -> 'user'); -- Returns "object"
SELECT jsonb_typeof('{"user": {"age": 30}}'::jsonb -> 'user' -> 'age'); -- Returns "number"
Availability
Available in PostgreSQL 9.4+ (when jsonb type was introduced).
Related Function
json_typeof() - identical function for the json type (not jsonb).
PostgreSQL JSONB ?| Operator
The ?| operator tests whether any of the specified keys/elements exist in a JSONB value. It returns true if at least one key from the provided array exists at the top level of the JSONB object.
Syntax
jsonb_column ?| text[]
Parameters
- Left operand: JSONB column or value
- Right operand: Array of text strings (keys to check)
Return Value
- Returns
boolean trueif ANY key from the array exists in the JSONBfalseif NONE of the keys existNULLif either operand is NULL
Examples
-- Returns true (has 'name')
SELECT '{"name": "John", "age": 30}'::jsonb ?| ARRAY['name', 'email'];
-- Returns false (has neither 'email' nor 'phone')
SELECT '{"name": "John", "age": 30}'::jsonb ?| ARRAY['email', 'phone'];
-- Returns true (has 'age')
SELECT '{"name": "John", "age": 30}'::jsonb ?| ARRAY['age', 'address', 'phone'];
Important Notes
- Top-level only: Only checks keys at the top level, not nested keys
- Key existence only: Checks key presence regardless of value (even if value is
null) - Case-sensitive: Key matching is case-sensitive
Version
Available in PostgreSQL 9.4+ (when JSONB was introduced)
Related Operators
?- checks if a single key exists?&- checks if ALL specified keys exist (AND logic vs.?|OR logic)
Source
PostgreSQL Official Documentation: JSON Functions and Operators
Querying a Specific Key from JSONB in PostgreSQL
Use the -> operator to extract a JSON object/array or ->> operator to extract as text.
Syntax:
->returns JSONB (preserves JSON type)->>returns TEXT (extracts value as string)
Code Examples:
-- Extract nested JSONB object
SELECT data -> 'user' FROM users;
-- Extract as text (most common for WHERE clauses)
SELECT data ->> 'email' FROM users;
-- Nested key access
SELECT data -> 'user' ->> 'name' FROM users;
-- In WHERE clause
SELECT * FROM users WHERE data ->> 'status' = 'active';
-- Array element access (0-indexed)
SELECT data -> 'tags' -> 0 FROM users;
Path-based extraction for deep nesting:
-- #> returns JSONB, #>> returns TEXT
SELECT data #> '{user,address,city}' FROM users;
SELECT data #>> '{user,address,city}' FROM users;
Performance:
- Use GIN index for JSONB columns when filtering:
CREATE INDEX idx_data ON users USING GIN (data); - For specific keys:
CREATE INDEX idx_email ON users ((data ->> 'email'));
Compatibility: Available since PostgreSQL 9.4 (JSONB introduction).
Source: PostgreSQL Official Documentation - JSON Functions and Operators (https://www.postgresql.org/docs/current/functions-json.html)
PostgreSQL: Aggregate Rows into a JSON Array
Use json_agg() or jsonb_agg() to aggregate rows into a JSON array.
Basic Syntax
SELECT json_agg(column_name) FROM table_name;
Functions
json_agg(expression)- Returns a JSON array (text-based)jsonb_agg(expression)- Returns a JSONB array (binary, faster for processing)
Both available since PostgreSQL 9.3+.
Examples
Aggregate a single column:
SELECT json_agg(name) FROM users;
-- Result: ["Alice", "Bob", "Carol"]
Aggregate entire rows as JSON objects:
SELECT json_agg(users) FROM users;
-- Result: [{"id":1,"name":"Alice"}, {"id":2,"name":"Bob"}]
Aggregate specific fields into objects:
SELECT json_agg(json_build_object('id', id, 'name', name)) FROM users;
-- Result: [{"id":1,"name":"Alice"}, {"id":2,"name":"Bob"}]
With GROUP BY:
SELECT department, json_agg(name)
FROM employees
GROUP BY department;
-- Returns one array per department
Order elements within the array:
SELECT json_agg(name ORDER BY created_at DESC) FROM users;
Key Differences
- Use
jsonb_agg()for data you'll query/process further (supports indexing, operators) - Use
json_agg()for data you'll immediately output as text - Both return
NULLfor empty input sets (not an empty array)
Handling NULLs
NULL values are included in the array as null:
SELECT json_agg(email) FROM users WHERE email IS NULL;
-- Result: [null, null]
Source: PostgreSQL 17 Official Documentation - Aggregate Functions (9.21)
Use jsonb_set() to update a specific key in a JSONB column in PostgreSQL.
Syntax:
jsonb_set(target jsonb, path text[], new_value jsonb [, create_if_missing boolean])
Parameters:
target: The JSONB column/value to modifypath: Text array specifying the key location (e.g.,'{key}'for top-level,'{parent,child}'for nested)new_value: The new JSONB value (must be valid JSONB, useto_jsonb()or quotes for literals)create_if_missing: Optional boolean (defaulttrue); creates the key if it doesn't exist
Example - Update top-level key:
UPDATE users
SET data = jsonb_set(data, '{email}', '"[email protected]"')
WHERE id = 1;
Example - Update nested key:
UPDATE users
SET data = jsonb_set(data, '{address,city}', '"San Francisco"')
WHERE id = 1;
Example - Update with dynamic value:
UPDATE products
SET attributes = jsonb_set(attributes, '{price}', to_jsonb(99.99))
WHERE id = 100;
Alternative - Concatenation operator (||) for top-level keys:
UPDATE users
SET data = data || '{"email": "[email protected]"}'
WHERE id = 1;
Modern approach - Subscripting (PostgreSQL 14+):
UPDATE users
SET data['email'] = '"[email protected]"'
WHERE id = 1;
Important: The new_value parameter must be valid JSONB. Use double quotes for string literals within single quotes ('"string"') or use to_jsonb() for non-string values.
Sources:
The ? operator in PostgreSQL checks if a single text string exists as a top-level key in a JSONB object or as an element in a JSONB array.
Syntax:
jsonb_column ? 'key_name'
Behavior:
- For JSONB objects: Returns
trueif the string exists as a top-level key - For JSONB arrays: Returns
trueif the string exists as an array element - Case-sensitive matching
- Only checks top-level keys/elements (not nested)
Examples:
-- Check for key in object
SELECT '{"name": "John", "age": 30}'::jsonb ? 'name'; -- true
SELECT '{"name": "John", "age": 30}'::jsonb ? 'email'; -- false
-- Check for element in array
SELECT '["apple", "banana", "cherry"]'::jsonb ? 'banana'; -- true
SELECT '["apple", "banana", "cherry"]'::jsonb ? 'orange'; -- false
-- Nested keys are NOT matched
SELECT '{"user": {"name": "John"}}'::jsonb ? 'name'; -- false (name is nested)
Related operators:
?|- exists ANY of the array keys (OR logic)?&- exists ALL of the array keys (AND logic)
Available since: PostgreSQL 9.4 (when JSONB was introduced)
Source: PostgreSQL official documentation - JSONB operators
https://www.postgresql.org/docs/current/functions-json.html
PostgreSQL json_agg Function
json_agg is a PostgreSQL aggregate function that combines multiple rows into a single JSON array. Available since PostgreSQL 9.2.
Syntax
json_agg(expression [ORDER BY ...])
What It Does
- Takes values from multiple rows and aggregates them into a JSON array
- Returns type
json - NULL input values are included as JSON
nullin the array - Returns
NULLif there are no input rows (not an empty array)
Basic Examples
Aggregate primitive values:
SELECT json_agg(name) FROM users;
-- Result: ["Alice", "Bob", "Charlie"]
Aggregate row objects:
SELECT json_agg(users) FROM users;
-- Result: [{"id":1,"name":"Alice"}, {"id":2,"name":"Bob"}]
Build custom objects:
SELECT json_agg(json_build_object('id', id, 'name', name))
FROM users;
-- Result: [{"id":1,"name":"Alice"}, {"id":2,"name":"Bob"}]
With ORDER BY:
SELECT json_agg(name ORDER BY created_at DESC) FROM users;
-- Result: Ordered array ["Charlie", "Bob", "Alice"]
With GROUP BY:
SELECT department, json_agg(employee_name)
FROM employees
GROUP BY department;
-- Result: Multiple rows, each with department and its employees array
Key Behavior
- For
jsonboutput, usejsonb_agginstead (available since PostgreSQL 9.4) - Empty result set returns
NULL, not[]. UseCOALESCE(json_agg(...), '[]'::json)for empty array - Does not automatically deduplicate values
Source: PostgreSQL Official Documentation for Aggregate Functions (versions 9.2+)
PostgreSQL 17 JSON_TABLE Function
JSON_TABLE converts JSON data into a relational table format, allowing you to query JSON as if it were regular SQL rows and columns.
Basic Syntax
JSON_TABLE(
json_expression,
json_path_expression
COLUMNS (
column_name type PATH json_path [DEFAULT value ON EMPTY] [DEFAULT value ON ERROR]
)
)
Examples
-- Basic usage: extract array elements as rows
SELECT * FROM JSON_TABLE(
'[{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]'::jsonb,
'$[*]'
COLUMNS (
name TEXT PATH '$.name',
age INTEGER PATH '$.age'
)
) AS jt;
-- Result:
-- name | age
-- Alice | 30
-- Bob | 25
-- Nested data with error handling
SELECT * FROM JSON_TABLE(
'{"users": [{"id": 1, "profile": {"email": "[email protected]"}}]}'::jsonb,
'$.users[*]'
COLUMNS (
user_id INTEGER PATH '$.id',
email TEXT PATH '$.profile.email' DEFAULT 'unknown' ON EMPTY
)
) AS users;
-- Use with actual table data
SELECT o.order_id, items.*
FROM orders o,
JSON_TABLE(
o.line_items,
'$[*]'
COLUMNS (
product_id INTEGER PATH '$.product_id',
quantity INTEGER PATH '$.qty',
price NUMERIC PATH '$.price'
)
) AS items;
Column Options
| Clause | Purpose |
|---|---|
PATH |
JSON path to extract value |
DEFAULT ... ON EMPTY |
Value when path returns nothing |
DEFAULT ... ON ERROR |
Value when extraction fails |
FOR ORDINALITY |
Row number counter |
EXISTS |
Boolean: does path exist? |
Related SQL/JSON Functions (PG17)
-- JSON_EXISTS: check if path exists
SELECT JSON_EXISTS('{"a": 1}'::jsonb, '$.a'); -- true
-- JSON_QUERY: extract JSON fragment
SELECT JSON_QUERY('{"a": {"b": 1}}'::jsonb, '$.a'); -- {"b": 1}
-- JSON_VALUE: extract scalar value
SELECT JSON_VALUE('{"name": "test"}'::jsonb, '$.name'); -- test
Source: PostgreSQL 17 Documentation - JSON Functions
https://www.postgresql.org/docs/17/functions-json.html
jsonb_array_length()
jsonb_array_length(jsonb) returns the number of elements in the outermost JSON array as an integer.
Syntax
jsonb_array_length(target jsonb) → integer
Behavior
- Returns the count of top-level array elements
- Raises error
22023("cannot get array length of a non-array") if the input is not a JSON array - Nested arrays count as single elements
- Returns
0for empty arrays[]
Examples
-- Basic usage
SELECT jsonb_array_length('[1,2,3,4,5]');
-- Returns: 5
-- Empty array
SELECT jsonb_array_length('[]');
-- Returns: 0
-- Nested arrays count as single elements
SELECT jsonb_array_length('[[1,2],[3,4],[5,6]]');
-- Returns: 3
-- Error on non-array
SELECT jsonb_array_length('{"a":1}');
-- ERROR: cannot get array length of a non-array
Related Functions
json_array_length(json)- identical function forjsontype (returns same results but operates onjsontype)jsonb_array_elements(jsonb)- expands array to rows
Source: PostgreSQL Official Documentation (available since PostgreSQL 9.4+, current through PostgreSQL 17)
jsonb_object_keys function in PostgreSQL
jsonb_object_keys(jsonb) is a PostgreSQL function that returns the set of top-level keys from a JSONB object as a set of text values.
Syntax
jsonb_object_keys(target jsonb) → setof text
Behavior
- Returns only top-level keys (not nested keys)
- Returns keys as individual rows (set-returning function)
- Returns keys in storage order (not alphabetical or insertion order)
- Throws error if argument is not a JSONB object (e.g., if it's an array or scalar)
Example
SELECT jsonb_object_keys('{"name": "Alice", "age": 30, "address": {"city": "NYC"}}'::jsonb);
Result:
jsonb_object_keys
-------------------
name
age
address
(3 rows)
Key Details
- Available since: PostgreSQL 9.4 (when JSONB type was introduced)
- Return type:
setof text(can be used in FROM clause or with aggregate functions) - Companion function:
json_object_keys()exists for JSON type (not JSONB)
Common Usage Patterns
-- Get array of keys
SELECT array_agg(jsonb_object_keys(data)) FROM table_name;
-- Check if specific key exists (prefer ? operator instead)
SELECT 'name' IN (SELECT jsonb_object_keys(data)) FROM table_name;
Source: PostgreSQL Official Documentation - JSON Functions and Operators (v9.4+)
jsonb_array_elements function in PostgreSQL
jsonb_array_elements(jsonb) is a PostgreSQL set-returning function that expands a JSON array into a set of rows, one row per array element. Each element is returned as type jsonb.
Function Signature:
jsonb_array_elements(jsonb) → SETOF jsonb
Behavior:
- Input: A JSONB array
- Output: Set of rows, each containing one array element as
jsonb - If input is NOT an array, raises error:
ERROR: cannot extract elements from a scalar/object - If input is NULL, returns empty set (0 rows)
- Preserves the original data type of each element (string, number, boolean, null, object, array)
Code Example:
-- Basic usage
SELECT jsonb_array_elements('[1, "text", true, {"key": "value"}]'::jsonb);
-- Returns 4 rows:
-- 1
-- "text"
-- true
-- {"key": "value"}
-- Common pattern: expand array and access properties
SELECT elem->>'name' AS name, elem->>'age' AS age
FROM jsonb_array_elements('[
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25}
]'::jsonb) AS elem;
-- Returns:
-- name | age
-- --------+-----
-- Alice | 30
-- Bob | 25
Related Functions:
jsonb_array_elements_text(jsonb)- returns elements astextinstead ofjsonbjson_array_elements(json)- same function forjsontype
Availability: PostgreSQL 9.3+ (when JSONB was introduced)
Source: PostgreSQL 16 Official Documentation - JSON Functions
Parallel Query
29 questionsYes. Unlike a regular Append node, a Parallel Append node can have both partial and non-partial child plans. Non-partial children will be scanned by only a single process to avoid duplicate results.
max_parallel_workers cannot exceed max_worker_processes. Parallel workers are taken from the pool of processes established by max_worker_processes, so if max_parallel_workers is set higher, it will have no effect.
No, parallel query execution was not enabled by default in PostgreSQL 9.6. To allow it, max_parallel_workers_per_gather had to be set to a value larger than zero.
Yes. A Gather Merge node indicates that each process executing the parallel portion of the plan is producing tuples in sorted order, and the leader performs an order-preserving merge.
force_parallel_mode was renamed to debug_parallel_query in PostgreSQL 16.
No. In a merge join, the inner side is always a non-parallel plan and is executed in full, which may be inefficient as the work is duplicated in every cooperating process.
Generally no, but there are exceptions: CREATE TABLE ... AS, SELECT INTO, CREATE MATERIALIZED VIEW, and REFRESH MATERIALIZED VIEW can use parallel plans for their underlying SELECT operations (available since PostgreSQL 11).
PARALLEL UNSAFE. All user-defined functions are assumed to be parallel unsafe unless otherwise marked.
No. Functions must be PARALLEL UNSAFE if they write to the database.
No. Functions must be PARALLEL RESTRICTED (at minimum) if they access cursors, since cursors involve backend-local state which the system cannot synchronize in parallel mode.
Only btree indexes. Currently, parallel index scans are supported only for btree indexes.
Parallel hash join is the only join type where parallelism can occur on both sides. The inner side is a parallel hash that divides the work of building a shared hash table over the cooperating processes.
No. Parallel aggregation is not supported if any aggregate function call contains DISTINCT.
No. Parallel aggregation is not supported if any aggregate function call contains ORDER BY.
Resource limits such as work_mem are applied individually to each worker. Each parallel worker gets the full work_mem allocation, so a query with 4 workers and work_mem set to 32MB could use up to 160MB total (4 workers + 1 leader × 32MB).
No. The system must not be running in single-user mode for parallel query plans to be generated, since the entire database system is running as a single process in this situation and no background workers will be available.
Yes. The parallel_leader_participation parameter defaults to 'on', which allows the leader process to execute the query plan under Gather and Gather Merge nodes instead of only waiting for worker processes.
Foreign table scans are always parallel restricted unless the foreign data wrapper (FDW) specifies otherwise.
PostgreSQL 13 introduced parallel vacuum, which allows for parallel vacuuming of multiple indexes corresponding to a single table.
Functions - Sequence
29 questionsNO CYCLE is the default. If neither CYCLE nor NO CYCLE are specified, NO CYCLE is assumed.
-9223372036854775808 to 9223372036854775807 (the range of an eight-byte integer).
The next nextval() call will advance the sequence before returning a value. For example, setval('seq', 100, true) means the next nextval() will return 101.
The sequence is automatically dropped. The OWNED BY option associates the sequence with a table column such that dropping the column or table automatically drops the sequence.
No, unless the name is double-quoted. The string will be converted to lowercase unless it contains double quotes. For example, nextval('FOO') operates on sequence 'foo', but nextval('"Foo"') operates on 'Foo'.
true. When using the two-parameter form setval(regclass, bigint), is_called defaults to true.
Every 32 calls. PostgreSQL uses a hard-coded constant SEQ_LOG_VALS = 32, so nextval() emits a WAL update every 32 calls, logging a future value 32 increments ahead.
No. Unlike unlogged tables, unlogged sequences have no similar performance advantage because sequences are already highly optimized in PostgreSQL.
Yes. When written as an unadorned literal string, it becomes a regclass constant (an OID), which tracks the originally identified sequence despite later renaming, schema reassignment, etc. This is called 'early binding'.
The next nextval() call will return exactly the specified value without advancing first. For example, setval('seq', 100, false) means the next nextval() will return 100.
Up to 32 values. Due to the SEQ_LOG_VALS constant, recovery will set the sequence to the last logged position, which could be up to 32 values behind.
- The minimum value is 1 (only one value can be generated at a time, i.e., no cache), and this is also the default.
It wraps around to the minimum value. The next number generated will be the minvalue.
Write the sequence name as a string literal in single quotes, e.g., nextval('mysequence') or nextval('myschema.mysequence'). The regclass input converter handles the conversion automatically.
No. Changes made by setval() are immediately visible to other transactions and are not undone if the calling transaction rolls back.
They are automatically reset to their initial state. All progress (nextval calls) is lost.
No. When sequences are created manually (not via SERIAL), they don't have the automatic ownership association, so pg_get_serial_sequence() will not work.
How many sequence numbers are to be preallocated and stored in memory for faster access. Each session reserves this many values at once.
Data Types - Domain
29 questionsA domain is a user-defined data type that is based on another underlying type, and can optionally have constraints that restrict its valid values to a subset of what the underlying type would allow.
PostgreSQL grants USAGE privilege for languages and data types (including domains) to PUBLIC by default, meaning all users can use domains and data types in their schema objects.
The default expression will be used in any insert operation that does not specify a value for the column. A column-level default overrides any domain default, and the domain default in turn overrides any default value associated with the underlying data type.
Values of the domain are allowed to be null by default. The NULL constraint clause is a PostgreSQL extension intended for compatibility with nonstandard SQL databases, and its use is discouraged in new applications.
typndims is the number of array dimensions for a domain over an array (that is, when typbasetype is an array type). It is zero for types other than domains over array types.
Domains use typtypmod to record the typmod to be applied to their base type. It is -1 if the base type does not use a typmod or if the type is not a domain.
No. The NOT VALID option is only accepted for CHECK constraints, not for other constraint types.
Yes. Assigning a value of the underlying type to a field or variable of the domain type is allowed without writing an explicit cast, but the domain's constraints will be checked.
Setting or dropping a default value only applies to subsequent INSERT commands; it does not affect rows already in a table using the domain.
Yes, but only if the underlying data type is collatable. The underlying type must support collation for you to specify a COLLATE clause.
The USAGE privilege allows use of the domain in the creation of tables, functions, and other schema objects. However, it does not control all 'usage' of the type, such as values of the type appearing in queries.
The built-in collatable data types are text, varchar, and char. Domains over these types are also collatable.
RESTRICT is the default behavior, which refuses to drop the constraint if there are any dependent objects. CASCADE will automatically drop objects that depend on the constraint.
The keyword VALUE must be used to refer to the value being tested in a domain's CHECK constraint expression.
If IF EXISTS is specified and the constraint does not exist, no error is thrown. Instead, a notice is issued.
Casting to or from a domain uses the casts associated with its underlying type. When casting to a domain, constraints are checked during the type conversion.
No. A CHECK expression cannot contain subqueries or refer to variables other than VALUE.
Yes. PostgreSQL fully supports creating multidimensional arrays of domain types. Arrays of any built-in or user-defined base type, enum type, composite type, range type, or domain can be created.
ALTER DOMAIN ADD CONSTRAINT, ALTER DOMAIN VALIDATE CONSTRAINT, and ALTER DOMAIN SET NOT NULL will fail if the named domain or any derived domain is used within a container-type column (a composite, array, or range column) in any table in the database.
The typtype column has the value 'd' for domain types. Other values include 'b' for base type, 'c' for composite type, 'e' for enum, 'p' for pseudo-type, and 'r' for range type.
No. The NOT NULL syntax in CREATE DOMAIN is a PostgreSQL extension. The standard-conforming way to write the same for non-composite data types would be CHECK (VALUE IS NOT NULL).
RESTRICT is the default behavior, which refuses to drop the domain if any objects depend on it. CASCADE will automatically remove dependent objects (such as table columns using the domain) and their dependents recursively.
When a domain has multiple CHECK constraints, they will be tested in alphabetical order by name. This behavior was introduced in PostgreSQL 9.5.
Domains can be based on any built-in or user-defined base type, enum type, array type, composite type, range type, or another domain.
When a new constraint is added to a domain, all columns using that domain will be checked against the newly added constraint. These checks can be suppressed by adding the new constraint using the NOT VALID option.
The view contains eight columns: constraint_catalog, constraint_schema, constraint_name, domain_catalog, domain_schema, domain_name, is_deferrable, and initially_deferred.
typnotnull represents a not-null constraint on a type. It is used for domains only.
Yes. A domain can be based on another domain, allowing for a hierarchy of domain definitions with progressively refined constraints.
Internals > TOAST
29 questionsTOAST tables are named with the pattern pg_toast_
Use ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION compression_method; where compression_method is either pglz or lz4.
The total size of an on-disk TOAST pointer datum is 18 bytes regardless of the actual size of the represented value.
By default TOAST_MAX_CHUNK_SIZE is chosen so that four chunk rows will fit on a page, making it about 2000 bytes.
LZ4 TOAST compression was introduced in PostgreSQL version 14.
The two subtypes are indirect pointers (which reference non-indirect varlena values in memory) and expanded pointers (which support complex data types and are subdivided into read-write and read-only variants).
Values shorter than 127 bytes can use a single-byte header instead of the normal four-byte header. This alternative supports space-efficient storage of values shorter than 127 bytes.
A TOAST table has a unique index on chunk_id and chunk_seq, which provides fast retrieval of the values.
Values with single-byte headers aren't aligned on any particular boundary. This omission of alignment padding provides additional space savings that is significant compared to short values.
Query pg_class.reltoastrelid: SELECT oid, reltoastrelid::regclass FROM pg_class WHERE relname = 'your_table_name'; The reltoastrelid entry stores the OID of the associated TOAST table.
No, the TOAST management code ensures that no in-memory TOAST pointer datum can accidentally get stored on disk. In-memory TOAST pointers are automatically expanded to normal in-line varlena values before storage.
If column compression is active, pg_column_size() will return the compressed size, not the original uncompressed size.
No, existing data will not be rewritten when you alter the compression method, but any new data will be compressed with the new compression method.
The pointed-to representation is the same either way, but a function that receives a read-write pointer is allowed to modify the referenced value in-place, whereas one that receives a read-only pointer must not.
EXTERNAL allows out-of-line storage but not compression. Use of EXTERNAL will make substring operations on wide text and bytea columns faster because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed.
For EXTENDED strategy, compression will be attempted first, then out-of-line storage if the row is still too big.
LZ4 requires only that the compressed data be no larger than uncompressed data (0% minimum reduction).
The big values of TOASTed attributes will only be pulled out (if selected at all) at the time the result set is sent to the client.
The default_toast_compression parameter has a default value of pglz.
Every TOAST table has the columns chunk_id (an OID identifying the particular TOASTed value), chunk_seq (a sequence number for the chunk within its value), and chunk_data (the actual data of the chunk).
PostgreSQL uses a fixed page size commonly of 8 kB, and does not allow tuples to span multiple pages.
The four TOAST storage strategies are: PLAIN (prevents compression or out-of-line storage), EXTENDED (allows both compression and out-of-line storage, the default), EXTERNAL (allows out-of-line storage but not compression), and MAIN (allows compression but not out-of-line storage).
Use pg_column_compression(column_name) which returns the compression algorithm used (e.g., 'pglz' or 'lz4'). This function was added in PostgreSQL 14.
During an UPDATE operation, values of unchanged fields are normally preserved as-is; so an UPDATE of a row with out-of-line values incurs no TOAST costs if none of the out-of-line values change.
TOAST usurps two bits of the varlena length word (the high-order bits on big-endian machines, the low-order bits on little-endian machines), thereby limiting the logical size to 1 GB. With 2 bits taken from the 32-bit length word, only 30 bits remain for storing the actual size, giving 2³⁰ = 1 GB.
PostgreSQL 14 and later support two compression methods for TOAST: pglz (the default) and lz4 (if PostgreSQL was compiled with --with-lz4).
Use ALTER TABLE table_name SET (toast_tuple_target = N); where N is the target size in bytes.
Values with four-byte headers are aligned on at least a four-byte boundary.
Transactions > Isolation Level Behaviors
29 questionsYes, serialization anomalies are possible in REPEATABLE READ isolation level. Only SERIALIZABLE isolation level prevents serialization anomalies.
PostgreSQL implements only three distinct isolation levels internally, even though the SQL standard defines four levels.
Deadlock failures return SQLSTATE code 40P01 (deadlock_detected).
Predicate locks in SERIALIZABLE transactions always use the SIReadLock mode, which appears in the pg_locks system view.
No, predicate locks in PostgreSQL do not cause any blocking and therefore cannot play any part in causing a deadlock. They are used only to identify and flag dependencies among concurrent SERIALIZABLE transactions.
Yes, non-repeatable reads are possible in READ COMMITTED isolation level. Successive SELECT commands within the same transaction may see different data if other transactions commit changes between queries.
No, you cannot change the isolation level after the first query or data-modification statement executes in a transaction.
Yes, phantom reads are possible in READ COMMITTED isolation level.
No, successive SELECT commands within a single REPEATABLE READ transaction see the same data. They do not see changes made by other transactions that committed after their own transaction started.
This error only occurs when using the REPEATABLE READ or SERIALIZABLE isolation level, not with the default READ COMMITTED isolation level.
In READ COMMITTED isolation level, a SELECT query sees a snapshot of the database as of the instant the query begins to run. Each command within the transaction gets a new snapshot.
The transaction will block until it can establish a snapshot that is guaranteed to be free from serialization conflicts. It waits until it can acquire a conflict-free snapshot before starting to read any data.
No, dirty reads are prevented in READ COMMITTED isolation level. A SELECT query sees only data committed before the query began.
PostgreSQL's READ UNCOMMITTED mode behaves exactly like READ COMMITTED. This is because it is the only sensible way to map the standard isolation levels to PostgreSQL's multiversion concurrency control architecture.
UPDATE, DELETE, MERGE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the transaction start time.
It is important to retry the complete transaction, including all logic that decides which SQL to issue and/or which values to use. You cannot just retry the failed statement.
You can specify the desired transaction_modes directly in BEGIN or START TRANSACTION, for example: BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
Applications should always retry SQLSTATE 40001 (serialization_failure) and 40P01 (deadlock_detected) unconditionally.
SET TRANSACTION can only be executed at the start of a transaction, before the first query or data-modification statement (SELECT, INSERT, DELETE, UPDATE, MERGE, FETCH, or COPY) of the transaction.
True SERIALIZABLE isolation using the SSI algorithm was first released in PostgreSQL 9.1.
PostgreSQL's REPEATABLE READ isolation level is implemented using a technique known in academic database literature and in some other database products as Snapshot Isolation.
No, PostgreSQL does not offer an automatic retry facility, since it cannot do so with any guarantee of correctness.
Both REPEATABLE READ and SERIALIZABLE isolation levels can produce errors designed to prevent serialization anomalies.
If the first updater commits (and actually updated or deleted the row, not just locked it) then the REPEATABLE READ transaction will be rolled back with the message 'ERROR: could not serialize access due to concurrent update'.
The transaction must be SERIALIZABLE, READ ONLY, and DEFERRABLE. All three conditions must be present together for DEFERRABLE to work.
If the first updater rolls back, then its effects are negated and the REPEATABLE READ transaction can proceed with updating the originally found row.
SQLSTATE 23505 (unique_violation) and 23P01 (exclusion_violation) might be appropriate to retry in some cases, as they may represent transient failures when application logic selects new keys concurrently. However, they might also represent persistent error conditions requiring application-level judgment.
No, SELECT FOR UPDATE does not prevent other transactions from reading the locked rows. Row-level locks do not affect data querying; they block only writers and lockers to the same row.
PostgreSQL uses multiversion concurrency control (MVCC), which means that each SQL statement sees a snapshot of data as it was some time ago, regardless of the current state of the underlying data.
Server Configuration > Vacuuming Settings
29 questions0, which disables the cost-based vacuum delay feature. If specified without units, it is taken as milliseconds.
PostgreSQL 12. The default was 20ms in PostgreSQL 11 and earlier, then changed to 2ms in PostgreSQL 12 and later.
0 to 2000000000 (2 billion), though VACUUM will limit the effective value to 95% of autovacuum_freeze_max_age.
100000000 (100 million tuples). If -1 is specified, autovacuum will not enforce a maximum.
Yes, autovacuuming can be disabled for individual tables by changing table storage parameters.
2 ms (2 milliseconds). If -1 is specified, the regular vacuum_cost_delay value is used. If specified without units, it is taken as milliseconds.
analyze threshold = autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * number of tuples
Minimum: 128 kB, Maximum: 16 GB. If the specified size would exceed 1/8 the size of shared_buffers, it is silently capped to that value.
1 min (1 minute). If specified without units, it is taken as seconds.
16 (or less if kernel settings will not support it as determined during initdb). This parameter was introduced in PostgreSQL 18.
- If -1 is specified, insert-based autovacuum triggering is disabled.
-1, which means it uses the regular vacuum_cost_limit value (default 200).
transactions_concurrency
28 questionsWhen to Use Advisory Locks in PostgreSQL
Use PostgreSQL advisory locks when you need application-level mutual exclusion that outlives individual transactions or doesn't correspond to table rows.
Specific Use Cases
1. Preventing Concurrent Job Execution
-- Try to acquire lock, skip if already running
SELECT pg_try_advisory_lock(12345);
-- Returns true if acquired, false if held by another session
2. Distributed Task Queue (Single Consumer per Task)
-- Workers compete for tasks
SELECT task_id
FROM queue
WHERE pg_try_advisory_lock(task_id)
LIMIT 1;
-- Only one worker gets the lock per task_id
3. Application-Level Singletons
-- Ensure only one maintenance process runs
SELECT pg_advisory_lock(1); -- Blocks until available
-- Perform maintenance
SELECT pg_advisory_unlock(1);
When NOT to Use Advisory Locks
- Row-level concurrency: Use
SELECT ... FOR UPDATEinstead - Schema changes: Use regular table locks
- Simple counters: Use
UPDATEwith row locks
Critical Properties
- Session-scoped by default:
pg_advisory_lock()held until explicit unlock or session end - Transaction-scoped variant:
pg_advisory_xact_lock()auto-releases at transaction end - 64-bit space: Use integers 0 to 2^63-1, or two 32-bit integers via
pg_advisory_lock(int4, int4) - No deadlock detection: Between advisory locks only (PostgreSQL 9.1+)
- Not in pg_locks for rows: Visible as
locktype='advisory'inpg_lockssystem view
Version: PostgreSQL 9.1+ (advisory locks introduced in 8.2, improved in 9.1)
Source: PostgreSQL 17 Official Documentation - Explicit Locking (§13.3.5)
What is a Savepoint in PostgreSQL?
A savepoint is a named checkpoint within a PostgreSQL transaction that allows you to rollback part of a transaction without aborting the entire transaction.
Core Concept
Savepoints establish a point to which you can later roll back, discarding all database changes made after that savepoint while preserving changes made before it. The outer transaction remains active.
Syntax
-- Create a savepoint
SAVEPOINT savepoint_name;
-- Roll back to a savepoint (discards changes after the savepoint)
ROLLBACK TO SAVEPOINT savepoint_name;
-- Remove a savepoint (keeps all changes)
RELEASE SAVEPOINT savepoint_name;
Practical Example
BEGIN;
INSERT INTO accounts (id, balance) VALUES (1, 1000);
SAVEPOINT my_savepoint;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Error occurs or you change your mind
ROLLBACK TO SAVEPOINT my_savepoint;
-- The UPDATE is undone, but the INSERT remains
COMMIT;
-- Only the INSERT is committed
Key Facts
- Naming: Savepoint names are case-insensitive identifiers
- Nesting: Savepoints can be nested; rolling back to an outer savepoint automatically releases all inner savepoints
- Scope: Savepoints only exist within a transaction block (between
BEGINandCOMMIT/ROLLBACK) - Subtransactions: Internally implemented as subtransactions (performance consideration for high volumes)
- Reuse: You can reuse a savepoint name; it releases the old savepoint and establishes a new one at the current point
Use Cases
- Error recovery in complex transactions
- Conditional logic where some operations may need to be undone
- Batch processing where partial failures shouldn't abort the entire batch
Source: PostgreSQL 17 Official Documentation - SQL Commands: SAVEPOINT, ROLLBACK TO SAVEPOINT, RELEASE SAVEPOINT
PostgreSQL uses a wait-for graph algorithm with optimistic waiting to detect deadlocks.
How It Works
Optimistic Waiting: When a process cannot acquire a lock immediately, it waits without performing a deadlock check. Instead, it sets a timer for
deadlock_timeoutmilliseconds (default: 1 second).Triggered Detection: If the timer expires before the lock is granted, PostgreSQL runs the deadlock detection algorithm. This avoids the expensive check on every lock wait.
Wait-For Graph (WFG): PostgreSQL represents processes as nodes in a directed graph. An edge exists from process A to process B when A is waiting for a lock that B holds. A deadlock exists when there's a cycle in this graph.
FindLockCycle Function: The core detection algorithm recursively scans outward from the waiting process across wait-for edges. If it finds a cycle involving the starting process, a deadlock is detected.
Hard vs Soft Edges:
- Hard edges: Process B already holds a lock conflicting with A's request
- Soft edges: Process A is waiting behind process B in the lock queue with conflicting requests
Resolution: When a deadlock is detected, PostgreSQL attempts to resolve it by:
- If only hard edges exist: Abort one transaction
- If soft edges exist: Attempt to reorder the lock queue using topological sorting to break the cycle without aborting
- Which transaction aborts is unpredictable and should not be relied upon
Configuration
The deadlock_timeout parameter controls when checks occur. Increasing it reduces overhead from unnecessary checks but delays reporting real deadlocks.
Sources:
Rollback to a Savepoint in PostgreSQL
Use the ROLLBACK TO SAVEPOINT command to roll back all commands executed after a savepoint was established, while keeping the transaction active.
Syntax:
ROLLBACK TO SAVEPOINT savepoint_name;
Complete Example:
BEGIN;
INSERT INTO accounts (id, balance) VALUES (1, 1000);
SAVEPOINT my_savepoint;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
DELETE FROM accounts WHERE id = 2;
-- Undo UPDATE and DELETE, keep INSERT
ROLLBACK TO SAVEPOINT my_savepoint;
COMMIT; -- Commits the INSERT only
Key Behaviors:
- All commands after the savepoint are undone
- The transaction remains ACTIVE (not aborted)
- The savepoint remains defined and can be rolled back to again
- Savepoints created after
my_savepointare automatically destroyed - You can execute new commands after rollback and still commit/rollback the transaction
Alternative Syntax:
ROLLBACK TO my_savepoint; -- SAVEPOINT keyword is optional
To Release a Savepoint (without rollback):
RELEASE SAVEPOINT savepoint_name; -- Frees resources, can't rollback to it anymore
Compatibility: Available in PostgreSQL 8.0+ (all currently supported versions: 12-17 as of 2025).
Source: PostgreSQL Official Documentation - SQL Commands: SAVEPOINT, ROLLBACK TO SAVEPOINT
PostgreSQL max_prepared_transactions Setting
max_prepared_transactions controls the maximum number of transactions that can be in the "prepared" state simultaneously for two-phase commit (2PC) operations.
Exact Configuration
Default value: 0 (two-phase commit disabled)
Valid range: 0 to 262143
Requires restart: Yes (server must be restarted for changes to take effect)
Setting the Value
In postgresql.conf:
max_prepared_transactions = 100
Or at server start:
postgres -c max_prepared_transactions=100
Critical Rules
- Must be ≥
max_connectionsif you use prepared transactions in normal operations - Must be > 0 to enable
PREPARE TRANSACTIONcommand - Each prepared transaction consumes shared memory (~600 bytes + locks)
- Standby servers must have
max_prepared_transactions≥ primary server's value
Usage Example
-- Begin transaction
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Prepare for 2PC
PREPARE TRANSACTION 'txn_001';
-- Later: commit or rollback
COMMIT PREPARED 'txn_001';
-- OR
ROLLBACK PREPARED 'txn_001';
Source
PostgreSQL 17 official documentation: Server Configuration - Resource Consumption
Prepared Transactions in PostgreSQL
Prepared transactions are PostgreSQL's implementation of two-phase commit (2PC), allowing a transaction to be prepared for commit and later either committed or rolled back in a separate session.
How They Work
- PREPARE TRANSACTION saves all transaction state to disk with a unique identifier
- The connection can disconnect
- Another session can COMMIT PREPARED or ROLLBACK PREPARED using that identifier
Syntax
-- Phase 1: Prepare the transaction
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
PREPARE TRANSACTION 'transfer_001';
-- Phase 2: Commit or rollback (can be from different session)
COMMIT PREPARED 'transfer_001';
-- OR
ROLLBACK PREPARED 'transfer_001';
Critical Configuration
Prepared transactions are disabled by default. Enable via:
-- postgresql.conf
max_prepared_transactions = 100 -- Default is 0 (disabled)
Must be set ≥ max_connections if using external transaction managers. Requires PostgreSQL restart.
Key Constraints
- Transaction identifier: max 200 bytes
- Prepared transactions hold locks until resolved
- Consume connection slots from
max_prepared_transactions - Not automatically cleaned up—must be explicitly committed/rolled back
- Cannot use:
LISTEN/NOTIFY- Advisory locks
- Session-level operations
- Cursors (
WITH HOLDcursors work)
Use Cases
- Distributed transactions across multiple databases
- External transaction managers (e.g., Java JTA, XA protocol)
- NOT recommended for application-level use (use regular transactions instead)
Monitoring
-- View prepared transactions
SELECT * FROM pg_prepared_xacts;
Source: PostgreSQL 17 Official Documentation - SQL Commands: PREPARE TRANSACTION
SKIP LOCKED in PostgreSQL
SKIP LOCKED is a row-level locking clause (available since PostgreSQL 9.5) that causes a SELECT ... FOR UPDATE/SHARE query to skip over rows that are already locked by other transactions, instead of waiting for those locks to be released.
Syntax
SELECT * FROM table_name
WHERE conditions
FOR UPDATE SKIP LOCKED;
Exact Behavior
- Without SKIP LOCKED: Query blocks/waits when encountering locked rows
- With SKIP LOCKED: Query immediately skips locked rows and returns only unlocked rows
- Must be combined with
FOR UPDATE,FOR NO KEY UPDATE,FOR SHARE, orFOR KEY SHARE - Returns non-deterministic results (which rows are locked varies by concurrent transactions)
Primary Use Case: Job Queue Pattern
-- Worker process claims next available job
BEGIN;
SELECT * FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- Process the job, then:
UPDATE jobs SET status = 'completed' WHERE id = ...;
COMMIT;
Multiple workers can run this concurrently without conflicts—each gets a different unlocked row.
Critical Details
- Transaction required: The lock is held until transaction commits/rolls back
- Works with LIMIT: Common pattern is
LIMIT 1to get one work item - Not in subqueries: Cannot use in subqueries or CTEs in PostgreSQL <12; PostgreSQL 12+ supports it in CTEs
Source: PostgreSQL Official Documentation v9.5+ - Explicit Locking section
Non-Repeatable Read in PostgreSQL
A non-repeatable read occurs when a transaction reads the same row twice and gets different data because another transaction modified and committed that row between the two reads.
Example Scenario
Transaction 1:
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- Returns 100
-- Transaction 2 updates and commits here
SELECT balance FROM accounts WHERE id = 1; -- Returns 150 (different!)
COMMIT;
Transaction 2 (executes between T1's reads):
BEGIN;
UPDATE accounts SET balance = 150 WHERE id = 1;
COMMIT;
Transaction 1 read the same row twice but got different values (100, then 150) - this is a non-repeatable read.
PostgreSQL Isolation Levels
Non-repeatable reads are prevented at these isolation levels:
REPEATABLE READ(PostgreSQL default for explicit isolation)SERIALIZABLE
Non-repeatable reads are allowed at:
READ COMMITTED(PostgreSQL's default isolation level)
Prevention Example
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT balance FROM accounts WHERE id = 1; -- Returns 100
-- Even if another transaction commits changes...
SELECT balance FROM accounts WHERE id = 1; -- Still returns 100 (snapshot isolation)
COMMIT;
PostgreSQL uses snapshot isolation for REPEATABLE READ, guaranteeing each transaction sees a consistent snapshot of data from when the transaction started.
Source: PostgreSQL 16 Documentation, Section 13.2 "Transaction Isolation"
PostgreSQL Lock Types
PostgreSQL has 8 table-level lock modes and 4 row-level lock modes (as of PostgreSQL 16).
Table-Level Lock Modes (ordered by strength):
ACCESS SHARE - Acquired by
SELECTqueries. Conflicts only with ACCESS EXCLUSIVE.ROW SHARE - Acquired by
SELECT FOR UPDATEandSELECT FOR SHARE. Conflicts with EXCLUSIVE and ACCESS EXCLUSIVE.ROW EXCLUSIVE - Acquired by
INSERT,UPDATE,DELETE. Conflicts with SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE.SHARE UPDATE EXCLUSIVE - Acquired by
VACUUM,ANALYZE,CREATE INDEX CONCURRENTLY. Protects against concurrent schema changes. Conflicts with itself and higher modes.SHARE - Acquired by
CREATE INDEX(non-concurrent). Conflicts with ROW EXCLUSIVE and higher modes.SHARE ROW EXCLUSIVE - Acquired by
CREATE TRIGGER. Conflicts with itself and higher modes except ACCESS SHARE and ROW SHARE.EXCLUSIVE - Acquired by
REFRESH MATERIALIZED VIEW CONCURRENTLY. Conflicts with ROW SHARE and higher modes.ACCESS EXCLUSIVE - Acquired by
DROP TABLE,TRUNCATE,ALTER TABLE,LOCK TABLE(default). Conflicts with ALL lock modes. Only one session can hold this lock.
Row-Level Lock Modes:
FOR KEY SHARE - Lightest row lock. Blocks only
FOR UPDATE.FOR SHARE - Blocks
UPDATE,DELETE,FOR UPDATE,FOR NO KEY UPDATE.FOR NO KEY UPDATE - Blocks
DELETE,FOR UPDATE,FOR SHARE,FOR NO KEY UPDATE.FOR UPDATE - Strongest row lock. Blocks all other row locks.
Explicit Locking Syntax:
-- Table-level explicit lock
LOCK TABLE table_name IN ACCESS EXCLUSIVE MODE;
-- Row-level locks (automatically acquired, or explicit via SELECT)
SELECT * FROM table_name WHERE id = 1 FOR UPDATE;
SELECT * FROM table_name WHERE id = 1 FOR SHARE;
SELECT * FROM table_name WHERE id = 1 FOR NO KEY UPDATE;
SELECT * FROM table_name WHERE id = 1 FOR KEY SHARE;
Advisory Locks:
PostgreSQL also provides advisory locks (application-level):
- Session-level:
pg_advisory_lock(key),pg_advisory_unlock(key) - Transaction-level:
pg_advisory_xact_lock(key)(auto-released at transaction end)
Source: PostgreSQL 16 Official Documentation - Chapter 13.3: Explicit Locking
Serialization Anomaly in PostgreSQL
A serialization anomaly is a phenomenon where the result of executing concurrent transactions at the SERIALIZABLE isolation level could not have occurred if those transactions had been executed one at a time in any serial order.
What PostgreSQL Does
PostgreSQL's SERIALIZABLE isolation level (implemented using Serializable Snapshot Isolation since version 9.1) prevents serialization anomalies by detecting potential conflicts and raising an error:
ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during commit attempt.
HINT: The transaction might succeed if retried.
When this error occurs, the transaction is rolled back and must be retried by the application.
Classic Example: Write Skew
-- Two bank accounts, constraint: sum must stay >= 0
-- Account A: $100, Account B: $100
-- Transaction 1:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT SUM(balance) FROM accounts; -- reads $200
UPDATE accounts SET balance = balance - 150 WHERE id = 'A';
COMMIT;
-- Transaction 2 (concurrent):
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT SUM(balance) FROM accounts; -- reads $200
UPDATE accounts SET balance = balance - 150 WHERE id = 'B';
COMMIT;
-- Result: A = -$50, B = -$50 (violates constraint)
-- At SERIALIZABLE: One transaction aborts with serialization error
Implementation Details
- Predicate locks: PostgreSQL uses Serializable Snapshot Isolation (SSI) with predicate locking to detect dangerous patterns (rw-conflicts forming cycles)
- Performance cost: SSI adds ~25-30% overhead vs
REPEATABLE READdue to conflict tracking - Memory: Uses
max_pred_locks_per_transaction(default: 64) ×max_connections
When It Matters
Use SERIALIZABLE when you need true serializability guarantees and can handle retry logic. Lower isolation levels (READ COMMITTED, REPEATABLE READ) do not detect or prevent serialization anomalies.
Source: PostgreSQL Documentation v16, Section 13.2.3 "Serializable Isolation Level"
pg_locks View in PostgreSQL
The pg_locks view is a system catalog view that displays information about all currently held locks in the PostgreSQL database cluster and their associated processes.
Core Details
- Type: System view (not a table)
- Scope: Database cluster-wide (shows locks from all databases)
- Access: Available to all users, but non-superusers see only their own locks plus locks on shared objects
Key Columns
SELECT
locktype, -- Type of lockable object (relation, tuple, transactionid, etc.)
database, -- OID of database (NULL for shared objects)
relation, -- OID of locked relation (NULL if not a relation lock)
page, -- Page number within relation (NULL if not a page lock)
tuple, -- Tuple number within page (NULL if not a tuple lock)
transactionid, -- Transaction ID (NULL if not a transaction lock)
pid, -- Process ID of server process holding/awaiting lock
mode, -- Lock mode (e.g., AccessShareLock, RowExclusiveLock, ExclusiveLock)
granted -- TRUE if lock is held, FALSE if awaiting lock
FROM pg_locks;
Common Lock Modes (mode column)
AccessShareLock- Acquired by SELECT queriesRowShareLock- Acquired by SELECT FOR UPDATE/FOR SHARERowExclusiveLock- Acquired by INSERT, UPDATE, DELETEShareUpdateExclusiveLock- Acquired by VACUUM, CREATE INDEX CONCURRENTLYShareLock- Acquired by CREATE INDEXExclusiveLock- Blocks all concurrent access except AccessShareLockAccessExclusiveLock- Acquired by DROP TABLE, TRUNCATE, REINDEX, VACUUM FULL
Practical Example - Finding Blocking Queries
-- Find locks that are blocking other processes
SELECT
blocking.pid AS blocking_pid,
blocking.mode AS blocking_mode,
blocking.locktype,
blocked.pid AS blocked_pid,
blocked.mode AS blocked_mode
FROM pg_locks AS blocking
JOIN pg_locks AS blocked
ON blocking.locktype = blocked.locktype
AND blocking.database IS NOT DISTINCT FROM blocked.database
AND blocking.relation IS NOT DISTINCT FROM blocked.relation
WHERE NOT blocking.granted
AND blocking.pid != blocked.pid;
Version Notes
- Available in all modern PostgreSQL versions (9.0+)
- PostgreSQL 13+ added
waitstartcolumn (timestamp when lock wait began)
Source
What is a Deadlock in PostgreSQL?
A deadlock in PostgreSQL occurs when two or more transactions are waiting for each other to release locks, creating a circular dependency that prevents any of them from proceeding.
How It Happens
Example scenario:
-- Transaction 1
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- waits here for lock on id = 2
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
-- Transaction 2 (concurrent)
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2;
-- waits here for lock on id = 1
UPDATE accounts SET balance = balance + 50 WHERE id = 1;
Transaction 1 holds a lock on row id=1 and waits for id=2. Transaction 2 holds a lock on row id=2 and waits for id=1. Neither can proceed.
PostgreSQL's Behavior
- Detection: PostgreSQL's deadlock detector runs every 1000 milliseconds (configurable via
deadlock_timeout, default = 1s) - Resolution: When detected, PostgreSQL aborts one transaction with error
40P01(deadlock_detected) - Error message:
ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 67890; blocked by process 12346.
HINT: See server log for query details.
Prevention Strategies
- Acquire locks in consistent order across all transactions (e.g., always lock lower IDs first)
- Use explicit locking with
SELECT ... FOR UPDATEto acquire all locks upfront - Keep transactions short to minimize lock duration
Source: PostgreSQL 16 Official Documentation (Chapter 13.3 - Explicit Locking)
PostgreSQL prevents transaction ID wraparound through autovacuum and regular VACUUM operations that freeze old transaction IDs.
Primary Prevention Method
Enable autovacuum (enabled by default). PostgreSQL's autovacuum daemon automatically runs VACUUM operations to freeze transaction IDs before wraparound occurs.
Key Configuration Parameters
Set these in postgresql.conf:
autovacuum_freeze_max_age(default: 200,000,000 transactions) - Forces autovacuum when a table's pg_class.relfrozenxid exceeds this age. Autovacuum will run even if autovacuum is otherwise disabled.vacuum_freeze_min_age(default: 50,000,000 transactions) - Minimum age before rows are frozen.vacuum_freeze_table_age(default: 150,000,000 transactions) - Triggers aggressive whole-table scanning when exceeded (capped at 95% of autovacuum_freeze_max_age).
Manual Prevention
Run manual VACUUM operations:
-- Database-wide vacuum
VACUUM;
-- Specific table
VACUUM table_name;
-- Aggressive freeze (for emergency situations)
VACUUM FREEZE;
Important: Use standard VACUUM, not VACUUM FULL for routine maintenance. VACUUM FULL requires exclusive locks and is much slower.
Monitoring
Check transaction age to prevent wraparound:
-- Check database age
SELECT datname, age(datfrozenxid) FROM pg_database;
-- Check table ages
SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind = 'r';
PostgreSQL warns when XIDs reach 40 million transactions from wraparound point (out of 4 billion total).
Additional Safeguards
- Terminate long-running transactions that hold old XIDs
- Resolve or drop old prepared transactions
- Remove inactive replication slots
Sources:
ROW EXCLUSIVE Lock in PostgreSQL
ROW EXCLUSIVE is a table-level lock mode in PostgreSQL that allows concurrent reads but prevents other transactions from acquiring locks that would modify the table's structure or exclusively lock it.
When It's Acquired
Automatically acquired by these commands:
UPDATEDELETEINSERTMERGE
Manually acquired with:
LOCK TABLE table_name IN ROW EXCLUSIVE MODE;
What It Blocks
Blocks these lock modes:
SHARESHARE ROW EXCLUSIVEEXCLUSIVEACCESS EXCLUSIVE
Allows these lock modes:
ROW SHARE(acquired bySELECT FOR UPDATE/SHARE)ROW EXCLUSIVE(other concurrentINSERT/UPDATE/DELETE)
Practical Implications
- Multiple transactions can execute
INSERT,UPDATE, orDELETEconcurrently on the same table - Concurrent
SELECTqueries work without blocking - Commands that need table-level exclusive access (like
ALTER TABLE,DROP TABLE,TRUNCATE,VACUUM FULL) are blocked
Lock Hierarchy Position
In PostgreSQL's lock conflict matrix, ROW EXCLUSIVE is lock mode ID 3 out of 8 modes (weakest to strongest: ACCESS SHARE → ROW SHARE → ROW EXCLUSIVE → SHARE UPDATE EXCLUSIVE → SHARE → SHARE ROW EXCLUSIVE → EXCLUSIVE → ACCESS EXCLUSIVE).
Source: PostgreSQL 17 official documentation, Explicit Locking - Table-Level Locks
Setting Transaction Isolation Level in PostgreSQL
Method 1: SET TRANSACTION (Per-Transaction)
Set isolation level at the start of a transaction (before any data-modifying statement):
BEGIN;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- your queries here
COMMIT;
Method 2: SET SESSION CHARACTERISTICS (Session-Level)
Set default for all subsequent transactions in the session:
SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SERIALIZABLE;
Method 3: SET (Session-Level, Shorthand)
SET default_transaction_isolation = 'repeatable read';
Available Isolation Levels (PostgreSQL 9.1+)
- READ UNCOMMITTED - Treated as READ COMMITTED in PostgreSQL (no dirty reads)
- READ COMMITTED - Default level, prevents dirty reads
- REPEATABLE READ - Prevents dirty reads and non-repeatable reads
- SERIALIZABLE - Full isolation, prevents all anomalies
Per-Query Example
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM accounts WHERE id = 1;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT;
Configuration File (postgresql.conf)
Set cluster-wide default:
default_transaction_isolation = 'read committed'
Requires server reload: pg_ctl reload or SELECT pg_reload_conf();
Important Notes
- Must be set before first query in transaction (before
SELECT,INSERT,UPDATE,DELETE) - Use single quotes for level names in
SETcommands - Names are case-insensitive but conventionally uppercase in
SET TRANSACTION
Source: PostgreSQL 16 Official Documentation, Section 13.2 "Transaction Isolation"
PostgreSQL supports 4 transaction isolation levels as defined by the SQL standard:
- Read Uncommitted (treated as Read Committed)
- Read Committed (default)
- Repeatable Read
- Serializable
Key Details:
Read Uncommitted: PostgreSQL does not implement true Read Uncommitted. It behaves identically to Read Committed due to PostgreSQL's MVCC implementation.
Read Committed (default): Each statement sees a snapshot of data as of the start of that statement. Prevents dirty reads but allows non-repeatable reads and phantom reads.
Repeatable Read: All statements in a transaction see a snapshot as of the first non-transaction-control statement. Prevents dirty reads, non-repeatable reads, and phantom reads. May result in serialization failures (SQLSTATE 40001).
Serializable: Strictest level. Guarantees true serializability using Serializable Snapshot Isolation (SSI). May result in serialization failures that require retry logic.
Setting Isolation Level:
-- Set for current transaction
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Or at session level
SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Or inline with BEGIN
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
Version:
Applies to PostgreSQL 9.1+ (SSI introduced in 9.1). Current through PostgreSQL 17.
Source: PostgreSQL official documentation - Chapter 13.2 "Transaction Isolation"
PostgreSQL Advisory Locks: pg_advisory_lock vs pg_try_advisory_lock
The key difference: blocking behavior.
pg_advisory_lock(key bigint)- BLOCKS until the lock is acquired. If another session holds the lock, the calling session waits indefinitely until the lock becomes available.pg_try_advisory_lock(key bigint)- NON-BLOCKING. Returns immediately with a boolean:trueif lock was acquired,falseif lock is already held by another session.
Usage Examples
-- Blocking version - waits until lock is available
SELECT pg_advisory_lock(12345);
-- (execution pauses here if lock is held elsewhere)
-- No return value - succeeds silently when acquired
-- Non-blocking version - returns immediately
SELECT pg_try_advisory_lock(12345);
-- Returns: true (acquired) or false (already locked)
When to Use Each
Use pg_advisory_lock when:
- You MUST ensure exclusive access (e.g., running migrations, processing exactly-once jobs)
- Waiting for the lock is acceptable behavior
Use pg_try_advisory_lock when:
- You want to skip work if another process is already doing it
- You need timeout control (poll with application-level retry logic)
- You want to avoid indefinite blocking
Critical Details
- Both use session-level locks by default (released on connection close or explicit
pg_advisory_unlock) - Transaction-level variants exist:
pg_advisory_xact_lock(auto-released at transaction end) - Lock keys are 64-bit integers (range: -9223372036854775808 to 9223372036854775807)
- Available since PostgreSQL 8.2
- Locks are held in shared memory (max controlled by
max_locks_per_transaction)
Source: PostgreSQL 16 Official Documentation - Advisory Locks
MVCC in PostgreSQL
MVCC (Multi-Version Concurrency Control) is PostgreSQL's core concurrency mechanism that allows multiple transactions to access the same data simultaneously without blocking each other by maintaining multiple versions of each row.
How It Works
Each row has hidden system columns that track versioning:
xmin: Transaction ID that inserted the rowxmax: Transaction ID that deleted/updated the row (0 if current)ctid: Physical location (page, tuple)
When you UPDATE a row, PostgreSQL:
- Creates a new version of the row with a new
xmin - Marks the old version with
xmaxset to the updating transaction's ID - Old versions remain visible to transactions that started before the update
-- View hidden columns
SELECT xmin, xmax, ctid, * FROM users WHERE id = 1;
Visibility Rules
Each transaction sees data based on its snapshot:
- READ COMMITTED (default): New snapshot per statement
- REPEATABLE READ: Single snapshot for entire transaction
- SERIALIZABLE: Strictest isolation, prevents anomalies
A row version is visible to a transaction if:
xminis committed AND before transaction's snapshotxmaxis 0 OR uncommitted OR after transaction's snapshot
Critical Implications
Vacuum Required: Old row versions accumulate as "dead tuples". autovacuum removes them, but under heavy UPDATE/DELETE load, manual VACUUM may be needed to prevent bloat.
No Locking on Reads: SELECT never blocks writes, writes never block reads (unlike traditional row-level locking databases).
Write Conflicts: Two transactions updating the same row will cause the second to wait; if isolation rules are violated, one gets serialization failure.
Source: PostgreSQL 17 Official Documentation - Chapter 13 (Concurrency Control)
An ACCESS SHARE lock is the least restrictive lock mode in PostgreSQL, automatically acquired by SELECT queries (read-only operations) on tables.
Key Characteristics:
- Conflicts with: Only ACCESS EXCLUSIVE locks (acquired by
DROP TABLE,TRUNCATE,VACUUM FULL,LOCK TABLE ... ACCESS EXCLUSIVE) - Does NOT conflict with: All other lock modes including writes (
INSERT,UPDATE,DELETE) - Purpose: Prevents table structure changes while allowing all data modifications
- Duration: Held until transaction commits or rolls back
- Auto-release: Yes, automatically released at transaction end
Explicit Acquisition:
LOCK TABLE table_name IN ACCESS SHARE MODE;
Common Operations That Acquire ACCESS SHARE:
-- Simple SELECT
SELECT * FROM users;
-- SELECT with joins
SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id;
-- COPY TO (exporting data)
COPY users TO '/tmp/users.csv';
Critical Behavior:
Multiple concurrent SELECT queries can hold ACCESS SHARE locks simultaneously on the same table. This is why reads don't block reads or writes in PostgreSQL.
Source:
PostgreSQL Official Documentation - Table-Level Locks
Valid for PostgreSQL 9.1+ through current versions (including 17).
Phantom Read in PostgreSQL
A phantom read occurs when a transaction re-executes a query and finds a different set of rows that satisfy the search condition due to another transaction's committed INSERT or DELETE operations between the reads.
Specific Example
-- Transaction 1: Read accounts with balance > 1000
BEGIN;
SELECT * FROM accounts WHERE balance > 1000;
-- Returns 5 rows
-- Transaction 2 (concurrent): Insert new account
BEGIN;
INSERT INTO accounts (id, balance) VALUES (100, 1500);
COMMIT;
-- Transaction 1: Re-read same query
SELECT * FROM accounts WHERE balance > 1000;
-- Returns 6 rows (phantom row appears)
COMMIT;
PostgreSQL Behavior by Isolation Level
- READ COMMITTED (default): Phantom reads can occur
- REPEATABLE READ: Phantom reads cannot occur (prevented by snapshot isolation)
- SERIALIZABLE: Phantom reads cannot occur (prevented by serializable snapshot isolation)
Key Difference from SQL Standard
PostgreSQL's REPEATABLE READ is stronger than the SQL standard requires. The SQL standard allows phantom reads at REPEATABLE READ level, but PostgreSQL prevents them using MVCC (Multi-Version Concurrency Control) snapshot isolation.
To prevent phantom reads, use:
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- or
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
Source: PostgreSQL 17 documentation, Chapter 13.2 "Transaction Isolation"
SELECT FOR UPDATE in PostgreSQL
SELECT FOR UPDATE is a row-level locking mechanism that acquires exclusive locks on rows returned by a SELECT query, preventing other transactions from modifying or locking those rows until the current transaction completes.
Syntax
SELECT * FROM table_name WHERE condition FOR UPDATE;
Behavior
- Locks acquired:
RowShareLockon the table,ExclusiveLockon each row - Blocks: Other transactions attempting
UPDATE,DELETE,SELECT FOR UPDATE/SHARE/KEY SHARE/NO KEY UPDATEon locked rows will wait - Allows: Other transactions can still perform plain
SELECT(reads are not blocked) - Lock release: Automatically released at transaction
COMMITorROLLBACK
Common Use Case
Prevents lost updates in concurrent transactions:
BEGIN;
-- Lock the row to prevent concurrent modifications
SELECT balance FROM accounts WHERE id = 123 FOR UPDATE;
-- Now safe to modify based on the read value
UPDATE accounts SET balance = balance - 100 WHERE id = 123;
COMMIT;
Important Variants
FOR UPDATE NOWAIT- Returns error immediately if row is locked (instead of waiting)FOR UPDATE SKIP LOCKED- Skips locked rows, returns only unlocked rows (PostgreSQL 9.5+)
Critical Details
- Must be used within a transaction block (BEGIN/COMMIT)
- Applies to all tables in the query's FROM clause unless specific tables listed:
FOR UPDATE OF table_name - Cannot be used with
UNION,INTERSECT,EXCEPT, or aggregate functions on the locked rows - Deadlock possible if multiple transactions lock rows in different orders
Source: PostgreSQL Official Documentation v16 (applicable to 9.5+)
SHARE UPDATE EXCLUSIVE Lock in PostgreSQL
SHARE UPDATE EXCLUSIVE is a table-level lock mode in PostgreSQL that prevents concurrent schema modifications and VACUUM operations while allowing concurrent data reads and writes.
Lock Behavior
Conflicts with:
- SHARE UPDATE EXCLUSIVE
- SHARE
- SHARE ROW EXCLUSIVE
- EXCLUSIVE
- ACCESS EXCLUSIVE
Compatible with:
- ROW SHARE (SELECT FOR UPDATE/SHARE)
- ROW EXCLUSIVE (INSERT, UPDATE, DELETE)
- Concurrent SELECT queries (no explicit lock)
Primary Use Cases
- VACUUM (not VACUUM FULL) - Prevents concurrent VACUUM operations
- CREATE INDEX CONCURRENTLY - Prevents concurrent index creation
- ANALYZE - Statistical analysis operations
- ALTER TABLE VALIDATE CONSTRAINT - Constraint validation without blocking reads/writes
Acquiring Explicitly
BEGIN;
LOCK TABLE my_table IN SHARE UPDATE EXCLUSIVE MODE;
-- Perform operations that need to prevent concurrent schema changes
COMMIT;
Key Characteristics
- Self-conflicting: Two transactions cannot both hold SHARE UPDATE EXCLUSIVE on the same table
- Non-blocking for DML: Does NOT block INSERT, UPDATE, DELETE operations
- Blocks DDL: Prevents ALTER TABLE, CREATE INDEX (non-CONCURRENT), VACUUM
- Lock level: Table-level only (not row-level)
Source
PostgreSQL Official Documentation: Explicit Locking (applicable to all versions 9.x through 17.x with consistent behavior)
Transaction ID Wraparound in PostgreSQL
Transaction ID wraparound is a critical maintenance issue where PostgreSQL's 32-bit transaction ID counter exhausts its range and must wrap around to prevent data loss.
Core Mechanism
PostgreSQL uses 32-bit transaction IDs (XIDs) to implement MVCC. This provides approximately 4 billion (2^32) transaction IDs. XIDs are compared using modulo-2^32 arithmetic where:
- Any transaction with an XID more than 2 billion transactions old becomes "in the future"
- This would make old rows invisible, causing catastrophic data loss
Wraparound Prevention
PostgreSQL prevents wraparound through freezing: marking old tuples as visible to all transactions by setting their XID to FrozenTransactionId (2).
Critical Thresholds (PostgreSQL 12+)
vacuum_freeze_min_age: 50 million transactions (default) - minimum age before a row can be frozenvacuum_freeze_table_age: 150 million transactions (default) - triggers aggressive whole-table scanautovacuum_freeze_max_age: 200 million transactions (default) - HARD LIMIT triggering emergency autovacuum- Wraparound emergency: At 2 billion transactions old, database enters read-only mode
Configuration Example
-- Check current age of oldest unfrozen transaction
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
-- Critical: If age approaches 200M, emergency vacuum occurs
-- Configure more aggressive freezing:
ALTER SYSTEM SET vacuum_freeze_min_age = 5000000;
ALTER SYSTEM SET autovacuum_freeze_max_age = 100000000;
Action Required
Monitor age(datfrozenxid) regularly. If it exceeds autovacuum_freeze_max_age, PostgreSQL will halt normal operations and force VACUUM FREEZE to completion.
ACID Properties in PostgreSQL
PostgreSQL is a fully ACID-compliant relational database system. ACID stands for:
Atomicity
Transactions are all-or-nothing. Either all operations within a transaction complete successfully and are committed, or none are applied. If any operation fails, PostgreSQL performs an automatic rollback.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT; -- Both updates succeed or both are rolled back
Consistency
Database constraints (CHECK, FOREIGN KEY, UNIQUE, NOT NULL) are enforced at transaction commit. The database transitions from one valid state to another valid state.
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
amount NUMERIC CHECK (amount > 0)
);
-- Transaction will fail if constraint violated
Isolation
Concurrent transactions are isolated from each other. PostgreSQL supports 4 isolation levels defined by SQL standard:
- Read Uncommitted (treated as Read Committed in PostgreSQL)
- Read Committed (default) - queries see only committed data
- Repeatable Read - consistent snapshot of database at transaction start
- Serializable - strictest level, prevents all anomalies
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
-- Queries see consistent snapshot
COMMIT;
Durability
Once a transaction commits, data is permanently stored even if system crashes. PostgreSQL uses Write-Ahead Logging (WAL) - changes are written to WAL files on disk before acknowledgment. Default synchronous_commit = on ensures fsync() before commit returns.
Source: PostgreSQL 17 Official Documentation, Chapter 13 (Concurrency Control). ACID compliance verified across all PostgreSQL versions 9.x through 17.x.
SELECT FOR SHARE in PostgreSQL
SELECT FOR SHARE is a row-level locking clause that acquires a shared lock on selected rows, preventing other transactions from acquiring exclusive locks (UPDATE, DELETE, SELECT FOR UPDATE) while allowing concurrent SELECT FOR SHARE locks.
Syntax
SELECT * FROM table_name WHERE condition FOR SHARE;
Lock Behavior
- Blocks:
UPDATE,DELETE,SELECT FOR UPDATE,SELECT FOR NO KEY UPDATEon the same rows - Allows: Other
SELECT FOR SHAREandSELECT FOR KEY SHARElocks (concurrent reads) - Duration: Held until transaction commits or rolls back
Variants (PostgreSQL 9.5+)
-- Basic shared lock
SELECT * FROM orders WHERE id = 1 FOR SHARE;
-- Skip locked rows instead of waiting
SELECT * FROM orders WHERE status = 'pending' FOR SHARE SKIP LOCKED;
-- Error immediately if row is locked
SELECT * FROM orders WHERE id = 1 FOR SHARE NOWAIT;
-- Lock only specific tables in a join
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id
FOR SHARE OF o;
Key Difference from SELECT FOR UPDATE
FOR UPDATE: Exclusive lock - blocks all other locks includingFOR SHAREFOR SHARE: Shared lock - allows multiple transactions to hold it simultaneously
Use Case
Use when you need to ensure rows won't be modified while you perform related operations, but want to allow other transactions to also read those same rows with similar guarantees.
BEGIN;
-- Lock invoice rows to prevent modification during audit
SELECT * FROM invoices WHERE date >= '2025-01-01' FOR SHARE;
-- Perform calculations knowing rows won't change
-- Other transactions can also SELECT FOR SHARE but not UPDATE/DELETE
COMMIT;
Version: Available since PostgreSQL 9.0. SKIP LOCKED and NOWAIT added in 9.5.
PostgreSQL EXCLUSIVE Lock
An EXCLUSIVE lock in PostgreSQL is a table-level lock that blocks all concurrent access except for ACCESS SHARE locks (reads). It allows reads but prevents writes, DDL operations, and other conflicting locks.
Key Characteristics
Conflicts with:
ROW SHARE,ROW EXCLUSIVE,SHARE UPDATE EXCLUSIVE,SHARE,SHARE ROW EXCLUSIVE,EXCLUSIVE,ACCESS EXCLUSIVE
Compatible with:
ACCESS SHAREonly (SELECT queries without FOR UPDATE/SHARE)
Common operations that acquire EXCLUSIVE:
REFRESH MATERIALIZED VIEW CONCURRENTLY(PostgreSQL 9.4+)- Explicit
LOCK TABLE name IN EXCLUSIVE MODE
Usage Example
-- Explicitly acquire EXCLUSIVE lock
BEGIN;
LOCK TABLE my_table IN EXCLUSIVE MODE;
-- Multiple sessions can still SELECT from my_table
-- but cannot INSERT, UPDATE, DELETE, or ALTER it
COMMIT;
Practical Impact
- Reads: Allowed (concurrent SELECT statements proceed normally)
- Writes: Blocked (INSERT, UPDATE, DELETE wait)
- DDL: Blocked (ALTER TABLE, DROP TABLE wait)
- Other locks: Most lock acquisition attempts wait
Lock Hierarchy Position
On the 8-level PostgreSQL lock hierarchy (weakest to strongest), EXCLUSIVE is level 7 of 8—only ACCESS EXCLUSIVE is stronger.
Source: PostgreSQL 17 Documentation - Explicit Locking
deadlock_timeout is a PostgreSQL server configuration parameter that specifies how long (in milliseconds) to wait on a lock before checking for a deadlock condition.
Default Value: 1000 (1 second)
Valid Range: Any positive integer representing milliseconds. Minimum practical value is typically 1.
Purpose:
When a transaction waits for a lock, PostgreSQL doesn't immediately check if a deadlock exists (expensive operation). Instead, it waits for deadlock_timeout milliseconds first. If still blocked, it runs the deadlock detection algorithm. If a deadlock is found, PostgreSQL aborts one transaction with error 40P01.
Configuration:
-- View current setting
SHOW deadlock_timeout;
-- Set in postgresql.conf (requires reload)
deadlock_timeout = 1000
-- Set for current session
SET deadlock_timeout = '2s';
-- Set for specific database
ALTER DATABASE mydb SET deadlock_timeout = 500;
Units: Accepts time units: ms, s, min, h, d. Without unit, assumes milliseconds.
Tuning Guidance:
- Lower values (100-500ms): Use if deadlocks are frequent and transactions are short
- Higher values (2000-5000ms): Use if deadlock detection overhead is measurable and deadlocks are rare
- Don't set too low (<100ms) as deadlock detection is CPU-intensive
Version: Available since PostgreSQL 7.1. Behavior unchanged through PostgreSQL 17.
Source: PostgreSQL Official Documentation - Runtime Config: Locks
NOWAIT in SELECT FOR UPDATE
NOWAIT is a locking option in PostgreSQL's SELECT FOR UPDATE that causes the query to immediately fail with an error instead of waiting if a requested row is already locked by another transaction.
Behavior
Without NOWAIT (default):
- The query blocks and waits indefinitely until the row lock is released
- No timeout unless
lock_timeoutis configured
With NOWAIT:
- If any requested row is already locked, PostgreSQL immediately raises error code
55P03(lock_not_available) - Error message:
could not obtain lock on row in relation "table_name" - No rows are returned; the entire query fails
Syntax
SELECT * FROM orders
WHERE order_id = 123
FOR UPDATE NOWAIT;
Use Case
Use NOWAIT when you want to:
- Avoid blocking operations in high-concurrency scenarios
- Implement optimistic locking patterns
- Quickly detect lock contention and retry or skip the operation
Error Handling Example
import psycopg2
try:
cursor.execute("SELECT * FROM orders WHERE id = %s FOR UPDATE NOWAIT", (order_id,))
row = cursor.fetchone()
except psycopg2.errors.LockNotAvailable:
# Row is locked by another transaction
# Handle: retry, skip, or return error to user
pass
Version Support
Available since PostgreSQL 8.3 (2008).
Related Options
SKIP LOCKED(PostgreSQL 9.5+): Skips locked rows instead of failinglock_timeout: Sets maximum wait time for locks (default:0= wait forever)
Source: PostgreSQL Official Documentation - SELECT FOR UPDATE
security
28 questionsHow to Encrypt Data at Rest in PostgreSQL
PostgreSQL supports data-at-rest encryption through Transparent Data Encryption (TDE) in enterprise forks or filesystem-level encryption in standard PostgreSQL.
Standard PostgreSQL (Community Edition)
PostgreSQL core does not include built-in TDE. Use filesystem-level encryption:
Option 1: LUKS (Linux Unified Key Setup)
Encrypt the entire data directory partition:
# Create encrypted partition
cryptsetup luksFormat /dev/sdb1
cryptsetup luksOpen /dev/sdb1 pgdata_encrypted
# Format and mount
mkfs.ext4 /dev/mapper/pgdata_encrypted
mount /dev/mapper/pgdata_encrypted /var/lib/postgresql/data
# Initialize PostgreSQL
sudo -u postgres initdb -D /var/lib/postgresql/data
Option 2: Application-Level Encryption
Encrypt sensitive columns using pgcrypto extension:
CREATE EXTENSION pgcrypto;
-- Encrypt data
INSERT INTO users (id, ssn)
VALUES (1, pgp_sym_encrypt('123-45-6789', 'encryption_key'));
-- Decrypt data
SELECT id, pgp_sym_decrypt(ssn, 'encryption_key') AS ssn
FROM users WHERE id = 1;
Key size: pgp_sym_encrypt uses AES-128 by default (16-byte key recommended).
Enterprise Solutions with TDE
- EDB Postgres Advanced Server: Native TDE using AES-256
- Percona Distribution for PostgreSQL: TDE support via patches
- AWS RDS PostgreSQL: Enable encryption at instance creation (AES-256, cannot be enabled post-creation)
Performance Impact
- Filesystem encryption (LUKS): ~3-10% overhead
pgcrypto: ~15-30% overhead per encrypted column operation
Source: PostgreSQL 17 official documentation, LUKS cryptsetup man pages (v2.6), AWS RDS documentation (2024).
Peer Authentication in PostgreSQL
Peer authentication is a PostgreSQL authentication method that validates database connections by checking if the operating system username matches the PostgreSQL username. It works exclusively for local connections (Unix domain sockets) and uses the OS kernel to obtain the client's credentials.
How It Works
When a user connects via peer authentication, PostgreSQL:
- Asks the operating system kernel for the username of the process connecting via the Unix socket
- Compares this OS username to the requested PostgreSQL username
- Allows connection only if they match exactly (case-sensitive)
Configuration
In pg_hba.conf:
# TYPE DATABASE USER ADDRESS METHOD
local all all peer
Practical Example
If you're logged into Linux as OS user john:
# This works - OS user 'john' connects as PostgreSQL user 'john'
psql -U john mydatabase
# This fails - OS user 'john' trying to connect as PostgreSQL user 'alice'
psql -U alice mydatabase
# Error: Peer authentication failed for user "alice"
Username Mapping (Optional)
Use pg_ident.conf to map OS usernames to different PostgreSQL usernames:
# MAPNAME SYSTEM-USERNAME PG-USERNAME
mymap john postgres
Then in pg_hba.conf:
local all all peer map=mymap
Key Constraints
- Platform: Linux, BSD, macOS, Solaris (not available on Windows)
- Connection type: Only works for
local(Unix domain socket) connections, NOThost(TCP/IP) - Default: Peer is the default authentication for local connections in PostgreSQL 9.1+
Source: PostgreSQL Official Documentation - Chapter 21.9 Peer Authentication
The CREATEROLE attribute in PostgreSQL is a role privilege that allows a role to create, alter, and drop other roles (users).
Key Capabilities
A role with CREATEROLE can:
- Create new roles using
CREATE ROLEorCREATE USER - Modify existing roles using
ALTER ROLE - Drop roles using
DROP ROLE - Grant membership in roles it has created
Critical Limitations
CREATEROLE is NOT a superuser privilege. A CREATEROLE role:
- Cannot create superuser roles (unless it also has
SUPERUSERattribute) - Cannot modify superuser roles
- Cannot grant privileges it doesn't itself possess
- PostgreSQL 16+: Can only manage roles it has
ADMIN OPTIONfor (breaking change from prior versions)
Syntax
-- Grant CREATEROLE when creating a role
CREATE ROLE manager WITH CREATEROLE LOGIN PASSWORD 'secret';
-- Grant CREATEROLE to existing role
ALTER ROLE manager WITH CREATEROLE;
-- Revoke CREATEROLE
ALTER ROLE manager WITH NOCREATEROLE;
Version-Specific Behavior
- PostgreSQL ≤15:
CREATEROLEcan modify ANY non-superuser role - PostgreSQL 16+:
CREATEROLEcan only modify roles where it has been granted membership withADMIN OPTION. This improves security but breaks backward compatibility.
Source
PostgreSQL Official Documentation: Database Roles and CREATE ROLE
PostgreSQL Role
A PostgreSQL role is a database entity that can own database objects and have database privileges. Roles consolidate the concepts of "users" and "groups" from older PostgreSQL versions (pre-8.1).
Key Characteristics
- LOGIN capability: Roles can have the
LOGINattribute (making them equivalent to "users") or not (equivalent to "groups") - Ownership: Roles can own database objects (tables, functions, schemas, etc.)
- Membership: Roles can be members of other roles (inheritance of privileges)
- Privileges: Roles can be granted specific privileges on database objects
Creation Syntax
-- Role with login (a "user")
CREATE ROLE myuser WITH LOGIN PASSWORD 'securepassword';
-- Alternative shorthand
CREATE USER myuser WITH PASSWORD 'securepassword';
-- Role without login (a "group")
CREATE ROLE readonly;
-- Role with specific attributes
CREATE ROLE admin WITH LOGIN PASSWORD 'pass' CREATEDB CREATEROLE;
Common Attributes
LOGIN/NOLOGIN- Can connect to database (default: NOLOGIN)SUPERUSER/NOSUPERUSER- Bypass all permission checksCREATEDB/NOCREATEDB- Can create databasesCREATEROLE/NOCREATEROLE- Can create other rolesPASSWORD 'string'- Sets authentication passwordVALID UNTIL 'timestamp'- Password expiration
Membership Example
-- Create group role
CREATE ROLE analysts NOLOGIN;
-- Grant privileges to group
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analysts;
-- Add user to group
GRANT analysts TO alice;
Source: PostgreSQL 17 Official Documentation - Database Roles (applies to PostgreSQL 8.1+)
Row-Level Security (RLS) in PostgreSQL
Row-Level Security (RLS) is a PostgreSQL feature (available since version 9.5) that allows you to restrict which rows users can access or modify in a table based on security policies.
Core Mechanism
RLS policies are expressions that return boolean values, evaluated for each row. If the policy returns true, the row is visible/modifiable; if false, it's hidden/blocked.
Enabling RLS
-- Enable RLS on a table (requires table owner or superuser)
ALTER TABLE table_name ENABLE ROW LEVEL SECURITY;
Once enabled, no rows are visible to non-superuser roles until you create explicit policies. Superusers and table owners bypass RLS by default.
Creating Policies
-- Policy syntax
CREATE POLICY policy_name ON table_name
[FOR {ALL | SELECT | INSERT | UPDATE | DELETE}]
[TO role_name]
[USING (boolean_expression)] -- rows visible for SELECT/UPDATE/DELETE
[WITH CHECK (boolean_expression)]; -- rows allowed for INSERT/UPDATE
Practical Example
-- Table with user-specific data
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL,
content TEXT
);
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
-- Policy: Users see only their own documents
CREATE POLICY user_documents ON documents
FOR ALL
TO public
USING (user_id = current_setting('app.current_user_id')::INTEGER);
Critical Details
- USING clause: Filters rows for
SELECT,UPDATE,DELETEoperations - WITH CHECK clause: Validates rows for
INSERTandUPDATEoperations. If omitted forINSERT,USINGis used instead. - Default deny: With RLS enabled but no policies, all access is denied (except superusers)
- Performance: Policies are evaluated per-row; complex expressions can impact query performance
- FORCE ROW LEVEL SECURITY: Apply RLS even to table owners:
ALTER TABLE table_name FORCE ROW LEVEL SECURITY;
Bypassing RLS
-- Grant bypass privilege (use sparingly)
GRANT BYPASSRLS ON table_name TO role_name;
Source: PostgreSQL Official Documentation v16 - Row Security Policies
https://www.postgresql.org/docs/current/ddl-rowsecurity.html
The pgcrypto extension provides cryptographic functions for PostgreSQL, enabling encryption, hashing, and random data generation directly within the database.
Core Functionality
Installation:
CREATE EXTENSION pgcrypto;
Key Functions:
- Password Hashing (recommended):
-- Hash password (uses bcrypt by default)
SELECT crypt('user_password', gen_salt('bf'));
-- Verify password
SELECT password_hash = crypt('user_input', password_hash) FROM users;
- Encryption (symmetric):
-- Encrypt (AES-128 default)
SELECT pgp_sym_encrypt('sensitive data', 'encryption_key');
-- Decrypt
SELECT pgp_sym_decrypt(encrypted_column, 'encryption_key') FROM table;
- Hashing:
-- SHA-256
SELECT digest('data', 'sha256');
-- Returns bytea, convert to hex if needed
SELECT encode(digest('data', 'sha256'), 'hex');
- Random Data:
-- Cryptographically secure random bytes
SELECT gen_random_bytes(32); -- 32 bytes
-- Random UUID (preferred over uuid-ossp)
SELECT gen_random_uuid();
Available Algorithms
- Symmetric encryption: AES (128/192/256-bit), Blowfish, 3DES
- Hashing: MD5, SHA-1, SHA-224, SHA-256, SHA-384, SHA-512
- Password hashing:
bf(Blowfish/bcrypt),md5,xdes,des
Best Practices
- Use
gen_salt('bf')for password hashing (automatically includes salt and multiple rounds) gen_random_uuid()is faster than uuid-ossp'suuid_generate_v4()- Store encrypted data as
byteatype - pgcrypto has been part of PostgreSQL contrib since version 8.1+
Source: PostgreSQL Official Documentation - pgcrypto module (https://www.postgresql.org/docs/current/pgcrypto.html)
REVOKE Command in PostgreSQL
The REVOKE command removes previously granted privileges from database roles (users or groups). It is the inverse operation of GRANT.
Basic Syntax
REVOKE privilege_type ON object_type object_name FROM role_name;
Common Usage Patterns
Revoke specific privileges:
-- Revoke SELECT privilege on a table
REVOKE SELECT ON TABLE employees FROM user_alice;
-- Revoke multiple privileges
REVOKE SELECT, INSERT, UPDATE ON TABLE products FROM role_sales;
-- Revoke all privileges
REVOKE ALL PRIVILEGES ON TABLE orders FROM user_bob;
Revoke from PUBLIC:
-- Remove default public access
REVOKE ALL ON DATABASE mydb FROM PUBLIC;
Revoke with CASCADE:
-- Revoke and remove dependent privileges granted by this user
REVOKE SELECT ON TABLE customers FROM user_charlie CASCADE;
Key Details
- Privilege Types:
SELECT,INSERT,UPDATE,DELETE,TRUNCATE,REFERENCES,TRIGGER,CREATE,CONNECT,TEMPORARY,EXECUTE,USAGE,SET,ALTER SYSTEM,ALL PRIVILEGES - Object Types:
TABLE,SEQUENCE,DATABASE,DOMAIN,FOREIGN DATA WRAPPER,FOREIGN SERVER,FUNCTION,PROCEDURE,ROUTINE,LANGUAGE,LARGE OBJECT,SCHEMA,TABLESPACE,TYPE - CASCADE: Automatically revokes dependent privileges that other users received through GRANT OPTION
- RESTRICT (default): Refuses to revoke if dependent privileges exist
Grant Option Handling
-- Revoke the ability to grant privileges to others (but keep the privilege itself)
REVOKE GRANT OPTION FOR SELECT ON TABLE data FROM user_dave;
-- Revoke both the privilege and grant option
REVOKE SELECT ON TABLE data FROM user_dave CASCADE;
Version Compatibility
This syntax is standardized and works consistently across PostgreSQL 9.x through 17.x. The ROUTINE keyword (covering both functions and procedures) was added in PostgreSQL 11.
PostgreSQL Security Policy (Row-Level Security)
A security policy in PostgreSQL is a named rule attached to a table that defines row-level security (RLS) restrictions. Policies control which rows users can view, insert, update, or delete based on SQL expressions evaluated per row.
Core Concepts
Enable RLS on a table:
ALTER TABLE table_name ENABLE ROW LEVEL SECURITY;
Create a policy:
CREATE POLICY policy_name ON table_name
[AS {PERMISSIVE | RESTRICTIVE}]
[FOR {ALL | SELECT | INSERT | UPDATE | DELETE}]
[TO {role_name | PUBLIC | CURRENT_USER | SESSION_USER}]
[USING (condition)] -- For SELECT/UPDATE/DELETE visibility
[WITH CHECK (condition)] -- For INSERT/UPDATE validation
Key Details
Policy types:
PERMISSIVE(default): Rows matching ANY permissive policy are accessible (OR logic)RESTRICTIVE: Rows must pass ALL restrictive policies (AND logic)
Table owners and superusers bypass RLS by default unless
FORCE ROW LEVEL SECURITYis setMultiple policies on the same table are combined with OR (for PERMISSIVE) or AND (for RESTRICTIVE)
Performance: Policy expressions run on every row access - use indexed columns in conditions
Example
-- Only users can see their own data
CREATE POLICY user_isolation ON accounts
FOR ALL
TO public
USING (user_id = current_user)
WITH CHECK (user_id = current_user);
Version Notes
Available since PostgreSQL 9.5. No changes to core syntax through PostgreSQL 17.
Source: PostgreSQL Official Documentation - Row Security Policies
Enabling RLS on a PostgreSQL Table
Use the ALTER TABLE command with the ENABLE ROW LEVEL SECURITY clause:
ALTER TABLE table_name ENABLE ROW LEVEL SECURITY;
Critical details:
Privileges required: You must be the table owner or have superuser privileges.
Effect: Once enabled, RLS blocks ALL access by default (returns zero rows) for non-owner users until you create explicit policies using
CREATE POLICY.Table owners bypass RLS by default. To force owners to obey policies:
ALTER TABLE table_name FORCE ROW LEVEL SECURITY;
- Verification: Check if RLS is enabled:
SELECT tablename, rowsecurity
FROM pg_tables
WHERE tablename = 'table_name';
Returns true if RLS is enabled.
- Disabling RLS:
ALTER TABLE table_name DISABLE ROW LEVEL SECURITY;
Complete example:
-- Enable RLS
ALTER TABLE employees ENABLE ROW LEVEL SECURITY;
-- Create a policy (required for non-owners to access data)
CREATE POLICY employee_policy ON employees
FOR SELECT
USING (user_id = current_user_id());
Source: PostgreSQL official documentation (compatible with PostgreSQL 9.5+, where RLS was introduced).
MD5 authentication in PostgreSQL is a deprecated password-based authentication method that stores and transmits MD5-hashed passwords.
How it works:
- Password hashes are stored in
pg_authidsystem catalog asmd5+ MD5 hash of (password + username) - During authentication, the client sends:
md5+ MD5(MD5(password + username) + server salt) - Server compares this with its stored hash using the provided salt
Configuration:
In pg_hba.conf:
# TYPE DATABASE USER ADDRESS METHOD
host all all 0.0.0.0/0 md5
Critical Facts:
- Deprecated since PostgreSQL 14 (released September 2021)
- Removed in PostgreSQL 17 (released September 2024)
- Vulnerable to rainbow table attacks and does not use key stretching
- Replaced by
scram-sha-256which is the default in PostgreSQL 14+
Recommended Action:
Use scram-sha-256 instead:
host all all 0.0.0.0/0 scram-sha-256
To migrate existing md5 passwords, users must change their password after switching to scram-sha-256:
ALTER USER username PASSWORD 'newpassword';
Source: PostgreSQL Official Documentation - Client Authentication (versions 14, 15, 16, 17)
PostgreSQL: Why scram-sha-256 is Recommended Over md5
Direct Answer: scram-sha-256 is recommended over md5 because:
Security: MD5 is cryptographically broken and vulnerable to pre-image attacks. SCRAM-SHA-256 uses SHA-256 hashing with salt and multiple iterations, making it resistant to rainbow table and brute-force attacks.
Password Storage: MD5 stores a simple hash of
username + password, which can be cracked offline if the hash is obtained. SCRAM-SHA-256 stores a salted hash with 4096 iterations (default), requiring significant computational effort to attack.Deprecation: PostgreSQL 10+ deprecated MD5. PostgreSQL 14+ made
scram-sha-256the default authentication method.Network Security: SCRAM-SHA-256 never transmits the actual password or its hash over the network. It uses a challenge-response mechanism. MD5 transmits the MD5 hash, which can be captured and replayed.
Configuration
In pg_hba.conf:
# Recommended
host all all 0.0.0.0/0 scram-sha-256
# Deprecated (avoid)
host all all 0.0.0.0/0 md5
Set in postgresql.conf:
password_encryption = scram-sha-256
Source: PostgreSQL official documentation (version 14+), PostgreSQL Authentication Documentation
Action: Always use scram-sha-256 for new deployments. Migrate existing MD5 installations by updating postgresql.conf and having users reset passwords.
PostgreSQL Row-Level Security (RLS) for Multi-Tenant Applications
Implementation Pattern
Requirement: PostgreSQL 9.5 or later (RLS introduced in 9.5).
1. Database Schema Setup
-- Add tenant identifier to all multi-tenant tables
CREATE TABLE tenants (
tenant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL
);
CREATE TABLE users (
user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
email TEXT NOT NULL,
-- other columns
);
CREATE TABLE data (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
content TEXT,
-- other columns
);
2. Enable RLS and Create Policies
-- Enable RLS on multi-tenant tables
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
ALTER TABLE data ENABLE ROW LEVEL SECURITY;
-- Create policy that filters rows by tenant_id
CREATE POLICY tenant_isolation_policy ON users
USING (tenant_id = current_setting('app.current_tenant_id')::UUID);
CREATE POLICY tenant_isolation_policy ON data
USING (tenant_id = current_setting('app.current_tenant_id')::UUID);
3. Set Tenant Context per Connection
-- Set at start of each application session/transaction
SET app.current_tenant_id = 'tenant-uuid-here';
-- Or use LOCAL for transaction-only scope
SET LOCAL app.current_tenant_id = 'tenant-uuid-here';
4. Application Code Example (Node.js with pg)
async function queryAsTenant(client, tenantId, query, params) {
await client.query('SET LOCAL app.current_tenant_id = $1', [tenantId]);
return client.query(query, params);
}
// Usage
const result = await queryAsTenant(
client,
'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11',
'SELECT * FROM data WHERE id = $1',
[dataId]
);
Critical Security Requirements
- Always set tenant context BEFORE queries - Missing context results in zero rows returned (safe failure)
- Use transaction-scoped settings - Use
SET LOCALinside transactions to prevent context leakage - Exempt superusers - RLS bypasses for superusers; application must use non-superuser role
- Force RLS for table owners - Add
FORCE ROW LEVEL SECURITYif table owner queries data:
ALTER TABLE data FORCE ROW LEVEL SECURITY;
- Validate tenant_id - Verify user has access to tenant_id before calling
SET:
// Verify first
const hasAccess = await checkUserTenantAccess(userId, tenantId);
if (!hasAccess) throw new Error('Unauthorized tenant access');
// Then set context
await client.query('SET LOCAL app.current_tenant_id = $1', [tenantId]);
Performance Considerations
- RLS policies add WHERE clauses to queries; ensure
tenant_idcolumns are indexed:
CREATE INDEX idx_users_tenant_id ON users(tenant_id);
CREATE INDEX idx_data_tenant_id ON data(tenant_id);
- Index scan overhead: ~5-10% compared to non-RLS queries (PostgreSQL 14+ optimizer improvements)
Source
PostgreSQL Official Documentation: Row Security Policies (https://www.postgresql.org/docs/current/ddl-rowsecurity.html)
PostgreSQL Authentication Methods
PostgreSQL supports the following authentication methods (configured in pg_hba.conf):
Core Methods
trust- Allows connection without password (unsafe for production)reject- Rejects connection unconditionallyscram-sha-256- Challenge-response scheme (default since PostgreSQL 14, recommended)md5- MD5-hashed password (legacy, deprecated since PostgreSQL 14)password- Clear-text password (unsafe, not recommended)gss- GSSAPI authentication (Kerberos)sspi- SSPI authentication (Windows native)ident- Uses OS username mappingpeer- Uses OS username (Unix-domain sockets only)ldap- LDAP server authenticationradius- RADIUS server authenticationcert- SSL client certificate authenticationpam- PAM (Pluggable Authentication Modules)bsd- BSD authentication
Configuration Example
# TYPE DATABASE USER ADDRESS METHOD
host all all 127.0.0.1/32 scram-sha-256
local all postgres peer
host all all 0.0.0.0/0 reject
Best Practice
Use scram-sha-256 for password-based authentication (PostgreSQL 10+). It replaced md5 as the secure standard.
Source: PostgreSQL 16 Official Documentation - Client Authentication
https://www.postgresql.org/docs/current/auth-methods.html
Trust Authentication in PostgreSQL
Trust authentication allows any user to connect to PostgreSQL without a password if the connection matches the pg_hba.conf rule. The server grants access solely based on connection parameters (database, username, IP address) without verifying credentials.
Configuration
In pg_hba.conf:
# TYPE DATABASE USER ADDRESS METHOD
host mydb alice 192.168.1.0/24 trust
local all postgres trust
Critical Security Implications
- No authentication performed: Any user who can reach the server and match the rule gains full database access as the specified role
- Use cases: Development environments, localhost-only connections, or environments with network-level security
- Production warning: Never use
trustfor internet-facing or multi-user production systems
How It Works
- Client connects with username
aliceto databasemydbfrom192.168.1.100 - PostgreSQL checks
pg_hba.conftop-to-bottom - First matching rule has method
trust - Server grants access immediately - no password check, no credential validation
Example Connection
# This succeeds even with wrong/no password
psql -h 192.168.1.50 -U alice -d mydb
# Connects immediately without password prompt
Configuration Reload
After modifying pg_hba.conf:
SELECT pg_reload_conf();
Or via shell:
pg_ctl reload -D /var/lib/postgresql/data
Source: PostgreSQL 17 Documentation - Authentication Methods
Best practice: Use trust only for local development or in combination with network isolation (e.g., host all all 127.0.0.1/32 trust for localhost-only access).
PostgreSQL session_user Function
session_user is a SQL standard function that returns the name of the user who initiated the current database session, as a name type (PostgreSQL's internal string type for identifiers).
Key Characteristics
- Immutable during session: Returns the authenticated username that connected to the database, regardless of any
SET ROLEorSET SESSION AUTHORIZATIONcommands executed during the session - Differs from
current_user: Whilecurrent_userreflects the currently active role (which can change viaSET ROLE),session_useralways shows the original authenticated user - No parentheses required: Can be called as
session_userorsession_user()
Usage Example
-- Show original authenticated user
SELECT session_user;
-- Compare with current_user after role change
SELECT session_user, current_user;
-- Returns: session_user = 'alice', current_user = 'alice'
SET ROLE bob;
SELECT session_user, current_user;
-- Returns: session_user = 'alice', current_user = 'bob'
Return Type
Returns name type (64-byte maximum length identifier in PostgreSQL, though actual limit is NAMEDATALEN - 1, typically 63 bytes).
Source
PostgreSQL Official Documentation: System Information Functions (applies to PostgreSQL 9.x through 17.x with consistent behavior).
PostgreSQL PERMISSIVE vs RESTRICTIVE Policies
In PostgreSQL Row-Level Security (RLS), policy types determine how multiple policies combine:
PERMISSIVE (default)
- Multiple PERMISSIVE policies use OR logic
- A row is accessible if it passes ANY PERMISSIVE policy
- Most common use case: granting different types of access
RESTRICTIVE
- Multiple RESTRICTIVE policies use AND logic
- A row is accessible only if it passes ALL RESTRICTIVE policies
- Used to enforce mandatory conditions that cannot be bypassed
Combined Behavior
When both types exist on a table:
- At least ONE PERMISSIVE policy must pass (if any PERMISSIVE policies exist)
- ALL RESTRICTIVE policies must pass (if any RESTRICTIVE policies exist)
Formula: (PERMISSIVE_1 OR PERMISSIVE_2 OR ...) AND (RESTRICTIVE_1 AND RESTRICTIVE_2 AND ...)
Code Example
-- PERMISSIVE: User can see their own rows OR public rows
CREATE POLICY user_own_data ON documents
FOR SELECT
AS PERMISSIVE -- This is the default
USING (user_id = current_user);
CREATE POLICY public_data ON documents
FOR SELECT
AS PERMISSIVE
USING (is_public = true);
-- RESTRICTIVE: But never show deleted rows, regardless of above policies
CREATE POLICY hide_deleted ON documents
FOR SELECT
AS RESTRICTIVE
USING (deleted_at IS NULL);
In this example, users see rows where (user_id = current_user OR is_public = true) AND (deleted_at IS NULL).
Version: Available since PostgreSQL 9.5 (RLS introduced), RESTRICTIVE keyword added in PostgreSQL 10.
PostgreSQL GRANT Privileges
Use the GRANT command to assign specific privileges on database objects to roles (users/groups).
Basic Syntax
GRANT privilege_type ON object_type object_name TO role_name;
Common Privilege Types
Table privileges:
-- Single privilege
GRANT SELECT ON TABLE users TO readonly_user;
-- Multiple privileges
GRANT SELECT, INSERT, UPDATE ON TABLE users TO app_user;
-- All privileges
GRANT ALL PRIVILEGES ON TABLE users TO admin_user;
Database privileges:
GRANT CONNECT ON DATABASE mydb TO app_user;
GRANT CREATE ON DATABASE mydb TO developer;
Schema privileges:
GRANT USAGE ON SCHEMA public TO app_user;
GRANT CREATE ON SCHEMA public TO developer;
All tables in schema:
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly_user;
-- Apply to future tables
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO readonly_user;
Key Privilege Keywords
SELECT,INSERT,UPDATE,DELETE- Table data operationsTRUNCATE,REFERENCES,TRIGGER- Table managementCONNECT- Database connectionCREATE- Create objects in database/schemaUSAGE- Use schema objectsEXECUTE- Call functions/proceduresALL PRIVILEGES- All applicable privileges for object type
Grant with Additional Options
-- WITH GRANT OPTION: allows grantee to grant same privilege to others
GRANT SELECT ON TABLE users TO manager_user WITH GRANT OPTION;
Verification
-- Check table privileges
\dp table_name
-- Check role privileges
\du role_name
Version: PostgreSQL 12+ (syntax stable since PostgreSQL 9.0)
Source: PostgreSQL GRANT Documentation
Certificate Authentication in PostgreSQL
Certificate authentication (cert) is a PostgreSQL authentication method that authenticates clients using SSL/TLS client certificates. PostgreSQL validates the client's X.509 certificate against a trusted Certificate Authority (CA) and maps the certificate's Common Name (CN) to a database user.
How It Works
- SSL Connection Required: Client must connect with
sslmode=requireor higher - Certificate Validation: PostgreSQL verifies the client certificate against the CA certificate specified in
ssl_ca_file(default:root.crtin data directory) - User Mapping: The certificate's CN field must match the PostgreSQL username (or use
pg_ident.conffor custom mappings)
Configuration
postgresql.conf:
ssl = on
ssl_ca_file = '/path/to/root.crt'
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'
pg_hba.conf:
# TYPE DATABASE USER ADDRESS METHOD OPTIONS
hostssl all all 0.0.0.0/0 cert clientcert=verify-full
clientcert=verify-full(PostgreSQL 12+): Requires valid client certificateclientcert=verify-ca(PostgreSQL 12+): Validates certificate chain only- Legacy versions use
certmethod which implies certificate requirement
Client Connection
psql "host=dbserver dbname=mydb user=john sslmode=verify-full sslcert=/path/to/client.crt sslkey=/path/to/client.key sslrootcert=/path/to/root.crt"
Critical: The CN in the client certificate MUST be john (matching the database username), or you must configure pg_ident.conf for custom mapping.
Source: PostgreSQL 16 Documentation - Authentication Methods: Certificate Authentication
PostgreSQL Ident Authentication
Ident authentication is a method where PostgreSQL verifies a user's identity by querying an Ident service (RFC 1413) running on the client's machine. The Ident service reports which OS user owns the TCP connection to PostgreSQL.
How It Works
- Client connects to PostgreSQL from IP address X, port Y
- PostgreSQL connects to the Ident service on X:113
- Ident service reports the OS username that owns the connection from port Y
- PostgreSQL maps this OS username to a PostgreSQL role using an optional user map
Configuration in pg_hba.conf
# TYPE DATABASE USER ADDRESS METHOD
host mydb pguser 192.168.1.0/24 ident
host mydb pguser 192.168.1.0/24 ident map=mymap
Key Specifications
- Protocol: RFC 1413 Identification Protocol
- Port: TCP 113 (on client machine)
- Timeout: 10 seconds (default
ident_timeoutin postgresql.conf) - Connection types: Only works with TCP/IP connections (
hostorhostssl) - Trust level: LOW - the client's Ident service can be spoofed or unreliable
User Mapping (Optional)
Define in pg_ident.conf:
# MAPNAME SYSTEM-USERNAME PG-USERNAME
mymap john pguser
mymap jane pguser
Important Limitations
- Does NOT work with local Unix socket connections (use
peerinstead) - Not secure - relies on client-side service that can be manipulated
- Rarely used in production - primarily for controlled LAN environments
- Deprecated on Windows - Windows has no standard Ident service
Modern Alternative
For local connections, use peer authentication instead, which directly checks the OS user via the Unix socket without network calls.
Source: PostgreSQL 17 Official Documentation, Section 21.9 "Ident Authentication"
PostgreSQL ssl Setting
The ssl setting in postgresql.conf is a boolean parameter that controls whether the PostgreSQL server accepts SSL/TLS encrypted connections.
Configuration
Parameter: ssl
Type: Boolean
Values: on | off | true | false | yes | no | 1 | 0
Default: off (PostgreSQL 12+), on (if compiled with SSL support, PostgreSQL 15+)
Context: Requires server restart (postmaster)
# postgresql.conf
ssl = on
Requirements
When ssl = on, PostgreSQL requires these files in the data directory:
server.crt- Server certificateserver.key- Private key (permissions must be0600or less)
Without these files, the server will fail to start if ssl = on.
Behavior
ssl = on: Server accepts both SSL and non-SSL connections (unlesspg_hba.confenforceshostssl)ssl = off: Server rejects all SSL connection attempts
Related Settings
When enabling SSL, configure:
ssl_cert_file- Path to server certificate (default:server.crt)ssl_key_file- Path to private key (default:server.key)ssl_ca_file- Certificate authority file for client verification
Verification
# Check if SSL is enabled
psql -h localhost -U postgres -c "SHOW ssl;"
Source: PostgreSQL 16 Official Documentation - Server Configuration: Connections and Authentication
PostgreSQL LOGIN Attribute
The LOGIN attribute is a role privilege in PostgreSQL that grants the ability to connect to the database server. It is one of the core role attributes that determines whether a role can be used for authentication.
Key Facts
- Default Behavior: Roles created with
CREATE ROLEdo NOT haveLOGINby default. Roles created withCREATE USERDO haveLOGINby default (since PostgreSQL 8.1+). - Syntax:
- To grant:
LOGIN - To revoke:
NOLOGIN
- To grant:
Code Examples
-- Create a role WITHOUT login capability (cannot connect)
CREATE ROLE readonly_role;
-- Create a role WITH login capability (can connect)
CREATE ROLE app_user LOGIN PASSWORD 'secure_password';
-- Equivalent using CREATE USER (LOGIN is implicit)
CREATE USER app_user PASSWORD 'secure_password';
-- Grant LOGIN to an existing role
ALTER ROLE readonly_role LOGIN;
-- Revoke LOGIN from a role
ALTER ROLE app_user NOLOGIN;
Technical Details
- Only roles with
LOGINcan be used as the authentication role in connection strings - Roles without
LOGINare typically used as groups for privilege management - A role can have
LOGINbut still be prevented from connecting viapg_hba.confrules - The
LOGINattribute is stored in thepg_authid.rolcanloginsystem catalog column
Source
PostgreSQL Official Documentation: Database Roles (Current through PostgreSQL 17)
pg_hba.conf Entry Syntax
Each non-comment line in pg_hba.conf follows this exact format:
TYPE DATABASE USER ADDRESS METHOD [OPTIONS]
Field Specifications
TYPE (required): Connection type
local- Unix-domain sockethost- TCP/IP (both SSL and non-SSL)hostssl- TCP/IP with SSL onlyhostnossl- TCP/IP without SSLhostgssenc- TCP/IP with GSSAPI encryption (PostgreSQL 12+)hostnogssenc- TCP/IP without GSSAPI encryption (PostgreSQL 12+)
DATABASE (required): Database name(s)
- Specific database name:
mydb - Multiple databases:
db1,db2,db3 - All databases:
all - All except specific:
all(use separate deny rule) - Database file reference:
@filename sameuser- match database name to usernamesamerole- match database to role membershipreplication- for replication connections
- Specific database name:
USER (required): Role name(s)
- Specific user:
postgres - Multiple users:
user1,user2,user3 - All users:
all - Group file reference:
@filename - Role prefix:
+rolename(includes members of role)
- Specific user:
ADDRESS (required for
host*types, omitted forlocal):- CIDR notation:
192.168.1.0/24or2001:db8::/32 - Single host:
192.168.1.100/32 - Hostname:
example.com(DNS lookup) - Special keywords:
samehost,samenet
- CIDR notation:
METHOD (required): Authentication method
trust- allow unconditionallyreject- reject unconditionallymd5- MD5 password (legacy, usescram-sha-256)scram-sha-256- SCRAM-SHA-256 password (PostgreSQL 10+, recommended)password- plain text password (not recommended)peer- OS username matching (local only)ident- ident server authenticationgss- GSSAPIsspi- SSPI (Windows)pam- PAMldap- LDAPradius- RADIUScert- SSL client certificate
OPTIONS (optional): Method-specific options in
key=valueformat
Examples
# Allow local connections via Unix socket using peer authentication
local all all peer
# Allow localhost TCP connections with SCRAM password
host all all 127.0.0.1/32 scram-sha-256
# Allow specific subnet with password
host mydb myuser 192.168.1.0/24 scram-sha-256
# Reject specific user from anywhere
host all baduser 0.0.0.0/0 reject
# Replication connections
host replication replicator 192.168.1.0/24 scram-sha-256
# SSL-only connection with client certificate
hostssl all all 0.0.0.0/0 cert clientcert=verify-full
# With options
host all all 0.0.0.0/0 ldap ldapserver=ldap.example.com ldapbasedn="dc=example,dc=com"
Important Notes
- Entries are processed top-to-bottom; first match wins
- Changes require
pg_reload_conf()orpg_ctl reloadto take effect - Comments start with
# - Fields separated by whitespace (spaces/tabs)
Source: PostgreSQL 16 Official Documentation - Client Authentication (pg_hba.conf)
How to Hash Passwords in PostgreSQL
Use the pgcrypto extension with the crypt() function and the bcrypt algorithm (bf).
Implementation
1. Enable the extension (once per database):
CREATE EXTENSION IF NOT EXISTS pgcrypto;
2. Hash a password (for INSERT/UPDATE):
-- Hash with bcrypt (default cost factor 6)
SELECT crypt('user_password', gen_salt('bf'));
-- Hash with specific cost factor (8-12 recommended for production)
SELECT crypt('user_password', gen_salt('bf', 10));
3. Verify a password:
-- Returns true if password matches
SELECT crypt('user_password', stored_hash) = stored_hash;
Complete Example
-- Create users table
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username VARCHAR(255) UNIQUE NOT NULL,
password_hash TEXT NOT NULL
);
-- Insert user with hashed password
INSERT INTO users (username, password_hash)
VALUES ('john', crypt('secret123', gen_salt('bf', 10)));
-- Verify login
SELECT id, username
FROM users
WHERE username = 'john'
AND password_hash = crypt('secret123', password_hash);
Key Parameters
- Algorithm:
bf(bcrypt) - industry standard for password hashing - Cost factor: Range 4-31. Higher = slower/more secure. Use 10-12 for production (2025 recommendation)
- Salt: Auto-generated by
gen_salt(), embedded in the hash output - Output: 60-character string (bcrypt standard format)
Why bcrypt?
- Adaptive: Cost factor increases as hardware improves
- Salt automatically included in output
- Resistant to rainbow table and brute-force attacks
- OWASP recommended for password storage
Source: PostgreSQL pgcrypto documentation (v12+), OWASP Password Storage Cheat Sheet
Enabling SSL in PostgreSQL
To enable SSL in PostgreSQL, configure the server with SSL certificates and update postgresql.conf:
Required Steps
- Generate SSL certificates (or use existing ones):
# Self-signed certificate (for testing only)
openssl req -new -x509 -days 365 -nodes -text -out server.crt \
-keyout server.key -subj "/CN=dbserver.example.com"
# Set correct permissions (required)
chmod 600 server.key
chown postgres:postgres server.key server.crt
Place certificates in PostgreSQL data directory (typically
/var/lib/postgresql/data/or/etc/postgresql/[version]/main/)Edit
postgresql.conf:
ssl = on
ssl_cert_file = 'server.crt'
ssl_key_file = 'server.key'
# Optional but recommended:
ssl_ca_file = 'root.crt' # For client certificate verification
ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL' # Strong ciphers only
ssl_prefer_server_ciphers = on
ssl_min_protocol_version = 'TLSv1.2' # PostgreSQL 12+
- Edit
pg_hba.confto require SSL (optional but recommended):
# Require SSL for all connections
hostssl all all 0.0.0.0/0 scram-sha-256
# Reject non-SSL
hostnossl all all 0.0.0.0/0 reject
- Restart PostgreSQL:
sudo systemctl restart postgresql
# or
sudo pg_ctl restart -D /var/lib/postgresql/data
- Verify SSL is enabled:
SHOW ssl; -- Should return 'on'
SELECT * FROM pg_stat_ssl; -- Shows SSL status per connection
Client Connection
Connect with SSL from clients:
psql "postgresql://user@host/db?sslmode=require"
# Or
psql "host=hostname dbname=mydb user=myuser sslmode=verify-full sslrootcert=root.crt"
SSL modes (in increasing order of security):
disable- No SSLrequire- SSL required, no certificate verificationverify-ca- SSL required, verify server certificateverify-full- SSL required, verify certificate and hostname (most secure)
Source: PostgreSQL Official Documentation v12-17 (https://www.postgresql.org/docs/current/ssl-tcp.html)
Note: For production, use certificates from a trusted CA, not self-signed certificates.
Creating a PostgreSQL Row Level Security (RLS) Policy
Row Level Security policies are created using CREATE POLICY after enabling RLS on a table.
Step-by-Step Process
1. Enable RLS on the table:
ALTER TABLE table_name ENABLE ROW LEVEL SECURITY;
2. Create the policy:
CREATE POLICY policy_name ON table_name
[AS { PERMISSIVE | RESTRICTIVE }]
[FOR { ALL | SELECT | INSERT | UPDATE | DELETE }]
[TO { role_name | PUBLIC | CURRENT_ROLE | CURRENT_USER | SESSION_USER }]
[USING (condition)]
[WITH CHECK (condition)];
Required Components
- policy_name: Unique identifier for the policy on this table
- table_name: Target table (must have RLS enabled)
- USING clause: Boolean expression determining which rows are visible for SELECT/UPDATE/DELETE
- WITH CHECK clause: Boolean expression determining which rows can be inserted or updated to
Complete Example
-- Enable RLS on the table
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
-- Allow users to see only their own documents
CREATE POLICY user_documents ON documents
FOR ALL
TO PUBLIC
USING (user_id = current_user_id())
WITH CHECK (user_id = current_user_id());
Key Behaviors
- Policies are PERMISSIVE by default (OR logic between policies)
- RESTRICTIVE policies use AND logic
- Omit
FORto apply to all operations (SELECT, INSERT, UPDATE, DELETE) USINGapplies to SELECT/UPDATE/DELETE;WITH CHECKapplies to INSERT/UPDATE- Table owner and superusers bypass RLS unless
FORCE ROW LEVEL SECURITYis set
Version: PostgreSQL 9.5+
Source: PostgreSQL Official Documentation - Row Security Policies (https://www.postgresql.org/docs/current/sql-createpolicy.html)
SCRAM-SHA-256 Authentication in PostgreSQL
SCRAM-SHA-256 (Salted Challenge Response Authentication Mechanism using SHA-256) is PostgreSQL's recommended password authentication method since PostgreSQL 10 (2017). It is the default password authentication method in PostgreSQL 14+.
How It Works
SCRAM-SHA-256 uses a challenge-response mechanism where passwords are never transmitted in plaintext or reversible form:
- Server sends a salt and iteration count (default: 4096 iterations)
- Client derives a key using PBKDF2 with SHA-256
- Both sides prove knowledge of the password through cryptographic challenges
- Password hash stored in database cannot be used for authentication without the original password
Configuration
In pg_hba.conf, specify the authentication method:
# IPv4 local connections:
host all all 127.0.0.1/32 scram-sha-256
# IPv6 local connections:
host all all ::1/128 scram-sha-256
In postgresql.conf, set the password encryption method:
password_encryption = scram-sha-256
Creating Users
When password_encryption = scram-sha-256, new passwords are automatically hashed using SCRAM-SHA-256:
CREATE ROLE myuser WITH LOGIN PASSWORD 'mypassword';
-- or
ALTER ROLE existinguser WITH PASSWORD 'newpassword';
Key Advantages Over md5
- Not vulnerable to rainbow table attacks (uses salt + iterations)
- SHA-256 hashing (cryptographically stronger than MD5)
- Challenge-response prevents password replay attacks
- RFC 5802 and RFC 7677 compliant
Version Requirements
- PostgreSQL 10+: SCRAM-SHA-256 available
- PostgreSQL 14+: Default password method
- Client libraries must support SCRAM-SHA-256 (libpq 10+, most modern drivers)
Source: PostgreSQL Official Documentation - Password Authentication (versions 10-17)
There is no functional difference between a user and a role in PostgreSQL 8.1+. They are the same object.
Key Facts
Since PostgreSQL 8.1 (released 2005), the concepts of "users" and "groups" were unified into a single concept: roles.
- A role is the modern, correct term
- A user is simply a role with the
LOGINattribute CREATE USERis an alias forCREATE ROLEwithLOGINincluded by default
Exact Differences in Commands
-- These are equivalent:
CREATE USER alice WITH PASSWORD 'secret';
CREATE ROLE alice WITH LOGIN PASSWORD 'secret';
-- Without LOGIN, it cannot authenticate:
CREATE ROLE readonly_group; -- Cannot login, acts as group
CREATE USER readonly_group; -- CAN login (LOGIN is default)
In Practice
- Use
CREATE ROLEfor everything (modern practice) CREATE USERexists only for backwards compatibility- Both create entries in
pg_rolesandpg_authidsystem catalogs - The
pg_userview shows only roles withLOGINattribute
Source
PostgreSQL 16 Official Documentation: "Database Roles" chapter (https://www.postgresql.org/docs/16/user-manag.html)
Agent Action Rule: Always use CREATE ROLE ... WITH LOGIN instead of CREATE USER in new code.
The CREATEDB attribute is a PostgreSQL role attribute that grants permission to create new databases.
Setting the Attribute
When creating a role:
CREATE ROLE myuser WITH CREATEDB;
For existing roles:
ALTER ROLE myuser WITH CREATEDB;
To revoke:
ALTER ROLE myuser WITH NOCREATEDB;
Behavior
- A role with
CREATEDBcan executeCREATE DATABASEstatements - The role becomes the owner of databases it creates
CREATEDBdoes NOT grant permission to drop databases owned by other users- Superusers always have
CREATEDBprivileges implicitly
Checking the Attribute
Query pg_roles system catalog:
SELECT rolname, rolcreatedb FROM pg_roles WHERE rolname = 'myuser';
The rolcreatedb column returns true if the role has CREATEDB, false otherwise.
Important Notes
CREATEDBis distinct fromSUPERUSER(less privileged)- Database owners can drop their own databases regardless of
CREATEDBstatus - Available in all supported PostgreSQL versions (9.0+, currently 12-17)
Source: PostgreSQL official documentation - Database Roles (CREATE ROLE, ALTER ROLE commands)
Partitioning > Partition Management
28 questionsThe FINALIZE option completes a previously canceled or interrupted DETACH PARTITION CONCURRENTLY operation.
The query planner can handle up to a few thousand partitions fairly well, provided that typical queries allow the planner to prune all but a small number of partitions.
DETACH PARTITION CONCURRENTLY acquires only a SHARE UPDATE EXCLUSIVE lock on the parent table during both internal transactions.
Add a valid CHECK constraint to the table that matches the partition constraint before running ATTACH PARTITION. PostgreSQL will use this CHECK constraint to determine that the table need not be scanned to validate the partition constraint.
When a new partition is attached and a default partition exists, the default partition is scanned to verify that it contains no records that should belong to the partition being attached.
No, the autovacuum daemon does not issue ANALYZE commands for partitioned tables because partitioned tables do not directly store tuples.
No, DETACH PARTITION CONCURRENTLY cannot be run within a transaction block.
Constraint exclusion only works when the query's WHERE clause contains constants or externally supplied parameters. Comparisons against non-immutable functions like CURRENT_TIMESTAMP cannot be optimized.
Creating an index on a partitioned table automatically creates matching indexes on all existing partitions, and any future partitions will also automatically receive the index.
PostgreSQL 14 added support for REINDEX CONCURRENTLY on partitioned indexes, which automatically reindexes all partition indexes.
No, partition pruning is driven only by the constraints defined implicitly by the partition keys, not by the presence of indexes.
Standard DETACH PARTITION acquires an ACCESS EXCLUSIVE lock on the parent table.
No, partitions cannot have columns that are absent from the parent table.
No, REINDEX on a partitioned table cannot run in a transaction block because it uses multiple transactions to process each partition.
No, you cannot mix temporary and permanent relations in the same partition tree.
A CHECK constraint duplicating the partition constraint is added to the detached partition when using DETACH PARTITION CONCURRENTLY.
A serialization failure error is raised (with SQLSTATE code '40001') when another concurrent UPDATE or DELETE detects row movement.
ATTACH PARTITION acquires a SHARE UPDATE EXCLUSIVE lock on the parent partitioned table.
DETACH PARTITION obtains SHARE locks on tables that reference the partitioned table via foreign keys.
ON DELETE ROW triggers are executed on the source partition, and INSERT triggers are executed on the destination partition. However, all UPDATE triggers are skipped (including AFTER UPDATE row triggers in the source partition and BEFORE/AFTER UPDATE row triggers in the destination partition).
Partition pruning can work during both query planning and query execution. It can prune additional partitions during execution when clause values are not known at planning time (e.g., parameters in PREPARE statements, subquery values, or parameterized values on the inner side of nested loop joins).
No, attempts to use TRUNCATE ONLY on a partitioned table will always return an error because partitioned tables don't contain data themselves—only their partitions do.
No, inheritance parents will only be analyzed if the parent itself is changed. Changes to child tables do not trigger autoanalyze on the parent table.
It is necessary to periodically run a manual ANALYZE on partitioned tables to keep the statistics up to date, especially when the table is first populated or when the distribution of data in partitions changes significantly.
Yes, as of PostgreSQL 12, foreign keys that reference partitioned tables are fully supported.
Tuples are routed in O(log N) complexity when using declarative partitioning.
No, you cannot use CONCURRENTLY when creating new indexes directly on partitioned tables. The workaround is to create an invalid index on the parent using CREATE INDEX ON ONLY, then create concurrent indexes on individual partitions, and attach them using ALTER INDEX ... ATTACH PARTITION.
Yes, if ONLY is not specified, VACUUM processes the table and all its inheritance child tables or partitions. You can use the ONLY keyword to skip processing partitions.
Streaming Replication > Standby Configuration
28 questionsThe default value is 512 kB. If specified without units, it is taken as bytes.
%f is replaced by the name of the WAL file to retrieve from the archive, and %p is replaced by the path where the WAL file should be restored on the server. Write %% to embed an actual % character.
The default value is 30 seconds. If specified without units, it is taken as milliseconds. A value of -1 allows indefinite waiting.
standby.signal indicates the server should start as a hot standby and continuously replay WAL. recovery.signal indicates normal archive recovery that stops when all WALs are consumed or recovery_target is reached. If both files are present, standby.signal takes precedence.
Setting archive_mode to always enables the WAL archiver during archive recovery or standby mode. The default on only archives on the primary.
The default value is 10 seconds. This controls the minimum frequency for the standby to report replication progress to the primary.
The default value is replica. The primary server must have wal_level set to replica or logical to enable hot standby.
The user must have the REPLICATION privilege or be a superuser with LOGIN capability.
The default value is 60 seconds. If specified without units, it is taken as milliseconds. Zero disables the timeout.
The default value is 0 MB. When zero, the system doesn't keep any extra WAL segments for standby purposes. If specified without units, it is taken as megabytes.
The valid values are: off, local, remote_write, on (default), and remote_apply.
max_connections, max_prepared_transactions, max_locks_per_transaction, max_wal_senders, and max_worker_processes must all be equal to or larger on the standby than on the primary.
FIRST specifies priority-based synchronous replication (e.g., 'FIRST 2 (s1, s2, s3)' chooses the first 2 based on list order). ANY specifies quorum-based replication (e.g., 'ANY 2 (s1, s2, s3)' waits for any 2 standbys to acknowledge).
A file named standby.signal must be present in the data directory. The file can be empty - only its existence matters, not its contents.
The keyword replication is used in the database field. Replication connections do not specify a specific database name.
The maximum length is 63 characters (NAMEDATALEN). Names exceeding this limit will be silently truncated to 63 characters.
Only lowercase letters (a-z), numbers (0-9), and underscore (_) characters are allowed. No uppercase letters, dashes, or other special characters are permitted.
Data Types - Object Identifiers
28 questionsThere is a hard limit of 2³² (4 billion) SQL commands within a single transaction.
cmin stores the command identifier (starting at zero) within the inserting transaction.
No. Unlike xid values, xid8 values increase strictly monotonically and cannot be reused in the lifetime of a database cluster.
Yes, if a constant of an OID alias type appears in a stored expression (such as a column default expression or view), it creates a dependency on the referenced object. The exception is regrole, which is not allowed in stored expressions.
Yes, the OID type can be cast to integer and then manipulated using standard integer operators, though you should beware of possible signed-versus-unsigned confusion.
xmin contains the identity (transaction ID) of the inserting transaction for this row version.
regproc only accepts unique function names and returns NULL if a function exists with the same name but differing signatures. regprocedure requires the complete argument list to uniquely identify overloaded functions.
ctid contains the physical location of the row version within its table (block number and tuple index within block).
Yes, all OID alias types for objects grouped by namespace accept schema-qualified names as input, such as 'myschema.mytable'::regclass.
PostgreSQL provides 11 OID alias types: regclass, regcollation, regconfig, regdictionary, regnamespace, regoper, regoperator, regproc, regprocedure, regrole, and regtype.
to_regproc() returns NULL if a function exists with the same name but differing signatures.
The default_with_oids configuration parameter controlled whether newly-created tables should automatically include an OID column. It was added in PostgreSQL 8.1 and removed in PostgreSQL 12.
OID alias types will display schema-qualified names on output if the object would not be found in the current search path without being qualified.
The OID generator will wrap around to reuse OID values. This does not bother PostgreSQL particularly, but you may start having occasional transaction failures due to duplicate OIDs, as OIDs are not guaranteed to remain globally unique.
The input functions for OID alias types allow whitespace between tokens, and will fold upper-case letters to lower case, except within double quotes.
cmax stores the command identifier within the deleting transaction, or zero for an undeleted row version.
xid is 32 bits wide and wraps around every 4 billion transactions. xid8 is 64-bit, includes the epoch, and therefore does not wrap around during the life of an installation.
The value 0 is reserved and considered an invalid OID in PostgreSQL.
Unary operators are identified by writing NONE for the unused operand.
As of PostgreSQL 8.1, OIDs are no longer added to user tables by default.
PostgreSQL 12 removed support for user-defined tables with OIDs. The WITH OIDS option was completely removed.
In PostgreSQL 12, OID columns in system catalog tables became normal columns that are always visible. Previously in PostgreSQL 11 and earlier, they were special 'hidden' columns that needed to be selected explicitly.
The oid type is currently implemented as an unsigned four-byte integer.
Routine Maintenance > REINDEX
28 questionsIf REINDEX CONCURRENTLY fails to drop the old index, it is suffixed with _ccold (or _ccold1, _ccold2, etc. if that name exists).
No, indexes on TOAST tables are rebuilt but not moved to the new tablespace.
If not specified, the postgres database will be used, or if that does not exist, template1 will be used.
REINDEX takes an ACCESS EXCLUSIVE lock on the specific index being processed.
PostgreSQL 14 added bottom-up index tuple deletion to reduce unnecessary page splits, index bloat, and fragmentation of heavily updated indexes.
MAINTAIN privilege on the table is required to reindex a single index or table.
- Adds transient index to pg_index with SHARE UPDATE EXCLUSIVE session lock; 2) First pass builds index and sets indisready to true; 3) Second pass adds tuples from first pass period; 4) Renames indexes, switches indisvalid flags, invalidates cache; 5) Sets old indisready to false and waits for running queries; 6) Drops old indexes and releases session locks.
There is no short form; only the long form --concurrently is available.
During REINDEX operations, the search_path is temporarily changed to pg_catalog, pg_temp.
B-tree index pages that have become completely empty are reclaimed for re-use. However, there is still a possibility of inefficient use of space when most but not all keys on a page are deleted.
No, REINDEX locks out writes but not reads of the index's parent table.
No, these forms of REINDEX cannot be executed inside a transaction block.
Drop the _ccnew index using DROP INDEX, then attempt REINDEX CONCURRENTLY again, as the _ccnew suffix indicates the transient index created during the concurrent operation that failed to complete.
reindexdb is a wrapper around the SQL command REINDEX that provides a command-line interface for rebuilding indexes.
No, for temporary tables, REINDEX is always non-concurrent, as no other session can access them, and non-concurrent reindex is cheaper.
Set the PGOPTIONS environment variable to -P before starting the client (e.g., export PGOPTIONS="-P" then psql).
It opens 4 connections to the database, so max_connections must be high enough to accommodate all connections.
No, regular and concurrent builds cannot occur simultaneously on the same table.
No, REINDEX SYSTEM does not support CONCURRENTLY since system catalogs cannot be reindexed concurrently.
Simply drop the _ccold index, as it corresponds to the original index that failed to be dropped, meaning the rebuild itself was successful.
REINDEX CONCURRENTLY manipulates the indisready and indisvalid flags in the pg_index catalog table.
Exclusion constraint indexes are skipped during REINDEX CONCURRENTLY (though they can be reindexed without the CONCURRENTLY option).
REINDEX CONCURRENTLY requires more total work and may impose extra CPU, memory, and I/O load on the database server.
Support for partitioned tables and indexes in REINDEX was added in PostgreSQL 14.
REINDEX CONCURRENTLY must perform two scans of the table for each index that needs to be rebuilt.
Server Configuration > Lock Management
28 questionsNo restart required. It has a context of 'superuser' and can be changed at runtime via configuration reload (SIGHUP) or with SET command.
Unlike statement_timeout, lock_timeout can only occur while waiting for locks. It does not abort statements for other reasons.
It can be changed with a configuration reload (SIGHUP) without requiring a server restart. Context is 'sighup'.
On standby servers, max_pred_locks_per_transaction must be set to the same or higher value as on the primary server. Otherwise, queries will not be allowed in the standby server.
No. PostgreSQL stores row locks on the row itself (on disk), not in shared memory. This is why there is no limit on the number of rows that can be locked.
No. Setting lock_timeout in postgresql.conf is not recommended because it would affect all sessions. It is recommended to set it at session or role level instead.
Milliseconds. If specified without units, the value is taken as milliseconds.
Minimum: 1 millisecond (1ms), Maximum: 2147483647 milliseconds (approximately 2.1 billion milliseconds or about 24.8 days)
It can be changed with a configuration reload (SIGHUP) without requiring a server restart. Context is 'sighup'.
No. PostgreSQL does not perform lock escalation. Row locks are stored directly on rows, avoiding the need for escalation and preventing unpredictable deadlocks.
It can be changed without a server restart via configuration reload (SIGHUP) or ALTER SYSTEM. Context is 'superuser'.
When log_lock_waits is enabled, a log message is produced when a session waits longer than deadlock_timeout to acquire a lock.
Only at server start. This parameter requires a postmaster restart to take effect.
It controls how many rows on a single page can be predicate-locked before the lock is promoted to covering the whole page.
max_pred_locks_per_transaction × (max_connections + max_prepared_transactions)
No. This is not the number of rows that can be locked; that value is unlimited. The parameter limits object-level locks (tables, indexes, etc.), not row-level locks.
Only at server start. This parameter requires a postmaster restart to take effect.
Milliseconds. If specified without units, the value is taken as milliseconds.
AccessExclusiveLocks held by prepared transactions will require twice the normal number of lock table entries at the end of recovery.
Extensions > hstore
28 questionsIt returns NULL. A subscripted fetch returns NULL if the subscript is NULL or that key does not exist in the hstore.
Both keys and values are simply text strings. No other data types are supported natively.
populate_record(anyelement, hstore) populates a record type with values from an hstore. The first argument must be a null value cast to the target record type, and the function will populate the record with matching key-value pairs from the hstore.
PostgreSQL 9.1. The CREATE EXTENSION command was introduced in PostgreSQL 9.1. Prior to 9.1, hstore had to be installed manually via SQL scripts.
Yes, hstore columns can be declared UNIQUE using btree or hash indexes. However, the sort ordering for hstore values is not particularly useful; these indexes are mainly valuable for equivalence lookups.
The - operator with a text argument deletes a single key from the hstore and returns the modified hstore.
PL/Perl (via hstore_plperl and hstore_plperlu extensions, mapping to Perl hashes) and PL/Python (via hstore_plpython3u extension, mapping to Python dictionaries).
If duplicate keys are declared, only one will be stored in the hstore and there is no guarantee as to which will be kept. The behavior is non-deterministic.
The -> operator with a text[] (text array) argument returns an array of values corresponding to the specified keys.
The <@ operator checks if the left hstore is contained in the right hstore. It's the inverse of the @> operator.
hstore_to_matrix() returns a 2D text[] array where each row is a key-value pair.
hstore_to_json() converts all hstore values to JSON strings. hstore_to_json_loose() attempts to distinguish numbers and booleans from plain strings, converting them to appropriate JSON types.
The ? operator checks if a key exists in the hstore (regardless of its value). The @> operator checks if the left hstore contains the right hstore, meaning all key-value pairs must match.
You can force an upgrade with: UPDATE tablename SET col = col || ''; or by using ALTER TABLE with a TYPE change.
A subscripted update fails if the subscript is NULL. However, a subscripted fetch returns NULL if the subscript is NULL or that key does not exist.
No, the order of the pairs is not significant and may not be reproduced on output.
The defined(hstore, text) function checks if a key exists and has a non-NULL value. It returns true only if the key exists and its value is not NULL.
The #= operator replaces fields in a record from matching values in the hstore. It modifies a record using key-value pairs from the hstore, updating only the fields specified in the hstore while leaving others unchanged.
Yes, on output, double quotes always surround keys and values, even when it's not strictly necessary.
The || operator concatenates two hstore values. When keys overlap, the right-side values overwrite the left-side values for duplicate keys.
akeys() returns keys as a text array (text[]), while skeys() returns keys as a set (SETOF text). The same difference applies to avals() vs svals() for values.
In PostgreSQL 8.4, neither the key nor the value string could exceed 65535 bytes in length; an error would be thrown if this limit was exceeded.
You can use the hstore(text, text) function to create a single key-value pair, for example: hstore('key', 'value').
Yes, hstore values can be used in GROUP BY, ORDER BY, or DISTINCT expressions, but the sort ordering for hstore values is not particularly useful.
Data Types - JSON > JSON vs JSONB
28 questionsThe operators that support GIN indexing are: @> (contains), <@ (contained in), ? (key exists), ?| (any keys exist), ?& (all keys exist), @? (jsonpath exists), and @@ (jsonpath predicate). The concatenation (||) and deletion operators (-, #-) do not support indexing.
The jsonb_ops operator class supports: ? (key exists), ?| (any keys exist), ?& (all keys exist), @> (contains), @? (jsonpath exists), and @@ (jsonpath predicate).
B-tree and hash indexes on JSONB are usually useful only if it's important to check equality of complete JSON documents.
1 GB (2³⁰ - 1 bytes). This limit applies to all TOAST-able data types.
It generates an object containing the union of their keys, taking the second object's value when there are duplicate keys.
JSONB stores data in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process since no reparsing is needed.
Objects are compared first by pair count (more pairs rank higher), then by key-value pairs in storage order. Shorter keys are stored before longer keys.
By default this value is chosen so that four chunk rows will fit on a page, making it about 2000 bytes.
Normally 2 kB (2000 bytes). The TOAST management code is triggered when a row value exceeds this threshold.
jsonb_path_ops produces no index entries for JSON structures not containing any values, such as {"a": {}}. Searching for such structures will require a full-index scan, which is quite slow.
PostgreSQL 9.2 (released in 2012). The JSON type stores JSON data as validated text according to RFC 4627.
They return the same type as their left-hand input (either json or jsonb).
The major practical difference is one of efficiency. JSON must be reparsed on each execution by processing functions, while JSONB's binary format is significantly faster to process since no reparsing is needed.
The performance of queries gets substantially slower (2-10×) for values larger than about 2 kB, due to how Postgres stores long variable-length data (TOAST).
Also normally 2 kB (2000 bytes), adjustable per table. The TOAST code will compress and/or move field values out-of-line until the row is shorter than this target.
Arrays are ordered by element count first (more elements rank higher), then by element comparison in sequence.
jsonb_ops creates independent index items for each key and value in the data, while jsonb_path_ops creates index items only for each value in the data (each item is a hash of the value and the key(s) leading to it).
GIN indexes only support Bitmap Index Scans (not Index Scan or Index Only Scan), because they only store parts of the row values in each index page.
Yes, the JSON type preserves the order of keys within JSON objects.
Yes, the JSON type preserves semantically-insignificant white space between tokens.
Object > Array > Boolean > Number > String > Null, with one exception: empty top-level arrays sort below null for historical reasons.
A jsonb_path_ops index is usually much smaller than a jsonb_ops index over the same data.
The jsonb_path_ops operator class supports only: @> (contains), @? (jsonpath exists), and @@ (jsonpath predicate). It does not support the key-exists operators (?, ?|, ?&).
JSONB supports GIN indexes (with jsonb_ops and jsonb_path_ops operator classes), B-tree indexes, and hash indexes.
User-Defined Functions
27 questionsLEAKPROOF indicates that the function has no side effects and reveals no information about its arguments other than by its return value. For example, a function that throws an error for some argument values but not others is not leakproof.
Mislabeling can cause stale values being re-used during subsequent uses of the plan, particularly problematic with prepared statements. There is a big difference if the plan is saved and reused later.
Type modifiers in parentheses (like the precision field for type numeric) are discarded by CREATE FUNCTION.
No, you cannot change the name already assigned to any input parameter (although you can add names to parameters that had none before).
PARALLEL UNSAFE is the default. All user-defined functions are assumed to be parallel unsafe unless otherwise marked.
100 units is assumed for functions in all other languages (non-C and non-internal).
CALLED ON NULL INPUT is the default, which means the function will be called normally when some of its arguments are null, and it is the function author's responsibility to check for null values if necessary.
Arguments are referenced using the syntax $n where $1 refers to the first input argument, $2 to the second, and so on.
Each position declared as anyelement is allowed to have any specific actual data type, but in any given call they must all be the same actual type.
100 arguments. This is defined by FUNC_MAX_ARGS and cannot be changed without recompiling PostgreSQL.
Yes, STRICT is exactly equivalent to RETURNS NULL ON NULL INPUT. Both keywords do the same thing - they indicate that the function always returns null whenever any of its arguments are null.
Use STABLE for functions that cannot modify the database and return the same results given the same arguments within a single statement, but may change across statements. For example, current_timestamp qualifies as STABLE since its value doesn't change within a transaction.
No. OUT and INOUT parameters cannot be used together with RETURNS TABLE notation.
No. Transaction control commands (COMMIT, SAVEPOINT) and utility commands (VACUUM) cannot be used in SQL functions.
The function will still be called if the variadic array has null elements. The strictness check tests that the variadic array as a whole is non-null, not its individual elements.
No. It is not possible to change the argument types of a function using CREATE OR REPLACE FUNCTION. If you tried, you would actually be creating a new, distinct function.
No. SQL function arguments can only be used as data values, not as identifiers. You cannot use INSERT INTO $1 VALUES (42) to dynamically reference table names.
The body must be written as a string constant. It is usually most convenient to use dollar quoting for this purpose.
LEAKPROOF affects how the system executes queries with row-level security or security_barrier views. The system will enforce security conditions before any user-supplied conditions with non-leakproof functions, preventing inadvertent data exposure.
The column name takes precedence. To reference the argument, you must qualify it with the function name: function_name.argument_name.
SECURITY INVOKER is the default, which indicates that the function is executed with the privileges of the user that calls it.
Yes. The RETURNS TABLE notation is specified in recent versions of the SQL standard, and thus may be more portable than using SETOF.
A parameter marked VARIADIC matches one or more occurrences of its element type, not of its own type. For example, VARIADIC arr numeric[] matches occurrences of numeric, not numeric[].
All input parameters following a parameter with a default value must also have default values. This ensures that arguments can be omitted from the end without ambiguity in positional notation.
Functions should be marked PARALLEL RESTRICTED if they: access temporary tables, reference client connection state, use cursors or prepared statements, or interact with backend-local state (e.g., setseed, random functions).
Performance Tips > Join Strategies
27 questions- This parameter uses genetic query optimization to plan queries with at least this many FROM items involved.
8MB. Hash-based operations use work_mem (4MB default) multiplied by hash_mem_multiplier (2.0 default), resulting in 8MB.
- This sets the planner's estimate of the cost of launching parallel worker processes.
With comma-separated tables in FROM, the planner is free to join the given tables in any order. With explicit JOIN syntax (except FULL JOINs), the planner has constraints on join ordering. For example, with LEFT JOIN, the planner has no choice of join order.
2.0, which makes hash-based operations use twice the usual work_mem base amount
- This sets the maximum number of workers that can be started by a single Gather or Gather Merge node.
8 (by default, this variable is set the same as from_collapse_limit)
8 megabytes (8MB). This sets the minimum amount of table data that must be scanned in order for a parallel scan to be considered.
512 kilobytes (512kB). This sets the minimum amount of index data that must be scanned in order for a parallel scan to be considered.
In a parallel hash join, the inner side is a parallel hash that divides the work of building a shared hash table over the cooperating processes.
The outer relation is the join's first (left) child that drives the join. The inner relation is the second (right) relation that is scanned once for every row found in the outer relation.
Hash join requires the join operator to represent equality. The assumption is that the join operator can only return true for pairs of values that hash to the same hash code. To be marked HASHES, the operator must represent some form of equality for some data type or pair of data types.
4 gigabytes (4GB). If BLCKSZ is not 8kB, the default value scales proportionally to it.
The three join strategies are: nested loop join (where the right relation is scanned once for every row found in the left relation), hash join (where the right relation is first scanned and loaded into a hash table using join attributes as hash keys), and merge join (where both relations are sorted on the join attributes before the join starts, then scanned in parallel).
Partitionwise join currently applies only when the join conditions include all the partition keys, which must be of the same data type and have one-to-one matching sets of child partitions.
0 to 10000. Set it to DEFAULT to revert to using the system default statistics target (default_statistics_target).
No, it is impossible to suppress nested-loop joins entirely. Turning this variable off discourages the planner from using one if there are other methods available, but nested loops may still be used when no other join method is viable (e.g., for non-equality join conditions).
- Setting join_collapse_limit to 1 prevents any reordering of explicit JOINs.
No, in a nested loop join, the inner side is always non-parallel. However, the outer tuples and thus the loops that look up values in the index are divided over the parallel workers, making it efficient if the inner side is an index scan.
Indexes > GIN Indexes
27 questionsSearches must scan the list of pending entries in addition to searching the regular index, so a large list of pending entries will slow searches significantly.
Yes. The GIN layer takes care of concurrency, logging, and searching the tree structure. GIN indexes include Write-Ahead Logging (WAL) for recoverability and replication.
Yes. GIN indexes support building indexes in parallel, along with B-tree and BRIN. This feature is available starting in PostgreSQL 18.
gin_trgm_ops supports trigram-based index searches for LIKE, ILIKE, ~ (regex), ~* (case-insensitive regex), and = queries, in addition to similarity operators (%, <%, %>, <<%, %>>).
No. Unlike B-tree or GiST, GIN index search effectiveness is the same regardless of which index column(s) the query conditions use.
Yes. GIN indexes support multiple-key-column indexes. Multicolumn GIN indexes are implemented by building a single B-tree over composite values (column number, key value).
jsonb_ops supports the key-exists operators (?, ?|, and ?&), while jsonb_path_ops does not support these operators.
GIN indexes support two storage parameters: fastupdate (boolean, default ON) and gin_pending_list_limit (integer in kilobytes).
GIN assumes that indexable operators are strict. A placeholder index entry is created automatically for null item values. As of PostgreSQL 9.1, null key values can be included in the index.
Most updates are fast because they use the pending list, but an update that causes the pending list to exceed gin_pending_list_limit will trigger an immediate cleanup cycle and be much slower than other updates.
The array_ops operator class supports: && (overlap), @> (contains), <@ (contained by), and = (equality).
Yes. GIN indexes support the CONCURRENTLY option, which builds the index without blocking writes to the table.
Each parallel worker must have at least a 32MB share of the total maintenance_work_mem budget, and there must also be a remaining 32MB share for the leader process.
The gin_clean_pending_list() function. It takes the index name as a parameter and returns the number of pending list pages deleted.
No. GIN indexes only support Bitmap Index Scans, not Index Scan or Index-Only Scan, because they only store parts of the row values in each index page.
Disabling fastupdate prevents future insertions from going into the pending list but doesn't flush existing entries. You need to run VACUUM or call gin_clean_pending_list() to empty the pending list.
indexing
27 questionsMulticolumn Index in PostgreSQL
A multicolumn index (also called composite index) is a single index defined on multiple columns of a table. PostgreSQL creates one index structure that combines the specified columns in the declared order.
Syntax
CREATE INDEX index_name ON table_name (column1, column2, column3);
How It Works
- Column Order Matters: An index on
(a, b, c)can efficiently handle queries filtering ona,(a, b), or(a, b, c), but NOT queries filtering only onb,c, or(b, c). - Maximum Columns: PostgreSQL supports up to 32 columns in a multicolumn index (limited by
INDEX_MAX_KEYScompile-time constant). - Supported Index Types: B-tree, GiST, GIN, and BRIN support multicolumn indexes. Hash supports multicolumn as of PostgreSQL 14+.
Example
-- Create multicolumn index
CREATE INDEX idx_employee_dept_salary ON employees (department_id, salary);
-- This query USES the index efficiently
SELECT * FROM employees WHERE department_id = 5 AND salary > 50000;
-- This query also USES the index (leading column only)
SELECT * FROM employees WHERE department_id = 5;
-- This query CANNOT use the index (skips leading column)
SELECT * FROM employees WHERE salary > 50000;
Critical Detail
For B-tree indexes (the default), equality constraints on leading columns plus inequality constraint on the first non-equality column can use the index. Additional columns after an inequality constraint require scanning the index entries.
Source: PostgreSQL 16 Documentation, Section 11.3 "Multicolumn Indexes"
REINDEX CONCURRENTLY in PostgreSQL
REINDEX CONCURRENTLY (available in PostgreSQL 12+) rebuilds indexes without blocking concurrent INSERT, UPDATE, DELETE, and SELECT operations on the table.
Syntax
REINDEX INDEX CONCURRENTLY index_name;
REINDEX TABLE CONCURRENTLY table_name;
REINDEX SCHEMA CONCURRENTLY schema_name;
REINDEX DATABASE CONCURRENTLY database_name;
Key Differences from Regular REINDEX
Regular REINDEX:
- Acquires
ACCESS EXCLUSIVElock (blocks all operations) - Faster execution
- Single transaction
REINDEX CONCURRENTLY:
- Acquires
SHARE UPDATE EXCLUSIVElock (allows SELECT/INSERT/UPDATE/DELETE) - Takes ~2x longer (builds new index alongside old one)
- Multiple transactions (can leave invalid indexes if interrupted)
- Cannot run inside a transaction block
- Cannot reindex system catalogs or TOAST tables
Limitations
Cannot reindex:
- System catalog indexes
- Indexes on TOAST tables
- Invalid indexes (use regular
REINDEX)
Failure handling:
- If interrupted, leaves behind invalid indexes with
_ccnewor_ccoldsuffix - Must be cleaned up manually or with
DROP INDEX
- If interrupted, leaves behind invalid indexes with
Space requirements:
- Temporarily uses ~2x disk space (old + new index)
Example Usage
-- Reindex a single index without blocking writes
REINDEX INDEX CONCURRENTLY users_email_idx;
-- Reindex all indexes on a table
REINDEX TABLE CONCURRENTLY users;
-- Check for failed concurrent reindex attempts
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;
Version requirement: PostgreSQL 12.0+
Source: PostgreSQL 12+ official documentation (REINDEX command reference)
CREATE INDEX CONCURRENTLY has several significant limitations:
Cannot run inside a transaction block - Unlike regular CREATE INDEX, it must be run as a standalone statement, not within BEGIN/COMMIT.
Cannot be used on partitioned tables - Concurrent index builds are not supported for partitioned tables. Workaround: build indexes on each partition individually, then create the partitioned index non-concurrently.
Only one concurrent index build per table - You cannot run multiple CREATE INDEX CONCURRENTLY commands on the same table simultaneously, nor can you run other schema modifications on that table during the build.
Requires two full table scans - The process scans the entire table twice and waits for all existing transactions to terminate between scans, making it significantly slower than regular index creation.
Higher resource consumption - Requires more total work, CPU, and I/O than standard index builds, potentially slowing other operations.
Leaves invalid indexes on failure - If the build fails (due to deadlock, uniqueness violation, or expression errors), it leaves behind an "invalid" index that:
- Consumes update overhead despite being unusable
- Must be manually dropped and recreated
- May continue enforcing uniqueness constraints even though invalid
Does not work on temporary tables - CREATE INDEX on temporary tables is always non-concurrent since no other session can access them.
Long wait periods - After the second scan, must wait for any transactions with snapshots predating the second scan to complete, including transactions from concurrent index builds on other tables.
Uniqueness violations visible early - For unique indexes, constraint violations may be reported in other queries before the index becomes fully available for use.
Sources:
PostgreSQL 18 Parallel GIN Index Builds
Yes. PostgreSQL 18 adds support for parallel builds of GIN indexes, significantly reducing index creation time for full-text search and JSONB columns.
Creating a Parallel GIN Index
-- Automatic: PostgreSQL chooses workers based on table size
CREATE INDEX CONCURRENTLY idx_docs_fts
ON documents USING GIN (to_tsvector('english', content));
-- Manual: Force specific worker count
SET max_parallel_maintenance_workers = 4;
CREATE INDEX idx_json_gin ON events USING GIN (payload jsonb_path_ops);
Configuration
-- Global setting for parallel maintenance operations
max_parallel_maintenance_workers = 2 -- default
-- Per-table override
ALTER TABLE documents SET (parallel_workers = 4);
-- Check current settings
SHOW max_parallel_maintenance_workers;
Performance Comparison
| Table Size | Workers | PG17 Time | PG18 Time | Improvement |
|---|---|---|---|---|
| 10 GB | 1 | 45 min | 45 min | - |
| 10 GB | 4 | 45 min | 14 min | 3.2x |
| 100 GB | 4 | 8 hrs | 2.5 hrs | 3.2x |
Common GIN Use Cases
-- Full-text search
CREATE INDEX idx_fts ON articles USING GIN (to_tsvector('english', body));
-- JSONB containment queries
CREATE INDEX idx_jsonb ON events USING GIN (metadata);
-- Array overlap/containment
CREATE INDEX idx_tags ON posts USING GIN (tags);
-- Trigram similarity (pg_trgm)
CREATE INDEX idx_trgm ON users USING GIN (name gin_trgm_ops);
Monitoring Build Progress
SELECT
p.pid,
p.phase,
p.blocks_total,
p.blocks_done,
round(100.0 * p.blocks_done / nullif(p.blocks_total, 0), 1) AS pct_done
FROM pg_stat_progress_create_index p;
Note
Parallel index builds were already available for B-tree (since PG11). PostgreSQL 18 extends this to GIN.
Source: PostgreSQL 18 Release Notes
https://www.postgresql.org/docs/18/release-18.html
A covering index in PostgreSQL is an index that contains all columns needed to answer a query, eliminating the need to access the table's heap pages. When PostgreSQL can satisfy a query entirely from the index, it performs an index-only scan, which is significantly faster.
How It Works
PostgreSQL (8.2+) automatically uses covering indexes when:
- All columns in the
SELECTclause are in the index - All columns in the
WHEREclause are in the index - The visibility map shows pages are all-visible (vacuuming required)
Creating Covering Indexes
Method 1: Include all queried columns in the index
-- Query: SELECT email, name FROM users WHERE user_id = 123;
CREATE INDEX idx_users_covering ON users (user_id, email, name);
Method 2: Use INCLUDE clause (PostgreSQL 11+)
-- Same query, but email/name don't need to be searchable
CREATE INDEX idx_users_include ON users (user_id) INCLUDE (email, name);
The INCLUDE clause is preferred because:
- Non-key columns don't increase index tree depth
- Smaller index size (columns aren't in B-tree nodes)
- Faster lookups when you need to filter on
user_idbut retrieve other columns
Verification
Check if an index-only scan is used:
EXPLAIN (ANALYZE, BUFFERS)
SELECT email, name FROM users WHERE user_id = 123;
-- Look for "Index Only Scan" in output
-- "Heap Fetches: 0" confirms no table access
Requirements
- Run
VACUUMregularly to update the visibility map - All queried columns must be in the index
- Works with B-tree indexes (default type)
Source: PostgreSQL 16 Documentation - Index-Only Scans and Covering Indexes
GIN Index in PostgreSQL
A GIN (Generalized Inverted Index) is a PostgreSQL index type designed for indexing composite values where a single row can contain multiple keys (e.g., arrays, JSONB, full-text search documents).
How It Works
GIN indexes create a separate index entry for each element/key within a composite value, pointing back to the rows containing that element. This makes it efficient for queries that test whether a value contains specific elements.
Primary Use Cases
- Array containment/overlap queries (
@>,&&,<@operators) - JSONB queries (
@>,?,?&,?|operators) - Full-text search (
@@operator withtsvector) - Range types (overlap operations)
Syntax
-- Array column
CREATE INDEX idx_tags ON articles USING GIN (tags);
-- JSONB column
CREATE INDEX idx_data ON users USING GIN (data);
-- Full-text search
CREATE INDEX idx_fts ON documents USING GIN (to_tsvector('english', content));
-- Multiple columns
CREATE INDEX idx_multi ON table_name USING GIN (col1, col2);
Performance Characteristics
- Slower inserts/updates than B-tree (3-5x slower) due to multiple index entries per row
- Faster searches for containment queries compared to sequential scans
- Larger index size than B-tree (typically 1.5-3x the data size)
- Supported operators vary by data type - check
pg_opclassfor available operator classes
Key Configuration Parameters
-- Create index with custom parameters
CREATE INDEX idx_name ON table_name USING GIN (column)
WITH (fastupdate = on, gin_pending_list_limit = 4096);
fastupdate(default:onin PG 12+): Accumulates updates in pending list before batch insertiongin_pending_list_limit(default:4MB): Max size of pending list before auto-cleanup
Version Notes
- Available since PostgreSQL 8.2
- JSONB GIN indexing added in PostgreSQL 9.4
fastupdate=onbecame default in PostgreSQL 12
PostgreSQL JSONB Indexing Strategies
Choose the right index type based on your query patterns:
Index Types Comparison
| Index Type | Best For | Operators Supported |
|---|---|---|
| GIN (default) | Containment, key existence | @>, ?, ?&, ?| |
| GIN (jsonb_path_ops) | Containment only (smaller, faster) | @> only |
| B-tree (on expression) | Equality on extracted values | =, <, >, etc. |
| Hash (on expression) | Equality only | = |
GIN Index (Most Common)
-- Default GIN: supports all JSONB operators
CREATE INDEX idx_data_gin ON events USING GIN (data);
-- Queries that use this index:
SELECT * FROM events WHERE data @> '{"type": "click"}';
SELECT * FROM events WHERE data ? 'user_id';
SELECT * FROM events WHERE data ?& array['type', 'timestamp'];
GIN with jsonb_path_ops (Optimized)
-- 2-3x smaller, faster for @> only
CREATE INDEX idx_data_pathops ON events USING GIN (data jsonb_path_ops);
-- Only supports containment:
SELECT * FROM events WHERE data @> '{"type": "click"}';
-- Does NOT support:
SELECT * FROM events WHERE data ? 'user_id'; -- Uses seq scan
B-tree on Extracted Value
-- Best for equality/range on specific keys
CREATE INDEX idx_user_id ON events ((data->>'user_id'));
CREATE INDEX idx_timestamp ON events (((data->>'timestamp')::timestamptz));
-- Queries that use this index:
SELECT * FROM events WHERE data->>'user_id' = '12345';
SELECT * FROM events WHERE (data->>'timestamp')::timestamptz > '2025-01-01';
Partial Index (Performance Boost)
-- Index only relevant rows
CREATE INDEX idx_clicks ON events USING GIN (data jsonb_path_ops)
WHERE data->>'type' = 'click';
-- Smaller index, faster for filtered queries
SELECT * FROM events WHERE data @> '{"action": "purchase"}'
AND data->>'type' = 'click';
Decision Guide
Need to search ANY key/value? -> GIN (default)
Only use @> containment? -> GIN (jsonb_path_ops)
Query specific scalar value? -> B-tree expression index
Filter + containment? -> Partial GIN index
Performance Example
-- Before optimization: 1200ms
SELECT * FROM events WHERE data @> '{"user_id": "12345"}';
-- After GIN jsonb_path_ops + partial index: 75ms
CREATE INDEX idx_events_jsonb ON events
USING GIN (data jsonb_path_ops)
WHERE data ? 'user_id';
Source: PostgreSQL Documentation - JSON Types
https://www.postgresql.org/docs/current/datatype-json.html
Creating a Case-Insensitive Index in PostgreSQL
Use a functional index with the LOWER() or UPPER() function on the text column:
CREATE INDEX idx_username_lower ON users (LOWER(username));
Then query using the same function to utilize the index:
SELECT * FROM users WHERE LOWER(username) = LOWER('JohnDoe');
Key Requirements:
- The query must use the same function (
LOWER()orUPPER()) as the index - Both sides of the comparison should use the function for case-insensitive matching
- Works with all PostgreSQL versions ≥ 7.4
Alternative for PostgreSQL ≥ 12.0:
Use a non-deterministic collation (more efficient):
-- Create collation once
CREATE COLLATION case_insensitive (
provider = icu,
locale = 'und-u-ks-level2',
deterministic = false
);
-- Create index
CREATE INDEX idx_username_ci ON users (username COLLATE case_insensitive);
-- Query (no function needed)
SELECT * FROM users WHERE username = 'JohnDoe' COLLATE case_insensitive;
Best Practice: The LOWER() functional index is the most portable and widely supported solution across PostgreSQL versions.
Source: PostgreSQL Official Documentation on Indexes (v16), Chapter 11.5 (Indexes on Expressions)
GiST Index in PostgreSQL
A GiST (Generalized Search Tree) index is a balanced tree-structured index type in PostgreSQL that provides a framework for implementing custom indexing strategies for complex data types and non-standard search operations.
Key Characteristics
- Template-based infrastructure: GiST is not a single index type but a framework that allows different operator classes to implement custom search strategies
- Lossy indexing: GiST indexes can be lossy—the index may return false positives that need rechecking against the actual table data
- Multi-column support: Supports indexing up to 32 columns (PostgreSQL 12+)
- Page size: Default page size is 8192 bytes
Common Use Cases
- Geometric data types:
point,box,circle,polygon(usinggist_geometry_ops) - Full-text search:
tsvectordata (usinggist_tsvector_ops) - Range types:
int4range,tsrange, etc. (usinggist_range_ops) - Network types:
inet,cidrfor IP addresses - PostGIS spatial data: geography, geometry types
Syntax
-- Basic GiST index
CREATE INDEX idx_name ON table_name USING gist (column_name);
-- Multi-column GiST index
CREATE INDEX idx_location ON places USING gist (coordinates, area);
-- GiST with specific operator class
CREATE INDEX idx_tsv ON documents USING gist (content_vector gist_tsvector_ops);
Performance Characteristics
- Build time: Slower than B-tree for initial creation
- Insert/Update: Generally slower than B-tree (requires tree rebalancing)
- Query performance: Optimized for overlap, containment, and proximity operators (
&&,@>,<->, etc.) - Index size: Typically larger than equivalent B-tree indexes
Required Extension
Some GiST operator classes require extensions:
-- For geometric operations (built-in, no extension needed)
-- For full-text search (built-in)
-- For PostGIS spatial types
CREATE EXTENSION postgis;
Version Notes
- Available since PostgreSQL 7.0
- Significant improvements in PostgreSQL 9.1+ (better buffering, faster builds)
- PostgreSQL 9.5+ added support for distance ordering (
ORDER BY column <-> value)
Source: PostgreSQL Official Documentation - Chapter 67: GiST Indexes (https://www.postgresql.org/docs/current/gist.html)
Index-Only Scan Requirements in PostgreSQL
An index-only scan can be used when ALL of these conditions are met:
1. Index Must Include ALL Referenced Columns
The index must contain every column referenced in the query (SELECT list, WHERE clause, JOIN conditions, ORDER BY, etc.). This requires either:
- A covering index with all needed columns, OR
- An index that naturally contains the data (e.g., unique indexes include the indexed columns)
-- Query can use index-only scan:
CREATE INDEX idx_users_email_name ON users(email, name);
SELECT email, name FROM users WHERE email = '[email protected]';
-- Query CANNOT use index-only scan (missing 'status'):
SELECT email, name, status FROM users WHERE email = '[email protected]';
2. Visibility Map Must Show Pages as All-Visible
PostgreSQL uses a visibility map to track which heap pages contain only tuples visible to all transactions. Pages must be marked all-visible, which happens after:
- VACUUM processes the page
- All transactions that could see old tuple versions have ended
Visibility map coverage can be checked:
SELECT * FROM pg_visibility('table_name');
3. Index Must Support Index-Only Scans
Not all index types support index-only scans. Supported types (PostgreSQL 12+):
- B-tree (btree) - YES
- GiST - YES (PostgreSQL 9.5+)
- SP-GiST - YES (PostgreSQL 13+)
- BRIN - NO
- Hash - NO (no visibility info stored)
- GIN - NO (doesn't store column values)
4. Planner Must Estimate It's Cheaper
Even if technically possible, PostgreSQL's planner must determine an index-only scan is more efficient than alternatives. Check with:
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
Look for Index Only Scan in the plan. If you see "Heap Fetches: N" with N > 0, the visibility map had gaps requiring heap access.
Key Insight
The visibility map is the most common blocker. Even with a perfect covering index, newly inserted/updated rows require heap verification until VACUUM marks pages all-visible. Autovacuum helps, but high-churn tables may rarely achieve full index-only scan efficiency.
Version: PostgreSQL 9.2+ (index-only scans introduced), verified through PostgreSQL 16.
PostgreSQL Visibility Map
The visibility map (VM) is a bitmap structure that tracks which pages in a table contain only tuples that are visible to all active transactions. Each heap table has one visibility map stored as a separate fork file with the suffix _vm.
Purpose
The visibility map serves two critical functions:
- Index-only scans: When all tuples on a page are visible to all transactions, index-only scans can skip fetching the heap page
- VACUUM optimization: VACUUM can skip pages marked as all-visible, avoiding unnecessary I/O
Structure
Each heap page is represented by 2 bits in the visibility map:
- Bit 0: All-visible flag - set when all tuples on the page are visible to all transactions
- Bit 1: All-frozen flag - set when all tuples on the page are frozen (introduced in PostgreSQL 9.6)
File location:
{table_oid}_vmin the table's directoryOne VM page covers 4 billion heap pages (512 GB of heap data at 8KB page size)
When Bits Are Set
- All-visible bit: Set by VACUUM when it verifies all tuples are visible to all transactions
- All-frozen bit: Set when all tuples have
xminolder thanvacuum_freeze_min_age
When Bits Are Cleared
Bits are cleared when:
- INSERT/UPDATE/DELETE modifies the page
- HOT updates occur (even if on the same page)
Code Example: Checking VM Status
-- View visibility map statistics
SELECT * FROM pg_visibility_map('table_name');
-- Check specific page visibility
SELECT all_visible, all_frozen
FROM pg_visibility('table_name', 0); -- page 0
Source: PostgreSQL 16 Documentation - Visibility Map
Index Bloat in PostgreSQL
Index bloat occurs when a PostgreSQL index contains dead tuples (deleted or obsolete row versions) and empty pages that consume disk space but provide no query performance benefit. This happens because PostgreSQL's MVCC (Multi-Version Concurrency Control) creates new row versions on UPDATE/DELETE operations, leaving old index entries that are not immediately reclaimed.
Root Cause
When rows are updated or deleted:
- New index entries are created (for UPDATEs)
- Old index entries are marked dead but not immediately removed
- VACUUM marks these entries as reusable, but pages remain allocated
- Over time, indexes grow larger than necessary
Detection
Check bloat percentage using this query:
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch,
round(100 * (1 - (idx_tup_fetch::float / NULLIF(idx_tup_read, 0)))) AS bloat_pct
FROM pg_stat_user_indexes
WHERE idx_tup_read > 0
ORDER BY pg_relation_size(indexrelid) DESC;
Bloat threshold: >20-30% bloat typically warrants action.
Resolution
Method 1 - REINDEX (requires table lock):
REINDEX INDEX CONCURRENTLY index_name; -- PostgreSQL 12+
Method 2 - pg_repack extension (no lock):
pg_repack -i index_name database_name
Prevention
- Run regular
VACUUM(autovacuum enabled by default since PostgreSQL 8.3) - Tune
autovacuum_vacuum_scale_factor(default: 0.2 = 20% dead tuples triggers vacuum) - For high-update tables, set lower thresholds per table:
ALTER TABLE high_update_table SET (autovacuum_vacuum_scale_factor = 0.05);
Source: PostgreSQL 17 Documentation - Routine Vacuuming (https://www.postgresql.org/docs/17/routine-vacuuming.html)
The default fillfactor for B-tree indexes in PostgreSQL is 90 (meaning 90% of each index page is filled, leaving 10% free space).
Key Details
For B-tree indexes (the default index type):
- Default fillfactor:
90 - Valid range:
10to100 - Purpose: The 10% free space reduces page splits during UPDATE operations
For other index types:
- Hash indexes: Default
75 - GiST indexes: Default
90 - GIN indexes: Do not support fillfactor (parameter is ignored)
Setting Fillfactor
-- Create index with custom fillfactor
CREATE INDEX idx_name ON table_name (column_name)
WITH (fillfactor = 70);
-- Alter existing index
ALTER INDEX idx_name SET (fillfactor = 80);
REINDEX INDEX idx_name; -- Required to apply the change
When to Modify
- Increase to 100: For static tables with no UPDATEs (eliminates wasted space)
- Decrease below 90: For tables with frequent UPDATEs that modify indexed columns (reduces page splits)
Version: This behavior is consistent across PostgreSQL 9.0+ through 17.x.
Source: PostgreSQL Official Documentation - Index Storage Parameters
Query the pg_stat_user_indexes system view to find indexes with zero or low scan counts:
SELECT
schemaname,
tablename,
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY schemaname, tablename;
Key considerations:
Check statistics age: Verify when stats were last reset by querying
pg_stat_database.stats_reset. You need sufficient observation time (typically 1+ months) to accurately identify unused indexes.idx_scan = 0 means never scanned: An index with zero scans has never been used to answer queries since the last stats reset. Note that enforcing PRIMARY KEY or UNIQUE constraints does NOT increment
idx_scan.For PostgreSQL 16+: Use the
last_idx_scancolumn to see the timestamp of the last index scan:
SELECT
schemaname,
tablename,
indexrelname,
idx_scan,
last_idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0 OR last_idx_scan < NOW() - INTERVAL '90 days'
ORDER BY schemaname, tablename;
- Check replicas separately: Replicas maintain their own statistics, so check each replica independently to avoid false positives.
Enhanced query with index size:
SELECT
s.schemaname,
s.tablename,
s.indexrelname AS index_name,
pg_size_pretty(pg_relation_size(s.indexrelid)) AS index_size,
s.idx_scan
FROM pg_stat_user_indexes s
JOIN pg_index i ON s.indexrelid = i.indexrelid
WHERE s.idx_scan = 0
AND i.indisunique IS FALSE -- Exclude unique/PK indexes
ORDER BY pg_relation_size(s.indexrelid) DESC;
Sources:
BRIN Index in PostgreSQL
BRIN (Block Range Index) is a space-efficient index type in PostgreSQL (available since version 9.5) designed for very large tables where values have strong physical correlation with their storage location.
How It Works
BRIN stores summary information (min/max values by default) for consecutive groups of table pages. The default range is 128 pages (1 MB with 8KB page size), configurable via pages_per_range storage parameter.
When to Use
Use BRIN when:
- Table data is naturally ordered (e.g., timestamps in append-only tables)
- Table size > 1GB and you need space efficiency
- You can tolerate approximate filtering (returns superset of matching rows)
BRIN indexes are typically 100-1000x smaller than B-tree indexes but require sequential scan of matched page ranges.
Creation Syntax
-- Basic BRIN index
CREATE INDEX idx_created_at ON logs USING BRIN (created_at);
-- Custom page range (256 pages = 2MB)
CREATE INDEX idx_created_at ON logs USING BRIN (created_at)
WITH (pages_per_range = 256);
-- Multi-column BRIN
CREATE INDEX idx_multi ON logs USING BRIN (created_at, user_id);
Performance Characteristics
- Index size: ~0.01-0.1% of table size (vs 10-20% for B-tree)
- Build time: Very fast, scales linearly
- Query performance: Good for range scans on correlated data, poor for random lookups
- Maintenance: Requires
VACUUMorbrin_summarize_new_values()for new pages
Source: PostgreSQL 17 Official Documentation - BRIN Indexes
Set a custom fillfactor for B-tree indexes when the indexed columns experience frequent UPDATE operations that modify the indexed values, or when dealing with monotonically increasing values (like sequences or timestamps) that cause page splits.
Default Behavior
PostgreSQL uses a default fillfactor of 90 for B-tree indexes (leaving 10% free space per page). For hash, GiST, and GIN indexes, the default is 100 (no free space).
When to Use Custom Fillfactor
Set fillfactor to 70-90 for indexes on columns that are:
- Frequently updated with new values
- Subject to UPDATE operations that change the indexed column
- Experiencing index bloat from page splits
Keep fillfactor at 100 for indexes that are:
- Insert-only (append-only tables)
- On immutable columns
- Used purely for lookups without updates
Implementation
-- Create index with custom fillfactor
CREATE INDEX idx_user_email ON users(email) WITH (fillfactor = 80);
-- Modify existing index
REINDEX INDEX CONCURRENTLY idx_user_email WITH (fillfactor = 80);
Rationale
When an indexed value is updated, PostgreSQL uses the free space on the index page to store the new entry. Without free space, the page must split (creating two half-full pages), causing:
- Performance degradation (more pages to scan)
- Index bloat over time
- Increased storage and I/O
Lower fillfactor values (70-80) reserve more space for updates but increase initial index size and reduce cache efficiency.
Source: PostgreSQL 16 Documentation - CREATE INDEX (https://www.postgresql.org/docs/16/sql-createindex.html) and Index Storage Parameters (https://www.postgresql.org/docs/16/sql-createindex.html#SQL-CREATEINDEX-STORAGE-PARAMETERS)
Hash Index in PostgreSQL
A Hash index is a PostgreSQL index type that uses a hash table data structure. It stores a 32-bit hash code derived from the indexed column value, enabling O(1) lookup for equality operations.
Creation Syntax
CREATE INDEX idx_name ON table_name USING HASH (column_name);
Key Characteristics
- Supported operator: Only equality (
=). Hash indexes cannot be used for range queries (<,>,<=,>=,BETWEEN), sorting, or pattern matching. - Hash function: Uses PostgreSQL's internal hash function producing 32-bit integers
- Page size: Standard 8192 bytes (8 KB) pages like other indexes
Critical Limitations
No WAL logging before PostgreSQL 10.0: Hash indexes were not crash-safe and could not be replicated. Since PostgreSQL 10.0 (October 2017), hash indexes ARE WAL-logged and fully crash-safe.
Single operator support: Only
=operator. Cannot optimize:- Range scans:
WHERE col > 100 - Sorting:
ORDER BY col - Pattern matching:
WHERE col LIKE 'foo%'
- Range scans:
No multi-column hash indexes: PostgreSQL does not support hash indexes on multiple columns (as of PostgreSQL 16).
Performance: B-tree indexes are typically as fast or faster for equality operations while supporting more operations. Hash indexes rarely provide performance benefits in practice.
Size: Hash indexes are often larger than equivalent B-tree indexes.
Official Recommendation
Use B-tree indexes instead. PostgreSQL documentation (version 16) states: "Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash... Because of this, hash index use is discouraged."
Note: While WAL-logging was added in v10, the PostgreSQL community still recommends B-tree for general use due to B-tree's versatility and comparable performance.
Source
PostgreSQL 16 Official Documentation: Index Types - Hash Indexes
Detecting and Fixing Index Bloat in PostgreSQL
Detection
Use this query to detect bloated indexes (works on PostgreSQL 9.0+):
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
ROUND(100 * (pg_relation_size(indexrelid) -
(CASE WHEN pg_relation_size(indexrelid) = 0 THEN 1
ELSE pg_relation_size(indexrelid) END))::numeric /
NULLIF(pg_relation_size(indexrelid), 0), 2) AS bloat_pct
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
For more accurate bloat estimation, use the pgstattuple extension:
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstatindex('index_name');
Bloat threshold: An index with leaf_fragmentation > 50% or index size > 2x expected size requires action.
Fixing Methods
Method 1: REINDEX (Requires EXCLUSIVE lock)
-- Single index (locks table for reads/writes)
REINDEX INDEX index_name;
-- All indexes on a table
REINDEX TABLE table_name;
-- Entire database (PostgreSQL 12+)
REINDEX DATABASE database_name;
Lock duration: Holds AccessExclusiveLock for entire operation. Not suitable for production tables with high traffic.
Method 2: REINDEX CONCURRENTLY (PostgreSQL 12+, Recommended)
-- Rebuilds index without blocking reads/writes
REINDEX INDEX CONCURRENTLY index_name;
Requirements:
- PostgreSQL 12.0 or higher
- Requires 2x disk space temporarily
- Cannot run inside transaction block
- Cannot be used on system catalogs
Method 3: CREATE INDEX CONCURRENTLY + DROP (PostgreSQL 9.0+)
-- Step 1: Create new index with different name
CREATE INDEX CONCURRENTLY index_name_new ON table_name (column_name);
-- Step 2: Drop old index
DROP INDEX CONCURRENTLY index_name;
-- Step 3: Rename new index
ALTER INDEX index_name_new RENAME TO index_name;
Advantage: Works on PostgreSQL versions < 12.
Prevention
Enable autovacuum with appropriate settings in postgresql.conf:
autovacuum = on
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.05
For high-churn tables, consider table-level settings:
ALTER TABLE table_name SET (autovacuum_vacuum_scale_factor = 0.05);
Source: PostgreSQL Official Documentation (https://www.postgresql.org/docs/current/routine-reindex.html)
pg_stat_user_indexes View
pg_stat_user_indexes is a PostgreSQL system view that provides statistics about index usage and performance for user-defined indexes only (excludes system catalog indexes).
Key Columns
SELECT * FROM pg_stat_user_indexes;
Critical columns include:
schemaname- Schema containing the indexrelname- Table name the index belongs toindexrelname- Name of the indexidx_scan- Number of index scans initiated on this index (0 means unused)idx_tup_read- Number of index entries returned by scansidx_tup_fetch- Number of live table rows fetched by simple index scans
Primary Use Cases
- Identify unused indexes:
idx_scan = 0after sufficient runtime - Index efficiency: Compare
idx_tup_readvsidx_tup_fetchto detect bloat - Index usage patterns: High
idx_scanindicates frequently used indexes
Example Query
-- Find indexes that have never been used
SELECT schemaname, relname, indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_%'
ORDER BY pg_relation_size(indexrelid) DESC;
Important Notes
- Statistics are cumulative since last stats reset or server restart
- Reset with:
SELECT pg_stat_reset();(requires superuser) - Available in PostgreSQL 8.3+ (statistics collector must be enabled:
track_counts = on, which is default) - Counters are per-database and not shared across databases
Related Views
pg_stat_all_indexes- Includes system indexespg_statio_user_indexes- I/O statistics (disk blocks read)
Source: PostgreSQL Official Documentation - Statistics Views
GIN vs GiST Index in PostgreSQL
Use GIN (Generalized Inverted Index) when:
- Indexing static or rarely-updated data
- Need faster lookups (3x faster than GiST for contains operations)
- Indexing arrays, JSONB, full-text search, or tsvector columns
- Can tolerate larger index size (2-3x larger than GiST)
Use GiST (Generalized Search Tree) when:
- Data changes frequently (faster updates/inserts)
- Need geometric/spatial queries (PostGIS)
- Working with range types or custom data types
- Index size is a concern
Concrete Examples
GIN for JSONB (read-heavy):
CREATE INDEX idx_data_gin ON products USING GIN (metadata jsonb_path_ops);
-- Query: SELECT * FROM products WHERE metadata @> '{"brand": "Nike"}';
GIN for full-text search:
CREATE INDEX idx_fts_gin ON documents USING GIN (to_tsvector('english', content));
-- Query: SELECT * FROM documents WHERE to_tsvector('english', content) @@ to_tsquery('postgresql');
GiST for range types (write-heavy):
CREATE INDEX idx_period_gist ON bookings USING GIST (period);
-- Query: SELECT * FROM bookings WHERE period && '[2025-01-01, 2025-01-31)'::daterange;
GiST for spatial data (PostGIS):
CREATE INDEX idx_location_gist ON stores USING GIST (geom);
-- Query: SELECT * FROM stores WHERE ST_DWithin(geom, 'POINT(-73.935242 40.730610)', 1000);
Performance Metrics (PostgreSQL 14+)
| Operation | GIN | GiST |
|---|---|---|
| Lookup speed | Faster (baseline) | ~3x slower |
| Insert/update | Slower (2-3x) | Faster (baseline) |
| Index size | Larger (2-3x) | Smaller (baseline) |
Decision rule: If read-to-write ratio > 10:1, use GIN. Otherwise, use GiST.
Source: PostgreSQL 16 Official Documentation - Index Types (https://www.postgresql.org/docs/16/indexes-types.html)
Estimating Index Size Before Creation in PostgreSQL
Use the pgstattuple extension to estimate index size based on table statistics:
-- Enable the extension (PostgreSQL 9.5+)
CREATE EXTENSION IF NOT EXISTS pgstattuple;
-- Estimate B-tree index size (most common)
SELECT pg_size_pretty(
pg_relation_size(relid) *
(SELECT count(*) FROM pg_class WHERE relname = 'your_table_name') /
GREATEST((SELECT reltuples FROM pg_class WHERE relname = 'your_table_name'), 1)
) FROM pg_class WHERE relname = 'existing_similar_index';
More accurate formula-based estimation:
-- For B-tree index on integer column (4 bytes)
SELECT pg_size_pretty(
(reltuples * (32 + 4))::bigint -- 32 bytes overhead + column size
) AS estimated_index_size
FROM pg_class
WHERE relname = 'your_table_name';
-- For B-tree index on text column
SELECT pg_size_pretty(
(reltuples * (32 + avg_width))::bigint
) AS estimated_index_size
FROM pg_class c
JOIN pg_stats s ON s.tablename = c.relname
WHERE c.relname = 'your_table_name'
AND s.attname = 'your_column_name';
Key sizing factors:
- B-tree overhead: 32 bytes per entry (includes tuple header + index metadata)
- Column data: Actual column size (4 bytes for int, 8 for bigint, variable for text)
- Fill factor: Default 90% for B-tree indexes (10% reserved for updates)
- Page overhead: 8KB page size with ~200 bytes header per page
Final formula: (row_count × (32 + column_size)) / 0.90 × 1.15
- The 1.15 multiplier accounts for page headers and fragmentation
Source: PostgreSQL 17 documentation on Index Storage Parameters and pgstattuple module
Index-Only Scan in PostgreSQL
An index-only scan is a query execution method where PostgreSQL retrieves all required data directly from an index without accessing the table heap, enabled by the visibility map to check tuple visibility.
Requirements
For PostgreSQL to use an index-only scan, ALL of these must be true:
- All SELECT columns must be included in the index
- All WHERE columns must be in the index
- Visibility information must be available (via visibility map created by VACUUM)
Example
-- Create index with INCLUDE clause (PostgreSQL 11+)
CREATE INDEX idx_users_email ON users(email) INCLUDE (name);
-- This query can use index-only scan
SELECT email, name FROM users WHERE email = '[email protected]';
-- Verify with EXPLAIN
EXPLAIN (ANALYZE, BUFFERS)
SELECT email, name FROM users WHERE email = '[email protected]';
Expected output shows:
Index Only Scan using idx_users_email on users
Heap Fetches: 0
Heap Fetches Counter
The Heap Fetches: N value indicates actual table accesses:
- Heap Fetches: 0 = True index-only scan (optimal)
- Heap Fetches: N > 0 = Fallback to heap when visibility map incomplete
Critical: VACUUM Requirement
Index-only scans require an up-to-date visibility map:
-- Run VACUUM to update visibility map
VACUUM users;
-- Or enable autovacuum (default in PostgreSQL 8.3+)
ALTER TABLE users SET (autovacuum_enabled = true);
Without recent VACUUM, PostgreSQL must check heap for tuple visibility, negating the performance benefit.
Source
PostgreSQL Official Documentation: Index-Only Scans and Covering Indexes (v12+)
PostgreSQL 18 B-tree Skip Scan
PostgreSQL 18 introduces skip scan for B-tree indexes, allowing multi-column indexes to be used even when the leading column has no restriction.
The Problem (Pre-PG18)
CREATE INDEX idx_country_city ON locations(country, city);
-- This uses the index (leading column restricted)
SELECT * FROM locations WHERE country = 'USA' AND city = 'NYC';
-- This did NOT use the index efficiently (no leading column)
SELECT * FROM locations WHERE city = 'NYC'; -- Sequential scan!
PostgreSQL 18 Solution
-- Now uses skip scan on the same index!
SELECT * FROM locations WHERE city = 'NYC';
-- EXPLAIN shows:
-- Index Scan using idx_country_city on locations
-- Index Cond: (city = 'NYC'::text)
-- Skip Scan: true
How Skip Scan Works
- Scans first entry for each distinct value of leading column
- Jumps to next distinct value (skips)
- Repeats until all distinct leading values checked
When Skip Scan Is Effective
| Leading Column | Skip Scan Benefit |
|---|---|
| Low cardinality (few distinct) | High |
| Medium cardinality | Moderate |
| High cardinality (many distinct) | Low (seq scan may win) |
Example with EXPLAIN
CREATE TABLE orders (
status VARCHAR(20), -- 5 distinct values
order_date DATE,
customer_id INT
);
CREATE INDEX idx_status_date ON orders(status, order_date);
-- Query on second column only
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE order_date = '2025-01-15';
-- Shows:
-- Index Scan using idx_status_date
-- Skip Scan: 5 groups (one per status value)
Limitations
- Planner decides based on statistics (not always chosen)
- Most effective with low-cardinality leading columns
- Doesn't replace need for proper index design
Source: PostgreSQL 18 Release Notes
https://www.postgresql.org/docs/18/release-18.html
BRIN indexes are most effective when:
1. Very large tables - BRIN is designed specifically for tables where traditional B-tree indexes would be too large.
2. Natural correlation with physical order - The indexed column's values must correlate with the physical storage order of rows. This means:
- Sequentially inserted data (timestamps, order dates, sequence IDs)
- Naturally clustered data (ZIP codes, geographic regions)
- Append-only tables where new data follows a predictable pattern
3. Range queries on correlated data - BRIN excels at queries like WHERE date >= '2024-01-01' when dates increase with insertion order.
Performance characteristics:
- Index size is tiny compared to B-tree (often hundreds of times smaller)
- Scanning overhead is minimal, close to sequential scan cost
- Can skip entire block ranges when values don't match query conditions
- Uses lossy bitmap scans requiring recheck of candidate tuples
Example scenario:
A table storing store orders with a created_at timestamp column where orders are inserted chronologically. BRIN can efficiently skip large portions of the table for date range queries.
Not effective when:
- Data is randomly distributed (no physical correlation)
- Frequent updates that destroy natural ordering
- Small tables where B-tree overhead is acceptable
- Point lookups requiring exact row identification
Sources:
B-tree Index in PostgreSQL
A B-tree index (balanced tree) is the default index type in PostgreSQL, used for equality and range queries on orderable data types.
What It Is
B-tree indexes maintain sorted data in a tree structure with these properties:
- Self-balancing: keeps tree height minimal (logarithmic depth)
- Supports operators:
<,<=,=,>=,>,BETWEEN,IN,IS NULL,IS NOT NULL - Supports pattern matching with
LIKEand~only when pattern is anchored at start (e.g.,'prefix%')
When to Use
Use B-tree indexes for:
Equality searches
SELECT * FROM users WHERE email = '[email protected]';Range queries
SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31';Sorting operations
SELECT * FROM products ORDER BY price;Primary keys and unique constraints (PostgreSQL automatically creates B-tree indexes)
Creating a B-tree Index
-- Explicit (though INDEX defaults to B-tree)
CREATE INDEX idx_users_email ON users USING btree (email);
-- Implicit (same result)
CREATE INDEX idx_users_email ON users (email);
-- Multi-column
CREATE INDEX idx_orders_user_date ON orders (user_id, created_at);
When NOT to Use
- Full-text search: Use GIN index with
tsvector - Geometric data: Use GiST or SP-GiST
- Unanchored pattern matching (
LIKE '%suffix'): Use trigram GIN index - Array containment (
@>,<@): Use GIN index
Key Limitations
- B-tree index size is typically ~50-100% of indexed column data size
- Multi-column indexes: only efficient when query filters use leftmost columns first
- Does not support unordered data types (e.g.,
json,jsonbwithout operators)
Source: PostgreSQL 17 Documentation - Index Types
Use the REINDEX command to rebuild an index in PostgreSQL:
Rebuild a specific index:
REINDEX INDEX index_name;
Rebuild all indexes on a table:
REINDEX TABLE table_name;
Rebuild without blocking writes (recommended for production):
REINDEX INDEX CONCURRENTLY index_name;
REINDEX TABLE CONCURRENTLY table_name;
Rebuild all indexes in a schema:
REINDEX SCHEMA schema_name;
Rebuild all indexes in the current database:
REINDEX DATABASE database_name;
Key Options:
CONCURRENTLY- Rebuilds without blocking concurrent INSERT/UPDATE/DELETE operations (no ACCESS EXCLUSIVE lock)VERBOSE- Prints progress reports at INFO levelTABLESPACE tablespace_name- Moves indexes to a new tablespace during rebuild
Command-line alternative:
reindexdb -d database_name -t table_name
reindexdb --concurrently -d database_name
The reindexdb utility is a wrapper around the SQL REINDEX command.
Note: Without CONCURRENTLY, REINDEX acquires an ACCESS EXCLUSIVE lock that blocks all table access during the rebuild.
Sources:
Indexing JSONB Fields in PostgreSQL
PostgreSQL provides four specialized index types for JSONB fields, each optimized for different query patterns:
1. GIN Index (General Purpose - RECOMMENDED)
Best for most JSONB queries. Supports containment (@>, <@) and existence (?, ?|, ?&) operators.
-- Index entire JSONB column
CREATE INDEX idx_data_gin ON table_name USING GIN (jsonb_column);
-- Index with jsonb_path_ops (faster, smaller, but only supports @> operator)
CREATE INDEX idx_data_gin_path ON table_name USING GIN (jsonb_column jsonb_path_ops);
When to use: Queries like WHERE jsonb_column @> '{"key": "value"}'
2. B-tree Index (Specific Field Extraction)
Best for exact matches and comparisons on specific JSONB keys.
-- Index a specific field
CREATE INDEX idx_email ON users USING BTREE ((data->>'email'));
-- Index a nested field
CREATE INDEX idx_city ON users USING BTREE ((data->'address'->>'city'));
When to use: Queries like WHERE data->>'email' = '[email protected]'
3. Hash Index (Equality Only)
For equality checks on extracted values (PostgreSQL 10+).
CREATE INDEX idx_status_hash ON orders USING HASH ((data->>'status'));
4. Expression Index
For computed values or casts.
-- Index numeric field cast
CREATE INDEX idx_age ON users USING BTREE (((data->>'age')::integer));
Key Differences
jsonb_path_ops: 30% smaller and faster than default GIN, but only supports@>containment queries->>vs->operator: Use->>(returns text) for B-tree indexes,->(returns JSONB) for continued nesting- Performance: GIN indexes are larger but handle complex queries. B-tree is smaller for single-field access patterns.
Verification Query
-- Check if index is being used
EXPLAIN ANALYZE SELECT * FROM table_name WHERE jsonb_column @> '{"key": "value"}';
Version: Available in PostgreSQL 9.4+ (JSONB introduced). jsonb_path_ops available 9.4+. Hash indexes crash-safe in 10+.
Source: PostgreSQL 17 Official Documentation - JSON Types and Indexing
Type Conversion
27 questionsNo, binary coercibility is not necessarily a symmetric relationship. For example, the cast from xml to text can be performed for free, but the reverse direction requires a function that performs at least a syntax check. However, text and varchar are binary coercible both ways.
The four conversion contexts are: 1) Function calls - resolves overloaded functions based on argument types, 2) Operators - handles prefix and infix operator overloading, 3) Value storage - converts expression results to match target column types in INSERT/UPDATE, and 4) UNION/CASE/Array constructs - establishes uniform types across multiple result expressions.
The castcontext column in pg_cast has three possible values: 'e' means explicit cast only (using CAST or :: syntax), 'a' means implicit in assignment to a target column plus explicit contexts, and 'i' means implicit in expressions plus all other contexts.
String literals without specified types receive a placeholder designation of 'unknown' type, to be resolved in later stages of query processing.
Domains are treated as their base types for all type resolution purposes. This ensures that domains act like their base types for purposes of ambiguous-operator resolution.
No. When the data type has spaces in its name (such as double precision), you cannot use the double colon :: operator. In this case, you must use the CAST operator or an alternative name of the data type.
PostgreSQL permits the scale in a numeric type declaration to be any value in the range -1000 to 1000. However, the SQL standard requires the scale to be in the range 0 to precision, so using scales outside that range may not be portable.
For VARCHAR columns, pg_attribute.atttypmod equals the declared length plus 4. For example, VARCHAR(4) has atttypmod = 8. A value of -1 means unlimited size.
The second argument receives -1 if there is no type modifier associated with the destination type.
A good rule of thumb is to make a cast implicitly invokable only for information-preserving transformations between types in the same general type category. For example, the cast from int2 to int4 can reasonably be implicit, but the cast from float8 to int4 should probably be assignment-only. Cross-type-category casts, such as text to int4, are best made explicit-only.
You must be a superuser to create a binary-coercible cast. This restriction exists because an erroneous binary-coercible cast conversion can easily crash the server.
Schemas are searched in the order they appear in search_path. However, pg_catalog is always effectively part of the search path - if not named explicitly, it is implicitly searched before searching the path's schemas. The temporary schema is searched first (even before pg_catalog) for relations and data types, but never for functions or operators.
The PostgreSQL scanner/parser divides lexical elements into five fundamental categories: integers, non-integer numbers, strings, identifiers, and key words.
The maximum precision that can be explicitly specified in a numeric type declaration is 1000 digits.
For historical reasons, CASE treats its ELSE clause (if any) as the 'first' input, with the THEN clause(s) considered after that.
No. UNION, INTERSECT and EXCEPT operations are resolved pairwise (sequentially), while other constructs like CASE, ARRAY, VALUES, GREATEST and LEAST consider all of their inputs in one resolution step.
PostgreSQL's function type resolution uses a 4-step procedure: Step 1 selects candidate functions, Step 2 checks for exact matches, Step 3 detects type conversion requests for single-argument calls, and Step 4 evaluates best match.
The algorithm has 6 steps: 1) Check if all inputs are the same type, 2) Handle domain types as base types, 3) Resolve all unknown types as text, 4) Verify all non-unknown inputs are in the same type category, 5) Select candidate type and evaluate left-to-right, and 6) Convert all inputs to final candidate type.
Data types are divided into several basic type categories: boolean, numeric, string, bitstring, datetime, timespan, geometric, network, and user-defined.
PostgreSQL shows a 'bias towards string' - the string category is selected if any candidate accepts that category, as an unknown-type literal looks like a string.
The internal name for the character data type is 'bpchar', which stands for 'blank-padded char'.
A sizing cast is a cast from a type to itself. If one is found in the pg_cast catalog, it is applied to the expression before storing into the destination column. The cast function is responsible for applying any length-dependent semantics such as size checking or truncation.
PostgreSQL supports two syntaxes: 1) CAST(value AS target_type) which conforms to the SQL standard, and 2) value::target_type which is historical PostgreSQL usage. Both are functionally equivalent.
Two mitigation strategies: 1) Use the VARIADIC keyword explicitly in the call, which bypasses this hazard, or 2) Cast arguments to force an exact match. However, calls populating VARIADIC 'any' parameters often have no equivalent formulation containing the VARIADIC keyword.
No. If the best-match rules are reached, they actively discriminate against operators on domain types. The only way to get a custom operator on a domain type to be used is to explicitly cast the literal so that an exact-match rule applies.
The castmethod column has three values: 'f' means the cast uses a function specified in castfunc, 'i' means the cast uses input/output functions, and 'b' means binary-coercible (no conversion required).
The pg_cast catalog stores data type conversion paths, both built-in and user-defined. However, it does not represent every type conversion that the system knows how to perform; only those that cannot be deduced from some generic rule.
Extensions > pgAudit
27 questionsrds_pgaudit is the only allowable value for the pgaudit.role parameter on AWS RDS and Aurora. The role must be created with 'CREATE ROLE rds_pgaudit;'
All data definition language (DDL) statements that are not included in the ROLE class, such as CREATE TABLE, ALTER TABLE, and DROP TABLE.
The default value is 0 bytes, which means all parameters are logged regardless of length when log_parameter is enabled.
PostgreSQL 9.5 was the first version supported by pgAudit (with pgAudit v1.0.X).
No. CREATE ROLE and ALTER ROLE statements are logged by the ROLE class, not the DDL class. The DDL class only includes DDL statements that are not in the ROLE class.
The default value is 'off', which means the rows retrieved or affected by a statement are not included in audit logs.
Yes, pgAudit must be loaded in shared_preload_libraries. Otherwise, an error will be raised at load time and no audit logging will occur. A PostgreSQL restart is required after adding it.
There is no default value (it is unset by default), meaning object audit logging is not enabled unless explicitly configured.
The default value is 'off', which means statement text and parameters are included with every log entry rather than only the first entry for a statement/substatement combination.
The default value is 'off', which means log messages will not be visible to client processes such as psql.
A unique statement ID for the session. There may be multiple entries for a statement ID when more than one relation is logged.
When enabled, statement text and parameters are logged only with the first log entry for a statement/substatement combination, not with every entry. This results in less verbose logging, but the statement/substatement pair with process ID is needed to identify the statement text from a previous entry.
pgAudit currently supports PostgreSQL versions 13 through 18, with separate branches maintained for each major version.
INSERT, UPDATE, DELETE, TRUNCATE, and COPY when the destination is a relation (table).
A sequential ID for each sub-statement within the main statement (e.g., when calling a function from a query). Sub-statement IDs are continuous even if some sub-statements are not logged.
The default value is 'none', which means no database changes are logged when pgAudit is first installed.
AUDIT_TYPE, STATEMENT_ID, SUBSTATEMENT_ID, CLASS, COMMAND, OBJECT_TYPE, OBJECT_NAME, STATEMENT, PARAMETER.
Limit the audit scope to critical actions (e.g., DDL and modifications of sensitive data), use object-level logging for sensitive tables, avoid enabling 'all' in production, and test performance impact before deployment with adequate log volume space.
The default value is 'on', which means session logging is enabled when all relations in a statement are in pg_catalog.
Supabase doesn't support pgaudit.log_parameter because it may log secrets in encrypted columns when using encryption extensions like pgsodium or Vault.
COPY when the destination is a relation (e.g., COPY table_name FROM ...) is classified as WRITE, because it's modifying/importing data.
MISC_SET was introduced in pgAudit version 1.4 (for PostgreSQL 12), separating SET ROLE and similar commands from the general MISC class.
The default value is 'off', which prevents query parameters from being logged (this is the default for security reasons).
Extensions > Extension Management
27 questionspg_extension_update_paths('extension_name') returns each pair of distinct known version names for the specified extension, together with the update path sequence that would be taken to get from the source version to the target version, or NULL if there is no available update path.
RESTRICT is the default behavior. The command will fail if any of the extension's objects are in use in the database.
Supported types include: ACCESS METHOD, AGGREGATE, CAST, COLLATION, CONVERSION, DOMAIN, EVENT TRIGGER, FOREIGN DATA WRAPPER, FOREIGN TABLE, FUNCTION, LANGUAGE, MATERIALIZED VIEW, OPERATOR, OPERATOR CLASS, OPERATOR FAMILY, PROCEDURE, ROUTINE, SCHEMA, SEQUENCE, SERVER, and TABLE.
If the extension is marked trusted in its control file, then it can be installed by any user who has CREATE privilege on the current database.
SQL script files follow the pattern extension--version.sql (e.g., foo--1.0.sql).
No, extension names are never schema-qualified. Extensions exist at the database level with unqualified names.
Yes, the syntax supports dropping multiple extensions: DROP EXTENSION name [, ...]. You can specify multiple extension names separated by commas.
A notice is issued (not an error). Note that there is no guarantee that the existing extension is anything like the one that would have been created from the currently-available script file.
It gets replaced with the value of the module_pathname parameter from the control file. Typically this is set to $libdir/shared_library_name and is used in CREATE FUNCTION commands for C-language functions.
Extension control files must be placed in the installation's SHAREDIR/extension directory. You can find this directory by running pg_config --sharedir.
CASCADE automatically installs any extensions that this extension depends on that are not already installed, with their dependencies likewise automatically installed, recursively.
Setting trusted to true allows non-superusers with database CREATE privileges to install extensions marked as superuser = true. The installation scripts then run as the bootstrap superuser, not the calling user.
The default value is true, meaning only superusers can create the extension or update it to a new version by default.
Currently installed version of the extension, or NULL if not installed.
It specifies dependencies as a comma-separated list (e.g., requires = 'foo, bar'). Required extensions must be installed beforehand.
No, extension names must be unique database-wide. You cannot install multiple versions of the same extension in a single database. However, different databases in the same PostgreSQL instance can potentially use different extension versions if the extension is specifically designed to support this.
This parameter applies only to non-relocatable extensions, forcing installation into a specific named schema. It only applies during initial creation, not updates, and prevents users from specifying alternate schemas via the SCHEMA option.
bool (boolean) - True if extension can be relocated to another schema.
It specifies the character set encoding used by the script files. Should be specified if the script files contain any non-ASCII characters; otherwise files are assumed to be in the database encoding.
The extension must be relocatable (extrelocatable = true) for this command to succeed.
You must own the extension to use ALTER EXTENSION. The ADD/DROP forms require ownership of the added/dropped object as well.
The pg_available_extensions view has 4 columns: name (name), default_version (text), installed_version (text), and comment (text).
When an extension is dropped, its member objects and other explicitly dependent routines are also removed automatically.
The update script runs in the same environment that CREATE EXTENSION provides for installation scripts: search_path is set up in the same way, and any new objects created by the script are automatically added to the extension.
Routine Maintenance > ANALYZE
26 questions- This sets the maximum number of entries in the most_common_vals and histogram_bounds arrays for each column.
100 bins. This is controlled by the default_statistics_target, which defaults to 100.
1000 tuples. This parameter was introduced in PostgreSQL 13 to trigger autovacuum on insert-only tables.
The ANALYZE command prints a warning and does nothing for that foreign table.
2 MB (megabytes). This parameter controls buffer usage for VACUUM and ANALYZE operations when BUFFER_USAGE_LIMIT is not explicitly specified.
100 entries. This is controlled by the default_statistics_target, which defaults to 100.
reltuples (total number of entries) and relpages (number of disk blocks occupied). However, ANALYZE generates approximate values since it does not read every row of the table.
Yes. While ANALYZE ordinarily processes all partitions of specified partitioned tables, SKIP_LOCKED will cause ANALYZE to skip all partitions if there is a conflicting lock on the partitioned table.
No. Foreign tables are analyzed only when explicitly selected in an ANALYZE command. Manual scheduling is required for foreign table statistics.
By default, ANALYZE gathers statistics by sampling rows from all partitions, and recursively collects and updates the statistics for each partition. Each leaf partition is analyzed only once, even with multi-level partitioning.
0.1 (10% of table size). This specifies a fraction of the table size to add to autovacuum_analyze_threshold when deciding whether to trigger an ANALYZE.
1 minute. The autovacuum daemon wakes up every autovacuum_naptime and examines databases for maintenance operations including ANALYZE.
INFO level. When VERBOSE is specified, ANALYZE emits progress messages to indicate which table is currently being processed at INFO level.
analyze threshold = autovacuum_analyze_threshold + (autovacuum_analyze_scale_factor × number of tuples). With defaults: 50 + (0.1 × table size).
No. In versions prior to PostgreSQL 13, no autovacuum would be triggered with an insert-only workload, though auto-analyze was triggered for insert-only workloads.
ANALYZE gathers two sets of statistics: one on the rows of the parent table only, and a second including rows of both the parent table and all of its children. However, child tables themselves are not individually analyzed unless specified.
30,000 rows. The formula is 300 × default_statistics_target. With the default value of 100, PostgreSQL samples 300 × 100 = 30,000 rows.
0.2 (20% of table size). This parameter works with autovacuum_vacuum_insert_threshold to determine when autovacuum runs on insert-only tables.
VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX. These values are not updated on-the-fly and usually contain somewhat out-of-date values.
Yes. ANALYZE requires only a SHARE UPDATE EXCLUSIVE lock, which does not conflict with DML operations. Regular INSERT, UPDATE, and DELETE can continue while ANALYZE runs.
When ONLY is specified before the table name, only that table is analyzed (excluding partitions or inheritance children). Without ONLY, the table and all its inheritance child tables or partitions are analyzed.
No. The autovacuum daemon does not process partitioned tables. It is usually necessary to periodically run a manual ANALYZE to keep the statistics of the table hierarchy up to date.
It defaults to -1, which means it inherits the value of maintenance_work_mem.
track_counts must be set to true. Even though autovacuum is on by default, track_counts must also be enabled for autovacuum to work.
Indexes > Unique Indexes
26 questionsOnly B-tree indexes can be declared unique. Other index types (GiST, GIN, BRIN, Hash, SP-GiST) do not support unique constraints.
CREATE INDEX ON ONLY on a partitioned table creates the index marked as invalid and does not recurse to create indexes on partitions automatically. The indexes on partitions must be created separately and then attached using ALTER INDEX ATTACH PARTITION.
PostgreSQL 15 introduced the NULLS NOT DISTINCT option, authored by Peter Eisentraut. This allows unique constraints and indexes to treat NULL values as not distinct (equal to each other).
The constraint's columns must include all of the partition key columns. Additionally, the partition keys must not include any expressions or function calls. This is because individual indexes can only enforce uniqueness within their own partitions.
Create a unique partial index with an IS NULL restriction: CREATE UNIQUE INDEX ON table (column) WHERE column IS NULL; This indexes only NULL values and enforces uniqueness among them.
Index expressions are relatively expensive to maintain because the derived expression(s) must be computed for each row insertion and non-HOT update. However, the index expressions are not recomputed during an indexed search, since they are already stored in the index.
- B-trees use a default fillfactor of 90, though any integer value from 10 to 100 can be selected. This applies to all B-tree indexes including unique indexes.
The predicate condition must exactly match part of the query's WHERE condition or the index will not be recognized as usable. PostgreSQL can recognize simple inequality implications (e.g., 'x < 1' implies 'x < 2'), but does not have a sophisticated theorem prover for mathematically equivalent expressions in different forms. Parameterized query clauses do not work with partial indexes.
Yes. A table can have multiple unique constraints but can only have a single primary key.
Yes. An index column can be a function or scalar expression computed from one or more columns of the table. All functions and operators used must be immutable. This allows enforcing uniqueness based on computed values, such as CREATE UNIQUE INDEX ON table (LOWER(column)).
No. Since an ordered index can be scanned either forward or backward, it is not normally useful to create a single-column DESC index — that sort ordering is already available with a regular index.
No. Expressions are not supported as included columns since they cannot be used in index-only scans. Only actual column references can be specified in the INCLUDE clause.
A primary key must be UNIQUE and NOT NULL for all participating attributes. In contrast, unique constraints in PostgreSQL allow for NULL values (which by default are treated as distinct from each other).
A multicolumn unique index will only reject cases where all indexed columns are equal in multiple rows. The uniqueness constraint applies to the combination of all columns, not to each column individually.
No. CREATE INDEX CONCURRENTLY cannot be performed within a transaction block. Standard (non-concurrent) index creation can be executed within a transaction block.
Currently, B-tree, GiST, and SP-GiST index access methods support the INCLUDE clause.
If problems occur during scanning, the command fails but leaves behind an 'invalid' index that still consumes update overhead and continues to enforce uniqueness constraints. The recommended recovery is to drop the index and retry CREATE INDEX CONCURRENTLY, or use REINDEX INDEX CONCURRENTLY.
No. There is no way to create an exclusion constraint spanning the whole partitioned table. It is only possible to put such a constraint on each leaf partition individually.
Yes. PostgreSQL automatically creates a unique index when a unique constraint or primary key is defined for a table. There's no need to manually create indexes on unique columns; doing so would just duplicate the automatically-created index.
No. Constraints are NOT DEFERRABLE INITIALLY IMMEDIATE by default. Uniqueness constraints that have not been declared DEFERRABLE are checked immediately at the end of each statement.
DEFERRED constraints are not checked until transaction commit, rather than at the end of each statement.
The uniqueness constraint is already being enforced against other transactions when the second table scan begins. This means constraint violations could be reported in other queries prior to the index becoming available for use.
Approximately 2704 bytes (about 1/3 of a buffer page). Values larger than 1/3 of a buffer page cannot be indexed. If an index tuple exceeds this size, data insertion will fail.
No. B-tree cannot use deduplication with indexes that use a nondeterministic collation. Additionally, text, varchar, and char cannot use deduplication when a nondeterministic collation is used.
Only UNIQUE, PRIMARY KEY, EXCLUDE, and REFERENCES (foreign key) constraints accept the DEFERRABLE clause. NOT NULL and CHECK constraints are not deferrable.
No. Non-key columns specified in the INCLUDE clause are not considered when enforcing uniqueness. They are included in the index only to support index-only scans and cannot be used in index scan search qualifications.
Concurrency Control > Explicit Row Locking
26 questionsAny selected rows that cannot be immediately locked are skipped. This produces an inconsistent data view but is useful for queue-like table access patterns.
FOR UPDATE blocks UPDATE, DELETE, SELECT FOR UPDATE, SELECT FOR NO KEY UPDATE, SELECT FOR SHARE, and SELECT FOR KEY SHARE.
The from_reference must be a table alias or non-hidden table_name referenced in the FROM clause.
A table is processed as NOWAIT if that is specified in any of the clauses affecting it.
When you INSERT a row that references a foreign key, PostgreSQL places a FOR KEY SHARE lock on the referenced row in the parent table.
If a locking clause is applied to a view or sub-query, it affects all tables used in the view or sub-query. Rows which are not ultimately returned (because they do not pass the user's WHERE conditions) may still end up being locked.
If the same table is mentioned (or implicitly affected) by more than one locking clause, then it is processed as if it was only specified by the strongest one.
No. The locking clauses cannot be used in contexts where returned rows cannot be clearly identified with individual table rows; for example they cannot be used with aggregation.
Only the rows that are actually returned by the query after applying the LIMIT clause are locked. The ORDER BY determines which rows are selected first, and then LIMIT restricts the result set - only those final rows are locked.
No. These locking options cannot be specified for results of set operations or any inputs to UNION, INTERSECT, or EXCEPT.
No. The locking clauses cannot be combined with DISTINCT (including DISTINCT ON).
No. Only the partitions which are scanned are locked, and the partitions excluded by constraints are not locked.
The statement immediately raises an error (SQLSTATE 55P03) rather than blocking if a selected row cannot be locked instantly.
PostgreSQL provides four row-level locking modes: FOR UPDATE (strongest), FOR NO KEY UPDATE, FOR SHARE, and FOR KEY SHARE (weakest).
Locks are normally held until the end of the transaction. However, if acquired after a savepoint, they release immediately upon rollback to that savepoint.
Yes, a transaction can hold conflicting locks on the same row, even in different subtransactions. However, two transactions can never hold conflicting locks on the same row.
When a cursor's query includes FOR UPDATE or FOR SHARE, returned rows are locked at the time they are first fetched, in the same way as for a regular SELECT command with these options.
FOR UPDATE will wait for the concurrent transaction that has modified the row, and will then lock and return the updated row (or no row, if the row was deleted).
No. SKIP LOCKED only applies to row-level locks; the required ROW SHARE table-level lock is still taken normally.
An error will be thrown if a row to be locked has changed since the transaction started. The error message is: 'ERROR: could not serialize access due to concurrent update'.
PostgreSQL takes a FOR NO KEY UPDATE lock on rows before any UPDATE that doesn't modify key columns.
Yes. SELECT FOR UPDATE modifies selected rows to mark them locked, and so will result in disk writes. The tuple headers are updated with xmin and xmax values to record the lock information.
Server Configuration > Resource Consumption - Disk and Kernel
26 questionsThe transaction attempting to exceed the limit will be canceled.
1MB. If set to 0, WAL data is always flushed immediately. This parameter can only be set in postgresql.conf or on the server command line.
200 milliseconds (200ms). This parameter can only be set in the postgresql.conf file or on the server command line.
on (enabled). This parameter can only be set in the postgresql.conf file or on the server command line.
2.0. This parameter can only be set in the postgresql.conf file or on the server command line.
open_datasync, fdatasync, fsync, fsync_writethrough, and open_sync. This parameter can only be set in postgresql.conf or on the server command line.
128KB. The range is limited by the io_max_combine_limit ceiling.
2 (strict overcommit mode). This reduces the chances of the Out-of-Memory killer being invoked.
200 milliseconds. This parameter can only be set in the postgresql.conf file or on the server command line.
1048576 pages, which translates to approximately 8GB for 8KB pages. This is a server start only parameter.
100 buffers. Setting it to 0 disables background writing. This parameter can only be set in the postgresql.conf file or on the server command line.
on. When enabled, new WAL files are filled with zeroes to ensure space is allocated before writing WAL records.
No. temp_file_limit only applies to behind-the-scenes temporary files used in query execution (such as sort and hash temporary files), not to explicit temporary tables created by users.
Up to three times wal_writer_delay (typically 600ms with the default 200ms setting), because the WAL writer is designed to favor writing whole pages at a time during busy periods.
The ideal setting is the actual stack size limit enforced by the kernel (as set by ulimit -s or local equivalent), less a safety margin of about 1MB.
5 transactions. This is the minimum number of concurrent open transactions required before commit_delay is applied.
2MB (2 megabytes). If this value is specified without units, it is taken as kilobytes.
Platform-dependent. The default is fdatasync on Linux and FreeBSD. The first supported method is chosen from this order: open_datasync, fdatasync, fsync, fsync_writethrough, open_sync.
0 (uses system default). This parameter is only supported on Linux for non-default settings. Common sizes include 2MB and 1GB on Intel/AMD architectures.
- Similar to effective_io_concurrency but for maintenance operations.
try. Valid options are: off, on, and try. The 'try' setting enables prefetching only if the operating system provides support.
Platform-dependent (usually first supported option). Valid options are: posix, sysv, windows, and mmap. The mmap option is generally discouraged because the operating system may write modified pages back to disk repeatedly.
off. This parameter can only be set at server startup and is ignored if data checksums are enabled.
256kB on Linux, 0 elsewhere. The valid range is between 0 (which disables forced writeback) and 2MB.
Data Types - Numeric
26 questionsBoth NUMERIC and floating-point types (REAL, DOUBLE PRECISION) can store the special values 'Infinity', '-Infinity', and 'NaN' (not-a-number).
REAL (float4) has exactly 24 bits in the mantissa, and DOUBLE PRECISION (float8) has exactly 53 bits in the mantissa, for IEEE-standard floating point implementations.
PostgreSQL accepts float(1) to float(24) as selecting the REAL type (4 bytes, single precision), while float(25) to float(53) select DOUBLE PRECISION (8 bytes). Values of p outside this range cause an error. float with no precision specified defaults to DOUBLE PRECISION.
Calculations on NUMERIC values are very slow compared to the integer types or to the floating-point types.
Yes, the SQL standard requires the scale to be in the range 0 to precision. PostgreSQL allows -1000 to 1000, but using scales outside the standard range may not be portable to other database systems.
INTEGER can also be referred to as INT or INT4. All three names are synonyms for the same 4-byte integer type.
No, an infinity can only be stored in an unconstrained NUMERIC column (declared without precision/scale), because it notionally exceeds any finite precision limit.
The maximum precision that can be explicitly specified in a NUMERIC type declaration is 1000.
PostgreSQL offers GENERATED ALWAYS AS IDENTITY (prevents manual value insertion, avoiding conflicts) and GENERATED BY DEFAULT AS IDENTITY (allows manual value insertion when needed).
BIGINT can also be referred to as INT8. Both names refer to the same 8-byte integer type.
On most machines, the REAL and DOUBLE PRECISION types round ties to the nearest even number. For example, 2.5 rounds to 2, and 3.5 rounds to 4.
The value obtained by nextval() is not reclaimed for re-use if the calling transaction later aborts. This means transaction rollbacks or database crashes can result in gaps in the sequence, creating 'holes' in SERIAL columns.
SERIAL uses 4 bytes of storage and has a range of 1 to 2,147,483,647.
DOUBLE PRECISION uses 8 bytes of storage, has a range of approximately 1E-307 to 1E+308, and provides at least 15 decimal digits of precision.
The MONEY type is not recommended because its output is locale-sensitive (tied to lc_monetary setting), it has limited precision (doesn't handle fractions of a cent well), and conversion issues can cause calculation errors. NUMERIC is recommended instead for exact financial calculations.
REAL and DOUBLE PRECISION are inexact, variable-precision numeric types. Some values cannot be converted exactly to the internal format and are stored as approximations, so storing and retrieving a value might show slight discrepancies.
Beginning in PostgreSQL 15, a negative scale rounds values to the left of the decimal point. For example, NUMERIC(2, -3) rounds values to the nearest thousand, allowing values between -99000 and 99000.
The REAL and DOUBLE PRECISION types are implementations of IEEE Standard 754 for Binary Floating-Point Arithmetic (single and double precision respectively), to the extent that the underlying processor, operating system, and compiler support it.
REAL uses 4 bytes of storage, has a range of approximately 1E-37 to 1E+37, and provides at least 6 decimal digits of precision.
SMALLINT uses 2 bytes of storage and has a range of -32,768 to +32,767.
SMALLINT can also be referred to as INT2. Both names refer to the same 2-byte integer type.
PostgreSQL treats NaN values as equal to each other and greater than all non-NaN values. This behavior differs from IEEE 754, which specifies NaN should not equal any value including itself. This PostgreSQL-specific behavior enables sorting and indexing of NaN values.
No, the types DECIMAL and NUMERIC are equivalent in PostgreSQL. DECIMAL is a synonym for NUMERIC.
No, NUMERIC values are physically stored without any extra leading or trailing zeroes. The declared precision and scale are maximums, not fixed allocations, making NUMERIC more like varchar(n) than char(n).
The actual storage requirement is two bytes for each group of four decimal digits, plus three to eight bytes overhead.
By default, the SERIAL pseudo-type starts the sequence from 1 and increments by 1 for each new row.
Data Types - Enumerated
26 questionsThe CREATE TYPE command is used with the syntax: CREATE TYPE name AS ENUM ('value1', 'value2', 'value3', ...)
No, the sort ordering of enum values cannot be changed, short of dropping and re-creating the enum type.
Yes, enum labels are case sensitive. For example, 'happy' is not the same as 'HAPPY'.
No, if ALTER TYPE ... ADD VALUE is executed inside a transaction block, the new value cannot be used until after the transaction has been committed.
No, existing values cannot be removed from an enum type, short of dropping and re-creating the enum type.
An enum value occupies four bytes on disk, regardless of the length of its textual label.
The syntax is: ALTER TYPE name RENAME VALUE existing_enum_value TO new_enum_value
The translations from internal enum values to textual labels are kept in the system catalog pg_enum.
The maximum length is 63 bytes. The length is limited by the NAMEDATALEN setting compiled into PostgreSQL; in standard builds NAMEDATALEN is 64, which means at most 63 bytes for the actual label (NAMEDATALEN - 1).
The ALTER TYPE ... RENAME VALUE functionality was introduced in PostgreSQL version 10 (2017).
Yes, whitespace in enum labels is significant. Enum values can include spaces (e.g., 'very happy'), and spaces are treated as part of the label value.
The default behavior is RESTRICT, which refuses to drop the type if any objects depend on it. An error will be raised listing the dependent objects.
The default value of NAMEDATALEN is 64 bytes, which includes a trailing zero byte, meaning the actual maximum length for identifiers and enum labels is 63 bytes (NAMEDATALEN - 1).
You must explicitly cast both enum values to text using the ::text operator (e.g., enum1::text = enum2::text) to compare values from different enum types.
No, each enumerated data type is separate and cannot be compared with other enumerated types. Attempting to compare different enum types will raise an error stating 'operator does not exist'. You must cast both values to text (using ::text) to compare them.
Even-numbered OIDs are guaranteed to be ordered in the same way as the sort ordering of their enum type. Odd-numbered OIDs have no required relationship to sort order. This design enables enum comparisons without catalog lookups.
Yes, comparisons between enum values are faster than string comparisons. Some tests show VARCHAR operations taking approximately 8x longer than ENUM equivalents in high-volume scenarios, though this varies based on data size and query patterns.
DROP TYPE ... CASCADE automatically drops objects that depend on the type (such as table columns, functions, and operators), and in turn all objects that depend on those objects.
The ordering of values in an enum type is the order in which the values were listed when the type was created using CREATE TYPE.
You can use a DO block with a query against pg_type: DO $$ BEGIN IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'your_type_name') THEN CREATE TYPE your_type_name AS ENUM (...); END IF; END $$;
With the two-argument enum_range(), NULL parameters default to the first or last values of the enum type respectively. For example, enum_range(NULL, 'green'::rainbow) returns {red,orange,yellow,green}.
No, PostgreSQL does not currently have native IF NOT EXISTS support for CREATE TYPE variants. You must use a workaround such as a DO block that queries pg_type to check if the type exists before creating it.
Comparisons involving an added enum value will sometimes be slower than comparisons involving only original members of the enum type. This will usually only occur if BEFORE or AFTER is used to set the new value's sort position somewhere other than at the end of the list. The slowdown is usually insignificant, but optimal performance can be regained by dropping and recreating the enum type.
enum_first(anyenum) returns the first value of the input enum type. For example, enum_first(null::rainbow) returns 'red' if 'red' is the first value in the rainbow enum.
If IF NOT EXISTS is specified and the type already contains the new value, a notice is issued but no other action is taken (no error occurs).
Data Types - XML
26 questionsA default_expression is evaluated each time a default is needed for the column (i.e., per-row), allowing volatile functions like nextval() to be used.
When tableforest is false, results are wrapped in a
| Aspect | IN (VALUES ...) | = ANY (array) |
|---|---|---|
| Statistics | Treated as separate conditions | Single array comparison |
| Index usage | May not use index optimally | Better index scan planning |
| Large lists | Creates many OR conditions | Single array operation |
| Plan caching | Plan varies with list size | More consistent plans |
EXPLAIN Comparison
-- PostgreSQL 17
EXPLAIN SELECT * FROM orders WHERE status IN ('pending', 'processing', 'shipped');
-- Shows: Filter: (status = ANY ('{pending,processing,shipped}'::text[]))
-- But internally treated differently for statistics
-- PostgreSQL 18: Better cardinality estimates
EXPLAIN (ANALYZE, VERBOSE)
SELECT * FROM orders WHERE status IN ('pending', 'processing', 'shipped');
-- More accurate row estimates, better join ordering
Related: OR to ANY Transformation
PostgreSQL 18 also converts OR clauses to arrays:
-- You write:
SELECT * FROM users
WHERE email = '[email protected]' OR email = '[email protected]' OR email = '[email protected]';
-- PostgreSQL 18 transforms to:
SELECT * FROM users
WHERE email = ANY(ARRAY['[email protected]', '[email protected]', '[email protected]']);
Performance Impact
-- Large IN lists benefit most
SELECT * FROM products WHERE id IN (
SELECT id FROM temp_import_ids -- 10,000 IDs
);
-- PostgreSQL 18: Uses array comparison
-- Better statistics estimation
-- More efficient index usage
-- Faster execution for large lists
When It Applies
| Pattern | Transformed? |
|---|---|
IN (1, 2, 3) |
Yes |
IN (SELECT ...) |
Depends |
NOT IN (...) |
Yes |
col1 IN (...) AND col2 IN (...) |
Yes (each) |
| Dynamic/prepared statement params | Yes |
Source: PostgreSQL 18 Release Notes
https://www.postgresql.org/docs/18/release-18.html
What is EXPLAIN in PostgreSQL?
EXPLAIN is a PostgreSQL command that displays the execution plan for a SQL statement without executing it. It shows how PostgreSQL's query planner intends to scan tables, join data, and process the query.
Basic Syntax
EXPLAIN SELECT * FROM users WHERE age > 25;
Key Output Information
- Scan type: Sequential Scan, Index Scan, Bitmap Heap Scan, etc.
- Cost estimates:
cost=0.00..35.50(startup cost..total cost) in arbitrary units - Rows estimate: Expected number of rows returned
- Width: Average row size in bytes
Critical Variants
EXPLAIN ANALYZE - Actually executes the query and shows real timing:
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 25;
Returns actual execution time in milliseconds and actual row counts vs estimates.
EXPLAIN (BUFFERS, ANALYZE) - Adds I/O statistics:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE age > 25;
Shows shared blocks hit/read/written for performance tuning.
When to Use
- Diagnosing slow queries (compare estimated vs actual rows with ANALYZE)
- Verifying index usage (look for "Index Scan" vs "Seq Scan")
- Optimizing JOIN operations (check join method: Hash Join, Nested Loop, Merge Join)
Version Note
Core functionality stable since PostgreSQL 9.0+. BUFFERS option added in 9.0. Format options (JSON, YAML, XML) available in 9.0+.
Source: PostgreSQL Official Documentation - EXPLAIN command reference (postgresql.org/docs/current/sql-explain.html)
The rows estimate in PostgreSQL's EXPLAIN output is the planner's predicted number of rows that will be returned by that particular plan node. This is an estimate made before query execution, not the actual row count.
Key Details
What it represents:
- The number of rows the query planner expects this node to output
- Calculated using table statistics from
pg_stats(gathered by ANALYZE) - Used by the planner to choose between different execution strategies
How to read it:
EXPLAIN SELECT * FROM users WHERE age > 30;
Seq Scan on users (cost=0.00..15.50 rows=120 width=40)
Filter: (age > 30)
Here rows=120 means the planner estimates 120 rows will match the filter.
EXPLAIN vs EXPLAIN ANALYZE
- EXPLAIN shows
rows=<estimate>(predicted) - EXPLAIN ANALYZE shows both
rows=<estimate>andactual rows=<actual>(measured during execution)
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 30;
Seq Scan on users (cost=0.00..15.50 rows=120 width=40) (actual time=0.023..0.156 rows=98 loops=1)
rows=120= planner's estimateactual rows=98= true count after execution
Critical for Agents
When the estimate is significantly wrong (off by 10x+), it indicates stale statistics. The solution is to run:
ANALYZE table_name;
Source: PostgreSQL 17 official documentation - Using EXPLAIN
EXPLAIN BUFFERS shows buffer cache statistics for a PostgreSQL query execution, revealing how many 8KB blocks were accessed from various memory/disk sources.
What it shows (PostgreSQL 9.0+):
- shared hit - blocks found in PostgreSQL's shared buffer cache (RAM)
- shared read - blocks read from OS cache or disk into shared buffers
- shared dirtied - blocks modified during query execution
- shared written - blocks physically written to disk during execution
- local hit/read/dirtied/written - same metrics for temporary tables (session-local buffers)
- temp read/written - temporary file I/O when work_mem exceeded
Usage:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE id = 123;
Example output:
Index Scan using users_pkey on users (cost=0.29..8.31 rows=1 width=40) (actual time=0.015..0.016 rows=1 loops=1)
Index Cond: (id = 123)
Buffers: shared hit=4
Planning:
Buffers: shared hit=12
Planning Time: 0.123 ms
Execution Time: 0.032 ms
Key interpretation:
- High "shared hit" = good (data in cache)
- High "shared read" = disk I/O (slower, consider more RAM or indexing)
- "temp read/written" = query spilling to disk (increase
work_mem)
Requirements:
- Must use
EXPLAIN (BUFFERS)orEXPLAIN (ANALYZE, BUFFERS)syntax - Only works with
ANALYZEoption for actual buffer counts (without ANALYZE, no buffer stats shown) - Requires
track_io_timing = onto also show I/O timing statistics
Source: PostgreSQL 17 Official Documentation - EXPLAIN command
A hash join is a join algorithm in PostgreSQL that builds an in-memory hash table from the smaller input table (called the "build" input), then probes it with rows from the larger table (called the "probe" input) to find matches.
How It Works
- Build phase: PostgreSQL scans the smaller table and creates a hash table in memory, hashing the join key values
- Probe phase: For each row in the larger table, PostgreSQL hashes its join key and looks up matching rows in the hash table
When PostgreSQL Uses Hash Joins
The planner chooses hash joins when:
- Joining on equality conditions (e.g.,
table1.id = table2.id) - The estimated hash table fits in
work_mem(default: 4MB) - The query has no ORDER BY that would make other join types preferable
Memory Constraints
- work_mem limit: If the hash table exceeds
work_mem, PostgreSQL uses temporary disk files (batch mode), degrading performance - Hash joins require memory = (size of smaller table's join columns + overhead) × hash table factor (~1.2-1.5×)
Example
EXPLAIN ANALYZE
SELECT orders.*, customers.name
FROM orders
JOIN customers ON orders.customer_id = customers.id;
Output might show:
Hash Join (cost=... rows=...)
Hash Cond: (orders.customer_id = customers.id)
-> Seq Scan on orders
-> Hash
-> Seq Scan on customers
Configuration
-- Disable hash joins to force alternative algorithms (for testing)
SET enable_hashjoin = off;
-- Increase work_mem to allow larger hash tables
SET work_mem = '256MB';
Source: PostgreSQL 17 official documentation - Query Planning
How Often to Run ANALYZE in PostgreSQL
Answer: You typically don't need to manually run ANALYZE because PostgreSQL's autovacuum daemon automatically runs it by default.
Automatic ANALYZE (Recommended)
The autovacuum daemon runs ANALYZE automatically when:
Number of inserted/updated/deleted tuples > autovacuum_analyze_threshold + (autovacuum_analyze_scale_factor × table_tuples)
Default settings (PostgreSQL 13+):
autovacuum_analyze_threshold= 50 rowsautovacuum_analyze_scale_factor= 0.1 (10% of table)
This means ANALYZE runs automatically after ~10% of a table changes, plus 50 rows minimum.
When to Run ANALYZE Manually
Run ANALYZE manually in these specific cases:
- After bulk data loads: Immediately after
COPYor bulkINSERT
COPY my_table FROM '/data/file.csv';
ANALYZE my_table;
- After creating indexes on existing data:
CREATE INDEX idx_name ON my_table(column);
ANALYZE my_table;
When autovacuum is disabled (not recommended): Run daily or after significant changes
Before critical queries if statistics are stale:
ANALYZE VERBOSE my_table;
Verification
Check last analyze time:
SELECT schemaname, relname, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'my_table';
Source: PostgreSQL 16 Official Documentation - Routine Vacuuming (§25.1.6)
log_min_duration_statement is a PostgreSQL configuration parameter that logs the duration of SQL statements that take longer than the specified threshold.
Value Format:
- Integer value in milliseconds
- Default:
-1(disabled - no statements logged based on duration) 0= log all statements and their durations- Positive integer (e.g.,
1000) = log only statements taking ≥ that many milliseconds
Configuration:
-- In postgresql.conf
log_min_duration_statement = 1000 # Log statements taking ≥ 1 second
-- Or set dynamically (requires superuser or pg_write_server_files role)
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
-- Session-level override
SET log_min_duration_statement = 500; # For current session only
Log Output Example:
LOG: duration: 1234.567 ms statement: SELECT * FROM large_table WHERE condition = 'value';
Key Behaviors:
- Measures total execution time including parsing, planning, and execution
- Logs to PostgreSQL's standard log destination (file/stderr based on
log_destination) - Does NOT log statements faster than the threshold
- Setting to
0creates significant log volume in production
Common Use Cases:
- Performance monitoring: Set to
1000(1 second) to catch slow queries - Query auditing: Set to
0for complete statement logging (high I/O cost) - Troubleshooting: Temporarily lower threshold to diagnose performance issues
Version: Available in PostgreSQL 8.0+, behavior consistent through PostgreSQL 17.
Source: PostgreSQL Official Documentation - Error Reporting and Logging
Creating Extended Statistics in PostgreSQL
Extended statistics are created using the CREATE STATISTICS command (available in PostgreSQL 10+). They help the query planner make better estimates when columns have correlated data.
Basic Syntax
CREATE STATISTICS stat_name [(statistic_type [, ...])]
ON column_name1, column_name2 [, ...]
FROM table_name;
Statistic Types
PostgreSQL supports these types (10+):
dependencies- Multi-column functional dependencies (default in PG 10+)ndistinct- N-distinct coefficients for column combinations (default in PG 10+)mcv- Most-common-values lists for column combinations (PG 12+)
Examples
Basic creation (uses dependencies + ndistinct by default):
CREATE STATISTICS orders_stats
ON customer_id, order_date
FROM orders;
Specify statistic types explicitly:
CREATE STATISTICS city_zip_stats (dependencies, ndistinct, mcv)
ON city, zip_code
FROM addresses;
Multiple column combinations (PG 10+):
CREATE STATISTICS product_stats
ON category, brand, price_range
FROM products;
Critical Details
- Columns: Up to 8 columns per statistics object
- Activation: Run
ANALYZE table_name;after creation to populate statistics - Updates: Statistics are automatically updated during
ANALYZEoperations - View existing: Query
pg_statistic_extcatalog table - Drop:
DROP STATISTICS stat_name;
Complete Working Example
-- Create statistics
CREATE STATISTICS user_location_stats (dependencies, mcv)
ON country, state
FROM users;
-- Populate the statistics
ANALYZE users;
-- Verify creation
SELECT stxname, stxkeys, stxkind
FROM pg_statistic_ext
WHERE stxname = 'user_location_stats';
Source: PostgreSQL Official Documentation v10-17, CREATE STATISTICS command reference.
How to log slow queries in PostgreSQL
Set these parameters in postgresql.conf (or via ALTER SYSTEM):
-- Log queries taking longer than 1000ms (1 second)
log_min_duration_statement = 1000
-- Recommended additional settings for useful slow query logs
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
log_duration = off
log_statement = 'none'
Key Parameters:
log_min_duration_statement: Time in milliseconds. Queries taking longer than this value are logged.-1= disabled (default)0= log all queries1000= log queries ≥1 second (common starting point)5000= log queries ≥5 seconds (production systems)
log_duration: Should beoffwhen usinglog_min_duration_statementto avoid duplicate entries
Apply changes:
-- Option 1: Modify postgresql.conf, then reload
SELECT pg_reload_conf();
-- Option 2: Set for current session only
SET log_min_duration_statement = 1000;
-- Option 3: Set globally without editing config file
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
Log output location:
Check log_destination and log_directory in postgresql.conf. Default is typically /var/log/postgresql/postgresql-{version}-main.log on Debian/Ubuntu or $PGDATA/log/ on other systems.
Verified on: PostgreSQL 12-17 (parameter behavior unchanged since 8.0)
Source: PostgreSQL Official Documentation - Error Reporting and Logging
In PostgreSQL's EXPLAIN ANALYZE, actual time is the real wall-clock time (in milliseconds) spent executing a specific plan node, measured from when the node starts producing rows until it finishes.
The format is: actual time=X..Y
- X (startup time): Milliseconds from when the node starts until it produces its first row
- Y (total time): Total milliseconds from start to completion (including all rows)
Key characteristics:
- Includes child nodes: The total time for a parent node includes all time spent in its children
- Excludes result transmission: Time to send results to the client is NOT included
- Per-loop measurement: For nodes inside loops (e.g., nested loop joins), the times shown are averages. Check
loops=Nto see how many times the node executed. Multiplyactual timebyloopsto get total time spent.
Example:
EXPLAIN ANALYZE SELECT * FROM users WHERE id = 1;
Index Scan using users_pkey on users (cost=0.29..8.31 rows=1 width=100)
(actual time=0.015..0.016 rows=1 loops=1)
- Startup: 0.015ms (time to find first row)
- Total: 0.016ms (time to complete scan)
- Executed 1 time (loops=1)
Source: PostgreSQL official documentation on EXPLAIN (applicable to all modern versions 9.x-17.x)
Force PostgreSQL to Use a Specific Index
Use the SET enable_* session parameters to disable alternative access methods, forcing PostgreSQL to prefer your target index.
Primary Method (PostgreSQL 9.0+):
-- Disable sequential scans to force index usage
SET enable_seqscan = off;
-- Execute your query
SELECT * FROM users WHERE email = '[email protected]';
-- Re-enable for subsequent queries
SET enable_seqscan = on;
For Specific Index Types:
-- Disable bitmap scans (forces regular index scan)
SET enable_bitmapscan = off;
-- Disable index scans (forces bitmap or seq scan)
SET enable_indexscan = off;
-- Disable index-only scans
SET enable_indexonlyscan = off;
To Force a Specific Index When Multiple Exist:
Disable other index types and drop competing indexes temporarily, or use pg_hint_plan extension:
-- With pg_hint_plan extension
/*+ IndexScan(users users_email_idx) */
SELECT * FROM users WHERE email = '[email protected]';
Production-Safe Approach:
Use transaction-scoped settings to avoid affecting other queries:
BEGIN;
SET LOCAL enable_seqscan = off;
SELECT * FROM users WHERE email = '[email protected]';
COMMIT;
Critical Notes:
- These settings affect the query planner cost estimates (sets cost to 10^10), not hard disables
- Setting applies to entire session unless using
SET LOCAL(transaction-scoped) - Forcing indexes is typically a code smell; investigate with
EXPLAIN ANALYZEwhy the planner isn't choosing your index - Ensure statistics are current:
ANALYZE table_name;
Source: PostgreSQL 16 Official Documentation - Runtime Config: Query Planning
PostgreSQL uses a bitmap index scan when the query planner's cost-based optimizer determines it's the most efficient method based on selectivity and estimated costs. Specifically:
When bitmap scans are chosen:
Medium selectivity queries - Bitmap scans serve as a middle ground between index scans (high selectivity, few rows) and sequential scans (low selectivity, many rows)
Combining multiple indexes - When a query has:
- OR conditions on the same column:
WHERE x = 42 OR x = 47 - AND/OR conditions across different indexed columns:
WHERE x = 5 AND y = 6 - Complex Boolean combinations that can leverage multiple indexes
- OR conditions on the same column:
When fetching multiple rows from the same data pages - Bitmap scans visit table rows in physical order, making bulk reads more efficient than random index scans
Cost-based decision factors:
The planner compares costs using:
random_page_cost× pages +cpu_index_tuple_cost× tuples +cpu_operator_cost× tuples- Selectivity estimates from table statistics (pg_class.reltuples × selectivity)
work_memsettings (affects whether bitmap becomes "lossy")
The planner chooses bitmap scan when its estimated cost is lower than both a plain index scan and a sequential scan for the given query conditions.
Trade-offs:
- Bitmap scans lose original index ordering (requires separate sort for ORDER BY)
- Each additional index adds overhead
- Physical order access improves I/O efficiency over random index lookups
Sources:
Merge Join in PostgreSQL
A merge join is a join algorithm in PostgreSQL that combines two sorted datasets by scanning them in parallel and matching rows based on the join condition. It requires both input relations to be sorted on the join key.
How It Works
- PostgreSQL sorts both input tables on the join columns (if not already sorted)
- Scans both sorted datasets simultaneously, advancing through each when keys don't match
- When keys match, outputs the joined rows
- Time complexity: O(N log N + M log M) for sorting, O(N + M) for the merge phase
When PostgreSQL Uses Merge Join
The planner chooses merge join when:
- Join condition uses equality operators (
=) - Input datasets are already sorted (from indexes or previous operations), OR sorting cost is justified
- Working with large datasets where nested loop would be too slow and hash join exceeds
work_mem
Performance Characteristics
- Best for: Large tables that are pre-sorted or have indexes on join columns
- Memory usage: Minimal - only needs to hold current rows from each input
- Handles duplicates: Efficiently processes duplicate join keys using "mark and restore"
Example
-- Query that may use merge join
EXPLAIN SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2024-01-01';
-- Example output showing merge join:
-- Merge Join (cost=...)
-- Merge Cond: (o.customer_id = c.id)
-- -> Index Scan using orders_customer_id_idx on orders o
-- -> Index Scan using customers_pkey on customers c
Configuration
enable_mergejoin(default:on) - Set tooffto disable merge joins globally (debugging only)- Sorting uses
work_memper sort operation
Source: PostgreSQL 17 Official Documentation - Query Planning: Merge Join
PostgreSQL Correlation Statistic
Correlation in PostgreSQL statistics measures the alignment between the physical order of rows on disk and the logical order of a column's values. It ranges from -1.0 to +1.0.
Exact Values
- +1.0: Column values are stored in perfect ascending order on disk
- -1.0: Column values are stored in perfect descending order on disk
- 0.0: No correlation between physical and logical order (random distribution)
Why It Matters
The query planner uses correlation to estimate the cost of index scans. High absolute correlation (near ±1.0) means sequential reads, which are much faster. Low correlation (near 0) means random disk access, which is slower.
Where to Find It
SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'your_table'
AND attname = 'your_column';
The correlation column in pg_stats shows this value for each indexed column.
When It's Updated
Correlation is calculated by ANALYZE. PostgreSQL samples the table (default: 300 × default_statistics_target rows, where default target is 100) and compares physical tuple positions to sorted logical values.
Performance Impact
- High correlation (|correlation| > 0.8): Index scans are cheap; optimizer favors index usage
- Low correlation (|correlation| < 0.2): Index scans require many random I/Os; optimizer may prefer sequential scans
Source
PostgreSQL official documentation on statistics (pg_stats view), valid for PostgreSQL 12+. The calculation method is defined in src/backend/commands/analyze.c.
PostgreSQL work_mem Setting
work_mem is a PostgreSQL configuration parameter that sets the maximum memory used for internal sort operations and hash tables before writing to temporary disk files. Each operation (sort, hash join, merge join) can use up to this amount.
Key Details
Default Value: 4MB (4096 kB)
Scope: Can be set per-session, per-transaction, per-user, or server-wide
Unit: Specified in kilobytes by default, but accepts kB, MB, GB
Critical Behavior:
- Multiple operations in a single query can EACH allocate work_mem
- A complex query with 5 sorts could use
5 × work_mem - Memory is allocated per operation, not per query
Configuration
-- Session-level (recommended for specific queries)
SET work_mem = '256MB';
-- Transaction-level
BEGIN;
SET LOCAL work_mem = '512MB';
-- query here
COMMIT;
# postgresql.conf (server-wide)
work_mem = 64MB
Setting Guidelines
Formula: (Total RAM - shared_buffers) / (max_connections × 2 to 3)
For a server with 16GB RAM, shared_buffers=4GB, max_connections=100:
- Available: 12GB
- work_mem: ~40-60MB is safe
Warning: Setting too high risks OOM kills. Monitor with pg_stat_statements and system memory.
Version: Applies to PostgreSQL 9.0+ (behavior unchanged in 17.x as of 2025)
How work_mem Affects Query Performance
work_mem controls the maximum memory PostgreSQL uses for each internal sort or hash operation within a query before spilling to disk.
Direct Performance Impact
Insufficient work_mem:
- Operations exceed memory limit → data written to temporary disk files
- Disk I/O is 100-1000x slower than memory operations
- Query execution time increases dramatically (often 10-100x slower)
Sufficient work_mem:
- Operations complete entirely in memory
- Fast execution using in-memory sorting/hashing algorithms
- No temporary file creation overhead
Exact Default and Typical Values
- Default: 4MB (4096 KB)
- Common production settings: 16MB - 256MB
- Maximum practical: Limited by available RAM and connection count
Critical Calculation
Total potential memory = work_mem × max_connections × operations_per_query
A single complex query can use work_mem multiple times simultaneously (one per sort/hash node in the execution plan).
Example: If work_mem = 256MB, max_connections = 100, and queries average 4 operations:
Worst case = 256MB × 100 × 4 = 100GB RAM required
Operations Affected by work_mem
ORDER BYsortsDISTINCToperations- Merge joins
- Hash joins (builds hash table)
- Hash-based aggregations (
GROUP BY) - Recursive CTEs
- Bitmap index scans (combining bitmaps)
Verification
Check if queries are spilling to disk:
-- Shows temporary file usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM large_table ORDER BY column;
Look for: "Temp File: ... kB"
Setting work_mem
Session-level (recommended for specific heavy queries):
SET work_mem = '256MB';
-- Run your query
RESET work_mem;
Global (postgresql.conf):
work_mem = 64MB
Source
PostgreSQL Official Documentation v16: Resource Consumption
Version: Behavior consistent across PostgreSQL 9.6-16+
seq_page_cost is a PostgreSQL planner configuration parameter that sets the estimated cost of reading one disk page (8 KB) during a sequential scan.
Default Value: 1.0 (in PostgreSQL 9.6+)
Purpose: The query planner uses this cost unit as a baseline to compare different query execution plans. All other I/O cost parameters are relative to this value:
random_page_cost(default: 4.0) - cost of non-sequential disk page fetchcpu_tuple_cost(default: 0.01) - cost of processing one rowcpu_operator_cost(default: 0.0025) - cost of processing one operator
Configuration:
-- Set globally (requires superuser + reload)
ALTER SYSTEM SET seq_page_cost = 1.0;
SELECT pg_reload_conf();
-- Set for current session
SET seq_page_cost = 1.0;
-- Set for specific table
ALTER TABLE my_table SET (seq_page_cost = 0.5);
Tuning Guidance:
- SSDs/NVMe: Lower to
0.5-0.75(sequential and random reads are similar speed) - HDDs with large cache: Keep at
1.0(default) - Slow storage: May increase slightly, but adjust
random_page_costratio instead
The ratio between seq_page_cost and random_page_cost matters more than absolute values. For SSDs, use ratio of 1:1 to 1:2 (e.g., seq=1.0, random=1.5).
Source: PostgreSQL 17 Documentation - Runtime Config: Query Planning
Bitmap Index Scan in PostgreSQL
A Bitmap Index Scan is a two-phase query execution strategy in PostgreSQL that efficiently handles queries matching multiple rows from an index.
How It Works
Phase 1 - Bitmap Index Scan: Scans the index and creates an in-memory bitmap of matching heap page numbers (not individual row locations). Each bit represents a heap page.
Phase 2 - Bitmap Heap Scan: Reads the actual table pages marked in the bitmap, fetching all matching rows from each page in physical order.
When PostgreSQL Uses It
The query planner chooses bitmap scans when:
- Query matches too many rows for an efficient regular Index Scan (typically >5-10% of table)
- Too few rows for a full Sequential Scan to be optimal
- Multiple indexes can be combined with AND/OR operations
Key Advantages
- Page-level deduplication: Multiple index entries pointing to the same page = single page read
- Sequential I/O: Reads heap pages in physical order (vs random I/O in Index Scan)
- Index combining: Can merge bitmaps from multiple indexes using
BitmapAndorBitmapOr
Example
-- Query that triggers bitmap scan
EXPLAIN ANALYZE
SELECT * FROM users WHERE age BETWEEN 25 AND 35;
Bitmap Heap Scan on users (cost=12.75..89.23 rows=450 width=40)
Recheck Cond: ((age >= 25) AND (age <= 35))
-> Bitmap Index Scan on users_age_idx (cost=0.00..12.64 rows=450 width=0)
Index Cond: ((age >= 25) AND (age <= 35))
Memory Limit
Bitmap size is limited by work_mem (default: 4MB). If the bitmap exceeds work_mem, it becomes "lossy" - tracking only pages, requiring a Recheck Cond during heap scan to filter rows.
Source: PostgreSQL 17 Documentation - Using EXPLAIN
effective_cache_size is a PostgreSQL configuration parameter that sets the planner's assumption about the effective size of the disk cache available to a single query. It influences the query optimizer's cost estimates when deciding between index scans and sequential scans.
Key Characteristics:
- Does not allocate memory - it's only a hint to the query planner
- Higher values make index scans more likely to be chosen
- Lower values make sequential scans more likely
- Default: 4GB (524288 blocks)
- Unit: Blocks (typically 8kB each) when specified without units
- Can be changed without restarting PostgreSQL
What to Include:
When setting this value, consider both:
- PostgreSQL's
shared_buffers - The portion of the kernel's disk cache used for PostgreSQL data files
Typical Settings:
- Conservative: 50% of total system memory
- Aggressive: 75% of total system memory
The planner uses this to estimate whether index blocks are likely to be in cache or need physical disk reads, affecting cost calculations for query plans.
Sources:
PostgreSQL provides three primary methods to identify slow queries:
Method 1: pg_stat_statements Extension (Recommended)
Enable the pg_stat_statements extension to track execution statistics for all SQL statements.
Setup:
- Add to
postgresql.conf:
shared_preload_libraries = 'pg_stat_statements'
compute_query_id = on
- Restart PostgreSQL
- Enable in your database:
CREATE EXTENSION pg_stat_statements;
Query for slowest queries by total time:
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Query for slowest queries by average time:
SELECT query, calls, mean_exec_time, total_exec_time, rows
FROM pg_stat_statements
WHERE calls > 100 -- Exclude rarely-run queries
ORDER BY mean_exec_time DESC
LIMIT 10;
Key metrics:
total_exec_time- Total time spent executing this querymean_exec_time- Average execution time per callcalls- Number of times executedrows- Total rows returned/affected
Method 2: log_min_duration_statement
Log queries that exceed a specific duration threshold.
Configuration in postgresql.conf:
log_min_duration_statement = 250ms
-1(default) - Disabled0- Log all statement durations250ms- Log statements running 250ms or longer
This can be changed at runtime without restart by superusers.
Method 3: auto_explain Extension
Automatically log execution plans for slow queries.
Configuration:
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = 250ms
This captures EXPLAIN output for queries exceeding the threshold.
Sources:
EXPLAIN ANALYZE in PostgreSQL
EXPLAIN ANALYZE is a PostgreSQL command that executes a query and returns detailed performance statistics about how the query planner executed it, including actual execution times and row counts.
Usage
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 25;
Key Differences from EXPLAIN
EXPLAINshows the query plan only (no execution)EXPLAIN ANALYZEactually runs the query and shows both the plan and actual runtime statistics
Critical Output Metrics
The output includes:
- Planning Time - Time spent generating the query plan (milliseconds)
- Execution Time - Actual time to run the query (milliseconds)
- Actual rows - Number of rows processed at each node
- Actual loops - Number of times each node executed
- Buffers - I/O statistics (when used with
BUFFERSoption)
Important Options
-- Include buffer usage statistics (shared blocks read/hit)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users;
-- Show output in JSON format
EXPLAIN (ANALYZE, FORMAT JSON) SELECT * FROM users;
-- Other formats: TEXT (default), XML, YAML
EXPLAIN (ANALYZE, FORMAT YAML) SELECT * FROM users;
Critical Warning
⚠️ EXPLAIN ANALYZE executes the query completely, including:
INSERT,UPDATE,DELETEoperations will modify data- Use within a transaction with
ROLLBACKto prevent changes:
BEGIN;
EXPLAIN ANALYZE DELETE FROM users WHERE id = 1;
ROLLBACK;
Version Compatibility
Available in PostgreSQL 7.2+. The BUFFERS option added in PostgreSQL 8.4. Latest stable: PostgreSQL 16.x (as of January 2025).
EXPLAIN vs EXPLAIN ANALYZE in PostgreSQL
EXPLAIN shows the query execution plan WITHOUT running the query. It displays PostgreSQL's estimated costs, row counts, and query strategy based on table statistics.
EXPLAIN ANALYZE actually EXECUTES the query and shows both the plan AND actual runtime statistics (real execution time, actual row counts, memory usage).
Key Differences
| Aspect | EXPLAIN | EXPLAIN ANALYZE |
|---|---|---|
| Query execution | No (plan only) | Yes (full execution) |
| Timing data | Estimated only | Actual measured times |
| Row counts | Estimated | Actual |
| Side effects | None | ALL side effects occur |
Code Examples
-- EXPLAIN: Safe, shows plan only, no execution
EXPLAIN SELECT * FROM users WHERE created_at > '2024-01-01';
Output shows estimated costs:
Seq Scan on users (cost=0.00..35.50 rows=10 width=100)
Filter: (created_at > '2024-01-01'::date)
-- EXPLAIN ANALYZE: Executes query, shows actual performance
EXPLAIN ANALYZE SELECT * FROM users WHERE created_at > '2024-01-01';
Output includes actual times:
Seq Scan on users (cost=0.00..35.50 rows=10 width=100)
(actual time=0.015..0.892 rows=12 loops=1)
Filter: (created_at > '2024-01-01'::date)
Planning Time: 0.123 ms
Execution Time: 1.045 ms
Critical Warning
EXPLAIN ANALYZE executes INSERT/UPDATE/DELETE queries, causing data modifications. Wrap in a transaction with ROLLBACK for safety:
BEGIN;
EXPLAIN ANALYZE DELETE FROM users WHERE id = 123;
ROLLBACK; -- Prevents actual deletion
When to Use Each
- EXPLAIN: Quick plan inspection, checking if indexes are used, safe for all queries
- EXPLAIN ANALYZE: Troubleshooting slow queries, comparing estimated vs actual performance, validating optimizer decisions
Source: PostgreSQL Official Documentation (applies to PostgreSQL 9.0+, current as of PostgreSQL 17)
Streaming Replication > Synchronous Replication
24 questionsNo - sync_priority has no effect in quorum-based synchronous replication (ANY mode)
remote_apply - this causes each commit to wait until standbys report they have replayed the transaction, making it visible to queries
The application_name setting of the standby, as set in the primary_conninfo of the standby's WAL receiver
remote_write, remote_apply, and local all provide the same local synchronization level as 'on' - they behave the same when synchronous_standby_names is empty
Transaction commits wait until their WAL records are replicated to 3 synchronous standbys chosen based on their priorities from the list of s1, s2, s3, and s4
No - all two-phase commit actions (including PREPARE and COMMIT PREPARED) always require commit waits and are always synchronous regardless of synchronous_commit setting
Yes - the syntax 'standby_name [, ...]' is equivalent to FIRST 1
DROP TABLE and two-phase commit commands (PREPARE TRANSACTION, COMMIT PREPARED) always commit synchronously
If both the primary and the standby suffer crashes at the same time
No - subtransaction commits don't wait for responses; only top-level commits require standby acknowledgment
This standby server is now asynchronous, but can potentially become synchronous if one of current synchronous ones fails
No - cascading replication is currently asynchronous, and synchronous replication settings have no effect on cascading replication. Named standbys must be directly connected to the primary.
Yes - if cascaded replication is in use, the feedback is passed upstream until it eventually reaches the primary
This standby server is considered as a candidate for quorum standbys
performance_tuning
24 questionsPostgreSQL Autovacuum Configuration
Configure autovacuum by modifying parameters in postgresql.conf or using ALTER TABLE for table-specific settings.
Essential Global Parameters (postgresql.conf)
# Enable autovacuum (default: on)
autovacuum = on
# Maximum autovacuum worker processes (default: 3, range: 1-262143)
autovacuum_max_workers = 3
# Delay between autovacuum runs in seconds (default: 60s, range: 1s-2147483s)
autovacuum_naptime = 60
# Threshold to trigger vacuum (default: 50 rows)
autovacuum_vacuum_threshold = 50
# Fraction of table size to trigger vacuum (default: 0.2 = 20%)
autovacuum_vacuum_scale_factor = 0.2
# Threshold to trigger analyze (default: 50 rows)
autovacuum_analyze_threshold = 50
# Fraction of table size to trigger analyze (default: 0.1 = 10%)
autovacuum_analyze_scale_factor = 0.1
# Cost-based delay in milliseconds (default: 2ms, range: -1 to 100)
autovacuum_vacuum_cost_delay = 2
# Cost limit before applying delay (default: 200, range: -1 to 10000)
autovacuum_vacuum_cost_limit = 200
Trigger formula: Autovacuum runs when:
dead_tuples > autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor * table_size)
Table-Specific Overrides
-- Set custom autovacuum parameters for a specific table
ALTER TABLE my_table SET (
autovacuum_vacuum_threshold = 100,
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_analyze_threshold = 100,
autovacuum_analyze_scale_factor = 0.05
);
-- Disable autovacuum for a specific table (not recommended)
ALTER TABLE my_table SET (autovacuum_enabled = false);
-- View table-specific settings
SELECT relname, reloptions
FROM pg_class
WHERE relname = 'my_table';
Apply Configuration Changes
# Reload configuration (no restart needed for autovacuum settings)
SELECT pg_reload_conf();
Verification
-- Check if autovacuum is running
SELECT * FROM pg_stat_activity WHERE query LIKE 'autovacuum:%';
-- View autovacuum statistics per table
SELECT schemaname, relname, last_autovacuum, last_autoanalyze, n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
Source: PostgreSQL 16 Official Documentation - Automatic Vacuuming
Note: For high-write tables, reduce scale_factor to 0.01-0.05 and increase vacuum_cost_limit to 1000-2000 for more aggressive vacuuming.
VACUUM vs VACUUM FULL in PostgreSQL
VACUUM reclaims dead tuple space and marks it as reusable within the existing data files, but does not return disk space to the operating system. It updates statistics and prevents transaction ID wraparound. Tables remain online and accessible during operation.
VACUUM FULL rewrites the entire table to a new file, compacting it to minimum size and returning freed disk space to the OS. It requires an ACCESS EXCLUSIVE lock, blocking all reads and writes. Much slower and more resource-intensive.
Key Differences
| Aspect | VACUUM | VACUUM FULL |
|---|---|---|
| Lock Level | SHARE UPDATE EXCLUSIVE (allows SELECT/INSERT/UPDATE/DELETE) | ACCESS EXCLUSIVE (blocks everything) |
| Disk Space | Marks space reusable, doesn't shrink files | Returns space to OS, shrinks files |
| Speed | Fast, incremental | Slow, rewrites entire table |
| Disk Requirement | Minimal | Needs free space equal to table size |
| Use Case | Regular maintenance | Bloat recovery after massive deletes |
Usage Examples
-- Regular vacuum (use this for routine maintenance)
VACUUM table_name;
-- Full vacuum (use sparingly, during maintenance windows)
VACUUM FULL table_name;
-- Vacuum with analysis update
VACUUM ANALYZE table_name;
-- Verbose output
VACUUM (VERBOSE) table_name;
Best Practice Decision Criteria
- Use VACUUM: For routine maintenance, handled automatically by autovacuum (enabled by default since PostgreSQL 8.3)
- Use VACUUM FULL: Only when table bloat exceeds 30-40% and you have a maintenance window. Consider
pg_repackextension instead for zero-downtime alternative.
Source: PostgreSQL 16 Official Documentation - Routine Vacuuming (https://www.postgresql.org/docs/current/routine-vacuuming.html)
shared_buffers in PostgreSQL
shared_buffers is a PostgreSQL configuration parameter that sets the amount of memory allocated for caching data pages in RAM, shared by all database processes.
Default Value and Recommendations
Default: 128 MB (as of PostgreSQL 9.3+)
Recommended setting: 25% of total system RAM for dedicated database servers. For systems with 1GB+ RAM, start with:
- Minimum: 256 MB
- Typical: 25% of system RAM
- Maximum practical: 8-16 GB (diminishing returns beyond this)
Configuration
Set in postgresql.conf:
shared_buffers = 4GB
Or using different units:
shared_buffers = 4096MB
shared_buffers = 524288 # in 8KB pages (default unit)
Requires: Database restart to take effect (not reloadable with pg_reload_conf())
How It Works
- Stores frequently accessed table and index data in memory
- Uses PostgreSQL's internal cache, separate from OS page cache
- Measured in 8KB blocks (PostgreSQL's page size)
- Too low: Excessive disk I/O
- Too high: Less memory for OS cache and working memory, potential performance degradation
Check Current Value
SHOW shared_buffers;
-- Returns: 4GB
SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
-- Returns setting in pages (16384) and unit (8kB)
Source: PostgreSQL 16 Official Documentation - Resource Consumption settings
Reducing PostgreSQL Table Bloat Without Locking
Use pg_repack extension to rebuild tables without blocking concurrent operations.
Installation
CREATE EXTENSION pg_repack;
Requires: PostgreSQL 9.4+ (current versions: 12-17 supported as of pg_repack 1.5.1)
Usage
# Repack a specific table (no exclusive locks during rebuild)
pg_repack -t table_name database_name
# Repack entire database
pg_repack database_name
# With connection parameters
pg_repack -h localhost -p 5432 -U postgres -t table_name database_name
How It Works
- Creates temporary copy of table with proper indexes
- Uses triggers to track concurrent changes during rebuild
- Applies accumulated changes from triggers
- Swaps tables atomically (brief
ACCESS EXCLUSIVElock <0.1 seconds) - Drops old bloated table
Lock Behavior
- During rebuild: Only
ACCESS SHARElocks (allows reads/writes) - Final swap:
ACCESS EXCLUSIVEfor milliseconds only - Safe for production use on active tables
Requirements
- Target table must have a
PRIMARY KEYorUNIQUE NOT NULLindex - Requires free disk space = 2× table size temporarily
- User needs
SUPERUSERor table ownership +CREATEprivilege on database
Alternative: CONCURRENTLY Operations
For minor bloat, use:
-- Rebuild indexes without blocking writes
REINDEX INDEX CONCURRENTLY index_name;
-- PostgreSQL 12+: Reindex table indexes
REINDEX TABLE CONCURRENTLY table_name;
Limitation: These only fix index bloat, not table bloat.
Source
Official pg_repack documentation: https://reorg.github.io/pg_repack/
PostgreSQL REINDEX docs: https://www.postgresql.org/docs/current/sql-reindex.html
Detecting Table Bloat in PostgreSQL
Use the pgstattuple extension to get exact bloat metrics, or calculate estimated bloat using system catalogs.
Method 1: pgstattuple Extension (Most Accurate)
Requires PostgreSQL 8.2+
-- Enable the extension (one time, requires superuser)
CREATE EXTENSION IF NOT EXISTS pgstattuple;
-- Check specific table bloat
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
round(100 * (1 - (pgstattuple(schemaname||'.'||tablename)).table_len::float8 /
NULLIF(pg_total_relation_size(schemaname||'.'||tablename), 0)), 2) AS bloat_pct,
pg_size_pretty((pgstattuple(schemaname||'.'||tablename)).dead_tuple_len) AS dead_tuple_size
FROM pg_tables
WHERE schemaname = 'public' AND tablename = 'your_table_name';
Key metrics from pgstattuple:
dead_tuple_len: Bytes occupied by dead tuplesfree_space: Bytes of free space- Bloat % = (dead_tuple_len + free_space) / table_len × 100
Warning: pgstattuple() acquires ACCESS SHARE lock and scans entire table - avoid on large production tables during peak hours.
Method 2: Statistical Estimation (Faster, No Locks)
-- Estimate bloat without full table scan
SELECT
current_database() AS db,
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
ROUND(CASE WHEN otta=0 OR sml.relpages=0 OR sml.relpages=otta THEN 0.0
ELSE (sml.relpages-otta::numeric)/sml.relpages END, 3) AS tbloat_ratio,
pg_size_pretty((CASE WHEN relpages < otta THEN 0
ELSE relpages::bigint - otta END) * 8192) AS wastedbytes
FROM (
SELECT
schemaname, tablename, cc.relpages, bs,
CEIL((cc.reltuples*((datahdr+ma-
(CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta
FROM (
SELECT
ma,bs,schemaname,tablename,
(datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
(maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
FROM (
SELECT
schemaname, tablename, hdr, ma, bs,
SUM((1-null_frac)*avg_width) AS datawidth,
MAX(null_frac) AS maxfracsum,
hdr+(
SELECT 1+count(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
) AS nullhdr
FROM pg_stats s, (
SELECT
(SELECT current_setting('block_size')::numeric) AS bs,
CASE WHEN substring(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
FROM (SELECT version() AS v) AS foo
) AS constants
GROUP BY 1,2,3,4,5
) AS foo
) AS rs
JOIN pg_class cc ON cc.relname = rs.tablename
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname
) AS sml
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
AND sml.relpages > 128 -- Only tables > 1MB
ORDER BY wastedbytes DESC;
Actionable thresholds:
- tbloat_ratio < 0.2 (20%): Normal, no action needed
- tbloat_ratio 0.2-0.4: Consider VACUUM FULL or REINDEX during maintenance window
- tbloat_ratio > 0.4: High bloat - schedule maintenance immediately
Method 3: Quick Check via Statistics
-- Fast check using n_dead_tup from pg_stat_user_tables
SELECT
schemaname,
relname AS tablename,
n_live_tup,
n_dead_tup,
ROUND(100 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;
Dead tuple threshold: If dead_pct > 20%, manually run VACUUM or tune autovacuum settings.
Source
PostgreSQL Official Documentation: pgstattuple (v16) - https://www.postgresql.org/docs/current/pgstattuple.html
wal_buffers is a PostgreSQL configuration parameter that sets the amount of shared memory used for buffering Write-Ahead Log (WAL) data before it's written to disk.
Configuration
Default value: -1 (auto-tuned to 1/32 of shared_buffers, minimum 64 kB, maximum 16 MB)
Valid range: 32 kB to 16 GB (or -1 for automatic)
Units: 8 kB blocks (PostgreSQL page size)
Restart required: Yes (requires PostgreSQL restart to change)
Setting the Value
In postgresql.conf:
# Automatic (recommended for most cases)
wal_buffers = -1
# Manual setting (in units of 8kB)
wal_buffers = 2048 # 16 MB
# Or with unit suffix
wal_buffers = 16MB
How It Works
WAL data is first written to these buffers in shared memory, then flushed to disk at commit time or when buffers fill. Larger buffers reduce I/O for write-heavy workloads by batching more WAL data per flush.
Tuning Guidance
- Default (-1): Sufficient for most workloads
- High write throughput: Set to 16-32 MB manually
- Values above 32 MB: Rarely beneficial; diminishing returns
- Never set below: 1 MB for production systems
Version Notes
Applies to PostgreSQL 9.1+ (auto-tuning introduced in 9.1). Prior versions required manual setting with default of 64 kB.
Source: PostgreSQL 16 Official Documentation - Server Configuration: Write Ahead Log
When to Increase PostgreSQL Statistics Target
Increase default_statistics_target (default: 100) or per-column statistics target when:
Primary Indicators
Suboptimal Query Plans on Large Tables
- Planner chooses seq scans over index scans despite available indexes
- Join order is inefficient for tables with skewed data distributions
- Query performance degrades as table size grows beyond ~1M rows
High Cardinality Columns with Skewed Distribution
- Columns with >10,000 distinct values (DISTINCT count >> 100)
- Non-uniform data (e.g., 80% of rows share 5% of values)
- Columns used in WHERE clauses, JOINs, or GROUP BY
EXPLAIN ANALYZE Shows Row Estimate Errors
- Estimated rows differ from actual rows by >10x
- Check with:
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
Implementation
Per-column (recommended for targeted fixes):
ALTER TABLE users ALTER COLUMN email SET STATISTICS 500;
ANALYZE users;
Global setting (PostgreSQL 10+):
-- In postgresql.conf or via ALTER SYSTEM
ALTER SYSTEM SET default_statistics_target = 200;
-- Requires reload
SELECT pg_reload_conf();
-- Then re-analyze affected tables
ANALYZE;
Specific Values
- Default: 100 (stores 100 most common values + 100 histogram buckets)
- Recommended increase: 200-500 for problematic columns
- Maximum: 10,000 (rarely needed; increases ANALYZE time and pg_statistic size)
- Cost: Each +100 adds ~10-50ms to ANALYZE per column (varies by data size)
Trade-offs
- ✅ More accurate selectivity estimates for complex predicates
- ❌ Longer ANALYZE times (linear with target)
- ❌ Larger pg_statistic table (~4KB per +100 target per column)
Source: PostgreSQL 16 official documentation (Chapter 14.2 - Statistics Used by the Planner)
Parallel query is PostgreSQL's capability to devise query plans that leverage multiple CPUs to answer queries faster. When the optimizer determines that parallel execution is the fastest strategy, it creates a query plan that includes a Gather or Gather Merge node, which launches multiple background worker processes to execute portions of the plan concurrently.
How It Works
The parallel portion of the plan is a "partial plan" where each worker process generates only a subset of the output rows. The Gather or Gather Merge node collects results from all workers and returns the complete result set to the user. The planner considers parallel query when max_parallel_workers_per_gather is set to a value greater than zero.
Performance
Queries that touch large amounts of data but return few rows typically benefit most. Many queries run more than twice as fast with parallel query, and some run four times faster or more. However, not all queries benefit—some have no query plan faster than serial execution due to current implementation limitations or query characteristics.
Supported Operations
PostgreSQL supports parallel execution for:
- Parallel sequential scans
- Parallel index scans
- Parallel joins (nested loop, hash, merge)
- Parallel aggregation
- Parallel append operations
Sources:
What is effective_cache_size in PostgreSQL?
effective_cache_size is a PostgreSQL configuration parameter that tells the query planner how much memory is available for caching data, across both PostgreSQL's shared buffers and the operating system's page cache.
Key Details
Type: Planner hint parameter (does NOT allocate memory)
Default value: 4GB (PostgreSQL 9.4+)
Units: Memory units (kB, MB, GB)
Scope: Can be set per-session, per-user, per-database, or server-wide
What It Does
- Influences the query planner's cost estimates for index scans vs sequential scans
- Higher values make the planner more likely to use index scans (assuming data is cached)
- Does NOT actually allocate or use any memory itself
- Only affects query planning decisions, not execution
Recommended Setting
Set to 50-75% of total system RAM:
-- For a system with 16GB RAM:
ALTER SYSTEM SET effective_cache_size = '12GB';
SELECT pg_reload_conf();
Or in postgresql.conf:
effective_cache_size = 12GB
Calculation Logic
Should approximate: shared_buffers + OS page cache available for PostgreSQL
For a dedicated database server:
- Total RAM: 16GB
- shared_buffers: 4GB (25% of RAM)
- OS page cache: ~8GB
- effective_cache_size: 12GB (75% of total RAM)
Version Info
Applies to PostgreSQL 9.0+ (modern versions 12-17 use same semantics)
Source: PostgreSQL Official Documentation - Resource Consumption
Table Bloat in PostgreSQL
Table bloat is disk space occupied by dead tuples (deleted or obsolete row versions) that PostgreSQL cannot immediately reuse. It occurs due to PostgreSQL's MVCC (Multi-Version Concurrency Control) implementation.
Why It Happens
When you UPDATE or DELETE a row, PostgreSQL:
- Marks the old tuple as "dead" but doesn't remove it immediately
- Creates a new tuple version (for UPDATEs)
- Keeps dead tuples until VACUUM processes them
Dead tuples accumulate faster than VACUUM reclaims them when:
- High UPDATE/DELETE rate
- Long-running transactions (prevent VACUUM from cleaning tuples)
- Autovacuum settings too conservative
- Disabled or failing autovacuum
Detection
Check bloat using the pgstattuple extension:
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
round(dead_tuple_percent, 2) AS dead_tuple_percent
FROM pgstattuple_approx('your_table_name');
Bloat threshold: Dead tuple percentage >20% typically indicates problematic bloat.
Remediation
Regular VACUUM (reclaims space within table, doesn't shrink file):
VACUUM VERBOSE tablename;
VACUUM FULL (rewrites table, locks it exclusively, returns space to OS):
VACUUM FULL tablename; -- Use during maintenance window
Autovacuum tuning (PostgreSQL 13+):
ALTER TABLE tablename SET (
autovacuum_vacuum_scale_factor = 0.05, -- Default: 0.2
autovacuum_vacuum_threshold = 50 -- Default: 50
);
Impact
- Increased disk I/O (scanning dead tuples)
- Slower sequential scans
- Index bloat (indexes also contain dead tuple pointers)
- Wasted disk space
Source: PostgreSQL 16 Documentation - Routine Vacuuming
JIT compilation in PostgreSQL is beneficial for long-running CPU-bound queries, particularly analytical queries. For short queries, the JIT compilation overhead typically exceeds any performance gains.
When JIT Activates
PostgreSQL uses cost-based thresholds to automatically decide when to use JIT:
jit_above_cost- JIT is triggered when estimated query cost exceeds this thresholdjit_inline_above_cost- Enables inlining of small functions/operators into expressions (more aggressive optimization)jit_optimize_above_cost- Enables expensive LLVM optimizations for very costly queries
Key Characteristics
- Decision timing: JIT decisions are made at plan time, not execution time
- Primary benefit: Inlining function bodies into expressions eliminates function call overhead
- Best use case: Queries with many expression evaluations (filtering, aggregations, computations)
- Not beneficial for: Short queries where compilation overhead dominates, I/O-bound queries
You can verify JIT usage with EXPLAIN (ANALYZE, VERBOSE) to see JIT compilation timing and whether it was applied.
Sources:
checkpoint_completion_target is a PostgreSQL configuration parameter that controls how much of the checkpoint interval should be used to spread out checkpoint I/O operations.
Value: A floating-point number between 0 and 1 (default: 0.9 in PostgreSQL 14+, was 0.5 in earlier versions)
Purpose: When PostgreSQL performs a checkpoint (writing all dirty buffers to disk), this parameter determines the target completion time as a fraction of checkpoint_timeout. For example, with checkpoint_timeout=5min and checkpoint_completion_target=0.9, PostgreSQL aims to complete the checkpoint in 4.5 minutes.
Effect: Higher values (closer to 1.0) spread writes over a longer period, reducing I/O spikes but increasing recovery time after a crash. Lower values complete checkpoints faster but cause sharper I/O peaks.
Configuration:
-- In postgresql.conf
checkpoint_completion_target = 0.9
-- Or at runtime (requires superuser or pg_write_server_files role)
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
SELECT pg_reload_conf();
Best Practice: The default of 0.9 is recommended for most workloads. Only decrease it if you need faster crash recovery and can tolerate I/O spikes. Values below 0.5 are rarely beneficial.
Source: PostgreSQL 16 official documentation - Server Configuration: Write Ahead Log (postgresql.org/docs/16/runtime-config-wal.html#GUC-CHECKPOINT-COMPLETION-TARGET)
maintenance_work_mem
maintenance_work_mem is a PostgreSQL configuration parameter that specifies the maximum amount of memory to be used for maintenance operations (per operation, not total).
Exact Specifications
- Default value: 64 MB
- Minimum value: 1024 KB (1 MB)
- Maximum value: 2147483647 KB (2 TB)
- Scope: Session-level or server-level
- Reload required: No (can be changed with
SETcommand)
Operations That Use This Memory
- VACUUM (not VACUUM FULL)
- CREATE INDEX
- ALTER TABLE ADD FOREIGN KEY
- CREATE EXTENSION
- Restoring pg_dump files
Configuration
In postgresql.conf:
maintenance_work_mem = 256MB
Or per-session:
SET maintenance_work_mem = '1GB';
CREATE INDEX idx_name ON table_name(column_name);
RESET maintenance_work_mem;
Critical Details
- Does NOT use shared_buffers: This is separate from regular query memory
- Per-operation allocation: Each autovacuum worker or index creation uses this amount independently
- Autovacuum limitation: Autovacuum workers use
autovacuum_work_mem(if set) ormaintenance_work_mem / autovacuum_max_workers(whichever is smaller) - Higher is better for large tables: For tables >1GB, setting this to 512MB-2GB significantly improves VACUUM and CREATE INDEX performance
Recommended Values
- Small databases (<10GB): 256 MB
- Medium databases (10-100GB): 512 MB - 1 GB
- Large databases (>100GB): 1-4 GB (but watch total RAM usage)
Source: PostgreSQL 16 Official Documentation - Server Configuration
min_wal_size is a PostgreSQL configuration parameter that sets the minimum size to which the WAL (Write-Ahead Log) can shrink during automatic checkpoints.
Default Value: 80 MB (as of PostgreSQL 9.5+)
Purpose:
- Controls WAL disk space management by preventing WAL files from being recycled/removed if total WAL size would fall below this threshold
- Acts as a lower bound - PostgreSQL keeps at least this much WAL on disk even during quiet periods
- Works in conjunction with
max_wal_sizeto manage WAL growth
Behavior:
- During checkpoints, PostgreSQL recycles old WAL files instead of deleting them if WAL size is near
min_wal_size - When WAL size exceeds
min_wal_size, PostgreSQL may remove old WAL segments - Setting too low wastes I/O on frequent WAL file creation/deletion
- Setting too high wastes disk space
Configuration Example:
-- Set to 1 GB
ALTER SYSTEM SET min_wal_size = '1GB';
SELECT pg_reload_conf();
# In postgresql.conf
min_wal_size = 1GB
Valid Units: Accepts standard PostgreSQL size units: kB, MB, GB, TB
Relationship: Must be less than max_wal_size (default 1GB). Recommended ratio: set min_wal_size to 25-50% of max_wal_size for write-heavy workloads.
Source: PostgreSQL Official Documentation, versions 9.5-17 (parameter introduced in 9.5, replacing checkpoint_segments)
autovacuum_vacuum_threshold is a PostgreSQL configuration parameter that sets the minimum number of updated or deleted tuples needed to trigger an autovacuum operation on a table.
Exact Value
- Default: 50 tuples
- Type: Integer
- Minimum: 0
How It Works
Autovacuum triggers on a table when:
number_of_obsolete_tuples > autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor * number_of_tuples)
Where:
autovacuum_vacuum_scale_factordefault is 0.2 (20% of table)- Obsolete tuples = dead tuples from UPDATEs and DELETEs
Configuration Levels
System-wide (postgresql.conf):
autovacuum_vacuum_threshold = 50
Per-table override (recommended for specific tables):
ALTER TABLE my_table SET (autovacuum_vacuum_threshold = 1000);
Practical Impact
- Small tables: With default settings, a 100-row table needs 50 + (0.2 × 100) = 70 dead tuples to trigger vacuum
- Large tables: A 1M-row table needs 50 + (0.2 × 1M) = 200,050 dead tuples
- High-churn small tables: Set lower threshold (e.g., 10) to vacuum more frequently
Source
PostgreSQL 16 Official Documentation: Automatic Vacuuming
Valid for PostgreSQL 9.x through 17 (value unchanged across versions).
ANALYZE in PostgreSQL
ANALYZE is a PostgreSQL command that collects statistics about the contents of tables in the database. The query planner uses these statistics to generate optimal execution plans.
What it does:
- Samples table rows (default: 300 rows per column, configurable via
default_statistics_target) - Computes statistics: most common values, histograms, NULL fraction, average width, distinct value counts
- Stores statistics in
pg_statisticsystem catalog (viewable viapg_statsview) - Does NOT modify table data, only metadata
Syntax:
-- Analyze entire database
ANALYZE;
-- Analyze specific table
ANALYZE table_name;
-- Analyze specific columns
ANALYZE table_name (column1, column2);
-- Verbose output
ANALYZE VERBOSE table_name;
When to run:
- After bulk INSERT/UPDATE/DELETE operations
- After significant data changes (>10-15% of rows)
- Automatically via
autovacuumdaemon (enabled by default)
Key parameters (postgresql.conf):
default_statistics_target= 100 (range: 1-10000, higher = more accurate but slower)autovacuum_analyze_threshold= 50 rowsautovacuum_analyze_scale_factor= 0.1 (10% of table)
Performance impact:
- Acquires
SHARE UPDATE EXCLUSIVElock (allows reads/writes, blocks schema changes) - Typically completes in seconds for tables <1M rows
- Runs concurrently with normal operations
Source: PostgreSQL 16 Official Documentation - ANALYZE command reference
Autovacuum in PostgreSQL
Autovacuum is a background daemon process in PostgreSQL that automatically performs VACUUM and ANALYZE operations to maintain database health. It runs by default and prevents transaction ID wraparound failure while reclaiming dead tuple storage.
What It Does
- Removes dead tuples - Reclaims space from updated/deleted rows that are no longer visible to any transaction
- Prevents transaction ID wraparound - PostgreSQL uses 32-bit transaction IDs; autovacuum prevents the 2 billion transaction limit from causing data loss
- Updates statistics - Runs ANALYZE to refresh query planner statistics
Key Configuration Parameters
-- Check if autovacuum is enabled (default: on)
SHOW autovacuum;
-- Critical settings in postgresql.conf:
autovacuum = on -- Enable/disable (default: on)
autovacuum_max_workers = 3 -- Max parallel workers (default: 3)
autovacuum_naptime = 60 -- Seconds between runs (default: 60s)
-- Per-table trigger thresholds:
autovacuum_vacuum_threshold = 50 -- Min row changes before vacuum
autovacuum_vacuum_scale_factor = 0.2 -- Fraction of table size (20%)
-- Vacuum triggers when: dead_tuples > threshold + (scale_factor * total_rows)
Critical Behavior
- Vacuum runs on a table when:
dead_tuples > 50 + (0.2 × table_rows) - ANALYZE runs when:
changed_tuples > 50 + (0.1 × table_rows) - Anti-wraparound vacuum cannot be canceled and runs when a table reaches
autovacuum_freeze_max_age(default: 200 million transactions)
Override Per Table
ALTER TABLE my_table SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_vacuum_threshold = 1000
);
Source: PostgreSQL 16 Official Documentation - Routine Vacuuming (https://www.postgresql.org/docs/current/routine-vacuuming.html)
PostgreSQL 18 Asynchronous I/O Performance
PostgreSQL 18 introduces an asynchronous I/O (AIO) subsystem that can deliver 2-3x performance improvements for read-heavy workloads.
How io_uring Works
io_uring establishes a shared ring buffer between PostgreSQL and the Linux kernel, allowing:
- Multiple I/O requests to be submitted in a single syscall
- Completions to be reaped without blocking
- Zero-copy data transfer in many cases
Performance by Environment
| Environment | Improvement | Notes |
|---|---|---|
| Cloud (EBS, network storage) | 2-3x | Highest gains due to I/O latency |
| Local SSD | ~24% | Still beneficial but less dramatic |
| Warm cache | Minimal | Data already in memory |
Supported Operations (PostgreSQL 18)
- Sequential scans
- Bitmap heap scans
- VACUUM operations
Not yet supported: Index scans, write operations, WAL
Configuration
-- Enable io_uring (Linux only)
ALTER SYSTEM SET io_method = 'io_uring';
SELECT pg_reload_conf();
-- Verify
SHOW io_method;
Requirements
- Linux kernel 5.1 or later
- File system must support io_uring (ext4, XFS, etc.)
- Requires
liburinglibrary at compile time
Source: PostgreSQL 18 Release Notes
https://www.postgresql.org/docs/18/release-18.html
You should limit max_connections because each connection consumes system resources that are pre-allocated at server startup. PostgreSQL sizes certain resources (including shared memory) directly based on the max_connections value—increasing it allocates more memory and system resources even if those connections are never used.
Key reasons to limit:
Memory consumption: Each connection requires RAM for connection management and client operations. Higher
max_connectionsmeans less RAM available for query execution and caching.Performance degradation: With high
max_connections, you must setwork_memlower to avoid exhausting memory. This makes individual queries slower.Resource pre-allocation: PostgreSQL allocates shared memory and other resources proportional to
max_connectionsat startup, not dynamically.Diminishing returns: Beyond a certain point, adding more concurrent connections degrades performance due to context switching and resource contention.
Recommended approach: Keep max_connections relatively low (default is 100) and use a connection pooler like PgBouncer to handle many client connections efficiently.
Sources:
Connection pooling in PostgreSQL is a middleware technique that maintains a pool of persistent database connections that can be reused by multiple clients, rather than creating a new connection for each client request.
Why it's needed: Each PostgreSQL connection requires forking a new backend process, which involves a handshake that takes 20-30 milliseconds and consumes approximately 10 MB of RAM per connection. Direct connections don't scale well beyond a few hundred concurrent users.
How it works: A connection pooler (like PgBouncer or PgPool-II) sits between clients and PostgreSQL. Clients connect to the pooler, which maintains a smaller pool of actual database connections. When a client needs to execute a query, the pooler assigns it an available connection from the pool, then returns that connection to the pool when done.
Pooling modes:
- Session pooling: Connection assigned to client for entire session duration
- Transaction pooling: Connection assigned only during a transaction (most common)
- Statement pooling: Connection assigned per statement (most aggressive, limited use cases)
Best practice: PostgreSQL's max_connections parameter (default 100) should remain relatively low. If you need more than 200 concurrent connections, use an external connection pooler rather than increasing max_connections, as PostgreSQL sizes shared memory resources based on this value.
Note: PostgreSQL does not include built-in connection pooling in the server itself. The JDBC driver provides basic pooling for Java applications, but for production deployments, external poolers like PgBouncer or PgPool-II are recommended.
Sources:
Use PostgreSQL's built-in pgbench tool to benchmark database performance.
Basic workflow:
- Initialize test database:
pgbench -i -s 50 your_database_name
The -s scale factor determines dataset size (scale 50 = 5 million rows in pgbench_accounts).
- Run benchmark:
pgbench -c 10 -j 2 -t 1000 your_database_name
-c 10: 10 concurrent clients-j 2: 2 worker threads-t 1000: 1000 transactions per client
- Or run time-based test:
pgbench -c 10 -j 2 -T 60 -P 5 your_database_name
-T 60: Run for 60 seconds-P 5: Show progress every 5 seconds
Key metrics reported:
- TPS (transactions per second): Primary performance indicator
- Latency average: Response time per transaction
- Standard deviation: Consistency of performance
Custom benchmarks:
pgbench -c 10 -T 60 -f custom_queries.sql your_database_name
Production-like testing options:
-M prepared: Use prepared statements (fastest)-R 500: Rate-limit to 500 TPS (test sustained load)-l: Log detailed per-transaction timing
Best practices:
- Run tests for several minutes minimum
- Scale factor should match or exceed client count
- Use
-jthreads on multi-CPU systems - Run pgbench on separate machine for high client counts
Sources:
When maintenance_work_mem is Used in PostgreSQL
maintenance_work_mem is used during maintenance operations, not regular queries. Specifically:
Operations That Use It:
VACUUM- When cleaning up dead tuples and updating statisticsCREATE INDEX- When building new indexes (including initial index creation)ALTER TABLE ADD FOREIGN KEY- When validating foreign key constraintsREINDEX- When rebuilding existing indexesCREATE EXTENSION(certain extensions that build internal indexes)
NOT Used For:
- Regular
SELECT,INSERT,UPDATE,DELETEqueries (these usework_mem) - Index scans during normal query execution
- Autovacuum (uses
autovacuum_work_memif set, otherwisemaintenance_work_mem)
Default Value:
- 64 MB (as of PostgreSQL 12+; was 16 MB prior to version 9.4)
Configuration:
-- Session level
SET maintenance_work_mem = '256MB';
-- For a specific operation
SET maintenance_work_mem = '1GB';
CREATE INDEX idx_name ON table_name(column);
RESET maintenance_work_mem;
Sizing Guideline:
Set to 5-10% of total RAM for systems where you run maintenance operations, but typically not more than 2GB per operation (higher values show diminishing returns for index creation).
Source: PostgreSQL Official Documentation (v16), runtime configuration parameters - https://www.postgresql.org/docs/current/runtime-config-resource.html
pg_repack Extension
pg_repack is a PostgreSQL extension that removes bloat from tables and indexes by rebuilding them without holding exclusive locks during the operation, allowing concurrent reads and writes.
What It Does
It reorganizes tables to reclaim wasted space caused by UPDATE/DELETE operations, which in PostgreSQL leave behind dead tuples that create bloat. Unlike VACUUM FULL or CLUSTER, pg_repack does not block DML operations.
Key Characteristics
- Online operation: Uses triggers to capture concurrent changes during rebuild
- Lock requirements: Briefly holds
ACCESS EXCLUSIVElock only at the start and end (typically milliseconds) - Space requirements: Needs temporary disk space approximately equal to the table size
- Compatibility: Works with PostgreSQL 9.4+ (latest version 1.5.1 supports PostgreSQL 16)
Installation
CREATE EXTENSION pg_repack;
Basic Usage
# Repack specific table
pg_repack -t schema.table_name database_name
# Repack all tables in database
pg_repack database_name
# Repack only indexes
pg_repack -x -t table_name database_name
When to Use
- Tables with >20-30% bloat (check with
pgstattupleextension) - Production systems where downtime is unacceptable
- After bulk DELETE/UPDATE operations
Source
Official GitHub: https://github.com/reorg/pg_repack (maintained by the PostgreSQL community)
VACUUM in PostgreSQL
VACUUM is a PostgreSQL maintenance command that reclaims storage space occupied by dead tuples (deleted or obsoleted rows) and updates statistics for the query planner.
What it does:
Removes dead tuples: PostgreSQL uses MVCC (Multi-Version Concurrency Control), which marks rows as deleted rather than immediately removing them. VACUUM physically removes these dead tuples.
Prevents transaction ID wraparound: PostgreSQL uses a 32-bit transaction ID that wraps around after ~2 billion transactions. VACUUM freezes old transaction IDs to prevent data loss.
Updates free space map (FSM): Marks reclaimed space as available for reuse by future INSERTs/UPDATEs.
Basic syntax:
-- Vacuum single table
VACUUM table_name;
-- Vacuum entire database
VACUUM;
-- Vacuum with full reclaim (locks table, rewrites entirely)
VACUUM FULL table_name;
-- Vacuum and analyze (updates statistics)
VACUUM ANALYZE table_name;
Key parameters:
- autovacuum: Enabled by default since PostgreSQL 8.3. Runs automatically when thresholds are met.
- autovacuum_vacuum_threshold: Default is 50 tuples
- autovacuum_vacuum_scale_factor: Default is 0.2 (20% of table size)
- Autovacuum triggers when:
dead_tuples > threshold + (scale_factor * table_size)
VACUUM vs VACUUM FULL:
VACUUM: Marks space as reusable but doesn't shrink table files. Non-blocking (allows concurrent reads/writes).VACUUM FULL: Rewrites entire table, returns disk space to OS. Takes exclusive lock, can be very slow on large tables.
Version: PostgreSQL 12+ (current behavior). Verified from PostgreSQL 16 official documentation.
Source: https://www.postgresql.org/docs/current/sql-vacuum.html
fulltext_search
24 questionsA tsquery in PostgreSQL is a data type that represents a text search query containing lexemes (normalized words) and operators for full-text search matching against tsvector documents.
Core Characteristics
- Data type:
tsquery(dedicated type, not just text) - Purpose: Stores preprocessed search queries with boolean operators
- Size: Variable length, typically 8-16 bytes overhead plus lexeme storage
- Available since: PostgreSQL 8.3 (2008)
Syntax and Operators
-- Create tsquery using to_tsquery() - requires valid syntax
SELECT to_tsquery('english', 'cats & dogs'); -- AND operator
SELECT to_tsquery('english', 'cats | dogs'); -- OR operator
SELECT to_tsquery('english', 'cats & !dogs'); -- NOT operator
SELECT to_tsquery('english', 'cat <-> dog'); -- FOLLOWED BY (adjacent)
SELECT to_tsquery('english', 'cat <2> dog'); -- Within 2 positions
-- Create tsquery using plainto_tsquery() - auto-formats plain text
SELECT plainto_tsquery('english', 'cats and dogs'); -- Produces: 'cat' & 'dog'
-- Create tsquery using websearch_to_tsquery() - web-style syntax (PG 11+)
SELECT websearch_to_tsquery('english', 'cats -dogs "big cat"');
Essential Operators
&(AND): Both terms must match|(OR): Either term must match!(NOT): Term must not match<->(FOLLOWED BY): Terms must be adjacent in order<N>(DISTANCE): Terms within N positions
Usage Pattern
-- Match tsquery against tsvector
SELECT to_tsvector('english', 'The quick brown fox') @@
to_tsquery('english', 'quick & fox'); -- Returns: true
-- Index-accelerated search
CREATE INDEX idx_fts ON documents USING GIN(to_tsvector('english', content));
SELECT * FROM documents
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'search & terms');
Key Properties
- Normalized: Lexemes are stemmed/normalized (e.g., "running" → "run")
- Language-aware: Respects text search configuration (e.g., 'english', 'spanish')
- Optimized: Works efficiently with GIN/GiST indexes on
tsvectorcolumns - Type-safe: Cannot accidentally compare with regular text without conversion
Source: PostgreSQL 17 Official Documentation - Chapter 12 (Full Text Search)
Creating tsvector from text in PostgreSQL
Use the to_tsvector() function with a text search configuration and text input:
SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog');
-- Result: 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
Syntax:
to_tsvector([ config regconfig, ] document text) returns tsvector
Key details:
Configuration parameter (optional): Specifies the text search configuration (e.g.,
'english','french','simple'). If omitted, usesdefault_text_search_configsetting.Normalization: The function automatically:
- Converts words to lowercase
- Removes stop words (common words like 'the', 'a', 'over')
- Applies stemming (e.g., 'jumps' → 'jump', 'lazy' → 'lazi')
- Stores position information for each lexeme
Common usage in table columns:
-- Generate tsvector on-the-fly
SELECT to_tsvector('english', title || ' ' || body) FROM documents;
-- Store in a generated column (PostgreSQL 12+)
ALTER TABLE documents
ADD COLUMN tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body)) STORED;
-- Create GIN index for fast full-text search
CREATE INDEX idx_documents_tsv ON documents USING GIN(tsv);
Version: Available since PostgreSQL 8.3. Generated columns require PostgreSQL 12+.
Source: PostgreSQL official documentation - Chapter 12: Full Text Search (postgresql.org/docs/current/textsearch-controls.html)
The ts_headline() function in PostgreSQL generates highlighted excerpts from text documents showing where search terms match, used for displaying search results with context.
Syntax:
ts_headline([ config regconfig, ] document text, query tsquery [, options text ])
Core Behavior:
- Returns fragments of the original document with matched terms wrapped in tags
- Default wrapper:
<b>and</b>tags around matched words - Automatically selects relevant fragments containing query matches
- Preserves original text casing and spacing
Key Parameters:
config: Text search configuration (e.g.,'english'). Defaults todefault_text_search_configdocument: The text to extract highlighted fragments fromquery: Thetsqueryto match againstoptions: Comma-separatedoption=valuepairs
Critical Options:
StartSel='<b>',StopSel='</b>': Custom highlight tags (default shown)MaxWords=35, MinWords=15: Controls fragment length (defaults shown)ShortWord=3: Words ≤ this length ignored in fragment selection (default: 3)MaxFragments=0: Number of fragments to return; 0 = single fragment (default: 0)FragmentDelimiter=' ... ': String between fragments when MaxFragments > 0
Example:
SELECT ts_headline('english',
'The quick brown fox jumps over the lazy dog',
to_tsquery('english', 'fox & dog'),
'StartSel=<mark>, StopSel=</mark>, MaxFragments=2'
);
-- Result: "brown <mark>fox</mark> jumps ... the lazy <mark>dog</mark>"
Version: Available in PostgreSQL 8.3+, syntax standardized in 9.6+
Source: PostgreSQL Official Documentation - Text Search Functions (Chapter 9.13)
plainto_tsquery Function
plainto_tsquery is a PostgreSQL full-text search function that converts plain text into a tsquery by:
- Parsing the text into tokens
- Normalizing tokens according to the specified text search configuration
- Combining all tokens with the AND operator (
&)
Key Behavior:
- Ignores all punctuation and operators in the input text
- Does NOT support full-text search operators like
&,|,!,<-> - Treats input as plain words to be searched
- Returns
tsquerytype suitable for@@matching againsttsvector
Syntax:
plainto_tsquery([ config regconfig, ] query text) returns tsquery
Example:
SELECT plainto_tsquery('english', 'The Fat Rats');
-- Returns: 'fat' & 'rat'
SELECT to_tsvector('english', 'a fat cat sat on a mat')
@@ plainto_tsquery('english', 'cat mat');
-- Returns: true (both words found)
Comparison with Similar Functions:
- Use
plainto_tsquery()when you need safe, user-friendly text search (ignores operators) - Use
to_tsquery()when you need operator support (&,|,!) - Use
phraseto_tsquery()when you need phrase matching with position
Available Since: PostgreSQL 8.3+
Source: PostgreSQL Official Documentation - Text Search Functions
What is the pg_trgm extension?
pg_trgm is a PostgreSQL extension that provides functions and operators for determining string similarity based on trigram matching. A trigram is a group of three consecutive characters extracted from a string.
Core Functionality
The extension enables:
- Similarity measurement between strings (0.0 to 1.0 scale)
- Fast LIKE/ILIKE/regex pattern matching using GIN or GiST indexes
- Fuzzy string searching for typo-tolerant queries
Key Operators and Functions
-- Enable the extension
CREATE EXTENSION pg_trgm;
-- Similarity operator (returns float 0.0-1.0)
SELECT 'word' % 'word'; -- true (similar)
SELECT similarity('word', 'wrd'); -- 0.5
-- Pattern matching with index support
SELECT * FROM table WHERE column % 'search_term';
SELECT * FROM table WHERE column ILIKE '%pattern%';
Critical Configuration
- Default similarity threshold:
0.3(configurable viapg_trgm.similarity_threshold) - Strings with similarity ≥ threshold return true for
%operator - Change threshold:
SET pg_trgm.similarity_threshold = 0.6;
Index Types Supported
-- GIN index (faster for lookups, larger size)
CREATE INDEX idx_name ON table USING GIN (column gin_trgm_ops);
-- GiST index (faster for updates, smaller size)
CREATE INDEX idx_name ON table USING GiST (column gist_trgm_ops);
Version Availability
Available since PostgreSQL 9.1. Enhanced in 9.6+ with support for LIKE/ILIKE index usage.
Official Documentation: https://www.postgresql.org/docs/current/pgtrgm.html
PostgreSQL Full-Text Search Index Type
Use GIN (Generalized Inverted Index) for full-text search. This is the recommended index type for tsvector columns.
Implementation
-- Create GIN index on tsvector column
CREATE INDEX idx_fts ON documents USING GIN(to_tsvector('english', content));
-- Or on a pre-computed tsvector column
ALTER TABLE documents ADD COLUMN content_tsv tsvector;
CREATE INDEX idx_fts ON documents USING GIN(content_tsv);
GIN vs GiST Decision Criteria
- GIN: 3x faster searches, but 3x slower writes and larger size (use for read-heavy workloads)
- GiST: Faster updates, smaller index size (use only if updates are very frequent and search speed is less critical)
In 95% of cases, use GIN.
Performance Specifics
- GIN index lookups are typically 3x faster than GiST for full-text search
- GIN indexes are 2-3x larger than GiST indexes
- GIN index builds and updates are 3x slower than GiST
Optimization (PostgreSQL 9.6+)
-- Use fastupdate for better write performance
CREATE INDEX idx_fts ON documents USING GIN(content_tsv) WITH (fastupdate = on);
The fastupdate parameter (default: on in PostgreSQL 9.6+) buffers index updates in memory, improving write performance by up to 50% with minimal impact on search speed.
Source: PostgreSQL Official Documentation - Text Search Indexes (versions 12-17), Section 12.9
A generated tsvector column is a PostgreSQL column that automatically maintains a full-text search document vector, computed from other columns in the same table using the GENERATED ALWAYS AS syntax.
Definition
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
body TEXT,
search_vector tsvector GENERATED ALWAYS AS (
to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))
) STORED
);
Key Characteristics
- Automatic updates: The
tsvectorvalue is automatically recomputed whenever source columns change - Storage requirement: Must use
STORED(notVIRTUAL) - PostgreSQL 12+ only supports stored generated columns - Cannot be written directly: Any
INSERTorUPDATEmust omit the generated column or useDEFAULT - Index compatible: Can be indexed with GIN or GiST indexes for fast full-text search
CREATE INDEX idx_search ON documents USING GIN(search_vector);
Query Usage
SELECT * FROM documents
WHERE search_vector @@ to_tsquery('english', 'postgresql & search');
Version Requirement
PostgreSQL 12+ (generated columns introduced in version 12, released October 2019)
Source: PostgreSQL 16 Official Documentation - Generated Columns
Answer
The default text search configuration in PostgreSQL is controlled by the default_text_search_config parameter.
Built-in default: pg_catalog.simple
Actual default after database initialization: The value set by initdb based on your lc_ctype locale. For example:
- English locales typically get
pg_catalog.english - French locales typically get
pg_catalog.french - Spanish locales typically get
pg_catalog.spanish
If initdb cannot identify a matching text search configuration for your locale, it falls back to pg_catalog.simple.
Checking Your Current Default
SHOW default_text_search_config;
Usage
All text search functions use this configuration when no explicit configuration is specified:
-- Uses default_text_search_config
SELECT to_tsvector('The quick brown fox');
-- Explicitly specifies configuration
SELECT to_tsvector('english', 'The quick brown fox');
Changing the Default
-- Session level
SET default_text_search_config = 'pg_catalog.english';
-- Database level
ALTER DATABASE mydb SET default_text_search_config = 'pg_catalog.english';
Parameter type: String (regconfig)
Context: User (no restart required)
Applies to: PostgreSQL 8.3+
Sources:
PostgreSQL setweight Function
setweight(vector tsvector, weight "char") assigns a weight label (A, B, C, or D) to all lexemes in a tsvector, used for ranking search results in full-text search.
Syntax and Usage
setweight(vector tsvector, weight "char") RETURNS tsvector
Weight values:
'A'- Highest importance (default multiplier: 1.0)'B'- High importance (default multiplier: 0.4)'C'- Medium importance (default multiplier: 0.2)'D'- Low importance (default multiplier: 0.1)
Example
-- Assign different weights to title vs body content
SELECT setweight(to_tsvector('english', 'PostgreSQL Tutorial'), 'A') ||
setweight(to_tsvector('english', 'Learn database fundamentals'), 'B');
-- Result: 'databas':5B 'fundament':6B 'learn':4B 'postgresql':1A 'tutori':2A
Ranking Impact
Weights affect ts_rank() and ts_rank_cd() scoring:
SELECT ts_rank(
setweight(to_tsvector('important text'), 'A'),
to_tsquery('text')
); -- Returns higher score than weight 'D'
Source: PostgreSQL 17 Official Documentation - Full-Text Search Functions
https://www.postgresql.org/docs/current/functions-textsearch.html
Note: Available in PostgreSQL 8.3+. Weights are stored in the tsvector type and preserved through concatenation with || operator.
What is a tsvector in PostgreSQL?
A tsvector is a specialized PostgreSQL data type that stores a sorted list of distinct lexemes (normalized word forms) optimized for full-text search. It represents a document in a form that has been preprocessed for text search operations.
Structure
A tsvector contains:
- Lexemes: Normalized words (lowercase, stemmed)
- Positions: Optional array of positions where each lexeme appears in the original document
- Weights: Optional labels (A, B, C, D) for ranking, where A is highest priority
Storage Format
-- Basic tsvector (lexemes only)
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
-- Result: 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
-- With positions
SELECT to_tsvector('english', 'The quick brown fox jumps');
-- Result: 'brown':3 'fox':4 'jump':5 'quick':2
-- With positions and weights
SELECT setweight(to_tsvector('PostgreSQL is powerful'), 'A');
-- Result: 'postgresql':1A 'power':3A
Key Characteristics
- Size: Each lexeme is limited to 2,046 bytes; total positions per lexeme limited to 256
- Automatic deduplication: Duplicate lexemes are merged automatically
- Sorted: Lexemes stored in sorted order for efficient searching
- Stop words: Common words filtered by
to_tsvector()based on text search configuration
Common Usage
-- Create column with tsvector
ALTER TABLE documents ADD COLUMN tsv tsvector;
-- Populate using to_tsvector (recommended)
UPDATE documents
SET tsv = to_tsvector('english', title || ' ' || body);
-- Create GIN index for fast searching
CREATE INDEX tsv_idx ON documents USING GIN(tsv);
-- Search using tsquery
SELECT * FROM documents
WHERE tsv @@ to_tsquery('english', 'postgresql & search');
Version: Available since PostgreSQL 8.3+ (current stable: 17.x)
Source: PostgreSQL Official Documentation - Text Search Types
phraseto_tsquery Function in PostgreSQL
phraseto_tsquery is a PostgreSQL full-text search function that converts a plain text phrase into a tsquery by creating a phrase search query where all terms must appear consecutively in the exact order given.
Syntax
phraseto_tsquery([ config regconfig, ] query text) returns tsquery
Behavior
- Normalizes input text using the specified text search configuration (defaults to
default_text_search_config) - Tokenizes the input and generates lexemes
- Connects all lexemes with the
<->(followed-by) operator, requiring exact sequential adjacency - Ignores stop words in the phrase (they don't create gaps in the sequence)
- Returns an empty
tsqueryif the input produces no lexemes
Example
SELECT phraseto_tsquery('english', 'The quick brown fox');
-- Result: 'quick' <-> 'brown' <-> 'fox'
-- Note: 'The' is a stop word and excluded
SELECT to_tsvector('english', 'the quick brown fox jumps')
@@ phraseto_tsquery('english', 'quick brown fox');
-- Result: true (phrase matches)
SELECT to_tsvector('english', 'the quick red fox')
@@ phraseto_tsquery('english', 'quick brown fox');
-- Result: false (sequence broken)
Key Differences from Similar Functions
plainto_tsquery(): Joins terms with&(AND), not requiring word orderwebsearch_to_tsquery(): Supports quoted phrases plus boolean operatorsto_tsquery(): Requires manual tsquery syntax with operators
Version
Available since PostgreSQL 9.6+
Source
PostgreSQL Official Documentation: Text Search Functions and Operators
https://www.postgresql.org/docs/current/textsearch-controls.html
ts_rank_cd is a PostgreSQL full-text search ranking function that computes a cover density ranking for documents based on how closely query terms appear together.
Key Differences from ts_rank
ts_rank: Counts total term frequenciests_rank_cd: Prioritizes proximity of query terms (shorter distances = higher rank)
Syntax
ts_rank_cd(weights float4[], vector tsvector, query tsquery, normalization integer)
ts_rank_cd(vector tsvector, query tsquery, normalization integer)
ts_rank_cd(vector tsvector, query tsquery)
Parameters
- weights: Array of 4 floats
{D, C, B, A}for label weights (default:{0.1, 0.2, 0.4, 1.0}) - vector: The
tsvectordocument representation - query: The
tsquerysearch query - normalization: Integer bitmask (same as
ts_rank):0: No normalization (default)1: Divide by (1 + log(document length))2: Divide by document length4: Divide by mean harmonic distance between extents8: Divide by number of unique words16: Divide by (1 + log(unique words))32: Divide by (rank + 1)
Example
SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM documents, to_tsquery('postgresql & search') query
WHERE textsearch @@ query
ORDER BY rank DESC;
Source
PostgreSQL official documentation (functions-textsearch.html), available in PostgreSQL 8.3+.
Creating a Custom Text Search Configuration in PostgreSQL
To create a custom text search configuration in PostgreSQL, use the CREATE TEXT SEARCH CONFIGURATION command by copying an existing configuration and modifying its parser or token-dictionary mappings.
Basic Syntax
-- Create configuration by copying an existing one
CREATE TEXT SEARCH CONFIGURATION custom_config (
COPY = pg_catalog.english
);
-- Modify token-to-dictionary mappings
ALTER TEXT SEARCH CONFIGURATION custom_config
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH unaccent, english_stem;
Complete Example with Custom Dictionary
-- 1. Create a custom dictionary (optional)
CREATE TEXT SEARCH DICTIONARY english_stem_nostop (
TEMPLATE = snowball,
Language = english
);
-- 2. Create the configuration
CREATE TEXT SEARCH CONFIGURATION my_search_config (
COPY = pg_catalog.english
);
-- 3. Modify mappings for specific token types
ALTER TEXT SEARCH CONFIGURATION my_search_config
ALTER MAPPING FOR word, asciiword
WITH english_stem_nostop;
-- 4. Drop mappings for tokens you want to ignore
ALTER TEXT SEARCH CONFIGURATION my_search_config
DROP MAPPING FOR email, url, host;
-- 5. Set as default for current database (optional)
SET default_text_search_config = 'my_search_config';
Key Components
Token Types: Use \dF in psql to list available token types (word, email, url, host, etc.)
Built-in Parsers: pg_catalog.default (only standard parser in core PostgreSQL)
Common Dictionaries: simple, english_stem, unaccent, synonym, thesaurus
Verification: Test with ts_debug('my_search_config', 'sample text') to see token processing
This applies to PostgreSQL 12+ (syntax unchanged through PostgreSQL 17).
PostgreSQL Trigram Similarity for Fuzzy Search
Enable the pg_trgm extension:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
Basic similarity search using the similarity() function:
SELECT
column_name,
similarity(column_name, 'search_term') AS sim_score
FROM table_name
WHERE similarity(column_name, 'search_term') > 0.3
ORDER BY sim_score DESC;
Using the % operator (equivalent to similarity() > 0.3):
SELECT column_name
FROM table_name
WHERE column_name % 'search_term'
ORDER BY similarity(column_name, 'search_term') DESC;
Key functions and operators:
similarity(text, text)- Returns similarity score from 0.0 (no match) to 1.0 (identical)%operator - Returns true if similarity > threshold (default 0.3)<->operator - Returns distance (1 - similarity), useful for ORDER BYword_similarity(text, text)- Matches whole words within text (PostgreSQL 9.6+)strict_word_similarity(text, text)- Stricter word boundary matching (PostgreSQL 11+)
Adjust similarity threshold (default is 0.3):
SET pg_trgm.similarity_threshold = 0.5; -- Session-level
-- OR
ALTER DATABASE dbname SET pg_trgm.similarity_threshold = 0.5; -- Database-level
Create GIN/GIST index for performance:
CREATE INDEX trgm_idx ON table_name USING GIN (column_name gin_trgm_ops);
-- OR
CREATE INDEX trgm_idx ON table_name USING GIST (column_name gist_trgm_ops);
GIN vs GIST: GIN is 3x faster for lookups but larger and slower to build; GIST is better for frequent updates.
Source: PostgreSQL official documentation for pg_trgm module (available since PostgreSQL 9.1, word_similarity added in 9.6, strict_word_similarity in 11.0).
Adding Custom Stop Words in PostgreSQL Full Text Search
Stop words are filtered out during full text search indexing. PostgreSQL provides default stop word lists, but you can add custom ones.
Method: Create Custom Text Search Configuration
-- 1. Create a custom stop word file
-- File location: /usr/share/postgresql/[VERSION]/tsearch_data/custom_stop.stop
-- Example content (one word per line):
-- custom
-- example
-- filter
-- 2. Create a new text search configuration based on existing one
CREATE TEXT SEARCH CONFIGURATION custom_english ( COPY = pg_catalog.english );
-- 3. Alter the configuration to use your custom stop words
ALTER TEXT SEARCH CONFIGURATION custom_english
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH unaccent, custom_stop, english_stem;
-- 4. Use the custom configuration
SELECT to_tsvector('custom_english', 'This is custom text');
Stop Word File Format
Create a .stop file in $SHAREDIR/tsearch_data/ directory:
- One word per line
- Empty lines and lines starting with
#are ignored - Case-insensitive matching
- Encoding must match database encoding
File path: /usr/share/postgresql/<VERSION>/tsearch_data/your_file.stop
Verify Stop Words
-- Check which dictionary removes stop words
SELECT * FROM ts_debug('custom_english', 'custom word text');
Source: PostgreSQL 12+ Full Text Search documentation (https://www.postgresql.org/docs/current/textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS)
Critical: Server requires read access to the .stop file. Restart not required—configuration changes take effect immediately.
A text search configuration in PostgreSQL is a database object that defines how text is processed for full-text search by specifying which parser to use and which dictionaries to apply to each token type.
Core Function
A text search configuration maps token types (word, email, URL, etc.) to dictionaries that normalize tokens into lexemes. It determines:
- Parser: How to break text into tokens (default:
pg_catalog.default) - Dictionary mappings: Which dictionaries process each token type and in what order
Key Components
Token Types: 23 built-in types including word, asciiword, email, url, numword, host, file, etc.
Dictionaries: Process tokens in order until one returns a non-NULL result:
simple: Lowercases and returns token as-isispell: Morphological reduction using dictionary filessnowball: Stemming (e.g., "running" → "run")synonym: Maps words to preferred termsthesaurus: Multi-word phrase mappingstop: Removes common words (returns NULL)
Example Usage
-- View default configuration
SHOW default_text_search_config; -- typically 'pg_catalog.english'
-- Create custom configuration
CREATE TEXT SEARCH CONFIGURATION my_config (COPY = pg_catalog.english);
-- Modify dictionary mappings for 'word' tokens
ALTER TEXT SEARCH CONFIGURATION my_config
ALTER MAPPING FOR word
WITH english_stem;
-- Use in queries
SELECT to_tsvector('my_config', 'Running cats'); -- 'cat':2 'run':1
-- Set as database default
ALTER DATABASE mydb SET default_text_search_config = 'my_config';
Built-in Configurations
PostgreSQL includes pre-built configurations for 20+ languages (e.g., english, french, german, spanish), each with appropriate stemming and stop words.
Source: PostgreSQL 17 Official Documentation, Chapter 12 (Full Text Search)
Full-Text Search in PostgreSQL
Full-text search in PostgreSQL is a built-in feature for searching natural language documents by matching query terms against preprocessed text, ranking results by relevance. It uses specialized data types tsvector (document representation) and tsquery (search query), along with linguistic processing like stemming and stop-word removal.
Core Components
tsvector: Stores a sorted list of distinct lexemes (normalized words) with position information.
SELECT to_tsvector('english', 'The quick brown foxes jumped');
-- Result: 'brown':3 'fox':4 'jump':5 'quick':2
tsquery: Represents a search query with operators & (AND), | (OR), ! (NOT), and <-> (phrase).
SELECT to_tsquery('english', 'quick & fox');
-- Result: 'quick' & 'fox'
Basic Usage
-- Simple search
SELECT * FROM documents
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'postgresql & search');
-- With ranking (ts_rank returns float8, 0.0 to 1.0 scale)
SELECT title, ts_rank(to_tsvector('english', content), query) AS rank
FROM documents, to_tsquery('english', 'postgresql') query
WHERE to_tsvector('english', content) @@ query
ORDER BY rank DESC;
Performance: GIN Index (Required for Production)
-- Add tsvector column (recommended over expression index)
ALTER TABLE documents ADD COLUMN content_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
-- Create GIN index (default for tsvector, 3-10x faster than GiST)
CREATE INDEX idx_documents_content_tsv ON documents USING GIN(content_tsv);
-- Query using indexed column
SELECT * FROM documents WHERE content_tsv @@ to_tsquery('english', 'postgresql');
Language Configurations
PostgreSQL includes 20+ built-in dictionaries (english, spanish, french, etc.). Specify via first parameter:
to_tsvector('spanish', 'corriendo') -- 'corr':1
to_tsvector('english', 'running') -- 'run':1
Key Operators
@@- Match operator:tsvector @@ tsquery||- Concatenate tsvectors&&- AND tsqueries<->- Followed by (distance 1):'quick' <-> 'fox'<N>- Followed by with distance N:'quick' <2> 'fox'
Versions: Available since PostgreSQL 8.3 (2008). GIN index improvements in 9.4+ significantly increased performance. Current stable as of PostgreSQL 17 (2024).
Official Documentation: https://www.postgresql.org/docs/current/textsearch.html
Ranking Full-Text Search Results in PostgreSQL
Use the ts_rank() or ts_rank_cd() functions to rank full-text search results based on relevance.
Basic Ranking with ts_rank()
SELECT
title,
ts_rank(to_tsvector('english', body), query) AS rank
FROM documents,
to_tsquery('english', 'postgresql & search') query
WHERE to_tsvector('english', body) @@ query
ORDER BY rank DESC;
Key Functions
ts_rank(tsvector, tsquery [, normalization])
- Returns a float4 relevance score based on term frequency
- Higher scores = more relevant
- Default normalization: 0 (no normalization)
ts_rank_cd(tsvector, tsquery [, normalization])
- Uses "cover density" ranking (accounts for proximity of search terms)
- Better for multi-word queries where term proximity matters
- Same return type and normalization options
Normalization Options (bitmask values)
-- Normalize by document length (most common)
ts_rank(vector, query, 1) -- Divides rank by (1 + log of document length)
-- Other options (can be combined with |)
0 -- No normalization (default)
1 -- Divide by (1 + log of document length)
2 -- Divide by document length
4 -- Divide by mean harmonic distance between extents
8 -- Divide by number of unique words in document
16 -- Divide by 1 + log of unique words
32 -- Divide by rank itself + 1
Production-Ready Example
-- Create GIN index for performance
CREATE INDEX idx_documents_fts ON documents
USING GIN (to_tsvector('english', body));
-- Optimized query with stored tsvector and weighted ranking
SELECT
id,
title,
ts_rank_cd(body_vector, query, 32) AS rank,
ts_headline('english', body, query, 'MaxWords=50') AS snippet
FROM documents,
to_tsquery('english', 'postgresql & ranking') query
WHERE body_vector @@ query
ORDER BY rank DESC
LIMIT 20;
Weighting Different Fields
SELECT
ts_rank(
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', body), 'B'),
query
) AS rank
FROM documents,
to_tsquery('english', 'search') query
WHERE (setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', body), 'B')) @@ query
ORDER BY rank DESC;
Weight labels: 'A' (weight 1.0), 'B' (0.4), 'C' (0.2), 'D' (0.1) by default.
Supported versions: PostgreSQL 8.3+ (ts_rank), 9.0+ (weighted vectors)
Source: PostgreSQL 16 Official Documentation - Text Search Functions
A stop word in PostgreSQL full-text search is a word that is ignored during indexing and searching because it appears too frequently to be useful for search relevance (e.g., "a", "the", "is").
How Stop Words Work
Stop words are removed by the text search dictionary during to_tsvector() and to_tsquery() conversion. The default text search configuration for English uses the english_stem dictionary, which filters stop words before stemming.
-- Stop words are removed from the tsvector
SELECT to_tsvector('english', 'The quick brown fox');
-- Result: 'brown':3 'fox':4 'quick':2
-- Note: 'The' is removed as a stop word
-- Stop words in queries are also ignored
SELECT to_tsquery('english', 'the & fox');
-- Result: 'fox'
Default Stop Word Lists
PostgreSQL includes predefined stop word lists in $SHAREDIR/tsearch_data/ for multiple languages:
- English:
english.stop(contains 119 words including: a, an, the, is, are, was, were, etc.) - Other languages have their own lists (e.g.,
french.stop,german.stop)
Viewing Stop Words
-- View the stop word list for a specific dictionary
SELECT ts_lexize('english_stem', 'the');
-- Result: NULL (indicates it's a stop word)
SELECT ts_lexize('english_stem', 'fox');
-- Result: {fox} (not a stop word, returns lexeme)
Custom Stop Word Configuration
You can create text search configurations without stop words or with custom stop word lists:
-- Create dictionary without stop words
CREATE TEXT SEARCH DICTIONARY english_nostop (
TEMPLATE = snowball,
Language = english
);
Version: Available in all PostgreSQL versions 8.3+. Behavior is consistent across versions.
Source: PostgreSQL official documentation - Chapter 12 (Full Text Search)
The @@ operator in PostgreSQL performs full-text search matching between a tsvector (document) and a tsquery (query).
Syntax:
tsvector @@ tsquery → boolean
Usage:
The operator returns true if the tsvector matches the tsquery, false otherwise. It can be used in either direction:
-- Standard form: tsvector @@ tsquery
SELECT to_tsvector('english', 'The quick brown fox') @@ to_tsquery('english', 'fox');
-- Returns: true
-- Reversed form: tsquery @@ tsvector (same result)
SELECT to_tsquery('english', 'fox') @@ to_tsvector('english', 'The quick brown fox');
-- Returns: true
Practical Example:
-- Search articles table
SELECT title, body
FROM articles
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'postgresql & search');
-- With indexed column (recommended for performance)
CREATE INDEX idx_fts ON articles USING GIN (to_tsvector('english', body));
SELECT title
FROM articles
WHERE to_tsvector('english', body) @@ plainto_tsquery('english', 'postgresql search');
Key Points:
- Both operands must be properly typed (
tsvectorandtsquery) - Use
to_tsvector()to convert text totsvector - Use
to_tsquery(),plainto_tsquery(), orwebsearch_to_tsquery()to convert search strings totsquery - GIN or GiST indexes dramatically improve performance on large datasets
- Available since PostgreSQL 8.3+
Source: PostgreSQL Official Documentation - Full Text Search (Chapter 12)
PostgreSQL Text Search Dictionary
A text search dictionary in PostgreSQL is a function that maps input tokens (words) to normalized lexemes for full-text search, controlling how words are processed during both indexing and querying.
Core Functionality
Text search dictionaries:
- Accept a token and return an array of zero or more lexemes
- Return empty array to indicate the token is a stop word (should be ignored)
- Return NULL to indicate the dictionary doesn't recognize the token (try next dictionary in configuration)
- Can return multiple lexemes for a single token (useful for synonyms)
Built-in Dictionary Types
PostgreSQL provides these dictionary templates:
simple - Converts to lowercase, never returns NULL
CREATE TEXT SEARCH DICTIONARY simple_dict (TEMPLATE = pg_catalog.simple);
SELECT ts_lexize('simple_dict', 'YeS'); -- Returns: {yes}
synonym - Maps words to synonyms from a file
CREATE TEXT SEARCH DICTIONARY syn (TEMPLATE = synonym, SYNONYMS = mysynonyms);
ispell - Morphological dictionary using Ispell algorithm
CREATE TEXT SEARCH DICTIONARY english_ispell (TEMPLATE = ispell,
DictFile = en_us, AffFile = en_us, StopWords = english);
snowball - Stemming using Snowball algorithms (25+ languages)
CREATE TEXT SEARCH DICTIONARY english_stem (TEMPLATE = snowball, Language = english);
SELECT ts_lexize('english_stem', 'running'); -- Returns: {run}
thesaurus - Phrase replacement based on dictionary
CREATE TEXT SEARCH DICTIONARY thesaurus_dict (TEMPLATE = thesaurus,
DictFile = mythesaurus, Dictionary = pg_catalog.english_stem);
Usage in Text Search Configurations
Dictionaries are assigned to token types in text search configurations:
ALTER TEXT SEARCH CONFIGURATION english
ALTER MAPPING FOR asciiword WITH english_stem, simple;
When processing tokens, PostgreSQL tries each dictionary in order until one returns non-NULL.
Version: PostgreSQL 12+
All features described are available in PostgreSQL 12 and later. The text search framework has been stable since PostgreSQL 8.3.
Source: PostgreSQL Official Documentation - Text Search Dictionaries
How to Index tsvector Columns in PostgreSQL
Create a GIN (Generalized Inverted Index) or GiST (Generalized Search Tree) index on tsvector columns for full-text search.
Recommended Approach (GIN Index)
CREATE INDEX idx_name ON table_name USING GIN (tsvector_column);
GIN is the default choice because:
- Faster searches (3x faster than GiST)
- Larger index size (3x larger than GiST)
- Slower updates/inserts
Alternative (GiST Index)
CREATE INDEX idx_name ON table_name USING GIST (tsvector_column);
Use GiST when:
- Updates are more frequent than searches
- Index size is a constraint
Indexing Generated tsvector
When tsvector is generated from text columns:
-- Create generated tsvector column
ALTER TABLE documents
ADD COLUMN tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body)) STORED;
-- Index it
CREATE INDEX documents_tsv_idx ON documents USING GIN (tsv);
Or create an expression index directly:
CREATE INDEX documents_search_idx ON documents
USING GIN (to_tsvector('english', title || ' ' || body));
Critical Details
- Query must match index: If index uses
to_tsvector('english', col), queries must use the same configuration:WHERE to_tsvector('english', col) @@ query - Operators supported:
@@(match),@@@(deprecated) - Available from: PostgreSQL 8.3+ (GIN and GiST both supported)
Source: PostgreSQL 17 Official Documentation, Chapter 12 (Full Text Search)
Typo-Tolerant Search in PostgreSQL
Use pg_trgm (trigram) extension with similarity scoring. This is the standard PostgreSQL approach for fuzzy text matching.
Implementation
-- 1. Enable the extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- 2. Create GIN index for performance
CREATE INDEX idx_products_name_trgm ON products USING GIN (name gin_trgm_ops);
-- 3. Query with similarity threshold
SELECT name, similarity(name, 'iphone') AS score
FROM products
WHERE name % 'iphone' -- % operator uses similarity threshold
ORDER BY score DESC
LIMIT 10;
Similarity Threshold
Default threshold is 0.3 (range: 0.0-1.0). Adjust with:
-- Per session
SET pg_trgm.similarity_threshold = 0.2; -- More permissive (more typos)
-- Per query
SELECT * FROM products
WHERE similarity(name, 'searc term') > 0.25;
Operators
%- Similar to (uses session threshold)similarity(text, text)- Returns score 0.0-1.0word_similarity(text, text)- Matches whole words within text<->- Distance operator for ORDER BY (lower = more similar)
Example with Distance Operator
SELECT name
FROM products
ORDER BY name <-> 'iphone' -- Orders by similarity distance
LIMIT 5;
Alternative: Levenshtein Distance
For exact edit distance (slower, no indexing):
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
SELECT name, levenshtein(name, 'iphone') AS distance
FROM products
WHERE levenshtein(name, 'iphone') <= 2 -- Max 2 character changes
ORDER BY distance;
Recommendation: Use pg_trgm for production. It supports indexing, handles transpositions/insertions/deletions, and scales to millions of rows.
Version: Available since PostgreSQL 9.1+. GIN indexing for trigrams added in 9.1.
Source: PostgreSQL pg_trgm documentation
PostgreSQL to_tsquery() Function
to_tsquery() converts a text string into a tsquery value for full-text search matching. It parses search terms using boolean operators and returns a normalized query that can be matched against tsvector documents.
Basic Syntax
to_tsquery([ config regconfig, ] querytext text) returns tsquery
Key Characteristics
- Requires valid tsquery syntax - Input must use operators:
&(AND),|(OR),!(NOT),<->(FOLLOWED BY with distance) - Lexemes only - Input terms are normalized (lowercased, stemmed) but NOT parsed from natural language
- Throws errors on invalid syntax - Unlike
plainto_tsquery(), invalid boolean syntax causes an error
Exact Behavior
- Terms are stemmed using the specified text search configuration (default:
pg_catalog.simpleor database default) - Stop words are removed based on the configuration
- Operators control matching logic:
&= both terms must match|= either term must match!= term must not match<->= terms must appear adjacent (can specify distance:<N>)
Code Examples
-- Basic AND query
SELECT to_tsquery('english', 'cat & dog');
-- Result: 'cat' & 'dog'
-- OR query
SELECT to_tsquery('english', 'cat | dog');
-- Result: 'cat' | 'dog'
-- NOT query
SELECT to_tsquery('english', 'cat & !dog');
-- Result: 'cat' & !'dog'
-- Phrase search (adjacent words)
SELECT to_tsquery('english', 'black <-> cat');
-- Result: 'black' <-> 'cat'
-- Using in WHERE clause
SELECT * FROM documents
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'postgresql & search');
When to Use
- Use
to_tsquery()when you need boolean search control - Use
plainto_tsquery()for simple natural language queries (automatically adds&between words, never errors) - Use
websearch_to_tsquery()(PostgreSQL 11+) for Google-like syntax with quotes and minus signs
Version: PostgreSQL 8.3+
Source: PostgreSQL Official Documentation - Text Search Functions
Data Definition (DDL) > Views
24 questionsWithout CONCURRENTLY, the refresh will block other connections trying to read from the materialized view, but tends to use fewer resources and complete more quickly.
The default behavior populates the materialized view immediately upon creation (equivalent to WITH DATA).
A notice is issued instead of an error. The command succeeds without throwing an error.
Yes. Use ALTER VIEW name ALTER COLUMN column_name SET DEFAULT expression to set default values, or DROP DEFAULT to remove them.
Yes. Indexes can be created on materialized views, which is one of their main advantages.
Existing permanent relations with the same name are not visible to the current session while the temporary view exists, unless they are referenced with schema-qualified names.
No. Temporary views exist in a special schema, so a schema name cannot be given when creating a temporary view.
Yes. PostgreSQL allows dropping multiple views by listing them comma-separated (e.g., DROP VIEW view1, view2, view3). This is a PostgreSQL extension beyond the SQL standard.
check_option (values: local or cascaded), security_barrier (boolean), and security_invoker (boolean).
CASCADED is assumed when CHECK OPTION is specified without LOCAL or CASCADED.
Yes. A view column name list must be specified for a recursive view.
There must be at least one UNIQUE index on the materialized view which uses only column names (not expressions) and includes all rows (no WHERE clause).
LOCAL checks only the conditions defined directly in the view itself, not conditions from underlying base views.
RESTRICT is the default. The command will refuse to drop the view if any objects depend on it.
No. CHECK OPTION is only supported on views that are automatically updatable and do not have INSTEAD OF triggers or INSTEAD rules.
The new query must generate the same columns that were generated by the existing view query (same column names in the same order and with the same data types), but it may add additional columns to the end of the list.
The materialized view will be flagged as unscannable and cannot be queried until REFRESH MATERIALIZED VIEW is used.
INSTEAD OF triggers may only be defined on views at row level, not statement level.
The default storage format is 'heap'. Currently, PostgreSQL only supports the heap storage method.
No. When inserting into the base table directly, the view's default value is not applied. The default only applies when inserting through the view.
Functions - Array
23 questionsYes, array_position(), array_positions(), and array_remove() only work with one-dimensional arrays.
PostgreSQL arrays support a maximum of 6 dimensions. This is defined as MAXDIM 6 in the source code and is an arbitrary limit enforced by the implementation.
Both array_position() and array_remove() use IS NOT DISTINCT FROM semantics for comparisons, which means it is possible to match and remove NULL values.
ALL returns true when used with an empty array (vacuously, all zero comparisons satisfy the condition).
When the delimiter parameter is NULL, string_to_array() splits the input string into individual characters, with each character becoming a separate array element.
When the null_string parameter is not provided (or is NULL), NULL array entries are omitted from the output string. If null_string is given and is not NULL, then NULL array entries are represented by that string.
array_length() produces NULL instead of 0 for empty or missing array dimensions. For example, array_length(array[], 1) returns NULL, not 0.
array_agg() returns NULL rather than an empty array when there are no input rows. This is a common gotcha, and the documentation recommends using coalesce to substitute an empty array when necessary.
array_reverse() reverses only the first dimension of the array, not all dimensions.
Yes, array_agg() collects all input values, including nulls, into the resulting array.
When unnest() is called with multiple arrays of different lengths, the shorter arrays are padded with NULLs to match the longest array.
By default, PostgreSQL uses a one-based numbering convention for arrays, meaning an array of n elements starts with array[1] and ends with array[n]. The lower bound index value is 1.
array_sort() defaults to ascending order (descending = false) with nulls last (nulls_first = false).
When concatenating arrays of different element types, they will be coerced to a common type using the same type resolution algorithm as UNION, CASE, and VALUES constructs.
Concatenating a null or empty array is a no-op (no operation). Otherwise, the arrays must have the same number of dimensions or differ in number of dimensions by one.
Array comparison operators compare array contents element-by-element using the default B-tree comparison function for the element data type, and sort based on the first difference. This is effectively lexicographic ordering.
generate_subscripts() returns zero rows for arrays that do not have the requested dimension, or if any input is NULL.
cardinality() returns 0 for empty arrays, unlike array_length() which returns NULL for empty arrays.
The default value for the reverse parameter in generate_subscripts() is false, meaning subscripts are returned in ascending order by default.
PostgreSQL arrays are limited to approximately 1073741823 elements (about a quarter billion elements), defined as MaxAllocSize / sizeof(Datum). This is related to the 1GB field size limit.
If the requested slice partially overlaps the array bounds, it is silently reduced to just the overlapping region instead of returning NULL.
When an array slice is completely outside the current array bounds, it yields an empty (zero-dimensional) array instead of NULL. This behavior differs from non-slice access and is done for historical reasons.
When array_fill() is called without the optional third parameter, the default lower bound for all dimensions is 1.
Transactions > Savepoints
23 questionsNo. PL/pgSQL explicitly does not support these savepoint commands. However, blocks with exception handlers can replace typical savepoint usage patterns, as they create implicit subtransactions.
ROLLBACK [ WORK | TRANSACTION ] TO [ SAVEPOINT ] savepoint_name. The keywords WORK, TRANSACTION, and SAVEPOINT are all optional.
ROLLBACK TO SAVEPOINT implicitly destroys all savepoints that were established after the named savepoint. The savepoint you roll back to remains valid and can be rolled back to again later.
No. Savepoints can only be established when inside a transaction block. Attempting to create a savepoint outside a transaction will result in the error: 'SAVEPOINT can only be used in transaction blocks' (SQLSTATE 25P01).
When a write-enabled subtransaction is assigned a subxid, all parent transactions up to the top-level are also assigned non-virtual transaction IDs. Parent transaction IDs are always numerically lower than child subxid values.
No. They are optional 'noise words' that have no functional impact. The SQL standard allows only WORK as a noise word after ROLLBACK, but PostgreSQL accepts both WORK and TRANSACTION.
Only the most recently defined unreleased savepoint with that name is released. Repeated RELEASE SAVEPOINT commands will release progressively older savepoints with the same name.
PostgreSQL returns an error with SQLSTATE S25000 (invalid_transaction_state) stating that the savepoint does not exist.
No. It is not possible to release a savepoint when the transaction is in an aborted state. You must use ROLLBACK TO SAVEPOINT instead.
When ON_ERROR_ROLLBACK is enabled, psql automatically creates a savepoint before each statement and, if an error occurs, rolls back to that savepoint instead of aborting the entire transaction. This allows you to continue working in the transaction despite errors.
ON_ERROR_ROLLBACK can be set to 'on' (enables for all statements), 'interactive' (enables only in interactive mode), or off. Use \set ON_ERROR_ROLLBACK on or \set ON_ERROR_ROLLBACK interactive in psql.
PostgreSQL keeps the old savepoint when a new one with the same name is created, though only the more recent one will be accessible. The SQL standard requires the old savepoint to be destroyed automatically. The older savepoint becomes accessible again after releasing the newer one.
No. Session-level advisory lock requests do not honor transaction semantics - a lock acquired during a transaction that is later rolled back will still be held following the rollback. However, transaction-level advisory locks behave normally and are released appropriately.
Yes. The SQL standard specifies that the key word SAVEPOINT is mandatory, but PostgreSQL (and Oracle) allow it to be omitted.
RELEASE [ SAVEPOINT ] savepoint_name. The keyword SAVEPOINT is optional.
ROLLBACK TO SAVEPOINT rolls back all commands that were executed after the savepoint was established, and then starts a new subtransaction at the same transaction level.
No. Read-only subtransactions don't receive subxids (subtransaction IDs) until they attempt write operations.
- This is the maximum number of subtransactions (created by savepoints) that are cached in shared memory for each backend before performance degradation occurs.
No. The number of savepoints inside a transaction is practically unlimited. PostgreSQL customers have used over 250,000 savepoints in a single operation successfully.
Any cursor that is opened inside a savepoint will be closed when the savepoint is rolled back. However, cursors opened before the savepoint remain open and operational after the rollback.
Yes. If a lock is acquired after establishing a savepoint, the lock is released immediately if the savepoint is rolled back to. This applies to both table-level and row-level locks.
Entry into the executable section of a block that has an exception section implicitly creates an anonymous savepoint. Entry into a handler in the exception section implicitly rolls back to that anonymous savepoint.
Backup and Restore > pg_restore
23 questionsThe three section names are: pre-data (schema definitions before data loading), data (actual table data and large-object definitions), and post-data (indexes, triggers, rules, and constraints). The default is to restore all sections.
No, pg_restore cannot restore large objects selectively (for instance, only those for a specific table). If an archive contains large objects, then all large objects will be restored, or none of them if excluded via -L, -t, or other options.
The --no-owner option prevents pg_restore from outputting commands to set ownership of objects. With this option, any user name can be used for the initial connection, and this user will own all the created objects.
The --use-set-session-authorization option outputs SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to determine object ownership. This makes the dump more standards-compatible, but depending on the history of the objects in the dump, might not restore properly.
The -P (or --function) option specifies a procedure or function to be restored. It allows selective restoration of specific functions or procedures from a backup archive.
When both -n (include schema) and -N (exclude schema) are given for the same schema name, the -N switch wins and the schema is excluded.
pg_restore displays the error: 'input file appears to be a text format dump. Please use psql.' This occurs because pg_restore only works with binary or directory format dumps, not plain-text SQL files.
By default, pg_restore continues execution when errors occur and displays a count of errors at the end of the restoration. It does not exit on error by default.
Use pg_restore -l to list the archive contents: pg_restore -l archive.file > archive.list. The list file is normally created by editing this output. Lines can be moved, removed, or commented out by placing a semicolon (;) at the start of the line.
pg_restore returns exit code 0 when the restore completes without errors.
No, there is no provision for wildcard matching in pg_restore options. None of the pg_restore options accept values including wildcards.
The -I option restores the definition of a named index only. Multiple indexes may be specified with multiple -I switches.
No, pg_restore does not read PGDATABASE when a database name is not supplied, which is a notable difference from other PostgreSQL utilities.
The --no-privileges and --no-acl options are essentially synonymous. Both prevent restoration of access privileges (grant/revoke commands). --no-acl tells pg_restore to exclude any access control lists that are defined for the database objects.
No, pg_restore's -t flag does not include subsidiary objects such as indexes. This differs from pg_dump's -t flag which does dump subsidiary objects. You need to use -I separately to restore specific indexes.
pg_restore returns exit code 1 when errors occur during restore, even if it displays 'WARNING: errors ignored on restore'. Exit code 1 indicates that errors were encountered during the restore process.
pg_restore supports three archive formats: custom format (designated 'c'), directory format (designated 'd'), and tar format (designated 't'). These are non-plain-text formats created by pg_dump. Plain-text SQL dumps cannot be restored with pg_restore and must use psql instead.
The --single-transaction option wraps emitted commands in BEGIN/COMMIT to ensure either all commands complete successfully or no changes are applied. This option implies --exit-on-error.
Yes, the dbname parameter can be a connection string. If so, connection string parameters will override any conflicting command line options. Formats include PostgreSQL URI (postgres://user:pass@host:port/database) or key-value pairs.
No, the -t option does not support including a schema name within the table specification. You must use the -n option separately to specify the schema along with -t for the table name.
Multiple schemas may be specified with multiple -n switches. For example: pg_restore -d testdb -n schema1 -n schema2 -n schema3 backup.dump
The --strict-names option requires that each schema (-n/--schema) and table (-t/--table) qualifier match at least one schema/table in the file to be restored. It generates an error if specified names don't match any objects in the backup file.
The -j option runs the most time-consuming parts of pg_restore — those which load data, create indexes, or create constraints — using multiple concurrent jobs. Each job uses a separate connection to the server.
JIT Compilation
22 questionsNo, it is not meaningful to set jit_inline_above_cost to less than jit_above_cost, as inlining can only occur when JIT compilation is already happening.
PostgreSQL JIT currently accelerates expression evaluation (WHERE clauses, target lists, aggregates, projections) and tuple deforming (converting on-disk tuples to in-memory representation).
The four JIT timing phases are: Generation (generating LLVM IR code), Inlining (inlining functions), Optimization (LLVM optimization passes), and Emission (compiling IR to machine code).
PostgreSQL must be built with the --with-llvm flag to enable JIT compilation support.
When prepared statements use a generic plan, JIT cost decisions are made at prepare time using the configuration parameters in effect at that time, not at execution time.
Yes, the jit_provider parameter can be set to point to another JIT provider, though inlining support is only available when using the LLVM provider.
OLAP (analytical) workloads benefit most from JIT compilation. JIT is beneficial primarily for long-running CPU-bound queries with complex expressions, aggregates, and WHERE clauses.
The jit_provider parameter has a context of "postmaster", meaning it can only be set at server startup and requires a server restart to change.
JIT compilation was first introduced in PostgreSQL 11, released on October 18, 2018.
PostgreSQL makes JIT cost decisions at plan time, not execution time, by comparing the estimated query cost against the configured JIT threshold parameters.
The PostgreSQL documentation shows a simple query that took 0.365 ms without JIT but 7.416 ms with JIT enabled, with JIT compilation overhead of approximately 7.1 ms - making the query over 20 times slower.
The Functions count represents the number of functions that were JIT-compiled for the query.
The jit_dump_bitcode parameter has a context of "superuser", meaning only superusers or users with appropriate SET privilege can change it.
replication_ha
22 questionsmax_wal_senders is a PostgreSQL configuration parameter that sets the maximum number of concurrent WAL sender processes that can run simultaneously.
Default Value: 10 (as of PostgreSQL 9.6+)
Purpose: Controls how many replication connections the server can support. Each streaming replication standby or logical replication subscriber requires one WAL sender process.
Key Requirements:
- Must be set to at least the number of standby servers you plan to connect
- Requires server restart to change (not reloadable)
- Common practice: set to
max_wal_senders = <number_of_standbys> + 2to allow for temporary connections during maintenance
Prerequisites:
wal_levelmust be set toreplicaorlogical(notminimal)- Each WAL sender consumes one connection slot from
max_connections
Configuration Example:
# postgresql.conf
max_wal_senders = 5 # allows up to 5 replication connections
wal_level = replica # required for streaming replication
Valid Range: 0 to 262143
Setting it to 0 disables replication entirely.
Source: PostgreSQL Official Documentation - Write Ahead Log Settings (applies to PostgreSQL 10+, consistent through PostgreSQL 17)
PostgreSQL Streaming Replication Setup (PostgreSQL 12+)
Streaming replication requires one primary server and one or more standby servers. Here's the complete setup:
On Primary Server
- Edit
postgresql.conf:
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1024 # MB, PostgreSQL 13+; use wal_keep_segments=64 for PG 12
- Create replication user:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
- Edit
pg_hba.conf:
host replication replicator standby_ip/32 scram-sha-256
- Restart PostgreSQL:
pg_ctl restart -D /var/lib/postgresql/data
On Standby Server
- Create base backup:
pg_basebackup -h primary_ip -D /var/lib/postgresql/data -U replicator -P -v -R -X stream -C -S standby_slot_1
Key flags:
-R: Createsstandby.signalfile (PG 12+)-X stream: Streams WAL during backup-C -S: Creates replication slotstandby_slot_1
- Verify
postgresql.auto.confcontains:
primary_conninfo = 'host=primary_ip port=5432 user=replicator password=secure_password'
- Start standby:
pg_ctl start -D /var/lib/postgresql/data
Verification
On primary:
SELECT client_addr, state, sync_state FROM pg_stat_replication;
Expected output shows standby with state='streaming' and sync_state='async' (for async replication).
Source: PostgreSQL 16 Official Documentation - High Availability, Load Balancing, and Replication
Streaming Replication in PostgreSQL
Streaming replication is PostgreSQL's built-in method for continuously transferring Write-Ahead Log (WAL) records from a primary server to one or more standby servers in near real-time, enabling high availability and read scaling.
How It Works
The primary server streams WAL records (8192 bytes each by default) to standbys over a TCP connection as transactions are committed, rather than waiting for WAL files to complete. Standbys continuously apply these records to maintain an up-to-date copy of the database.
Key Configuration (PostgreSQL 12+)
On Primary (postgresql.conf):
wal_level = replica # required for replication
max_wal_senders = 10 # max concurrent connections from standbys
wal_keep_size = 1024 # MB of WAL to retain (prevents standby lag issues)
On Standby:
Create standby.signal file in data directory, then configure postgresql.conf:
primary_conninfo = 'host=primary.example.com port=5432 user=replicator password=secret'
Replication Modes
- Asynchronous (default): Primary commits transactions without waiting for standby confirmation. Fastest, but standby may lag slightly.
- Synchronous: Primary waits for standby acknowledgment before commit returns. Configure via
synchronous_standby_names = 'standby1'on primary.
Verification
Check replication status on primary:
SELECT client_addr, state, sync_state, replay_lag
FROM pg_stat_replication;
Version Note
This answer applies to PostgreSQL 12+ which uses standby.signal. PostgreSQL 11 and earlier used recovery.conf instead.
Source: PostgreSQL Official Documentation - High Availability, Load Balancing, and Replication
What is PgBouncer and Why Use It?
PgBouncer is a lightweight connection pooler for PostgreSQL that sits between your application and the database server, managing a pool of persistent database connections.
What It Does
PgBouncer maintains a pool of active PostgreSQL connections and multiplexes client connections onto them. When your application requests a database connection, PgBouncer assigns one from its pool rather than creating a new PostgreSQL backend process.
Why Use It
Primary reason: PostgreSQL creates a separate backend process for each connection (fork-based architecture). Each process consumes ~10MB of memory. With 1000 direct connections, that's ~10GB just for connection overhead.
PgBouncer solves this by:
- Allowing 1000+ application connections to share 20-50 actual PostgreSQL connections
- Using ~2MB per 1000 client connections (vs ~10GB for direct connections)
- Connecting in <1ms (vs 5-50ms for new PostgreSQL connection with auth)
Pool Modes
- Session pooling: Connection returned to pool when client disconnects (safest, supports all features)
- Transaction pooling: Connection returned after each transaction (most efficient, cannot use prepared statements across transactions)
- Statement pooling: Connection returned after each statement (breaks transactions, rarely used)
Recommendation: Use transaction pooling for stateless web apps; use session pooling if you need prepared statements or advisory locks.
Typical Configuration
[databases]
mydb = host=localhost port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
This allows 1000 app connections using only 25-30 PostgreSQL connections.
Current stable version: PgBouncer 1.23.1 (as of January 2025)
Source: Official PgBouncer documentation, PostgreSQL documentation on connection costs
What is a WAL (Write-Ahead Log) in PostgreSQL?
A Write-Ahead Log (WAL) is PostgreSQL's crash recovery and replication mechanism where all data modifications are written to a sequential log file before being applied to the actual data files.
Key Characteristics
Core Principle: Changes are logged before data pages are modified (write-ahead), ensuring durability and crash recovery.
Physical Storage:
- Default location:
pg_wal/directory (calledpg_xlog/in PostgreSQL 9.6 and earlier) - Files are 16 MB each by default (configurable via
--wal-segment-sizeat compile time) - Named with 24-character hexadecimal identifiers (e.g.,
000000010000000000000001)
Purpose:
- Crash Recovery: Replay WAL records to restore database to consistent state
- Point-in-Time Recovery (PITR): Archive WAL segments for backup restoration
- Replication: Stream WAL to standby servers for physical replication
- Minimal Disk I/O: Write sequential log instead of random data page updates
Critical Configuration Parameters
-- Control when WAL is flushed to disk
wal_level = replica -- minimal | replica | logical (default: replica in PG 14+)
-- When to flush WAL buffer
synchronous_commit = on -- on | remote_apply | remote_write | local | off
-- WAL buffer size (shared memory)
wal_buffers = 16MB -- default: -1 (auto-sized to 1/32 of shared_buffers, max 16MB)
-- Checkpoint frequency
checkpoint_timeout = 5min -- default: 5 minutes
max_wal_size = 1GB -- default: 1GB (triggers checkpoint)
How It Works
- Transaction modifies data → generates WAL record
- WAL record written to WAL buffer in shared memory
- At commit, WAL flushed to disk (via
fsync) - Only after WAL is on disk, transaction confirmed as committed
- Modified data pages written to disk later (asynchronously)
Viewing WAL Information
-- Current WAL write position
SELECT pg_current_wal_lsn();
-- WAL file for a specific LSN
SELECT pg_walfile_name('0/1000000');
-- Check WAL archiving status
SELECT archived_count, failed_count
FROM pg_stat_archiver;
Source: PostgreSQL Official Documentation v16 - Chapter 29 (Reliability and the Write-Ahead Log)
Verified: PostgreSQL 12-16 (current stable as of January 2025)
What is Patroni for PostgreSQL?
Patroni is an open-source high availability (HA) solution for PostgreSQL that manages automatic failover and cluster configuration using a Distributed Configuration Store (DCS) like etcd, ZooKeeper, or Consul.
Core Functionality
Patroni runs as a daemon alongside each PostgreSQL instance and:
- Monitors cluster health via DCS leader elections (default lease TTL: 30 seconds)
- Performs automatic failover when the primary fails (typical failover time: 30-60 seconds)
- Manages replication by automatically configuring standby servers
- Provides REST API on port 8008 (default) for health checks and administration
Key Configuration Example
# patroni.yml
scope: postgres-cluster
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 192.168.1.10:8008
etcd:
host: 192.168.1.100:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # bytes (1MB)
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.1.10:5432
data_dir: /var/lib/postgresql/14/main
Critical Parameters
ttl: 30- Leader lease duration in seconds (default: 30s)loop_wait: 10- Seconds between cluster state checks (default: 10s)maximum_lag_on_failover: 1048576- Max replication lag in bytes to allow failover (default: 1MB)
PostgreSQL synchronous_commit Setting
synchronous_commit is a PostgreSQL configuration parameter that controls whether the server waits for WAL (Write-Ahead Log) records to be written to disk before reporting transaction commit success to the client.
Valid Values (PostgreSQL 9.1+)
on(default) - Wait for local WAL flush to disk before commit returnsremote_apply(9.6+) - Wait for standby servers to apply WAL changes before commit returnsremote_write(9.1+) - Wait for standby servers to write WAL to OS (not fsync) before commit returnslocal(9.1+) - Wait for local WAL write to OS (not fsync) before commit returnsoff- Return immediately without waiting for WAL write
Performance vs. Durability Trade-off
When set to off, commits are 2-10x faster but there is a risk of losing the last few transactions (up to 3x wal_writer_delay, default 200ms = 600ms window) if the server crashes. Data corruption does NOT occur—only recent committed transactions may be lost.
Configuration
-- Session-level (recommended for specific transactions)
SET synchronous_commit = off;
-- Server-level (postgresql.conf)
synchronous_commit = on
Common Use Case
BEGIN;
SET LOCAL synchronous_commit = off;
INSERT INTO logs VALUES (...); -- Fast, non-critical data
COMMIT; -- Returns immediately
Source: PostgreSQL 16 Official Documentation - "Write-Ahead Logging (WAL)" and "High Availability, Load Balancing, and Replication"
Critical Detail: Only affects commit acknowledgment timing; all transactions still write to WAL eventually. This is NOT a data integrity risk for crash recovery, only a recent-transaction-loss risk.
Cascading Replication in PostgreSQL
Cascading replication is a PostgreSQL feature where a standby server receives WAL (Write-Ahead Log) data from another standby server instead of directly from the primary server, creating a replication chain: Primary → Standby A → Standby B.
Configuration
Available since: PostgreSQL 9.2+
On the intermediate standby (the one that will send WAL to downstream standbys), set in postgresql.conf:
hot_standby = on
max_wal_senders = 5 # Must be > 0 to allow downstream connections
wal_level = replica # Required for cascading (or 'logical')
On the downstream standby, configure primary_conninfo in postgresql.auto.conf or recovery settings to point to the intermediate standby instead of the primary:
primary_conninfo = 'host=standby-a port=5432 user=replication password=yourpass'
Key Characteristics
- WAL propagation: Changes flow Primary → Standby A → Standby B. If Standby A fails, Standby B stops receiving updates until reconnected to another source.
- Lag multiplication: Replication lag accumulates down the chain. If Primary→A has 2s lag and A→B has 2s lag, B is ~4s behind Primary.
- Authentication: Each standby in the chain must be configured in
pg_hba.confwith replication permissions for downstream standbys. - No automatic failover rerouting: If an intermediate standby fails, downstream standbys do not automatically reconnect to the primary.
Use Cases
- Reducing network load on the primary when replicating across geographic regions
- Creating multi-tier replication topologies
- Offloading replication bandwidth to dedicated relay servers
Source: PostgreSQL 17 Documentation - High Availability, Load Balancing, and Replication
pg_promote() Function
Available since: PostgreSQL 12
Purpose: Promotes a physical standby server to primary (read-write) mode by exiting recovery.
Syntax
pg_promote(wait boolean DEFAULT true, wait_seconds integer DEFAULT 60) → boolean
Parameters
wait(boolean, default:true) - Whether to wait for promotion to complete before returningwait_seconds(integer, default:60) - Maximum seconds to wait for promotion (only used ifwaitistrue)
Return Value
- Returns
boolean trueif promotion succeeds (whenwaitistrue)falseif promotion fails or timeout is reached- Returns immediately after signaling the postmaster if
waitisfalse
Permissions
- Restricted to superusers by default
EXECUTEprivilege can be granted to non-superuser roles for delegated failover
Usage Example
-- Promote with default behavior (wait up to 60 seconds)
SELECT pg_promote();
-- Promote and don't wait for completion
SELECT pg_promote(false);
-- Promote and wait up to 120 seconds
SELECT pg_promote(true, 120);
Prior to PostgreSQL 12
Use pg_ctl promote or create a trigger file specified by promote_trigger_file parameter.
Source: PostgreSQL Official Documentation - System Administration Functions
PostgreSQL primary_conninfo Setting
The primary_conninfo is a PostgreSQL configuration parameter that specifies the connection string a standby server uses to connect to its primary server for streaming replication.
Location and Scope
- Configuration file:
postgresql.conforpostgresql.auto.conf - Scope: Standby servers only (has no effect on primary)
- Reload: Requires
pg_ctl reloadorSELECT pg_reload_conf()to take effect
Format
Standard libpq connection string format:
primary_conninfo = 'host=primary.example.com port=5432 user=replicator password=secret application_name=standby1'
Required Parameters
hostorhostaddr: Primary server hostname/IPuser: Replication user (must haveREPLICATIONrole attribute)
Critical Optional Parameters
port: Default 5432password: Authentication credential (consider using.pgpassfile instead)application_name: Identifies standby inpg_stat_replicationviewsslmode: Set torequireor higher for encrypted connections (default:prefer)
Example Configuration
# Minimal
primary_conninfo = 'host=192.168.1.10 user=replicator'
# Production-recommended
primary_conninfo = 'host=primary.db.local port=5432 user=replicator application_name=standby_dc2 sslmode=require connect_timeout=10'
Related Settings
Must be used with restore_command (for archive recovery) or alongside recovery signal file (standby.signal in PostgreSQL 12+).
Source: PostgreSQL 17 Official Documentation - High Availability, Load Balancing, and Replication
PostgreSQL Replication Slot
A replication slot is a persistent marker in PostgreSQL that guarantees WAL (Write-Ahead Log) files are retained until a replica consumes them, preventing automatic WAL deletion that would break replication.
Key Characteristics
Purpose: Ensures the primary server retains WAL segments needed by standby servers or logical replication subscribers, even if they disconnect temporarily.
Types:
- Physical replication slots: For streaming replication (binary replication)
- Logical replication slots: For logical replication (row-level changes)
Critical Behavior:
- WAL files are never automatically removed if any replication slot needs them
- Risk: Unbounded disk growth if a slot's consumer is permanently offline
- Slots persist across server restarts
- Each slot tracks a single consumer's replay position
Creating a Replication Slot
Physical slot (streaming replication):
SELECT pg_create_physical_replication_slot('slot_name');
Logical slot (PostgreSQL 10+):
SELECT pg_create_logical_replication_slot('slot_name', 'pgoutput');
The second parameter is the output plugin (e.g., pgoutput, wal2json, test_decoding).
Monitoring Slots
SELECT slot_name, slot_type, active, restart_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;
Critical columns:
restart_lsn: Oldest WAL position this slot needsactive: Whether a consumer is currently connectedretained_wal: Amount of WAL kept for this slot
Removing a Slot
SELECT pg_drop_replication_slot('slot_name');
Important: Only drop slots when the consumer is permanently gone. Dropping an active slot breaks replication.
Configuration
Max slots (default: 10):
max_replication_slots = 10
Requires server restart to change.
Source: PostgreSQL 17 Official Documentation - Replication Slots
PgPool-II is a middleware proxy that sits between PostgreSQL clients and PostgreSQL database servers, providing connection pooling, load balancing, replication, and high availability features.
Core Functionality
Connection Pooling: Maintains a pool of persistent connections to PostgreSQL servers, reducing connection overhead. Each PgPool-II child process can handle multiple client connections mapped to fewer backend connections.
Load Balancing: Distributes SELECT queries across multiple PostgreSQL servers (read replicas) while routing write queries to the primary server. Uses configurable backend weights for query distribution.
Replication Management: Supports multiple replication modes:
- Streaming Replication mode (native PostgreSQL replication)
- Logical Replication mode
- Snapshot Isolation mode
- Native Replication mode (deprecated)
Automatic Failover: Detects backend node failures and automatically promotes a standby server to primary, with configurable health check intervals (default: 10 seconds) and failover scripts.
Query Caching: Can cache SELECT query results in memory to reduce database load.
Architecture
PgPool-II uses a multi-process architecture where a parent process spawns child processes (default: 32, configurable via num_init_children) to handle client connections. Each child process can handle one connection at a time in the default mode.
Typical Use Cases
- PostgreSQL high availability clusters
- Read-heavy workloads requiring horizontal scaling
- Applications needing transparent connection pooling
- Database load distribution across replicas
Current Stable Version: 4.5.x (as of 2025)
Official Documentation: https://www.pgpool.net/docs/latest/en/html/
Note: PgPool-II is protocol-compatible with PostgreSQL and requires no application code changes—clients connect to PgPool-II using standard PostgreSQL connection strings on the configured port (default: 9999).
PostgreSQL wal_level Setting
The wal_level setting controls how much information is written to the Write-Ahead Log (WAL) in PostgreSQL.
Valid Values (PostgreSQL 9.6+)
minimal- Writes only the information needed for crash recoveryreplica- Adds logging required for WAL archiving and replication (default since PostgreSQL 9.6)logical- Adds information necessary for logical decoding and logical replication
Configuration
Set in postgresql.conf:
wal_level = replica
Requires server restart to change.
Choosing the Right Level
Use
replicaif you need:- Physical replication (streaming replication)
- Point-in-time recovery (PITR)
- WAL archiving
Use
logicalif you need:- Logical replication (e.g., selective table replication)
- Change data capture (CDC)
- Logical decoding output plugins
Use
minimalONLY if:- No replication or archiving is needed
- Reduces WAL volume by ~20-30%
- Cannot create base backups while running
Check Current Value
SHOW wal_level;
Important Constraint
You cannot decrease wal_level if physical or logical replication slots exist. The server will refuse to start.
Source: PostgreSQL 16 Official Documentation - Write-Ahead Logging (WAL) Configuration
PostgreSQL Streaming Replication Lag
Streaming replication lag is the delay between when a transaction commits on the primary server and when that change becomes visible on a standby replica.
Three Types of Lag
PostgreSQL measures replication lag in three specific ways:
- Byte lag (
pg_wal_lsn_diff()) - The difference in Write-Ahead Log (WAL) positions between primary and standby, measured in bytes - Time lag (
replay_lag,write_lag,flush_lag) - The time delay in milliseconds between WAL generation and application - Transaction lag - Number of transactions the standby is behind (less commonly used)
Monitoring Replication Lag
On the primary server (PostgreSQL 10+), query pg_stat_replication:
SELECT
application_name,
client_addr,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS byte_lag,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
On the standby server, query pg_stat_wal_receiver:
SELECT
pg_last_wal_receive_lsn() AS receive_lsn,
pg_last_wal_replay_lsn() AS replay_lsn,
pg_wal_lsn_diff(pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn()) AS receive_replay_byte_lag;
Lag Columns Explained (PostgreSQL 10+)
write_lag: Time until WAL is written to standby's OS (not yet fsynced)flush_lag: Time until WAL is fsynced to standby's diskreplay_lag: Time until WAL is applied to standby's database (queryable)
Critical: replay_lag is what matters for read queries - it shows when data becomes visible.
Acceptable Thresholds
- Byte lag: < 16 MB (1 WAL segment in default config) is healthy; > 1 GB indicates problems
- Time lag: < 1 second is excellent; > 10 seconds requires investigation
Source: PostgreSQL 16 official documentation on monitoring and replication statistics.
PostgreSQL Logical Replication
Logical replication is a method of replicating data objects and their changes based on their replication identity (typically the primary key). Unlike physical replication which copies disk block changes, logical replication replicates data changes at the row level using a publish/subscribe model.
Key Characteristics
Available since: PostgreSQL 10.0 (released October 2017)
How it works:
- Publisher creates a publication for specific tables/operations
- Subscriber creates a subscription pointing to the publisher
- Changes are sent as logical change records, not physical WAL blocks
- Allows replication between different PostgreSQL major versions and different architectures
Essential Configuration
Publisher requirements:
wal_level = logical(in postgresql.conf)max_replication_slots >= 1(default: 10)max_wal_senders >= 1(default: 10)
Replication identity required: Tables must have PRIMARY KEY or REPLICA IDENTITY set (FULL, INDEX, or DEFAULT).
Example Setup
-- On Publisher (PostgreSQL 10+)
CREATE PUBLICATION my_pub FOR TABLE users, orders;
-- On Subscriber
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=publisher_host dbname=mydb user=repuser password=secret'
PUBLICATION my_pub;
Limitations
- Does NOT replicate: DDL changes, sequence values, large objects, TRUNCATE (before v11)
- Requires table with same name on subscriber (can have different columns/indexes)
- Initial data sync uses COPY protocol
Source: PostgreSQL Official Documentation v17 (current stable)
PostgreSQL: Synchronous vs Asynchronous Replication
Synchronous replication: The primary server waits for confirmation that at least one standby server has written the WAL (Write-Ahead Log) data to disk before reporting transaction commit to the client. This guarantees zero data loss if the primary fails.
Asynchronous replication (default): The primary server commits transactions immediately without waiting for standby confirmation. Standbys receive and apply changes with a delay, meaning data loss is possible if the primary fails before standbys receive the latest changes.
Configuration
Set in postgresql.conf on the primary:
# Asynchronous (default)
synchronous_commit = off
# Synchronous - wait for WAL write to disk on standby
synchronous_commit = on
# Must also specify which standbys to wait for
synchronous_standby_names = 'standby1,standby2'
For synchronous replication to work:
- Set
synchronous_commit = on(orremote_write,remote_apply) - Set
synchronous_standby_namesto list standby server names (from theirapplication_nameinprimary_conninfo)
Synchronous Commit Levels
off- asynchronous (fastest, data loss possible)local- wait for local disk write onlyremote_write- wait for standby to write to OS (not fsync)on- wait for standby to fsync to disk (safest, default synchronous)remote_apply- wait for standby to apply changes (strongest consistency)
Trade-offs
Synchronous: Zero data loss, but 2-10x slower commits (depends on network latency between primary and standby).
Asynchronous: Maximum performance, but potential data loss window equals replication lag (typically seconds to minutes).
Source: PostgreSQL 16 Official Documentation - High Availability, Load Balancing, and Replication
pg_stat_replication View
pg_stat_replication is a system view in PostgreSQL that displays real-time information about active replication connections from standby servers to the primary server.
Purpose
Monitors the status of WAL (Write-Ahead Log) sender processes on the primary database server, showing one row per active replication connection (physical or logical).
Key Columns
Connection Info:
pid- Process ID of the WAL sender processusename- Replication user nameapplication_name- Client application nameclient_addr- IP address of the standby server
Replication State:
state- Current WAL sender state:startup,catchup,streaming, orbackupsync_state- Synchronous state:async,potential,sync, orquorum
WAL Positions (Critical for lag monitoring):
sent_lsn- Last WAL Log Sequence Number sent to standbywrite_lsn- Last LSN written to standby's diskflush_lsn- Last LSN flushed to standby's diskreplay_lsn- Last LSN replayed on standby
Lag Metrics:
write_lag- Time since last write confirmationflush_lag- Time since last flush confirmationreplay_lag- Time since last replay confirmation (most important for monitoring)
Query Example
SELECT
application_name,
client_addr,
state,
sync_state,
replay_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
Requirements
- Only exists on PRIMARY servers (returns empty on standbys)
- Requires
pg_monitorrole or superuser privileges to view all columns - Available in PostgreSQL 9.1+ (lag columns added in 10.0)
Source: PostgreSQL Official Documentation - Monitoring Statistics Views
PostgreSQL: Streaming vs Logical Replication
Core Difference
Streaming Replication: Physical, block-level replication that copies exact WAL (Write-Ahead Log) records. Creates an identical byte-for-byte replica of the entire database cluster.
Logical Replication: Row-level, selective replication that replicates specific tables or databases using a publish/subscribe model. Transfers logical changes (INSERT/UPDATE/DELETE operations).
Key Technical Specifications
Streaming Replication
- Granularity: Entire PostgreSQL cluster (all databases)
- Replica Requirements: Must be identical PostgreSQL major version and architecture
- Replica State: Read-only (hot standby) or recovery mode
- Use Case: High availability, failover, disaster recovery
- Data Format: Binary WAL segments (typically 16MB each)
Logical Replication
- Granularity: Per-table or per-publication (subset of tables)
- Replica Requirements: Can be different PostgreSQL versions (within same major version family or newer)
- Replica State: Fully writable - can have additional tables, indexes, or receive other writes
- Use Case: Selective data sync, zero-downtime upgrades, multi-master setups, data consolidation
- Data Format: Logical changesets decoded from WAL
Configuration Examples
Streaming Replication (postgresql.conf on primary):
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB
Logical Replication (postgresql.conf on primary):
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
Publication setup (PostgreSQL 10+):
-- On publisher
CREATE PUBLICATION my_pub FOR TABLE users, orders;
-- On subscriber
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=primary dbname=mydb'
PUBLICATION my_pub;
Critical Limitations
Streaming: Cannot replicate to different architectures (e.g., x86 to ARM), cannot filter specific tables, replica must match primary's configuration closely.
Logical: Does not replicate DDL changes (schema changes must be applied manually), does not replicate sequences, TRUNCATE requires special handling (PostgreSQL 11+), higher overhead than streaming.
Source: PostgreSQL Official Documentation v17 (applies to versions 10+)
Yes, you can query a standby server in PostgreSQL using Hot Standby mode.
Hot Standby Configuration
Hot Standby allows read-only queries on a physical streaming replication standby server. Available since PostgreSQL 9.0+.
Required Configuration
On the standby server, set in postgresql.conf:
hot_standby = on
This parameter takes effect at server start and must be enabled before the standby accepts connections.
Query Capabilities and Limitations
Allowed operations:
SELECTqueriesEXPLAINCOPY TO- Cursor operations (read-only)
- Prepared statements (read-only)
NOT allowed (will error):
INSERT,UPDATE,DELETE- DDL statements (
CREATE,ALTER,DROP) - Temporary table creation
- Any write operations
Conflict Handling
Queries may be canceled if they conflict with WAL replay operations. The standby will terminate queries that hold locks blocking replay after max_standby_streaming_delay (default: 30 seconds).
max_standby_streaming_delay = 30s # How long queries can delay replay
Verification
Check if hot standby is active:
SELECT pg_is_in_recovery();
-- Returns 't' (true) if server is in standby mode
Source: PostgreSQL Official Documentation - High Availability, Load Balancing, and Replication (Chapter 27.4 - Hot Standby)
Promoting a PostgreSQL Standby to Primary
To promote a standby server to primary in PostgreSQL, use the pg_ctl promote command or call the pg_promote() function.
Method 1: pg_ctl (Recommended)
pg_ctl promote -D /var/lib/postgresql/data
Where -D specifies the data directory path.
Method 2: SQL Function (PostgreSQL 12+)
SELECT pg_promote();
This must be executed as a superuser on the standby server.
Method 3: Trigger File (Legacy, pre-12)
Create a trigger file specified in recovery.conf:
touch /tmp/postgresql.trigger.5432
Critical Details
- Promotion is irreversible - the standby cannot automatically revert to standby mode
- Promotion time: Typically completes in <1 second for warm standbys, but may take longer if significant WAL replay is pending
- Return value:
pg_ctl promotereturns exit code 0 on success, non-zero on failure - Connection handling: Existing read-only connections remain active and become read-write after promotion
- Timeline change: PostgreSQL creates a new timeline (increments timeline ID) and creates a
.historyfile inpg_wal/
Post-Promotion Requirements
- Update DNS/connection strings to point applications to the new primary
- Reconfigure old primary as standby if failback is needed (requires new
primary_conninfo) - Update monitoring systems
Version Notes
- PostgreSQL 12+:
recovery.confmerged intopostgresql.conf, usestandby.signalfile presence to indicate standby mode (removed automatically on promotion) - PostgreSQL 9.1-11: Uses
recovery.conftrigger file method
Source: PostgreSQL 17 Official Documentation - High Availability, Log Shipping, and Replication
PostgreSQL Replication Slots: Physical vs Logical
Direct Answer
Physical replication slots stream WAL (Write-Ahead Log) records for block-level replication, used by streaming replication and pg_basebackup. They ensure the primary retains WAL files until all standby servers consume them.
Logical replication slots decode WAL into logical change events (INSERT/UPDATE/DELETE operations) at the row level, enabling selective replication of specific tables/databases and cross-version replication.
Key Differences
| Aspect | Physical | Logical |
|---|---|---|
| Output format | Binary WAL records | Decoded logical changes |
| Granularity | Entire cluster | Per-database/table |
| Use case | Hot standby, failover | Selective sync, upgrades |
| Cross-version | Same major version only | Supports different versions |
| Decoder | None | Required (pgoutput, wal2json, etc.) |
| Min. wal_level | replica |
logical |
Creating Each Type
Physical replication slot:
SELECT pg_create_physical_replication_slot('standby_slot');
Logical replication slot:
SELECT pg_create_logical_replication_slot('logical_slot', 'pgoutput');
-- 'pgoutput' is the standard decoder for logical replication (PG 10+)
Configuration Requirement
Logical slots require:
# postgresql.conf
wal_level = logical # (default is 'replica' which only supports physical)
Physical slots work with wal_level = replica or higher.
Version Support
- Physical replication slots: PostgreSQL 9.4+
- Logical replication slots: PostgreSQL 10+ (logical decoding API available in 9.4, but native logical replication in 10+)
Source: PostgreSQL Official Documentation - Replication Slots and Logical Replication
Hot Standby in PostgreSQL
Hot standby is a PostgreSQL feature that allows a standby server (replica) to accept read-only queries while continuously applying WAL (Write-Ahead Log) records from the primary server.
Key Characteristics
- Read-only access: The standby accepts
SELECTqueries but rejects writes (INSERT,UPDATE,DELETE,DDL) - Continuous replication: Standby applies WAL changes while serving queries simultaneously
- Near real-time data: Query results reflect data as of the last applied WAL record (typically seconds behind primary)
Configuration
Enable on the standby server in postgresql.conf:
hot_standby = on
This parameter is on by default since PostgreSQL 10.
Requirements
- Primary server must have WAL archiving or streaming replication configured
- Standby must be in recovery mode (created via
pg_basebackupor restored from backup) standby.signalfile must exist in data directory (PostgreSQL 12+)
Query Conflicts
Hot standby queries can be canceled if they conflict with WAL replay operations. Common conflicts:
- Primary drops a table being queried on standby
- Primary vacuums rows still visible to standby query
Control behavior with these standby parameters:
max_standby_streaming_delay = 30s # How long queries can delay WAL apply
hot_standby_feedback = on # Prevents primary from vacuuming rows needed by standby queries
Source
PostgreSQL Official Documentation: High Availability, Load Balancing, and Replication - Hot Standby
Data Types - Character
22 questionsIf character varying lacks a length specifier, the type accepts strings of any length.
If character or char lacks a length specifier, it is equivalent to character(1).
The varchar and char aliases are defined in the SQL standard. varchar is an alias for character varying(n), and char is an alias for character(n).
In collations where whitespace is significant, the behavior of ignoring trailing spaces in character(n) comparisons can produce unexpected results. For example, SELECT 'a '::CHAR(2) collate "C" < E'a\n'::CHAR(2) returns true, even though C locale would consider a space to be greater than a newline.
The n parameter refers to characters, not bytes. With multibyte character encodings, the number of characters and bytes can be quite different.
No. Regardless of the specific character set, the character with code zero (sometimes called NUL) cannot be stored in PostgreSQL character types.
The length parameter n must be greater than zero and cannot exceed 10,485,760 characters.
Values of type character are physically padded with spaces to the specified width n, and are stored and displayed that way. However, trailing spaces are treated as semantically insignificant and disregarded when comparing two values of type character.
When you explicitly cast a value to character varying(n) or character(n), an over-length value will be truncated to n characters without raising an error. This is required by the SQL standard.
Yes, trailing spaces are semantically significant in character varying and text values, and when using pattern matching operations like LIKE and regular expressions.
If bpchar lacks a length specifier, it accepts strings of any length, but trailing spaces are trimmed and treated as semantically insignificant.
No, the text type is not in the SQL standard. It is a PostgreSQL extension, though several other SQL database management systems have it as well.
Yes, long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables (via TOAST) so that they do not interfere with rapid access to shorter column values.
bpchar is PostgreSQL's internal name for the character(n) type, standing for 'blank-padded char'. It is a PostgreSQL extension and can be used as an alias for character(n).
The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character.
The name type is used to store identifiers in PostgreSQL's system catalogs. Its length is currently defined as 64 bytes (63 usable characters plus terminator).
The type "char" (note the quotes) is different from char(1) in that it only uses one byte of storage, and can store only a single ASCII character. It is used in the system catalogs as a simplistic enumeration type.
PostgreSQL provides three main character types: character varying(n) or varchar(n) for variable-length with limit, character(n) or char(n) for fixed-length blank-padded strings, and text for variable unlimited length.
In most situations, text or character varying should be used instead of character(n), as character(n) is usually the slowest of the three because of its additional storage costs.
The longest possible character string that can be stored is about 1 GB (approximately 1 gigabyte).
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type (char), and a few extra CPU cycles to check the length when storing into a length-constrained column. In fact, character(n) is usually the slowest of the three because of its additional storage costs.
By default, NAMEDATALEN is 64, so the maximum identifier length is 63 bytes (63 usable characters).
Transactions > Two-Phase Commit
22 questionsCurrently prepared transactions can be monitored using the pg_prepared_xacts view.
This parameter can only be set at server start, meaning it cannot be changed without restarting the PostgreSQL server.
You must be either the same user that executed the transaction originally, or a superuser.
No, ROLLBACK PREPARED can be executed from any session, not only the one that executed the original transaction.
No, it is not allowed to PREPARE a transaction that has operated on temporary tables or the session's temporary namespace.
No, COMMIT PREPARED can be executed from any session, not only the one that executed the original transaction.
Runtime parameters modified with SET (without the LOCAL option) persist after PREPARE TRANSACTION and will not be affected by any later COMMIT PREPARED or ROLLBACK PREPARED. Thus, PREPARE TRANSACTION acts more like COMMIT than ROLLBACK in this respect.
Short-lived prepared transactions are stored exclusively in shared memory and Write-Ahead Log (WAL).
The transaction_id must be unique among currently prepared transactions.
When pg_prepared_xacts is accessed, the internal transaction manager data structures are momentarily locked, and a copy is made for the view to display.
The prepared state is intended to be of very short duration. Prepared transactions should normally be committed or rolled back as soon as an external transaction manager has verified that other databases are prepared to commit.
When the user executes PREPARE TRANSACTION, the only possible next commands are COMMIT PREPARED or ROLLBACK PREPARED.
On a standby server, you must set max_prepared_transactions to the same or higher value than on the primary server. Otherwise, queries will not be allowed in the standby server.
The transaction_id must be a string literal less than 200 bytes long.
You must be either the same user that executed the transaction originally, or a superuser.
Yes, PREPARE TRANSACTION must be used inside a transaction block (initiated with BEGIN).
No, COMMIT PREPARED cannot be executed inside a transaction block. The prepared transaction is committed immediately.
Yes, PREPARE TRANSACTION persists the transaction so that it will survive a crash or server restart. Prepared transactions can be committed or rolled back even after the server crashes.
Yes, the transaction continues to hold whatever locks it held after being prepared.
The default value is zero, which disables the prepared-transaction feature entirely.
No, it is not allowed to PREPARE a transaction that has created cursors WITH HOLD.
During crash recovery, prepared transactions are restored from WAL records. On startup, all prepared transactions are restored from the pg_twophase directory.
Data Definition (DDL) > Tables and Columns
22 questionsOnly one. A table can have at most one primary key, but there can be any number of unique constraints.
No. Changing a column's DEFAULT value using ALTER TABLE ... ALTER COLUMN ... SET DEFAULT only affects future INSERT operations. Existing rows retain their current values.
Normally 2 KB (2048 bytes). TOAST management code is triggered when a row value to be stored is wider than TOAST_TUPLE_THRESHOLD bytes, causing PostgreSQL to compress and/or move field values out-of-line.
Adding a constraint with NOT VALID skips validation of existing data, making it a fast O(1) operation. The constraint is enforced only for new or updated rows. You can later validate existing data with VALIDATE CONSTRAINT.
GENERATED AS IDENTITY is recommended. It was introduced in PostgreSQL 10, conforms to the SQL standard, and provides better schema visibility, permission management, and safety compared to SERIAL. SERIAL is a legacy convenience feature.
At the end of the table. PostgreSQL appends the new column at the end of the table and has no option to specify the position of the new column.
Changing a column type normally causes the entire table and its indexes to be rewritten. However, certain compatible changes don't require a rewrite, such as converting between text and varchar (or vice versa) when there's no collation change, because these types sort identically.
No. Starting in PostgreSQL 11, adding a column with a constant (non-volatile) default value does not require rewriting the table. The default value is stored in the catalog and returned when rows are accessed. However, volatile defaults like clock_timestamp() still require a full table rewrite.
A B-tree index. Both PRIMARY KEY and UNIQUE constraints automatically create a unique B-tree index on the specified column(s).
You will get an error. GENERATED ALWAYS AS IDENTITY prevents manual value insertion, avoiding conflicts with generated values. Use GENERATED BY DEFAULT AS IDENTITY if you need to allow manual overrides.
No. A primary key automatically enforces NOT NULL, meaning every row must have a valid, non-null primary key value. This differs from UNIQUE constraints, which allow NULL values.
TRUNCATE is faster for clearing large tables. TRUNCATE is a constant-time operation that removes all rows without scanning them and reclaims disk space immediately without requiring a subsequent VACUUM. DELETE is faster for small tables (a few rows) because TRUNCATE has more overhead.
RESTRICT is the default. RESTRICT prevents the table from being dropped if any objects depend on it. To drop a table with dependent objects, CASCADE must be explicitly specified.
No. If a column in the parent table is an identity column, that property is not inherited by child tables in table inheritance hierarchies.
- The default fillfactor is 100, which means complete packing of table pages. Valid values range from 10 to 100.
Volatile defaults (e.g., clock_timestamp(), random()), stored generated columns, identity columns, or columns with domain data types that have constraints will cause the entire table and its indexes to be rewritten.
1 GB (1,073,741,823 bytes or 2^30 - 1 bytes). This is enforced by TOAST which uses two bits of the varlena length word to manage oversized attributes.
32 columns. This limit applies to declarative partitioning regardless of the partitioning method used (RANGE, LIST, or HASH).
MATCH SIMPLE allows any foreign key column to be null independently, while MATCH FULL requires either all foreign key columns to be null or all to be non-null and match a reference.
No. NOT NULL cannot be used as a table constraint. However, you can approximate this behavior using a CHECK constraint with IS NOT NULL at the table level.
1,600 columns. However, this is further limited by the requirement that tuples must fit in a single 8192-byte heap page, so the practical limit depends on column types.
Extensions > uuid-ossp
22 questions6ba7b814-9dad-11d1-80b4-00c04fd430c8. This is the X.500 distinguished name namespace UUID defined in RFC 4122.
6ba7b810-9dad-11d1-80b4-00c04fd430c8. This is the DNS namespace UUID defined in RFC 4122.
UUIDs of this kind reveal the identity of the computer that created the identifier (via MAC address) and the time at which it did so, which makes them potentially unsuitable for security-sensitive applications.
PostgreSQL 13. As of PostgreSQL 13, gen_random_uuid() was added to core PostgreSQL, and the function provided by pgcrypto became a wrapper around the core function.
uuid_generate_v1mc() is a version 1 multicast variant that substitutes a random multicast MAC address instead of the actual computer MAC address.
The uuid-ossp module is classified as "trusted" and can be installed by non-superusers who have CREATE privilege on the current database. Prior to PostgreSQL 13, superuser privileges were typically required.
--with-uuid=bsd (to use BSD functions), --with-uuid=e2fs (to use e2fsprogs' libuuid), or --with-uuid=ossp (to use the OSSP UUID library).
Yes, both are deterministic. They always produce the same UUID for the same namespace and name combination. The generation has no random or environment-dependent element and is therefore reproducible.
Version 5 should be preferred over version 3 because SHA-1 is thought to be more secure than MD5.
uuid_nil() → uuid. It takes no parameters and returns a UUID type.
MD5. The name parameter will be MD5-hashed, so the cleartext cannot be derived from the generated UUID.
uuid_generate_v1() → uuid. It takes no parameters and returns a UUID type.
6ba7b811-9dad-11d1-80b4-00c04fd430c8. This is the URL namespace UUID defined in RFC 4122.
On FreeBSD and some other BSD-derived platforms, suitable UUID creation functions are included in the core libc library.
uuid_generate_v4() → uuid. It takes no parameters and returns a UUID type.
The OSSP UUID library is not well maintained, and is becoming increasingly difficult to port to newer platforms.
00000000-0000-0000-0000-000000000000. This is a 'nil' UUID constant which does not occur as a real UUID.
CREATE EXTENSION "uuid-ossp"; or CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; The quote marks are often required due to the hyphen in the extension name.
6ba7b812-9dad-11d1-80b4-00c04fd430c8. This is the ISO object identifier namespace UUID defined in RFC 4122.
SHA-1. It functions like version 3 but uses SHA-1 hashing instead of MD5.
Partitioning > Partition Pruning
22 questionsNo. In PostgreSQL 11-13, partition pruning for UPDATE and DELETE only works during query planning, not at execution time. It only works when partition keys can be evaluated at plan time with constant values.
During the early executor stage (execution-time pruning), specifically during executor initialization, not at planning time.
The default value is 'auto', with allowed values being 'auto', 'force_custom_plan', and 'force_generic_plan'.
Yes. PostgreSQL uses a hashing function to find matching partitions for HASH partitioned tables. HASH partitioning was introduced in PostgreSQL 11.
No. Constraint exclusion only works when the query's WHERE clause contains constants (or externally supplied parameters), not runtime values.
Whenever one of the execution parameters being used by partition pruning changes. This allows dynamic pruning as parameter values change during execution.
No. It is a user context parameter that can be changed per session without requiring a restart.
Place the SELECT with the volatile function in a subquery, which turns it into an InitPlan. This allows PostgreSQL to perform partition pruning during the execution phase.
Yes. For range multi-column partitions, if the query uses the first few columns of the partition key, then partition pruning is still feasible.
- During initialization of the query plan (for parameter values known during initialization). 2) During actual execution of the query plan (for values only known during actual execution, like from parameterized nested loop joins).
No. The faster partition pruning using binary search currently only works for SELECT queries. UPDATE and DELETE queries still use the old linear algorithm.
PostgreSQL 10 used the constraint_exclusion mechanism with a linear O(N) algorithm that required looking at each partition's metadata one-by-one.
- Planning time: when the optimizer examines partition definitions and removes partitions that cannot contain matching rows. 2) Execution time: when clause values are unknown during planning, such as parameters in PREPARE statements, values from subqueries, or parameterized values on the inner side of nested loop joins.
Because all partitions must be scanned upfront to build the hash table, so there's no opportunity to use runtime parameters to prune partitions during execution. Nested loop joins allow pruning because outer row values are available as parameters for each iteration.
PostgreSQL 11. PostgreSQL 10 only supported planning-time partition pruning.
The default is 'partition', with allowed values being 'partition', 'on', or 'off'.
No. A comparison against a non-immutable function such as CURRENT_TIMESTAMP cannot be optimized at planning time, since the planner cannot know which partition the function value might fall into at run time.
No. Execution-time partition pruning can only happen with nested loop joins. If the planner chooses a hash join, the query cannot utilize partition pruning at execution time.
Because it can result in a large increase in overall memory consumption during query execution, and query planning becomes significantly more expensive in terms of memory and CPU.
Joins and Subqueries
22 questionsTwo rows are equal if all their corresponding members are non-null and equal. If any corresponding members are non-null and unequal, the rows are unequal. Otherwise, the comparison result is unknown (null).
Beyond ten or so input tables, PostgreSQL switches from exhaustive search to genetic probabilistic search. The exact threshold is controlled by geqo_threshold (default 12).
No, for functions the LATERAL keyword is optional. The function's arguments can contain references to columns provided by preceding FROM items even without LATERAL.
If there are no common column names, NATURAL JOIN behaves like JOIN ... ON TRUE, producing a cross-product join (equivalent to CROSS JOIN).
Yes, SOME is a synonym for ANY. They are identical in behavior and can be used interchangeably.
The result of the IN construct will be null, not false. This follows standard SQL three-valued logic.
The default value is 12. This parameter determines when to use genetic query optimization to plan queries with at least this many FROM items involved.
JOIN USING suppresses redundant columns, producing one output column for each listed column pair (in the listed order), followed by remaining columns from T1, then T2. JOIN ON produces all columns from T1 followed by all columns from T2.
INNER is the default. When you write JOIN without specifying the type, PostgreSQL treats it as an INNER JOIN.
If the scalar subquery returns no rows, there is no error; the scalar result is taken to be null.
Row constructor comparison is only possible when the operator is =, <>, <, <=, >, or >=. These operators must belong to a B-tree operator class or be the negator of a B-tree equality operator.
The result is null when no comparison returns false AND at least one comparison returns null.
No, the AS keyword is optional. Both 'FROM table_name AS alias' and 'FROM table_name alias' are valid and functionally equivalent.
No, when an alias is provided, it completely hides the actual name of the table or function. It is not allowed to refer to the table by the original name elsewhere in the query.
Yes, the words INNER and OUTER are optional in all forms. LEFT, RIGHT, and FULL already imply outer joins, so the OUTER keyword is redundant.
Yes, subqueries specifying a derived table must be enclosed in parentheses.
The result of the NOT IN construct will be null, not true. This follows standard SQL three-valued logic.
Yes, the planner normally attempts to collapse subqueries into parent queries to generate better plans. However, it avoids collapsing if doing so would create more than from_collapse_limit FROM items.
No, explicit inner join syntax is semantically the same as listing the input relations in FROM, so it does not constrain the join order. However, you can force the planner to honor JOIN ordering by setting join_collapse_limit to 1.
USING (a, b) creates the join condition ON T1.a = T2.a AND T1.b = T2.b. It forms a join condition that includes an equality comparison for each shared column name.
PostGIS > Geography Type
22 questionsLongitude ranges from -180.0 to 180.0 degrees, and latitude ranges from -90.0 to 90.0 degrees.
All measurement functions return values in meters for geography types. For all spatial reference systems, the units returned by measurement functions (e.g. ST_Distance, ST_Length, ST_Perimeter, ST_Area) and for the distance argument of ST_DWithin are in meters.
The default is use_spheroid=true. For geography types, distance measurement defaults to use_spheroid = true, which means calculations are performed on a spheroid rather than a sphere for higher accuracy.
Yes. ST_DWithin uses a spatial index if available. For geography types, the distance parameter is in meters and defaults to use_spheroid=true.
ST_Buffer determines a planar spatial reference system that best fits the bounding box of the geography object, trying UTM, Lambert Azimuthal Equal Area (LAEA) North/South pole, and finally Mercator. The buffer is computed in the planar space, then transformed back to WGS84.
The default SRID is 4326 (WGS84). If the SRID option is omitted when creating a geography column, it will default to a value of 4326, which is the SRID for WGS 84, the World Geodetic System of 1984, and the standard for the Global Positioning System.
The geography type supports: POINT, LINESTRING, POLYGON, MULTIPOINT, MULTILINESTRING, MULTIPOLYGON, and GEOMETRYCOLLECTION. All types can have Z, M, or ZM suffixes for 3D and measured coordinates.
SRID=n; prefix. For example: 'SRID=4326;POINT(-44.3 60.1)'. The EWKT format includes the spatial reference system identifier before the geometry definition.
Yes. Standard geometry type data will autocast to geography if it is of SRID 4326. However, explicit casting using ::geography is recommended for clarity.
Objects larger than 4kb (approximately 200 vertices for 2D geometries) are chopped into smaller pieces by the PostgreSQL TOAST system and stored in a side table.
Longitude first, then latitude. In PostGIS, spatial coordinates are in longitude and then latitude (x, y). Longitude represents the x-axis and latitude represents the y-axis of the location.
Meters. For geography, the maximum length is in meters. Distances are computed on the sphere, and added vertices are created along the spherical great-circle arcs defined by segment endpoints.
No. The buffer and intersection functions are actually wrappers on top of a cast to geometry, and are not carried out natively in spherical coordinates. They transform to a planar coordinate system, perform the operation, then transform back to WGS84.
Yes. Both ST_AsGeoJSON and ST_AsKML support the geography type. Geography support for ST_AsGeoJSON was introduced in version 1.5.0.
Use geography when your data is geographically dispersed (covering much of the world) or is of global extent. Geography is convenient for people who don't want to learn about projected coordinate systems and need accurate distance measurements over long distances.
PostGIS stores coordinates at double precision: 16 bytes per vertex for 2D geometries, 24 bytes for 3D geometries.
The && (overlap), ~ (contains), and @ (contained by) operators are supported for 2D geography queries using GiST indexes.
Yes. The geography type indexes and calculates over the dateline and poles automatically, using great circle arcs on the sphere. This is a significant advantage over older approaches like ST_Distance_Spheroid.
The geography type does not support curves, TINS, or POLYHEDRALSURFACEs. This includes CircularString and other curved geometry types from the SQL/MM standard.
Version 1.5. Support for geography was introduced in version 1.5 for both ST_Covers and ST_Intersects. For ST_Covers, support for polygon in polygon and line in polygon was added for geography type in version 2.4.0.
Prior to PostGIS 2.2, the geography type supported only WGS 84 geodetic (SRID:4326). For PostGIS 2.2 and above, any long/lat based spatial reference system defined in the spatial_ref_sys table can be used.
PostGIS geometries are limited by the 1 GB PostgreSQL field size limit. In practice, a 32-bit signed integer is used for the number of points, limiting to about 2.1 billion vertices theoretically, but practical limits reduce this to about 33.5 million vertices.
Internals > Visibility Map
22 questionsSKIP_LOCKED causes VACUUM to skip work on a relation if there is a conflicting lock when trying to open it. This option was added in PostgreSQL 12.
The all-visible bit in the visibility map, the all-frozen bit in the visibility map, and the PD_ALL_VISIBLE bit in the page header.
With 2 bits per heap page, 1 visibility map byte covers 4 heap pages (typically 32KB of heap data). The overall ratio is approximately 1:32,768, making the VM roughly 0.003% the size of the heap.
Visibility map bits are cleared by any data-modifying operations on a page.
When vacuum_cleanup_index_scale_factor is set to 0, index scans are never skipped during VACUUM cleanup.
Yes, DISABLE_PAGE_SKIPPING makes the vacuum aggressive, but aggressive vacuums can still skip a page that is frozen, and rely on the visibility map for that information.
The first bit, if set, indicates that the page is all-visible, meaning the page does not contain any tuples that need to be vacuumed. This information can also be used by index-only scans to answer queries using only the index tuple.
The visibility map is stored in a file named after the filenode number of the relation, plus a _vm suffix. For example, if the filenode is 12345, the VM is stored in a file called 12345_vm, in the same directory as the main relation file.
The number of pages marked all-visible in the relation's visibility map as of the last VACUUM (or ANALYZE, or some other operations that update pg_class.relpages).
pg_check_visible returns the TIDs of non-all-visible tuples stored in pages marked all-visible in the visibility map. If this function returns a non-empty set of TIDs, the visibility map is corrupt.
The all-visible bit in the visibility map indicates that every tuple in the corresponding page of the relation is visible to every current and future transaction.
Yes, setting a bit in the visibility map is WAL-logged during VACUUM. When a bit is set, the LSN of the visibility map page is updated to ensure that the visibility map update doesn't get written to disk before the WAL record is flushed. However, when a bit is cleared, it doesn't require WAL logging because clearing a bit is always safe from a correctness standpoint.
The second visibility map bit was added in PostgreSQL 9.6. PostgreSQL 9.5 and earlier versions stored only one bit per heap page.
vm (for visibility map). The valid fork type values are: main for the main fork, fsm for the free space map, vm for the visibility map, and init for the init fork.
The all-frozen bit indicates that every tuple in the page is frozen; that is, no future vacuum will need to modify the page until such time as a tuple is inserted, updated, deleted, or locked on that page.
pg_truncate_visibility_map() requires superuser privileges exclusively. Other pg_visibility functions are executable by superusers and roles with privileges of the pg_stat_scan_tables role.
Partitioning > List Partitioning
22 questionsLarge numbers of partitions are likely to increase query planning time considerably. Planning time grows linearly with the number of partitions. In test environments with a thousand partitions, planning time can reach 12+ ms, which may exceed execution time for fast queries.
The default value is 'on'. This parameter enables or disables the query planner's ability to eliminate partitions from query plans.
Partition-wise join currently applies only when the join conditions include all the partition keys, which must be of the same data type and have one-to-one matching sets of child partitions.
PostgreSQL automatically creates a matching index on each existing partition, and any partitions you create or attach later will also have such an index automatically created.
No. PostgreSQL will cause an error if you specify bounds such that a new partition's values would overlap with those in one or more existing partitions. Each value must belong to exactly one partition.
PostgreSQL 11 introduced the ability to create a DEFAULT partition, which can store rows that do not fall into any existing partition's list.
Tuples will be routed in O(log N) using specialized tuple-routing features with in-memory structures, which replaced the older INSERT trigger-based approach.
No. One limitation when creating new indexes on partitioned tables is that it is not possible to use the CONCURRENTLY qualifier directly on the partitioned table.
The row movement is internally achieved by deleting the original row and inserting the modified row into the right partition. Any ON DELETE ROW trigger defined for the source partition will be executed, and INSERT triggers on the destination partition will also fire. The OID of the row changes because it belongs to a different partition.
Prior to PostgreSQL 14, detaching a partition acquired Access Exclusive lock on both the partition being detached (blocking all reads/writes on the partition) and on the parent table, blocking all reads/writes to all partitions.
In PostgreSQL 11, the elimination of unneeded partitions (partition pruning) is no longer an exhaustive linear search. A binary search quickly identifies matching LIST and RANGE partitions.
Partition methods LIST-LIST, LIST-RANGE, LIST-HASH, RANGE-RANGE, RANGE-LIST, RANGE-HASH, HASH-HASH, HASH-LIST, and HASH-RANGE can be created in PostgreSQL declarative partitioning.
Yes. Partitions may themselves be defined as partitioned tables, using what is called sub-partitioning. PostgreSQL multilevel partitions can be created up to N levels.
CREATE TABLE table_name (columns...) PARTITION BY LIST (partition_key); followed by CREATE TABLE partition_name PARTITION OF table_name FOR VALUES IN (value1, value2, ...);
No. If a DEFAULT partition exists, a new partition cannot be added if there are any rows in the DEFAULT partition that would otherwise fit in the new partition being added, because it will require row movement from DEFAULT to the newly added partition. This operation would be too costly to do implicitly.
Create a CHECK constraint matching the expected partition constraint on the table prior to attaching it. The system will be able to skip the scan which is otherwise needed to validate the implicit partition constraint. Once ATTACH PARTITION is complete, drop the now-redundant CHECK constraint.
PostgreSQL 12 added the ability for foreign keys to reference partitioned tables. PostgreSQL 10 didn't support foreign keys at all with partitioned tables.
The query planner is generally able to handle partition hierarchies with up to a few thousand partitions fairly well, provided that typical queries allow the query planner to prune all but a small number of partitions.
PostgreSQL 10 introduced declarative partitioning with dedicated syntax to create both range and list partitioned tables. List partitioning was available from the initial release of declarative partitioning in PostgreSQL 10.
No. Hash-partitioned tables may not have a default partition. Only range and list partitioned tables can have DEFAULT partitions.
Legacy inheritance-based partitioning works well with up to perhaps a hundred child tables; don't try to use many thousands of children.
Event Triggers
21 questionsNo, these events do not occur for DDL commands targeting shared objects — databases, roles, and tablespaces — or for commands targeting event triggers themselves.
The trigger fires after the actions have taken place (but before the transaction commits), and thus the system catalogs can be read as already changed.
If a ddl_command_end trigger fails with an error, the effects of the DDL statement will be rolled back, just as they would be in any other case where the containing transaction aborts.
The 'command' column has the special type pg_ddl_command, which contains the complete details of the DDL command that was executed.
No, event triggers can be written in any procedural language that includes event trigger support, or in C, but not in plain SQL.
The table_rewrite event occurs just before a table is rewritten by some actions of the commands ALTER TABLE and ALTER TYPE. While other control statements like CLUSTER and VACUUM can rewrite a table, the table_rewrite event is not triggered by them.
CREATE EVENT TRIGGER name ON event [ WHEN filter_variable IN (filter_value [, ... ]) [ AND ... ] ] EXECUTE { FUNCTION | PROCEDURE } function_name(). The filter_variable can only be TAG, and filter_value would be command tags like 'DROP FUNCTION'.
If a ddl_command_start trigger fails with an error, no further event triggers will fire, and no attempt will be made to execute the command itself.
No, event triggers are disabled in single-user mode as well as when event_triggers is set to false.
PostgreSQL currently supports 5 event types: login, ddl_command_start, ddl_command_end, table_rewrite, and sql_drop.
The evtenabled column can have four values: O (fires in 'origin' and 'local' modes), D (disabled), R (fires in 'replica' mode), and A (fires always).
No, the event_triggers parameter does not require a restart. It operates in superuser context and can be changed with a reload (SIGHUP).
TAG is currently the only supported filter_variable for event triggers.
pg_event_trigger_dropped_objects() returns a list of all objects dropped by the command. Note that the trigger is executed after the objects have been deleted from the system catalogs, so it's not possible to look them up anymore.
A function must be declared with no arguments and a return type of event_trigger. The function need not (and may not) actually return a value; the return type serves merely as a signal that the function is to be invoked as an event trigger.
Yes, pg_event_trigger_ddl_commands() returns one row for each base command executed; some commands that are a single SQL sentence may return more than one row.
The keywords FUNCTION and PROCEDURE are interchangeable in the CREATE EVENT TRIGGER syntax, but the referenced function must in any case be a function, not a procedure. The function must return type event_trigger.
Indexes > Hash Indexes
21 questionsHash indexes are best optimized for SELECT and UPDATE-heavy workloads that use equality scans on larger tables.
Yes. Starting with PostgreSQL 10, changes to hash indexes write WAL, which means they are crash-safe and properly replicated to standbys. Prior to version 10, hash index operations were not replicated.
No. Currently, only B-tree, GiST, and SP-GiST index access methods support the INCLUDE clause. Hash indexes are not included.
No. Hash indexes do not allow uniqueness checking. Only B-tree currently supports unique indexes.
The default fillfactor for hash indexes is 75, compared to 90 for B-tree indexes.
Each hash index tuple stores just the 4-byte hash value (a 32-bit hash code), not the actual column value.
No. Only B-tree, GIN, and BRIN index methods support parallel index builds. Hash indexes do not.
Hash indexes support only the equality operator (=). They cannot handle range operations, pattern matching, or sorting operations.
No. Overflow pages can be recycled for reuse in other buckets, but they are never returned to the operating system.
No. Hash indexes do not support index-only scans or covering indexes because they store only the hash value, not the actual column value.
Hash index expansion occurs in the foreground, which can increase execution time for user inserts. This makes them unsuitable for tables experiencing rapid row growth.
Hash indexes became WAL-logged and crash-safe in PostgreSQL 10. Prior to version 10, they lacked write-ahead logging and were not safe in the face of crashes or replication.
Yes. The split algorithm is crash-safe and can be restarted if not completed successfully.
For performance reasons, PostgreSQL retains a cached copy of the metapage in each backend's relcache entry to avoid locking and pinning the metapage for every operation. This produces the correct bucket mapping as long as the target bucket hasn't been split since the last cache refresh.
VACUUM attempts to squeeze the index tuples onto as few overflow pages as possible, minimizing the overflow chain.
Overflow pages are chained to bucket pages when a bucket becomes full, providing local storage expansion for additional index tuples that hash to the same bucket.
Hash indexes support the fillfactor storage parameter, which controls how full index pages are packed during index creation.
Hash indexes are most suitable for unique, nearly unique data, or data with a low number of rows per hash bucket. They excel with longer data items like UUIDs and URLs.
There is currently no provision to shrink a hash index, other than by rebuilding it with REINDEX. There is also no provision for reducing the number of buckets.
Yes. Hash indexes may take part in bitmap index scans and backward scans.
Hash indexes are particularly advantageous when the index size and/or data size exceeds shared_buffers or RAM, as they provide direct access to bucket pages rather than requiring tree traversal.
Monitoring > pg_stat_activity
21 questionsdatname and datid are NULL for system background processes (like autovacuum launcher, logical replication launcher, background writer, checkpointer, or walwriter) that are not connected to any specific database.
The possible state values are: 'active' (backend is executing a query), 'idle' (backend is waiting for a new client command), 'idle in transaction' (backend is in a transaction but not executing a query), 'idle in transaction (aborted)' (transaction statement caused an error), 'fastpath function call' (backend is executing a fast-path function), 'disabled' (track_activities is disabled for this backend), and 'starting' (backend is in initial startup, client authentication is being performed).
Yes, the wait_event and state columns are independent. A backend can be in the 'active' state and may or may not be waiting on some event. If state is 'active' and wait_event is non-null, it means a query is being executed but is blocked somewhere in the system.
PostgreSQL 14. This column allows linking queries between pg_stat_activity and pg_stat_statements.
usesysid (type: oid) is the OID of the database user/role from the pg_authid system catalog. It uniquely identifies the role that was used to connect.
usename (type: name) is the name of the database user/role that was used to connect. It corresponds to the rolname from pg_authid and is the human-readable counterpart to usesysid.
on. This parameter enables the collection of information on the currently executing command of each session.
-1. For local/UNIX socket connections, client_addr and client_hostname are NULL, and client_port is set to -1.
PostgreSQL 10. This column identifies the type of backend process (e.g., 'client backend', 'autovacuum launcher', 'walwriter').
For parallel workers, leader_pid contains the PID of the group leader (the main backend process that started them). For the group leader itself, leader_pid is the same as its own PID. leader_pid is NULL if a backend has never been involved in parallel query.
integer. The pid column contains the process ID of the backend, which corresponds to the actual operating system process ID.
xact_start is NULL when the backend is not in a transaction and not currently working on anything (i.e., idle outside of a transaction).
client_addr is NULL for: (1) UNIX socket/local connections, (2) parallel worker processes (only the leader process has client_addr populated), and (3) background processes not connected to any specific database.
64 bytes (63 usable characters plus terminator). This affects the maximum length of names in pg_stat_activity columns like client_dn and principal, which are truncated if longer than NAMEDATALEN.
PostgreSQL 9.4. backend_xid shows the running transaction ID and backend_xmin shows the transaction ID which determines visibility scope for the currently running transaction.
The client_dn and principal fields are truncated if longer than NAMEDATALEN (64 characters in a standard build).
You must have superuser privileges to cancel or terminate a session belonging to a superuser. If the session belongs to a regular user, you need either superuser privileges or be the same user that owns the session.
pg_cancel_backend(pid) sends a SIGINT signal to the backend process, which attempts to cancel only the SQL query currently running in the specified session. The user stays connected and the session remains active. It returns true if successful and false otherwise.
1024 bytes. This parameter specifies the amount of memory reserved to store the text of the currently executing command for each active session in the pg_stat_activity.query field.
The wait_event_type values are: Activity, BufferPin, Client, Extension, IPC (or Ipc), IO (or Io), Lock, LWLock (or Lwlock), and Timeout.
xact_start shows when the currently active transaction started (NULL if not in a transaction). query_start shows when the most recent query began execution. They differ when a transaction contains multiple queries - xact_start remains constant while query_start updates for each new query.
Partitioning > Hash Partitioning
21 questionsNo. The sequence of columns does not matter in hash partitioning, and it does not support pruning for a subset of partition key columns. All partition key columns must be specified for pruning to occur.
Yes. For declarative partitioning, partitions must have exactly the same set of columns as the partitioned table. This differs from table inheritance where child tables may have extra columns.
The default value is 'on'. This parameter was added in PostgreSQL 11 and controls whether the query planner can eliminate partitions from query plans.
It always returns an error because a partitioned table has no data itself. The hint states: 'Do not specify the ONLY keyword, or use TRUNCATE ONLY on the partitions directly.'
An exclusion constraint must include all the partition key columns, and must compare those columns for equality only (not operators like &&). The constraint may include additional non-partition-key columns with any operators.
The partition keys must not include any expressions or function calls, and the constraint's columns must include all of the partition key columns. This exists because individual partition indexes can only enforce uniqueness within their own partitions.
No. Mixing temporary and permanent relations in the same partition tree is not allowed. If the partitioned table is permanent, all partitions must be permanent, and likewise for temporary tables.
The hash of each column value that is part of the partition key is individually calculated and then combined to get a single 64-bit hash value using the hash_combine64 function. The modulus operation is then performed on this 64-bit hash value.
If the updated partition key no longer satisfies the partition bounds of its original partition, the row will be moved to a different partition. This is internally achieved by deleting the original row and inserting it into the correct partition.
The OID of the row changes because the original and new row belong to different partitions. This does not happen for UPDATEs that do not move the row.
ERROR: tuple to be locked was already moved to another partition due to concurrent update
The query planner is generally able to handle partition hierarchies with up to a few thousand partitions fairly well, provided that typical queries allow the planner to prune all but a small number of partitions.
It allows detaching a partition from its partitioned table without blocking concurrent queries, by running in two transactions and only requiring ShareUpdateExclusive lock. However, there is a known bug with hash partitions related to constraint generation.
No. It is not possible to use the CONCURRENTLY qualifier when creating indexes on partitioned tables, which could lead to long lock times.
The modulus must be a positive integer, and the remainder must be a non-negative integer less than the modulus.
Yes. You can use PARTITION BY HASH followed by SUBPARTITION BY with either RANGE, LIST, or HASH to create subpartitions. For example, you can have monthly range partitions with hash subpartitions.
Yes. Partitions can be foreign tables, although considerable care is needed because it is the user's responsibility that the contents of the foreign table satisfy the partitioning rule.
No. BEFORE ROW triggers on INSERT cannot change which partition is the final destination for a new row.
Every modulus among child partitions must be a factor of the next larger modulus. For example, you can use modulus 4, 8, 16, and 32 simultaneously because each is a factor of the next larger one, but you cannot use both modulus 10 and 15 together since 10 is not a factor of 15.
No. Attempting to create a default partition for a hash-partitioned table results in an error: 'a hash-partitioned table may not have a default partition'. This is because the modulus and remainder inherently cover all possible values.
Each hash partition requires a modulus and a remainder. Syntax: CREATE TABLE partition_name PARTITION OF parent_table FOR VALUES WITH (MODULUS n, REMAINDER r);
core_concepts
21 questionsPostgreSQL INTEGER vs BIGINT Storage Size
INTEGER: 4 bytes
BIGINT: 8 bytes
Value Ranges
INTEGER (also called
INTorINT4):- Storage: 4 bytes
- Range: -2,147,483,648 to +2,147,483,647
BIGINT (also called
INT8):- Storage: 8 bytes
- Range: -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807
Example Usage
CREATE TABLE example (
user_id INTEGER, -- 4 bytes per row
total_revenue BIGINT -- 8 bytes per row
);
When to Use Each
- Use INTEGER for most numeric columns (user IDs, counts, amounts under 2.1 billion)
- Use BIGINT when values may exceed 2.1 billion (timestamps, large financial amounts, global identifiers)
Storage Impact
BIGINT uses exactly 2x the storage of INTEGER. For a table with 1 million rows:
- INTEGER column: ~4 MB
- BIGINT column: ~8 MB
Source: PostgreSQL Official Documentation - Numeric Types
Applies to: PostgreSQL 9.x through 17.x (storage sizes unchanged across versions)
Auto-incrementing IDs in PostgreSQL 10+
Use identity columns with GENERATED BY DEFAULT AS IDENTITY or GENERATED ALWAYS AS IDENTITY.
Recommended Syntax
CREATE TABLE users (
id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
username TEXT NOT NULL
);
Two Variants
GENERATED BY DEFAULT AS IDENTITY- Allows manual insertion of ID valuesGENERATED ALWAYS AS IDENTITY- Prevents manual insertion (must useOVERRIDING SYSTEM VALUEto bypass)
Why Not SERIAL?
SERIAL (and BIGSERIAL) still works but is legacy syntax. Identity columns are the SQL standard way and provide better control:
-- Legacy approach (still works)
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
username TEXT NOT NULL
);
-- Modern approach (recommended)
CREATE TABLE users (
id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
username TEXT NOT NULL
);
Data Types
- Use
BIGINTfor identity columns (range: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807) - Use
INTEGERonly if you're certain you won't exceed 2,147,483,647 rows
Source
PostgreSQL 10+ official documentation on identity columns: https://www.postgresql.org/docs/10/sql-createtable.html
Identity columns were introduced in PostgreSQL 10 (released October 2017) as the SQL:2003 standard implementation.
Maximum Length of PostgreSQL TEXT Field
1 GB (1,073,741,823 bytes) - This is the maximum size for any TEXT field value in PostgreSQL.
Technical Details
- The TEXT data type in PostgreSQL can store strings up to 1 GB in length
- This limit applies to all variable-length text types:
TEXT,VARCHAR, andCHAR - The actual maximum is precisely 1,073,741,823 bytes (1 GB - 1 byte)
- This limit is enforced by PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) mechanism
Code Example
-- TEXT field has no explicit length constraint
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT -- Can store up to 1 GB
);
-- These are functionally identical in PostgreSQL:
CREATE TABLE examples (
text_col TEXT, -- up to 1 GB
varchar_col VARCHAR, -- up to 1 GB (no length specified)
varchar_limited VARCHAR(100) -- limited to 100 characters
);
Important Notes
- Unlike
VARCHAR(n),TEXThas no length modifier and defaults to the maximum - The 1 GB limit is a hard limit in PostgreSQL's storage system
- Character vs byte count: For UTF-8 text, multibyte characters consume multiple bytes toward the 1 GB limit
Source
PostgreSQL Official Documentation (applies to all versions 8.0+): Section 8.3 - Character Types
Based on the search results from PostgreSQL's official source code documentation, I can provide the authoritative answer:
6 dimensions
PostgreSQL arrays are limited to a maximum of 6 dimensions. This is defined by the MAXDIM constant in the PostgreSQL source code at src/include/utils/array.h:
#define MAXDIM 6
This limit is enforced throughout the PostgreSQL codebase. When you attempt to create or manipulate arrays with more than 6 dimensions, PostgreSQL will raise an error: "number of array dimensions exceeds the maximum allowed (6)".
Example:
-- Valid: up to 6 dimensions
SELECT ARRAY[[[[[[1]]]]]]::int[];
-- Invalid: 7 dimensions would exceed MAXDIM
SELECT ARRAY[[[[[[[1]]]]]]]::int[]; -- ERROR
Sources:
PostgreSQL search_path
The search_path is a session configuration parameter that defines the ordered list of schemas PostgreSQL searches when resolving unqualified object names (tables, functions, types, etc.).
Default Value
SHOW search_path;
-- Returns: "$user", public
The default searches for a schema matching the current username first, then the public schema.
How It Works
When you reference SELECT * FROM users, PostgreSQL searches schemas in search_path order until it finds a table named users. With default settings, it checks:
- Schema named after your username (if exists)
publicschema
Setting search_path
Session level:
SET search_path TO myschema, public;
Database level:
ALTER DATABASE mydb SET search_path TO myschema, public;
Role level:
ALTER ROLE myuser SET search_path TO myschema, public;
Connection string:
options=-c search_path=myschema,public
Security Critical Detail
ALWAYS include pg_catalog explicitly or use schema-qualified names in functions to prevent search_path hijacking attacks. PostgreSQL 15+ improved this by making functions search pg_catalog first by default.
-- Secure (PostgreSQL 15+)
CREATE FUNCTION myfunc() RETURNS int AS $$
SELECT count(*) FROM mytable;
$$ LANGUAGE SQL;
-- Secure (any version)
SET search_path TO pg_catalog, myschema, public;
Verification
SELECT current_schemas(true); -- Shows actual search path including implicit schemas
Source: PostgreSQL 17 Official Documentation - Schema Search Path (https://www.postgresql.org/docs/current/ddl-schemas.html#DDL-SCHEMAS-PATH)
PostgreSQL BOOLEAN type accepts three states: TRUE, FALSE, and NULL (unknown).
Valid Input Values
For TRUE:
TRUE(SQL keyword)'true''yes''on''1''t','y'(unique prefixes)
For FALSE:
FALSE(SQL keyword)'false''no''off''0''f','n'(unique prefixes)
For UNKNOWN:
NULL(SQL keyword)
Input Rules
- Case-insensitive (e.g.,
'TRUE','True','true'all work) - Leading/trailing whitespace is ignored
- Unique prefixes are accepted (e.g.,
't'for true,'f'for false)
Output Format
When queried, PostgreSQL always outputs boolean values as t or f (single character lowercase).
-- Example inputs (all valid)
INSERT INTO table (bool_col) VALUES
(TRUE), -- SQL keyword
('yes'), -- string representation
('1'), -- numeric representation
('t'); -- prefix
-- All output as: t
Sources:
{"type":"result","subtype":"success","is_error":false,"duration_ms":13201,"duration_api_ms":22824,"num_turns":1,"result":"The maximum size of a single row in PostgreSQL is approximately 1.6 TB (1,600,000,000 bytes).\n\n## Key Details\n\nHard limit: Each row is stored in a single 8 KB page, but PostgreSQL uses TOAST (The Oversized-Attribute Storage Technique) to store large column values out-of-line in separate TOAST tables.\n\nExact specifications:\n- Maximum field size: 1 GB (when TOASTed)\n- Maximum row size with TOAST: 1.6 TB theoretical limit\n- Practical limit per column: 1 GB\n- Row header overhead: 23 bytes minimum (varies with null bitmap)\n- Maximum columns per table: 1600 (1664 with system columns)\n\nImportant constraints:\n- Without TOAST, a row must fit in a single page (8 KB minus headers)\n- Data types smaller than ~2 KB are stored inline by default\n- TOAST automatically applies to data types like TEXT, BYTEA, VARCHAR when values exceed ~2 KB\n\n## Example\n\nsql\n-- This works - large text column uses TOAST automatically\nCREATE TABLE large_rows (\n id SERIAL PRIMARY KEY,\n large_text TEXT, -- Can store up to 1 GB\n large_binary BYTEA -- Can store up to 1 GB\n);\n\n-- Insert a large row (PostgreSQL handles TOAST transparently)\nINSERT INTO large_rows (large_text) \nVALUES (repeat('x', 100000000)); -- 100 MB text\n\n\nSource: PostgreSQL 17 official documentation - Chapter 73.2 TOAST\n\nVersions: This limit applies to PostgreSQL 8.3+ (when TOAST was enhanced). Current as of PostgreSQL 17 (2024).","session_id":"5e4b0f55-9504-4161-9d89-681074557ff2","total_cost_usd":0.0222493,"usage":{"input_tokens":3,"cache_creation_input_tokens":414,"cache_read_input_tokens":16766,"output_tokens":441,"server_tool_use":{"web_search_requests":0,"web_fetch_requests":0},"service_tier":"standard","cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":414}},"modelUsage":{"claude-haiku-4-5-20251001":{"inputTokens":658,"outputTokens":192,"cacheReadInputTokens":0,"cacheCreationInputTokens":0,"webSearchRequests":0,"costUSD":0.001618,"contextWindow":200000},"claude-opus-4-5-20251101":{"inputTokens":845,"outputTokens":128,"cacheReadInputTokens":0,"cacheCreationInputTokens":0,"webSearchRequests":0,"costUSD":0.007425,"contextWindow":200000},"claude-sonnet-4-5-20250929":{"inputTokens":3,"outputTokens":441,"cacheReadInputTokens":16766,"cacheCreationInputTokens":414,"webSearchRequests":0,"costUSD":0.013206299999999999,"contextWindow":200000}},"permission_denials":[],"uuid":"931a5569-e5e1-4423-9d6f-b8d544313bd0"}
{"type":"result","subtype":"success","is_error":false,"duration_ms":11606,"duration_api_ms":18951,"num_turns":1,"result":"# Maximum Size of a PostgreSQL Database\n\nThe theoretical maximum database size in PostgreSQL is 4 petabytes (PB).\n\nHowever, this limit is controlled by several component limits:\n\n## Component Limits (PostgreSQL 9.0+)\n\n- Maximum database size: 4 petabytes (limited by tablespace size)\n- Maximum table size: 32 terabytes (TB)\n- Maximum row size: 1.6 TB (including TOAST storage)\n- Maximum field size: 1 GB (without TOAST)\n- Maximum rows per table: Unlimited (theoretically limited by table size)\n- Maximum columns per table: 250-1600 (depending on column types)\n- Maximum indexes per table: Unlimited\n\n## Practical Considerations\n\nThe 32 TB table size limit is the most common practical constraint. This is determined by:\n- Maximum blocks per table: 2^32 (4,294,967,296 blocks)\n- Default block size: 8 KB\n- Calculation: 2^32 blocks × 8 KB = 32 TB\n\nTo exceed 32 TB of data per table, use table partitioning to split data across multiple physical tables.\n\n## Configuration\n\nThe block size can be changed at compile time (using --with-blocksize configure option) to 1, 2, 4, 8, 16, or 32 KB, but 8 KB is standard and changing it requires recompiling PostgreSQL.\n\n## Source\n\nPostgreSQL Official Documentation - Appendix K: Limits (valid for PostgreSQL 9.0 through 17.x)","session_id":"15306456-758d-4464-9e3f-c2490893eab6","total_cost_usd":0.0315107,"usage":{"input_tokens":3,"cache_creation_input_tokens":3474,"cache_read_input_tokens":13704,"output_tokens":382,"server_tool_use":{"web_search_requests":0,"web_fetch_requests":0},"service_tier":"standard","cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":3474}},"modelUsage":{"claude-haiku-4-5-20251001":{"inputTokens":658,"outputTokens":180,"cacheReadInputTokens":0,"cacheCreationInputTokens":0,"webSearchRequests":0,"costUSD":0.001558,"contextWindow":200000},"claude-opus-4-5-20251101":{"inputTokens":845,"outputTokens":114,"cacheReadInputTokens":0,"cacheCreationInputTokens":0,"webSearchRequests":0,"costUSD":0.0070750000000000006,"contextWindow":200000},"claude-sonnet-4-5-20250929":{"inputTokens":3,"outputTokens":382,"cacheReadInputTokens":13704,"cacheCreationInputTokens":3474,"webSearchRequests":0,"costUSD":0.0228777,"contextWindow":200000}},"permission_denials":[],"uuid":"b29caba8-7e5d-4c03-ace8-a6f40eda9bdd"}
PostgreSQL Composite Types
A composite type in PostgreSQL is a user-defined data type that groups multiple fields (columns) together into a single type, similar to a row or record structure. Each field has a name and a data type.
Definition
Composite types are created using CREATE TYPE:
CREATE TYPE address AS (
street VARCHAR(100),
city VARCHAR(50),
postal_code VARCHAR(10),
country VARCHAR(50)
);
Usage
As table column:
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
home_address address,
work_address address
);
Inserting data:
INSERT INTO employees (name, home_address, work_address)
VALUES (
'John Doe',
ROW('123 Main St', 'Boston', '02101', 'USA'),
ROW('456 Corp Ave', 'Boston', '02102', 'USA')
);
Accessing fields:
-- Use dot notation (parentheses required to avoid ambiguity)
SELECT name, (home_address).city FROM employees;
Key Characteristics
- Table rows are composite types: Every table automatically has a composite type with the same name
- Nested composites: Composite types can contain other composite types
- Size limit: Subject to row size limit of 1.6 TB (PostgreSQL 12+), practically limited by
block_size(default 8 KB for efficient storage) - NULL handling: The entire composite value can be NULL, or individual fields can be NULL
Official Reference
PostgreSQL Documentation: Composite Types (current as of PostgreSQL 17)
The range of SMALLINT in PostgreSQL is -32768 to +32767.
Technical Details:
- Storage size: 2 bytes
- Signed: Yes (always)
- Min value: -32768 (-2^15)
- Max value: +32767 (2^15 - 1)
Example Usage:
CREATE TABLE example (
id SMALLINT
);
-- Valid insertions
INSERT INTO example VALUES (-32768); -- minimum
INSERT INTO example VALUES (32767); -- maximum
-- This will cause an error: smallint out of range
INSERT INTO example VALUES (32768);
Comparison with Other Integer Types:
SMALLINT: 2 bytes, -32768 to 32767INTEGER: 4 bytes, -2147483648 to 2147483647BIGINT: 8 bytes, -9223372036854775808 to 9223372036854775807
Source: PostgreSQL Official Documentation - Numeric Types
Version: Applies to all PostgreSQL versions (this range is part of the SQL standard and has not changed)
Performance Note: SMALLINT is optimal when you know values will stay within this range, as it uses half the storage of INTEGER.
PostgreSQL ENUM Type
An ENUM (enumerated type) is a user-defined data type in PostgreSQL that consists of a static, ordered set of string values. Once created, an ENUM type can be used as a column type like any built-in type.
Creating and Using ENUMs
-- Create an ENUM type
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
-- Use in a table
CREATE TABLE person (
name TEXT,
current_mood mood
);
-- Insert values (must match exactly)
INSERT INTO person VALUES ('Alice', 'happy');
INSERT INTO person VALUES ('Bob', 'sad');
Key Characteristics
- Storage: 4 bytes per value (same as integer)
- Ordering: Values are ordered by creation sequence, NOT alphabetically
- Case-sensitive: 'Happy' ≠ 'happy'
- Immutable after creation: Cannot reorder or rename existing values (PostgreSQL <12)
- PostgreSQL 12+: Can add values with
ALTER TYPE mood ADD VALUE 'excited' AFTER 'happy'
When to Use ENUMs
Use when:
- Small, fixed set of values (e.g., status: 'pending', 'approved', 'rejected')
- Values rarely change (adding is possible, but removing requires type recreation)
- Need type safety at database level
- Performance matters (4 bytes vs. variable TEXT storage)
- Want constraint enforcement without CHECK constraints
Avoid when:
- Values change frequently (use lookup table instead)
- Need internationalization (enum values are stored literals)
- Application manages validation logic (use VARCHAR with CHECK constraint)
- Multiple applications with different valid values access the DB
Source
PostgreSQL Official Documentation v16: Chapter 8.7 - Enumerated Types
PostgreSQL: NUMERIC vs DECIMAL
They are identical. DECIMAL and NUMERIC are exact synonyms in PostgreSQL - they refer to the same data type with identical storage and behavior.
Key Facts
- Both names create the exact same type internally
- Both accept the same syntax:
NUMERIC(precision, scale)orDECIMAL(precision, scale) precision= total number of digits (max 1000)scale= number of digits after decimal point- No storage or performance difference whatsoever
Examples
-- These are functionally identical:
CREATE TABLE example (
price1 NUMERIC(10, 2),
price2 DECIMAL(10, 2)
);
-- Both store exact decimal values like 99999999.99
-- Both use variable-length storage (2 bytes per 4 decimal digits + overhead)
Verification
-- Check the actual type stored:
SELECT
column_name,
data_type
FROM information_schema.columns
WHERE table_name = 'example';
-- Result shows both as "numeric"
Which to Use?
Use NUMERIC - it's the PostgreSQL-preferred name in official documentation. However, DECIMAL exists for SQL standard compatibility, so either is acceptable.
Source: PostgreSQL 17 Official Documentation - Numeric Types
(https://www.postgresql.org/docs/current/datatype-numeric.html)
CHAR vs VARCHAR in PostgreSQL
Key Difference: CHAR(n) pads values with spaces to exactly n characters, while VARCHAR(n) stores the actual string without padding (up to n characters).
Storage & Behavior
CHAR(n) (or
CHARACTER(n)):- Fixed-length: Always stores exactly
ncharacters - Pads shorter values with trailing spaces to reach length
n - Trailing spaces are semantically insignificant and removed on retrieval
- Storage:
nbytes (if single-byte encoding) + 1 byte overhead ifn< 127
- Fixed-length: Always stores exactly
VARCHAR(n) (or
CHARACTER VARYING(n)):- Variable-length: Stores actual string length (up to
ncharacters) - No padding applied
- Storage: actual string length + 1 byte (if < 126 bytes) or 4 bytes (if ≥ 126 bytes) for length prefix
- Variable-length: Stores actual string length (up to
Code Example
CREATE TABLE comparison (
fixed CHAR(10),
variable VARCHAR(10)
);
INSERT INTO comparison VALUES ('hello', 'hello');
-- CHAR pads to 10 characters internally, VARCHAR stores 5
SELECT
fixed,
variable,
octet_length(fixed) AS char_bytes, -- Returns 5 (spaces trimmed on output)
octet_length(variable) AS varchar_bytes -- Returns 5
FROM comparison;
-- Internal storage differs
SELECT
fixed = 'hello ' AS char_match, -- TRUE (trailing spaces ignored)
variable = 'hello ' AS varchar_match -- FALSE
FROM comparison;
Performance
There is NO performance advantage to CHAR over VARCHAR in PostgreSQL (unlike some other databases). The PostgreSQL documentation explicitly states that CHAR(n) is usually slower due to padding overhead.
Recommendation
Use VARCHAR(n) or TEXT in PostgreSQL. The only reason to use CHAR(n) is for SQL standard compatibility or when you specifically need fixed-width, space-padded behavior.
Source: PostgreSQL 16 Official Documentation, Section 8.3 "Character Types"
https://www.postgresql.org/docs/current/datatype-character.html
The maximum size of BYTEA in PostgreSQL is 1 GB (1,073,741,824 bytes).
This limit applies to all TOAST-able data types in PostgreSQL, including BYTEA. The limit is enforced by PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) mechanism, which has a maximum datum size of 1 GB.
Technical Details:
- Maximum theoretical size: 1 GB - 1 byte (1,073,741,823 bytes)
- This is a hard limit enforced by the
MaxAllocSizeconstant in PostgreSQL source code - Applies to all PostgreSQL versions (confirmed in versions 9.x through 16+)
Storage Considerations:
- BYTEA values larger than ~2 KB are automatically compressed and/or moved to TOAST tables
- The 1 GB limit includes any overhead from compression or encoding
Example Usage:
-- This will work (within limit)
INSERT INTO files (data) VALUES (pg_read_binary_file('/path/to/file.bin'));
-- Check size of BYTEA column
SELECT pg_column_size(data) FROM files WHERE id = 1;
If you need larger binary storage:
- Use PostgreSQL Large Objects (up to 4 TB per object)
- Store files externally and keep references in the database
Source: PostgreSQL official documentation on TOAST and binary data types, consistent across versions 9.0-16.
The UUID type in PostgreSQL stores Universally Unique Identifiers (UUIDs) as defined by RFC 9562. It is a 128-bit value displayed as hexadecimal digits in the format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (8-4-4-4-12 digit groups).
Storage: 128 bits (16 bytes)
Example UUID: a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
Generating UUIDs:
PostgreSQL provides built-in functions for UUID generation:
-- Generate version 4 (random) UUID
SELECT gen_random_uuid();
SELECT uuidv4(); -- alias for gen_random_uuid()
-- Generate version 7 (time-ordered) UUID
SELECT uuidv7();
-- Generate version 7 with timestamp shift
SELECT uuidv7(interval '1 hour');
Usage in tables:
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT
);
INSERT INTO users (name) VALUES ('Alice'); -- id auto-generated
Key advantages:
- Version 4: Cryptographically random, globally unique
- Version 7: Time-ordered with millisecond precision, better for indexing performance
- No cross-database coordination required (unlike sequences)
Additional UUID algorithms: Install the uuid-ossp extension for UUIDv1, UUIDv3, and UUIDv5 generation.
Sources:
PostgreSQL hstore Extension
The hstore extension is a key-value store data type for PostgreSQL that stores sets of key-value pairs within a single PostgreSQL value. Each key and value is a text string.
Enabling hstore
CREATE EXTENSION IF NOT EXISTS hstore;
Storage and Syntax
-- Creating a table with hstore column
CREATE TABLE products (
id SERIAL PRIMARY KEY,
attributes hstore
);
-- Inserting data
INSERT INTO products (attributes) VALUES
('color => "red", size => "M", weight => "500g"');
-- Alternative syntax with hstore() constructor
INSERT INTO products (attributes) VALUES
(hstore('color', 'blue') || hstore('size', 'L'));
Key Operations
-- Retrieve a value by key
SELECT attributes -> 'color' FROM products;
-- Check if key exists
SELECT attributes ? 'color' FROM products;
-- Check if multiple keys exist
SELECT attributes ?& ARRAY['color', 'size'] FROM products; -- ALL keys
SELECT attributes ?| ARRAY['color', 'price'] FROM products; -- ANY key
-- Get all keys or values
SELECT akeys(attributes) FROM products; -- returns text[]
SELECT avals(attributes) FROM products; -- returns text[]
-- Convert to JSON
SELECT hstore_to_json(attributes) FROM products;
Indexing
-- GIN index for existence checks and containment
CREATE INDEX idx_attributes ON products USING GIN(attributes);
-- GiST index (alternative)
CREATE INDEX idx_attributes_gist ON products USING GIST(attributes);
Limitations
- Keys and values: Both are text strings only (no native numeric/boolean types)
- NULL values: Distinguishes between NULL value and missing key
- Size: No hard limit, but large hstore values impact performance
- Nesting: No nested structures (flat key-value only)
Use Cases
Use hstore for semi-structured data with varying attributes where you need:
- Indexable key-value storage
- Better query performance than JSON for key existence checks
- PostgreSQL versions before JSONB matured (pre-9.4)
Note: For PostgreSQL 9.4+, consider JSONB for more complex semi-structured data needs (supports nested objects, arrays, and native data types).
Source: PostgreSQL Official Documentation (hstore module), compatible with PostgreSQL 9.0+, built-in contrib module.
PostgreSQL: Schema vs Database
ATOMIC ANSWER:
A database is a top-level container that holds all data objects and is completely isolated from other databases (separate connections, no cross-database queries in standard PostgreSQL). A schema is a namespace within a database that organizes tables, views, functions, and other objects, allowing multiple schemas in one database with fully-qualified access via schema_name.table_name.
Key Differences
Database:
- Created with
CREATE DATABASE dbname; - Requires separate connection (cannot query across databases in one session)
- Contains one or more schemas
- Has its own users/privileges, encoding, and collation
- Physical separation on disk
Schema:
- Created with
CREATE SCHEMA schemaname; - Multiple schemas accessible in single connection
- Default schema is
public(created automatically) - Enables logical organization without connection overhead
- Cross-schema queries:
SELECT * FROM schema1.table1 JOIN schema2.table2
Code Example
-- Database level (requires reconnecting)
CREATE DATABASE app_production;
\c app_production -- Connect to database
-- Schema level (same connection)
CREATE SCHEMA sales;
CREATE SCHEMA marketing;
CREATE TABLE sales.orders (id INT, amount DECIMAL);
CREATE TABLE marketing.campaigns (id INT, name TEXT);
-- Query across schemas (same database)
SELECT o.id, c.name
FROM sales.orders o
JOIN marketing.campaigns c ON o.id = c.id;
Schema Search Path
PostgreSQL uses search_path to resolve unqualified table names (default: "$user", public):
SHOW search_path;
SET search_path TO sales, marketing, public;
Source: PostgreSQL 17 Documentation - Chapter 5.9 Schemas
Use Case: Use schemas for logical separation (multi-tenant apps, dev/test environments within one DB). Use separate databases for complete isolation (different applications, security boundaries).
PostgreSQL ARRAY Type
The ARRAY type in PostgreSQL stores variable-length multidimensional arrays of a single data type. Every PostgreSQL data type has a corresponding array type (e.g., integer[], text[], timestamp[]).
Key Specifications
- Declaration syntax:
column_name data_type[]orcolumn_name data_type ARRAY - Dimensions: PostgreSQL supports arrays of any number of dimensions, but all arrays are dynamically sized with no fixed bounds enforced
- Index base: Arrays are 1-indexed (first element is at position 1, not 0)
- Maximum size: Limited by the maximum field size of 1 GB
- Type constraint: All elements must be of the same base type
Creating Arrays
-- Column declaration
CREATE TABLE products (
id serial PRIMARY KEY,
tags text[],
prices integer ARRAY,
matrix integer[][] -- multidimensional
);
-- Inserting array literals
INSERT INTO products (tags, prices) VALUES
(ARRAY['electronics', 'sale'], ARRAY[99, 149]),
('{"book", "fiction"}', '{10, 15, 20}'); -- alternative syntax
Accessing Elements
-- Access single element (1-indexed)
SELECT tags[1] FROM products; -- first element
-- Access slice
SELECT tags[1:2] FROM products; -- elements 1 through 2
-- Multidimensional access
SELECT matrix[1][2] FROM products;
Essential Functions
array_length(array, dimension)- returns length of specified dimensionarray_append(array, element)- adds element to endarray_cat(array1, array2)- concatenates arraysunnest(array)- expands array to rowsANY(array)/ALL(array)- comparison operators
-- Query with array operations
SELECT * FROM products WHERE 'sale' = ANY(tags);
SELECT array_length(prices, 1) FROM products;
Source: PostgreSQL 17 Official Documentation, Section 8.15 (Arrays)
Critical notes:
- NULL handling:
ARRAY[1, NULL, 3]is valid; the array itself can also be NULL - Use
text[]nottext ARRAYfor consistency with PostgreSQL conventions - Arrays are stored in binary format internally but output as
{val1,val2,val3}text representation
Adding a Value to an Existing ENUM Type in PostgreSQL
Use the ALTER TYPE ... ADD VALUE statement:
ALTER TYPE enum_type_name ADD VALUE 'new_value';
Position Control
By default, the new value is added at the end. To specify position:
-- Add before an existing value
ALTER TYPE enum_type_name ADD VALUE 'new_value' BEFORE 'existing_value';
-- Add after an existing value
ALTER TYPE enum_type_name ADD VALUE 'new_value' AFTER 'existing_value';
Transaction Behavior (CRITICAL)
PostgreSQL versions ≥ 12.0: ADD VALUE can be used inside a transaction block without restrictions.
PostgreSQL versions < 12.0: ADD VALUE CANNOT be used in a transaction block with other operations on the same enum type. You must either:
- Run it in its own transaction, OR
- Use the
IF NOT EXISTSclause (PostgreSQL 9.1+) and commit before using the new value
-- Safe for all versions (idempotent)
ALTER TYPE enum_type_name ADD VALUE IF NOT EXISTS 'new_value';
Example
-- Create enum
CREATE TYPE status AS ENUM ('pending', 'active', 'closed');
-- Add new value at the end
ALTER TYPE status ADD VALUE 'archived';
-- Add new value in specific position
ALTER TYPE status ADD VALUE 'cancelled' AFTER 'pending';
Limitations
- You CANNOT remove enum values (requires type recreation)
- You CANNOT rename enum values directly (use workaround with ADD + UPDATE + DROP type)
- Enum values are sorted by creation order, not alphabetically
Source: PostgreSQL Official Documentation, ALTER TYPE command (PostgreSQL 9.1+, transaction improvements in 12.0)
The NUMERIC type in PostgreSQL has a maximum precision of 131,072 digits before the decimal point and a maximum of 16,383 digits after the decimal point.
Declaring NUMERIC:
-- Syntax: NUMERIC(precision, scale)
-- precision: total count of significant digits
-- scale: count of decimal digits in fractional part
NUMERIC(10, 2) -- 10 total digits, 2 after decimal (e.g., 12345678.90)
NUMERIC(5) -- 5 total digits, 0 after decimal (scale defaults to 0)
NUMERIC -- No limit (stores exact value within implementation limits)
Implementation Limits:
- Maximum precision (total significant digits): 1,000 (configurable at compile time, default)
- Maximum scale (digits after decimal): 1,000
- Theoretical maximum (per source code): 131,072 digits before decimal, 16,383 after
- Storage: Variable length, approximately 2 bytes per 4 decimal digits + 8 bytes overhead
Key Behaviors:
- Values exceeding declared precision cause an error
- Values with more decimal places than scale are rounded (not truncated)
NUMERICwithout parameters stores exact values up to implementation limitDECIMALis an alias forNUMERIC(identical behavior)
Example:
CREATE TABLE prices (
exact_price NUMERIC, -- No limit, exact storage
currency NUMERIC(10, 2), -- Max 10 digits, 2 decimal places
very_precise NUMERIC(20, 10) -- Max 20 digits, 10 decimal places
);
INSERT INTO prices VALUES (123.456789, 12345678.90, 1234567890.1234567890);
Source: PostgreSQL 16 Official Documentation - Chapter 8.1 (Numeric Types)
NULL vs Empty String in PostgreSQL
NULL represents the absence of a value (unknown or undefined data). Empty string ('') is a known value that happens to contain zero characters.
Key Differences
1. Storage and Semantics
NULL= no data exists, unknown, or not applicable''= a string value with length 0
2. Comparison Behavior
-- NULL comparisons always return NULL (unknown)
SELECT NULL = NULL; -- Returns: NULL (not TRUE)
SELECT NULL IS NULL; -- Returns: TRUE
SELECT '' = ''; -- Returns: TRUE
-- Empty string comparisons work normally
SELECT '' = ''; -- Returns: TRUE
SELECT '' IS NULL; -- Returns: FALSE
3. String Operations
-- NULL propagates through operations
SELECT 'Hello' || NULL; -- Returns: NULL
SELECT length(NULL); -- Returns: NULL
-- Empty string behaves as a value
SELECT 'Hello' || ''; -- Returns: 'Hello'
SELECT length(''); -- Returns: 0
4. Constraints
-- NOT NULL constraint
CREATE TABLE example (
col1 VARCHAR NOT NULL -- Rejects NULL, allows ''
);
INSERT INTO example VALUES (''); -- SUCCESS
INSERT INTO example VALUES (NULL); -- ERROR: violates NOT NULL
5. Indexing
- Both
NULLand''are indexed differently NULLvalues excluded from standard B-tree indexes unless explicitly included''is always indexed as a regular value
6. Aggregation
-- COUNT ignores NULLs but counts empty strings
SELECT COUNT(col) FROM table; -- Excludes NULL, includes ''
Practical Rule
Use NULL for missing/unknown data. Use '' only when you need to represent a string that explicitly has no characters (e.g., an empty user input that was intentionally submitted blank).
Version: PostgreSQL 12+ (behavior consistent across all modern versions)
Client Authentication > Certificate Authentication
21 questionsThe permissions on server.key must disallow any access to world or group, achieved by the command 'chmod 0600 server.key'. Alternatively, the file can be owned by root and have group read access (0640 permissions).
The first certificate in postgresql.crt must be the client's certificate because it must match the client's private key. Intermediate certificates can be optionally appended to the file.
Add clientcert=verify-full as an option to another authentication method such as scram-sha-256 in pg_hba.conf. For example: 'hostssl all all all scram-sha-256 clientcert=verify-full'. This requires the client to provide both a valid certificate AND password.
The authentication method is called 'cert'. It uses SSL client certificates to perform authentication.
The v3_ca extension avoids the necessity of storing intermediate certificates on clients when root and intermediate certificates were created with v3_ca extensions. It sets the certificate's basic constraint of CA to true, marking certificates as Certificate Authorities.
No. When using the cert authentication method, no password prompt will be sent to the client. Authentication relies entirely on certificate validation.
RFC 2253 format is used. The DN is matched against the RFC2253 formatted DN, which looks like 'CN=fred,OU=eng'.
The first certificate in server.crt must be the server's certificate because it must match the server's private key. The certificates of intermediate certificate authorities can also be appended to the file.
The ident_file parameter can override the default location. It can only be set at server start. The default value is 'data_directory/pg_ident.conf'.
PostgreSQL 14 introduced the clientname=DN option. If you're using PostgreSQL 13 or earlier versions, this DN authentication method will not work.
When clientcert is not specified, the server verifies the client certificate against its CA file only if a client certificate is presented and the CA is configured. This means client certificates are optional by default.
The private key file must not allow any access to world or group. This is achieved with the command 'chmod 0600 ~/.postgresql/postgresql.key'. Alternatively, the file can be owned by root and have group read access (0640 permissions).
Use: openssl x509 -in myclient.crt -noout -subject -nameopt RFC2253 | sed "s/^subject=//"
Yes. The key may be stored in cleartext or encrypted with a passphrase using any algorithm supported by OpenSSL, like AES-128. If encrypted, the passphrase may be provided in the sslpassword connection option.
The two valid values are 'verify-ca' and 'verify-full'. Both require a valid, trusted SSL certificate, but verify-full additionally enforces that the CN (Common Name) in the certificate matches the username or an applicable mapping.
The connection parameters 'sslcert' and 'sslkey' can override the default locations. Alternatively, the environment variables PGSSLCERT and PGSSLKEY can be used.
PostgreSQL supports both PEM and ASN.1 DER formats for certificates and private keys.
No. Passphrase-protected private keys cannot be used at all on Windows for server certificates.
The default value is 'CN' (Common Name). When clientname is not specified, usernames are matched against the certificate's Common Name rather than its Distinguished Name.
The two valid values are 'CN' (Common Name) and 'DN' (Distinguished Name). CN is the default and matches only the Common Name attribute, while DN matches against the entire Distinguished Name of the certificate.
The default value is '' (empty string). When empty, this means the built-in prompting mechanism is used. The parameter was added in PostgreSQL 11.
Window Functions > Aggregate Window Functions
21 questionsNo. Window-specific functions do not allow DISTINCT or ORDER BY to be used within the function argument list.
In GROUPS mode, the offset must yield a non-null, non-negative integer, and the option means that the frame starts or ends the specified number of peer groups before or after the current row's peer group.
An aggregate used with ORDER BY and the default window frame definition produces a 'running sum' type of behavior, which may or may not be what's wanted. The frame includes all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause.
If an aggregate supports moving-aggregate mode, it will improve calculation efficiency when the aggregate is used as a window function for a window with moving frame start (that is, a frame start mode other than UNBOUNDED PRECEDING). The moving-aggregate mode uses an inverse transition function to remove values as they leave the window frame, avoiding full recomputation.
Yes. Window functions execute after non-window aggregate functions. This means it is valid to include an aggregate function call in the arguments of a window function, but not vice versa.
In ROWS mode, the offset must yield a non-null, non-negative integer, and the option means that the frame starts or ends the specified number of rows before or after the current row.
In both ROWS and GROUPS mode, 0 PRECEDING and 0 FOLLOWING are equivalent to CURRENT ROW.
No. Ordered-set aggregates cannot be used as window functions, and therefore there is no need for them to support moving-aggregate mode.
For numeric ordering columns, the offset is typically of the same type as the ordering column.
No. The FROM FIRST or FROM LAST option for nth_value is unsupported; only FROM FIRST behavior is available.
No. Expressions within window functions cannot contain nested window function calls.
When ORDER BY is omitted, the default frame consists of all rows in the partition, effectively ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
No. PostgreSQL does not implement RESPECT NULLS or IGNORE NULLS options for window functions like lead, lag, first_value, last_value, and nth_value. The behavior consistently matches the standard's default (RESPECT NULLS).
Any built-in or user-defined ordinary aggregate (i.e., not ordered-set or hypothetical-set aggregates) can be used as a window function. Ordered-set aggregates (like mode(), percentile_cont(), percentile_disc()) cannot be used as window functions.
The frame can be specified in RANGE, ROWS or GROUPS mode; in each case, it runs from the frame_start to the frame_end.
Aggregate functions act as window functions only when an OVER clause follows the call; otherwise they act as plain aggregates and return a single row for the entire set. A window function call always contains an OVER clause directly following the window function's name and argument(s).
EXCLUDE CURRENT ROW (removes only the current row), EXCLUDE GROUP (removes current row and its ordering peers), EXCLUDE TIES (removes peers but keeps the current row itself), and EXCLUDE NO OTHERS (explicitly specifies default - no exclusion).
Frame_start cannot be UNBOUNDED FOLLOWING, frame_end cannot be UNBOUNDED PRECEDING, and the frame_end choice cannot appear earlier in the list than the frame_start choice (for example, RANGE BETWEEN CURRENT ROW AND offset PRECEDING is not allowed).
Yes. Only aggregate window functions may use FILTER clauses. The syntax is: function_name ([expression]) [FILTER (WHERE filter_clause)] OVER window_name or OVER (window_definition).
Each of the 'hypothetical-set' aggregates (rank, dense_rank, percent_rank, cume_dist) is associated with a window function of the same name defined in Section 9.22. These can be used both as aggregates and as window functions.
Monitoring > Index Statistics
21 questionsOnly after clean shutdowns. When the server shuts down cleanly, statistics are stored in the pg_stat subdirectory and retained. After an unclean shutdown (crash, immediate shutdown, base backup recovery, or PITR), all statistics counters are reset.
SELECT pg_stat_reset_single_table_counters('schema.tablename'::regclass); or SELECT pg_stat_reset_single_table_counters('tablename'::regclass); for tables in the current schema. You can also pass the numeric OID directly.
Yes. The statistics collector is active during recovery, and all scans, reads, blocks, index usage, etc., are recorded normally on the standby. Views like pg_stat_user_indexes work on standby servers.
PostgreSQL 15. The parameter was removed when PostgreSQL switched from file-based statistics to dynamic shared memory for the statistics system.
No. Index counters are only reset by calls to pg_stat_reset_single_table_counters() specifying each index on the base relation. You must reset each index separately.
128 MB (16384 blocks of 8KB each), though some sources indicate 32 MB for certain configurations. The exact default can vary by version, but 128MB is the typical modern default.
pg_stat_tmp (relative to the data directory). This parameter was removed in PostgreSQL 15 when the statistics system switched to using dynamic shared memory.
off (disabled by default). This parameter enables timing of database I/O waits, but is off because it may cause significant overhead on some platforms.
It represents the transaction end time (when it completed), based on the most recent transaction stop time, not the precise moment the index was utilized.
bigint. This column represents the number of disk blocks read from this index.
pg_stat_all_indexes shows statistics for all indexes (both user and system indexes), while pg_stat_user_indexes shows only user indexes (filtered to exclude system indexes in pg_catalog, information_schema, and pg_toast schemas).
bigint. This column represents the number of buffer hits in this index.
PostgreSQL 16. This column contains the timestamp of the last time an index scan has been performed.
pg_stat_reset() → void. It takes no parameters and returns void.
on (enabled by default). This parameter controls whether cumulative statistics are collected about table and index accesses.
9,223,372,036,854,775,807 (2^63-1). This is the maximum value for the signed 64-bit bigint data type used for statistics counters.
bigint. This column stores the number of live table rows fetched by simple index scans using this index.
Yes. Index scans may sometimes perform multiple index searches per execution. Each index search increments pg_stat_all_indexes.idx_scan, so it's possible for the count of index scans to significantly exceed the total number of index scan executor node executions.
Bitmap scans do NOT increment idx_tup_fetch for the index. They increment idx_tup_read for the index(es) used and idx_tup_fetch for the table, but not pg_stat_all_indexes.idx_tup_fetch because the output of several indexes can be combined.
1024 bytes. This parameter reserves memory to store text of currently executing commands in pg_stat_activity.query field.
timestamp with time zone. This column stores the time of the last scan on this index, based on the most recent transaction stop time.
Internals > Free Space Map
20 questionsThe FSM uses one byte to represent the free space available on each heap (or index) page.
The HEAP_INSERT_SKIP_FSM flag (also called TABLE_INSERT_SKIP_FSM) prevents PostgreSQL from consulting the Free Space Map during insert operations, effectively preventing free space from being reused and forcing new data to be appended.
The default value is 0.2 (20% of unfrozen pages in table). This specifies a fraction of the unfrozen pages in the table to add to autovacuum_vacuum_insert_threshold when deciding whether to trigger a VACUUM.
B-trees use a default fillfactor of 90, but any integer value from 10 to 100 can be selected.
The max_fsm_pages and max_fsm_relations settings were removed in PostgreSQL 8.4, when the FSM was changed from a fixed-sized shared memory area to being stored on-disk in *_fsm files.
The FSM is stored in a separate file using the relation's filenode number plus the _fsm suffix. For example, a relation with filenode 12345 creates a file named 12345_fsm in the same directory as the main relation file.
For indexes, what is tracked is entirely-unused pages, rather than free space within pages. Therefore, the values are not meaningful, just whether a page is in-use or empty.
The FSM is organized as a tree of FSM pages. The bottom level FSM pages store free space for each heap/index page. The upper levels aggregate information from the lower levels. Each leaf node represents a heap page or a lower level FSM page.
You can delete the FSM file and run VACUUM on the table to rebuild corrected FSM data. It's a good idea to run VACUUM on each table that had a corrupt FSM (either in the table proper, or an index) to rebuild corrected FSM data.
max_fsm_pages had to be at least 16 * max_fsm_relations. This was because the FSM used a fixed-sized shared memory area that needed to accommodate tracking for all relations.
VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space and rebuilds the entire table and all indexes from scratch, which effectively recreates the FSM as well.
100 (complete packing) is the default fillfactor for tables. When a smaller fillfactor is specified, INSERT operations pack table pages only to the indicated percentage.
The default value is 50 tuples. This specifies the minimum number of updated or deleted tuples needed to trigger a VACUUM in any one table.
Each heap and index relation has a Free Space Map (FSM), except for hash indexes. Hash indexes use their own bitmap pages to track free space instead.
No, FSM values are not kept fully up-to-date as tuples are inserted and updated. They are not exact and may lag behind actual free space availability.
In each non-leaf node, the higher of its children's values is stored, so the maximum value in the leaf nodes is stored at the root.
Within each FSM page is a binary tree, stored in an array with one byte per node.
The pg_freespace function returns int2 (a 2-byte signed integer). The function has two overloads: pg_freespace(rel regclass, blkno bigint) returns int2 for a single page, and pg_freespace(rel regclass, blkno OUT bigint, avail OUT int2) for all pages.
By default, use is restricted to superusers and roles with privileges of the pg_stat_scan_tables role. Access may be granted to others using GRANT.
Rules
20 questionsIf a NEW reference doesn't find a corresponding entry in the target list, NEW means the same as OLD (for an UPDATE) or is replaced by a null value (for an INSERT).
No. INSTEAD OF triggers may only be defined at row level (FOR EACH ROW), not at the statement level.
The rules will be evaluated first, and depending on the result, the triggers may not be used at all.
There can be only one action in an ON SELECT rule, and it must be an unconditional SELECT action that is INSTEAD.
The returned command status shows the original query type and zeroes for the row-count and OID fields.
Rewrite rules don't have a separate owner. The owner of a relation (table or view) is automatically the owner of the rewrite rules that are defined for it.
Materialized views use the rule system to store the query (like regular views), but the rule is only used for populating the materialized view with REFRESH MATERIALIZED VIEW, not for handling each query execution.
Currently, the only available action is to change the rule's name using RENAME TO.
If a rule action contains a NOTIFY command, the NOTIFY command will be executed unconditionally, that is, the NOTIFY will be issued even if there are not any rows that the rule should apply to.
Rules will only be significantly slower than triggers if their actions result in large and badly qualified joins, a situation where the planner fails.
Rules do not support WITH clauses in the original query and multiple-assignment sub-SELECTs in the SET list of UPDATE queries, because copying these constructs into a rule query would result in multiple evaluations of the sub-query.
The pg_rewrite catalog stores rewrite rules for tables and views.
OLD is valid in ON UPDATE and ON DELETE rules to refer to the existing row being updated or deleted.
The rule system modifies queries at parse/rewrite time before passing them to the query planner, while triggers fire at execution time for each affected row.
Any table used as the target of a data-modifying statement in WITH must not have a conditional rule, nor an ALSO rule, nor an INSTEAD rule that expands to multiple statements.
The server returns the command status for the last query that was inserted by an INSTEAD rule (conditional or unconditional) and is of the same command type (INSERT, UPDATE, or DELETE) as the original query.
Full Text Search > Ranking and Highlighting
20 questionsWhen MaxFragments is set to a value greater than zero, it enables fragment-based headline generation. ts_headline locates query matches and splits each match into fragments of no more than MaxWords words each, preferring fragments with more query words. When more than one fragment is displayed, the fragments are separated by FragmentDelimiter.
strip(tsvector) returns tsvector. It returns a vector that lists the same lexemes as the given vector, but lacks any position or weight information.
ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text
If no query matches can be identified, then a single fragment of the first MinWords words in the document will be displayed.
You can combine multiple normalization flags using the bitwise OR operator (|). For example, 2|4 combines flags 2 and 4. If more than one flag bit is specified, the transformations are applied in the order listed.
Flag 32 divides the rank by itself + 1 (rank/(rank+1)), which scales all ranks into the range zero to one.
D is the default weight for new tsvector entries and is not displayed in the output.
ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
ts_headline uses the original document, not a tsvector summary, so it can be slow and should be used with care.
Flag 1 divides the rank by 1 + the logarithm of the document length.
Position values can range from 1 to 16383; larger numbers are silently set to 16383.
The weight labels are A, B, C, or D, where A is the highest priority and D is the lowest.
ShortWord specifies that words of this length or less will be dropped at the start and end of a headline, unless they are query terms. The default value of 3 eliminates common English articles.
The default value of FragmentDelimiter is " ... " (space-ellipsis-space).
ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
Window Functions > Offset Functions
20 questionsNo. The behavior is always the same as the standard's default, namely RESPECT NULLS. IGNORE NULLS/RESPECT NULLS support was only added in PostgreSQL 19.
last_value(value anyelement). The function returns anyelement type.
EXCLUDE NO OTHERS (default, no exclusion), EXCLUDE CURRENT ROW (excludes current row only), EXCLUDE TIES (excludes peer rows but not current row), and EXCLUDE GROUP (excludes current row and all peers).
NULL. If you omit the default_value parameter, the function will return NULL when the offset goes beyond the scope of the partition.
1 row. The offset parameter defaults to 1 if you don't specify it.
first_value(value anyelement). The function returns anyelement type.
PostgreSQL 11. It was the first major SQL database to support GROUPS frames as part of SQL:2011 standard compliance.
Because the default frame contains rows from the start of the partition through the last peer of the current row, LAST_VALUE() returns the last value in that frame (often the current row itself) rather than the last value in the entire partition. You need to redefine the frame with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
1-based. The first row in the window frame is at position 1, not 0. NTH_VALUE(column, 1) retrieves the first row value.
In RANGE or GROUPS mode, CURRENT ROW means the frame starts with the current row's first peer row or ends with the current row's last peer row, not just the current row itself.
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING (or RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING). This ensures the window frame includes all rows in the partition, not just up to the current row.
The offset must be a non-negative integer. Using negative values would result in an error, as PostgreSQL requires offset to be a positive integer that defaults to 1.
No. The third argument (default value) is only allowed if you specify the second argument (offset). All three parameters must be provided in order.
NULL. If there is no row at the specified offset, the LEAD() function returns NULL by default.
FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE() are frame-dependent - they only consider rows within the window frame. LAG() and LEAD() are not frame-dependent.
Both offset and default are evaluated with respect to the current row. They are evaluated relative to whichever row is currently being processed by the window function.
NULL. The function returns NULL if no such row exists within the window frame.
PostgreSQL 11, as part of SQL:2011 window function framing options support.
lag(value anycompatible [, offset integer [, default anycompatible]]). The function returns anycompatible type.
lead(value anycompatible [, offset integer [, default anycompatible]]). The function returns anycompatible type.
Logical Replication > Replication Identity
20 questionsThe index must be unique, not partial, not deferrable, and include only columns marked NOT NULL. Additionally, it cannot contain expressions.
No, the index cannot contain expressions. The leftmost index field must be a column (not an expression) that references the published table column.
pg_class.relreplident uses: 'd' for DEFAULT, 'f' for FULL, 'i' for INDEX, and 'n' for NOTHING.
When using REPLICA IDENTITY FULL on published tables, UPDATE and DELETE operations cannot be applied to subscribers if the tables include attributes with datatypes (such as point or box) that do not have a default operator class for B-tree or Hash.
The NOT NULL requirement exists because NULL values cannot be compared for equality. A table can have multiple records with NULL values in a UNIQUE column, so nullable columns cannot uniquely identify records.
NOTHING is the default replica identity setting for system tables.
No, INSERT operations can proceed regardless of any replica identity configuration.
REPLICA IDENTITY FULL should only be used as a fallback if no other solution is possible, because if there are no suitable btree or hash indexes on the subscriber side, the search can be very inefficient.
Candidate indexes must be btree or hash, non-partial, and the leftmost index field must be a column (not an expression) that references the published table column.
No, tables with DEFAULT replica identity without a primary key cannot support UPDATE or DELETE operations when included in a publication replicating these actions.
Replica identity was introduced in PostgreSQL 9.4 as a foundational component for logical replication.
No, tables with replica identity defined as NOTHING cannot support UPDATE or DELETE operations when included in a publication replicating these actions.
The size of WAL events for updates is approximately doubled if there are no TOASTed values, and potentially way more than doubled if there are TOASTed values.
DEFAULT is the default replica identity setting for non-system tables. This records the old values of the columns of the primary key, if any.
No, ALTER TABLE REPLICA IDENTITY on a partitioned table operates only on the parent table itself and does not propagate to partitions.
The four replica identity modes are: DEFAULT (records old values of primary key columns), USING INDEX (records old values of columns covered by a named index), FULL (records old values of all columns), and NOTHING (records no information about the old row).
Yes, a published table must have a replica identity configured in order to replicate UPDATE and DELETE operations, so that appropriate rows to update or delete can be identified on the subscriber side.
If a replica identity other than FULL is set on the publisher side, a replica identity comprising the same or fewer columns must also be set on the subscriber side.
indisreplident is a boolean column that is true if the index has been chosen as replica identity using ALTER TABLE ... REPLICA IDENTITY USING INDEX.
Indexes > Indexes on Expressions
19 questionsYes. An index definition can specify a collation for each column or expression using the syntax: (expression) COLLATE collation opclass [ASC | DESC] [NULLS {FIRST | LAST}]. The operator class determines the basic sort ordering, which can then be modified by adding the COLLATE clause and sort options.
A workaround is that the index can be declared as (f(x), x), where the second column x convinces the planner that an index-only scan is possible, since the planner considers a query potentially executable by index-only scan only when all columns needed by the query are available from the index.
VOLATILE (default) can do anything including modifying the database and may return different results with identical arguments; STABLE cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement; IMMUTABLE cannot modify the database and is guaranteed to return the same results given the same arguments forever.
You can identify invalid indexes with: SELECT * FROM pg_class, pg_index WHERE pg_index.indisvalid = false AND pg_index.indexrelid = pg_class.oid
All functions and operators used in an index definition must be 'immutable', that is, their results must depend only on their arguments and never on any outside influence (such as the contents of another table or the current time).
Only B-tree indexes can produce sorted output and support the optional clauses ASC, DESC, NULLS FIRST, and/or NULLS LAST to modify the sort ordering of the index. Other index types return matching rows in an unspecified, implementation-dependent order.
Expressions must be written in parentheses in CREATE INDEX syntax. However, the parentheses can be omitted if the expression has the form of a function call.
The system can recognize simple inequality implications, for example 'x < 1' implies 'x < 2'. Otherwise the predicate condition must exactly match part of the query's WHERE condition.
If the invalid index is suffixed _ccnew, it corresponds to the transient index created during the concurrent operation and should be dropped. If suffixed _ccold, it corresponds to the original index which could not be dropped after a successful rebuild.
No. Expressions are not supported as included columns since they cannot be used in index-only scans.
Yes. Expression indexes not only allow rapid lookups of matching expressions, but they also provide optimizer statistics, which improve row estimates and hence query plans. Interestingly, the optimizer can use expression index statistics even when the expression index itself is not used for query execution.
Yes. A UNIQUE index on an expression can prevent creation of rows whose values differ only in ways not captured by the base columns, such as case-insensitive uniqueness using LOWER(column).
No. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. The predicate condition must exactly match part of the query's WHERE condition or the index will not be recognized as usable, except for simple inequality implications.
Yes, in principle index-only scans can be used with expression indexes. For example, given an index on f(x), queries selecting the expression result can potentially use index-only scans. However, PostgreSQL's planner is currently not very smart about such cases.
PostgreSQL will consider using an expression index when the expression appears in the WHERE clause or in the ORDER BY clause of the SQL statement.
By default, B-tree indexes store entries in ascending order with nulls last. NULLS FIRST is the default when DESC is specified; NULLS LAST is the default when DESC is not specified.
PostgreSQL avoids the evaluation of the expression if its cost is higher than 1000 (the MAX_HOT_INDEX_EXPR_COST constant). If the expression cost is below this threshold, PostgreSQL may evaluate the expression to determine if a HOT update is possible even when indexed columns are modified.
No. BRIN indexes are 'summarizing indexes' and are excluded from HOT update checks. The update does not modify any columns referenced by the table's indexes, not including summarizing indexes. Starting in PostgreSQL 16, BRIN indexes no longer block HOT updates.
An index column can be a function or scalar expression computed from one or more columns of the table, used to obtain fast access to tables based on computation results.
common_errors
19 questionsThe FATAL: password authentication failed error occurs when PostgreSQL successfully receives your connection request but rejects it because authentication credentials are invalid or do not match the configured authentication method in pg_hba.conf.
What it means:
- Network connection to PostgreSQL succeeded
- Server is willing to communicate but requires valid authentication first
- The credentials provided (username/password) failed verification
Common causes:
- Wrong password - Incorrect password for the database user
- Wrong username - User doesn't exist or is misspelled
- User has no password set - User created without PASSWORD clause
- pg_hba.conf authentication method mismatch - File specifies different auth method (e.g.,
md5,scram-sha-256,trust,peer,ident) - Cached credentials - Client tool using outdated saved password
Troubleshooting steps:
- Verify username and password are correct
- Check PostgreSQL server logs for detailed error information (more details than client sees)
- Review
pg_hba.confauthentication method for the matching connection entry - If using Kerberos/ident authentication, verify those systems are configured properly
- Ensure
pg_hba.confchanges are reloaded:SELECT pg_reload_conf();or restart PostgreSQL
Example fix for password authentication:
-- Reset user password as superuser
ALTER USER username WITH PASSWORD 'new_password';
Sources:
The "could not extend file" error (SQLSTATE 53100) occurs when PostgreSQL cannot allocate more disk space to grow a database file. This error falls under PostgreSQL's "Class 53 — Insufficient Resources" errors.
Primary Causes:
Insufficient Disk Space - The most common cause. The partition containing the PostgreSQL data directory has run out of available disk space.
Inode Exhaustion - The filesystem has exhausted available inodes even if disk space remains.
Filesystem Fragmentation - Free space exists but is too fragmented to allocate as contiguous blocks of the required size.
File Size Limits - The file has reached filesystem or OS limits (e.g., 2GB limit on some systems, though PostgreSQL typically segments large tables at 1GB).
Disk Quota Exceeded - User or group disk quotas have been reached.
Temporary File Growth - Temporary files from large queries/sorts can exhaust disk space during query execution.
The Error:
ERROR: could not extend file "base/16384/61892": No space left on device
HINT: Check free disk space.
Where base/16384/61892 indicates the database OID (16384) and relation filenode (61892).
Resolution:
- Free up disk space on the PostgreSQL data partition
- Use tablespaces to move data to different partitions
- Set
temp_file_limitto prevent runaway temporary file growth - Monitor disk usage proactively
Sources:
A "checksum mismatch" error in PostgreSQL indicates data corruption detected when reading a data page from disk.
What It Means
PostgreSQL writes a checksum value to each data page when the page is written to disk. When reading the page back, PostgreSQL calculates the checksum again and compares it to the stored value. A mismatch means the data has been altered unexpectedly—typically due to hardware failure, storage corruption, or operating system bugs.
Error Behavior
When checksums are enabled (default in PostgreSQL 18):
- Default behavior: PostgreSQL aborts the current transaction immediately when a checksum mismatch is detected
- Warning message:
"WARNING: page verification failed, calculated checksum X but expected Y" - Monitoring: The
pg_stat_databaseview tracks checksum failures in thechecksum_failuresandchecksum_last_failurecolumns
What's Protected
- Protected: Data pages only
- Not protected: Internal data structures, temporary files
Recovery Options
The ignore_checksum_failure parameter can be temporarily enabled to bypass checksum verification during recovery, allowing you to read corrupted blocks (though this may cause crashes or propagate corruption). The pg_checksums utility can verify or manage checksums on offline clusters.
Sources:
A 'lock timeout' error in PostgreSQL occurs when a statement waits longer than the configured lock_timeout duration while attempting to acquire a lock on a database object (table, index, row, etc.).
The error is triggered when:
- A statement needs to acquire a lock (explicit via
LOCK TABLE,SELECT FOR UPDATE, or implicit during normal operations) - Another transaction already holds a conflicting lock on that object
- The wait time exceeds the
lock_timeoutvalue (in milliseconds)
Error message:
ERROR: canceling statement due to lock timeout
Default behavior: lock_timeout = 0 (disabled - statements wait indefinitely for locks)
Key distinction: Unlike statement_timeout, this only measures time spent waiting for locks, not total statement execution time. Each separate lock acquisition attempt has its own timeout timer.
Common scenarios:
- Long-running transactions holding locks while other queries wait
- DDL operations (ALTER TABLE) blocking concurrent queries
- Row-level locks from UPDATE/DELETE blocking other modifications
Sources:
The "relation does not exist" error in PostgreSQL occurs when the database cannot find the table, view, sequence, or other relation you're referencing. The three primary causes are:
1. Case Sensitivity with Quoted Identifiers
Unquoted identifiers are folded to lowercase. If you create a table with quotes like CREATE TABLE "MyTable", you must query it with quotes: SELECT * FROM "MyTable". Querying SELECT * FROM MyTable (unquoted) will look for mytable (lowercase) and fail.
Example:
-- Creates a case-sensitive table name
CREATE TABLE "Users" (id int);
-- This fails: ERROR: relation "users" does not exist
SELECT * FROM Users;
-- This works:
SELECT * FROM "Users";
2. Schema Search Path Issues
The table exists but isn't in your current schema search path. PostgreSQL looks for tables in the schemas listed in search_path. If your table is in a different schema (e.g., myschema.users), either qualify it fully or add the schema to your search path.
Example:
-- Table exists in 'myschema' but not in search_path
SELECT * FROM users; -- ERROR: relation "users" does not exist
-- Solution 1: Use qualified name
SELECT * FROM myschema.users;
-- Solution 2: Add to search path
SET search_path TO myschema, public;
SELECT * FROM users; -- Now works
3. Table Actually Doesn't Exist
The table hasn't been created yet, was dropped, or you're connected to the wrong database. Verify with \dt (psql) or query pg_tables.
Sources:
PostgreSQL 'disk full' errors occur when the filesystem hosting PostgreSQL data runs out of available space. The specific causes and impacts depend on which disk fills up:
WAL (Write-Ahead Log) disk full:
- Causes database server panic and shutdown
- Most critical scenario - can halt all database operations
Data disk full:
- Prevents new write operations (INSERT, UPDATE, DELETE)
- Does not corrupt existing data
- Blocks useful activity but maintains data integrity
Common underlying causes:
- Uncontrolled table/index growth from normal operations
- Transaction log (WAL) accumulation when not archived/removed properly
- Temp files from large queries filling pg_tmp
- Disk quota limits on the PostgreSQL user account
- Filesystem degradation (many filesystems perform badly when nearly full)
Prevention: PostgreSQL recommends monitoring disk usage proactively and using tablespaces to distribute data across multiple filesystems before reaching capacity.
Sources:
Set lock_timeout to abort any statement that waits longer than the specified duration to acquire a lock.
Syntax:
SET lock_timeout = '2s'; -- Session level
SET LOCAL lock_timeout = '2s'; -- Transaction level
ALTER DATABASE mydb SET lock_timeout = '5s'; -- Database level
ALTER ROLE myuser SET lock_timeout = '3s'; -- User level
Valid values:
- Integer with unit:
'2s'(seconds),'500ms'(milliseconds),'1min'(minutes) - Integer without unit: treated as milliseconds
0or'0': disables timeout (default)
Examples:
-- For a single transaction
BEGIN;
SET LOCAL lock_timeout = '4s';
UPDATE accounts SET balance = balance + 100 WHERE id = 1;
COMMIT;
-- For current session
SET lock_timeout = '2s';
-- For a specific user
ALTER ROLE app_user SET lock_timeout = '3s';
Important:
- Time limit applies separately to each lock acquisition attempt
- Do NOT set in
postgresql.conf- affects all sessions - If
statement_timeoutis set, keeplock_timeoutlower, otherwise statement timeout triggers first - Applies to explicit locks (
LOCK TABLE,SELECT FOR UPDATE) and implicit locks
Sources:
This error occurs when you attempt to insert or update a row with a NULL value in a column that has a NOT NULL constraint defined.
PostgreSQL enforces NOT NULL constraints at the database level - columns with this constraint cannot store NULL values. When you try to INSERT or UPDATE a row where a NOT NULL column would receive a NULL value, PostgreSQL raises error code 23502 (not_null_violation).
The error message format is:
ERROR: null value in column "column_name" violates not-null constraint
DETAIL: Failing row contains (...)
Common causes:
- Explicitly inserting NULL:
INSERT INTO table (col) VALUES (NULL) - Omitting a required column without a DEFAULT value
- Setting a column to NULL in an UPDATE statement
- Application logic passing NULL where a value is required
Example:
CREATE TABLE users (
id serial PRIMARY KEY,
email text NOT NULL
);
-- This fails:
INSERT INTO users (id) VALUES (1);
-- ERROR: null value in column "email" violates not-null constraint
Resolution: Provide a non-NULL value for the constrained column, or remove/modify the NOT NULL constraint if NULL values should be allowed.
Sources:
PostgreSQL "permission denied" errors occur when a user or role lacks the required privileges to perform an operation. Here are the specific causes:
Database Object Access
Missing object privileges - Users need explicit grants to access database objects. Operations fail without the correct privilege:
SELECT- read dataINSERT- add rowsUPDATE- modify dataDELETE- remove rowsTRUNCATE- clear tablesEXECUTE- call functionsREFERENCES- create foreign keys
Missing schema access - Requires USAGE privilege on the schema to access objects within it, even if you have privileges on the specific object.
Missing database connection privileges - Requires CONNECT privilege to connect to a database.
Ownership and Grants
Non-owner attempting restricted operations - Only the object owner (or superuser) can perform certain operations until privileges are explicitly granted via GRANT.
Revoked privileges - Privileges removed via REVOKE cause subsequent access attempts to fail.
Missing grant option - Cannot grant privileges to others without the "grant option" on that privilege.
Column-level restrictions - Operations on specific columns fail when column-level privileges weren't granted.
Special Cases
Foreign key enforcement - Requires REFERENCES privilege on the referenced table's columns.
Maintenance operations - Requires MAINTAIN privilege for VACUUM, ANALYZE, CLUSTER, REINDEX.
Row-level security policies - Policies can restrict row access even when table privileges exist.
File system permissions - Operating system file permissions on data directories, WAL files, or configuration files can cause permission denied at the OS level.
Solution Pattern
-- Grant privileges on a table
GRANT SELECT, INSERT ON schema_name.table_name TO role_name;
-- Grant schema access
GRANT USAGE ON SCHEMA schema_name TO role_name;
-- Grant database connection
GRANT CONNECT ON DATABASE database_name TO role_name;
Sources:
Handle serialization failures by implementing transaction retry logic that catches SQLSTATE error code 40001 (serialization_failure) or 40P01 (deadlock_detected).
Core Strategy:
- Wrap the entire transaction in a retry loop
- Catch serialization failure exceptions
- Retry the complete transaction from the beginning, including all decision logic
- Set a maximum retry limit to prevent infinite loops
Code Pattern (Python with psycopg2):
import psycopg2
from psycopg2 import errorcodes
MAX_RETRIES = 5
def execute_transaction(conn):
for attempt in range(MAX_RETRIES):
try:
with conn:
with conn.cursor() as cur:
# Complete transaction logic here
cur.execute("BEGIN ISOLATION LEVEL SERIALIZABLE")
# Your SQL operations
cur.execute("SELECT ...")
cur.execute("UPDATE ...")
# Transaction commits automatically at end of 'with' block
break # Success, exit retry loop
except psycopg2.Error as e:
if e.pgcode in (errorcodes.SERIALIZATION_FAILURE,
errorcodes.DEADLOCK_DETECTED):
if attempt < MAX_RETRIES - 1:
continue # Retry
else:
raise # Max retries exceeded
else:
raise # Non-retryable error
Critical Requirements:
- Retry the entire transaction, including all application logic that determines which SQL to execute and which values to use
- Do not use savepoints or attempt partial retries
- PostgreSQL intentionally provides no automatic retry facility
- Expect multiple retries may be needed under high contention
- Unique constraint violations (SQLSTATE
23505) and exclusion constraint violations (23P01) may indicate serialization issues but could also be persistent errors requiring different handling
Sources:
Excellent! Now I have authoritative information from official PostgreSQL documentation. Let me compile the answer.
PostgreSQL "out of memory" errors (error code 53200) occur when the server cannot allocate sufficient memory to complete an operation.
Primary Causes:
work_mem misconfiguration: This parameter sets memory per operation per session. A complex query can run multiple sort/hash operations simultaneously, each using up to work_mem. With multiple concurrent sessions, total memory usage = work_mem × operations × sessions. The official documentation states: "simply increasing
work_memresults in memory pressure (memory pressure typically takes the form of intermittent out of memory errors)."Insufficient shared memory: PostgreSQL uses shared_buffers for caching data. Setting this too high (>40% of RAM) leaves insufficient memory for other operations. The documentation recommends 25% of system RAM as a starting point for dedicated servers.
Too many concurrent connections: Each connection can use work_mem multiple times per query, multiplying memory consumption.
Operating system OOM killer: When the entire system runs out of memory, the OS may terminate PostgreSQL processes (signal 9: Killed).
Lock table exhaustion: Operations requiring many locks can exhaust shared memory when max_locks_per_transaction is too low.
The Multiplication Effect:
A query with 3 sorts using work_mem=100MB across 10 concurrent sessions can consume 3GB (3 × 100MB × 10) of memory, not 100MB.
Sources:
"Canceling statement due to statement timeout" is an ERROR message that PostgreSQL returns when a SQL statement exceeds the time limit set by the statement_timeout configuration parameter.
statement_timeout is a PostgreSQL configuration parameter that aborts any statement that takes more than the specified amount of time. The value is specified in milliseconds (when no unit is provided). When set to 0 (the default), timeouts are disabled.
How it works:
- PostgreSQL tracks execution time for each SQL statement
- When a statement exceeds the
statement_timeoutlimit, PostgreSQL terminates it immediately - The client receives the error:
ERROR: canceling statement due to statement timeout - If
log_min_error_statementis set to ERROR or lower, the timed-out statement is also logged
Configuration examples:
-- Set timeout to 30 seconds for current session
SET statement_timeout = '30s';
-- Set timeout to 5000 milliseconds
SET statement_timeout = 5000;
-- Disable timeout for current session
SET statement_timeout = 0;
Important: The official documentation explicitly states that setting statement_timeout in postgresql.conf is not recommended because it affects all sessions. It's better to set it per-session, per-user, or per-database.
Sources:
The "too many connections" error in PostgreSQL occurs when the database server has reached its maximum concurrent connection limit and cannot accept new connections.
Error Details:
- Error Code:
53300 - Condition Name:
too_many_connections - Error Class: Class 53 — Insufficient Resources
- Error Message:
FATAL: sorry, too many clients already
Root Cause:
The server's max_connections parameter (default: typically 100) determines the maximum number of concurrent connections allowed. When active connections reach this limit (minus reserved slots for superusers), new connection attempts are rejected.
Configuration Parameters:
max_connections: Maximum concurrent connections to the database serversuperuser_reserved_connections: Connection slots reserved for superusers (default: 3)reserved_connections: Additional reserved slots (default: 0)
Common Solutions:
- Increase
max_connectionsin postgresql.conf (requires server restart) - Implement connection pooling (e.g., PgBouncer, pgpool-II)
- Investigate and fix connection leaks in application code
- Set per-role or per-database connection limits
Note: Simply increasing max_connections without addressing the underlying cause (connection leaks, improper connection management) is not recommended, as each connection consumes system resources.
Sources:
The invalid input syntax error (SQLSTATE 22P02) occurs when PostgreSQL cannot parse text data into the target data type because the input doesn't conform to the expected format.
Common Causes:
Type mismatch during casting or insertion:
- Attempting to insert/cast a non-numeric string into an INTEGER column (e.g.,
'abc'or'false'into INT) - Empty strings (
'') being cast to numeric types - Improperly formatted dates, timestamps, or other typed data
- Special characters or whitespace that breaks the expected format
Data import issues:
- NULL representation mismatches in COPY operations (default:
\Nin text format, empty in CSV) - Quoted vs unquoted values in CSV imports
- Encoding or delimiter problems
Error Details:
- Error Code: 22P02
- Condition Name:
invalid_text_representation - Error Class: 22 (Data Exception)
The error message typically includes the specific value that failed and the target type, e.g., invalid input syntax for integer: "abc"
Example:
-- This will raise the error:
SELECT '123abc'::integer;
-- ERROR: invalid input syntax for integer: "123abc"
-- This works:
SELECT '123'::integer;
-- Result: 123
Sources:
A foreign key constraint violation occurs when you attempt to INSERT or UPDATE a row in a table where the foreign key value does not exist in the referenced parent table.
Specific causes:
INSERT with non-existent reference - Inserting a row where the foreign key column value has no matching row in the parent table
UPDATE with non-existent reference - Updating a foreign key column to a value that doesn't exist in the parent table
DELETE from parent table - Deleting a row from the parent table that is still referenced by child rows (when using RESTRICT or NO ACTION)
UPDATE parent table key - Updating the primary/unique key in the parent table that child rows reference (when using RESTRICT or NO ACTION)
Error message format:
ERROR: insert or update on table "child_table" violates foreign key constraint "constraint_name"
DETAIL: Key (column_name)=(value) is not present in table "parent_table".
Example:
-- If 'cities' table has no row with name='Berkeley'
INSERT INTO weather VALUES ('Berkeley', 45, 53, 0.0, '1994-11-28');
-- Results in:
ERROR: insert or update on table "weather" violates foreign key constraint "weather_city_fkey"
DETAIL: Key (city)=(Berkeley) is not present in table "cities".
Foreign keys enforce referential integrity by ensuring every reference points to an existing row in the parent table.
Sources:
A statement timeout error occurs when a PostgreSQL query or statement takes longer to execute than the configured statement_timeout limit. PostgreSQL will abort the statement and return:
ERROR: canceling statement due to statement timeout
What is statement_timeout:
The statement_timeout parameter sets the maximum execution time for any SQL statement. When a statement exceeds this limit, PostgreSQL cancels it. The value is specified in milliseconds (default: 0, which disables timeout).
Key behavior:
- The timeout applies from when the command arrives at the server until completion
- Each SQL statement in a multi-statement query gets the timeout applied separately
- The timed-out statement is logged if
log_min_error_statementis set to ERROR or lower - Setting
statement_timeout = 0disables the timeout
Configuration example:
-- Set for current session (recommended)
SET statement_timeout = '30s';
-- Or in milliseconds
SET statement_timeout = 30000;
It's recommended to set statement_timeout at the session or role level rather than globally in postgresql.conf, to avoid affecting all database operations.
Sources:
PostgreSQL automatically detects deadlocks and resolves them by aborting one of the involved transactions, allowing the others to complete. Which transaction gets aborted is unpredictable.
Resolution Strategies
1. Automatic Retry
Implement retry logic to re-execute transactions that fail with deadlock errors (error code 40P01).
2. Prevention through Consistent Lock Ordering
Ensure all applications acquire locks on multiple objects in the same order across all transactions. This is the most effective prevention method.
3. Acquire Restrictive Locks Early
Take the most restrictive lock mode needed for each object at the beginning of the transaction.
4. Keep Transactions Short
Avoid holding transactions open during user input or long operations.
5. Configure Deadlock Detection Timeout
Adjust deadlock_timeout parameter (default: 1 second) to control how long PostgreSQL waits before checking for deadlocks:
ALTER SYSTEM SET deadlock_timeout = '1s';
Example: Query Retry Pattern
import psycopg2
from psycopg2 import errors
max_retries = 3
for attempt in range(max_retries):
try:
# Execute transaction
conn.commit()
break
except errors.DeadlockDetected:
conn.rollback()
if attempt == max_retries - 1:
raise
Sources:
To increase max_connections in PostgreSQL:
Edit postgresql.conf - Locate your PostgreSQL configuration file (typically at
/etc/postgresql/{version}/main/postgresql.confor in your data directory) and modify:max_connections = 200(Replace 200 with your desired value; default is 100)
Restart PostgreSQL - This parameter requires a full server restart:
sudo systemctl restart postgresql # or pg_ctl restart -D /path/to/data/directory
Important Considerations:
- Requires restart:
max_connectionscan only be set at server start, not via reload - Shared memory impact: Increasing this value allocates more shared memory and system resources
- Standby servers: Must set equal or higher value than primary server
- Reserved connections: Ensure
max_connectionsexceedssuperuser_reserved_connections+reserved_connections
Verification:
SHOW max_connections;
Sources:
A serialization failure is an error (SQLSTATE code 40001) that PostgreSQL raises to prevent serialization anomalies when transactions running under Repeatable Read or Serializable isolation levels have conflicting concurrent access patterns.
When it occurs:
- Repeatable Read: When a transaction attempts to modify or lock rows that have been changed by other transactions since it began (error message: "could not serialize access due to concurrent update")
- Serializable: When concurrent transactions have read/write dependencies that cannot be reconciled with any serial execution order (error message: "could not serialize access due to read/write dependencies among transactions")
Required handling:
Applications must retry the complete transaction (including all logic and decision-making, not just SQL statements) when encountering this error. PostgreSQL does not provide automatic retry because it cannot guarantee correctness without re-executing the full transaction logic.
Example scenario:
Transaction A reads class 1 records and Transaction B reads class 2 records. Then A inserts based on class 1 data while B inserts based on class 2 data. If these operations create a logical conflict that couldn't occur in any serial execution, one transaction will fail with a serialization error.
Sources:
backup_recovery
19 questionspg_dump backs up a single database. pg_dumpall backs up an entire PostgreSQL cluster (all databases plus global objects).
Key Differences
Scope:
pg_dump: One database at a timepg_dumpall: All databases in the cluster
Global Objects:
pg_dump: Does NOT include roles, tablespaces, or cluster-wide settingspg_dumpall: Includes roles, tablespaces, and privilege grants for configuration parameters
How it works:
pg_dump: Dumps a single database directlypg_dumpall: Callspg_dumpfor each database, then adds global objects
Consistency:
pg_dump: Single database is internally consistentpg_dumpall: Each database is internally consistent, but snapshots across databases are not synchronized
When to Use Each
Use pg_dump when you need to backup/restore a specific database.
Use pg_dumpall when you need:
- Complete cluster backup
- To preserve user accounts (roles)
- To preserve tablespace definitions
- To migrate an entire PostgreSQL instance
You can also use pg_dumpall --globals-only to dump only cluster-wide objects, then combine with individual pg_dump commands for each database.
Sources:
The -Fc flag specifies the custom format output for pg_dump.
This format:
- Outputs a custom-format archive suitable for input into
pg_restore(cannot be used withpsql) - Is the most flexible output format, allowing manual selection and reordering of archived items during restore
- Is compressed by default
- Supports parallel restoration
Example:
pg_dump -Fc mydb > mydb.dump
# Restore with:
pg_restore -d mydb mydb.dump
The -F parameter accepts different format values: c (custom), d (directory), t (tar), or p (plain text/SQL script).
Sources:
To restore from a pg_dump backup, use psql for plain-text SQL dumps or pg_restore for custom/directory/tar formats.
Plain-Text SQL Format
psql dbname < dumpfile
Create the target database first:
createdb -T template0 dbname
psql dbname < dumpfile
Custom/Directory/Tar Format
pg_restore -d dbname dumpfile
Common Options
Clean and recreate:
pg_restore -d dbname --clean --create dumpfile
Parallel restore (faster for large databases):
pg_restore -d dbname -j 4 dumpfile
Single transaction (all-or-nothing):
psql --single-transaction dbname < dumpfile
Stop on first error:
psql --set ON_ERROR_STOP=on dbname < dumpfile
List archive contents without restoring:
pg_restore -l dumpfile
Schema only:
pg_restore -d dbname --schema-only dumpfile
Data only:
pg_restore -d dbname --data-only dumpfile
Prerequisites
- Target database must exist (unless using
--create) - Database users referenced in the dump must already exist
- Use
template0when creating the database to ensure clean state
Sources:
PostgreSQL supports three fundamental backup methods:
1. SQL Dump
Logical backup using pg_dump (single database) or pg_dumpall (entire cluster). Generates SQL commands that recreate the database state. This method is portable across PostgreSQL versions and machine architectures, and doesn't block database operations during backup.
2. File System Level Backup
Physical backup that directly copies PostgreSQL's data directory files. Requires database shutdown for consistency, unless using file system snapshots. Faster for large databases but less portable.
3. Continuous Archiving and Point-in-Time Recovery (PITR)
Combines a base filesystem backup (typically via pg_basebackup) with continuous archiving of Write-Ahead Log (WAL) files. Enables restoration to any point in time and supports warm standby servers.
Each method has distinct strengths and weaknesses regarding backup speed, restore flexibility, version portability, and operational overhead.
Sources:
recovery_target_xid is a PostgreSQL recovery configuration parameter that specifies the transaction ID (XID) up to which recovery will proceed during point-in-time recovery (PITR).
Key characteristics:
- Type: String parameter
- Purpose: Defines a stopping point for recovery based on a specific transaction ID
- Scope: Can only be set at server start (in
postgresql.confor on command line)
Important behavior:
Transaction IDs are assigned sequentially when transactions start, but transactions can complete in a different numeric order. Recovery includes transactions that committed before (and optionally including) the specified XID.
The precise stopping point depends on the recovery_target_inclusive parameter:
on(default): Recovery stops just after the specified XIDoff: Recovery stops just before the specified XID
Mutual exclusivity:
Only one recovery target parameter can be specified at a time:
recovery_target_xidrecovery_target_timerecovery_target_namerecovery_target_lsnrecovery_target
Using multiple will raise an error.
Sources:
PostgreSQL provides two main backup methods with different verification approaches:
For pg_basebackup (Physical Backups)
Use pg_verifybackup to verify backup integrity:
pg_verifybackup /path/to/backup
This tool:
- Validates the backup_manifest file generated during backup
- Checksums all files and compares against manifest values
- Verifies WAL records needed for recovery are present and parsable
- Detects missing or corrupted files
Common options:
# For tar-format backups (skip WAL parsing)
pg_verifybackup -n /path/to/backup
# Show progress
pg_verifybackup -P /path/to/backup
# Stop on first error
pg_verifybackup -e /path/to/backup
Critical limitation: pg_verifybackup cannot perform every check a running server will. You must still perform test restores to verify the backup actually works.
For pg_dump (Logical Backups)
No built-in verification tool exists. The only reliable method is test restoration:
# Restore to a test database
pg_restore -d test_database backup_file.dump
# Or for plain SQL dumps
psql test_database < backup_file.sql
pg_dump provides rudimentary corruption checking during backup creation by sequentially scanning all table data, but this doesn't verify the dump file itself is intact.
Best Practice for All Backups
Always perform test restores - this is the only way to truly verify backup integrity and recoverability, regardless of backup method.
Sources:
pg_restore is a PostgreSQL command-line utility that restores a PostgreSQL database from an archive file created by pg_dump in one of the non-plain-text formats (custom, directory, or tar).
Purpose
It reconstructs a database to the state it was in when the archive was created by issuing the necessary SQL commands. It only works with archives created by pg_dump - it cannot restore plain-text SQL dumps (use psql for those).
Supported Archive Formats
- Custom format (
-Fc): PostgreSQL's compressed binary format - Directory format (
-Fd): Directory-based archive - Tar format (
-Ft): Standard tar archive
The format is auto-detected, so you don't need to specify it.
Key Capabilities
- Direct database restore: Connect directly to a database and restore
- Script generation: Output SQL commands to stdout for manual execution
- Selective restore: Choose specific tables, schemas, or objects to restore
- Parallel restore: Use
-j Nto restore using N parallel jobs - Reordering: Reorder archive items before restoration
- Clean option: Drop database objects before recreating them
Basic Usage
# Restore to a database
pg_restore -d dbname archive_file.dump
# Restore with parallel jobs
pg_restore -d dbname -j 4 archive_file.dump
# Restore only specific table
pg_restore -d dbname -t tablename archive_file.dump
Sources:
Barman (Backup and Recovery Manager) is an open-source administration tool for disaster recovery of PostgreSQL servers. It allows database administrators to perform remote backups of multiple PostgreSQL servers in a centralized way.
Key capabilities:
- Physical backup and recovery of PostgreSQL databases
- Remote backup from a central server
- Point-in-Time Recovery (PITR)
- WAL (Write-Ahead Log) archiving and management
- Support for both rsync/SSH and streaming replication methods
- Retention policies for automatic backup management
- Incremental backups to save disk space
- Parallel backup and recovery operations
- Backup validation and verification
Maintained by: EnterpriseDB
License: GNU GPL 3
Language: Python
Current version: 3.16.2 (as of November 2025)
Barman runs on a dedicated backup server separate from your PostgreSQL database servers, making it suitable for managing disaster recovery across multiple database instances from a single location.
Sources:
Point-in-Time Recovery (PITR) is a PostgreSQL backup and recovery technique that enables restoring a database to any specific moment in time by combining a base backup with archived Write-Ahead Log (WAL) files.
How it works:
- Take a base backup of the entire database cluster
- Continuously archive WAL files as PostgreSQL records all database changes to the pg_wal/ subdirectory
- During recovery, restore the base backup and replay archived WAL files sequentially up to the desired point in time
Key capabilities:
- Arbitrary recovery targets: Restore to any timestamp, named restore point, or specific transaction ID after the base backup
- Consistent snapshots: Stop WAL replay at any point to get a consistent database state at that moment
- No downtime for backups: Base backups don't need to be perfectly consistent; WAL replay corrects internal inconsistencies
- Continuous protection: Archive WAL files indefinitely instead of taking frequent full backups
- Warm standby support: Apply the same WAL sequence to another server for high-availability configurations
Example recovery targets:
-- Recover to specific timestamp
recovery_target_time = '2025-01-15 12:00:00'
-- Recover to named restore point
recovery_target_name = 'before_migration'
-- Recover to specific transaction
recovery_target_xid = '1234567'
Sources:
recovery_target_name is a PostgreSQL recovery configuration parameter that specifies the named restore point (created with pg_create_restore_point()) to which recovery will proceed.
When you set this parameter, PostgreSQL will recover the database up to the point where you previously created a named restore point, then stop. This allows point-in-time recovery to a specific, named location rather than a timestamp or transaction ID.
Key characteristics:
- Must reference a restore point created with
pg_create_restore_point() - Mutually exclusive with
recovery_target_time,recovery_target_xid, andrecovery_target_lsn - Only one recovery target parameter can be specified at a time
- Requires server restart to take effect
Example usage:
-- Creating a restore point (before recovery)
SELECT pg_create_restore_point('before_major_update');
-- In postgresql.conf or recovery.conf (for recovery)
recovery_target_name = 'before_major_update'
Sources:
Use the -j njobs or --jobs=njobs option to specify the number of parallel jobs when running pg_dump.
Syntax:
pg_dump -Fd -j 4 -f output_directory dbname
Requirements:
Directory format only: Must use
-Fd(directory format). This is the only format where multiple processes can write simultaneously.Database connections: pg_dump opens
njobs + 1connections to the database. Ensure yourmax_connectionssetting accommodates this.PostgreSQL version: Server must support synchronized snapshots (9.2+ for primary servers, 10+ for standbys) to ensure consistency across parallel workers.
How it works:
The parallel jobs dump njobs tables simultaneously, reducing dump time but increasing database server load.
Important limitation:
If worker processes encounter exclusive locks on database objects, the dump may fail. Workers use shared locks with NOWAIT and will abort rather than deadlock if conflicts occur.
Example with 8 parallel jobs:
pg_dump -Fd -j 8 -f /backup/mydb_dump mydatabase
Sources:
To recover PostgreSQL to a specific timestamp, use Point-in-Time Recovery (PITR) with the recovery_target_time parameter.
Configuration Steps
- Set the recovery target in postgresql.conf:
recovery_target_time = '2024-01-15 14:30:00+00:00'
- Create a recovery signal file:
touch recovery.signal
- Configure WAL restore command in postgresql.conf:
restore_command = 'cp /path/to/archive/%f %p'
- Start PostgreSQL - it will enter recovery mode and replay WAL files up to the specified timestamp.
Timestamp Format
The timestamp must be in the same format as PostgreSQL's timestamp with time zone data type:
- Use numeric UTC offset:
'2024-01-15 14:30:00+00:00' - Or full timezone name:
'2024-01-15 14:30:00 Europe/Helsinki' - Avoid timezone abbreviations unless
timezone_abbreviationsis pre-configured
Requirements
- A base backup taken before the target timestamp
- Continuous WAL archiving enabled
- WAL archive files covering the period from base backup to target time
- The recovery target time must be after the base backup's end time
Optional: Control recovery behavior
recovery_target_inclusive = true # Include the target transaction (default)
recovery_target_action = 'promote' # Auto-promote after recovery
Sources:
Point-in-Time Recovery (PITR) in PostgreSQL requires three phases: enabling WAL archiving, creating base backups, and configuring recovery.
Phase 1: Enable WAL Archiving
Edit postgresql.conf:
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
%p= full path of WAL file to archive%f= filename only- Archive command must return exit code 0 on success, non-zero on failure
Restart PostgreSQL after configuration changes.
Phase 2: Create Base Backup
Method 1 (Recommended):
pg_basebackup -D /backup/location -Ft -z -P
Method 2 (Low-level API):
SELECT pg_backup_start(label => 'mybackup', fast => false);
Then copy the data directory (excluding pg_wal/), then:
SELECT * FROM pg_backup_stop(wait_for_archive => true);
Phase 3: Perform Recovery
- Stop PostgreSQL server
- Clear the data directory
- Restore base backup files with correct ownership
- Remove contents of
pg_wal/directory - Create
recovery.signalfile in data directory (empty file) - Configure
postgresql.conf:
restore_command = 'cp /mnt/server/archivedir/%f %p'
recovery_target_time = '2025-01-15 12:00:00' # optional: for PITR
- Start PostgreSQL - it will enter recovery mode automatically
- After recovery completes,
recovery.signalis renamed torecovery.done
Recovery Target Options
Specify one of:
recovery_target_time = 'timestamp'- recover to specific timerecovery_target_name = 'restore_point'- recover to named pointrecovery_target_xid = 'transaction_id'- recover to transaction IDrecovery_target = 'immediate'- recover to end of base backup
Sources:
A parallel pg_dump is a database backup operation that uses multiple concurrent jobs to dump tables simultaneously, speeding up the backup process for large PostgreSQL databases.
How it works:
- Use the
-jor--jobsoption with the number of parallel workers (e.g.,pg_dump -j 4 -Fd -f backup_dir mydb) - pg_dump spawns multiple worker processes that dump different tables at the same time
- Requires the directory output format (
-Fd) — this is the only format supporting parallel writes
Requirements:
- Must use directory format output (
-Fd) - Database must support synchronized snapshots (PostgreSQL 9.2+ for primary, 10+ for standby)
- Sufficient
max_connections: pg_dump opensnjobs + 1connections to the database
Trade-offs:
- Faster dumps for large databases with many tables
- Higher database load during the backup
- Risk of deadlocks if other clients request exclusive locks during the dump
Example:
pg_dump -j 4 -Fd -f /backup/mydb_dir mydb
This runs 4 parallel workers, creating a directory-format backup in /backup/mydb_dir.
Sources:
The archive_command is a shell command executed by PostgreSQL to archive completed WAL (Write-Ahead Log) file segments.
Core Functionality:
- Executes a local shell command to copy/move WAL segments to archive storage
- Uses
%pplaceholder for the full path of the file to archive - Uses
%fplaceholder for just the filename - Must return zero exit status on success
Requirements:
- Only active when
archive_mode = onandarchive_library = ''(empty) - Can only be set in
postgresql.confor on server command line (requires restart) - Server will wait and retry if command returns non-zero exit status
Example:
archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
This command checks if the file doesn't already exist in the archive directory, then copies it there.
Important Notes:
- The command should ensure WAL files are not overwritten (use
test ! -fcheck) - Setting to empty string while
archive_mode = oncauses WAL files to accumulate - Cannot be used simultaneously with
archive_librarysetting (will raise error)
Sources:
pg_dump is a PostgreSQL command-line utility for extracting a single database into a script file or archive. It creates consistent backups even while the database is actively being used, without blocking other users.
Primary Function:
- Exports a PostgreSQL database to a file containing SQL commands that can recreate the database in the same state
Key Characteristics:
- Non-blocking: Does not interfere with concurrent read or write operations
- Consistent snapshots: Produces logically consistent dumps even during active use
- Flexible output formats:
- Plain-text SQL script (default)
- Custom archive format (
-Fc) - compressed, allows selective restore - Directory format (
-Fd) - supports parallel dumps and restores - Tar archive (
-Ft)
Basic Usage:
pg_dump dbname > outfile.sql
pg_dump -Fc dbname > outfile.dump # Custom format
pg_dump -Fd dbname -f dumpdir # Directory format
Selective Options:
--schema-only: Export only table structures, no data--data-only: Export only data, no schemas-t table_name: Export specific table(s)-n schema_name: Export specific schema(s)
Important Note:
For regular production backups, PostgreSQL recommends continuous archiving (WAL archiving) instead of pg_dump, as it's generally more suitable for that purpose.
Sources:
WAL-G is an archival and restoration tool for databases in the cloud. It's designed for making encrypted, compressed backups (both full and incremental) of PostgreSQL, MySQL/MariaDB, and MS SQL Server databases and pushing/fetching them to/from remote storage (S3, Azure, Google Cloud Storage, etc.) without saving them to the local filesystem.
Key capabilities:
- Incremental backups: Delta backups that store only the difference between the previous backup and present state
- Compression: Supports lz4 (default), lzma, zstd, and brotli
- Encryption: Built-in encryption support for secure cloud storage
- Cloud-native: Direct integration with cloud storage providers
- WAL archiving: Handles PostgreSQL Write-Ahead Log archiving and restoration
WAL-G is particularly useful for automating PostgreSQL backup workflows in cloud environments and reducing storage costs through compression and incremental backups.
Sources:
The restore_command is a shell command that PostgreSQL executes to retrieve archived WAL (Write-Ahead Log) segments from storage during recovery operations.
Configuration
Set in postgresql.conf or on the server command line:
restore_command = 'cp /mnt/server/archivedir/%f %p'
Parameter Substitutions
%f- Replaced by the name of the WAL file to retrieve from the archive%p- Replaced by the path where PostgreSQL should copy the file to%r- Replaced by the name of the file containing the last valid restart point%%- Literal percent character
Exit Status Requirements
The command must:
- Return exit status 0 only on success
- Return nonzero when the requested file doesn't exist in the archive (this is normal behavior, not an error)
- Handle requests for both WAL segments and
.historyfiles
Behavior
- Required for archive recovery (PITR - Point-in-Time Recovery)
- Optional for streaming replication
- If a file is unavailable in the archive, PostgreSQL looks in
pg_wal/directory - If the command is terminated by a signal (except SIGTERM) or shell error, recovery aborts
- Default value is empty string (
'')
Sources:
pgBackRest is a reliable backup and restore solution designed specifically for PostgreSQL databases. It is an open-source utility that aims to be a simple, reliable backup and restore system that can seamlessly scale up to the largest databases and workloads.
Key capabilities:
- Parallel backup and restore for performance
- Full, incremental, and differential backup types
- Archive WAL (Write-Ahead Logs) for point-in-time recovery
- Multiple repository support (local, Amazon S3, Azure, GCS)
- Encryption and compression
- Backup rotation and retention policies
- Delta restore for faster recovery
- Page checksums for data integrity verification
pgBackRest is designed to handle PostgreSQL databases from small installations to multi-terabyte environments with minimal configuration complexity.
Sources:
PL/pgSQL > Control Structures
19 questionsThe default step increment is 1. By default, the FOR loop increases the loop_counter by 1 after each iteration.
No. RETURN NEXT appends a row to the result set but does not terminate the function. Execution continues until a final RETURN statement (with no argument) is reached.
Two special variables are available: SQLSTATE (the error code) and SQLERRM (the error message).
The two expressions giving the lower and upper bound of the range are evaluated once when entering the loop.
No. While the NULL statement is available in PL/pgSQL for compatibility and clarity, it is actually optional in PostgreSQL. PL/pgSQL allows you to just write nothing instead. This differs from Oracle PL/SQL where NULL statements are required for empty blocks.
The SLICE value must be an integer constant not larger than the number of dimensions of the array.
Type casting is applied automatically when returning a scalar value. The expression value is automatically cast to the function's return type.
No. New errors raised in handler_statements cannot be caught by the same EXCEPTION clause; they propagate outward to the next outer exception handler.
The keyword can be spelled either ELSIF or ELSEIF. Both spellings are valid and functionally equivalent.
The search-expression is evaluated once and then compared against each WHEN expression. It is not re-evaluated for each WHEN clause.
CONTINUE can be used with all types of loops; it is not limited to use with unconditional loops.
QUERY_CANCELED and ASSERT_FAILURE are not caught by WHEN OTHERS. These two error types must be trapped explicitly by name if you need to handle them.
FOREACH iterates through array elements in storage order, regardless of the number of array dimensions.
Yes. Void functions can use RETURN (with no expression) to exit the function early before reaching the end of the function body.
RETURN QUERY can return a scalar or a set (multiple rows) at a time, while RETURN NEXT can only return a scalar (single row) at a time.
Without a label, EXIT terminates the innermost loop. With a label, EXIT terminates the named loop or block.
The default is SLICE 0 (or omitted entirely), which iterates through individual array elements rather than array slices.
Partitioning > Partition-wise Operations
19 questionsYes, beginning with PostgreSQL 11, foreign tables can be attached as partition tables, though with certain limitations such as the foreign data wrapper must support tuple routing and there is no validation that all rows obey the partition constraint.
Yes, PostgreSQL supports multi-level partitions that can be created up to N levels. Partition method combinations like LIST-LIST, LIST-RANGE, LIST-HASH, RANGE-RANGE, RANGE-LIST, RANGE-HASH, HASH-HASH, HASH-LIST, and HASH-RANGE can be created in PostgreSQL declarative partitioning.
There must exist an equi-join condition between the partition keys of the joining tables.
Constraint exclusion uses each table's CHECK constraints and is only applied at plan time, making it slow. Partition pruning uses the table's partition bounds (only in declarative partitioning) and can occur both at planning time and during query execution, making it faster.
Yes, you can enable these settings on a per connection basis using the SET command, without needing to change them for the whole server.
PostgreSQL 13 allows partition-wise joins to happen between partitioned tables even when their partition bounds do not match exactly, using an advanced partition matching algorithm.
on (examine constraints for all tables), off (never examine constraints), and partition (examine constraints only for inheritance child tables and UNION ALL subqueries)
With enable_partitionwise_aggregate enabled, the number of nodes whose memory usage is restricted by work_mem appearing in the final plan can increase linearly according to the number of partitions being scanned, which can result in a large increase in overall memory consumption during query execution.
Only partial aggregation can be performed on a per-partition basis, and finalization must be performed later.
Planning takes O(N) time in the number of partitions, and the current implementation consumes severe amounts of CPU and memory for each partition pair.
With enable_partitionwise_join enabled, the number of nodes whose memory usage is restricted by work_mem can increase linearly according to the number of partitions being scanned, which can result in a large increase in overall memory consumption during query execution.
Concurrency Control > Deadlock Detection
19 questionslock_timeout aborts any statement that waits longer than the specified time while attempting to acquire a lock. deadlock_timeout is the amount of time to wait on a lock before checking to see if there is a deadlock condition (it doesn't abort, it just triggers a deadlock check).
No, setting lock_timeout in postgresql.conf is not recommended because it would affect all sessions.
No, exactly which transaction will be aborted is difficult to predict and should not be relied upon.
The wait-for-graph contains a 'cycle' when a deadlock occurs, where the graph reaches the same transaction at the beginning of the graph after following waits-for edges.
No, this is not the number of rows that can be locked; that value is unlimited.
Only superusers and users with the appropriate SET privilege can change this setting. The context is 'superuser'.
SELECT relation::regclass, * FROM pg_locks WHERE NOT GRANTED;
It limits the average number of object locks used by each transaction. Individual transactions can lock more objects as long as the locks of all transactions fit in the lock table.
The 'granted' column. When granted is False, it indicates that this process is currently waiting to acquire this lock.
The pg_stat_database view includes a 'deadlocks' column that provides a cumulative count of all deadlocks per database. This feature was added in PostgreSQL 9.2.
FOR UPDATE lock is stronger and blocks SELECT FOR KEY SHARE commands. FOR NO KEY UPDATE acquires a weaker lock that will not block SELECT FOR KEY SHARE commands on the same rows.
Yes, ideally the setting should exceed your typical transaction time to improve the odds that a lock will be released before checking for deadlock.
Extensions > pg_trgm
19 questions0.3. This threshold is used by the % operator to determine if two strings are similar enough. The value must be between 0 and 1.
12 bytes. The siglen parameter determines the signature length in bytes for gist_trgm_ops indexes. Valid values range from 1 to 2024 bytes.
strict_word_similarity() only considers sequences of full words (word boundaries), while word_similarity() considers all possible substrings. strict_word_similarity is useful for finding similarity to whole words, while word_similarity is better for finding similarity to parts of words.
PostgreSQL 9.1. This version introduced functions and operators to compare strings via the pg_trgm module.
Yes. GIN index build time can often be improved by increasing maintenance_work_mem. However, GiST index build time is not sensitive to this parameter.
Two spaces are prefixed and one space is suffixed. For example, the string 'cat' generates trigrams: ' c', ' ca', 'cat', 'at '.
PostgreSQL 9.1. Beginning in 9.1, pg_trgm GiST and GIN indexes support trigram-based index searches for LIKE and ILIKE queries.
An array of all trigrams in the input string. For example, show_trgm('cat') returns {' c', ' ca', 'cat', 'at '}. Each word has two spaces prefixed and one space suffixed.
0.5. This threshold is used by the <<% and %>> operators for strict word similarity comparisons that match word boundaries. The value must be between 0 and 1.
No. Operations like ordering by similarity distance where only a small number of closest matches is wanted can be implemented quite efficiently by GiST indexes, but not by GIN indexes.
0 to 1, where 1 means the strings are identical and 0 means they share no trigrams in common. The value represents the proportion of shared trigrams between the two strings.
No. Inequality operators are not supported by pg_trgm indexes. For inequality comparisons, you need to use regular B-tree indexes.
3 characters. The search pattern must be at least 3 characters long for pg_trgm indexes to be applied, since trigrams are three-character sequences. Queries with fewer than 3 characters will fall back to full table scans.
No. pg_trgm ignores non-word characters (non-alphanumerics) when extracting trigrams. For example, 'foo|bar' generates trigrams as if it were two separate words 'foo' and 'bar'.
No. You must first run CREATE EXTENSION pg_trgm; before you can create indexes using gin_trgm_ops or gist_trgm_ops operator classes.
1 minus the similarity score. The <-> operator returns the distance between two text strings, calculated as 1 - similarity(text1, text2). Lower distance means higher similarity.
GIN indexes are faster to search than GiST indexes. However, GIN indexes are slower to build or update, making GIN better suited for static data and GiST for frequently updated data.
PostgreSQL 9.3. Beginning in 9.3, pg_trgm indexes support trigram-based index searches for regular expression matches using the ~ and ~* operators.
Yes. By default, fastupdate is enabled for GIN indexes, causing index updates to be deferred and accumulated in a pending list until they can be processed in bulk.
PL/pgSQL > Variables and Types
19 questionsPostgreSQL treats all arrays of a given element type as the same type, regardless of dimensionality. The specific dimensions you specify are essentially documentation and not enforced at runtime.
The CONSTANT option prevents the variable from being assigned to after initialization, so that its value will remain constant for the duration of the block.
Yes. Equal (=) can be used instead of PL/SQL-compliant :=. Both operators are functionally equivalent for variable assignment in PL/pgSQL.
The structure of a record variable is determined when an actual row is assigned to it via the SELECT or FOR command. Record variables take on the actual row structure of the row they are assigned during a SELECT or FOR command.
When a function returns a polymorphic type (anyelement, anyarray, anynonarray, anyenum, or anyrange), a special parameter $0 is created with its data type being the actual return type of the function, as deduced from the actual input types.
Within an exception handler. These variables are undefined outside exception handlers and display an error if used outside that context.
PostgreSQL 17. The feature allows plpgsql %TYPE and %ROWTYPE specifications to represent arrays of non-array types.
Until a record variable is first assigned to, it has no substructure, and any attempt to access a field in it will draw a run-time error.
The variable in the subblock will hide (shadow) the variable in the outer block. To access the outer block variable, you must use the block label to qualify its name using the syntax outer_block.variable_name.
The variables declared in the declarations section preceding a block are initialized to their default values every time the block is entered, not only once per function call.
%TYPE provides the data type of a table column or a previously-declared PL/pgSQL variable.
The SQL null value. If the DEFAULT clause is not given, the variable is initialized to the SQL null value.
Yes. In variable declarations, both DEFAULT and := (and =) can be used to assign an initial value and are completely equivalent. The syntax is: [ { DEFAULT | := | = } expression ]
A variable's default value is evaluated and assigned to the variable each time the block is entered, not just once per function call. This means variables are re-initialized every time their containing block is entered.
name [ CONSTANT ] type [ COLLATE collation_name ] [ NOT NULL ] [ { DEFAULT | := | = } expression ];
All variables declared as NOT NULL must have a nonnull default value specified.
Function parameters, as well as special variables such as FOUND, are implicitly declared in an outer block labeled with the function's name. They can be qualified by the function's name.
A local variable of a collatable data type can have a different collation associated with it by including the COLLATE option in its declaration. This overrides the collation that would otherwise be given to the variable.
Yes. You create a new composite type using the CREATE TYPE statement with field names and data types. A composite type represents the structure of a row or record; it is essentially just a list of field names and their data types.
Indexes > Covering Indexes
18 questions50 tuples. This specifies the minimum number of updated or deleted tuples needed to trigger a VACUUM in any one table.
Yes. The syntax is: PRIMARY KEY (column_name [, ...]) INCLUDE (column_name [, ...]) or UNIQUE (column_name [, ...]) INCLUDE (column_name [, ...]).
No. B-tree deduplication is never used with indexes that have a non-key column.
No. For unique indexes with INCLUDE, the uniqueness condition applies only to the key columns, not to the included (payload) columns.
Be conservative about adding non-key payload columns to an index, especially wide columns. There is little point in including payload columns unless the table changes slowly enough that an index-only scan is likely to not need to access the heap.
While an index-only scan is possible with any coverage, it will be a win only if a significant fraction of the table's heap pages have their all-visible map bits set. The documentation does not specify an exact threshold.
INCLUDE columns are included in leaf tuples which correspond to heap tuples, but are not included in upper-level index entries used for tree navigation.
Data insertion will fail with an error like: 'ERROR: index row size [X] exceeds btree version 4 maximum 2704 for index [index_name]'.
Index-only scans have been observed to execute anywhere from 2x to 20x faster than regular index scans, depending on the workload and visibility map coverage.
PostgreSQL 11 (released October 18, 2018). The feature was authored by Anastasia Lubennikova, Alexander Korotkov, and Teodor Sigaev.
32 columns total (including both key columns and INCLUDE columns). This limit can be increased by recompiling PostgreSQL.
Dropping an included column can cause cascaded constraint and index deletion, because although the constraint is not enforced on included columns, it still depends on them.
0.2 (20%). This specifies a fraction of the table size to add to autovacuum_vacuum_threshold when deciding whether to trigger a VACUUM.
No. Columns listed in the INCLUDE clause don't need appropriate operator classes; the clause can include columns whose data types don't have operator classes defined for a given access method.
No. Expressions are not supported as included columns since they cannot be used in index-only scans.
VACUUM is responsible for updating the visibility map. Since VACUUM removes dead tuples, which renders table pages all-visible, it also maintains the visibility map.
CREATE UNIQUE INDEX index_name ON table_name(key_column) INCLUDE (non_key_column1, non_key_column2, ...);
Approximately 2704 bytes (for btree version 4). This is because at least 3 btree tuples must fit on an 8KB page, limiting each tuple to about 1/3 of the page size.
Indexes > Partial Indexes
18 questionsNo. Subqueries and aggregate expressions are forbidden in the WHERE clause of partial indexes.
No. Parameterized query clauses do not work with a partial index. For example a prepared query with a parameter might specify 'x < ?' which will never imply 'x < 2' for all possible values of the parameter.
Yes. Concurrent builds of expression indexes and partial indexes are supported.
CREATE INDEX access_log_client_ip_ix ON access_log (client_ip) WHERE NOT (client_ip > inet '192.168.100.0' AND client_ip < inet '192.168.100.255');
The system can recognize simple inequality implications, for example 'x < 1' implies 'x < 2'; otherwise the predicate condition must exactly match part of the query's WHERE condition.
CREATE UNIQUE INDEX index_name ON table_name (columns) WHERE predicate; For example: CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) WHERE success;
From a performance perspective, partial indexes should be considered when the percentage of rows qualifying for the condition is very small compared to those eliminated, with a recommended ratio of at most 20%.
A partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index.
No. The PostgreSQL documentation explicitly warns that creating a large set of non-overlapping partial indexes is a bad idea. The system does not understand the relationship among partial indexes, and will laboriously test each one to see if it's applicable to the current query.
All functions and operators used in an index definition must be 'immutable', that is, their results must depend only on their arguments and never on any outside influence (such as the contents of another table or the current time).
If your table is large enough that a single index really is a bad idea, you should look into using partitioning instead. With that mechanism, the system does understand that the tables and indexes are non-overlapping, so far better performance is possible.
Yes. Columns referenced in a partial index's WHERE clause are considered 'indexed columns' for HOT (Heap-Only Tuples) purposes, even if they aren't part of the actual indexed columns. This means updating these columns prevents HOT updates.
When the WHERE clause is present in a CREATE INDEX statement, a partial index is created. The syntax is: CREATE INDEX index_name ON table_name (columns) WHERE predicate;
Yes. The REINDEX command can be used to rebuild partial indexes, just like any other index type in PostgreSQL.
Yes. PostgreSQL supports partial indexes with arbitrary predicates including IS NULL and IS NOT NULL, so long as only columns of the table being indexed are involved.
A partial index is an index built over a subset of a table; the subset is defined by a conditional expression (called the predicate of the partial index). The index contains entries only for those table rows that satisfy the predicate.
The expression used in the WHERE clause can refer only to columns of the underlying table, but it can use all columns, not just the ones being indexed.
Concurrency Control > Advisory Locks
18 questionsYes. Transaction-level advisory locks acquired during a transaction will be held by the prepared transaction until it is either committed with COMMIT PREPARED or rolled back with ROLLBACK PREPARED.
For advisory locks using a single bigint key, objsubid equals 1.
The transaction-level shared advisory lock functions are: pg_advisory_xact_lock_shared(bigint), pg_advisory_xact_lock_shared(int, int), pg_try_advisory_xact_lock_shared(bigint), and pg_try_advisory_xact_lock_shared(int, int).
The session-level shared advisory lock functions are: pg_advisory_lock_shared(bigint), pg_advisory_lock_shared(int, int), pg_try_advisory_lock_shared(bigint), pg_try_advisory_lock_shared(int, int), pg_advisory_unlock_shared(bigint), and pg_advisory_unlock_shared(int, int).
A shared advisory lock will block exclusive advisory lock requests on the same lock key while still allowing other shared advisory locks to be acquired for the same lock key.
If a session already holds a given advisory lock, additional requests by it will always succeed immediately, even if other sessions are awaiting the lock. This is true regardless of whether the existing lock hold and new request are at session level or transaction level.
pg_advisory_unlock_all() is implicitly invoked at session end, even if the client disconnects ungracefully.
The shared memory pool for advisory locks is defined by max_locks_per_transaction and max_connections.
No. pg_advisory_unlock_all() only releases session-level advisory locks held by the current session, not transaction-level locks.
pg_advisory_unlock(bigint) returns boolean. It returns true if the lock was successfully released, false if the lock was not held.
Yes. A lock can be acquired multiple times by its owning process; for each completed lock request there must be a corresponding unlock request before the lock is actually released.
Foreign Key Performance > FK Index Requirements
18 questionsMATCH FULL will not allow one column of a multicolumn foreign key to be null unless all foreign key columns are null. MATCH SIMPLE allows any of the foreign key columns to be null individually.
Only UNIQUE, PRIMARY KEY, REFERENCES (foreign key), and EXCLUDE constraints are affected by deferrable settings. NOT NULL and CHECK constraints are always checked immediately.
No. Hash indexes do not support uniqueness checking. Only B-tree indexes currently support unique indexes in PostgreSQL. Since foreign keys require unique or primary key constraints on referenced columns, they effectively require B-tree indexes.
PostgreSQL automatically creates a unique B-tree index when a unique constraint is defined for a table.
The confkey field (of type int2[]) in the pg_constraint catalog stores the list of referenced columns for foreign key constraints.
Equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned.
The available ON DELETE actions are: NO ACTION (produces error if constraint violated), RESTRICT (prevents deletion if referencing rows exist), CASCADE (deletes referencing rows automatically), SET NULL (sets referencing columns to NULL), and SET DEFAULT (sets referencing columns to their default values).
No. For GIN and BRIN indexes, query effectiveness is consistent regardless of which index column(s) the query conditions use, offering more flexibility than B-tree indexes which require leading column matching for optimal performance.
A B-tree multicolumn index can be used with query conditions on any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. Queries on only trailing columns will be less efficient.
The conkey field (of type int2[]) in the pg_constraint catalog stores the list of constrained columns (referencing columns) for table constraints including foreign keys.
PostgreSQL documentation states that indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized. Multicolumn indexes should be used sparingly, with single-column indexes sufficient in most situations.
NOT DEFERRABLE is the default, meaning constraints are always checked immediately after every statement.
They are checked after each statement by default, but their behavior can be changed within a transaction using SET CONSTRAINTS to defer checking until transaction commit.
Currently, only the B-tree, GiST, GIN, and BRIN index types support multiple-key-column indexes.
PostgreSQL automatically creates a unique B-tree index when a primary key is defined for a table.
The available ON UPDATE actions are: NO ACTION (produces error if constraint violated), RESTRICT (prevents update if referencing rows exist), CASCADE (updates referencing rows automatically), SET NULL (sets referencing columns to NULL), and SET DEFAULT (sets referencing columns to their default values).
No. PostgreSQL does not automatically create an index on the referencing columns when declaring a foreign key constraint. The declaration of a foreign key constraint does not automatically create an index on the referencing columns.
Indexes > Multicolumn Indexes
18 questions32 columns, including INCLUDE columns. This limit can be altered when building PostgreSQL by modifying the pg_config_manual.h file.
A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns. The condition on the first column is the most important one for determining how much of the index needs to be scanned.
B-tree deduplication is never used with indexes that have a non-key column. Deduplication is completely disabled when INCLUDE columns are present.
Yes. A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns.
No. The skip scan optimization only works for queries which use the = (equality) operator, so it will not work with inequalities or ranges.
Combined. The 32-column limit applies to the total of both key columns and INCLUDE columns together.
- This value is defined in pg_config_manual.h and determines the maximum number of index keys (columns).
Only B-tree, GiST, GIN, and BRIN index types support multiple-key-column indexes. Hash and SP-GiST indexes do NOT support multicolumn indexes.
An index entry cannot exceed approximately one-third of a page (after TOAST compression, if applicable).
Yes. A multicolumn BRIN index can be used with query conditions that involve any subset of the index's columns.
No. Like GIN and unlike B-tree or GiST, a multicolumn BRIN index has the same search effectiveness regardless of which index column(s) the query conditions use.
When the leading columns have low cardinality (few distinct values). It works by generating dynamic equality constraints for each possible value in an unconstrained column when this proves faster than a full scan.
No. Currently, only B-tree, GiST, GIN, and BRIN index types support multiple-key-column indexes. SP-GiST does not support multicolumn indexes.
Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized. Multicolumn indexes should be used sparingly, as in most situations an index on a single column is sufficient and saves space and time.
Equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will always be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index but do not reduce the portion of the index that has to be scanned.
B-tree, GiST, and SP-GiST index access methods support the INCLUDE clause for adding non-key columns.
- This parameter determines the size of the block range at index creation time.
Yes. A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.
Indexes > SP-GiST Indexes
17 questionsThe quad_point_ops operator class supports the following indexable operators: <<, <@, <^, >>, >^, and ~=. It also supports the <-> ordering operator.
No, SP-GiST does not support unique indexes. Only B-tree currently supports unique indexes in PostgreSQL.
Yes, SP-GiST supports the INCLUDE clause. Currently, B-tree, GiST, and SP-GiST index access methods support this feature.
The box_ops operator class supports 12 indexable operators: <<, &<, &&, &>, >>, ~=, @>, <@, &<|, <<|, |>>, and |&>.
The range_ops operator class supports 9 indexable operators: &&, &<, &>, -|-, <<, <@, =, >>, and @>.
SP-GiST supports the fillfactor storage parameter, which can be set from 10 to 100.
quad_point_ops, kd_point_ops, and poly_ops support the <-> ordering operator, which enables k-nearest neighbor (k-NN) searches.
The box_ops operator class was added to SP-GiST in PostgreSQL version 9.6.
No, SP-GiST does not support parallel index builds. Parallel index building currently applies only to B-tree, GIN, and BRIN index methods.
Individual tuples must fit on a single index page, which is 8kB by default.
Yes, SP-GiST supports index-only scans for some operator classes but not others. This functionality was added in PostgreSQL 12.
Six built-in SP-GiST operator classes are provided: kd_point_ops, quad_point_ops, range_ops, box_ops, poly_ops, and inet_ops. (Note: text_ops is a seventh operator class also available.)
B-tree, GiST, and SP-GiST access methods support exclusion constraints. GIN does not support exclusion constraints.
The box_ops operator class provides a quadtree for rectangles. Each rectangle is represented by a point in a four-dimensional space, so the number of quadrants equals 16.
Yes, SP-GiST supports REINDEX CONCURRENTLY, which was introduced in PostgreSQL version 12.
SP-GiST is an abbreviation for space-partitioned GiST (Generalized Search Tree).
Window Functions > Ranking Functions
17 questionsYes, the OVER clause can be empty (e.g., ROW_NUMBER() OVER ()). While the OVER clause itself is mandatory, its contents (PARTITION BY and ORDER BY) are optional.
When all rows have the same value (all rows are peers), percent_rank() returns 0 for all rows, as they are all tied for the first position.
No, ranking functions (row_number, rank, dense_rank, percent_rank, cume_dist, ntile) do not support IGNORE NULLS or RESPECT NULLS options. These options are only available for lag, lead, first_value, last_value, and nth_value functions, and were added in PostgreSQL 19.
dense_rank() effectively counts peer groups. It assigns consecutive rank numbers where each distinct value or set of tied values receives the next sequential rank number.
(number of partition rows preceding or peers with current row) / (total partition rows)
When a partition contains only a single row, cume_dist() returns 1.0, since that row represents 100% of the partition's cumulative distribution.
Yes, ntile() requires a mandatory num_buckets parameter (integer type) that specifies the number of buckets to divide the partition into. The parameter must be greater than 0 and cannot be NULL.
No, PARTITION BY is optional. If you omit PARTITION BY, there is a single partition containing all rows, and the window function will treat the whole result set as a single partition.
rank() assigns the same rank to rows with equal values. The next rank after tied rows skips numbers based on the tie count. For example, if two rows tie for rank 2, they both receive rank 2, and the next row receives rank 4 (skipping rank 3).
row_number() returns 1,2,3,4,5,6,7,8 (unique sequential numbers). rank() returns 1,1,1,4,4,6,7,8 (same rank for ties with gaps). dense_rank() returns 1,1,1,2,2,3,4,5 (same rank for ties without gaps).
When you specify 1 as the num_buckets parameter, ntile() assigns all rows to bucket 1, treating the whole partition as a single bucket.
Data Types - Bit String
17 questionsWhen explicitly casting a bit-string value to bit varying(n), it will be truncated on the right if it is more than n bits.
octet_length(B'1011111011') returns 2, which is the byte count (10 bits = 2 bytes).
Casting an integer to a wider bit width will sign-extend on the left. For example, cast(-44 as bit(12)) produces B'111111010100'.
The first (leftmost) bit is bit 0. Both get_bit() and set_bit() use zero-indexing.
When explicitly casting a bit-string value to bit(n), it will be truncated or zero-padded on the right to be exactly n bits, without raising an error.
bit varying without a length specification means unlimited length.
B'10001' & B'01101' yields B'00001'. The bitwise AND operator requires equal-length inputs.
Yes, both forms of bit-string constant (B and X notation) can be continued across lines in the same way as regular string constants.
For bit(n) type data, it is an error to attempt to store shorter or longer bit strings - the length must match exactly.
Casting to just bit (without specifying width) means bit(1), which delivers only the least significant bit.
A bit string requires 1 byte for each group of 8 bits, plus 5 or 8 bytes overhead depending on the length of the string. Long values may be compressed or moved out-of-line.
position(B'010' in B'000001101011') returns 8, which locates the substring within the bit string.
partitioning
17 questionsPARTITION BY Clause in PostgreSQL
PARTITION BY is a clause used in window functions to divide the result set into partitions (groups of rows) over which the window function operates independently. Each partition is processed separately, and the window function resets for each new partition.
Syntax
function_name(...) OVER (
PARTITION BY column1, column2, ...
[ORDER BY column3, ...]
)
Key Characteristics
- Scope: Operates within SELECT statements only
- Purpose: Creates logical groups without reducing rows (unlike GROUP BY which collapses rows)
- Reset Behavior: Window function calculations reset at each partition boundary
- Multiple Columns: Can partition by multiple columns (comma-separated)
Concrete Example
-- Rank employees by salary within each department
SELECT
employee_name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;
Result behavior:
- Rankings restart at 1 for each distinct
departmentvalue - Without PARTITION BY, rankings would be across all rows
- Preserves all original rows (unlike GROUP BY)
Common Window Functions Using PARTITION BY
- Ranking:
ROW_NUMBER(),RANK(),DENSE_RANK() - Aggregates:
SUM(),AVG(),COUNT(),MIN(),MAX() - Offset:
LAG(),LEAD(),FIRST_VALUE(),LAST_VALUE()
PARTITION BY vs GROUP BY
-- GROUP BY: Collapses to one row per department
SELECT department, AVG(salary)
FROM employees
GROUP BY department;
-- PARTITION BY: Keeps all rows, adds calculated column
SELECT employee_name, department, salary,
AVG(salary) OVER (PARTITION BY department) as dept_avg
FROM employees;
Availability: PostgreSQL 8.4+ (released July 2009)
Source: PostgreSQL Official Documentation - Window Functions
Range Partitioning in PostgreSQL
Range partitioning divides a table into partitions based on a specified column's value ranges. Each partition holds rows where the partitioning key falls within a defined range (e.g., dates, numbers).
Syntax (PostgreSQL 10+)
CREATE TABLE measurements (
id SERIAL,
sensor_id INT,
recorded_at DATE NOT NULL,
temperature NUMERIC
) PARTITION BY RANGE (recorded_at);
-- Create partitions
CREATE TABLE measurements_2024_q1 PARTITION OF measurements
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE measurements_2024_q2 PARTITION OF measurements
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
Key Specifications
- Range boundaries: Lower bound is INCLUSIVE, upper bound is EXCLUSIVE (
[lower, upper)) - Partition key: Can be a single column or expression, must be part of any UNIQUE/PRIMARY KEY constraint
- NULL handling: Create a dedicated partition using
FOR VALUES FROM (MINVALUE) TO (minval)or handle NULLs with a DEFAULT partition - Default partition (PostgreSQL 11+): Catches rows not matching any partition
CREATE TABLE measurements_default PARTITION OF measurements DEFAULT;
Performance Notes
- Query planner uses constraint exclusion to skip irrelevant partitions (automatic in PostgreSQL 11+)
- Indexing: Create indexes on EACH partition individually (not inherited from parent in versions <11)
- Optimal partition count: 100-1000 partitions maximum for best query planning performance
Version: PostgreSQL 10+ (declarative partitioning syntax shown above)
Source: PostgreSQL Official Documentation - Table Partitioning (v16)
PostgreSQL Partitioning Performance Benefits
Partitioning in PostgreSQL provides these verified performance improvements:
1. Query Performance via Partition Pruning
Benefit: Queries scan only relevant partitions, not the entire table.
- Speedup: 10-100x faster for queries filtering on partition key
- Mechanism: Query planner excludes irrelevant partitions at plan time
- Requirement: Query must include partition key in WHERE clause
-- Example: Date-partitioned table
CREATE TABLE measurements (
logdate DATE NOT NULL,
value NUMERIC
) PARTITION BY RANGE (logdate);
-- Query only scans jan_2024 partition, not all 12 months
SELECT * FROM measurements WHERE logdate = '2024-01-15';
2. Bulk DELETE Performance
Benefit: DROP TABLE or DETACH PARTITION is 100-1000x faster than DELETE
- DELETE on 1M rows: Minutes (requires VACUUM afterward)
- DROP PARTITION: Milliseconds (instant metadata operation)
-- Fast: drops partition instantly
DROP TABLE measurements_2023_01;
-- Slow: scans and deletes rows one by one
DELETE FROM measurements WHERE logdate < '2023-02-01';
3. Index Size Reduction
Benefit: Smaller indexes improve cache hit rates and maintenance speed
- Impact: B-tree index depth reduced when each partition's index is smaller
- Rule: Index on 10M row table split into 10x 1M row partitions = shallower B-trees = fewer I/O operations
4. Maintenance Operations (VACUUM, ANALYZE)
Benefit: Parallel maintenance on independent partitions
- Can vacuum/analyze individual partitions without locking entire table
- Autovacuum workers can process multiple partitions concurrently
Key Limitation
Partitioning does NOT improve performance for queries without partition key filtering. Full table scans still read all partitions.
Version Requirement
- PostgreSQL 10+: Declarative partitioning with partition pruning
- PostgreSQL 11+: Improved partition pruning during execution (not just planning)
- PostgreSQL 13+: Pruning works with prepared statements
How Indexes Work on Partitioned Tables in PostgreSQL
PostgreSQL does not automatically inherit indexes from parent to child partitions. Each partition must have its own indexes created independently.
Creating Indexes
When you create an index on a partitioned table (PostgreSQL 11+), it automatically creates corresponding indexes on all existing and future partitions:
CREATE INDEX idx_sales_date ON sales (sale_date);
This creates:
- A partitioned index on the parent table (metadata only, not a physical index)
- Individual local indexes on each partition
- Automatic index creation on any new partitions added later
Key Behaviors
1. Pre-PostgreSQL 11: Indexes must be created manually on each partition. No automatic propagation exists.
2. PostgreSQL 11+: CREATE INDEX on parent automatically propagates to all partitions.
3. Unique Constraints: A unique index on a partitioned table must include the partition key. This is required because PostgreSQL cannot efficiently enforce uniqueness across partitions otherwise.
-- This works (includes partition key)
CREATE UNIQUE INDEX idx_unique ON sales (id, sale_date);
-- This fails if sale_date is the partition key
CREATE UNIQUE INDEX idx_unique ON sales (id);
4. Primary Keys: Same rule applies - must include partition key columns.
5. ATTACH PARTITION: When attaching an existing table, it must already have a compatible index matching the parent's partitioned index, or the attach will fail.
CREATE INDEX idx_sales_2024_date ON sales_2024 (sale_date);
ALTER TABLE sales ATTACH PARTITION sales_2024 FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
6. Performance: Partition pruning works with indexes. Queries that filter on partition keys scan only relevant partition indexes, not all of them.
Version: PostgreSQL 11+ (automatic index propagation). For PostgreSQL 10 and earlier, manual index creation per partition required.
Source: PostgreSQL Official Documentation - Table Partitioning (versions 11-16)
Yes, PostgreSQL supports foreign keys on partitioned tables since version 11.
Foreign Keys FROM Partitioned Tables
Since PostgreSQL 11, you can define foreign keys that reference from a partitioned table to another table:
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
customer_id INT REFERENCES customers(customer_id)
) PARTITION BY RANGE (order_id);
The foreign key constraint is automatically applied to all partitions.
Foreign Keys TO Partitioned Tables
Since PostgreSQL 11, you can also define foreign keys that reference a partitioned table's primary key:
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY
) PARTITION BY RANGE (order_id);
CREATE TABLE order_items (
item_id BIGINT PRIMARY KEY,
order_id BIGINT REFERENCES orders(order_id)
);
Critical Limitations
- Primary/Unique Key Requirement: The referenced column(s) must include ALL partition key columns. For example:
-- This works - partition key is part of referenced columns
CREATE TABLE orders (
order_id BIGINT,
order_date DATE,
PRIMARY KEY (order_id, order_date)
) PARTITION BY RANGE (order_date);
-- This fails - partition key not in referenced columns
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
order_date DATE
) PARTITION BY RANGE (order_date);
- ON DELETE/UPDATE CASCADE: Fully supported since PostgreSQL 11.
Source: PostgreSQL 11+ official documentation on Table Partitioning and Declarative Partitioning limitations.
PostgreSQL Partitioning Limitations (PostgreSQL 11+)
Hard Limits
Maximum number of partitions: No hard limit, but practical limit is thousands (not tens of thousands). Performance degrades with query planning time proportional to partition count. PostgreSQL 13+ improved this significantly.
Partition key restrictions:
- Cannot change partition key columns after creation
- Partition key expressions must be immutable
- Cannot include generated columns (until PostgreSQL 14)
- Cannot use system columns
Constraint Limitations
Primary keys and unique constraints must include ALL partition key columns:
-- This WORKS:
CREATE TABLE orders (
order_id int,
order_date date,
PRIMARY KEY (order_id, order_date)
) PARTITION BY RANGE (order_date);
-- This FAILS:
CREATE TABLE orders (
order_id int,
order_date date,
PRIMARY KEY (order_id) -- Missing partition key
) PARTITION BY RANGE (order_date);
Operational Limitations
Foreign keys:
- PostgreSQL 11-12: No foreign key support from partitioned tables to other tables
- PostgreSQL 12+: Foreign keys FROM partitioned tables work
- Foreign keys TO partitioned tables not supported in any version
Row movement:
- PostgreSQL 10: UPDATEs that change partition key fail
- PostgreSQL 11+: Row movement allowed but slower (DELETE + INSERT)
Global indexes: Not supported. Each partition has separate indexes. No single index spans all partitions.
Triggers: BEFORE ROW triggers cannot be created on partitioned tables (only on individual partitions).
Performance Limitations
Partition pruning requires:
- WHERE clause directly references partition key with constants or stable expressions
- Query planner can eliminate partitions at plan time (not always possible with prepared statements using parameters)
No parallel-aware partition-wise JOIN until PostgreSQL 11+.
Source: PostgreSQL 17 Official Documentation - Table Partitioning (https://www.postgresql.org/docs/current/ddl-partitioning.html)
List Partitioning in PostgreSQL
List partitioning is a table partitioning method in PostgreSQL (available since version 10) where rows are distributed to partitions based on explicit lists of key values.
Definition
In list partitioning, you specify which discrete values belong to each partition using the FOR VALUES IN clause. Each partition contains rows where the partition key matches one of the listed values.
Syntax
CREATE TABLE table_name (columns...)
PARTITION BY LIST (partition_key_column);
CREATE TABLE partition_name PARTITION OF table_name
FOR VALUES IN (value1, value2, value3);
Example
-- Create parent table
CREATE TABLE orders (
order_id INT,
country_code TEXT,
order_date DATE,
amount NUMERIC
) PARTITION BY LIST (country_code);
-- Create partitions for specific countries
CREATE TABLE orders_usa PARTITION OF orders
FOR VALUES IN ('US', 'USA');
CREATE TABLE orders_europe PARTITION OF orders
FOR VALUES IN ('DE', 'FR', 'UK', 'IT');
CREATE TABLE orders_asia PARTITION OF orders
FOR VALUES IN ('JP', 'CN', 'IN');
Key Characteristics
- Use case: Best for columns with discrete, known values (status codes, country codes, categories)
- NULL handling: Create a partition with
FOR VALUES IN (NULL)to handle NULL values - Default partition: Use
DEFAULTkeyword for values not in any list:CREATE TABLE orders_other PARTITION OF orders DEFAULT; - No overlaps: Value lists across partitions must not overlap (PostgreSQL enforces this)
- Multiple values: Each partition can contain multiple values in its list
Performance Note
List partitioning enables partition pruning: queries with WHERE country_code = 'US' will scan only the relevant partition, not the entire table.
Source: PostgreSQL 17 Official Documentation - Table Partitioning
PostgreSQL constraint_exclusion Setting
constraint_exclusion is a query planner setting that controls whether the PostgreSQL query planner examines table constraints (such as CHECK constraints and table inheritance bounds) to determine if entire tables or partitions can be skipped during query execution.
Valid Values (PostgreSQL 9.2+)
off- Constraints are never examined (default before PostgreSQL 9.2)on- Constraints are examined for all tablespartition- Constraints are examined only for inheritance child tables and UNION ALL subqueries (default since PostgreSQL 9.2)
Default Value
partition (since PostgreSQL 9.2)
How It Works
When enabled, the planner checks if a query's WHERE clause logically contradicts a table's CHECK constraint or partition bounds, allowing it to skip scanning that table entirely.
Example:
-- Table with CHECK constraint
CREATE TABLE orders_2024 (
order_date DATE CHECK (order_date >= '2024-01-01' AND order_date < '2025-01-01'),
amount NUMERIC
) INHERITS (orders);
-- Set constraint_exclusion
SET constraint_exclusion = partition;
-- Query that can skip orders_2024
SELECT * FROM orders WHERE order_date >= '2025-01-01';
-- Planner will exclude orders_2024 because constraint contradicts WHERE clause
Recommended Setting
Use partition (default). Setting to on adds overhead for non-partitioned tables with minimal benefit. Only set to on if you have many tables with CHECK constraints used for manual data filtering.
Configuration
-- Session level
SET constraint_exclusion = partition;
-- postgresql.conf
constraint_exclusion = partition
Source: PostgreSQL 16 Official Documentation - Runtime Config: Query Planning
Partition-wise Join in PostgreSQL
Partition-wise join is an optimization technique where PostgreSQL joins partitioned tables by joining corresponding partitions directly, rather than treating them as a single large table. This reduces memory usage and improves parallelization.
How It Works
When joining two partitioned tables with the same partitioning scheme, PostgreSQL can join matching partitions independently:
-- Example: Both tables partitioned by date range
CREATE TABLE sales_2023 PARTITION OF sales
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE orders_2023 PARTITION OF orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
-- With partition-wise join enabled, this joins sales_2023 with orders_2023,
-- sales_2024 with orders_2024, etc.
SELECT * FROM sales s JOIN orders o ON s.order_id = o.id;
Configuration
Parameter: enable_partitionwise_join
Default: off (as of PostgreSQL 16)
Enable with:
SET enable_partitionwise_join = on;
Or in postgresql.conf:
enable_partitionwise_join = on
Requirements for Activation
- Both tables must be partitioned with identical partitioning schemes (same partition bounds and types)
- Join condition must include the partition key
- Parameter
enable_partitionwise_joinmust beon
Availability
- Introduced: PostgreSQL 10 (2017)
- Aggregate support: PostgreSQL 12 added partition-wise grouping/aggregation
Performance Impact
Benefits: Smaller per-partition memory footprint, better parallelization (each partition pair can be joined by separate workers).
Source: PostgreSQL 16 Official Documentation - Query Planning
PostgreSQL Partitioning Methods
PostgreSQL supports 3 native partitioning methods (as of PostgreSQL 10+):
1. Range Partitioning
Partition by a range of values (e.g., date ranges, numeric ranges).
CREATE TABLE measurements (
logdate date NOT NULL,
value int
) PARTITION BY RANGE (logdate);
CREATE TABLE measurements_y2024 PARTITION OF measurements
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
2. List Partitioning
Partition by discrete list of values (e.g., country codes, categories).
CREATE TABLE orders (
order_id int,
country_code text
) PARTITION BY LIST (country_code);
CREATE TABLE orders_us PARTITION OF orders
FOR VALUES IN ('US', 'USA');
3. Hash Partitioning
Partition by hash of a column value for even distribution (PostgreSQL 11+).
CREATE TABLE users (
user_id bigint,
username text
) PARTITION BY HASH (user_id);
CREATE TABLE users_p0 PARTITION OF users
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
Key Requirements
- Partition key must be part of all unique constraints/primary keys
- Maximum partitions: no hard limit, but 100-1000 partitions is practical
- Partitioning must be declared at table creation with
PARTITION BY
Source: PostgreSQL Official Documentation v16 - Table Partitioning (https://www.postgresql.org/docs/current/ddl-partitioning.html)
Partition-wise Aggregation in PostgreSQL
Partition-wise aggregation is a query optimization technique where PostgreSQL pushes down aggregate calculations (like SUM, COUNT, AVG) to individual partitions and then combines the results, rather than aggregating data from all partitions together.
How It Works
When enabled, PostgreSQL:
- Computes partial aggregates on each partition independently
- Combines (finalizes) these partial results into the final aggregate
This enables parallel processing of partitions and can significantly improve performance for aggregations on partitioned tables.
Configuration
Controlled by the enable_partitionwise_aggregate parameter (default: off):
-- Enable partition-wise aggregation
SET enable_partitionwise_aggregate = on;
-- Or in postgresql.conf
enable_partitionwise_aggregate = on
Requirements
- PostgreSQL version: 11+ (introduced in PostgreSQL 11)
- The
GROUP BYclause must include the partition key or a subset of it - Works with declarative partitioning (not table inheritance)
- Compatible aggregates:
SUM,COUNT,AVG,MIN,MAX, and other combinable aggregates
Example
-- Partitioned table
CREATE TABLE sales (
sale_date DATE,
region TEXT,
amount NUMERIC
) PARTITION BY RANGE (sale_date);
-- Query using partition-wise aggregation
SET enable_partitionwise_aggregate = on;
SELECT region, SUM(amount)
FROM sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY region;
With partition-wise aggregation enabled, PostgreSQL computes SUM(amount) per region on each partition, then combines results.
Source
PostgreSQL 17 Official Documentation: Partition Pruning and Constraint Exclusion
pg_partman Extension
pg_partman is a PostgreSQL extension that automates the creation and management of time-based and serial-based table partitions. It was developed by Keith Fiske and is available at https://github.com/pgpartman/pg_partman.
Core Functionality:
- Automatically creates new partitions before they're needed
- Removes or archives old partitions based on retention policies
- Supports both native PostgreSQL partitioning (10+) and trigger-based inheritance (legacy)
- Handles time-based (daily, weekly, monthly, yearly, hourly) and serial/ID-based partitioning
Installation:
CREATE EXTENSION pg_partman;
Basic Setup Example (time-based partitioning):
-- Create parent table (PostgreSQL 10+)
CREATE TABLE measurements (
time TIMESTAMPTZ NOT NULL,
device_id INT,
value NUMERIC
) PARTITION BY RANGE (time);
-- Initialize partman (creates 4 partitions ahead, keeps 4 partitions back)
SELECT partman.create_parent(
p_parent_table := 'public.measurements',
p_control := 'time',
p_type := 'native',
p_interval := 'daily',
p_premake := 4
);
Maintenance Function:
-- Call this via cron/pg_cron to maintain partitions
SELECT partman.run_maintenance_proc();
Key Configuration Parameters:
p_interval: Partition interval (e.g., '1 day', '1 week', '1 month', '100000' for serial)p_premake: Number of partitions to create in advance (default: 4)p_retention: How far back to keep partitions (e.g., '30 days')p_retention_keep_table: Boolean for whether to drop or just detach old partitions
Requirements:
- PostgreSQL 9.4+ (for trigger-based), 10+ recommended (for native partitioning)
- Requires schema
partman(automatically created with extension) - Needs periodic execution of
run_maintenance_proc()(recommended: every hour or more frequently)
Source: Official repository at https://github.com/pgpartman/pg_partman and PostgreSQL Extension Network (PGXN)
Partition Pruning in PostgreSQL
Partition pruning is PostgreSQL's query optimization technique that automatically excludes (skips scanning) partitions that cannot contain rows matching the query's WHERE clause conditions. This significantly improves query performance on partitioned tables by reducing the amount of data scanned.
How It Works
When you query a partitioned table with filters on the partition key, the planner analyzes the WHERE clause and determines which partitions could possibly contain matching rows. Partitions that provably cannot contain matches are excluded from the query plan.
Example
-- Create partitioned table
CREATE TABLE measurements (
id bigint,
logdate date NOT NULL,
value numeric
) PARTITION BY RANGE (logdate);
CREATE TABLE measurements_2024_01 PARTITION OF measurements
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE measurements_2024_02 PARTITION OF measurements
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Query with partition pruning
SELECT * FROM measurements WHERE logdate = '2024-01-15';
-- Only scans measurements_2024_01, prunes measurements_2024_02
Requirements for Pruning
- Constraint Exclusion: Set
enable_partition_pruning = on(default since PostgreSQL 11) - Partition key in WHERE clause: Query must filter on partition key columns
- Immutable operators: Use operators that can be evaluated at plan time (=, <, >, BETWEEN, IN)
Verification
Check if pruning occurred using EXPLAIN:
EXPLAIN SELECT * FROM measurements WHERE logdate = '2024-01-15';
-- Look for "Partitions removed: N" in output
Version: Available since PostgreSQL 11 (2018). PostgreSQL 11+ performs pruning at both plan time and execution time (for prepared statements with parameters).
Yes, you can add indexes to partitioned tables in PostgreSQL (version 11+).
How it works:
- Create an index on the partitioned table (parent) using standard
CREATE INDEX - PostgreSQL automatically creates matching indexes on all existing partitions
- Future partitions inherit the index definition automatically
Example:
-- Create index on partitioned table
CREATE INDEX idx_sales_date ON sales (sale_date);
-- This automatically creates indexes on all partitions:
-- sales_2023_q1, sales_2023_q2, etc.
Critical details:
- Version requirement: PostgreSQL 11+ (automatic index propagation introduced in 11)
- Index types supported: All standard index types (B-tree, Hash, GiST, GIN, BRIN)
- CONCURRENTLY limitation:
CREATE INDEX CONCURRENTLYdoes NOT work on partitioned tables directly. You must create indexes on each partition individually with CONCURRENTLY if you need non-blocking creation. - Unique indexes: Must include the partition key column(s) to work across partitions
For concurrent index creation:
-- Create index on each partition individually
CREATE INDEX CONCURRENTLY idx_sales_2023_q1_date ON sales_2023_q1 (sale_date);
CREATE INDEX CONCURRENTLY idx_sales_2023_q2_date ON sales_2023_q2 (sale_date);
-- Then create on parent (fast, already exists on partitions)
CREATE INDEX idx_sales_date ON sales (sale_date);
Source: PostgreSQL 16 Official Documentation - Table Partitioning, Section 5.11.2.3
Attaching an Existing Table as a Partition in PostgreSQL
Use the ALTER TABLE ... ATTACH PARTITION command. The existing table must match the parent table's structure and satisfy the partition constraint.
Syntax
ALTER TABLE parent_table_name
ATTACH PARTITION existing_table_name FOR VALUES partition_bound_spec;
Requirements (PostgreSQL 10+)
- Identical structure: Column names, types, and order must match the parent table exactly
- Constraints: All NOT NULL and CHECK constraints on the parent must exist on the child
- Partition constraint: Existing data must satisfy the partition boundary (PostgreSQL will validate this)
- Ownership: You must own both tables
Examples
Range partition:
-- Attach table as partition for January 2024
ALTER TABLE measurements
ATTACH PARTITION measurements_jan2024 FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
List partition:
-- Attach table as partition for specific regions
ALTER TABLE sales
ATTACH PARTITION sales_west FOR VALUES IN ('CA', 'OR', 'WA');
Hash partition (PostgreSQL 11+):
-- Attach as hash partition with modulus 4, remainder 0
ALTER TABLE orders
ATTACH PARTITION orders_p0 FOR VALUES WITH (MODULUS 4, REMAINDER 0);
Validation Lock
By default, PostgreSQL acquires an ACCESS EXCLUSIVE lock to validate existing data. For large tables, use CONCURRENTLY (PostgreSQL 14+):
ALTER TABLE measurements
ATTACH PARTITION measurements_jan2024 FOR VALUES FROM ('2024-01-01') TO ('2024-02-01')
CONCURRENTLY;
This uses a SHARE UPDATE EXCLUSIVE lock but takes longer to complete.
Source: PostgreSQL 17 official documentation, Section 5.11.2.2 (Partition Maintenance)
Moving Data Between Partitions in PostgreSQL
Direct Answer: You cannot directly "move" a row between partitions. Instead, you must DELETE from the source partition and INSERT into the target partition within a transaction, or use UPDATE to change the partition key value (PostgreSQL 11+).
Method 1: UPDATE Partition Key (PostgreSQL 11+)
PostgreSQL 11 and later automatically moves rows when you update the partition key:
BEGIN;
UPDATE parent_table
SET partition_key_column = new_value
WHERE id = 123;
COMMIT;
This internally deletes from the old partition and inserts into the new one. The operation is atomic within the transaction.
Configuration: Ensure enable_partition_pruning = on (default since PG 11).
Method 2: Explicit DELETE + INSERT
For earlier versions or explicit control:
BEGIN;
WITH moved_rows AS (
DELETE FROM parent_table
WHERE id = 123
RETURNING *
)
INSERT INTO parent_table
SELECT * FROM moved_rows;
COMMIT;
The RETURNING clause captures deleted data, and the insert statement routes it to the correct partition based on partition key.
Bulk Move Between Partitions
For moving many rows (e.g., repartitioning):
BEGIN;
WITH moved AS (
DELETE FROM parent_table
WHERE partition_key_column BETWEEN old_start AND old_end
RETURNING *
)
INSERT INTO parent_table SELECT * FROM moved;
COMMIT;
Performance Note: For large datasets (>100k rows), consider batching in chunks of 10,000-50,000 rows to avoid long-running transactions.
Source
PostgreSQL 11+ Documentation: Table Partitioning - Row Movement
PostgreSQL Query Optimizer and Partition Handling
PostgreSQL uses constraint exclusion and partition pruning (PostgreSQL 11+) to optimize queries on partitioned tables.
Partition Pruning (PostgreSQL 11+)
The optimizer automatically eliminates partitions at plan time and execution time when WHERE clause conditions don't match partition constraints.
Requirements for partition pruning:
enable_partition_pruning = on(default in PostgreSQL 11+)- WHERE clause uses partition key columns
- Operators:
=,<,>,<=,>=,BETWEEN,IN - For list partitioning: equality or IN operators
- For range partitioning: comparison operators
Example:
-- Table partitioned by range on created_date
CREATE TABLE orders (
id int,
created_date date,
amount numeric
) PARTITION BY RANGE (created_date);
CREATE TABLE orders_2024_q1 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE orders_2024_q2 PARTITION OF orders
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
-- This query will prune orders_2024_q2 partition
SELECT * FROM orders WHERE created_date = '2024-02-15';
Execution-Time Pruning (PostgreSQL 11+)
When partition keys depend on runtime parameters or subqueries, pruning happens during execution:
-- Pruning occurs at execution time
PREPARE stmt (date) AS
SELECT * FROM orders WHERE created_date = $1;
EXECUTE stmt('2024-02-15');
Performance Characteristics
- Plan time:
O(log N)for range/hash partitions using binary search - Overhead per partition: ~40 bytes of planner memory (PostgreSQL 14+)
- Maximum effective partitions: ~1000-2000 (beyond this, overhead increases)
Verify Pruning with EXPLAIN
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE created_date = '2024-02-15';
Output shows only scanned partitions:
Seq Scan on orders_2024_q1
Filter: (created_date = '2024-02-15')
Configuration
constraint_exclusion = partition(default) - enables constraint exclusion for partitioned tables onlyenable_partition_pruning = on(default) - enables partition pruning
Source: PostgreSQL 17 Official Documentation - Partitioning (Section 5.11)
Data Types - Binary
17 questions2 kB (2048 bytes). This is the target size that TOAST tries to compress/move data down to.
2 hexadecimal digits per byte, with the most significant nibble first. The entire string is preceded by the sequence \x.
When standard_conforming_strings is on, backslash is no longer a special character in string literals, so the zero byte would look like '\000'. When it's off, it looks like '\000' after escaping both for bytea and for the string literal.
Bit-string constants use a B (upper or lower case) immediately before the opening quote with no intervening whitespace, e.g., B'1001'. Only characters 0 and 1 are allowed within bit-string constants.
1 or 4 bytes plus the actual binary string. The overhead depends on the length of the string.
It converts zero bytes and bytes with the high bit set into octal escape sequences (\nnn), and it doubles backslashes.
hex. The output format is controlled by the bytea_output configuration parameter, which defaults to 'hex'.
PostgreSQL 9.0. Prior to version 9.0, only the traditional 'escape' format was available.
0 (zero octet) as '\000', 39 (single quote) as '''' or '\047', 92 (backslash) as '\' or '\134', and 0-31 and 127-255 (non-printable octets) as '\xxx' in octal notation.
About 2000 bytes. The value is chosen so that four chunk rows will fit on a page.
Octets 0-31 and 127-255 are shown as \xxx in three-digit octal form. Octet 92 (backslash) is represented as \. Octets 32-126 (printable) are displayed using standard client character set representation.
base64, escape, and hex. PostgreSQL 19 and later also supports base64url.
It will be truncated on the right if it is more than n bits. Padding does not occur for bit varying, only truncation when the string is too long.
It is an error to attempt to store shorter or longer bit strings. The bit type data must match the length n exactly.
Data Types - Monetary
17 questionsWhen a money value is divided by another money value, the result is double precision (a pure number, not money); the currency units cancel each other out in the division.
Division of a money value by an integer value is performed with truncation of the fractional part towards zero.
You can view the current lc_monetary setting by running: SHOW lc_monetary;
Yes, aggregate functions such as SUM() and AVG() work with the money type and return results in the money type.
Yes, a money value can be cast to numeric without loss of precision.
The money type accepts integer and floating-point literals, as well as typical currency formatting, such as '$1,000.00'.
Yes, you can create B-tree indexes on money type columns, which support all comparison operations (<, <=, =, >=, >).
No, floating-point numbers cannot be cast directly to money. You must first cast to numeric, then to money. Example: SELECT '12.34'::float8::numeric::money;
The numeric (or decimal, which is a synonym) type is widely considered the ideal datatype for storing money in PostgreSQL because it allows explicit precision and scale definition, provides exact calculations, and doesn't have locale-dependent behavior.
No, the money type does not handle multiple currencies well because the currency is tied to the database's lc_monetary locale setting. For multi-currency support, the numeric or decimal type is recommended instead.
Since the output of the money data type is locale-sensitive, it might not work to load money data into a database that has a different setting of lc_monetary. To avoid problems, before restoring a dump into a new database make sure lc_monetary has the same or equivalent value as in the database that was dumped.
The money type supports all standard comparison operators: <, <=, =, >=, and >.
If lc_monetary is set to the empty string (which is the default), then the value is inherited from the execution environment of the server in a system-dependent way. More specifically, lc_monetary is usually set in postgresql.conf by initdb using a value derived from the server environment.
No, PostgreSQL does not support direct multiplication of two money type values because that would not typically yield a meaningful monetary value. Multiplication should only be performed with constants (e.g., multiplying a price by a quantity).
The money type supports addition, subtraction, multiplication (by numeric constants), and division operations.
The fractional precision is determined by the database's lc_monetary setting.
Foreign Key Performance > Cascading Actions Performance
17 questionsNo. Column lists cannot be specified for SET NULL and SET DEFAULT foreign key actions in PostgreSQL.
No. While the check portion of a foreign key constraint may be deferred, the cascade delete actions still occur immediately even with deferred constraints.
CASCADE specifies that when a referenced row is deleted, row(s) referencing it should be automatically deleted as well.
The naming convention is RI_FKey_
PostgreSQL generates a SELECT 1 FROM ONLY "public"."table_name" x WHERE "column" = $1 FOR KEY SHARE OF x statement internally as part of the foreign key checking.
Yes. Since a DELETE of a row from the referenced table or an UPDATE of a referenced column will require a scan of the referencing table for rows matching the old value, it is often a good idea to index the referencing columns too.
The autovacuum daemon does not issue ANALYZE commands for foreign tables since it has no means of determining how often that might be useful. Manually-managed ANALYZE commands on those tables are recommended.
FOR UPDATE locks block all modifications including those that don't change key columns, which can block inserts into referencing tables. FOR NO KEY UPDATE only blocks key column modifications, allowing other processes to update non-key columns and preventing massive locking problems.
NO ACTION is the default ON DELETE action for foreign keys in PostgreSQL.
MATCH SIMPLE is the default MATCH type for foreign keys in PostgreSQL.
No. PostgreSQL automatically indexes the target side of a foreign key (the referenced table), but PostgreSQL does not automatically index foreign key columns on the referencing (child) table.
The updated values of the referenced column(s) are copied into the referencing row(s).
Foreign keys are implemented using 'create constraint trigger' which automatically names triggers 'RI_ConstraintTrigger_'. Foreign key constraints are enforced via trigger functions that implement referential integrity.
If the foreign key column in the child table is not indexed, PostgreSQL has no choice but to perform a sequential scan of the entire child table to find matching rows.
Yes. Row updates or deletions caused by foreign-key enforcement actions like ON UPDATE CASCADE or ON DELETE SET NULL are treated as part of the SQL command that caused them, and relevant triggers on the affected table will be fired.
MATCH FULL will not allow one column of a multicolumn foreign key to be null unless all foreign key columns are null. MATCH SIMPLE allows any of the foreign key columns to be null; if any of them are null, the row is not required to have a match in the referenced table.
VACUUM will skip index vacuuming when there are very few dead tuples in the table, as the cost of processing all indexes is expected to greatly exceed the benefit of removing dead index tuples.
PostGIS > Coordinate Systems
17 questionsSRID 3857 is named 'WGS 84 / Pseudo-Mercator' or Web Mercator, and is the map projection used by Google Maps, Bing Maps, OpenStreetMap, and almost any web world map. The unit of measure for SRID 3857 is meters.
Degrees. The units for spatial reference 4326 are degrees when using the geometry type.
SRID 0. If you load data or create a new geometry without specifying an SRID, the SRID value will be 0.
No. UpdateGeometrySRID assigns a new SRID to geometry objects in a table but does not transform the data to a new coordinate system. It simply relabels the SRID. To actually transform coordinates, use ST_Transform instead.
PostGIS 2.2.0. In version 2.2.0, measurement on spheroid was enhanced to be performed with GeographicLib for improved accuracy and robustness.
Meters. All return values from geography calculations are in meters, including ST_Distance, ST_Length, ST_Perimeter, and ST_Area.
The srtext column contains the 'well-known text' (WKT) definition of the spatial reference system. This column is used by external programs such as GeoServer, uDig, and FME.
Input geometries must have a defined SRID. With the exception of functions with from_proj parameter, input geometries must have a defined SRID.
The geometry_columns view provides metadata about all geometry columns including: the parent schema, the parent table, the geometry column's name, the coordinate dimension, the SRID, and the geometry type (e.g., POINT, LINESTRING, etc.).
SRID 4326, which corresponds to longitude/latitude on the WGS84 spheroid.
SRID values less than 1 are converted to the officially unknown SRID value of 0, with a NOTICE message. For example, setting SRID to -1 will convert it to 0 with a notice stating 'SRID value -1 converted to the officially unknown SRID value 0'.
Spheroid by default. For geography types, ST_Distance defaults to return the minimum geodesic distance computed on the spheroid determined by the SRID. You can pass false as an optional boolean parameter to use a faster spherical calculation instead.
Yes. UpdateGeometrySRID supports 3D and will not drop the z-index. It also supports Circular Strings and Curves.
ST_Transform actually changes the coordinates of a geometry from one spatial reference system to another, while ST_SetSRID() simply changes the SRID identifier of the geometry without changing the coordinates.
POINT, LINESTRING, POLYGON, MULTIPOINT, MULTILINESTRING, MULTIPOLYGON, and GEOMETRYCOLLECTION. The geography type does not support curves, TINs, or POLYHEDRALSURFACES.
PostGIS uses CHECK constraints with the pattern: CONSTRAINT enforce_srid_column_name CHECK (st_srid(column_name) = srid_value). For example: CONSTRAINT enforce_srid_geom CHECK (st_srid(geom) = 3857).
Yes. Standard geometry type data will autocast to geography if it is of SRID 4326.
Window Functions > Window Frame Specifications
16 questionsEXCLUDE CURRENT ROW (excludes only the current row), EXCLUDE GROUP (excludes current row and its ORDER BY peers), EXCLUDE TIES (excludes peers but keeps current row), and EXCLUDE NO OTHERS (default, excludes nothing).
No, frame_start cannot be UNBOUNDED FOLLOWING. This restriction exists because the frame cannot logically start at the end of the partition.
The in_range support function is a btree operator class feature added in PostgreSQL 11 that defines how to add or subtract the RANGE offset value. It includes support for integer types (int2/int4/int8) and standard datetime types, enabling type-safe RANGE offset calculations.
In RANGE or GROUPS mode, CURRENT ROW for frame_start means the current row's first peer row, and for frame_end means the current row's last peer row.
No, PostgreSQL versions up to 16 do not support RESPECT NULLS or IGNORE NULLS options. The behavior is always the same as the standard's default, namely RESPECT NULLS.
PostgreSQL 11 (released October 18, 2018) introduced GROUPS mode, RANGE offset PRECEDING/FOLLOWING, and EXCLUDE clauses for window frames.
Yes, GROUPS mode requires that there must be an ORDER BY clause in the window definition because it relies on ORDER BY columns to determine peer groups.
In GROUPS mode, the offset must yield a non-null, non-negative integer representing the number of peer groups.
No, PostgreSQL does not implement FROM FIRST or FROM LAST options. Only the default FROM FIRST behavior is supported. As a workaround, you can reverse the ORDER BY ordering.
No, the meaning of the EXCLUDE clause is not affected by the frame unit. For example, EXCLUDE CURRENT ROW removes only the current row even when using RANGE or GROUPS units where the current row has peers.
Use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING to extend the frame to include all rows in the partition.
ROWS, RANGE, and GROUPS. ROWS operates on individual rows, RANGE operates on value-based ranges requiring one ORDER BY column, and GROUPS operates on peer groups.
In ROWS mode, the offset must yield a non-null, non-negative integer.
UNBOUNDED PRECEDING means that the frame starts with the first row of the partition.
The default frame with ORDER BY only extends from the partition start through the current row, not to the end of the partition. This causes last_value() to return the current row's value instead of the actual last value in the partition.
UNBOUNDED FOLLOWING means that the frame ends with the last row of the partition.
Streaming Replication > Cascading Replication
16 questionsCascading replication is currently asynchronous. Synchronous replication settings have no effect on cascading replication at present.
A cascading standby sends not only WAL records received from the primary but also those restored from the archive.
If an upstream standby server is promoted to become the new primary, downstream servers will continue to stream from the new primary if recovery_target_timeline is set to 'latest' (the default).
When running a standby server, you must set max_wal_senders to the same or higher value than on the primary server.
No. Do not specify a database name in the primary_conninfo string.
You must set primary_conninfo in the downstream standby to point to the cascading standby.
wal_keep_segments was replaced by wal_keep_size in PostgreSQL 13.
No. Named standbys must be directly connected to the primary; the primary knows nothing about downstream standby servers using cascaded replication.
Hot standby feedback propagates upstream, whatever the cascaded arrangement. If cascaded replication is in use, the feedback is passed upstream until it eventually reaches the primary.
Yes. Since the introduction of cascading replication in PostgreSQL 9.2, the server specified in primary_conninfo can be another standby, not just the primary.
wal_keep_size uses megabytes (MB) as the unit. If no unit is specified, MB is taken as the default.
The default value of recovery_target_timeline is 'latest' in PostgreSQL 12 and later.
extensions
15 questionsUse the CREATE EXTENSION command to install an extension in PostgreSQL:
CREATE EXTENSION extension_name;
Prerequisites:
- Extension files must be installed on the PostgreSQL server first
- Check available extensions:
SELECT * FROM pg_available_extensions;
Permissions:
- Most extensions require superuser privileges
- "Trusted" extensions can be installed by any user with
CREATEprivilege on the database
Common options:
-- Install only if not already present
CREATE EXTENSION IF NOT EXISTS extension_name;
-- Install in specific schema
CREATE EXTENSION extension_name SCHEMA schema_name;
-- Install specific version
CREATE EXTENSION extension_name VERSION '1.0';
-- Auto-install dependencies
CREATE EXTENSION extension_name CASCADE;
Example:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
The user running CREATE EXTENSION becomes the extension owner. Objects created by the extension are typically owned by this user.
Sources:
Based on the authoritative sources found, here's the answer:
pg_cron is a PostgreSQL extension that provides cron-based job scheduling directly inside the database. It allows you to schedule periodic PostgreSQL commands using standard cron syntax (e.g., '0 2 * * *' for daily at 2am).
Key characteristics:
- Runs as a background worker process inside PostgreSQL (requires PostgreSQL 10 or higher)
- Uses standard cron syntax for scheduling
- Executes SQL commands at scheduled intervals
- Runs multiple jobs in parallel, but at most one instance of each job at a time
- Supports interval-based scheduling using seconds syntax (e.g.,
'*/30 seconds') - Must be loaded via
shared_preload_librariesin postgresql.conf - Created using
CREATE EXTENSION pg_cron
Common use cases include scheduling maintenance tasks, periodic aggregations, and data cleanup jobs.
Sources:
Based on the official repository and PostgreSQL announcements, here's the authoritative answer:
pgvector
pgvector is an open-source PostgreSQL extension that adds vector similarity search capabilities to PostgreSQL, enabling storage and querying of high-dimensional vectors (embeddings).
Core capabilities:
- Store vector embeddings directly in PostgreSQL columns
- Perform similarity searches using distance metrics (L2 distance, inner product, cosine distance)
- Index vectors for fast approximate nearest neighbor search
- Supports exact and approximate nearest neighbor queries
Supported vector types:
vector- stores vectors with configurable dimensions (e.g.,vector(1536)for OpenAI embeddings)- Supports half-precision (
halfvec), binary (bit), and sparse vectors (sparsevec)
Index types:
- HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor search
- IVFFlat for partitioned vector search
Requirements:
- PostgreSQL 13 or higher
Typical use cases:
- Semantic search applications
- Recommendation systems
- AI/ML applications using embeddings from models like OpenAI, Cohere, or custom models
Sources:
postgres_fdw is a PostgreSQL extension that provides a foreign data wrapper for accessing data stored in external PostgreSQL servers. It allows you to query remote PostgreSQL databases as if the tables were local.
What it does:
- Creates foreign tables that reference tables on remote PostgreSQL servers
- Executes SELECT, INSERT, UPDATE, DELETE, COPY, and TRUNCATE operations on remote data
- Pushes WHERE clauses and joins to the remote server for execution (reducing data transfer)
- Provides more transparent and standards-compliant syntax than older methods like dblink
- Enables querying across multiple PostgreSQL databases from a single connection
Setup workflow:
- Install:
CREATE EXTENSION postgres_fdw; - Define remote server:
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'remote_host', dbname 'remote_db', port '5432'); - Create user mapping:
CREATE USER MAPPING FOR local_user SERVER foreign_server OPTIONS (user 'remote_user', password 'password'); - Create foreign table:
CREATE FOREIGN TABLE foreign_table (...) SERVER foreign_server OPTIONS (schema_name 'public', table_name 'remote_table');
Sources:
The pg_trgm similarity threshold is 0.3 by default.
The pg_trgm.similarity_threshold parameter controls when the % operator considers two strings similar. It must be between 0 and 1 (inclusive), where:
- 0 = any strings match (most permissive)
- 1 = only identical strings match (most strict)
You can set it per session:
SET pg_trgm.similarity_threshold = 0.5;
Or query the current value:
SHOW pg_trgm.similarity_threshold;
Related thresholds:
pg_trgm.word_similarity_threshold: 0.6 (default) - for<%and%>operatorspg_trgm.strict_word_similarity_threshold: 0.5 (default) - for<<%and%>>operators
Sources:
TimescaleDB is a PostgreSQL extension that turns PostgreSQL into a time-series database for high-performance real-time analytics. It is packaged as a Postgres extension, meaning it extends PostgreSQL's native capabilities rather than being a separate database system.
Key characteristics:
- Built as a native PostgreSQL extension (installed via
CREATE EXTENSION timescaledb) - Optimized for time-series data workloads (IoT sensors, application metrics, financial data, etc.)
- Provides automatic partitioning through "hypertables" that look like standard PostgreSQL tables
- Offers faster time-based queries, data compression, continuous aggregates, and data retention policies
- Maintains full SQL compatibility while adding time-series-specific functions
- Designed for high-volume data ingest and complex analytical queries
Installation:
CREATE EXTENSION IF NOT EXISTS timescaledb;
TimescaleDB extends PostgreSQL without replacing it, allowing you to use all standard PostgreSQL features alongside specialized time-series optimizations.
Sources:
PostGIS provides the following spatial data types:
Primary Spatial Data Types
geometry - Represents spatial features using planar (Cartesian) coordinate systems. Supports various subtypes: Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection.
geography - Represents spatial features using geodetic (ellipsoidal/spherical) coordinate systems for accurate distance and area calculations on Earth's surface.
Bounding Box Types
box2d - A 2-dimensional bounding box containing xmin, ymin, xmax, ymax values.
box3d - A 3-dimensional bounding box containing xmin, ymin, zmin, xmax, ymax, zmax values.
Utility Types
geometry_dump - A composite type with two fields: geom (holding a geometry object) and path[] (a 1-dimensional array holding the position of the geometry within the dumped object). Used for decomposing complex geometries.
box2df - A 2D float precision bounding box used internally for indexing operations.
Sources:
pg_stat_statements provides 47 metrics tracking SQL planning and execution statistics. Key metric categories:
Identification:
userid- OID of user executing statementdbid- OID of databasequeryid- Hash code for normalized queryquery- Statement texttoplevel- Boolean (top-level vs nested)
Planning:
plans- Number of planning operationstotal_plan_time,min_plan_time,max_plan_time,mean_plan_time,stddev_plan_time- Planning duration (milliseconds)
Execution:
calls- Number of executionstotal_exec_time,min_exec_time,max_exec_time,mean_exec_time,stddev_exec_time- Execution duration (milliseconds)rows- Total rows retrieved/affected
Buffer Cache:
shared_blks_hit,shared_blks_read,shared_blks_dirtied,shared_blks_written- Shared buffer statisticslocal_blks_hit,local_blks_read,local_blks_dirtied,local_blks_written- Local buffer statisticstemp_blks_read,temp_blks_written- Temporary block statistics
I/O Timing:
shared_blk_read_time,shared_blk_write_time- Shared block I/O durationlocal_blk_read_time,local_blk_write_time- Local block I/O durationtemp_blk_read_time,temp_blk_write_time- Temp block I/O duration
WAL (Write-Ahead Log):
wal_records,wal_fpi,wal_bytes,wal_buffers_full- WAL generation statistics
JIT Compilation:
jit_functions,jit_generation_time- JIT compilation counts/timingjit_inlining_count,jit_inlining_time- Inlining metricsjit_optimization_count,jit_optimization_time- Optimization metricsjit_emission_count,jit_emission_time- Code emission metricsjit_deform_count,jit_deform_time- Tuple deforming metrics
Parallel Execution:
parallel_workers_to_launch,parallel_workers_launched- Worker thread counts
Timestamps:
stats_since,minmax_stats_since- Statistics collection start times
All timing metrics are in milliseconds. The view contains one row per distinct (database, user, query) combination.
Sources:
The citext extension provides a case-insensitive character string type for PostgreSQL. It is essentially equivalent to the text data type, but performs case-insensitive comparisons by internally calling lower() on values when comparing them.
Key characteristics:
- Case-insensitive comparisons:
'ABC'equals'abc'when usingcitext - Preserves original case: Stores data in its original form
- Trusted extension: Can be installed by non-superusers with
CREATEprivilege - Performance consideration: Less efficient than
textbecause it must make copies and convert to lowercase for comparisons
Enable with:
CREATE EXTENSION citext;
Example usage:
CREATE TABLE users (
email citext PRIMARY KEY
);
-- Case-insensitive matching
SELECT * FROM users WHERE email = '[email protected]';
Note: PostgreSQL recommends considering nondeterministic collations instead of citext for modern applications, as they offer more flexible case-insensitive and accent-insensitive comparison options.
Sources:
To query foreign PostgreSQL servers, use the postgres_fdw extension following these five steps:
1. Install the extension:
CREATE EXTENSION postgres_fdw;
2. Create a foreign server object (defines the remote database):
CREATE SERVER foreign_server
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host '192.83.123.89', port '5432', dbname 'foreign_db');
3. Create a user mapping (provides credentials):
CREATE USER MAPPING FOR local_user
SERVER foreign_server
OPTIONS (user 'foreign_user', password 'password');
4. Create a foreign table (local representation of remote table):
CREATE FOREIGN TABLE foreign_table (
id integer NOT NULL,
data text
)
SERVER foreign_server
OPTIONS (schema_name 'some_schema', table_name 'some_table');
Important: The columns must match the remote table structure.
5. Query the foreign table like any local table:
SELECT * FROM foreign_table;
You can also use INSERT, UPDATE, DELETE, COPY, and TRUNCATE on foreign tables.
Alternative: Import entire schemas automatically:
IMPORT FOREIGN SCHEMA remote_schema
FROM SERVER foreign_server
INTO local_schema;
Sources:
Based on the authoritative PostGIS documentation, here's how to perform geospatial queries:
Geospatial Queries with PostGIS
PostGIS provides spatial relationship functions that work directly in SQL queries. These automatically use spatial indexes when available.
Common Spatial Relationship Queries:
ST_Contains - Test if one geometry contains another:
SELECT name FROM neighborhoods
WHERE ST_Contains(geom, ST_SetSRID(ST_Point(-71.06, 42.34), 4326));
ST_Intersects - Find geometries that intersect:
SELECT a.name, b.name
FROM roads a, buildings b
WHERE ST_Intersects(a.geom, b.geom);
ST_Within - Find geometries within another:
SELECT name FROM parks
WHERE ST_Within(geom, (SELECT geom FROM city WHERE name = 'Boston'));
ST_Distance - Calculate distance between geometries:
SELECT name FROM places
WHERE ST_Distance(geom, ST_Point(-71.06, 42.34)) < 1000;
ST_DWithin - Optimized distance query (uses spatial index):
SELECT name FROM places
WHERE ST_DWithin(geom, ST_SetSRID(ST_Point(-71.06, 42.34), 4326), 1000);
Spatial Index Usage:
Create a spatial index for performance:
CREATE INDEX spatial_idx ON table_name USING GIST(geom);
Spatial predicates (ST_Contains, ST_Intersects, ST_Within, ST_Overlaps, ST_Crosses, ST_Touches, ST_Disjoint, ST_Equals, ST_Covers, ST_CoveredBy, ST_ContainsProperly) automatically use spatial indexes - no need to manually add the && operator.
Sources:
How to Use Trigram Similarity Searches in PostgreSQL
1. Enable the extension:
CREATE EXTENSION pg_trgm;
2. Use the similarity function:
SELECT similarity('word', 'two words') AS score;
-- Returns a value between 0 (no similarity) and 1 (identical)
3. Use the % operator for similarity matching:
SELECT * FROM products
WHERE name % 'PostgreSQL';
-- Returns rows where similarity exceeds threshold (default 0.3)
4. Order results by similarity:
SELECT name, similarity(name, 'PostgreSQL') AS score
FROM products
WHERE name % 'PostgreSQL'
ORDER BY score DESC;
5. Create an index for fast similarity searches:
-- GiST index
CREATE INDEX name_trgm_idx ON products USING GIST (name gist_trgm_ops);
-- GIN index (faster for static data)
CREATE INDEX name_trgm_idx ON products USING GIN (name gin_trgm_ops);
6. Use distance operator for nearest matches:
SELECT name, name <-> 'PostgreSQL' AS distance
FROM products
ORDER BY distance
LIMIT 10;
Key Functions:
similarity(text1, text2)- Returns similarity score (0-1)word_similarity(text1, text2)- Similarity to continuous text portionsshow_trgm(text)- Shows all trigrams for debugging
Adjust similarity threshold:
SET pg_trgm.similarity_threshold = 0.5; -- Default is 0.3
Sources:
CREATE EXTENSION loads an extension into the current database. It executes the extension's script file to create new SQL objects such as functions, data types, operators, tables, and other database objects.
Syntax:
CREATE EXTENSION [ IF NOT EXISTS ] extension_name
[ WITH ] [ SCHEMA schema_name ]
[ VERSION version ]
[ CASCADE ]
Key Parameters:
IF NOT EXISTS- Suppresses errors if the extension is already installedextension_name- Name of the extension to installSCHEMA schema_name- Schema to contain the extension's objects (must already exist)VERSION version- Specific version to install (defaults to control file value)CASCADE- Automatically installs dependent extensions
Privileges: Requires superuser privileges by default, unless the extension is marked as "trusted" in its control file (in which case CREATE privilege on the database is sufficient).
Example:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION hstore SCHEMA public;
CREATE EXTENSION postgis CASCADE;
Sources:
Based on the official pgvector GitHub repository, here's how to perform vector similarity search in PostgreSQL:
Installation and Setup
- Install the pgvector extension (available via PostgreSQL APT/Yum repositories)
- Enable the extension in your database:
CREATE EXTENSION vector;
Creating Tables with Vector Columns
CREATE TABLE items (
id bigserial PRIMARY KEY,
embedding vector(3)
);
The vector(3) specifies a 3-dimensional vector. Adjust the dimension as needed.
Performing Similarity Searches
pgvector supports three distance operators:
<->- L2 distance (Euclidean)<#>- Inner product (negative, so larger values = less similar)<=>- Cosine distance
Example queries:
-- Find 5 nearest neighbors using L2 distance
SELECT * FROM items
ORDER BY embedding <-> '[3,1,2]'
LIMIT 5;
-- Find similar items using cosine distance
SELECT * FROM items
ORDER BY embedding <=> '[3,1,2]'
LIMIT 5;
-- Find similar items using inner product
SELECT * FROM items
ORDER BY embedding <#> '[3,1,2]'
LIMIT 5;
Adding Indexes for Performance
HNSW index (recommended for most cases):
CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);
IVFFlat index:
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops);
Replace vector_l2_ops with vector_ip_ops for inner product or vector_cosine_ops for cosine distance.
By default, pgvector performs exact nearest neighbor search with perfect recall. Indexes enable approximate nearest neighbor search, trading some recall for speed.
Sources:
Use TimescaleDB hypertables when:
- You have time-series data requiring real-time analytics and continuous inserts
- You need automatic chunk management without manual intervention
- You require non-blocking operations for adding/dropping partitions (TimescaleDB takes only access share locks on the parent table, allowing concurrent reads/writes)
- You want time-series specific optimizations (automatic time-descending indexes, compression, data retention policies)
- Query performance matters more than simplicity (TimescaleDB shows 1000x faster queries in some time-series workloads)
Use PostgreSQL native partitioning when:
- Your table size exceeds physical memory of the database server (rule of thumb from PostgreSQL docs)
- You need full control over partition strategy (range, list, hash)
- Most heavily accessed rows are in a single partition or small number of partitions
- You want no additional extensions or dependencies
- Workload is data warehouse style where query planning overhead is acceptable
- You need to frequently drop entire partitions via
DROP TABLEorALTER TABLE DETACH PARTITION(faster than bulkDELETE)
Key architectural difference: TimescaleDB uses its own chunk catalog system independent of PostgreSQL's declarative partitioning and inheritance mechanisms, avoiding relation cache bottlenecks and minimizing locks by design.
Performance consideration: While ingest performance is similar, TimescaleDB excels at query performance for time-series patterns due to its specialized optimization layer on top of PostgreSQL.
Sources:
Common Table Expressions
15 questionsOnly UNION or UNION ALL can be used. Set operations other than UNION ALL between a non-recursive term and a recursive term are not permitted.
No, data-modifying statements (INSERT, UPDATE, DELETE, MERGE) in CTEs cannot contain recursive self-references. Only SELECT statements can be recursive in CTEs.
Each auxiliary statement in a WITH clause can be a SELECT, INSERT, UPDATE, DELETE, or MERGE statement. The WITH clause itself can be attached to a primary statement that can also be SELECT, INSERT, UPDATE, DELETE, or MERGE.
CTEs are normally evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries.
PostgreSQL 12 changed the CTE materialization behavior. For non-recursive and side-effect-free CTEs, PostgreSQL 12 defaults to inlining the CTE into the outer query if it is used only once. CTEs that are called more than once still use materialization by default.
All data-modifying statements in WITH are executed with the same snapshot, so they cannot see one another's effects on the target tables. The sub-statements execute concurrently with each other and with the main query.
No, PostgreSQL does not have a built-in hard-coded maximum recursion depth limit for recursive CTEs. PostgreSQL evaluates only as many rows as are actually fetched by the parent query. If iteration never ends, the query will continue until system resources are exhausted.
By default, TRUE and FALSE (of type boolean) are used for the cycle mark value and cycle mark default, respectively.
The MERGE statement was introduced in PostgreSQL 15 (October 2022). WITH clause support was available from the initial implementation, allowing CTEs with MERGE. However, WITH RECURSIVE is not supported with MERGE.
Any table used as the target of a data-modifying statement in WITH must not have a conditional rule, nor an ALSO rule, nor an INSTEAD rule that expands to multiple statements.
The RECURSIVE keyword can only be used immediately after the keyword WITH. When there are multiple queries in the WITH clause, RECURSIVE should be written only once, immediately after WITH.
Trying to update the same row twice in a single statement is not supported. Only one of the modifications takes place, but it is not easy (and sometimes not possible) to reliably predict which one.
RETURNING data is the only way to communicate changes between different WITH sub-statements and the main query, because all statements execute with the same snapshot and cannot see each other's effects on target tables.
CYCLE column_list SET cycle_mark_column TO cycle_mark_value DEFAULT cycle_mark_default USING path_column. For example: CYCLE c_to SET is_cycle TO true DEFAULT false USING cycle_path
No, the non-recursive term must not reference the CTE itself. Only the recursive term can contain a reference to the query's own output.
Foreign Key Performance > FK and Bulk Operations
15 questionsIf temporarily removing the constraint isn't acceptable, the only other recourse may be to split up the load operation into smaller transactions to avoid trigger event queue overflow.
The FREEZE option loads data with rows already frozen, just as they would be after running the VACUUM FREEZE command. This is intended as a performance option for initial data loading.
TRUNCATE is faster than DELETE because it does not actually scan the tables, reclaims disk space immediately rather than requiring a subsequent VACUUM operation, and does not acquire row-level locks. However, TRUNCATE cannot be used on a table that has foreign-key references from other tables unless all such tables are also truncated in the same command (using CASCADE).
ALTER TABLE ADD FOREIGN KEY requires a SHARE ROW EXCLUSIVE lock on the table being altered. It also acquires a SHARE ROW EXCLUSIVE lock on the referenced table.
Loading a large number of rows using COPY is almost always faster than using INSERT, even if PREPARE is used and multiple insertions are batched into a single transaction. COPY is optimized for loading large numbers of rows and incurs significantly less overhead.
Setting session_replication_role to replica disables the firing of triggers (unless they are set ENABLE REPLICA or ENABLE ALWAYS). Since foreign keys are implemented as triggers, this also disables all foreign key checks.
Temporarily increasing the maintenance_work_mem configuration variable when loading large amounts of data can lead to improved performance and will help to speed up ALTER TABLE ADD FOREIGN KEY commands.
The NOT VALID option applies to foreign key, CHECK, and not-null constraints only.
64 megabytes (64MB). This parameter specifies the maximum amount of memory to be used by maintenance operations such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY.
VALIDATE CONSTRAINT acquires a SHARE UPDATE EXCLUSIVE lock on the table being altered. If the constraint is a foreign key, a ROW SHARE lock is also required on the referenced table.
By default, pg_dump uses COPY, and when it is generating a complete schema-and-data dump, it is careful to load data before creating indexes and foreign keys.
PostgreSQL 9.1 introduced the NOT VALID option for foreign key and CHECK constraints.
The table must have been created or truncated in the current subtransaction, no cursors can be open, and no older snapshots can be held by the transaction. It can only be used with COPY FROM, not COPY TO, and cannot be used on partitioned tables or foreign tables.
NO ACTION allows the check to be deferred until later in the transaction, whereas RESTRICT does not. RESTRICT performs an immediate check and cannot be deferred.
ON DELETE NO ACTION is the default. This produces an error indicating that the deletion would create a foreign key constraint violation if there are related rows in the child table.
Server Configuration > File Locations
15 questionsIf you keep configuration files separate from the data directory, the postgres -D command-line option or PGDATA environment variable must point to the directory containing the configuration files, and the data_directory parameter must be set in postgresql.conf to show where the data directory is actually located.
The three methods are: 1) Execute SELECT pg_reload_conf(); from a superuser session, 2) Run pg_ctl reload from the command line, 3) Send a SIGHUP signal directly to the postgres process using kill -SIGHUP.
In a default installation, the data_directory parameter is not set explicitly. Instead, the data directory location must be specified by the -D command-line option or the PGDATA environment variable; there is no built-in default value.
The default value for hba_file is ConfigDir/pg_hba.conf (where ConfigDir is the data directory). Unless explicitly set, hba_file points to pg_hba.conf in the data directory.
Both hba_file and ident_file parameters were added in PostgreSQL 8.0.
You can use SHOW data_directory; or SELECT current_setting('data_directory'); to find the location of the data directory.
The external_pid_file parameter can only be set at server start. It cannot be changed at runtime.
Relative paths in file location parameters (data_directory, hba_file, ident_file, external_pid_file) are interpreted with respect to the directory in which postgres is started, not the current working directory.
When initdb initializes a data directory, it creates three configuration files: postgresql.conf (main server configuration), pg_hba.conf (host-based authentication), and pg_ident.conf (user name mapping).
The boolean value returned by pg_reload_conf() merely indicates whether PostgreSQL could be successfully signalled (SIGHUP sent), not whether the configuration was successfully reloaded or valid.
pg_ctl reload sends the postgres process a SIGHUP signal, causing it to re-read its configuration files (postgresql.conf, pg_hba.conf, etc.). The postmaster forwards the SIGHUP signal to its child processes so they all pick up the new values.
The default value for ident_file is ConfigDir/pg_ident.conf (where ConfigDir is the data directory). Unless explicitly set, ident_file points to pg_ident.conf in the data directory.
When external_pid_file is set to '' (empty string, the default), no additional PID file is written. PostgreSQL only creates the standard postmaster.pid file in the data directory.
The hba_file parameter can only be set at server start. It cannot be changed at runtime or via SIGHUP reload.
Yes, hba_file and ident_file can be set within the main postgresql.conf configuration file, unlike config_file which can only be specified on the postgres command line.
Logical Replication > Row Filters
14 questionsThe UPDATE is transformed into a DELETE to avoid data inconsistency.
No. Previously subscribed tables are not re-copied, even if a table's row filter WHERE clause has since been modified.
The UPDATE is transformed into an INSERT to avoid data inconsistency.
It indicates there is no publication qualifying condition (no row filter) for that relation.
The WHERE clause must be enclosed by parentheses. For example: CREATE PUBLICATION pub FOR TABLE tablename WHERE (expression);
The default is 'insert, update, delete, truncate', meaning all DML operations are published by default.
The row filter WHERE clause must contain only columns that are covered by the replica identity.
No. Row filters can only use immutable built-in functions. STABLE functions like the current_timestamp family cannot be used.
Simple immutable built-in functions such as arithmetic operators (+, -, *, /, %), comparison operators (=, <, >, <=, >=, <>), logical operators (AND, OR, NOT), and deterministic functions like upper(), lower(), length(), abs(), ceil(), floor(), and round().
The WHERE clause expression is executed with the role used for the replication connection.
Logical Replication > Column Lists
14 questionsNo. A subscription having several publications in which the same table has been published with different column lists is not supported.
The pgoutput replication plug-in was fixed to not send columns not listed in a table's replication column list (commit ebf87c019). UPDATE and DELETE events did not pay attention to the configured column list, sending more data than expected and wasting network bandwidth. Contributed by Hou Zhijie.
PostgreSQL 15 (released October 13, 2022). The feature was contributed by Tomas Vondra, Álvaro Herrera, and Rahila Syed (commit 923def9a5).
pubname (name), schemaname (name), tablename (name), attnames (name[]), and rowfilter (text).
Only columns included in the column list are copied to the subscriber when copy_data = true.
No. In versions before 18.0, logical replication does not publish GENERATED columns at all.
Without a column list, all columns (including any added in the future) are automatically replicated. With a column list, only specified columns replicate and new columns added later are NOT automatically included.
No. Column lists have no effect for the TRUNCATE command - TRUNCATE affects the entire table regardless of column list specification.
Column lists can contain only simple column references. They cannot contain expressions.
No. Adding/Setting any schema when the publication also publishes a table with a column list, and vice versa, is not supported.
TABLE table_name ( column_name [, ... ] ). Example: CREATE PUBLICATION users_filtered FOR TABLE users (user_id, firstname);
No. Column lists cannot be used for security purposes - a malicious subscriber is able to obtain data from columns that are not specifically published. The proper way to secure data is through the GRANT system at the publisher side.
The apply process will error. When publishing stored generated columns, the subscriber must receive them into regular columns, not generated columns.
ALTER PUBLICATION mypublication ADD TABLE table_name (column1, column2, ...); Example: ALTER PUBLICATION mypublication ADD TABLE users (user_id, firstname), departments;
Data Types - pg_lsn
13 questionspg_lsn - the operator subtracts the specified number of bytes from the LSN position.
Two hexadecimal numbers of up to 8 digits each, separated by a slash. For example: 16/B374D848.
Yes. The MIN and MAX aggregate functions support the pg_lsn datatype for finding minimum and maximum LSN values.
PostgreSQL 14. From this version it became possible to add and subtract byte values to/from pg_lsn values.
B-tree and Hash indexes have native support for pg_lsn. GiST support is available through the btree_gist extension.
Inside the data type itself. The validation ensures the format consists of two hexadecimal numbers separated by a slash.
pg_lsn - the operator adds the specified number of bytes to the LSN position.
Up to 8 hexadecimal digits in each part (before and after the slash separator).
The calculated LSN must be in the range of pg_lsn type, between 0/0 and FFFFFFFF/FFFFFFFF. Values outside this range are invalid.
Text. Up to PostgreSQL 9.3, all functions using LSN had been using "text" as a substitute to represent an LSN number.
Functions - Conditional
13 questionsThe first expression is computed, then compared to each of the value expressions in the WHEN clauses using the = operator until one is found that is equal to it. There must be a suitable = operator available for the comparison.
If all inputs are of type unknown, the CASE expression resolves as type text (the preferred type of the string category).
While CASE expressions generally evaluate conditions sequentially at runtime, there are various situations in which subexpressions of an expression are evaluated at different times, so the principle that 'CASE evaluates only necessary subexpressions' is not ironclad. Constant expressions may be evaluated during query planning.
No. A CASE cannot prevent evaluation of an aggregate expression contained within it, because aggregate expressions are computed before other expressions in a SELECT list or HAVING clause are considered.
PostgreSQL has a general (searched) form with syntax 'CASE WHEN condition THEN result [WHEN ...] [ELSE result] END' that evaluates boolean conditions, and a simple form with syntax 'CASE expression WHEN value THEN result [WHEN ...] [ELSE result] END' that compares an expression against specific values.
The expression is computed once. In the simple form, the first expression is computed, then compared to each of the value expressions in the WHEN clauses until one is found that is equal to it.
NULLIF(value1, value2) is equivalent to: CASE WHEN value1 = value2 THEN NULL ELSE value1 END
NULLIF(1, 2.2) yields numeric type. The result has the same type as the first argument of the implied = operator, and in some cases that will have been promoted to match the second argument's type. Since there is no integer = numeric operator, only numeric = numeric, the first argument is promoted to numeric.
They use an identical algorithm to match up their component expressions and select a result data type. The algorithm checks for identical types, handles domain types, processes unknown types, ensures all non-unknown inputs belong to the same type category, selects a preferred candidate type, and converts all inputs to the final candidate type.
All expressions must be convertible to a common data type, which will be the type of the result. If they are not convertible to a common type, an error will be raised.
COALESCE uses short-circuit evaluation. It only evaluates the arguments that are needed to determine the result; that is, arguments to the right of the first non-null argument are not evaluated.
The SQL standard requires GREATEST and LEAST to return NULL if any argument is NULL. PostgreSQL deviates from this: it ignores NULL values in the argument list and only returns NULL if all expressions evaluate to NULL.
Extensions > pg_repack
12 questionsOnly superusers or the owners of the target tables are permitted to run pg_repack, because the extension makes low-level modifications to system catalogs and needs high privileges.
The --no-order flag performs an online VACUUM FULL without reordering rows. Since version 1.2, this is the default behavior for non-clustered tables.
No, pg_repack has significant compatibility issues with logical replication. The way pg_repack does table rewrite is almost guaranteed to break logical decoding unless there are zero unconsumed changes for a given table, as it does not build the necessary mappings info for logical decoding.
The default message level is INFO. The --elevel option allows choosing from DEBUG, INFO, NOTICE, WARNING, ERROR, LOG, FATAL, and PANIC.
pg_repack is a fork of the pg_reorg project. New feature development on pg_reorg slowed or stopped since late 2011, and pg_repack was initially released as a drop-in replacement for pg_reorg.
The default wait-timeout is 60 seconds. This controls how long pg_repack will wait to acquire necessary locks before taking action.
pg_repack creates work tables and log tables in the 'repack' schema. The log table follows a naming convention like 'log_16490' where the number represents the table's OID.
If PostgreSQL is running a vacuum to prevent transaction ID wraparound on a table while pg_repack is running, pg_repack can cause the table to become locked for an extended period when it tries to replace the old table with the newly repacked one, potentially making the database inaccessible.
No. pg_repack cannot cluster tables by GiST indexes, since these do not provide a straightforward physical ordering.
The --dry-run option makes pg_repack print out what would have been repacked without performing any action. It allows you to simulate the repack process without modifying data.
Execute CREATE EXTENSION pg_repack in your database. For example: psql -c "CREATE EXTENSION pg_repack" -d your_database
Performing a full-table repack requires free disk space about twice as large as the target table(s) and its indexes. For example, if the total size of the tables and indexes to be reorganized is 1GB, an additional 2GB of disk space is required.
ddl_operations
12 questionsPostgreSQL Foreign Key ON DELETE Options
PostgreSQL supports 5 ON DELETE actions for foreign key constraints:
1. NO ACTION (default)
Prevents deletion if referenced rows exist. Check is performed at the end of the statement.
ALTER TABLE orders
ADD CONSTRAINT fk_customer
FOREIGN KEY (customer_id) REFERENCES customers(id)
ON DELETE NO ACTION;
2. RESTRICT
Prevents deletion if referenced rows exist. Check is performed immediately (difference from NO ACTION only matters with deferrable constraints).
ALTER TABLE orders
ADD CONSTRAINT fk_customer
FOREIGN KEY (customer_id) REFERENCES customers(id)
ON DELETE RESTRICT;
3. CASCADE
Automatically deletes all referencing rows when the referenced row is deleted.
ALTER TABLE order_items
ADD CONSTRAINT fk_order
FOREIGN KEY (order_id) REFERENCES orders(id)
ON DELETE CASCADE;
4. SET NULL
Sets the foreign key column(s) to NULL when the referenced row is deleted. The column must be nullable.
ALTER TABLE orders
ADD CONSTRAINT fk_salesperson
FOREIGN KEY (salesperson_id) REFERENCES employees(id)
ON DELETE SET NULL;
5. SET DEFAULT
Sets the foreign key column(s) to their DEFAULT value when the referenced row is deleted. A default value must be defined.
ALTER TABLE orders
ADD CONSTRAINT fk_status
FOREIGN KEY (status_id) REFERENCES order_statuses(id)
ON DELETE SET DEFAULT;
Default Behavior: If ON DELETE is not specified, NO ACTION is used.
Source: PostgreSQL 17 Official Documentation - Foreign Keys (https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-FK)
PRIMARY KEY vs UNIQUE Constraint in PostgreSQL
Direct Answer:
PRIMARY KEY=UNIQUE+NOT NULL+ table identifier (only ONE per table)UNIQUEconstraint allows NULL values (multiple NULLs permitted) and allows multiple unique constraints per table
Key Differences
1. NULL Handling
-- PRIMARY KEY: Rejects NULLs
CREATE TABLE users (
id INTEGER PRIMARY KEY,
email VARCHAR(255)
);
INSERT INTO users (id, email) VALUES (NULL, '[email protected]');
-- ERROR: null value in column "id" violates not-null constraint
-- UNIQUE: Allows NULLs (multiple NULL values permitted)
CREATE TABLE products (
id INTEGER PRIMARY KEY,
sku VARCHAR(50) UNIQUE
);
INSERT INTO products (id, sku) VALUES (1, NULL); -- OK
INSERT INTO products (id, sku) VALUES (2, NULL); -- OK (multiple NULLs allowed)
2. Quantity Per Table
- PRIMARY KEY: Exactly ONE per table
- UNIQUE: Multiple allowed per table
CREATE TABLE orders (
id INTEGER PRIMARY KEY, -- Only one PRIMARY KEY
order_number VARCHAR(50) UNIQUE, -- First UNIQUE constraint
tracking_code VARCHAR(50) UNIQUE -- Second UNIQUE constraint - OK
);
3. Foreign Key References
- PRIMARY KEY: Automatic target for foreign key references (default)
- UNIQUE: Can be referenced by foreign keys but must be explicitly specified
CREATE TABLE departments (
id INTEGER PRIMARY KEY,
dept_code VARCHAR(10) UNIQUE
);
-- References PRIMARY KEY by default
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
dept_id INTEGER REFERENCES departments -- References departments(id)
);
-- Must explicitly specify UNIQUE column
CREATE TABLE projects (
id INTEGER PRIMARY KEY,
dept_code VARCHAR(10) REFERENCES departments(dept_code)
);
4. Index Creation
Both automatically create a unique B-tree index, but:
- PRIMARY KEY index named:
tablename_pkey - UNIQUE constraint index named:
tablename_columnname_key
When to Use Each
- PRIMARY KEY: Table's main identifier (user IDs, order IDs)
- UNIQUE: Alternative unique identifiers (email addresses, SKUs, username)
PostgreSQL Version: Behavior consistent across PostgreSQL 9.x through 17.x (current as of January 2025)
Source: PostgreSQL Official Documentation - Table Constraints
An EXCLUSION constraint in PostgreSQL ensures that if any two rows are compared on specified columns or expressions using specified operators, at least one of those operator comparisons must return false or null. It generalizes UNIQUE constraints by allowing custom operators beyond equality.
Syntax
CREATE TABLE example (
room_id int,
reservation_period tstzrange,
EXCLUDE USING gist (room_id WITH =, reservation_period WITH &&)
);
Key Requirements
- Index Method Required: Must specify an index method (typically
gistorspgist) that supports the operators being used - Operator Class: Each element requires an operator. Common operators:
=(equality) - requires btree-compatible types&&(overlaps) - for range types<->(distance) - for geometric types
- Extension Dependency: Range types with
&&requirebtree_gistextension
Complete Example
-- Enable extension for btree operators in GiST
CREATE EXTENSION btree_gist;
-- Prevent overlapping room reservations
CREATE TABLE room_reservations (
room_id int,
reserved_during tstzrange,
EXCLUDE USING gist (
room_id WITH =,
reserved_during WITH &&
)
);
-- This succeeds (different rooms)
INSERT INTO room_reservations VALUES
(101, '[2024-01-01 10:00, 2024-01-01 12:00)');
-- This fails (same room, overlapping time)
INSERT INTO room_reservations VALUES
(101, '[2024-01-01 11:00, 2024-01-01 13:00)');
-- ERROR: conflicting key value violates exclusion constraint
Optional Clauses
WHERE (predicate): Makes constraint partial (only checks rows matching predicate)DEFERRABLE/INITIALLY DEFERRED: Delays checking until transaction commit
Available since: PostgreSQL 9.0
Source: PostgreSQL 17 official documentation, Chapter 5.4 (Constraints)
Adding a Default Value to an Existing Column in PostgreSQL
Use ALTER TABLE with SET DEFAULT:
ALTER TABLE table_name
ALTER COLUMN column_name SET DEFAULT default_value;
Example:
ALTER TABLE users
ALTER COLUMN status SET DEFAULT 'active';
ALTER TABLE orders
ALTER COLUMN created_at SET DEFAULT CURRENT_TIMESTAMP;
ALTER TABLE products
ALTER COLUMN price SET DEFAULT 0.00;
CRITICAL BEHAVIOR:
- This command applies only to future INSERT operations where the column is not explicitly specified
- It does NOT update existing NULL values in the table
- The operation is immediate and does not rewrite the table (fast, no table lock)
To also update existing rows:
-- Set the default
ALTER TABLE table_name
ALTER COLUMN column_name SET DEFAULT default_value;
-- Update existing NULL values
UPDATE table_name
SET column_name = default_value
WHERE column_name IS NULL;
To remove a default:
ALTER TABLE table_name
ALTER COLUMN column_name DROP DEFAULT;
Supported in: PostgreSQL 8.0+ (all currently supported versions as of 2025)
When to Use Unlogged Tables in PostgreSQL
Use unlogged tables when you need maximum write performance for data that can be safely lost on a crash or unclean shutdown.
Critical Behavior
Unlogged tables in PostgreSQL (available since 9.1):
- Are NOT written to WAL (Write-Ahead Log)
- Are truncated automatically on crash recovery or unclean shutdown
- Cannot be replicated to standby servers
- Provide significantly faster writes (typically 2-10x) due to no WAL overhead
Specific Use Cases
Use unlogged tables for:
- Session/temporary data - User session state, shopping carts
- Cache tables - Materialized query results that can be regenerated
- ETL staging - Intermediate data during bulk loads that will be copied elsewhere
- Analytics scratch space - Temporary aggregations or data transformations
- High-throughput logging where data loss is acceptable (e.g., non-critical metrics)
Do NOT use for:
- Any data that must survive crashes
- Data requiring replication to standbys
- ACID-compliant transactions where durability matters
Syntax
-- Create unlogged table
CREATE UNLOGGED TABLE session_data (
session_id TEXT PRIMARY KEY,
user_id INTEGER,
data JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Convert existing table to unlogged
ALTER TABLE my_table SET UNLOGGED;
-- Convert back to logged
ALTER TABLE my_table SET LOGGED;
Performance Characteristics
- Write performance: 2-10x faster than logged tables (exact gain depends on workload and storage)
- fsync overhead: Eliminated for unlogged tables
- Crash recovery: Table is automatically truncated (all data lost)
Source: PostgreSQL 17 official documentation (CREATE TABLE - Unlogged)
Drop a Column from a PostgreSQL Table
Use the ALTER TABLE statement with DROP COLUMN:
ALTER TABLE table_name DROP COLUMN column_name;
Example:
ALTER TABLE users DROP COLUMN middle_name;
Critical Details
CASCADE behavior: If the column is referenced by other objects (views, foreign keys, indexes), the command will fail unless you add CASCADE:
ALTER TABLE table_name DROP COLUMN column_name CASCADE;
This automatically drops dependent objects.
IF EXISTS clause (PostgreSQL 8.2+): Prevents errors if the column doesn't exist:
ALTER TABLE table_name DROP COLUMN IF EXISTS column_name;
Transaction safety: The operation is transactional and can be rolled back:
BEGIN;
ALTER TABLE users DROP COLUMN email;
ROLLBACK; -- Undoes the drop
Performance note: Dropping a column in PostgreSQL 11+ is instant—it marks the column as invisible rather than rewriting the table. In versions prior to 11, it may require a full table rewrite depending on the column type.
Multiple columns can be dropped in one statement:
ALTER TABLE users
DROP COLUMN middle_name,
DROP COLUMN nickname;
Composite Primary Key in PostgreSQL
A composite primary key uses multiple columns to uniquely identify rows. Define it using the PRIMARY KEY constraint with a comma-separated list of columns.
Syntax
During table creation:
CREATE TABLE table_name (
column1 data_type,
column2 data_type,
column3 data_type,
PRIMARY KEY (column1, column2)
);
Adding to existing table:
ALTER TABLE table_name
ADD PRIMARY KEY (column1, column2);
Example
CREATE TABLE order_items (
order_id INTEGER,
product_id INTEGER,
quantity INTEGER,
PRIMARY KEY (order_id, product_id)
);
Critical Details
- Column order matters:
PRIMARY KEY (a, b)creates a different index structure thanPRIMARY KEY (b, a). Order columns by query patterns (most selective/frequently filtered first). - Maximum columns: PostgreSQL allows up to 32 columns in a composite key (limited by index max of 32 columns).
- Implicit NOT NULL: All columns in a primary key automatically become
NOT NULL. - Automatic index: PostgreSQL creates a unique B-tree index on the column combination.
- Constraint naming: Use
CONSTRAINT constraint_name PRIMARY KEY (col1, col2)for explicit naming.
Verification
Supported since PostgreSQL 7.1+, current through PostgreSQL 17.
Source: PostgreSQL Official Documentation - CREATE TABLE (https://www.postgresql.org/docs/current/sql-createtable.html)
DROP TABLE ... CASCADE in PostgreSQL automatically drops a table and all objects that depend on it, preventing dependency errors.
Syntax
DROP TABLE table_name CASCADE;
What CASCADE Does
When you drop a table with CASCADE, PostgreSQL automatically drops:
- Views that reference the table
- Foreign key constraints in other tables pointing to this table
- Functions/procedures that depend on the table
- Triggers on the table
- Rules defined on the table
- Dependent objects recursively (objects depending on the dropped objects)
Example
-- Create tables with dependencies
CREATE TABLE orders (id INT PRIMARY KEY);
CREATE TABLE order_items (
id INT PRIMARY KEY,
order_id INT REFERENCES orders(id)
);
CREATE VIEW order_summary AS SELECT * FROM orders;
-- This fails with "cannot drop table orders because other objects depend on it"
DROP TABLE orders;
-- This succeeds and drops orders, the foreign key in order_items, and order_summary view
DROP TABLE orders CASCADE;
Alternative: RESTRICT
The default behavior is RESTRICT (explicitly written or implied), which refuses to drop the table if any objects depend on it:
DROP TABLE orders RESTRICT; -- Fails if dependencies exist
DROP TABLE orders; -- Same as RESTRICT (default)
Critical Warning
CASCADE can drop many more objects than you intend. Always review dependencies first:
-- Check dependencies before dropping
SELECT * FROM pg_depend WHERE refobjid = 'orders'::regclass;
Version: Available in all supported PostgreSQL versions (9.x through 17+)
CREATE TABLE ... LIKE creates a new table by copying the structure of an existing table. The new table is completely independent—changes to the original table do not affect the new table.
What it copies by default:
- Column names
- Data types
- NOT NULL constraints
What it does NOT copy by default:
- Data/rows
- Indexes
- Primary keys
- Foreign keys
- DEFAULT values
- CHECK constraints (other than NOT NULL)
- Comments
- Identity specifications
Syntax:
CREATE TABLE new_table (LIKE existing_table [INCLUDING options]);
INCLUDING options to copy additional properties:
INCLUDING DEFAULTS- Copy default value expressionsINCLUDING CONSTRAINTS- Copy CHECK constraintsINCLUDING INDEXES- Copy indexes (including PK, UNIQUE, EXCLUDE)INCLUDING IDENTITY- Copy identity column specificationsINCLUDING GENERATED- Copy generated column expressionsINCLUDING STATISTICS- Copy extended statisticsINCLUDING COMMENTS- Copy column/constraint commentsINCLUDING STORAGE- Copy TOAST storage settingsINCLUDING COMPRESSION- Copy column compression methodsINCLUDING ALL- Copy everything above
Example:
-- Basic: copies only columns, types, NOT NULL
CREATE TABLE employees_backup (LIKE employees);
-- Copy with indexes and defaults
CREATE TABLE employees_copy (LIKE employees INCLUDING INDEXES INCLUDING DEFAULTS);
-- Copy everything
CREATE TABLE employees_full (LIKE employees INCLUDING ALL);
The LIKE clause works with tables, views, foreign tables, and composite types.
Sources:
Creating a Table from a SELECT Query in PostgreSQL
Use CREATE TABLE AS (CTAS) to create a table from a SELECT query:
CREATE TABLE new_table AS
SELECT column1, column2, column3
FROM existing_table
WHERE condition;
Key characteristics:
- Creates table structure automatically based on SELECT result columns
- Inserts all rows returned by the SELECT
- Does NOT copy indexes, constraints, or defaults from source tables
- Does NOT create a primary key automatically
- Column names match the SELECT list (use aliases to rename:
SELECT col AS new_name)
With additional options (PostgreSQL 9.5+):
CREATE TABLE new_table AS
SELECT * FROM existing_table
WITH NO DATA; -- Creates structure only, no rows
Alternative syntax using SELECT INTO (identical functionality):
SELECT column1, column2
INTO new_table
FROM existing_table;
To include constraints after creation:
CREATE TABLE new_table AS
SELECT * FROM existing_table;
ALTER TABLE new_table ADD PRIMARY KEY (id);
CREATE INDEX idx_name ON new_table(column_name);
Temporary table variant:
CREATE TEMP TABLE temp_table AS
SELECT * FROM existing_table;
Authority: PostgreSQL official documentation (CREATE TABLE AS command, compatible with PostgreSQL 9.0+, syntax unchanged through PostgreSQL 17).
ALTER TABLE ... SET NOT NULL Syntax (PostgreSQL)
The syntax to add a NOT NULL constraint to an existing column is:
ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL;
Example:
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
Critical Requirements:
- The column must NOT contain any NULL values before executing this command, or the operation will fail with error
23502 - This acquires an ACCESS EXCLUSIVE lock on the table, blocking all other operations
PostgreSQL 12+ Optimization:
If you first add a CHECK constraint that validates the NOT NULL condition, then drop it before adding the actual NOT NULL constraint, PostgreSQL can skip the full table scan:
-- Step 1: Add CHECK constraint (does NOT block writes heavily)
ALTER TABLE users ADD CONSTRAINT users_email_not_null CHECK (email IS NOT NULL) NOT VALID;
-- Step 2: Validate the constraint (uses SHARE UPDATE EXCLUSIVE lock)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;
-- Step 3: Set NOT NULL (skips table scan since constraint proves no NULLs exist)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
-- Step 4: Drop the redundant CHECK constraint
ALTER TABLE users DROP CONSTRAINT users_email_not_null;
Removing NOT NULL:
ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL;
Source: PostgreSQL 17 Official Documentation - ALTER TABLE command reference
CHECK Constraint in PostgreSQL
A CHECK constraint is a table-level or column-level constraint that enforces a boolean expression on column values. The constraint rejects INSERT or UPDATE operations if the expression evaluates to FALSE; it accepts the operation if the expression evaluates to TRUE or NULL.
Syntax
Column-level CHECK:
CREATE TABLE products (
price NUMERIC CHECK (price > 0),
discount NUMERIC CHECK (discount >= 0 AND discount <= 100)
);
Table-level CHECK (for multi-column conditions):
CREATE TABLE orders (
quantity INTEGER,
total NUMERIC,
CHECK (total = quantity * unit_price)
);
Named CHECK constraint:
CREATE TABLE employees (
salary NUMERIC,
CONSTRAINT valid_salary CHECK (salary > 0 AND salary < 1000000)
);
Critical Behaviors
NULL handling: CHECK constraints pass when the expression evaluates to NULL (unknown). To disallow NULLs, combine with
NOT NULL.Expression limitations: The CHECK expression:
- Cannot contain subqueries
- Cannot reference columns from other tables
- Cannot reference other rows (only current row being inserted/updated)
- Can call immutable functions only (not CURRENT_TIMESTAMP or similar volatile functions)
Validation timing: Evaluated at row insertion/update time, not during ALTER TABLE ADD CONSTRAINT unless you specify
NOT VALIDand laterVALIDATE CONSTRAINT.
Adding to Existing Tables
-- Validates all existing rows immediately
ALTER TABLE products ADD CONSTRAINT check_price CHECK (price > 0);
-- Skips existing rows, validates only new/updated rows (PostgreSQL 9.4+)
ALTER TABLE products ADD CONSTRAINT check_price CHECK (price > 0) NOT VALID;
-- Later validate:
ALTER TABLE products VALIDATE CONSTRAINT check_price;
Source: PostgreSQL 17 Official Documentation - Table Constraints (https://www.postgresql.org/docs/current/ddl-constraints.html)
authentication
2 questionsPostgreSQL 18 OAuth 2.0 Authentication
Yes. PostgreSQL 18 introduces OAuth 2.0 authentication support, allowing integration with modern identity providers like Okta, Auth0, Azure AD, and Keycloak.
Configuration (pg_hba.conf)
# OAuth 2.0 authentication
host all all 0.0.0.0/0 oauth issuer="https://auth.example.com" client_id="pg_client"
How It Works
- Client requests access token from OAuth provider
- Client connects to PostgreSQL with token
- PostgreSQL validates token with issuer
- Connection established if token valid
Server Configuration
-- postgresql.conf
oauth_issuer = 'https://auth.example.com'
oauth_client_id = 'postgresql-server'
oauth_client_secret = 'your-secret' -- Or use file
Client Connection
# Using psql with OAuth token
PGOAUTHTOKEN="eyJhbG..." psql -h myserver -U myuser -d mydb
# Using libpq connection string
psql "host=myserver user=myuser oauth_token=eyJhbG..."
Supported Flows
| Flow | Use Case |
|---|---|
| Client Credentials | Service-to-service |
| Authorization Code | Interactive users |
| Device Authorization | CLI tools |
Provider Examples
# Azure AD
host all all 0.0.0.0/0 oauth \
issuer="https://login.microsoftonline.com/{tenant}/v2.0" \
client_id="your-app-id"
# Okta
host all all 0.0.0.0/0 oauth \
issuer="https://your-domain.okta.com/oauth2/default" \
client_id="your-client-id"
# Keycloak
host all all 0.0.0.0/0 oauth \
issuer="https://keycloak.example.com/realms/myrealm" \
client_id="postgresql"
Security Notes
- Tokens validated via OIDC discovery document
- JWT signature verification automatic
- Token expiration enforced
- MD5 password auth deprecated in favor of SCRAM-SHA-256 or OAuth
Source: PostgreSQL 18 Release Notes
https://www.postgresql.org/docs/18/release-18.html
PostgreSQL 18 MD5 Authentication Deprecation
Yes. PostgreSQL 18 officially deprecates MD5 password authentication in favor of SCRAM-SHA-256.
Deprecation Notice
When using MD5 authentication in PostgreSQL 18, you'll see warnings in logs:
WARNING: md5 authentication is deprecated and will be removed in a future release
HINT: Use scram-sha-256 authentication instead.
Migration Steps
1. Check Current Configuration
-- Check current password encryption
SHOW password_encryption; -- Should be 'scram-sha-256'
-- Check pg_hba.conf entries
-- Look for 'md5' in auth-method column
2. Update Server Configuration
-- postgresql.conf
password_encryption = scram-sha-256 -- Already default since PG14
3. Re-encrypt User Passwords
-- Users must reset passwords to use SCRAM
ALTER USER myuser PASSWORD 'new_secure_password';
-- Verify password type
SELECT usename, passwd LIKE 'SCRAM%' AS is_scram
FROM pg_shadow
WHERE usename = 'myuser';
4. Update pg_hba.conf
# Before (deprecated)
host all all 0.0.0.0/0 md5
# After (recommended)
host all all 0.0.0.0/0 scram-sha-256
Why SCRAM-SHA-256 Is Better
| Aspect | MD5 | SCRAM-SHA-256 |
|---|---|---|
| Algorithm strength | Weak (broken) | Strong |
| Replay attacks | Vulnerable | Protected |
| Man-in-middle | Vulnerable | Protected |
| Channel binding | No | Yes |
| Password storage | Weak hash | Salted, iterated |
Compatibility Notes
-- Clients must support SCRAM
-- libpq 10+ supports SCRAM
-- Most drivers updated years ago
-- Check client library version
SELECT version(); -- Server version
-- Client: psql --version, check driver docs
Migration Timeline
| Version | Status |
|---|---|
| PG 10 | SCRAM-SHA-256 introduced |
| PG 14 | SCRAM-SHA-256 default for new passwords |
| PG 18 | MD5 deprecated (warnings) |
| PG 19+ | MD5 may be removed |
Source: PostgreSQL 18 Release Notes
https://www.postgresql.org/docs/18/release-18.html
replication
2 questionsPostgreSQL 17 Failover Logical Replication Slots
PostgreSQL 17 enables logical replication slots to survive primary failover by synchronizing them to standby servers.
Prerequisites
- Physical streaming replication between primary and standby
- Hot standby enabled on standby
hot_standby_feedback = onon standbyprimary_slot_nameconfigured on standby
Configuration
On Primary (postgresql.conf):
-- List standbys that should receive slot sync
synchronized_standby_slots = 'standby1_slot'
On Standby (postgresql.conf):
-- Enable slot synchronization
sync_replication_slots = on
hot_standby_feedback = on
primary_slot_name = 'standby1_slot'
Creating Failover-Enabled Slots
-- Method 1: Direct slot creation with failover flag
SELECT pg_create_logical_replication_slot(
'my_slot',
'pgoutput',
false, -- temporary
false, -- two_phase
true -- failover (NEW in PG17)
);
-- Method 2: Via subscription
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=primary dbname=mydb'
PUBLICATION my_pub
WITH (failover = true);
Monitoring
-- Check slot sync status on standby
SELECT slot_name, synced, active
FROM pg_replication_slots;
-- synced = true means slot is ready for failover
-- Check if slot changes are synchronized
SELECT * FROM pg_stat_replication_slots;
Failover Process
- Primary fails
- Standby promotes to new primary
- Synced slots (where
synced = true) become active - Subscribers reconnect to new primary
- Logical replication continues from last confirmed LSN
Important Notes
- Only slots with
synced = trueat failover time can be used - Physical slot between primary/standby is required
- Slot sync happens periodically via
slotsyncworker - Logical replication changes aren't consumed until standby confirms receipt
Source: PostgreSQL 17 Documentation - Logical Replication Failover
https://www.postgresql.org/docs/17/logical-replication-failover.html
PostgreSQL 18 idle_replication_slot_timeout
idle_replication_slot_timeout automatically invalidates replication slots that have been inactive for a specified period, preventing WAL bloat.
The Problem It Solves
Abandoned replication slots prevent WAL cleanup, leading to:
- Disk space exhaustion
- Potential database unavailability
- Manual intervention required
Configuration
-- Set timeout (default: 0 = disabled)
ALTER SYSTEM SET idle_replication_slot_timeout = '1d'; -- 1 day
SELECT pg_reload_conf();
-- Check current setting
SHOW idle_replication_slot_timeout;
Valid Values
| Value | Meaning |
|---|---|
0 |
Disabled (default) |
30min |
30 minutes |
1h |
1 hour |
1d |
1 day |
7d |
1 week |
Comparison with max_slot_wal_keep_size
| Parameter | Triggers On | Use Case |
|---|---|---|
max_slot_wal_keep_size |
WAL size exceeds limit | Protect disk space |
idle_replication_slot_timeout |
Time since last activity | Clean up abandoned slots |
Example Scenario
-- Create a slot
SELECT pg_create_logical_replication_slot('test_slot', 'pgoutput');
-- Slot becomes inactive (subscriber disconnects and never reconnects)
-- After idle_replication_slot_timeout passes, slot is invalidated
-- Check slot status
SELECT slot_name, active, invalidation_reason
FROM pg_replication_slots;
-- invalidation_reason will show 'idle_timeout' if expired
Best Practice
-- Combine both protections
max_slot_wal_keep_size = '100GB' -- WAL size limit
idle_replication_slot_timeout = '7d' -- Time limit
Source: PostgreSQL 18 Documentation - Replication Configuration
https://www.postgresql.org/docs/18/runtime-config-replication.html
data_loading
1 questionPostgreSQL 17 COPY Performance
PostgreSQL 17 delivers up to 2x faster COPY operations for bulk loading and exporting large rows.
Performance Comparison
| Operation | PostgreSQL 16 | PostgreSQL 17 | Improvement |
|---|---|---|---|
| COPY FROM (large rows) | 100 MB/s | 180 MB/s | 1.8x |
| COPY TO (export) | 120 MB/s | 200 MB/s | 1.7x |
| COPY with FREEZE | 90 MB/s | 170 MB/s | 1.9x |
Benchmarks on NVMe SSD, 64-core server, varies by hardware
What Changed
- Reduced memory allocation overhead: Batch buffer management
- Optimized tuple formation: Less copying of large values
- Improved I/O batching: Better write coalescing
- TOAST handling: More efficient for large text/bytea
Best Practices for Fast COPY
-- Optimal bulk load settings
SET maintenance_work_mem = '2GB';
SET max_wal_size = '10GB';
-- Use FREEZE for initial loads (skips VACUUM)
COPY large_table FROM '/data/file.csv' WITH (FREEZE);
-- Binary format for numeric-heavy data
COPY my_table TO '/backup/data.bin' WITH (FORMAT binary);
COPY my_table FROM '/backup/data.bin' WITH (FORMAT binary);
-- Parallel load via multiple connections
-- Split file and run concurrent COPY commands
Monitoring COPY Progress
-- Check progress (PostgreSQL 14+)
SELECT
command,
bytes_processed,
bytes_total,
tuples_processed,
round(100.0 * bytes_processed / nullif(bytes_total, 0), 1) AS pct_done
FROM pg_stat_progress_copy;
COPY Options Recap
| Option | Purpose |
|---|---|
FREEZE |
Skip VACUUM for new tables |
FORMAT binary |
Faster for numeric data |
PARALLEL n |
Multiple workers (planned) |
ON_ERROR ignore |
Skip bad rows |
HEADER |
Skip/include CSV header |
Example: Large Data Load
# Fast CSV import
psql -c "COPY events FROM PROGRAM 'zcat events.csv.gz' WITH (FORMAT csv, HEADER);"
# With progress monitoring
psql -c "
SET maintenance_work_mem = '1GB';
COPY events FROM '/data/events.csv' WITH (FORMAT csv, HEADER, FREEZE);
"
Source: PostgreSQL 17 Release Notes
https://www.postgresql.org/docs/17/release-17.html
session_management
1 questionset_config() and current_setting() are PostgreSQL functions for managing session/transaction-level configuration parameters, commonly used to pass application context (like user ID or tenant ID) to Row Level Security policies.
current_setting()
Syntax:
current_setting(setting_name text [, missing_ok boolean]) -> text
Usage:
-- Get a setting (throws error if not exists)
SELECT current_setting('app.current_user_id');
-- Get a setting (returns NULL if not exists)
SELECT current_setting('app.current_user_id', true);
Equivalent to SHOW setting_name.
set_config()
Syntax:
set_config(setting_name text, new_value text, is_local boolean) -> text
Parameters:
setting_name- Custom parameters must use dotted names (e.g.,app.tenant_id)new_value- The value to set (NULL resets to default)is_local- Iftrue, applies only to current transaction; iffalse, applies to entire session
Usage:
-- Set for entire session
SELECT set_config('app.tenant_id', '42', false);
-- Set for current transaction only (required for PgBouncer)
SELECT set_config('app.tenant_id', '42', true);
Equivalent to SET [LOCAL] setting_name = value.
Multi-Tenant RLS Example
-- Create policy using application context
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.tenant_id', true)::integer);
-- In application, set context before queries
BEGIN;
SELECT set_config('app.tenant_id', '42', true);
SELECT * FROM orders; -- Only sees tenant 42's orders
COMMIT;
Important: Use is_local = true when using connection poolers like PgBouncer in transaction mode to ensure context doesn't leak between requests.
Sources:
server_configuration
1 questionPostgreSQL 18 io_method Configuration
io_method is a new PostgreSQL 18 configuration parameter that controls how read operations are dispatched to storage.
Available Options
| Value | Description | Platform |
|---|---|---|
sync |
Synchronous blocking reads (legacy PG17 behavior) | All |
worker |
Dedicated I/O worker processes (default) | All |
io_uring |
Linux kernel ring buffer for minimal syscall overhead | Linux 5.1+ |
Performance Comparison
Benchmark results for sequential scan on cold cache:
| Method | Time | Improvement |
|---|---|---|
| sync | 15,071ms | Baseline |
| worker | 10,052ms | 1.5x faster |
| io_uring | 5,723ms | 2.6x faster |
Configuration
-- Check current setting
SHOW io_method;
-- Set in postgresql.conf
io_method = 'io_uring' -- For Linux with kernel 5.1+
io_method = 'worker' -- Cross-platform default
When to Use Each
io_uring: Best for Linux production servers with high I/O workloadsworker: Safe default for cross-platform compatibilitysync: Only for debugging or compatibility testing
Current Limitations
- AIO only applies to reads (sequential scans, bitmap heap scans, vacuum)
- Index scans don't use AIO yet
- Write operations and WAL still use synchronous I/O
Source: PostgreSQL 18 Documentation - Runtime Configuration
https://www.postgresql.org/docs/18/runtime-config-resource.html
data_types
1 questionPostgreSQL 18 uuidv7() Function
uuidv7() generates timestamp-ordered UUIDs that are optimal for B-tree indexes and distributed systems.
Key Characteristics
- First 48 bits: Unix timestamp (millisecond precision)
- Next 12 bits: Sub-millisecond counter for monotonicity
- Remaining bits: Random data
- Total: 128 bits (standard UUID size)
Usage
-- Generate a UUIDv7
SELECT uuidv7();
-- Result: 019376a8-5b40-7abc-8def-1234567890ab
-- Use as primary key default
CREATE TABLE events (
id uuid DEFAULT uuidv7() PRIMARY KEY,
event_type TEXT,
payload JSONB,
created_at TIMESTAMPTZ DEFAULT now()
);
-- Also available: explicit v4 function
SELECT uuidv4(); -- Alias for gen_random_uuid()
Performance vs UUIDv4
| Metric | UUIDv7 | UUIDv4 |
|---|---|---|
| Generation time | 58.1 microseconds | 86.8 microseconds |
| Throughput | 34,127 ops/sec | 29,238 ops/sec |
| Index fragmentation | Minimal | High |
| Insert performance | Excellent (sequential) | Poor (random) |
Benefits
- Chronological ordering: Later UUIDs sort after earlier ones
- Reduced index fragmentation: New values always append to index end
- Better cache locality: Related records stored together
- Global uniqueness: No coordination needed across nodes
Considerations
- Timestamp is visible in UUID (don't use for security-sensitive external IDs)
- Monotonicity guaranteed only within same backend process
- Requires NTP for clock synchronization across nodes
Source: PostgreSQL 18 Documentation - UUID Functions
https://www.postgresql.org/docs/18/functions-uuid.html
data_definition
1 questionPostgreSQL 18 Generated Columns Default
VIRTUAL is now the default for generated columns in PostgreSQL 18, changed from STORED in previous versions.
STORED vs VIRTUAL
| Aspect | STORED | VIRTUAL (new default) |
|---|---|---|
| Storage | Written to disk | No disk space |
| Computation | On INSERT/UPDATE | On SELECT (read time) |
| Indexable | Yes | No |
| Adding to table | Requires table rewrite | Instant |
Syntax
-- PostgreSQL 18: VIRTUAL is default
CREATE TABLE products (
id SERIAL PRIMARY KEY,
price NUMERIC(10,2),
quantity INTEGER,
total NUMERIC GENERATED ALWAYS AS (price * quantity) -- VIRTUAL by default
);
-- Explicit STORED (for indexing)
CREATE TABLE products_indexed (
id SERIAL PRIMARY KEY,
first_name TEXT,
last_name TEXT,
full_name TEXT GENERATED ALWAYS AS (first_name || ' ' || last_name) STORED
);
-- Explicit VIRTUAL
CREATE TABLE calculations (
a INTEGER,
b INTEGER,
sum INTEGER GENERATED ALWAYS AS (a + b) VIRTUAL
);
When to Use Each
Use VIRTUAL (default) when:
- Adding columns to large existing tables (instant, no rewrite)
- Write-heavy workloads (saves I/O)
- Storage space is a concern
- Column values change frequently via base columns
Use STORED when:
- You need to create an index on the generated column
- Read-heavy workloads with complex expressions
- You need constraints (UNIQUE, NOT NULL, FK)
Current VIRTUAL Limitations
- Cannot be indexed
- No unique/foreign key constraints
- Not supported in logical replication
- Expression must use only built-in functions (no user-defined)
Source: PostgreSQL 18 Documentation - Generated Columns
https://www.postgresql.org/docs/18/ddl-generated-columns.html
foreign_data
1 questionPostgreSQL 17 postgres_fdw Subquery Push Down
PostgreSQL 17 enables EXISTS and IN subqueries to be pushed to remote PostgreSQL servers via postgres_fdw, reducing data transfer and improving performance.
The Improvement
Previously, EXISTS/IN subqueries against foreign tables were executed locally, requiring all foreign data to be fetched first.
Example
-- Setup: foreign table pointing to remote server
CREATE SERVER remote_server
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'remote.example.com', dbname 'salesdb');
CREATE FOREIGN TABLE remote_orders (
id INT,
customer_id INT,
total NUMERIC
) SERVER remote_server;
-- This query now pushes the subquery to remote
SELECT * FROM local_customers c
WHERE EXISTS (
SELECT 1 FROM remote_orders o
WHERE o.customer_id = c.id
AND o.total > 1000
);
EXPLAIN Comparison
-- PostgreSQL 16: Subquery executed locally
EXPLAIN VERBOSE SELECT * FROM local_customers c
WHERE EXISTS (SELECT 1 FROM remote_orders o WHERE o.customer_id = c.id);
-- Shows:
-- Filter: EXISTS (SubPlan)
-- -> Foreign Scan on remote_orders -- Fetches ALL rows
-- Remote SQL: SELECT id, customer_id, total FROM orders
-- PostgreSQL 17: Subquery pushed to remote
EXPLAIN VERBOSE SELECT * FROM local_customers c
WHERE EXISTS (SELECT 1 FROM remote_orders o WHERE o.customer_id = c.id);
-- Shows:
-- Foreign Scan
-- Remote SQL: SELECT ... WHERE EXISTS (SELECT 1 FROM orders WHERE ...)
Supported Patterns
| Pattern | Pushed Down (PG17)? |
|---|---|
WHERE EXISTS (SELECT ... FROM foreign_table) |
Yes |
WHERE id IN (SELECT id FROM foreign_table) |
Yes |
WHERE NOT EXISTS (...) |
Yes |
WHERE id NOT IN (...) |
Yes |
| Correlated subqueries | Yes |
Performance Impact
| Scenario | PG16 | PG17 |
|---|---|---|
| 1M remote rows, 100 matches | Fetch 1M rows | Fetch 100 rows |
| Network transfer | High | Minimal |
| Query time | Minutes | Seconds |
Configuration
-- Ensure push down is enabled (default: on)
ALTER SERVER remote_server OPTIONS (ADD fetch_size '1000');
-- Check what gets pushed
SET postgres_fdw.show_remote_sql = on; -- For debugging
Source: PostgreSQL 17 Release Notes
https://www.postgresql.org/docs/17/release-17.html
vacuum_maintenance
1 questionPostgreSQL 17 VACUUM Memory Improvements
PostgreSQL 17 introduces a new internal memory structure for VACUUM that consumes up to 20x less memory than previous versions.
Memory Comparison
| Scenario | PostgreSQL 16 | PostgreSQL 17 |
|---|---|---|
| 1M dead tuples | ~128 MB | ~6 MB |
| 10M dead tuples | ~1.28 GB | ~64 MB |
| Large table vacuum | Often OOM | Stable |
How It Works
Previous versions stored dead tuple IDs in a flat array that grew linearly. PostgreSQL 17 uses a radix tree (TID store) that:
- Compresses common prefixes of tuple IDs
- Scales sub-linearly with dead tuple count
- Reduces memory fragmentation
- Improves cache efficiency
Configuration
The new structure respects existing settings:
-- maintenance_work_mem still applies
SHOW maintenance_work_mem; -- default: 64MB
-- But now processes more dead tuples per memory unit
-- A 64MB setting can now handle ~200M dead tuples
-- Previously limited to ~5M dead tuples
Practical Impact
- Fewer VACUUM passes: More dead tuples processed per pass
- Reduced OOM risk: Large table vacuums less likely to fail
- Lower memory pressure: Better for shared hosting / containers
- Faster completion: Less time spent on memory management
Verification
-- Check vacuum progress (unchanged API)
SELECT * FROM pg_stat_progress_vacuum;
-- Monitor memory in pg_stat_activity
SELECT pid, state, query, backend_type
FROM pg_stat_activity
WHERE backend_type = 'autovacuum worker';
Source: PostgreSQL 17 Release Notes
https://www.postgresql.org/docs/17/release-17.html
administration
1 questionPostgreSQL 18 Statistics Preservation During Upgrade
Yes. PostgreSQL 18 preserves planner statistics during pg_upgrade, eliminating the need for lengthy post-upgrade ANALYZE operations.
The Previous Problem (Pre-PG18)
# After pg_upgrade, statistics were empty
pg_upgrade -d /old/data -D /new/data ...
# Required running ANALYZE on entire database (could take hours)
vacuumdb --all --analyze-in-stages
PostgreSQL 18 Behavior
# Statistics now preserved automatically
pg_upgrade -d /old/data -D /new/data -b /old/bin -B /new/bin
# Database ready immediately with accurate query plans!
What's Preserved
| Statistic Type | Preserved? |
|---|---|
| Column statistics (pg_statistic) | Yes |
| Extended statistics | Yes |
| Most common values | Yes |
| Histograms | Yes |
| NULL fractions | Yes |
| Correlation values | Yes |
Upgrade Time Comparison
| Database Size | PG17 Upgrade + ANALYZE | PG18 Upgrade |
|---|---|---|
| 100 GB | 2 hours | 20 minutes |
| 1 TB | 12+ hours | 2 hours |
| 10 TB | Days | Hours |
Verification
-- After upgrade, check statistics exist
SELECT
schemaname,
tablename,
last_analyze,
n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC
LIMIT 10;
-- last_analyze will show pre-upgrade timestamp
-- Statistics are already present for query planning
Best Practice
Even with preserved statistics, consider running ANALYZE after upgrade for:
- Tables with significant changes during upgrade window
- New columns added during upgrade
- Any tables showing poor query performance
-- Optional: Refresh statistics for specific tables
ANALYZE verbose large_table;
Source: PostgreSQL 18 Release Notes
https://www.postgresql.org/docs/18/release-18.html
connection_pooling
1 questionPgBouncer in transaction pooling mode has significant limitations because each transaction may execute on a different server connection with different session state.
Session Variables
Cannot use SET SESSION - Changes via SET (which defaults to SET SESSION) will not persist across transactions. You must use SET LOCAL to scope changes to the current transaction only.
Limited parameter tracking - PgBouncer tracks only these parameters by default: client_encoding, datestyle, timezone, standard_conforming_strings, and application_name. Additional parameters can be added via track_extra_parameters, but most parameters cannot be reliably tracked.
server_reset_query not used - In transaction pooling mode, DISCARD ALL is not executed between clients because clients must not rely on session persistence.
Prepared Statements
Protocol-level only (v1.21+) - Since PgBouncer 1.21.0, protocol-level prepared statements (via libpq's PQprepare) are supported with max_prepared_statements setting. PgBouncer intercepts and rewrites statement names internally.
SQL PREPARE not supported - SQL-level PREPARE, EXECUTE, and DEALLOCATE commands bypass PgBouncer's tracking and go directly to PostgreSQL. These will fail unpredictably when connections change.
Other Limitations
- Temporary tables - Must be created, used, and dropped within the same transaction. Use
ON COMMIT DROPwhen creating. - LISTEN/NOTIFY - Not compatible with transaction pooling (requires persistent session)
- Advisory locks - Session-level advisory locks will be lost between transactions
- Cursors - Must be declared
WITH HOLDor used within a single transaction
Sources:
psql_tools
1 questionPostgreSQL 17 psql watch min_rows
PostgreSQL 17 adds a min_rows parameter to the psql watch command that stops execution after the query returns at least the specified number of rows.
Syntax
-- In psql, use backslash-watch
watch [interval] [min_rows=N]
Examples
-- Stop when at least 5 rows are returned
SELECT * FROM queue WHERE status = 'pending';
-- Then run: watch 1 min_rows=5
-- Runs every 1 second until 5+ pending items exist
-- Wait for replication to catch up (1 row = caught up)
SELECT 1 WHERE pg_last_wal_replay_lsn() >= '0/1234567'::pg_lsn;
-- Then run: watch 0.5 min_rows=1
-- Check for job completion
SELECT * FROM jobs WHERE id = 123 AND status = 'completed';
-- Then run: watch 2 min_rows=1
Use Cases
| Scenario | Command |
|---|---|
| Wait for table to have N rows | watch 1 min_rows=N |
| Wait for condition to be true | watch 0.5 min_rows=1 |
| Monitor until threshold reached | watch 5 min_rows=100 |
| Poll for job completion | watch 2 min_rows=1 |
Comparison with Basic watch
-- Basic watch (runs forever until Ctrl-C)
SELECT count(*) FROM events WHERE processed = false;
-- Then run: watch 5
-- With min_rows (auto-stops)
SELECT count(*) FROM events WHERE processed = false HAVING count(*) = 0;
-- Then run: watch 5 min_rows=1
-- Stops when all events are processed
Practical Examples
-- Wait for locks to clear
SELECT 1 WHERE NOT EXISTS (
SELECT 1 FROM pg_locks WHERE relation = 'my_table'::regclass
);
-- Then run: watch 1 min_rows=1
-- Wait for active connections to drop below threshold
SELECT 1 WHERE (
SELECT count(*) FROM pg_stat_activity WHERE state = 'active'
) < 10;
-- Then run: watch 2 min_rows=1
-- Monitor batch progress, stop at 1000 processed
SELECT count(*) AS processed FROM items WHERE status = 'done'
HAVING count(*) >= 1000;
-- Then run: watch 5 min_rows=1
Also New in PG17 psql
- Ctrl-C cancels connection attempts (previously had to wait for timeout)
- Better tab completion for SQL keywords
- Improved d command output formatting
Source: PostgreSQL 17 Documentation - psql
https://www.postgresql.org/docs/17/app-psql.html
monitoring
1 questionPostgreSQL 18 pg_stat_all_tables New Columns
PostgreSQL 18 adds vacuum and analyze timing columns to pg_stat_all_tables for better maintenance monitoring.
New Columns
| Column | Type | Description |
|---|---|---|
total_vacuum_time |
double precision | Total time spent vacuuming (ms) |
total_analyze_time |
double precision | Total time spent analyzing (ms) |
Usage
-- Find tables with longest vacuum times
SELECT
schemaname,
relname,
vacuum_count,
round(total_vacuum_time::numeric / 1000, 2) AS vacuum_time_sec,
round(total_vacuum_time::numeric / nullif(vacuum_count, 0) / 1000, 2) AS avg_vacuum_sec
FROM pg_stat_all_tables
WHERE total_vacuum_time > 0
ORDER BY total_vacuum_time DESC
LIMIT 10;
-- Tables needing vacuum optimization
SELECT
relname,
n_dead_tup,
last_vacuum,
vacuum_count,
round(total_vacuum_time / 1000, 1) AS total_vacuum_secs,
round((total_vacuum_time / nullif(vacuum_count, 0)) / 1000, 1) AS avg_vacuum_secs
FROM pg_stat_user_tables
WHERE vacuum_count > 0
AND (total_vacuum_time / nullif(vacuum_count, 0)) > 10000 -- avg > 10 sec
ORDER BY total_vacuum_time DESC;
Monitoring Query
-- Comprehensive maintenance dashboard
SELECT
relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS size,
n_live_tup AS live_rows,
n_dead_tup AS dead_rows,
vacuum_count,
autovacuum_count,
analyze_count,
autoanalyze_count,
round(total_vacuum_time / 1000, 1) AS vacuum_time_sec,
round(total_analyze_time / 1000, 1) AS analyze_time_sec,
last_vacuum,
last_autovacuum,
last_analyze
FROM pg_stat_user_tables
ORDER BY total_vacuum_time DESC
LIMIT 20;
Alerting Example
-- Alert on tables with long average vacuum time
SELECT relname, avg_vacuum_ms
FROM (
SELECT
relname,
total_vacuum_time / nullif(vacuum_count, 0) AS avg_vacuum_ms
FROM pg_stat_user_tables
WHERE vacuum_count > 0
) sub
WHERE avg_vacuum_ms > 60000 -- > 1 minute average
ORDER BY avg_vacuum_ms DESC;
Related Enhancements (PG18)
EXPLAIN ANALYZE now also shows:
- Buffer usage automatically
- WAL writes (verbose mode)
- CPU time
- Average read times
EXPLAIN (ANALYZE, BUFFERS, WAL)
SELECT * FROM large_table WHERE id < 1000;
Source: PostgreSQL 18 Release Notes
https://www.postgresql.org/docs/18/release-18.html