PostgreSQL Internals Notes Compilation
PostgreSQL Internals Notes Compilation
c. pg_stat_statements — for queries that are not logged (not breaching the
log_min_statement_duration threshold).
d. lock - pg_locks , pg_stat_activity
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
blocking_locks.pid != blocked_locks.pid
NOT blocked_locks.granted
13. Autovacuum Tuning:
Transaction ID wraparound (4 bytes) , 2 to the power 32 -> 4 billion transactions.
1. Age of vacuum process.
2. Age of datfrozenxid of each database.
3. Age of relfrozenxid of each table in a database.
4. Run the vacuum verbose - after setting maitenance_work_mem or autovacuum_work_mem.
5. For large tables : (i) decrease autovacuum_vacuum_cost_delay (ii) Adjust
autovacuum_vaccum_scale_fator (decimal) to extremely low value for large tables.
6. For overall fall behind of autovacuum for all tables: (i) increase autovacuum_max_workers
(default is 3) (ii) increase proportionately autovacuum_vacuum_cost_limit 1500 (default is 200).
7. log autovacuum details log_autovacuum_min_duration.
Architecture of Autovacuum worker processes:
- Started by Postmaster upon request and details left by Autovacuum launch process for
Postmaster in shared memory (usual process of communication between processes is (a) shared
memory (b) messages).
- Upon completing the vacuum of a table, the autovacuum worker process will get terminated.
However, Autovacuum Launcher will launch another one if there are more tables to be vacuumed.
14. Storage Details: Nutshell [ Page Layout - 5 entities , Row Layout - 4 entities — Page Layout ->
Page Header (8 entities), Row Layout -> Tuple Header (8 entities)]
Storage Page Layout -> [PIFIS]
1. Page Header : 24 bytes
2. Item Data : 4 bytes (offset,length) , POINTERS to items (actual data)
3. Free Space
4. Item Space - Actual Data
5. Special Space - Index and other Access information to Actual Data.
Page Header: 24 bytes [lcf - lus - pp ]
1. pd_lsn - Log Sequence Number - Next byte after the last WAL change( to this page)
2. pd_checksum - page checksum
3. pg_flag - flag bits
4. pd_lower - offset to start of free space
5. pd_upper - offset to end of free space
6. pd_special - offset to special space.
7. pd_pagesize_version - pagesize and layout version number
8. pd_prune_xid - Oldest Unpruned XMAX on the page
[ UNPRUNED XMAX: unpruned means VACUUM has found the max XMAX of all rows that have transaction id
between XMIN and XMAX of transaction snapshot and hence need visibility , so the row was not removed.
Hence, a pruned page will mean that all the rows in the page do not have XMAX which is newer than a currently
running transaction XID.
]
Row layout:
1. Tuple Header
2. Optional null bitmap
- when there are nulls in the row
- present when TUPLE HEATHER's t_infomask’s heap_hasnull bit on.
- contains as many bits as the number of columns - bit 1 - has null , bit 0 - does not have null.
3. Optional object id - is set only if HEAP_HASOID bit is set in t_infomask.
4. User data
Tuple Header: 23 bytes [ mmc - xc - iih ] mohendra mohan choudhury - indian institute of hardware
- xc
1. t_min — minimum transaction id
2. t_max — maximum transaction id
3. t_cid — command id
4. t_xvac — transaction id for VACUUM operation - moving a row version
5. t_ctid — current tuple id (page id and then block id sequence)
6. t_infomask2 — Attributes indicated by flag bits
7. t_infomask — Attributes indicated by flag bits
8. t_hoff — offset to user data
15. Data Retrieval [ SELECT ]
- each attribute is traversed
- determined if it's null from Optional null bitmap.
- for variable length columns, the length of the column is determined from struct varlena .
- flags determine in struct varlena indicate:
(a) compressed or not
(b) column is a toast or not
3rd June 2021
——————
Parallel processes
16. max_parallel_workers_per_gather (default = 2) Gather node parallel processes
max_worker_processes (default = 8) total backend process in the system.
max_parallel_workers(default = 8) total parallel backend process in the system
gather/ Gather Merge
Loop in execution plan -> number of parallel processes
Operations:
Parallel Scan - Sequential , Index and Bitmap Heap
Parallel Join - Nested loop ( only outer table[ or driving table] blocks divided among the Parallel
Worker Backend Processes) , the inner table is accessed through index lookup and the access is not
parallel.
Merge join - the inner table is not parallel access
Hash Join - Can be (i) hash join (ii) parallel hash join with prefix ‘parallel’ (from
Postgresql 10 and above)
With the (prefix parallel) - the driving table (outer table) scan is parallelized as well as
the driven table (inner table) scan is parallelized.
The hash table is built from the whole driving table and is
shared with all the Coordinating processes which will probe the built (hash) table after picking up
tuples from the driven table.
Without the (prefix parallel) - the driving table scan (not parallelized) - each worker
process generates a Hash table from the driving table . The driven table scan is parallelized.
Parallel Aggregate: (i) Partial Aggregate Node: Each backend worker process
(ii) Finalize Aggregate Node: The leader re-aggregates the results.
Presence of GROUP BY/DISTINCT prevents optimizers from using Parallel
Aggregate.
Parallel Append: for UNION ALL / Partitioned Table Scan
5th June 2021
————————
17 .Architecture:
- Startup of Postgresql server: allocates share memory , starts background process -> postmaster.
PROCESSES
===========
- Postmaster: starts the other BACKGROUND processes: CAAWWS L -> Checkpointer ,
Autovaccum Launcher , Archiver , WAL Writer , DB Writer , Statistics collector and Logical
Replication process.
- Postmaster: Received connection request on default port 5432 and starts a BACKEND process for
every client request.
[ Postgresql does not have a BACKEND process that can serve multiple clients like the SHARED
SERVER process like in Oracle.
Hence, in PostgreSQL , all BACKEND processes are like DEDICATED SERVER processes like in
Oracle.]
- Postmaster: Starts other BACKEND processes like (i) Parallel Worker Processes (ii) Autovacuum
worker processes.
MEMORY
===============
SHARED:
- Shared_buffer : database pages containing data
- WAL_buffer : buffer container WAL entries
- CLOG buffer : transaction details containing the state of transaction (i) committed (ii) aborted (iii)
rollback.
LOCAL:
- temp_buffer: holds data related to temporary tables.
- work_mem: holds data related to GODUC operations (group by/order by /distinct/union)
- maintenance_mem : holds temporary data for maintenance operations like statistics collection ,
vacuum operation , reindex
ADDITIONAL SHARED:
- A - Access control Locks like light-weight locks, semaphores & shared/exclusive locks.
- B - Background process - Autovacuum , Checkpoint.
- T - Transaction related data like save-point and two-phase commit.
5th June 2021
——————-
18. Buffer Manager
1. Buffer Table : Consists of
- (i) Hash buckets , (ii) Hash Functions and (ii) Hash Slots in each bucket
- Hash Slot consists of Data entry - (i) Buffer_id from buffer pool (ii) Hash of Buffer
Tag (from Buffer descriptor)
2. Buffer Pool : Consists of
(i) Buffer pages
(ii) Array of buffer id’s as indexes to the buffer pages.
and
3. Buffer Descriptor Layer: Consists of a collection of buffer descriptors.
Each buffer descriptor is a structure of type BufferDesc which consists of elements: BBC FF UR
(i) Buffer Tag : is a combination of 5 details to identify a storage Page for PostgresQL - (i) Database
(ii) Tablespace (iii) relation (iv) block_number (v) type - freespace/visibility/data - 1,2,0
(ii) Buffer_id : which is the index of the Buffer Slot in the Buffer Pool
(iii) content lock bit & io_in_progress lock bit : Bit to indicate changing of content in buffer pool for the
particular buffer slot as well as for IO progress details.
(iv) refcount : increase by 1 when the block is being currently worked on , once done , it is
decremented.
(v) usecount: increase by 1 when it is accessed.
(vi) Flag bits: (a) valid (b) io_in_progress (c) dirty : various flags set when relevant
(vII) FreeNext : relevant for buffer descriptor which is in freelist .
Three states of Buffer descriptor:
1. Empty (usagecount = 0 , refcount = 0)
2. Pinned (usagecount > 0 , refcount > 0)
3. Unpinned (usagecount > 0 , refcount = 0 )
Page replacement Algorithm:
- Clock Sweep : move circular through the blocks —> Pick up a block
—> check if it is pinned (refcount = 0)
—> check if its usage_count =0 (if not decrease it
by 1)
—> repeat the steps till blocks with (refcount =0
and usagecount =0) are found.
- Buffer Manager Locks
1. Buffer Table lock - BuffMappingLock - Exclusive:for changed to hash slots
Shared: for lookup/reading of hash slots
2. Buffer Descriptor lock:
content_lock : Shared : When the data in the corresponding page (tuple) needs to be looked
up.
Exclusive: When the data in the corresponding page(tuple) needs to be changed.
io_progress_lock : when the corresponding buffer page needs to be retrieved from the storage.
spinlock : when values in the BuffDesc struct variables needs changed like:
dirty_bit/valid_bit/io_progress_bit needs to change.
- Buffer Manager Working to Read a page: [ KEY: 1. Buffer Table updated first —then—> Buffer pool
updated
2. Locks on Buffer pool is obtained indirect by
referencing bits and details in the Buffer Descriptor ]
1. A page that needs to be read is in the Buffer slot: Create buffer Tag —> Get BuffMappingLock —>
find the buffer_id in Buffer Table(traverse hash bucket and get hash slot) —> Release
BuffMappingLock —> read the buffer slot.
2. A page that needs to be read in not in the Buffer slot but there is empty slot to read a page: (steps
of 1) —> obtain Free Buffer from Freelist —> Exclusive BuffMappingLock (Buffer Table) —> Load up
Hash slot -> Load up Buffer pool (io_in_progress bit..)-> release BuffMappingLock -> read the buffer
slot
3. A page that needs to be read is not in the buffer slot but and there is no empty slot to read a page:
(steps of 1) —> ‘clock sweep’ page replacement algorithm -> Dirty bit -1(—> log flush
io_in_progress bit :1) -> BuffMappingLock (old & new) -> Hash Table update (both old and new) ->
Load up Buffer Pool -> release BuffMappingLock -> read the buffer slot.
15th August 2021
=============
Aspects of WAL Data:
1. Logical/Physical Structure
2. Internal Structure
3. Writing Data / Write process
4. Managing Segments
Aspects of Database Recovery:
1. Checkpoint
2. Commit
3. Recovery (controlfile)
4. Archive WAL segments
Purpose of Checkpoint : (a) Recovery Preparation (b) Shared_pool cleanup
a. Redo Point
b. Restore Rules.
1. Full Page Write Rules : a. A block touched for the first time.
b. Block touched after a checkpoint
2. Full Restore of a block: Rule: if a record to restore is inside a full block , it is restored irrespective
of the LSN.
If a record to be restored has its LSN (backup record) > the one in
shared_pool - then it's replayed.
c. Full Page Write & Backup Block:
xlog log entry (Ins/Del/Upd & commit) with LSN(location in Xlog unique)-> WAL memory -> WAL
segment -> REDO point (recent checkpoint start/recovery start) -> Checkpointer
writes XLOG record (holds REDO point)
1. First Insert TabA (since last checkpoint) -> shared_buffer TabA page entry (INSERT) ->
Page/Block pd_lsn changed(say lsn_0 to lsn_1) -> Entire Page into WAL buffer -> COMMIT ->
BACKEND process WAL buffer to Segment
-> XLog record COMMIT.
2. Second insert TabA(no new checkpoint since last one) -> shared_buffer TabA page
entry(INSERT) -> Page/Block pd_lsn changed(say lsn_1 to lsn_2) -> Xlog entry with header into
WAL buffer> COMMIT -> BACKEND process WAL buffer to Segment
xlog recovery (normal) : Loads page of TabA into shared_buffer -> compare LSN(shared_buffer
page) / LSN(xlog page) -> if LSN(xlog page) is greater than LSN(shared_buffer page) replay xlog
record from data portion.
xlog recovery(FPW) : Loads page of TabA into shared_buffer -> xlog Page(backup block) -> No LSN
comparison - > shared_buffer page overwritten.
d. Checkpoint Sequence: xlog entry -> clog buffer -> Dirty buffer cleanup -> control file updated
e. Checkpoint Frequency depends on:
1. max_wal_size
2. checkpoint_timeout
3. checkpoint_segments
4. checkpoint_completion_target
f. Logical & Physical Structure of XLOG entry.
1. Can be addressed by 8 Bytes - 2^8 = 1.6 exabytes
2. Too big single transaction log file -> broken down into WAL segments - 16MB.
3. WAL contains:
i. timline ID - 4 bytes (indicates number of times DB been restored)
DB created by initdb - timeline 1
Used for PITR
ii. LSN - location in WAL - 64bit integer - pointer XLogRecPtr - internally represented by 2
hexadecimal numbers of each 8 digits separated by a slash - pd_lsn - difference - pg_wal_lsn_diff().
1
iii. Naming - 000000010000000 000000 FF Timelineid, segment No. No. of 255 segments.
g. Internal layout:
Xlog (1GB) -> many WAL segments (16M) -> many pages (8K) (header page , normal page).
Header Page ->Header -> XLogLongPageHeaderData struct (which contains XlogPagerHeaderData
struct ,
unsigned integers
)
+ Data -> Xlog record entry
Subsequent page -> Header -> XlogPageHeaderData -> Integers (1) Magic , (2) Info , (3) Timeline
, (4) this page’s xlog address & (5) xlp_rem_len contains length of xlog record continued from the earlier page.
-> Data -> xlog record.
xlog record : Header Section -> Struct XLogRecord - LRCX , LIP
[xl_tot_len , xl_rmid , xl_crc, xl_xid , xl_len , xl_info, xl_prev ]
xl_rmid - resource manager
- THIRST -> transaction, heap only , index , replication , sequence and tablespace
T-transaction - commits ->
resource manger(xl_rmid( :RM_XACT & information(xl_info):XLOG_XACT_COMMIT
db
recovery -> RM_HEAP xact_redo_commit() (x_info) -> replay record
H-heap only-> RM_HEAP
(resource manager) & XLOG_HEAP_INSERT (information) /XLOG_HEAP_DELETE (information)
/XLOG_HEAP_UPDATE (information)
https://coupang-my.sharepoint.com/personal/rajorshi_coupang_com/Documents/backup_macbook_
2/Notes Cooked/Postgresql/Notes/ACID-Read Consistency.docx
- The vacuum marks the ‘xmin’ with a value that ensures that all
transactions find it as old and hence renders the row visible to the transaction.
e. Tuple header — pgpageinspect - select * from
heap_page_items(get_raw_page(‘<schema_name.object_name’> , <page_number>));
Page Header — pgpageinspect - select * from
page_header(get_raw_page(‘<schema_name.object_name’> , <page_number>));
Optional Null bitmap - one nullable column results in 1 row of bits ,
0 bit -> does not contain null, 1 -> contain null.
This happens only when
HEAP_HASNULL bit of t_infomask is set.
Number of nullable columns = no of bits in t_infomask2.
So, number of rows of bitmap = number of bits in t_infomask2.
Optional object id - HEAP_HASOID bit is set in t_infomask.
- present just before user data starting from t_hoff
Userdata
ctid - current tuple id - pair of block number and row number.
cid (field3 in the tuple header) - command id = The DML sequence number in a transaction
that created this row.
11: What are the steps that Primary Server executes when there is replication involved?
Answer:
1. Primary: BackEnd process - LSN -> LSN_1 XLogInsert()
LSN_1 -> LSN_2 finish_xact_command() , XLogInsert()
WAL buffer flush.
2. Primary: WAL Sender entries (disk) -> WAL receiver (standby)
3. Primary : SynRepWaitForLSN() - Primary wait for ACK response from standby
4. Standby: write() - WAL Segment (buffer) —> ACK1 send to Primary
5. Standby: fsync() - WAL Segment (Physical storage/file) —> ACK2 send to primary
6. Standby: Startup process replay WAL entries stored in WAL segments of Standby.
7. Primary : Upon receipt of ACK1(when synchronous_commit = ‘off’)/ ACK2(when
synchronous_commit=‘on’) , the will complete SynRepWaitForLSN() , releasing the latch and the
session will now be able to process transaction again.
12. What are the ‘SYNC_STATE’ states that a Standby can have in the primary?
Answer:
1. SYNC - the first standby in the CSV list of parameter SYNCHRONOUS_STANDBY_NAMES
2. Potential - the rest of the standby in the CSV list of the parameters mentioned above.
3. ASYNC. - standby that is not mentioned in the SYNCHRONOUS_STANDBY_NAMES list.
13. What happens if the STANDBY which is SYNC sends ACK later than the STANDBY that is
potential?
Answer:
Primary will only consider ACK from the standby that is SYNC.
14. How is a Standby detection failure detected?
Answer:
Parameter WAL_SENDER_TIMEOUT (default 60 seconds) determines that.
If the Standby does not post a heartbeat in this time , the primary terminates the WALSENDER
process that is meant to send STREAM to the corresponding WALRECEIVER of the standby that is
not responding.
15. What happens when the SYNC standby fails?
Answer:
The potential STANDBY (with the priority next to the failing SYNC standby) will be promoted to
SYNC.
28th December 2021 Backup and Point In Time Recovery
================
1. Base Backup
- pg_start_backup() - (i) Initiate FPW mode
(ii) checkpoints the database
(iii) Switches WAL segment logs
(iv) creates a BACKUP Label file under ‘data’ directory.
- pg_end_backup() - (i) reset to Non-FPW mode
(ii) backup END XLOG entry
(iii) WAL file switch
(iv) create backup history file (in pg_wal/pg_xlog)
(v) delete backup label file (after its included in the backupset - pg_wal/pg_xlog)
- Backup Label file contains:
1. BACKUP checkpoint information -
start of backup,
location of checkpoint in xlog
2. Backup label name
- Backup History file contains:
1. Contents of backup label file.
2. Timestamp of pg_end_backup() execution.
- Sample backup label file: {wal segment} . { offset value at the time of backup start } .
backup
- Sample backup history file: {wal segment} . { offset value at the time of backup start } .
.
backup done
2. Point In Time Recovery [B-backup file 5R -redo,recovery,restore,recovery,recovery T-timeline
history ]
a. Backup label file - contains [recovery start checkpoint & start backup time]
b. Redo point - obtained from Backup label file , location of the start of recovery in the
WAL log.
c. restore_command - command to copy backup files to restore location restore_command =
'cp /mnt/server/archivedir/%f %p'
d. recovery_target_time - end time of recovery [ empty will result in recovery till the logs are
available] . Example: recovery_target_time = "2021-12-18 12:05 GMT".
e. recovery.conf - contains (c) & (d) [ restore_command & recovery_target_time ]
f. recovery.signal - PG 12 : empty file in $PGDATA directory to trigger a restore and recovery [
no recovery.conf , is contents
g. timeline history file - Contents of Backup label file copied over to pg_wal directory and then if
archive_command parameter is set , the backup history file will be copied over to archive
destination.
Additional Parameter.
recovery_end_command Sets the shell command that will be
executed once at the end of recovery.
recovery_min_apply_delay 0 Sets the minimum delay for applying
changes during recovery.
recovery_target Set to "immediate" to end recovery as soon
as a consistent state is reached.
recovery_target_action pause Sets the action to perform upon reaching
the recovery target.
recovery_target_inclusive on Sets whether to include or exclude
transactions with a recovery target.
recovery_target_lsn Sets the LSN of the write-ahead log
location up to which recovery will proceed.
recovery_target_name Sets the named restore point up to which
recovery will proceed.
recovery_target_timeline latest Specifies the timeline to recover into.
recovery_target_xid Sets the transaction ID up to which
recovery will proceed.
STEPS:
- recovery.conf / ( postgresql.conf + recovery.signal) — will contain two parameters.
1. restore_command
2. restore_target_date
- PostgreSQL upon start will find either POSTGRESQL.CONF or RECOVERY.SIGNAL and get into
recovery mode.
- restore all the DATAFILE and WAL log segments in the respective location.
- Backup Checkpoint - from - Back Label File through function read_backup_label()
- REDO POINT from backup checkpoint. [ Start WAL location - backup label file]
- For each WAL entry from the WAL logs retrieved from ARCHIVE Location and kept in temporary
location $PGDATA , the timestamp is compared with RECOVERY_TARGET_TIME , if its mentioned.
- Once a WAL log has been processed and applied to the PostgreSQL cluster , it is deleted to
preserve space.
3. TIMELINE and TIMELINE HISTORY
TimeLineID: 4 byte unsigned int - starting from 1.
- For each recovery , the timeline is incremented by 1.
TimeLine History: “8 digit new TimeLineId”.history
Details that it holds:
1. timeLineID
2. LSN - of WAL switch
3. Reason - why timeline changed.
Example:
TimeLineID - 1
TimeLine - ‘00000001’
Sequence: Database Incident that requires recovery at time T1 —>
Restore Backup taken at T1 - x minutes —>
Recover Data till time T1 [ recovery.conf/postresql.conf will have parameters
RECOVERY_TARGET_TIME & RESTORE_COMMAND] —>
Completion of recovery ->
Increase TimeLineID from 1 —> 2
——
If the Cluster is recovered using to archive files:
a. ‘00000010000000000000009’
b. ‘‘00000001000000000000000A’
- The fresh recovered database will get a timeLineID 2 and new file
corresponding to the last file WAL log file applied with an increased timeLineID.
‘00000002000000000000000A’.
Second PITR on the same database Sequence:
Database Incident that requires recovery at time T1 + y —>
Restore Backup taken at (T1 + y) - x —>
[ Note: beyond first recovery ,mandatory parameter that needs to be mentioned is
RECOVERY_TARGET_TIMELINE, so now the parameters would be:
RECOVERY_TARGET_TIME
RESTORE_COMMAND
RECOVERY_TARGET_TIMELINE
Example:
restore_command = ‘cp /mnt/server/archivedir/%f %p ‘
recover_target_time = ’2021012018 12:15:00 GMT’
recovery_target_timeline = 2
]
Redo Point information - from - backup label file
Recovery target time - from - recovery.conf/postgresql.conf
WAL logs from the archive location with TimeLine marked as 2.—>
TimeLine increased from 2 to 3 —>
WAL log created with TimeLined id 3.
If the Cluster is recovered using to archive files:
a. ‘00000010000000000000009’
b. ‘‘00000002000000000000000A’
- The freshly recovered database will get a timeLineID 3 and new file
corresponding to the last WAL log file applied with an increased timeLineID.
‘00000003000000000000000A’.