Postgres' IO
Architecture, Tuning, Problems
Andres Freund
PostgreSQL Developer & Committer
Citus Data – [Link] - @citusdata
[Link]
pgbench -M prepared -c 32 -j 32
standard settings
2000 10000
1800 9000
1600 8000
1400 7000
1200 6000
TPS
Latency (ms)
1000 5000 Latency
TPS
800 4000
600 3000
400 2000
200 1000
0 0
seconds
Memory Architecture
Process local Memory Shared Memory
Postmaster
Sorting
Buffer Cache
User Connection Backend
Plans Checkpointer
Locking
Wal writer Information
Background writer
Temporary
Tables Autovac Launcher
Transaction
State
Bitmap
Scans
…
Shared Buffers
L FC
T
0 A
OLN 8 KB
G
CAT DATA
KGS
Buffer Mapping Hashtable
L FC
T
1 A
OLN 8 KB
G
CAT DATA
KGS
L FC
T
2 A
OLN 8 KB
G
CAT DATA
KGS
L FC
T
3 A
OLN 8 KB
G
CAT DATA
KGS
L FC
T
4 A
OLN 8 KB
G
CAT DATA
KGS
Reading Data
Storage
Buffer Mapping Hashtable
OS PageCache
open()
read()
L FC
T
3 A
OLN 88KB
KB
G
CAT DATA
DATA
KGS
Clock-Sweep
CNT: 4
3
35 0 1
2
3 CNT: 0
4
5
6
Writing Data Out
Buffer Mapping Hashtable
OS PageCache
open()
write()
L FC
T
3 A
OLN 88KB
KB
G
CAT DATA
DATA
KGS
Recovery & Checkpoints
Startup Restart CHECKPOINT
CHECKPOINT
Time
Checkpoints
1)Remember current position in WAL
2)Do some boring things
3)Write out all dirty buffers
4)Fsync all files modified since last checkpoint
5)Write checkpoint WAL record, pg_control etc.
6)Remove old WAL
Triggering Checkpoints
● checkpoint_timeout = 5min
– LOG: checkpoint starting: time
● checkpoint_segments = 3 / max_wal_size = 1GB
– LOG: checkpoint starting: xlog
– LOG: checkpoints are occurring too frequently (2 seconds
apart)
● shutdown
– LOG: checkpoint starting: shutdown immediate
● manually (CHECKPOINT;)
– LOG: checkpoint starting: immediate force wait
Spreading Checkpoints
● checkpoint_completion_target = 0.5
● estimation based on
– checkpoint_timeout
– checkpoint_segments/max_wal_size
● Spread checkpoints over completion_target *
timeout/segments till next checkpoint
● Try to keep pace
pgbench -M prepared -c 32 -j 32
standard settings
2000 10000
1800 9000
1600 8000
1400 7000
1200 6000
TPS
Latency (ms)
1000 5000 Latency
TPS
800 4000
600 3000
400 2000
200 1000
0 0
seconds
pgbench -M prepared -c 32 -j 32
shared_buffers = 16GB
2000 10000
1800 9000
1600 8000
1400 7000
1200 6000
TPS
Latency (ms)
1000 5000 Latency
TPS
800 4000
600 3000
400 2000
200 1000
0 0
seconds
pgbench -M prepared -c 32 -j 32
shared_buffers = 16GB, max_wal_size = 100GB
2000 10000
1800 9000
1600 8000
1400 7000
1200 6000
TPS
Latency (ms)
1000 5000 Latency
TPS
800 4000
600 3000
400 2000
200 1000
0 0
seconds
Dirty Data
1000000
900000
800000
700000
600000
500000
Dirty
bytes
Writeback
400000
300000
200000
100000
time (seconds)
OS Dirty Data Tuning
● dirty_writeback_centisecs => lower
– how often to check for writeback
● dirty_bytes/dirty_ratio => lower
– when to block writing data
● dirty_background_bytes => lower
– when to write data back in the background
● Increases random writes!
● Often slows total throughput, but improves
latency
pgbench -M prepared -c 32 -j 32
shared_buffers = 16GB, max_wal_size = 100GB, OS tuning (no dirty)
2000 1200
1800
1000
1600
1400
800
1200
TPS
Latency (ms)
1000 600 Latency
TPS
800
400
600
400
200
200
0 0
seconds
pgbench -M prepared -c 32 -j 32
shared_buffers = 16GB, max_wal_size = 100GB, target = 0.9; OS tuning (no dirty)
2000 1200
1800
1000
1600
1400
800
1200
TPS
Latency (ms)
1000 600 Latency
TPS
800
400
600
400
200
200
0 0
seconds
Shared Buffers Tuning
● Leave memory for queries / other work
● Hot data fits into shared_buffers => increase
s_b
● Bulk-Writes in a bigger than shared_buffers
workload => measure decreasing s_b
● Large Shared Buffers => enable huge pages
● Frequent Relation DROP/REINDEX =>
decrease s_b
WAL tuning
● Checkpoints should be triggered by time!
– high enough checkpoint_segments/wal_max_size
– Monitor!
● Except maybe at night, during batch runs or such
● Consider recovery time → less frequent
checkpoints, crash recovery takes longer
● Consider full page writes → more frequent
checkpoints mean much much more WAL
● separate pg_xlog can help a lot!
WAL Writer
● Writes WAL instead backends
● Important for synchronous_commit = off
● Otherwise boring
Clock-Sweep
35 0 1
2
3
4
5
6
Background Writer
● Write dirty buffers before backends
● Not very good
● All random writes
● Defaults write max 4MB/s
● bgwriter_delay → lower, wakes up more often
● bgwriter_lru_maxpages → increases, writes
more at once
Autovacuum
● Limited read/write rate – too low
– ~4MB/s
● Cost calculated with
– vacuum_cost_page_miss = 20
– vacuum_cost_page_hit = 1
– vacuum_cost_page_miss = 20
● Limited by
– {autovacuum_,}vacuum_cost_limit = 200
– autovacuum_vacuum_cost_delay = 20ms
– vacuum_cost_delay = 0
Problem – Dirty Buffers in Kernel
● Massive Latency Spikes, up to hundreds of
seconds
● Force flush using sync_file_range() or msync()
– Decreases jitter
– Increases randomness
● Sort checkpointed buffers
– Decreases randomness
– Increases Throughput
● In 9.6 for some OSs
Problem – Hashtable
● Can't efficiently search for the next buffer
– need to sort for checkpoints
– can't write combine to reduce total number of writes
– can't efficiently drop relation/… bufers
● Expensive Lookups
– Cache / pipeline inefficient datastructure
– some locking issues: Improved 9.5, 9.6
● Possible Solution: Radix Tree
● Hopefully 9.7
Problem - Cache Replacement
Scales Badly
● Single Lock for Clock Sweep!
– fixed in 9.5
● Every Backend performs Clock Sweep
– potentially 9.7?
● Algorithm is fundamentally expensive
– UH, Oh.
Problem - Cache Replacement
Replaces Badly
● Usagecount of 5 (max) reached very quickly
– Often all buffers have 5 / 0
● Increasing max usagecount increases cost, the
worst case essentially is
O(NBuffer * max_usagecount)
● Hard to solve
Problem: Kernel Page Cache
● Double buffering decreases effective memory
utilization
● Use O_DIRECT?
– Requires lots of performance work on our side
– Considerably faster in some scenarios
– Less Adaptive
– Very OS specific
Postgres' IO
Architecture, Tuning, Problems
Andres Freund
PostgreSQL Developer & Committer
Citus Data – [Link] - @citusdata
[Link]