SlideShare a Scribd company logo
PostgreSQL Sharding and
HA: Theory and Practice
Aleksander Alekseev
A few words about me
● I live in Moscow, Russia;
● Develop software since 2007;
● Contribute to PostgreSQL since 2015;
● Work in Postgres Professional company;
● Interests: OSS, functional programming,
electronics, SDR, distributed systems, blogging,
podcasting;
In this talk
● A brief introduction to PostgreSQL replication (physical & logical);
● Solutions for HA / failover;
● Solutions for sharding;
● Q&A section :)
Target audience
● You believe that the replication is something very complicated;
● You think that the only way to scale is to scale horizontally;
● You’ve never configured physical and/or logical replication in PostgreSQL;
● You don’t know how to configure an auto-failover / HA;
● You would like to know what’s new in recent releases of PostgreSQL;
● You are looking for an idea for a project =).
What is not in this talk
● A boring retelling of the documentation;
● For the interested listeners there will be links to additional materials;
Disclaimer
In this talk I will mention a lot of databases,
extensions, etc. It doesn’t mean that I’m an expert in
all of them.
Replication
Leader
(Master)
Replica
(Follower)
What for?
● Load balancing
○ OLTP: writing to the leader, reading from replicas;
○ OLAP: analytical queries on a separate replica;
○ Taking backups from a separate replica;
● Failover / High Availability
○ Failover can be manual or automatic
● Delayed replication
● Replication doesn’t replace backups!
Streaming (or physical) replication
● In essence, it represents a transfer of the WAL over the network;
● Asynchronous
○ Fast, but the recent data can be lost;
● Synchronous
○ Slower (not as much in the same datacenter) but more reliable. It’s better to have at least
two replicas;
● Also - cascading replication (I had to mention it on some slide).
Fun facts!
Streaming replication:
● Doesn’t work between servers with different architecture;
● Doesn’t work between different versions of PostgreSQL [1];
● May not work between different operating systems / compilers [2];
● Also transactions may become visible on the leader and the replica in
different order;
[1]: According to https://simply.name/ru/upgrading-postgres-to-9.4.html a typical
downtime during the upgrade is a few minutes.
[2]: google://sizeof long compilers
Logical replication
● Out-of-the-box starting from PostgreSQL 10;
● Previous approaches: Slony, Londiste, pglogical;
○ I personally would not recommend them, unless you are already using one of these.
Credits: logical replication
Yet another type of replication? Why?
● To replicate only part of the data;
● To upgrade without a downtime;
● To create temporary tables on replicas;
● To write any other data on replicas;
● To write to the replicated tables;
● One replica can replicate data from multiple leaders;
● In theory — you can build a multimaster*;
● Other scenarios, when physical replication for some reason doesn’t work well.
* but it will be complicated and ugly.
Fun facts!
● Replicated tables may differ on the leader and the replica;
● The order of the columns may differ;
● The replica may have additional NULLable columns;
● The leader can’t have more columns then the replica, even if values in these
columns are always NULL.
Limitations of the logical replication
● All replicated tables should have a primary key;
● DDL, TRUNCATE & sequences are not replicated;
● Triggers are not executed in some cases [1].
[1]: https://postgr.es/m/20171009141341.GA16999@e733.localdomain
synchronous_commit
● synchronous_commit = off
○ Asynchronous writing to the WAL, part of recent changes can be lost;
○ Unlike fsync = off can’t cause a database inconsistency;
● synchronous_commit = on
○ Synchronous writing to the WAL — leader’s and replica’s
● synchronous_commit = remote_write
○ Ditto, but without fsync() on replicas;
● synchronous_commit = local
○ Synchronous writing to the WAL on the leader only;
● synchronous_commit = remote_apply ( >= 9.6 )
○ Same as ‘on’ but also wait until changes will be applied to the data on replicas;
Fun fact!
● synchronous_commit can be changed not only in postgresql.conf,
but also in the session using SET command.
synchronous_standby_names
● synchronous_standby_names = ‘*’
○ Wait for ‘ack’ from any one replica;
● synchronous_standby_names = ANY 2(node1,node2,node3);
○ Quorum commit;
○ PostgreSQL >= 10;
● Other possible values [1] IMHO are not as interesting;
[1]: https://www.postgresql.org/docs/current/static/runtime-config-replication.html
Credits: logical decoding (PostgreSQL 9.4)
Logical decoding
$ pg_recvlogical --slot=myslot --dbname=eax --user=eax 
--create-slot --plugin=test_decoding
$ pg_recvlogical --slot=myslot --dbname=eax --user=eax --start -f -
Logical decoding: output
BEGIN 560
COMMIT 560
BEGIN 561
table public.test: INSERT: k[text]:'aaa' v[text]:'bbb'
COMMIT 561
Logical decoding & JSON
$ pg_recvlogical --slot=myslot --dbname=eax --user=eax 
--create-slot --plugin=wal2json
$ pg_recvlogical --slot=myslot --dbname=eax --user=eax --start -f - | jq
Logical decoding & JSON: output
HA / Failover
● Manual
○ Used by many companies in practice;
○ OK if you have a moderate number of database servers (e.g. ~10);
○ See next two slides about modern hardware;
● Automatic
○ If your company is as big as Google :)
A few words about hardware: RAM
● You can put up to 3 TB of RAM in a single physical server these days;
● AWS instance x1.32xlarge (128 vCPU, 1952 GB RAM, 2 x 1920 GB SSD)
costs 9603$ a month [1];
● Also AWS announced new instances with 4-16 TB of RAM [2][3].
[1]: https://aws.amazon.com/ec2/pricing/on-demand/
[2]: https://aws.amazon.com/ec2/instance-types/x1e/
[3]: https://www.theregister.co.uk/2017/05/16/aws_ram_cram/
A few words about hardware: hard drives
● You can buy a 1 TB SSD for ~300$ [1];
● You can put up to 900 TB of data in a single physical server these days;
● Next year: up to 1.5 PB.
[1]: Samsung MZ-75E1T0BW, https://amazon.com/dp/B00OBRFFAS
Manual failover howto
● Configure metrics and alerts using Nagios / Zabbix / Datadog / … ;
● Check them, make sure everything works;
● When something breaks:
○ Wake up in the night;
○ Figure out what’s going on;
○ Fix it (e.g. promote a replica);
● Since there are not many servers it will happen like once a year, so it’s OK;
● In many regards it’s more reliable than automatic failover;
Automatic HA / failover
● Repmgr
● Patroni
● Stolon
● Postgres-XL
● Postgres-X2, ex. Postgres-XC (abandoned?)
● Multimaster (part of Postgres Pro Enterprise)
● ???
Stolon
● Developed since 2015 by Sorint.lab;
● Written in Go;
● Relies on Consul or etcd for service discovery;
● Supports integration with Kubernetes;
● Very easy to configure and maintain;
● Handles crashes and netsplits correctly;
Stolon: how does it work?
Fun facts!
● Stolon routes both reads and writes to the leader. There is a workaround [1];
● It uses Consul or etcd only as a key-value storage. In particular, it doesn’t rely
on DNS support in Consul and other features.
[1]: https://github.com/sorintlab/stolon/issues/132
Postgres Pro Multimaster
● Looks like a regular RDBMS for the user;
● Is a part of Postgres Pro Enterprise;
● Based on paper “Clock-SI: Snapshot Isolation for Partitioned Data Stores
Using Loosely Synchronized Clocks” [*];
● Developers: Konstantin Knizhnik, Stas Kelvich, Arseny Sher;
- https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sameh
e-clocksi.srds2013.pdf
The Multimaster team
Existing solutions for sharding
● Manual sharding
● Citus
● Greenplum
● pg_shardman (part of Postgres Pro Enterprise)
● ???
Existing solutions for sharding
● Manual sharding
● Citus
● Greenplum
● pg_shardman (part of Postgres Pro Enterprise)
● ???
AFAIK designed mostly
for analytics
Manual sharding
● Used in practice by many companies;
● It’s OK if you don’t have (many) distributed transactions;
● Rebalancing is done quite simple with logical replication;
● For distributed transactions you can use Percolator-like approach [1];
○ Provides snapshot isolation only, so the write skew anomaly is possible. On the other hand
in some RDBMS this is the best you can get [2].
● Or even something simpler (e.g. based on log of idempotent operations);
[1]: http://rystsov.info/2012/09/01/cas.html
[2]: https://github.com/ept/hermitage/
pg_shardman
● Developed by the Postgres Pro Multimaster team;
● Is a part of Postgres Pro Enterprise;
● Supports replication factor > 1 (which is not true for some alternatives);
● It is currently in a state of beta-release;
● Please contact info@postgrespro.com and ask for a trial;
● Note that PostgresPro Enterprise is free for educational and non-commercial
use;
Not quite PostgreSQL
● Amazon Aurora
● CockroachDB
Amazon Aurora
● ACID with transparent failover, sharding and distributed transactions;
● Announced in 2014;
● Exists only in a cloud;
● Is compatible with MySQL and PostgreSQL [1] on the protocol level;
● There is a paper [2];
[1]: since Nov 2016 https://news.ycombinator.com/item?id=13072861
[2]: http://www.allthingsdistributed.com/files/p1041-verbitski.pdf
CochroachDB
● ACID with transparent failover, sharding and distributed transactions;
● Announced in 2014, is written in Go, is developed by ex-Google employees;
● Free and open source software;
● Is compatible with PostgreSQL on the protocol level;
● Passes Jepsen [*];
● Based on Spanner paper [*];
- https://www.cockroachlabs.com/blog/cockroachdb-beta-passes-jepsen-testing/
- https://static.googleusercontent.com/media/research.google.com/en//archive/s
panner-osdi2012.pdf
Links
● https://www.postgresql.org/docs/10/static/index.html
● https://github.com/sorintlab/stolon/
● https://github.com/eulerto/wal2json
● https://github.com/posix4e/jsoncdc
● https://github.com/citusdata/citus
● http://greenplum.org/
● https://postgrespro.com/products/postgrespro/enterprise
● https://aws.amazon.com/rds/aurora/
● https://www.cockroachlabs.com/
See you in Russia!
Thank you for your attention!
● a.alekseev@postgrespro.ru
● https://afiskon.github.io/
● https://postgrespro.com/
● https://github.com/postgrespro/

More Related Content

What's hot (20)

PDF
[HKOSCON][20180616][Containerized High Availability Virtual Hosting Deploymen...
Wong Hoi Sing Edison
 
PDF
Red Hat Ceph Storage Roadmap: January 2016
Red_Hat_Storage
 
PDF
[BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes]
Wong Hoi Sing Edison
 
PDF
RBD: What will the future bring? - Jason Dillaman
Ceph Community
 
PDF
A Container Stack for Openstack - OpenStack Silicon Valley
Stephen Gordon
 
PDF
Experiences building a distributed shared log on RADOS - Noah Watkins
Ceph Community
 
PDF
Ceph Client librbd Performance Analysis and Learnings - Mahati Chamarthy
Ceph Community
 
PDF
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC
 
PPTX
MySQL on Ceph
Kyle Bader
 
PPTX
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Nicolas Brousse
 
PDF
Architectural caching patterns for kubernetes
Rafał Leszko
 
PDF
[245] presto 내부구조 파헤치기
NAVER D2
 
PDF
KubeCon US 2021 - Recap - DCMeetup
Faheem Memon
 
PDF
Architectural caching patterns for kubernetes
Rafał Leszko
 
PDF
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
IT Event
 
PDF
OpenNebulaConf 2016 - Budgeting: the Ugly Duckling of Cloud computing? by Mat...
OpenNebula Project
 
PDF
Ceph Performance: Projects Leading up to Jewel
Colleen Corrice
 
PDF
Red Hat Summit 2017: Wicked Fast PaaS: Performance Tuning of OpenShift and D...
Jeremy Eder
 
PDF
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Community
 
PDF
NantOmics
Ceph Community
 
[HKOSCON][20180616][Containerized High Availability Virtual Hosting Deploymen...
Wong Hoi Sing Edison
 
Red Hat Ceph Storage Roadmap: January 2016
Red_Hat_Storage
 
[BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes]
Wong Hoi Sing Edison
 
RBD: What will the future bring? - Jason Dillaman
Ceph Community
 
A Container Stack for Openstack - OpenStack Silicon Valley
Stephen Gordon
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Ceph Community
 
Ceph Client librbd Performance Analysis and Learnings - Mahati Chamarthy
Ceph Community
 
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC
 
MySQL on Ceph
Kyle Bader
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Nicolas Brousse
 
Architectural caching patterns for kubernetes
Rafał Leszko
 
[245] presto 내부구조 파헤치기
NAVER D2
 
KubeCon US 2021 - Recap - DCMeetup
Faheem Memon
 
Architectural caching patterns for kubernetes
Rafał Leszko
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
IT Event
 
OpenNebulaConf 2016 - Budgeting: the Ugly Duckling of Cloud computing? by Mat...
OpenNebula Project
 
Ceph Performance: Projects Leading up to Jewel
Colleen Corrice
 
Red Hat Summit 2017: Wicked Fast PaaS: Performance Tuning of OpenShift and D...
Jeremy Eder
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Community
 
NantOmics
Ceph Community
 

Similar to PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017) (20)

PDF
Postgres Vienna DB Meetup 2014
Michael Renner
 
PDF
The Challenges of Distributing Postgres: A Citus Story
Hanna Kelman
 
PDF
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
Citus Data
 
PDF
Hosted PostgreSQL
Mike Fowler
 
ODP
Pro PostgreSQL, OSCon 2008
Robert Treat
 
PDF
Replication in PostgreSQL tutorial given in Postgres Conference 2019
Abbas Butt
 
PPTX
CAP: Scaling, HA
Vitaly Peregudov
 
PDF
Open Source SQL Databases
Emanuel Calvo
 
PDF
Elephants in the Cloud
Mike Fowler
 
PDF
Postgres clusters
Stas Kelvich
 
PDF
Out of the box replication in postgres 9.4
Denish Patel
 
PDF
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
PDF
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
PPTX
PostgreSQL 10: What to Look For
Amit Langote
 
PPTX
Built in physical and logical replication in postgresql-Firat Gulec
FIRAT GULEC
 
PDF
PostgreSQL Replication High Availability Methods
Mydbops
 
PDF
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
Ontico
 
PDF
PostgreSQL: present and near future
NaN-tic
 
PDF
Streaming replication in practice
Alexey Lesovsky
 
PDF
ProstgreSQLFailoverConfiguration
Suyog Shirgaonkar
 
Postgres Vienna DB Meetup 2014
Michael Renner
 
The Challenges of Distributing Postgres: A Citus Story
Hanna Kelman
 
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
Citus Data
 
Hosted PostgreSQL
Mike Fowler
 
Pro PostgreSQL, OSCon 2008
Robert Treat
 
Replication in PostgreSQL tutorial given in Postgres Conference 2019
Abbas Butt
 
CAP: Scaling, HA
Vitaly Peregudov
 
Open Source SQL Databases
Emanuel Calvo
 
Elephants in the Cloud
Mike Fowler
 
Postgres clusters
Stas Kelvich
 
Out of the box replication in postgres 9.4
Denish Patel
 
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
PostgreSQL 10: What to Look For
Amit Langote
 
Built in physical and logical replication in postgresql-Firat Gulec
FIRAT GULEC
 
PostgreSQL Replication High Availability Methods
Mydbops
 
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
Ontico
 
PostgreSQL: present and near future
NaN-tic
 
Streaming replication in practice
Alexey Lesovsky
 
ProstgreSQLFailoverConfiguration
Suyog Shirgaonkar
 
Ad

More from Aleksander Alekseev (13)

PDF
Growing up new PostgreSQL developers (pgcon.org 2018)
Aleksander Alekseev
 
PDF
PostgreSQL and Compressed Documents (pgconf.ru 2018)
Aleksander Alekseev
 
PDF
Data recovery using pg_filedump
Aleksander Alekseev
 
PDF
Full Text Search in PostgreSQL
Aleksander Alekseev
 
PDF
pg_filedump
Aleksander Alekseev
 
PDF
Quality Assurance in PostgreSQL
Aleksander Alekseev
 
PDF
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 
PDF
ZSON, или прозрачное сжатие JSON
Aleksander Alekseev
 
PDF
Профилирование кода на C/C++ в *nix системах
Aleksander Alekseev
 
PDF
Новые технологии репликации данных в PostgreSQL - Александр Алексеев
Aleksander Alekseev
 
PDF
Haskell - это просто - Александр Алексеев
Aleksander Alekseev
 
PDF
Работа с Akka Cluster - Александр Алексеев
Aleksander Alekseev
 
PDF
Функциональное программирование - Александр Алексеев
Aleksander Alekseev
 
Growing up new PostgreSQL developers (pgcon.org 2018)
Aleksander Alekseev
 
PostgreSQL and Compressed Documents (pgconf.ru 2018)
Aleksander Alekseev
 
Data recovery using pg_filedump
Aleksander Alekseev
 
Full Text Search in PostgreSQL
Aleksander Alekseev
 
pg_filedump
Aleksander Alekseev
 
Quality Assurance in PostgreSQL
Aleksander Alekseev
 
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 
ZSON, или прозрачное сжатие JSON
Aleksander Alekseev
 
Профилирование кода на C/C++ в *nix системах
Aleksander Alekseev
 
Новые технологии репликации данных в PostgreSQL - Александр Алексеев
Aleksander Alekseev
 
Haskell - это просто - Александр Алексеев
Aleksander Alekseev
 
Работа с Akka Cluster - Александр Алексеев
Aleksander Alekseev
 
Функциональное программирование - Александр Алексеев
Aleksander Alekseev
 
Ad

Recently uploaded (17)

PDF
Empowering Local Language Email with IDN & EAI – Powered by XgenPlus
XgenPlus Technologies
 
PPTX
Class_4_Limbgvchgchgchgchgchgcjhgchgcnked_Lists.pptx
test123n
 
PDF
AI security AI security AI security AI security
elite44
 
PPTX
原版一样(ANU毕业证书)澳洲澳大利亚国立大学毕业证在线购买
Taqyea
 
PDF
Strategic Plan New and Completed Templeted
alvi932317
 
PPTX
My Mother At 66! (2).pptx00000000000000000000000000000
vedapattisiddharth
 
PPTX
Lesson 1.1 Career-Opportunities-in-Ict.pptx
lizelgumadlas1
 
PPTX
Meloniusk_Communication_Template_best.pptx
howesix147
 
PDF
web application development company in bangalore.pdf
https://dkpractice.co.in/seo.html tech
 
PPTX
Ransomware attack and its effects on cyber crimes
ShilpaShreeD
 
PDF
ContextForge MCP Gateway - the missing proxy for AI Agents and Tools
Mihai Criveti
 
PPTX
CHAPTER 1 - PART 3 FOR GRADE 11 STUDENTS
FSBTLEDNathanVince
 
PDF
Materi tentang From Digital Economy to Fintech.pdf
Abdul Hakim
 
PDF
Clive Dickens RedTech Public Copy - Collaborate or Die
Clive Dickens
 
PPTX
Q1 English3 Week5 [email protected]
JenniferCawaling1
 
PDF
Beginning-Laravel-Build-Websites-with-Laravel-5.8-by-Sanjib-Sinha-z-lib.org.pdf
TagumLibuganonRiverB
 
PDF
The Convergence of Threat Behaviors Across Intrusions
Joe Slowik
 
Empowering Local Language Email with IDN & EAI – Powered by XgenPlus
XgenPlus Technologies
 
Class_4_Limbgvchgchgchgchgchgcjhgchgcnked_Lists.pptx
test123n
 
AI security AI security AI security AI security
elite44
 
原版一样(ANU毕业证书)澳洲澳大利亚国立大学毕业证在线购买
Taqyea
 
Strategic Plan New and Completed Templeted
alvi932317
 
My Mother At 66! (2).pptx00000000000000000000000000000
vedapattisiddharth
 
Lesson 1.1 Career-Opportunities-in-Ict.pptx
lizelgumadlas1
 
Meloniusk_Communication_Template_best.pptx
howesix147
 
web application development company in bangalore.pdf
https://dkpractice.co.in/seo.html tech
 
Ransomware attack and its effects on cyber crimes
ShilpaShreeD
 
ContextForge MCP Gateway - the missing proxy for AI Agents and Tools
Mihai Criveti
 
CHAPTER 1 - PART 3 FOR GRADE 11 STUDENTS
FSBTLEDNathanVince
 
Materi tentang From Digital Economy to Fintech.pdf
Abdul Hakim
 
Clive Dickens RedTech Public Copy - Collaborate or Die
Clive Dickens
 
Beginning-Laravel-Build-Websites-with-Laravel-5.8-by-Sanjib-Sinha-z-lib.org.pdf
TagumLibuganonRiverB
 
The Convergence of Threat Behaviors Across Intrusions
Joe Slowik
 

PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)

  • 1. PostgreSQL Sharding and HA: Theory and Practice Aleksander Alekseev
  • 2. A few words about me ● I live in Moscow, Russia; ● Develop software since 2007; ● Contribute to PostgreSQL since 2015; ● Work in Postgres Professional company; ● Interests: OSS, functional programming, electronics, SDR, distributed systems, blogging, podcasting;
  • 3. In this talk ● A brief introduction to PostgreSQL replication (physical & logical); ● Solutions for HA / failover; ● Solutions for sharding; ● Q&A section :)
  • 4. Target audience ● You believe that the replication is something very complicated; ● You think that the only way to scale is to scale horizontally; ● You’ve never configured physical and/or logical replication in PostgreSQL; ● You don’t know how to configure an auto-failover / HA; ● You would like to know what’s new in recent releases of PostgreSQL; ● You are looking for an idea for a project =).
  • 5. What is not in this talk ● A boring retelling of the documentation; ● For the interested listeners there will be links to additional materials;
  • 6. Disclaimer In this talk I will mention a lot of databases, extensions, etc. It doesn’t mean that I’m an expert in all of them.
  • 8. What for? ● Load balancing ○ OLTP: writing to the leader, reading from replicas; ○ OLAP: analytical queries on a separate replica; ○ Taking backups from a separate replica; ● Failover / High Availability ○ Failover can be manual or automatic ● Delayed replication ● Replication doesn’t replace backups!
  • 9. Streaming (or physical) replication ● In essence, it represents a transfer of the WAL over the network; ● Asynchronous ○ Fast, but the recent data can be lost; ● Synchronous ○ Slower (not as much in the same datacenter) but more reliable. It’s better to have at least two replicas; ● Also - cascading replication (I had to mention it on some slide).
  • 10. Fun facts! Streaming replication: ● Doesn’t work between servers with different architecture; ● Doesn’t work between different versions of PostgreSQL [1]; ● May not work between different operating systems / compilers [2]; ● Also transactions may become visible on the leader and the replica in different order; [1]: According to https://simply.name/ru/upgrading-postgres-to-9.4.html a typical downtime during the upgrade is a few minutes. [2]: google://sizeof long compilers
  • 11. Logical replication ● Out-of-the-box starting from PostgreSQL 10; ● Previous approaches: Slony, Londiste, pglogical; ○ I personally would not recommend them, unless you are already using one of these.
  • 13. Yet another type of replication? Why? ● To replicate only part of the data; ● To upgrade without a downtime; ● To create temporary tables on replicas; ● To write any other data on replicas; ● To write to the replicated tables; ● One replica can replicate data from multiple leaders; ● In theory — you can build a multimaster*; ● Other scenarios, when physical replication for some reason doesn’t work well. * but it will be complicated and ugly.
  • 14. Fun facts! ● Replicated tables may differ on the leader and the replica; ● The order of the columns may differ; ● The replica may have additional NULLable columns; ● The leader can’t have more columns then the replica, even if values in these columns are always NULL.
  • 15. Limitations of the logical replication ● All replicated tables should have a primary key; ● DDL, TRUNCATE & sequences are not replicated; ● Triggers are not executed in some cases [1]. [1]: https://postgr.es/m/[email protected]
  • 16. synchronous_commit ● synchronous_commit = off ○ Asynchronous writing to the WAL, part of recent changes can be lost; ○ Unlike fsync = off can’t cause a database inconsistency; ● synchronous_commit = on ○ Synchronous writing to the WAL — leader’s and replica’s ● synchronous_commit = remote_write ○ Ditto, but without fsync() on replicas; ● synchronous_commit = local ○ Synchronous writing to the WAL on the leader only; ● synchronous_commit = remote_apply ( >= 9.6 ) ○ Same as ‘on’ but also wait until changes will be applied to the data on replicas;
  • 17. Fun fact! ● synchronous_commit can be changed not only in postgresql.conf, but also in the session using SET command.
  • 18. synchronous_standby_names ● synchronous_standby_names = ‘*’ ○ Wait for ‘ack’ from any one replica; ● synchronous_standby_names = ANY 2(node1,node2,node3); ○ Quorum commit; ○ PostgreSQL >= 10; ● Other possible values [1] IMHO are not as interesting; [1]: https://www.postgresql.org/docs/current/static/runtime-config-replication.html
  • 19. Credits: logical decoding (PostgreSQL 9.4)
  • 20. Logical decoding $ pg_recvlogical --slot=myslot --dbname=eax --user=eax --create-slot --plugin=test_decoding $ pg_recvlogical --slot=myslot --dbname=eax --user=eax --start -f -
  • 21. Logical decoding: output BEGIN 560 COMMIT 560 BEGIN 561 table public.test: INSERT: k[text]:'aaa' v[text]:'bbb' COMMIT 561
  • 22. Logical decoding & JSON $ pg_recvlogical --slot=myslot --dbname=eax --user=eax --create-slot --plugin=wal2json $ pg_recvlogical --slot=myslot --dbname=eax --user=eax --start -f - | jq
  • 23. Logical decoding & JSON: output
  • 24. HA / Failover ● Manual ○ Used by many companies in practice; ○ OK if you have a moderate number of database servers (e.g. ~10); ○ See next two slides about modern hardware; ● Automatic ○ If your company is as big as Google :)
  • 25. A few words about hardware: RAM ● You can put up to 3 TB of RAM in a single physical server these days; ● AWS instance x1.32xlarge (128 vCPU, 1952 GB RAM, 2 x 1920 GB SSD) costs 9603$ a month [1]; ● Also AWS announced new instances with 4-16 TB of RAM [2][3]. [1]: https://aws.amazon.com/ec2/pricing/on-demand/ [2]: https://aws.amazon.com/ec2/instance-types/x1e/ [3]: https://www.theregister.co.uk/2017/05/16/aws_ram_cram/
  • 26. A few words about hardware: hard drives ● You can buy a 1 TB SSD for ~300$ [1]; ● You can put up to 900 TB of data in a single physical server these days; ● Next year: up to 1.5 PB. [1]: Samsung MZ-75E1T0BW, https://amazon.com/dp/B00OBRFFAS
  • 27. Manual failover howto ● Configure metrics and alerts using Nagios / Zabbix / Datadog / … ; ● Check them, make sure everything works; ● When something breaks: ○ Wake up in the night; ○ Figure out what’s going on; ○ Fix it (e.g. promote a replica); ● Since there are not many servers it will happen like once a year, so it’s OK; ● In many regards it’s more reliable than automatic failover;
  • 28. Automatic HA / failover ● Repmgr ● Patroni ● Stolon ● Postgres-XL ● Postgres-X2, ex. Postgres-XC (abandoned?) ● Multimaster (part of Postgres Pro Enterprise) ● ???
  • 29. Stolon ● Developed since 2015 by Sorint.lab; ● Written in Go; ● Relies on Consul or etcd for service discovery; ● Supports integration with Kubernetes; ● Very easy to configure and maintain; ● Handles crashes and netsplits correctly;
  • 30. Stolon: how does it work?
  • 31. Fun facts! ● Stolon routes both reads and writes to the leader. There is a workaround [1]; ● It uses Consul or etcd only as a key-value storage. In particular, it doesn’t rely on DNS support in Consul and other features. [1]: https://github.com/sorintlab/stolon/issues/132
  • 32. Postgres Pro Multimaster ● Looks like a regular RDBMS for the user; ● Is a part of Postgres Pro Enterprise; ● Based on paper “Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks” [*]; ● Developers: Konstantin Knizhnik, Stas Kelvich, Arseny Sher; - https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sameh e-clocksi.srds2013.pdf
  • 34. Existing solutions for sharding ● Manual sharding ● Citus ● Greenplum ● pg_shardman (part of Postgres Pro Enterprise) ● ???
  • 35. Existing solutions for sharding ● Manual sharding ● Citus ● Greenplum ● pg_shardman (part of Postgres Pro Enterprise) ● ??? AFAIK designed mostly for analytics
  • 36. Manual sharding ● Used in practice by many companies; ● It’s OK if you don’t have (many) distributed transactions; ● Rebalancing is done quite simple with logical replication; ● For distributed transactions you can use Percolator-like approach [1]; ○ Provides snapshot isolation only, so the write skew anomaly is possible. On the other hand in some RDBMS this is the best you can get [2]. ● Or even something simpler (e.g. based on log of idempotent operations); [1]: http://rystsov.info/2012/09/01/cas.html [2]: https://github.com/ept/hermitage/
  • 37. pg_shardman ● Developed by the Postgres Pro Multimaster team; ● Is a part of Postgres Pro Enterprise; ● Supports replication factor > 1 (which is not true for some alternatives); ● It is currently in a state of beta-release; ● Please contact [email protected] and ask for a trial; ● Note that PostgresPro Enterprise is free for educational and non-commercial use;
  • 38. Not quite PostgreSQL ● Amazon Aurora ● CockroachDB
  • 39. Amazon Aurora ● ACID with transparent failover, sharding and distributed transactions; ● Announced in 2014; ● Exists only in a cloud; ● Is compatible with MySQL and PostgreSQL [1] on the protocol level; ● There is a paper [2]; [1]: since Nov 2016 https://news.ycombinator.com/item?id=13072861 [2]: http://www.allthingsdistributed.com/files/p1041-verbitski.pdf
  • 40. CochroachDB ● ACID with transparent failover, sharding and distributed transactions; ● Announced in 2014, is written in Go, is developed by ex-Google employees; ● Free and open source software; ● Is compatible with PostgreSQL on the protocol level; ● Passes Jepsen [*]; ● Based on Spanner paper [*]; - https://www.cockroachlabs.com/blog/cockroachdb-beta-passes-jepsen-testing/ - https://static.googleusercontent.com/media/research.google.com/en//archive/s panner-osdi2012.pdf
  • 41. Links ● https://www.postgresql.org/docs/10/static/index.html ● https://github.com/sorintlab/stolon/ ● https://github.com/eulerto/wal2json ● https://github.com/posix4e/jsoncdc ● https://github.com/citusdata/citus ● http://greenplum.org/ ● https://postgrespro.com/products/postgrespro/enterprise ● https://aws.amazon.com/rds/aurora/ ● https://www.cockroachlabs.com/
  • 42. See you in Russia!
  • 43. Thank you for your attention! ● [email protected] ● https://afiskon.github.io/ ● https://postgrespro.com/ ● https://github.com/postgrespro/