SlideShare a Scribd company logo
Distributed Fun with
And the consensus problem
DistSys Riyadh Meetup
Abdulaziz AlMalki @almalki_am
Agenda
โ— The consensus problem
โ— Paxos and raft
โ— What is etcd?
โ— etcd use cases
โ— etcd as a kv store
โ— etcd consistency guarantees
โ— etcd failure modes
โ— Leader election
โ— Distributed locks
Agenda
โ— Distributed cluster configuration
โ— Service discovery
โ— How kubernetes uses etcd
โ— Demo:
โ—‹ PostgreSQL leader election with patroni and etcd
โ—‹ Using etcd and confd for dynamic pull based cluster reconfiguration
The consensus problem
What is consensus?
Getting a group of processes to agree on a value
Properties:
โ— Termination: eventually, every non-faulty process decides some value
โ— Agreement: all processes select the same value
โ— Integrity: a process decides only once
โ— Validity: The value must have proposed by some process
The consensus problem
Reaching an agreement (consensus) is an important step in many distributed
computing problems:
โ— synchronizing replicated state machines and making sure all replicas have the
same (consistent) view of system state.
โ— electing a leader
โ— mutual exclusion (distributed locks)
โ— managing group membership/failure detection
โ— deciding to commit or abort for distributed transactions
But...
There's always a but.
Is it possible to achieve consensus in distributed systems?
It depends..
Distributed System Models
Synchronous model
โ— messages are received within a known bounded time
โ— drift of each process local clock has a known bound
โ— Each step in a process has a known bound
โ— e.g supercomputer
Asynchronous model
โ— no bounds on message transmission delays
โ— arbitrary drift rate of local clocks
โ— no bounds on process execution
โ— e.g The Internet
Back to consensus
Is it possible to achieve consensus in distributed systems?
Yes & No
Yes in Synchronous model
Not in Asynchronous model
Why?
FLP Proof
Impossibility of distributed consensus with one faulty process (1985)
Fischer, Lynch and Paterson
https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf
Result:
โ€œWe show that every protocol for this problem has the possibility of nontermination,
even with only one faulty process. By way of contrast, solutions are known for the
synchronous case, the "Byzantine Generals" problem.โ€
Paxos
Leslie Lamport discovered the algorithm in the late 1980s
Used by Google Chubby
Guarantees safety, but not liveness
โ— Safety: agreement property, guaranteed
โ— Liveness: termination property, not guaranteed
Eventual liveness
Hard to understand and implement!
Raft
Reliable, Replicated, Redundant, And Fault-Tolerant
(was supposed to be named Redundo)
https://groups.google.com/forum/#!topic/raft-dev/95rZqptGpmU
Developed by Diego Ongaro and John Ousterhout from Stanford University
Designed to be easy to understand
Published in 2014: https://raft.github.io/raft.pdf
More Info and related research can be found here: https://raft.github.io/
Demo
The Secret Lives of Data (An interactive demo that explains how raft works)
http://thesecretlivesofdata.com/raft/
RaftScope: a raft cluster running in your browser that you can interact with to see
Raft in action
https://raft.github.io/raftscope/
etcd playground
http://play.etcd.io/play
etcd
etcd is a distributed key value store that provides a reliable way to store data
across a cluster of machines.
etcd is used by kubernetes for the backend for service discovery and storing
cluster state and configuration
Cloud Foundry uses etcd to store cluster state and configuration and as a global
lock service
etcd
etcd is written in Go and uses the Raft consensus algorithm to manage a
highly-available replicated log.
https://github.com/etcd-io/etcd
Production-grade
Name from unix "/etc" folder and "d"istributed systems
Originally developed for CoreOS to get automatic, zero-downtime Linux kernel
updates using Locksmith which implements a distributed semaphore over etcd to
ensure only a subset of a cluster is rebooting at any given time.
etcd use cases
Should be used to store metadata and configurations, such as to coordinate
processes
Can handle a few GB of data with consistent ordering
etcd replicates all data within a single consistent replication group, no sharding
etcd provides distributed coordination primitives such as event watches, leases,
elections, and distributed shared locks out of the box.
etcd as a kv store
gRPC remote procedure call
โ— KV - Creates, updates, fetches, and deletes key-value pairs.
โ— Watch - Monitors changes to keys.
โ— Lease - Primitives for consuming client keep-alive messages.
Demo
etcdctl
https://github.com/etcd-io/etcd/blob/master/etcdctl/README.md
Interacting with etcd
https://github.com/etcd-io/etcd/blob/master/Documentation/dev-guide/interacting_
v3.md
etcd consistency guarantees
โ— Atomicity
โ—‹ All API requests are atomic; an operation either completes entirely or not at all.
โ—‹ For watch requests, all events generated by one operation will be in one watch response.
โ— Consistency
โ—‹ sequential consistency: a client reads the same events in the same order
โ—‹ etcd does not ensure linearizability for watch operations
โ—‹ etcd ensures linearizability for all other operations by default
โ—‹ For lower latencies and higher throughput, use serializable, may access stale data with respect
to quorum
โ— Isolation
โ—‹ etcd ensures serializable isolation
โ— Durability
โ—‹ Any completed operations are durable
etcd failure modes
Minor followers failure
โ— with less than half of the members failing, etcd continues running
โ— clients should automatically reconnect to other operating members
Leader failure
โ— etcd cluster automatically elects a new leader
โ— takes about an election timeout to elect a new leader
โ— requests sent during the election are queued
โ— writes already sent to the old leader but not yet committed may be lost
etcd failure modes
Majority failure
โ— etcd cluster fails and cannot accept more writes
โ— recover from a majority failure once the majority of members become available
Network partition
โ— either minor followers failure or a leader failure
Leader election
https://github.com/etcd-io/etcd/blob/v3.2.17/Documentation/dev-guide/api_concurr
ency_reference_v3.md
Distributed locks
https://github.com/etcd-io/etcd/blob/v3.2.17/Documentation/dev-guide/api_concurr
ency_reference_v3.md
Distributed cluster configuration
Use etcd as a central configuration store
โ— all consumers have immediate access to configuration data
โ— etcd makes it easy for applications to watch for changes
โ— reduces the time between a configuration change and propagation of that
change throughout the infrastructure
โ— failed nodes get latest config immediately after recovery
(Pushing config files to servers lacks all of the above)
Service Discovery
Services register/heartbeat/deregister themselves
Clients (or load balancers) watch etcd for endpoints and use it to connect
e.g.
/services/<service_name>/<instance_id> = <instance_address>
How kubernetes uses etcd
โ— Kubernetes stores data, state, and metadata in etcd
โ— All access to etcd goes through the apiserver
โ— Kubernetes stores the ideal state and the actual state.
โ— Kubernetes control loop (kube-controller-manager) watches these states of the
cluster through the apiserver and if these two states have diverged, itโ€™ll make
changes to reconcile them.
โ— Clusters using etcd3 preserve changes in the last 5 minutes by default.
GET /api/v1/namespaces/test/pods?watch=1&resourceVersion=10245
How kubernetes uses etcd
Create Pod Flow.
Source:
heptio.com
Patroni
Patroni: A Template for PostgreSQL HA with ZooKeeper, etcd or Consul
https://github.com/zalando/patroni
https://github.com/zalando/patroni/blob/master/patroni/dcs/etcd.py
Patroni originated as a fork of Governor, the project from Compose
https://github.com/helm/charts/tree/master/incubator/patroni
HA PostgreSQL Clusters with Docker
https://github.com/zalando/spilo
Confd
Manage local application configuration files using templates and data from etcd
http://www.confd.io/
โ— Sync configuration files by polling etcd and processing template resources.
โ— Reloading applications to pick up new config file changes
References and further reading
A Brief Tour of FLP Impossibility
https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/
Distributed Systems, Failures, and Consensus
https://www2.cs.duke.edu/courses/fall07/cps212/consensus.pdf
Consensus
https://www.cs.rutgers.edu/~pxk/417/notes/content/consensus.html
References and further reading
etcd github
https://github.com/etcd-io/etcd
etcd Concurrency primitives
https://github.com/etcd-io/etcd/tree/master/clientv3/concurrency
Consistency Models
https://jepsen.io/consistency
https://aphyr.com/posts/313-strong-consistency-models
References and further reading
Cloud Computing Concepts, Part 1 & 2
https://www.coursera.org/learn/cloud-computing/
https://www.coursera.org/learn/cloud-computing-2
Distributed Consensus
https://homepage.cs.uiowa.edu/~ghosh/16612.week11.pdf
How to Build a Highly Available System Using Consensus
https://www.microsoft.com/en-us/research/publication/how-to-build-a-highly-availab
le-system-using-consensus/
References and further reading
In Search of an Understandable Consensus Algorithm
https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
Tech Talk - Raft, In Search of an Understandable Consensus Algorithm by Diego
Ongaro
https://www.youtube.com/watch?v=LAqyTyNUYSY&feature=youtu.be
The Raft Consensus Algorithm
https://raft.github.io/
References and further reading
State machine replication
https://en.wikipedia.org/wiki/State_machine_replication
Kube-controller-manager
https://kubernetes.io/docs/concepts/overview/components/
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller
-manager/
go-config: a dynamic config framework
https://github.com/micro/go-config

More Related Content

What's hot (20)

PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
ย 
PDF
Comparing high availability solutions with percona xtradb cluster and percona...
Marco Tusa
ย 
PDF
What CloudStackers Need To Know About LINSTOR/DRBD
ShapeBlue
ย 
PDF
Ceph RBD Update - June 2021
Ceph Community
ย 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
ย 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
ย 
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
ย 
PDF
OverlayFS as a Docker Storage Driver
Tomoya Akase
ย 
PDF
Secrets of Performance Tuning Java on Kubernetes
Bruno Borges
ย 
PDF
Introduction to Apache Spark
Datio Big Data
ย 
PDF
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
MukundThakur22
ย 
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf
ย 
PDF
DevConf 2014 Kernel Networking Walkthrough
Thomas Graf
ย 
PPTX
Apache Pinot Meetup Sept02, 2020
Mayank Shrivastava
ย 
PDF
Cassandra Introduction & Features
DataStax Academy
ย 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
ย 
PPTX
Apache kafka ํ™•์žฅ๊ณผ ์‘์šฉ
JANGWONSEO4
ย 
PDF
Introduction to MongoDB
Mike Dirolf
ย 
PDF
[232] แ„‰แ…ฅแ†ผแ„‚แ…ณแ†ผแ„‹แ…ฅแ„ƒแ…ตแ„แ…กแ„Œแ…ตแ„Œแ…ฑแ„‹แ…ฅแ„แ…กแ„‡แ…ชแ†ปแ„‚แ…ต แ„‰แ…ฉแ†ผแ„แ…ขแ„‹แ…ฎแ†ผ
NAVER D2
ย 
PDF
Unveiling etcd: Architecture and Source Code Deep Dive
Chieh (Jack) Yu
ย 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
ย 
Comparing high availability solutions with percona xtradb cluster and percona...
Marco Tusa
ย 
What CloudStackers Need To Know About LINSTOR/DRBD
ShapeBlue
ย 
Ceph RBD Update - June 2021
Ceph Community
ย 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
ย 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
ย 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
ย 
OverlayFS as a Docker Storage Driver
Tomoya Akase
ย 
Secrets of Performance Tuning Java on Kubernetes
Bruno Borges
ย 
Introduction to Apache Spark
Datio Big Data
ย 
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
MukundThakur22
ย 
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf
ย 
DevConf 2014 Kernel Networking Walkthrough
Thomas Graf
ย 
Apache Pinot Meetup Sept02, 2020
Mayank Shrivastava
ย 
Cassandra Introduction & Features
DataStax Academy
ย 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
ย 
Apache kafka ํ™•์žฅ๊ณผ ์‘์šฉ
JANGWONSEO4
ย 
Introduction to MongoDB
Mike Dirolf
ย 
[232] แ„‰แ…ฅแ†ผแ„‚แ…ณแ†ผแ„‹แ…ฅแ„ƒแ…ตแ„แ…กแ„Œแ…ตแ„Œแ…ฑแ„‹แ…ฅแ„แ…กแ„‡แ…ชแ†ปแ„‚แ…ต แ„‰แ…ฉแ†ผแ„แ…ขแ„‹แ…ฎแ†ผ
NAVER D2
ย 
Unveiling etcd: Architecture and Source Code Deep Dive
Chieh (Jack) Yu
ย 

Similar to Distributed fun with etcd (20)

PPTX
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Imesha Sudasingha
ย 
PDF
Pluggable Infrastructure with CI/CD and Docker
Bob Killen
ย 
PDF
Techtalks: taking docker to production
muayyad alsadi
ย 
PDF
JOSA TechTalk: Taking Docker to Production
Jordan Open Source Association
ย 
PPTX
Introduction to kubernetes
Rishabh Indoria
ย 
PDF
Coordination in distributed systems
Andrea Monacchi
ย 
PDF
Introduction to ZooKeeper - TriHUG May 22, 2012
mumrah
ย 
PDF
Distributed Tracing
distributedtracing
ย 
PDF
JavaScript for Enterprise Applications
Piyush Katariya
ย 
PPTX
CrawlerLD - Distributed crawler for linked data
Raphael do Vale
ย 
PDF
The State of the Veil Framework
VeilFramework
ย 
PDF
Crikeycon 2019 Velociraptor Workshop
Velocidex Enterprises
ย 
PPTX
Distributed tracing 101
Itiel Shwartz
ย 
PDF
Workflow story: Theory versus practice in Large Enterprises
Puppet
ย 
PDF
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
NETWAYS
ย 
PPTX
Introduction to containers
Nitish Jadia
ย 
PPT
A Practical Event Driven Model
Xi Wu
ย 
PDF
KrakenD API Gateway
Albert Lombarte
ย 
PPTX
First steps with kubernetes
Vinรญcius Kroth
ย 
PPTX
Zookeeper big sonata
Anh Le
ย 
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Imesha Sudasingha
ย 
Pluggable Infrastructure with CI/CD and Docker
Bob Killen
ย 
Techtalks: taking docker to production
muayyad alsadi
ย 
JOSA TechTalk: Taking Docker to Production
Jordan Open Source Association
ย 
Introduction to kubernetes
Rishabh Indoria
ย 
Coordination in distributed systems
Andrea Monacchi
ย 
Introduction to ZooKeeper - TriHUG May 22, 2012
mumrah
ย 
Distributed Tracing
distributedtracing
ย 
JavaScript for Enterprise Applications
Piyush Katariya
ย 
CrawlerLD - Distributed crawler for linked data
Raphael do Vale
ย 
The State of the Veil Framework
VeilFramework
ย 
Crikeycon 2019 Velociraptor Workshop
Velocidex Enterprises
ย 
Distributed tracing 101
Itiel Shwartz
ย 
Workflow story: Theory versus practice in Large Enterprises
Puppet
ย 
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
NETWAYS
ย 
Introduction to containers
Nitish Jadia
ย 
A Practical Event Driven Model
Xi Wu
ย 
KrakenD API Gateway
Albert Lombarte
ย 
First steps with kubernetes
Vinรญcius Kroth
ย 
Zookeeper big sonata
Anh Le
ย 
Ad

Recently uploaded (20)

PPTX
ChessBase 18.02 Crack + Serial Key Free Download
cracked shares
ย 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
ย 
PDF
Troubleshooting Virtual Threads in Java!
Tier1 app
ย 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
ย 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
ย 
PPTX
Cutting Optimization Pro 5.18.2 Crack With Free Download
cracked shares
ย 
PPTX
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
ย 
PPTX
Presentation about variables and constant.pptx
kr2589474
ย 
PDF
How Agentic AI Networks are Revolutionizing Collaborative AI Ecosystems in 2025
ronakdubey419
ย 
PDF
System Center 2025 vs. 2022; Whatโ€™s new, whatโ€™s next_PDF.pdf
Q-Advise
ย 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
ย 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
ย 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
ย 
PPTX
TexSender Pro 8.9.1 Crack Full Version Download
cracked shares
ย 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
ย 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
ย 
PDF
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
ย 
PDF
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
ย 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
ย 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
ย 
ChessBase 18.02 Crack + Serial Key Free Download
cracked shares
ย 
Activate_Methodology_Summary presentatio
annapureddyn
ย 
Troubleshooting Virtual Threads in Java!
Tier1 app
ย 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
ย 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
ย 
Cutting Optimization Pro 5.18.2 Crack With Free Download
cracked shares
ย 
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
ย 
Presentation about variables and constant.pptx
kr2589474
ย 
How Agentic AI Networks are Revolutionizing Collaborative AI Ecosystems in 2025
ronakdubey419
ย 
System Center 2025 vs. 2022; Whatโ€™s new, whatโ€™s next_PDF.pdf
Q-Advise
ย 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
ย 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
ย 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
ย 
TexSender Pro 8.9.1 Crack Full Version Download
cracked shares
ย 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
ย 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
ย 
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
ย 
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
ย 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
ย 
Presentation about Database and Database Administrator
abhishekchauhan86963
ย 
Ad

Distributed fun with etcd

  • 1. Distributed Fun with And the consensus problem DistSys Riyadh Meetup Abdulaziz AlMalki @almalki_am
  • 2. Agenda โ— The consensus problem โ— Paxos and raft โ— What is etcd? โ— etcd use cases โ— etcd as a kv store โ— etcd consistency guarantees โ— etcd failure modes โ— Leader election โ— Distributed locks
  • 3. Agenda โ— Distributed cluster configuration โ— Service discovery โ— How kubernetes uses etcd โ— Demo: โ—‹ PostgreSQL leader election with patroni and etcd โ—‹ Using etcd and confd for dynamic pull based cluster reconfiguration
  • 4. The consensus problem What is consensus? Getting a group of processes to agree on a value Properties: โ— Termination: eventually, every non-faulty process decides some value โ— Agreement: all processes select the same value โ— Integrity: a process decides only once โ— Validity: The value must have proposed by some process
  • 5. The consensus problem Reaching an agreement (consensus) is an important step in many distributed computing problems: โ— synchronizing replicated state machines and making sure all replicas have the same (consistent) view of system state. โ— electing a leader โ— mutual exclusion (distributed locks) โ— managing group membership/failure detection โ— deciding to commit or abort for distributed transactions
  • 6. But... There's always a but. Is it possible to achieve consensus in distributed systems? It depends..
  • 7. Distributed System Models Synchronous model โ— messages are received within a known bounded time โ— drift of each process local clock has a known bound โ— Each step in a process has a known bound โ— e.g supercomputer Asynchronous model โ— no bounds on message transmission delays โ— arbitrary drift rate of local clocks โ— no bounds on process execution โ— e.g The Internet
  • 8. Back to consensus Is it possible to achieve consensus in distributed systems? Yes & No Yes in Synchronous model Not in Asynchronous model Why?
  • 9. FLP Proof Impossibility of distributed consensus with one faulty process (1985) Fischer, Lynch and Paterson https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf Result: โ€œWe show that every protocol for this problem has the possibility of nontermination, even with only one faulty process. By way of contrast, solutions are known for the synchronous case, the "Byzantine Generals" problem.โ€
  • 10. Paxos Leslie Lamport discovered the algorithm in the late 1980s Used by Google Chubby Guarantees safety, but not liveness โ— Safety: agreement property, guaranteed โ— Liveness: termination property, not guaranteed Eventual liveness Hard to understand and implement!
  • 11. Raft Reliable, Replicated, Redundant, And Fault-Tolerant (was supposed to be named Redundo) https://groups.google.com/forum/#!topic/raft-dev/95rZqptGpmU Developed by Diego Ongaro and John Ousterhout from Stanford University Designed to be easy to understand Published in 2014: https://raft.github.io/raft.pdf More Info and related research can be found here: https://raft.github.io/
  • 12. Demo The Secret Lives of Data (An interactive demo that explains how raft works) http://thesecretlivesofdata.com/raft/ RaftScope: a raft cluster running in your browser that you can interact with to see Raft in action https://raft.github.io/raftscope/ etcd playground http://play.etcd.io/play
  • 13. etcd etcd is a distributed key value store that provides a reliable way to store data across a cluster of machines. etcd is used by kubernetes for the backend for service discovery and storing cluster state and configuration Cloud Foundry uses etcd to store cluster state and configuration and as a global lock service
  • 14. etcd etcd is written in Go and uses the Raft consensus algorithm to manage a highly-available replicated log. https://github.com/etcd-io/etcd Production-grade Name from unix "/etc" folder and "d"istributed systems Originally developed for CoreOS to get automatic, zero-downtime Linux kernel updates using Locksmith which implements a distributed semaphore over etcd to ensure only a subset of a cluster is rebooting at any given time.
  • 15. etcd use cases Should be used to store metadata and configurations, such as to coordinate processes Can handle a few GB of data with consistent ordering etcd replicates all data within a single consistent replication group, no sharding etcd provides distributed coordination primitives such as event watches, leases, elections, and distributed shared locks out of the box.
  • 16. etcd as a kv store gRPC remote procedure call โ— KV - Creates, updates, fetches, and deletes key-value pairs. โ— Watch - Monitors changes to keys. โ— Lease - Primitives for consuming client keep-alive messages.
  • 18. etcd consistency guarantees โ— Atomicity โ—‹ All API requests are atomic; an operation either completes entirely or not at all. โ—‹ For watch requests, all events generated by one operation will be in one watch response. โ— Consistency โ—‹ sequential consistency: a client reads the same events in the same order โ—‹ etcd does not ensure linearizability for watch operations โ—‹ etcd ensures linearizability for all other operations by default โ—‹ For lower latencies and higher throughput, use serializable, may access stale data with respect to quorum โ— Isolation โ—‹ etcd ensures serializable isolation โ— Durability โ—‹ Any completed operations are durable
  • 19. etcd failure modes Minor followers failure โ— with less than half of the members failing, etcd continues running โ— clients should automatically reconnect to other operating members Leader failure โ— etcd cluster automatically elects a new leader โ— takes about an election timeout to elect a new leader โ— requests sent during the election are queued โ— writes already sent to the old leader but not yet committed may be lost
  • 20. etcd failure modes Majority failure โ— etcd cluster fails and cannot accept more writes โ— recover from a majority failure once the majority of members become available Network partition โ— either minor followers failure or a leader failure
  • 23. Distributed cluster configuration Use etcd as a central configuration store โ— all consumers have immediate access to configuration data โ— etcd makes it easy for applications to watch for changes โ— reduces the time between a configuration change and propagation of that change throughout the infrastructure โ— failed nodes get latest config immediately after recovery (Pushing config files to servers lacks all of the above)
  • 24. Service Discovery Services register/heartbeat/deregister themselves Clients (or load balancers) watch etcd for endpoints and use it to connect e.g. /services/<service_name>/<instance_id> = <instance_address>
  • 25. How kubernetes uses etcd โ— Kubernetes stores data, state, and metadata in etcd โ— All access to etcd goes through the apiserver โ— Kubernetes stores the ideal state and the actual state. โ— Kubernetes control loop (kube-controller-manager) watches these states of the cluster through the apiserver and if these two states have diverged, itโ€™ll make changes to reconcile them. โ— Clusters using etcd3 preserve changes in the last 5 minutes by default. GET /api/v1/namespaces/test/pods?watch=1&resourceVersion=10245
  • 26. How kubernetes uses etcd Create Pod Flow. Source: heptio.com
  • 27. Patroni Patroni: A Template for PostgreSQL HA with ZooKeeper, etcd or Consul https://github.com/zalando/patroni https://github.com/zalando/patroni/blob/master/patroni/dcs/etcd.py Patroni originated as a fork of Governor, the project from Compose https://github.com/helm/charts/tree/master/incubator/patroni HA PostgreSQL Clusters with Docker https://github.com/zalando/spilo
  • 28. Confd Manage local application configuration files using templates and data from etcd http://www.confd.io/ โ— Sync configuration files by polling etcd and processing template resources. โ— Reloading applications to pick up new config file changes
  • 29. References and further reading A Brief Tour of FLP Impossibility https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/ Distributed Systems, Failures, and Consensus https://www2.cs.duke.edu/courses/fall07/cps212/consensus.pdf Consensus https://www.cs.rutgers.edu/~pxk/417/notes/content/consensus.html
  • 30. References and further reading etcd github https://github.com/etcd-io/etcd etcd Concurrency primitives https://github.com/etcd-io/etcd/tree/master/clientv3/concurrency Consistency Models https://jepsen.io/consistency https://aphyr.com/posts/313-strong-consistency-models
  • 31. References and further reading Cloud Computing Concepts, Part 1 & 2 https://www.coursera.org/learn/cloud-computing/ https://www.coursera.org/learn/cloud-computing-2 Distributed Consensus https://homepage.cs.uiowa.edu/~ghosh/16612.week11.pdf How to Build a Highly Available System Using Consensus https://www.microsoft.com/en-us/research/publication/how-to-build-a-highly-availab le-system-using-consensus/
  • 32. References and further reading In Search of an Understandable Consensus Algorithm https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro Tech Talk - Raft, In Search of an Understandable Consensus Algorithm by Diego Ongaro https://www.youtube.com/watch?v=LAqyTyNUYSY&feature=youtu.be The Raft Consensus Algorithm https://raft.github.io/
  • 33. References and further reading State machine replication https://en.wikipedia.org/wiki/State_machine_replication Kube-controller-manager https://kubernetes.io/docs/concepts/overview/components/ https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller -manager/ go-config: a dynamic config framework https://github.com/micro/go-config