SlideShare a Scribd company logo
Cassandra Architecture
Distributed
Peer to Peer
Client
â—Ź There is no leader/follower.
â—Ź Each node is aware of keys held by other nodes and coordinates with that node to fetch the
data.
â—Ź Depending on the replication factor & consistency level the coordinator talks to one of more
nodes before returning the response to the client.
â—Ź Every table defines a partition key.
â—Ź Data is distributed across the various nodes in the cluster using the hash on the partition
key. Uses Consistent hashing algo.
â—Ź Partitions are replicated across multiple nodes to prevent single point of failure.
Replication copies of the data across
multiple nodes within/across the DCs.
Replication Factor (RF) denotes the no of
copies.
Set at the keyspace level.
Snitch: Is a strategy to identify the DC and
Rack the node belongs to. This identity
can be manually shared across all nodes
or via Gossiping.
Coordinator is aware of the RF/keyspace
and coordinates the writes upto that factor
to the various nodes within/across DCs.
Hinted Handoff - While the replica node is
down the coordinator will delay the
transmission to that node by persisting
that data locally. It can retransmits it once
that replica node is back online.
Cassandra configuration sets the duration
for holding such data before handoff.
Replication & Consistency Consistency is an agreeable factor across
the nodes that ensures the acceptance of
a read/write.
Consistency can be set for both
read/writes.
Consistency levels (CL) can be set from
low to high (ONE, LOCAL_QUOROUM,
QUORUM, ALL)
CL is a trade off b/w consistency and
availability.
Read Repair: Coordinator performs a read
repair on some/all of the replicas that
have trailing versions. Depending on the
CL this can be done async during a read
request.
Gossip Each node stores info about itself and
every other node in its Knowledge base.
Each node initiates the gossip every
second with 2 or 3 other nodes to share
its knowledge base.
Knowledge Base:
Each node increments its heartbeat
version every second.
When it receives a gossip from other
node, it checks each nodes heart beat
version and updates if it had received the
latest version.
Optimization to reduce message
bandwidth during gossiping
Gossip is initiated with a SYN to the
receiving node.
SYN: Just a digest - no AppState included
Receiving node ACKs back to the sender.
ACK: Digest for the trailing versions &
detailed (includes AppState) for leading
versions.
Sender updates the trailing versions and
Acks back with the detailed info for the
requested trailing versions on the other
end.
EndPt State: <IP of a node>
HeartBeat State:
Generation: 10
Version: 34
Application State;
Status::
Norma/Removed/Arrived…
DataCenter:
Rack:
Load:
Severity:
….
EndPt State: <IP of a
node>...
Knowledge Base
Mem table
Commit log
Client
Write Path Client writes to both commit log and
memtable. In the event of the node
failures, the memtable can be constructed
from the commit log.
Commit log is append only, does not
maintain any order.
Memtable is partitioned by partition key
and ordered by clustering columns.
Eventually memtable grows out of size
and is flushed to disk (SSTable). SSTable
is immutable so with each flush a new
SSTable file is created.
SSTable holds each partition
Compaction is a process of merging
numerous sstable files into one. It relies
on timestamp of each row to resolve dups.
SSTable 1
SSTable 1
SSTable 1
SSTable
Compaction
Flushing
Disk
Memory
23, USA 4
23, USA 8
23, Mexico 7
55, Korea 9
23, USA 5
55, Korea 9
23, Mexico 7
23, USA 4
23, USA 5
23, USA 8
23, Mexico 7
23, USA 4
55, Korea 9
23, USA 5
23, USA 8
55, China 20
55, China 40
55, Korea 9
23, Mexico 7
23, USA 4
23, USA 5
23, USA 8
Replica Node
Coordinator
Bloom Filters
Read Path
Mem table
Client
SSTable 1
SSTable 1
SSTable 1
SSTable
Compaction
Flushing
DiskMemory
Partition
Index
Summary
Index
Key
Cache
(LRU)
Order of search during a Read:
Coordinator node calls one of the replica
node for the requested partition key.
Replica Node first looks in the Mem table.
If not found, follows the below path until
the key is found.
Bloom filters help determine two things.
The key doesn’t exist in the sstable or the
key may exist in the sstable.
Key Cache, An LRU cache with partition
key & value is the offset of the partition in
the SSTable file.
Summary Index is range based index for
the keys in the partition index and their
offset.
Partition Index is the indexed lookup on
the partition key and the offset of the
partition in the SSTable file.
Replica Node
Coordinator
Bloom Filters
Bloom Filters
Bloom Filters
Summary Index
Partition Index
Key Cache
References:
â—Ź https://academy.datastax.com
â—Ź https://www.youtube.com/watch?v=s1xc1HVsRk0&list=PLalrWAGybpB-L1PGA-
NfFu2uiWHEsdscD&index=1
â—Ź https://www.toptal.com/big-data/consistent-hashing
â—Ź https://www.baeldung.com/cassandra-data-modeling
Consistent Hashing
Given a set of key/value pairs, hashing is strategy to
spread each pair evenly as possible, so that we can fetch
them in almost constant time by their key.
Consistent hashing is one such hashing strategy to spread
the keys in a distributed env.
The hash of keys are hypothetically spread on ring. The
position the key takes on the ring can be anywhere b/w 0 -
360 based on hash of key (mostly mod on the hash).
The stores/server that hosts these key are also given a
position on the ring (e.g., A, B, C…)
The key is stored on the server that is found first, while
traversing the ring in anti-clockwise direction from the keys
position.
E.g., key Steve @ 352.3 finds server C @ 81.7
If we maintain a sorted list of server and their position, a
quick binary search will point us to the server where the
key can be found eliminating the need to query all servers.
Keys can be replicated on succeeding servers to avoid
SPF (Single point of failures).
Consistent Hashing
Although the keys are spread over several servers, the
distribution may not be even due to the uneven clustering
of the key in real world (names starting with a certain
alphabet may be more common).
In such scenarios, to overcome the load on an individual
server, we define virtual servers. What this means is we
provide multiple positions for the same server simulating
multiple instances of the same server across the ring.
With ref to the pic here, the refined sorted list of servers
will now have virtual instances of servers a1, a2, b2, c3
etc... Thereby distributed the load on C to B and A as well.
Bloom Filters
It's a probabilistic data structure to determine if an element is present in the set of not.
It consists of a set of n bits & a collection of independent hash functions. Each of which return a no between 0 to n-1 representing one of
the nth bit.
Writes:
A key is run thru the collection of hash functions. The resulting nth bit is flipped on to mark the elements presence.
Reads:
A key is run thru the collection of hash functions. Iff all the resulting nth bit is turned on, we can ensure that the key MAY be present in the
underlying set. Even if one of them is not flipped on, we can GUARANTEE that the key is not present.

More Related Content

What's hot (20)

PPTX
Locks In Disributed Systems
mridul mishra
 
PPTX
Istanbul BFT
Yu-Te Lin
 
PDF
Pulsar connector on flink 1.14
宇帆 盛
 
PDF
Cassandra consistency
zqhxuyuan
 
PDF
GopherCon 2017 - Writing Networking Clients in Go: The Design & Implementati...
wallyqs
 
PDF
Cassandra basic
zqhxuyuan
 
PPTX
Stephan Ewen - Scaling to large State
Flink Forward
 
PPTX
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Kasper Grud Skat Madsen
 
PPT
Distributed System by Pratik Tambekar
Pratik Tambekar
 
PDF
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Gera Shegalov
 
PPTX
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward
 
PDF
Elixir concurrency 101
Rafael Antonio Gutiérrez Turullols
 
PDF
Cassandra Internals Overview
beobal
 
PDF
Cassandra NYC 2011 Data Modeling
Matthew Dennis
 
PPTX
Transactions and Concurrency Control
Dilum Bandara
 
PDF
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Kévin LOVATO
 
PDF
Everything You Thought You Already Knew About Orchestration
Laura Frank Tacho
 
PPT
Chapter 12 transactions and concurrency control
AbDul ThaYyal
 
PDF
M|18 Architectural Overview: MariaDB MaxScale
MariaDB plc
 
PPT
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
btoddb
 
Locks In Disributed Systems
mridul mishra
 
Istanbul BFT
Yu-Te Lin
 
Pulsar connector on flink 1.14
宇帆 盛
 
Cassandra consistency
zqhxuyuan
 
GopherCon 2017 - Writing Networking Clients in Go: The Design & Implementati...
wallyqs
 
Cassandra basic
zqhxuyuan
 
Stephan Ewen - Scaling to large State
Flink Forward
 
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Kasper Grud Skat Madsen
 
Distributed System by Pratik Tambekar
Pratik Tambekar
 
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Gera Shegalov
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward
 
Elixir concurrency 101
Rafael Antonio Gutiérrez Turullols
 
Cassandra Internals Overview
beobal
 
Cassandra NYC 2011 Data Modeling
Matthew Dennis
 
Transactions and Concurrency Control
Dilum Bandara
 
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Kévin LOVATO
 
Everything You Thought You Already Knew About Orchestration
Laura Frank Tacho
 
Chapter 12 transactions and concurrency control
AbDul ThaYyal
 
M|18 Architectural Overview: MariaDB MaxScale
MariaDB plc
 
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
btoddb
 

Similar to Cassandra Architecture (20)

PPTX
L6.sp17.pptx
SudheerKumar499932
 
PDF
Cassandra - A Decentralized Structured Storage System
Varad Meru
 
PDF
Cassandra for Sysadmins
Nathan Milford
 
PDF
Nzpug welly-cassandra-02-12-2010
aaronmorton
 
PPTX
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
PPTX
NoSql Database
Suresh Parmar
 
PDF
Introduction to Cassandra
aaronmorton
 
PPTX
Basics of Distributed Systems - Distributed Storage
Nilesh Salpe
 
PPTX
Dynamo cassandra
Wu Liang
 
PDF
SDEC2011 NoSQL concepts and models
Korea Sdec
 
PDF
Cassandra 101
Nader Ganayem
 
PPTX
Cassandra & Python - Springfield MO User Group
Adam Hutson
 
PPTX
Cassandra
exsuns
 
PDF
Scalable Data Storage Getting You Down? To The Cloud!
Mikhail Panchenko
 
PDF
Scalable Data Storage Getting you Down? To the Cloud!
Mikhail Panchenko
 
PDF
Design Patterns For Distributed NO-reational databases
lovingprince58
 
PPT
6.1-Cassandra.ppt
yashsharma863914
 
PPT
Cassandra
ssuserbad56d
 
PPT
6.1-Cassandra.ppt
DanBarcan2
 
PDF
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
L6.sp17.pptx
SudheerKumar499932
 
Cassandra - A Decentralized Structured Storage System
Varad Meru
 
Cassandra for Sysadmins
Nathan Milford
 
Nzpug welly-cassandra-02-12-2010
aaronmorton
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
NoSql Database
Suresh Parmar
 
Introduction to Cassandra
aaronmorton
 
Basics of Distributed Systems - Distributed Storage
Nilesh Salpe
 
Dynamo cassandra
Wu Liang
 
SDEC2011 NoSQL concepts and models
Korea Sdec
 
Cassandra 101
Nader Ganayem
 
Cassandra & Python - Springfield MO User Group
Adam Hutson
 
Cassandra
exsuns
 
Scalable Data Storage Getting You Down? To The Cloud!
Mikhail Panchenko
 
Scalable Data Storage Getting you Down? To the Cloud!
Mikhail Panchenko
 
Design Patterns For Distributed NO-reational databases
lovingprince58
 
6.1-Cassandra.ppt
yashsharma863914
 
Cassandra
ssuserbad56d
 
6.1-Cassandra.ppt
DanBarcan2
 
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
Ad

Recently uploaded (20)

PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Ad

Cassandra Architecture

  • 2. Distributed Peer to Peer Client â—Ź There is no leader/follower. â—Ź Each node is aware of keys held by other nodes and coordinates with that node to fetch the data. â—Ź Depending on the replication factor & consistency level the coordinator talks to one of more nodes before returning the response to the client. â—Ź Every table defines a partition key. â—Ź Data is distributed across the various nodes in the cluster using the hash on the partition key. Uses Consistent hashing algo. â—Ź Partitions are replicated across multiple nodes to prevent single point of failure.
  • 3. Replication copies of the data across multiple nodes within/across the DCs. Replication Factor (RF) denotes the no of copies. Set at the keyspace level. Snitch: Is a strategy to identify the DC and Rack the node belongs to. This identity can be manually shared across all nodes or via Gossiping. Coordinator is aware of the RF/keyspace and coordinates the writes upto that factor to the various nodes within/across DCs. Hinted Handoff - While the replica node is down the coordinator will delay the transmission to that node by persisting that data locally. It can retransmits it once that replica node is back online. Cassandra configuration sets the duration for holding such data before handoff. Replication & Consistency Consistency is an agreeable factor across the nodes that ensures the acceptance of a read/write. Consistency can be set for both read/writes. Consistency levels (CL) can be set from low to high (ONE, LOCAL_QUOROUM, QUORUM, ALL) CL is a trade off b/w consistency and availability. Read Repair: Coordinator performs a read repair on some/all of the replicas that have trailing versions. Depending on the CL this can be done async during a read request.
  • 4. Gossip Each node stores info about itself and every other node in its Knowledge base. Each node initiates the gossip every second with 2 or 3 other nodes to share its knowledge base. Knowledge Base: Each node increments its heartbeat version every second. When it receives a gossip from other node, it checks each nodes heart beat version and updates if it had received the latest version. Optimization to reduce message bandwidth during gossiping Gossip is initiated with a SYN to the receiving node. SYN: Just a digest - no AppState included Receiving node ACKs back to the sender. ACK: Digest for the trailing versions & detailed (includes AppState) for leading versions. Sender updates the trailing versions and Acks back with the detailed info for the requested trailing versions on the other end. EndPt State: <IP of a node> HeartBeat State: Generation: 10 Version: 34 Application State; Status:: Norma/Removed/Arrived… DataCenter: Rack: Load: Severity: …. EndPt State: <IP of a node>... Knowledge Base
  • 5. Mem table Commit log Client Write Path Client writes to both commit log and memtable. In the event of the node failures, the memtable can be constructed from the commit log. Commit log is append only, does not maintain any order. Memtable is partitioned by partition key and ordered by clustering columns. Eventually memtable grows out of size and is flushed to disk (SSTable). SSTable is immutable so with each flush a new SSTable file is created. SSTable holds each partition Compaction is a process of merging numerous sstable files into one. It relies on timestamp of each row to resolve dups. SSTable 1 SSTable 1 SSTable 1 SSTable Compaction Flushing Disk Memory 23, USA 4 23, USA 8 23, Mexico 7 55, Korea 9 23, USA 5 55, Korea 9 23, Mexico 7 23, USA 4 23, USA 5 23, USA 8 23, Mexico 7 23, USA 4 55, Korea 9 23, USA 5 23, USA 8 55, China 20 55, China 40 55, Korea 9 23, Mexico 7 23, USA 4 23, USA 5 23, USA 8 Replica Node Coordinator Bloom Filters
  • 6. Read Path Mem table Client SSTable 1 SSTable 1 SSTable 1 SSTable Compaction Flushing DiskMemory Partition Index Summary Index Key Cache (LRU) Order of search during a Read: Coordinator node calls one of the replica node for the requested partition key. Replica Node first looks in the Mem table. If not found, follows the below path until the key is found. Bloom filters help determine two things. The key doesn’t exist in the sstable or the key may exist in the sstable. Key Cache, An LRU cache with partition key & value is the offset of the partition in the SSTable file. Summary Index is range based index for the keys in the partition index and their offset. Partition Index is the indexed lookup on the partition key and the offset of the partition in the SSTable file. Replica Node Coordinator Bloom Filters Bloom Filters Bloom Filters
  • 8. References: â—Ź https://academy.datastax.com â—Ź https://www.youtube.com/watch?v=s1xc1HVsRk0&list=PLalrWAGybpB-L1PGA- NfFu2uiWHEsdscD&index=1 â—Ź https://www.toptal.com/big-data/consistent-hashing â—Ź https://www.baeldung.com/cassandra-data-modeling
  • 9. Consistent Hashing Given a set of key/value pairs, hashing is strategy to spread each pair evenly as possible, so that we can fetch them in almost constant time by their key. Consistent hashing is one such hashing strategy to spread the keys in a distributed env. The hash of keys are hypothetically spread on ring. The position the key takes on the ring can be anywhere b/w 0 - 360 based on hash of key (mostly mod on the hash). The stores/server that hosts these key are also given a position on the ring (e.g., A, B, C…) The key is stored on the server that is found first, while traversing the ring in anti-clockwise direction from the keys position. E.g., key Steve @ 352.3 finds server C @ 81.7 If we maintain a sorted list of server and their position, a quick binary search will point us to the server where the key can be found eliminating the need to query all servers. Keys can be replicated on succeeding servers to avoid SPF (Single point of failures).
  • 10. Consistent Hashing Although the keys are spread over several servers, the distribution may not be even due to the uneven clustering of the key in real world (names starting with a certain alphabet may be more common). In such scenarios, to overcome the load on an individual server, we define virtual servers. What this means is we provide multiple positions for the same server simulating multiple instances of the same server across the ring. With ref to the pic here, the refined sorted list of servers will now have virtual instances of servers a1, a2, b2, c3 etc... Thereby distributed the load on C to B and A as well.
  • 11. Bloom Filters It's a probabilistic data structure to determine if an element is present in the set of not. It consists of a set of n bits & a collection of independent hash functions. Each of which return a no between 0 to n-1 representing one of the nth bit. Writes: A key is run thru the collection of hash functions. The resulting nth bit is flipped on to mark the elements presence. Reads: A key is run thru the collection of hash functions. Iff all the resulting nth bit is turned on, we can ensure that the key MAY be present in the underlying set. Even if one of them is not flipped on, we can GUARANTEE that the key is not present.