SlideShare a Scribd company logo
An Introduction to 
Apache 
assandra 
Aaron Ploetz
What is Cassandra? 
● Cassandra is a non-relational, partitioned row store. 
● Rows are organized into column families (tables) with a 
required primary key. 
● Data is distributed across multiple master-less, nodes in 
an application-transparent manner. 
● DataStax oversees the development of the Apache 
Cassandra open-source project, provides support to 
companies using Cassandra, and provides an enterprise-ready 
version of Cassandra.
$ whoami 
Aaron Ploetz 
@APloetz 
● Lead Database Engineer 
● B.S.-MCS UW-Whitewater 
● M.S.-SED Regis University 
● Using Cassandra since version 0.8 
● Contributor to the Casandra tag on StackOverflow 
● Contributor to the Apache Cassandra project 
● 2014/15 DataStax MVP for Apache Cassandra
(short) History of Cassandra and 
DataStax 
● Developed at , open sourced in 2008. 
● Design influenced by Google BigTable and Amazon Dynamo. 
● Graduated to Apache “Top-Level Project” status in Feb 2010. 
● DataStax founded by Jonathan Ellis and Matt Pfeil in late 2010, 
offering enterprise Cassandra support. 
● Secured $190 million in VC funding. 
● Started with eight people, now employs more than 350. 
● 400+ Customers, including 25 of the Fortune 100.
Key Features 
● Current release is Cassandra 2.1 (Sept 10). 
● Distributed, decentralized storage; no SPOF. 
● Scalable. 
● High-availability, Fault-tolerance. 
● Tunable Consistency. 
● High-performance. 
● Data center awareness.
Distributed, Decentralized 
Storage DC1 DC2 
● Peer-to-peer, master-less replication. 
● Any node can handle a read or write operation. 
● Supports local read/write ops via “logical” data centers. 
● Gossip protocol allows nodes to be aware of each other. 
● Snitch ensures that data is replicated appropriately.
Scalability 
● Cassandra allows you to easily add nodes to scale your 
application back-end. 
● Benchmark from 2011: 
– 48 node cluster could handle 174,373 writes/sec. 
– 288 node cluster could handle 1,099,837 writes/sec. 
– Indicates that Cassandra scales linearly. 
● Throughput of N nodes = T. 
● Throughput of Nx2 nodes = Tx2.
High Availability 
DC1 DC2 
X 
● Cassandra was designed under the premise that 
hardware failures can and do occur.
High Availability 
DC1 DC2 
X 
X 
● Cassandra was designed under the premise that 
hardware failures can and do occur.
High Availability 
DC1 DC2 
X 
X X 
X 
X 
X X 
●Gossip Protocol keeps live nodes informed of failures. 
●Cassandra 2.0.2 implemented Rapid Read Protection which 
redirects read operations to live nodes.
Tunable Consistency 
● Cassandra allows you alter your consistency level on a 
per-operation basis. 
● Also allows configuration for data center locality: 
ALL QUORUM ONE 
Strong 
Consistency 
High Availability / 
Eventual 
Consistency 
Quorum == 
(nodes / 2) + 1
Eventual Consistency != Hopeful 
Consistency 
● experiment on consistency : 
– Created two data centers with C* 1.1.7 Cluster of 48 nodes in 
each data center. 
– Wrote 1,000,000 records at CL1 in one data center. 
– Read same 1,000,000 records at CL1 in other data center. 
– All records read successfully! 
– “Eventually consistent does not mean a day, minute or 
even a second from now… in most cases, it is 
milliseconds!”- Christos Kalantzis
High Performance 
● Cassandra is optimized from the ground up for 
performance: 
Source: DataStax.com
High Performance 
● All disk writes are sequential, append-only 
operations. 
● No reading before writing. 
● Cassandra is optimized for threading with multi-core/ 
processor machines.
Potential Drawbacks? 
● Some use cases are not appropriate (transient data 
or delete-heavy patterns). 
● Developer learning curve: CQL != SQL 
● Simple queries only. No JOINs or sub-queries. 
● Optimal performance is achieved through de-normalizaiton 
and query-based data modeling.
Cassandra moves beyond disco-era 
data modeling 
●Everything MUST be normalized!!! 
●Redundant data == “bad” 
●Relational Database theory originated when disk space was expensive. In 
1975 some vendors were selling disk space at $11k per MB. 
●By 1980 prices “dropped” so that you could finally buy 1GB of storage for 
under $1 Million. 
●Today I can buy a 1TB disk for $60.
Cassandra Storage Structures 
● Keyspace == Database (in the RDBMS world) 
CREATE KEYSPACE products WITH replication = { 
'class': 'NetworkTopologyStrategy', 
'RFD': '2', 'MKE': '4'}; 
● Column Family == Table 
CREATE TABLE hierarchy ( 
category text, 
subcategory text, 
classification text, 
skumap map<uuid, text>, 
PRIMARY KEY (category, subcategory, classification));
Cassandra Primary Keys 
● Primary Keys are unique. 
● Single Primary Key: 
PRIMARY KEY (keyColumn) 
● Composite Primary Key: 
PRIMARY KEY (myPartitionKey, my1stClusteringKey, 
my2ndClusteringKey) 
● Composite Partitioning Key: 
PRIMARY KEY ((my1stPartitionKey, my2ndPartitionKey), 
myClusteringKey)
Cassandra Secondary Indexes 
● Does allow secondary indexes. 
CREATE INDEX myIndex ON myTable(myNonKeyColumn) 
● Designed for query convenience, not for performance. 
● Does not perform well on high-cardinality columns, because you filter a 
huge volume of records for a small number of results. Extremely low 
cardinality is also not a good idea (ex: customer address [state == good, 
phone == bad, gender == bad]). 
● Works best on a table having many rows that contain the indexed value; 
middle-of-the-road cardinality.
Serenity “crew” 
● Create a table to store data for the crew of “Serenity” from “Firefly.” 
CREATE TABLE crew ( 
crewname TEXT, 
firstname TEXT, 
lastname TEXT, 
phone TEXT, 
PRIMARY KEY (crewname)); 
crewname | firstname | lastname | phone 
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­Mal 
| Malcolm | Reynolds | 111­555­1234 
Jayne | Jayne | Cobb | 111­555­3464 
Sheppard | Derial | Book | 111­555­2349 
Simon | Simon | Tam | 111­555­8899
Serenity “crew” under the hood 
RowKey:Mal 
=> (column, value=, timestamp=1374546754299000) 
=> (column=firstname, value=Malcolm, timestamp=1374546754299000) 
=> (column=lastname, value=Reynolds, timestamp=1374546754299000) 
=> (column=phone, value=111­555­1234, 
timestamp=1374546754299000) 
RowKey:Jayne 
=> (column, value=, timestamp=1374546757815000) 
=> (column=firstname, value=Jayne, timestamp=1374546757815000) 
=> (column=lastname, value=Cobb, timestamp=1374546757815000) 
=> (column=phone, value=111­555­3464, 
timestamp=1374546757815000)
Serenity “crewbyphone” 
● To solve the problem of being able to query crew members by phone:” 
CREATE TABLE crewbyphone ( 
crewname TEXT, 
firstname TEXT, 
lastname TEXT, 
phone TEXT, 
PRIMARY KEY (phone,crewname)); 
crewname | firstname | lastname | phone 
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­Mal 
| Malcolm | Reynolds | 111­555­1234 
Wash | Hoban | Washburne| 111­555­1212 
Zoey | Zoey | Washburne| 111­555­1212 
Jayne | Jayne | Cobb | 111­555­3464
Serenity “crewbyphone” under 
the hood 
RowKey:111­555­1234 
=> (column=Mal, value=, timestamp=1374546754299000) 
=> (column:Mal:firstname, value=Malcolm, timestamp=... 
=> (column:Mal:lastname, value=Reynolds, timestamp=... 
RowKey:111­555­1212 
=> (column=Wash, value=, timestamp=1374546754299000) 
=> (column=Wash:firstname, value=Hoban, timestamp=... 
=> (column=Wash:lastname, value=Washburne, timestamp=... 
=> (column=Zoey, value=, timestamp=1374546754299000) 
=> (column=Zoey:firstname, value=Zoey, timestamp=... 
=> (column=Zoey:lastname, value=Washburne, timestamp=...
Who Uses Cassandra?
Who else Uses Cassandra?
Cassandra Large Deployments 
● 100+ nodes. 250TB of data, cluster sizes vary from 6 to 32 
nodes. 
● 2,500+ nodes, 420TB of data, 4 DCs, handles 1 trillion 
operations per day. 
● 75,000+ nodes, 10s of PB of data, largest cluster 1000+ nodes.
Additional Reading 
● Amazon Dynamo paper 
● Facebook Cassandra paper 
● Harvest, Yield, and Scalable, Tolerant Systems - Brewer, Fox, 1999 
● DataStax grabs $106M to achieve big-dog status in database country 
● http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 
● http://planetcassandra.org/blog/a-netflix-experiment-eventual-consistency-hopeful-consistency-● DataStax Documentation 
● KillrVideo.com
Getting Started 
● Community site: http://planetcassandra.org 
● http://datastax.com 
● DataStax community edition: 
http://planetcassandra.org/cassandra 
● DataStax startup program: 
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/● Apache Cassandra project site: 
http://cassandra.apache.org/
Questions?
Demo
Want to work at AccuLynx? 
We're hiring! 
http://careers.stackoverflow.com/company/acculynx
Ad

More Related Content

What's hot (20)

HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18
Derek Downey
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
Michelle Darling
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
Jay Patel
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
Nguyen Quang
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Manuel Correa
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
Jimmy Angelakos
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
DataStax Academy
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Mike Dirolf
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
daveconnors
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
Knoldus Inc.
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Sandesh Rao
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
Nikiforos Botis
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18
Derek Downey
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
Jay Patel
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
Nguyen Quang
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
Jimmy Angelakos
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Mike Dirolf
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
daveconnors
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
Knoldus Inc.
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Sandesh Rao
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
Nikiforos Botis
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 

Similar to Intro to cassandra (20)

Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis
 
Cassandra
CassandraCassandra
Cassandra
Carbo Kuo
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
Stu Hood
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
Brent Theisen
 
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruUse Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Tim Callaghan
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
Stu Hood
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
nehabsairam
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
Christian Johannsen
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
Cédrick Lunven
 
Performance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxPerformance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. Datastax
ScyllaDB
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
DataStax Academy
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Stavros Kontopoulos
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Pollfish
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
DataStax
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
Instaclustr
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Cassandra
CassandraCassandra
Cassandra
Upaang Saxena
 
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
Stu Hood
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
Brent Theisen
 
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruUse Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Tim Callaghan
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
Stu Hood
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
nehabsairam
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
Cédrick Lunven
 
Performance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxPerformance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. Datastax
ScyllaDB
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
DataStax Academy
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Pollfish
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
DataStax
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
Instaclustr
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Ad

Recently uploaded (20)

How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Ad

Intro to cassandra

  • 1. An Introduction to Apache assandra Aaron Ploetz
  • 2. What is Cassandra? ● Cassandra is a non-relational, partitioned row store. ● Rows are organized into column families (tables) with a required primary key. ● Data is distributed across multiple master-less, nodes in an application-transparent manner. ● DataStax oversees the development of the Apache Cassandra open-source project, provides support to companies using Cassandra, and provides an enterprise-ready version of Cassandra.
  • 3. $ whoami Aaron Ploetz @APloetz ● Lead Database Engineer ● B.S.-MCS UW-Whitewater ● M.S.-SED Regis University ● Using Cassandra since version 0.8 ● Contributor to the Casandra tag on StackOverflow ● Contributor to the Apache Cassandra project ● 2014/15 DataStax MVP for Apache Cassandra
  • 4. (short) History of Cassandra and DataStax ● Developed at , open sourced in 2008. ● Design influenced by Google BigTable and Amazon Dynamo. ● Graduated to Apache “Top-Level Project” status in Feb 2010. ● DataStax founded by Jonathan Ellis and Matt Pfeil in late 2010, offering enterprise Cassandra support. ● Secured $190 million in VC funding. ● Started with eight people, now employs more than 350. ● 400+ Customers, including 25 of the Fortune 100.
  • 5. Key Features ● Current release is Cassandra 2.1 (Sept 10). ● Distributed, decentralized storage; no SPOF. ● Scalable. ● High-availability, Fault-tolerance. ● Tunable Consistency. ● High-performance. ● Data center awareness.
  • 6. Distributed, Decentralized Storage DC1 DC2 ● Peer-to-peer, master-less replication. ● Any node can handle a read or write operation. ● Supports local read/write ops via “logical” data centers. ● Gossip protocol allows nodes to be aware of each other. ● Snitch ensures that data is replicated appropriately.
  • 7. Scalability ● Cassandra allows you to easily add nodes to scale your application back-end. ● Benchmark from 2011: – 48 node cluster could handle 174,373 writes/sec. – 288 node cluster could handle 1,099,837 writes/sec. – Indicates that Cassandra scales linearly. ● Throughput of N nodes = T. ● Throughput of Nx2 nodes = Tx2.
  • 8. High Availability DC1 DC2 X ● Cassandra was designed under the premise that hardware failures can and do occur.
  • 9. High Availability DC1 DC2 X X ● Cassandra was designed under the premise that hardware failures can and do occur.
  • 10. High Availability DC1 DC2 X X X X X X X ●Gossip Protocol keeps live nodes informed of failures. ●Cassandra 2.0.2 implemented Rapid Read Protection which redirects read operations to live nodes.
  • 11. Tunable Consistency ● Cassandra allows you alter your consistency level on a per-operation basis. ● Also allows configuration for data center locality: ALL QUORUM ONE Strong Consistency High Availability / Eventual Consistency Quorum == (nodes / 2) + 1
  • 12. Eventual Consistency != Hopeful Consistency ● experiment on consistency : – Created two data centers with C* 1.1.7 Cluster of 48 nodes in each data center. – Wrote 1,000,000 records at CL1 in one data center. – Read same 1,000,000 records at CL1 in other data center. – All records read successfully! – “Eventually consistent does not mean a day, minute or even a second from now… in most cases, it is milliseconds!”- Christos Kalantzis
  • 13. High Performance ● Cassandra is optimized from the ground up for performance: Source: DataStax.com
  • 14. High Performance ● All disk writes are sequential, append-only operations. ● No reading before writing. ● Cassandra is optimized for threading with multi-core/ processor machines.
  • 15. Potential Drawbacks? ● Some use cases are not appropriate (transient data or delete-heavy patterns). ● Developer learning curve: CQL != SQL ● Simple queries only. No JOINs or sub-queries. ● Optimal performance is achieved through de-normalizaiton and query-based data modeling.
  • 16. Cassandra moves beyond disco-era data modeling ●Everything MUST be normalized!!! ●Redundant data == “bad” ●Relational Database theory originated when disk space was expensive. In 1975 some vendors were selling disk space at $11k per MB. ●By 1980 prices “dropped” so that you could finally buy 1GB of storage for under $1 Million. ●Today I can buy a 1TB disk for $60.
  • 17. Cassandra Storage Structures ● Keyspace == Database (in the RDBMS world) CREATE KEYSPACE products WITH replication = { 'class': 'NetworkTopologyStrategy', 'RFD': '2', 'MKE': '4'}; ● Column Family == Table CREATE TABLE hierarchy ( category text, subcategory text, classification text, skumap map<uuid, text>, PRIMARY KEY (category, subcategory, classification));
  • 18. Cassandra Primary Keys ● Primary Keys are unique. ● Single Primary Key: PRIMARY KEY (keyColumn) ● Composite Primary Key: PRIMARY KEY (myPartitionKey, my1stClusteringKey, my2ndClusteringKey) ● Composite Partitioning Key: PRIMARY KEY ((my1stPartitionKey, my2ndPartitionKey), myClusteringKey)
  • 19. Cassandra Secondary Indexes ● Does allow secondary indexes. CREATE INDEX myIndex ON myTable(myNonKeyColumn) ● Designed for query convenience, not for performance. ● Does not perform well on high-cardinality columns, because you filter a huge volume of records for a small number of results. Extremely low cardinality is also not a good idea (ex: customer address [state == good, phone == bad, gender == bad]). ● Works best on a table having many rows that contain the indexed value; middle-of-the-road cardinality.
  • 20. Serenity “crew” ● Create a table to store data for the crew of “Serenity” from “Firefly.” CREATE TABLE crew ( crewname TEXT, firstname TEXT, lastname TEXT, phone TEXT, PRIMARY KEY (crewname)); crewname | firstname | lastname | phone ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­Mal | Malcolm | Reynolds | 111­555­1234 Jayne | Jayne | Cobb | 111­555­3464 Sheppard | Derial | Book | 111­555­2349 Simon | Simon | Tam | 111­555­8899
  • 21. Serenity “crew” under the hood RowKey:Mal => (column, value=, timestamp=1374546754299000) => (column=firstname, value=Malcolm, timestamp=1374546754299000) => (column=lastname, value=Reynolds, timestamp=1374546754299000) => (column=phone, value=111­555­1234, timestamp=1374546754299000) RowKey:Jayne => (column, value=, timestamp=1374546757815000) => (column=firstname, value=Jayne, timestamp=1374546757815000) => (column=lastname, value=Cobb, timestamp=1374546757815000) => (column=phone, value=111­555­3464, timestamp=1374546757815000)
  • 22. Serenity “crewbyphone” ● To solve the problem of being able to query crew members by phone:” CREATE TABLE crewbyphone ( crewname TEXT, firstname TEXT, lastname TEXT, phone TEXT, PRIMARY KEY (phone,crewname)); crewname | firstname | lastname | phone ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­Mal | Malcolm | Reynolds | 111­555­1234 Wash | Hoban | Washburne| 111­555­1212 Zoey | Zoey | Washburne| 111­555­1212 Jayne | Jayne | Cobb | 111­555­3464
  • 23. Serenity “crewbyphone” under the hood RowKey:111­555­1234 => (column=Mal, value=, timestamp=1374546754299000) => (column:Mal:firstname, value=Malcolm, timestamp=... => (column:Mal:lastname, value=Reynolds, timestamp=... RowKey:111­555­1212 => (column=Wash, value=, timestamp=1374546754299000) => (column=Wash:firstname, value=Hoban, timestamp=... => (column=Wash:lastname, value=Washburne, timestamp=... => (column=Zoey, value=, timestamp=1374546754299000) => (column=Zoey:firstname, value=Zoey, timestamp=... => (column=Zoey:lastname, value=Washburne, timestamp=...
  • 25. Who else Uses Cassandra?
  • 26. Cassandra Large Deployments ● 100+ nodes. 250TB of data, cluster sizes vary from 6 to 32 nodes. ● 2,500+ nodes, 420TB of data, 4 DCs, handles 1 trillion operations per day. ● 75,000+ nodes, 10s of PB of data, largest cluster 1000+ nodes.
  • 27. Additional Reading ● Amazon Dynamo paper ● Facebook Cassandra paper ● Harvest, Yield, and Scalable, Tolerant Systems - Brewer, Fox, 1999 ● DataStax grabs $106M to achieve big-dog status in database country ● http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html ● http://planetcassandra.org/blog/a-netflix-experiment-eventual-consistency-hopeful-consistency-● DataStax Documentation ● KillrVideo.com
  • 28. Getting Started ● Community site: http://planetcassandra.org ● http://datastax.com ● DataStax community edition: http://planetcassandra.org/cassandra ● DataStax startup program: http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/● Apache Cassandra project site: http://cassandra.apache.org/
  • 30. Demo
  • 31. Want to work at AccuLynx? We're hiring! http://careers.stackoverflow.com/company/acculynx