SlideShare a Scribd company logo
introduction to cassandra
         eben hewitt


      september 29. 2010
         web 2.0 expo
         new york city
@ebenhewitt
• director, application architecture
 at a global corp


• focus on SOA, SaaS, Events


• i wrote this
agenda
•   context
•   features
•   data model
•   api
“nosql”  “big data”
•   mongodb
•   couchdb
•   tokyo cabinet
•   redis
•   riak
•   what about?
    – Poet, Lotus, Xindice
    – they’ve been around forever…
    – rdbms was once the new kid…
innovation at scale
• google bigtable (2006)
  – consistency model: strong
  – data model: sparse map
  – clones: hbase, hypertable
• amazon dynamo (2007)
  – O(1) dht
  – consistency model: client tune-able
  – clones: riak, voldemort



  cassandra ~= bigtable + dynamo
proven
• The Facebook stores 150TB of data on 150 nodes



                 web 2.0
• used at Twitter, Rackspace, Mahalo, Reddit,
  Cloudkick, Cisco, Digg, SimpleGeo, Ooyala, OpenX,
  others
cap theorem

• consistency
  – all clients have same view of data

• availability
  – writeable in the face of node failure

• partition tolerance
  – processing can continue in the face of network failure
    (crashed router, broken network)
daniel abadi: pacelc
write consistency
Level                        Description
ZERO                         Good luck with that
ANY                          1 replica (hints count)
ONE                          1 replica. read repair in bkgnd
QUORUM (DCQ for RackAware)   (N /2) + 1
ALL                          N = replication factor



                   read consistency
Level                        Description
ZERO                         Ummm…
ANY                          Try ONE instead
ONE                          1 replica
QUORUM (DCQ for RackAware)   Return most recent TS after (N /2) + 1
                             report
ALL                          N = replication factor
agenda
•   context
•   features
•   data model
•   api
cassandra properties
•   tuneably consistent
•   very fast writes
•   highly available
•   fault tolerant
•   linear, elastic scalability
•   decentralized/symmetric
•   ~12 client languages
    – Thrift RPC API
• ~automatic provisioning of new nodes
• 0(1) dht
• big data
write op
Staged Event-Driven Architecture
• A general-purpose framework for high
  concurrency & load conditioning
• Decomposes applications into stages
  separated by queues
• Adopt a structured approach to event-driven
  concurrency
instrumentation
data replication
partitioner smack-down
Random Preserving                Order Preserving
• system will use MD5(key) to    • key distribution determined
  distribute data across nodes     by token
• even distribution of keys      • lexicographical ordering
  from one CF across             • required for range queries
  ranges/nodes                      – scan over rows like cursor in
                                      index
                                 • can specify the token for
                                   this node to use
                                 • ‘scrabble’ distribution
agenda
•   context
•   features
•   data model
•   api
structure
keyspace
• ~= database
• typically one per application
• some settings are configurable only per
  keyspace
column family
• group records of similar kind
• not same kind, because CFs are sparse tables
• ex:
  – User
  – Address
  – Tweet
  – PointOfInterest
  – HotelRoom
think of cassandra as

 row-oriented
• each row is uniquely identifiable by key
• rows group columns and super columns
column family

key                 nickname=
      user=eben         The
123                  Situation


key                    icon=     n=
      user=alison
456                              42
json-like notation
User {
  123 : { email: alison@foo.com,
          icon:          },

    456 : { email: eben@bar.com,
            location: The Danger Zone}
}
0.6 example
$cassandra –f
$bin/cassandra-cli
cassandra> connect localhost/9160


cassandra> set Keyspace1.Standard1[‘eben’]
  [‘age’]=‘29’
cassandra> set Keyspace1.Standard1[‘eben’]
  [‘email’]=‘e@e.com’
cassandra> get Keyspace1.Standard1[‘eben'][‘age']
=> (column=6e616d65, value=39,
  timestamp=1282170655390000)
a column has 3 parts
1. name
  –   byte[]
  –   determines sort order
  –   used in queries
  –   indexed
1. value
  –   byte[]
  –   you don’t query on column values
1. timestamp
  –   long (clock)
  –   last write wins conflict resolution
column comparators
•   byte
•   utf8
•   long
•   timeuuid
•   lexicaluuid
•   <pluggable>
    – ex: lat/long
super column



super columns group columns under a common name
super column family
             <<SCF>>PointOfInterest

           <<SC>>Central          <<SC>>
               Park           Empire State Bldg
10017       desc=Fun to                  desc=Great
                           phone=212.
              walk in.                    view from
                            555.11212
                                         102nd floor!



            <<SC>>
85255       Phoenix
              Zoo
super column family
                     super column family

PointOfInterest {
  key: 85255 {                                           column
      Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. },
      Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. },
  }, //end phx
                  key
                        super column
    key: 10019 {                                              flexible schema
       Central Park { desc: Walk around. It's pretty.} ,
                                                   s
        Empire State Building { phone: 212-777-7777,
                                desc: Great view from 102nd floor. }
    } //end nyc
}
about super column families
• sub-column names in a SCF are not indexed
  – top level columns (SCF Name) are always indexed
• often used for denormalizing data from
  standard CFs
agenda
•   context
•   features
•   data model
•   api
slice predicate
• data structure describing columns to return
  – SliceRange
     •   start column name
     •   finish column name (can be empty to stop on count)
     •   reverse
     •   count (like LIMIT)
• get() : Column
    – get the Col or SC at given ColPath
                                                                 read api
      COSC cosc = client.get(key, path, CL);

• get_slice() : List<ColumnOrSuperColumn>
    – get Cols in one row, specified by SlicePredicate:
      List<ColumnOrSuperColumn> results =
        client.get_slice(key, parent, predicate, CL);

• multiget_slice() : Map<key, List<CoSC>>
    – get slices for list of keys, based on SlicePredicate
     Map<byte[],List<ColumnOrSuperColumn>> results =
        client.multiget_slice(rowKeys, parent, predicate, CL);

• get_range_slices() : List<KeySlice>
    – returns multiple Cols according to a range
    – range is startkey, endkey, starttoken, endtoken:
      List<KeySlice> slices = client.get_range_slices(
         parent, predicate, keyRange, CL);
client.insert(userKeyBytes, parent,     write        api
    new Column(“band".getBytes(UTF8),
    “Funkadelic".getBytes(), clock), CL);

batch_mutate
  – void batch_mutate(
    map<byte[], map<String, List<Mutation>>> , CL)
remove
  – void remove(byte[],
    ColumnPath column_path, Clock, CL)
//create param
                                                           batch_mutate
Map<byte[], Map<String, List<Mutation>>> mutationMap =
   new HashMap<byte[], Map<String, List<Mutation>>>();

//create Cols for Muts
Column nameCol = new Column("name".getBytes(UTF8),
“Funkadelic”.getBytes("UTF-8"), new Clock(System.nanoTime()););
Mutation nameMut = new Mutation();
nameMut.column_or_supercolumn = nameCosc; //also phone, etc

Map<String, List<Mutation>> muts = new HashMap<String, List<Mutation>>();
List<Mutation> cols = new ArrayList<Mutation>();
cols.add(nameMut);
cols.add(phoneMut);
muts.put(CF, cols);
//outer map key is a row key; inner map key is the CF name
mutationMap.put(rowKey.getBytes(), muts);
//send to server
client.batch_mutate(mutationMap, CL);
raw thrift: for masochists only

•   pycassa (python)
•   fauna (ruby)
•   hector (java)
•   pelops (java)
•   kundera (JPA)
•   hectorSharp (C#)
?
     what about…

    SELECT WHERE
        ORDER BY
         JOIN ON
           GROUP
rdbms: domain-based model
             what answers do I have?



cassandra: query-based model
             what questions do I have?
SELECT WHERE
               cassandra is an index factory
<<cf>>USER
Key: UserID
Cols: username, email, birth date, city, state

How to support this query?

SELECT * FROM User WHERE city = ‘Scottsdale’

Create a new CF called UserCity:

<<cf>>USERCITY
Key: city
Cols: IDs of the users in that city.
Also uses the Valueless Column pattern
SELECT WHERE pt 2
• Use an aggregate key
  state:city: { user1, user2}

• Get rows between AZ: & AZ;
  for all Arizona users

• Get rows between AZ:Scottsdale &
  AZ:Scottsdale1
  for all Scottsdale users
ORDER BY

Columns                   Rows
are sorted according to   are placed according to their Partitioner:
CompareWith or
CompareSubcolumnsWith     •Random: MD5 of key
                          •Order-Preserving: actual key


                          are sorted by key, regardless of partitioner
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
is cassandra a good fit?
• you need really fast writes   • your programmers can deal
• you need durability              –   documentation
• you have lots of data            –   complexity
                                   –   consistency model
   > GBs
                                   –   change
   >= three servers
                                   –   visibility tools
• your app is evolving
                                • your operations can deal
   – startup mode, fluid data
                                   – hardware considerations
     structure
                                   – can move data
• loose domain data
                                   – JMX monitoring
   – “points of interest”
thank you!
@ebenhewitt
Ad

More Related Content

What's hot (20)

Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
Markus Klems
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
Luke Tillman
 
DB2 Native XML
DB2 Native XMLDB2 Native XML
DB2 Native XML
Amol Pujari
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xquery
Amol Pujari
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
Jason Vance
 
Apache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and PerformanceApache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and Performance
aaronmorton
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Nitin S
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
Alejandro Fernandez
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Terraform modules restructured
Terraform modules restructuredTerraform modules restructured
Terraform modules restructured
Ami Mahloof
 
memcached Distributed Cache
memcached Distributed Cachememcached Distributed Cache
memcached Distributed Cache
Aniruddha Chakrabarti
 
SQL Server SQL Server
SQL Server SQL ServerSQL Server SQL Server
SQL Server SQL Server
webhostingguy
 
Introduce to Terraform
Introduce to TerraformIntroduce to Terraform
Introduce to Terraform
Samsung Electronics
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
thelabdude
 
Dynomite Nosql
Dynomite NosqlDynomite Nosql
Dynomite Nosql
elliando dias
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
なおき きしだ
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
なおき きしだ
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
Markus Klems
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
Luke Tillman
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xquery
Amol Pujari
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
Jason Vance
 
Apache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and PerformanceApache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and Performance
aaronmorton
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Nitin S
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
Alejandro Fernandez
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Terraform modules restructured
Terraform modules restructuredTerraform modules restructured
Terraform modules restructured
Ami Mahloof
 
SQL Server SQL Server
SQL Server SQL ServerSQL Server SQL Server
SQL Server SQL Server
webhostingguy
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
thelabdude
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
なおき きしだ
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
なおき きしだ
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 

Viewers also liked (6)

Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and GraphiteAutomated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Pratima Singh
 
6 Dean Google
6 Dean Google6 Dean Google
6 Dean Google
Frank Cai
 
Pregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingPregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph Processing
Chris Bunch
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in java
Peter Lawrey
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
DataStax
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and GraphiteAutomated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Pratima Singh
 
6 Dean Google
6 Dean Google6 Dean Google
6 Dean Google
Frank Cai
 
Pregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingPregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph Processing
Chris Bunch
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in java
Peter Lawrey
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
DataStax
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
Ad

Similar to Scaling web applications with cassandra presentation (20)

Scaling Web Applications with Cassandra Presentation (1).ppt
Scaling Web Applications with Cassandra Presentation (1).pptScaling Web Applications with Cassandra Presentation (1).ppt
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
Scaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptScaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.ppt
ssuserbad56d
 
Cassandra Overview
Cassandra OverviewCassandra Overview
Cassandra Overview
Sergey Titov, Ph.D.
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
Boris Yen
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
Jacky Chu
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
Cliff Gilmore
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL Meetup
Michael Wynholds
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
Brent Theisen
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
Suresh Parmar
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
Cassandra
CassandraCassandra
Cassandra
exsuns
 
Cassandra integrations
Cassandra integrationsCassandra integrations
Cassandra integrations
T Jake Luciani
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Boris Yen
 
Göteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache CassandraGöteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache Cassandra
Jeremy Hanna
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
Morningstar Tech Talks
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
rantav
 
Scaling Web Applications with Cassandra Presentation (1).ppt
Scaling Web Applications with Cassandra Presentation (1).pptScaling Web Applications with Cassandra Presentation (1).ppt
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
Scaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptScaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.ppt
ssuserbad56d
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
Boris Yen
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
Jacky Chu
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
Cliff Gilmore
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL Meetup
Michael Wynholds
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
Brent Theisen
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
Cassandra
CassandraCassandra
Cassandra
exsuns
 
Cassandra integrations
Cassandra integrationsCassandra integrations
Cassandra integrations
T Jake Luciani
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Boris Yen
 
Göteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache CassandraGöteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache Cassandra
Jeremy Hanna
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
Morningstar Tech Talks
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
rantav
 
Ad

More from Murat Çakal (8)

REST vs. SOAP
REST vs. SOAPREST vs. SOAP
REST vs. SOAP
Murat Çakal
 
Mongodb open source_high_performance_database
Mongodb open source_high_performance_databaseMongodb open source_high_performance_database
Mongodb open source_high_performance_database
Murat Çakal
 
Building web applications with mongo db presentation
Building web applications with mongo db presentationBuilding web applications with mongo db presentation
Building web applications with mongo db presentation
Murat Çakal
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
Murat Çakal
 
Trouble with nosql_dbs
Trouble with nosql_dbsTrouble with nosql_dbs
Trouble with nosql_dbs
Murat Çakal
 
NoSql databases
NoSql databasesNoSql databases
NoSql databases
Murat Çakal
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
Murat Çakal
 
No sql
No sqlNo sql
No sql
Murat Çakal
 
Mongodb open source_high_performance_database
Mongodb open source_high_performance_databaseMongodb open source_high_performance_database
Mongodb open source_high_performance_database
Murat Çakal
 
Building web applications with mongo db presentation
Building web applications with mongo db presentationBuilding web applications with mongo db presentation
Building web applications with mongo db presentation
Murat Çakal
 
Trouble with nosql_dbs
Trouble with nosql_dbsTrouble with nosql_dbs
Trouble with nosql_dbs
Murat Çakal
 

Recently uploaded (20)

Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Asthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdfAsthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdf
VanessaRaudez
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Asthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdfAsthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdf
VanessaRaudez
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 

Scaling web applications with cassandra presentation

  • 1. introduction to cassandra eben hewitt september 29. 2010 web 2.0 expo new york city
  • 2. @ebenhewitt • director, application architecture at a global corp • focus on SOA, SaaS, Events • i wrote this
  • 3. agenda • context • features • data model • api
  • 4. “nosql”  “big data” • mongodb • couchdb • tokyo cabinet • redis • riak • what about? – Poet, Lotus, Xindice – they’ve been around forever… – rdbms was once the new kid…
  • 5. innovation at scale • google bigtable (2006) – consistency model: strong – data model: sparse map – clones: hbase, hypertable • amazon dynamo (2007) – O(1) dht – consistency model: client tune-able – clones: riak, voldemort cassandra ~= bigtable + dynamo
  • 6. proven • The Facebook stores 150TB of data on 150 nodes web 2.0 • used at Twitter, Rackspace, Mahalo, Reddit, Cloudkick, Cisco, Digg, SimpleGeo, Ooyala, OpenX, others
  • 7. cap theorem • consistency – all clients have same view of data • availability – writeable in the face of node failure • partition tolerance – processing can continue in the face of network failure (crashed router, broken network)
  • 9. write consistency Level Description ZERO Good luck with that ANY 1 replica (hints count) ONE 1 replica. read repair in bkgnd QUORUM (DCQ for RackAware) (N /2) + 1 ALL N = replication factor read consistency Level Description ZERO Ummm… ANY Try ONE instead ONE 1 replica QUORUM (DCQ for RackAware) Return most recent TS after (N /2) + 1 report ALL N = replication factor
  • 10. agenda • context • features • data model • api
  • 11. cassandra properties • tuneably consistent • very fast writes • highly available • fault tolerant • linear, elastic scalability • decentralized/symmetric • ~12 client languages – Thrift RPC API • ~automatic provisioning of new nodes • 0(1) dht • big data
  • 13. Staged Event-Driven Architecture • A general-purpose framework for high concurrency & load conditioning • Decomposes applications into stages separated by queues • Adopt a structured approach to event-driven concurrency
  • 16. partitioner smack-down Random Preserving Order Preserving • system will use MD5(key) to • key distribution determined distribute data across nodes by token • even distribution of keys • lexicographical ordering from one CF across • required for range queries ranges/nodes – scan over rows like cursor in index • can specify the token for this node to use • ‘scrabble’ distribution
  • 17. agenda • context • features • data model • api
  • 19. keyspace • ~= database • typically one per application • some settings are configurable only per keyspace
  • 20. column family • group records of similar kind • not same kind, because CFs are sparse tables • ex: – User – Address – Tweet – PointOfInterest – HotelRoom
  • 21. think of cassandra as row-oriented • each row is uniquely identifiable by key • rows group columns and super columns
  • 22. column family key nickname= user=eben The 123 Situation key icon= n= user=alison 456 42
  • 23. json-like notation User { 123 : { email: [email protected], icon: }, 456 : { email: [email protected], location: The Danger Zone} }
  • 24. 0.6 example $cassandra –f $bin/cassandra-cli cassandra> connect localhost/9160 cassandra> set Keyspace1.Standard1[‘eben’] [‘age’]=‘29’ cassandra> set Keyspace1.Standard1[‘eben’] [‘email’]=‘[email protected]’ cassandra> get Keyspace1.Standard1[‘eben'][‘age'] => (column=6e616d65, value=39, timestamp=1282170655390000)
  • 25. a column has 3 parts 1. name – byte[] – determines sort order – used in queries – indexed 1. value – byte[] – you don’t query on column values 1. timestamp – long (clock) – last write wins conflict resolution
  • 26. column comparators • byte • utf8 • long • timeuuid • lexicaluuid • <pluggable> – ex: lat/long
  • 27. super column super columns group columns under a common name
  • 28. super column family <<SCF>>PointOfInterest <<SC>>Central <<SC>> Park Empire State Bldg 10017 desc=Fun to desc=Great phone=212. walk in. view from 555.11212 102nd floor! <<SC>> 85255 Phoenix Zoo
  • 29. super column family super column family PointOfInterest { key: 85255 { column Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. }, Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. }, }, //end phx key super column key: 10019 { flexible schema Central Park { desc: Walk around. It's pretty.} , s Empire State Building { phone: 212-777-7777, desc: Great view from 102nd floor. } } //end nyc }
  • 30. about super column families • sub-column names in a SCF are not indexed – top level columns (SCF Name) are always indexed • often used for denormalizing data from standard CFs
  • 31. agenda • context • features • data model • api
  • 32. slice predicate • data structure describing columns to return – SliceRange • start column name • finish column name (can be empty to stop on count) • reverse • count (like LIMIT)
  • 33. • get() : Column – get the Col or SC at given ColPath read api COSC cosc = client.get(key, path, CL); • get_slice() : List<ColumnOrSuperColumn> – get Cols in one row, specified by SlicePredicate: List<ColumnOrSuperColumn> results = client.get_slice(key, parent, predicate, CL); • multiget_slice() : Map<key, List<CoSC>> – get slices for list of keys, based on SlicePredicate Map<byte[],List<ColumnOrSuperColumn>> results = client.multiget_slice(rowKeys, parent, predicate, CL); • get_range_slices() : List<KeySlice> – returns multiple Cols according to a range – range is startkey, endkey, starttoken, endtoken: List<KeySlice> slices = client.get_range_slices( parent, predicate, keyRange, CL);
  • 34. client.insert(userKeyBytes, parent, write api new Column(“band".getBytes(UTF8), “Funkadelic".getBytes(), clock), CL); batch_mutate – void batch_mutate( map<byte[], map<String, List<Mutation>>> , CL) remove – void remove(byte[], ColumnPath column_path, Clock, CL)
  • 35. //create param batch_mutate Map<byte[], Map<String, List<Mutation>>> mutationMap = new HashMap<byte[], Map<String, List<Mutation>>>(); //create Cols for Muts Column nameCol = new Column("name".getBytes(UTF8), “Funkadelic”.getBytes("UTF-8"), new Clock(System.nanoTime());); Mutation nameMut = new Mutation(); nameMut.column_or_supercolumn = nameCosc; //also phone, etc Map<String, List<Mutation>> muts = new HashMap<String, List<Mutation>>(); List<Mutation> cols = new ArrayList<Mutation>(); cols.add(nameMut); cols.add(phoneMut); muts.put(CF, cols); //outer map key is a row key; inner map key is the CF name mutationMap.put(rowKey.getBytes(), muts); //send to server client.batch_mutate(mutationMap, CL);
  • 36. raw thrift: for masochists only • pycassa (python) • fauna (ruby) • hector (java) • pelops (java) • kundera (JPA) • hectorSharp (C#)
  • 37. ? what about… SELECT WHERE ORDER BY JOIN ON GROUP
  • 38. rdbms: domain-based model what answers do I have? cassandra: query-based model what questions do I have?
  • 39. SELECT WHERE cassandra is an index factory <<cf>>USER Key: UserID Cols: username, email, birth date, city, state How to support this query? SELECT * FROM User WHERE city = ‘Scottsdale’ Create a new CF called UserCity: <<cf>>USERCITY Key: city Cols: IDs of the users in that city. Also uses the Valueless Column pattern
  • 40. SELECT WHERE pt 2 • Use an aggregate key state:city: { user1, user2} • Get rows between AZ: & AZ; for all Arizona users • Get rows between AZ:Scottsdale & AZ:Scottsdale1 for all Scottsdale users
  • 41. ORDER BY Columns Rows are sorted according to are placed according to their Partitioner: CompareWith or CompareSubcolumnsWith •Random: MD5 of key •Order-Preserving: actual key are sorted by key, regardless of partitioner
  • 44. is cassandra a good fit? • you need really fast writes • your programmers can deal • you need durability – documentation • you have lots of data – complexity – consistency model > GBs – change >= three servers – visibility tools • your app is evolving • your operations can deal – startup mode, fluid data – hardware considerations structure – can move data • loose domain data – JMX monitoring – “points of interest”