Cassandra an overview

Road Map
• The Big Data Challenge
• The Cassandra Solution
• The CAPTheorem
• The Architecture of Cassandra
• The Data Partition and Replication

Data Proliferation
• Data Produce Growing Exponentially
• Sources of Data Growing
• Mobile Devices
• Cloud Services
• Historical Data

ThreeVs –Trends In Big Data
• Volume:
• Terabytess -> Zettabyts
• Velocity:
• Batch -> Streaming
• Variety :
• Structured -> Unstructured

RDBMS Scaling
• Traditional RDBMS : No Scale out
• Cassandra : Linear Scale out

Objective
• Schema Free
• Easy Replication
• SimpleAPI
• Consistence
• Can Handle huge data
NoSQLDatabase
• simplicity of design,
• horizontal scaling
• finer control over availability

Relational Database vs. NoSQL
Relational Database
• Supports powerful query language
• It has a fixed schema
• Follows ACID (Atomicity,Consistency,
Isolation, and Durability)
• Supports transactions
NoSQL Database
• Supports very simple query language
• No Fixed Schema
• It is only “eventually consistent”
• Does not support transactions

Other NoSQL Database
• Apache HBase - HBase is an open source, non-relational,
distributed database modeled after Google’s BigTable and is
written in Java. It is developed as a part of Apache Hadoop
project and runs on top of HDFS, providing BigTable-like
capabilities for Hadoop.
• MongoDB - MongoDB is a cross-platform document-
oriented database system that avoids using the traditional
table-based relational database structure in favor of JSON-
like documents with dynamic schemas making the
integration of data in certain types of applications easier and
faster.

What is Apache Cassandra?
• Apache Cassandra™ is a free
• Distributed
• High performance
• Extremely scalable
• Fault tolerant (i.e. no single point of failure)
• post-relational database solution. Cassandra can serve as both real-time
data store (the “system of record”) for online/transactional applications, and
as a read intensive database for business intelligence systems.

Features of Cassandra
• Elastic scalability
• Always on architecture
• Fast linear-scale performance
• Flexible data storage
• Easy data distribution
• Transaction support
• Fast writes

History of Cassandra
• Cassandra was developed at Facebook for inbox search.
• It was open-sourced by Facebook in July 2008.
• Cassandra was accepted into Apache Incubator in March 2009.
• It was made an Apache top-level project since February 2010.

CAPTheorem
• Distributed System Can only provide two of
• Availability
• Consistency
• PartitionTolerance
• AKA BrewersTheorem

Cassandra AP
• Cassandra Prioritizes Availability and PartitionTolerance
• Consistency is not guaranteed
• Tradeoffs between latency and Consistency

Other Approaches -CP
• Eg. Hbase
• Implements Row locking for consistency
• HBase has master/slave & Single point of Failure
• No A

Architecture Overview
• Cassandra was designed with the understanding that
system/hardware failures can and do occur
• Peer-to-peer, distributed system
• All nodes the same
• Data partitioned among all nodes in the cluster
• Custom data replication to ensure fault tolerance
• Read/Write-anywhere design

• Each node communicates with each other through the
• Gossip protocol, which exchanges information across
the
• cluster every second
• A commit log is used on each node to capture write
• activity. Data durability is assured
• Data also written to an in-memory structure
(memtable)
• and then to disk once the memory structure is full (an
• SStable)

• The schema used in Cassandra is mirrored after
Google
• Bigtable. It is a row-oriented, column structure
• A keyspace is akin to a database in the RDBMS world
• A column family is similar to an RDBMS table but is
more
• flexible/dynamic
• A row in a column family is indexed by its key. Other
• columns may be indexed as well

Components of Cassandra
• Node − It is the place where data is stored.
• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log −The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After commit log, the data
will be written to the mem-table. Sometimes, for a single-column family, there will be
multiple mem-tables.
• SSTable − It is a disk file to which the data is flushed from the mem-table when its contents
reach a threshold value.
• Bloom filter −These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters are
accessed after every query.

The Data Partition and Replication

Partition Process
• Data is transparently portioned across the nodes
• Data sent to a node is hashed and sent to partition based on hash
• The data partitioning strategy is controlled via the partitioner option inside cassandra.yaml
file
• Once a cluster in initialized with a partitioner option, it can not be changed without
reloading all of the data in the cluster

Partitioning Strategies
• Random Partitioning
• This is the default and recommended strategy.
• Partition data as evenly as possible across all nodes
• using an MD5 hash of every column family row key
• Ordered Partitioning
• Store column family row keys in sorted order across all nodes in the cluster.
• Sequential writes can cause hot spots
• More administrative overhead to load balance the cluster
• Uneven load balancing for multiple column families

Replication
• To ensure fault tolerance and no single point of failure, you can replicate one or more
copies of every row across nodes in the cluster
• Replication is controlled by the parameters replication factor and replication strategy of a
keyspace
• Replication factor controls how many copies of a row should be store in the cluster
• Replication strategy controls how the data being replicated.

Replication Strategies
• Simple Strategy
• Place the original row on a node determined by the partitioner.Additional replica rows
are placed on the new nodes clockwise in the ring.
• NetworkTopology Strategy
• Allow replication between different racks in a data center and or between multiple
data centers
• The original row is placed according the partitioner.Additional replica rows in the same
data center are then placed by walking the ring clockwise until a node in a different
rack from previous replica is found. If there is no such node, additional replicas will be
placed in the same rack.

Cassandra an overview

Recommended

More Related Content

What's hot (20)

Similar to Cassandra an overview (20)

Recently uploaded (20)

Cassandra an overview

Editor's Notes