IntroNoSQL Revised
IntroNoSQL Revised
INTRODUCTION TO
NOSQL DATABASES
2
Outline
• Background
• What is NOSQL?
• Who is using it?
• 3 major papers for NOSQL
• CAP theorem
• NOSQL categories
• Conclusion
• References
3
Background
• Relational databases ® mainstay of business
• Web-based applications caused spikes
• explosion of social media sites (Facebook, Twitter) with large data
needs
• rise of cloud-based solutions such as Amazon S3 (simple storage
solution)
• Hooking RDBMS to web-based application becomes
trouble
4
What is NOSQL?
• The Name:
• Stands for Not Only SQL
• The term NOSQL was introduced by Carl Strozzi in 1998 to name
his file-based database
• It was again re-introduced by Eric Evans when an event was
organized to discuss open source distributed databases
• Eric states that “… but the whole point of seeking alternatives is
that you need to solve a problem that relational databases are a
bad fit for. …”
5
What is NOSQL?
• Key features (advantages):
• non-relational
• don’t require schema
• data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
• horizontal scalable
• cheap, easy to implement
(open-source)
• massive write performance
• fast key-value access
6
What is NOSQL?
• Disadvantages:
• Don’t fully support relational features
• no join, group by, order by operations (except within partitions)
• no referential integrity constraints across partitions
• No declarative query language (e.g., SQL) ® more programming
• Relaxed ACID (see CAP theorem) ® fewer guarantees
• No easy integration with other applications that support SQL
7
CAP Theorem
• Suppose three properties
of a distributed system (sharing data) A
• Consistency:
• all copies have same value
C
• Availability: P
• reads and writes always succeed
• Partition-tolerance:
• system properties (consistency and/or availability) hold even when
network failures prevent some machines from communicating with
others
11
CAP Theorem
• Brewer’s CAP Theorem:
• For any system sharing data, it is “impossible” to guarantee
simultaneously all of these three properties
• You can have at most two of these three properties for any shared-
data system
• Very large systems will “partition” at some point:
• That leaves either C or A to choose from (traditional DBMS prefers
C over A and P )
• In almost all cases, you would choose A over C (except in specific
applications such as order processing)
12
CAP Theorem
All client always have the
same view of the data
Availability
Consistency
Partition
tolerance
13
CAP Theorem
• Consistency
• 2 types of consistency:
1. Strong consistency – ACID (Atomicity, Consistency,
Isolation, Durability)
2. Weak consistency – BASE (Basically Available
Soft-state Eventual consistency)
14
CAP Theorem
• ACID
• A DBMS is expected to support “ACID transactions,” processes that
are:
• Atomicity: either the whole process is done or none is
• Consistency: only valid data are written
• Isolation: one operation at a time
• Durability: once committed, it stays that way
• CAP
• Consistency: all data on cluster has the same copies
• Availability: cluster always accepts reads and writes
• Partition tolerance: guaranteed properties are maintained even
when network failures prevent some machines from communicating
with others
15
CAP Theorem
• A consistency model determines rules for visibility and
apparent order of updates
• Example:
• Row X is replicated on nodes M and N
• Client A writes row X to node N
• Some period of time t elapses
• Client B reads row X from node M
• Does client B see the write from client A?
• Consistency is a continuum with tradeoffs
• For NOSQL, the answer would be: “maybe”
• CAP theorem states: “strong consistency can't be achieved at the
same time as availability and partition-tolerance”
16
CAP Theorem
• Eventual consistency
• When no updates occur for a long period of time, eventually all
updates will propagate through the system and all the nodes will
be consistent
• Cloud computing
• ACID is hard to achieve, moreover, it is not always required, e.g. for
blogs, status updates, product listings, etc.
17
CAP Theorem
Each client always can
read and write.
Availability
Consistency
Partition
tolerance
18
CAP Theorem
A system can continue to
operate in the presence of
a network partitions
Availability
Consistency
Partition
tolerance
19
NOSQL categories
1. Key-value
• Example: DynamoDB, Voldermort, Scalaris
2. Document-based
• Example: MongoDB, CouchDB
3. Column-based
• Example: BigTable, Cassandra, Hbased
4. Graph-based
• Example: Neo4J, InfoGrid
• “No-schema” is a common characteristics of most NOSQL
storage systems
• Provide “flexible” data types
20
Key-value
• Focus on scaling to huge amounts of data
• Designed to handle massive load
• Based on Amazon’s dynamo paper
• Data model: (global) collection of Key-value pairs
• Dynamo ring partitioning and replication
• Example: (DynamoDB)
• items having one or more attributes (name, value)
• An attribute can be single-valued or multi-valued like set.
• items are combined into a table
21
Key-value
• Basic API access:
• get(key): extract the value given a key
• put(key, value): create or update the value given its key
• delete(key): remove the key and its associated value
• execute(key, operation, parameters): invoke an operation to the
value (given its key) which is a special data structure (e.g. List, Set,
Map .... etc)
22
Key-value
Pros:
• very fast
• very scalable (horizontally distributed to nodes based on key)
• simple data model
• eventual consistency
• fault-tolerance
Cons:
- Can’t model more complex data structure such as objects
23
Document-based
• Can model more complex objects
• Inspired by Lotus Notes
• Data model: collection of documents
• Document: JSON (JavaScript Object Notation is a
data model, key-value pairs, which supports objects,
records, structs, lists, array, maps, dates, Boolean
with nesting), XML, other semi-structured formats.
24
Document-based
• Example: (MongoDB) document
• {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1",
"Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}
25
Column-based
• Based on Google’s BigTable paper
• Like column oriented relational databases (store data in column order) but
with a twist
• Tables similarly to RDBMS, but handle semi-structured
• Data model:
• Collection of Column Families
• Column family = (key, value) where value = set of related columns (standard, super)
• indexed by row key, column key and timestamp
Column-based
• One column family can have variable
numbers of columns
• Cells within a column family are sorted “physically”
• Very sparse, most cells have null values
• Comparison: RDBMS vs column-based NOSQL
• Query on multiple tables
• RDBMS: must fetch data from several places on disk and glue together
• Column-based NOSQL: only fetch column families of those columns
that are required by a query (all columns in a column family are stored
together on the disk, so multiple rows can be retrieved in one read
operation à data locality)
27
Column-based
• Example: (Cassandra column family--timestamps
removed for simplicity)
UserProfile = {
Cassandra = { emailAddress:”[email protected]” , age:”20”}
TerryCho = { emailAddress:”[email protected]” , gender:”male”}
Cath = { emailAddress:”[email protected]” ,
age:”20”,gender:”female”,address:”Seoul”}
}
28
Graph-based
• Focus on modeling the structure of data (interconnectivity)
• Scales to the complexity of data
• Inspired by mathematical Graph Theory (G=(E,V))
• Data model:
• (Property Graph) nodes and edges
• Nodes may have properties (including ID)
• Edges may have labels or roles
• Key-value pairs on both