big_data_topic4_[nosql_database]_[thanh_binh_nguyen].TextMark
big_data_topic4_[nosql_database]_[thanh_binh_nguyen].TextMark
A
Big Data
L
(NoSQL Databases)
3
Instructor: Thanh Binh Nguyen
S
September 1st, 2019
S3Lab
Smart Software System Laboratory
1
B .
LA
“Big data is at the foundation of all the
megatrends that are happening today, from
3
social to mobile to cloud to gaming.”
S
– Chris Lynch, Vertica Systems
Big Data 2
Background
B .
A
● Relational databases mainstay of business
Web-based applications caused spikes
L
●
● Explosion of social media sites (Facebook, Twitter) with large data needs
3
● rise of cloud-based solutions such as Amazon S3 (simple storage solution)
● Hooking RDBMS to web-based application becomes trouble
Big Data S 3
Issues with Scaling up
B .
A
● Best way to provide ACID and Rich Query Model is to have the dataset on
a single machine
L
● Limits to scaling up (or vertical scaling: make a “single” machine more
3
powerful) dataset is just too big!
● Scaling out (or horizontal scaling: adding more smaller / cheaper servers)
S
is a better choice
● Different approaches for horizontal scaling (multi-node database):
○ Master / Slave
○ Sharding (partitioning)
Big Data 4
Scaling out RDBMS
B .
A
Master / Slave
● All writes are written to the master
L
● All reads performed against the replicated slave databases
3
● Critical reads may be incorrect as writes may not have been propagated
down
S
● Large datasets can pose problems as master needs to duplicate data to
slaves
Big Data 5
Scaling out RDBMS
B .
A
Sharding (Partitioning)
● Scales well for both reads and writes
L
● Not transparent, application needs to be partition-aware
3
● Can no longer have relationships/joins across partitions
● Loss of referential integrity across shards
Big Data S 6
Scaling out RDBMS
B .
A
Master / Slave
3 L
Big Data S 7
Scaling out RDBMS
B .
A
Sharding (Partitioning)
3 L
Big Data S 8
Scaling out RDBMS
B .
A
The other ways
● Multi-Master replication
L
● INSERT only, not UPDATES/DELETES
3
● No JOINs, thereby reducing query time
○ This involves de-normalizing data
S
● In-memory databases
Big Data 9
What is NoSQL?
B .
A
● This name stands for Not Only SQL
● The term NOSQL was introduced by Carl Strozzi in 1998 to name his
L
file-based database
3
● It was again re-introduced by Eric Evans when an event was organized to
discuss open source distributed databases
S
○ Eric states that “… but the whole point of seeking alternatives is that you need to solve a
problem that relational databases are a bad fit for. …”
Big Data 10
What is NoSQL?
B .
A
Key features (Advantages)
● non-relational
L
● don’t require schema
3
● data are replicated to multiple nodes (so, identical & fault-tolerant)
and can be partitioned:
S
○ down nodes easily replaced
○ no single point of failure
● horizontal scalable
Big Data 11
What is NoSQL?
B .
A
Key features (Advantages)
● cheap, easy to implement (open-source)
L
● massive write performance
3
● fast key-value access
Big Data S 12
What is NoSQL?
B .
A
Disadvantages
● Don’t fully support relational features
L
○ no join, group by, order by operations (except within partitions)
no referential integrity constraints across partitions
3
○
S
● Relaxed ACID (see CAP theorem) fewer guarantees
● No easy integration with other applications that support SQL
Big Data 13
Who is using them?
B .
LA
Big Data S3 14
3 major papers for NoSQL
B .
A
● Three major papers were the “seeds” of the NOSQL movement:
L
○ BigTable (Google)
DynamoDB (Amazon)
3
○
■ Ring partition and replication
Gossip protocol (discovery and error detection)
S
■
■ Distributed key-value data stores
■ Eventual consistency
○ CAP Theorem
Big Data 15
The perfect storm
B .
A
● Large datasets, acceptance of alternatives, and dynamically-typed data
L
has come together in a “perfect storm”
● Not a backlash against RDBMS
3
● SQL is a rich query language that cannot be rivaled by the current list of
S
NOSQL offerings
Big Data 16
CAP Theorem
Suppose three properties of a distributed system (sharing data)
B .
A
●
○ Consistency:
L
■ Reads and writes are always executed atomically and are strictly consistent
(linearizable). Put differently, all clients have the same view on the data at all times.
3
○ Availability:
■ Every non-failing node in the system can always accept read and write requests by
S
clients and will eventually return with a meaningful response, i.e. not with an error
message.
○ Partition-tolerance:
■ system properties (consistency and/or availability) hold even when network failures
prevent some machines from communicating with others. A system can continue to
B .
A
● Brewer’s CAP Theorem:
L
○ For any system sharing data, it is “impossible” to guarantee simultaneously all of these
three properties
3
○ You can have at most two of these three properties for any shared-data system
S
○ That leaves either C or A to choose from (traditional DBMS prefers C over A and P )
○ In almost all cases, you would choose A over C (except in specific applications such as
order processing)
Big Data 18
CAP Theorem
B .
A
Consistency
3 L
Big Data S 19
CAP Theorem
B .
A
Consistency
● Have 2 types of consistency:
L
○ Strong consistency – ACID (Atomicity, Consistency, Isolation, Durability)
○ Weak consistency – BASE (Basically Available Soft-state Eventual consistency)
Big Data S3 20
CAP Theorem
B .
A
Consistency
● A consistency model determines rules for visibility and apparent order of
L
updates
● Example:
3
○ Row X is replicated on nodes M and N
○ Client A writes row X to node N
Some period of time t elapses
S
○
○ Client B reads row X from node M
○ Does client B see the write from client A?
○ Consistency is a continuum with tradeoffs
○ For NOSQL, the answer would be: “maybe”
○ CAP theorem states: “strong consistency can't be achieved at the same time as
availability and partition-tolerance”
Big Data 21
CAP Theorem
B .
A
Consistency
● Cloud computing
L
○ ACID is hard to achieve, moreover, it is not always required, e.g. for blogs, status updates,
product listings, etc.
Big Data S3 22
NoSQL
B .
A
● “No-schema” is a common characteristics of most NOSQL storage systems
L
● Provide “flexible” data types
● Other or additional query languages than SQL
3
● Distributed – horizontal scaling
S
● Less structured data
Big Data 23
NoSQL Categories
B .
LA
Big Data S3 24
NoSQL Categories
B .
LA
Big Data S3 25
NoSQL Categories
B .
A
Key-value
● Focus on scaling to huge amounts of data
L
● Designed to handle massive load
● Based on Amazon’s dynamo paper
3
● Data model: (global) collection of Key-value pairs
● Dynamo ring partitioning and replication
S
● Example: (DynamoDB)
○ items having one or more attributes (name, value)
○ An attribute can be single-valued or multivalued like set.
○ items are combined into a table
Big Data 26
NoSQL Categories
B .
A
Key-value
● Basic API access:
L
○ get(key): extract the value given a key
○ put(key, value): create or update the value given its key
3
○ delete(key): remove the key and its associated value
○ execute(key, operation, parameters): invoke an operation to the value (given its key) which
is a special data structure (e.g. List, Set, Map .... etc)
Big Data S 27
NoSQL Categories
B .
A
Key-value
● Pros:
L
○ very fast
○ very scalable (horizontally distributed to nodes based on key)
3
○ simple data model
○ eventual consistency
○ fault-tolerance
S
● Cons:
○ Can’t model more complex data structure such as objects
Big Data 28
NoSQL Categories
B .
A
Key-value
Name Producer Data model Querying
L
SimpleDB Amazon set of couples (key, {attribute}), where restricted SQL; select, delete,
attribute is a couple (name, value) GetAttributes, and PutAttributes
3
operations
Redis Salvatore set of couples (key, value), where value primitive operations for each value
S
Sanfilippo is simple typed value, list, ordered type
(according to ranking) or unordered set,
hash value
Big Data 29
NoSQL Categories
B .
A
Key-value
3 L
Big Data S 30
NoSQL Categories
B .
A
Column-based
● Based on Google’s BigTable paper
L
● Like column oriented relational databases (store data in column order) but
3
with a twist
● Tables similarly to RDBMS, but handle semi-structured
S
● Data model:
○ Collection of Column Families
○ Column family = (key, value) where value = set of related columns (standard, super)
○ indexed by row key, column key and timestamp
Big Data 31
NoSQL Categories
B .
A
Column-based
3 L
Big Data S 32
NoSQL Categories
B .
A
Keyspace ~ Schema, Column Family ~ Table
3 L
Big Data S 33
NoSQL Categories
B .
A
Row structure
3 L
Big Data S 34
NoSQL Categories
B .
A
Column-based
● One column family can have variable numbers of columns
L
● Cells within a column family are sorted “physically”
3
● Very sparse, most cells have null values
● Comparison: RDBMS vs column-based NOSQL
S
○ Query on multiple tables
■ RDBMS: must fetch data from several places on disk and glue together
■ Column-based NOSQL: only fetch column families of those columns that are required
by a query (all columns in a column family are stored together on the disk, so multiple
rows can be retrieved in one read operation data locality)
Big Data 35
NoSQL Categories
B .
A
Column-based
3 L
Big Data S 36
NoSQL Categories
B .
A
Column-based
● Example: (Cassandra column family--timestamps removed for simplicity)
L
UserProfile = {
Cassandra = {
emailAddress:”[email protected]” , age:”20”
3
}
TerryCho = {
S
emailAddress:”[email protected]” , gender:”male”
}
Cath = {
emailAddress:”[email protected]”,
age:”20”,gender:”female”,address:”Seoul”
}
}
Big Data 37
NoSQL Categories
B .
A
Column-based
Name Producer Data model Querying
L
BigTable Google set of couples (key, {value}) selection (by combination of row, column, and time stamp ranges)
HBase Apache groups of columns (a BigTable clone) JRUBY IRB-based shell (similar to SQL)
3
Hypertable Hypertable like BigTable HQL (Hypertext Query Language)
S
CASSANDRA Apache columns, groups of columns simple selections on key, range queries, column or columns ranges
(originally corresponding to a key
Facebook) (supercolumns)
PNUTS Yahoo (hashed or ordered) tables, typed selection and projection from a single table (retrieve an arbitrary
arrays, flexible schema single record by primary key, range queries, complex predicates,
ordering, top-k)
Big Data 38
NoSQL Categories
B .
A
Document-based
● Can model more complex objects
L
● Inspired by Lotus Notes
Data model: collection of documents
3
●
● Document: JSON (JavaScript Object Notation is a data model, key-value
S
pairs, which supports objects, records, structs, lists, array, maps, dates,
Boolean with nesting), XML, other semi-structured formats.
Big Data 39
NoSQL Categories
B .
A
Document-based
● Example: (MongoDB) document
L
{
Name:"Jaroslav",
3
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1", "Otis: "3", Richard: "1“}
S
Phones: [ “123-456-7890”, “234-567-8963” ]
}
Big Data 40
NoSQL Categories
B .
A
Document-based
3 L
Big Data S 41
NoSQL Categories
B .
A
Document-based
3 L
Big Data S 42
NoSQL Categories
B .
A
Document-based
Name Producer Data model Querying
L
MongoDB 10gen object-structured documents manipulations with objects in
stored in collections; collections (find object or
3
each object has a primary key objects via simple selections
called ObjectId and logical expressions,
delete, update,)
Couchbase
Big Data SCouchbase1 document as a list of named by key and key range, views
(structured) items (JSON document) via Javascript and
MapReduce
43
NoSQL Categories
B .
A
Graph-based
● Focus on modeling the structure of data (interconnectivity)
L
● A graph is composed of two elements: a node and a relationship.
3
● Scales to the complexity of data
● Inspired by mathematical Graph Theory (G=(E,V))
S
● Data model:
○ (Property Graph) nodes and edges
■ Nodes may have properties (including ID)
■ Edges may have labels or roles
○ Key-value pairs on both
Big Data 44
NoSQL Categories
B .
A
Graph-based
● Interfaces and query languages vary
L
● Single-step vs path expressions vs full recursion
Example:
3
●
○ Neo4j, FlockDB, Pregel, InfoGrid …
Big Data S 45
NoSQL Categories
B .
A
Graph-based
3 L
Big Data S 46
NoSQL Categories
B .
A
Graph-based
3 L
Big Data S 47
NoSQL Categories
B .
A
Comparison
3 L
Big Data S 48
Conclusion
B .
A
● NOSQL database cover only a part of data-intensive cloud applications
(mainly Web applications)
L
● Problems with cloud computing:
3
○ SaaS (Software as a Service or on-demand software) applications require enterprise-level
functionality, including ACID transactions, security, and other features associated with
S
commercial RDBMS technology, i.e. NOSQL should not be the only option in the cloud
○ Hybrid solutions:
■ Voldemort with MySQL as one of storage backend
■ deal with NOSQL data as semi-structured data
-> integrating RDBMS and NOSQL via SQL/XML
Big Data 49
Conclusion
B .
A
● next generation of highly scalable and elastic RDBMS: NewSQL
databases (from April 2011)
L
○ they are designed to scale out horizontally on shared nothing machines,
3
○ still provide ACID guarantees,
○ applications interact with the database primarily using SQL,
S
○ the system employs a lock-free concurrency control scheme to avoid user shut down,
○ the system provides higher performance than available from the traditional systems.
Big Data 50
Conclusion
B .
LA
Big Data S3 51
Conclusion
B .
LA
Big Data S3 52
Q&A
B .
LA
Big Data S3
Cảm ơn đã theo dõi
Chúng tôi hy vọng cùng nhau đi đến thành công.
53