No SQL
No SQL
Based on slides by
Mike Franklin and Jimmy Lin
1
Big Data (some old numbers)
• Facebook:
130TB/day: user logs
200-400TB/day: 83 million pictures
Many components:
• Storage systems
• Database systems
• Data mining and statistical algorithms
• Visualization
3
What is NoSQL?
4
What is NoSQL?
• An emerging “movement” around non-
relational software for Big Data
• Roots are in the Google and Amazon homegrown
software stacks
Relational Operators
Access Methods
Buffer Management
Analytics Interface Imperative Lang
(RoR, Java,Scala, …) Disk Space Management
(Pig, Hive, …)
Data Parallel Processing
(MapReduce/Hadoop)
Distributed Key/Value or Column Store
(Cassandra, Hbase, Redis, …)
Scalable File System
(GFS, HDFS, …) 6
NoSQL features
• Scalability is crucial!
load increased rapidly for many applications
• Large servers are expensive
7
NoSQL features
• Sometimes not a well defined schema
8
Flavors of NoSQL
11
Key-Value Stores
12
Document Databases
Examples include: MongoDB, CouchDB, Terrastore
13
The Structure Spectrum
15
Example mongodb
{ "_id”:ObjectId("4efa8d2b7d284dad101e4bc9"),
"Last Name": ” Cousteau",
"First Name": ” Jacques-Yves",
"Date of Birth": ”06-1-1910" },
{ "_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
"Last Name": "PELLERIN",
"First Name": "Franck",
"Date of Birth": "09-19-1983",
"Address": "1 chemin des Loges",
"City": "VERSAILLES" }
16
Example Document Database:
MongoDB
Key features include:
• JSON-style documents
• actually uses BSON (JSON's binary format)
• replication for high availability
• auto-sharding for scalability
• document-based queries
• can create an index on any attribute
• for faster reads
17
MongoDB Terminology
relational term <== >MongoDB equivalent
----------------------------------------------------------
database <== > database
table <== > collection
row <== > document
attributes <== > fields (field-name:value pairs)
primary key <== > the _id field, which is the key
associated with the document
18
JSON
• JSON is an alternative data model for
semi-structured data.
• JavaScript Object Notation
19
The _id Field
Every MongoDB document must have an _id field.
• its value must be unique within the collection
• acts as the primary key of the collection
• it is the key in the key/value pair
• If you create a document without an _id field:
• MongoDB adds the field for you
• assigns it a unique BSON ObjectID
• example from the MongoDB shell:
> db.test.save({ rating: "PG-13" })
> db.test.find() { "_id" :ObjectId("528bf38ce6d3df97b49a0569"),
"rating" : "PG-13" }
20
Data Modeling in MongoDB
Need to determine how to map
entities and relationships => collections of documents
• Could in theory give each type of entity:
• its own (flexibly formatted) type of document
• those documents would be stored in the same collection
• However, it can make sense to group different types
of entities together.
• create an aggregate containing data that tends
to be accessed together
21
Capturing Relationships in MongoDB
• Two options:
1. store references to other documents using their
_id values
22
Example relationships
Consider the following documents examples:
{
{
"_id":ObjectId("52ffc4a5d85242602e000000"),
"_id":ObjectId("52ffc33cd85242f436000001"),
"building": "22 A, Indiana Apt",
"name": "Tom Hanks",
"contact": "987654321", "pincode": 123456,
"city": "Los Angeles",
"dob": "01-01-1991"
"state": "California"
}
}
24
Projection
• Specify the name of the fields that you want in the output with
1 ( 0 hides the value)
• Example:
>db.movies.find({},{"title":1,_id:0})
(will report the title but not the id)
25
Selection
• You can specify the condition on the corresponding attributes
using the find:
>db.movies.find({ rating: "R", year: 2000 }, { name: 1,
runtime: 1 })
• Operators for other types of comparisons:
MongoDB SQL equivalent
$gt, $gte >, >=
$lt, $lte <, <=
$ne !=
Example: find the names of movies with an earnings <= 200000
> db.movies.find({ earnings: { $lte: 200000 }})
• For logical operators $and, $or, $nor
use an array of conditions and apply the logical operator among the array conditions:
26
Aggregation
• Recall the aggregate operators in SQL: AVG(), SUM(), etc.
More generally, aggregation involves computing a result
from a collection of data.
27
Simple Aggregations
• db.collection.count(<selection>)
returns the number of documents in the collection
that satisfy the specified selection document
Example: how may R-rated movies are shorter than 90 minutes?
>db.movies.count({ rating: "R”, runtime: { $lt: 90 }})
• db.collection.distinct(<field>, <selection>)
returns an array with the distinct values of the specified field
in documents that satisfy the specified selection document
if omit the query, get all distinct values of that field
- which actors have been in one or more of the top 10 grossing movies?
>db.movies.distinct("actors.name”, { earnings_rank: { $lte: 10 }})
28
Aggregation Pipeline
• A very powerful approach to write queries in MongoDB is to use
pipelines.
29
Aggregation Pipeline example
{
"_id": "10280",
"city": "NEW YORK",
"state": "NY",
"pop": 5574,
"loc": [
-74.016323,
40.710537
• Example for the zipcodes database: }
]
> db.zipcodes.aggregate( [
{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
{ $match: { totalPop: { $gte: 10*1000*1000 } } }
])
Here we use group_by to group documents per state, compute sum of
population and output documents with _id, totalPop (_id has the name of
the state). The next stage finds a match for all states the have more than
10M population and outputs the state and total population.
More here:
https://docs.mongodb.com/v3.0/tutorial/aggregation-zip-code-data-set/
30
continued:
In SQL:
db.zipcodes.aggregate( [
{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
{ $match: { totalPop: { $gte: 10*1000*1000 } } }
])
31
more examples:
db.zipcodes.aggregate( [
{ $group: { _id: { state: "$state", city: "$city" }, pop: { $sum: "$pop" } } },
{ $group: { _id: "$_id.state", avgCityPop: { $avg: "$pop" } } }
])
First we get groups by city and state and for each group we compute
the population.
Then we get groups by state and compute the average city population
{
"_id" : {
"state" : "CO", {
"city" : "EDGEWATER" "_id" : "MN",
}, "avgCityPop" : 5335
"pop" : 13154 }
32
}
Aggregation Pipeline example
{ c_id:”A123”
amount: 500,
status: “A” { c_id:”A123”
} amount: 500,
status: “A” { _id:”A123”
{ c_id:”A123” total:
amount: 50, }
{ c_id:”A123” }
status: “A” { _id:”B132”
} amount: 50,
total: 200
{ c_id:”B132” status: “A” $group
}
amount: 200, $match }
status: “A” { c_id:”B132”
} amount: 200,
{ c_id:”A123” status: “A”
amount: 500, }
status: “D”
} db.orders.aggregate( [
{ $match: {status: “A”}}
{ $group: {_id:“c_id”, total: {$sum: $amount}}
])
Other Structure Issues
• NoSQL: a) Tables are unnatural, b) “joins” are evil,
c) need to be able to “grep” my data
34
Fault Tolerance
• DBs: coarse-grained FT – if trouble, restart
transaction
Fewer, Better nodes, so failures are rare
Transactions allow you to kill a job and easily restart it
35
36
Cloud Computing Computation Models
Similar to SQL!!
Typical Large-Data Problem
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
Shuffle and Sort: aggregate values by keys
a 15 b 27 c 2368
r1 s1 r2 s2 r3 s3
MapReduce
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
All values with the same key are sent to the same
reducer
• The execution framework handles everything else…
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
combine combine combine combine
a 1 b 2 c 9 a 5 c 2 b 7 c 8
partition partition partition partition
r1 s1 r2 s2 r3 s3
Two more details…
• Barrier between map and reduce phases
But we can begin copying intermediate data
earlier
• Keys arrive at each reducer in sorted order
No enforced ordering across reducers
MapReduce Overall Architecture
User
Program
(1) submit
Master
(2) schedule map (2) schedule reduce
worker
split 0 (5) remote read (6) writeoutput
split 1 (3) read worker file 0
split 2 (4) local write
split 3 worker
output
split 4 worker file 1
worker
SAN
Compute Nodes
HDFS namenode
Application /foo/bar
(file name, block id) File namespace
HDFS Client block 3df2
(block id, block location)
instructions to datanode
datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data Linux file system Linux file system
… …
… … …
slave node slave node slave node
MapReduce/GFS Summary
• Simple, but powerful programming model
• Scales to handle petabyte+ workloads
Google: six hours and two minutes to sort 1PB (10
trillion 100-byte records) on 4,000 computers
Yahoo!: 16.25 hours to sort 1PB on 3,800 computers
• Incremental performance improvement
with more nodes
• Seamlessly handles failures, but possibly
with performance penalties