Webinar: Data Processing and Aggregation Options

Data Processing and Aggregation
Achille Brighton
Consulting Engineer, MongoDB

Exponential Data Growth
Billions of URLs indexed by Google
1200
1000
800
600
400
200
0
2000

2002

2004

2006

2008

For over a decade

Big Data == Custom Software

In the past few years
Open source software has
emerged enabling the rest of
us to handle Big Data

How MongoDB Meets Our Requirements
•  MongoDB is an operational database
•  MongoDB provides high performance for storage and

retrieval at large scale
•  MongoDB has a robust query interface permitting

intelligent operations
•  MongoDB is not a data processing engine, but provides

processing functionality

MongoDB data processing options

The “hello world” of
MapReduce is counting words
in a paragraph of text.
Let’s try something a little more
interesting…

What is the most popular pub name?

Open Street Map Data
#!/usr/bin/env python
# Data Source
# http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59]
import re
import sys
from imposm.parser import OSMParser
import pymongo
class Handler(object):
def nodes(self, nodes):
if not nodes:
return
docs = []
for node in nodes:
osm_id, doc, (lon, lat) = node
if "name" not in doc:
node_points[osm_id] = (lon, lat)
continue
doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&")
doc["_id"] = osm_id
doc["location"] = {"type": "Point", "coordinates": [lon, lat]}
docs.append(doc)
collection.insert(docs)

Example Pub Data
{
"_id" : 451152,
"amenity" : "pub",
"name" : "The Dignity",
"addr:housenumber" : "363",
"addr:street" : "Regents Park Road",
"addr:city" : "London",
"addr:postcode" : "N3 1DH",
"toilets" : "yes",
"toilets:access" : "customers",
"location" : {
"type" : "Point",
"coordinates" : [-0.1945732, 51.6008172]
}
}

MongoDB MapReduce
• 

map
MongoDB

reduce
ﬁnalize

map

Map Function
MongoDB

> var map = function() {
emit(this.name, 1);

reduce

ﬁnalize

map

Reduce Function
MongoDB

> var reduce = function (key, values) {
var sum = 0;
values.forEach( function (val) {sum += val;} );
return sum;
}

reduce

ﬁnalize

Results
> db.pub_names.find().sort({value: -1}).limit(10)
{ "_id" : "The Red Lion", "value" : 407 }
{ "_id" : "The Royal Oak", "value" : 328 }
{ "_id" : "The Crown", "value" : 242 }
{ "_id" : "The White Hart", "value" : 214 }
{ "_id" : "The White Horse", "value" : 200 }
{ "_id" : "The New Inn", "value" : 187 }
{ "_id" : "The Plough", "value" : 185 }
{ "_id" : "The Rose & Crown", "value" : 164 }
{ "_id" : "The Wheatsheaf", "value" : 147 }
{ "_id" : "The Swan", "value" : 140 }

Webinar: Data Processing and Aggregation Options

Pub Names in the Center of London
> db.pubs.mapReduce(map, reduce, { out: "pub_names",
query: {
location: {
$within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] }
}}
})
{
"result" : "pub_names",
"timeMillis" : 116,
"counts" : {
"input" : 643,
"emit" : 643,
"reduce" : 54,
"output" : 537
},
"ok" : 1,
}

Results
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Green Man", "value" : 5 }
"The Kings Arms", "value" : 5 }
"The Red Lion", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }

MongoDB MapReduce
•  Real-time
•  Output directly to document or collection
•  Runs inside MongoDB on local data

− Adds load to your DB
− In Javascript – debugging can be a challenge
− Translating in and out of C++

•  Declared in JSON, executes in C++

Aggregation Framework
Data Processing in MongoDB

•  Flexible, functional, and simple


•  Flexible, functional, and simple
•  Plays nice with sharding


Pipeline
Piping command line operations

ps ax | grep mongod | head 1


Pipeline
Piping aggregation operations

$match | $group | $sort
Stream of documents

Result document


Pipeline Operators
•  $match

•  $sort

•  $project

•  $limit

•  $group

•  $skip

•  $unwind

•  $geoNear


$match
•  Filter documents
•  Uses existing query syntax
•  If using $geoNear it has to be ﬁrst in pipeline
•  $where is not supported

Matching Field Values
{
"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point",
"coordinates" : [
-1.6192422,
50.9131996
]
}
}

{ "$match": {
"name": "The Red Lion"
}}

{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]}

{
"_id" : 271466,
"amenity" : "pub",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}

}

$project
•  Reshape documents
•  Include, exclude or rename fields
•  Inject computed fields
•  Create sub-document fields

Including and Excluding Fields
{
"_id" : 271466,
"amenity" : "pub",
"location" : {
"type" : "Point",

{ “$project”: {
“_id”: 0,
“amenity”: 1,
“name”: 1,
}}

"coordinates" : [
-1.5494749,
50.7837119
]
}
}

{
“amenity” : “pub”,
“name” : “The Red Lion”
}

Reformatting Documents
{
"_id" : 271466,
"amenity" : "pub",
"location" : {
"type" : "Point",

{ “$project”: {
“_id”: 0,
“name”: 1,
“meta”: {
“type”: “$amenity”}
}}

"coordinates" : [
-1.5494749,
50.7837119
]
}
}

{
“name” : “The Red Lion”
“meta” : {
“type” : “pub”
}}

$group
•  Group documents by an ID
•  Field reference, object, constant
•  Other output ﬁelds are computed

$max, $min, $avg, $sum
$addToSet, $push $ﬁrst, $last
•  Processes all data in memory

Summating ﬁelds

}

{ $group: {
_id: "$language",
numTitles: { $sum: 1 },
sumPages: { $sum: "$pages" }
}}

{

{

{
title: "The Great Gatsby",
pages: 218,
language: "English"

title: "War and Peace",
pages: 1440,
language: "Russian”
}

}

{

_id: "Russian",
numTitles: 1,
sumPages: 1440

{
title: "Atlas Shrugged",
pages: 1088,
language: "English"

}

}

_id: "English",
numTitles: 2,
sumPages: 1306

Add To Set
{
pages: 218,
language: "English"

{ $group: {
_id: "$language",
titles: { $addToSet: "$title" }
}}

}

{
{
title: "War and Peace",
pages: 1440,
language: "Russian"

}
{

}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}

}

_id: "Russian",
titles: [ "War and Peace" ]

_id: "English",
titles: [
"Atlas Shrugged",
"The Great Gatsby"
]

Expanding Arrays
{ $unwind: "$subjects" }

{
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]

{

}
{

}

}
{

}

ISBN: "9781857150193",
subjects: "Long Island"
ISBN: "9781857150193",
subjects: "New York"
ISBN: "9781857150193",
subjects: "1920s"

Back to the pub!

• 

http://www.offwestend.com/index.php/theatres/pastshows/71

Popular Pub Names
>var popular_pub_names = [
{ $match : location:
{ $within: { $centerSphere:
[[-0.12, 51.516], 2 / 3959]}}}
},
{ $group :
{ _id: “$name”
value: {$sum: 1} }
},
{ $sort : {value: -1} },
{ $limit : 10 }

Results
> db.pubs.aggregate(popular_pub_names)
{
"result" : [
{ "_id" : "All Bar One", "value" : 11 }
{ "_id" : "The Slug & Lettuce", "value" : 7 }
{ "_id" : "The Coach & Horses", "value" : 6 }
{ "_id" : "The Green Man", "value" : 5 }
{ "_id" : "The Kings Arms", "value" : 5 }
{ "_id" : "The Red Lion", "value" : 5 }
{ "_id" : "Corney & Barrow", "value" : 4 }
{ "_id" : "O'Neills", "value" : 4 }
{ "_id" : "Pitcher & Piano", "value" : 4 }
{ "_id" : "The Crown", "value" : 4 }
],
"ok" : 1
}

Aggregation Framework Beneﬁts
•  Real-time
•  Simple yet powerful interface
•  Runs inside MongoDB on local data

− Adds load to your DB
− Limited Operators
− Data output is limited

Analyzing MongoDB Data in
External Systems

MongoDB with Hadoop
• 

MongoDB

Hadoop MongoDB Connector
•  MongoDB or BSON ﬁles as input/output
•  Source data can be ﬁltered with queries
•  Hadoop Streaming support
–  For jobs written in Python, Ruby, Node.js

•  Supports Hadoop tools such as Pig and Hive

Map Pub Names in Python
from pymongo_hadoop import BSONMapper
def mapper(documents):
bounds = get_bounds() # ~2 mile polygon
for doc in documents:
geo = get_geo(doc["location"]) # Convert the geo type
if not geo:
continue
if bounds.intersects(geo):
yield {'_id': doc['name'], 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."

Reduce Pub Names in Python
from pymongo_hadoop import BSONReducer
def reducer(key, values):
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'value': _count}
BSONReducer(reducer)

Execute MapReduce
hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar
-mapper examples/pub/map.py
-reducer examples/pub/reduce.py
-mongo mongodb://127.0.0.1/demo.pubs
-outputURI mongodb://127.0.0.1/demo.pub_names

Popular Pub Names Nearby
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Kings Arms", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }
"The George", "value" : 4 }
"The Green Man", "value" : 4 }

MongoDB with Hadoop
• 

MongoDB

warehouse

MongoDB with Hadoop
• 

ETL

MongoDB

Limitations
•  Batch processing
•  Requires synchronization between data store and

processor
•  Adds complexity to infrastructure

Advantages
•  Processing decoupled from data store
•  Parallel processing
•  Leverage existing infrastructure
•  Java has rich set of data processing libraries
–  And other languages if using Hadoop Streaming

Storm MongoDB connector
•  Spout for MongoDB oplog or capped collections
–  Filtering capabilities
–  Threaded and non-blocking

•  Output to new or existing documents
–  Insert/update bolt

Aggregating MongoDB’s
Data Processing Options

Data Processing with MongoDB
•  Process in MongoDB using Map/Reduce
•  Process in MongoDB using Aggregation Framework
•  Also: Storing pre-aggregated data
–  An exercise in schema design
•  Process outside MongoDB using Hadoop and other

external tools

References
•  Map Reduce docs
–  http://docs.mongodb.org/manual/core/map-reduce/
•  Aggregation Framework
–  Examples
http://docs.mongodb.org/manual/applications/aggregation
–  SQL Comparison
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/
•  Multi Threaded Map Reduce:

http://edgystuff.tumblr.com/post/54709368492/how-to-speedup-mongodb-map-reduce-by-20x

Thanks!
Achille Brighton
Consulting Engineer, MongoDB

Webinar: Data Processing and Aggregation Options

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Webinar: Data Processing and Aggregation Options (20)

More from MongoDB (20)

Recently uploaded (20)

Webinar: Data Processing and Aggregation Options