SlideShare a Scribd company logo
Data Processing and Aggregation
Achille Brighton
Consulting Engineer, MongoDB
Big Data
Exponential Data Growth
Billions of URLs indexed by Google
1200
1000
800
600
400
200
0
2000

2002

2004

2006

2008
For over a decade

Big Data == Custom Software
In the past few years
Open source software has
emerged enabling the rest of
us to handle Big Data
How MongoDB Meets Our Requirements
•  MongoDB is an operational database
•  MongoDB provides high performance for storage and

retrieval at large scale
•  MongoDB has a robust query interface permitting

intelligent operations
•  MongoDB is not a data processing engine, but provides

processing functionality
MongoDB data processing options
Getting Example Data
The “hello world” of
MapReduce is counting words
in a paragraph of text.
Let’s try something a little more
interesting…
What is the most popular pub name?
Open Street Map Data
#!/usr/bin/env python
# Data Source
# http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59]
import re
import sys
from imposm.parser import OSMParser
import pymongo
class Handler(object):
def nodes(self, nodes):
if not nodes:
return
docs = []
for node in nodes:
osm_id, doc, (lon, lat) = node
if "name" not in doc:
node_points[osm_id] = (lon, lat)
continue
doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&")
doc["_id"] = osm_id
doc["location"] = {"type": "Point", "coordinates": [lon, lat]}
docs.append(doc)
collection.insert(docs)
Example Pub Data
{
"_id" : 451152,
"amenity" : "pub",
"name" : "The Dignity",
"addr:housenumber" : "363",
"addr:street" : "Regents Park Road",
"addr:city" : "London",
"addr:postcode" : "N3 1DH",
"toilets" : "yes",
"toilets:access" : "customers",
"location" : {
"type" : "Point",
"coordinates" : [-0.1945732, 51.6008172]
}
}
MongoDB MapReduce
• 

map
MongoDB

reduce
finalize
MongoDB MapReduce
• 
map

Map Function
MongoDB

> var map = function() {
emit(this.name, 1);

reduce

finalize
map

Reduce Function
MongoDB

> var reduce = function (key, values) {
var sum = 0;
values.forEach( function (val) {sum += val;} );
return sum;
}

reduce

finalize
Results
> db.pub_names.find().sort({value: -1}).limit(10)
{ "_id" : "The Red Lion", "value" : 407 }
{ "_id" : "The Royal Oak", "value" : 328 }
{ "_id" : "The Crown", "value" : 242 }
{ "_id" : "The White Hart", "value" : 214 }
{ "_id" : "The White Horse", "value" : 200 }
{ "_id" : "The New Inn", "value" : 187 }
{ "_id" : "The Plough", "value" : 185 }
{ "_id" : "The Rose & Crown", "value" : 164 }
{ "_id" : "The Wheatsheaf", "value" : 147 }
{ "_id" : "The Swan", "value" : 140 }
Webinar: Data Processing and Aggregation Options
Pub Names in the Center of London
> db.pubs.mapReduce(map, reduce, { out: "pub_names",
query: {
location: {
$within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] }
}}
})
{
"result" : "pub_names",
"timeMillis" : 116,
"counts" : {
"input" : 643,
"emit" : 643,
"reduce" : 54,
"output" : 537
},
"ok" : 1,
}
Results
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Green Man", "value" : 5 }
"The Kings Arms", "value" : 5 }
"The Red Lion", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }
Double Checking
MongoDB MapReduce
•  Real-time
•  Output directly to document or collection
•  Runs inside MongoDB on local data

− Adds load to your DB
− In Javascript – debugging can be a challenge
− Translating in and out of C++
Aggregation Framework
•  Declared in JSON, executes in C++

Aggregation Framework
Data Processing in MongoDB
•  Declared in JSON, executes in C++
•  Flexible, functional, and simple

Aggregation Framework
Data Processing in MongoDB
•  Declared in JSON, executes in C++
•  Flexible, functional, and simple
•  Plays nice with sharding

Aggregation Framework
Data Processing in MongoDB
Pipeline
Piping command line operations

ps ax | grep mongod | head 1

Data Processing in MongoDB
Pipeline
Piping aggregation operations

$match | $group | $sort
Stream of documents

Result document

Data Processing in MongoDB
Pipeline Operators
•  $match

•  $sort

•  $project

•  $limit

•  $group

•  $skip

•  $unwind

•  $geoNear

Data Processing in MongoDB
$match
•  Filter documents
•  Uses existing query syntax
•  If using $geoNear it has to be first in pipeline
•  $where is not supported
Matching Field Values
{
"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point",
"coordinates" : [
-1.6192422,
50.9131996
]
}
}

{ "$match": {
"name": "The Red Lion"
}}

{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]}

{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}

}
$project
•  Reshape documents
•  Include, exclude or rename fields
•  Inject computed fields
•  Create sub-document fields
Including and Excluding Fields
{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",

{ “$project”: {
“_id”: 0,
“amenity”: 1,
“name”: 1,
}}

"coordinates" : [
-1.5494749,
50.7837119
]
}
}

{
“amenity” : “pub”,
“name” : “The Red Lion”
}
Reformatting Documents
{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",

{ “$project”: {
“_id”: 0,
“name”: 1,
“meta”: {
“type”: “$amenity”}
}}

"coordinates" : [
-1.5494749,
50.7837119
]
}
}

{
“name” : “The Red Lion”
“meta” : {
“type” : “pub”
}}
$group
•  Group documents by an ID
•  Field reference, object, constant
•  Other output fields are computed

$max, $min, $avg, $sum
$addToSet, $push $first, $last
•  Processes all data in memory
Summating fields

}

{ $group: {
_id: "$language",
numTitles: { $sum: 1 },
sumPages: { $sum: "$pages" }
}}

{

{

{
title: "The Great Gatsby",
pages: 218,
language: "English"

title: "War and Peace",
pages: 1440,
language: "Russian”
}

}

{

_id: "Russian",
numTitles: 1,
sumPages: 1440

{
title: "Atlas Shrugged",
pages: 1088,
language: "English"

}

}

_id: "English",
numTitles: 2,
sumPages: 1306
Add To Set
{
title: "The Great Gatsby",
pages: 218,
language: "English"

{ $group: {
_id: "$language",
titles: { $addToSet: "$title" }
}}

}

{
{
title: "War and Peace",
pages: 1440,
language: "Russian"

}
{

}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}

}

_id: "Russian",
titles: [ "War and Peace" ]

_id: "English",
titles: [
"Atlas Shrugged",
"The Great Gatsby"
]
Expanding Arrays
{ $unwind: "$subjects" }

{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]

{

}
{

}

}
{

}

title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island"
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "New York"
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "1920s"
Back to the pub!

• 

http://www.offwestend.com/index.php/theatres/pastshows/71
Popular Pub Names
>var popular_pub_names = [
{ $match : location:
{ $within: { $centerSphere:
[[-0.12, 51.516], 2 / 3959]}}}
},
{ $group :
{ _id: “$name”
value: {$sum: 1} }
},
{ $sort : {value: -1} },
{ $limit : 10 }
Results
> db.pubs.aggregate(popular_pub_names)
{
"result" : [
{ "_id" : "All Bar One", "value" : 11 }
{ "_id" : "The Slug & Lettuce", "value" : 7 }
{ "_id" : "The Coach & Horses", "value" : 6 }
{ "_id" : "The Green Man", "value" : 5 }
{ "_id" : "The Kings Arms", "value" : 5 }
{ "_id" : "The Red Lion", "value" : 5 }
{ "_id" : "Corney & Barrow", "value" : 4 }
{ "_id" : "O'Neills", "value" : 4 }
{ "_id" : "Pitcher & Piano", "value" : 4 }
{ "_id" : "The Crown", "value" : 4 }
],
"ok" : 1
}
Aggregation Framework Benefits
•  Real-time
•  Simple yet powerful interface
•  Declared in JSON, executes in C++
•  Runs inside MongoDB on local data

− Adds load to your DB
− Limited Operators
− Data output is limited
Analyzing MongoDB Data in
External Systems
MongoDB with Hadoop
• 

MongoDB
Hadoop MongoDB Connector
•  MongoDB or BSON files as input/output
•  Source data can be filtered with queries
•  Hadoop Streaming support
–  For jobs written in Python, Ruby, Node.js

•  Supports Hadoop tools such as Pig and Hive
Map Pub Names in Python
#!/usr/bin/env python
from pymongo_hadoop import BSONMapper
def mapper(documents):
bounds = get_bounds() # ~2 mile polygon
for doc in documents:
geo = get_geo(doc["location"]) # Convert the geo type
if not geo:
continue
if bounds.intersects(geo):
yield {'_id': doc['name'], 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Reduce Pub Names in Python
#!/usr/bin/env python
from pymongo_hadoop import BSONReducer
def reducer(key, values):
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'value': _count}
BSONReducer(reducer)
Execute MapReduce
hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar 
-mapper examples/pub/map.py 
-reducer examples/pub/reduce.py 
-mongo mongodb://127.0.0.1/demo.pubs 
-outputURI mongodb://127.0.0.1/demo.pub_names
Popular Pub Names Nearby
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Kings Arms", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }
"The George", "value" : 4 }
"The Green Man", "value" : 4 }
MongoDB with Hadoop
• 

MongoDB

warehouse
MongoDB with Hadoop
• 

ETL

MongoDB
Limitations
•  Batch processing
•  Requires synchronization between data store and

processor
•  Adds complexity to infrastructure
Advantages
•  Processing decoupled from data store
•  Parallel processing
•  Leverage existing infrastructure
•  Java has rich set of data processing libraries
–  And other languages if using Hadoop Streaming
Storm
Storm
Storm MongoDB connector
•  Spout for MongoDB oplog or capped collections
–  Filtering capabilities
–  Threaded and non-blocking

•  Output to new or existing documents
–  Insert/update bolt
Aggregating MongoDB’s
Data Processing Options
Data Processing with MongoDB
•  Process in MongoDB using Map/Reduce
•  Process in MongoDB using Aggregation Framework
•  Also: Storing pre-aggregated data
–  An exercise in schema design
•  Process outside MongoDB using Hadoop and other

external tools
External Tools
Questions?
References
•  Map Reduce docs
–  http://docs.mongodb.org/manual/core/map-reduce/
•  Aggregation Framework
–  Examples
http://docs.mongodb.org/manual/applications/aggregation
–  SQL Comparison
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/
•  Multi Threaded Map Reduce:

http://edgystuff.tumblr.com/post/54709368492/how-to-speedup-mongodb-map-reduce-by-20x
Thanks!
Achille Brighton
Consulting Engineer, MongoDB

More Related Content

What's hot (20)

PPTX
Agg framework selectgroup feb2015 v2
MongoDB
 
PPTX
Beyond the Basics 2: Aggregation Framework
MongoDB
 
PDF
Webinar: Working with Graph Data in MongoDB
MongoDB
 
PPTX
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
PDF
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
PPTX
Webinar: Exploring the Aggregation Framework
MongoDB
 
PDF
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
PPTX
Introduction to MongoDB and Hadoop
Steven Francia
 
PPTX
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
PPTX
MongoDB - Aggregation Pipeline
Jason Terpko
 
PDF
Hadoop - MongoDB Webinar June 2014
MongoDB
 
PPTX
MongoDB Aggregation
Amit Ghosh
 
PPTX
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB
 
PPTX
Back to Basics, webinar 2: La tua prima applicazione MongoDB
MongoDB
 
PPTX
2014 bigdatacamp asya_kamsky
Data Con LA
 
PDF
MongoDB and Python
Norberto Leite
 
PDF
Python and MongoDB
Norberto Leite
 
PPTX
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
Agg framework selectgroup feb2015 v2
MongoDB
 
Beyond the Basics 2: Aggregation Framework
MongoDB
 
Webinar: Working with Graph Data in MongoDB
MongoDB
 
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
Webinar: Exploring the Aggregation Framework
MongoDB
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
Introduction to MongoDB and Hadoop
Steven Francia
 
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
MongoDB - Aggregation Pipeline
Jason Terpko
 
Hadoop - MongoDB Webinar June 2014
MongoDB
 
MongoDB Aggregation
Amit Ghosh
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB
 
Back to Basics, webinar 2: La tua prima applicazione MongoDB
MongoDB
 
2014 bigdatacamp asya_kamsky
Data Con LA
 
MongoDB and Python
Norberto Leite
 
Python and MongoDB
Norberto Leite
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 

Viewers also liked (9)

PPTX
Introduction to MongoDB and Workshop
AhmedabadJavaMeetup
 
ODP
MongoDB - Ekino PHP
Florent DENIS
 
PPTX
MongoDB
Anthony Slabinck
 
ODP
Introduction to MongoDB with PHP
fwso
 
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
NoSQLmatters
 
PPTX
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 
KEY
An Introduction to Map/Reduce with MongoDB
Rainforest QA
 
ODP
MongoDB & Machine Learning
Tom Maiaroto
 
Introduction to MongoDB and Workshop
AhmedabadJavaMeetup
 
MongoDB - Ekino PHP
Florent DENIS
 
Introduction to MongoDB with PHP
fwso
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
NoSQLmatters
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 
An Introduction to Map/Reduce with MongoDB
Rainforest QA
 
MongoDB & Machine Learning
Tom Maiaroto
 
Ad

Similar to Webinar: Data Processing and Aggregation Options (20)

PPTX
Past, Present and Future of Data Processing in Apache Hadoop
Codemotion
 
PPTX
Webinar: General Technical Overview of MongoDB for Dev Teams
MongoDB
 
KEY
Building Your First MongoDB Application
Rick Copeland
 
PPTX
Webinar: Building Your First Application with MongoDB
MongoDB
 
PPTX
First app online conf
MongoDB
 
PDF
OSDC 2012 | Building a first application on MongoDB by Ross Lawley
NETWAYS
 
PPTX
Geoindexing with MongoDB
leafnode
 
PPT
Building web applications with mongo db presentation
Murat Çakal
 
PDF
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB
 
PPTX
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
PPTX
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 
KEY
MongoDB - Introduction
Vagmi Mudumbai
 
PPTX
1403 app dev series - session 5 - analytics
MongoDB
 
PDF
MongoDB Atlas Workshop - Singapore
Ashnikbiz
 
KEY
Mongodb intro
christkv
 
PPTX
Webinar: Getting Started with MongoDB - Back to Basics
MongoDB
 
PPTX
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
MongoDB
 
PPTX
MongoDB
Bembeng Arifin
 
KEY
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
Daniel Cousineau
 
PDF
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 
Past, Present and Future of Data Processing in Apache Hadoop
Codemotion
 
Webinar: General Technical Overview of MongoDB for Dev Teams
MongoDB
 
Building Your First MongoDB Application
Rick Copeland
 
Webinar: Building Your First Application with MongoDB
MongoDB
 
First app online conf
MongoDB
 
OSDC 2012 | Building a first application on MongoDB by Ross Lawley
NETWAYS
 
Geoindexing with MongoDB
leafnode
 
Building web applications with mongo db presentation
Murat Çakal
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB
 
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 
MongoDB - Introduction
Vagmi Mudumbai
 
1403 app dev series - session 5 - analytics
MongoDB
 
MongoDB Atlas Workshop - Singapore
Ashnikbiz
 
Mongodb intro
christkv
 
Webinar: Getting Started with MongoDB - Back to Basics
MongoDB
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
MongoDB
 
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
Daniel Cousineau
 
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 

Recently uploaded (20)

PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Digital Circuits, important subject in CS
contactparinay1
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 

Webinar: Data Processing and Aggregation Options

  • 1. Data Processing and Aggregation Achille Brighton Consulting Engineer, MongoDB
  • 3. Exponential Data Growth Billions of URLs indexed by Google 1200 1000 800 600 400 200 0 2000 2002 2004 2006 2008
  • 4. For over a decade Big Data == Custom Software
  • 5. In the past few years Open source software has emerged enabling the rest of us to handle Big Data
  • 6. How MongoDB Meets Our Requirements •  MongoDB is an operational database •  MongoDB provides high performance for storage and retrieval at large scale •  MongoDB has a robust query interface permitting intelligent operations •  MongoDB is not a data processing engine, but provides processing functionality
  • 9. The “hello world” of MapReduce is counting words in a paragraph of text. Let’s try something a little more interesting…
  • 10. What is the most popular pub name?
  • 11. Open Street Map Data #!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs)
  • 12. Example Pub Data { "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } }
  • 15. map Map Function MongoDB > var map = function() { emit(this.name, 1); reduce finalize
  • 16. map Reduce Function MongoDB > var reduce = function (key, values) { var sum = 0; values.forEach( function (val) {sum += val;} ); return sum; } reduce finalize
  • 17. Results > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 }
  • 19. Pub Names in the Center of London > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, }
  • 20. Results > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Green Man", "value" : 5 } "The Kings Arms", "value" : 5 } "The Red Lion", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 }
  • 22. MongoDB MapReduce •  Real-time •  Output directly to document or collection •  Runs inside MongoDB on local data − Adds load to your DB − In Javascript – debugging can be a challenge − Translating in and out of C++
  • 24. •  Declared in JSON, executes in C++ Aggregation Framework Data Processing in MongoDB
  • 25. •  Declared in JSON, executes in C++ •  Flexible, functional, and simple Aggregation Framework Data Processing in MongoDB
  • 26. •  Declared in JSON, executes in C++ •  Flexible, functional, and simple •  Plays nice with sharding Aggregation Framework Data Processing in MongoDB
  • 27. Pipeline Piping command line operations ps ax | grep mongod | head 1 Data Processing in MongoDB
  • 28. Pipeline Piping aggregation operations $match | $group | $sort Stream of documents Result document Data Processing in MongoDB
  • 29. Pipeline Operators •  $match •  $sort •  $project •  $limit •  $group •  $skip •  $unwind •  $geoNear Data Processing in MongoDB
  • 30. $match •  Filter documents •  Uses existing query syntax •  If using $geoNear it has to be first in pipeline •  $where is not supported
  • 31. Matching Field Values { "_id" : 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "$match": { "name": "The Red Lion" }} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } }
  • 32. $project •  Reshape documents •  Include, exclude or rename fields •  Inject computed fields •  Create sub-document fields
  • 33. Including and Excluding Fields { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", { “$project”: { “_id”: 0, “amenity”: 1, “name”: 1, }} "coordinates" : [ -1.5494749, 50.7837119 ] } } { “amenity” : “pub”, “name” : “The Red Lion” }
  • 34. Reformatting Documents { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", { “$project”: { “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} "coordinates" : [ -1.5494749, 50.7837119 ] } } { “name” : “The Red Lion” “meta” : { “type” : “pub” }}
  • 35. $group •  Group documents by an ID •  Field reference, object, constant •  Other output fields are computed $max, $min, $avg, $sum $addToSet, $push $first, $last •  Processes all data in memory
  • 36. Summating fields } { $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum: "$pages" } }} { { { title: "The Great Gatsby", pages: 218, language: "English" title: "War and Peace", pages: 1440, language: "Russian” } } { _id: "Russian", numTitles: 1, sumPages: 1440 { title: "Atlas Shrugged", pages: 1088, language: "English" } } _id: "English", numTitles: 2, sumPages: 1306
  • 37. Add To Set { title: "The Great Gatsby", pages: 218, language: "English" { $group: { _id: "$language", titles: { $addToSet: "$title" } }} } { { title: "War and Peace", pages: 1440, language: "Russian" } { } { title: "Atlas Shrugged", pages: 1088, language: "English" } } _id: "Russian", titles: [ "War and Peace" ] _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]
  • 38. Expanding Arrays { $unwind: "$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] { } { } } { } title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island" title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"
  • 39. Back to the pub! •  http://www.offwestend.com/index.php/theatres/pastshows/71
  • 40. Popular Pub Names >var popular_pub_names = [ { $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}} }, { $group : { _id: “$name” value: {$sum: 1} } }, { $sort : {value: -1} }, { $limit : 10 }
  • 41. Results > db.pubs.aggregate(popular_pub_names) { "result" : [ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } ], "ok" : 1 }
  • 42. Aggregation Framework Benefits •  Real-time •  Simple yet powerful interface •  Declared in JSON, executes in C++ •  Runs inside MongoDB on local data − Adds load to your DB − Limited Operators − Data output is limited
  • 43. Analyzing MongoDB Data in External Systems
  • 45. Hadoop MongoDB Connector •  MongoDB or BSON files as input/output •  Source data can be filtered with queries •  Hadoop Streaming support –  For jobs written in Python, Ruby, Node.js •  Supports Hadoop tools such as Pig and Hive
  • 46. Map Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  • 47. Reduce Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONReducer def reducer(key, values): _count = 0 for v in values: _count += v['count'] return {'_id': key, 'value': _count} BSONReducer(reducer)
  • 48. Execute MapReduce hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar -mapper examples/pub/map.py -reducer examples/pub/reduce.py -mongo mongodb://127.0.0.1/demo.pubs -outputURI mongodb://127.0.0.1/demo.pub_names
  • 49. Popular Pub Names Nearby > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Kings Arms", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } "The George", "value" : 4 } "The Green Man", "value" : 4 }
  • 52. Limitations •  Batch processing •  Requires synchronization between data store and processor •  Adds complexity to infrastructure
  • 53. Advantages •  Processing decoupled from data store •  Parallel processing •  Leverage existing infrastructure •  Java has rich set of data processing libraries –  And other languages if using Hadoop Streaming
  • 54. Storm
  • 55. Storm
  • 56. Storm MongoDB connector •  Spout for MongoDB oplog or capped collections –  Filtering capabilities –  Threaded and non-blocking •  Output to new or existing documents –  Insert/update bolt
  • 58. Data Processing with MongoDB •  Process in MongoDB using Map/Reduce •  Process in MongoDB using Aggregation Framework •  Also: Storing pre-aggregated data –  An exercise in schema design •  Process outside MongoDB using Hadoop and other external tools
  • 61. References •  Map Reduce docs –  http://docs.mongodb.org/manual/core/map-reduce/ •  Aggregation Framework –  Examples http://docs.mongodb.org/manual/applications/aggregation –  SQL Comparison http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/ •  Multi Threaded Map Reduce: http://edgystuff.tumblr.com/post/54709368492/how-to-speedup-mongodb-map-reduce-by-20x