SlideShare a Scribd company logo
Big Data
Analysis Patterns
TriHUG
6/27/2013
1
whoami
•

Brad Anderson

•

Solutions Architect at MapR (Atlanta)

•

ATLHUG co-chair

•

NoSQL East Conference 2009

•

“boorad” most places (twitter, github)

•

banderson@maprtech.com
2
BIG DATA
3
4
Big Data is not new!
but the tools are.

5
The Good News in Big Data:

“Simple algorithms and lots of data
trump complex models”

Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems
6
The Challenge: So Many Solutions!
What solutions fit your business problem?
For example, do you need…



Apache Mahout?



Storm?



Apache Solr/Lucene?



Apache HBase (or MapR M7)?



Apache Drill (or Impala?)



d3.js or Tableau?



Node.js


7

Apache Hadoop?

Titan?
7
Ask a Different Question
It may be more useful to better define the problem by asking some
of these questions:



How large is the data to be queried? (the analysis volume)



What time frame is appropriate for your query response?



How fast is data arriving? (bursts or continuously?)



Are queries by sophisticated users?



Are you looking for common patterns or outliers?



8

How large is the data to be stored?

How are your data sources structures?

8
Picking the Best Solution
Your responses to these questions can help you better:


define the problem



recognize the analysis pattern to which it belongs



guide the choice of solutions to try

But first, here’s a quick review of a few of the technologies you
might choose, and then we will focus on three of the questions as a
part of the landscape.

9

9
Apache Solr/Lucene
Solr/Lucene is a powerful search engine used for flexible, heavily
indexed queries including data such as


Full text



Geographical data



Statistically weighted data

Solr is a small data tool that has flourished in a big data world

10
Apache Mahout
Mahout provides a library of scalable machine learning algorithms
useful for big data analysis based on Hadoop or other storage
systems.

Mahout algorithms mainly are used for


Recommendation (collaborative filtering)



Clustering



Classification

Mahout can be used in conjunction with solutions such as Solr: You
might use Mahout to create a co-occurrence data base that could
then be queried using a search tool such as Solr

11
Apache Drill


Google Dremel clone



Pluggable Query Languages
–
–



Pluggable Storage Backends
–
–
–



Starts with ANSI SQL 2003
Hive, Pig, Cascading, MongoQL, …
Hadoop, Hbase
MongoDB (BSON)
RDBMS?

Bypasses MapReduce

12
Storm


Realtime Stream Computation Engine



Horizontal Scalability



Guaranteed Data Processing



Fault Tolerance



Higher level abstraction over:
–

–



Message Queues
Worker Logic

“The Hadoop of Realtime”

13
Titan


Distributed Graph Database



Property Graph



Pluggable Backend Storage
–
–
–



Search Integrated
–
–



HBase or M7
Cassandra
Berkeley DB
Solr/Lucene
Elastic Search

Faunus
–
–

Graph traversals on subset
In-memory
14
Using the Answers to Guide Your Choices
For simplicity, let’s focus in on the first three questions:


How large is the data to be stored?



How large is the data to be queried? (the analysis volume)



What time frame is appropriate for your query response?

15
Big Data Decision Tree
How big is your data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB

What size queries?
One pass
over 100%

B

Response time?

C

Big storage

Multiple passes
over big chunks

Streaming

< 100s
(human scale)
D
16

throughput
not response
E
Use Cases
Company
 Data Shape
 Technique(s)
 Business Value


17
Business Value
18
Business Value
19
Telecommunications Giant

ETL Offload
20
Telecommunications






Data Shape

Lots of Data
Lots of Queries across Large Sets
Throughput important

21
Telecommunications

Techniques
Analytics

ETL

22
Telecommunications

Techniques

+
ETL (Hadoop)

Analytics (Teradata)
23
Telecommunications

Business Value

24
Credit Card
Issuer

25
Credit Card
Issuer

Data Shape








Customer Purchase History (big)
Merchant Designations
Merchant Special Offers
Throughput important
Recommendations
26
Search Abuse

Techniques
A Recommendation Engine with Mahout and Solr/Lucene

History matrix
One row per user
One column per thing
27
Techniques
Recommendation based on
cooccurrence
Cooccurrence gives item-item
mapping
One row and column per thing
28
Techniques
Cooccurrence matrix can also be
implemented as a search index

29
Techniques
Complete
history

Cooccurrence
(Mahout)

SolR
SolR
Indexer
Solr
Indexer
indexing

Item metadata

Index
shards

30

20 Hrs  3 Hrs
Techniques
User
history

SolR
SolR
Indexer
Solr
Indexer
search

Web tier

8Hrs  3 Min

Item metadata

Index
shards

31
Techniques
Hadoop
Purchase
History

Export
(4 hrs)

App
App

Merchant
Information

Recommendation
Engine Results
(Mahout)

Presentation
Data Store
(DB2)

App
App

Merchant
Offers

App

Import
(4 hrs)
32
Techniques
Hadoop
Purchase
History
Merchant
Information

Recommendation
Engine Results
(Mahout)

Index
Update
(3 min)

App
App

Recommendation
Search Index
(Solr)

App
App

Merchant
Offers

App

33
Business Value

34
Waste & Recycling Leader

Idle Alerts
35
Data Shape
Truck Geolocation Data
– 20,000 trucks
– 5 sec interval (arriving quickly)
 Landfill Geographic Boundaries


36
Techniques
Realtime Stream Computation
(Storm)

Truck
Geolocation

Data

Hadoop
Storage

Immediate
Alerts

Batch Computation
(MapReduce)

Tax Reduction
Reporting

Shortest Path
Graph Algorithm
(Titan)

Route
Optimization

37
Business Value

38
Beverage Company

Social Engagement Application

39
Data Shape

Tweets, FB Messages
 Person, Activity links
 Graph Traversal


40
Consumer Activity Graph
Wal*Mart.com
Ebay
Shopping.com
Sam’s
Ebay Motors
Dollar General
StubHub
CVS

41

Toys R Us
Techniques
Property Graph
(Titan)

Social
Activity
Stream
Key/Value Store
(MapR M7)

42

Graph Traversal
(Faunus)
Business Value

43
Questions?

44

More Related Content

What's hot (20)

PDF
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
PDF
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
PDF
Introduction to Big Data
IMC Institute
 
PDF
The Future Of Big Data
Matthew Dennis
 
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
PPTX
Are you ready for BIG DATA?
Putchong Uthayopas
 
ODP
Big Data Analytics - Introduction
Alex Meadows
 
PPTX
Introduction to Big Data
Karan Desai
 
PPT
Big data analytics, survey r.nabati
nabati
 
PPTX
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
PPTX
BDaas- BigData as a service
Agile Testing Alliance
 
PPTX
Big Data & Data Science
BrijeshGoyani
 
PDF
Big Data Final Presentation
17aroumougamh
 
PDF
Big data Big Analytics
Ajay Ohri
 
PPTX
Bigdata " new level"
Vamshikrishna Goud
 
PPT
Big Data: An Overview
C. Scyphers
 
PPTX
Big data ppt
Nasrin Hussain
 
PPTX
Introduction to Big Data
Vipin Batra
 
PDF
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
CloudxLab
 
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Introduction to Big Data
IMC Institute
 
The Future Of Big Data
Matthew Dennis
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Are you ready for BIG DATA?
Putchong Uthayopas
 
Big Data Analytics - Introduction
Alex Meadows
 
Introduction to Big Data
Karan Desai
 
Big data analytics, survey r.nabati
nabati
 
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
BDaas- BigData as a service
Agile Testing Alliance
 
Big Data & Data Science
BrijeshGoyani
 
Big Data Final Presentation
17aroumougamh
 
Big data Big Analytics
Ajay Ohri
 
Bigdata " new level"
Vamshikrishna Goud
 
Big Data: An Overview
C. Scyphers
 
Big data ppt
Nasrin Hussain
 
Introduction to Big Data
Vipin Batra
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 

Viewers also liked (15)

PPTX
Development Platform as a Service - erfarenheter efter ett års användning - ...
IBM Sverige
 
PPT
Couchbase Server 2.0 - Indexing and Querying - Deep dive
Dipti Borkar
 
PPTX
Paris data-geeks-2013-03-28
Ted Dunning
 
PDF
OpenStack Heat slides
dbelova
 
PDF
Cassandra at Instagram (August 2013)
Rick Branson
 
PDF
A user's perspective on SaltStack and other configuration management tools
SaltStack
 
PDF
storm at twitter
Krishna Gade
 
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
PPTX
Building Your First App with MongoDB
MongoDB
 
PPTX
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB
 
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
PDF
Realtime Data Analysis Patterns
Mikio L. Braun
 
PPT
Data Acquisition System and Data loggers
Swara Dave
 
PDF
Recommender system algorithm and architecture
Liang Xiang
 
PPTX
What is big data?
David Wellman
 
Development Platform as a Service - erfarenheter efter ett års användning - ...
IBM Sverige
 
Couchbase Server 2.0 - Indexing and Querying - Deep dive
Dipti Borkar
 
Paris data-geeks-2013-03-28
Ted Dunning
 
OpenStack Heat slides
dbelova
 
Cassandra at Instagram (August 2013)
Rick Branson
 
A user's perspective on SaltStack and other configuration management tools
SaltStack
 
storm at twitter
Krishna Gade
 
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
Building Your First App with MongoDB
MongoDB
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
Realtime Data Analysis Patterns
Mikio L. Braun
 
Data Acquisition System and Data loggers
Swara Dave
 
Recommender system algorithm and architecture
Liang Xiang
 
What is big data?
David Wellman
 
Ad

Similar to Big Data Analysis Patterns - TriHUG 6/27/2013 (20)

PPTX
Modul_1_Introduction_to_Big_Data.pptx
NouhaElhaji1
 
PDF
Technologies for Data Analytics Platform
N Masahiro
 
PDF
Big Data , Big Problem?
Mohammadhasan Farazmand
 
PPTX
A Glimpse of Bigdata - Introduction
saisreealekhya
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PPTX
Big data presentationandoverview_of_couchbase
AMAR NATH
 
PDF
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
PPTX
Big data or big deal
eduarderwee
 
PPT
Architecting Big Data Ingest & Manipulation
George Long
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Modern data warehouse
Stephen Alex
 
PPTX
Foxvalley bigdata
Tom Rogers
 
PPTX
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
PDF
Big data analytics 1
gauravsc36
 
PPTX
Intro to Hadoop
Jonathan Bloom
 
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
PDF
Hadoop Technologies
zahid-mian
 
PPT
Seminar presentation
Klawal13
 
PPTX
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
BigData_Europe
 
Modul_1_Introduction_to_Big_Data.pptx
NouhaElhaji1
 
Technologies for Data Analytics Platform
N Masahiro
 
Big Data , Big Problem?
Mohammadhasan Farazmand
 
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Big data presentationandoverview_of_couchbase
AMAR NATH
 
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Big data or big deal
eduarderwee
 
Architecting Big Data Ingest & Manipulation
George Long
 
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Stephen Alex
 
Foxvalley bigdata
Tom Rogers
 
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Big data analytics 1
gauravsc36
 
Intro to Hadoop
Jonathan Bloom
 
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
Hadoop Technologies
zahid-mian
 
Seminar presentation
Klawal13
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
BigData_Europe
 
Ad

More from boorad (11)

PPTX
Hadoop and Storm - AJUG talk
boorad
 
PDF
Realtime Computation with Storm
boorad
 
PPTX
Big Data Use Cases
boorad
 
PPTX
PhillyDB Talk - Beyond Batch
boorad
 
KEY
TriHUG - Beyond Batch
boorad
 
KEY
Realtime Computation with Storm
boorad
 
KEY
Large Scale Data Analysis Tools
boorad
 
KEY
DevNexus 2011
boorad
 
KEY
DevNation Atlanta
boorad
 
KEY
NOSQL, CouchDB, and the Cloud
boorad
 
PDF
Why Erlang? - Bar Camp Atlanta 2008
boorad
 
Hadoop and Storm - AJUG talk
boorad
 
Realtime Computation with Storm
boorad
 
Big Data Use Cases
boorad
 
PhillyDB Talk - Beyond Batch
boorad
 
TriHUG - Beyond Batch
boorad
 
Realtime Computation with Storm
boorad
 
Large Scale Data Analysis Tools
boorad
 
DevNexus 2011
boorad
 
DevNation Atlanta
boorad
 
NOSQL, CouchDB, and the Cloud
boorad
 
Why Erlang? - Bar Camp Atlanta 2008
boorad
 

Recently uploaded (20)

PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PPTX
Wondershare Filmora Crack Free Download 2025
josanj305
 
PDF
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
Quantum Threats Are Closer Than You Think – Act Now to Stay Secure
WSO2
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Wondershare Filmora Crack Free Download 2025
josanj305
 
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
Quantum Threats Are Closer Than You Think – Act Now to Stay Secure
WSO2
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 

Big Data Analysis Patterns - TriHUG 6/27/2013