SlideShare a Scribd company logo
Hadoop for Data Science
Donald Miner
Data Science MD
October 9, 2013
About Don
@donaldpminer
dminer@clearedgeit.com
I’ll talk about…
Intro to Hadoop (HDFS and MapReduce)
Some reasons why I think Hadoop is cool
(is this cliché yet?)
Step 1: Hadoop
Step 2: ????
Step 3: Data Science!
Some examples of data science work on Hadoop
What can Hadoop do to enable data science work?
Hadoop Distributed File System
HDFS
• Stores files in folders (that’s it)
– Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-2GB)
• 3 replicates of each block (better safe than sorry)
• Blocks are scattered all over the place
FILE BLOCKS
MapReduce
• Analyzes raw data in HDFS where the data is
• Jobs are split into Mappers and Reducers
Reducers (you code this, too)
Automatically Groups by the
mapper’s output key
Aggregate, count, statistics
Outputs to HDFS
Mappers (you code this)
Loads data from HDFS
Filter, transform, parse
Outputs (key, value) pairs
Hadoop Ecosystem
• Higher-level languages like Pig and Hive
• HDFS Data systems like HBase and Accumulo
• Close friends like ZooKeeper, Flume, Storm,
Cassandra, Avro
Mahout
• Mahout is a Machine
Library
• Has both MapReduce
and non-parallel
implementations of a
number of algorithms:
– Recommenders
– Clustering
– Classification
Cool Thing #1: Linear Scalability
• HDFS and MapReduce
scale linearly
• If you have twice as
many computers, jobs
run twice as fast
• If you have twice as
much data, jobs run
twice as slow
• If you have twice as
many computers, you
can store twice as much
data
DATA LOCALITY!!
Cool Thing #2: Schema on Read
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
What implications does this have?
BEFORE:
ETL, schema design upfront,
tossing out original data,
comprehensive data study
Keep original data around!
Have multiple views of the same data!
Work with unstructured data sooner!
Store first, figure out what to do with it later!
WITH HADOOP:
Cool Thing #3: Transparent Parallelism
Network programming?
Inter-process communication?
Threading?
Distributed stuff?
With MapReduce, I DON’T CARE
Your solution
… I just have to be sure my solution fits into this tiny box
Fault tolerance?
Code deployment?
RPC?
Message passing?
Locking?
MapReduce
Framework
Data storage?
Scalability?
Data center fires?
Cool Thing #4: Unstructured Data
• Unstructured data:
media, text,
forms, log data
lumped structured data
• Query languages like SQL
and Pig assume some sort
of “structure”
• MapReduce is just Java:
You can do anything Java can
do in a Mapper or Reducer
The rest of the talk
• Four threads:
– Data exploration
– Classification
– NLP
– Recommender systems
I’m using these to illustrate some points
Exploration
• Hadoop is great at exploring data!
• I like to explore data in a couple ways:
– Filtering
– Sampling
– Summarization
– Evaluate cleanliness
• I like to spend 50% of my time
doing exploration
(but unfortunately it’s the
first thing to get cut)
Filtering
• Filtering is like a microscope:
I want to take a closer look at a subset
• In MapReduce, you do this in the mapper
• Identify nasty records you want to get rid of
• Examples:
– Only Baltimore data
– Remove gibberish
– Only 5 minutes
Sampling
• Hadoop isn’t the king of interactive analysis
• Sampling is a good way to grab a set of data
then work with it locally (Excel?)
• Pig has a handy SAMPLE keyword
• Types of sampling:
– Sample randomly across the entire data set
– Sub-graph extraction
– Filters (from the last slide)
Summarization
• Summarization is a bird’s-eye view
• MapReduce is good at summarization:
– Mappers extract the group-by keys
– Reducers do the aggregation
• I like to:
– Count number, get stdev, get average, get min/max of
records in several groups
– Count nulls in columns
(if applicable)
– Grab top-10 lists
Evaluating Cleanliness
• I’ve never been burned twice
• Things to check for:
– Fields that shouldn’t be null that are
– Duplicates (does unique records=records?)
– Dates (look for 1970; look at formats; time zones)
– Things that should be normalized
– Keys that are different because of trash
e.g. “ abc “ != “abc”
SQL/RDBMS =
What’s the point?
• Hadoop is really good at this stuff!
• You probably have a lot of data and a lot of it
is garbage!
• Take the time to do this and your further work
will be much easier
• It’s hard to tell what methods
you should use until you
explore your data
Classification
• Classification is taking feature vectors (derived from
your data), and then guessing some sort of label
– E.g.,
sunny, Saturday, summer -> play tennis
rainy, Wednesday, winter -> don’t play tennis
• Most classification algorithms aren’t easily
parallelizable or have good implementations
• You need a training set of true feature vectors and
labels… how often is your data labeled?
• I’ve found classification rather hard, except for when…
Overall Classification Workflow
EXPLORATION EXPERIMENTATION
WITH DIFFERENT METHODS
REFINING PROMISING
METHODS
The Model Training Workflow
FEATURE
EXTRACTION
MODEL
TRAINING USE MODEL
DATA FEATURE
VECTORS
MODEL OUTPUT
Data volumes in training
DATAVOLUME
DATA
I have a lot of data
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction
Is this result “big data”?
Examples:
- 10TB of network traffic distilled into 9K IP address FVs
- 10TB of medical records distilled into 50M patient FVs
- 10TB of documents distilled into 5TB of document FVs
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction Model
Training
MODEL
The model itself is usually pretty tiny
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction Model
Training
MODEL
Applying that model to all the
data is a big data problem!
Some hurdles
• Where do I run non-hadoop code?
• How do I host out results to the application?
• How do I use my model on streaming data?
• Automate performance measurement?
So what’s the point?
• Not all stages of the model training workflow
are Hadoop problems
• Use the right tool for the job in each phase
e.g., non-parallel model training in some cases
FEATURE
EXTRACTION
MODEL
TRAINING USE MODEL
DATA FEATURE
VECTORS
MODEL OUTPUT
Natural Language Processing
• A lot of classic tools in NLP are “embarrassingly
parallel” over an entire corpus since words split nicely.
– Stemming
– Lexical analysis
– Parsing
– Tokenization
– Normalization
– Removing stop words
– Spell check
Each of these apply to segments of text and
don’t have much to do with any other piece of
Text in the corpus.
Python, NLTK, and Pig
• Pig is a higher-level abstract over MapReduce
• NLTK is a popular natural language toolkit for Python
• Pig allows you to stream data through arbitrary
processes (including python scripts)
• You can use UDFs to wrap NLTK methods, but the need
to use Jython sucks
• Use Pig to move your data around, use a real package
to do the work on the records
postdata = STREAM data THROUGH `my_nltk_script.py`;
(I do the same thing with Scipy and Numpy)
OpenNLP and MapReduce
• OpenNLP is an Apache project is an NLP library
• “It supports the most common NLP tasks, such as
tokenization, sentence segmentation, part-of-
speech tagging, named entity extraction,
chunking, parsing, and coreference resolution.”
• Written in Java with reasonable APIs
• MapReduce is just Java, so you can link into just
about anything you want
• Use OpenNLP in the Mapper to enrich, normalize,
cleanse your data
So what’s the point?
• Hadoop can be used to glue together already
existing libraries
– You just have to figure out how to split the
problem up yourself
Recommender Systems
• Hadoop is good at recommender systems
– Recommender systems like a lot of data
– Systems want to make a lot of recommendations
• A number of methods available in Mahout
Collaborative Filtering:
Base recommendations on others
• Collaborative Filtering
is cool because it
doesn’t have to
understand the user or
the item… just the
relationships
• Relationships are easy
to extract, features and
labels not so much
What’s the point?
• Recommender systems parallelize and there is
a Hadoop library for it
• They use relationships, not features, so the
input data is easier to extract
• If you can fit your problem into the
recommendation framework, you can do
something interesting
Other stuff: Graphs
• Graphs are useful and a lot can be done with
Hadoop
• Check out Giraph
• Check out how Accumulo has been used to
store graphs (google: “Graph 500 Accumulo”)
• Stuff to do:
– Subgraph extraction
– Missing edge recommendation
– Cool visualizations
– Summarizing relationships
Other stuff: Clustering
• Provides interesting insight into group
• Some methods parallelize well
• Mahout has:
– Dirichlet process
– K-means
– Fuzzy K-means
Other stuff: R and Hadoop
• RHIPE and Rhadoop allow you to write
MapReduce jobs in R, instead of Java
• Can also use Hadoop streaming to use R
• This doesn’t magically parallelize all your R
code
• Useful to integrate into R more seamlessly
Wrap up
Hadoop can’t do everything and
you have to do the rest
THANKS!
dminer@clearedgeit.com
@donaldpminer
Ad

More Related Content

What's hot (19)

Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Geek camp
Geek campGeek camp
Geek camp
jdhok
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
stratapps
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
Mithun Radhakrishnan
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Lars Marius Garshol
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
Jake Mannix
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
Travis Oliphant
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
daijy
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Geek camp
Geek campGeek camp
Geek camp
jdhok
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
stratapps
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
Mithun Radhakrishnan
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Lars Marius Garshol
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
Jake Mannix
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
Travis Oliphant
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
daijy
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 

Viewers also liked (20)

Tosta matrimonio anchoas del cantábrico y boquerones olmeda orígenes
Tosta matrimonio  anchoas del cantábrico y boquerones olmeda orígenesTosta matrimonio  anchoas del cantábrico y boquerones olmeda orígenes
Tosta matrimonio anchoas del cantábrico y boquerones olmeda orígenes
Olmeda Orígenes
 
OpSmile Art Auction Catalogue
OpSmile Art Auction CatalogueOpSmile Art Auction Catalogue
OpSmile Art Auction Catalogue
opsmile
 
lunch and learn presentation
lunch and learn presentationlunch and learn presentation
lunch and learn presentation
Apriele Minott
 
World Pangolin Day
World Pangolin DayWorld Pangolin Day
World Pangolin Day
Article 10 Integrated Marketing Ltd.
 
Wildlife: Grand Canyon National Park
Wildlife: Grand Canyon National ParkWildlife: Grand Canyon National Park
Wildlife: Grand Canyon National Park
Austin Gratham
 
Po co ci content marketing
Po co ci content marketingPo co ci content marketing
Po co ci content marketing
Grzegorz Miecznikowski
 
Linked In Pg Customer Contact Offering Base Mkt Ver Nov 091[1]
Linked In Pg Customer Contact Offering Base Mkt Ver Nov 091[1]Linked In Pg Customer Contact Offering Base Mkt Ver Nov 091[1]
Linked In Pg Customer Contact Offering Base Mkt Ver Nov 091[1]
dshurts
 
中正行銷課
中正行銷課中正行銷課
中正行銷課
bopomo
 
Microsoft surface
Microsoft surfaceMicrosoft surface
Microsoft surface
piyush khadse
 
Linked Ocean Data - Exploring connections between marine datasets in a Big Da...
Linked Ocean Data - Exploring connections between marine datasets in a Big Da...Linked Ocean Data - Exploring connections between marine datasets in a Big Da...
Linked Ocean Data - Exploring connections between marine datasets in a Big Da...
Adam Leadbetter
 
Abstraction Classes in Software Design
Abstraction Classes in Software DesignAbstraction Classes in Software Design
Abstraction Classes in Software Design
Vladimir Tsukur
 
Обучение_детей_младшего_возраста_программированию
Обучение_детей_младшего_возраста_программированиюОбучение_детей_младшего_возраста_программированию
Обучение_детей_младшего_возраста_программированию
Ivan Dementiev
 
Boletim (2)
Boletim (2)Boletim (2)
Boletim (2)
Gabriela Rodrigues
 
Patologías del Sistema Nervioso
Patologías del Sistema NerviosoPatologías del Sistema Nervioso
Patologías del Sistema Nervioso
Hector Martínez
 
Broadband for Business
Broadband for BusinessBroadband for Business
Broadband for Business
mnrem
 
Kabinetschef van Willy Claes was mol voor Amerikanen
Kabinetschef van Willy Claes was mol voor AmerikanenKabinetschef van Willy Claes was mol voor Amerikanen
Kabinetschef van Willy Claes was mol voor Amerikanen
Thierry Debels
 
Mobile AR, OOH and the Mirror World
Mobile AR, OOH and the Mirror WorldMobile AR, OOH and the Mirror World
Mobile AR, OOH and the Mirror World
Christopher Grayson
 
Why hotel direct bookers make better lovers - A Triptease + Tnooz slide prese...
Why hotel direct bookers make better lovers - A Triptease + Tnooz slide prese...Why hotel direct bookers make better lovers - A Triptease + Tnooz slide prese...
Why hotel direct bookers make better lovers - A Triptease + Tnooz slide prese...
tnooz
 
Daily Newsletter: 20th July, 2011
Daily Newsletter: 20th July, 2011Daily Newsletter: 20th July, 2011
Daily Newsletter: 20th July, 2011
Fullerton Securities
 
Research into Film Magazines
Research into Film MagazinesResearch into Film Magazines
Research into Film Magazines
meaglesham1
 
Tosta matrimonio anchoas del cantábrico y boquerones olmeda orígenes
Tosta matrimonio  anchoas del cantábrico y boquerones olmeda orígenesTosta matrimonio  anchoas del cantábrico y boquerones olmeda orígenes
Tosta matrimonio anchoas del cantábrico y boquerones olmeda orígenes
Olmeda Orígenes
 
OpSmile Art Auction Catalogue
OpSmile Art Auction CatalogueOpSmile Art Auction Catalogue
OpSmile Art Auction Catalogue
opsmile
 
lunch and learn presentation
lunch and learn presentationlunch and learn presentation
lunch and learn presentation
Apriele Minott
 
Wildlife: Grand Canyon National Park
Wildlife: Grand Canyon National ParkWildlife: Grand Canyon National Park
Wildlife: Grand Canyon National Park
Austin Gratham
 
Linked In Pg Customer Contact Offering Base Mkt Ver Nov 091[1]
Linked In Pg Customer Contact Offering Base Mkt Ver Nov 091[1]Linked In Pg Customer Contact Offering Base Mkt Ver Nov 091[1]
Linked In Pg Customer Contact Offering Base Mkt Ver Nov 091[1]
dshurts
 
中正行銷課
中正行銷課中正行銷課
中正行銷課
bopomo
 
Linked Ocean Data - Exploring connections between marine datasets in a Big Da...
Linked Ocean Data - Exploring connections between marine datasets in a Big Da...Linked Ocean Data - Exploring connections between marine datasets in a Big Da...
Linked Ocean Data - Exploring connections between marine datasets in a Big Da...
Adam Leadbetter
 
Abstraction Classes in Software Design
Abstraction Classes in Software DesignAbstraction Classes in Software Design
Abstraction Classes in Software Design
Vladimir Tsukur
 
Обучение_детей_младшего_возраста_программированию
Обучение_детей_младшего_возраста_программированиюОбучение_детей_младшего_возраста_программированию
Обучение_детей_младшего_возраста_программированию
Ivan Dementiev
 
Patologías del Sistema Nervioso
Patologías del Sistema NerviosoPatologías del Sistema Nervioso
Patologías del Sistema Nervioso
Hector Martínez
 
Broadband for Business
Broadband for BusinessBroadband for Business
Broadband for Business
mnrem
 
Kabinetschef van Willy Claes was mol voor Amerikanen
Kabinetschef van Willy Claes was mol voor AmerikanenKabinetschef van Willy Claes was mol voor Amerikanen
Kabinetschef van Willy Claes was mol voor Amerikanen
Thierry Debels
 
Mobile AR, OOH and the Mirror World
Mobile AR, OOH and the Mirror WorldMobile AR, OOH and the Mirror World
Mobile AR, OOH and the Mirror World
Christopher Grayson
 
Why hotel direct bookers make better lovers - A Triptease + Tnooz slide prese...
Why hotel direct bookers make better lovers - A Triptease + Tnooz slide prese...Why hotel direct bookers make better lovers - A Triptease + Tnooz slide prese...
Why hotel direct bookers make better lovers - A Triptease + Tnooz slide prese...
tnooz
 
Research into Film Magazines
Research into Film MagazinesResearch into Film Magazines
Research into Film Magazines
meaglesham1
 
Ad

Similar to Hadoop for Data Science (20)

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
Adaryl "Bob" Wakefield, MBA
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Cloudera, Inc.
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
Mindgrub Technologies
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
Bitly
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Maninda Edirisooriya
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
Russell Jurney
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
elephantscale
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
Geoff Hendrey
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Cloudera, Inc.
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
Bitly
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Maninda Edirisooriya
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
Russell Jurney
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
elephantscale
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
Geoff Hendrey
 
Ad

More from Donald Miner (6)

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
Donald Miner
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
Donald Miner
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
Donald Miner
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
Donald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
Donald Miner
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
Donald Miner
 
Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
Donald Miner
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
Donald Miner
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
Donald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
Donald Miner
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
Donald Miner
 

Recently uploaded (20)

2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 

Hadoop for Data Science

  • 1. Hadoop for Data Science Donald Miner Data Science MD October 9, 2013
  • 3. I’ll talk about… Intro to Hadoop (HDFS and MapReduce) Some reasons why I think Hadoop is cool (is this cliché yet?) Step 1: Hadoop Step 2: ???? Step 3: Data Science! Some examples of data science work on Hadoop What can Hadoop do to enable data science work?
  • 4. Hadoop Distributed File System HDFS • Stores files in folders (that’s it) – Nobody cares what’s in your files • Chunks large files into blocks (~64MB-2GB) • 3 replicates of each block (better safe than sorry) • Blocks are scattered all over the place FILE BLOCKS
  • 5. MapReduce • Analyzes raw data in HDFS where the data is • Jobs are split into Mappers and Reducers Reducers (you code this, too) Automatically Groups by the mapper’s output key Aggregate, count, statistics Outputs to HDFS Mappers (you code this) Loads data from HDFS Filter, transform, parse Outputs (key, value) pairs
  • 6. Hadoop Ecosystem • Higher-level languages like Pig and Hive • HDFS Data systems like HBase and Accumulo • Close friends like ZooKeeper, Flume, Storm, Cassandra, Avro
  • 7. Mahout • Mahout is a Machine Library • Has both MapReduce and non-parallel implementations of a number of algorithms: – Recommenders – Clustering – Classification
  • 8. Cool Thing #1: Linear Scalability • HDFS and MapReduce scale linearly • If you have twice as many computers, jobs run twice as fast • If you have twice as much data, jobs run twice as slow • If you have twice as many computers, you can store twice as much data DATA LOCALITY!!
  • 9. Cool Thing #2: Schema on Read LOAD DATA FIRST, ASK QUESTIONS LATER Data is parsed/interpreted as it is loaded out of HDFS What implications does this have? BEFORE: ETL, schema design upfront, tossing out original data, comprehensive data study Keep original data around! Have multiple views of the same data! Work with unstructured data sooner! Store first, figure out what to do with it later! WITH HADOOP:
  • 10. Cool Thing #3: Transparent Parallelism Network programming? Inter-process communication? Threading? Distributed stuff? With MapReduce, I DON’T CARE Your solution … I just have to be sure my solution fits into this tiny box Fault tolerance? Code deployment? RPC? Message passing? Locking? MapReduce Framework Data storage? Scalability? Data center fires?
  • 11. Cool Thing #4: Unstructured Data • Unstructured data: media, text, forms, log data lumped structured data • Query languages like SQL and Pig assume some sort of “structure” • MapReduce is just Java: You can do anything Java can do in a Mapper or Reducer
  • 12. The rest of the talk • Four threads: – Data exploration – Classification – NLP – Recommender systems I’m using these to illustrate some points
  • 13. Exploration • Hadoop is great at exploring data! • I like to explore data in a couple ways: – Filtering – Sampling – Summarization – Evaluate cleanliness • I like to spend 50% of my time doing exploration (but unfortunately it’s the first thing to get cut)
  • 14. Filtering • Filtering is like a microscope: I want to take a closer look at a subset • In MapReduce, you do this in the mapper • Identify nasty records you want to get rid of • Examples: – Only Baltimore data – Remove gibberish – Only 5 minutes
  • 15. Sampling • Hadoop isn’t the king of interactive analysis • Sampling is a good way to grab a set of data then work with it locally (Excel?) • Pig has a handy SAMPLE keyword • Types of sampling: – Sample randomly across the entire data set – Sub-graph extraction – Filters (from the last slide)
  • 16. Summarization • Summarization is a bird’s-eye view • MapReduce is good at summarization: – Mappers extract the group-by keys – Reducers do the aggregation • I like to: – Count number, get stdev, get average, get min/max of records in several groups – Count nulls in columns (if applicable) – Grab top-10 lists
  • 17. Evaluating Cleanliness • I’ve never been burned twice • Things to check for: – Fields that shouldn’t be null that are – Duplicates (does unique records=records?) – Dates (look for 1970; look at formats; time zones) – Things that should be normalized – Keys that are different because of trash e.g. “ abc “ != “abc” SQL/RDBMS =
  • 18. What’s the point? • Hadoop is really good at this stuff! • You probably have a lot of data and a lot of it is garbage! • Take the time to do this and your further work will be much easier • It’s hard to tell what methods you should use until you explore your data
  • 19. Classification • Classification is taking feature vectors (derived from your data), and then guessing some sort of label – E.g., sunny, Saturday, summer -> play tennis rainy, Wednesday, winter -> don’t play tennis • Most classification algorithms aren’t easily parallelizable or have good implementations • You need a training set of true feature vectors and labels… how often is your data labeled? • I’ve found classification rather hard, except for when…
  • 20. Overall Classification Workflow EXPLORATION EXPERIMENTATION WITH DIFFERENT METHODS REFINING PROMISING METHODS The Model Training Workflow FEATURE EXTRACTION MODEL TRAINING USE MODEL DATA FEATURE VECTORS MODEL OUTPUT
  • 21. Data volumes in training DATAVOLUME DATA I have a lot of data
  • 22. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Is this result “big data”? Examples: - 10TB of network traffic distilled into 9K IP address FVs - 10TB of medical records distilled into 50M patient FVs - 10TB of documents distilled into 5TB of document FVs
  • 23. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Model Training MODEL The model itself is usually pretty tiny
  • 24. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Model Training MODEL Applying that model to all the data is a big data problem!
  • 25. Some hurdles • Where do I run non-hadoop code? • How do I host out results to the application? • How do I use my model on streaming data? • Automate performance measurement?
  • 26. So what’s the point? • Not all stages of the model training workflow are Hadoop problems • Use the right tool for the job in each phase e.g., non-parallel model training in some cases FEATURE EXTRACTION MODEL TRAINING USE MODEL DATA FEATURE VECTORS MODEL OUTPUT
  • 27. Natural Language Processing • A lot of classic tools in NLP are “embarrassingly parallel” over an entire corpus since words split nicely. – Stemming – Lexical analysis – Parsing – Tokenization – Normalization – Removing stop words – Spell check Each of these apply to segments of text and don’t have much to do with any other piece of Text in the corpus.
  • 28. Python, NLTK, and Pig • Pig is a higher-level abstract over MapReduce • NLTK is a popular natural language toolkit for Python • Pig allows you to stream data through arbitrary processes (including python scripts) • You can use UDFs to wrap NLTK methods, but the need to use Jython sucks • Use Pig to move your data around, use a real package to do the work on the records postdata = STREAM data THROUGH `my_nltk_script.py`; (I do the same thing with Scipy and Numpy)
  • 29. OpenNLP and MapReduce • OpenNLP is an Apache project is an NLP library • “It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of- speech tagging, named entity extraction, chunking, parsing, and coreference resolution.” • Written in Java with reasonable APIs • MapReduce is just Java, so you can link into just about anything you want • Use OpenNLP in the Mapper to enrich, normalize, cleanse your data
  • 30. So what’s the point? • Hadoop can be used to glue together already existing libraries – You just have to figure out how to split the problem up yourself
  • 31. Recommender Systems • Hadoop is good at recommender systems – Recommender systems like a lot of data – Systems want to make a lot of recommendations • A number of methods available in Mahout
  • 32. Collaborative Filtering: Base recommendations on others • Collaborative Filtering is cool because it doesn’t have to understand the user or the item… just the relationships • Relationships are easy to extract, features and labels not so much
  • 33. What’s the point? • Recommender systems parallelize and there is a Hadoop library for it • They use relationships, not features, so the input data is easier to extract • If you can fit your problem into the recommendation framework, you can do something interesting
  • 34. Other stuff: Graphs • Graphs are useful and a lot can be done with Hadoop • Check out Giraph • Check out how Accumulo has been used to store graphs (google: “Graph 500 Accumulo”) • Stuff to do: – Subgraph extraction – Missing edge recommendation – Cool visualizations – Summarizing relationships
  • 35. Other stuff: Clustering • Provides interesting insight into group • Some methods parallelize well • Mahout has: – Dirichlet process – K-means – Fuzzy K-means
  • 36. Other stuff: R and Hadoop • RHIPE and Rhadoop allow you to write MapReduce jobs in R, instead of Java • Can also use Hadoop streaming to use R • This doesn’t magically parallelize all your R code • Useful to integrate into R more seamlessly
  • 37. Wrap up Hadoop can’t do everything and you have to do the rest

Editor's Notes

  • #2: Donald's talk will cover how to use native MapReduce in conjunction with Pig, including a detailed discussion of when users might be best served to use one or the other.