SlideShare a Scribd company logo
Data Ingestion with Spark
UII
Yogyakarta
1
About Me
Sofian Hadiwijaya
@sofianhw
me@sofianhw.com
Co Founder at Pinjam.co.id
Tech Advisor at Nodeflux.io
Software Innovator (IoT and AI) at Intel
2
3
4
5
Wikipedia big data
In information technology, big data is a loosely-defined
term used to describe data sets so large and
complex that they become awkward to work with
using on-hand database management tools.
6Source: http://en.wikipedia.org/wiki/Big_data
How big is big?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user data + 15
TB/day
• 2009: eBay has 6.5 PB user data + 50 TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day
7
That’s a lot of data
8Credit: http://www.flickr.com/photos/19779889@N00/1367404058/
So what?
s/data/knowledge/g
9
No really, what do you do with it?
• User behavior analysis
• AB test analysis
• Ad targeting
• Trending topics
• User and topic modeling
• Recommendations
• And more...
10
How to scale data?
11
Divide and Conquer
12
Parallel processing is complicated
• How do we assign tasks to workers?
• What if we have more tasks than slots?
• What happens when tasks fail?
• How do you handle distributed
synchronization?
13Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/
Data storage is not trivial
• Data volumes are massive
• Reliably storing PBs of data is challenging
• Disk/hardware/network failures
• Probability of failure event increases with number of
machines
For example:
1000 hosts, each with 10 disks
a disk lasts 3 year
how many failures per day?
14
Hadoop cluster
15
Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
Hadoop
16
Hadoop provides
• Redundant, fault-tolerant data storage
• Parallel computation framework
• Job coordination
17http://hapdoop.apache.org
Joy
18Credit: http://www.flickr.com/photos/spyndle/3480602438/
Hadoop origins
• Hadoop is an open-source implementation based
on GFS and MapReduce from Google
• Sanjay Ghemawat, Howard Gobioff, and Shun-
Tak Leung. (2003) The Google File System
• Jeffrey Dean and Sanjay Ghemawat. (2004)
MapReduce: Simplified Data Processing on Large
Clusters. OSDI 2004
19
Hadoop Stack
20
MapReduce
(Distributed Programming Framework)
Pig
(Data Flow)
Hive
(SQL)
HDFS
(Hadoop Distributed File System)
Cascading
(Java)
HBase
(ColumnarDatabase)
HDFS
21
HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity
hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
22
HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)
• The NameNode (NN) manages metadata about files
and blocks
• The SecondaryNameNode (SNN) holds a backup of
the NN data
• DataNodes (DN) store and serve blocks
23
Replication
• Multiple copies of a block are stored
• Replication strategy:
• Copy #1 on another node on same rack
• Copy #2 on another node on different rack
24
HDFS - writes
25
DataNode
Block
Slave node
NameNode
e
Master
DataNode
Block
Slave node
DataNode
Block
Slave node
File
Client
Rack #1 Rack #2
Note: Write path for a
single block shown.
Client writes multiple
blocks in parallel.
block
HDFS - reads
26
DataNode
Block
Slave node
NameNode
e
Master
DataNode
Block
Slave node
DataNode
Block
Slave node
File
Client
Client reads multiple blocks
in parallel and re-assembles
into a file.
block 1 block 2
block N
What about DataNode failures?
• DNs check in with the NN to report health
• Upon failure NN orders DNs to replicate under-
replicated blocks
27Credit: http://www.flickr.com/photos/18536761@N00/367661087/
MapReduce
28
MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and
performing such computations
• An open-source implementation called Hadoop
29
Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
30
Map
Reduce
(Dean and Ghemawat, OSDI 2004)
MapReduce Flow
31
MapReduce architecture
32
TaskTracker
Task
Slave node
JobTracker
Master
TaskTracker
Task
Slave node
TaskTracker
Task
Slave node
Job
Client
What about failed tasks?
• Tasks will fail
• JT will retry failed tasks up to N attempts
• After N failed attempts for a task, job fails
• Some tasks are slower than other
• Speculative execution is JT starting up
multiple of the same task
• First one to complete wins, other is killed
33Credit: http://www.flickr.com/photos/phobia/2308371224/
MapReduce - Java API
• Mapper:
void map(WritableComparable key,
Writable value,
OutputCollector output,
Reporter reporter)
• Reducer:
void reduce(WritableComparable key,
Iterator values,
OutputCollector output,
Reporter reporter)
34
MapReduce - Java API
• Writable
• Hadoop wrapper interface
• Text, IntWritable, LongWritable, etc
• WritableComparable
• Writable classes implement WritableComparable
• OutputCollector
• Class that collects keys and values
• Reporter
• Reports progress, updates counters
• InputFormat
• Reads data and provide InputSplits
• Examples: TextInputFormat, KeyValueTextInputFormat
• OutputFormat
• Writes data
• Examples: TextOutputFormat, SequenceFileOutputFormat
35
MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend
• Bad:
System.out.println(“Couldn’t parse value”);
• Good:
reporter.incrCounter(BadParseEnum, 1L);
36
MapReduce - word count mapper
public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
37
MapReduce - word count reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
38
MapReduce - word count main
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
39
MapReduce - running a job
• To run word count, add files to HDFS and do:
$ bin/hadoop jar wordcount.jar
org.myorg.WordCount input_dir output_dir
40
MapReduce is good for...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset
41
MapReduce is ok for...
• Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to disk
• IO and latency cost of an iteration is high
42
MapReduce is not good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
43
Hadoop combined architecture
44
TaskTracker
DataNode
Slave node
JobTracker
Master
TaskTracker
DataNode
Slave node
TaskTracker
DataNode
Slave node
SecondaryNameNode
Backup
NameNode
Hadoop to Complicated
45
Welcome Spark
46
What is Spark?
Distributed data analytics engine, generalizing Map Reduce
Core engine, with streaming, SQL, machine learning, and graph
processing modules
Most Active Big Data Project
Activity in last 30 days*
*as of June 1, 2014
0
50
100
150
200
250
Patches
MapReduce Storm Yarn Spark
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Lines Added
MapReduce Storm Yarn Spark
0
2000
4000
6000
8000
10000
12000
14000
16000
Lines Removed
MapReduce Storm Yarn Spark
Big Data Systems Today
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill
Impala
S4 …
Specialized systems
(iterative, interactive and
streaming apps)
General batch
processing
Unified platform
Spark Core: RDDs
Distributed collection of objects
What’s cool about them?
• In-memory
• Built via parallel transformations
(map, filter, …)
• Automatically rebuilt on failure
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS
read
Slow due to replication, serialization, and disk IO
iter. 1 iter. 2 . . .
Input
What We’d Like
Distributed
memory
Input
query 1
query 2
query 3
. . .
one-time
processing
10-100× faster than network and disk
A Unified Platform
MLlib
machine
learning
Spark
Streaming
real-time
Spark Core
GraphX
graph
Spark
SQL
Spark SQL
Unify tables with RDDs
Tables = Schema + Data
Spark SQL
Unify tables with RDDs
Tables = Schema + Data = SchemaRDD
coolPants = sql("""
SELECT pid, color
FROM pants JOIN opinions
WHERE opinions.coolness > 90""")
chosenPair =
coolPants.filter(lambda row: row(1) == "green").take(1)
GraphX
Unifies graphs with RDDs of edges and vertices
GraphX
Unifies graphs with RDDs of edges and vertices
GraphX
Unifies graphs with RDDs of edges and vertices
GraphX
Unifies graphs with RDDs of edges and vertices
MLlib
Vectors, Matrices
MLlib
Vectors, Matrices = RDD[Vector]
Iterative computation
Spark Streaming
Time
Input
Spark Streaming
RD
D
RD
D
RD
D
RD
D
RD
D
RD
D
Time
Express streams as a series of RDDs over time
val pantsers =
spark.sequenceFile(“hdfs:/pantsWearingUsers”)
spark.twitterStream(...)
.filter(t => t.text.contains(“Hadoop”))
.transform(tweets => tweets.map(t => (t.user,
t)).join(pantsers)
.print()
What it Means for Users
Separate frameworks:
…
HDFS
read
HDFS
write
ETL
HDFS
read
HDFS
write
train
HDFS
read
HDFS
write
query
HDFS
HDFS
read
ETL
train
query
Spark: Interactive
analysis
Spark Cluster
65
Benefits of Unification
• No copying or ETLing data between systems
• Combine processing types in one program
• Code reuse
• One system to learn
• One system to maintain
Data Ingestion
67
68
Collection vs Ingestion
69
Data collection
• Happens where data originates
• “logging code”
• Batch v. Streaming
• Pull v. Push
70
Data Ingestion
• Receives data
• Sometimes coupled with storage
• Routing data
71
Thanks
Sofianhw
me@sofianhw.com
Pinjam.co.id
72
Ad

More Related Content

What's hot (20)

Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Data Con LA
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
DataWorks Summit
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
Niels Naglé
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
Venkatesh Narayanan
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Data Con LA
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Data Con LA
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
Niels Naglé
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Data Con LA
 

Viewers also liked (20)

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Yönetim ve Organizasyon1
Yönetim ve Organizasyon1Yönetim ve Organizasyon1
Yönetim ve Organizasyon1
Nisantasi University
 
Spark machine learning & deep learning
Spark machine learning & deep learningSpark machine learning & deep learning
Spark machine learning & deep learning
hoondong kim
 
IoT Presentation - Unsri - Palembang
IoT Presentation - Unsri - PalembangIoT Presentation - Unsri - Palembang
IoT Presentation - Unsri - Palembang
Sofian Hadiwijaya
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi
 
IoT Platform with MQTT and Websocket
IoT Platform with MQTT and WebsocketIoT Platform with MQTT and Websocket
IoT Platform with MQTT and Websocket
Sofian Hadiwijaya
 
Why Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén CasadoWhy Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén Casado
Big Data Spain
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
Amir Payberah
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)
Alessio Tonioni
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Michael Noll
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Pentagono chiqui
Pentagono chiquiPentagono chiqui
Pentagono chiqui
tagorin
 
Iot presentation gunadarma
Iot presentation gunadarmaIot presentation gunadarma
Iot presentation gunadarma
Sofian Hadiwijaya
 
Iot presentation raharja
Iot presentation raharjaIot presentation raharja
Iot presentation raharja
Sofian Hadiwijaya
 
Javascript Basic RESTful
Javascript Basic RESTfulJavascript Basic RESTful
Javascript Basic RESTful
Sofian Hadiwijaya
 
Realtime traffic monitoring
Realtime traffic monitoringRealtime traffic monitoring
Realtime traffic monitoring
Sofian Hadiwijaya
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Spark machine learning & deep learning
Spark machine learning & deep learningSpark machine learning & deep learning
Spark machine learning & deep learning
hoondong kim
 
IoT Presentation - Unsri - Palembang
IoT Presentation - Unsri - PalembangIoT Presentation - Unsri - Palembang
IoT Presentation - Unsri - Palembang
Sofian Hadiwijaya
 
IoT Platform with MQTT and Websocket
IoT Platform with MQTT and WebsocketIoT Platform with MQTT and Websocket
IoT Platform with MQTT and Websocket
Sofian Hadiwijaya
 
Why Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén CasadoWhy Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén Casado
Big Data Spain
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
Amir Payberah
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)
Alessio Tonioni
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Michael Noll
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Pentagono chiqui
Pentagono chiquiPentagono chiqui
Pentagono chiqui
tagorin
 
Ad

Similar to Intro to Big Data - Spark (20)

Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
Anthony Hsu
 
Big Data
Big DataBig Data
Big Data
Mahesh Bmn
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
ch adnan
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
Anthony Hsu
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
ch adnan
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Ad

More from Sofian Hadiwijaya (20)

Workshop Web3 Weekend Social Impact
Workshop Web3 Weekend Social ImpactWorkshop Web3 Weekend Social Impact
Workshop Web3 Weekend Social Impact
Sofian Hadiwijaya
 
Warung Pintar Social Impact Report 2018
Warung Pintar Social Impact Report 2018Warung Pintar Social Impact Report 2018
Warung Pintar Social Impact Report 2018
Sofian Hadiwijaya
 
Agile and Scrum 101
Agile and Scrum 101Agile and Scrum 101
Agile and Scrum 101
Sofian Hadiwijaya
 
Execute Idea
Execute Idea Execute Idea
Execute Idea
Sofian Hadiwijaya
 
Product market fit and Scale
Product market fit and ScaleProduct market fit and Scale
Product market fit and Scale
Sofian Hadiwijaya
 
Entrepreneur Story
Entrepreneur StoryEntrepreneur Story
Entrepreneur Story
Sofian Hadiwijaya
 
Pathway became data scientist
Pathway became data scientistPathway became data scientist
Pathway became data scientist
Sofian Hadiwijaya
 
Keynote Speaker PyConID 2018
Keynote Speaker PyConID 2018Keynote Speaker PyConID 2018
Keynote Speaker PyConID 2018
Sofian Hadiwijaya
 
Building Startups
Building StartupsBuilding Startups
Building Startups
Sofian Hadiwijaya
 
Big data and digital marketing
Big data  and digital marketingBig data  and digital marketing
Big data and digital marketing
Sofian Hadiwijaya
 
Data Driven Organization
Data Driven OrganizationData Driven Organization
Data Driven Organization
Sofian Hadiwijaya
 
Data Driven Company
Data Driven CompanyData Driven Company
Data Driven Company
Sofian Hadiwijaya
 
serverless web application
serverless web applicationserverless web application
serverless web application
Sofian Hadiwijaya
 
Startup 101
Startup 101Startup 101
Startup 101
Sofian Hadiwijaya
 
IoT and AI for Retail Industry
IoT and AI for Retail IndustryIoT and AI for Retail Industry
IoT and AI for Retail Industry
Sofian Hadiwijaya
 
Growth in Startup
Growth in StartupGrowth in Startup
Growth in Startup
Sofian Hadiwijaya
 
Technology Industry
Technology Industry Technology Industry
Technology Industry
Sofian Hadiwijaya
 
What you can get with data
What you can get with dataWhat you can get with data
What you can get with data
Sofian Hadiwijaya
 
DeepLearning with Neon
DeepLearning with NeonDeepLearning with Neon
DeepLearning with Neon
Sofian Hadiwijaya
 
How BigData Affects Business
How BigData Affects BusinessHow BigData Affects Business
How BigData Affects Business
Sofian Hadiwijaya
 

Recently uploaded (20)

C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 

Intro to Big Data - Spark

  • 1. Data Ingestion with Spark UII Yogyakarta 1
  • 2. About Me Sofian Hadiwijaya @sofianhw [email protected] Co Founder at Pinjam.co.id Tech Advisor at Nodeflux.io Software Innovator (IoT and AI) at Intel 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. Wikipedia big data In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. 6Source: http://en.wikipedia.org/wiki/Big_data
  • 7. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day 7
  • 8. That’s a lot of data 8Credit: http://www.flickr.com/photos/19779889@N00/1367404058/
  • 10. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics • User and topic modeling • Recommendations • And more... 10
  • 11. How to scale data? 11
  • 13. Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization? 13Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/
  • 14. Data storage is not trivial • Data volumes are massive • Reliably storing PBs of data is challenging • Disk/hardware/network failures • Probability of failure event increases with number of machines For example: 1000 hosts, each with 10 disks a disk lasts 3 year how many failures per day? 14
  • 15. Hadoop cluster 15 Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
  • 17. Hadoop provides • Redundant, fault-tolerant data storage • Parallel computation framework • Job coordination 17http://hapdoop.apache.org
  • 19. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System • Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 19
  • 20. Hadoop Stack 20 MapReduce (Distributed Programming Framework) Pig (Data Flow) Hive (SQL) HDFS (Hadoop Distributed File System) Cascading (Java) HBase (ColumnarDatabase)
  • 22. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System 22
  • 23. HDFS - files and blocks • Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks 23
  • 24. Replication • Multiple copies of a block are stored • Replication strategy: • Copy #1 on another node on same rack • Copy #2 on another node on different rack 24
  • 25. HDFS - writes 25 DataNode Block Slave node NameNode e Master DataNode Block Slave node DataNode Block Slave node File Client Rack #1 Rack #2 Note: Write path for a single block shown. Client writes multiple blocks in parallel. block
  • 26. HDFS - reads 26 DataNode Block Slave node NameNode e Master DataNode Block Slave node DataNode Block Slave node File Client Client reads multiple blocks in parallel and re-assembles into a file. block 1 block 2 block N
  • 27. What about DataNode failures? • DNs check in with the NN to report health • Upon failure NN orders DNs to replicate under- replicated blocks 27Credit: http://www.flickr.com/photos/18536761@N00/367661087/
  • 29. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop 29
  • 30. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output 30 Map Reduce (Dean and Ghemawat, OSDI 2004)
  • 33. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other • Speculative execution is JT starting up multiple of the same task • First one to complete wins, other is killed 33Credit: http://www.flickr.com/photos/phobia/2308371224/
  • 34. MapReduce - Java API • Mapper: void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) • Reducer: void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) 34
  • 35. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter • Reports progress, updates counters • InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat • OutputFormat • Writes data • Examples: TextOutputFormat, SequenceFileOutputFormat 35
  • 36. MapReduce - Counters are... • A distributed count of events during a job • A way to indicate job metrics without logging • Your friend • Bad: System.out.println(“Couldn’t parse value”); • Good: reporter.incrCounter(BadParseEnum, 1L); 36
  • 37. MapReduce - word count mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } 37
  • 38. MapReduce - word count reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } 38
  • 39. MapReduce - word count main public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } 39
  • 40. MapReduce - running a job • To run word count, add files to HDFS and do: $ bin/hadoop jar wordcount.jar org.myorg.WordCount input_dir output_dir 40
  • 41. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset 41
  • 42. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high 42
  • 43. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records 43
  • 44. Hadoop combined architecture 44 TaskTracker DataNode Slave node JobTracker Master TaskTracker DataNode Slave node TaskTracker DataNode Slave node SecondaryNameNode Backup NameNode
  • 47. What is Spark? Distributed data analytics engine, generalizing Map Reduce Core engine, with streaming, SQL, machine learning, and graph processing modules
  • 48. Most Active Big Data Project Activity in last 30 days* *as of June 1, 2014 0 50 100 150 200 250 Patches MapReduce Storm Yarn Spark 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Lines Added MapReduce Storm Yarn Spark 0 2000 4000 6000 8000 10000 12000 14000 16000 Lines Removed MapReduce Storm Yarn Spark
  • 49. Big Data Systems Today MapReduce Pregel Dremel GraphLab Storm Giraph Drill Impala S4 … Specialized systems (iterative, interactive and streaming apps) General batch processing Unified platform
  • 50. Spark Core: RDDs Distributed collection of objects What’s cool about them? • In-memory • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure
  • 51. Data Sharing in MapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to replication, serialization, and disk IO
  • 52. iter. 1 iter. 2 . . . Input What We’d Like Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk
  • 54. Spark SQL Unify tables with RDDs Tables = Schema + Data
  • 55. Spark SQL Unify tables with RDDs Tables = Schema + Data = SchemaRDD coolPants = sql(""" SELECT pid, color FROM pants JOIN opinions WHERE opinions.coolness > 90""") chosenPair = coolPants.filter(lambda row: row(1) == "green").take(1)
  • 56. GraphX Unifies graphs with RDDs of edges and vertices
  • 57. GraphX Unifies graphs with RDDs of edges and vertices
  • 58. GraphX Unifies graphs with RDDs of edges and vertices
  • 59. GraphX Unifies graphs with RDDs of edges and vertices
  • 61. MLlib Vectors, Matrices = RDD[Vector] Iterative computation
  • 63. Spark Streaming RD D RD D RD D RD D RD D RD D Time Express streams as a series of RDDs over time val pantsers = spark.sequenceFile(“hdfs:/pantsWearingUsers”) spark.twitterStream(...) .filter(t => t.text.contains(“Hadoop”)) .transform(tweets => tweets.map(t => (t.user, t)).join(pantsers) .print()
  • 64. What it Means for Users Separate frameworks: … HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query HDFS HDFS read ETL train query Spark: Interactive analysis
  • 66. Benefits of Unification • No copying or ETLing data between systems • Combine processing types in one program • Code reuse • One system to learn • One system to maintain
  • 68. 68
  • 70. Data collection • Happens where data originates • “logging code” • Batch v. Streaming • Pull v. Push 70
  • 71. Data Ingestion • Receives data • Sometimes coupled with storage • Routing data 71