SlideShare a Scribd company logo
Apache Spark 
Buenos Aires High Scalability 
Buenos Aires, Argentina, Dic 2014 
Fernando Rodriguez Olivera 
@frodriguez
Fernando Rodriguez Olivera 
Professor at Universidad Austral (Distributed Systems, Compiler 
Design, Operating Systems, …) 
Creator of mvnrepository.com 
Organizer at Buenos Aires High Scalability Group, Professor at 
nosqlessentials.com 
Twitter: @frodriguez
Apache Spark 
Apache Spark is a Fast and General Engine 
for Large-Scale data processing 
In-Memory computing primitives 
Supports for Batch, Interactive, Iterative and 
Stream processing with Unified API
Apache Spark 
Unified API for multiple kind of processing 
Batch (high throughput) 
Interactive (low latency) 
Stream (continuous processing) 
Iterative (results used immediately)
Daytona Gray Sort 100TB Benchmark 
Data Size Time Nodes Cores 
Hadoop MR 
(2013) 
102.5 TB 72 min 2,100 
50,400 
physical 
Apache 
Spark 
(2014) 
100 TB 23 min 206 
6,592 
virtualized 
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Daytona Gray Sort 100TB Benchmark 
Data Size Time Nodes Cores 
Hadoop MR 
(2013) 
102.5 TB 72 min 2,100 
50,400 
physical 
Apache 
Spark 
(2014) 
100 TB 23 min 206 
6,592 
virtualized 
3X faster using 10X fewer machines 
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Hadoop vs Spark for Iterative Proc 
Logistic regression in Hadoop and Spark 
source: https://spark.apache.org/
Hadoop MR Limits 
Job Job Job 
Hadoop HDFS 
MapReduce designed for Batch Processing: 
- Communication between jobs through FS 
- Fault-Tolerance (between jobs) by Persistence to FS 
- Memory not managed (relies on OS caches) 
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
Apache Spark 
Apache Spark (Core) 
Spark 
SQL 
Spark 
Streaming ML lib GraphX 
Powered by Scala and Akka 
APIs for Java, Scala, Python
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed 
Stored in Memory
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed 
Stored in Memory 
Partitions Recomputed on Failure
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
...
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
depends on 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
depends on 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars 
Int 
N 
Action
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars 
RDD Implementation 
Partitions 
Compute Function 
Dependencies 
Preferred Compute 
Location 
(for each partition) 
Partitioner 
depends on 
Int 
N 
Action
Spark API 
val spark = new SparkContext() 
val lines = spark.textFile(“hdfs://docs/”) // RDD[String] 
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] 
val count = nonEmpty.count 
Scala 
SparkContext spark = new SparkContext(); 
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) 
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); 
long count = nonEmpty.count(); 
Java 8 Python 
spark = SparkContext() 
lines = spark.textFile(“hdfs://docs/”) 
nonEmpty = lines.filter(lambda line: len(line) > 0) 
count = nonEmpty.count()
RDD Operations 
Transformations Actions 
map(func) 
flatMap(func) 
filter(func) 
take(N) 
count() 
collect() 
groupByKey() 
reduceByKey(func) 
reduce(func) 
mapValues(func) 
takeOrdered(N) 
top(N) 
… …
Text Processing Example 
Top Words by Frequency 
(Step by step)
Create RDD from External Data 
Apache Spark 
Hadoop FileSystem, 
I/O Formats, Codecs 
HDFS S3 HBase MongoDB 
Cassandra 
… 
Spark can read/write from any data source supported by Hadoop 
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) 
// Step 1 - Create RDD from Hadoop Text File 
val docs = spark.textFile(“/docs/”) 
ElasticSearch
Function map 
RDD[String] RDD[String] 
Hello World 
A New Line 
hello 
... 
The end 
.map(line => line.toLowerCase) 
hello world 
a new line 
hello 
... 
the end 
= 
.map(_.toLowerCase) 
// Step 2 - Convert lines to lower case 
val lower = docs.map(line => line.toLowerCase)
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
hello 
... 
the 
world 
new line 
end
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
hello 
... 
the 
world 
new line 
end 
.flatten 
RDD[String] 
hello 
world 
a 
new 
line 
... 
*
Functions map and flatMap 
hello world 
a new line 
hello 
... 
the end 
RDD[Array[String]] 
hello 
.flatMap(line => line.split(“s+“)) 
RDD[String] 
.map( … ) 
_.split(“s+”) 
a 
hello 
... 
the 
world 
new line 
end 
.flatten 
RDD[String] 
hello 
world 
a 
new 
line 
... 
*
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
world 
new line 
hello 
... 
the 
end 
.flatten 
.flatMap(line => line.split(“s+“)) 
RDD[String] 
world 
// Step 3 - Split lines into words 
val words = lower.flatMap(line => line.split(“s+“)) 
Note: flatten() not available in spark, only flatMap 
hello 
a 
new 
line 
... 
*
Key-Value Pairs 
RDD[Tuple2[String, Int]] 
RDD[String] RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
... 
hello 
world 
a 
new 
line 
hello 
... 
.map(word => Tuple2(word, 1)) 
1 
1 
1 
1 
1 
1 
= 
.map(word => (word, 1)) 
// Step 4 - Split lines into words 
val counts = words.map(word => (word, 1)) 
Pair RDD
Shuffling 
RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)]
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)] 
RDD[(String, Int)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
.reduceByKey((a, b) => a + b) 
RDD[(String, Int)] 
RDD[(String, Int)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Shuffling 
RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)] 
.reduceByKey((a, b) => a + b) 
// Step 5 - Count all words 
val freq = counts.reduceByKey(_ + _) 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Top N (Prepare data) 
RDD[(String, Int)] RDD[(Int, String)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.map(_.swap) 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
// Step 6 - Swap tuples (partial code) 
freq.map(_.swap)
Top N (First Attempt) 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2
Top N (First Attempt) 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
.sortByKey 
RDD[(Int, String)] 
2 
1 
1 a 
hello 
world 
new 
line 
1 
1 
(sortByKey(false) for descending)
Top N (First Attempt) 
RDD[(Int, String)] Array[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
hello 
world 
2 
1 
RDD[(Int, String)] 
2 
1 
1 a 
hello 
world 
.sortByKey .take(N) 
new 
line 
1 
1 
(sortByKey(false) for descending)
Top N 
Array[(Int, String)] 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
world 
a 
1 
1 
.top(N) 
hello 
line 
2 
1 
hello 
line 
2 
1 
local top N * 
local top N * 
reduction 
* local top N implemented by bounded priority queues 
// Step 6 - Swap tuples (complete code) 
val top = freq.map(_.swap).top(N)
Top Words by Frequency (Full Code) 
val spark = new SparkContext() 
// RDD creation from external data source 
val docs = spark.textFile(“hdfs://docs/”) 
// Split lines into words 
val lower = docs.map(line => line.toLowerCase) 
val words = lower.flatMap(line => line.split(“s+“)) 
val counts = words.map(word => (word, 1)) 
// Count all words (automatic combination) 
val freq = counts.reduceByKey(_ + _) 
// Swap tuples and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
RDD Persistence (in-memory) 
RDD 
… 
... 
... 
… 
... 
… 
... 
… 
... 
.cache() 
.persist() 
.persist(storageLevel) 
StorageLevel: 
MEMORY_ONLY, 
MEMORY_ONLY_SER, 
MEMORY_AND_DISK, 
MEMORY_AND_DISK_SER, 
DISK_ONLY, … 
(memory only) 
(memory only) 
(lazy persistence & caching)
RDD Lineage 
RDD Transformations 
words = sc.textFile(“hdfs://large/file/”) HadoopRDD 
.map(_.toLowerCase) 
.flatMap(_.split(“ “)) FlatMappedRDD 
nums = words.filter(_.matches(“[0-9]+”)) 
alpha.count() 
MappedRDD 
alpha = words.filter(_.matches(“[a-z]+”)) 
FilteredRDD 
FilteredRDD 
Lineage 
(built on the driver 
by the transformations) 
Action (run job on the cluster)
SchemaRDD & SQL 
SchemaRDD 
Row 
... 
... 
Row 
... 
... 
Row 
Row 
... 
RRD of Row + Column Metadata 
Queries with SQL 
Support for Reflection, JSON, 
Parquet, …
SchemaRDD & SQL 
topWords 
Row 
... 
... 
Row 
... 
... 
Row 
Row 
... 
case class Word(text: String, n: Int) 
val wordsFreq = freq.map { 
case (text, count) => Word(text, count) 
} // RDD[Word] 
wordsFreq.registerTempTable("wordsFreq") 
val topWords = sql("select text, n 
from wordsFreq 
order by n desc 
limit 20”) // RDD[Row] 
topWords.collect().foreach(println)
Spark Streaming 
DStream 
RDD RDD RDD RDD RDD RDD 
Data Collected, Buffered and Replicated 
by a Receiver (one per DStream) 
then Pushed to a stream as small RDDs 
Configurable Batch Intervals. 
e.g: 1 second, 5 seconds, 5 minutes 
Receiver 
e.g: Kafka, 
Kinesis, 
Flume, 
Sockets, 
Akka 
etc
DStream Transformations 
DStream 
RDD RDD RDD RDD RDD RDD 
DStream 
transform 
RDD RDD RDD RDD RDD RDD 
Receiver 
// Example 
val entries = stream.transform { rdd => rdd.map(Log.parse) } 
// Alternative 
val entries = stream.map(Log.parse)
Parallelism with Multiple Receivers 
DStream 1 
Receiver 1 RDD RDD RDD RDD RDD RDD 
DStream 2 
Receiver 2 RDD RDD RDD RDD RDD RDD 
union of (stream1, stream2, …) 
Union can be used to manage multiple DStreams as 
a single logical stream
Sliding Windows 
DStream 
RDD RDD RDD RDD RDD RDD 
DStream 
… … … W3 W2 W1 
Window Length: 3, Sliding Interval: 1 
Receiver
Deployment with Hadoop 
A 
B 
C 
D 
/large/file 
allocates resources 
(cores and memory) 
Spark 
Worker 
Data 
Node 1 
Application 
Spark 
Worker 
Data 
Node 3 
Spark 
Worker 
Data 
Node 4 
Spark 
Worker 
Data 
Node 2 
A C B C A B A B 
Spark 
Master 
Name 
Node 
RF 3 D D D C 
Client 
Submit App 
(mode=cluster) 
Driver Executors Executors Executors 
DN + Spark 
HDFS Spark
Fernando Rodriguez Olivera 
twitter: @frodriguez
Ad

More Related Content

What's hot (20)

Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Apache spark
Apache sparkApache spark
Apache spark
shima jafari
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Spark
SparkSpark
Spark
Heena Madan
 
Apache spark
Apache sparkApache spark
Apache spark
TEJPAL GAUTAM
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 

Viewers also liked (20)

Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Cloudera, Inc.
 
AWS Kinesis Streams
AWS Kinesis StreamsAWS Kinesis Streams
AWS Kinesis Streams
Fernando Rodriguez
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
tugrulh
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
sjoerdluteyn
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Brian O'Neill
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
Massimiliano Martella
 
Performance
PerformancePerformance
Performance
Christophe Marchal
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
Daniel Leon
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
MapR Technologies
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
colorant
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
Dean Wampler
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Cloudera, Inc.
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
tugrulh
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
sjoerdluteyn
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
Massimiliano Martella
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
Daniel Leon
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
MapR Technologies
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
colorant
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
Dean Wampler
 
Ad

Similar to Apache Spark & Streaming (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
Javier Santos Paniego
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
Venkat Datla
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
BugSense
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
Venkat Datla
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
BugSense
 
MapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory PresentationMapReduce and Hadoop Introcuctory Presentation
MapReduce and Hadoop Introcuctory Presentation
ssuserb91a20
 
Ad

Recently uploaded (20)

Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 

Apache Spark & Streaming

  • 1. Apache Spark Buenos Aires High Scalability Buenos Aires, Argentina, Dic 2014 Fernando Rodriguez Olivera @frodriguez
  • 2. Fernando Rodriguez Olivera Professor at Universidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group, Professor at nosqlessentials.com Twitter: @frodriguez
  • 3. Apache Spark Apache Spark is a Fast and General Engine for Large-Scale data processing In-Memory computing primitives Supports for Batch, Interactive, Iterative and Stream processing with Unified API
  • 4. Apache Spark Unified API for multiple kind of processing Batch (high throughput) Interactive (low latency) Stream (continuous processing) Iterative (results used immediately)
  • 5. Daytona Gray Sort 100TB Benchmark Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 6. Daytona Gray Sort 100TB Benchmark Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized 3X faster using 10X fewer machines source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 7. Hadoop vs Spark for Iterative Proc Logistic regression in Hadoop and Spark source: https://spark.apache.org/
  • 8. Hadoop MR Limits Job Job Job Hadoop HDFS MapReduce designed for Batch Processing: - Communication between jobs through FS - Fault-Tolerance (between jobs) by Persistence to FS - Memory not managed (relies on OS caches) Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
  • 9. Apache Spark Apache Spark (Core) Spark SQL Spark Streaming ML lib GraphX Powered by Scala and Akka APIs for Java, Scala, Python
  • 10. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects
  • 11. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed
  • 12. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed Stored in Memory
  • 13. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed Stored in Memory Partitions Recomputed on Failure
  • 14. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ...
  • 15. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Compute Function (transformation) e.g: apply function to count chars
  • 16. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... Compute Function (transformation) e.g: apply function to count chars
  • 17. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... depends on Compute Function (transformation) e.g: apply function to count chars
  • 18. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... depends on Compute Function (transformation) e.g: apply function to count chars Int N Action
  • 19. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... Compute Function (transformation) e.g: apply function to count chars RDD Implementation Partitions Compute Function Dependencies Preferred Compute Location (for each partition) Partitioner depends on Int N Action
  • 20. Spark API val spark = new SparkContext() val lines = spark.textFile(“hdfs://docs/”) // RDD[String] val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] val count = nonEmpty.count Scala SparkContext spark = new SparkContext(); JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); long count = nonEmpty.count(); Java 8 Python spark = SparkContext() lines = spark.textFile(“hdfs://docs/”) nonEmpty = lines.filter(lambda line: len(line) > 0) count = nonEmpty.count()
  • 21. RDD Operations Transformations Actions map(func) flatMap(func) filter(func) take(N) count() collect() groupByKey() reduceByKey(func) reduce(func) mapValues(func) takeOrdered(N) top(N) … …
  • 22. Text Processing Example Top Words by Frequency (Step by step)
  • 23. Create RDD from External Data Apache Spark Hadoop FileSystem, I/O Formats, Codecs HDFS S3 HBase MongoDB Cassandra … Spark can read/write from any data source supported by Hadoop I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) // Step 1 - Create RDD from Hadoop Text File val docs = spark.textFile(“/docs/”) ElasticSearch
  • 24. Function map RDD[String] RDD[String] Hello World A New Line hello ... The end .map(line => line.toLowerCase) hello world a new line hello ... the end = .map(_.toLowerCase) // Step 2 - Convert lines to lower case val lower = docs.map(line => line.toLowerCase)
  • 25. Functions map and flatMap RDD[String] hello world a new line hello ... the end
  • 26. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a hello ... the world new line end
  • 27. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a hello ... the world new line end .flatten RDD[String] hello world a new line ... *
  • 28. Functions map and flatMap hello world a new line hello ... the end RDD[Array[String]] hello .flatMap(line => line.split(“s+“)) RDD[String] .map( … ) _.split(“s+”) a hello ... the world new line end .flatten RDD[String] hello world a new line ... *
  • 29. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a world new line hello ... the end .flatten .flatMap(line => line.split(“s+“)) RDD[String] world // Step 3 - Split lines into words val words = lower.flatMap(line => line.split(“s+“)) Note: flatten() not available in spark, only flatMap hello a new line ... *
  • 30. Key-Value Pairs RDD[Tuple2[String, Int]] RDD[String] RDD[(String, Int)] hello world a new line hello ... hello world a new line hello ... .map(word => Tuple2(word, 1)) 1 1 1 1 1 1 = .map(word => (word, 1)) // Step 4 - Split lines into words val counts = words.map(word => (word, 1)) Pair RDD
  • 31. Shuffling RDD[(String, Int)] hello world a new line hello 1 1 1 1 1 1
  • 32. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)]
  • 33. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)] RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 34. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 .reduceByKey((a, b) => a + b) RDD[(String, Int)] RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 35. Shuffling RDD[(String, Int)] hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)] .reduceByKey((a, b) => a + b) // Step 5 - Count all words val freq = counts.reduceByKey(_ + _) world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 36. Top N (Prepare data) RDD[(String, Int)] RDD[(Int, String)] world a 1 1 new 1 line hello 1 2 .map(_.swap) 1 1 1 new world a line hello 1 2 // Step 6 - Swap tuples (partial code) freq.map(_.swap)
  • 37. Top N (First Attempt) RDD[(Int, String)] 1 1 1 new world a line hello 1 2
  • 38. Top N (First Attempt) RDD[(Int, String)] 1 1 1 new world a line hello 1 2 .sortByKey RDD[(Int, String)] 2 1 1 a hello world new line 1 1 (sortByKey(false) for descending)
  • 39. Top N (First Attempt) RDD[(Int, String)] Array[(Int, String)] 1 1 1 new world a line hello 1 2 hello world 2 1 RDD[(Int, String)] 2 1 1 a hello world .sortByKey .take(N) new line 1 1 (sortByKey(false) for descending)
  • 40. Top N Array[(Int, String)] RDD[(Int, String)] 1 1 1 new world a line hello 1 2 world a 1 1 .top(N) hello line 2 1 hello line 2 1 local top N * local top N * reduction * local top N implemented by bounded priority queues // Step 6 - Swap tuples (complete code) val top = freq.map(_.swap).top(N)
  • 41. Top Words by Frequency (Full Code) val spark = new SparkContext() // RDD creation from external data source val docs = spark.textFile(“hdfs://docs/”) // Split lines into words val lower = docs.map(line => line.toLowerCase) val words = lower.flatMap(line => line.split(“s+“)) val counts = words.map(word => (word, 1)) // Count all words (automatic combination) val freq = counts.reduceByKey(_ + _) // Swap tuples and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 42. RDD Persistence (in-memory) RDD … ... ... … ... … ... … ... .cache() .persist() .persist(storageLevel) StorageLevel: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER, DISK_ONLY, … (memory only) (memory only) (lazy persistence & caching)
  • 43. RDD Lineage RDD Transformations words = sc.textFile(“hdfs://large/file/”) HadoopRDD .map(_.toLowerCase) .flatMap(_.split(“ “)) FlatMappedRDD nums = words.filter(_.matches(“[0-9]+”)) alpha.count() MappedRDD alpha = words.filter(_.matches(“[a-z]+”)) FilteredRDD FilteredRDD Lineage (built on the driver by the transformations) Action (run job on the cluster)
  • 44. SchemaRDD & SQL SchemaRDD Row ... ... Row ... ... Row Row ... RRD of Row + Column Metadata Queries with SQL Support for Reflection, JSON, Parquet, …
  • 45. SchemaRDD & SQL topWords Row ... ... Row ... ... Row Row ... case class Word(text: String, n: Int) val wordsFreq = freq.map { case (text, count) => Word(text, count) } // RDD[Word] wordsFreq.registerTempTable("wordsFreq") val topWords = sql("select text, n from wordsFreq order by n desc limit 20”) // RDD[Row] topWords.collect().foreach(println)
  • 46. Spark Streaming DStream RDD RDD RDD RDD RDD RDD Data Collected, Buffered and Replicated by a Receiver (one per DStream) then Pushed to a stream as small RDDs Configurable Batch Intervals. e.g: 1 second, 5 seconds, 5 minutes Receiver e.g: Kafka, Kinesis, Flume, Sockets, Akka etc
  • 47. DStream Transformations DStream RDD RDD RDD RDD RDD RDD DStream transform RDD RDD RDD RDD RDD RDD Receiver // Example val entries = stream.transform { rdd => rdd.map(Log.parse) } // Alternative val entries = stream.map(Log.parse)
  • 48. Parallelism with Multiple Receivers DStream 1 Receiver 1 RDD RDD RDD RDD RDD RDD DStream 2 Receiver 2 RDD RDD RDD RDD RDD RDD union of (stream1, stream2, …) Union can be used to manage multiple DStreams as a single logical stream
  • 49. Sliding Windows DStream RDD RDD RDD RDD RDD RDD DStream … … … W3 W2 W1 Window Length: 3, Sliding Interval: 1 Receiver
  • 50. Deployment with Hadoop A B C D /large/file allocates resources (cores and memory) Spark Worker Data Node 1 Application Spark Worker Data Node 3 Spark Worker Data Node 4 Spark Worker Data Node 2 A C B C A B A B Spark Master Name Node RF 3 D D D C Client Submit App (mode=cluster) Driver Executors Executors Executors DN + Spark HDFS Spark
  • 51. Fernando Rodriguez Olivera twitter: @frodriguez