Introduction to Spark - Phoenix Meetup 08-19-2014

CONFIDENTIAL - RESTRICTED
Introduction to Spark
Maxime Dumas – Systems Engineer, Cloudera
Phoenix Biomedical Informatics Group
August 19th 2014

Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• from Montreal, Canada
2

What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3

4
But first… how did we get here?

What does Hadoop look like?
5
HDFS
worker
(“DN”)
MR
worker
(“TT”)
HDFS
worker
(“DN”)
MR
worker
(“TT”)
HDFS
worker
(“DN”)
MR
worker
(“TT”)
HDFS
worker
(“DN”)
MR
worker
(“TT”)
HDFS
worker
(“DN”)
MR
worker
(“TT”)
…HDFS
master
(“NN”)
MR
master
(“JT”)
Standby
master

But I want MORE!
6
HDFS
worker
HDFS
worker
HDFS
worker
HDFS
worker
HDFS
worker
…
MapReduce
HDFS
master
(“NN”)
MR
master
(“JT”)
Standby
master

Hadoop as an Architecture
The Old Way
$30,000+ per TB
Expensive & Unattainable
• Hard to scale
• Network is a bottleneck
• Only handles relational data
• Difficult to add new fields & data types
Expensive, Special purpose, “Reliable” Servers
Expensive Licensed Software
Network
Data Storage
(SAN, NAS)
Compute
(RDBMS, EDW)
The Hadoop Way
$300-$1,000 per TB
Affordable & Attainable
• Scales out forever
• No bottlenecks
• Easy to ingest any data
• Agile data access
Commodity “Unreliable” Servers
Hybrid Open Source Software
Compute
(CPU)
Memory Storage
(Disk)
z
z

CDH: the App Store for Hadoop
8
Integration
Storage
Resource Management
Metadata
NoSQL
DBMS
…
Analytic
MPP
DBMS
Search
Engine
In-
Memory
Batch
Processing
System
Management
Data
Management
Support
Security
Machine
Learning
MapReduce

9
Introduction to Apache Spark
Credits:
• Ben White
• Todd Lipcon
• Ted Malaska
• Jairam Ranganathan
• Jayant Shekhar
• Sandy Ryza

Can we improve on MR?
• Problems with MR:
• Very low-level: requires a lot of code to do simple
things
• Very constrained: everything must be described as
“map” and “reduce”. Powerful but sometimes
difficult to think in these terms.
10

Can we improve on MR?
• Two approaches to improve on MapReduce:
1. Special purpose systems to solve one problem domain
well.
• Giraph / Graphlab (graph processing)
• Storm (stream processing)
• Impala (real-time SQL)
2. Generalize the capabilities of MapReduce to
provide a richer foundation to solve problems.
• Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs)
Both are viable strategies depending on the problem!
11

What is Apache Spark?
Spark is a general purpose computational framework
Retains the advantages of MapReduce:
• Linear scalability
• Fault-tolerance
• Data Locality based computations
…but offers so much more:
• Leverages distributed memory for better performance
• Supports iterative algorithms that are not feasible in MR
• Improved developer experience
• Full Directed Graph expressions for data parallel computations
• Comes with libraries for machine learning, graph analysis, etc.
12

Getting started with Spark
• Java API
• Interactive shells:
• Scala (spark-shell)
• Python (pyspark)
14

Execution modes
• Standalone Mode
• Dedicated master and worker daemons
• YARN Client Mode
• Launches a YARN application with the
driver program running locally
• YARN Cluster Mode
• Launches a YARN application with the
driver program running in the YARN
ApplicationMaster
15
Dynamic resource
management
between Spark,
MR, Impala…
Dedicated Spark
runtime with static
resource limits

Parallelized Collections
17
scala> val data = 1 to 5
data: Range.Inclusive = Range(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] =
ParallelCollectionRDD[0]
Now I can apply parallel operations to this array:
scala> distData.reduce(_ + _)
[… Adding task set 0.0 with 56 tasks …]
res0: Int = 15
What just happened?!

RDD – Resilient Distributed Dataset
• Collections of objects partitioned across a cluster
• Stored in RAM or on Disk
• You can control persistence and partitioning
• Created by:
• Distributing local collection objects
• Transformation of data in storage
• Transformation of RDDs
• Automatically rebuilt on failure (resilient)
• Contains lineage to compute from storage
• Lazy materialization
18

Operations on RDDs
Transformations lazily transform a RDD
to a new RDD
• map
• flatMap
• filter
• sample
• join
• sort
• reduceByKey
• …
Actions run computation to return a
value
• collect
• reduce(func)
• foreach(func)
• count
• first, take(n)
• saveAs
• …
20

Fault Tolerance
• RDDs contain lineage.
• Lineage – source location and list of transformations
• Lost partitions can be re-computed from source data
21
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))

Word Count in MapReduce
23
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Word Count in Spark
sc.textFile(“words”)
.flatMap(line => line.split(" "))
.map(word=>(word,1))
.reduceByKey(_+_).collect()
24

Logistic Regression
• Read two sets of points
• Looks for a plane W that separates them
• Perform gradient descent:
• Start with random W
• On each iteration, sum a function of W over the data
• Move W in a direction that improves it
25

Logistic Regression Performance
28

29
Spark and Hadoop:
a Framework within a Framework

31
Integration
Storage
Resource Management
Metadata
HBase …Impala Solr Spark
Map
Reduce
System
Management
Data
Management
Support
Security

Spark Streaming
• Takes the concept of RDDs and extends it to DStreams
• Fault-tolerant like RDDs
• Transformable like RDDs
• Adds new “rolling window” operations
• Rolling averages, etc.
• But keeps everything else!
• Regular Spark code works in Spark Streaming
• Can still access HDFS data, etc.
• Example use cases:
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS.
• Detecting anomalous behavior and triggering alerts.
• Continuous reporting of summary metrics for incoming data.
32

Micro-batching for on the fly ETL
33

What about SQL?
34
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/

Fault Recovery Recap
• RDDs store dependency graph
• Because RDDs are deterministic:
Missing RDDs are rebuilt in parallel on other nodes
• Stateful RDDs can have infinite lineage
• Periodic checkpoints to disk clears lineage
• Faster recovery times
• Better handling of stragglers vs row-by-row streaming
35

Why Spark?
• Flexible like MapReduce
• High performance
• Machine learning,
iterative algorithms
• Interactive data
explorations
• Concise, easy API for
developer productivity
36

37
Demo Time!
• Log file Analysis
• Machine Learning
• Spark Streaming

What’s Next?
• Download Hadoop!
• CDH available at www.cloudera.com
• Try it online: Cloudera Live
• Cloudera provides pre-loaded VMs
• http://tiny.cloudera.com/quickstartvm
38

39
Preferably related to the talk… or not.
Questions?

40
Thank You!
Maxime Dumas
mdumas@cloudera.com
We’re hiring.

Introduction to Spark - Phoenix Meetup 08-19-2014

Recommended

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Introduction to Spark - Phoenix Meetup 08-19-2014 (20)

Recently uploaded (20)

Introduction to Spark - Phoenix Meetup 08-19-2014

Editor's Notes