0% found this document useful (0 votes)

149 views

What Is Spark?: Up To 100× Faster

Spark is a fast, general-purpose cluster computing system that allows processing of large datasets across a cluster in a fault-tolerant manner. It improves efficiency through in-memory computing and general computation graphs. Spark provides APIs in Java, Scala and Python and can run on local machines, EC2 or private clusters using Mesos, YARN or standalone mode. Key concepts include resilient distributed datasets (RDDs) which are immutable distributed collections that can be rebuilt if a partition is lost.

Uploaded by

jainam dude

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views

What Is Spark?: Up To 100× Faster

Uploaded by

jainam dude

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

What is Spark?

 Fast, expressive cluster computing system compatible with Apache Hadoop

- Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
 Improves efficiency through:
- In-memory computing primitives
Up to 100× faster
- General computation graphs
 Improves usability through:
- Rich APIs in Java, Scala, Python
Often 2-10× less code
- Interactive shell
How to Run It
 Local multicore: just a library in your program
 EC2: scripts for launching a Spark cluster
 Private cluster: Mesos, YARN, Standalone Mode
Languages
 APIs in Java, Scala and Python
 Interactive shells in Scala and Python
Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Key Idea
 Work with distributed collections as you would with local ones

 Concept: resilient distributed datasets (RDDs)

- Immutable collections of objects spread across a cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (e.g. caching in RAM)
Operations
 Transformations (e.g. map, filter, groupBy, join)
- Lazy operations to build RDDs from other RDDs
 Actions (e.g. count, collect, save)
- Return a result or write it to storage
Example: Mining Console Logs
 Load error messages from a log into memory, then interactively search for patterns
Cache 1
Base RDD
Transformed RDD Worker
lines = spark.textFile(“hdfs://...”) tasks
errors = lines.filter(lambda s: s.startswith(“ERROR”)) Block 1
Driver results
messages = errors.map(lambda s: s.split(‘\t’)[2])
messages.cache()
Action
messages.filter(lambda s: “foo” in s).count() Cache 2
Worker
messages.filter(lambda s: “bar” in s).count()
Cache 3
. . . Block 2
Worker

Result:
Result:
full-text
scaled
search
to 1 TB
of Wikipedia
data in 5-7
in sec
<1 sec Block 3
(vs
(vs170
20 sec
secfor
foron-disk
on-diskdata)
data)
RDD Fault Tolerance

RDDs track the transformations used to build them (their lineage) to recompute lost data

E.g:

messages = textFile(...).filter(lambda s: s.contains(“ERROR”))

.map(lambda s: s.split(‘\t’)[2])

HadoopRDD FilteredRDD MappedRDD

path = hdfs://… func = contains(...) func = split(…)
Fault Recovery Test

140
119
120
Iteratrion time (s)

100 Failure happens

81
80
57 56 58 58 57 59 57 59
60
40
20
0
1 2 3 4 5 6 7 8 9 10
Iteration
Behavior with Less RAM

Ite r a tio n tim e (s )

80
70 69

60 58

50
41
40
30 30

20
12

10
0
Cache disabled 25% 50% 75% Fully cached
% of working set in cache
Spark in Java and Scala

Java API: Scala API:

JavaRDD<String> lines = spark.textFile(…); val lines = spark.textFile(…)

errors = lines.filter( errors = lines.filter(s => s.contains(“ERROR”))

new Function<String, Boolean>() { // can also write filter(_.contains(“ERROR”))
public Boolean call(String s) {
return s.contains(“ERROR”); errors.count
}
});

errors.count()
Which Language Should I Use?
 Standalone programs can be written in any, but console is only Python & Scala
 Python developers: can stay with Python for both
 Java developers: consider using Scala for console (to learn the API)

 Performance: Java / Scala will be faster (statically typed), but Python can do well for
numerical work with NumPy
Scala Cheat Sheet
Variables: Collections and closures:
var x: Int = 7 val nums = Array(1, 2, 3)
var x = 7 // type inferred
nums.map((x: Int) => x + 2) // => Array(3, 4, 5)
val y = “hi” // read-only
nums.map(x => x + 2) // => same
nums.map(_ + 2) // => same

Functions: nums.reduce((x, y) => x + y) // => 6

nums.reduce(_ + _) // => 6
def square(x: Int): Int = x*x
def square(x: Int): Int = {
x*x // last line returned Java interop:
} import java.net.URL
More details:
scala-lang.org
new URL(“http://cnn.com”).openStream()
Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Learning Spark
 Easiest way: Spark interpreter (spark-shell or pyspark)
- Special Scala and Python consoles for cluster use
 Runs in local mode on 1 thread by default, but can control with MASTER environment var:

MASTER=local ./spark-shell # local, 1 thread

MASTER=local[2] ./spark-shell # local, 2 threads
MASTER=spark://host:port ./spark-shell # Spark standalone cluster
First Stop: SparkContext
 Main entry point to Spark functionality
 Created for you in Spark shells as variable sc
 In standalone programs, you’d make your own (see later for details)
Creating RDDs
# Turn a local collection into an RDD
sc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3

sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)

# Use any existing Hadoop InputFormat

sc.hadoopFile(keyClass, valClass, inputFmt, conf)
Basic Transformations
nums = sc.parallelize([1, 2, 3])

# Pass each element through a function

squares = nums.map(lambda x: x*x) # => {1, 4, 9}

# Keep elements passing a predicate

even = squares.filter(lambda x: x % 2 == 0) # => {4}

# Map each element to zero or more others

nums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1, 2}

Range object (sequence of

numbers 0, 1, …, x-1)
Basic Actions
nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
nums.saveAsTextFile(“hdfs://file.txt”)
Working with Key-Value Pairs
 Spark’s “distributed reduce” transformations act on RDDs of key-value pairs
 Python: pair = (a, b)
pair[0] # => a
pair[1] # => b

 Scala: val pair = (a, b)

pair._1 // => a
pair._2 // => b

 Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2

pair._1 // => a
pair._2 // => b
Some Key-Value Operations
pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
pets.groupByKey()
# => {(cat, Seq(1, 2)), (dog, Seq(1)}
pets.sortByKey()
# => {(cat, 1), (cat, 2), (dog, 1)}

reduceByKey also automatically implements combiners on the map side

Example: Word Count

lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ ”)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)

“to” (to, 1)
(be, 2)
“to be or” “be” (be, 1)
(not, 1)
“or” (or, 1)

“not” (not, 1)
(or, 1)
“not to be” “to” (to, 1)
(to, 2)
“be” (be, 1)
Multiple Datasets
visits = sc.parallelize([(“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”)])
pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”, “About”)])
visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
visits.cogroup(pageNames)
# (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”)))
# (“about.html”, (Seq(“3.4.5.6”), Seq(“About”)))
Controlling the Level of Parallelism
 All the pair RDD operations take an optional second parameter for number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
Using Local Variables
 External variables you use in a closure will automatically be shipped to the cluster:
query = raw_input(“Enter a query:”)
pages.filter(lambda x: x.startswith(query)).count()

 Some caveats:
- Each task gets a new copy (updates aren’t sent back)
- Variable must be Serializable (Java/Scala) or Pickle-able (Python)
- Don’t use fields of an outer object (ships all of it!)
Closure Mishap Example

class MyCoolRddApp { How to get around it:

val param = 3.14
val log = new Log(...) class MyCoolRddApp {
... ...

def work(rdd: RDD[Int]) { def work(rdd: RDD[Int]) {

rdd.map(x => x + param) val param_ = param
.reduce(...) rdd.map(x => x + param_)
} .reduce(...)
} }
NotSerializableException:
MyCoolRddApp (or Log) } References only local variable
instead of this.param
More Details
 Spark supports lots of other operations!
 Full programming guide: spark-project.org/documentation
Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Software Components
 Spark runs as a library in your program Your application
(one instance per app)
SparkContext
 Runs tasks locally or on a cluster
- Standalone deploy cluster, Mesos or YARN Cluster Local
 Accesses storage via Hadoop InputFormat API manager threads

- Can use HBase, HDFS, S3, …

Worker Worker
Spark Spark
executor executor

HDFS or other storage

Task Scheduler
 Supports general task graphs A: B:
 Pipelines functions where possible
 Cache-aware data reuse & locality F:
Stage 1 groupBy
 Partitioning-aware to avoid shuffles
C: D: E:

join

Stage 2 map filter Stage 3

= RDD = cached partition

Hadoop Compatibility
 Spark can read/write to any storage system / format that has a plugin for Hadoop!
- Examples: HDFS, S3, HBase, Cassandra, Avro, SequenceFile
- Reuses Hadoop’s InputFormat and OutputFormat APIs
 APIs like SparkContext.textFile support filesystems, while SparkContext.hadoopRDD
allows passing any Hadoop JobConf to configure an input source
Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Build Spark
 Requires Java 6+, Scala 2.9.2
git clone git://github.com/mesos/spark
cd spark
sbt/sbt package
# Optional: publish to local Maven cache
sbt/sbt publish-local
Add Spark to Your Project
 Scala and Java: add a Maven dependency on
groupId: org.spark-project
artifactId:spark-core_2.9.1
version: 0.7.0-SNAPSHOT

 Python: run program with our pyspark script

Create a SparkContext
import spark.SparkContext
Scala

import spark.SparkContext._

val sc = new SparkContext(“masterUrl”, “name”, “sparkHome”, Seq(“app.jar”))

Cluster URL, or
import spark.api.java.JavaSparkContext; App Spark install List of JARs with
Java

local / local[N] name path on cluster app code (to ship)

JavaSparkContext sc = new JavaSparkContext(
“masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”}));
Python

from pyspark import SparkContext

sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))

Complete App: Scala
import spark.SparkContext
import spark.SparkContext._

object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext(“local”, “WordCount”, args(0), Seq(args(1)))
val lines = sc.textFile(args(2))
lines.flatMap(_.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile(args(3))
}
}
Complete App: Python
import sys
from pyspark import SparkContext

if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0], None)
lines = sc.textFile(sys.argv[1])

lines.flatMap(lambda s: s.split(“ ”)) \

.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y) \
.saveAsTextFile(sys.argv[2])
Example: PageRank
Why PageRank?
 Good example of a more complex algorithm
- Multiple stages of map & reduce
 Benefits from Spark’s in-memory caching
- Multiple iterations over the same data
Basic Idea
 Give pages ranks (scores) based on links to them
- Links from many pages  high rank
- Link from a high-rank page  high rank

Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0

1.0 1.0

1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0
1 0.5
1
1.0 0.5 1.0

0.5
0.5

1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85

0.58 1.0

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85
0.58 0.5
1.85
0.58 0.29 1.0

0.5
0.29

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.31

0.39 1.72
...

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs

Final state: 1.44

0.46 1.37

0.73
Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}

ranks.saveAsTextFile(...)
PageRank Performance

Ite ra tio n tim e (s )

171
200

150 Hadoop
Spark
100

80
50

14
0
30 60

Number of machines
Other Iterative Algorithms

155 Hadoop
K-Means Clustering
4.1 Spark
0 30 60 90 120 150 180

110
Logistic Regression
0.96
0 25 50 75 100 125

Time per Iteration (s)

Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Local Mode
 Just pass local or local[k] as master URL
 Still serializes tasks to catch marshaling errors
 Debug using local debuggers
- For Java and Scala, just run your main program in a debugger
- For Python, use an attachable debugger (e.g. PyDev, winpdb)
 Great for unit testing
Private Cluster
 Can run with one of:
- Standalone deploy mode (similar to Hadoop cluster scripts)
- Apache Mesos: spark-project.org/docs/latest/running-on-mesos.html
- Hadoop YARN: spark-project.org/docs/0.6.0/running-on-yarn.html
 Basically requires configuring a list of workers, running launch scripts, and passing a
special cluster URL to SparkContext
Amazon EC2
 Easiest way to launch a Spark cluster
git clone git://github.com/mesos/spark.git
cd spark/ec2
./spark-ec2 -k keypair –i id_rsa.pem –s slaves \
[launch|stop|start|destroy] clusterName

 Details: spark-project.org/docs/latest/ec2-scripts.html

 New: run Spark on Elastic MapReduce – tinyurl.com/spark-emr

Viewing Logs
 Click through the web UI at master:8080
 Or, look at stdout and stdout files in the Spark or Mesos “work” directory for your app:
work/<ApplicationID>/<ExecutorID>/stdout
 Application ID (Framework ID in Mesos) is printed when Spark connects
Community
 Join the Spark Users mailing list:
groups.google.com/group/spark-users

 Come to the Bay Area meetup:

www.meetup.com/spark-users
Conclusion
 Spark offers a rich API to make data analytics fast: both fast to write and fast to run
 Achieves 100x speedups in real applications
 Growing community with 14 companies contributing
 Details, tutorials, videos: www.spark-project.org

TCN Android Touch Screen Vending Machine User Manual
No ratings yet
TCN Android Touch Screen Vending Machine User Manual
29 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Pganalyze Advanced+Database+Programming+With+Rails
No ratings yet
Pganalyze Advanced+Database+Programming+With+Rails
45 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Interview
No ratings yet
Interview
86 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Snowflake - Billing Components
No ratings yet
Snowflake - Billing Components
9 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
Master_Snowflake_Interview_Q_A_�_1729835390
No ratings yet
Master_Snowflake_Interview_Q_A_�_1729835390
7 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Oopr Prelim
No ratings yet
Oopr Prelim
7 pages
Ethical Hacking (Research Info)
No ratings yet
Ethical Hacking (Research Info)
23 pages
Anurag Singh Project Final Project 222
No ratings yet
Anurag Singh Project Final Project 222
22 pages
Tarqul SQA
No ratings yet
Tarqul SQA
1 page
3765 Circuitpro 2 3 Compendium v1 0 en
No ratings yet
3765 Circuitpro 2 3 Compendium v1 0 en
386 pages
101 500
No ratings yet
101 500
6 pages
30 Role Plays For TEFL - Students Workbook - Guide
No ratings yet
30 Role Plays For TEFL - Students Workbook - Guide
3 pages
Desktop Support Wipro IMG
0% (2)
Desktop Support Wipro IMG
22 pages
Embedded Systems - 1: RME40002: Mechatronics Systems Design
No ratings yet
Embedded Systems - 1: RME40002: Mechatronics Systems Design
44 pages
FIFA Mod Manager Log20240108
No ratings yet
FIFA Mod Manager Log20240108
33 pages
Igivetest v2 Guide
No ratings yet
Igivetest v2 Guide
30 pages
Linker and Loader
100% (1)
Linker and Loader
25 pages
Css Content Property
No ratings yet
Css Content Property
17 pages
Nvidia DGX Os 4 Server: Software Release Notes
No ratings yet
Nvidia DGX Os 4 Server: Software Release Notes
19 pages
Curso Civil UCA
100% (1)
Curso Civil UCA
465 pages
SciNet Tutorial
No ratings yet
SciNet Tutorial
22 pages
2017 Tech Fast500 Apac Ranking Report
No ratings yet
2017 Tech Fast500 Apac Ranking Report
13 pages
Validation & Authentication System For Pharmaceutical Exports From India (iVEDA)
No ratings yet
Validation & Authentication System For Pharmaceutical Exports From India (iVEDA)
17 pages
Les 02
No ratings yet
Les 02
3 pages
Fresher Resume
No ratings yet
Fresher Resume
2 pages
Siebel ConnIAAFINS
No ratings yet
Siebel ConnIAAFINS
82 pages
Runtime Asset Management
No ratings yet
Runtime Asset Management
14 pages
INT Final CA Upload
No ratings yet
INT Final CA Upload
12 pages
Video Material For Onshoring Migration
No ratings yet
Video Material For Onshoring Migration
21 pages
Log
No ratings yet
Log
37 pages
How To Easily Set Up A Mail Server On Debian 10 Buster With iRedMail
No ratings yet
How To Easily Set Up A Mail Server On Debian 10 Buster With iRedMail
29 pages
How to Pay Overtime Through DOR Step by Step 1723773819
No ratings yet
How to Pay Overtime Through DOR Step by Step 1723773819
21 pages

What Is Spark?: Up To 100× Faster

Uploaded by

What Is Spark?: Up To 100× Faster

Uploaded by

What is Spark?

 Fast, expressive cluster computing system compatible with Apache Hadoop

 Concept: resilient distributed datasets (RDDs)

messages = textFile(...).filter(lambda s: s.contains(“ERROR”))

HadoopRDD FilteredRDD MappedRDD

100 Failure happens

Ite r a tio n tim e (s )

Java API: Scala API:

JavaRDD<String> lines = spark.textFile(…); val lines = spark.textFile(…)

errors = lines.filter( errors = lines.filter(s => s.contains(“ERROR”))

Functions: nums.reduce((x, y) => x + y) // => 6

MASTER=local ./spark-shell # local, 1 thread

# Load text file from local FS, HDFS, or S3

# Use any existing Hadoop InputFormat

# Pass each element through a function

# Keep elements passing a predicate

# Map each element to zero or more others

Range object (sequence of

 Scala: val pair = (a, b)

 Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2

reduceByKey also automatically implements combiners on the map side

class MyCoolRddApp { How to get around it:

def work(rdd: RDD[Int]) { def work(rdd: RDD[Int]) {

- Can use HBase, HDFS, S3, …

HDFS or other storage

Stage 2 map filter Stage 3

= RDD = cached partition

 Python: run program with our pyspark script

val sc = new SparkContext(“masterUrl”, “name”, “sparkHome”, Seq(“app.jar”))

local / local[N] name path on cluster app code (to ship)

from pyspark import SparkContext

sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))

lines.flatMap(lambda s: s.split(“ ”)) \

Final state: 1.44

for (i <- 1 to ITERATIONS) {

Ite ra tio n tim e (s )

Time per Iteration (s)

 New: run Spark on Elastic MapReduce – tinyurl.com/spark-emr

 Come to the Bay Area meetup:

You might also like