What Is Spark?: Up To 100× Faster
What Is Spark?: Up To 100× Faster
Result:
Result:
full-text
scaled
search
to 1 TB
of Wikipedia
data in 5-7
in sec
<1 sec Block 3
(vs
(vs170
20 sec
secfor
foron-disk
on-diskdata)
data)
RDD Fault Tolerance
RDDs track the transformations used to build them (their lineage) to recompute lost data
E.g:
140
119
120
Iteratrion time (s)
80
70 69
60 58
50
41
40
30 30
20
12
10
0
Cache disabled 25% 50% 75% Fully cached
% of working set in cache
Spark in Java and Scala
errors.count()
Which Language Should I Use?
Standalone programs can be written in any, but console is only Python & Scala
Python developers: can stay with Python for both
Java developers: consider using Scala for console (to learn the API)
Performance: Java / Scala will be faster (statically typed), but Python can do well for
numerical work with NumPy
Scala Cheat Sheet
Variables: Collections and closures:
var x: Int = 7 val nums = Array(1, 2, 3)
var x = 7 // type inferred
nums.map((x: Int) => x + 2) // => Array(3, 4, 5)
val y = “hi” // read-only
nums.map(x => x + 2) // => same
nums.map(_ + 2) // => same
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ ”)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)
“to” (to, 1)
(be, 2)
“to be or” “be” (be, 1)
(not, 1)
“or” (or, 1)
“not” (not, 1)
(or, 1)
“not to be” “to” (to, 1)
(to, 2)
“be” (be, 1)
Multiple Datasets
visits = sc.parallelize([(“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”)])
pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”, “About”)])
visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
visits.cogroup(pageNames)
# (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”)))
# (“about.html”, (Seq(“3.4.5.6”), Seq(“About”)))
Controlling the Level of Parallelism
All the pair RDD operations take an optional second parameter for number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
Using Local Variables
External variables you use in a closure will automatically be shipped to the cluster:
query = raw_input(“Enter a query:”)
pages.filter(lambda x: x.startswith(query)).count()
Some caveats:
- Each task gets a new copy (updates aren’t sent back)
- Variable must be Serializable (Java/Scala) or Pickle-able (Python)
- Don’t use fields of an outer object (ships all of it!)
Closure Mishap Example
join
import spark.SparkContext._
Cluster URL, or
import spark.api.java.JavaSparkContext; App Spark install List of JARs with
Java
object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext(“local”, “WordCount”, args(0), Seq(args(1)))
val lines = sc.textFile(args(2))
lines.flatMap(_.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile(args(3))
}
}
Complete App: Python
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0], None)
lines = sc.textFile(sys.argv[1])
Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0
1.0 1.0
1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0
1 0.5
1
1.0 0.5 1.0
0.5
0.5
1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85
0.58 1.0
0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85
0.58 0.5
1.85
0.58 0.29 1.0
0.5
0.29
0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.31
0.39 1.72
...
0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.46 1.37
0.73
Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
ranks.saveAsTextFile(...)
PageRank Performance
171
200
150 Hadoop
Spark
100
80
50
23
14
0
30 60
Number of machines
Other Iterative Algorithms
155 Hadoop
K-Means Clustering
4.1 Spark
0 30 60 90 120 150 180
110
Logistic Regression
0.96
0 25 50 75 100 125
Details: spark-project.org/docs/latest/ec2-scripts.html