Basics of RDD
Basics of RDD
What is RDD?
Dataset:
Distributed: Resilient:
Collection of data
Parts Multiple Recovers on
elements.
e.g. Array, Tables, Data frame (R), collections of machines Failure
mongodb
Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
A collection of elements partitioned across cluster
Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
Spark
Spark Spark Spark
Application
Application Application Application
Driver
Application
Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
A collection of elements partitioned across cluster
Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
Basics of RDD
Creating RDD - Scala
Basics of RDD
WordCount - Scala
Basics of RDD
RDD Operations
Transformation Action
Basics of RDD
RDD - Operations : Transformation
Resilient Distributed Dataset 2 (RDD)
Basics of RDD
RDD - Operations : Transformation
Basics of RDD
Map Transformation
➢ Map is a transformation
➢ That runs provided function against each element of RDD
➢ And creates a new RDD from the results of execution function
Basics of RDD
Map Transformation - Scala
➢ val arr = 1 to 10000
➢ val nums = sc.parallelize(arr)
➢ def multiplyByTwo(x:Int):Int = x*2
➢ multiplyByTwo(5)
10
➢ var dbls = nums.map(multiplyByTwo);
➢ dbls.take(5)
[2, 4, 6, 8, 10]
Basics of RDD
Transformations - filter() - scala
➢ var evens =
➢ var arr = 1 to 1000
nums.filter(isEven)
➢ var nums = sc.parallelize(arr)
➢ evens.take(3)
➢ def isEven(x:Int):Boolean = x%2 == 0
➢ [2, 4, 6]
nums 1 2 3 4 5 6 7 …..
evens 2 4 6 …..
Basics of RDD
RDD - Operations : Actions
Basics of RDD
Action Example - take()
➢ var dbls =
➢ val arr = 1 to 1000000
nums.map(multipleByTwo);
➢ val nums = sc.parallelize(arr)
➢ dbls.take(5)
➢ def multipleByTwo(x:Int):Int = x*2
➢ [2, 4, 6, 8, 10]
Basics of RDD
Action Example - saveAsTextFile()
Basics of RDD
Action Example - saveAsTextFile()
Basics of RDD
RDD Operations
Transformation Action
Examples map() take()
Returns Another RDD Local value
Executes Lazily Immediately. Executes transformations
Basics of RDD
Lazy Evaluation Example - The waiter takes orders patiently
Soup and
Cheese burger, A Plate of
soup and Ok.
Noodles for
a Plate of Noodles One cheese burger
me
please Two soups
Two plates of Noodles
Anything else, sir?
Basics of RDD
Instant Evaluation
Cheese Burger...
And Soup?
Basics of RDD
Instant Evaluation
Basics of RDD
Actions: Lazy Evaluation
1. Every time we call an action, entire RDD must be computed from scratch
2. Everytime d gets executed, a,b,c would be run
a. lines = sc.textFile("myfile");
b. fewlines = lines.filter(...)
c. uppercaselines = fewlines.map(...)
d. uppercaselines.count()
3. When we call a transformation, it is not evaluated immediately.
4. It helps Spark optimize the performance
5. Similar to Pig, tensorflow etc.
6. Instead of thinking RDD as dataset, think of it as the instruction on how to
compute data
Basics of RDD
Actions: Lazy Evaluation - Optimization - Scala
lines2.collect()
Basics of RDD
Lineage Graph
Spark Code Lineage Graph
3
lowercaselines uppercaselines
Basics of RDD
Transformations:: flatMap() - Scala
Basics of RDD
Transformations:: flatMap() - Scala
➢ var linesRDD = sc.parallelize( Array("this is a dog", "named jerry"))
➢ def toWords(line:String):Array[String]= line.split(" ")
➢ var wordsRDD = linesRDD.flatMap(toWords)
➢ wordsRDD.collect()
➢ ['this', 'is', 'a', 'dog', 'named', 'jerry']
toWords() toWords()
Basics of RDD
How is it different from Map()?
● In case of map() the resulting rdd and input rdd having same number of elements.
● map() can only convert one to one while flatMap could convert one to many.
Basics of RDD
What would happen if map() is used
➢ var linesRDD = sc.parallelize( Array("this is a dog", "named jerry"))
➢ def toWords(line:String):Array[String]= line.split(" ")
➢ var wordsRDD1 = linesRDD.map(toWords)
➢ wordsRDD1.collect()
➢ [['this', 'is', 'a', 'dog'], ['named', 'jerry']]
toWords() toWords()
Basics of RDD
FlatMap
Basics of RDD
FlatMap
Basics of RDD
flatMap as map
Basics of RDD
flatMap as filter
➢ var arr = 1 to 1000
➢ var nums = sc.parallelize(arr) ➢ var evens =
➢ def isEven(x:Int):Array[Int] = { nums.flatMap(isEven)
➢ if(x%2 == 0) Array(x) ➢ evens.take(3)
➢ else Array() ➢ [2, 4, 6]
➢ }
Basics of RDD
Transformations:: Union
➢ var a = sc.parallelize(Array('1','2','3'));
➢ var b = sc.parallelize(Array('A','B','C'));
➢ var c=a.union(b)
➢ Note: doesn't remove duplicates
➢ c.collect();
[1, 2, 3, 'A', 'B', 'C']
Union
Basics of RDD
Transformations:: union()
InputRDD
Filter Filter
errorsRDD warningsRDD
Union
badlinesRDD
Basics of RDD
Actions: saveAsTextFile() - Scala
Basics of RDD
Actions: collect() - Scala
Brings all the elements back to you. Data must fit into memory.
Mostly it is impractical.
1 2 3 4 5 6 7
Basics of RDD
Actions: take() - Scala
Bring only few elements to the driver.
This is more practical than collect()
1 2 3 4 5 6 7
Basics of RDD
Actions: count() - Scala
To find out how many elements are there in an RDD.
Works in distributed fashion.
1, 2, 3 4,5 6,7
2
3
2
3+ 2 + 2 = 7
Basics of RDD
More Actions - Reduce()
Basics of RDD
More Actions - Reduce()
Basics of RDD
More Actions - Reduce()
To confirm, you could use the formula for summation of natural numbers
= n*(n+1)/2
= 100*101/2
= 5050
Basics of RDD
How does reduce work?
Partition 1 Partition 2
3 7 13 16 9 RDD
10 25
Spark Application
23
Spark Application
48
Spark Driver
Basics of RDD
For avg(), can we use reduce?
The way we had computed summation using reduce,
Can we compute the average in the same way?
≫ var seq = sc.parallelize(Array(3.0, 7, 13, 16, 19))
≫ def avg(x: Double, y:Double):Double = {return (x+y)/2}
≫ var total = seq.reduce(avg);
total: Double = 9.875
Basics of RDD
Why average with reduce is wrong?
Partition 1 Partition 2
3 7 13 16 9 RDD
5 12.5
10.75
Basics of RDD
Why average with reduce is wrong?
!=
Basics of RDD
But sum is ok
=
Basics of RDD
Reduce
Basics of RDD
Commutative
If changing the order of inputs does not make any difference to
output, the function is commutative.
Examples
Addition
2+3=3+2
Multiplication
2 * 3 = 3*2
Average:
(3+4+5)/3 = (4+3+5)/3 Non Commutative
Division
Euclidean Distance: 2 / 3 not eq 3 / 2
Subtraction
2 - 3 != 3 - 2
= Exponent / power
4 ^ 2 != 2^4
Basics of RDD
Associative
Associative property:
Can add or multiply regardless of how
the numbers are grouped.
By 'grouped' we mean 'how you use
parenthesis'.
Approach 1
➢ var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
➢ var avg = rdd.reduce(_ + _) / rdd.count();
Basics of RDD
Approach 2 - So, how to compute average?
(Total1, Count1) (Total2, Count2)
4 5 6
(9, 2)
15/3 = 5 (15, 3)
Basics of RDD
Approach 2 - So, how to compute average?
(Total1, Count1) (Total2, Count2)
Basics of RDD
Comparision of the two approaches?
Approach1:
0.023900 + 0.065180
= 0.08908 seconds ~ 89 ms
Approach2:
0.058654 seconds ~ 58 ms
Approximately 2X difference.
Basics of RDD
How to compute Standard deviation?
Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.
Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.
Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.
Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.
Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.
Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.
Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6
Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6
Already Computed in
Previous problem
1. Mean of numbers is μ
= (2 + 3 + 5 + 6) / 4 => 4
Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6
Already Computed in
Previous problem
Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6
Already Computed in
Previous problem
Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6
Already Computed in
Previous problem
Basics of RDD
So, how to compute Standard deviation?
Basics of RDD
So, how to compute Standard deviation?
Basics of RDD
So, how to compute Standard deviation?
Basics of RDD
So, how to compute Standard deviation?
Basics of RDD
So, how to compute Standard deviation?
a. var rdd = sc.parallelize(Array(2, 3, 5, 6))
b. //Mean or average of numbers is μ
i. var rdd_count = rdd.map((_, 1))
ii. var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
iii. var avg = sum / count
c. // (xi - μ)2
d. var sqdiff = rdd.map( _ - avg).map(x => x*x)
e. // ∑(xi - μ)2
f. var sum_sqdiff = sqdiff.reduce(_ + _)
g. //√1/N ∑(xi - μ)2
h. import math._;
i. var sd = sqrt(sum_sqdiff*1.0/count)
2. sd: Double = 1.5811388300841898
Basics of RDD
Computing random sample from a dataset
The objective of the exercise is to pick a random sample from huge data.
Though there is a method provided in RDD but we are creating our own.
Basics of RDD
Computing random sample from a dataset
The objective of the exercise is to pick a random sample from huge data.
Though there is a method provided in RDD but we are creating our own.
Basics of RDD
Computing random sample from a dataset
The objective of the exercise is to pick a random sample from huge data.
Though there is a method provided in RDD but we are creating our own.
Basics of RDD
Computing random sample from a dataset
The objective of the exercise is to pick a random sample from huge data.
Though there is a method provided in RDD but we are creating our own.
Basics of RDD
Computing random sample from a dataset
The objective of the exercise is to pick a random sample from huge data.
Though there is a method provided in RDD but we are creating our own.
Basics of RDD
Computing random sample from a dataset
➢ var rdd = sc.parallelize(1 to 1000);
Basics of RDD
Computing random sample from a dataset
➢ var rdd = sc.parallelize(1 to 1000);
➢ var fraction = 0.1
Basics of RDD
Computing random sample from a dataset
➢ var rdd = sc.parallelize(1 to 1000);
➢ var fraction = 0.1
➢ def cointoss(x:Int): Boolean = scala.util.Random.nextFloat() <= fraction
Basics of RDD
Computing random sample from a dataset
➢ var rdd = sc.parallelize(1 to 1000);
➢ var fraction = 0.1
➢ def cointoss(x:Int): Boolean = scala.util.Random.nextFloat() <= fraction
➢ var myrdd = rdd.filter(cointoss)
Basics of RDD
Computing random sample from a dataset
➢ var rdd = sc.parallelize(1 to 1000);
➢ var fraction = 0.1
➢ def cointoss(x:Int): Boolean = scala.util.Random.nextFloat() <= fraction
➢ var myrdd = rdd.filter(cointoss)
➢ var localsample = myrdd.collect()
➢ localsample.length
Basics of RDD
Basics of RDD
Thank you!