0% found this document useful (0 votes)

32 views84 pages

Basics of RDD

RDDs are resilient distributed datasets that allow distributed collections of objects to be operated on in parallel. RDDs are immutable, partitioned across clusters, and can recover from failures. Transformations like map and filter create new RDDs lazily without real execution, while actions like take, count, and saveAsTextFile trigger real execution and return results to the driver program. This lazy evaluation allows for optimizations like combining multiple transformations.

Uploaded by

justin maxton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views84 pages

Basics of RDD

Uploaded by

justin maxton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Basics of RDD

RDDs - Resilient Distributed Datasets

What is RDD?

Dataset:
Distributed: Resilient:
Collection of data
Parts Multiple Recovers on
elements.
e.g. Array, Tables, Data frame (R), collections of machines Failure
mongodb

Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
A collection of elements partitioned across cluster

Machine 1 Machine 2 Machine 3 Machine 4

Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET

Node 1 Node 2 Node 3

Node 4

Resilient Distributed Dataset (RDD)

1, 2, 3, 4 5, 6, 7, 8 9, 10, 11, 12 13, 14, 15

Spark
Spark Spark Spark
Application
Application Application Application

Driver
Application

Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
A collection of elements partitioned across cluster

• An immutable distributed collection of objects.

• Split in partitions which may be on multiple nodes
• Can contain any data type:
○ Python,
○ Java,
○ Scala objects
○ including user defined classes

Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET

• RDD Can be persisted in memory

• RDD Auto recover from node failures
• Can have any data type but has a special dataset type for key-value
• Supports two type of operations:
○ Transformation
○ Action

Basics of RDD
Creating RDD - Scala

Method 1: By Directly Loading a file from remote

>>var lines = sc.textFile("/data/mr/wordcount/input/big.txt")

Method 2: By distributing existing object

>> val arr = 1 to 10000

>> var nums = sc.parallelize(arr)

Basics of RDD
WordCount - Scala

var linesRdd = sc.textFile("/data/mr/wordcount/input/big.txt")

var words = linesRdd.flatMap(x => x.split(" "))
var wordsKv = words.map(x => (x, 1))
//def myfunc(x:Int, y:Int): Int = x + y
var output = wordsKv.reduceByKey(_ + _)
output.take(10)
or
output.saveAsTextFile("my_result")

Basics of RDD
RDD Operations

Two Kinds Operations

Transformation Action

Basics of RDD
RDD - Operations : Transformation
Resilient Distributed Dataset 2 (RDD)

Transformation Transformation Transformation Transformation

Resilient Distributed Dataset 1 (RDD)

• Transformations are operations on RDDs

• return a new RDD
• such as map() and filter()

Basics of RDD
RDD - Operations : Transformation

• Transformations are operations on RDDs

• return a new RDD
• such as map() and filter()

Basics of RDD
Map Transformation

➢ Map is a transformation
➢ That runs provided function against each element of RDD
➢ And creates a new RDD from the results of execution function

Basics of RDD
Map Transformation - Scala
➢ val arr = 1 to 10000
➢ val nums = sc.parallelize(arr)
➢ def multiplyByTwo(x:Int):Int = x*2
➢ multiplyByTwo(5)
10
➢ var dbls = nums.map(multiplyByTwo);
➢ dbls.take(5)
[2, 4, 6, 8, 10]

Basics of RDD
Transformations - filter() - scala
➢ var evens =
➢ var arr = 1 to 1000
nums.filter(isEven)
➢ var nums = sc.parallelize(arr)
➢ evens.take(3)
➢ def isEven(x:Int):Boolean = x%2 == 0
➢ [2, 4, 6]

nums 1 2 3 4 5 6 7 …..

isEven(1) isEven(3) isEven(5) isEven(7)

isEven(2) isEven(4) isEven(6)

evens 2 4 6 …..

Basics of RDD
RDD - Operations : Actions

• Causes the full execution of transformations

• Involves both spark driver as well as the nodes
• Example - Take(): Brings back the data to driver

Basics of RDD
Action Example - take()
➢ var dbls =
➢ val arr = 1 to 1000000
nums.map(multipleByTwo);
➢ val nums = sc.parallelize(arr)
➢ dbls.take(5)
➢ def multipleByTwo(x:Int):Int = x*2
➢ [2, 4, 6, 8, 10]

Basics of RDD
Action Example - saveAsTextFile()

To save the results in HDFS or Any other file system

Call saveAsTextFile(directoryName)
It would create directory
And save the results inside it
If directory exists, it would throw error.

Basics of RDD
Action Example - saveAsTextFile()

val arr = 1 to 1000 var dbls = nums.map(multipleByTwo);

val nums = sc.parallelize(arr) dbls.saveAsTextFile("mydirectory")
def multipleByTwo(x:Int):Int = x*2 Check the HDFS home directory

Basics of RDD
RDD Operations

Transformation Action
Examples map() take()
Returns Another RDD Local value
Executes Lazily Immediately. Executes transformations

Basics of RDD
Lazy Evaluation Example - The waiter takes orders patiently
Soup and
Cheese burger, A Plate of
soup and Ok.
Noodles for
a Plate of Noodles One cheese burger
me
please Two soups
Two plates of Noodles
Anything else, sir?

The chef is able to

optimize because of
clubbing multiple
order together

Basics of RDD
Instant Evaluation

Cheese Burger...

Let me get a cheese burger

for you. I'll be right back!

And Soup?

The soup order will be taken once the waiter is back.

Basics of RDD
Instant Evaluation

The usual programing languages have instant evaluation.

As you as you type:

var x = 2+10.

It doesn't wait. It immediately evaluates.

Basics of RDD
Actions: Lazy Evaluation
1. Every time we call an action, entire RDD must be computed from scratch
2. Everytime d gets executed, a,b,c would be run
a. lines = sc.textFile("myfile");
b. fewlines = lines.filter(...)
c. uppercaselines = fewlines.map(...)
d. uppercaselines.count()
3. When we call a transformation, it is not evaluated immediately.
4. It helps Spark optimize the performance
5. Similar to Pig, tensorflow etc.
6. Instead of thinking RDD as dataset, think of it as the instruction on how to
compute data

Basics of RDD
Actions: Lazy Evaluation - Optimization - Scala

def Map1(x:String):String = def Map3(x:String):String={

x.trim(); var y = x.trim();
return y.toUpperCase();
def Map2(x:String):String = }
x.toUpperCase();
lines = sc.textFile(...)
var lines = sc.textFile(...) lines2 = lines.map(Map3);
var lines1 = lines.map(Map1);
var lines2 = lines1.map(Map2); lines2.collect()

lines2.collect()

Basics of RDD
Lineage Graph
Spark Code Lineage Graph

lines = sc.textFile("myfile"); HDFS Input Split

fewlines = lines.filter(...) sc.textFile

uppercaselines = fewlines.map(...) 1
lines
lowercaselines = fewlines.map(...)
filter
uppercaselines.count() 2
fewlines
map map

3
lowercaselines uppercaselines

Basics of RDD
Transformations:: flatMap() - Scala

To convert one record of an RDD into multiple records.

Basics of RDD
Transformations:: flatMap() - Scala
➢ var linesRDD = sc.parallelize( Array("this is a dog", "named jerry"))
➢ def toWords(line:String):Array[String]= line.split(" ")
➢ var wordsRDD = linesRDD.flatMap(toWords)
➢ wordsRDD.collect()
➢ ['this', 'is', 'a', 'dog', 'named', 'jerry']

linesRDD this is a dog named jerry

toWords() toWords()

wordsRDD this is a dog named jerry

Basics of RDD
How is it different from Map()?

● In case of map() the resulting rdd and input rdd having same number of elements.
● map() can only convert one to one while flatMap could convert one to many.

Basics of RDD
What would happen if map() is used
➢ var linesRDD = sc.parallelize( Array("this is a dog", "named jerry"))
➢ def toWords(line:String):Array[String]= line.split(" ")
➢ var wordsRDD1 = linesRDD.map(toWords)
➢ wordsRDD1.collect()
➢ [['this', 'is', 'a', 'dog'], ['named', 'jerry']]

linesRDD this is a dog named jerry

toWords() toWords()

wordsRDD1 ['this', 'is', 'a', 'dog'] ['named', 'jerry']

Basics of RDD
FlatMap

● Very similar to Hadoop's Map()

● Can give out 0 or more records

Basics of RDD
FlatMap

● Can emulate map as well as filter

● Can produce many as well as no value which empty array as output
○ If it give out single value, it behaves like map().
○ If it gives out empty array, it behaves like filter.

Basics of RDD
flatMap as map

➢ val arr = 1 to 10000

➢ val nums = sc.parallelize(arr)
➢ def multiplyByTwo(x:Int) = Array(x*2)
➢ multiplyByTwo(5)
Array(10)
➢ var dbls = nums.flatMap(multiplyByTwo);
➢ dbls.take(5)
[2, 4, 6, 8, 10]

Basics of RDD
flatMap as filter
➢ var arr = 1 to 1000
➢ var nums = sc.parallelize(arr) ➢ var evens =
➢ def isEven(x:Int):Array[Int] = { nums.flatMap(isEven)
➢ if(x%2 == 0) Array(x) ➢ evens.take(3)
➢ else Array() ➢ [2, 4, 6]
➢ }

Basics of RDD
Transformations:: Union
➢ var a = sc.parallelize(Array('1','2','3'));
➢ var b = sc.parallelize(Array('A','B','C'));
➢ var c=a.union(b)
➢ Note: doesn't remove duplicates
➢ c.collect();
[1, 2, 3, 'A', 'B', 'C']

['1', '2', '3'] ['A','B','C'])

Union

['1', '2', '3', 'A','B','C']

Basics of RDD
Transformations:: union()

InputRDD

Filter Filter

errorsRDD warningsRDD

Union

badlinesRDD

RDD lineage graph created during log analysis

Basics of RDD
Actions: saveAsTextFile() - Scala

Saves all the elements into HDFS as text files.

➢ var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7));
➢ a.saveAsTextFile("myresult");
➢ // Check the HDFS.
➢ //There should myresult folder in your home directory.

Basics of RDD
Actions: collect() - Scala
Brings all the elements back to you. Data must fit into memory.
Mostly it is impractical.

➢ var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7));

➢ a
org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:21

➢ var localarray = a.collect();

➢ localarray
[1, 2, 3, 4, 5, 6, 7]

1 2 3 4 5 6 7

Basics of RDD
Actions: take() - Scala
Bring only few elements to the driver.
This is more practical than collect()

➢ var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7));

➢ var localarray = a.take(4);
➢ localarray
[1, 2, 3, 4]

1 2 3 4 5 6 7

Basics of RDD
Actions: count() - Scala
To find out how many elements are there in an RDD.
Works in distributed fashion.

➢ var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7), 3);

➢ var mycount = a.count();
➢ mycount
7

1, 2, 3 4,5 6,7

2
3
2

3+ 2 + 2 = 7

Basics of RDD
More Actions - Reduce()

Aggregate elements of dataset using a function:

• Takes 2 arguments and returns only one
• Commutative and associative for parallelism
• Return type of function has to be same as argument

➢ var seq = sc.parallelize(1 to 100)

➢ def sum(x: Int, y:Int):Int = {return x+y}
➢ var total = seq.reduce(sum);
total: Int = 5050

Basics of RDD
More Actions - Reduce()

To confirm, you could use the formula for summation of natural numbers
= n*(n+1)/2
= 100*101/2
= 5050

Basics of RDD
How does reduce work?

Partition 1 Partition 2

3 7 13 16 9 RDD

10 25
Spark Application

23
Spark Application

48
Spark Driver

Basics of RDD
For avg(), can we use reduce?
The way we had computed summation using reduce,
Can we compute the average in the same way?
≫ var seq = sc.parallelize(Array(3.0, 7, 13, 16, 19))
≫ def avg(x: Double, y:Double):Double = {return (x+y)/2}
≫ var total = seq.reduce(avg);
total: Double = 9.875

Which is wrong. The correct average of 3, 7, 13, 16, 19 is 11.6.

Basics of RDD
Why average with reduce is wrong?

Partition 1 Partition 2

3 7 13 16 9 RDD

5 12.5

10.75

Basics of RDD
Why average with reduce is wrong?

Basics of RDD
But sum is ok

=
Basics of RDD
Reduce

A reduce function must be

commutative and associative
otherwise
the results could be unpredictable and wrong.

Basics of RDD
Commutative
If changing the order of inputs does not make any difference to
output, the function is commutative.

Examples
Addition
2+3=3+2
Multiplication
2 * 3 = 3*2
Average:
(3+4+5)/3 = (4+3+5)/3 Non Commutative
Division
Euclidean Distance: 2 / 3 not eq 3 / 2
Subtraction
2 - 3 != 3 - 2
= Exponent / power
4 ^ 2 != 2^4

Basics of RDD
Associative
Associative property:
Can add or multiply regardless of how
the numbers are grouped.
By 'grouped' we mean 'how you use
parenthesis'.

Examples Non Associative

Multiplication: Division:
(3 * 4 ) * 2 = 3 * ( 4 * 2) (⅔) / 4 not equal to 2 / (¾)
Min: Subtraction:
Min(Min(3,4), 30) (2 - 3) - 1 != 2 - (3-1)
= Min(3, Min(4, 30)) = 3 Exponent / power:
Max: 4 ^ 2 != 2^4
Max(Max(3,4), 30) Average:
= Max(3, Min(4, 30)) = 30 avg(avg(2, 3), 4) != avg(avg(2, 4), 3)
Basics of RDD
Solving Some Problems with Spark
Approach 1 - So, how to compute average?

Approach 1
➢ var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
➢ var avg = rdd.reduce(_ + _) / rdd.count();

What's wrong with this approach?

We are computing RDD twice - during reduce and during count.

Can we compute sum and count in a single reduce?

Basics of RDD
Approach 2 - So, how to compute average?
(Total1, Count1) (Total2, Count2)

(Total1 + Total 2, Count1 + Count2)

4 5 6

(4, 1) (5,1) (6,1)

(9, 2)

15/3 = 5 (15, 3)

Basics of RDD
Approach 2 - So, how to compute average?
(Total1, Count1) (Total2, Count2)

(Total1 + Total 2, Count1 + Count2)

➢ var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);

➢ var rdd_count = rdd.map((_, 1))
➢ var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
➢ var avg = sum / count
avg: Double = 4.0

Basics of RDD
Comparision of the two approaches?

Approach1:
0.023900 + 0.065180
= 0.08908 seconds ~ 89 ms
Approach2:
0.058654 seconds ~ 58 ms

Approximately 2X difference.

Basics of RDD
How to compute Standard deviation?

Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.

1. Work out the Mean (the simple average of the numbers)

Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.

1. Work out the Mean (the simple average of the numbers)

2. Then for each number: subtract the Mean and square the result

Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.

1. Work out the Mean (the simple average of the numbers)

2. Then for each number: subtract the Mean and square the result
3. Then work out the mean of those squared differences.

Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.

1. Work out the Mean (the simple average of the numbers)

2. Then for each number: subtract the Mean and square the result
3. Then work out the mean of those squared differences.
4. Take the square root of that and we are done!

Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6

Already Computed in
Previous problem

1. Mean of numbers is μ
= (2 + 3 + 5 + 6) / 4 => 4

Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6

Already Computed in
Previous problem

Can be done using map() 1. Mean of numbers is μ

= (2 + 3 + 5 + 6) / 4 => 4
2. xi - μ = (-2, -1, 1 , 2)
3. (xi - μ)2 = (4, 1, 1 , 4)

Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6

Already Computed in
Previous problem

Can be done using map() 1. Mean of numbers is μ

= (2 + 3 + 5 + 6) / 4 => 4
2. xi - μ = (-2, -1, 1 , 2)
Requires reduce. 3. (xi - μ)2 = (4, 1, 1 , 4)
4. ∑(xi - μ)2 = 10

Basics of RDD
So, how to compute Standard deviation?
Lets calculate SD of 2 3 5 6

Already Computed in
Previous problem

Can be done using map() 1. Mean of numbers is μ

= (2 + 3 + 5 + 6) / 4 => 4
2. xi - μ = (-2, -1, 1 , 2)
Requires reduce. 3. (xi - μ)2 = (4, 1, 1 , 4)
4. ∑(xi - μ)2 = 10
5. √1/N ∑(xi - μ)2 = √10/4 = √2.5 =
Can be performed locally
1.5811
Basics of RDD
So, how to compute Standard deviation?
➢ var rdd = sc.parallelize(Array(2, 3, 5, 6))

Basics of RDD
So, how to compute Standard deviation?

➢ var rdd = sc.parallelize(Array(2, 3, 5, 6))

//Mean or average of numbers is μ
➢ var rdd_count = rdd.map((_, 1))
➢ var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 +
y._2))
➢ var avg = sum / count
// (xi - μ)2

Basics of RDD
So, how to compute Standard deviation?

➢ var rdd = sc.parallelize(Array(2, 3, 5, 6))

Basics of RDD
So, how to compute Standard deviation?

➢ var rdd = sc.parallelize(Array(2, 3, 5, 6))

//Mean or average of numbers is μ
➢ var rdd_count = rdd.map((_, 1))
➢ var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
➢ var avg = sum / count
// (xi - μ)2
➢ var sqdiff = rdd.map( _ - avg).map(x => x*x)
// ∑(xi - μ)2
➢ var sum_sqdiff = sqdiff.reduce(_ + _)

Basics of RDD
So, how to compute Standard deviation?

➢ var rdd = sc.parallelize(Array(2, 3, 5, 6))

Basics of RDD
So, how to compute Standard deviation?
a. var rdd = sc.parallelize(Array(2, 3, 5, 6))
b. //Mean or average of numbers is μ
i. var rdd_count = rdd.map((_, 1))
ii. var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
iii. var avg = sum / count
c. // (xi - μ)2
d. var sqdiff = rdd.map( _ - avg).map(x => x*x)
e. // ∑(xi - μ)2
f. var sum_sqdiff = sqdiff.reduce(_ + _)
g. //√1/N ∑(xi - μ)2
h. import math._;
i. var sd = sqrt(sum_sqdiff*1.0/count)
2. sd: Double = 1.5811388300841898