0% found this document useful (0 votes)
9 views

L7A_Spark RDD with Scala

The document provides a comprehensive guide on creating and manipulating Resilient Distributed Datasets (RDDs) using Scala in Apache Spark, covering various scenarios from parallelizing collections to performing transformations and actions. It includes detailed steps for operations such as filtering, counting word lengths, sorting, saving results to text files, and converting RDDs to DataFrames. Additionally, it explains the concepts of lazy evaluation, RDD lineage, and the differences between coalesce and repartition methods.

Uploaded by

2024740897
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

L7A_Spark RDD with Scala

The document provides a comprehensive guide on creating and manipulating Resilient Distributed Datasets (RDDs) using Scala in Apache Spark, covering various scenarios from parallelizing collections to performing transformations and actions. It includes detailed steps for operations such as filtering, counting word lengths, sorting, saving results to text files, and converting RDDs to DataFrames. Additionally, it explains the concepts of lazy evaluation, RDD lineage, and the differences between coalesce and repartition methods.

Uploaded by

2024740897
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

L7A - Spark RDD with Scala in Cluster (version 2.

4)

Outlines • Creating an RDD


• RDD operations
• Scenario 1 - getting familiar with parallelizing (using array)
• Scenario 2 - getting familiar with parallelizing (using list)
• Scenario 3 until 9 (based on word count)
o Scenario 3 - creating a new RDD from a data source
o Scenario 4 - creating a new RDD from an existing RDD
o Scenario 5 - exploring the filtering function
o Scenario 6 - counting the length of each word
o Scenario 7 - sorting the words according to its length
o Scenario 8 - saving to a textfile
o Scenario 9 - converting to a dataframe for using dataframe APIs

Accessing Spark Scala:


Spark spark-shell --conf "spark.port.maxRetries=100"

Pyspark:
pyspark --conf "spark.port.maxRetries=100"
To terminate the spark session, press ctrl + c

creating an • Parallelizing
RDD - we use it when we have a collection in driver program
- Collections are the containers that hold sequenced linear set of items like List, Set, Tuple, Option,
Map etc.
• Referencing a dataset - we use it when a dataset in an external storage system (e.g. HDFS, Hbase,
shared file system)
• Creating RDD from existing RDD

RDD RDDs support two types of operations:


operations • Transformation operations
o They create new RDD from another RDD without modifying them (because RDDs are immutable).
o Applying transformation built an RDD lineage, also known RDD dependency graph (it is Directed Acyclic
Graph (DAG)
o Transformations are lazy in nature, i.e. they get execute when we call an action, they are not executed
immediately
o Basic transformation operations are map(), filter()
o the resultant RDD from its parent RDD can be smaller (due to e.g. filter(), sample()) or bigger (due to e.g.
flatMap(), union() ) or the same size (due to map())

• Action operations
o They are used at the end of a Spark pipeline to get Scala (or whatever the language you use) object from
the final RDD.

Reference:
https://www.javahelps.com/2019/02/spark-03-understanding-resilient.html
• https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
• https://data-flair.training/blogs/apache-spark-map-vs-flatmap/


Scenario 1 • Getting familiar with parallelized collection

Steps • create a collection


• parallelize the collection
• perform any relevant operation
• end the spark session (ctrl + c)

create a Assume, we have a set of data in an array, which can be defined as follows:
collection
val data = Array(1,2,3,4,5)

parallelizing val distData = sc.parallelize(data)


the collection

now, distData is an RDD

operation Example: summation and counting

exploring How many partition created for the RDD?


num of run this command:
partition • distData.partitions.size

Scenario 2 • Getting familiar with parallelized collection (using list)

Steps • create a collection


• parallelize the collection
• perform any relevant operation
• end the spark session (ctrl + c)

creation Assume, we have a set of data in a list

parallelizing

now, rdd_country is an RDD

operation Example: summation and counting


Scenario 3 • creating a new RDD from a data source

sample abstract.txt
dataset

Steps • upload / put this dataset into your hdfs


• create an RDD object to point to a dataset
• display the content of the file
• end the spark session (ctrl + c)

read and run this command:


assign
val rdd_file = sc.textFile("abstract.txt")

Note: At this stage data are not actually load into rdd_file yet, due to lazy computation.

display the run this command:


content
rdd_file.collect()

Note: When an operation is triggered (e.g. collect), then only the actual data is loaded and appeared into the
screen

exploring How many partition created for the RDD?


RDD run this command:
• rdd_file.partitions.size

exploration - 1) Go to (you must in UiTM network) - http://10.5.19.231:18088/


using spark 2) Choose "Show incomplete applications" at the bottom of the page
web UI 3) Explore Jobs, Stages, DAG, and Tasks

Examples:

DAG visualization

Number of tasks
Execution Timeline

Scenario 4 • Creating RDD from another RDD (based on word count program)

Note: This scenario is a continuation from previous scenario

Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter
• transform the words into key-value pairs
• reduce redundant key-value pairs

splitting The command:


val rdd_split = rdd_file.flatMap(line => line.split(" "))

Note:
• flatMap() is a one-to-many transformation function
• it will create/produce another RDD
• In this case, from a string line (i.e. held by rdd_file), we want to produce multiple words delimited by
whitespace

Illustration:
Then, we can display the output:

rdd_split.collect()

if we count the number of word, we will get the following:

rdd_split.count()

Note: This count may include duplicated words

exploring 1) Check out the completed jobs


execution via
spark web UI

2) Check out the stage of job 1

3) check out the tasks of job 1


mapping The command:
val rdd_map = rdd_split.map(word => (word,1))

Note:
• map() is a one-to-one transformation function
• it will create/produce another RDD
• in this case, for each word, we want to produce a set of key-value pairs where the value is assigned to 1

Illustration:

Then, we can display the output:


rdd_map.collect

exploring the 1) Check out the completed jobs


execution via
spark web UI

2) Check out the stage of job 2


3) check out the tasks of job 2

Reducing The command:


val rdd_reduce = rdd_map.reduceByKey(_+_)

Note:
• reduceByKey() is used to combine or merge the elements of the same key
• It can only be used if the data structure is key-value pair
• The resulting RDD will get smaller than the input RDD
• in this case, the keys refer to word and the value will increase if the words are the same
• further reading - https://backtobazics.com/big-data/spark/apache-spark-reducebykey-example/

illustration:

Then, we can display the output:


rdd_reduce.collect

We can also display the output using this command:


rdd_reduce.collect.foreach(println)

if we count the number of word, we will get the following:


rdd_reduce.count

exploring the 1) Check out the completed jobs


execution via
spark web UI

2) Check out the stage of job 3


3) check out the tasks of job 3 (involving stage 3 and stage 4)
for stage 3

for stage 4

Notes:
• Notice that the same executors (Id 5 and 6) were used to perform task 6, 7, 8 and 9
• Stages were separated due to data shuffling involved. That means, the function reduceByKey() caused data
in the input partitions to be shuffled into the output partitions

Scenario 5 • Exploring the filtering function (based on word count program)

Note: This scenario is a continuation from previous scenario

Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• filter the keywords that we need

recall the val rdd_file = sc.textFile("abstract.txt")


previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands

filtering The command:


val rdd_filter = rdd_split.filter (w => w.contains("res"))
Note:
• notice that, rdd_split is used, which refers to the output of the flatmap of previous scenario (not the output
of reduceByKey)
• in this command, rdd_split becomes the input, while rdd_filter is the reference to the output RDD

The output:
rdd_filter.collect

exploring the 1) Check out the completed jobs


execution via
spark web UI

2) Check out the stage of job 4

3) check out the tasks of job 4

Note:
• Only one executor needed to execute task 10 and 11
Scenario 6 • Counting the length of each word (based on word count program)

Note: This scenario is a continuation from previous scenario

Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word

recall the val rdd_file = sc.textFile("abstract.txt")


previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands

counting The command:


val rdd_word_length = rdd_split.map(word => (word,word.length))

The output:
rdd_word_length.collect.foreach(println)

exploring the 1) Check out the completed jobs


execution via
spark web UI

2) Check out the stage of job 5


3) check out the tasks of job 5

Scenario 7 • sort the words according to its length

Note: This scenario is a continuation from previous scenario

Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word (refer to previous scenario)
• sorting the words

recall the val rdd_file = sc.textFile("abstract.txt")


previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands val rdd_word_length = rdd_split.map(word => (word,word.length))

sorting Step 1: Rearrange the key-value pairs


• the value which is the length becomes the key
• the key which is the word becomes the value

The command:
val rdd_rearrange = rdd_word_length.map (word => (word._2,word._1))

The output:
rdd_rearrange.collect
Step 2: Sort the words according to its length

The command:
val rdd_sort_word = rdd_rearrange.sortByKey()

The output:
rdd_sort_word.collect.foreach(println)

Note: If you need to sort the words descendingly:

The command:
val rdd_sort_word_desc = rdd_rearrange.sortByKey(false)

Note:
• false value means descending

The output:

exploring the 1) Check out the stages involved for rearranging and sorting
execution via
spark web UI
Note:
• Stages are separated due to shuffling caused by sortByKey() function

(to explore) The command:


combining
rearrange and val rdd_sort = rdd_word_length.map(word => (word._2,word._1)).sortByKey(false)
sort in a
single line
command

The output:

Scenario 8 • save to a textfile

Note: This scenario is a continuation from previous scenario. Assuming, you want to write the sorted words
into a textfile

Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word (refer to previous scenario)
• sorting the words (refer to previous scenario)
• save to a text file
recall the val rdd_file = sc.textFile("abstract.txt")
previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands val rdd_word_length = rdd_split.map(word => (word,word.length))
val rdd_sort = rdd_word_length.map(word => (word._2,word._1)).sortByKey(false)

counting 2 options: using coalesce or repartition

The command (option 1):


rdd_sort.coalesce(1).saveAsTextFile("sortedwords.txt")

Note:
• you may use rdd_sort or rdd_sort_word_desc, depending on the latest RDD created from previous
scenario

The command (option 2):


rdd_sort.repartition(1).saveAsTextFile("sortedwords_v1.txt")

About the saved file:


• It should be in your hdfs directory
• You can view it via HUE

The output of coelesce

The output of repartition

Note: There is no difference between coelesce and repartition in terms of output. The difference is the way
it reduces the partition to produce a single file.

coalesce vs The similarity:


repartition • Both, coalesce(1) and repartition(1) will reduce the number of partitions into a single partition

The difference:
o coalesce uses existing partition to minimize the amount of data that's shuffled.
o Repartition creates new partitions and does a full shuffle
Illustration for Coalesce:

Illustration for repartition:

References
o https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
o https://intellipaat.com/community/5166/write-single-csv-file-using-spark-csv

exploring the 1) Check out the stages involved for rearranging and sorting
execution via
spark web UI
Note:
• Notice that, coalesce does not involve shuffling, hence no separation of stage involved.

Note:
• Stage 14 is skipped because data has been fetched from cache (since coelesce was executed earlier on the
same RDD) and re-execution of the given stage is not required.
• Also notice that, repartition involves shuffling, thus a separated stage in needed

Reference:
• https://blog.rockthejvm.com/repartition-coalesce/

Scenario 9 • converting to a dataframe using dataframe APIs to utilize more functions (e.g. filter, groupBy, etc)

Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word (refer to previous scenario)
• sorting the words (refer to previous scenario)
• convert to a dataframe
• perform grouping
• perform grouping and sorting
• perform grouping, sorting and filter in a single command line
converting The command:

val df_sorted = spark.createDataFrame(rdd_sort).toDF("length","word")

Note:
• df_sorted is a dataframe, rdd_sort is an RDD
• if you have executed coelesce or repartition previously, you may get a warning related to lineage
• Thus, you will need to re-execute sorting from scenario 7

The output:
df_sorted.show()

exploring the 1) Check out the stages involved to covert to a dataframe


execution via
spark web UI

Note:
• This stage shows that, the process of converting from RDD into dataframe involves a set of transformation
operations
• However, with the available functions in spark API, the programming task based on a dataframe becomes
easier

Grouping on The command:


a dataframe
(no longer df_sorted.groupBy("word").count.show()
RDD)

exploring the 1) Check out the stages involved


execution via
spark web UI

Note:
• This stage shows a set of RDD operations involved
• However, with the available functions in spark API, the programming task based on a dataframe becomes
easier

Grouping and Step 1: Create a new dataframe that stores the result of grouping
sorting with
descending The command:
val df_groupby = df_sorted.groupBy("word").count

the output:

df_groupby.show()
Step 2: Sort the dataframe as descending

The command:

df_groupby.orderBy(desc("count")).show()

exploring the 1) Check out the stages involved


execution via
spark web UI

Grouping, Aim: To display the words which are repeated more than one time
sorting and
filtering (in a Recall the dataframe schema:
single
command df_sorted.printSchema()
line)

The command:

df_sorted.groupBy("word").agg(count("length").alias("repeating")).filter(col("repeating") > 1
).sort(desc("repeating")).show()

exploration Is the Dataframe still a distributed dataset?

You can check the number of partition for the created dataframe:
• df_sorted.rdd.getNumPartitions

Hence, the answer is Yes.

You might also like