L7A_Spark RDD with Scala
L7A_Spark RDD with Scala
4)
Pyspark:
pyspark --conf "spark.port.maxRetries=100"
To terminate the spark session, press ctrl + c
creating an • Parallelizing
RDD - we use it when we have a collection in driver program
- Collections are the containers that hold sequenced linear set of items like List, Set, Tuple, Option,
Map etc.
• Referencing a dataset - we use it when a dataset in an external storage system (e.g. HDFS, Hbase,
shared file system)
• Creating RDD from existing RDD
• Action operations
o They are used at the end of a Spark pipeline to get Scala (or whatever the language you use) object from
the final RDD.
Reference:
https://www.javahelps.com/2019/02/spark-03-understanding-resilient.html
• https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
• https://data-flair.training/blogs/apache-spark-map-vs-flatmap/
•
Scenario 1 • Getting familiar with parallelized collection
create a Assume, we have a set of data in an array, which can be defined as follows:
collection
val data = Array(1,2,3,4,5)
parallelizing
sample abstract.txt
dataset
Note: At this stage data are not actually load into rdd_file yet, due to lazy computation.
Note: When an operation is triggered (e.g. collect), then only the actual data is loaded and appeared into the
screen
Examples:
DAG visualization
Number of tasks
Execution Timeline
Scenario 4 • Creating RDD from another RDD (based on word count program)
Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter
• transform the words into key-value pairs
• reduce redundant key-value pairs
Note:
• flatMap() is a one-to-many transformation function
• it will create/produce another RDD
• In this case, from a string line (i.e. held by rdd_file), we want to produce multiple words delimited by
whitespace
Illustration:
Then, we can display the output:
rdd_split.collect()
rdd_split.count()
Note:
• map() is a one-to-one transformation function
• it will create/produce another RDD
• in this case, for each word, we want to produce a set of key-value pairs where the value is assigned to 1
Illustration:
Note:
• reduceByKey() is used to combine or merge the elements of the same key
• It can only be used if the data structure is key-value pair
• The resulting RDD will get smaller than the input RDD
• in this case, the keys refer to word and the value will increase if the words are the same
• further reading - https://backtobazics.com/big-data/spark/apache-spark-reducebykey-example/
illustration:
for stage 4
Notes:
• Notice that the same executors (Id 5 and 6) were used to perform task 6, 7, 8 and 9
• Stages were separated due to data shuffling involved. That means, the function reduceByKey() caused data
in the input partitions to be shuffled into the output partitions
Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• filter the keywords that we need
The output:
rdd_filter.collect
Note:
• Only one executor needed to execute task 10 and 11
Scenario 6 • Counting the length of each word (based on word count program)
Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word
The output:
rdd_word_length.collect.foreach(println)
Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word (refer to previous scenario)
• sorting the words
The command:
val rdd_rearrange = rdd_word_length.map (word => (word._2,word._1))
The output:
rdd_rearrange.collect
Step 2: Sort the words according to its length
The command:
val rdd_sort_word = rdd_rearrange.sortByKey()
The output:
rdd_sort_word.collect.foreach(println)
The command:
val rdd_sort_word_desc = rdd_rearrange.sortByKey(false)
Note:
• false value means descending
The output:
exploring the 1) Check out the stages involved for rearranging and sorting
execution via
spark web UI
Note:
• Stages are separated due to shuffling caused by sortByKey() function
The output:
Note: This scenario is a continuation from previous scenario. Assuming, you want to write the sorted words
into a textfile
Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word (refer to previous scenario)
• sorting the words (refer to previous scenario)
• save to a text file
recall the val rdd_file = sc.textFile("abstract.txt")
previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands val rdd_word_length = rdd_split.map(word => (word,word.length))
val rdd_sort = rdd_word_length.map(word => (word._2,word._1)).sortByKey(false)
Note:
• you may use rdd_sort or rdd_sort_word_desc, depending on the latest RDD created from previous
scenario
Note: There is no difference between coelesce and repartition in terms of output. The difference is the way
it reduces the partition to produce a single file.
The difference:
o coalesce uses existing partition to minimize the amount of data that's shuffled.
o Repartition creates new partitions and does a full shuffle
Illustration for Coalesce:
References
o https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
o https://intellipaat.com/community/5166/write-single-csv-file-using-spark-csv
exploring the 1) Check out the stages involved for rearranging and sorting
execution via
spark web UI
Note:
• Notice that, coalesce does not involve shuffling, hence no separation of stage involved.
Note:
• Stage 14 is skipped because data has been fetched from cache (since coelesce was executed earlier on the
same RDD) and re-execution of the given stage is not required.
• Also notice that, repartition involves shuffling, thus a separated stage in needed
Reference:
• https://blog.rockthejvm.com/repartition-coalesce/
Scenario 9 • converting to a dataframe using dataframe APIs to utilize more functions (e.g. filter, groupBy, etc)
Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word (refer to previous scenario)
• sorting the words (refer to previous scenario)
• convert to a dataframe
• perform grouping
• perform grouping and sorting
• perform grouping, sorting and filter in a single command line
converting The command:
Note:
• df_sorted is a dataframe, rdd_sort is an RDD
• if you have executed coelesce or repartition previously, you may get a warning related to lineage
• Thus, you will need to re-execute sorting from scenario 7
The output:
df_sorted.show()
Note:
• This stage shows that, the process of converting from RDD into dataframe involves a set of transformation
operations
• However, with the available functions in spark API, the programming task based on a dataframe becomes
easier
Note:
• This stage shows a set of RDD operations involved
• However, with the available functions in spark API, the programming task based on a dataframe becomes
easier
Grouping and Step 1: Create a new dataframe that stores the result of grouping
sorting with
descending The command:
val df_groupby = df_sorted.groupBy("word").count
the output:
df_groupby.show()
Step 2: Sort the dataframe as descending
The command:
df_groupby.orderBy(desc("count")).show()
Grouping, Aim: To display the words which are repeated more than one time
sorting and
filtering (in a Recall the dataframe schema:
single
command df_sorted.printSchema()
line)
The command:
df_sorted.groupBy("word").agg(count("length").alias("repeating")).filter(col("repeating") > 1
).sort(desc("repeating")).show()
You can check the number of partition for the created dataframe:
• df_sorted.rdd.getNumPartitions
•