0% found this document useful (0 votes)

9 views

L7A_Spark RDD with Scala

The document provides a comprehensive guide on creating and manipulating Resilient Distributed Datasets (RDDs) using Scala in Apache Spark, covering various scenarios from parallelizing collections to performing transformations and actions. It includes detailed steps for operations such as filtering, counting word lengths, sorting, saving results to text files, and converting RDDs to DataFrames. Additionally, it explains the concepts of lazy evaluation, RDD lineage, and the differences between coalesce and repartition methods.

Uploaded by

2024740897

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

L7A_Spark RDD with Scala

Uploaded by

2024740897

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

L7A - Spark RDD with Scala in Cluster (version 2.

Outlines • Creating an RDD

• RDD operations
• Scenario 1 - getting familiar with parallelizing (using array)
• Scenario 2 - getting familiar with parallelizing (using list)
• Scenario 3 until 9 (based on word count)
o Scenario 3 - creating a new RDD from a data source
o Scenario 4 - creating a new RDD from an existing RDD
o Scenario 5 - exploring the filtering function
o Scenario 6 - counting the length of each word
o Scenario 7 - sorting the words according to its length
o Scenario 8 - saving to a textfile
o Scenario 9 - converting to a dataframe for using dataframe APIs

Accessing Spark Scala:

Spark spark-shell --conf "spark.port.maxRetries=100"

Pyspark:
pyspark --conf "spark.port.maxRetries=100"
To terminate the spark session, press ctrl + c

creating an • Parallelizing
RDD - we use it when we have a collection in driver program
- Collections are the containers that hold sequenced linear set of items like List, Set, Tuple, Option,
Map etc.
• Referencing a dataset - we use it when a dataset in an external storage system (e.g. HDFS, Hbase,
shared file system)
• Creating RDD from existing RDD

RDD RDDs support two types of operations:

operations • Transformation operations
o They create new RDD from another RDD without modifying them (because RDDs are immutable).
o Applying transformation built an RDD lineage, also known RDD dependency graph (it is Directed Acyclic
Graph (DAG)
o Transformations are lazy in nature, i.e. they get execute when we call an action, they are not executed
immediately
o Basic transformation operations are map(), filter()
o the resultant RDD from its parent RDD can be smaller (due to e.g. filter(), sample()) or bigger (due to e.g.
flatMap(), union() ) or the same size (due to map())

• Action operations
o They are used at the end of a Spark pipeline to get Scala (or whatever the language you use) object from
the final RDD.

Reference:
https://www.javahelps.com/2019/02/spark-03-understanding-resilient.html
• https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
• https://data-flair.training/blogs/apache-spark-map-vs-flatmap/

•
Scenario 1 • Getting familiar with parallelized collection

Steps • create a collection

• parallelize the collection
• perform any relevant operation
• end the spark session (ctrl + c)

create a Assume, we have a set of data in an array, which can be defined as follows:
collection
val data = Array(1,2,3,4,5)

parallelizing val distData = sc.parallelize(data)

the collection

now, distData is an RDD

operation Example: summation and counting

exploring How many partition created for the RDD?

num of run this command:
partition • distData.partitions.size

Scenario 2 • Getting familiar with parallelized collection (using list)

Steps • create a collection

• parallelize the collection
• perform any relevant operation
• end the spark session (ctrl + c)

creation Assume, we have a set of data in a list

parallelizing

now, rdd_country is an RDD

operation Example: summation and counting

Scenario 3 • creating a new RDD from a data source

sample abstract.txt
dataset

Steps • upload / put this dataset into your hdfs

• create an RDD object to point to a dataset
• display the content of the file
• end the spark session (ctrl + c)

read and run this command:

assign
val rdd_file = sc.textFile("abstract.txt")

Note: At this stage data are not actually load into rdd_file yet, due to lazy computation.

display the run this command:

content
rdd_file.collect()

Note: When an operation is triggered (e.g. collect), then only the actual data is loaded and appeared into the
screen

exploring How many partition created for the RDD?

RDD run this command:
• rdd_file.partitions.size

exploration - 1) Go to (you must in UiTM network) - http://10.5.19.231:18088/

using spark 2) Choose "Show incomplete applications" at the bottom of the page
web UI 3) Explore Jobs, Stages, DAG, and Tasks

Examples:

DAG visualization

Number of tasks
Execution Timeline

Scenario 4 • Creating RDD from another RDD (based on word count program)

Note: This scenario is a continuation from previous scenario

splitting The command:

val rdd_split = rdd_file.flatMap(line => line.split(" "))

Note:
• flatMap() is a one-to-many transformation function
• it will create/produce another RDD
• In this case, from a string line (i.e. held by rdd_file), we want to produce multiple words delimited by
whitespace

Illustration:
Then, we can display the output:

rdd_split.collect()

if we count the number of word, we will get the following:

rdd_split.count()

Note: This count may include duplicated words

exploring 1) Check out the completed jobs

execution via
spark web UI

2) Check out the stage of job 1

3) check out the tasks of job 1

mapping The command:
val rdd_map = rdd_split.map(word => (word,1))

Note:
• map() is a one-to-one transformation function
• it will create/produce another RDD
• in this case, for each word, we want to produce a set of key-value pairs where the value is assigned to 1

Illustration:

Then, we can display the output:

rdd_map.collect

exploring the 1) Check out the completed jobs

execution via
spark web UI

2) Check out the stage of job 2

3) check out the tasks of job 2

Reducing The command:

val rdd_reduce = rdd_map.reduceByKey(_+_)

Note:
• reduceByKey() is used to combine or merge the elements of the same key
• It can only be used if the data structure is key-value pair
• The resulting RDD will get smaller than the input RDD
• in this case, the keys refer to word and the value will increase if the words are the same
• further reading - https://backtobazics.com/big-data/spark/apache-spark-reducebykey-example/

illustration:

Then, we can display the output:

rdd_reduce.collect

We can also display the output using this command:

rdd_reduce.collect.foreach(println)

if we count the number of word, we will get the following:

rdd_reduce.count

exploring the 1) Check out the completed jobs

execution via
spark web UI

2) Check out the stage of job 3

3) check out the tasks of job 3 (involving stage 3 and stage 4)
for stage 3

for stage 4

Notes:
• Notice that the same executors (Id 5 and 6) were used to perform task 6, 7, 8 and 9
• Stages were separated due to data shuffling involved. That means, the function reduceByKey() caused data
in the input partitions to be shuffled into the output partitions

Scenario 5 • Exploring the filtering function (based on word count program)

Note: This scenario is a continuation from previous scenario

recall the val rdd_file = sc.textFile("abstract.txt")

previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands

filtering The command:

val rdd_filter = rdd_split.filter (w => w.contains("res"))
Note:
• notice that, rdd_split is used, which refers to the output of the flatmap of previous scenario (not the output
of reduceByKey)
• in this command, rdd_split becomes the input, while rdd_filter is the reference to the output RDD

The output:
rdd_filter.collect

exploring the 1) Check out the completed jobs

execution via
spark web UI

2) Check out the stage of job 4

3) check out the tasks of job 4

Note:
• Only one executor needed to execute task 10 and 11
Scenario 6 • Counting the length of each word (based on word count program)

Note: This scenario is a continuation from previous scenario

recall the val rdd_file = sc.textFile("abstract.txt")

previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands

counting The command:

val rdd_word_length = rdd_split.map(word => (word,word.length))

The output:
rdd_word_length.collect.foreach(println)

exploring the 1) Check out the completed jobs

execution via
spark web UI

2) Check out the stage of job 5

3) check out the tasks of job 5

Scenario 7 • sort the words according to its length

Note: This scenario is a continuation from previous scenario

recall the val rdd_file = sc.textFile("abstract.txt")

previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands val rdd_word_length = rdd_split.map(word => (word,word.length))

sorting Step 1: Rearrange the key-value pairs

• the value which is the length becomes the key
• the key which is the word becomes the value

The command:
val rdd_rearrange = rdd_word_length.map (word => (word._2,word._1))

The output:
rdd_rearrange.collect
Step 2: Sort the words according to its length

The command:
val rdd_sort_word = rdd_rearrange.sortByKey()

The output:
rdd_sort_word.collect.foreach(println)

Note: If you need to sort the words descendingly:

The command:
val rdd_sort_word_desc = rdd_rearrange.sortByKey(false)

Note:
• false value means descending

The output:

exploring the 1) Check out the stages involved for rearranging and sorting
execution via
spark web UI
Note:
• Stages are separated due to shuffling caused by sortByKey() function

(to explore) The command:

combining
rearrange and val rdd_sort = rdd_word_length.map(word => (word._2,word._1)).sortByKey(false)
sort in a
single line
command

The output:

Scenario 8 • save to a textfile

Note: This scenario is a continuation from previous scenario. Assuming, you want to write the sorted words
into a textfile

Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word (refer to previous scenario)
• sorting the words (refer to previous scenario)
• save to a text file
recall the val rdd_file = sc.textFile("abstract.txt")
previous val rdd_split = rdd_file.flatMap(line => line.split(" "))
commands val rdd_word_length = rdd_split.map(word => (word,word.length))
val rdd_sort = rdd_word_length.map(word => (word._2,word._1)).sortByKey(false)

counting 2 options: using coalesce or repartition

The command (option 1):

rdd_sort.coalesce(1).saveAsTextFile("sortedwords.txt")

Note:
• you may use rdd_sort or rdd_sort_word_desc, depending on the latest RDD created from previous
scenario

The command (option 2):

rdd_sort.repartition(1).saveAsTextFile("sortedwords_v1.txt")

About the saved file:

• It should be in your hdfs directory
• You can view it via HUE

The output of coelesce

The output of repartition

Note: There is no difference between coelesce and repartition in terms of output. The difference is the way
it reduces the partition to produce a single file.

coalesce vs The similarity:

repartition • Both, coalesce(1) and repartition(1) will reduce the number of partitions into a single partition

The difference:
o coalesce uses existing partition to minimize the amount of data that's shuffled.
o Repartition creates new partitions and does a full shuffle
Illustration for Coalesce:

Illustration for repartition:

References
o https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
o https://intellipaat.com/community/5166/write-single-csv-file-using-spark-csv

exploring the 1) Check out the stages involved for rearranging and sorting
execution via
spark web UI
Note:
• Notice that, coalesce does not involve shuffling, hence no separation of stage involved.

Note:
• Stage 14 is skipped because data has been fetched from cache (since coelesce was executed earlier on the
same RDD) and re-execution of the given stage is not required.
• Also notice that, repartition involves shuffling, thus a separated stage in needed

Reference:
• https://blog.rockthejvm.com/repartition-coalesce/

Scenario 9 • converting to a dataframe using dataframe APIs to utilize more functions (e.g. filter, groupBy, etc)

Steps • upload / put this dataset into your hdfs (refer to previous scenario)
• create an RDD object to point to a dataset (refer to previous scenario)
• display the content of the file (refer to previous scenario)
• split the words by using whitespace delimiter (refer to previous scenario)
• count the length of each word (refer to previous scenario)
• sorting the words (refer to previous scenario)
• convert to a dataframe
• perform grouping
• perform grouping and sorting
• perform grouping, sorting and filter in a single command line
converting The command:

val df_sorted = spark.createDataFrame(rdd_sort).toDF("length","word")

Note:
• df_sorted is a dataframe, rdd_sort is an RDD
• if you have executed coelesce or repartition previously, you may get a warning related to lineage
• Thus, you will need to re-execute sorting from scenario 7

The output:
df_sorted.show()

exploring the 1) Check out the stages involved to covert to a dataframe

execution via
spark web UI

Note:
• This stage shows that, the process of converting from RDD into dataframe involves a set of transformation
operations
• However, with the available functions in spark API, the programming task based on a dataframe becomes
easier

Grouping on The command:

a dataframe
(no longer df_sorted.groupBy("word").count.show()
RDD)

exploring the 1) Check out the stages involved

execution via
spark web UI

Note:
• This stage shows a set of RDD operations involved
• However, with the available functions in spark API, the programming task based on a dataframe becomes
easier

Grouping and Step 1: Create a new dataframe that stores the result of grouping
sorting with
descending The command:
val df_groupby = df_sorted.groupBy("word").count

the output:

df_groupby.show()
Step 2: Sort the dataframe as descending

The command:

df_groupby.orderBy(desc("count")).show()

exploring the 1) Check out the stages involved

execution via
spark web UI

Grouping, Aim: To display the words which are repeated more than one time
sorting and
filtering (in a Recall the dataframe schema:
single
command df_sorted.printSchema()
line)

The command:

df_sorted.groupBy("word").agg(count("length").alias("repeating")).filter(col("repeating") > 1
).sort(desc("repeating")).show()

exploration Is the Dataframe still a distributed dataset?

You can check the number of partition for the created dataframe:
• df_sorted.rdd.getNumPartitions
•

Hence, the answer is Yes.

Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
A204080739_28953_20_2025_unit 3 Introduction to RDD (1)
No ratings yet
A204080739_28953_20_2025_unit 3 Introduction to RDD (1)
51 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Lab 04 Spark APIs (1)
No ratings yet
Lab 04 Spark APIs (1)
20 pages
External Video-En (15)
No ratings yet
External Video-En (15)
2 pages
rdd
No ratings yet
rdd
48 pages
SPARK
No ratings yet
SPARK
35 pages
3- SPARK
No ratings yet
3- SPARK
51 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
Big Data Analysis With Scala and Spark: Heather Miller
No ratings yet
Big Data Analysis With Scala and Spark: Heather Miller
17 pages
Spark_RDD
No ratings yet
Spark_RDD
60 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Apache Spark Tutorials
No ratings yet
Apache Spark Tutorials
9 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Lec28 - RDD (1)
No ratings yet
Lec28 - RDD (1)
56 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
spark
No ratings yet
spark
160 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Spark And Scala Week 1
No ratings yet
Spark And Scala Week 1
16 pages
Lec 9
No ratings yet
Lec 9
33 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Lec 9
No ratings yet
Lec 9
38 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
Exam2Topics_Fall2023 - Tagged
No ratings yet
Exam2Topics_Fall2023 - Tagged
3 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Spark
No ratings yet
Spark
51 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Function Spark
No ratings yet
Function Spark
10 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Dichotomous Keys
No ratings yet
Dichotomous Keys
25 pages
Conditions Riuclass Program en
No ratings yet
Conditions Riuclass Program en
15 pages
AE321 Unit 3 Working Capital Management
No ratings yet
AE321 Unit 3 Working Capital Management
16 pages
Diwat National High School (DNHS) Animal Production (Swine) NC Ii Progress Chart
No ratings yet
Diwat National High School (DNHS) Animal Production (Swine) NC Ii Progress Chart
1 page
Unit 6 Internal Competences and Resources: Structure
No ratings yet
Unit 6 Internal Competences and Resources: Structure
30 pages
IRCTC Registration Form
No ratings yet
IRCTC Registration Form
24 pages
14TH Annual Report: Year Ended - 3 (. O3.2Oo9
No ratings yet
14TH Annual Report: Year Ended - 3 (. O3.2Oo9
19 pages
Class-XII-IP-First Pre Board
No ratings yet
Class-XII-IP-First Pre Board
7 pages
RAB BG. Marine Power BULAN Juni 2023
No ratings yet
RAB BG. Marine Power BULAN Juni 2023
34 pages
Security Features in PON Devices
No ratings yet
Security Features in PON Devices
10 pages
Christian Bible College and Seminary Catalog-2019 PDF
No ratings yet
Christian Bible College and Seminary Catalog-2019 PDF
30 pages
EMT Lecture 1
No ratings yet
EMT Lecture 1
21 pages
Final Exam AP
No ratings yet
Final Exam AP
3 pages
QP4_BRN32
No ratings yet
QP4_BRN32
7 pages
Manual EchoLife HG510
No ratings yet
Manual EchoLife HG510
74 pages
ISS RP Samuel Bonda
No ratings yet
ISS RP Samuel Bonda
47 pages
Chapter 5: Curriculum Implementation: The Teacher and The School Curriculum A Guide To Curriculum Development
No ratings yet
Chapter 5: Curriculum Implementation: The Teacher and The School Curriculum A Guide To Curriculum Development
26 pages
Kaarthikeyan 2019
No ratings yet
Kaarthikeyan 2019
9 pages
Blender Scripting With Python (Sample)
0% (1)
Blender Scripting With Python (Sample)
24 pages
Practice Test 2: I. Choose The Word Whose Underlined Part Is Pronounced Differently From That of The Others
No ratings yet
Practice Test 2: I. Choose The Word Whose Underlined Part Is Pronounced Differently From That of The Others
7 pages
Solah (Prayer in Islam)
No ratings yet
Solah (Prayer in Islam)
22 pages
S Series General Purpose Battery: Specifications
No ratings yet
S Series General Purpose Battery: Specifications
2 pages
Resume Mercy e Were
No ratings yet
Resume Mercy e Were
2 pages
1 Error Detection and Correction
No ratings yet
1 Error Detection and Correction
25 pages
Applied Physics Lectures
No ratings yet
Applied Physics Lectures
18 pages
21st Century Tos - Sy - 2023-2024 - First Quarter
100% (1)
21st Century Tos - Sy - 2023-2024 - First Quarter
2 pages
Initial Proposal Form Harvey Elvins 2023
No ratings yet
Initial Proposal Form Harvey Elvins 2023
4 pages
Buddhism (History Notes For UPSC & Govt. Exams)
100% (1)
Buddhism (History Notes For UPSC & Govt. Exams)
13 pages
Water And Sustainability In Arid Regions Bridging The Gap Between Physical And Social Sciences 1st Edition Du Zheng instant download
No ratings yet
Water And Sustainability In Arid Regions Bridging The Gap Between Physical And Social Sciences 1st Edition Du Zheng instant download
87 pages
105 Harv LRev 503
No ratings yet
105 Harv LRev 503
52 pages

L7A_Spark RDD with Scala

Uploaded by

L7A_Spark RDD with Scala

Uploaded by

L7A - Spark RDD with Scala in Cluster (version 2.

Outlines • Creating an RDD

Accessing Spark Scala:

RDD RDDs support two types of operations:

Steps • create a collection

parallelizing val distData = sc.parallelize(data)

now, distData is an RDD

operation Example: summation and counting

exploring How many partition created for the RDD?

Scenario 2 • Getting familiar with parallelized collection (using list)

Steps • create a collection

creation Assume, we have a set of data in a list

now, rdd_country is an RDD

operation Example: summation and counting

Steps • upload / put this dataset into your hdfs

read and run this command:

display the run this command:

exploring How many partition created for the RDD?

exploration - 1) Go to (you must in UiTM network) - http://10.5.19.231:18088/

Note: This scenario is a continuation from previous scenario

splitting The command:

if we count the number of word, we will get the following:

Note: This count may include duplicated words

exploring 1) Check out the completed jobs

2) Check out the stage of job 1

3) check out the tasks of job 1

Then, we can display the output:

exploring the 1) Check out the completed jobs

2) Check out the stage of job 2

Reducing The command:

Then, we can display the output:

We can also display the output using this command:

if we count the number of word, we will get the following:

exploring the 1) Check out the completed jobs

2) Check out the stage of job 3

Scenario 5 • Exploring the filtering function (based on word count program)

Note: This scenario is a continuation from previous scenario

recall the val rdd_file = sc.textFile("abstract.txt")

filtering The command:

exploring the 1) Check out the completed jobs

2) Check out the stage of job 4

3) check out the tasks of job 4

Note: This scenario is a continuation from previous scenario

recall the val rdd_file = sc.textFile("abstract.txt")

counting The command:

exploring the 1) Check out the completed jobs

2) Check out the stage of job 5

Scenario 7 • sort the words according to its length

Note: This scenario is a continuation from previous scenario

recall the val rdd_file = sc.textFile("abstract.txt")

sorting Step 1: Rearrange the key-value pairs

Note: If you need to sort the words descendingly:

(to explore) The command:

Scenario 8 • save to a textfile

counting 2 options: using coalesce or repartition

The command (option 1):

The command (option 2):

About the saved file:

The output of coelesce

The output of repartition

coalesce vs The similarity:

Illustration for repartition:

val df_sorted = spark.createDataFrame(rdd_sort).toDF("length","word")

exploring the 1) Check out the stages involved to covert to a dataframe

Grouping on The command:

exploring the 1) Check out the stages involved

exploring the 1) Check out the stages involved

exploration Is the Dataframe still a distributed dataset?

Hence, the answer is Yes.

You might also like