0% found this document useful (0 votes)
84 views

BDACh05L04Spark DataFramesAndRDDs

Uploaded by

Shaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

BDACh05L04Spark DataFramesAndRDDs

Uploaded by

Shaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Lesson 4

Spark DataFrame and RDDs

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 1
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark DataFrame

• A distributed collection of data


organized into named columns
• Used for transformation using filter,
join, or groupBy aggregation
functions
• Refer Section 10.4.1 for Merge and
Join Functions for DataFrame Objects
“Big Data Analytics “, Ch.05 L03: Spark and Big Data Analytics
2019 2
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
DataFrames
• Created from several data sources,
• JSON datasets, Hive tables, Parquet
row groups, structured data files,
external Data Stores and RDDs
• Usages of DataFrames from the
Parquet and JSON objects
• Section 10.3.3 for conversion from
CSV format dataset
“Big Data Analytics “, Ch.05 L03: Spark and Big Data Analytics
2019 3
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Figure 5.6 Sample table toyPuzzleTypeCostTbl
rows, row groups in DataFrames

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 4
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
DataFrame (SchemaRDD)
• DataFrame, earlier named as SchemaRDD
is similar to a table in a traditional database
• The schema is blueprint for the
organization of data in an RDD (similar to
traditional database schema)
• The schema tells how the RDD constructs
• Refer Section 10.3.4 for creating
DataFrame from the RDD

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 5
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
DataFrame (SchemaRDD)
• The SchemaRDD returns on the
queries loading or execution. A
SchemaRDD is composed of row
objects. The SchemaRDD has
additionally the ‘Data Type’
information for each column. A row
object wraps the arrays of basic data
types.

“Big Data Analytics “, Ch.05 L03: Spark and Big Data Analytics
2019 6
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark Resilient Distributed Dataset
(RDD)
• A collection of objects distributed on
many computing nodes
• Parallel structures on clusters
• immutable (thus read-only) and
partitioned distributed collection of
objects,

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 7
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
RDD Features
• Have an interface which enables
transformations that apply the same to
many data objects,
• Each RDD can split into multiple
partitions, which may be computed in
parallel on different nodes of a cluster
• computations,

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 8
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
RDD Features
• Fault-tolerant abstraction which enables
In-Memory cluster
• create only through the deterministic
operations on either (i) data in stable
Data store such as file or (ii) operations
on other RDDs,
• Enable efficient execution of iterative
algorithms, and interactive data-mining
“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 9
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
RDD Features
• Commands to them, enable the
intermediate-results explicitly persist
in memory, and
• Controls the partitioning so that
placement of data optimizes and
partitions can be manipulated using
operators

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 10
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark RDD immutability
• Not capable of or not susceptible to
change
• A new RDD creates on transform and
action commands

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 11
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Commands for Creation of New
RDD
(i) load an external dataset as a
distributed collection of objects, or
(ii) use a driver (program) for
distributing a collection of objects.

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 12
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Transform operation
• Each dataset represents an object
• The transform-command invokes the
methods using the objects to create
new RDD(s)
• Transform operations create RDDs
from each other
• Example 5.9

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 13
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Action Operation
• (i) returns a value into a program or
(ii) exports data to a Data Store.
• The action command does the
computation when a first-time action
takes place on an RDD and returns a
value or sends data to a Data Store.
• Example 5.9

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 14
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Removing Data
• Auto-monitoring in Spark auto-
monitors the usages of caches
• Spark removes the caches using ‘least
recently used partitions removed first’
strategy
• RDD.unpersist()

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 15
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark data types and their
descriptions.
• Table 5.4

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 16
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Numeric Operations on RDD
• Table 5.6
• count (*), count (expr); sum (col),
sum (DISTINCT col), avg (col), avg
(DISTINCT col), min (col) and
DOUBLE max(col) (Table 4.10).

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 17
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Use of Statistical Functions

• The statistical functions stdev(),


sampleStdev(), variance,
sampleVariance() for analysis with
DataFrames in input

“Big Data Analytics “, Ch.05 L03: Spark and Big Data Analytics
2019 18
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Shared Variables
• Broadcast Variable:
• Created from a value denoted by a
variable v and running method
sc.broadcast (v). The broadcast
variable is a wrapper around v.
• Example 5.11

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 19
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Shared Variables
• Accumulator Variable:
• Accumulators are special variables.
They add the values using associative
and commutative operations. They
also support parallel run: for example,
in count() or sum()
• Example 5.12

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 20
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Summary
We learnt :
• DataFrame
• Resilient Data Sets
• RDD Features
• Transformation and Action
• Shared Variables

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 21
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
End of Lesson 4 on
Spark DataFrame and RDDs

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 22
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)

You might also like