BDACh05L04Spark DataFramesAndRDDs

Uploaded by

Shaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views

BDACh05L04Spark DataFramesAndRDDs

Uploaded by

Shaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Lesson 4

Spark DataFrame and RDDs

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 1
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark DataFrame

• A distributed collection of data

organized into named columns
• Used for transformation using filter,
join, or groupBy aggregation
functions
• Refer Section 10.4.1 for Merge and
Join Functions for DataFrame Objects
“Big Data Analytics “, Ch.05 L03: Spark and Big Data Analytics
2019 2
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
DataFrames
• Created from several data sources,
• JSON datasets, Hive tables, Parquet
row groups, structured data files,
external Data Stores and RDDs
• Usages of DataFrames from the
Parquet and JSON objects
• Section 10.3.3 for conversion from
CSV format dataset
“Big Data Analytics “, Ch.05 L03: Spark and Big Data Analytics
2019 3
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Figure 5.6 Sample table toyPuzzleTypeCostTbl
rows, row groups in DataFrames

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 4
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
DataFrame (SchemaRDD)
• DataFrame, earlier named as SchemaRDD
is similar to a table in a traditional database
• The schema is blueprint for the
organization of data in an RDD (similar to
traditional database schema)
• The schema tells how the RDD constructs
• Refer Section 10.3.4 for creating
DataFrame from the RDD

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 5
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
DataFrame (SchemaRDD)
• The SchemaRDD returns on the
queries loading or execution. A
SchemaRDD is composed of row
objects. The SchemaRDD has
additionally the ‘Data Type’
information for each column. A row
object wraps the arrays of basic data
types.

“Big Data Analytics “, Ch.05 L03: Spark and Big Data Analytics
2019 6
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark Resilient Distributed Dataset
(RDD)
• A collection of objects distributed on
many computing nodes
• Parallel structures on clusters
• immutable (thus read-only) and
partitioned distributed collection of
objects,

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 7
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
RDD Features
• Have an interface which enables
transformations that apply the same to
many data objects,
• Each RDD can split into multiple
partitions, which may be computed in
parallel on different nodes of a cluster
• computations,
•
“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 8
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
RDD Features
• Fault-tolerant abstraction which enables
In-Memory cluster
• create only through the deterministic
operations on either (i) data in stable
Data store such as file or (ii) operations
on other RDDs,
• Enable efficient execution of iterative
algorithms, and interactive data-mining
“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 9
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
RDD Features
• Commands to them, enable the
intermediate-results explicitly persist
in memory, and
• Controls the partitioning so that
placement of data optimizes and
partitions can be manipulated using
operators

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 10
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark RDD immutability
• Not capable of or not susceptible to
change
• A new RDD creates on transform and
action commands

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 11
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Commands for Creation of New
RDD
(i) load an external dataset as a
distributed collection of objects, or
(ii) use a driver (program) for
distributing a collection of objects.

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 12
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Transform operation
• Each dataset represents an object
• The transform-command invokes the
methods using the objects to create
new RDD(s)
• Transform operations create RDDs
from each other
• Example 5.9

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 13
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Action Operation
• (i) returns a value into a program or
(ii) exports data to a Data Store.
• The action command does the
computation when a first-time action
takes place on an RDD and returns a
value or sends data to a Data Store.
• Example 5.9

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 14
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Removing Data
• Auto-monitoring in Spark auto-
monitors the usages of caches
• Spark removes the caches using ‘least
recently used partitions removed first’
strategy
• RDD.unpersist()

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 15
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark data types and their
descriptions.
• Table 5.4

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 16
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Numeric Operations on RDD
• Table 5.6
• count (*), count (expr); sum (col),
sum (DISTINCT col), avg (col), avg
(DISTINCT col), min (col) and
DOUBLE max(col) (Table 4.10).

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 17
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Use of Statistical Functions

• The statistical functions stdev(),

sampleStdev(), variance,
sampleVariance() for analysis with
DataFrames in input

“Big Data Analytics “, Ch.05 L03: Spark and Big Data Analytics
2019 18
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Shared Variables
• Broadcast Variable:
• Created from a value denoted by a
variable v and running method
sc.broadcast (v). The broadcast
variable is a wrapper around v.
• Example 5.11

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 19
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Shared Variables
• Accumulator Variable:
• Accumulators are special variables.
They add the values using associative
and commutative operations. They
also support parallel run: for example,
in count() or sum()
• Example 5.12

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 20
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Summary
We learnt :
• DataFrame
• Resilient Data Sets
• RDD Features
• Transformation and Action
• Shared Variables

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 21
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
End of Lesson 4 on
Spark DataFrame and RDDs

“Big Data Analytics “, Ch.05 L04: Spark and Big Data Analytics
2019 22
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)

cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Data Processing Cycle
100% (1)
Data Processing Cycle
5 pages
BDACh 05 L05 Python Librariesfor Analysis
No ratings yet
BDACh 05 L05 Python Librariesfor Analysis
29 pages
BDACh 05 L03 A Spark QLAnalytics
No ratings yet
BDACh 05 L03 A Spark QLAnalytics
24 pages
BDACh05L07bETLDATA ETLProcessInAnalytics
No ratings yet
BDACh05L07bETLDATA ETLProcessInAnalytics
11 pages
BDACh05L08Applications and Big Data Analytics Using Spark
No ratings yet
BDACh05L08Applications and Big Data Analytics Using Spark
11 pages
BDACh01L05DataStorage Analysis Traditional BigDataSytems.ppt
No ratings yet
BDACh01L05DataStorage Analysis Traditional BigDataSytems.ppt
22 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
BD 07 Spark
No ratings yet
BD 07 Spark
49 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Module 3
No ratings yet
Module 3
51 pages
Bda Solved Sample Question Paper 70 Marks
No ratings yet
Bda Solved Sample Question Paper 70 Marks
29 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
BDA U1 ANS
No ratings yet
BDA U1 ANS
20 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Big Data Analytics - notes
No ratings yet
Big Data Analytics - notes
13 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
8 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
BIG data1
No ratings yet
BIG data1
49 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Airlines Dynamic Pricing
No ratings yet
Airlines Dynamic Pricing
24 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Pyspark Power Guide.pdf
No ratings yet
Pyspark Power Guide.pdf
73 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Int 421
No ratings yet
Int 421
2 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Pyspark
No ratings yet
Pyspark
31 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
9 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Module 1 Introduction to Big Data Analytics
No ratings yet
Module 1 Introduction to Big Data Analytics
121 pages
ak_as2
No ratings yet
ak_as2
15 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
SPARK
No ratings yet
SPARK
35 pages
BDA-Lec9
No ratings yet
BDA-Lec9
25 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
Big Data
No ratings yet
Big Data
190 pages
Unit 01
No ratings yet
Unit 01
36 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
Unit 4
No ratings yet
Unit 4
60 pages
BIG DATA ANALYTICS
No ratings yet
BIG DATA ANALYTICS
10 pages
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Sa 2
No ratings yet
Sa 2
8 pages
SAP Overview
No ratings yet
SAP Overview
61 pages
IOT Mod 5 Overview
No ratings yet
IOT Mod 5 Overview
5 pages
SAP Overview
No ratings yet
SAP Overview
61 pages
Automobile Engg July 2022 With Solution (2018 Scheme)
No ratings yet
Automobile Engg July 2022 With Solution (2018 Scheme)
15 pages
FS Mod 3
No ratings yet
FS Mod 3
34 pages
Cry R 2
No ratings yet
Cry R 2
6 pages
Crypto 5th Modulenotes
No ratings yet
Crypto 5th Modulenotes
22 pages
Crypto Super-Imp-Tie-23
No ratings yet
Crypto Super-Imp-Tie-23
2 pages
Atdf2txt - Copie
No ratings yet
Atdf2txt - Copie
4 pages
November 2020
No ratings yet
November 2020
10 pages
VCTA-TB880 On Line Automatic Optic Inspection (AOI) : Website: Email
No ratings yet
VCTA-TB880 On Line Automatic Optic Inspection (AOI) : Website: Email
4 pages
WP File
No ratings yet
WP File
33 pages
Rman Case Studies
No ratings yet
Rman Case Studies
34 pages
AIRSHOWREPORT1
No ratings yet
AIRSHOWREPORT1
26 pages
Radwin Guia
No ratings yet
Radwin Guia
201 pages
mpc175 851 FL PDF
No ratings yet
mpc175 851 FL PDF
2 pages
Password Recovery Procedure For The Cisco 1700 and 1800 Series Routers - Cis
No ratings yet
Password Recovery Procedure For The Cisco 1700 and 1800 Series Routers - Cis
7 pages
Ipad & Iphone User: Beginner's Guide To The Ipad Everything You Need To Know To Get Started February 2018 PDF
No ratings yet
Ipad & Iphone User: Beginner's Guide To The Ipad Everything You Need To Know To Get Started February 2018 PDF
102 pages
Changelog
No ratings yet
Changelog
31 pages
se
No ratings yet
se
8 pages
Switch 2960
No ratings yet
Switch 2960
22 pages
Alcatel 9135 OMC-Radio PDF
No ratings yet
Alcatel 9135 OMC-Radio PDF
43 pages
Matrix Multiplication Parallel
No ratings yet
Matrix Multiplication Parallel
5 pages
Top 10 Best Ost To PST Converter of 2019
No ratings yet
Top 10 Best Ost To PST Converter of 2019
4 pages
Fabrication: Machine Setup in CAM: Learning Objectives
No ratings yet
Fabrication: Machine Setup in CAM: Learning Objectives
43 pages
Gemc 511687717711679 30092022
No ratings yet
Gemc 511687717711679 30092022
4 pages
7.traffic Light PDF
No ratings yet
7.traffic Light PDF
2 pages
4.3.3.4 Lab - Configure HSRP
100% (1)
4.3.3.4 Lab - Configure HSRP
7 pages
Logcat
No ratings yet
Logcat
16 pages
How To Install Windows XP System (For System V1.3e)
No ratings yet
How To Install Windows XP System (For System V1.3e)
5 pages
WLM2-xFS Modbus Manual: Wlm2 Wlta WLTD WLTP Wlct2
No ratings yet
WLM2-xFS Modbus Manual: Wlm2 Wlta WLTD WLTP Wlct2
13 pages
Space - Remover in Colomn
No ratings yet
Space - Remover in Colomn
886 pages
Python Free Game by Curious Programmer
No ratings yet
Python Free Game by Curious Programmer
46 pages
Veeam Agent Linux 4 0 Whats New
No ratings yet
Veeam Agent Linux 4 0 Whats New
2 pages
PowerSwitch SmartFabric OS10 REST API Implementation Participant Guide
No ratings yet
PowerSwitch SmartFabric OS10 REST API Implementation Participant Guide
43 pages
Steam Community - Guide - How To Downgrade Game Version
No ratings yet
Steam Community - Guide - How To Downgrade Game Version
8 pages

BDACh05L04Spark DataFramesAndRDDs

Uploaded by

BDACh05L04Spark DataFramesAndRDDs

Uploaded by

Lesson 4

Spark DataFrame and RDDs

• A distributed collection of data

• The statistical functions stdev(),

You might also like