0% found this document useful (0 votes)
50 views

BDACh05L07bETLDATA ETLProcessInAnalytics

Uploaded by

Shaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

BDACh05L07bETLDATA ETLProcessInAnalytics

Uploaded by

Shaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lesson 7

Extract, Transform and Load


Process

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 1
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
ETL process three functions
• Extract which does the acquisition of
data from Data Store querying or from
another program,
• Transform which does the change of data
into a desired file, columnar, tabular or other.
• Load which does the process of
placing transformed data into another
Data Store or data warehouse
“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 2
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Transform Functions
• join(), groupBy(), cogroup(), filter(),
map(), mapValues(), flatMap(), sort(),
pratitionBy(), groupByKey(),
reduceByKey(), aggregateByKey(),
pipe(), coalesce(), sample(), union(),
crossProduct()

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 3
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark 2.3 with Pandas
• Includes transformation functions on
complex objects like arrays, maps and
set of columns
• Pandas provide powerful
transformation UDFs, VUDFs and
GVUDFs

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 4
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Figure 5.9: An ETL pipeline using Spark SQL for
ETL Process and Data Source API v2 in Spark 2.3.

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 5
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Extract
• Skipping Corrupt or Bad Records
or Files

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 6
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Extract and Load
• Multi-line JSON/CSV Support
• Load and Save files: SerDe uses codes
for obtaining records from
unstructured data
• Save process uses serializer codes
• Loading (extracting) process uses
deserializer.

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 7
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Example for Load and Save
• Example 5.13 explains the codes for
sequence File, JSON and CSV file
load and save functions for obtaining
records/rows/files

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 8
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Example
• Example 5.14 explains Spark SQL
transformations in Spark 2.3
• Complex objects, nested tables (one
column rows) and array
transformations
• Using the DataframeWriter API.

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 9
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Summary
• Extract, Transform and Load
• Transform functions
• Load and Save
• Spark 2.3 includes transformation
functions on complex objects like
arrays, maps and set of columns
• Pandas provide powerful transformation
UDFs, VUDFs and GVUDFs
“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 10
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
End of Lesson 7 on
Applications and Big Data
analytics using Spark

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 11
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)

You might also like