0% found this document useful (0 votes)
261 views196 pages

Synapse Project Deck

This document describes an Azure Synapse Analytics course that provides over 18 hours of learning content on topics such as data transformation, Spark SQL, joins, and schema management. It discusses how Synapse Analytics integrates transactional and analytical systems by combining SQL pools, Spark pools, and data lakes to enable data ingestion, preparation, transformation, storage, and visualization in one solution. This overcomes issues with separate systems by consolidating capabilities in Azure Synapse Analytics.

Uploaded by

mysites220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
261 views196 pages

Synapse Project Deck

This document describes an Azure Synapse Analytics course that provides over 18 hours of learning content on topics such as data transformation, Spark SQL, joins, and schema management. It discusses how Synapse Analytics integrates transactional and analytical systems by combining SQL pools, Spark pools, and data lakes to enable data ingestion, preparation, transformation, storage, and visualization in one solution. This overcomes issues with separate systems by consolidating capabilities in Azure Synapse Analytics.

Uploaded by

mysites220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 196

Basics to Advanced: Azure Synapse Analytics

Hands-On Project

Author: Shanmukh Sattiraju


Azure Synapse Analytics

Author: Shanmukh Sattiraju


Pre-requisites
• No experience needed for Azure Synapse Analytics, we will
start from Scratch
• Basic knowledge on Python
• Basic knowledge on SQL language
• Basic Azure cloud knowledge would be a plus

Author: Shanmukh Sattiraju


What you’ll get from this course?

• More than 18.5 hours of updated learning content


• 50+ most commonly used PySpark transformations
• 45+ PySpark notebooks
• Practical understanding on Delta lake
• Understand Spark Optimization techniques
• Lifetime access to this Course
• Certificate of completion at end of the course

Author: Shanmukh Sattiraju


2 - Handling Nulls, 3 - Data
1 - Selection and 4 - MSSpark
Duplicates and Transformation
Filtering Utilities
aggregations and Manipulation

6 - Join
5 - Spark SQL
Transformation

7 - String 10 - Schema
8 - Window 9 - Conversions
Manipulation and definition and
Functions and pivoting
sorting management

11 - User Defined
Functions
Author: Shanmukh Sattiraju
Author: Shanmukh Sattiraju
Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Project Architecture

Ingestion Transformation Loading

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Along with Hands-on project:

Author: Shanmukh Sattiraju


Origin of Azure Synapse Analytics

Author: Shanmukh Sattiraju


Rise of Data Warehouse
• All started with a need of a separate Transactional system and an
Analytical System

Author: Shanmukh Sattiraju


Example of a Transactional System

Selects
User

Writes
User
Website Storage
Updates

User

Author: Shanmukh Sattiraju


Performing Analysis on Data

Analysis

Data Engineers/ Data


Analysts / Business
users

Storage
Author: Shanmukh Sattiraju
OLTP vs OLAP

Data Cleansing/ Transformation

ETL
OLTP ( Online Transactional Processing)
OLAP (Online Analytical Processing)

Can contain NULL values, Data Warehouse contains


Columns that we don’t need Structured and Cleaned
to perform analysis Data
Author: Shanmukh Sattiraju
Summary
• OLTP ( Online Transactional processing system) is suited for current
data which required high reads and writes

• OLAP (Online Analytical processing System) will contains all the


historical data

• OLAP is dedicated for performing the analytics on the data which


brings us the need to have a data warehouse

Author: Shanmukh Sattiraju


A typical Datawarehouse

SQL

Other databases
Data Cleansing/ Transformation

ETL
CSV files
Datawarehouse

Data Warehouse contains


Structured and Cleaned
JSON files Data
Author: Shanmukh Sattiraju
Data lake
SQL

Other databases

CSV files Ingestion

ETL
JSON files
Data Lake

Can store structured, Semi-


structed and un-structured
data
Image/ video Author: Shanmukh Sattiraju
Modern Data Warehouse

INGEST PREPARE TRANSFORM & SERVE VISUALIZE


ENRICH
Azure Data
On-premise Azure Data
data Factory
Factory Azure SQL
Azure Data Data
Factory Warehouse
Azure Azure
Databricks Databricks
External or
IOT data

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Problem with Modern Data Warehouse

INGEST PREPARE TRANSFORM & SERVE VISUALIZE


ENRICH
Azure Data
On-premise Azure Data
data Factory
Factory Azure SQL
Azure Data Data
Factory Warehouse
Azure Azure
Databricks Databricks
External or
IOT data

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
The Solution – Azure Synapse Analytics

VISUALIZE

On-premise
data

External or
IOT data

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Components of Azure Synapse Analytics
Studio

VISUALIZE
Data Integration Management Monitoring Security
On-premise
data
Analytics Pools

SQL Pools Apache Spark Pools Data Explorer Pool


External or
IOT data
Serverless Dedicated
Spark Pool Data Explorer Pool
SQL Pool SQL Pool
AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Replacing the Modern Data Warehouse

INGEST PREPARE TRANSFORM & SERVE VISUALIZE


ENRICH
Azure Data
On-premise Azure Data
data Factory
Factory Azure SQL
Azure Data Data
Factory Warehouse
Azure Azure
Databricks Databricks
External or
IOT data

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Replacing the Modern Data Warehouse

INGEST PREPARE TRANSFORM & SERVE VISUALIZE


ENRICH
On-premise Synapse Synapse Synapse
data Serverless Serverless Dedicated
Synapse SQL Pool SQL Pool SQL Pool
Pipelines (or)
Dedicated (or)
SQL Pool
External or Synapse
IOT data Synapse Serverless
Spark
Spark SQL Pool

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Ingest

Synapse Pipelines Data Flows

Storage

Azure SQL Database (DW) Azure Data lake (primary) Spark Tables

Compute

Dedicated SQL Pool (SQL DW) Serverless SQL pool Spark Pool

Visualize

Manage / Security
Author: Shanmukh Sattiraju
Azure Synapse Analytics

Microsoft’s Definition:

Azure Synapse is a limitless analytics service that brings together


enterprise data warehousing and Big Data analytics. It gives you the
freedom to query data on your terms, using either serverless or
dedicated resources—at scale.

Author: Shanmukh Sattiraju


Ingest

Synapse Pipelines Data Flows

Storage

Azure SQL Database (DW) Azure Data lake (primary) Spark Tables

Compute

Dedicated SQL Pool (SQL DW) Serverless SQL pool Spark Pool

Visualize

Manage / Security
Author: Shanmukh Sattiraju
Environment Setup

Author: Shanmukh Sattiraju


Understanding dataset
Unemployment dataset

Author: Shanmukh Sattiraju


Serverless SQL Pool

Author: Shanmukh Sattiraju


Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
On-demand Serverless SQL pool

Synapse Serverless SQL Pool

T-SQL Queries

STORE Azure Data Lake Storage Gen2

Author: Shanmukh Sattiraju


Serverless SQL Pool – Architecture - DQP

Author: Shanmukh Sattiraju


Benefits of a Serverless SQL Pool
• You are charged based on how much data in processed by
query, better for data exploration
• No underlying Infrastructure
• You can use T-SQL queries to work with your like (same in
Dedicated SQL pool)
• You cannot create Tables in Serverless SQL pool because this
is not managing any storage on its own
• You can only create external tables or views to work with.
Author: Shanmukh Sattiraju
Analysing data with Serverless SQL Pool

• Mostly used for data exploration


• T-SQL to query data
• Pricing

Author: Shanmukh Sattiraju


Querying data with Serverless SQL Pool

• OPENROWSET() Function to query data


• We can use OPENROWSET() function with or without DATA_SOURCE
parameter
• We will use OPENROWSET() after FROM Clause in T-SQL query
• The OPENROWSET function is not supported in dedicated SQL pool.

SELECT
*
FROM
OPENROWSET()
Author: Shanmukh Sattiraju
Mandatory parameters for OPENROWSET()
SELECT
TOP 100 *
FROM
OPENROWSET(
*BULK ‘<storage-path>’,
*FORMAT = ‘CSV or PARQUET or DELTA',
*PARSER_VERSION = ‘1.0 or 2.0'
) AS [result]
Author: Shanmukh Sattiraju
URL formats for BULK parameter
External Data Source Prefix Storage account path
Azure Blob Storage http[s] <storage_account>.blob.core.windows.net/path/file
Azure Blob Storage wasb[s] <container>@<storage_account>.blob.core.windows.net/path/file

Azure Data Lake Store http[s] <storage_account>.dfs.core.windows.net /path/file


Gen2
Azure Data Lake Store abfs[s] <container>@<storage_account>.dfs.core.windows.net/path/file
Gen2

External Data Source Full path


Azure Blob Storage with https https://<storage_account>.blob.core.windows.net/path/file
Azure Blob Storage with wasbs wasbs://<container>@<storage_account>.blob.core.windows.net/path/file

Azure Data Lake Store Gen2 https://<storage_account>.dfs.core.windows.net /path/file

Azure Data Lake Store Gen2 abfss://<container>@<storage_account>.dfs.core.windows.net/path/file


Author: Shanmukh Sattiraju
Creating external data source

SELECT
TOP 10 *
FROM
OPENROWSET(
BULK ‘folder/file',
DATA_SOURCE =‘<name>',
FORMAT = 'CSV',
External Data Source
Data in ADLS =‘abfss://<container>@<storage_account>.dfs.core.windows.net/’
PARSER_VERSION = '2.0',
FIRSTROW = 2
Gen2
)

Serverless SQL DB SQL Script


(Stores Metadata)

Author: Shanmukh Sattiraju


External Data Source, Credential

Azure Data Lake


CREATE MASTER KEY ENCRYPTION BY PASSWORD
= ‘<pass>';

Oracle CREATE DATABASE SCOPED CREDENTIAL <cred>


WITH IDENTITY = <Type>;

CREATE EXTERNAL DATA SOURCE <Sourcename>


MySQL WITH (
LOCATION = ‘<Path>’,
CREDENTIAL = <cred>
)

Author: Shanmukh Sattiraju


Credential
• A database scoped credential is a record that contains the authentication information that is required to connect to a
resource outside SQL Server
• Before creating a database scoped credential, the database must have a master key to protect the credential
CREATE MASTER KEY ENCRYPTION BY
PASSWORD = <password>';
GO
CREATE DATABASE SCOPED CREDENTIAL
<cred>
WITH IDENTITY = ‘<Identity name>';
GO

CREATE EXTERNAL DATA SOURCE <s_name>


WITH (
External Data Source LOCATION =
Data in ADLS =‘abfss://<container>@<storage_account>.dfs.core.windows.net/’ 'abfss://[email protected]
Gen2 s.net/',
CREDENTIAL = <cred>
)

Serverless SQL DB SQL Script


(Stores Metadata)

Author: Shanmukh Sattiraju


CREATE EXTERNAL FILE FORMAT

• It defines what FILE FORMAT does a file will hold when creating EXTERNAL TABLE (Will be discussed)
• Creating an external file format is a prerequisite for creating an External Table
• The data that is created from EXTERNAL TABLE uses this EXTERNAL FILE FORMAT.
• In short, EXTERNAL FILE FORMAT will specify the actual layout of the data referred by an external
table.

Currently supported File Formats:


• DELIMITEDTEXT
• PARQUET
• DELTA (Applies to only Serverless SQL Pools)

Author: Shanmukh Sattiraju


CREATE EXTERNAL FILE FORMAT
-- Create an external file format for DELIMITED (CSV/TSV) files.
CREATE EXTERNAL FILE FORMAT file_format_name
WITH (
FORMAT_TYPE = DELIMITEDTEXT
[ , FORMAT_OPTIONS ( <format_options> [ ,...n ] ) ]
[ , DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.GzipCodec'
}
]);

<format_options> ::=
{
FIELD_TERMINATOR = field_terminator
| STRING_DELIMITER = string_delimiter
| FIRST_ROW = integer -- ONLY AVAILABLE FOR AZURE SYNAPSE ANALYTICS
| DATE_FORMAT = datetime_format
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| ENCODING = {'UTF8' | 'UTF16'}
| PARSER_VERSION = {'parser_version'}
} Author: Shanmukh Sattiraju
CREATE EXTERNAL FILE FORMAT
--Create an external file format for PARQUET files.

CREATE EXTERNAL FILE FORMAT file_format_name


WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.SnappyCodec'
| 'org.apache.hadoop.io.compress.GzipCodec' }
);

-- Create an external file format for delta table files (serverless SQL pools in Synapse analytics and SQL
Server 2022).

CREATE EXTERNAL FILE FORMAT file_format_name


WITH (
FORMAT_TYPE = DELTA
);

Author: Shanmukh Sattiraju


Create External Table As Select (CETAS)
CREATE EXTERNAL TABLE ext_table
WITH (

LOCATION = ‘test/extfile/’
DATA_SOURCE =
FILE_FORMAT =

) AS SELECT
TOP 10 [data].Year, [data].State
FROM
OPENROWSET(
BULK
'abfss://[email protected]/Unemployment.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS [data]
Author: Shanmukh Sattiraju
Initial Transformation

Raw container Refined container

Unemployment.csv .parquet

EXTERNAL DATA SOURCE = @Refined

EXTERNAL FILE FORMAT = .parquet

EXTERNAL TABLE

Author: Shanmukh Sattiraju


External Table

Azure Data Lake

Oracle
External Data Source SELECT *
External File Format FROM ExternalTable

MySQL

Author: Shanmukh Sattiraju


CREATE EXTERNAL TABLE

• With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or
serverless SQL pool.

• External tables is a table-like object in azure synapse that represents structure and schema of data
stored in external data sources.

• External tables acts as a reference point to data that is stored externally in storage (Azure data lake
storage or Azure blob storage)

• This helps to control the access to external data.

• In parallel to creating external table the data will be created in the storage that you wish.

• Options needed to create an external table:


• LOCATION = Where you want to store the data that is created by external table

• DATA_SOURCE = Holds the path of the storage

• FILE_FORMAT = Format that you wish to have for the data that is created by External table

Author: Shanmukh Sattiraju


Serverless SQL pool initial Transformation

Raw container Refined container

Unemployment.csv .parquet

Author: Shanmukh Sattiraju


History and Data processing before Spark

Author: Shanmukh Sattiraju


Big data approach

Cluster
RAM &
STORAGE

RAM &
STORAGE

RAM RAM RAM RAM


& & & &
RAM & STORAGE STORAGE
STORAGE STORAGE
STORAGE

Single Computer for Data Storage Distributed Approach (adding


and Processing (Monolithic) multiple machines to achieve
parallel processing)
Author: Shanmukh Sattiraju
Hadoop Platform
Cluster Manger - YARN
Master Node

Cluster

Worker Worker Worker Worker


Node Node Node Node
Distributed Storage - HDFS
HDFS HDFS HDFS HDFS (Hadoop Distributed File System)

Distributed Approach (adding


multiple machines to achieve
parallel processing)
Author: Shanmukh Sattiraju
Hadoop Platform
Cluster Manager - Yarn

Hadoop Platform Distributed Storage - HDFS

Distributed Computing - MapReduce

Author: Shanmukh Sattiraju


YARN – Yet Another Resource Negotiator
Submits Application
RM Master Node

Worker Worker Worker Worker Worker


Node Node Node Node Node

AM AM

NM NM NM NM NM

Author: Shanmukh Sattiraju


HDFS – Distributed Storage
Metadata
repository
1 6
2 7 Store or copy data
3 8 NN Master Node
4 9
5 10

3,5 2,4 1,6 9,10 8,7

Worker Worker Worker Worker Worker


Node Node Node Node Node

2 8
3 1 9
4 7
5 6 10

DN DN DN DN DN

Author: Shanmukh Sattiraju


Map/Reduce – Distributed Computing

Map Shuffle Reduce

Author: Shanmukh Sattiraju


MapReduce – Distributed Computing
Shuffling
HDFS Distributed Storage Reads each line and
generates key value.
Key = Word Ironman ,
Value = Occurrences
Ironman [1,1] HDFS Distributed Storage
Reduce
Superman Ironman - 1
Batman Mapper Superman - 1
Antman Batman - 1 Superman,
Antman - 1 [1,1, 1] Ironman – 2
Superman -3 Ironman – 2
Batman - 2 Superman -3
Batman – 2
Superman
Spiderman Superman - 1
Mapper Spiderman - 1 Batman,
Batman [ 1, 1]
Batman - 1

Antman Antman , Antman -2


Antman – 1 Antman -2
Superman [1,1] Spiderman - 2 Spiderman - 2
Spiderman Mapper Superman - 1
Spiderman - 1
Ironman
Ironman - 1
Spiderman,
[1,1]
HDFS Read HDFS Write
Iteration or process
Author: Shanmukh Sattiraju
Emergence of Spark

Author: Shanmukh Sattiraju


Drawbacks of MapReduce
Traditional Hadoop MapReduce processing

HDFS HDFS HDFS HDFS


Read Iteration 1 Iteration 2 Write
Write Read
Storage Storage Storage
Process data Process data
HDFS Disk HDFS Disk HDFS Disk

Author: Shanmukh Sattiraju


Emergence of spark

HDFS
Read
RAM RAM RAM

HDFS Disk
Iteration 1 Iteration 2 Iteration 3
Or
Analyse Data Analyse Data Analyse Data
Any cloud
storage

Author: Shanmukh Sattiraju


Apache Spark

Apache Spark is an open source in-memory application framework for


distributed data processing and iterative analysis on massive data
volumes

In simple terms, Spark is a


• Compute Engine
• Unified data processing System

Author: Shanmukh Sattiraju


Spark Core Concepts

Author: Shanmukh Sattiraju


Apache Spark Ecosystem
Spark Spark ML Spark Graph SparkR
Spark SQL Streaming (Mllib) (Graph Computation) (R on spark)

(Interactive
Queries) Higher level APIs
DataFrame/ Dataset APIs

Spark Core
Scala Java Python SQL R
Spark Core API
RDD – Resilient Distributed Dataset RDD APIs

Distributed
Compute Engine Spark Engine

Cluster or Resource Manager (YARN, Mesos, Standalone, Kubernetes)

Distributed Storage (Azure Storage, Amazon S3 , GCP)


Author: Shanmukh Sattiraju
Limitations with Hadoop
Metrics Hadoop Apache Spark

Performance Dependency on disks for read and In-memory processing.


write operations. Slower disk I/O 10-100x faster than Hadoop

Development Need to develop Map and Reduce Use native SQL to develop by making use of
Code which is complex Spark SQL and composable APIs
Language Java Java, Scala, Python and R

Storage HDFS HDFS and Cloud Storages(Azure Storage,


Amazon S3, etc.)

Resource YARN YARN, Measos, Standalone, Kubernetes


Management

Data processing Batch Processing Batch Processing, Streaming, Machine Learning

Author: Shanmukh Sattiraju


Apache Spark Architecture
Worker Node
Executor CACHE
Master Node
TASK TASK

Driver Program

Driver Program

Spark Context Cluster Manager


Worker Node
Executor CACHE

TASK TASK

Author: Shanmukh Sattiraju


Benefits of Spark pool

• Speed and efficiency

• Ease of Creation

• Support for ADLS Gen2

• Scalability

Author: Shanmukh Sattiraju


RDD – Resilient Distributed Dataset

• RDD Stands for Resilient Distributed Dataset.


• It is fundamental data structure of Apache Spark
• It is immutable collection of objects
• RDD is divided into logical partitions, which may be computed on different
nodes of the cluster

Let’s breakdown and see RDD


Resilient = It is fault tolerant
Distributed = data is spread across partitions in the clusters
Dataset = Set of data that have rows and columns
Author: Shanmukh Sattiraju
RDD -Overview
Worker node

RDD
Partition - 1 Worker node

Partition - 2

Worker node
Partition - 3
Cluster
Driver
Manager
Partition - 4
Worker node
Partition - 5

Partition - 6 Worker node

Worker node

Author: Shanmukh Sattiraju


Lambda(), Map() , filter()

Lambda: Map: Filter:


Anonymous functions (i.e. Return a new distributed dataset Return a new dataset formed by
functions defined without a formed by passing each element of selecting those elements of the
name) the source through a <function> source on which <function> returns
Syntax: true
lambda ( < value> : <expression>) Syntax:
Map (<function>) Syntax:
Eg: Filter ( <function>)
Eg:
a = lambda (x : x+10) If RDD have values [1,2,3] Eg:
rdd.map(lambda (x : x+10 ) If RDD have values [11,12,13]
print( a (10)) rdd.Filter(lambda x: x%2==0 )
Result [11,12,13]
Result is 20 Result [12]
Map adds 10 on reach value in given
data Filter will apply that on each element
if that condition is true then it will
Author: Shanmukh Sattiraju take as new dataset
RDD Operations

Transformations Actions
• Any operation that leads returns a value or data back to
• Any operation that leads to some change in form of data driver program is an Action
is transformation. • It brings laziness of RDD into motion.
• These take an RDD as the input and produces one or
many RDDs as output E.g.: count(), collect(), take()
• After executing a transformation, the result RDD(s) can
be smaller, bigger or sometimes same size as Parent
RDDs
• Transformations are Lazy, they are not executed
immediately. Transformations can execute only when
actions are called (also called Lazy Evaluation)
• Transformations don’t change the input RDD as RDDs
are immutable
• E.g.: Filter(), Map(), FlatMap(), etc.

Author: Shanmukh Sattiraju


RDD operations
rdd_list

map()

Transformations
rdd_add

filter()

rdd_filter

collect()
Action

Output sent to
Driver Program

Author: Shanmukh Sattiraju


Lineage

Author: Shanmukh Sattiraju


RDD Lineage
>> rdd_list = sc.parallelize(list)

>> rdd_add = rdd_list.map(<func>)


rdd_list + map() transformation

>> rdd_filter = rdd_add.filter(<func>)

RDD Lineage
>> rdd.filter.collect()
rdd_list rdd_list =rdd_add
sc.parallelize(list)rdd_filter

Driver
ACTION
Program

Author: Shanmukh Sattiraju


Word count
txtFile() FlatMap() map() reduceByKey() Collect(()

Tony Stark Tony Tony,1


is an Stark Stark,1
Avenger…
Is Is,1 Tony,[1,1,1]
Tony,3
an an,1
Avenger Avenger,1
Azure Datalake

Tony Stark is an
Avenger.. Tony Stark Tony Ton,1
is genius, Tony,3
Stark Stark,1
Tony Stark is a Billionaire… Stark,[1,1,1] Stark,3 Driver
Is Is,1 Stark,3
Genius, a a,1 Is,3 Program
Billionaire,…. genius Genius,1 A,3
Billionaire Billionaire,1
Biography.txt Avenger,2

Tony Stark Tony Tony,1


is a Stark Stark,1 is,[1,1,1]
Superhero… is,3
. Is Is,1
a a,1
superhero Superhero,1
.
.
.

RDDRead RDDMap RDDPairedAuthor: Shanmukh Sattiraju


RDDreduced
ReduceByKey() vs GroupByKey()
ReduceByKey()

Combiner Tony,2
Tony Tony,4
Tony,2 Tony,2
Is
Awesome Is ,2
Tony Awesome,1
Is Superhero,1
Superhero Tony,4
Is,4
Is,2 Awesome,2
Is,4
Is,2 Superhero,1
Combiner
Tony
Is Tony,2
Ironman Is ,2
Tony Ironman,1 Awesome,1
Is Awesome,1 Awesome,2
Awesome,1
Awesome

Author: Shanmukh Sattiraju


ReduceByKey() vs GroupByKey()
GroupByKey()

Tony,1
Tony,1
Tony Tony,4
Tony,1
Is Tony,1
Awesome
Tony
Is
Superhero Tony,4
Is,1 Is,4
Is,1 Awesome,2
Is,4
Is,1 Superhero,1
Is,1
Tony
Is
Ironman
Tony
Awesome,1 Awesome,4
Is
Awesome,1
Awesome

Author: Shanmukh Sattiraju


ReduceByKey() GroupByKey()

• Wide Transformation • Wide Transformation


• Data is combined or aggregated at • Data is combined or aggregated at after
partition itself before getting shuffled shuffling
• Less shuffling – as data already combined • More shuffling – as need to collect data
• More Efficient • Less Efficient

Author: Shanmukh Sattiraju


Execution plan
Task 1

Stage 1
Task 2
Job 0
Stage 2 Task 3

Program Task 4
Execution
Job 1 Stage 3
Task 5

Task 6
Job 2 Stage 4

Task 7

Task 8
Author: Shanmukh Sattiraju
Jobs

Number of Number of
Jobs = Actions

Author: Shanmukh Sattiraju


Transformations

Narrow Transformations Wide Transformations

• In Narrow transformation, all the • In wide transformation, all the


elements that are required to elements that are required to
compute the records in single compute the records in the
partition, live in the single single partition, may live in many
partition of parent RDD partitions of parent RDD.

• E.g. map(), filter() • E.g.


groupbyKey() and reducebyKey()

Author: Shanmukh Sattiraju


Stage 0 Stage 1
Narrow Transformation Wide Transformation
FlatMap() map() reduceByKey()

Tony Tony,1
Stark Stark,1
Is Is,1 Tony,[1,1,1]
Tony,3
an an,1
Avenger Avenger,1

Tony Ton,1
Stark Stark,1
Is Is,1 Stark,[1,1,1]
a
Stark,3
a,1
genius Genius,1
Billionaire Billionaire,1

Tony Tony,1
Stark Stark,1 is,[1,1,1]
is,3
Is Is,1
a a,1
superhero Superhero,1
.
.
.

RDDPaired RDDreduced
Author: Shanmukh Sattiraju
Stages

Number of Wide
Number of
Stages = Transformations +1
applied

Author: Shanmukh Sattiraju


Tasks

Number of
Tasks = Number of
Partitions

Author: Shanmukh Sattiraju


Summary

Author: Shanmukh Sattiraju


DAG

Directed Acyclic Graph

A C
B

Author: Shanmukh Sattiraju


DAG
Stage 0

Map

Author: Shanmukh Sattiraju


RDD lineage vs DAG

RDD Lineage DAG

• Formed when a RDD/Data frame is • Forms when an action is called

created or after each • It’s a physical plan built by DAG

transformation is applied scheduler after an action is called

• Each RDD points to one or more • It’s like a combination of many

parent RDD which forms a lineage RDD and its transformations

• It is Logical plan

• Its like a portion of DAG


Author: Shanmukh Sattiraju
SQL APIs Dataframe / Dataset APIs Higher Level APIs

Catalyst Optimizer

RDD APIsRDD
Optimized Lower Level APIs

Author: Shanmukh Sattiraju


DataFrames
• Dataframes are built on top of Spark RDD APIs
• DataFrame APIs is more efficient, it can optimize operations using underlying
catalyst

To read DataFrame :
DataframeReader

Supports:
JSON
CSV
PARQUET
AVRO
ORC
TEXT

Author: Shanmukh Sattiraju


PySpark Transformations

Author: Shanmukh Sattiraju


Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Selection and filtering

We are making use of below functions and understand their usage


1. display()
2. Select()
3. selectExpr()
4. Filter()
5. Where()

Author: Shanmukh Sattiraju


Handling NULLs/Missing values and Grouping
Aggregation

We are making use of below functions and understand their usage


1. fillna()
2. na.fill()
3. Groupby()
4. Agg()
5. dropna()
6. na.drop()

Author: Shanmukh Sattiraju


Handling NULLs and aggregation
Columns Transformation

Line Number Identify NULLs


Filter()
Year
Month Replace NULLs
State fillna() or na.fill()
Labor Force
Employed Drop NULLs
Dropna() or na.drop()
Unemployed
Unemployment Rate
Drop Duplicate rows
Industry dropDuplicates()
Gender
Education Level Aggregation
Date Inserted groupBy() / agg()
Aggregation Level
Data Accuracy
Author: Shanmukh Sattiraju
Data Transformation and Manipulation:

We are making use of below functions and understand their usage


1. withColumn()
2. Distinct()
3. Drop()
4. withColumnRenamed()

Author: Shanmukh Sattiraju


Data Transformation and Manipulation:
Before transformation After transformation Transformation

Line Number Line_Number Add column


Year withColumn()
Year
Month Month
Update column
State State withColumn()
Labor Force Labor Force
Employed Employed Update column based on
Unemployed Unemployed Condition
Unemployment Rate Industry withColumn( when..)

Industry Gender
Delete column
Gender Education Level Drop()
Education Level Date Inserted
Date Inserted Aggregation Level Rename column
Aggregation Level Data Accuracy withColumnRenamed()
Data Accuracy UnEmployed Rate Percentage

Author: Shanmukh Sattiraju


MSSparkUtils
• Microsoft Spark Utilities (MSSparkUtils) is a builtin package to help
you easily perform common tasks.
• In databricks it is dbutils
• You can use MSSparkUtils to work with file systems, to get
environment variables, to chain notebooks together, and to work with
secrets
• MSSparkUtils are available in PySpark (Python), Scala, .NET Spark
(C#), and R (Preview) notebooks and Synapse pipelines.
• We can work on storages like we do in local file system using file
system in MSSparkUtils
Author: Shanmukh Sattiraju
MSSpark Utilities
This module provides various utilities for users to interact with the
rest of Synapse notebook.

• fs: Utility for filesystem operations in Synapse


• notebook: Utility for notebook operations (e.g, chaining
Synapse notebooks together)
• credentials: Utility for obtaining credentials (tokens and keys)
for Synapse resources
• env: Utility for obtaining environment metadata (e.g, userName,
clusterId etc)
Author: Shanmukh Sattiraju
Env utilities

• getUserName(): returns user name

• getUserId(): returns unique user id

• getJobId(): returns job id

• getWorkspaceName(): returns workspace name

• getPoolName(): returns Spark pool name

• getClusterId(): returns cluster id


Author: Shanmukh Sattiraju
Filesystem Utilities
• Cp -> Copies a file or directory, possibly across File Systems
• Mv -> Moves a file or directory, possibly across File Systems
• ls -> Lists the contents of a directory
• mkdirs -> Creates the given directory if it does not exist, also creating any necessary
parent directories
• put -> Writes the given String out to a file, encoded in UTF-8
• head -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in
UTF-8
• append -> Append the content to a file
• rm -> Removes a file or directory
• exists -> Check if a file or directory exists
• mount -> Mounts the given remote storage directory at the given mount point
• Unmount -> Deletes a mount point
• mounts -> Show information about what is mounted
• getMountPath -> Gets the local path of the mount point
Author: Shanmukh Sattiraju
Mounting Storage

Spark.read.format(‘csv’)\
.option((‘header’,’true’)\
.load(‘synfs:/’+JobId + ‘/lake/transformed/nulls.csv’
Notebook

Mount point
/lake

Azure Data Lake Container : refined


Folder: transformed
File: nulls.csv
Author: Shanmukh Sattiraju
Mounting storage to Spark pool
Mounting = attaching

Once we attached our storage to a mount point , we can access the storage
account without using full path name

Syntax to mount: (Using linked service authentication - Recommended)

mssparkutils.fs.mount(
“< full_path>",
“<Mount_point_name>",
{"LinkedService":“<Linked_Service_name>"}
)

Author: Shanmukh Sattiraju


To increase quota

Workspace level
• Every Azure Synapse workspace comes with a default quota of
vCores that can be used for Spark. The quota is split between the
user quota and the dataflow quota so that neither usage pattern uses
up all the vCores in the workspace. The quota is different depending
on the type of your subscription but is symmetrical between user and
dataflow. However if you request more vCores than are remaining in
the workspace, then you'll get the following error:
Failed to start session: [User] MAXIMUM_WORKSPACE_CAPACITY_EXCEEDED
Your Spark job requested 12 vCores.
However, the workspace only has xxx vCores available out of quota of yyy vCores.
Try reducing the numbers of vCores requested or increasing your vCore quota. Click
here for more information - https://go.microsoft.com/fwlink/?linkid=213499

Author: Shanmukh Sattiraju


• The following article describes how to request an increase in
workspace vCore quota.
• Select "Azure Synapse Analytics" as the service type.
• In the Quota details window, select Apache Spark (vCore) per
workspace

Author: Shanmukh Sattiraju


Notebook utilities

• exit -> This method lets you exit a notebook with a value.
• run -> This method runs a notebook and returns its exit value.

• Similar to run() method as above we also have a magic command %%run


<notebook>

• Using %%run followed by notebook name will also runs the notebook from
another book

Author: Shanmukh Sattiraju


Magic commands

• You can use multiple


languages in one notebook
• You need to specify
language magic command at
the beginning of a cell.
• By default, the entire
notebook will work on the
language that you choose at
the top

Author: Shanmukh Sattiraju


Access mount point from other notebook
• The purpose of mount of is to reduce our development effort
• Ideally once we mount the mount point in synapse, they should be
accessing for other notebooks
Notebook 1 Notebook 2

mssparkutils.fs.mount( mssparkutils.fs.mount()
"abfss://[email protected]
re.windows.net/",
"/lake",
{"LinkedService":"synapse1121-
WorkspaceDefaultStorage"}
Returns an empty array

Author: Shanmukh Sattiraju


SQL Temp Views
• You cannot reference data or variables directly across different
languages in a Synapse notebook.
• In Spark, a temporary views can be referenced across languages.
• The lifetime of this temporary table is tied to the SparkSession
• SQL script only works on either a table or a view not on data frames

%%sql

df = spark.read.format(‘csv’)\ SELECT * FROM df


.option(‘header’,’true’)\
.load(<path>)

df.createTempView(‘<ViewName’>) %%sql

SELECT * FROM ViewName

Author: Shanmukh Sattiraju


Temporary Views
All the temporary views are active for Session Only

• createTempView()
• This will throw error if another view is created with same name in that session
• createOrReplaceTempView()
• This is used when you want to automate running you notebook and use the same name
again

• We have 2 more views


• createGlobalTempView()
• createOrReplaceGlobalTempView()

These Global views makes the view available for another notebook to access them
if they are attached to same cluster .
But synapse analytics is not like databricks , we will not have any cluster. Hence the
Global Temp views are not like we used in databricks
Author: Shanmukh Sattiraju
Workspace data
• Lake databases
• You can define tables on top of lake data using Apache Spark
notebooks
• You can refer this tables for querying using T-SQL (Transact-SQL)
language using the serverless SQL pool

• SQL Databases
• You can define your own databases and tables directly using the
serverless SQL pools
• You can use T-SQL CREATE DATABASE, CREATE EXTERNAL TABLE to
define the objects

Author: Shanmukh Sattiraju


Spark Managed vs External Tables
• Managed Tables
• These can be defined without a specified location
• The data files are stored within the storage used by the metastore
• Dropping the table not only removes its metadata from the catalog, but also
deletes the folder in which its data files are stored.

• External Tables
• These can be defined for a custom file location, where the data for the
table is stored.
• The metadata for the table is defined in the Spark catalog.
• Dropping the table deletes the metadata from the catalog, but doesn't
affect the data files.

Author: Shanmukh Sattiraju


Metadata sharing
Spark pool Serverless SQL Pool

Lake Database SQL Database

Metadata sharing
by
Replication

Azure data lake


Author: Shanmukh Sattiraju
Joins and combining data
We are making use of below functions and understand their usage
1. join()
I. Inner join
II. Left join
III. Right join
IV. Outer Join
V. Left Semi Join
VI. Left Anti Join
VII. Cross Join
2. Union()

Author: Shanmukh Sattiraju


Join Transformations
Before transformation After transformation
Transformation
Line_Number
Year Line_Number Joining data
Month Year .join()
State Month
Labor Force State
Employed Labor Force
Unemployed Employed
Industry Unemployed
Gender Unemployment Rate
Education Level Industry
Date Inserted Gender
Aggregation Level Education Level
Data Accuracy Date Inserted
UnEmployed Rate Percentage Aggregation Level
Data Accuracy
UnEmployed Rate Percentage
Education Level Expected Salary Range in USD
Expected Salary Range in USD
Author: Shanmukh Sattiraju
String Manipulation and sorting

We are making use of below functions and understand their usage


1. replace()
2. Split()
3. Concat()
4. OrderBy()
5. Sort()

Author: Shanmukh Sattiraju


String Manipulation and sorting
Before transformation After transformation
Transformation
Line_Number Line_Number
Year Year Add underscores in columns
Month Month .replace()
State State
Labor Force Labor_Force Created 2 columns from
Employed Employed Expected salary range
Unemployed Unemployed .split()
Industry Industry
Gender Gender Combine month Year Columns
Education Level Education_Level .concatenate()
Date Inserted Date_Inserted
Aggregation Level Aggregation_Level Orderby() / Sort()
Data Accuracy Data_Accuracy
UnEmployed Rate Percentage UnEmployed_Rate_Percentage
Expected Salary Range in USD Min_Salary_USD
Max_Salary_USD

Author: Shanmukh Sattiraju


Window functions

We are making use of below functions and understand their usage


1. row_number()
2. rank()
3. dense_rank()

Author: Shanmukh Sattiraju


String Manipulation and sorting
Before transformation After transformation
Transformation
Line_Number Line_Number
Year Year Assigning ranks based on
Month Month Unemployment rate
State State
.dense_rank()
Labor_Force Labor_Force
Employed Employed
Unemployed Unemployed We understood how the
Industry Industry below will work
Gender Gender .row_number()
Education_Level Education_Level .rank()
Date_Inserted Date_Inserted
Aggregation_Level Aggregation_Level
Data_Accuracy Data_Accuracy
UnEmployed_Rate_Percentage UnEmployed_Rate_Percentage
Min_Salary_USD Min_Salary_USD
Max_Salary_USD Max_Salary_USD
dense_rank

Author: Shanmukh Sattiraju


Pivoting and conversions

We are making use of below functions and understand their usage


1. cast()
2. pivot()
3. Stack()
4. to_date()

Author: Shanmukh Sattiraju


Schema definition and Management
StructType and StructField
• StructType & StructField classes are used to programmatically specify the schema to the
DataFrame
• StructType:
• Represents the schema or structure of a DataFrame.
• It is a collection of StructField objects.
• Defines the columns and their data types in a DataFrame.
• Created by passing a list of StructField objects.
• StructField:
• Represents a single field or column in a DataFrame schema.
• Defines the name, data type, and other attributes of a column.
• Used as elements within a StructType object.
• Syntax: StructField(name, datatype, nullable=True)
• name: Name or identifier of the column.
• dataType: Data type of the column.
• nullable: Specifies whether the column can contain null values.
Author: Shanmukh Sattiraju
UDFs
• In PySpark, UDF stands for User-Defined Function.
• UDFs allow you to define custom functions to operate on Spark
DataFrames or RDDs (Resilient Distributed Datasets).
• These functions can be used to perform complex computations
• These can be used when transformations on the data that are not
available through built-in Spark functions.

Why UDFs?
• In SQL or PySpark, you cannot directly use python functions on them.
• To use the custom functions on dataframes or Spark SQL you need UDFs
Author: Shanmukh Sattiraju
Methods to create UDF
Method 1:

1. Create a function in Python syntax


2. Register that function as udf() to use it on dataframe or Spark SQL

Method 2:

Create a function in a Python syntax and wrap it with UDF Annotation

Author: Shanmukh Sattiraju


Steps to create UDF – Method 1
1. Defining python function using def

Syntax:
def <Function_name>(<args>):
<Function_definition>
return <return_Type>

2. Registering the function as UDF

Syntax (to use on Dataframe):


from pyspark.sql.functions import udf
my_udf = udf(Function_name, returnType)

Syntax (to use on Spark SQL):


spark.udf.register(“<UDF_name”>, <Function_Name>)

Author: Shanmukh Sattiraju


Steps to create UDF – Method 2

1. Use annotation to wrap the UDF around the function for applying on DF

Syntax:
@udf(return_Type)
def <Function_name>(<args>):
<Function_definition>
return <return_Type>

Author: Shanmukh Sattiraju


Dedicated SQL Pool
• Previously known as SQL Data Warehouse.
• This stores data in a relational table with Columnar storage
• Dedicated SQL pool is just a traditional Data Warehouse with MPP
architecture
• You will have an internal storage specific to Dedicated SQL Pool
• The size of the Dedicated SQL pool depends on DWU (Data Warehousing
Units) that we choose while creating it.

Author: Shanmukh Sattiraju


Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Synapse Dedicated SQL Architecture – MPP

DMS

DMS DMS DMS DMS

Author: Shanmukh Sattiraju


Performance Level

Author: Shanmukh Sattiraju


For DW1000c
Compute Azure Storage

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

Node 1 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20

D21 D22 D23 D24 D25 D26 D27 D28 D29 D30

D31 D32 D33 D34 D35 D36 D37 D38 D39 D40

Node 2 D41 D42 D43 D44 D45 D46 D47 D48 D49 D50

D51 D52 D53 D54 D55 D56 D57 D58 D59 D60

Author: Shanmukh Sattiraju


Scaling Compute with DWU

Author: Shanmukh Sattiraju


DW1500c

DMS

DMS DMS DMS

20 20 20

Author: Shanmukh Sattiraju


When to consider Dedicated SQL Pool?

• When data size more than a 1 TB

• When we have more than a billion Rows

• When we need high concurrency

• When you want predicted workloads

Author: Shanmukh Sattiraju


Copying data into Dedicated SQL pool

• You can copy data to Dedicated SQL pool in multiple ways.

• For now lets see the below ways


• Using Copy command

• Using BULK Load feature

• Using pipeline to copy data

Author: Shanmukh Sattiraju


Using copy command

1. CREATE A TABLE 2. COPY data to table

CREATE TABLE [schema].[TableName] COPY INTO [schema].[TableName]


( FROM ‘<HTTPS://ExternalFilePath>’
<Column_Name> <DataType>,
<Column_Name> <DataType>, WITH
<Column_Name> <DataType>, (
) FILE_TYPE=‘parquet’
WITH )
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
);

Author: Shanmukh Sattiraju


Clustered column store index

101 2013
102 Male Vijay
2013 Stark
103 Male
2014 Andrea
104 Female
2015 Steve
105 Male
2016

101,2013,Male, Vijay
102,2013,Male, Stark
103,2013,Female, Andrea
104,2014,Male,Steve

Author: Shanmukh Sattiraju


Sharding Pattern

These sharding patterns are:

•Hash

•Round Robin

•Replicate

60 distributions

Author: Shanmukh Sattiraju


Round Robin Distribution
CREATE TABLE StudenDetails
WITH (DISTRIBUTION = ROUND_ROBIN)
AS..

Dist_1
Student ID Subject
101 Networking
Dist_2
102 Linux
103 Java
Dist_3
101 Azure

Dist_4

Author: Shanmukh Sattiraju


Hash Distribution
CREATE TABLE StudenDetails
WITH (DISTRIBUTION = HASH(StudentID)
AS..

Dist_1
Student ID Subject
101 Networking
102 Linux
103 Java Dist_2
101 Azure
Dist_3

Dist_4
Author: Shanmukh Sattiraju
Replicated Distribution CREATE TABLE StudenDetails
WITH (DISTRIBUTION = REPLICATE)
AS..

Student ID Subject
101 Networking
102 Linux
103 Java
101 Azure
Dist_1
Student ID Subject
101 Networking Student ID Subject
101 Networking
102 Linux 102 Linux Dist_2
103 Java
103 Java 101 Azure

101 Azure
Student ID Subject
101 Networking
102 Linux Dist_3
103 Java
101 Azure
Author: Shanmukh Sattiraju
Sharding Patterns

Distributes randomly row Performance is not Staging Tables


evenly across nodes, No optimized
logic on how data is
distributed
Rows are distributed Maximum query Fact Tables
across nodes based on performance
the hash column that we
defined.
1 node = 1 hash value
Keeps copy of entire Good performance if Dimension Tables
table in every node , 60 its used for small
distributions makes 60 tables
copies

Author: Shanmukh Sattiraju


Reporting data with Power BI

Author: Shanmukh Sattiraju


Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Spark Optimization Techniques

Author: Shanmukh Sattiraju


Spark can be optimized at 2 levels
1. Spark pool Optimization (For Synapse)
• Choosing right node size
• Number of vCores and Memory
• Auto-scaling enabled
• Number of nodes

2. Application or Code Level Optimization


• Writing code to make efficient use of available
resources
Author: Shanmukh Sattiraju
Avoid using collect()
Worker node

Worker node

Collect()
Worker node
Out of memory Driver

Worker node

Worker node

Worker node

Author: Shanmukh Sattiraju


Instead use take()
Worker node

Worker node

take(n)
Worker node

Returns first n elements Driver

Worker node

Worker node

Worker node

Author: Shanmukh Sattiraju


Avoid using InferSchema

df = spark.read.format(‘csv’)
Using inferSchema will: .option(‘header’,’true’)\
• Invokes spark job and reads all the .option(‘inferSchema’,’true’)\
columns .load(‘<storage_path>’)
• Takes lot time to load due to that.
• Will not provide accurate data types
.e.g. date columns
• Not recommended for production
notebooks

Best practice:
• Use StructType/StructField to enforce schema to columns
Author: Shanmukh Sattiraju
Data Serialization

Data Transfer
(Network)
Memory Memory

Node Node

Author: Shanmukh Sattiraju


Data Serialization
Serialization De-Serialization

Object Object
0011010110101 0011010110101
Memory Memory

Node Node

Author: Shanmukh Sattiraju


Cache and persist

• They allow you to store intermediate data in memory


• The stored data will be reused in subsequent actions which can
significantly improve the performance of your Spark applications.
• Both caching and persisting are used to save
• Cache() saves data only in memory
• Persist() can save data with multiple storage levels (will see shortly)

Author: Shanmukh Sattiraju


How cache() and persist() works

Without cache() or persist()

df = spark.read.format('csv')\
.option('header','true')\
.load('abfss://raw@da..')
df_converted.select()

df_transform = df.withColumn() df_converted.Filter()

df_converted.OrderBy()
df_dropped = df_transform\
.drop(..)

df_converted.groupBy()
df_converted = df_dropped\
.withColumn(..)

Author: Shanmukh Sattiraju


How cache() and persist() works

With cache() or persist()

df_converted.cache()
df = spark.read.format('csv')\ df_converted.select()
.option('header','true')\
.load('abfss://raw@da..')

df_converted.Filter()

df_transform = df.withColumn()

df_converted.OrderBy()

df_dropped = df_transform\
.drop(..)
df_converted.groupBy()

MEMORY
df_converted = df_dropped\
.withColumn(..)

Author: Shanmukh Sattiraju


How cache() and persist() works

With cache() or persist()

Initially stored in memory and it will be re-used for subsequent actions


Why subsequent actions?

For 1st action = It will be computed once and stored in memory

2nd action = Instead of re-computing, retrieved from memory


.
.
.
6th Action = Instead of re-computing, retrieved from memory

Author: Shanmukh Sattiraju


Cache() vs persist()
• Cache() when used will store the data in MEMORY_ONLY
• Persist() when used will store data in different persistent levels or storage levels

Here persistent level means


– Where (memory / disk ) and how (serialized or de-serialized) the data will be stored

Various persistent levels are: Usage:


df.persist(StorageLevel.MEMORY_ONLY)
MEMORY_ONLY
df.persist(StorageLevel. MEMORY_AND_DISK
MEMORY_AND_DISK .
.
MEMORY_ONLY_SER (Java, Scala)
.
MEMORY_AND_DISK_SER (Java, Scala) .
.
DISK_ONLY
.
OFF_HEAP df.persist(StorageLevel.OFF_HEAP)

Author: Shanmukh Sattiraju


As per Spark documentation

In Python, stored objects will always be serialized with the Pickle library, so it does not
matter whether you choose a serialized level.

The available storage levels in Python (PySpark) include


• MEMORY_ONLY
• MEMORY_ONLY_2
• MEMORY_AND_DISK
• MEMORY_AND_DISK_2
• DISK_ONLY
• DISK_ONLY_2
• DISK_ONLY_3.

Author: Shanmukh Sattiraju


Persistant Levels
MEMORY_ONLY
• In memory it is stored as de-serialized objects (Java / Scala)
• In memory it is stored as serialized objects (PySpark)
• Any excess data the doesn’t fit into memory is re-computed
• When MEMORY_ONLY is used it is same as using cache() ( in functionality )
• Using cache() from PySpark will store them in de-serialized format.
• Usage: df.persist(StorageLevel.MEMORY_ONLY))
• Best suitable use case : Interactive Data Exploration
EXCESS DATA
RE-COMPUTED
DATA

MEMORY
Author: Shanmukh Sattiraju
Understanding StorageLevel

StorageLevel ( <useDisk>,<useMemory>,<useOffHeap>,<de-serialized>,<replication> )

For MEMORY_ONLY

StorageLevel (false , true , false , false ,1)

useDisk useMemory useOffHeap deserialized replication

Author: Shanmukh Sattiraju


Persistant Levels
MEMORY_AND_DISK
• First the data will be stored in memory as :
• de-serialized objects (Java or Scala)
• Serialized Object (PySpark)
• Any excess data the doesn’t fit into memory is sent to Disk (nothing but storage)
• Usage: df.persist(StorageLevel.MEMORY_AND_DISK))
• Best suitable use case: Machine Learning Training

EXCESS DATA
DISK
DATA

MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
MEMORY_ONLY_SER
• Similar to MEMORY_ONLY using PySpark
• But data here will be stored as ‘serialized object’ (Java or Scala)
• This is not available in PySpark because it is already serialized by using Python
• Any excess data the doesn’t fit into memory is recomputed
• Usage: df.persist(StorageLevel.MEMORY_ONLY_SER))
• Best suitable use case: Serialize data for memory optimization

001011010011101
EXCESS DATA
RE-COMPUTED
DATA

MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
MEMORY_AND_DISK_SER
• Similar to MEMORY_AND_DISK in PySpark
• But data here will be stored as ‘serialized object’ (Java or Scala)
• This is not available in PySpark because it is already serialized by using Python
• Any excess data the doesn’t fit into memory is sent to disk (storage)
• Usage: df.persist(StorageLevel.MEMORY_AND_DISK_SER))
• Best suitable use case: When you want to reduce memory usage by storing data to disk

001011010011101
EXCESS DATA
DISK
DATA

MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
DISK_ONLY
• Stores only on disk
• Serialized object in both Scala and PySpark
• Usage: df.persist(StorageLevel.DISK_ONLY))
• Best suitable Use Case: When you have large datasets that doesn’t fit into memory

DATA DISK

Author: Shanmukh Sattiraju


Persistant Levels
OFF_HEAP
• Stores only on OFF_HEAP
• Usage: df.persist(StorageLevel.OFF_HEAP))
• Best suitable use case: Off-Heap Storage for Extremely Large Datasets

DATA OFF HEAP MEMORY

Author: Shanmukh Sattiraju


Remaining Persistent Levels of PySpark

• MEMORY_ONLY_2
• MEMORY_AND_DISK_2
• DISK_ONLY_2
• DISK_ONLY_3.

Author: Shanmukh Sattiraju


Partitioning
• Partitioning is a way to split data into separate folders based on one or multiple columns.
• Each of the partitions is saved into a separate folder when partitioned.
• Optimizes the queries by skipping reading parts of the data that are not required.

2012
Year Month Unemployed

2012 Jan 211741 2013

2013 Jan 451751


Partition on Year column 2014
2014 Jan 51652

2015 Jan 5174 2015

2016 Jan 21657 2016


2017 Jan 45868
2017
2018 Jan 87474
Author: Shanmukh Sattiraju
Partitioning
Which column to choose of partitioning?

• A column that is used frequently in filtering


• No of distinct values in partitioned column = No of partitions = No of Folders.
• A column which have less distinct values ( low cardinality)

Which column to avoid for partitioning?

• A column that is having many distinct values


• This will make too many partitions and make it less efficient in querying

Author: Shanmukh Sattiraju


Repartition and coalesce

Repartition()
• Repartition is a transformation API that can be used to
increase or decrease the number of partitions in a
dataframe/RDD.

Coalesce()
• Coalesce is a transformation API that can be used to decrease
the number of partitions in a dataframe/RDD.

Author: Shanmukh Sattiraju


Repartition
Wide Transformation

1,2,3,4

1,10,15,5,16,7

5,6,7,8,9

3,8,11,14,17,4

10,11,12,13,14

2,6,9,13,18,12

15,16,17,18

Author: Shanmukh Sattiraju


coalesce
Narrow Transformation

1,2,3,4

1,2,3,4,5,6,7,8,9
5,6,7,8,9

10,11,12,13,14 10,11,12,13,14,15
,16,17,18

15,16,17,18

Author: Shanmukh Sattiraju


Repartition Coalesce

Reduce or increase partition numbers by Only reduces number of partitions on


Purpose
performing a full shuffle. dataframe and avoids shuffling

Performs a full shuffle, which can be an expensive


Shuffle Does not perform a full shuffle.
operation.

Number of Can increase or decrease the number of partitions


Only decreases the number of partitions
Partitions in a DataFrame.

Data Moves data across the network to create the new Tries to minimize data movement and
Movement partitioning scheme. avoid shuffling whenever possible.
Generally faster compared to repartition
Generally slower compared to coalesce due to the
Performance since it avoids shuffling whenever
full shuffle operation.
possible.
Author: Shanmukh Sattiraju
Broadcast variables

CA - California
NY - New York

Author: Shanmukh Sattiraju


Broadcast variables
Without broadcast

Worker Node
val
Task Task

val
Worker Node

Driver Task Task

Worker Node

Task Task

Worker Node

Task Task
Author: Shanmukh Sattiraju
Broadcast variables

Worker Node
broad broad
Task Task

broad Worker Node


broad
Task Task
Driver

Worker Node
broad
Task Task

Worker Node
broad
Task Task
Author: Shanmukh Sattiraju
Broadcast variables

• Read-only variables cached on each machine


• Access the value in them using .value[]
• Cached on each worker node in serialized form
• Useful when a dataset needs to be shared across all nodes
• Reduces the data transfer

Author: Shanmukh Sattiraju


Serialization Types
• Java Serialization
▪ Default serialization technique used by spark
▪ Less performative than Kryo
▪ Not efficient in space-utilization

• Kryo Serialization
▪ Faster and more efficient
▪ Takes less time to convert object to byte stream hence faster
▪ Since Spark 2.0, the framework had used Kryo for all internal shuffling of RDDs,
DataFrame with simple types, arrays of simple types, and so on.
▪ Spark also provides configurations to enhance the Kryo Serializer as per our
application requirement.

Author: Shanmukh Sattiraju


Delta Lake

Author: Shanmukh Sattiraju


Drawbacks of ADLS

ADLS != Database
Atomicity

Consistency
Relational database
Isolation

Durability
Author: Shanmukh Sattiraju
Drawbacks of ADLS

• No ACID properties
• Job failures lead to inconsistent data
• Simultaneous writes on same folder brings incorrect results
• No schema enforcement
• No support for updates
• No support for versioning

Author: Shanmukh Sattiraju


What is delta lake

• Open-source storage framework that brings reliability to data


lakes
• Brings transaction capabilities to data lakes
• Runs on top of your existing datalake and supports parquet
• Enables Lakehouse architecture

Author: Shanmukh Sattiraju


Lakehouse Architecture

Best elements of Best elements of

Data lake Data warehouse

Lakehouse

Author: Shanmukh Sattiraju


Lakehouse Architecture

Datawarehouse Modern Datawarehouse Lakehouse Architecture


(usesAuthor:
Datalake)
Shanmukh Sattiraju
Author: Shanmukh Sattiraju
How to create delta lake?
Instead of parquet.. Replace with delta..

dataframe. dataframe.
write\ write\
.format(“parquet”)\ .format(“delta”)\
.save(“/data/”) .save(“/data/”)

Author: Shanmukh Sattiraju


Delta format

Azure Data Lake


Storage

Parquet + Transaction Log

Author: Shanmukh Sattiraju


delta/

_delta_log/
0000.json Contains transaction
information applied on
0001.json actual data

Partition directory (if applied)

file01.parquet Contains actual data

Author: Shanmukh Sattiraju


Understanding Transaction log file (Delta Log)

• Contains records of every transaction performed on the delta


table

• Files under _delta_log will be stored in JSON format

• Single source of truth

Author: Shanmukh Sattiraju


Transaction log contents
JSON File = result of set of actions

• metadata – Table’s name, schema, partitioning ,etc


• Add – info of added file (with optional statistics)
• Remove – info of removed file
• Set Transaction – contains record of transaction id
• Change protocol – Contains the version that is used
• Commit info – Contains what operation was performed on this

Author: Shanmukh Sattiraju


Delta lake key features
• Open Source: Stored in form of parquet files in ADLS
• ACID Transactions: Ensures data quality
• Schema Enforcement : Restricts unexpected schema changes
• Schema Evolution: Accepts any required schema changes.
• Audit History: Logs all the change details happened on table
• Time Travel: Helps to get previous versions using version or
timestamp
• DML Operations: Enables us to use UPDATE, DELETE and MERGE
• Unified batch /Streaming: Follows same approach for batch and
streaming flows
Author: Shanmukh Sattiraju
Schema Enforcement

Loading new data Delta Table

WRITE

Author: Shanmukh Sattiraju


How does schema enforcement works?
Delta lake uses Schema validation on “writes” .

Schema Enforcement Rules:

1. Cannot contain any additional columns that are not present in the target table's
schema
2. Cannot have column data types that differ from the column data types in the target
table.

Author: Shanmukh Sattiraju


Schema Evolution

Loading new data Delta Table

WRITE

Author: Shanmukh Sattiraju


Audit Data Changes & Time Travel

• Delta automatically versions every operation that you perform

• You can time travel to historical versions

• This versioning makes it easy to audit data changes, roll back data in
case of accidental bad writes or deletes, and reproduce experiments
and reports.

Author: Shanmukh Sattiraju


Vacuum in Delta lake

• Vacuum helps to remove parquet files which are not in latest state in
transaction log
• It will skip the files that are starting with _ (underscore) that includes
_delta_log
• It deletes the files that are older then retention threshold
• Default retention threshold in 7 days
• If you run VACUUM on a Delta table, you lose the ability to time
travel back to a version older than the specified data retention period.

Author: Shanmukh Sattiraju


Checkpoints in Delta lake
json WRITE

json WRITE

00000010.checkpoint.parquet
json WRITE

json WRITE

json • Serves as starting point of compute


WRITE
• Contains replay of all actions
performed
json WRITE
• Reduce reading of JSON files
• By default, checkpoint is created for
json WRITE every 10 commits

json WRITE

json WRITE

json WRITE Author: Shanmukh Sattiraju


Optimize in Delta lake

CREATE TABLE 000.json

WRITE aabb.parquet 001.json 100 Active

WRITE ccdd.parquet 002.json 101 Inactive

WRITE eeff.parquet 003.json 102 Inactive

DELETE 101 gghh.parquet (empty) 004.json Inactive

UPDATE 102 iijj.parquet 005.json 99 Active

Author: Shanmukh Sattiraju


UPSERT (Merge) in delta lake

• We can UPSERT (UPDATE + INSERT) data using MERGE command.


• If any matching rows found, it will update them
• If no matching rows found, this will insert that as new row

MERGE INTO <Destination_Table>


USING <Source_Table>
ON <Dest>.Col2 = <Source>.Col2
WHEN MATCHED
THEN UPDATE SET
<Dest>.Col1 = <Source>.Col1,
<Dest>.Col2 = <Source>.Col2
WHEN NOT MATCHED
THEN INSERT
VALUES(Source.Col1, Source.Col2)

Author: Shanmukh Sattiraju


End of the course

Author: Shanmukh Sattiraju

You might also like