0% found this document useful (0 votes)

261 views196 pages

Synapse Project Deck

This document describes an Azure Synapse Analytics course that provides over 18 hours of learning content on topics such as data transformation, Spark SQL, joins, and schema management. It discusses how Synapse Analytics integrates transactional and analytical systems by combining SQL pools, Spark pools, and data lakes to enable data ingestion, preparation, transformation, storage, and visualization in one solution. This overcomes issues with separate systems by consolidating capabilities in Azure Synapse Analytics.

Uploaded by

mysites220

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

261 views196 pages

Synapse Project Deck

Uploaded by

mysites220

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 196

Basics to Advanced: Azure Synapse Analytics

Hands-On Project

Author: Shanmukh Sattiraju

Azure Synapse Analytics

Author: Shanmukh Sattiraju

Pre-requisites
• No experience needed for Azure Synapse Analytics, we will
start from Scratch
• Basic knowledge on Python
• Basic knowledge on SQL language
• Basic Azure cloud knowledge would be a plus

Author: Shanmukh Sattiraju

What you’ll get from this course?

• More than 18.5 hours of updated learning content

• 50+ most commonly used PySpark transformations
• 45+ PySpark notebooks
• Practical understanding on Delta lake
• Understand Spark Optimization techniques
• Lifetime access to this Course
• Certificate of completion at end of the course

Author: Shanmukh Sattiraju

2 - Handling Nulls, 3 - Data
1 - Selection and 4 - MSSpark
Duplicates and Transformation
Filtering Utilities
aggregations and Manipulation

6 - Join
5 - Spark SQL
Transformation

7 - String 10 - Schema
8 - Window 9 - Conversions
Manipulation and definition and
Functions and pivoting
sorting management

11 - User Defined
Functions
Author: Shanmukh Sattiraju
Author: Shanmukh Sattiraju
Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Project Architecture

Ingestion Transformation Loading

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Along with Hands-on project:

Author: Shanmukh Sattiraju

Origin of Azure Synapse Analytics

Author: Shanmukh Sattiraju

Rise of Data Warehouse
• All started with a need of a separate Transactional system and an
Analytical System

Author: Shanmukh Sattiraju

Example of a Transactional System

Selects
User

Writes
User
Website Storage
Updates

User

Author: Shanmukh Sattiraju

Performing Analysis on Data

Analysis

Data Engineers/ Data

Analysts / Business
users

Storage
Author: Shanmukh Sattiraju
OLTP vs OLAP

Data Cleansing/ Transformation

ETL
OLTP ( Online Transactional Processing)
OLAP (Online Analytical Processing)

Can contain NULL values, Data Warehouse contains

Columns that we don’t need Structured and Cleaned
to perform analysis Data
Author: Shanmukh Sattiraju
Summary
• OLTP ( Online Transactional processing system) is suited for current
data which required high reads and writes

• OLAP (Online Analytical processing System) will contains all the

historical data

• OLAP is dedicated for performing the analytics on the data which

brings us the need to have a data warehouse

Author: Shanmukh Sattiraju

A typical Datawarehouse

SQL

Other databases
Data Cleansing/ Transformation

ETL
CSV files
Datawarehouse

Data Warehouse contains

Structured and Cleaned
JSON files Data
Author: Shanmukh Sattiraju
Data lake
SQL

Other databases

CSV files Ingestion

ETL
JSON files
Data Lake

Can store structured, Semi-

structed and un-structured
data
Image/ video Author: Shanmukh Sattiraju
Modern Data Warehouse

INGEST PREPARE TRANSFORM & SERVE VISUALIZE

ENRICH
Azure Data
On-premise Azure Data
data Factory
Factory Azure SQL
Azure Data Data
Factory Warehouse
Azure Azure
Databricks Databricks
External or
IOT data

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Problem with Modern Data Warehouse

INGEST PREPARE TRANSFORM & SERVE VISUALIZE

ENRICH
Azure Data
On-premise Azure Data
data Factory
Factory Azure SQL
Azure Data Data
Factory Warehouse
Azure Azure
Databricks Databricks
External or
IOT data

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
The Solution – Azure Synapse Analytics

VISUALIZE

On-premise
data

External or
IOT data

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Components of Azure Synapse Analytics
Studio

VISUALIZE
Data Integration Management Monitoring Security
On-premise
data
Analytics Pools

SQL Pools Apache Spark Pools Data Explorer Pool

External or
IOT data
Serverless Dedicated
Spark Pool Data Explorer Pool
SQL Pool SQL Pool
AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Replacing the Modern Data Warehouse

INGEST PREPARE TRANSFORM & SERVE VISUALIZE

ENRICH
Azure Data
On-premise Azure Data
data Factory
Factory Azure SQL
Azure Data Data
Factory Warehouse
Azure Azure
Databricks Databricks
External or
IOT data

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Replacing the Modern Data Warehouse

INGEST PREPARE TRANSFORM & SERVE VISUALIZE

ENRICH
On-premise Synapse Synapse Synapse
data Serverless Serverless Dedicated
Synapse SQL Pool SQL Pool SQL Pool
Pipelines (or)
Dedicated (or)
SQL Pool
External or Synapse
IOT data Synapse Serverless
Spark
Spark SQL Pool

AWS,GCP, Azure

STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Ingest

Synapse Pipelines Data Flows

Storage

Azure SQL Database (DW) Azure Data lake (primary) Spark Tables

Compute

Dedicated SQL Pool (SQL DW) Serverless SQL pool Spark Pool

Visualize

Manage / Security
Author: Shanmukh Sattiraju
Azure Synapse Analytics

Microsoft’s Definition:

Azure Synapse is a limitless analytics service that brings together

enterprise data warehousing and Big Data analytics. It gives you the
freedom to query data on your terms, using either serverless or
dedicated resources—at scale.

Author: Shanmukh Sattiraju

Ingest

Synapse Pipelines Data Flows

Storage

Azure SQL Database (DW) Azure Data lake (primary) Spark Tables

Compute

Dedicated SQL Pool (SQL DW) Serverless SQL pool Spark Pool

Visualize

Manage / Security
Author: Shanmukh Sattiraju
Environment Setup

Author: Shanmukh Sattiraju

Understanding dataset
Unemployment dataset

Author: Shanmukh Sattiraju

Serverless SQL Pool

Author: Shanmukh Sattiraju

Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
On-demand Serverless SQL pool

Synapse Serverless SQL Pool

T-SQL Queries

STORE Azure Data Lake Storage Gen2

Author: Shanmukh Sattiraju

Serverless SQL Pool – Architecture - DQP

Author: Shanmukh Sattiraju

Benefits of a Serverless SQL Pool
• You are charged based on how much data in processed by
query, better for data exploration
• No underlying Infrastructure
• You can use T-SQL queries to work with your like (same in
Dedicated SQL pool)
• You cannot create Tables in Serverless SQL pool because this
is not managing any storage on its own
• You can only create external tables or views to work with.
Author: Shanmukh Sattiraju
Analysing data with Serverless SQL Pool

• Mostly used for data exploration

• T-SQL to query data
• Pricing

Author: Shanmukh Sattiraju

Querying data with Serverless SQL Pool

• OPENROWSET() Function to query data

• We can use OPENROWSET() function with or without DATA_SOURCE
parameter
• We will use OPENROWSET() after FROM Clause in T-SQL query
• The OPENROWSET function is not supported in dedicated SQL pool.

SELECT
*
FROM
OPENROWSET()
Author: Shanmukh Sattiraju
Mandatory parameters for OPENROWSET()
SELECT
TOP 100 *
FROM
OPENROWSET(
*BULK ‘<storage-path>’,
*FORMAT = ‘CSV or PARQUET or DELTA',
*PARSER_VERSION = ‘1.0 or 2.0'
) AS [result]
Author: Shanmukh Sattiraju
URL formats for BULK parameter
External Data Source Prefix Storage account path
Azure Blob Storage http[s] <storage_account>.blob.core.windows.net/path/file
Azure Blob Storage wasb[s] <container>@<storage_account>.blob.core.windows.net/path/file

Azure Data Lake Store http[s] <storage_account>.dfs.core.windows.net /path/file

Gen2
Azure Data Lake Store abfs[s] <container>@<storage_account>.dfs.core.windows.net/path/file
Gen2

External Data Source Full path

Azure Blob Storage with https https://<storage_account>.blob.core.windows.net/path/file
Azure Blob Storage with wasbs wasbs://<container>@<storage_account>.blob.core.windows.net/path/file

Azure Data Lake Store Gen2 https://<storage_account>.dfs.core.windows.net /path/file

Azure Data Lake Store Gen2 abfss://<container>@<storage_account>.dfs.core.windows.net/path/file

Author: Shanmukh Sattiraju
Creating external data source

SELECT
TOP 10 *
FROM
OPENROWSET(
BULK ‘folder/file',
DATA_SOURCE =‘<name>',
FORMAT = 'CSV',
External Data Source
Data in ADLS =‘abfss://<container>@<storage_account>.dfs.core.windows.net/’
PARSER_VERSION = '2.0',
FIRSTROW = 2
Gen2
)

Serverless SQL DB SQL Script

(Stores Metadata)

Author: Shanmukh Sattiraju

External Data Source, Credential

Azure Data Lake

CREATE MASTER KEY ENCRYPTION BY PASSWORD
= ‘<pass>';

Oracle CREATE DATABASE SCOPED CREDENTIAL <cred>

WITH IDENTITY = <Type>;

CREATE EXTERNAL DATA SOURCE <Sourcename>

MySQL WITH (
LOCATION = ‘<Path>’,
CREDENTIAL = <cred>
)

Author: Shanmukh Sattiraju

Credential
• A database scoped credential is a record that contains the authentication information that is required to connect to a
resource outside SQL Server
• Before creating a database scoped credential, the database must have a master key to protect the credential
CREATE MASTER KEY ENCRYPTION BY
PASSWORD = <password>';
GO
CREATE DATABASE SCOPED CREDENTIAL
<cred>
WITH IDENTITY = ‘<Identity name>';
GO

CREATE EXTERNAL DATA SOURCE <s_name>

WITH (
External Data Source LOCATION =
Data in ADLS =‘abfss://<container>@<storage_account>.dfs.core.windows.net/’ 'abfss://[email protected]
Gen2 s.net/',
CREDENTIAL = <cred>
)

Serverless SQL DB SQL Script

(Stores Metadata)

Author: Shanmukh Sattiraju

CREATE EXTERNAL FILE FORMAT

• It defines what FILE FORMAT does a file will hold when creating EXTERNAL TABLE (Will be discussed)
• Creating an external file format is a prerequisite for creating an External Table
• The data that is created from EXTERNAL TABLE uses this EXTERNAL FILE FORMAT.
• In short, EXTERNAL FILE FORMAT will specify the actual layout of the data referred by an external
table.

Currently supported File Formats:

• DELIMITEDTEXT
• PARQUET
• DELTA (Applies to only Serverless SQL Pools)

Author: Shanmukh Sattiraju

CREATE EXTERNAL FILE FORMAT
-- Create an external file format for DELIMITED (CSV/TSV) files.
CREATE EXTERNAL FILE FORMAT file_format_name
WITH (
FORMAT_TYPE = DELIMITEDTEXT
[ , FORMAT_OPTIONS ( <format_options> [ ,...n ] ) ]
[ , DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.GzipCodec'
}
]);

<format_options> ::=
{
FIELD_TERMINATOR = field_terminator
| STRING_DELIMITER = string_delimiter
| FIRST_ROW = integer -- ONLY AVAILABLE FOR AZURE SYNAPSE ANALYTICS
| DATE_FORMAT = datetime_format
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| ENCODING = {'UTF8' | 'UTF16'}
| PARSER_VERSION = {'parser_version'}
} Author: Shanmukh Sattiraju
CREATE EXTERNAL FILE FORMAT
--Create an external file format for PARQUET files.

CREATE EXTERNAL FILE FORMAT file_format_name

WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.SnappyCodec'
| 'org.apache.hadoop.io.compress.GzipCodec' }
);

-- Create an external file format for delta table files (serverless SQL pools in Synapse analytics and SQL
Server 2022).

CREATE EXTERNAL FILE FORMAT file_format_name

WITH (
FORMAT_TYPE = DELTA
);

Author: Shanmukh Sattiraju

Create External Table As Select (CETAS)
CREATE EXTERNAL TABLE ext_table
WITH (

LOCATION = ‘test/extfile/’
DATA_SOURCE =
FILE_FORMAT =

) AS SELECT
TOP 10 [data].Year, [data].State
FROM
OPENROWSET(
BULK
'abfss://[email protected]/Unemployment.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS [data]
Author: Shanmukh Sattiraju
Initial Transformation

Raw container Refined container

Unemployment.csv .parquet

EXTERNAL DATA SOURCE = @Refined

EXTERNAL FILE FORMAT = .parquet

EXTERNAL TABLE

Author: Shanmukh Sattiraju

External Table

Azure Data Lake

Oracle
External Data Source SELECT *
External File Format FROM ExternalTable

MySQL

Author: Shanmukh Sattiraju

CREATE EXTERNAL TABLE

• With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or
serverless SQL pool.

• External tables is a table-like object in azure synapse that represents structure and schema of data
stored in external data sources.

• External tables acts as a reference point to data that is stored externally in storage (Azure data lake
storage or Azure blob storage)

• This helps to control the access to external data.

• In parallel to creating external table the data will be created in the storage that you wish.

• Options needed to create an external table:

• LOCATION = Where you want to store the data that is created by external table

• DATA_SOURCE = Holds the path of the storage

• FILE_FORMAT = Format that you wish to have for the data that is created by External table

Author: Shanmukh Sattiraju

Serverless SQL pool initial Transformation

Raw container Refined container

Unemployment.csv .parquet

Author: Shanmukh Sattiraju

History and Data processing before Spark

Author: Shanmukh Sattiraju

Big data approach

Cluster
RAM &
STORAGE

RAM &
STORAGE

RAM RAM RAM RAM

& & & &
RAM & STORAGE STORAGE
STORAGE STORAGE
STORAGE

Single Computer for Data Storage Distributed Approach (adding

and Processing (Monolithic) multiple machines to achieve
parallel processing)
Author: Shanmukh Sattiraju
Hadoop Platform
Cluster Manger - YARN
Master Node

Cluster

Worker Worker Worker Worker

Node Node Node Node
Distributed Storage - HDFS
HDFS HDFS HDFS HDFS (Hadoop Distributed File System)

Distributed Approach (adding

multiple machines to achieve
parallel processing)
Author: Shanmukh Sattiraju
Hadoop Platform
Cluster Manager - Yarn

Hadoop Platform Distributed Storage - HDFS

Distributed Computing - MapReduce

Author: Shanmukh Sattiraju

YARN – Yet Another Resource Negotiator
Submits Application
RM Master Node

Worker Worker Worker Worker Worker

Node Node Node Node Node

AM AM

NM NM NM NM NM

Author: Shanmukh Sattiraju

HDFS – Distributed Storage
Metadata
repository
1 6
2 7 Store or copy data
3 8 NN Master Node
4 9
5 10

3,5 2,4 1,6 9,10 8,7

Worker Worker Worker Worker Worker

Node Node Node Node Node

2 8
3 1 9
4 7
5 6 10

DN DN DN DN DN

Author: Shanmukh Sattiraju

Map/Reduce – Distributed Computing

Map Shuffle Reduce

Author: Shanmukh Sattiraju

MapReduce – Distributed Computing
Shuffling
HDFS Distributed Storage Reads each line and
generates key value.
Key = Word Ironman ,
Value = Occurrences
Ironman [1,1] HDFS Distributed Storage
Reduce
Superman Ironman - 1
Batman Mapper Superman - 1
Antman Batman - 1 Superman,
Antman - 1 [1,1, 1] Ironman – 2
Superman -3 Ironman – 2
Batman - 2 Superman -3
Batman – 2
Superman
Spiderman Superman - 1
Mapper Spiderman - 1 Batman,
Batman [ 1, 1]
Batman - 1

Antman Antman , Antman -2

Antman – 1 Antman -2
Superman [1,1] Spiderman - 2 Spiderman - 2
Spiderman Mapper Superman - 1
Spiderman - 1
Ironman
Ironman - 1
Spiderman,
[1,1]
HDFS Read HDFS Write
Iteration or process
Author: Shanmukh Sattiraju
Emergence of Spark

Author: Shanmukh Sattiraju

Drawbacks of MapReduce
Traditional Hadoop MapReduce processing

HDFS HDFS HDFS HDFS

Read Iteration 1 Iteration 2 Write
Write Read
Storage Storage Storage
Process data Process data
HDFS Disk HDFS Disk HDFS Disk

Author: Shanmukh Sattiraju

Emergence of spark

HDFS
Read
RAM RAM RAM

HDFS Disk
Iteration 1 Iteration 2 Iteration 3
Or
Analyse Data Analyse Data Analyse Data
Any cloud
storage

Author: Shanmukh Sattiraju

Apache Spark

Apache Spark is an open source in-memory application framework for

distributed data processing and iterative analysis on massive data
volumes

In simple terms, Spark is a

• Compute Engine
• Unified data processing System

Author: Shanmukh Sattiraju

Spark Core Concepts

Author: Shanmukh Sattiraju

Apache Spark Ecosystem
Spark Spark ML Spark Graph SparkR
Spark SQL Streaming (Mllib) (Graph Computation) (R on spark)

(Interactive
Queries) Higher level APIs
DataFrame/ Dataset APIs

Spark Core
Scala Java Python SQL R
Spark Core API
RDD – Resilient Distributed Dataset RDD APIs

Distributed
Compute Engine Spark Engine

Cluster or Resource Manager (YARN, Mesos, Standalone, Kubernetes)

Distributed Storage (Azure Storage, Amazon S3 , GCP)

Author: Shanmukh Sattiraju
Limitations with Hadoop
Metrics Hadoop Apache Spark

Performance Dependency on disks for read and In-memory processing.

write operations. Slower disk I/O 10-100x faster than Hadoop

Development Need to develop Map and Reduce Use native SQL to develop by making use of
Code which is complex Spark SQL and composable APIs
Language Java Java, Scala, Python and R

Storage HDFS HDFS and Cloud Storages(Azure Storage,

Amazon S3, etc.)

Resource YARN YARN, Measos, Standalone, Kubernetes

Management

Data processing Batch Processing Batch Processing, Streaming, Machine Learning

Author: Shanmukh Sattiraju

Apache Spark Architecture
Worker Node
Executor CACHE
Master Node
TASK TASK

Driver Program

Spark Context Cluster Manager

Worker Node
Executor CACHE

TASK TASK

Author: Shanmukh Sattiraju

Benefits of Spark pool

• Speed and efficiency

• Ease of Creation

• Support for ADLS Gen2

• Scalability

Author: Shanmukh Sattiraju

RDD – Resilient Distributed Dataset

• RDD Stands for Resilient Distributed Dataset.

• It is fundamental data structure of Apache Spark
• It is immutable collection of objects
• RDD is divided into logical partitions, which may be computed on different
nodes of the cluster

Let’s breakdown and see RDD

Resilient = It is fault tolerant
Distributed = data is spread across partitions in the clusters
Dataset = Set of data that have rows and columns
Author: Shanmukh Sattiraju
RDD -Overview
Worker node

RDD
Partition - 1 Worker node

Partition - 2

Worker node
Partition - 3
Cluster
Driver
Manager
Partition - 4
Worker node
Partition - 5

Partition - 6 Worker node

Worker node

Author: Shanmukh Sattiraju

Lambda(), Map() , filter()

Lambda: Map: Filter:

Anonymous functions (i.e. Return a new distributed dataset Return a new dataset formed by
functions defined without a formed by passing each element of selecting those elements of the
name) the source through a <function> source on which <function> returns
Syntax: true
lambda ( < value> : <expression>) Syntax:
Map (<function>) Syntax:
Eg: Filter ( <function>)
Eg:
a = lambda (x : x+10) If RDD have values [1,2,3] Eg:
rdd.map(lambda (x : x+10 ) If RDD have values [11,12,13]
print( a (10)) rdd.Filter(lambda x: x%2==0 )
Result [11,12,13]
Result is 20 Result [12]
Map adds 10 on reach value in given
data Filter will apply that on each element
if that condition is true then it will
Author: Shanmukh Sattiraju take as new dataset
RDD Operations

Transformations Actions
• Any operation that leads returns a value or data back to
• Any operation that leads to some change in form of data driver program is an Action
is transformation. • It brings laziness of RDD into motion.
• These take an RDD as the input and produces one or
many RDDs as output E.g.: count(), collect(), take()
• After executing a transformation, the result RDD(s) can
be smaller, bigger or sometimes same size as Parent
RDDs
• Transformations are Lazy, they are not executed
immediately. Transformations can execute only when
actions are called (also called Lazy Evaluation)
• Transformations don’t change the input RDD as RDDs
are immutable
• E.g.: Filter(), Map(), FlatMap(), etc.

Author: Shanmukh Sattiraju

RDD operations
rdd_list

map()

Transformations
rdd_add

filter()

rdd_filter

collect()
Action

Output sent to
Driver Program

Author: Shanmukh Sattiraju

Lineage

Author: Shanmukh Sattiraju

RDD Lineage
>> rdd_list = sc.parallelize(list)

>> rdd_add = rdd_list.map(<func>)

rdd_list + map() transformation

>> rdd_filter = rdd_add.filter(<func>)

RDD Lineage
>> rdd.filter.collect()
rdd_list rdd_list =rdd_add
sc.parallelize(list)rdd_filter

Driver
ACTION
Program

Author: Shanmukh Sattiraju

Word count
txtFile() FlatMap() map() reduceByKey() Collect(()

Tony Stark Tony Tony,1

is an Stark Stark,1
Avenger…
Is Is,1 Tony,[1,1,1]
Tony,3
an an,1
Avenger Avenger,1
Azure Datalake

Tony Stark is an
Avenger.. Tony Stark Tony Ton,1
is genius, Tony,3
Stark Stark,1
Tony Stark is a Billionaire… Stark,[1,1,1] Stark,3 Driver
Is Is,1 Stark,3
Genius, a a,1 Is,3 Program
Billionaire,…. genius Genius,1 A,3
Billionaire Billionaire,1
Biography.txt Avenger,2

Tony Stark Tony Tony,1

is a Stark Stark,1 is,[1,1,1]
Superhero… is,3
. Is Is,1
a a,1
superhero Superhero,1
.
.
.

RDDRead RDDMap RDDPairedAuthor: Shanmukh Sattiraju

RDDreduced
ReduceByKey() vs GroupByKey()
ReduceByKey()

Combiner Tony,2
Tony Tony,4
Tony,2 Tony,2
Is
Awesome Is ,2
Tony Awesome,1
Is Superhero,1
Superhero Tony,4
Is,4
Is,2 Awesome,2
Is,4
Is,2 Superhero,1
Combiner
Tony
Is Tony,2
Ironman Is ,2
Tony Ironman,1 Awesome,1
Is Awesome,1 Awesome,2
Awesome,1
Awesome

Author: Shanmukh Sattiraju

ReduceByKey() vs GroupByKey()
GroupByKey()

Tony,1
Tony,1
Tony Tony,4
Tony,1
Is Tony,1
Awesome
Tony
Is
Superhero Tony,4
Is,1 Is,4
Is,1 Awesome,2
Is,4
Is,1 Superhero,1
Is,1
Tony
Is
Ironman
Tony
Awesome,1 Awesome,4
Is
Awesome,1
Awesome

Author: Shanmukh Sattiraju

ReduceByKey() GroupByKey()

• Wide Transformation • Wide Transformation

• Data is combined or aggregated at • Data is combined or aggregated at after
partition itself before getting shuffled shuffling
• Less shuffling – as data already combined • More shuffling – as need to collect data
• More Efficient • Less Efficient

Author: Shanmukh Sattiraju

Execution plan
Task 1

Stage 1
Task 2
Job 0
Stage 2 Task 3

Program Task 4
Execution
Job 1 Stage 3
Task 5

Task 6
Job 2 Stage 4

Task 7

Task 8
Author: Shanmukh Sattiraju
Jobs

Number of Number of
Jobs = Actions

Author: Shanmukh Sattiraju

Transformations

Narrow Transformations Wide Transformations

• In Narrow transformation, all the • In wide transformation, all the

elements that are required to elements that are required to
compute the records in single compute the records in the
partition, live in the single single partition, may live in many
partition of parent RDD partitions of parent RDD.

• E.g. map(), filter() • E.g.

groupbyKey() and reducebyKey()

Author: Shanmukh Sattiraju

Stage 0 Stage 1
Narrow Transformation Wide Transformation
FlatMap() map() reduceByKey()

Tony Tony,1
Stark Stark,1
Is Is,1 Tony,[1,1,1]
Tony,3
an an,1
Avenger Avenger,1

Tony Ton,1
Stark Stark,1
Is Is,1 Stark,[1,1,1]
a
Stark,3
a,1
genius Genius,1
Billionaire Billionaire,1

Tony Tony,1
Stark Stark,1 is,[1,1,1]
is,3
Is Is,1
a a,1
superhero Superhero,1
.
.
.

RDDPaired RDDreduced
Author: Shanmukh Sattiraju
Stages

Number of Wide
Number of
Stages = Transformations +1
applied

Author: Shanmukh Sattiraju

Tasks

Number of
Tasks = Number of
Partitions

Author: Shanmukh Sattiraju

Summary

Author: Shanmukh Sattiraju

DAG

Directed Acyclic Graph

A C
B

Author: Shanmukh Sattiraju

DAG
Stage 0

Map

Author: Shanmukh Sattiraju

RDD lineage vs DAG

RDD Lineage DAG

• Formed when a RDD/Data frame is • Forms when an action is called

created or after each • It’s a physical plan built by DAG

transformation is applied scheduler after an action is called

• Each RDD points to one or more • It’s like a combination of many

parent RDD which forms a lineage RDD and its transformations

• It is Logical plan

• Its like a portion of DAG

Author: Shanmukh Sattiraju
SQL APIs Dataframe / Dataset APIs Higher Level APIs

Catalyst Optimizer

RDD APIsRDD
Optimized Lower Level APIs

Author: Shanmukh Sattiraju

DataFrames
• Dataframes are built on top of Spark RDD APIs
• DataFrame APIs is more efficient, it can optimize operations using underlying
catalyst

To read DataFrame :
DataframeReader

Supports:
JSON
CSV
PARQUET
AVRO
ORC
TEXT

Author: Shanmukh Sattiraju

PySpark Transformations

Author: Shanmukh Sattiraju

Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Selection and filtering

We are making use of below functions and understand their usage

1. display()
2. Select()
3. selectExpr()
4. Filter()
5. Where()

Author: Shanmukh Sattiraju

Handling NULLs/Missing values and Grouping
Aggregation

We are making use of below functions and understand their usage

1. fillna()
2. na.fill()
3. Groupby()
4. Agg()
5. dropna()
6. na.drop()

Author: Shanmukh Sattiraju

Handling NULLs and aggregation
Columns Transformation

Line Number Identify NULLs

Filter()
Year
Month Replace NULLs
State fillna() or na.fill()
Labor Force
Employed Drop NULLs
Dropna() or na.drop()
Unemployed
Unemployment Rate
Drop Duplicate rows
Industry dropDuplicates()
Gender
Education Level Aggregation
Date Inserted groupBy() / agg()
Aggregation Level
Data Accuracy
Author: Shanmukh Sattiraju
Data Transformation and Manipulation:

We are making use of below functions and understand their usage

1. withColumn()
2. Distinct()
3. Drop()
4. withColumnRenamed()

Author: Shanmukh Sattiraju

Data Transformation and Manipulation:
Before transformation After transformation Transformation

Line Number Line_Number Add column

Year withColumn()
Year
Month Month
Update column
State State withColumn()
Labor Force Labor Force
Employed Employed Update column based on
Unemployed Unemployed Condition
Unemployment Rate Industry withColumn( when..)

Industry Gender
Delete column
Gender Education Level Drop()
Education Level Date Inserted
Date Inserted Aggregation Level Rename column
Aggregation Level Data Accuracy withColumnRenamed()
Data Accuracy UnEmployed Rate Percentage

Author: Shanmukh Sattiraju

MSSparkUtils
• Microsoft Spark Utilities (MSSparkUtils) is a builtin package to help
you easily perform common tasks.
• In databricks it is dbutils
• You can use MSSparkUtils to work with file systems, to get
environment variables, to chain notebooks together, and to work with
secrets
• MSSparkUtils are available in PySpark (Python), Scala, .NET Spark
(C#), and R (Preview) notebooks and Synapse pipelines.
• We can work on storages like we do in local file system using file
system in MSSparkUtils
Author: Shanmukh Sattiraju
MSSpark Utilities
This module provides various utilities for users to interact with the
rest of Synapse notebook.

• fs: Utility for filesystem operations in Synapse

• notebook: Utility for notebook operations (e.g, chaining
Synapse notebooks together)
• credentials: Utility for obtaining credentials (tokens and keys)
for Synapse resources
• env: Utility for obtaining environment metadata (e.g, userName,
clusterId etc)
Author: Shanmukh Sattiraju
Env utilities

• getUserName(): returns user name

• getUserId(): returns unique user id

• getJobId(): returns job id

• getWorkspaceName(): returns workspace name

• getPoolName(): returns Spark pool name

• getClusterId(): returns cluster id

Author: Shanmukh Sattiraju
Filesystem Utilities
• Cp -> Copies a file or directory, possibly across File Systems
• Mv -> Moves a file or directory, possibly across File Systems
• ls -> Lists the contents of a directory
• mkdirs -> Creates the given directory if it does not exist, also creating any necessary
parent directories
• put -> Writes the given String out to a file, encoded in UTF-8
• head -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in
UTF-8
• append -> Append the content to a file
• rm -> Removes a file or directory
• exists -> Check if a file or directory exists
• mount -> Mounts the given remote storage directory at the given mount point
• Unmount -> Deletes a mount point
• mounts -> Show information about what is mounted
• getMountPath -> Gets the local path of the mount point
Author: Shanmukh Sattiraju
Mounting Storage

Spark.read.format(‘csv’)\
.option((‘header’,’true’)\
.load(‘synfs:/’+JobId + ‘/lake/transformed/nulls.csv’
Notebook

Mount point
/lake

Azure Data Lake Container : refined

Folder: transformed
File: nulls.csv
Author: Shanmukh Sattiraju
Mounting storage to Spark pool
Mounting = attaching

Once we attached our storage to a mount point , we can access the storage
account without using full path name

Syntax to mount: (Using linked service authentication - Recommended)

mssparkutils.fs.mount(
“< full_path>",
“<Mount_point_name>",
{"LinkedService":“<Linked_Service_name>"}
)

Author: Shanmukh Sattiraju

To increase quota

Workspace level
• Every Azure Synapse workspace comes with a default quota of
vCores that can be used for Spark. The quota is split between the
user quota and the dataflow quota so that neither usage pattern uses
up all the vCores in the workspace. The quota is different depending
on the type of your subscription but is symmetrical between user and
dataflow. However if you request more vCores than are remaining in
the workspace, then you'll get the following error:
Failed to start session: [User] MAXIMUM_WORKSPACE_CAPACITY_EXCEEDED
Your Spark job requested 12 vCores.
However, the workspace only has xxx vCores available out of quota of yyy vCores.
Try reducing the numbers of vCores requested or increasing your vCore quota. Click
here for more information - https://go.microsoft.com/fwlink/?linkid=213499

Author: Shanmukh Sattiraju

• The following article describes how to request an increase in
workspace vCore quota.
• Select "Azure Synapse Analytics" as the service type.
• In the Quota details window, select Apache Spark (vCore) per
workspace

Author: Shanmukh Sattiraju

Notebook utilities

• exit -> This method lets you exit a notebook with a value.
• run -> This method runs a notebook and returns its exit value.

• Similar to run() method as above we also have a magic command %%run

• Using %%run followed by notebook name will also runs the notebook from
another book

Author: Shanmukh Sattiraju

Magic commands

• You can use multiple

languages in one notebook
• You need to specify
language magic command at
the beginning of a cell.
• By default, the entire
notebook will work on the
language that you choose at
the top

Author: Shanmukh Sattiraju

Access mount point from other notebook
• The purpose of mount of is to reduce our development effort
• Ideally once we mount the mount point in synapse, they should be
accessing for other notebooks
Notebook 1 Notebook 2

mssparkutils.fs.mount( mssparkutils.fs.mount()
"abfss://[email protected]
re.windows.net/",
"/lake",
{"LinkedService":"synapse1121-
WorkspaceDefaultStorage"}
Returns an empty array

Author: Shanmukh Sattiraju

SQL Temp Views
• You cannot reference data or variables directly across different
languages in a Synapse notebook.
• In Spark, a temporary views can be referenced across languages.
• The lifetime of this temporary table is tied to the SparkSession
• SQL script only works on either a table or a view not on data frames

%%sql

df = spark.read.format(‘csv’)\ SELECT * FROM df

.option(‘header’,’true’)\
.load(<path>)

df.createTempView(‘<ViewName’>) %%sql

SELECT * FROM ViewName

Author: Shanmukh Sattiraju

Temporary Views
All the temporary views are active for Session Only

• createTempView()
• This will throw error if another view is created with same name in that session
• createOrReplaceTempView()
• This is used when you want to automate running you notebook and use the same name
again

• We have 2 more views

• createGlobalTempView()
• createOrReplaceGlobalTempView()

These Global views makes the view available for another notebook to access them
if they are attached to same cluster .
But synapse analytics is not like databricks , we will not have any cluster. Hence the
Global Temp views are not like we used in databricks
Author: Shanmukh Sattiraju
Workspace data
• Lake databases
• You can define tables on top of lake data using Apache Spark
notebooks
• You can refer this tables for querying using T-SQL (Transact-SQL)
language using the serverless SQL pool

• SQL Databases
• You can define your own databases and tables directly using the
serverless SQL pools
• You can use T-SQL CREATE DATABASE, CREATE EXTERNAL TABLE to
define the objects

Author: Shanmukh Sattiraju

Spark Managed vs External Tables
• Managed Tables
• These can be defined without a specified location
• The data files are stored within the storage used by the metastore
• Dropping the table not only removes its metadata from the catalog, but also
deletes the folder in which its data files are stored.

• External Tables
• These can be defined for a custom file location, where the data for the
table is stored.
• The metadata for the table is defined in the Spark catalog.
• Dropping the table deletes the metadata from the catalog, but doesn't
affect the data files.

Author: Shanmukh Sattiraju

Metadata sharing
Spark pool Serverless SQL Pool

Lake Database SQL Database

Metadata sharing
by
Replication

Azure data lake

Author: Shanmukh Sattiraju
Joins and combining data
We are making use of below functions and understand their usage
1. join()
I. Inner join
II. Left join
III. Right join
IV. Outer Join
V. Left Semi Join
VI. Left Anti Join
VII. Cross Join
2. Union()

Author: Shanmukh Sattiraju

Join Transformations
Before transformation After transformation
Transformation
Line_Number
Year Line_Number Joining data
Month Year .join()
State Month
Labor Force State
Employed Labor Force
Unemployed Employed
Industry Unemployed
Gender Unemployment Rate
Education Level Industry
Date Inserted Gender
Aggregation Level Education Level
Data Accuracy Date Inserted
UnEmployed Rate Percentage Aggregation Level
Data Accuracy
UnEmployed Rate Percentage
Education Level Expected Salary Range in USD
Expected Salary Range in USD
Author: Shanmukh Sattiraju
String Manipulation and sorting

We are making use of below functions and understand their usage

1. replace()
2. Split()
3. Concat()
4. OrderBy()
5. Sort()

Author: Shanmukh Sattiraju

String Manipulation and sorting
Before transformation After transformation
Transformation
Line_Number Line_Number
Year Year Add underscores in columns
Month Month .replace()
State State
Labor Force Labor_Force Created 2 columns from
Employed Employed Expected salary range
Unemployed Unemployed .split()
Industry Industry
Gender Gender Combine month Year Columns
Education Level Education_Level .concatenate()
Date Inserted Date_Inserted
Aggregation Level Aggregation_Level Orderby() / Sort()
Data Accuracy Data_Accuracy
UnEmployed Rate Percentage UnEmployed_Rate_Percentage
Expected Salary Range in USD Min_Salary_USD
Max_Salary_USD

Author: Shanmukh Sattiraju

Window functions

We are making use of below functions and understand their usage

1. row_number()
2. rank()
3. dense_rank()

Author: Shanmukh Sattiraju

String Manipulation and sorting
Before transformation After transformation
Transformation
Line_Number Line_Number
Year Year Assigning ranks based on
Month Month Unemployment rate
State State
.dense_rank()
Labor_Force Labor_Force
Employed Employed
Unemployed Unemployed We understood how the
Industry Industry below will work
Gender Gender .row_number()
Education_Level Education_Level .rank()
Date_Inserted Date_Inserted
Aggregation_Level Aggregation_Level
Data_Accuracy Data_Accuracy
UnEmployed_Rate_Percentage UnEmployed_Rate_Percentage
Min_Salary_USD Min_Salary_USD
Max_Salary_USD Max_Salary_USD
dense_rank

Author: Shanmukh Sattiraju

Pivoting and conversions

We are making use of below functions and understand their usage

1. cast()
2. pivot()
3. Stack()
4. to_date()

Author: Shanmukh Sattiraju

Schema definition and Management
StructType and StructField
• StructType & StructField classes are used to programmatically specify the schema to the
DataFrame
• StructType:
• Represents the schema or structure of a DataFrame.
• It is a collection of StructField objects.
• Defines the columns and their data types in a DataFrame.
• Created by passing a list of StructField objects.
• StructField:
• Represents a single field or column in a DataFrame schema.
• Defines the name, data type, and other attributes of a column.
• Used as elements within a StructType object.
• Syntax: StructField(name, datatype, nullable=True)
• name: Name or identifier of the column.
• dataType: Data type of the column.
• nullable: Specifies whether the column can contain null values.
Author: Shanmukh Sattiraju
UDFs
• In PySpark, UDF stands for User-Defined Function.
• UDFs allow you to define custom functions to operate on Spark
DataFrames or RDDs (Resilient Distributed Datasets).
• These functions can be used to perform complex computations
• These can be used when transformations on the data that are not
available through built-in Spark functions.

Why UDFs?
• In SQL or PySpark, you cannot directly use python functions on them.
• To use the custom functions on dataframes or Spark SQL you need UDFs
Author: Shanmukh Sattiraju
Methods to create UDF
Method 1:

1. Create a function in Python syntax

2. Register that function as udf() to use it on dataframe or Spark SQL

Method 2:

Create a function in a Python syntax and wrap it with UDF Annotation

Author: Shanmukh Sattiraju

Steps to create UDF – Method 1
1. Defining python function using def

Syntax:
def <Function_name>(<args>):
<Function_definition>
return <return_Type>

2. Registering the function as UDF

Syntax (to use on Dataframe):

from pyspark.sql.functions import udf
my_udf = udf(Function_name, returnType)

Syntax (to use on Spark SQL):

spark.udf.register(“<UDF_name”>, <Function_Name>)

Author: Shanmukh Sattiraju

Steps to create UDF – Method 2

1. Use annotation to wrap the UDF around the function for applying on DF

Syntax:
@udf(return_Type)
def <Function_name>(<args>):
<Function_definition>
return <return_Type>

Author: Shanmukh Sattiraju

Dedicated SQL Pool
• Previously known as SQL Data Warehouse.
• This stores data in a relational table with Columnar storage
• Dedicated SQL pool is just a traditional Data Warehouse with MPP
architecture
• You will have an internal storage specific to Dedicated SQL Pool
• The size of the Dedicated SQL pool depends on DWU (Data Warehousing
Units) that we choose while creating it.

Author: Shanmukh Sattiraju

Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Synapse Dedicated SQL Architecture – MPP

DMS

DMS DMS DMS DMS

Author: Shanmukh Sattiraju

Performance Level

Author: Shanmukh Sattiraju

For DW1000c
Compute Azure Storage

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

Node 1 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20

D21 D22 D23 D24 D25 D26 D27 D28 D29 D30

D31 D32 D33 D34 D35 D36 D37 D38 D39 D40

Node 2 D41 D42 D43 D44 D45 D46 D47 D48 D49 D50

D51 D52 D53 D54 D55 D56 D57 D58 D59 D60

Author: Shanmukh Sattiraju

Scaling Compute with DWU

Author: Shanmukh Sattiraju

DW1500c

DMS

DMS DMS DMS

20 20 20

Author: Shanmukh Sattiraju

When to consider Dedicated SQL Pool?

• When data size more than a 1 TB

• When we have more than a billion Rows

• When we need high concurrency

• When you want predicted workloads

Author: Shanmukh Sattiraju

Copying data into Dedicated SQL pool

• You can copy data to Dedicated SQL pool in multiple ways.

• For now lets see the below ways

• Using Copy command

• Using BULK Load feature

• Using pipeline to copy data

Author: Shanmukh Sattiraju

Using copy command

1. CREATE A TABLE 2. COPY data to table

CREATE TABLE [schema].[TableName] COPY INTO [schema].[TableName]

( FROM ‘<HTTPS://ExternalFilePath>’
<Column_Name> <DataType>,
<Column_Name> <DataType>, WITH
<Column_Name> <DataType>, (
) FILE_TYPE=‘parquet’
WITH )
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
);

Author: Shanmukh Sattiraju

Clustered column store index

101 2013
102 Male Vijay
2013 Stark
103 Male
2014 Andrea
104 Female
2015 Steve
105 Male
2016

101,2013,Male, Vijay
102,2013,Male, Stark
103,2013,Female, Andrea
104,2014,Male,Steve

Author: Shanmukh Sattiraju

Sharding Pattern

These sharding patterns are:

•Hash

•Round Robin

•Replicate

60 distributions

Author: Shanmukh Sattiraju

Round Robin Distribution
CREATE TABLE StudenDetails
WITH (DISTRIBUTION = ROUND_ROBIN)
AS..

Dist_1
Student ID Subject
101 Networking
Dist_2
102 Linux
103 Java
Dist_3
101 Azure

Dist_4

Author: Shanmukh Sattiraju

Hash Distribution
CREATE TABLE StudenDetails
WITH (DISTRIBUTION = HASH(StudentID)
AS..

Dist_1
Student ID Subject
101 Networking
102 Linux
103 Java Dist_2
101 Azure
Dist_3

Dist_4
Author: Shanmukh Sattiraju
Replicated Distribution CREATE TABLE StudenDetails
WITH (DISTRIBUTION = REPLICATE)
AS..

Student ID Subject
101 Networking
102 Linux
103 Java
101 Azure
Dist_1
Student ID Subject
101 Networking Student ID Subject
101 Networking
102 Linux 102 Linux Dist_2
103 Java
103 Java 101 Azure

101 Azure
Student ID Subject
101 Networking
102 Linux Dist_3
103 Java
101 Azure
Author: Shanmukh Sattiraju
Sharding Patterns

Distributes randomly row Performance is not Staging Tables

evenly across nodes, No optimized
logic on how data is
distributed
Rows are distributed Maximum query Fact Tables
across nodes based on performance
the hash column that we
defined.
1 node = 1 hash value
Keeps copy of entire Good performance if Dimension Tables
table in every node , 60 its used for small
distributions makes 60 tables
copies

Author: Shanmukh Sattiraju

Reporting data with Power BI

Author: Shanmukh Sattiraju

Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Project Architecture

Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool

Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Spark Optimization Techniques

Author: Shanmukh Sattiraju

Spark can be optimized at 2 levels
1. Spark pool Optimization (For Synapse)
• Choosing right node size
• Number of vCores and Memory
• Auto-scaling enabled
• Number of nodes

2. Application or Code Level Optimization

• Writing code to make efficient use of available
resources
Author: Shanmukh Sattiraju
Avoid using collect()
Worker node

Worker node

Collect()
Worker node
Out of memory Driver

Worker node

Author: Shanmukh Sattiraju

Instead use take()
Worker node

Worker node

take(n)
Worker node

Returns first n elements Driver

Worker node

Author: Shanmukh Sattiraju

Avoid using InferSchema

df = spark.read.format(‘csv’)
Using inferSchema will: .option(‘header’,’true’)\
• Invokes spark job and reads all the .option(‘inferSchema’,’true’)\
columns .load(‘<storage_path>’)
• Takes lot time to load due to that.
• Will not provide accurate data types
.e.g. date columns
• Not recommended for production
notebooks

Best practice:
• Use StructType/StructField to enforce schema to columns
Author: Shanmukh Sattiraju
Data Serialization

Data Transfer
(Network)
Memory Memory

Node Node

Author: Shanmukh Sattiraju

Data Serialization
Serialization De-Serialization

Object Object
0011010110101 0011010110101
Memory Memory

Node Node

Author: Shanmukh Sattiraju

Cache and persist

• They allow you to store intermediate data in memory

• The stored data will be reused in subsequent actions which can
significantly improve the performance of your Spark applications.
• Both caching and persisting are used to save
• Cache() saves data only in memory
• Persist() can save data with multiple storage levels (will see shortly)

Author: Shanmukh Sattiraju

How cache() and persist() works

Without cache() or persist()

df = spark.read.format('csv')\
.option('header','true')\
.load('abfss://raw@da..')
df_converted.select()

df_transform = df.withColumn() df_converted.Filter()

df_converted.OrderBy()
df_dropped = df_transform\
.drop(..)

df_converted.groupBy()
df_converted = df_dropped\
.withColumn(..)

Author: Shanmukh Sattiraju

How cache() and persist() works

With cache() or persist()

df_converted.cache()
df = spark.read.format('csv')\ df_converted.select()
.option('header','true')\
.load('abfss://raw@da..')

df_converted.Filter()

df_transform = df.withColumn()

df_converted.OrderBy()

df_dropped = df_transform\
.drop(..)
df_converted.groupBy()

MEMORY
df_converted = df_dropped\
.withColumn(..)

Author: Shanmukh Sattiraju

How cache() and persist() works

With cache() or persist()

Initially stored in memory and it will be re-used for subsequent actions

Why subsequent actions?

For 1st action = It will be computed once and stored in memory

2nd action = Instead of re-computing, retrieved from memory

.
.
.
6th Action = Instead of re-computing, retrieved from memory

Author: Shanmukh Sattiraju

Cache() vs persist()
• Cache() when used will store the data in MEMORY_ONLY
• Persist() when used will store data in different persistent levels or storage levels

Here persistent level means

– Where (memory / disk ) and how (serialized or de-serialized) the data will be stored

Various persistent levels are: Usage:

df.persist(StorageLevel.MEMORY_ONLY)
MEMORY_ONLY
df.persist(StorageLevel. MEMORY_AND_DISK
MEMORY_AND_DISK .
.
MEMORY_ONLY_SER (Java, Scala)
.
MEMORY_AND_DISK_SER (Java, Scala) .
.
DISK_ONLY
.
OFF_HEAP df.persist(StorageLevel.OFF_HEAP)

Author: Shanmukh Sattiraju

As per Spark documentation

In Python, stored objects will always be serialized with the Pickle library, so it does not
matter whether you choose a serialized level.

The available storage levels in Python (PySpark) include

• MEMORY_ONLY
• MEMORY_ONLY_2
• MEMORY_AND_DISK
• MEMORY_AND_DISK_2
• DISK_ONLY
• DISK_ONLY_2
• DISK_ONLY_3.

Author: Shanmukh Sattiraju

Persistant Levels
MEMORY_ONLY
• In memory it is stored as de-serialized objects (Java / Scala)
• In memory it is stored as serialized objects (PySpark)
• Any excess data the doesn’t fit into memory is re-computed
• When MEMORY_ONLY is used it is same as using cache() ( in functionality )
• Using cache() from PySpark will store them in de-serialized format.
• Usage: df.persist(StorageLevel.MEMORY_ONLY))
• Best suitable use case : Interactive Data Exploration
EXCESS DATA
RE-COMPUTED
DATA

MEMORY
Author: Shanmukh Sattiraju
Understanding StorageLevel

StorageLevel ( <useDisk>,<useMemory>,<useOffHeap>,<de-serialized>,<replication> )

For MEMORY_ONLY

StorageLevel (false , true , false , false ,1)

useDisk useMemory useOffHeap deserialized replication

Author: Shanmukh Sattiraju

Persistant Levels
MEMORY_AND_DISK
• First the data will be stored in memory as :
• de-serialized objects (Java or Scala)
• Serialized Object (PySpark)
• Any excess data the doesn’t fit into memory is sent to Disk (nothing but storage)
• Usage: df.persist(StorageLevel.MEMORY_AND_DISK))
• Best suitable use case: Machine Learning Training

EXCESS DATA
DISK
DATA

MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
MEMORY_ONLY_SER
• Similar to MEMORY_ONLY using PySpark
• But data here will be stored as ‘serialized object’ (Java or Scala)
• This is not available in PySpark because it is already serialized by using Python
• Any excess data the doesn’t fit into memory is recomputed
• Usage: df.persist(StorageLevel.MEMORY_ONLY_SER))
• Best suitable use case: Serialize data for memory optimization

001011010011101
EXCESS DATA
RE-COMPUTED
DATA

MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
MEMORY_AND_DISK_SER
• Similar to MEMORY_AND_DISK in PySpark
• But data here will be stored as ‘serialized object’ (Java or Scala)
• This is not available in PySpark because it is already serialized by using Python
• Any excess data the doesn’t fit into memory is sent to disk (storage)
• Usage: df.persist(StorageLevel.MEMORY_AND_DISK_SER))
• Best suitable use case: When you want to reduce memory usage by storing data to disk

001011010011101
EXCESS DATA
DISK
DATA

MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
DISK_ONLY
• Stores only on disk
• Serialized object in both Scala and PySpark
• Usage: df.persist(StorageLevel.DISK_ONLY))
• Best suitable Use Case: When you have large datasets that doesn’t fit into memory

DATA DISK

Author: Shanmukh Sattiraju

Persistant Levels
OFF_HEAP
• Stores only on OFF_HEAP
• Usage: df.persist(StorageLevel.OFF_HEAP))
• Best suitable use case: Off-Heap Storage for Extremely Large Datasets

DATA OFF HEAP MEMORY

Author: Shanmukh Sattiraju

Remaining Persistent Levels of PySpark

• MEMORY_ONLY_2
• MEMORY_AND_DISK_2
• DISK_ONLY_2
• DISK_ONLY_3.

Author: Shanmukh Sattiraju

Partitioning
• Partitioning is a way to split data into separate folders based on one or multiple columns.
• Each of the partitions is saved into a separate folder when partitioned.
• Optimizes the queries by skipping reading parts of the data that are not required.

2012
Year Month Unemployed

2012 Jan 211741 2013

2013 Jan 451751

Partition on Year column 2014
2014 Jan 51652

2015 Jan 5174 2015

2016 Jan 21657 2016

2017 Jan 45868
2017
2018 Jan 87474
Author: Shanmukh Sattiraju
Partitioning
Which column to choose of partitioning?

• A column that is used frequently in filtering

• No of distinct values in partitioned column = No of partitions = No of Folders.
• A column which have less distinct values ( low cardinality)

Which column to avoid for partitioning?

• A column that is having many distinct values

• This will make too many partitions and make it less efficient in querying

Author: Shanmukh Sattiraju

Repartition and coalesce

Repartition()
• Repartition is a transformation API that can be used to
increase or decrease the number of partitions in a
dataframe/RDD.

Coalesce()
• Coalesce is a transformation API that can be used to decrease
the number of partitions in a dataframe/RDD.

Author: Shanmukh Sattiraju

Repartition
Wide Transformation

1,2,3,4

1,10,15,5,16,7

5,6,7,8,9

3,8,11,14,17,4

10,11,12,13,14

2,6,9,13,18,12

15,16,17,18

Author: Shanmukh Sattiraju

coalesce
Narrow Transformation

1,2,3,4

1,2,3,4,5,6,7,8,9
5,6,7,8,9

10,11,12,13,14 10,11,12,13,14,15
,16,17,18

15,16,17,18

Author: Shanmukh Sattiraju

Repartition Coalesce

Reduce or increase partition numbers by Only reduces number of partitions on

Purpose
performing a full shuffle. dataframe and avoids shuffling

Performs a full shuffle, which can be an expensive

Shuffle Does not perform a full shuffle.
operation.

Number of Can increase or decrease the number of partitions

Only decreases the number of partitions
Partitions in a DataFrame.

Data Moves data across the network to create the new Tries to minimize data movement and
Movement partitioning scheme. avoid shuffling whenever possible.
Generally faster compared to repartition
Generally slower compared to coalesce due to the
Performance since it avoids shuffling whenever
full shuffle operation.
possible.
Author: Shanmukh Sattiraju
Broadcast variables

CA - California
NY - New York

Author: Shanmukh Sattiraju

Broadcast variables
Without broadcast

Worker Node
val
Task Task

val
Worker Node

Driver Task Task

Worker Node

Task Task

Worker Node

Task Task
Author: Shanmukh Sattiraju
Broadcast variables

Worker Node
broad broad
Task Task

broad Worker Node

broad
Task Task
Driver

Worker Node
broad
Task Task

Worker Node
broad
Task Task
Author: Shanmukh Sattiraju
Broadcast variables

• Read-only variables cached on each machine

• Access the value in them using .value[]
• Cached on each worker node in serialized form
• Useful when a dataset needs to be shared across all nodes
• Reduces the data transfer

Author: Shanmukh Sattiraju

Serialization Types
• Java Serialization
▪ Default serialization technique used by spark
▪ Less performative than Kryo
▪ Not efficient in space-utilization

• Kryo Serialization
▪ Faster and more efficient
▪ Takes less time to convert object to byte stream hence faster
▪ Since Spark 2.0, the framework had used Kryo for all internal shuffling of RDDs,
DataFrame with simple types, arrays of simple types, and so on.
▪ Spark also provides configurations to enhance the Kryo Serializer as per our
application requirement.

Author: Shanmukh Sattiraju

Delta Lake

Author: Shanmukh Sattiraju

Drawbacks of ADLS

ADLS != Database
Atomicity

Consistency
Relational database
Isolation

Durability
Author: Shanmukh Sattiraju
Drawbacks of ADLS

• No ACID properties
• Job failures lead to inconsistent data
• Simultaneous writes on same folder brings incorrect results
• No schema enforcement
• No support for updates
• No support for versioning

Author: Shanmukh Sattiraju

What is delta lake

• Open-source storage framework that brings reliability to data

lakes
• Brings transaction capabilities to data lakes
• Runs on top of your existing datalake and supports parquet
• Enables Lakehouse architecture

Author: Shanmukh Sattiraju

Lakehouse Architecture

Best elements of Best elements of

Data lake Data warehouse

Lakehouse

Author: Shanmukh Sattiraju

Lakehouse Architecture

Datawarehouse Modern Datawarehouse Lakehouse Architecture

(usesAuthor:
Datalake)
Shanmukh Sattiraju
Author: Shanmukh Sattiraju
How to create delta lake?
Instead of parquet.. Replace with delta..

dataframe. dataframe.
write\ write\
.format(“parquet”)\ .format(“delta”)\
.save(“/data/”) .save(“/data/”)

Author: Shanmukh Sattiraju

Delta format

Azure Data Lake

Storage

Parquet + Transaction Log

Author: Shanmukh Sattiraju

delta/

_delta_log/
0000.json Contains transaction
information applied on
0001.json actual data

Partition directory (if applied)

file01.parquet Contains actual data

Author: Shanmukh Sattiraju

Understanding Transaction log file (Delta Log)

• Contains records of every transaction performed on the delta

table

• Files under _delta_log will be stored in JSON format

• Single source of truth

Author: Shanmukh Sattiraju

Transaction log contents
JSON File = result of set of actions

• metadata – Table’s name, schema, partitioning ,etc

• Add – info of added file (with optional statistics)
• Remove – info of removed file
• Set Transaction – contains record of transaction id
• Change protocol – Contains the version that is used
• Commit info – Contains what operation was performed on this

Author: Shanmukh Sattiraju

Delta lake key features
• Open Source: Stored in form of parquet files in ADLS
• ACID Transactions: Ensures data quality
• Schema Enforcement : Restricts unexpected schema changes
• Schema Evolution: Accepts any required schema changes.
• Audit History: Logs all the change details happened on table
• Time Travel: Helps to get previous versions using version or
timestamp
• DML Operations: Enables us to use UPDATE, DELETE and MERGE
• Unified batch /Streaming: Follows same approach for batch and
streaming flows
Author: Shanmukh Sattiraju
Schema Enforcement

Loading new data Delta Table

WRITE

Author: Shanmukh Sattiraju

How does schema enforcement works?
Delta lake uses Schema validation on “writes” .

Schema Enforcement Rules:

1. Cannot contain any additional columns that are not present in the target table's
schema
2. Cannot have column data types that differ from the column data types in the target
table.

Author: Shanmukh Sattiraju

Schema Evolution

Loading new data Delta Table

WRITE

Author: Shanmukh Sattiraju

Audit Data Changes & Time Travel

• Delta automatically versions every operation that you perform

• You can time travel to historical versions

• This versioning makes it easy to audit data changes, roll back data in
case of accidental bad writes or deletes, and reproduce experiments
and reports.

Author: Shanmukh Sattiraju

Vacuum in Delta lake

• Vacuum helps to remove parquet files which are not in latest state in
transaction log
• It will skip the files that are starting with _ (underscore) that includes
_delta_log
• It deletes the files that are older then retention threshold
• Default retention threshold in 7 days
• If you run VACUUM on a Delta table, you lose the ability to time
travel back to a version older than the specified data retention period.

Author: Shanmukh Sattiraju

Checkpoints in Delta lake
json WRITE

json WRITE

00000010.checkpoint.parquet
json WRITE

json WRITE

json • Serves as starting point of compute

WRITE
• Contains replay of all actions
performed
json WRITE
• Reduce reading of JSON files
• By default, checkpoint is created for
json WRITE every 10 commits

json WRITE

json WRITE Author: Shanmukh Sattiraju

Optimize in Delta lake

CREATE TABLE 000.json

WRITE aabb.parquet 001.json 100 Active

WRITE ccdd.parquet 002.json 101 Inactive

WRITE eeff.parquet 003.json 102 Inactive

DELETE 101 gghh.parquet (empty) 004.json Inactive

UPDATE 102 iijj.parquet 005.json 99 Active

Author: Shanmukh Sattiraju

UPSERT (Merge) in delta lake

• We can UPSERT (UPDATE + INSERT) data using MERGE command.

• If any matching rows found, it will update them
• If no matching rows found, this will insert that as new row

MERGE INTO <Destination_Table>

USING <Source_Table>
ON <Dest>.Col2 = <Source>.Col2
WHEN MATCHED
THEN UPDATE SET
<Dest>.Col1 = <Source>.Col1,
<Dest>.Col2 = <Source>.Col2
WHEN NOT MATCHED
THEN INSERT
VALUES(Source.Col1, Source.Col2)

Author: Shanmukh Sattiraju

End of the course

Author: Shanmukh Sattiraju

Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Ikm Attempt
No ratings yet
Ikm Attempt
16 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Databricks Academy Self Paced Content
No ratings yet
Databricks Academy Self Paced Content
18 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Azure Synapse
No ratings yet
Azure Synapse
609 pages
SnapLogic Second Edition
From Everand
SnapLogic Second Edition
Gerardus Blokdyk
No ratings yet
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
Azure Synapse Course Presentation
100% (1)
Azure Synapse Course Presentation
261 pages
Azure Data Platform End2End - 2day
100% (2)
Azure Data Platform End2End - 2day
108 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
ELT Architecture in The Azure Cloud
No ratings yet
ELT Architecture in The Azure Cloud
8 pages
06.introduction To Data Factory
No ratings yet
06.introduction To Data Factory
26 pages
Databricks Certified Data Engineer Professional Practice Questions
No ratings yet
Databricks Certified Data Engineer Professional Practice Questions
13 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
DWH & Data Modeling
No ratings yet
DWH & Data Modeling
50 pages
Implementing An Azure Data Solution DP-200 - DumpsTool - Mansoor
No ratings yet
Implementing An Azure Data Solution DP-200 - DumpsTool - Mansoor
4 pages
TCS Azure Data Engineer Interview Questions and Answers
No ratings yet
TCS Azure Data Engineer Interview Questions and Answers
7 pages
100 Days of Data Engineering - Make A Copy and Use As You Need - Sheet1
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need - Sheet1
4 pages
Talend Architecture White Paper - Branded - Final 11302020
No ratings yet
Talend Architecture White Paper - Branded - Final 11302020
18 pages
The Complete Guide To An Enterprise DataOps Transformation (2022)
100% (1)
The Complete Guide To An Enterprise DataOps Transformation (2022)
186 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
Azure Cosmos DB: Technical Deep Dive
100% (1)
Azure Cosmos DB: Technical Deep Dive
193 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Azure Databricks Overview
No ratings yet
Azure Databricks Overview
23 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Azure Data Factory Notes 1682135573
No ratings yet
Azure Data Factory Notes 1682135573
78 pages
Alteryx + Snowflake Retail Solutions
No ratings yet
Alteryx + Snowflake Retail Solutions
19 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Azure Data Lake and U-SQL
No ratings yet
Azure Data Lake and U-SQL
51 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Databricks Certified Data Analyst Associate
No ratings yet
Databricks Certified Data Analyst Associate
110 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Python For Data Engineering Guide
No ratings yet
Python For Data Engineering Guide
4 pages
Dp203 Notes
No ratings yet
Dp203 Notes
87 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
Modern Data Warehouse Architecture
No ratings yet
Modern Data Warehouse Architecture
25 pages
Snowflake Vs Data Bricks
No ratings yet
Snowflake Vs Data Bricks
10 pages
DWH Fundamentals (Training Material)
No ratings yet
DWH Fundamentals (Training Material)
21 pages
Lab 7 - Orchestrating Data Movement With Azure Data Factory
No ratings yet
Lab 7 - Orchestrating Data Movement With Azure Data Factory
26 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Azure Data Engineer Learning Path (July 2019)
No ratings yet
Azure Data Engineer Learning Path (July 2019)
1 page
Databricks Certified Data Engineer Associate Practice Exams - 1
100% (1)
Databricks Certified Data Engineer Associate Practice Exams - 1
25 pages
SnowPro Core Study Guide
No ratings yet
SnowPro Core Study Guide
37 pages
Designing Data Integration The ETL Pattern Approac
No ratings yet
Designing Data Integration The ETL Pattern Approac
9 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
Databricks
No ratings yet
Databricks
43 pages
Data Modeling
100% (3)
Data Modeling
240 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Snowflake Interview 2024 03
100% (1)
Snowflake Interview 2024 03
167 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
SIC MCQ Latest
100% (1)
SIC MCQ Latest
31 pages
Fpsecrashlog
No ratings yet
Fpsecrashlog
17 pages
NXOpen How To Use A Modeless WinForm and Still Unload Your DLL Immediately
100% (1)
NXOpen How To Use A Modeless WinForm and Still Unload Your DLL Immediately
2 pages
0019 UCPL - SAP - Logistics - S4 - HANA - Syllabus
No ratings yet
0019 UCPL - SAP - Logistics - S4 - HANA - Syllabus
8 pages
Software Engineering Bootcamp
No ratings yet
Software Engineering Bootcamp
17 pages
Implementing Travel & Hospitality Data Mesh: AWS Reference Architecture
No ratings yet
Implementing Travel & Hospitality Data Mesh: AWS Reference Architecture
2 pages
SQL Homework Problems
100% (1)
SQL Homework Problems
7 pages
GDPR Whitepaper FINAL
No ratings yet
GDPR Whitepaper FINAL
6 pages
Sricam PC English
No ratings yet
Sricam PC English
4 pages
How To Use Index and Match
No ratings yet
How To Use Index and Match
10 pages
CT071 3 5 3 Ddac
No ratings yet
CT071 3 5 3 Ddac
38 pages
Veeam Backup Cloud Partner Scenario Card
No ratings yet
Veeam Backup Cloud Partner Scenario Card
2 pages
The Role of Internal Audit in Higher Education
No ratings yet
The Role of Internal Audit in Higher Education
26 pages
SQL Map Tutorial
100% (2)
SQL Map Tutorial
16 pages
3 Pandas Basic III
No ratings yet
3 Pandas Basic III
23 pages
Sap Business Technology Platform Supplement English v10 2021
No ratings yet
Sap Business Technology Platform Supplement English v10 2021
3 pages
Unit 4
No ratings yet
Unit 4
90 pages
Docs Citusdata Com en v9.4
No ratings yet
Docs Citusdata Com en v9.4
342 pages
Distributed Systems Characterization and Design
No ratings yet
Distributed Systems Characterization and Design
35 pages
Python Course File Updated 18 19
No ratings yet
Python Course File Updated 18 19
11 pages
Lecture 03 Linux File System Hierarchy Standard
No ratings yet
Lecture 03 Linux File System Hierarchy Standard
4 pages
Group 5 - 1409
No ratings yet
Group 5 - 1409
53 pages
Abap Syllabus
No ratings yet
Abap Syllabus
5 pages
Jayamukhi Institute of Technological Sciences (Autonomous) M.Tech. (Software Engineering) Course Structure and Syllabus I Year - I Semester
No ratings yet
Jayamukhi Institute of Technological Sciences (Autonomous) M.Tech. (Software Engineering) Course Structure and Syllabus I Year - I Semester
70 pages
Instructions - Supply Chain Attack: Traficom Research Reports
No ratings yet
Instructions - Supply Chain Attack: Traficom Research Reports
18 pages
Oracle 19c - Important Feature For DBA
100% (1)
Oracle 19c - Important Feature For DBA
52 pages
Cape Notes Unit1 Module 2 Content 15
No ratings yet
Cape Notes Unit1 Module 2 Content 15
5 pages
Object-Oriented Analysis and Design Using UML (OO-226) : Course Outline/Details
No ratings yet
Object-Oriented Analysis and Design Using UML (OO-226) : Course Outline/Details
3 pages
Objects (String and Math)
No ratings yet
Objects (String and Math)
16 pages