Synapse Project Deck
Synapse Project Deck
Hands-On Project
6 - Join
5 - Spark SQL
Transformation
7 - String 10 - Schema
8 - Window 9 - Conversions
Manipulation and definition and
Functions and pivoting
sorting management
11 - User Defined
Functions
Author: Shanmukh Sattiraju
Author: Shanmukh Sattiraju
Project Architecture
Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool
Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Project Architecture
Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool
Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Along with Hands-on project:
Selects
User
Writes
User
Website Storage
Updates
User
Analysis
Storage
Author: Shanmukh Sattiraju
OLTP vs OLAP
ETL
OLTP ( Online Transactional Processing)
OLAP (Online Analytical Processing)
SQL
Other databases
Data Cleansing/ Transformation
ETL
CSV files
Datawarehouse
Other databases
ETL
JSON files
Data Lake
AWS,GCP, Azure
STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Problem with Modern Data Warehouse
AWS,GCP, Azure
STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
The Solution – Azure Synapse Analytics
VISUALIZE
On-premise
data
External or
IOT data
AWS,GCP, Azure
STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Components of Azure Synapse Analytics
Studio
VISUALIZE
Data Integration Management Monitoring Security
On-premise
data
Analytics Pools
STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Replacing the Modern Data Warehouse
AWS,GCP, Azure
STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Replacing the Modern Data Warehouse
AWS,GCP, Azure
STORE
Azure Data Lake Storage Gen2
Author: Shanmukh Sattiraju
Ingest
Storage
Azure SQL Database (DW) Azure Data lake (primary) Spark Tables
Compute
Dedicated SQL Pool (SQL DW) Serverless SQL pool Spark Pool
Visualize
Manage / Security
Author: Shanmukh Sattiraju
Azure Synapse Analytics
Microsoft’s Definition:
Storage
Azure SQL Database (DW) Azure Data lake (primary) Spark Tables
Compute
Dedicated SQL Pool (SQL DW) Serverless SQL pool Spark Pool
Visualize
Manage / Security
Author: Shanmukh Sattiraju
Environment Setup
Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool
Unemployment.csv Reporting
Author: Shanmukh Sattiraju
On-demand Serverless SQL pool
T-SQL Queries
SELECT
*
FROM
OPENROWSET()
Author: Shanmukh Sattiraju
Mandatory parameters for OPENROWSET()
SELECT
TOP 100 *
FROM
OPENROWSET(
*BULK ‘<storage-path>’,
*FORMAT = ‘CSV or PARQUET or DELTA',
*PARSER_VERSION = ‘1.0 or 2.0'
) AS [result]
Author: Shanmukh Sattiraju
URL formats for BULK parameter
External Data Source Prefix Storage account path
Azure Blob Storage http[s] <storage_account>.blob.core.windows.net/path/file
Azure Blob Storage wasb[s] <container>@<storage_account>.blob.core.windows.net/path/file
SELECT
TOP 10 *
FROM
OPENROWSET(
BULK ‘folder/file',
DATA_SOURCE =‘<name>',
FORMAT = 'CSV',
External Data Source
Data in ADLS =‘abfss://<container>@<storage_account>.dfs.core.windows.net/’
PARSER_VERSION = '2.0',
FIRSTROW = 2
Gen2
)
• It defines what FILE FORMAT does a file will hold when creating EXTERNAL TABLE (Will be discussed)
• Creating an external file format is a prerequisite for creating an External Table
• The data that is created from EXTERNAL TABLE uses this EXTERNAL FILE FORMAT.
• In short, EXTERNAL FILE FORMAT will specify the actual layout of the data referred by an external
table.
<format_options> ::=
{
FIELD_TERMINATOR = field_terminator
| STRING_DELIMITER = string_delimiter
| FIRST_ROW = integer -- ONLY AVAILABLE FOR AZURE SYNAPSE ANALYTICS
| DATE_FORMAT = datetime_format
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| ENCODING = {'UTF8' | 'UTF16'}
| PARSER_VERSION = {'parser_version'}
} Author: Shanmukh Sattiraju
CREATE EXTERNAL FILE FORMAT
--Create an external file format for PARQUET files.
-- Create an external file format for delta table files (serverless SQL pools in Synapse analytics and SQL
Server 2022).
LOCATION = ‘test/extfile/’
DATA_SOURCE =
FILE_FORMAT =
) AS SELECT
TOP 10 [data].Year, [data].State
FROM
OPENROWSET(
BULK
'abfss://[email protected]/Unemployment.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS [data]
Author: Shanmukh Sattiraju
Initial Transformation
Unemployment.csv .parquet
EXTERNAL TABLE
Oracle
External Data Source SELECT *
External File Format FROM ExternalTable
MySQL
• With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or
serverless SQL pool.
• External tables is a table-like object in azure synapse that represents structure and schema of data
stored in external data sources.
• External tables acts as a reference point to data that is stored externally in storage (Azure data lake
storage or Azure blob storage)
• In parallel to creating external table the data will be created in the storage that you wish.
• FILE_FORMAT = Format that you wish to have for the data that is created by External table
Unemployment.csv .parquet
Cluster
RAM &
STORAGE
RAM &
STORAGE
Cluster
AM AM
NM NM NM NM NM
2 8
3 1 9
4 7
5 6 10
DN DN DN DN DN
HDFS
Read
RAM RAM RAM
HDFS Disk
Iteration 1 Iteration 2 Iteration 3
Or
Analyse Data Analyse Data Analyse Data
Any cloud
storage
(Interactive
Queries) Higher level APIs
DataFrame/ Dataset APIs
Spark Core
Scala Java Python SQL R
Spark Core API
RDD – Resilient Distributed Dataset RDD APIs
Distributed
Compute Engine Spark Engine
Development Need to develop Map and Reduce Use native SQL to develop by making use of
Code which is complex Spark SQL and composable APIs
Language Java Java, Scala, Python and R
Driver Program
Driver Program
TASK TASK
• Ease of Creation
• Scalability
RDD
Partition - 1 Worker node
Partition - 2
Worker node
Partition - 3
Cluster
Driver
Manager
Partition - 4
Worker node
Partition - 5
Worker node
Transformations Actions
• Any operation that leads returns a value or data back to
• Any operation that leads to some change in form of data driver program is an Action
is transformation. • It brings laziness of RDD into motion.
• These take an RDD as the input and produces one or
many RDDs as output E.g.: count(), collect(), take()
• After executing a transformation, the result RDD(s) can
be smaller, bigger or sometimes same size as Parent
RDDs
• Transformations are Lazy, they are not executed
immediately. Transformations can execute only when
actions are called (also called Lazy Evaluation)
• Transformations don’t change the input RDD as RDDs
are immutable
• E.g.: Filter(), Map(), FlatMap(), etc.
map()
Transformations
rdd_add
filter()
rdd_filter
collect()
Action
Output sent to
Driver Program
RDD Lineage
>> rdd.filter.collect()
rdd_list rdd_list =rdd_add
sc.parallelize(list)rdd_filter
Driver
ACTION
Program
Tony Stark is an
Avenger.. Tony Stark Tony Ton,1
is genius, Tony,3
Stark Stark,1
Tony Stark is a Billionaire… Stark,[1,1,1] Stark,3 Driver
Is Is,1 Stark,3
Genius, a a,1 Is,3 Program
Billionaire,…. genius Genius,1 A,3
Billionaire Billionaire,1
Biography.txt Avenger,2
Combiner Tony,2
Tony Tony,4
Tony,2 Tony,2
Is
Awesome Is ,2
Tony Awesome,1
Is Superhero,1
Superhero Tony,4
Is,4
Is,2 Awesome,2
Is,4
Is,2 Superhero,1
Combiner
Tony
Is Tony,2
Ironman Is ,2
Tony Ironman,1 Awesome,1
Is Awesome,1 Awesome,2
Awesome,1
Awesome
Tony,1
Tony,1
Tony Tony,4
Tony,1
Is Tony,1
Awesome
Tony
Is
Superhero Tony,4
Is,1 Is,4
Is,1 Awesome,2
Is,4
Is,1 Superhero,1
Is,1
Tony
Is
Ironman
Tony
Awesome,1 Awesome,4
Is
Awesome,1
Awesome
Stage 1
Task 2
Job 0
Stage 2 Task 3
Program Task 4
Execution
Job 1 Stage 3
Task 5
Task 6
Job 2 Stage 4
Task 7
Task 8
Author: Shanmukh Sattiraju
Jobs
Number of Number of
Jobs = Actions
Tony Tony,1
Stark Stark,1
Is Is,1 Tony,[1,1,1]
Tony,3
an an,1
Avenger Avenger,1
Tony Ton,1
Stark Stark,1
Is Is,1 Stark,[1,1,1]
a
Stark,3
a,1
genius Genius,1
Billionaire Billionaire,1
Tony Tony,1
Stark Stark,1 is,[1,1,1]
is,3
Is Is,1
a a,1
superhero Superhero,1
.
.
.
RDDPaired RDDreduced
Author: Shanmukh Sattiraju
Stages
Number of Wide
Number of
Stages = Transformations +1
applied
Number of
Tasks = Number of
Partitions
A C
B
Map
• It is Logical plan
Catalyst Optimizer
RDD APIsRDD
Optimized Lower Level APIs
To read DataFrame :
DataframeReader
Supports:
JSON
CSV
PARQUET
AVRO
ORC
TEXT
Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool
Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Selection and filtering
Industry Gender
Delete column
Gender Education Level Drop()
Education Level Date Inserted
Date Inserted Aggregation Level Rename column
Aggregation Level Data Accuracy withColumnRenamed()
Data Accuracy UnEmployed Rate Percentage
Spark.read.format(‘csv’)\
.option((‘header’,’true’)\
.load(‘synfs:/’+JobId + ‘/lake/transformed/nulls.csv’
Notebook
Mount point
/lake
Once we attached our storage to a mount point , we can access the storage
account without using full path name
mssparkutils.fs.mount(
“< full_path>",
“<Mount_point_name>",
{"LinkedService":“<Linked_Service_name>"}
)
Workspace level
• Every Azure Synapse workspace comes with a default quota of
vCores that can be used for Spark. The quota is split between the
user quota and the dataflow quota so that neither usage pattern uses
up all the vCores in the workspace. The quota is different depending
on the type of your subscription but is symmetrical between user and
dataflow. However if you request more vCores than are remaining in
the workspace, then you'll get the following error:
Failed to start session: [User] MAXIMUM_WORKSPACE_CAPACITY_EXCEEDED
Your Spark job requested 12 vCores.
However, the workspace only has xxx vCores available out of quota of yyy vCores.
Try reducing the numbers of vCores requested or increasing your vCore quota. Click
here for more information - https://go.microsoft.com/fwlink/?linkid=213499
• exit -> This method lets you exit a notebook with a value.
• run -> This method runs a notebook and returns its exit value.
• Using %%run followed by notebook name will also runs the notebook from
another book
mssparkutils.fs.mount( mssparkutils.fs.mount()
"abfss://[email protected]
re.windows.net/",
"/lake",
{"LinkedService":"synapse1121-
WorkspaceDefaultStorage"}
Returns an empty array
%%sql
df.createTempView(‘<ViewName’>) %%sql
• createTempView()
• This will throw error if another view is created with same name in that session
• createOrReplaceTempView()
• This is used when you want to automate running you notebook and use the same name
again
These Global views makes the view available for another notebook to access them
if they are attached to same cluster .
But synapse analytics is not like databricks , we will not have any cluster. Hence the
Global Temp views are not like we used in databricks
Author: Shanmukh Sattiraju
Workspace data
• Lake databases
• You can define tables on top of lake data using Apache Spark
notebooks
• You can refer this tables for querying using T-SQL (Transact-SQL)
language using the serverless SQL pool
• SQL Databases
• You can define your own databases and tables directly using the
serverless SQL pools
• You can use T-SQL CREATE DATABASE, CREATE EXTERNAL TABLE to
define the objects
• External Tables
• These can be defined for a custom file location, where the data for the
table is stored.
• The metadata for the table is defined in the Spark catalog.
• Dropping the table deletes the metadata from the catalog, but doesn't
affect the data files.
Metadata sharing
by
Replication
Why UDFs?
• In SQL or PySpark, you cannot directly use python functions on them.
• To use the custom functions on dataframes or Spark SQL you need UDFs
Author: Shanmukh Sattiraju
Methods to create UDF
Method 1:
Method 2:
Syntax:
def <Function_name>(<args>):
<Function_definition>
return <return_Type>
1. Use annotation to wrap the UDF around the function for applying on DF
Syntax:
@udf(return_Type)
def <Function_name>(<args>):
<Function_definition>
return <return_Type>
Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool
Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Synapse Dedicated SQL Architecture – MPP
DMS
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
Node 1 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20
D21 D22 D23 D24 D25 D26 D27 D28 D29 D30
D31 D32 D33 D34 D35 D36 D37 D38 D39 D40
Node 2 D41 D42 D43 D44 D45 D46 D47 D48 D49 D50
D51 D52 D53 D54 D55 D56 D57 D58 D59 D60
DMS
20 20 20
101 2013
102 Male Vijay
2013 Stark
103 Male
2014 Andrea
104 Female
2015 Steve
105 Male
2016
101,2013,Male, Vijay
102,2013,Male, Stark
103,2013,Female, Andrea
104,2014,Male,Steve
•Hash
•Round Robin
•Replicate
60 distributions
Dist_1
Student ID Subject
101 Networking
Dist_2
102 Linux
103 Java
Dist_3
101 Azure
Dist_4
Dist_1
Student ID Subject
101 Networking
102 Linux
103 Java Dist_2
101 Azure
Dist_3
Dist_4
Author: Shanmukh Sattiraju
Replicated Distribution CREATE TABLE StudenDetails
WITH (DISTRIBUTION = REPLICATE)
AS..
Student ID Subject
101 Networking
102 Linux
103 Java
101 Azure
Dist_1
Student ID Subject
101 Networking Student ID Subject
101 Networking
102 Linux 102 Linux Dist_2
103 Java
103 Java 101 Azure
101 Azure
Student ID Subject
101 Networking
102 Linux Dist_3
103 Java
101 Azure
Author: Shanmukh Sattiraju
Sharding Patterns
Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool
Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Project Architecture
Refined Processed
Raw Serverless SQL pool Spark Pool Dedicated SQL Pool
Unemployment.csv Reporting
Author: Shanmukh Sattiraju
Spark Optimization Techniques
Worker node
Collect()
Worker node
Out of memory Driver
Worker node
Worker node
Worker node
Worker node
take(n)
Worker node
Worker node
Worker node
Worker node
df = spark.read.format(‘csv’)
Using inferSchema will: .option(‘header’,’true’)\
• Invokes spark job and reads all the .option(‘inferSchema’,’true’)\
columns .load(‘<storage_path>’)
• Takes lot time to load due to that.
• Will not provide accurate data types
.e.g. date columns
• Not recommended for production
notebooks
Best practice:
• Use StructType/StructField to enforce schema to columns
Author: Shanmukh Sattiraju
Data Serialization
Data Transfer
(Network)
Memory Memory
Node Node
Object Object
0011010110101 0011010110101
Memory Memory
Node Node
df = spark.read.format('csv')\
.option('header','true')\
.load('abfss://raw@da..')
df_converted.select()
df_converted.OrderBy()
df_dropped = df_transform\
.drop(..)
df_converted.groupBy()
df_converted = df_dropped\
.withColumn(..)
df_converted.cache()
df = spark.read.format('csv')\ df_converted.select()
.option('header','true')\
.load('abfss://raw@da..')
df_converted.Filter()
df_transform = df.withColumn()
df_converted.OrderBy()
df_dropped = df_transform\
.drop(..)
df_converted.groupBy()
MEMORY
df_converted = df_dropped\
.withColumn(..)
In Python, stored objects will always be serialized with the Pickle library, so it does not
matter whether you choose a serialized level.
MEMORY
Author: Shanmukh Sattiraju
Understanding StorageLevel
StorageLevel ( <useDisk>,<useMemory>,<useOffHeap>,<de-serialized>,<replication> )
For MEMORY_ONLY
EXCESS DATA
DISK
DATA
MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
MEMORY_ONLY_SER
• Similar to MEMORY_ONLY using PySpark
• But data here will be stored as ‘serialized object’ (Java or Scala)
• This is not available in PySpark because it is already serialized by using Python
• Any excess data the doesn’t fit into memory is recomputed
• Usage: df.persist(StorageLevel.MEMORY_ONLY_SER))
• Best suitable use case: Serialize data for memory optimization
001011010011101
EXCESS DATA
RE-COMPUTED
DATA
MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
MEMORY_AND_DISK_SER
• Similar to MEMORY_AND_DISK in PySpark
• But data here will be stored as ‘serialized object’ (Java or Scala)
• This is not available in PySpark because it is already serialized by using Python
• Any excess data the doesn’t fit into memory is sent to disk (storage)
• Usage: df.persist(StorageLevel.MEMORY_AND_DISK_SER))
• Best suitable use case: When you want to reduce memory usage by storing data to disk
001011010011101
EXCESS DATA
DISK
DATA
MEMORY
Author: Shanmukh Sattiraju
Persistant Levels
DISK_ONLY
• Stores only on disk
• Serialized object in both Scala and PySpark
• Usage: df.persist(StorageLevel.DISK_ONLY))
• Best suitable Use Case: When you have large datasets that doesn’t fit into memory
DATA DISK
• MEMORY_ONLY_2
• MEMORY_AND_DISK_2
• DISK_ONLY_2
• DISK_ONLY_3.
2012
Year Month Unemployed
Repartition()
• Repartition is a transformation API that can be used to
increase or decrease the number of partitions in a
dataframe/RDD.
Coalesce()
• Coalesce is a transformation API that can be used to decrease
the number of partitions in a dataframe/RDD.
1,2,3,4
1,10,15,5,16,7
5,6,7,8,9
3,8,11,14,17,4
10,11,12,13,14
2,6,9,13,18,12
15,16,17,18
1,2,3,4
1,2,3,4,5,6,7,8,9
5,6,7,8,9
10,11,12,13,14 10,11,12,13,14,15
,16,17,18
15,16,17,18
Data Moves data across the network to create the new Tries to minimize data movement and
Movement partitioning scheme. avoid shuffling whenever possible.
Generally faster compared to repartition
Generally slower compared to coalesce due to the
Performance since it avoids shuffling whenever
full shuffle operation.
possible.
Author: Shanmukh Sattiraju
Broadcast variables
CA - California
NY - New York
Worker Node
val
Task Task
val
Worker Node
Worker Node
Task Task
Worker Node
Task Task
Author: Shanmukh Sattiraju
Broadcast variables
Worker Node
broad broad
Task Task
Worker Node
broad
Task Task
Worker Node
broad
Task Task
Author: Shanmukh Sattiraju
Broadcast variables
• Kryo Serialization
▪ Faster and more efficient
▪ Takes less time to convert object to byte stream hence faster
▪ Since Spark 2.0, the framework had used Kryo for all internal shuffling of RDDs,
DataFrame with simple types, arrays of simple types, and so on.
▪ Spark also provides configurations to enhance the Kryo Serializer as per our
application requirement.
ADLS != Database
Atomicity
Consistency
Relational database
Isolation
Durability
Author: Shanmukh Sattiraju
Drawbacks of ADLS
• No ACID properties
• Job failures lead to inconsistent data
• Simultaneous writes on same folder brings incorrect results
• No schema enforcement
• No support for updates
• No support for versioning
Lakehouse
dataframe. dataframe.
write\ write\
.format(“parquet”)\ .format(“delta”)\
.save(“/data/”) .save(“/data/”)
_delta_log/
0000.json Contains transaction
information applied on
0001.json actual data
WRITE
1. Cannot contain any additional columns that are not present in the target table's
schema
2. Cannot have column data types that differ from the column data types in the target
table.
WRITE
• This versioning makes it easy to audit data changes, roll back data in
case of accidental bad writes or deletes, and reproduce experiments
and reports.
• Vacuum helps to remove parquet files which are not in latest state in
transaction log
• It will skip the files that are starting with _ (underscore) that includes
_delta_log
• It deletes the files that are older then retention threshold
• Default retention threshold in 7 days
• If you run VACUUM on a Delta table, you lose the ability to time
travel back to a version older than the specified data retention period.
json WRITE
00000010.checkpoint.parquet
json WRITE
json WRITE
json WRITE
json WRITE