0% found this document useful (0 votes)
47 views

Azure Synapse

Uploaded by

SantoshJammi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Azure Synapse

Uploaded by

SantoshJammi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 229

Azure Synapse Analytics

James Serra
Data & AI Architect
Microsoft, NYC MTC
[email protected]
Blog: JamesSerra.com
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda

 Introduction
 Studio
 Data Integration
 SQL Analytics
 Data Storage and Performance Optimizations
 SQL On-Demand
 Spark
 Security
 Connected Services
Azure Synapse Analytics is a limitless analytics service, that brings together
enterprise data warehousing and Big Data analytics. It gives you the freedom
to query data on your terms, using either serverless on-demand or provisioned
resources, at scale. Azure Synapse brings these two worlds together with a
unified experience to ingest, prepare, manage, and serve data for immediate
business intelligence and machine learning needs.
Azure Synapse – SQL Analytics
focus areas

Best in class price Industry-leading Workload aware Data flexibility Developer


per performance security query execution productivity

Up to 94% less expensive Defense-in-depth Manage heterogenous Ingest variety of data Use preferred tooling for
than competitors security and 99.9% workloads through sources to derive the SQL data warehouse
financially backed workload priorities and maximum benefit. development
availability SLA isolation Query all data.
Leveraging ISV partners with Azure Synapse Analytics
Azure Data Share Ecosystem

+ many more

Azure Synapse Analytics

Power BI Azure Machine Learning


What workloads are NOT suitable?

Operational workloads (OLTP)


SQL
• High frequency reads and writes.
• Large numbers of singleton
selects.
• High volumes of single row
inserts.
Data Preparations
• Row by row processing needs.
SQL
• Incompatible formats (XML).
What Workloads are Suitable?
Analytics
Store large volumes of data.
Consolidate disparate data into a single location.
Shape, model, transform and aggregate data.
Batch/Micro-batch loads.
Perform query analysis across large datasets.
Ad-hoc reporting across large data volumes.
All using simple SQL constructs.
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence

Artificial Intelligence / Machine Learning / Internet of Things


Intelligent Apps / Business Intelligence

Experience Synapse Analytics Studio

Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R

Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND

Analytics Runtimes
MONITORING
MONITORING

METASTORE
METASTORE
DATA INTEGRATION

Azure Common Data Model


Enterprise Security
Data Lake Storage
Optimized for Analytics
Integrated data platform for BI, AI and continuous intelligence

Artificial Intelligence / Machine Learning / Internet of Things


Intelligent Apps / Business Intelligence

Experience Synapse Analytics Studio


Connected Services
Platform Languages
MANAGEMENT Azure Data Catalog
SQL Python .NET Java Scala R
Azure Data Lake Storage
Form Factors
SECURITY Azure Data Share
PROVISIONED ON-DEMAND
Azure Databricks
Analytics Runtimes Azure HDInsight
MONITORING
Azure Machine Learning
Power BI
METASTORE
DATA INTEGRATION
3rd Party Integration

Azure Common Data Model


Enterprise Security
Data Lake Storage
Optimized for Analytics
Provisioning Synapse workspace

Providing Synapse is easy


Subscription

Resource Group

Workspace Name

Region

Data Lake Storage Account


Synapse workspace
SQL pools
Apache Spark pools
Azure Synapse Analytics
Studio
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence

Artificial Intelligence / Machine Learning / Internet of Things


Intelligent Apps / Business Intelligence

Experience Synapse Analytics Studio

Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R

Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND

Analytics Runtimes
MONITORING
MONITORING

METASTORE
METASTORE
DATA INTEGRATION

Azure Common Data Model


Enterprise Security
Data Lake Storage
Optimized for Analytics
Studio https://web.azuresynapse.net
A single place for Data Engineers, Data Scientists, and IT Pros to collaborate on enterprise analytics
Synapse Studio
Synapse Studio divided into Activity hubs.
These organize the tasks needed for building analytics solution.

Overview Data Develop Orchestrate


Quick-access to common Explore structured and Write code and the define Design pipelines that that
gestures, most-recently used unstructured data business logic of the pipeline move and transform data.
items, and links to tutorials via notebooks, SQL scripts,
and documentation. Data flows, etc.

Monitor Manage
Centralized view of all resource Configure the workspace, pool,
usage and activities in the access to artifacts
workspace.
Synapse Studio
Overview hub
Overview Hub
It is a starting point for the activities with key links to tasks, artifacts and documentation
Overview Hub
Overview

New dropdown – offers quickly start work


item

Recent & Pinned – Lists recently opened


code artifacts. Pin selected ones for quick
access
Synapse Studio
Data hub
Data Hub
Explore data inside the workspace and in linked storage accounts
Data Hub – Storage accounts
Browse Azure Data Lake Storage Gen2 accounts and filesystems – navigate through folders to see data

Filepath

ADLS Gen2 Account

Container (filesystem)
Data Hub – Storage accounts
Preview a sample of your data
Data Hub – Storage accounts
See basic file properties
Data Hub – Storage accounts
Manage Access - Configure standard POSIX ACLs on files and folders
Data Hub – Storage accounts
Two simple gestures to start analyzing with SQL scripts or with notebooks.

T-SQL or PySpark auto-generated.


Data Hub – Storage accounts

SQL Script from Multiple files

Multi-select of files generates a SQL script that analyzes all those files together
Data Hub – Databases
Explore the different kinds of databases that exist in a workspace.

SQL pool

SQL on-demand

Spark
Data Hub – Databases
Familiar gesture to generate T-SQL scripts from SQL Starting from a table, auto-generate a single line of PySpark code
metadata objects such as tables. that makes it easy to load a SQL table into a Spark dataframe
Data Hub – Datasets
Orchestration datasets describe data that is persisted. Once a dataset is defined, it can be used in pipelines and
sources of data or as sinks of data.
Synapse Studio
Develop hub
Develop Hub
Overview

It provides development experience to


query, analyze, model data

Benefits

Multiple languages to analyze data


under one umbrella

Switch over notebooks and scripts


without loosing content

Code intellisense offers reliable code


development

Create insightful visualizations


Develop Hub - SQL scripts
SQL Script

Authoring SQL Scripts

Execute SQL script on provisioned SQL Pool or SQL


On-demand

Publish individual SQL script or multiple SQL


scripts through Publish all feature

Language support and intellisense


Develop Hub - SQL scripts
SQL Script
View results in Table or Chart form and export results in
several popular formats
Develop Hub - Notebooks
Notebooks

Allows to write multiple languages in one


notebook
%%<Name of language>

Offers use of temporary tables across


languages

Language support for Syntax highlight, syntax


error, syntax code completion, smart indent,
code folding

Export results
Develop Hub - Notebooks
Configure session allows developers to control how many resources
are devoted to running their notebook.
Develop Hub - Notebooks
As notebook cells run, the underlying
Spark application status is shown.
Providing immediate feedback and
progress tracking.​
Dataflow Capabilities

Handle upserts, updates,


Add new partition methods Add schema drift support
deletes on sql sinks

Commonly used ETL


Add file handling (move files New inventory of functions
patterns(Sequence
after read, write files to file (for e.g Hash functions for
generator/Lookup
names described in rows etc) row comparison)
transformation/SCD…)

Data lineage – Capturing sink Implement commonly used


column lineage & impact ETL patterns as
analysis(invaluable if this is templates(SCD Type1, Type2,
for enterprise deployment) Data Vault)
Develop Hub - Data Flows
Data flows are a visual way of specifying how to transform data.
Provides a code-free experience.
Develop Hub – Power BI
Overview

Create Power BI reports in the workspace

Provides access to published reports in the


workspace

Update reports real time from Synapse


workspace to get it reflected on Power BI
service

Visually explore and analyze data


Develop Hub – Power BI
View published reports in Power BI workspace
Develop Hub – Power BI
Edit reports in Synapse workspace
Develop Hub – Power BI
Publish edited reports in Synapse workspace to Power BI workspace

Publish changes by simple save


report in workspace
Real-time publish on save
Synapse Studio
Orchestrate hub
Orchestrate Hub
It provides ability to create pipelines to ingest, transform and load data with 90+ inbuilt connectors.

Offers a wide range of activities that a pipeline can perform.


Synapse Studio
Monitor hub
Monitor Hub
Overview

This feature provides ability to monitor orchestration, activities and compute resources.
Monitoring Hub - Orchestration
Overview

Monitor orchestration in the Synapse workspace for the


progress and status of pipeline

Benefits

Track all/specific pipelines

Monitor pipeline run and activity run details

Find the root cause of pipeline failure or activity failure


Monitoring Hub - Spark applications
Overview

Monitor Spark pools, Spark applications for the progress and


status of activities

Benefits

Monitor Spark pools for the status as paused, active,


resume, scaling and upgrading

Track the usage of resources


Synapse Studio
Manage hub
Manage Hub
Overview

This feature provides ability to manage Linked Services, Orchestration and Security.
Manage – Linked services
Overview
It defines the connection information needed to
connect to external resources.

Benefits
Offers pre-build 90+ connectors

Easy cross platform data migration

Represents data store or compute resources


Manage – Access Control
Overview
It provides access control management to workspace
resources and artifacts for admin and users

Benefits
Share workspace with the team

Increases productivity

Manage permissions on code artifacts and Spark


pools
Manage – Triggers
Overview
It defines a unit of processing that determines when a
pipeline execution needs to be kicked off.

Benefits
Create and manage

• Schedule trigger

• Tumbling window trigger

• Event trigger

Control pipeline execution


Manage – Integration runtimes
Overview
Integration runtimes are the compute infrastructure used by
Pipelines to provide the data integration capabilities across
different network environments. An integration runtime
provides the bridge between the activity and linked services.

Benefits
Offers Azure Integration Runtime or Self-Hosted Integration
Runtime

Azure Integration Runtime – provides fully managed,


serverless compute in Azure

Self-Hosted Integration Runtime – use compute resources in


on-premises machine or a VM inside private network
Azure Synapse Analytics
Data Integration
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence

Artificial Intelligence / Machine Learning / Internet of Things


Intelligent Apps / Business Intelligence

Experience Synapse Analytics Studio

Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R

Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND

Analytics Runtimes
MONITORING
MONITORING

METASTORE
METASTORE
DATA INTEGRATION

Azure Common Data Model


Enterprise Security
Data Lake Storage
Optimized for Analytics
Orchestration @ Scale
Trigger Pipeline

Activity Activity

Activity Activity

Activity

Self-hosted Azure
Integration Runtime Integration Runtime

LEGEND
Linked Command and Control
Service Data
Data Movement
Scalable
per job elasticity
Up to 4 GB/s

Simple
Visually author or via code (Python, .Net, etc.)
Serverless, no infrastructure to manage

Access all your data


90+ connectors provided and growing (cloud, on premises, SaaS)
Data Movement as a Service: 25 points of presence worldwide
Self-hostable Integration Runtime for hybrid movement
90+ Connectors out of the box
File
Azure (15) Database & DW (26) File Storage (6) NoSQL (3) Services and App (28) Generic (4)
Formats(6)
Blob storage Amazon Redshift Oracle Amazon S3 AVRO Cassandra Amazon MWS Oracle Service Cloud Generic HTTP
Cosmos DB - SQL API DB2 Phoenix File system Binary Couchbase CDS for Apps PayPal Generic OData
Cosmos DB - MongoDB
Drill PostgreSQL FTP Delimited Text MongoDB Concur QuickBooks Generic ODBC
API
Google Google Cloud
Data Explorer Presto JSON Dynamics 365 Salesforce Generic REST
BigQuery Storage
SAP BW Open
Data Lake Storage Gen1 Greenplum HDFS ORC Dynamics AX SF Service Cloud
Hub
Data Lake Storage Gen2 HBase SAP BW via MDX SFTP Parquet Dynamics CRM SF Marketing Cloud

Database for MariaDB Hive SAP HANA Google AdWords SAP C4C

Database for MySQL Apache Impala SAP table HubSpot SAP ECC

Database for PostgreSQL Informix Spark Jira ServiceNow

File Storage MariaDB SQL Server Magento Shopify

SQL Database Microsoft Access Sybase Marketo Square

SQL Database MI MySQL Teradata Office 365 Web table

SQL Data Warehouse Netezza Vertica Oracle Eloqua Xero

Search index Oracle Responsys Zoho

Table storage
Pipelines
Overview

It provides ability to load data from storage


account to desired linked service. Load data by
manual execution of pipeline or by
orchestration

Benefits

Supports common loading patterns

Fully parallel loading into data lake or SQL


tables

Graphical development experience


Prep & Transform Data
Mapping Dataflow Wrangling Dataflow
Code free data transformation @scale Code free data preparation @scale
Triggers
Overview

Triggers represent a unit of processing that


determines when a pipeline execution needs to be
kicked off.

Data Integration offers 3 trigger types as –

1. Schedule – gets fired at a schedule with


information of start date, recurrence, end date

2. Event – gets fired on specified event

3. Tumbling window – gets fired at a periodic time


interval from a specified start date, while
retaining state

It also provides ability to monitor pipeline runs and


control trigger execution.
Manage – Linked Services
Overview
It defines the connection information needed for
Pipeline to connect to external resources.

Benefits
Offers pre-build 85+ connectors

Easy cross platform data migration

Represents data store or compute resources

NOTE: Linked Services are all for Data Integration


except for Power BI (eventually ADC, Databricks)
Manage – Integration runtimes
Overview
It is the compute infrastructure used by Pipelines to provide
the data integration capabilities across different network
environments. An integration runtime provides the bridge
between the activity and linked Services.

Benefits
Offers Azure Integration Runtime or Self-Hosted Integration
Runtime

Azure Integration Runtime – provides fully managed,


serverless compute in Azure

Self-Hosted Integration Runtime – use compute resources in


on-premises machine or a VM inside private network
Azure Synapse Analytics
SQL Analytics
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence

Artificial Intelligence / Machine Learning / Internet of Things


Intelligent Apps / Business Intelligence

Experience Synapse Analytics Studio

Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R

Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND

Analytics Runtimes
MONITORING
MONITORING

METASTORE
METASTORE
DATA INTEGRATION

Azure Common Data Model


Enterprise Security
Data Lake Storage
Optimized for Analytics
Platform: Performance
Overview
SQL Data Warehouse’s industry leading price-performance
comes from leveraging the Azure ecosystem and core SQL
Server engine improvements to produce massive gains in
performance.

These benefits require no customer configuration and are


provided out-of-the-box for every data warehouse

• Gen2 adaptive caching – using non-volatile memory solid-


state drives (NVMe) to increase the I/O bandwidth
available to queries.

• Azure FPGA-accelerated networking enhancements – to


move data at rates of up to 1GB/sec per node to improve
queries

• Instant data movement – leverages multi-core parallelism


in underlying SQL Servers to move data efficiently between
compute nodes.

• Query Optimization – ongoing investments in distributed


query optimization
TPC-H 1 Petabyte query times

The first and only


analytics system to have
run all TPC-H queries
at petabyte-scale

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

TPC-H queries
TPC-H 1 Petabyte Query Execution

Azure Synapse is the first


and only analytics
system to have run all
TPC-H queries at 1
petabyte-scale

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

TPC-H queries
Azure Synapse Analytics > SQL >

Comprehensive SQL functionality

Advanced storage system T-SQL Querying Complete SQL object model


• Columnstore Indexes • Windowing aggregates • Tables

• Table partitions • Approximate execution • Views


(Hyperloglog)
• Distributed tables • Stored procedures
• JSON data support
• Isolation modes • Functions

• Materialized Views

• Nonclustered Indexes

• Result-set caching
Azure Synapse Analytics > SQL >

Windowing functions
SELECT
OVER clause ROW_NUMBER() OVER(PARTITION BY PostalCode ORDER BY SalesYTD DESC
) AS "Row Number",
Defines a window or specified set of rows within a query LastName,
result set SalesYTD,
PostalCode
Computes a value for each row in the window FROM Sales
WHERE SalesYTD <> 0
Aggregate functions ORDER BY PostalCode;

COUNT, MAX, AVG, SUM, APPROX_COUNT_DISTINCT,


Row Number LastName SalesYTD PostalCode
MIN, STDEV, STDEVP, STRING_AGG, VAR, VARP, 1 Mitchell 4251368.5497 98027
GROUPING, GROUPING_ID, COUNT_BIG, CHECKSUM_AGG 2 Blythe 3763178.1787 98027

Ranking functions 3 Carson 3189418.3662 98027


4 Reiter 2315185.611 98027
RANK, NTILE, DENSE_RANK, ROW_NUMBER 5 Vargas 1453719.4653 98027

Analytical functions 6 Ansman-Wolfe 1352577.1325 98027


1 Pak 4116870.2277 98055
LAG, LEAD, FIRST_VALUE, LAST_VALUE, CUME_DIST, 2 Varkey Chudukaktil 3121616.3202 98055
PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK 3 Saraiva 2604540.7172 98055

ROWS | RANGE 4 Ito 2458535.6169 98055


5 Valdez 1827066.7118 98055
PRECEDING, UNBOUNDING PRECEDING, CURRENT ROW, 6 Mensa-Annan 1576562.1966 98055
BETWEEN, FOLLOWING, UNBOUNDED FOLLOWING 7 Campbell 1573012.9383 98055
8 Tsoflias 1421810.9242 98055
Azure Synapse Analytics > SQL >

Windowing Functions (continued)


Analytical functions
LAG, LEAD, FIRST_VALUE, LAST_VALUE, CUME_DIST,
PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK
-- PERCENTILE_CONT, PERCENTILE_DISC --LAG Function

SELECT DISTINCT Name AS DepartmentName SELECT BusinessEntityID,

,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ph.Rate) YEAR(QuotaDate) AS SalesYear,

OVER (PARTITION BY Name) AS MedianCont SalesQuota AS CurrentQuota,

,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ph.Rate) LAG(SalesQuota, 1,0) OVER (ORDER BY YEAR(QuotaDate)) AS PreviousQuota

OVER (PARTITION BY Name) AS MedianDisc FROM Sales.SalesPersonQuotaHistory

FROM HumanResources.Department AS d WHERE BusinessEntityID = 275 and YEAR(QuotaDate) IN ('2005','2006');

INNER JOIN HumanResources.EmployeeDepartmentHistory AS dh


ON dh.DepartmentID = d.DepartmentID BusinessEntityID SalesYear CurrentQuota PreviousQuota
---------------- ----------- --------------------- ---------------------
INNER JOIN HumanResources.EmployeePayHistory AS ph 275 2005 367000.00 0.00
ON ph.BusinessEntityID = dh.BusinessEntityID 275 2005 556000.00 367000.00
275 2006 502000.00 556000.00
WHERE dh.EndDate IS NULL; 275 2006 550000.00 502000.00
275 2006 1429000.00 550000.00
DepartmentName MedianCont MedianDisc 275 2006 1324000.00 1429000.00
-------------------- ------------- -------------
Document Control 16.8269 16.8269
Engineering 34.375 32.6923
Executive 54.32695 48.5577
Human Resources 17.427850 16.5865
Azure Synapse Analytics > SQL >

Windowing Functions (continued)


ROWS | RANGE
-- First_Value
PRECEDING, UNBOUNDING PRECEDING, CURRENT ROW, SELECT JobTitle, LastName, VacationHours AS VacHours,
BETWEEN, FOLLOWING, UNBOUNDED FOLLOWING FIRST_VALUE(LastName) OVER (PARTITION BY JobTitle
ORDER BY VacationHours ASC ROWS UNBOUNDED PRECEDING ) AS
FewestVacHours
FROM HumanResources.Employee AS e
INNER JOIN Person.Person AS p
ON e.BusinessEntityID = p.BusinessEntityID
ORDER BY JobTitle;

JobTitle LastName VacHours FewestVacHours


--------------------------------- ---------------- ---------- -------------------
Accountant Moreland 58 Moreland
Accountant Seamans 59 Moreland
Accounts Manager Liu 57 Liu
Accounts Payable Specialist Tomic 63 Tomic
Accounts Payable Specialist Sheperdigian 64 Tomic
Accounts Receivable Specialist Poe 60 Poe
Accounts Receivable Specialist Spoon 61 Poe
Accounts Receivable Specialist Walton 62 Poe
Azure Synapse Analytics > SQL >

Approximate execution
HyperLogLog accuracy
Will return a result with a 2% accuracy of true cardinality on average.
e.g. COUNT (DISTINCT) returns 1,000,000, HyperLogLog will return a value in the range of 999,736 to 1,016,234.

APPROX_COUNT_DISTINCT
Returns the approximate number of unique non-null values in a group.
Use Case: Approximating web usage trend behavior

-- Syntax
APPROX_COUNT_DISTINCT ( expression )

-- The approximate number of different order keys by order status from the orders table.
SELECT O_OrderStatus, APPROX_COUNT_DISTINCT(O_OrderKey) AS Approx_Distinct_OrderKey
FROM dbo.Orders
GROUP BY O_OrderStatus
ORDER BY O_OrderStatus;
Azure Synapse Analytics > SQL >

Approximate execution
APPROX_COUNT_DISTINCT

COUNT DISTINCT
Azure Synapse Analytics > SQL >

Group by options
Group by with rollup -- GROUP BY ROLLUP Example --
SELECT Country,
Creates a group for each combination of column expressions.
Region,
Rolls up the results into subtotals and grand totals
SUM(Sales) AS TotalSales
Calculate the aggregates of hierarchical data
FROM Sales
GROUP BY ROLLUP (Country, Region);
Grouping sets -- Results --
Combine multiple GROUP BY clauses into one GROUP BY CLAUSE.
Equivalent of UNION ALL of specified groups. Country Region TotalSales
Canada Alberta 100

-- GROUP BY SETS Example -- Canada British Columbia 500

SELECT Country, Canada NULL 600

SUM(Sales) AS TotalSales United States Montana 100


FROM Sales
United States NULL 100
GROUP BY GROUPING SETS ( Country, () );
NULL NULL 700
Azure Synapse Analytics > SQL >

Snapshot isolation
Overview
Specifies that statements cannot read data that has been modified but ALTER DATABASE MyDatabase
not committed by other transactions. SET ALLOW_SNAPSHOT_ISOLATION ON
This prevents dirty reads.
ALTER DATABASE MyDatabase SET
READ_COMMITTED_SNAPSHOT ON
Isolation level
• READ COMMITTED
• REPEATABLE READ
• SERIALIZABLE
• READ UNCOMMITTED

READ_COMMITTED_SNAPSHOT
OFF (Default) – Uses shared locks to prevent other transactions from
modifying rows while running a read operation
ON – Uses row versioning to present each statement with a
transactionally consistent snapshot of the data as it existed at the start of
the statement. Locks are not used to protect the data from updates.
Azure Synapse Analytics > SQL >

JSON data support – insert JSON data


Overview -- Create Table with column for JSON string
CREATE TABLE CustomerOrders
The JSON format enables representation of (
complex or hierarchical data structures in tables. CustomerId BIGINT NOT NULL,
Country NVARCHAR(150) NOT NULL,
JSON data is stored using standard NVARCHAR OrderDetails NVARCHAR(3000) NOT NULL –- NVARCHAR column for JSON
table columns. ) WITH (DISTRIBUTION = ROUND_ROBIN)

-- Populate table with semi-structured data


Benefits INSERT INTO CustomerOrders
VALUES
Transform arrays of JSON objects into table
( 101, -- CustomerId
format
'Bahrain', -- Country
Performance optimization using clustered N'[{ StoreId": "AW73565",

columnstore indexes and memory optimized "Order": { "Number":"SO43659",

tables "Date":"2011-05-31T00:00:00"
},
"Item": { "Price":2024.40, "Quantity":1 }
}]’ -- OrderDetails
)
Azure Synapse Analytics > SQL >

JSON data support – read JSON data


Overview -- Return all rows with valid JSON data
SELECT CustomerId, OrderDetails
Read JSON data stored in a string column with the FROM CustomerOrders
following: WHERE ISJSON(OrderDetails) > 0;

• ISJSON – verify if text is valid JSON CustomerId OrderDetails

• JSON_VALUE – extract a scalar value from a JSON N'[{ StoreId": "AW73565", "Order": { "Number":"SO43659",
101 "Date":"2011-05-31T00:00:00“ }, "Item": { "Price":2024.40,
string "Quantity":1 }}]'

• JSON_QUERY – extract a JSON object or array from a


JSON string

Benefits -- Extract values from JSON string


SELECT CustomerId,
Ability to get standard columns as well as JSON column Country,
JSON_VALUE(OrderDetails,'$.StoreId') AS StoreId,
Perform aggregation and filter on JSON values JSON_QUERY(OrderDetails,'$.Item') AS ItemDetails
FROM CustomerOrders;

CustomerId Country StoreId ItemDetails

101 Bahrain AW73565 { "Price":2024.40, "Quantity":1 }


Azure Synapse Analytics > SQL >

JSON data support – modify and operate on JSON data


Overview -- Modify Item Quantity value
UPDATE CustomerOrders SET OrderDetails =
JSON_MODIFY(OrderDetails, '$.OrderDetails.Item.Quantity',2)
Use standard table columns and values from JSON text
in the same analytical query.
OrderDetails

Modify JSON data with the following: N'[{ StoreId": "AW73565", "Order": { "Number":"SO43659",
"Date":"2011-05-31T00:00:00“ }, "Item": { "Price":2024.40, "Quantity": 2}}]'
• JSON_MODIFY – modifies a value in a JSON string

• OPENJSON – convert JSON collection to a set of


rows and columns -- Convert JSON collection to rows and columns
SELECT CustomerId,
StoreId,
OrderDetails.OrderDate,
Benefits OrderDetails.OrderPrice
FROM CustomerOrders
Flexibility to update JSON string using T-SQL CROSS APPLY OPENJSON (CustomerOrders.OrderDetails)
WITH ( StoreId VARCHAR(50) '$.StoreId',
Convert hierarchical data into flat tabular structure OrderNumber VARCHAR(100) '$.Order.Date',
OrderDate DATETIME '$.Order.Date',
OrderPrice DECIMAL ‘$.Item.Price',
OrderQuantity INT '$.Item.Quantity'
) AS OrderDetails

CustomerId StoreId OrderDate OrderPrice

101 AW73565 2011-05-31T00:00:00 2024.40


Azure Synapse Analytics > SQL >

Stored Procedures
Overview
It is a group of one or more SQL statements or a CREATE PROCEDURE HumanResources.uspGetAllEmployees
reference to a Microsoft .NET Framework AS
common runtime language (CLR) method. SET NOCOUNT ON;
SELECT LastName, FirstName, JobTitle, Department
Promotes flexibility and modularity. FROM HumanResources.vEmployeeDepartment;
GO
Supports parameters and nesting.
-- Execute a stored procedures
EXECUTE HumanResources.uspGetAllEmployees;
Benefits GO
-- Or
Reduced server/client network traffic, improved EXEC HumanResources.uspGetAllEmployees;
performance GO
-- Or, if this procedure is the first statement
Stronger security within a batch:
Easy maintenance HumanResources.uspGetAllEmployees;
Azure Synapse Analytics
Data Storage and Performance Optimizations
Columnar Storage Columnar Ordering

Database Tables
Table Partitioning Hash Distribution

Optimized Storage
Reduce Migration Risk
Less Data Scanned Nonclustered Indexes
Smaller Cache Required
Smaller Clusters
Faster Queries
Azure Synapse Analytics > SQL >

Tables – Indexes
Clustered Columnstore index (Default Primary)
-- Create table with index
Highest level of data compression
CREATE TABLE orderTable
Best overall query performance
(
OrderId INT NOT NULL,
Clustered index (Primary) Date DATE NOT NULL,
Performant for looking up a single to few rows Name VARCHAR(2),
Country VARCHAR(2)
Heap (Primary) )

Faster loading and landing temporary data WITH

Best for small lookup tables (


CLUSTERED COLUMNSTORE INDEX |

Nonclustered indexes (Secondary) HEAP |


CLUSTERED INDEX (OrderId)
Enable ordering of multiple columns in a table );
Allows multiple nonclustered on a single table
Can be created on any of the above primary indexes -- Add non-clustered index to table
More performant lookup queries CREATE INDEX NameIndex ON orderTable (Name);
Azure Synapse Analytics > SQL >

SQL Analytics Columnstore Tables


Logical table structure Clustered columnstore index Clustered/Non-clustered rowstore index
(OrderId) (OrderId)
OrderId Date Name Country
OrderId PageId
85016 11-2-2018 V UK Rowgroup1 82147 1001

85018 11-2-2018 Q SP Min (OrderId): 82147 | Max (OrderId): 85395 98137 1002

85216 11-2-2018 Q DE
85395 11-2-2018 V NL OrderId Country
OrderId PageId OrderId PageId
82147 11-2-2018 Q FR 82147 FR
82147 1005 98137 1007
86881 11-2-2018 D UK 85016 UK … 85395 1006 98979 1008
93080 11-3-2018 R UK 85018 Name SP … …
94156 11-3-2018 S FR Date Q DE OrderId Date Name Country OrderId Date Name Country
85216 OrderId Date Name Country OrderId Date Name Country
96250 11-3-2018 Q NL 85395 11-2-2018 V NL
98137 11-3-2018 T FR

+
82147 11-2-2018 Q FR
98799 11-3-2018 R NL 82147 11-2-2018 Q FR 98137 11-3-2018 T FR
85016 11-2-2018 V UK 98310 11-3-2018 D DE
98015 11-3-2018 T UK Delta Rowstore 85016 11-2-2018 V UK 98310 11-3-2018 D DE
85018 11-2-2018 Q SP 98799 11-3-2018 R NL
98310 11-3-2018 D DE OrderId Date Name Country 85018 11-2-2018 Q SP 98799 11-3-2018 R NL
98979 11-3-2018 Z DE 98137 11-3-2018 T FR

98310 11-3-2018 D DE
98137 11-3-2018 T FR
98799 11-3-2018 R NL
… … … …
98979 11-3-2018 Z DE • Data is stored in a B-tree index structure for performant
lookup queries for particular rows.
• Data stored in compressed columnstore segments after
being sliced into groups of rows (rowgroups/micro- • Clustered rowstore index: The leaf nodes in the structure
partitions) for maximum compression store the data values in a row (as pictured above)

• Rows are stored in the delta rowstore until the number of • Non-clustered (secondary) rowstore index: The leaf nodes
rows is large enough to be compressed into a store pointers to the data values, not the values
columnstore themselves
Azure Synapse Analytics > SQL >

Ordered Clustered Columnstore Indexes


Overview
Queries against tables with ordered columnstore segments can
take advantage of improved segment elimination to drastically -- Insert data into table with ordered columnstore index
reduce the time needed to service a query. INSERT INTO sortedOrderTable
VALUES (1, '01-01-2019','Dave’, 'UK')
-- Create Table with Ordered Columnstore Index
CREATE TABLE sortedOrderTable
(
OrderId INT NOT NULL,
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX ORDER (OrderId)
)
-- Create Clustered Columnstore Index on existing table
CREATE CLUSTERED COLUMNSTORE INDEX cciOrderId
ON dbo.OrderTable ORDER (OrderId)
Azure Synapse Analytics > SQL >

Tables – Distributions
Round-robin distributed CREATE TABLE dbo.OrderTable
(
Distributes table rows evenly across all distributions OrderId INT NOT NULL,
at random. Date DATE NOT NULL,

Hash distributed
Name VARCHAR(2),
Country VARCHAR(2)
Distributes table rows across the Compute nodes by )
using a deterministic hash function to assign each WITH
row to one distribution. (
CLUSTERED COLUMNSTORE INDEX,

Replicated
DISTRIBUTION = HASH([OrderId]) |
ROUND ROBIN |
REPLICATED
Full copy of table accessible on each Compute node. );
Azure Synapse Analytics > SQL >

Tables – Partitions
Overview CREATE TABLE partitionedOrderTable

Table partitions divide data into smaller groups (


OrderId INT NOT NULL,
In most cases, partitions are created on a date column
Date DATE NOT NULL,
Supported on all table types Name VARCHAR(2),
RANGE RIGHT – Used for time partitions Country VARCHAR(2)
)
RANGE LEFT – Used for number partitions
WITH
Benefits (
CLUSTERED COLUMNSTORE INDEX,
Improves efficiency and performance of loading and DISTRIBUTION = HASH([OrderId]),
querying by limiting the scope to subset of data. PARTITION (
[Date] RANGE RIGHT FOR VALUES (
Offers significant query performance enhancements '2000-01-01', '2001-01-01', '2002-01-01’,
where filtering on the partition key can eliminate '2003-01-01', '2004-01-01', '2005-01-01'
unnecessary scans and eliminate IO. )
)
);
Azure Synapse Analytics > SQL >

Tables – Distributions & Partitions


Logical table structure Physical data distribution
( Hash distribution (OrderId), Date partitions )

OrderId Date Name Country Distribution1


(OrderId 80,000 – 100,000)
85016 11-2-2018 V UK
85018 11-2-2018 Q SP 11-2-2018 partition
85216 11-2-2018 Q DE OrderId Date Name Country
85016 11-2-2018 V UK
85395 11-2-2018 V NL
85018 11-2-2018 Q SP


82147 11-2-2018 Q FR
85216 11-2-2018 Q DE
86881 11-2-2018 D UK 85395 11-2-2018 V NL
93080 11-3-2018 R UK 82147 11-2-2018 Q FR

94156 11-3-2018 S FR
86881 11-2-2018 D UK x 60 distributions (shards)
… … … …
96250 11-3-2018 Q NL
98799 11-3-2018 R NL 11-3-2018 partition
98015 11-3-2018 T UK
OrderId Date Name Country
• Each shard is partitioned with the same
93080 11-3-2018 R UK
98310 11-3-2018 D DE 94156 11-3-2018 S FR date partitions
98979 11-3-2018 Z DE 96250 11-3-2018 Q NL
98137 11-3-2018 T FR 98799 11-3-2018 R NL • A minimum of 1 million rows per
98015 11-3-2018 T UK
… … … …
98310 11-3-2018 D DE distribution and partition is needed for
98979 11-3-2018 Z DE optimal compression and performance of
98137 11-3-2018 T FR
… … … … clustered Columnstore tables
Azure Synapse Analytics > SQL >

Common table distribution methods

Table Category Recommended Distribution Option

Use hash-distribution with clustered columnstore index. Performance improves because hashing enables the
platform to localize certain operations within the node itself during query execution.
Operations that benefit:
COUNT(DISTINCT( <hashed_key> ))
Fact OVER PARTITION BY <hashed_key>
most JOIN <table_name> ON <hashed_key>
GROUP BY <hashed_key>

Dimension Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.

Use round-robin for the staging table. The load with CTAS is faster. Once the data is in the staging table, use
Staging INSERT…SELECT to move the data to production tables.
Views
Database Views

Materialized Views
Best in class price
performance

Interactive dashboarding with


Materialized Views
- Automatic data refresh and maintenance
- Automatic query rewrites to improve performance
- Built-in advisor
Azure Synapse Analytics > SQL >

Materialized views
-- Create indexed view
Overview CREATE MATERIALIZED VIEW Sales.vw_Orders
WITH
A materialized view pre-computes, stores, and maintains its (
DISTRIBUTION = ROUND_ROBIN |
data like a table. HASH(ProductID)
)
Materialized views are automatically updated when data in AS
SELECT SUM(UnitPrice*OrderQty) AS Revenue,
underlying tables are changed. This is a synchronous OrderDate,
operation that occurs as soon as the data is changed. ProductID,
COUNT_BIG(*) AS OrderCount
The auto caching functionality allows Azure Synapse FROM Sales.SalesOrderDetail
GROUP BY OrderDate, ProductID;
Analytics Query Optimizer to consider using indexed view GO
even if the view is not referenced in the query.
-- Disable index view and put it in suspended mode
Supported aggregations: MAX, MIN, AVG, COUNT, ALTER INDEX ALL ON Sales.vw_Orders DISABLE;
COUNT_BIG, SUM, VAR, STDEV -- Re-enable index view by rebuilding it
ALTER INDEX ALL ON Sales.vw_Orders REBUILD;

Benefits
Automatic and synchronous data refresh with data changes
in base tables. No user action is required.
High availability and resiliency as regular tables
Azure Synapse Analytics > SQL >

Materialized views - example


In this example, a query to get the year total sales per customer is shown to
have a lot of data shuffles and joins that contribute to slow performance:

Execution time: 103 seconds


No relevant indexed views created on the data warehouse Lots of data shuffles and joins needed to complete query
-- Get year total sales per customer
(WITH year_total AS
SELECT customer_id,​
first_name,​
last_name,
birth_country,
login,
email_address​,
d_year,
SUM(ISNULL(list_price – wholesale_cost –
discount_amt + sales_price, 0)/2)year_total
FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id​, first_name​,
last_name,birth_country​,
login​,email_address ,d_year
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Azure Synapse Analytics > SQL >

Materialized views - example


Now, we add an indexed view to the data warehouse to increase the performance of
the previous query. This view can be leveraged by the query even though it is not
directly referenced.

Original query – get year total sales per customer Create indexed view with hash distribution on customer_id column
-- Get year total sales per customer -- Create indexed view for query
(WITH year_total AS CREATE INDEXED VIEW nbViewCS WITH (DISTRIBUTION=HASH(customer_id)) AS
SELECT customer_id,​ SELECT customer_id,​
first_name,​ first_name,​
last_name, last_name,
birth_country, birth_country,
login, login,
email_address​, email_address​,
d_year, d_year,
SUM(ISNULL(list_price – wholesale_cost – SUM(ISNULL(list_price – wholesale_cost – discount_amt +
discount_amt + sales_price, 0)/2)year_total sales_price, 0)/2) AS year_total
FROM customer cust FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id​, first_name​, GROUP BY customer_id​, first_name​,
last_name,birth_country​, last_name,birth_country​,
login​,email_address ,d_year login​, email_address, d_year
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Azure Synapse Analytics > SQL >

Indexed (materialized) views - example


The SQL Data Warehouse query optimizer automatically leverages the indexed view to speed up the same query.
Notice that the query does not need to reference the view directly

Execution time: 6 seconds


Original query – no changes have been made to query Optimizer leverages materialized view to reduce data shuffles and joins needed
-- Get year total sales per customer
(WITH year_total AS
SELECT customer_id,​
first_name,​
last_name,
birth_country,
login,
email_address​,
d_year,
SUM(ISNULL(list_price – wholesale_cost –
discount_amt + sales_price, 0)/2)year_total
FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id​ , first_name​
,
last_name,birth_country​ ,
login​,email_address ,d_year
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Azure Synapse Analytics > SQL >

Materialized views- Recommendations

EXPLAIN - provides query plan for SQL Data Warehouse


SQL statement without running the statement; view
estimated cost of the query operations. EXPLAIN WITH_RECOMMENDATIONS
select count(*)
from ((select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
where store_sales.ss_sold_date_sk =
date_dim.d_date_sk
EXPLAIN WITH_RECOMMENDATIONS - provides query and store_sales.ss_customer_sk =
plan with recommendations to optimize the SQL
customer.c_customer_sk
and d_month_seq between 1194 and 1194+11)
statement performance. except
(select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
where catalog_sales.cs_sold_date_sk =
date_dim.d_date_sk
and catalog_sales.cs_bill_customer_sk =
customer.c_customer_sk
and d_month_seq between 1194 and 1194+11)
) top_customers
T-SQL Language

SQL Analytics
Event Hubs

Heterogenous Data
Preparation &
Streaming Ingestion Data Warehouse

Ingestion
IoT Hub
COPY statement
- Simplified permissions (no CONTROL required) Azure Data Lake
- No need for external tables
- Standard CSV support (i.e. custom row terminators,
escape delimiters, SQL dates) --Copy files in parallel directly into data warehouse table
COPY INTO [dbo].[weatherTable]
- User-driven file selection (wild card support) FROM
'abfss://<storageaccount>.blob.core.windows.net/<filepath>'
WITH (
FILE_FORMAT = 'DELIMITEDTEXT’,
SECRET = CredentialObject);
Azure Synapse Analytics > SQL >

COPY command
COPY INTO test_1
Overview FROM
'https://XXX.blob.core.windows.net/customerdatasets/tes
Copies data from source to destination t_1.txt'
WITH (
FILE_TYPE = 'CSV',
CREDENTIAL=(IDENTITY= 'Shared Access Signature',
Benefits SECRET='<Your_SAS_Token>'),
FIELDQUOTE = '"',
Retrieves data from all files from the folder and all its FIELDTERMINATOR=';',
subfolders. ROWTERMINATOR='0X0A',
ENCODING = 'UTF8',
Supports multiple locations from the same storage account, DATEFORMAT = 'ymd',
MAXERRORS = 10,
separated by comma ERRORFILE = '/errorsfolder/'--path starting from
the storage container,
Supports Azure Data Lake Storage (ADLS) Gen 2 and Azure IDENTITY_INSERT
Blob Storage. )

Supports CSV, PARQUET, ORC file formats COPY INTO test_parquet


FROM
'https://XXX.blob.core.windows.net/customerdatasets/test
.parquet'
WITH (
FILE_FORMAT = myFileFormat
CREDENTIAL=(IDENTITY= 'Shared Access Signature',
SECRET='<Your_SAS_Token>')
)
Data Flexibility – Parquet Direct
Overview

Dashboards, Reports, Ad-hoc analytics

Parquet
Result
Control Node

Best in class price


performance Compute Node Compute Node Compute Node

Interactive dashboarding with


Resultset Caching
- Millisecond responses with resultset caching Storage
- Cache survives pause/resume/scale operations
- Fully managed cache (1TB in size)

Alter Database <DBNAME> Set Result_Set_Caching ON


Azure Synapse Analytics > SQL >

Result-set caching
Overview -- Turn on/off result-set caching for a database
-- Must be run on the MASTER database
Cache the results of a query in DW storage. This enables interactive ALTER DATABASE {database_name}
response times for repetitive queries against tables with infrequent SET RESULT_SET_CACHING { ON | OFF }

data changes. -- Turn on/off result-set caching for a client session


The result-set cache persists even if a data warehouse is paused and -- Run on target data warehouse
SET RESULT_SET_CACHING {ON | OFF}
resumed later.
-- Check result-set caching setting for a database
Query cache is invalidated and refreshed when underlying table data -- Run on target data warehouse
or query code changes. SELECT is_result_set_caching_on
FROM sys.databases
Result cache is evicted regularly based on a time-aware least WHERE name = {database_name}
recently used algorithm (TLRU).
-- Return all query requests with cache hits
-- Run on target data warehouse
SELECT *
Benefits
FROM sys.dm_pdw_request_steps
WHERE command like '%DWResultCacheDb%'
Enhances performance when same result is requested repetitively
AND step_index = 0

Reduced load on server for repeated queries


Offers monitoring of query execution with a result cache hit or miss
Azure Synapse Analytics > SQL >

Result-set caching flow

0101010001 0101010001
0100101010 0100101010

1 Client sends query to DW 2 Query is processed using DW compute Query results are cached in remote
nodes which pull data from remote storage so subsequent requests can
storage, process query and output back be served immediately
to client app

0101010001
0100101010

3 Subsequent executions for the same 4 Remote storage cache is evicted regularly 5 Cache will need to be
query bypass compute nodes and can based on time, cache usage, and any regenerated if query results
be fetched instantly from persistent modifications to underlying table data. have been evicted from cache
cache in remote storage
Azure Synapse Analytics > SQL >

Resource classes
Overview
Pre-determined resource limits defined for a user or role. /* View resource classes in the data warehouse */
SELECT name
FROM sys.database_principals
Benefits WHERE name LIKE '%rc%' AND type_desc = 'DATABASE_ROLE';
Govern the system memory assigned to each query.
/* Change user’s resource class to 'largerc' */
Effectively used to control the number of concurrent queries that EXEC sp_addrolemember 'largerc', 'loaduser’;
can run on a data warehouse.
/* Decrease the loading user's resource class */
EXEC sp_droprolemember 'largerc', 'loaduser';
Exemptions to concurrency limit:
CREATE|ALTER|DROP (TABLE|USER|PROCEDURE|VIEW|LOGIN)
CREATE|UPDATE|DROP (STATISTICS|INDEX)
SELECT from system views and DMVs
EXPLAIN
Result-Set Cache
TRUNCATE TABLE
ALTER AUTHORIZATION
CREATE|UPDATE|DROP STATISTICS
Azure Synapse Analytics > SQL >

Resource class types


Static Resource Classes Static resource classes:
Allocate the same amount of memory independent of staticrc10 | staticrc20 | staticrc30 |
the current service-level objective (SLO). staticrc40 | staticrc50 | staticrc60 |
staticrc70 | staticrc80
Well-suited for fixed data sizes and loading jobs.

Dynamic Resource Classes Dynamic resource classes:


Allocate a variable amount of memory depending on smallrc | mediumrc | largerc | xlargerc
the current SLO.
Well-suited for growing or variable datasets. Resource Class Percentage Max. Concurrent
All users default to the smallrc dynamic resource class. Memory Queries
smallrc 3% 32
mediumrc 10% 10
largerc 22% 4
xlargerc 70% 1
Azure Synapse Analytics > SQL >

Concurrency slots @DW1000c: 40 concurrency slots

Memory (concurrency slots)


Overview
Smallrc query
Queries running on a DW compete for access to system resources (1 slot each)

(CPU, IO, and memory). Staticrc20 query


(2 slots each)

To guarantee access to resources, running queries are assigned a Mediumrc query


chunk of system memory (a concurrency slot) for processing the (4 slots each)
query. The amount given is determined by the resource class of
the user executing the query. Higher DW SLOs provide more Xlargerc query
(28 slots each)
memory and concurrency slots
Azure Synapse Analytics > SQL >

Concurrent query limits @DW1000c: 32 max concurrent queries, 40 slots

Queries Concurrency slots


Overview
smallrc
The limit on how many queries can run at the same time is (1 slot each)
governed by two properties: staticrc20
(2 slots each)
• The max. concurrent query count for the DW SLO
mediumrc
• The total available memory (concurrency slots) for the DW SLO (4 slots each)
staticrc50
(16 slots each)
Increase the concurrent query limit by:
• Scaling up to a higher DW SLO (up to 128 concurrent queries)
• Using lower resource classes that use less memory per query
15 concurrent queries
(40 slots used)
Concurrency limits based on resource classes
• 8 x smallrc
• 4 x staticrc20
• 2 x mediumrc
• 1 x staticrc50
Azure Synapse Analytics > SQL >

Workload Management
Overview
It manages resources, ensures highly efficient resource utilization,
and maximizes return on investment (ROI).
The three pillars of workload management are Pillars of Workload
1. Workload Classification – To assign a request to a workload Management
group and setting importance levels.
2. Workload Importance – To influence the order in which a

Classification

Importance
request gets access to resources.

Isolation
3. Workload Isolation – To reserve resources for a workload
group.
Azure Synapse Analytics > SQL >

Workload classification
Overview
Map queries to allocations of resources via pre-determined rules. CREATE WORKLOAD CLASSIFIER classifier_name
WITH
Use with workload importance to effectively share resources
(
across different workload types. [WORKLOAD_GROUP = '<Resource Class>' ]
If a query request is not matched to a classifier, it is assigned to [IMPORTANCE = { LOW |
BELOW_NORMAL |
the default workload group (smallrc resource class). NORMAL |
ABOVE_NORMAL |
HIGH
Benefits }
]
Map queries to both Resource Management and Workload [MEMBERNAME = ‘security_account’]
Isolation concepts. )
WORKLOAD_GROUP: maps to an existing resource class
Manage groups of users with only a few classifiers. IMPORTANCE: specifies relative importance of
request
MEMBERNAME: database user, role, AAD login or AAD
Monitoring DMVs group
sys.workload_management_workload_classifiers
sys.workload_management_workload_classifier_details
Query DMVs to view details about all active workload classifiers.
Azure Synapse Analytics > SQL >

Workload importance
Overview
Queries past the concurrency limit enter a FiFo queue
By default, queries are released from the queue on a
first-in, first-out basis as resources become available
Workload importance allows higher priority queries to
receive resources immediately regardless of queue

Example Video
State analysts have normal importance.
National analyst is assigned high importance.
State analyst queries execute in order of arrival
When the national analyst’s query arrives, it jumps to
the top of the queue
CREATE WORKLOAD CLASSIFIER National_Analyst
WITH
(
[WORKLOAD_GROUP = ‘smallrc’]
[IMPORTANCE = HIGH]
[MEMBERNAME = ‘National_Analyst_Login’]
CREATE WORKLOAD GROUP Sales
WITH
(
[ MIN_PERCENTAGE_RESOURCE = 60 ]
[ CAP_PERCENTAGE_RESOURCE = 100 ]
[ MAX_CONCURRENCY = 6 ] )

Workload aware Intra Cluster Workload Isolation


query execution (Scale In)

Sales

Workload Isolation
- Multiple workloads share deployed resources
60% Compute
1000c DWU
Marketing
- Reservation or shared resource configuration
- Online changes to workload policies 100%
40%
Azure Synapse Analytics > SQL >

Workload Isolation
Overview
Allocate fixed resources to workload group.
CREATE WORKLOAD GROUP group_name
Assign maximum and minimum usage for varying WITH
(
resources under load. These adjustments can be done live MIN_PERCENTAGE_RESOURCE = value
without having to SQL Analytics offline. , CAP_PERCENTAGE_RESOURCE = value
, REQUEST_MIN_RESOURCE_GRANT_PERCENT = value
[ [ , ] REQUEST_MAX_RESOURCE_GRANT_PERCENT = value ]
[ [ , ] IMPORTANCE = {LOW | BELOW_NORMAL | NORMAL | ABOVE_NORMAL | HIGH} ]
Benefits [ [ , ] QUERY_EXECUTION_TIMEOUT_SEC = value ]
)[ ; ]
Reserve resources for a group of requests
Limit the amount of resources a group of requests can
consume RESOURCE ALLOCATION
Shared resources accessed based on importance level
Set Query timeout value. Get DBAs out of the business of
killing runaway queries 0.4, 0.4, group A
40% 40%
group B
Monitoring DMVs Shared

sys.workload_management_workload_groups
0.2,
Query to view configured workload group. 20%
Azure Synapse Analytics > SQL >

Dynamic Management Views (DMVs)

Overview
Dynamic Management Views (DMV) are queries that return information
about model objects, server operations, and server health.

Benefits:
Simple SQL syntax
Returns result in table format
Easier to read and copy result
Azure Synapse Analytics > SQL >

SQL Monitor with DMVs


Overview
Offers monitoring of Count sessions by user
-all open, closed sessions
-count sessions by user
--count sessions by user
SELECT login_name, COUNT(*) as session_count FROM
-count completed queries by user sys.dm_pdw_exec_sessions where status = 'Closed' and session_id
<> session_id() GROUP BY login_name;
-all active, complete queries
-longest running queries
List all open sessions
-memory consumption
-- List all open sessions
SELECT * FROM sys.dm_pdw_exec_sessions where status <> 'Closed'
and session_id <> session_id();

List all active queries


-- List all active queries
SELECT * FROM sys.dm_pdw_exec_requests WHERE status not in
('Completed','Failed','Cancelled') AND session_id <> session_id()
ORDER BY submit_time DESC;
Azure Synapse Analytics > SQL >

Developer Tools
Azure Synapse Analytics Visual Studio - SSDT database projects

Azure Data Studio (queries, extensions etc.) SQL Server Management Studio
Visual Studio Code
(queries, execution plans etc.)
Azure Synapse Analytics > SQL >

Developer Tools
Visual Studio - SSDT
Azure Data Studio SQL Server Management Studio Visual Studio Code
Azure Synapse Analytics database projects

Azure Cloud Service Runs on Windows Runs on Windows, Runs on Windows Runs on Windows,
Linux, macOS Linux, macOS
Light weight editor, Offers GUI support to Offers development
Offers end-to-end Create, maintain
(queries and query, design and experience with light-
lifecycle for analytics database code, compile,
extensions) manage weight code editor
code refactoring

Connects to multiple
services
Azure Synapse Analytics > SQL >

Continuous integration and delivery (CI/CD)


Overview
Database project support in SQL Server Data Tools
(SSDT) allows teams of developers to collaborate over a
version-controlled data warehouse, and track, deploy
and test schema changes.

Benefits
Database project support includes first-class
integration with Azure DevOps. This adds support for:
• Azure Pipelines to run CI/CD workflows for any
platform (Linux, macOS, and Windows)
• Azure Repos to store project files in source control
• Azure Test Plans to run automated check-in tests to
verify schema updates and modifications
• Growing ecosystem of third-party integrations that
can be used to complement existing workflows
(Timetracker, Microsoft Teams, Slack, Jenkins, etc.)
Azure Synapse Analytics > SQL >

Azure Advisor recommendations


Suboptimal Table Distribution
Reduce data movement by replicating tables

Data Skew
Choose new hash-distribution key
Slowest distribution limits performance

Cache Misses
Provision additional capacity

Tempdb Contention
Scale or update user resource class

Suboptimal Plan Selection


Create or update table statistics
Azure Synapse Analytics > SQL >

Maintenance windows
Overview
Choose a time window for your upgrades.
Select a primary and secondary window within a seven-day
period.
Windows can be from 3 to 8 hours.
24-hour advance notification for maintenance events.

Benefits
Ensure upgrades happen on your schedule.
Predictable planning for long-running jobs.
Stay informed of start and end of maintenance.
Azure Synapse Analytics > SQL >

Automatic statistics management


-- Turn on/off auto-create statistics settings
Overview ALTER DATABASE {database_name}
Statistics are automatically created and maintained for SQL pool. SET AUTO_CREATE_STATISTICS { ON | OFF }
Incoming queries are analyzed, and individual column statistics
are generated on the columns that improve cardinality estimates -- Turn on/off auto-update statistics settings

to enhance query performance. ALTER DATABASE {database_name}


SET AUTO_UPDATE_STATISTICS { ON | OFF }

Statistics are automatically updated as data modifications occur in -- Configure synchronous/asynchronous update
underlying tables. By default, these updates are synchronous but ALTER DATABASE {database_name}
can be configured to be asynchronous. SET AUTO_UPDATE_STATISTICS_ASYNC { ON | OFF }

-- Check statistics settings for a database


Statistics are considered out of date when: SELECT is_auto_create_stats_on,
• There was a data change on an empty table is_auto_update_stats_on,

• The number of rows in the table at time of statistics creation is_auto_update_stats_async_on

was 500 or less, and more than 500 rows have been updated FROM sys.databases

• The number of rows in the table at time of statistics creation


was more than 500, and more than 500 + 20% of rows have
been updated
T-SQL Language
Heterogenous Data
Preparation & SQL Analytics
Ingestion
Event Hubs

Streaming Ingestion Data Warehouse

Native SQL Streaming


- High throughput ingestion (up to 200MB/sec)
- Delivery latencies in seconds
IoT Hub
- Ingestion throughput scales with compute scale
- Analytics capabilities (SQL-based queries for joins,
aggregations, filters)
- Removes the need to use Spark for streaming
Create Upload Score
models models models

Machine Learning SQL Analytics

enabled DW + =
Model Data Predictions

T-SQL Language

Native PREDICT-ion
- T-SQL based experience (interactive./batch scoring)
- Interoperability with other models built elsewhere Data Warehouse
- Execute scoring where the data lives

--T-SQL syntax for scoring data in SQL DW


SELECT d.*, p.Score
FROM PREDICT(MODEL = @onnx_model, DATA =
dbo.mytable AS d)
WITH (Score float) AS p;
SQL Analytics

Data Lake
Integration

ParquetDirect for interactive 13X


data lake exploration
- >10X performance improvement
- Full columnar optimizations (optimizer, batch)
- Built-in transparent caching (SSD, in-memory,
resultset)
Azure Data Share

Enterprise data sharing


- Share from DW to DW/DB/other systems
- Choose data format to receive data in (CSV, Parquet)
- One to many data sharing
- Share a single or multiple datasets
SQL Analytics
new features available

GA features: Public preview features:


- Performance: Resultset caching - Workload management: Workload Isolation
- Performance: Materialized Views - Data ingestion: Simple ingestion with COPY
- Performance: Ordered columnstore - Data Sharing: Share DW data with Azure Data Share
- Heterogeneous data: JSON support - Trustworthy computation: Private LINK support
- Trustworthy compution: Dynamic Data Masking
- Continuous integration & deployment: SSDT support
- Language: Read committed snapshot isolation

Private preview features:


- Data ingestion: Streaming ingestion & analytics in DW
- Built-in ML: Native Prediction/Scoring
- Data lake enabled: Fast query over Parquet files
- Language: Updateable distribution column
- Language: FROM clause with joins
- Language: Multi-column distribution support
- Security: Column-level Encryption
Note: private preview features require whitelisting
Power BI Aggregations and Synapse query performance
Azure Synapse Analytics
SQL On-Demand
Query Options

1. Provisioned SQL over relational database – Traditional SQL DW [existing]


2. Provisioned SQL over ADLS Gen2 – via external tables or openrowset [existing via PolyBase]
3. On-demand SQL over relational database - dependency on the flexible data model (data cells) over
columnstore data (preview) [new]
4. On-demand SQL over ADLS Gen2 – via external tables or openrowset [new]
5. Provisioned Spark over relational database – Not possible
6. Provisioned Spark over ADLS Gen2 [new]
7. On-demand Spark over relational database - On-demand Spark is not supported
8. On-demand Spark over ADLS Gen2 – On-demand Spark is not supported

Notes:

• Separation of state (data, metadata and transactional logs) and compute


• Queries against data loaded into SQL Analytics tables are faster 2-3X compared to queries over external tables
• Improved performance compared to PolyBase. PolyBase is not used, but functional aspects are supported
• SQL on-demand will push down queries from the front-end to back-end nodes
• Warm-up for first on-demand query takes about 20-25 seconds
• If you create a Spark Table, that table will be created as an external table in SQL Pool or On-Demand without
having to keep a Spark cluster up and running
Distributed Query Processor (DQP)

• Auto-scale compute nodes - Instruct the underlying fabric the need for more compute power to
adjust to peaks during the workload. If compute power is granted, the Polaris DQP will re-distribute
tasks leveraging the new compute container. Note that in-flight tasks in the previous topology
continue running, while new queries get the new compute power with the new re-balancing
• Compute node fault tolerance - Recover from faulty nodes while a query is running. If a node fails
the DQP re-schedules the tasks in the faulted node through the remainder of the healthy topology
• Compute node hot spot: rebalance queries or scale out nodes - Can detect hot spots in the
existing topology. That is, overloaded compute nodes due to data skew. In the advent of a compute
node running hot because of skewed tasks, the DQP can decide to re-schedule some of the tasks
assigned to that compute node amongst others where the load is less
• Multi-cluster - Multiple compute pools accessing the same data
• Cross-database queries – A query can specify multiple databases

These features work for both on-demand and provisioned over ADLS Gen2 and relational databases
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence

Artificial Intelligence / Machine Learning / Internet of Things


Intelligent Apps / Business Intelligence

Experience Synapse Analytics Studio

Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R

Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND

Analytics Runtimes
MONITORING
MONITORING

METASTORE
METASTORE
DATA INTEGRATION

Azure Common Data Model


Enterprise Security
Data Lake Storage
Optimized for Analytics
Synapse SQL on-demand scenarios

What’s in this file? How many rows are there? What’s the max value?

SQL On-demand reduces data lake exploration to the right-click!

How to convert CSVs to Parquet quickly? How to transform the raw data?

Use the full power of T-SQL to transform the data in the data lake
Azure Synapse Analytics > SQL >

SQL On-Demand
Overview
An interactive query service that provides T-SQL queries over SQL DW

high scale data in Azure Storage.


Benefits Power BI
Curate and transform data
Serverless
SQL On
Azure Data Studio
No infrastructure Demand
Sync table Query
definitions
Pay only for query execution
Read and write SSMS
No ETL data files
Read and write
data files
Offers security
Data integration with Databricks, HDInsight Azure Storage

T-SQL syntax to query data


Supports data in various formats (Parquet, CSV, JSON)
Support for BI ecosystem
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying on storage


Azure Synapse Analytics > SQL >

SQL On Demand – Querying CSV File


Overview
Uses OPENROWSET function to access data

Benefits
Ability to read CSV File with SELECT *
- no header row, Windows style new line FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/population/populat
- no header row, Unix-style new line ion.csv',
FORMAT = 'CSV',
- header row, Unix-style new line FIELDTERMINATOR =',',
- header row, Unix-style new line, quoted )
ROWTERMINATOR = '\n'

- header row, Unix-style new line, escape WITH (


[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
- header row, Unix-style new line, tab-delimited [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
[year] smallint,
- without specifying all columns [population] bigint
) AS [r]
WHERE
country_name = 'Luxembourg'
AND year = 2017
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying CSV File


Read CSV file - header row, Unix-style new line Read CSV file - without specifying all columns

SELECT * SELECT
FROM OPENROWSET( COUNT(DISTINCT country_name) AS countries
BULK 'https://XXX.blob.core.windows.net/csv/population- FROM OPENROWSET(
unix-hdr/population.csv', BULK 'https://XXX.blob.core.windows.net/csv/popul
FORMAT = 'CSV', ation/population.csv',
FIELDTERMINATOR =',', FORMAT = 'CSV',
ROWTERMINATOR = '0x0a', FIELDTERMINATOR =',',
FIRSTROW = 2 ROWTERMINATOR = '\n'
) )
WITH ( WITH (
[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2, [country_name] VARCHAR (100) COLLATE Latin1_Gener
[country_name] VARCHAR (100) COLLATE Latin1_General_BIN2, al_BIN2 2
[year] smallint, ) AS [r]
[population] bigint
) AS [r]
WHERE
country_name = 'Luxembourg'
AND year = 2017
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying folders


Overview SELECT YEAR(pickup_datetime) as [year], SUM(passenger_count) AS
passengers_total, COUNT(*) AS [rides_total]
Uses OPENROWSET function to access data from FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/taxi/*.*’,
multiple files or folders FORMAT = 'CSV’
, FIRSTROW = 2 )
WITH (
Benefits vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2,
pickup_datetime DATETIME2,
Offers reading multiple files/folders through usage of dropoff_datetime DATETIME2,
passenger_count INT,
wildcards trip_distance FLOAT,
Offers reading specific file/folder rate_code INT,
store_and_fwd_flag VARCHAR(100) COLLATE Latin1_General_BIN2,
Supports use of multiple wildcards pickup_location_id INT,
dropoff_location_id INT,
payment_type INT,
fare_amount FLOAT,
extra FLOAT, mta_tax FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
improvement_surcharge FLOAT,
total_amount FLOAT
) AS nyc
GROUP BY YEAR(pickup_datetime)
ORDER BY YEAR(pickup_datetime)
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying folders

Read all files from multiple folders Read subset of files in folder
SELECT YEAR(pickup_datetime) as [year], SELECT
SUM(passenger_count) AS passengers_total, payment_type,
COUNT(*) AS [rides_total] SUM(fare_amount) AS fare_total
FROM OPENROWSET( FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/t*i/', BULK 'https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-*.csv',
FORMAT = 'CSV', FORMAT = 'CSV',
FIRSTROW = 2 ) FIRSTROW = 2 )
WITH ( WITH (
vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2, vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2,
pickup_datetime DATETIME2, pickup_datetime DATETIME2,
dropoff_datetime DATETIME2, dropoff_datetime DATETIME2,
passenger_count INT, passenger_count INT,
trip_distance FLOAT, trip_distance FLOAT,
<… columns> <…columns>
) AS nyc ) AS nyc
GROUP BY YEAR(pickup_datetime) GROUP BY payment_type
ORDER BY YEAR(pickup_datetime) ORDER BY payment_type
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying specific files


Overview
filename – Provides file name that originates row
result
filepath – Provides full path when no parameter is Example of filename function
passed or part of path when parameter is passed
that originates result SELECT
r.filename() AS [filename]
,COUNT_BIG(*) AS [rows]
FROM OPENROWSET(
Benefits BULK 'https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_201
7-1*.csv’,
Provides source name/path of file/folder for row FORMAT = 'CSV',
result set FIRSTROW = 2
)
WITH (
vendor_id INT,
pickup_datetime DATETIME2,
dropoff_datetime DATETIME2,
passenger_count SMALLINT,
trip_distance FLOAT,
<…columns>
) AS [r]

GROUP BY r.filename()

ORDER BY [filename]
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying specific files

Example of filepath function


SELECT
r.filepath() AS filepath
,r.filepath(1) AS [year]
,r.filepath(2) AS [month]
,COUNT_BIG(*) AS [rows]
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_*-*.csv’,
FORMAT = 'CSV',
FIRSTROW = 2 )
WITH (
vendor_id INT,
pickup_datetime DATETIME2,
dropoff_datetime DATETIME2,
passenger_count SMALLINT,
trip_distance FLOAT,
<… columns>
) AS [r]

WHERE r.filepath(1) IN ('2017’)


AND r.filepath(2) IN ('10', '11', '12’)

GROUP BY r.filepath() ,r.filepath(1) ,r.filepath(2)


ORDER BY filepath filepath year month rows
https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-10.csv 2017 10 9768815
https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-11.csv 2017 11 9284803
https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-12.csv 2017 12 9508276
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying Parquet files


Overview SELECT
YEAR(pickup_datetime),
Uses OPENROWSET function to access data passenger_count,
COUNT(*) AS cnt
FROM
Benefits OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/parquet/taxi/*/*/*',
Ability to specify column names of interest FORMAT='PARQUET'
) WITH (
Offers auto reading of column names and data types pickup_datetime DATETIME2,
passenger_count INT
Provides target specific partitions using filepath function ) AS nyc
GROUP BY
passenger_count,
YEAR(pickup_datetime)
ORDER BY
YEAR(pickup_datetime),
passenger_count
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Creating views


Overview USE [mydbname]
Create views using SQL On Demand queries
GO

IF EXISTS(select * FROM sys.views where name = 'populationView')


DROP VIEW populationView
Benefits GO

Works same as standard views CREATE VIEW populationView AS


SELECT *
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/population/population.csv',
FORMAT = 'CSV',
FIELDTERMINATOR =',',
ROWTERMINATOR = '\n'
)
WITH (
[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
[country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
[year] smallint,
[population] bigint
) AS [r]

SELECT
country_name, population
FROM populationView
WHERE
[year] = 2019
ORDER BY
[population] DESC
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Creating views


Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying JSON files


Overview
Read JSON files and provides data in tabular format

Benefits SELECT *
FROM
Supports OPENJSON, JSON_VALUE and JSON_QUERY OPENROWSET(
functions BULK 'https://XXX.blob.core.windows.net/json/books/book
1.json’,
FORMAT='CSV',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(8000)
) AS [r]
Azure Synapse Analytics > SQL On Demand

SQL On Demand – Querying JSON files

Example of JSON_VALUE function Example of JSON_QUERY function


SELECT SELECT

JSON_VALUE(jsonContent, '$.title') AS title, JSON_QUERY(jsonContent, '$.authors') AS authors,


JSON_VALUE(jsonContent, '$.publisher') as publisher,
jsonContent
jsonContent FROM
FROM OPENROWSET(
OPENROWSET( BULK 'https://XXX.blob.core.windows.net/json/books/*.json',
BULK 'https://XXX.blob.core.windows.net/json/books/*.json', FORMAT='CSV',
FORMAT='CSV', FIELDTERMINATOR ='0x0b',
FIELDTERMINATOR ='0x0b', FIELDQUOTE = '0x0b',
FIELDQUOTE = '0x0b', ROWTERMINATOR = '0x0b'
ROWTERMINATOR = '0x0b' )
) WITH (
WITH ( jsonContent varchar(8000)
jsonContent varchar(8000) ) AS [r]
) AS [r] WHERE
WHERE JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statist
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statisti ical Methods in Cryptology, An Introduction by Selected Topics'
cal Methods in Cryptology, An Introduction by Selected Topics'
Azure Synapse Analytics > SQL On Demand

Create External Table As Select -- Create a database master key if one does not already exist
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo'
;
Overview -- Create a database scoped credential with Azure storage account key
as the secret.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
Creates an external table and then exports results of the WITH
IDENTITY = '<my_account>'
Select statement. These operations will import data into the , SECRET = '<azure_storage_account_key>'
;
database for the duration of the query -- Create an external data source with CREDENTIAL option.
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
Steps:
( LOCATION = 'wasbs://[email protected]/'
, CREDENTIAL = AzureStorageCredential
, TYPE = HADOOP
1. Create Master Key )
-- Create an external file format
CREATE EXTERNAL FILE FORMAT MyAzureCSVFormat
2. Create Credentials WITH (FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR = ',',
3. Create External Data Source FIRST_ROW = 2)
--Create an external table
CREATE EXTERNAL TABLE dbo.FactInternetSalesNew
4. Create External Data Format WITH(
LOCATION = '/files/Customer',
DATA_SOURCE = MyAzureStorage,
5. Create External Table FILE_FORMAT = MyAzureCSVFormat
)
AS SELECT T1.* FROM dbo.FactInternetSales T1 JOIN dbo.DimCustomer T2
ON ( T1.CustomerKey = T2.CustomerKey )
OPTION ( HASH JOIN );
SQL scripts > View and export results
SQL scripts > View results (chart)
Convert from CSV to Parquet on-demand
Azure Synapse Analytics
Spark
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence

Artificial Intelligence / Machine Learning / Internet of Things


Intelligent Apps / Business Intelligence

Experience Synapse Analytics Studio

Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R

Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND

Analytics Runtimes
MONITORING
MONITORING

METASTORE
METASTORE
DATA INTEGRATION

Azure Common Data Model


Enterprise Security
Data Lake Storage
Optimized for Analytics
Azure Synapse Apache Spark - Summary

• Apache Spark 2.4 derivation • Core scenarios


• Linux Foundation Delta Lake 0.4 support • Data Prep/Data Engineering/ETL
• .Net Core 3.0 support • Machine Learning via Spark ML and Azure ML
integration
• Python 3.6 + Anacondas support
• Extensible through library management
• Tightly coupled to other Azure Synapse services
• Efficient resource utilization
• Integrated security and sign on
• Fast Start
• Integrated Metadata
• Auto scale (up and down)
• Integrated and simplified provisioning
• Auto pause
• Integrated UX including nteract based notebooks
• Min cluster size of 3 nodes
• Fast load of SQL Analytics pools
• Multi Language Support
• .Net (C#), PySpark, Scala, Spark SQL, Java
Languages
Overview

Supports multiple languages to develop


notebook
• PySpark (Python)
• Spark (Scala)
• .NET Spark (C#)
• Spark SQL
• Java
• R (early 2020)

Benefits
Allows to write multiple languages in one
notebook
%%<Name of language>

Offers use of temporary tables across


languages
Notebooks > Configure Session
Apache Spark
An unified, open source, parallel, data processing framework for Big Data Analytics

Spark Unifies: Spark SQL Spark MLlib Spark GraphX


Batch processing Machine Streaming Graph
 Batch Processing Computation
Learning Stream processing




Spark Core Engine

Yarn
Spark MLlib
Spark Structured Machine
Streaming Learning
Stream processing
http://spark.apache.org
Motivation for Apache Spark
Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing
involves lots of (slow) disk I/O

HDFS HDFS HDFS HDFS


Read Iteration 1 Write Read Iteration 2 Write
CPU Memory CPU Memory
Motivation for Apache Spark
Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing
involves lots of (slow) disk I/O

HDFS HDFS HDFS HDFS


Read Iteration 1 Write Read Iteration 2 Write
CPU Memory CPU Memory

Solution: Keep data in-memory with a new distributed execution engine

Minimal
HDFS Chain Job Output
Read Iteration 1 Read/Write Disk Iteration 2 into New Job Input
Bottleneck
CPU Memory CPU Memory

Input
10–100x faster than
network & disk
What makes Spark fast

Read from Write to Read from Write to


HDFS HDFS HDFS HDFS

Read from
HDFS
General Spark Cluster Architecture

Driver Program
SparkContext

Cluster Manager

Node Node Node

Cache Cache Cache

Data Sources (HDFS, SQL, NoSQL, …)


Spark Component Features
Spark SQL Spark Streaming

Mllib/SparkML GraphX
Azure Synapse Apache Spark
Architecture Overview
Synapse Job Service • User creates Synapse Workspace and Spark pool and
launches Synapse Studio.
AAD • User attaches Notebook to Spark pool and enters
one or more Spark statements (code blocks).
• The Notebook client gets user token from AAD and
sends a Spark session create request to Synapse
Synapse Studio
Gateway.
• Synapse Gateway authenticates the request and
validates authorizations on the Workspace and Spark
Synapse Service pool and forwards it to the Spark (Livy) controller
hosted in Synapse Job Service frontend.
Gateway Auth Service
• The Job Service frontend forwards the request to Job
Service backend that creates two jobs – one for
Job Service Frontend
Resource creating the cluster and the other for creating the
Spark API Provider Spark session.
Controller … DB
• The Job service backend contacts Synapse Resource
Provider to obtain Workspace and Spark pool details
and delegates the cluster creation request to
Job Service Backend
Synapse Instance Service.
Instance Azure
Creation Service • Once the instance is created, the Job Service
Spark Plugin … backend forwards the Spark session creation request
DB DB
to the Livy endpoint in the cluster.
• Once the Spark session is created the Notebook
client sends Spark statements to the Job Service
frontend.
Spark Instance
• Job Service frontend obtains the actual Livy endpoint
for the cluster created for the particular user from
VM VM VM VM VM VM the backend and sends the statement directly to Livy
for execution.
Synapse Spark Instances
1. Synapse Job Service sends request to
Cluster Service for creating BBC clusters
Provision Resources per the description in the associated
Azure Resource
Provider Spark pool.
2. Cluster Service sends request to Azure
Spark Instance using Azure SDK to create VMs
(required plus additional) with
Create VMs with Subnet
specialized VHD.
Specialized VHD VM – 001 VM – 002 VM – 003 VM – 004 VM – 005
3. The specialized VHD contains bits for
Spark
Executors all the services that are required by the
Spark Cluster type (for e.g. Spark) with
Hive Metastore Executors Spark
prefetch instrumentation.
Executors Spark
YARN NM - 01 Executors 4. Once VM boots up, the Node Agent
Livy - 01 YARN NM - 02 sends heartbeat to Cluster Service for
getting node configuration.
Zookeeper - 01 Zookeeper - 02 Zookeeper - 03
5. The nodes are initialized and assigned
YARN RM - 01 YARN RM - 02 YARN NM - 03 YARN NM - 04 roles based on their first heartbeat.
Node Agent Node Agent Node Agent Node Agent Node Agent 6. Extra nodes get deleted on first
Create Cluster heartbeat.
7. After Cluster Service considers the
Synapse Cluster
cluster ready, it returns the Livy end-
Service Heartbeats
point to the Job Service.
(Control Plane)
Heartbeat sequence
Creating a Spark pool (1 of 2)

Only required field from user

Default Settings
Creating a Spark pool (2 of 2) - optional

Customize component versions, auto-pause

Import libraries by providing text file


containing library name and version
JDBC to open connection
Existing Approach: JDBC 1
2 Apply any Filters/Projections

3 Spark reads the data serially


Control
Driver
Node

Compute Compute Compute Compute Compute Executor Executor Executor Executor Executor

New Approach: JDBC and Polybase

1 JDBC to issue CETAS + send filters/projections


Control
Node 3 Spark reads the data in parallel

Compute Compute Compute Compute Compute


2
Apply any Filters/Projections
DW exports the data in parallel

User Provisioned Workspace-Default Data Lake


Code-Behind Experience
Existing Approach New Approach

val jdbcUsername = "<SQL DB ADMIN USER>" // Construct a Spark DataFrame from SQL Pool
val jdbcPwd = "<SQL DB ADMIN PWD>" var df = spark.read.sqlanalytics("sql1.dbo.Tbl1")
val jdbcHostname = "servername.database.windows.net”
val jdbcPort = 1433 // Write the Spark DataFrame into SQL Pool
val jdbcDatabase ="<AZURE SQL DB NAME>“ df.write.sqlanalytics(“sql1.dbo.Tbl2”)

val jdbc_url =
s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase};
encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.databas
e.windows.net;loginTimeout=60;“

val connectionProperties = new Properties()

connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPwd}")

val sqlTableDf = spark.read.jdbc(jdbc_url, “dbo.Tbl1", connectionProperties)


Create Notebook on files in storage
View results in
table format
SQL support

View results in
chart format
Exploratory data analysis
with graphs – histogram,
boxplot etc
Library Management - Python
In the Portal
Overview
Specify the new requirements while creating Spark Pool in
Customers can add new python libraries at Spark pool level Additional Settings blade

Benefits
Input requirements.txt in simple pip freeze format
Add new libraries to your cluster
Update versions of existing libraries on your cluster
Libraries will get installed for your Spark pool during cluster
creation
Ability to specify different requirements file for different pools
within the same workspace

Constraints
The library version must exist on PyPI repository
Version downgrade of an existing library not allowed
Library Management - Python
Get list of installed libraries with version information
Spark ML Algorithms

Spark ML
Algorithms
Synapse Notebook: Connect to AML workspace

Simple code to connect


workspace
Synapse Notebook: Configure AML job to run on Synapse

Configuration parameters
Synapse Notebook: Run AML job

ML job execution result


Industry-leading security
and compliance
Enterprise-grade security
Industry-leading compliance

ISO 27001 SOC 1 Type 2 SOC 2 Type 2 PCI DSS Level 1 Cloud Controls ISO 27018 Content Delivery and Shared
Matrix Security Association Assessments

FedRAMP JAB HIPAA / FIPS 140-2 21 CFR FERPA DISA Level 2 CJIS IRS 1075 ITAR-ready Section 508
P-ATO HITECH Part 11 VPAT

European Union EU Safe United China Multi China China Singapore Australian New Zealand Japan ENISA
Model Clauses Harbor Kingdom Layer Protection GB 18030 CCCPPF MTCS Level 3 Signals GCIO Financial Services IAF
G-Cloud Scheme Directorate
Comprehensive Security
Category Feature

Data in Transit
Data Protection Data Encryption at Rest
Data Discovery and Classification
Object Level Security (Tables/Views)
Row Level Security
Access Control
Column Level Security
Dynamic Data Masking
SQL Login
Authentication Azure Active Directory
Multi-Factor Authentication
Virtual Networks
Network Security Firewall
Azure ExpressRoute
Thread Detection
Threat Protection Auditing
Vulnerability Assessment
Threat Protection - Business requirements

How do we enumerate How do we discover and Data Protection


and track potential SQL alert on suspicious
vulnerabilities? database activity?
Access Control
To mitigate any security To detect and resolve any data
misconfigurations before they exfiltration or SQL injection attacks.
become a serious issue.
Authentication

Network Security

Threat Protection
SQL auditing in Azure Log Analytics and Event Hubs
Gain insight into database audit log

 Configurable via audit policy


Log Analytics Power BI Dashboards

 SQL audit logs can reside in


• Azure Storage account
Azure Log Analytics
Event Hubs

Azure Synapse Audit Log
Analytics • Azure Event Hubs
Blob Storage

(1) Turn on SQL Auditing  Rich set of tools for


(2) Analyze audit log
• Investigating security alerts
• Tracking access to sensitive data
SQL threat detection
Detect and investigate anomalous database activity

 Detects potential SQL injection


attacks

(2) Possible threat to  Detects unusual access & data


access / breach data
exfiltration activities

 Actionable alerts to investigate &


remediate

 View alerts for your entire Azure


Apps Azure Synapse Analytics
tenant using Azure Security Center

Audit Log Threat Detection


(1) Turn on Threat Detection
(3) Real-time actionable alerts
SQL Data Discovery & Classification
Discover, classify, protect and track access to sensitive data

 Automatic discovery of columns with


sensitive data

 Add persistent sensitive data labels

 Audit and detect access to the sensitive data

 Manage labels for your entire Azure tenant


using Azure Security Center
SQL Data Discovery & Classification - setup
Step 1: Enable Advanced Data Security Step 2: Use recommendations and/or manual classification to
on the logical SQL Server classify all the sensitive columns in your tables
SQL Data Discovery & Classification – audit sensitive data access
Step 1: Configure auditing for your target Data warehouse. This can be Step 3: Open logs using extended events viewer in SSMS.
configured for just a single data warehouse or all databases on a server. Configure viewer to include ‘data_sensitivity_information’ column

Step 2: Navigate to audit logs in storage account and


download ‘xel’ log files to local machine.
Network Security - Business requirements

How do we implement How do we achieve Data Protection


network isolation? separation?
Data at different levels of security Disallowing access to entities
Access Control
needs to be accessed from outside the company’s network
different locations. security boundary.

Authentication

Network Security

Threat Protection
Azure networking: application-access patterns

Your Virtual Network

Users

Internet

Access to Synapse Analytics BackEnd Mid-tier FrontEnd


Access to/from Internet
Service Endpoints DDoS protection
Web application firewall
Azure Firewall
Backend
Connectivity Network virtual appliances

Access private traffic ExpressRoute


VPN Gateways
Network security groups (NSGs)
Application security groups (ASGs)
User-defined routes (UDRs)
Securing with firewalls
Overview
By default, all access to your Azure Synapse Analytics is Internet Microsoft Azure

blocked by the firewall.


Firewall also manages virtual network rules that are based on SQL Data Warehouse firewall
virtual network service endpoints. Server-level firewall rules

Rules Client IP address in range?


No
Connection fails
Allow specific or range of whitelisted IP addresses.
Allow Azure applications to connect.

Yes

DB 1 DB 2 DB 3
Firewall configuration on the portal
By default, Azure blocks all external
connections to port 1433

Configure with the following steps:


Azure Synapse Analytics Resource:
Server name > Firewalls and virtual networks
Firewall configuration using REST API
Managing firewall rules through REST API must be
authenticated. PUT
https://management.azure.com/subscriptions/{subscriptionI
For information, see Authenticating Service Management d}/resourceGroups/{resourceGroupName}/providers/Microsoft
Requests. .Sql/servers/{serverName}/firewallRules/{firewallRuleName
}?api-version=2014-04-01REQUEST BODY
Server-level rules can be created, updated, or {
"properties": {
deleted using REST API. "startIpAddress": "0.0.0.3",
"endIpAddress": "0.0.0.3"
To create or update a server-level firewall rule, }
execute the PUT method. }

To remove an existing server-level firewall rule, DELETE


https://management.azure.com/subscriptions/{subscriptionI
execute the DELETE method. d}/resourceGroups/{resourceGroupName}/providers/Microsoft
.Sql/servers/{serverName}/firewallRules/{firewallRuleName
To list firewall rules, execute the GET. }?api-version=2014-04-01

GET
https://management.azure.com/subscriptions/{subscriptionI
d}/resourceGroups/{resourceGroupName}/providers/Microsoft
.Sql/servers/{serverName}/firewallRules/{firewallRuleName
}?api-version=2014-04-01
Firewall configuration using PowerShell/T-SQL
Windows PowerShell Azure cmdlets

# PS Allow external IP access to SQL DW


PS C:\> New-AzureRmSqlServerFirewallRule
-ResourceGroupName "myResourceGroup" `
-ServerName $servername `
Transact SQL -FirewallRuleName "AllowSome"
-StartIpAddress "0.0.0.0"
-EndIpAddress "0.0.0.0“

-- T-SQL Allow external IP access to SQL DW


EXECUTE sp_set_firewall_rule
@name = N'ContosoFirewallRule’,
@start_ip_address = '192.168.1.1’,
@end_ip_address = '192.168.1.10'
VNET configuration on Azure portal
Configure with the following steps:
Azure Synapse Analytics Resource:
Server name > Firewalls and virtual networks
REST API and PowerShell alternatives available

Note:
By default, VMs on your subnets cannot communicate
with your SQL Data Warehouse.
There must first be a virtual network service endpoint
for the rule to reference.
Authentication - Business requirements

How do I configure Azure How do I allow non- Data Protection


Active Directory with Azure Microsoft accounts to be
Synapse Analytics? able to authenticate?
Access Control
I want additional control in the form
of multi-factor authentication

Authentication

Network Security

Threat Protection
Azure Active Directory authentication
Overview
Manage user identities in one location. Azure Synapse Analytics

Enable access to Azure Synapse Analytics and other Microsoft


services with Azure Active Directory user identities and groups.

Benefits
Alternative to SQL Server authentication
Customer 1
Limits proliferation of user identities across databases Customer 2
Customer 3
Allows password rotation in a single place

Enables management of database permissions by using


external Azure Active Directory groups

Eliminates the need to store passwords


Azure Active Directory trust architecture

Azure Active Directory and Azure Synapse Analytics

On-premises Active Directory

ADFS Azure Active Azure


Directory Synapse Analytics

ADALSQL
SQL Server Management Suite
ADO .NET
4.6
SQL Server Data Tools Azure Active Directory Authentication
Library for SQL Server (ADALSQL)
App
SQL authentication
Overview
This authentication method uses a username and
password.
When you created the logical server for your data
warehouse, you specified a "server admin" login with a
username and password.
Using these credentials, you can authenticate to any
database on that server as the database owner.
Furthermore, you can create user logins and roles with
familiar SQL Syntax.

-- Connect to master database and create a login


CREATE LOGIN ApplicationLogin WITH PASSWORD = 'Str0ng_password';
CREATE USER ApplicationUser FOR LOGIN ApplicationLogin;

-- Connect to SQL DW database and create a database user


CREATE USER DatabaseUser FOR LOGIN ApplicationLogin;
Access Control - Business requirements

How do I restrict access How do I ensure users Data Protection


to sensitive data to only have access to
specific database users? relevant data?
Access Control
For example, in a hospital only
medical staff should be allowed
to see patient data that is
Authentication
relevant to them—and not every
patient’s data.

Network Security

Threat Protection
Object-level security (tables, views, and more)
Overview
GRANT controls permissions on designated tables, views, stored procedures, and functions.
Prevent unauthorized queries against certain tables.
Simplifies design and implementation of security at the database level as opposed to application level.

-- Grant SELECT permission to user RosaQdM on table Person.Address in the AdventureWorks2012 database
GRANT SELECT ON OBJECT::Person.Address TO RosaQdM;
GO
-- Grant REFERENCES permission on column BusinessEntityID in view HumanResources.vEmployee to user Wanida
GRANT REFERENCES(BusinessEntityID) ON OBJECT::HumanResources.vEmployee to Wanida with GRANT OPTION;
GO
-- Grant EXECUTE permission on stored procedure HumanResources.uspUpdateEmployeeHireInfo to an application role called Recruiting11
USE AdventureWorks2012;
GRANT EXECUTE ON OBJECT::HumanResources.uspUpdateEmployeeHireInfo TO RECRUITING 11;
GO
Row-level security (RLS)
Overview
Fine grained access control of specific rows in a
database table.

Help prevent unauthorized access when multiple


users share the same tables.
Customer 1
Eliminates need to implement connection filtering
Customer 2
in multi-tenant applications. Customer 3

Administer via SQL Server Management Studio or


SQL Data Warehouse
SQL Server Data Tools.

Easily locate enforcement logic inside the database


and schema bound to the table.
Row-level security
Creating policies
Filter predicates silently filter the rows -- The following syntax creates a security policy with a filter predicate for the
available to read operations (SELECT, Customer table

UPDATE, and DELETE). CREATE SECURITY POLICY [FederatedSecurityPolicy]


ADD FILTER PREDICATE [rls].[fn_securitypredicate]([CustomerId])
The following examples demonstrate ON [dbo].[Customer];
the use of the CREATE SECURITY
POLICY syntax -- Create a new schema and predicate function, which will use the application user ID
stored in CONTEXT_INFO to filter rows.
CREATE FUNCTION rls.fn_securitypredicate (@AppUserId int)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN (
SELECT 1 AS fn_securitypredicate_result
WHERE
DATABASE_PRINCIPAL_ID() = DATABASE_PRINCIPAL_ID('dbo') -- application context
AND CONTEXT_INFO() = CONVERT(VARBINARY(128), @AppUserId));
GO
Row-level security
Three steps:
1. Policy manager creates filter predicate and security policy in T-SQL, binding
the predicate to the patients table.
2. App user (e.g., nurse) selects from Patients table.
3. Security policy transparently rewrites query to apply filter predicate.
Policy manager

CREATE FUNCTION dbo.fn_securitypredicate(@wing int)


RETURNS TABLE WITH SCHEMABINDING AS
Database return SELECT 1 as [fn_securitypredicate_result] FROM
Filter StaffDuties d INNER JOIN Employees e
ON (d.EmpId = e.EmpId)
Security policy
Predicate: WHERE e.UserSID = SUSER_SID() AND @wing = d.Wing;
INNER
Nurse JOIN… CREATE SECURITY POLICY dbo.SecPol
ADD FILTER PREDICATE dbo.fn_securitypredicate(Wing) ON Patients
WITH (STATE = ON)

Patients SELECT * FROM Patients


Application SEMIJOIN APPLY dbo.fn_securitypredicate(patients.Wing);

SELECT * FROM Patients


SELECT Patients.* FROM Patients,
StaffDuties d INNER JOIN Employees e ON (d.EmpId = e.EmpId)
WHERE e.UserSID = SUSER_SID() AND Patients.wing = d.Wing;
Column-level security
Overview
Control access of specific columns in a database table
based on customer’s group membership or execution
context.

Simplifies the design and implementation of security by


putting restriction logic in database tier as opposed to
application tier.

Administer via GRANT T-SQL statement.

Both Azure Active Directory (AAD) and SQL authentication


are supported.
Column-level security
Three steps:
1. Policy manager creates permission policy in T-SQL, binding the policy to the Patients
table on a specific group.
2. App user (for example, a nurse) selects from Patients table.
3. Permission policy prevents access on sensitive data.
Policy manager
CREATE TABLE Patients (
Database PatientID int IDENTITY,
FirstName varchar(100) NULL,
Patients SSN char(9) NOT NULL,
LastName varchar(100) NOT NULL,
Nurse Phone varchar(12) NULL,
Email varchar(100) NULL
);

Application

SELECT * FROM Membership; GRANT SELECT ON Patients (


Permission policy PatientID, FirstName, LastName, Phone, Email
Msg 230, Level 14, State 1, Line 12 ) TO Nurse;
The SELECT permission was denied on the column
'SSN' of the object 'Membership', database Allow ‘Nurse’ to access all columns except for sensitive SSN column
'CLS_TestDW', schema 'dbo'.

Queries executed as ‘Nurse’ will fail if they include


the SSN column
Data Protection - Business requirements

How do I protect sensitive data against Data Protection


unauthorized (high-privileged) users?
What key management options do I have?
Access Control

Authentication

Network Security

Threat Protection
Dynamic Data Masking

Overview
Prevent abuse of sensitive data by hiding it from users Table.CreditCardNo

4465-6571-7868-5796

Easy configuration in new Azure Portal 4468-7746-3848-1978

4484-5434-6858-6550

Policy-driven at table and column level, for a defined


set of users SQL Database

Real-time data masking,


Data masking applied in real-time to query results partial masking
based on policy

Multiple masking functions available, such as full or


partial, for various sensitive data categories
(credit card numbers, SSN, etc.) CreditCardNo

XXXX-XXXX-XXXX-5796

XXXX-XXXX-XXXX-1978
Dynamic Data Masking
Three steps
2
1. Security officer defines dynamic data masking policy in T-SQL SELECT [First Name],
over sensitive data in the Employee table. The security officer uses [Social Security Number],
[Email],
the built-in masking functions (default, email, random) Business app [Salary]
FROM [Employee]
2. The app-user selects from the Employee table

3. The dynamic data masking policy obfuscates the sensitive data


in the query results for non-privileged users
Non-masked data (admin login) 3

1
ALTER TABLE [Employee]
ALTER COLUMN [SocialSecurityNumber]
ADD MASKED WITH (FUNCTION = 'DEFAULT()')

ALTER TABLE [Employee]


Security officer ALTER COLUMN [Email]
ADD MASKED WITH (FUNCTION = 'EMAIL()') Masked data (admin1 login)

ALTER TABLE [Employee]


ALTER COLUMN [Salary]
ADD MASKED WITH (FUNCTION = 'RANDOM(1,20000)')

GRANT UNMASK to admin1


Types of data encryption
Data Encryption Encryption Technology Customer Value

In transit Transport Layer Security (TLS) from Protects data between client and server against snooping and
the client to the server man-in-the-middle attacks
TLS 1.2

At rest Transparent Data Encryption (TDE) Protects data on the disk


for Azure Synapse Analytics
User or Service Managed key management is handled by Azure, which makes it
easier to obtain compliance

In use In transit At rest


Database files, backups,
Column encryption Customer data
Tx log, TempDB
Transparent data encryption (TDE)

Overview
USE master;
All customer data encrypted at rest GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<UseStrongPasswordHere>';
TDE performs real-time I/O encryption and go
decryption of the data and log files. CREATE CERTIFICATE MyServerCert WITH SUBJECT = 'My DEK Certificate';
go
Service OR User managed keys. USE MyDatabase;
GO
Application changes kept to a minimum. CREATE DATABASE ENCRYPTION KEY
WITH ALGORITHM = AES_128
Transparent encryption/decryption of data ENCRYPTION BY SERVER CERTIFICATE MyServerCert;
in a TDE-enabled client driver. GO
ALTER DATABASE MyDatabase
Compliant with many laws, regulations, and SET ENCRYPTION ON;
guidelines established across various industries. GO
Transparent data encryption (TDE)
Key Vault
Benefits with User Managed Keys
Assume more control over who has access
to your data and when.

Highly available and scalable cloud-based


key store.

Central key management that allows


separation of key management and data.

Configurable via Azure Portal, PowerShell,


and REST API.
Single Sign-On
Implicit authentication - User provides
login credentials once to access Azure
Synapse Workspace
AAD authentication - Azure Synapse
Studio will request token to access each
linked services as user. A separate token is
acquired for each of the below services:
1. ADLS Gen2
2. Azure Synapse Analytics
3. Power BI
4. Spark – Spark Livy API
5. management.azure.com – resource
provisioning
6. Develop artifacts – dev.workspace.net
7. Graph endpoints
MSI authentication - Orchestration uses
MSI auth for automation
Azure Synapse Analytics
Connected Services
Azure Synapse Analytics
Limitless analytics service with unmatched time to insight

Unified platform and experience

Synapse Studio
On-premises data

Integration Management Monitoring Security

Azure Machine
Learning
Cloud data Analytics Runtimes

SQL
Power BI

SaaS data
Azure Data Lake Storage
Azure Machine Learning
Overview
Data Scientists can use Azure ML notebooks to do
(distributed) data preparation on Synapse Spark compute.

Benefits
Connect to your existing Azure ML workspace and project
Use the AutoML Classifier for classification or regression
problem
Train the model
Access open datasets
Azure Machine Learning (continued)
Power BI
Overview

Power BI is a business analytics service


that delivers insights to enable fast,
informed decisions

Benefits

Create Power BI reports in the workspace

Have access to published reports in


workspace

Update reports real time from Synapse


workspace to get it reflected on Power BI
service

Visually explore and analyze data


Migration Path
SQL DW – All of the data warehousing features that were generally available in Azure SQL Data Warehouse (intelligent
workload management, dynamic data masking, materialized views, etc.) continue to be generally available today. Businesses
can continue running their existing data warehouse workloads in production today with Azure Synapse and will automatically
benefit from the new capabilities which are in preview (unified experience with Azure Synapse studio, query-as-a-service,
built-in data integration, integrated Apache Spark, etc.) once they become generally available in 2020 and can use them in
production if they choose to do so. Customers will not have to migrate any workloads

Azure Data Factory - Continue using Azure Data Factory. When the new functional of data integration within Azure Synapse
becomes generally available, we will provide the capability to import your Azure Data Factory pipelines into Azure Synapse.
Your existing Azure Data Factory accounts and pipelines will work with Azure Synapse if you choose not to import them into
the Azure Synapse workspace. Note that Azure-SSIS Integration Runtime (IR) will not be supported in Synapse

Power BI – Customers link to a Power BI workspace within Azure Synapse Studio so no migration needed

ADLS Gen2 – Customers link to ADLS Gen2 within Azure Synapse Studio so no migration needed

Azure Databricks – TBD

Azure HDInsight - The Spark runtime within the Azure Synapse service is different from HDInsight
Q&A ?
James Serra, Big Data Evangelist
Email me at: [email protected]
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

You might also like