Azure Synapse
Azure Synapse
James Serra
Data & AI Architect
Microsoft, NYC MTC
[email protected]
Blog: JamesSerra.com
About Me
Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
Been perm employee, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
Blog at JamesSerra.com
Former SQL Server MVP
Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
Introduction
Studio
Data Integration
SQL Analytics
Data Storage and Performance Optimizations
SQL On-Demand
Spark
Security
Connected Services
Azure Synapse Analytics is a limitless analytics service, that brings together
enterprise data warehousing and Big Data analytics. It gives you the freedom
to query data on your terms, using either serverless on-demand or provisioned
resources, at scale. Azure Synapse brings these two worlds together with a
unified experience to ingest, prepare, manage, and serve data for immediate
business intelligence and machine learning needs.
Azure Synapse – SQL Analytics
focus areas
Up to 94% less expensive Defense-in-depth Manage heterogenous Ingest variety of data Use preferred tooling for
than competitors security and 99.9% workloads through sources to derive the SQL data warehouse
financially backed workload priorities and maximum benefit. development
availability SLA isolation Query all data.
Leveraging ISV partners with Azure Synapse Analytics
Azure Data Share Ecosystem
+ many more
Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R
Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND
Analytics Runtimes
MONITORING
MONITORING
METASTORE
METASTORE
DATA INTEGRATION
Resource Group
Workspace Name
Region
Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R
Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND
Analytics Runtimes
MONITORING
MONITORING
METASTORE
METASTORE
DATA INTEGRATION
Monitor Manage
Centralized view of all resource Configure the workspace, pool,
usage and activities in the access to artifacts
workspace.
Synapse Studio
Overview hub
Overview Hub
It is a starting point for the activities with key links to tasks, artifacts and documentation
Overview Hub
Overview
Filepath
Container (filesystem)
Data Hub – Storage accounts
Preview a sample of your data
Data Hub – Storage accounts
See basic file properties
Data Hub – Storage accounts
Manage Access - Configure standard POSIX ACLs on files and folders
Data Hub – Storage accounts
Two simple gestures to start analyzing with SQL scripts or with notebooks.
Multi-select of files generates a SQL script that analyzes all those files together
Data Hub – Databases
Explore the different kinds of databases that exist in a workspace.
SQL pool
SQL on-demand
Spark
Data Hub – Databases
Familiar gesture to generate T-SQL scripts from SQL Starting from a table, auto-generate a single line of PySpark code
metadata objects such as tables. that makes it easy to load a SQL table into a Spark dataframe
Data Hub – Datasets
Orchestration datasets describe data that is persisted. Once a dataset is defined, it can be used in pipelines and
sources of data or as sinks of data.
Synapse Studio
Develop hub
Develop Hub
Overview
Benefits
Export results
Develop Hub - Notebooks
Configure session allows developers to control how many resources
are devoted to running their notebook.
Develop Hub - Notebooks
As notebook cells run, the underlying
Spark application status is shown.
Providing immediate feedback and
progress tracking.
Dataflow Capabilities
This feature provides ability to monitor orchestration, activities and compute resources.
Monitoring Hub - Orchestration
Overview
Benefits
Benefits
This feature provides ability to manage Linked Services, Orchestration and Security.
Manage – Linked services
Overview
It defines the connection information needed to
connect to external resources.
Benefits
Offers pre-build 90+ connectors
Benefits
Share workspace with the team
Increases productivity
Benefits
Create and manage
• Schedule trigger
• Event trigger
Benefits
Offers Azure Integration Runtime or Self-Hosted Integration
Runtime
Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R
Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND
Analytics Runtimes
MONITORING
MONITORING
METASTORE
METASTORE
DATA INTEGRATION
Activity Activity
Activity Activity
Activity
Self-hosted Azure
Integration Runtime Integration Runtime
LEGEND
Linked Command and Control
Service Data
Data Movement
Scalable
per job elasticity
Up to 4 GB/s
Simple
Visually author or via code (Python, .Net, etc.)
Serverless, no infrastructure to manage
Database for MariaDB Hive SAP HANA Google AdWords SAP C4C
Database for MySQL Apache Impala SAP table HubSpot SAP ECC
Table storage
Pipelines
Overview
Benefits
Benefits
Offers pre-build 85+ connectors
Benefits
Offers Azure Integration Runtime or Self-Hosted Integration
Runtime
Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R
Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND
Analytics Runtimes
MONITORING
MONITORING
METASTORE
METASTORE
DATA INTEGRATION
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
TPC-H queries
TPC-H 1 Petabyte Query Execution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
TPC-H queries
Azure Synapse Analytics > SQL >
• Materialized Views
• Nonclustered Indexes
• Result-set caching
Azure Synapse Analytics > SQL >
Windowing functions
SELECT
OVER clause ROW_NUMBER() OVER(PARTITION BY PostalCode ORDER BY SalesYTD DESC
) AS "Row Number",
Defines a window or specified set of rows within a query LastName,
result set SalesYTD,
PostalCode
Computes a value for each row in the window FROM Sales
WHERE SalesYTD <> 0
Aggregate functions ORDER BY PostalCode;
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ph.Rate) LAG(SalesQuota, 1,0) OVER (ORDER BY YEAR(QuotaDate)) AS PreviousQuota
Approximate execution
HyperLogLog accuracy
Will return a result with a 2% accuracy of true cardinality on average.
e.g. COUNT (DISTINCT) returns 1,000,000, HyperLogLog will return a value in the range of 999,736 to 1,016,234.
APPROX_COUNT_DISTINCT
Returns the approximate number of unique non-null values in a group.
Use Case: Approximating web usage trend behavior
-- Syntax
APPROX_COUNT_DISTINCT ( expression )
-- The approximate number of different order keys by order status from the orders table.
SELECT O_OrderStatus, APPROX_COUNT_DISTINCT(O_OrderKey) AS Approx_Distinct_OrderKey
FROM dbo.Orders
GROUP BY O_OrderStatus
ORDER BY O_OrderStatus;
Azure Synapse Analytics > SQL >
Approximate execution
APPROX_COUNT_DISTINCT
COUNT DISTINCT
Azure Synapse Analytics > SQL >
Group by options
Group by with rollup -- GROUP BY ROLLUP Example --
SELECT Country,
Creates a group for each combination of column expressions.
Region,
Rolls up the results into subtotals and grand totals
SUM(Sales) AS TotalSales
Calculate the aggregates of hierarchical data
FROM Sales
GROUP BY ROLLUP (Country, Region);
Grouping sets -- Results --
Combine multiple GROUP BY clauses into one GROUP BY CLAUSE.
Equivalent of UNION ALL of specified groups. Country Region TotalSales
Canada Alberta 100
Snapshot isolation
Overview
Specifies that statements cannot read data that has been modified but ALTER DATABASE MyDatabase
not committed by other transactions. SET ALLOW_SNAPSHOT_ISOLATION ON
This prevents dirty reads.
ALTER DATABASE MyDatabase SET
READ_COMMITTED_SNAPSHOT ON
Isolation level
• READ COMMITTED
• REPEATABLE READ
• SERIALIZABLE
• READ UNCOMMITTED
READ_COMMITTED_SNAPSHOT
OFF (Default) – Uses shared locks to prevent other transactions from
modifying rows while running a read operation
ON – Uses row versioning to present each statement with a
transactionally consistent snapshot of the data as it existed at the start of
the statement. Locks are not used to protect the data from updates.
Azure Synapse Analytics > SQL >
tables "Date":"2011-05-31T00:00:00"
},
"Item": { "Price":2024.40, "Quantity":1 }
}]’ -- OrderDetails
)
Azure Synapse Analytics > SQL >
• JSON_VALUE – extract a scalar value from a JSON N'[{ StoreId": "AW73565", "Order": { "Number":"SO43659",
101 "Date":"2011-05-31T00:00:00“ }, "Item": { "Price":2024.40,
string "Quantity":1 }}]'
Modify JSON data with the following: N'[{ StoreId": "AW73565", "Order": { "Number":"SO43659",
"Date":"2011-05-31T00:00:00“ }, "Item": { "Price":2024.40, "Quantity": 2}}]'
• JSON_MODIFY – modifies a value in a JSON string
Stored Procedures
Overview
It is a group of one or more SQL statements or a CREATE PROCEDURE HumanResources.uspGetAllEmployees
reference to a Microsoft .NET Framework AS
common runtime language (CLR) method. SET NOCOUNT ON;
SELECT LastName, FirstName, JobTitle, Department
Promotes flexibility and modularity. FROM HumanResources.vEmployeeDepartment;
GO
Supports parameters and nesting.
-- Execute a stored procedures
EXECUTE HumanResources.uspGetAllEmployees;
Benefits GO
-- Or
Reduced server/client network traffic, improved EXEC HumanResources.uspGetAllEmployees;
performance GO
-- Or, if this procedure is the first statement
Stronger security within a batch:
Easy maintenance HumanResources.uspGetAllEmployees;
Azure Synapse Analytics
Data Storage and Performance Optimizations
Columnar Storage Columnar Ordering
Database Tables
Table Partitioning Hash Distribution
Optimized Storage
Reduce Migration Risk
Less Data Scanned Nonclustered Indexes
Smaller Cache Required
Smaller Clusters
Faster Queries
Azure Synapse Analytics > SQL >
Tables – Indexes
Clustered Columnstore index (Default Primary)
-- Create table with index
Highest level of data compression
CREATE TABLE orderTable
Best overall query performance
(
OrderId INT NOT NULL,
Clustered index (Primary) Date DATE NOT NULL,
Performant for looking up a single to few rows Name VARCHAR(2),
Country VARCHAR(2)
Heap (Primary) )
85018 11-2-2018 Q SP Min (OrderId): 82147 | Max (OrderId): 85395 98137 1002
85216 11-2-2018 Q DE
85395 11-2-2018 V NL OrderId Country
OrderId PageId OrderId PageId
82147 11-2-2018 Q FR 82147 FR
82147 1005 98137 1007
86881 11-2-2018 D UK 85016 UK … 85395 1006 98979 1008
93080 11-3-2018 R UK 85018 Name SP … …
94156 11-3-2018 S FR Date Q DE OrderId Date Name Country OrderId Date Name Country
85216 OrderId Date Name Country OrderId Date Name Country
96250 11-3-2018 Q NL 85395 11-2-2018 V NL
98137 11-3-2018 T FR
+
82147 11-2-2018 Q FR
98799 11-3-2018 R NL 82147 11-2-2018 Q FR 98137 11-3-2018 T FR
85016 11-2-2018 V UK 98310 11-3-2018 D DE
98015 11-3-2018 T UK Delta Rowstore 85016 11-2-2018 V UK 98310 11-3-2018 D DE
85018 11-2-2018 Q SP 98799 11-3-2018 R NL
98310 11-3-2018 D DE OrderId Date Name Country 85018 11-2-2018 Q SP 98799 11-3-2018 R NL
98979 11-3-2018 Z DE 98137 11-3-2018 T FR
98310 11-3-2018 D DE
98137 11-3-2018 T FR
98799 11-3-2018 R NL
… … … …
98979 11-3-2018 Z DE • Data is stored in a B-tree index structure for performant
lookup queries for particular rows.
• Data stored in compressed columnstore segments after
being sliced into groups of rows (rowgroups/micro- • Clustered rowstore index: The leaf nodes in the structure
partitions) for maximum compression store the data values in a row (as pictured above)
• Rows are stored in the delta rowstore until the number of • Non-clustered (secondary) rowstore index: The leaf nodes
rows is large enough to be compressed into a store pointers to the data values, not the values
columnstore themselves
Azure Synapse Analytics > SQL >
Tables – Distributions
Round-robin distributed CREATE TABLE dbo.OrderTable
(
Distributes table rows evenly across all distributions OrderId INT NOT NULL,
at random. Date DATE NOT NULL,
Hash distributed
Name VARCHAR(2),
Country VARCHAR(2)
Distributes table rows across the Compute nodes by )
using a deterministic hash function to assign each WITH
row to one distribution. (
CLUSTERED COLUMNSTORE INDEX,
Replicated
DISTRIBUTION = HASH([OrderId]) |
ROUND ROBIN |
REPLICATED
Full copy of table accessible on each Compute node. );
Azure Synapse Analytics > SQL >
Tables – Partitions
Overview CREATE TABLE partitionedOrderTable
…
82147 11-2-2018 Q FR
85216 11-2-2018 Q DE
86881 11-2-2018 D UK 85395 11-2-2018 V NL
93080 11-3-2018 R UK 82147 11-2-2018 Q FR
94156 11-3-2018 S FR
86881 11-2-2018 D UK x 60 distributions (shards)
… … … …
96250 11-3-2018 Q NL
98799 11-3-2018 R NL 11-3-2018 partition
98015 11-3-2018 T UK
OrderId Date Name Country
• Each shard is partitioned with the same
93080 11-3-2018 R UK
98310 11-3-2018 D DE 94156 11-3-2018 S FR date partitions
98979 11-3-2018 Z DE 96250 11-3-2018 Q NL
98137 11-3-2018 T FR 98799 11-3-2018 R NL • A minimum of 1 million rows per
98015 11-3-2018 T UK
… … … …
98310 11-3-2018 D DE distribution and partition is needed for
98979 11-3-2018 Z DE optimal compression and performance of
98137 11-3-2018 T FR
… … … … clustered Columnstore tables
Azure Synapse Analytics > SQL >
Use hash-distribution with clustered columnstore index. Performance improves because hashing enables the
platform to localize certain operations within the node itself during query execution.
Operations that benefit:
COUNT(DISTINCT( <hashed_key> ))
Fact OVER PARTITION BY <hashed_key>
most JOIN <table_name> ON <hashed_key>
GROUP BY <hashed_key>
Dimension Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.
Use round-robin for the staging table. The load with CTAS is faster. Once the data is in the staging table, use
Staging INSERT…SELECT to move the data to production tables.
Views
Database Views
Materialized Views
Best in class price
performance
Materialized views
-- Create indexed view
Overview CREATE MATERIALIZED VIEW Sales.vw_Orders
WITH
A materialized view pre-computes, stores, and maintains its (
DISTRIBUTION = ROUND_ROBIN |
data like a table. HASH(ProductID)
)
Materialized views are automatically updated when data in AS
SELECT SUM(UnitPrice*OrderQty) AS Revenue,
underlying tables are changed. This is a synchronous OrderDate,
operation that occurs as soon as the data is changed. ProductID,
COUNT_BIG(*) AS OrderCount
The auto caching functionality allows Azure Synapse FROM Sales.SalesOrderDetail
GROUP BY OrderDate, ProductID;
Analytics Query Optimizer to consider using indexed view GO
even if the view is not referenced in the query.
-- Disable index view and put it in suspended mode
Supported aggregations: MAX, MIN, AVG, COUNT, ALTER INDEX ALL ON Sales.vw_Orders DISABLE;
COUNT_BIG, SUM, VAR, STDEV -- Re-enable index view by rebuilding it
ALTER INDEX ALL ON Sales.vw_Orders REBUILD;
Benefits
Automatic and synchronous data refresh with data changes
in base tables. No user action is required.
High availability and resiliency as regular tables
Azure Synapse Analytics > SQL >
Original query – get year total sales per customer Create indexed view with hash distribution on customer_id column
-- Get year total sales per customer -- Create indexed view for query
(WITH year_total AS CREATE INDEXED VIEW nbViewCS WITH (DISTRIBUTION=HASH(customer_id)) AS
SELECT customer_id, SELECT customer_id,
first_name, first_name,
last_name, last_name,
birth_country, birth_country,
login, login,
email_address, email_address,
d_year, d_year,
SUM(ISNULL(list_price – wholesale_cost – SUM(ISNULL(list_price – wholesale_cost – discount_amt +
discount_amt + sales_price, 0)/2)year_total sales_price, 0)/2) AS year_total
FROM customer cust FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id, first_name, GROUP BY customer_id, first_name,
last_name,birth_country, last_name,birth_country,
login,email_address ,d_year login, email_address, d_year
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Azure Synapse Analytics > SQL >
SQL Analytics
Event Hubs
Heterogenous Data
Preparation &
Streaming Ingestion Data Warehouse
Ingestion
IoT Hub
COPY statement
- Simplified permissions (no CONTROL required) Azure Data Lake
- No need for external tables
- Standard CSV support (i.e. custom row terminators,
escape delimiters, SQL dates) --Copy files in parallel directly into data warehouse table
COPY INTO [dbo].[weatherTable]
- User-driven file selection (wild card support) FROM
'abfss://<storageaccount>.blob.core.windows.net/<filepath>'
WITH (
FILE_FORMAT = 'DELIMITEDTEXT’,
SECRET = CredentialObject);
Azure Synapse Analytics > SQL >
COPY command
COPY INTO test_1
Overview FROM
'https://XXX.blob.core.windows.net/customerdatasets/tes
Copies data from source to destination t_1.txt'
WITH (
FILE_TYPE = 'CSV',
CREDENTIAL=(IDENTITY= 'Shared Access Signature',
Benefits SECRET='<Your_SAS_Token>'),
FIELDQUOTE = '"',
Retrieves data from all files from the folder and all its FIELDTERMINATOR=';',
subfolders. ROWTERMINATOR='0X0A',
ENCODING = 'UTF8',
Supports multiple locations from the same storage account, DATEFORMAT = 'ymd',
MAXERRORS = 10,
separated by comma ERRORFILE = '/errorsfolder/'--path starting from
the storage container,
Supports Azure Data Lake Storage (ADLS) Gen 2 and Azure IDENTITY_INSERT
Blob Storage. )
Parquet
Result
Control Node
Result-set caching
Overview -- Turn on/off result-set caching for a database
-- Must be run on the MASTER database
Cache the results of a query in DW storage. This enables interactive ALTER DATABASE {database_name}
response times for repetitive queries against tables with infrequent SET RESULT_SET_CACHING { ON | OFF }
0101010001 0101010001
0100101010 0100101010
1 Client sends query to DW 2 Query is processed using DW compute Query results are cached in remote
nodes which pull data from remote storage so subsequent requests can
storage, process query and output back be served immediately
to client app
0101010001
0100101010
3 Subsequent executions for the same 4 Remote storage cache is evicted regularly 5 Cache will need to be
query bypass compute nodes and can based on time, cache usage, and any regenerated if query results
be fetched instantly from persistent modifications to underlying table data. have been evicted from cache
cache in remote storage
Azure Synapse Analytics > SQL >
Resource classes
Overview
Pre-determined resource limits defined for a user or role. /* View resource classes in the data warehouse */
SELECT name
FROM sys.database_principals
Benefits WHERE name LIKE '%rc%' AND type_desc = 'DATABASE_ROLE';
Govern the system memory assigned to each query.
/* Change user’s resource class to 'largerc' */
Effectively used to control the number of concurrent queries that EXEC sp_addrolemember 'largerc', 'loaduser’;
can run on a data warehouse.
/* Decrease the loading user's resource class */
EXEC sp_droprolemember 'largerc', 'loaduser';
Exemptions to concurrency limit:
CREATE|ALTER|DROP (TABLE|USER|PROCEDURE|VIEW|LOGIN)
CREATE|UPDATE|DROP (STATISTICS|INDEX)
SELECT from system views and DMVs
EXPLAIN
Result-Set Cache
TRUNCATE TABLE
ALTER AUTHORIZATION
CREATE|UPDATE|DROP STATISTICS
Azure Synapse Analytics > SQL >
Workload Management
Overview
It manages resources, ensures highly efficient resource utilization,
and maximizes return on investment (ROI).
The three pillars of workload management are Pillars of Workload
1. Workload Classification – To assign a request to a workload Management
group and setting importance levels.
2. Workload Importance – To influence the order in which a
Classification
Importance
request gets access to resources.
Isolation
3. Workload Isolation – To reserve resources for a workload
group.
Azure Synapse Analytics > SQL >
Workload classification
Overview
Map queries to allocations of resources via pre-determined rules. CREATE WORKLOAD CLASSIFIER classifier_name
WITH
Use with workload importance to effectively share resources
(
across different workload types. [WORKLOAD_GROUP = '<Resource Class>' ]
If a query request is not matched to a classifier, it is assigned to [IMPORTANCE = { LOW |
BELOW_NORMAL |
the default workload group (smallrc resource class). NORMAL |
ABOVE_NORMAL |
HIGH
Benefits }
]
Map queries to both Resource Management and Workload [MEMBERNAME = ‘security_account’]
Isolation concepts. )
WORKLOAD_GROUP: maps to an existing resource class
Manage groups of users with only a few classifiers. IMPORTANCE: specifies relative importance of
request
MEMBERNAME: database user, role, AAD login or AAD
Monitoring DMVs group
sys.workload_management_workload_classifiers
sys.workload_management_workload_classifier_details
Query DMVs to view details about all active workload classifiers.
Azure Synapse Analytics > SQL >
Workload importance
Overview
Queries past the concurrency limit enter a FiFo queue
By default, queries are released from the queue on a
first-in, first-out basis as resources become available
Workload importance allows higher priority queries to
receive resources immediately regardless of queue
Example Video
State analysts have normal importance.
National analyst is assigned high importance.
State analyst queries execute in order of arrival
When the national analyst’s query arrives, it jumps to
the top of the queue
CREATE WORKLOAD CLASSIFIER National_Analyst
WITH
(
[WORKLOAD_GROUP = ‘smallrc’]
[IMPORTANCE = HIGH]
[MEMBERNAME = ‘National_Analyst_Login’]
CREATE WORKLOAD GROUP Sales
WITH
(
[ MIN_PERCENTAGE_RESOURCE = 60 ]
[ CAP_PERCENTAGE_RESOURCE = 100 ]
[ MAX_CONCURRENCY = 6 ] )
Sales
Workload Isolation
- Multiple workloads share deployed resources
60% Compute
1000c DWU
Marketing
- Reservation or shared resource configuration
- Online changes to workload policies 100%
40%
Azure Synapse Analytics > SQL >
Workload Isolation
Overview
Allocate fixed resources to workload group.
CREATE WORKLOAD GROUP group_name
Assign maximum and minimum usage for varying WITH
(
resources under load. These adjustments can be done live MIN_PERCENTAGE_RESOURCE = value
without having to SQL Analytics offline. , CAP_PERCENTAGE_RESOURCE = value
, REQUEST_MIN_RESOURCE_GRANT_PERCENT = value
[ [ , ] REQUEST_MAX_RESOURCE_GRANT_PERCENT = value ]
[ [ , ] IMPORTANCE = {LOW | BELOW_NORMAL | NORMAL | ABOVE_NORMAL | HIGH} ]
Benefits [ [ , ] QUERY_EXECUTION_TIMEOUT_SEC = value ]
)[ ; ]
Reserve resources for a group of requests
Limit the amount of resources a group of requests can
consume RESOURCE ALLOCATION
Shared resources accessed based on importance level
Set Query timeout value. Get DBAs out of the business of
killing runaway queries 0.4, 0.4, group A
40% 40%
group B
Monitoring DMVs Shared
sys.workload_management_workload_groups
0.2,
Query to view configured workload group. 20%
Azure Synapse Analytics > SQL >
Overview
Dynamic Management Views (DMV) are queries that return information
about model objects, server operations, and server health.
Benefits:
Simple SQL syntax
Returns result in table format
Easier to read and copy result
Azure Synapse Analytics > SQL >
Developer Tools
Azure Synapse Analytics Visual Studio - SSDT database projects
Azure Data Studio (queries, extensions etc.) SQL Server Management Studio
Visual Studio Code
(queries, execution plans etc.)
Azure Synapse Analytics > SQL >
Developer Tools
Visual Studio - SSDT
Azure Data Studio SQL Server Management Studio Visual Studio Code
Azure Synapse Analytics database projects
Azure Cloud Service Runs on Windows Runs on Windows, Runs on Windows Runs on Windows,
Linux, macOS Linux, macOS
Light weight editor, Offers GUI support to Offers development
Offers end-to-end Create, maintain
(queries and query, design and experience with light-
lifecycle for analytics database code, compile,
extensions) manage weight code editor
code refactoring
Connects to multiple
services
Azure Synapse Analytics > SQL >
Benefits
Database project support includes first-class
integration with Azure DevOps. This adds support for:
• Azure Pipelines to run CI/CD workflows for any
platform (Linux, macOS, and Windows)
• Azure Repos to store project files in source control
• Azure Test Plans to run automated check-in tests to
verify schema updates and modifications
• Growing ecosystem of third-party integrations that
can be used to complement existing workflows
(Timetracker, Microsoft Teams, Slack, Jenkins, etc.)
Azure Synapse Analytics > SQL >
Data Skew
Choose new hash-distribution key
Slowest distribution limits performance
Cache Misses
Provision additional capacity
Tempdb Contention
Scale or update user resource class
Maintenance windows
Overview
Choose a time window for your upgrades.
Select a primary and secondary window within a seven-day
period.
Windows can be from 3 to 8 hours.
24-hour advance notification for maintenance events.
Benefits
Ensure upgrades happen on your schedule.
Predictable planning for long-running jobs.
Stay informed of start and end of maintenance.
Azure Synapse Analytics > SQL >
Statistics are automatically updated as data modifications occur in -- Configure synchronous/asynchronous update
underlying tables. By default, these updates are synchronous but ALTER DATABASE {database_name}
can be configured to be asynchronous. SET AUTO_UPDATE_STATISTICS_ASYNC { ON | OFF }
was 500 or less, and more than 500 rows have been updated FROM sys.databases
enabled DW + =
Model Data Predictions
T-SQL Language
Native PREDICT-ion
- T-SQL based experience (interactive./batch scoring)
- Interoperability with other models built elsewhere Data Warehouse
- Execute scoring where the data lives
Data Lake
Integration
Notes:
• Auto-scale compute nodes - Instruct the underlying fabric the need for more compute power to
adjust to peaks during the workload. If compute power is granted, the Polaris DQP will re-distribute
tasks leveraging the new compute container. Note that in-flight tasks in the previous topology
continue running, while new queries get the new compute power with the new re-balancing
• Compute node fault tolerance - Recover from faulty nodes while a query is running. If a node fails
the DQP re-schedules the tasks in the faulted node through the remainder of the healthy topology
• Compute node hot spot: rebalance queries or scale out nodes - Can detect hot spots in the
existing topology. That is, overloaded compute nodes due to data skew. In the advent of a compute
node running hot because of skewed tasks, the DQP can decide to re-schedule some of the tasks
assigned to that compute node amongst others where the load is less
• Multi-cluster - Multiple compute pools accessing the same data
• Cross-database queries – A query can specify multiple databases
These features work for both on-demand and provisioned over ADLS Gen2 and relational databases
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence
Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R
Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND
Analytics Runtimes
MONITORING
MONITORING
METASTORE
METASTORE
DATA INTEGRATION
What’s in this file? How many rows are there? What’s the max value?
How to convert CSVs to Parquet quickly? How to transform the raw data?
Use the full power of T-SQL to transform the data in the data lake
Azure Synapse Analytics > SQL >
SQL On-Demand
Overview
An interactive query service that provides T-SQL queries over SQL DW
Benefits
Ability to read CSV File with SELECT *
- no header row, Windows style new line FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/population/populat
- no header row, Unix-style new line ion.csv',
FORMAT = 'CSV',
- header row, Unix-style new line FIELDTERMINATOR =',',
- header row, Unix-style new line, quoted )
ROWTERMINATOR = '\n'
SELECT * SELECT
FROM OPENROWSET( COUNT(DISTINCT country_name) AS countries
BULK 'https://XXX.blob.core.windows.net/csv/population- FROM OPENROWSET(
unix-hdr/population.csv', BULK 'https://XXX.blob.core.windows.net/csv/popul
FORMAT = 'CSV', ation/population.csv',
FIELDTERMINATOR =',', FORMAT = 'CSV',
ROWTERMINATOR = '0x0a', FIELDTERMINATOR =',',
FIRSTROW = 2 ROWTERMINATOR = '\n'
) )
WITH ( WITH (
[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2, [country_name] VARCHAR (100) COLLATE Latin1_Gener
[country_name] VARCHAR (100) COLLATE Latin1_General_BIN2, al_BIN2 2
[year] smallint, ) AS [r]
[population] bigint
) AS [r]
WHERE
country_name = 'Luxembourg'
AND year = 2017
Azure Synapse Analytics > SQL On Demand
Read all files from multiple folders Read subset of files in folder
SELECT YEAR(pickup_datetime) as [year], SELECT
SUM(passenger_count) AS passengers_total, payment_type,
COUNT(*) AS [rides_total] SUM(fare_amount) AS fare_total
FROM OPENROWSET( FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/t*i/', BULK 'https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-*.csv',
FORMAT = 'CSV', FORMAT = 'CSV',
FIRSTROW = 2 ) FIRSTROW = 2 )
WITH ( WITH (
vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2, vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2,
pickup_datetime DATETIME2, pickup_datetime DATETIME2,
dropoff_datetime DATETIME2, dropoff_datetime DATETIME2,
passenger_count INT, passenger_count INT,
trip_distance FLOAT, trip_distance FLOAT,
<… columns> <…columns>
) AS nyc ) AS nyc
GROUP BY YEAR(pickup_datetime) GROUP BY payment_type
ORDER BY YEAR(pickup_datetime) ORDER BY payment_type
Azure Synapse Analytics > SQL On Demand
GROUP BY r.filename()
ORDER BY [filename]
Azure Synapse Analytics > SQL On Demand
SELECT
country_name, population
FROM populationView
WHERE
[year] = 2019
ORDER BY
[population] DESC
Azure Synapse Analytics > SQL On Demand
Benefits SELECT *
FROM
Supports OPENJSON, JSON_VALUE and JSON_QUERY OPENROWSET(
functions BULK 'https://XXX.blob.core.windows.net/json/books/book
1.json’,
FORMAT='CSV',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(8000)
) AS [r]
Azure Synapse Analytics > SQL On Demand
Create External Table As Select -- Create a database master key if one does not already exist
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo'
;
Overview -- Create a database scoped credential with Azure storage account key
as the secret.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
Creates an external table and then exports results of the WITH
IDENTITY = '<my_account>'
Select statement. These operations will import data into the , SECRET = '<azure_storage_account_key>'
;
database for the duration of the query -- Create an external data source with CREDENTIAL option.
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
Steps:
( LOCATION = 'wasbs://[email protected]/'
, CREDENTIAL = AzureStorageCredential
, TYPE = HADOOP
1. Create Master Key )
-- Create an external file format
CREATE EXTERNAL FILE FORMAT MyAzureCSVFormat
2. Create Credentials WITH (FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR = ',',
3. Create External Data Source FIRST_ROW = 2)
--Create an external table
CREATE EXTERNAL TABLE dbo.FactInternetSalesNew
4. Create External Data Format WITH(
LOCATION = '/files/Customer',
DATA_SOURCE = MyAzureStorage,
5. Create External Table FILE_FORMAT = MyAzureCSVFormat
)
AS SELECT T1.* FROM dbo.FactInternetSales T1 JOIN dbo.DimCustomer T2
ON ( T1.CustomerKey = T2.CustomerKey )
OPTION ( HASH JOIN );
SQL scripts > View and export results
SQL scripts > View results (chart)
Convert from CSV to Parquet on-demand
Azure Synapse Analytics
Spark
Azure Synapse Analytics
Integrated data platform for BI, AI and continuous intelligence
Platform Languages
MANAGEMENT
MANAGEMENT
SQL Python .NET Java Scala R
Form Factors
SECURITY
SECURITY
PROVISIONED ON-DEMAND
Analytics Runtimes
MONITORING
MONITORING
METASTORE
METASTORE
DATA INTEGRATION
Benefits
Allows to write multiple languages in one
notebook
%%<Name of language>
Yarn
Spark MLlib
Spark Structured Machine
Streaming Learning
Stream processing
http://spark.apache.org
Motivation for Apache Spark
Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing
involves lots of (slow) disk I/O
Minimal
HDFS Chain Job Output
Read Iteration 1 Read/Write Disk Iteration 2 into New Job Input
Bottleneck
CPU Memory CPU Memory
Input
10–100x faster than
network & disk
What makes Spark fast
Read from
HDFS
General Spark Cluster Architecture
Driver Program
SparkContext
Cluster Manager
Mllib/SparkML GraphX
Azure Synapse Apache Spark
Architecture Overview
Synapse Job Service • User creates Synapse Workspace and Spark pool and
launches Synapse Studio.
AAD • User attaches Notebook to Spark pool and enters
one or more Spark statements (code blocks).
• The Notebook client gets user token from AAD and
sends a Spark session create request to Synapse
Synapse Studio
Gateway.
• Synapse Gateway authenticates the request and
validates authorizations on the Workspace and Spark
Synapse Service pool and forwards it to the Spark (Livy) controller
hosted in Synapse Job Service frontend.
Gateway Auth Service
• The Job Service frontend forwards the request to Job
Service backend that creates two jobs – one for
Job Service Frontend
Resource creating the cluster and the other for creating the
Spark API Provider Spark session.
Controller … DB
• The Job service backend contacts Synapse Resource
Provider to obtain Workspace and Spark pool details
and delegates the cluster creation request to
Job Service Backend
Synapse Instance Service.
Instance Azure
Creation Service • Once the instance is created, the Job Service
Spark Plugin … backend forwards the Spark session creation request
DB DB
to the Livy endpoint in the cluster.
• Once the Spark session is created the Notebook
client sends Spark statements to the Job Service
frontend.
Spark Instance
• Job Service frontend obtains the actual Livy endpoint
for the cluster created for the particular user from
VM VM VM VM VM VM the backend and sends the statement directly to Livy
for execution.
Synapse Spark Instances
1. Synapse Job Service sends request to
Cluster Service for creating BBC clusters
Provision Resources per the description in the associated
Azure Resource
Provider Spark pool.
2. Cluster Service sends request to Azure
Spark Instance using Azure SDK to create VMs
(required plus additional) with
Create VMs with Subnet
specialized VHD.
Specialized VHD VM – 001 VM – 002 VM – 003 VM – 004 VM – 005
3. The specialized VHD contains bits for
Spark
Executors all the services that are required by the
Spark Cluster type (for e.g. Spark) with
Hive Metastore Executors Spark
prefetch instrumentation.
Executors Spark
YARN NM - 01 Executors 4. Once VM boots up, the Node Agent
Livy - 01 YARN NM - 02 sends heartbeat to Cluster Service for
getting node configuration.
Zookeeper - 01 Zookeeper - 02 Zookeeper - 03
5. The nodes are initialized and assigned
YARN RM - 01 YARN RM - 02 YARN NM - 03 YARN NM - 04 roles based on their first heartbeat.
Node Agent Node Agent Node Agent Node Agent Node Agent 6. Extra nodes get deleted on first
Create Cluster heartbeat.
7. After Cluster Service considers the
Synapse Cluster
cluster ready, it returns the Livy end-
Service Heartbeats
point to the Job Service.
(Control Plane)
Heartbeat sequence
Creating a Spark pool (1 of 2)
Default Settings
Creating a Spark pool (2 of 2) - optional
Compute Compute Compute Compute Compute Executor Executor Executor Executor Executor
val jdbcUsername = "<SQL DB ADMIN USER>" // Construct a Spark DataFrame from SQL Pool
val jdbcPwd = "<SQL DB ADMIN PWD>" var df = spark.read.sqlanalytics("sql1.dbo.Tbl1")
val jdbcHostname = "servername.database.windows.net”
val jdbcPort = 1433 // Write the Spark DataFrame into SQL Pool
val jdbcDatabase ="<AZURE SQL DB NAME>“ df.write.sqlanalytics(“sql1.dbo.Tbl2”)
val jdbc_url =
s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase};
encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.databas
e.windows.net;loginTimeout=60;“
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPwd}")
View results in
chart format
Exploratory data analysis
with graphs – histogram,
boxplot etc
Library Management - Python
In the Portal
Overview
Specify the new requirements while creating Spark Pool in
Customers can add new python libraries at Spark pool level Additional Settings blade
Benefits
Input requirements.txt in simple pip freeze format
Add new libraries to your cluster
Update versions of existing libraries on your cluster
Libraries will get installed for your Spark pool during cluster
creation
Ability to specify different requirements file for different pools
within the same workspace
Constraints
The library version must exist on PyPI repository
Version downgrade of an existing library not allowed
Library Management - Python
Get list of installed libraries with version information
Spark ML Algorithms
Spark ML
Algorithms
Synapse Notebook: Connect to AML workspace
Configuration parameters
Synapse Notebook: Run AML job
ISO 27001 SOC 1 Type 2 SOC 2 Type 2 PCI DSS Level 1 Cloud Controls ISO 27018 Content Delivery and Shared
Matrix Security Association Assessments
FedRAMP JAB HIPAA / FIPS 140-2 21 CFR FERPA DISA Level 2 CJIS IRS 1075 ITAR-ready Section 508
P-ATO HITECH Part 11 VPAT
European Union EU Safe United China Multi China China Singapore Australian New Zealand Japan ENISA
Model Clauses Harbor Kingdom Layer Protection GB 18030 CCCPPF MTCS Level 3 Signals GCIO Financial Services IAF
G-Cloud Scheme Directorate
Comprehensive Security
Category Feature
Data in Transit
Data Protection Data Encryption at Rest
Data Discovery and Classification
Object Level Security (Tables/Views)
Row Level Security
Access Control
Column Level Security
Dynamic Data Masking
SQL Login
Authentication Azure Active Directory
Multi-Factor Authentication
Virtual Networks
Network Security Firewall
Azure ExpressRoute
Thread Detection
Threat Protection Auditing
Vulnerability Assessment
Threat Protection - Business requirements
Network Security
Threat Protection
SQL auditing in Azure Log Analytics and Event Hubs
Gain insight into database audit log
Authentication
Network Security
Threat Protection
Azure networking: application-access patterns
Users
Internet
Yes
DB 1 DB 2 DB 3
Firewall configuration on the portal
By default, Azure blocks all external
connections to port 1433
GET
https://management.azure.com/subscriptions/{subscriptionI
d}/resourceGroups/{resourceGroupName}/providers/Microsoft
.Sql/servers/{serverName}/firewallRules/{firewallRuleName
}?api-version=2014-04-01
Firewall configuration using PowerShell/T-SQL
Windows PowerShell Azure cmdlets
Note:
By default, VMs on your subnets cannot communicate
with your SQL Data Warehouse.
There must first be a virtual network service endpoint
for the rule to reference.
Authentication - Business requirements
Authentication
Network Security
Threat Protection
Azure Active Directory authentication
Overview
Manage user identities in one location. Azure Synapse Analytics
Benefits
Alternative to SQL Server authentication
Customer 1
Limits proliferation of user identities across databases Customer 2
Customer 3
Allows password rotation in a single place
ADALSQL
SQL Server Management Suite
ADO .NET
4.6
SQL Server Data Tools Azure Active Directory Authentication
Library for SQL Server (ADALSQL)
App
SQL authentication
Overview
This authentication method uses a username and
password.
When you created the logical server for your data
warehouse, you specified a "server admin" login with a
username and password.
Using these credentials, you can authenticate to any
database on that server as the database owner.
Furthermore, you can create user logins and roles with
familiar SQL Syntax.
Network Security
Threat Protection
Object-level security (tables, views, and more)
Overview
GRANT controls permissions on designated tables, views, stored procedures, and functions.
Prevent unauthorized queries against certain tables.
Simplifies design and implementation of security at the database level as opposed to application level.
-- Grant SELECT permission to user RosaQdM on table Person.Address in the AdventureWorks2012 database
GRANT SELECT ON OBJECT::Person.Address TO RosaQdM;
GO
-- Grant REFERENCES permission on column BusinessEntityID in view HumanResources.vEmployee to user Wanida
GRANT REFERENCES(BusinessEntityID) ON OBJECT::HumanResources.vEmployee to Wanida with GRANT OPTION;
GO
-- Grant EXECUTE permission on stored procedure HumanResources.uspUpdateEmployeeHireInfo to an application role called Recruiting11
USE AdventureWorks2012;
GRANT EXECUTE ON OBJECT::HumanResources.uspUpdateEmployeeHireInfo TO RECRUITING 11;
GO
Row-level security (RLS)
Overview
Fine grained access control of specific rows in a
database table.
Application
Authentication
Network Security
Threat Protection
Dynamic Data Masking
Overview
Prevent abuse of sensitive data by hiding it from users Table.CreditCardNo
4465-6571-7868-5796
4484-5434-6858-6550
XXXX-XXXX-XXXX-5796
XXXX-XXXX-XXXX-1978
Dynamic Data Masking
Three steps
2
1. Security officer defines dynamic data masking policy in T-SQL SELECT [First Name],
over sensitive data in the Employee table. The security officer uses [Social Security Number],
[Email],
the built-in masking functions (default, email, random) Business app [Salary]
FROM [Employee]
2. The app-user selects from the Employee table
1
ALTER TABLE [Employee]
ALTER COLUMN [SocialSecurityNumber]
ADD MASKED WITH (FUNCTION = 'DEFAULT()')
In transit Transport Layer Security (TLS) from Protects data between client and server against snooping and
the client to the server man-in-the-middle attacks
TLS 1.2
Overview
USE master;
All customer data encrypted at rest GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<UseStrongPasswordHere>';
TDE performs real-time I/O encryption and go
decryption of the data and log files. CREATE CERTIFICATE MyServerCert WITH SUBJECT = 'My DEK Certificate';
go
Service OR User managed keys. USE MyDatabase;
GO
Application changes kept to a minimum. CREATE DATABASE ENCRYPTION KEY
WITH ALGORITHM = AES_128
Transparent encryption/decryption of data ENCRYPTION BY SERVER CERTIFICATE MyServerCert;
in a TDE-enabled client driver. GO
ALTER DATABASE MyDatabase
Compliant with many laws, regulations, and SET ENCRYPTION ON;
guidelines established across various industries. GO
Transparent data encryption (TDE)
Key Vault
Benefits with User Managed Keys
Assume more control over who has access
to your data and when.
Synapse Studio
On-premises data
Azure Machine
Learning
Cloud data Analytics Runtimes
SQL
Power BI
SaaS data
Azure Data Lake Storage
Azure Machine Learning
Overview
Data Scientists can use Azure ML notebooks to do
(distributed) data preparation on Synapse Spark compute.
Benefits
Connect to your existing Azure ML workspace and project
Use the AutoML Classifier for classification or regression
problem
Train the model
Access open datasets
Azure Machine Learning (continued)
Power BI
Overview
Benefits
Azure Data Factory - Continue using Azure Data Factory. When the new functional of data integration within Azure Synapse
becomes generally available, we will provide the capability to import your Azure Data Factory pipelines into Azure Synapse.
Your existing Azure Data Factory accounts and pipelines will work with Azure Synapse if you choose not to import them into
the Azure Synapse workspace. Note that Azure-SSIS Integration Runtime (IR) will not be supported in Synapse
Power BI – Customers link to a Power BI workspace within Azure Synapse Studio so no migration needed
ADLS Gen2 – Customers link to ADLS Gen2 within Azure Synapse Studio so no migration needed
Azure HDInsight - The Spark runtime within the Azure Synapse service is different from HDInsight
Q&A ?
James Serra, Big Data Evangelist
Email me at: [email protected]
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)