SlideShare a Scribd company logo
Introduction to Azure Data Lake
and U-SQL for SQL users
Michael Rys (@MikeDoesBigData)
John Morcos
Microsoft Corp
The Traditional Data Warehouse
2
Data sourcesNon-relational data
The Data Lake approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Using analytic
engines like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
WebHDFS
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics HDInsight
ADL Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)
Why U-SQL?
Some sample use cases
Digital Crime Unit – Analyze complex attack patterns
to understand BotNets and to predict and mitigate
future attacks by analyzing log records with
complex custom algorithms
Image Processing – Large-scale image feature
extraction and classification using custom code
Shopping Recommendation – Complex pattern
analysis and prediction over shopping records
using proprietary algorithms
Characteristics
of Big Data
Analytics
Requires processing
of any type of data
Allow use of custom
algorithms
Scale to any size and
be efficient
Status Quo:
SQL for
Big Data
 Declarativity does scaling and
parallelization for you
 Extensibility is bolted on and
not “native”
 hard to work with anything other than
structured data
 difficult to extend with custom code
Status Quo:
Programming
Languages for
Big Data
 Extensibility through custom code
is “native”
 Declarativity is bolted on and
not “native”
 User often has to
care about scale and performance
 SQL is 2nd class within string
 Often no code reuse/
sharing across queries
Why U-SQL?
 Declarativity and Extensibility are
equally native to the language!
Get benefits of both!
Makes it easy for you by unifying:
• Declarative and imperative
• Unstructured and structured data processing
• Local and remote Queries
• Increase productivity and agility from Day 1 and
at Day 100 for YOU!
Scales out your
custom imperative
Code (written in .NET,
Python, R, and more
to come) in a
declarative SQL-
based framework
The origins
of U-SQL
SCOPE – Microsoft’s internal
Big Data language
• SQL and C# integration model
• Optimization and Scaling model
• Runs 100’000s of jobs daily
Hive
• Complex data types (Maps, Arrays)
• Data format alignment for text files
T-SQL/ANSI SQL
• Many of the SQL capabilities (windowing functions, meta
data model etc.)
Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
• Avoid moving large amounts of data across the
network between stores
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by
maintaining multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Projections
• Filters
• Joins
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage
U-SQL offers Advanced Analytics
Extensions for
Massively Parallel
processing
• Python
• R
Built-in Cognitive
capabilities
• Imaging
• Detecting Objects
• Detecting Emotion in Faces
• Detecting Text (OCR)
• Text Analysis
• Key Phrase Extraction
• Sentiment Analysis
Demo
Show me U-SQL!
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Expression-flow
Programming Style
Automatic "in-lining" of U-SQL expressions –
whole script leads to a single execution model.
Execution plan that is optimized out-of-the-
box and w/o user intervention.
Per job and user driven level of parallelization.
Detail visibility into execution steps, for
debugging.
Heatmap like functionality to identify
performance bottlenecks.
U-SQL extensibility
Extend U-SQL with C#/.NET, Python, R
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
• Schema on Read
• Write to File
• Built-in and custom Extractors and
Outputters
• ADL Storage and Azure Blob
Storage
“Unstructured” Files
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in Extractors: Csv, Tsv, Text with lots of options
• Custom Extractors: e.g., JSON, XML, etc. (see http://usql.io)
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text
• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
Show me File Sets!
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
• Simple Patterns
• Virtual Columns
• Only on EXTRACT for now
File Sets Simple pattern language on filename and path
@pattern string =
"/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix
• Wildcards the filename
• Limits on number of files
(Current limit 800-3000 is increased in special preview)
Virtual columns
EXTRACT
name string
, suffix string // virtual column
, date DateTime // virtual column
FROM @pattern
USING Extractors.Csv();
• Refer to virtual columns in query predicates to get partition
elimination
• Warning gets raised if no partition elimination was found
Create
shareable data
and code
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Meta Data Object Model
ADLA Account/Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
Clustered
Index
partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
Ext. tables
User
objects
Refers toContains Implemented
and named by
Procedures
Creden-
tials
MD
Name
C# Name
C# Applier
Table Types
Legend
Statistics
C# UDTs
Packages
• Naming
• Discovery
• Sharing
• Securing
U-SQL Catalog Naming
• Default Database and Schema context: master.dbo
• Quote identifiers with []: [my table]
• Stores data in ADL Storage /catalog folder
Discovery
• Visual Studio Server Explorer
• Azure Data Lake Analytics Portal
• SDKs and Azure Powershell commands
Sharing
• Within an Azure Data Lake Analytics account
• Across ADLA accounts that share same Azure Active Directory:
• Referencing Assemblies
• Calling TVFs and referencing tables and views
• Inserting into Tables
Securing
• Secured with AAD principals at catalog and Database level
• Views for simple cases
• TVFs for parameterization and
most cases
VIEWs and TVFs Views
CREATE VIEW V AS EXTRACT…
CREATE VIEW V AS SELECT …
• Cannot contain user-defined objects (e.g. UDF or UDOs)!
• Will be inlined
Table-Valued Functions (TVFs)
CREATE FUNCTION F (@arg string = "default")
RETURNS @res [TABLE ( … )]
AS BEGIN … @res = … END;
• Provides parameterization
• One or more results
• Can contain multiple statements
• Can contain user-code (needs assembly reference)
• Will always be inlined
• Infers schema or checks against specified return schema
Procedures
CREATE PROCEDURE P (@arg string = "default“) AS
BEGIN
…;
CREATE TABLE T …;
OUTPUT @res TO …;
INSERT INTO T …;
END;
• Provides parameterization
• No result but writes into file or table
• Can contain multiple statements
• Can contain user-code (needs assembly reference)
• Will always be inlined
• Can contain DDL (but no CREATE, DROP
FUNCTION/PROCEDURE)
• CREATE TABLE
• CREATE TABLE AS SELECT
Tables
CREATE TABLE T (col1 int
, col2 string
, col3 SQL.MAP<string,string>
, INDEX idx CLUSTERED (col2 ASC)
PARTITION BY (col1)
DISTRIBUTED BY HASH (driver_id)
);
• Structured Data, built-in Data types only (no UDTs)
• Clustered Index (needs to be specified): row-oriented
• Fine-grained distribution (needs to be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN
• Addressable Partitions (optional)
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;
CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;
CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);
• Infer the schema from the query
• Still requires index and distribution (does not support partitioning)
When to use
Tables
Benefits of Table clustering and distribution
• Faster lookup of data provided by distribution and clustering when right
distribution/cluster is chosen
• Data distribution provides better localized scale out
• Used for filters, joins and grouping
Benefits of Table partitioning
• Provides data life cycle management (“expire” old partitions)
• Partial re-computation of data at partition level
• Query predicates can provide partition elimination
Do not use when…
• No filters, joins and grouping
• No reuse of the data for future queries
If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.
• ALTER TABLE ADD/DROP
COLUMN
Evolving Tables
ALTER TABLE T ADD COLUMN eventName string;
ALTER TABLE T DROP COLUMN col3;
ALTER TABLE T ADD COLUMN result string, clientId
string, payload int?;
ALTER TABLE T DROP COLUMN clientId, result;
• Meta-data only operation
• Existing rows will get
• Non-nullable types: C# data type default value (e.g., int will
be 0)
• Nullable types: null
Let’s do
some SQL
with U-SQL!
https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
U-SQL
Joins
Join operators
• INNER JOIN
• LEFT or RIGHT or FULL OUTER JOIN
• CROSS JOIN
• SEMIJOIN
• equivalent to IN subquery
• ANTISEMIJOIN
• Equivalent to NOT IN subquery
Notes
• ON clause comparisons need to be of the simple form:
rowset.column == rowset.column
or AND conjunctions of the simple equality comparison
• If a comparand is not a column, wrap it into a column in a previous
SELECT
• If the comparison operation is not ==, put it into the WHERE clause
• turn the join into a CROSS JOIN if no equality comparison
Reason: Syntax calls out which joins are efficient
U-SQL
Analytics
Windowing Expression
Window_Function_Call 'OVER' '('
[ Over_Partition_By_Clause ]
[ Order_By_Clause ]
[ Row _Clause ]
')'.
Window_Function_Call :=
Aggregate_Function_Call
| Analytic_Function_Call
| Ranking_Function_Call.
Windowing Aggregate Functions
ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP
Analytics Functions
CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT,
PERCENTILE_DISC, PERCENT_RANK, LEAD, LAG
Ranking Functions
DENSE_RANK, NTILE, RANK, ROW_NUMBER
“Top 5”s
Surprises for
SQL Users
• AS is not as
• C# keywords and SQL keywords overlap
• Costly to make case-insensitive -> Better
build capabilities than tinker with syntax
• = != ==
• Remember: C# expression language
• null IS NOT NULL
• C# nulls are two-valued
• PROCEDURES but no WHILE
• No UPDATE, DELETE, nor MERGE (yet)
U-SQL Language Philosophy
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on .NET
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#, Python, R)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float, ... );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Scales out your data processing over large amount of data
Unifies natively SQL’s declarativity and PL’s extensibility
Unifies querying structured and unstructured data
Unifies querying Data Lake and SQL Server (in Azure) data
Increase productivity & agility on Day 1 & 100 for YOU!
Sign up for an Azure Data Lake account at
http://www.azure.com/datalake and give us your feedback via
http://aka.ms/adlfeedback!
This is why U-SQL!
Additional
Resources
Blogs, presentations and community pages:
http://aka.ms/AzureDataLake
http://usql.io (U-SQL Github)
http://blogs.msdn.microsoft.com/mrys/
http://blogs.msdn.microsoft.com/azuredatalake/
http://www.slideshare.net/MichaelRys
Documentation, articles, and videos:
http://aka.ms/usql_reference
https://azure.microsoft.com/en-
us/documentation/services/data-lake-analytics/
https://msdn.microsoft.com/en-us/magazine/mt614251
https://channel9.msdn.com/Search?term=U-SQL#ch9Search
https://www.youtube.com/results?search_query=U-SQL
ADL forums and feedback
http://aka.ms/adlfeedback
https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
• Continue your education
at Microsoft Virtual
Academy online.
SQLSaturday Sponsors!
Titanium
& Global Partner
Gold
Silver
Bronze
Without the generosity of these sponsors, this event would not be
possible! Please, stop by the vendor booths and thank them.

More Related Content

What's hot (20)

PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Jason L Brugger
 
PPTX
U-SQL Query Execution and Performance Tuning
Michael Rys
 
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
PPTX
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
PPTX
U-SQL - Azure Data Lake Analytics for Developers
Michael Rys
 
PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
PPTX
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
PPTX
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
PPTX
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
PPTX
Microsoft's Hadoop Story
Michael Rys
 
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
PPTX
U-SQL Does SQL (SQLBits 2016)
Michael Rys
 
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
PDF
Spark SQL with Scala Code Examples
Todd McGrath
 
PPTX
Introduction to HiveQL
kristinferrier
 
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
PPTX
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
PPTX
Apache Spark sql
aftab alam
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Jason L Brugger
 
U-SQL Query Execution and Performance Tuning
Michael Rys
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
U-SQL - Azure Data Lake Analytics for Developers
Michael Rys
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
Microsoft's Hadoop Story
Michael Rys
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
U-SQL Does SQL (SQLBits 2016)
Michael Rys
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Spark SQL with Scala Code Examples
Todd McGrath
 
Introduction to HiveQL
kristinferrier
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
Apache Spark sql
aftab alam
 

Similar to Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635) (20)

PPTX
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
PPTX
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
PDF
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
PPTX
An intro to Azure Data Lake
Rick van den Bosch
 
PPTX
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
086ChintanPatel1
 
PPTX
Rdbms
Parthiv Prem
 
PDF
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PDF
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
PPTX
Azure Data Lake and U-SQL
Michael Rys
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
PPTX
Build a modern data platform.pptx
Ike Ellis
 
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
PPTX
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
PDF
Prague data management meetup 2018-03-27
Martin Bém
 
PPTX
Data Modeling on Azure for Analytics
Ike Ellis
 
PPTX
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
PPTX
Data saturday malta - ADX Azure Data Explorer overview
Riccardo Zamana
 
PDF
Talavant Data Lake Analytics
Sean Forgatch
 
PPTX
A lap around Azure Data Factory
BizTalk360
 
PPTX
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
An intro to Azure Data Lake
Rick van den Bosch
 
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
086ChintanPatel1
 
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
Azure Data Lake and U-SQL
Michael Rys
 
Azure Synapse Analytics Overview (r2)
James Serra
 
Build a modern data platform.pptx
Ike Ellis
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
Prague data management meetup 2018-03-27
Martin Bém
 
Data Modeling on Azure for Analytics
Ike Ellis
 
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
Data saturday malta - ADX Azure Data Explorer overview
Riccardo Zamana
 
Talavant Data Lake Analytics
Sean Forgatch
 
A lap around Azure Data Factory
BizTalk360
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Ad

More from Michael Rys (12)

PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
PPTX
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
PPTX
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Michael Rys
 
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
PPTX
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Intro (SQLBits 2016)
Michael Rys
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Michael Rys
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
U-SQL Intro (SQLBits 2016)
Michael Rys
 
Ad

Recently uploaded (20)

PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 

Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)

  • 1. Introduction to Azure Data Lake and U-SQL for SQL users Michael Rys (@MikeDoesBigData) John Morcos Microsoft Corp
  • 2. The Traditional Data Warehouse 2 Data sourcesNon-relational data
  • 3. The Data Lake approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 4. WebHDFS .NET, SQL, Python, R scaled out by U-SQL ADL Analytics HDInsight ADL Store HiveAnalytics Storage Azure Data Lake (Store, HDInsight, Analytics)
  • 6. Some sample use cases Digital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing – Large-scale image feature extraction and classification using custom code Shopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms Characteristics of Big Data Analytics Requires processing of any type of data Allow use of custom algorithms Scale to any size and be efficient
  • 7. Status Quo: SQL for Big Data  Declarativity does scaling and parallelization for you  Extensibility is bolted on and not “native”  hard to work with anything other than structured data  difficult to extend with custom code
  • 8. Status Quo: Programming Languages for Big Data  Extensibility through custom code is “native”  Declarativity is bolted on and not “native”  User often has to care about scale and performance  SQL is 2nd class within string  Often no code reuse/ sharing across queries
  • 9. Why U-SQL?  Declarativity and Extensibility are equally native to the language! Get benefits of both! Makes it easy for you by unifying: • Declarative and imperative • Unstructured and structured data processing • Local and remote Queries • Increase productivity and agility from Day 1 and at Day 100 for YOU! Scales out your custom imperative Code (written in .NET, Python, R, and more to come) in a declarative SQL- based framework
  • 10. The origins of U-SQL SCOPE – Microsoft’s internal Big Data language • SQL and C# integration model • Optimization and Scaling model • Runs 100’000s of jobs daily Hive • Complex data types (Maps, Arrays) • Data format alignment for text files T-SQL/ANSI SQL • Many of the SQL capabilities (windowing functions, meta data model etc.)
  • 11. Query data where it lives Easily query data in multiple Azure data stores without moving it to a single store Benefits • Avoid moving large amounts of data across the network between stores • Single view of data irrespective of physical location • Minimize data proliferation issues caused by maintaining multiple copies • Single query language for all data • Each data store maintains its own sovereignty • Design choices based on the need • Push SQL expressions to remote SQL sources • Projections • Filters • Joins U-SQL Query Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics Azure SQL Data Warehouse Azure Data Lake Storage
  • 12. U-SQL offers Advanced Analytics Extensions for Massively Parallel processing • Python • R Built-in Cognitive capabilities • Imaging • Detecting Objects • Detecting Emotion in Faces • Detecting Text (OCR) • Text Analysis • Key Phrase Extraction • Sentiment Analysis
  • 14. Expression-flow Programming Style Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. Execution plan that is optimized out-of-the- box and w/o user intervention. Per job and user driven level of parallelization. Detail visibility into execution steps, for debugging. Heatmap like functionality to identify performance bottlenecks.
  • 15. U-SQL extensibility Extend U-SQL with C#/.NET, Python, R Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  • 16. • Schema on Read • Write to File • Built-in and custom Extractors and Outputters • ADL Storage and Azure Blob Storage “Unstructured” Files EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Built-in Extractors: Csv, Tsv, Text with lots of options • Custom Extractors: e.g., JSON, XML, etc. (see http://usql.io) OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text • Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io) Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" • WASB: "wasb://container@account/filepath/file.csv"
  • 17. Show me File Sets! https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
  • 18. • Simple Patterns • Virtual Columns • Only on EXTRACT for now File Sets Simple pattern language on filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Limits on number of files (Current limit 800-3000 is increased in special preview) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in query predicates to get partition elimination • Warning gets raised if no partition elimination was found
  • 20. Meta Data Object Model ADLA Account/Catalog Database Schema [1,n] [1,n] [0,n] tables views TVFs C# Fns C# UDAgg Clustered Index partitions C# Assemblies C# Extractors Data Source C# Reducers C# Processors C# Combiners C# Outputters Ext. tables User objects Refers toContains Implemented and named by Procedures Creden- tials MD Name C# Name C# Applier Table Types Legend Statistics C# UDTs Packages
  • 21. • Naming • Discovery • Sharing • Securing U-SQL Catalog Naming • Default Database and Schema context: master.dbo • Quote identifiers with []: [my table] • Stores data in ADL Storage /catalog folder Discovery • Visual Studio Server Explorer • Azure Data Lake Analytics Portal • SDKs and Azure Powershell commands Sharing • Within an Azure Data Lake Analytics account • Across ADLA accounts that share same Azure Active Directory: • Referencing Assemblies • Calling TVFs and referencing tables and views • Inserting into Tables Securing • Secured with AAD principals at catalog and Database level
  • 22. • Views for simple cases • TVFs for parameterization and most cases VIEWs and TVFs Views CREATE VIEW V AS EXTRACT… CREATE VIEW V AS SELECT … • Cannot contain user-defined objects (e.g. UDF or UDOs)! • Will be inlined Table-Valued Functions (TVFs) CREATE FUNCTION F (@arg string = "default") RETURNS @res [TABLE ( … )] AS BEGIN … @res = … END; • Provides parameterization • One or more results • Can contain multiple statements • Can contain user-code (needs assembly reference) • Will always be inlined • Infers schema or checks against specified return schema
  • 23. Procedures CREATE PROCEDURE P (@arg string = "default“) AS BEGIN …; CREATE TABLE T …; OUTPUT @res TO …; INSERT INTO T …; END; • Provides parameterization • No result but writes into file or table • Can contain multiple statements • Can contain user-code (needs assembly reference) • Will always be inlined • Can contain DDL (but no CREATE, DROP FUNCTION/PROCEDURE)
  • 24. • CREATE TABLE • CREATE TABLE AS SELECT Tables CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col2 ASC) PARTITION BY (col1) DISTRIBUTED BY HASH (driver_id) ); • Structured Data, built-in Data types only (no UDTs) • Clustered Index (needs to be specified): row-oriented • Fine-grained distribution (needs to be specified): • HASH, DIRECT HASH, RANGE, ROUND ROBIN • Addressable Partitions (optional) CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …; CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…; CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT); • Infer the schema from the query • Still requires index and distribution (does not support partitioning)
  • 25. When to use Tables Benefits of Table clustering and distribution • Faster lookup of data provided by distribution and clustering when right distribution/cluster is chosen • Data distribution provides better localized scale out • Used for filters, joins and grouping Benefits of Table partitioning • Provides data life cycle management (“expire” old partitions) • Partial re-computation of data at partition level • Query predicates can provide partition elimination Do not use when… • No filters, joins and grouping • No reuse of the data for future queries If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.
  • 26. • ALTER TABLE ADD/DROP COLUMN Evolving Tables ALTER TABLE T ADD COLUMN eventName string; ALTER TABLE T DROP COLUMN col3; ALTER TABLE T ADD COLUMN result string, clientId string, payload int?; ALTER TABLE T DROP COLUMN clientId, result; • Meta-data only operation • Existing rows will get • Non-nullable types: C# data type default value (e.g., int will be 0) • Nullable types: null
  • 27. Let’s do some SQL with U-SQL! https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
  • 28. U-SQL Joins Join operators • INNER JOIN • LEFT or RIGHT or FULL OUTER JOIN • CROSS JOIN • SEMIJOIN • equivalent to IN subquery • ANTISEMIJOIN • Equivalent to NOT IN subquery Notes • ON clause comparisons need to be of the simple form: rowset.column == rowset.column or AND conjunctions of the simple equality comparison • If a comparand is not a column, wrap it into a column in a previous SELECT • If the comparison operation is not ==, put it into the WHERE clause • turn the join into a CROSS JOIN if no equality comparison Reason: Syntax calls out which joins are efficient
  • 29. U-SQL Analytics Windowing Expression Window_Function_Call 'OVER' '(' [ Over_Partition_By_Clause ] [ Order_By_Clause ] [ Row _Clause ] ')'. Window_Function_Call := Aggregate_Function_Call | Analytic_Function_Call | Ranking_Function_Call. Windowing Aggregate Functions ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP Analytics Functions CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK, LEAD, LAG Ranking Functions DENSE_RANK, NTILE, RANK, ROW_NUMBER
  • 30. “Top 5”s Surprises for SQL Users • AS is not as • C# keywords and SQL keywords overlap • Costly to make case-insensitive -> Better build capabilities than tinker with syntax • = != == • Remember: C# expression language • null IS NOT NULL • C# nulls are two-valued • PROCEDURES but no WHILE • No UPDATE, DELETE, nor MERGE (yet)
  • 31. U-SQL Language Philosophy Declarative Query and Transformation Language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions • Optimizable, Scalable Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Operates on Unstructured & Structured Data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on .NET • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined Aggregators (C#) • User-defined Operators (UDO) (C#, Python, R) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float, ... ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  • 32. Scales out your data processing over large amount of data Unifies natively SQL’s declarativity and PL’s extensibility Unifies querying structured and unstructured data Unifies querying Data Lake and SQL Server (in Azure) data Increase productivity & agility on Day 1 & 100 for YOU! Sign up for an Azure Data Lake account at http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback! This is why U-SQL!
  • 33. Additional Resources Blogs, presentations and community pages: http://aka.ms/AzureDataLake http://usql.io (U-SQL Github) http://blogs.msdn.microsoft.com/mrys/ http://blogs.msdn.microsoft.com/azuredatalake/ http://www.slideshare.net/MichaelRys Documentation, articles, and videos: http://aka.ms/usql_reference https://azure.microsoft.com/en- us/documentation/services/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251 https://channel9.msdn.com/Search?term=U-SQL#ch9Search https://www.youtube.com/results?search_query=U-SQL ADL forums and feedback http://aka.ms/adlfeedback https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake http://stackoverflow.com/questions/tagged/u-sql • Continue your education at Microsoft Virtual Academy online.
  • 34. SQLSaturday Sponsors! Titanium & Global Partner Gold Silver Bronze Without the generosity of these sponsors, this event would not be possible! Please, stop by the vendor booths and thank them.

Editor's Notes

  • #3: Why is Gartner saying this? What is the current state of the traditional data warehouse? There are 4 key reasons why data warehouses are at their tipping point and where something needs to change. Increase in data volumes - Data volumes are expected to grow 10X over the next five years and traditional data warehouses cannot keep up with this explosion of data Real-time data – Analysts, business stakeholders want access to real-time, dynamic data. I want my data. I want it fast. With increase in data volumes, it’s hard to keep up. New data sources and types - 85% of data growth is coming from “non-relational” data in the form of things like web logs, sensor data, social sentiment and devices. What new skills do folks need to be trained on? What’s the time to solution because as we all know, time is money. Cloud-born data –Data from the cloud (ie. CRM, ERP, etc) stored by any type of corporate owned system. How do you incorporate both on-premises and cloud data as part of your data warehouse? This is the last trend that is breaking the traditional data warehouse.   Because of these four trends, we need to evolve our traditional data warehouse to become the “modern data warehouse.” We believe Microsoft’s modern data warehouse approach properly addresses this need.
  • #4: A data lake is an enterprise wide repository of every type of data collected in a single place. Data of all types can be arbitrarily stored in the data lake prior to any formal definition of requirements or schema for the purposes of operational and exploratory analytics. Advanced analytics can be done using Hadoop, Machine Learning tools, or act as a lower cost data preparation location prior to moving curated data into a data warehouse. In these cases, customers would load data into the data lake prior to defining any transformation logic. This is bottom up because data is collected first and the data itself gives you the insight and helps derive conclusions or predictive models.
  • #7: Add velocity?
  • #8: Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps. Some examples: Hive UDAgg Code and compile .java into .jar Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading Extend GenericUDAFEvaluator class: implements logic in 8 methods. - Deploy: Deploy jar into class path on server Edit FunctionRegistry.java to register as built-in Update the content of show functions with ant Hive UDF (as of v0.13) Code Load JAR into head node or at URI CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
  • #9: Spark supports Custom “inputters and outputters” for defining custom RDDs No UDAGGs Simple integration of UDFs but only for duration of program. No reuse/sharing. Cloud dataflow? Requires has to care about scale and perf Spark UDAgg Is not yet supported ( SPARK-3947) Spark UDF Write inline function def westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state) for SQL usage need to register the table customerTable.registerTempTable("customerTable") Register each UDF sqlContext.udf.register("westernState", westernState _) Call it val westernStates = sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
  • #10: Offers Auto-scaling and performance Operates on unstructured data without tables needed Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg. Easy to query remote sources even without external tables U-SQL UDAgg Code and compile .cs file: Implement IAggregate’s 3 methods :Init(), Accumulate(), Terminate() C# takes case of type checking, generics etc. Deploy: Tooling: one click registration in user db of assembly By Hand: Copy file to ADL CREATE ASSEMBLY to register assembly Use via AGG<MyNamespace.MyAggregate<T>>(a) U-SQL UDF Code in C#, register assembly once, call by C# name.
  • #11: Remove SCOPE for external customers?
  • #12: DATA SOURCE: Represents a remote data source such as Azure SQL Database. Have to specify all the details (connection string, credentials, etc required to connect to and issues queries. EXTERNAL TABLE: A local table, with columns defined in C# types, that redirects queries issued against it to the remote table that it is based on. U-SQL automatically does the type conversion. External tables lets you impose a specific schema against the remote data, shielding you from remote schema changes. You can issue queries that ‘join’ external and local tables. PASS THROUGH queries: These queries are issued directly against the remote data source in the syntax of the remote data source (say T-SQL for Azure SQL database). REMOTABLE_TYPES: For every external data source you have to specify the list of ‘remoteable types. This list constrains the types of queries that will be remoted. Ex: REMOTABLE_TYPES = (bool, byte, short, ushort, int, decimal); LAZY METADATA LOADING: Here the remote data schematized only when the query is actually issues to the remote data source. Your program must be able to deal with remote schema changes.
  • #16: Extensions require .NET assemblies to be registered with a database
  • #18: Shows simple Extract, OUTPUT Then simple extensibility with string functions.
  • #21: Add file sets.
  • #23: Show Views, TVFs and Tables
  • #33: GROUP BY, ORDER BY, CROSS APPLY
  • #37: Use for language experts