SlideShare a Scribd company logo
U-SQL Killer scenarios:
Custom Processing, Big Cognition,
Image and JSON processing at Scale
Michael Rys (@MikeDoesBigData)
John Morcos
Microsoft Corp
Agenda Introduction to U-SQL’s Extensibility
U-SQL Cognitive Services
More Custom Image processing
Python in U-SQL
R in U-SQL
JSON processing
U-SQL extensibility
Extend U-SQL with C#/.NET, Python, R etc.
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
What are UDOs?
User-Defined Extractors
User-Defined Outputters
User-Defined Processors
Take one row and produce one row
Pass-through versus transforming
User-Defined Appliers
Take one row and produce 0 to n rows
Used with OUTER/CROSS APPLY
User-Defined Combiners
Combines rowsets (like a user-defined join)
User-Defined Reducers
Take n rows and produce m rows (normally m<n)
Scaled out with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
EXTRACT
OUTPUT
CROSS APPLY
Custom Operator Extensions
Scaled out by U-SQL
PROCESS
COMBINE
REDUCE
Copyright Camera
Make
Camera
Model
Thumbnail
Michael Canon 70D
Michael Samsung S7
https://github.com/Azure/usql/tree/master/Examples/ImageApp
.Net API provided to build UDOs
Any .Net language usable
however only C# is first-class in tooling
Use U-SQL specific .Net DLLs
Deploying UDOs
Compile DLL
Upload DLL to ADLS
register with U-SQL script
VisualStudio provides tool support
UDOs can
Invoke managed code
Invoke native code deployed with UDO assemblies
Invoke other language runtimes (e.g., Python, R)
be scaled out by U-SQL execution framework
UDOs cannot
Communicate between different UDO invocations
Call Webservices/Reach
outside the vertex boundary
How to specify
UDOs?
How to specify
UDOs?
Code behind
C# Class Project for U-SQLHow to specify
UDOs?
[SqlUserDefinedExtractor]
public class DriverExtractor : IExtractor
{
private byte[] _row_delim;
private string _col_delim;
private Encoding _encoding;
// Define a non-default constructor since I want to pass in my own parameters
public DriverExtractor( string row_delim = "rn", string col_delim = ",“
, Encoding encoding = null )
{
_encoding = encoding == null ? Encoding.UTF8 : encoding;
_row_delim = _encoding.GetBytes(row_delim);
_col_delim = col_delim;
} // DriverExtractor
// Converting text to target schema
private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow)
{
var schema = outputrow.Schema;
if (schema[i].Type == typeof(int))
{
var tmp = Convert.ToInt32(c);
outputrow.Set(i, tmp);
}
...
} //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input
, IUpdatableRow outputrow)
{
foreach (var row in input.Split(_row_delim))
{
using(var s = new StreamReader(row, _encoding))
{
int i = 0;
foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None))
{
OutputValueAtCol_I(c, i++, outputrow);
} // foreach
} // using
yield return outputrow.AsReadOnly();
} // foreach
} // Extract
} // class DriverExtractor
UDO model
Marking UDOs
Parameterizing UDOs
UDO signature
UDO-specific processing
pattern
Rowsets and their schemas in
UDOs
Setting results
By position
By name
Managing
Assemblies
Create assemblies
Reference assemblies
Enumerate assemblies
Drop assemblies
VisualStudio makes registration easy!
• CREATE ASSEMBLY db.assembly FROM @path;
• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll
system.data.dll, System.Runtime.Serialization.dll,
mscorelib.dll (e.g., System.Text,
System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer and Azure Portal
• DROP ASSEMBLY db.assembly;
DEPLOY
RESOURCE
Syntax:
'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }.
Example:
DEPLOY RESOURCE "/config/configfile.xml", "package.zip";
Semantics:
• Files have to be in ADLS or WASB
• Files are deployed to vertex and are accessible from any custom
code
Limits:
• Single resource file limit is 400MB
• Overall limit for deployed resource files is 3GB
U-SQL Vertex Code
C#
C++
Algebra
Additional non-dll files &
Deployed resources
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ADLS DEPLOY RESOURCE
System files
(built-in Runtimes, Core DLLs, OS)
Cognitive Services
https://github.com/Azure/usql/tree/master/Examples/ImageApp
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-
u-sql-cognitive
Car
Green
Parked
Outdoor
Racing
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
REFERENCE ASSEMBLY ImageOcr;
@imgs =
EXTRACT FileName string, ImgData byte[]
FROM @"/images/{FileName}.jpg"
USING new Cognition.Vision.ImageExtractor();
// Extract the number of objects on each image and tag them
@objects =
PROCESS @imgs
PRODUCE FileName,
NumObjects int,
Tags string
READONLY FileName
USING new Cognition.Vision.ImageTagger();
OUTPUT @objects
TO "/objects.tsv"
USING Outputters.Tsv();
Imaging
REFERENCE ASSEMBLY [TextCommon];
REFERENCE ASSEMBLY [TextSentiment];
REFERENCE ASSEMBLY [TextKeyPhrase];
@WarAndPeace =
EXTRACT No int,
Year string,
Book string, Chapter string,
Text string
FROM @"/usqlext/samples/cognition/war_and_peace.csv"
USING Extractors.Csv();
@sentiment =
PROCESS @WarAndPeace
PRODUCE No,
Year,
Book, Chapter,
Text,
Sentiment string,
Conf double
USING new Cognition.Text.SentimentAnalyzer(true);
OUTPUT @sentinment
TO "/sentiment.tsv"
USING Outputters.Tsv();
Text Analysis
U-SQL/Cognitive
Example
• Identify objects in images (tags)
• Identify faces and emotions and images
• Join datasets – find out which tags are associated with happiness
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
@objects =
PROCESS MegaFaceView
PRODUCE FileName, NumObjects int, Tags string
READONLY FileName
USING new Cognition.Vision.ImageTagger();
@tags =
SELECT FileName, T.Tag
FROM @objects
CROSS APPLY
EXPLODE(SqlArray.Create(Tags.Split(';')))
AS T(Tag)
WHERE T.Tag.ToString().Contains("dog") OR
T.Tag.ToString().Contains("cat");
@emotion_raw =
PROCESS MegaFaceView
PRODUCE FileName string, NumFaces int, Emotion string
READONLY FileName
USING new Cognition.Vision.EmotionAnalyzer();
@emotion =
SELECT FileName, T.Emotion
FROM @emotion_raw
CROSS APPLY
EXPLODE(SqlArray.Create(Emotion.Split(';')))
AS T(Emotion);
@correlation =
SELECT T.FileName, Emotion, Tag
FROM @emotion AS E
INNER JOIN
@tags AS T
ON E.FileName == T.FileName;
Images
Objects Emotions
filter
join
aggregate
Python Processing
Python
Author Tweet
MikeDoesBigData @AzureDataLake: Come and see the #SQLSaturday sessions on #USQL
AzureDataLake What are your recommendations for #SQLSaturday? @MikeDoesBigData
Author Mentions Topics
MikeDoesBigData {@AzureDataLake} {#SQLSaturday, #USQL}
AzureDataLake {@MikeDoesBigData} {#SQLSaturday}
REFERENCE ASSEMBLY [ExtPython];
DECLARE @myScript = @"
def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
";
@t =
SELECT * FROM
(VALUES
("D1","T1","A1","@foo Hello World @bar"),
("D2","T2","A2","@baz Hello World @beer")
) AS D( date, time, author, tweet );
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer(pyScript:@myScript);
Use U-SQL to create a massively
distributed program.
Executing Python code across many
nodes.
Using standard libraries such as
numpy and pandas.
Documentation:
https://docs.microsoft.com/en-
us/azure/data-lake-analytics/data-
lake-analytics-u-sql-python-
extensions
Python
Extensions
R Processing
R running in U-
SQL
Generate a linear
model
SampleScript_LM_Iris.R
REFERENCE ASSEMBLY [ExtR];
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFileModelSummary string =
@"/my/R/Output/LMModelSummaryCoefficientsIrisFromRCommand.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species = as.factor(inputFromUSQL$Species)
lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL)
#do not return readonly columns and make sure that the column names are the
same in usql and r scripts,
outputToUSQL=data.frame(summary(lm.fit)$coefficients)
colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"", ""Pr"")
outputToUSQL";
@InputData =
EXTRACT SepalLength double, SepalWidth double, PetalLength double,
PetalWidth double, Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData = SELECT 0 AS Par, * FROM @InputData;
@ModelCoefficients = REDUCE @ExtendedData ON Par
PRODUCE Par, Estimate double, StdError double, tValue double, Pr double
READONLY Par
USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe");
OUTPUT @ModelCoefficients TO @OutputFileModelSummary USING Outputters.Tsv();
R running in U-
SQL
Use a previously
generated model
REFERENCE ASSEMBLY master.ExtR;
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda"; // Prediction
Model
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/Output/LMPredictionsIris.csv";
DECLARE @PartitionCount int = 10;
// R script to run
DECLARE @myRScript = @"
load(""my_model_LM_Iris.rda"")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval=""confidence""))";
@InputData =
EXTRACT SepalLength double, SepalWidth double, PetalLength double,
PetalWidth double, Species string
FROM @IrisData
USING Extractors.Csv();
//Randomly partition the data to apply the model in parallel
@ExtendedData =
SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par, *
FROM @InputData;
// Predict Species
@RScriptOutput =
REDUCE @ExtendedData ON Par
PRODUCE Par, fit double, lwr double, upr double
READONLY Par
USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe",
stringsAsFactors:false);
OUTPUT @RScriptOutput TO @OutputFilePredictions
USING Outputters.Csv(outputHeader:true);
JSON Processing
How do I extract data from JSON documents?
https://github.com/Azure/usql/tree/master/Examples/DataFormats
https://github.com/Azure/usql/tree/master/Examples/JSONExamples
Architecture of Sample Format Assembly
Single JSON document per file: Use JsonExtractor
Multiple JSON documents per file:
Do not allow row delimiter (e.g., CR/LF) in JSON
Use built-in Text Extractor to extract
Use JsonTuple to schematize (with CROSS APPLY)
Currently loads full JSON document into memory
better to use JSONReader Processing if docs are large
Microsoft.Analytics.Samples.Formats
NewtonSoft.Json Microsoft.Hadoop.AvroSystem.Xml
JSON
Processing
JSON
Processing
@json =
EXTRACT personid int,
name string,
addresses string
FROM @input
USING new Json.JsonExtractor(“[*].person");
@person =
SELECT personid,
name,
Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array
FROM @json;
@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address
FROM @person
CROSS APPLY
EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);
@result =
SELECT personid,
name,
address["addressid"]AS addressid,
address["street"]AS street,
address["postcode"]AS postcode,
address["city"]AS city
FROM @addresses;
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Processing at scale (SQL Saturday 635)
What are UDOs?
Custom Operator Extensions written in .Net (C#)
Scaled out by U-SQL
UDO Tips and
Warnings
Tips when Using UDOs:
READONLY clause to allow pushing predicates through UDOs
REQUIRED clause to allow column pruning through UDOs
PRESORT on REDUCE if you need global order
Hint Cardinality if it does choose the wrong plan
Warnings and better alternatives:
Use SELECT with UDFs instead of PROCESS
Use User-defined Aggregators instead of REDUCE
Learn to use Windowing Functions (OVER expression)
Good use-cases for PROCESS/REDUCE/COMBINE:
The logic needs to dynamically access the input and/or output schema.
E.g., create a JSON doc for the data in the row where the columns
are not known apriori.
Your UDF based solution creates too much memory pressure and you
can write your code more memory efficient in a UDO
You need an ordered Aggregator or produce more than 1 row per
group
Additional
Resources
Blogs and community page:
http://usql.io (U-SQL Github)
http://blogs.msdn.microsoft.com/azuredatalake/
http://blogs.msdn.microsoft.com/mrys/
https://channel9.msdn.com/Search?term=U-SQL#ch9Search
Documentation, presentations and articles:
http://aka.ms/usql_reference
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-
programmability-guide
https://docs.microsoft.com/en-us/azure/data-lake-analytics/
https://msdn.microsoft.com/en-us/magazine/mt614251
https://msdn.microsoft.com/magazine/mt790200
http://www.slideshare.com/MichaelRys
Getting Started with R in U-SQL
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-
python-extensions
ADL forums and feedback
https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
http://aka.ms/adlfeedback
SQLSaturday Sponsors!
Titanium
& Global Partner
Gold
Silver
Bronze
Without the generosity of these sponsors, this event would not be
possible! Please, stop by the vendor booths and thank them.

More Related Content

What's hot (20)

PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Query Execution and Performance Tuning
Michael Rys
 
PPTX
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
PPTX
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
PPTX
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Does SQL (SQLBits 2016)
Michael Rys
 
PDF
Spark SQL with Scala Code Examples
Todd McGrath
 
PPTX
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PPTX
Apache Spark sql
aftab alam
 
PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Data Source API in Spark
Databricks
 
PDF
20140908 spark sql & catalyst
Takuya UESHIN
 
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
PPTX
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
PPTX
Hive @ Bucharest Java User Group
Remus Rusanu
 
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
U-SQL Query Execution and Performance Tuning
Michael Rys
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
U-SQL Does SQL (SQLBits 2016)
Michael Rys
 
Spark SQL with Scala Code Examples
Todd McGrath
 
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Apache Spark sql
aftab alam
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Data Source API in Spark
Databricks
 
20140908 spark sql & catalyst
Takuya UESHIN
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Spark meetup v2.0.5
Yan Zhou
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Hive @ Bucharest Java User Group
Remus Rusanu
 

Similar to U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Processing at scale (SQL Saturday 635) (20)

PPTX
U-SQL - Azure Data Lake Analytics for Developers
Michael Rys
 
PPTX
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
PDF
USQ Landdemos Azure Data Lake
Trivadis
 
PPTX
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
PPTX
U-SQL Intro (SQLBits 2016)
Michael Rys
 
PPTX
Azure Data Lake and U-SQL
Michael Rys
 
PPTX
Dive Into Azure Data Lake - PASS 2017
Ike Ellis
 
PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Jason L Brugger
 
PPTX
Paris Datageeks meetup 05102016
Michel Caradec
 
PDF
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
PDF
Talavant Data Lake Analytics
Sean Forgatch
 
PPTX
NDC Sydney - Analyzing StackExchange with Azure Data Lake
Tom Kerkhove
 
PPTX
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
PPTX
Azure Data Lake and Azure Data Lake Analytics
Waqas Idrees
 
PPTX
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Tom Kerkhove
 
PDF
USQL Trivadis Azure Data Lake Event
Trivadis
 
PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
PPTX
C# + SQL = Big Data
Sascha Dittmann
 
PPTX
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 
U-SQL - Azure Data Lake Analytics for Developers
Michael Rys
 
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
USQ Landdemos Azure Data Lake
Trivadis
 
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
U-SQL Intro (SQLBits 2016)
Michael Rys
 
Azure Data Lake and U-SQL
Michael Rys
 
Dive Into Azure Data Lake - PASS 2017
Ike Ellis
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Jason L Brugger
 
Paris Datageeks meetup 05102016
Michel Caradec
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
Talavant Data Lake Analytics
Sean Forgatch
 
NDC Sydney - Analyzing StackExchange with Azure Data Lake
Tom Kerkhove
 
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
Azure Data Lake and Azure Data Lake Analytics
Waqas Idrees
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Tom Kerkhove
 
USQL Trivadis Azure Data Lake Event
Trivadis
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
C# + SQL = Big Data
Sascha Dittmann
 
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 
Ad

More from Michael Rys (12)

PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
PPTX
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
PPTX
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Michael Rys
 
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
PPTX
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 
PPTX
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Michael Rys
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Ad

Recently uploaded (20)

PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
What Is Data Integration and Transformation?
subhashenia
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 

U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Processing at scale (SQL Saturday 635)

  • 1. U-SQL Killer scenarios: Custom Processing, Big Cognition, Image and JSON processing at Scale Michael Rys (@MikeDoesBigData) John Morcos Microsoft Corp
  • 2. Agenda Introduction to U-SQL’s Extensibility U-SQL Cognitive Services More Custom Image processing Python in U-SQL R in U-SQL JSON processing
  • 3. U-SQL extensibility Extend U-SQL with C#/.NET, Python, R etc. Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  • 4. What are UDOs? User-Defined Extractors User-Defined Outputters User-Defined Processors Take one row and produce one row Pass-through versus transforming User-Defined Appliers Take one row and produce 0 to n rows Used with OUTER/CROSS APPLY User-Defined Combiners Combines rowsets (like a user-defined join) User-Defined Reducers Take n rows and produce m rows (normally m<n) Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): EXTRACT OUTPUT CROSS APPLY Custom Operator Extensions Scaled out by U-SQL PROCESS COMBINE REDUCE
  • 5. Copyright Camera Make Camera Model Thumbnail Michael Canon 70D Michael Samsung S7 https://github.com/Azure/usql/tree/master/Examples/ImageApp
  • 6. .Net API provided to build UDOs Any .Net language usable however only C# is first-class in tooling Use U-SQL specific .Net DLLs Deploying UDOs Compile DLL Upload DLL to ADLS register with U-SQL script VisualStudio provides tool support UDOs can Invoke managed code Invoke native code deployed with UDO assemblies Invoke other language runtimes (e.g., Python, R) be scaled out by U-SQL execution framework UDOs cannot Communicate between different UDO invocations Call Webservices/Reach outside the vertex boundary How to specify UDOs?
  • 8. C# Class Project for U-SQLHow to specify UDOs?
  • 9. [SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "rn", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor // Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema; if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor UDO model Marking UDOs Parameterizing UDOs UDO signature UDO-specific processing pattern Rowsets and their schemas in UDOs Setting results By position By name
  • 10. Managing Assemblies Create assemblies Reference assemblies Enumerate assemblies Drop assemblies VisualStudio makes registration easy! • CREATE ASSEMBLY db.assembly FROM @path; • CREATE ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files • REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer and Azure Portal • DROP ASSEMBLY db.assembly;
  • 11. DEPLOY RESOURCE Syntax: 'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }. Example: DEPLOY RESOURCE "/config/configfile.xml", "package.zip"; Semantics: • Files have to be in ADLS or WASB • Files are deployed to vertex and are accessible from any custom code Limits: • Single resource file limit is 400MB • Overall limit for deployed resource files is 3GB
  • 12. U-SQL Vertex Code C# C++ Algebra Additional non-dll files & Deployed resources managed dll native dll Compilation output (in job folder) Compilation and Optimization U-SQL Metadata Service Deployed to Vertices REFERENCE ASSEMBLY ADLS DEPLOY RESOURCE System files (built-in Runtimes, Core DLLs, OS)
  • 14. REFERENCE ASSEMBLY ImageCommon; REFERENCE ASSEMBLY FaceSdk; REFERENCE ASSEMBLY ImageEmotion; REFERENCE ASSEMBLY ImageTagging; REFERENCE ASSEMBLY ImageOcr; @imgs = EXTRACT FileName string, ImgData byte[] FROM @"/images/{FileName}.jpg" USING new Cognition.Vision.ImageExtractor(); // Extract the number of objects on each image and tag them @objects = PROCESS @imgs PRODUCE FileName, NumObjects int, Tags string READONLY FileName USING new Cognition.Vision.ImageTagger(); OUTPUT @objects TO "/objects.tsv" USING Outputters.Tsv(); Imaging
  • 15. REFERENCE ASSEMBLY [TextCommon]; REFERENCE ASSEMBLY [TextSentiment]; REFERENCE ASSEMBLY [TextKeyPhrase]; @WarAndPeace = EXTRACT No int, Year string, Book string, Chapter string, Text string FROM @"/usqlext/samples/cognition/war_and_peace.csv" USING Extractors.Csv(); @sentiment = PROCESS @WarAndPeace PRODUCE No, Year, Book, Chapter, Text, Sentiment string, Conf double USING new Cognition.Text.SentimentAnalyzer(true); OUTPUT @sentinment TO "/sentiment.tsv" USING Outputters.Tsv(); Text Analysis
  • 16. U-SQL/Cognitive Example • Identify objects in images (tags) • Identify faces and emotions and images • Join datasets – find out which tags are associated with happiness REFERENCE ASSEMBLY ImageCommon; REFERENCE ASSEMBLY FaceSdk; REFERENCE ASSEMBLY ImageEmotion; REFERENCE ASSEMBLY ImageTagging; @objects = PROCESS MegaFaceView PRODUCE FileName, NumObjects int, Tags string READONLY FileName USING new Cognition.Vision.ImageTagger(); @tags = SELECT FileName, T.Tag FROM @objects CROSS APPLY EXPLODE(SqlArray.Create(Tags.Split(';'))) AS T(Tag) WHERE T.Tag.ToString().Contains("dog") OR T.Tag.ToString().Contains("cat"); @emotion_raw = PROCESS MegaFaceView PRODUCE FileName string, NumFaces int, Emotion string READONLY FileName USING new Cognition.Vision.EmotionAnalyzer(); @emotion = SELECT FileName, T.Emotion FROM @emotion_raw CROSS APPLY EXPLODE(SqlArray.Create(Emotion.Split(';'))) AS T(Emotion); @correlation = SELECT T.FileName, Emotion, Tag FROM @emotion AS E INNER JOIN @tags AS T ON E.FileName == T.FileName; Images Objects Emotions filter join aggregate
  • 17. Python Processing Python Author Tweet MikeDoesBigData @AzureDataLake: Come and see the #SQLSaturday sessions on #USQL AzureDataLake What are your recommendations for #SQLSaturday? @MikeDoesBigData Author Mentions Topics MikeDoesBigData {@AzureDataLake} {#SQLSaturday, #USQL} AzureDataLake {@MikeDoesBigData} {#SQLSaturday}
  • 18. REFERENCE ASSEMBLY [ExtPython]; DECLARE @myScript = @" def get_mentions(tweet): return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) ) def usqlml_main(df): del df['time'] del df['author'] df['mentions'] = df.tweet.apply(get_mentions) del df['tweet'] return df "; @t = SELECT * FROM (VALUES ("D1","T1","A1","@foo Hello World @bar"), ("D2","T2","A2","@baz Hello World @beer") ) AS D( date, time, author, tweet ); @m = REDUCE @t ON date PRODUCE date string, mentions string USING new Extension.Python.Reducer(pyScript:@myScript); Use U-SQL to create a massively distributed program. Executing Python code across many nodes. Using standard libraries such as numpy and pandas. Documentation: https://docs.microsoft.com/en- us/azure/data-lake-analytics/data- lake-analytics-u-sql-python- extensions Python Extensions
  • 20. R running in U- SQL Generate a linear model SampleScript_LM_Iris.R REFERENCE ASSEMBLY [ExtR]; DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv"; DECLARE @OutputFileModelSummary string = @"/my/R/Output/LMModelSummaryCoefficientsIrisFromRCommand.txt"; DECLARE @myRScript = @" inputFromUSQL$Species = as.factor(inputFromUSQL$Species) lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL) #do not return readonly columns and make sure that the column names are the same in usql and r scripts, outputToUSQL=data.frame(summary(lm.fit)$coefficients) colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"", ""Pr"") outputToUSQL"; @InputData = EXTRACT SepalLength double, SepalWidth double, PetalLength double, PetalWidth double, Species string FROM @IrisData USING Extractors.Csv(); @ExtendedData = SELECT 0 AS Par, * FROM @InputData; @ModelCoefficients = REDUCE @ExtendedData ON Par PRODUCE Par, Estimate double, StdError double, tValue double, Pr double READONLY Par USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe"); OUTPUT @ModelCoefficients TO @OutputFileModelSummary USING Outputters.Tsv();
  • 21. R running in U- SQL Use a previously generated model REFERENCE ASSEMBLY master.ExtR; DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda"; // Prediction Model DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv"; DECLARE @OutputFilePredictions string = @"/Output/LMPredictionsIris.csv"; DECLARE @PartitionCount int = 10; // R script to run DECLARE @myRScript = @" load(""my_model_LM_Iris.rda"") outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval=""confidence""))"; @InputData = EXTRACT SepalLength double, SepalWidth double, PetalLength double, PetalWidth double, Species string FROM @IrisData USING Extractors.Csv(); //Randomly partition the data to apply the model in parallel @ExtendedData = SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par, * FROM @InputData; // Predict Species @RScriptOutput = REDUCE @ExtendedData ON Par PRODUCE Par, fit double, lwr double, upr double READONLY Par USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe", stringsAsFactors:false); OUTPUT @RScriptOutput TO @OutputFilePredictions USING Outputters.Csv(outputHeader:true);
  • 22. JSON Processing How do I extract data from JSON documents? https://github.com/Azure/usql/tree/master/Examples/DataFormats https://github.com/Azure/usql/tree/master/Examples/JSONExamples
  • 23. Architecture of Sample Format Assembly Single JSON document per file: Use JsonExtractor Multiple JSON documents per file: Do not allow row delimiter (e.g., CR/LF) in JSON Use built-in Text Extractor to extract Use JsonTuple to schematize (with CROSS APPLY) Currently loads full JSON document into memory better to use JSONReader Processing if docs are large Microsoft.Analytics.Samples.Formats NewtonSoft.Json Microsoft.Hadoop.AvroSystem.Xml JSON Processing
  • 24. JSON Processing @json = EXTRACT personid int, name string, addresses string FROM @input USING new Json.JsonExtractor(“[*].person"); @person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array FROM @json; @addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address FROM @person CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address); @result = SELECT personid, name, address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city FROM @addresses;
  • 26. What are UDOs? Custom Operator Extensions written in .Net (C#) Scaled out by U-SQL
  • 27. UDO Tips and Warnings Tips when Using UDOs: READONLY clause to allow pushing predicates through UDOs REQUIRED clause to allow column pruning through UDOs PRESORT on REDUCE if you need global order Hint Cardinality if it does choose the wrong plan Warnings and better alternatives: Use SELECT with UDFs instead of PROCESS Use User-defined Aggregators instead of REDUCE Learn to use Windowing Functions (OVER expression) Good use-cases for PROCESS/REDUCE/COMBINE: The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori. Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO You need an ordered Aggregator or produce more than 1 row per group
  • 28. Additional Resources Blogs and community page: http://usql.io (U-SQL Github) http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Search Documentation, presentations and articles: http://aka.ms/usql_reference https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql- programmability-guide https://docs.microsoft.com/en-us/azure/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251 https://msdn.microsoft.com/magazine/mt790200 http://www.slideshare.com/MichaelRys Getting Started with R in U-SQL https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql- python-extensions ADL forums and feedback https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake http://stackoverflow.com/questions/tagged/u-sql http://aka.ms/adlfeedback
  • 29. SQLSaturday Sponsors! Titanium & Global Partner Gold Silver Bronze Without the generosity of these sponsors, this event would not be possible! Please, stop by the vendor booths and thank them.

Editor's Notes

  • #4: Extensions require .NET assemblies to be registered with a database