SlideShare a Scribd company logo
Big Data Processing with .NET and Spark
Michael Rys
Principal Program Manager, Azure Data
@MikeDoesBigData
Agenda What is Apache Spark
Why .NET for Apache Spark
What is .NET for Apache Spark
Demos
How does it perform
Where does it run
Special Announcement & Call to Action
 Apache Spark is an OSS fast analytics engine for big data and machine
learning
 Improves efficiency through:
 General computation graphs beyond map/reduce
 In-memory computing primitives
 Allows developers to scale out their user code & write in their language of
choice
 Rich APIs in Java, Scala, Python, R, SparkSQL etc.
 Batch processing, streaming and interactive shell
 Available on Azure via
Azure Synapse Azure Databricks
Azure HDInsight IaaS/Kubernetes
.NET Developers 💖 Apache Spark…
A lot of big data-usable business logic (millions
of lines of code) is written in .NET!
Expensive and difficult to translate into
Python/Scala/Java!
Locked out from big data processing due to
lack of .NET support in OSS big data solutions
In a recently conducted .NET Developer survey (> 1000 developers), more than 70%
expressed interest in Apache Spark!
Would like to tap into OSS eco-system for: Code libraries, support, hiring
Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.
We are developing it in the open!
Contributions to foundational OSS projects:
• Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284,
SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373
• Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737,
ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887,
ARROW-5908, ARROW-6314, ARROW-6682
• Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to
Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance
.NET for Apache Spark is open source
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
• Frequent releases (about every 6 weeks), current release v0.12.1
• Integrates with .NET Interactive (https://github.com/dotnet/interactive) and
nteract/Jupyter
Spark project improvement proposals:
• Interop support for Spark language extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006
Journey so far
~2k
GitHub unique
visitors/wk
~8k
GitHub page
views/wk
260
GitHub issues
closed
246
GitHub PRs
merged
127k
Nuget
Downloads
39
GitHub
Contributors
Journey so far
Customer Success: O365’s MSAI
Job:
Build ML/Deep models on top of
substrate data to infuse intelligence
to Office 365 products. Our data
resides in Azure Data Lake Storage.
We write cook/featurize data that in
turn gets fed into our ML models.
Why Spark.NET?
Given our business logic e.g.,
featurizers, tokenizers for
normalizing text, are written in C# –
Spark.NET is an ideal candidate for
our workloads. We leverage
Spark.NET to run those libraries at
scale.
Experience:
Very promising, stable & highly
vibrant community that is helping us
iterate at the agility we want.
Looking forward to longer working
relationship and broader adoption
across Substrate Intelligence / MSAI.
Microsoft Search, Assistant & Intelligence Team: Towards Modern Workspaces in O365
Scale: ~ 50 TB
.NET provides full-spectrum Spark support
Spark DataFrames
with SparkSQL
Works with
Spark v2.3.x/v2.4.x
and includes
~300 SparkSQL
functions
Grouped Map
Delta Lake
.NET Spark UDFs
Batch &
streaming
Including
Spark Structured
Streaming and all
Spark-supported data
sources
.NET Standard 2.0
Works with
.NET Framework v4.6.1+
and .NET Core v2.1/v3.1
and includes C#/F#
support
.NET
Standard
Data Science
Including access to
ML.NET
Interactive Notebook
with C# REPL
Speed &
productivity
Performance optimized
interop, as fast or faster
than pySpark,
Support for HW
Vectorization
https://github.com/dotnet/spark/examples
UserId State Salary
Terry WA XX
Rahul WA XX
Dan WA YY
Tyson CA ZZ
Ankit WA YY
Michae
l
WA YY
Introduction to Spark Programming:
DataFrame
.NET for Apache Spark programmability
var spark = SparkSession.Builder().GetOrCreate();
var dataframe =
spark.Read().Json(“input.json”);
dataframe.Filter(df["age"] > 21)
.Select(concat(df[“age”], df[“name”]).Show();
var concat =
Udf<int?, string, string>((age, name)=>name+age);
Language comparison: TPC-H Query 2
val europe = region.filter($"r_name" === "EUROPE")
.join(nation, $"r_regionkey" === nation("n_regionkey"))
.join(supplier, $"n_nationkey" === supplier("s_nationkey"))
.join(partsupp,
supplier("s_suppkey") === partsupp("ps_suppkey"))
val brass = part.filter(part("p_size") === 15
&& part("p_type").endsWith("BRASS"))
.join(europe, europe("ps_partkey") === $"p_partkey")
val minCost = brass.groupBy(brass("ps_partkey"))
.agg(min("ps_supplycost").as("min"))
brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey"))
.filter(brass("ps_supplycost") === minCost("min"))
.select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.sort($"s_acctbal".desc,
$"n_name", $"s_name", $"p_partkey")
.limit(100)
.show()
var europe = region.Filter(Col("r_name") == "EUROPE")
.Join(nation, Col("r_regionkey") == nation["n_regionkey"])
.Join(supplier, Col("n_nationkey") == supplier["s_nationkey"])
.Join(partsupp,
supplier["s_suppkey"] == partsupp["ps_suppkey"]);
var brass = part.Filter(part["p_size"] == 15
& part["p_type"].EndsWith("BRASS"))
.Join(europe, europe["ps_partkey"] == Col("p_partkey"));
var minCost = brass.GroupBy(brass["ps_partkey"])
.Agg(Min("ps_supplycost").As("min"));
brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"])
.Filter(brass["ps_supplycost"] == minCost["min"])
.Select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.Sort(Col("s_acctbal").Desc(),
Col("n_name"), Col("s_name"), Col("p_partkey"))
.Limit(100)
.Show();
Similar syntax – dangerously copy/paste friendly!
$”col_name” vs. Col(“col_name”) Capitalization
Scala C#
C# vs Scala (e.g., == vs ===)
Demo 1: Getting started locally
Submitting a Spark Application
spark-submit `
--class <user-app-main-class> `
--master local `
<path-to-user-jar>
<argument(s)-to-your-app>
spark-submit
(Scala)
spark-submit `
--class org.apache.spark.deploy.DotnetRunner `
--master local `
<path-to-microsoft-spark-jar> `
<path-to-your-app-exe> <argument(s)-to-your-app>
spark-submit
(.NET)
Provided by .NET for
Apache Spark Library
Provided by User &
has business logic
Demo 2: Locally debugging a .NET for Spark
App
spark-submit --class
org.apache.spark.deploy.DotnetRunner `
--master local <path-to-microsoft-spark-jar> `
Debugging User-defined Code
https://github.com/dotnet/spark/pull/294
Step 1
Write your app code
Step 2
set DOTNET_WORKER_DEBUG=1
Run spark-submit with debug argument
Step 3
Switch to app code, add breakpoint
at your business logic, F5
Step 4
In the `Choose Just-In-Time
Debugger`, choose “New Instance of
…”, select your app code CS file
Step 5
That’s it! Have fun 
Demo 2: Twitter analysis in the Cloud
What is happening when you write .NET Spark code?
DataFrame
SparkSQL
.NET for
Apache
Spark
.NET
Program
Did you
define
a .NET
UDF?
Regular execution path
(no .NET runtime during execution)
Same Speed as with Scala Spark
Interop between Spark and .NET
Faster than with PySpark
No
Yes
Spark
operation tree
Spark Worker Node JVM
Spark Executor Microsoft.Spark.Worker
Spark Worker Node CLR
Run a task with
a UDF
1
Launch worker executable2
3 Serialize UDFs &
data
.NET UDF Library
4 Execute user-defined
operations
5 Write serialized result rows
User Spark Library
Legend:
Interop (Scala) Interop (.NET)
Challenge:
How to serialize data
between JVM & CLR?
Pickling
Row-oriented
Apache Arrow
Column-oriented
Default
Performance: Worker-side Interop
df.GroupBy("age")
.Apply(
new StructType(new[]
{
new StructField("age", new IntegerType()),
new StructField("nameCharCount", new IntegerType())
}),
batch => CountCharacters(batch, "age", "name"))
.Show();
Simplifying experience with Arrow
private static FxDataFrame CountCharacters(
FxDataFrame df,
string groupColName,
string summaryColName)
{
int charCount = 0;
for (long i = 0; i < df.RowCount; ++i)
{
charCount += ((string)df[summaryColName][i]).Length;
}
return new FxDataFrame(new[] {
new PrimitiveColumn<int>(groupColName,
new[] { (int?)df[groupColName][0] }),
new PrimitiveColumn<int>(summaryColName,
new[] { charCount }) });
}
private static RecordBatch CountCharacters(
RecordBatch records,
string groupColName,
string summaryColName)
{
int summaryColIndex = records.Schema.GetFieldIndex(summaryColName);
StringArray stringValues = records.Column(summaryColIndex) as StringArray;
int charCount = 0;
for (int i = 0; i < stringValues.Length; ++i)
{
charCount += stringValues.GetString(i).Length;
}
int groupColIndex = records.Schema.GetFieldIndex(groupColName);
Field groupCol = records.Schema.GetFieldByIndex(groupColIndex);
return new RecordBatch(
new Schema.Builder()
.Field(groupCol)
.Field(f => f.Name(summaryColName).DataType(Int32Type.Default))
.Build(),
new IArrowArray[]
{
records.Column(groupColIndex),
new Int32Array.Builder().Append(charCount).Build()
},
records.Length);
}
Previous Experience New Experience
Simplifying experience with Arrow
Performance –
warm cluster runs
for Pickling
Serialization
(Arrow
improvements see
next slide)
Takeaway 1: Where UDF
performance does not
matter, .NET is on-par
with Python
Takeaway 2: Where UDF
performance is critical, .NET
is ~2x faster than Python!
Performance –
Warm Cluster
Runs for C#
Pickling vs.
Arrow
Serialization
Takeaway: Since Q1 is
interop bound, we see 33%
perf improvement with
better serialization
Performance –
Warm Cluster
Runs for Arrow
Serialization in
C# vs. Python
Takeaway: Since serialization
inefficiencies have been removed,
we are left with similar perf across
languages – if you like .NET, you can
stick with .NET 
Works everywhere!
Cross platform
Cross Cloud
Windows Ubuntu
Azure & AWS
Databricks
macOS
AWS EMR
Spark
Azure HDI
Spark
Installed out of
the box
Azure
Synapse
Installation docs
on Github
• cd mySparkApp
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64
• Zip the folder
• Upload ZIP file to your cloud storage
Using .NET for Spark in Azure Synapse
Batch Submission https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
• cd mySparkApp
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64
• Zip the folder
• Upload ZIP file to your cloud storage
Using .NET for Spark in Azure Synapse
Batch Submission
Language selects semantics of
submission fields
ZIP file that contains the Spark
application, including UDF DLLs, and
even the Spark or .NET Runtime if a
different version is needed
Main Program (Unix)
Program Parameters as needed
Additional resource and library files
that are not included in the ZIP (e.g.,
shared DLLs, config files)
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
Using .NET for Spark in Azure Synapse
Notebooks with .NET Interactive
Language selects Type of notebook
Interactive C#
Spark context spark is built-in
Using .NET for Spark in Azure Synapse
Notebooks with .NET Interactive – importing nuget packages
Using .NET for Spark in Azure Databricks
• Not available out of the box but can be used in batch submission
• https://github.com/dotnet/spark/blob/master/deployment/README.md#databricks
Note: Traditional Databricks notebooks are proprietary and cannot integrate .NET.
Please contact @Databricks if you want to use it out of the box 
VSCode extension for Spark .NET
• Spark .NET Project creation​
• Dependency packaging​
• Language service
• Sample code
Author
• Reference management
• Spark local run​
• Spark cluster run (e.g. HDInsight)
Run
• DebugFix
Extension to VSCode
 Tap into VSCode for C# programming
 Automate Maven and Spark dependency
for environment setup
 Facilitate first project success through
project template and sample code
 Support Spark local run and cluster run
 Integrate with Azure for HDInsight clusters
navigation
 Azure Databricks integration planned
ANNOUNCING: .NET for Apache Spark v1.0 is released!
 First-class C# and F# bindings to Apache Spark,
bringing the power of big data analytics to .NET
developers
Apache Spark 2.4/3.0
Data Frames, Structured
Streaming, Delta Lake
.NET Standard 2.0
C# and F#
ML.NET
Performance optimized
with Apache Arrow and
HW Vectorization
First class integration in
Azure Synapse: Batch
Submission
Interactive .NET notebooks
Learn more at
http://dot.net/Spark
More
programming
experiences in
.NET
(UDAF, UDT
support, multi-
language UDFs)
What’s next?
Spark data
connectors in
.NET
(e.g., Apache Kafka,
Azure Blob Store,
Azure Data Lake)
Tooling
experiences
(e.g.,
Jupyter/nteract,
VS Code, Visual
Studio, others?)
Idiomatic
experiences
for C# and F#
(LINQ, Type
Provider)
Go to https://github.com/dotnet/spark and let us know what is important to you!
Out-of-Box
Experiences
(Azure Synapse,
Azure HDInsight,
Azure Databricks,
Cosmos DB Spark,
SQL 2019 BDC, …)
Call to action: Engage, use & guide us!
Related session:
• Big Data and Data Warehousing Together with Azure
Synapse Analytics
Useful links:
• http://github.com/dotnet/spark
• https://www.nuget.org/packages/Microsoft.Spark
https://aka.ms/GoDotNetForSpark
• https://docs.microsoft.com/dotnet/spark
Website:
• https://dot.net/spark (Request a Demo!)
Starter Videos .NET for Apache Spark 101:
• Watch on YouTube
• Watch on Channel 9
Available out-of-box on
Azure Synapse & Azure HDInsight Spark
Running .NET for Spark anywhere—
https://aka.ms/InstallDotNetForSpark
You &
@MikeDoesBigData #DotNetForSpark
© Copyright Microsoft Corporation. All rights reserved.

More Related Content

What's hot (20)

PPTX
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
PPTX
An intro to Azure Data Lake
Rick van den Bosch
 
PDF
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
PPTX
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
PPTX
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
PPTX
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
 
PPTX
Introduction to Azure Databricks
James Serra
 
PPTX
Spark SQL
Caserta
 
PDF
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Databricks
 
PPTX
TechEvent Databricks on Azure
Trivadis
 
PPTX
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
PDF
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
PPTX
Azure Data Lake and U-SQL
Michael Rys
 
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
PDF
Data Lakes with Azure Databricks
Data Con LA
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
PDF
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
An intro to Azure Data Lake
Rick van den Bosch
 
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
 
Introduction to Azure Databricks
James Serra
 
Spark SQL
Caserta
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Databricks
 
TechEvent Databricks on Azure
Trivadis
 
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
Azure Data Lake and U-SQL
Michael Rys
 
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
Data Lakes with Azure Databricks
Data Con LA
 
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
Spark sql meetup
Michael Zhang
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 

Similar to Big Data Processing with .NET and Spark (SQLBits 2020) (20)

PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
PDF
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Spark devoxx2014
Andy Petrella
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PDF
Simple Apache Spark Introduction - Part 2
chiragmota91
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
H2O PySparkling Water
Sri Ambati
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PPTX
Spark Study Notes
Richard Kuo
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
20170126 big data processing
Vienna Data Science Group
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Spark devoxx2014
Andy Petrella
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Simple Apache Spark Introduction - Part 2
chiragmota91
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
H2O PySparkling Water
Sri Ambati
 
Intro to Spark and Spark SQL
jeykottalam
 
Spark Study Notes
Richard Kuo
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Ad

More from Michael Rys (20)

PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
PPTX
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Michael Rys
 
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
PPTX
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
PPTX
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
PPTX
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
PPTX
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
PPTX
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 
PPTX
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
PPTX
Killer Scenarios with Data Lake in Azure with U-SQL
Michael Rys
 
PPTX
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Michael Rys
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
Killer Scenarios with Data Lake in Azure with U-SQL
Michael Rys
 
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
Ad

Recently uploaded (20)

PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 

Big Data Processing with .NET and Spark (SQLBits 2020)

  • 1. Big Data Processing with .NET and Spark Michael Rys Principal Program Manager, Azure Data @MikeDoesBigData
  • 2. Agenda What is Apache Spark Why .NET for Apache Spark What is .NET for Apache Spark Demos How does it perform Where does it run Special Announcement & Call to Action
  • 3.  Apache Spark is an OSS fast analytics engine for big data and machine learning  Improves efficiency through:  General computation graphs beyond map/reduce  In-memory computing primitives  Allows developers to scale out their user code & write in their language of choice  Rich APIs in Java, Scala, Python, R, SparkSQL etc.  Batch processing, streaming and interactive shell  Available on Azure via Azure Synapse Azure Databricks Azure HDInsight IaaS/Kubernetes
  • 4. .NET Developers 💖 Apache Spark… A lot of big data-usable business logic (millions of lines of code) is written in .NET! Expensive and difficult to translate into Python/Scala/Java! Locked out from big data processing due to lack of .NET support in OSS big data solutions In a recently conducted .NET Developer survey (> 1000 developers), more than 70% expressed interest in Apache Spark! Would like to tap into OSS eco-system for: Code libraries, support, hiring
  • 5. Goal: .NET for Apache Spark is aimed at providing .NET developers a first-class experience when working with Apache Spark. Non-Goal: Converting existing Scala/Python/Java Spark developers.
  • 6. We are developing it in the open! Contributions to foundational OSS projects: • Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284, SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373 • Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737, ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887, ARROW-5908, ARROW-6314, ARROW-6682 • Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance .NET for Apache Spark is open source • Website: https://dot.net/spark • GitHub: https://github.com/dotnet/spark • Frequent releases (about every 6 weeks), current release v0.12.1 • Integrates with .NET Interactive (https://github.com/dotnet/interactive) and nteract/Jupyter Spark project improvement proposals: • Interop support for Spark language extensions: SPARK-26257 • .NET bindings for Apache Spark: SPARK-27006
  • 7. Journey so far ~2k GitHub unique visitors/wk ~8k GitHub page views/wk 260 GitHub issues closed 246 GitHub PRs merged 127k Nuget Downloads 39 GitHub Contributors
  • 9. Customer Success: O365’s MSAI Job: Build ML/Deep models on top of substrate data to infuse intelligence to Office 365 products. Our data resides in Azure Data Lake Storage. We write cook/featurize data that in turn gets fed into our ML models. Why Spark.NET? Given our business logic e.g., featurizers, tokenizers for normalizing text, are written in C# – Spark.NET is an ideal candidate for our workloads. We leverage Spark.NET to run those libraries at scale. Experience: Very promising, stable & highly vibrant community that is helping us iterate at the agility we want. Looking forward to longer working relationship and broader adoption across Substrate Intelligence / MSAI. Microsoft Search, Assistant & Intelligence Team: Towards Modern Workspaces in O365 Scale: ~ 50 TB
  • 10. .NET provides full-spectrum Spark support Spark DataFrames with SparkSQL Works with Spark v2.3.x/v2.4.x and includes ~300 SparkSQL functions Grouped Map Delta Lake .NET Spark UDFs Batch & streaming Including Spark Structured Streaming and all Spark-supported data sources .NET Standard 2.0 Works with .NET Framework v4.6.1+ and .NET Core v2.1/v3.1 and includes C#/F# support .NET Standard Data Science Including access to ML.NET Interactive Notebook with C# REPL Speed & productivity Performance optimized interop, as fast or faster than pySpark, Support for HW Vectorization https://github.com/dotnet/spark/examples
  • 11. UserId State Salary Terry WA XX Rahul WA XX Dan WA YY Tyson CA ZZ Ankit WA YY Michae l WA YY Introduction to Spark Programming: DataFrame
  • 12. .NET for Apache Spark programmability var spark = SparkSession.Builder().GetOrCreate(); var dataframe = spark.Read().Json(“input.json”); dataframe.Filter(df["age"] > 21) .Select(concat(df[“age”], df[“name”]).Show(); var concat = Udf<int?, string, string>((age, name)=>name+age);
  • 13. Language comparison: TPC-H Query 2 val europe = region.filter($"r_name" === "EUROPE") .join(nation, $"r_regionkey" === nation("n_regionkey")) .join(supplier, $"n_nationkey" === supplier("s_nationkey")) .join(partsupp, supplier("s_suppkey") === partsupp("ps_suppkey")) val brass = part.filter(part("p_size") === 15 && part("p_type").endsWith("BRASS")) .join(europe, europe("ps_partkey") === $"p_partkey") val minCost = brass.groupBy(brass("ps_partkey")) .agg(min("ps_supplycost").as("min")) brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey")) .filter(brass("ps_supplycost") === minCost("min")) .select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .sort($"s_acctbal".desc, $"n_name", $"s_name", $"p_partkey") .limit(100) .show() var europe = region.Filter(Col("r_name") == "EUROPE") .Join(nation, Col("r_regionkey") == nation["n_regionkey"]) .Join(supplier, Col("n_nationkey") == supplier["s_nationkey"]) .Join(partsupp, supplier["s_suppkey"] == partsupp["ps_suppkey"]); var brass = part.Filter(part["p_size"] == 15 & part["p_type"].EndsWith("BRASS")) .Join(europe, europe["ps_partkey"] == Col("p_partkey")); var minCost = brass.GroupBy(brass["ps_partkey"]) .Agg(Min("ps_supplycost").As("min")); brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"]) .Filter(brass["ps_supplycost"] == minCost["min"]) .Select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .Sort(Col("s_acctbal").Desc(), Col("n_name"), Col("s_name"), Col("p_partkey")) .Limit(100) .Show(); Similar syntax – dangerously copy/paste friendly! $”col_name” vs. Col(“col_name”) Capitalization Scala C# C# vs Scala (e.g., == vs ===)
  • 14. Demo 1: Getting started locally
  • 15. Submitting a Spark Application spark-submit ` --class <user-app-main-class> ` --master local ` <path-to-user-jar> <argument(s)-to-your-app> spark-submit (Scala) spark-submit ` --class org.apache.spark.deploy.DotnetRunner ` --master local ` <path-to-microsoft-spark-jar> ` <path-to-your-app-exe> <argument(s)-to-your-app> spark-submit (.NET) Provided by .NET for Apache Spark Library Provided by User & has business logic
  • 16. Demo 2: Locally debugging a .NET for Spark App spark-submit --class org.apache.spark.deploy.DotnetRunner ` --master local <path-to-microsoft-spark-jar> `
  • 17. Debugging User-defined Code https://github.com/dotnet/spark/pull/294 Step 1 Write your app code Step 2 set DOTNET_WORKER_DEBUG=1 Run spark-submit with debug argument Step 3 Switch to app code, add breakpoint at your business logic, F5 Step 4 In the `Choose Just-In-Time Debugger`, choose “New Instance of …”, select your app code CS file Step 5 That’s it! Have fun 
  • 18. Demo 2: Twitter analysis in the Cloud
  • 19. What is happening when you write .NET Spark code? DataFrame SparkSQL .NET for Apache Spark .NET Program Did you define a .NET UDF? Regular execution path (no .NET runtime during execution) Same Speed as with Scala Spark Interop between Spark and .NET Faster than with PySpark No Yes Spark operation tree
  • 20. Spark Worker Node JVM Spark Executor Microsoft.Spark.Worker Spark Worker Node CLR Run a task with a UDF 1 Launch worker executable2 3 Serialize UDFs & data .NET UDF Library 4 Execute user-defined operations 5 Write serialized result rows User Spark Library Legend: Interop (Scala) Interop (.NET) Challenge: How to serialize data between JVM & CLR? Pickling Row-oriented Apache Arrow Column-oriented Default Performance: Worker-side Interop
  • 21. df.GroupBy("age") .Apply( new StructType(new[] { new StructField("age", new IntegerType()), new StructField("nameCharCount", new IntegerType()) }), batch => CountCharacters(batch, "age", "name")) .Show(); Simplifying experience with Arrow
  • 22. private static FxDataFrame CountCharacters( FxDataFrame df, string groupColName, string summaryColName) { int charCount = 0; for (long i = 0; i < df.RowCount; ++i) { charCount += ((string)df[summaryColName][i]).Length; } return new FxDataFrame(new[] { new PrimitiveColumn<int>(groupColName, new[] { (int?)df[groupColName][0] }), new PrimitiveColumn<int>(summaryColName, new[] { charCount }) }); } private static RecordBatch CountCharacters( RecordBatch records, string groupColName, string summaryColName) { int summaryColIndex = records.Schema.GetFieldIndex(summaryColName); StringArray stringValues = records.Column(summaryColIndex) as StringArray; int charCount = 0; for (int i = 0; i < stringValues.Length; ++i) { charCount += stringValues.GetString(i).Length; } int groupColIndex = records.Schema.GetFieldIndex(groupColName); Field groupCol = records.Schema.GetFieldByIndex(groupColIndex); return new RecordBatch( new Schema.Builder() .Field(groupCol) .Field(f => f.Name(summaryColName).DataType(Int32Type.Default)) .Build(), new IArrowArray[] { records.Column(groupColIndex), new Int32Array.Builder().Append(charCount).Build() }, records.Length); } Previous Experience New Experience Simplifying experience with Arrow
  • 23. Performance – warm cluster runs for Pickling Serialization (Arrow improvements see next slide) Takeaway 1: Where UDF performance does not matter, .NET is on-par with Python Takeaway 2: Where UDF performance is critical, .NET is ~2x faster than Python!
  • 24. Performance – Warm Cluster Runs for C# Pickling vs. Arrow Serialization Takeaway: Since Q1 is interop bound, we see 33% perf improvement with better serialization
  • 25. Performance – Warm Cluster Runs for Arrow Serialization in C# vs. Python Takeaway: Since serialization inefficiencies have been removed, we are left with similar perf across languages – if you like .NET, you can stick with .NET 
  • 26. Works everywhere! Cross platform Cross Cloud Windows Ubuntu Azure & AWS Databricks macOS AWS EMR Spark Azure HDI Spark Installed out of the box Azure Synapse Installation docs on Github
  • 27. • cd mySparkApp dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64 • Zip the folder • Upload ZIP file to your cloud storage Using .NET for Spark in Azure Synapse Batch Submission https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
  • 28. • cd mySparkApp dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64 • Zip the folder • Upload ZIP file to your cloud storage Using .NET for Spark in Azure Synapse Batch Submission Language selects semantics of submission fields ZIP file that contains the Spark application, including UDF DLLs, and even the Spark or .NET Runtime if a different version is needed Main Program (Unix) Program Parameters as needed Additional resource and library files that are not included in the ZIP (e.g., shared DLLs, config files) https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
  • 29. Using .NET for Spark in Azure Synapse Notebooks with .NET Interactive Language selects Type of notebook Interactive C# Spark context spark is built-in
  • 30. Using .NET for Spark in Azure Synapse Notebooks with .NET Interactive – importing nuget packages
  • 31. Using .NET for Spark in Azure Databricks • Not available out of the box but can be used in batch submission • https://github.com/dotnet/spark/blob/master/deployment/README.md#databricks Note: Traditional Databricks notebooks are proprietary and cannot integrate .NET. Please contact @Databricks if you want to use it out of the box 
  • 32. VSCode extension for Spark .NET • Spark .NET Project creation​ • Dependency packaging​ • Language service • Sample code Author • Reference management • Spark local run​ • Spark cluster run (e.g. HDInsight) Run • DebugFix Extension to VSCode  Tap into VSCode for C# programming  Automate Maven and Spark dependency for environment setup  Facilitate first project success through project template and sample code  Support Spark local run and cluster run  Integrate with Azure for HDInsight clusters navigation  Azure Databricks integration planned
  • 33. ANNOUNCING: .NET for Apache Spark v1.0 is released!  First-class C# and F# bindings to Apache Spark, bringing the power of big data analytics to .NET developers Apache Spark 2.4/3.0 Data Frames, Structured Streaming, Delta Lake .NET Standard 2.0 C# and F# ML.NET Performance optimized with Apache Arrow and HW Vectorization First class integration in Azure Synapse: Batch Submission Interactive .NET notebooks Learn more at http://dot.net/Spark
  • 34. More programming experiences in .NET (UDAF, UDT support, multi- language UDFs) What’s next? Spark data connectors in .NET (e.g., Apache Kafka, Azure Blob Store, Azure Data Lake) Tooling experiences (e.g., Jupyter/nteract, VS Code, Visual Studio, others?) Idiomatic experiences for C# and F# (LINQ, Type Provider) Go to https://github.com/dotnet/spark and let us know what is important to you! Out-of-Box Experiences (Azure Synapse, Azure HDInsight, Azure Databricks, Cosmos DB Spark, SQL 2019 BDC, …)
  • 35. Call to action: Engage, use & guide us! Related session: • Big Data and Data Warehousing Together with Azure Synapse Analytics Useful links: • http://github.com/dotnet/spark • https://www.nuget.org/packages/Microsoft.Spark https://aka.ms/GoDotNetForSpark • https://docs.microsoft.com/dotnet/spark Website: • https://dot.net/spark (Request a Demo!) Starter Videos .NET for Apache Spark 101: • Watch on YouTube • Watch on Channel 9 Available out-of-box on Azure Synapse & Azure HDInsight Spark Running .NET for Spark anywhere— https://aka.ms/InstallDotNetForSpark You & @MikeDoesBigData #DotNetForSpark
  • 36. © Copyright Microsoft Corporation. All rights reserved.

Editor's Notes

  • #4: 3
  • #10: “Spark.Net team helped enhance the user experience which was a major issue for adoption in Satori”
  • #11: No RDD support.