.NET per la Data Science e oltre

.NET per la Data Science
(e anche di più)
Marco Parenzan

Marco Parenzan
Solution Sales Specialist in Insight for Digital Innovation
Azure MVP
Community Lead 1nn0va // Pordenone
Linkedin: https://www.linkedin.com/in/marcoparenzan/

C# language evolution
• C# 1.0 was a new managed language
• C# 2.0 introduced generics
• C# 3.0 enabled LINQ
• C# 4.0 was all about interoperability with dynamic non-strongly typed languages.
• C# 5.0 simplified asynchronous programming with the async and await keywords.
• C# 6.0 the language has been increasingly shaped by conversation with the
community, now to the point of taking language features as contributions from
outside Microsoft
• C# 7.x will be no exception, with tuples and pattern matching as the biggest
features, transforming and streamlining the flow of data and control in code.
Point releases
• C# 7.1, 7.2, 7.3 Safe Efficient Code, More Freedom, Less Code
• C# 8 running in the function path
• C# 9 records, top level statements

Add a footer 7
Top level statement
• Only one file in a project the code you could write in a Main
method directly without method and class
• Else is a syntax error
• You have no reference to that class (that is compiler generated)
• …and it is async by default! 

Pythonizing C# (since C# 7.x)
Add a footer 8

Guido Van Rossum joins Microsoft

What about Data Science and Spark?
• In a recent survey, more than 70% of .NET devs expressed
interest in Apache Spark
• Millions of lines of big data-usable business logic are written in
.NET
• But .NET devs are locked out from big data processing – lack of
.NET support in OSS big data solutions
• We want a first-class .net data processing experience

Batch vs. Notebooks
• Batch
• Work on slow data stored into a
Datalake
• Submit a complete app in one
single deploy
• Receive the entire output
• Notebook
• «sketching» the code
• Write/delete/rewrite
continuously
• Run cell by cell (but also all at
once) interactive
• In a world of Mathematica

The .NET Interactive
experience

Evolution of REPL
• At the beginning there was mono
• Then Dynamic/DLR (C# 4)
• C#/F# interactive (C#6 + Roslyn)
• .NET Try.NET Interactive

.NET Interactive Architectural Overview
• The kernel concept in
.NET Interactive is a
component that
accepts commands and
produces outputs.
• The commands are
typically blocks of
arbitrary code, and the
outputs are events that
describe the results and
effects of that code.
The Kernel class
represents this core
abstraction.
• “Coding like a chat with
a Bot”

Jupyter
• Evolution and generalization of the seminal role of Mathematica
(notebook)
• +Python adoption (ipynb)
• +Web (HTTP+Html+Markdown)
• +Kernel

Data Processing in a Azure World...
• Thousands of IoT sensors in a factory,
producing petabytes of data
• Started with Stream Analytics...
• Like Azure Data Explorer...
• Data Warehouse...
• ...but there is a standard? Yes!

Apache Spark
The data compute experience

Spark Unifies:
 Batch Processing
 Interactive SQL
 Real-time processing
 Machine Learning
 Deep Learning
 Graph Processing
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Batch processing
Spark Structured
Streaming
Stream processing
Spark MLlib
Machine
Learning
Yarn
Spark MLlib
Machine
Learning
Spark
Streaming
Stream processing
GraphX
Graph
Computation
http://spark.apache.org
Apache Spark

Data Sources (HDFS, SQL, NoSQL, …)
Cluster Manager
Node Node Node
Cache Cache Cache
Driver Program
SparkContext
General Spark Cluster Architecture
• ‘Driver’ runs the user’s ‘main’ function
and executes the various parallel
operations on the worker nodes.
• The results of the operations are
collected by the driver
• The worker nodes read and write data
from/to Data Sources including HDFS.
• Worker node also cache transformed
data in memory as RDDs (Resilient
Data Sets).
• Worker nodes and the Driver Node
execute as VMs in public clouds
(AWS, Google and Azure).

Read from
HDFS
Write to
HDFS
Read from
HDFS
Write to
HDFS
Read from
HDFS
WhatmakesSparkfast

DataFrame as the core of Spark
• Recipe:
• Create Session
• Create Dataframe
• Define a user defined function
• Manipulate and view Data
CSV Data
JSON Data
RDBMS Data
Parquet Data
Binary Data
DataFrame
User programs
against the
DataFrame
abstraction

.NET for Apache Spark
• .NET bindings (C# e F#) to Spark
• Written on the Spark interop layer,
designed to provide high
performance bindings to multiple
languages
• Re-use knowledge, skills, code
you have as a .NET developer
• Compliant with .NET Standard
• You can use .NET for Apache
Spark anywhere you write .NET
code
• Original project Moebius
• https://github.com/microsoft/Mob
ius

.NET Spark support
Spark DataFramews
with SparkSQL
• Spark 2.3.x, 2.4.x,
3.0
• ~300 SparkSQL
function
• DeltaLake
.NET Standard 2.0
• C#/F#
• .NET Framework
4.6.1+
• .NET Core 2.1+
Batch&Streaming
• Structured
Streaming
Data Science
• ML.NET
• Notebooks

Using .NET for Spark
• Get started with .NET for
Apache Spark | Microsoft
Docs
• https://docs.microsoft.com/en-
us/dotnet/spark/tutorials/get-
started?tabs=windows
• Install .NET
• Install Java
• Install Apache Spark
• Install .NET for Apache Spark
• Create your app
• Install NuGet package

Run Spark with .NET in a container
• Not an official Microsoft image
• https://hub.docker.com/r/3rdman/dotnet-spark
• Install nothing on your machine other than docker
• Launch and Debug your code from Visual Studio and Visual Studio
Code
• Very good for development

The Azure Synapse Analytics
Experience

Engine for business-changing insights with seamless ecosystem
integration
Azure
Synapse Analytics
Data
integration
Data
warehousing
Big data
processing
Azure Data Lake Storage + Common Data Model

Azure Synapse Analytics
Limitless analytics service with unmatched time to insight
Synapse Analytics
Platform
Azure
Data Lake Storage
Common Data Model
Enterprise Security
Optimized for Analytics
Data lake integrated and
Common Data Model aware
METASTORE
SECURITY
MANAGEMENT
MONITORING
Integrated platform services
for, management, security,
monitoring, and meta-store
DATA INTEGRATION
SQL
Analytics Runtimes
Integrated analytics runtimes
available provisioned and
serverless on-demand
Synapse SQL offering T-SQL for
batch, streaming and interactive
processing
Apache Spark for big data
processing with Python, Scala and
.NET
PROVISIONED ON-DEMAND
Form Factors
SQL
Languages
Python .NET Java Scala R
Multiple languages suited to
different analytics workloads
Experience Synapse Analytics Studio
SaaS developer experiences for
code free and code first
Artificial Intelligence / Machine Learning / Internet of Things
Intelligent Apps / Business Intelligence
Designed for analytics workloads
at any scale
METASTORE
SECURITY
MANAGEMENT
MONITORING

Manage – Apache Spark pools
• Overview
• Provides ability to Pause, Scale, Assign Tags and upload packages from Studio.

.NET per la Data Science e oltre

Spark ML
Algorithms
Spark ML Algorithms

Synapse Service
Job Service Frontend
Spark API
Controller …
Job Service Backend
Spark Plugin
Gateway
Resource
Provider
DB
Synapse Studio
AAD
Auth Service
Instance
Creation Service
DBDB
Azure
Spark Instance
VM VM VM VM VM
…
VM
Synapse Job Service

Develop Hub - Notebooks
• Notebooks
• Allows to write multiple
languages in one notebook
• %%<Name of language>
• Offers use of temporary tables
across languages
• Language support for Syntax
highlight, syntax error, syntax
code completion, smart indent,
code folding
• Export results

Conclusion
• Spark is here to stay
• Now it is a place for .NET skills too
• Azure Synapse Analytics the best fusion between the old and
the new world
Add a footer 39

.NET per la Data Science e oltre

Recommended

More Related Content

What's hot (20)

Similar to .NET per la Data Science e oltre (20)

More from Marco Parenzan (20)

Recently uploaded (20)

.NET per la Data Science e oltre