SlideShare a Scribd company logo
Data science and deep
learning on Spark with
1/10th of the code
Roope Astala, Sudarshan Ragunathan
Microsoft Corporation
Agenda
• User story: Snow Leopard Conservation
• Introducing Microsoft Machine Learning Library for Apache Spark
– Vision: Productivity, Scalability, State-of-Art Algorithms, Open Source
– Deep Learning and Image Processing
– Text Analytics
Disclaimer:​ Any roadmap items are subject to change without notice.
Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache
Software Foundation in the United States and/or other countries.
Rhetick Sengupta
President, Board of Directors
Snow leopards
• 3,900-6,500 individuals left in the wild
• Variety of Threats
– Poaching
– Retribution killing
– Loss of prey
– Loss of habitat (mining)
• Little known about their ecology,
behavior, movement patterns,
survival rates
• More data required to influence
survival
Range spread across 1.5 million km2
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan ()
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan ()
Camera trapping since 2009
• 1,700 sq km
• 42 camera
traps
• 8,490 trap
nights
• 4 primary
sampling
periods
• 56 secondary
sampling
periods
• ~1.3 mil
images
Camera Trap Images
Manually classifying
images averages 300
hours per survey
Automated Image
Classification Benefits
Short term
• Thousands of hours of researcher and volunteer
time saved
• Resources redeployed to science and
conservation vs image sorting
• Much more accurate data on range and
population
Long term
• Population numbers that can be accurately
monitored
• Influence governments on protected areas
• Enhance community based conservation
programs
• Predict threats before they happen
www.snowleopard.org
How can you help?
We need more camera surveys!
• 1,700 sq km surveyed of
1,500,000
• $500 will buy an additional
camera
• $5,000 will fund a researcher
• Any amount helps
Contact me directly at rhetick@hotmail.com or donate online.
Microsoft Machine Learning Library
for Apache Spark (MMLSpark)
GitHub Repo: https://github.com/Azure/mmlspark
Get started now using Docker image:
docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark
Navigate to http://localhost:8888 to view example Jupyter notebooks
Challenges when using Spark for ML
• User needs to write lots of “ceremonial” code to prepare
features for ML algorithms.
– Coerce types and data layout to that what’s expected by learner
– Use different conventions for different learners
• Lack of domain-specific libraries: computer vision or text
analytics…
• Limited capabilities for model evaluation & model
management
Vision of Microsoft Machine Learning Library for
Apache Spark
• Enable you to solve ML problems dealing with large amounts of
data.
• Make you as a professional data scientist more productive on
Spark, so you can focus on ML problems, not software
engineering problems.
• Provide cutting edge ML algorithms on Spark: deep learning,
computer vision, text analytics…
• Open Source at GitHub
Example: Hello MMLSpark
Predict income from US Census data
Using Plain PySpark
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan ()
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan ()
Using MMLSpark
from mmlspark import TrainClassifier, ComputeModelStatistics
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol=" income").fit(train)
prediction = model.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
Algorithms
• Deep learning through Microsoft Cognitive Toolkit
(CNTK)
– Scale-out DNN featurization and scoring. Take an existing
DNN model or train locally on big GPU machine, and
deploy it to Spark cluster to score large data.
– Scale-up training on a GPU VM. Preprocess large data on
Spark cluster workers and feed to GPU to train the DNN.
• Scale-out algorithms for “traditional” ML through
SparkML
Design Principles
• Run on every platform and on language supported by Spark
• Follow SparkML pipeline model for composability. MML-Spark consists
of Estimators and Transforms that can be combined with existing
SparkML components into pipelines.
• Use SparkML DataFrames as common format. You can use existing
capabilities in Spark for reading in the data to your model.
• Consistent handling of different datatypes – text, categoricals, images
– for different algorithms. No need for low-level type coercion,
encoding or vector assembly.
Deep Neural Net Featurization
• Basic idea: Interior layers of pre-trained DNN models have high-order information about
features
• Using “headless” pre-trained DNNs allows us to extract really good set of features from
images that can in turn be used to train more “traditional” models like random forests,
SVM, logistic regression, etc.
– Pre-trained DNNs are typically state-of-the-art models on datasets like ImageNet, MSCoco or
CIFAR, for example ResNet (Microsoft), GoogLeNet (Google), Inception (Google), VGG, etc.
• Transfer learning enables us to train effective models where we don’t have enough data,
computational power or domain expertise to train a new DNN from scratch
• Performance scales with executors
High-order features
Predictions
DNN Featurization using MML-Spark
cntkModel = CNTKModel().setInputCol("images").
setOutputCol("features").setModelLocation(resnetModel).
setOutputNode("z.x")
featurizedImages = cntkModel.transform(imagesWithLabels).
select(['labels','features'])
model = TrainClassifier(model=LogisticRegression(),labelCol="labels").
fit(featurizedImages)
The DNN featurization is incorporated as SparkML pipeline stage. The evaluation
happens directly on JVM from Scala: no Python UDF overhead!
Image Processing Transforms
DNNs are often picky about their input data shape and normalization.
We provide bindings to OpenCV image ingress and processing operations,
exposed as SparkML PipelineStages:
images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1)
tr = ImageTransform().setOutputCol("transformed")
.resize(height = 200, width = 200)
.crop(0, 0, height = 180, width = 180)
smallImages = tr.transform(images).select("transformed")
Image Pipeline for Object Detection and
Classification
• In real data, objects are often not framed and 1 per image. There can be manycandidate sub-
images.
• You could extract candidate sub-images using flatMapsand UDFs, but this gets cumbersome quickly.
• We plan to provide simplified pipeline components to support generic object detection workflow.
Detect
candidate
ROI
Extract
sub-image
Classify
sub-images
Prediction:
Cloud,
None,
Sun,
Smiley,
None
…
Virtual network
Training of DNNs on GPU node
• GPUs are very powerful for training DNNs. However, running an entire cluster of GPUs is often too
expensive and unnecessary.
• Instead, load and prep large data on CPU Spark cluster, then feed the prepped data to GPU node on
virtual network for training. Once DNN is trained, broadcast the model to CPU nodes for evaluation.
learner = CNTKLearner(brainScript=brainscriptText, dataTransfer='hdfs-mount',
gpuMachines=‘my-gpu-vm’, workingDir=‘file:/tmp/’).fit(trainData)
predictions = learner.setOutputNode(‘z’).transform(testData)
Raw
data
Processed data
as DataFrame
Trained DNN as
PipelineStage
Text Analytics
• Goal: provide one-step text featurization capability that lets you
take free-form text columns and turn them into feature vectors for
ML algorithms.
– Tokenization, stop-word removal, case normalization
– N-Gram creation, feature hashing, IDF weighing.
• Future: multi-language support, more advanced text
preprocessing and DNN featurization capabilities.
Example: Text analytics for document
classification by sentiment
from pyspark.ml import Pipeline
from mmlspark import TextFeaturizer, SelectColumns, TrainClassifier
from pyspark.ml.classification import LogisticRegression
textFeaturizer = TextFeaturizer(inputCol="text", outputCol = "features",
useStopWordsRemover=True, useIDF=True, minDocFreq=5,
numFeatures=2**16)
columnSelector = SelectColumns(cols=["features","label"])
classifier = TrainClassifier(model = LogisticRegression(), labelCol='label')
textPipeline = Pipeline(stages= [textFeaturizer,columnSelector,classifier])
Easy to get started
GitHub Repo: https://github.com/Azure/mmlspark
Get started now using Docker image:
docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark
Navigate to http://localhost:8888 for example Jupyter notebooks
Spark package installable for generic Spark 2.1 clusters
Script Action installation on Azure HDInsight cluster

More Related Content

What's hot (19)

PPTX
Meetup tensorframes
Paolo Platter
 
PDF
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
PDF
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
PDF
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Víctor Zabalza
 
PDF
Lens: Data exploration with Dask and Jupyter widgets
Víctor Zabalza
 
PDF
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
PDF
Large Scale Deep Learning with TensorFlow
Jen Aman
 
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
PPTX
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
PDF
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
PDF
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Jan Wiegelmann
 
PDF
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
PDF
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
PDF
Deep learning with TensorFlow
Ndjido Ardo BAR
 
PDF
20160908 hivemall meetup
Takeshi Yamamuro
 
PPTX
Advanced spark deep learning
Adam Gibson
 
PPTX
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 
PDF
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Naoki (Neo) SATO
 
Meetup tensorframes
Paolo Platter
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Víctor Zabalza
 
Lens: Data exploration with Dask and Jupyter widgets
Víctor Zabalza
 
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
Large Scale Deep Learning with TensorFlow
Jen Aman
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Jan Wiegelmann
 
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
Deep learning with TensorFlow
Ndjido Ardo BAR
 
20160908 hivemall meetup
Takeshi Yamamuro
 
Advanced spark deep learning
Adam Gibson
 
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Naoki (Neo) SATO
 

Similar to Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan () (20)

PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Scalable Data Science in Python and R on Apache Spark
felixcss
 
PDF
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
PPTX
Notes on Deploying Machine-learning Models at Scale
Deep Kayal
 
PDF
AWS re:Invent Deep Learning: Goin Beyond Machine Learning (BDT311)
Chida Chidambaram
 
PPTX
Introduction to Spark ML
Holden Karau
 
PDF
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PDF
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PDF
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
PPTX
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Scalable Data Science in Python and R on Apache Spark
felixcss
 
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
Notes on Deploying Machine-learning Models at Scale
Deep Kayal
 
AWS re:Invent Deep Learning: Goin Beyond Machine Learning (BDT311)
Chida Chidambaram
 
Introduction to Spark ML
Holden Karau
 
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Apache Spark MLlib
Zahra Eskandari
 
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 

Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan ()

  • 1. Data science and deep learning on Spark with 1/10th of the code Roope Astala, Sudarshan Ragunathan Microsoft Corporation
  • 2. Agenda • User story: Snow Leopard Conservation • Introducing Microsoft Machine Learning Library for Apache Spark – Vision: Productivity, Scalability, State-of-Art Algorithms, Open Source – Deep Learning and Image Processing – Text Analytics Disclaimer:​ Any roadmap items are subject to change without notice. Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
  • 4. Snow leopards • 3,900-6,500 individuals left in the wild • Variety of Threats – Poaching – Retribution killing – Loss of prey – Loss of habitat (mining) • Little known about their ecology, behavior, movement patterns, survival rates • More data required to influence survival
  • 5. Range spread across 1.5 million km2
  • 8. Camera trapping since 2009 • 1,700 sq km • 42 camera traps • 8,490 trap nights • 4 primary sampling periods • 56 secondary sampling periods • ~1.3 mil images
  • 9. Camera Trap Images Manually classifying images averages 300 hours per survey
  • 10. Automated Image Classification Benefits Short term • Thousands of hours of researcher and volunteer time saved • Resources redeployed to science and conservation vs image sorting • Much more accurate data on range and population Long term • Population numbers that can be accurately monitored • Influence governments on protected areas • Enhance community based conservation programs • Predict threats before they happen
  • 11. www.snowleopard.org How can you help? We need more camera surveys! • 1,700 sq km surveyed of 1,500,000 • $500 will buy an additional camera • $5,000 will fund a researcher • Any amount helps Contact me directly at [email protected] or donate online.
  • 12. Microsoft Machine Learning Library for Apache Spark (MMLSpark) GitHub Repo: https://github.com/Azure/mmlspark Get started now using Docker image: docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark Navigate to http://localhost:8888 to view example Jupyter notebooks
  • 13. Challenges when using Spark for ML • User needs to write lots of “ceremonial” code to prepare features for ML algorithms. – Coerce types and data layout to that what’s expected by learner – Use different conventions for different learners • Lack of domain-specific libraries: computer vision or text analytics… • Limited capabilities for model evaluation & model management
  • 14. Vision of Microsoft Machine Learning Library for Apache Spark • Enable you to solve ML problems dealing with large amounts of data. • Make you as a professional data scientist more productive on Spark, so you can focus on ML problems, not software engineering problems. • Provide cutting edge ML algorithms on Spark: deep learning, computer vision, text analytics… • Open Source at GitHub
  • 15. Example: Hello MMLSpark Predict income from US Census data
  • 19. Using MMLSpark from mmlspark import TrainClassifier, ComputeModelStatistics from pyspark.ml.classification import LogisticRegression model = TrainClassifier(model=LogisticRegression(), labelCol=" income").fit(train) prediction = model.transform(test) metrics = ComputeModelStatistics().transform(prediction)
  • 20. Algorithms • Deep learning through Microsoft Cognitive Toolkit (CNTK) – Scale-out DNN featurization and scoring. Take an existing DNN model or train locally on big GPU machine, and deploy it to Spark cluster to score large data. – Scale-up training on a GPU VM. Preprocess large data on Spark cluster workers and feed to GPU to train the DNN. • Scale-out algorithms for “traditional” ML through SparkML
  • 21. Design Principles • Run on every platform and on language supported by Spark • Follow SparkML pipeline model for composability. MML-Spark consists of Estimators and Transforms that can be combined with existing SparkML components into pipelines. • Use SparkML DataFrames as common format. You can use existing capabilities in Spark for reading in the data to your model. • Consistent handling of different datatypes – text, categoricals, images – for different algorithms. No need for low-level type coercion, encoding or vector assembly.
  • 22. Deep Neural Net Featurization • Basic idea: Interior layers of pre-trained DNN models have high-order information about features • Using “headless” pre-trained DNNs allows us to extract really good set of features from images that can in turn be used to train more “traditional” models like random forests, SVM, logistic regression, etc. – Pre-trained DNNs are typically state-of-the-art models on datasets like ImageNet, MSCoco or CIFAR, for example ResNet (Microsoft), GoogLeNet (Google), Inception (Google), VGG, etc. • Transfer learning enables us to train effective models where we don’t have enough data, computational power or domain expertise to train a new DNN from scratch • Performance scales with executors High-order features Predictions
  • 23. DNN Featurization using MML-Spark cntkModel = CNTKModel().setInputCol("images"). setOutputCol("features").setModelLocation(resnetModel). setOutputNode("z.x") featurizedImages = cntkModel.transform(imagesWithLabels). select(['labels','features']) model = TrainClassifier(model=LogisticRegression(),labelCol="labels"). fit(featurizedImages) The DNN featurization is incorporated as SparkML pipeline stage. The evaluation happens directly on JVM from Scala: no Python UDF overhead!
  • 24. Image Processing Transforms DNNs are often picky about their input data shape and normalization. We provide bindings to OpenCV image ingress and processing operations, exposed as SparkML PipelineStages: images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1) tr = ImageTransform().setOutputCol("transformed") .resize(height = 200, width = 200) .crop(0, 0, height = 180, width = 180) smallImages = tr.transform(images).select("transformed")
  • 25. Image Pipeline for Object Detection and Classification • In real data, objects are often not framed and 1 per image. There can be manycandidate sub- images. • You could extract candidate sub-images using flatMapsand UDFs, but this gets cumbersome quickly. • We plan to provide simplified pipeline components to support generic object detection workflow. Detect candidate ROI Extract sub-image Classify sub-images Prediction: Cloud, None, Sun, Smiley, None …
  • 26. Virtual network Training of DNNs on GPU node • GPUs are very powerful for training DNNs. However, running an entire cluster of GPUs is often too expensive and unnecessary. • Instead, load and prep large data on CPU Spark cluster, then feed the prepped data to GPU node on virtual network for training. Once DNN is trained, broadcast the model to CPU nodes for evaluation. learner = CNTKLearner(brainScript=brainscriptText, dataTransfer='hdfs-mount', gpuMachines=‘my-gpu-vm’, workingDir=‘file:/tmp/’).fit(trainData) predictions = learner.setOutputNode(‘z’).transform(testData) Raw data Processed data as DataFrame Trained DNN as PipelineStage
  • 27. Text Analytics • Goal: provide one-step text featurization capability that lets you take free-form text columns and turn them into feature vectors for ML algorithms. – Tokenization, stop-word removal, case normalization – N-Gram creation, feature hashing, IDF weighing. • Future: multi-language support, more advanced text preprocessing and DNN featurization capabilities.
  • 28. Example: Text analytics for document classification by sentiment from pyspark.ml import Pipeline from mmlspark import TextFeaturizer, SelectColumns, TrainClassifier from pyspark.ml.classification import LogisticRegression textFeaturizer = TextFeaturizer(inputCol="text", outputCol = "features", useStopWordsRemover=True, useIDF=True, minDocFreq=5, numFeatures=2**16) columnSelector = SelectColumns(cols=["features","label"]) classifier = TrainClassifier(model = LogisticRegression(), labelCol='label') textPipeline = Pipeline(stages= [textFeaturizer,columnSelector,classifier])
  • 29. Easy to get started GitHub Repo: https://github.com/Azure/mmlspark Get started now using Docker image: docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark Navigate to http://localhost:8888 for example Jupyter notebooks Spark package installable for generic Spark 2.1 clusters Script Action installation on Azure HDInsight cluster