SlideShare a Scribd company logo
Distributed Models Over Distributed Data
with MLflow, Pyspark, and Pandas
Thunder Shiviah
thunder@databricks.com
Senior Solutions Architect, Databricks
SAIS EUROPE - 2019
Abstract
● More data often improves DL models in high variance problem spaces (with semi or
unstructured data) such as NLP, image, video but does not significantly improve high
bias problem spaces where traditional ML is more appropriate.
● That said, in the limits of more data, there still not a reason to go distributed if you
can instead use transfer learning.
● Data scientists have pain points
○ running many models in parallel
○ automating the experimental set up
○ Getting others (especially analysts) within an organization to use their models
● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
Why do single node data science?
Is more data always better?
The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira,
Google 2009
Banko and Brill, 2001
Team A came up with a very sophisticated algorithm using the Netflix
data. Team B used a very simple algorithm, but they added in
additional data beyond the Netflix set: information about movie genres
from the Internet Movie Database (IMDB). Guess which team did
better?
Team B got much better results, close to the best results on the Netflix
leaderboard! I'm really happy for them, and they're going to tune their
algorithm and take a crack at the grand prize.
But the bigger point is, adding more, independent data usually
beats out designing ever-better algorithms to analyze an existing
data set. I'm often surprised that many people in the business, and
even in academia, don't realize this. - "More data usually beats better algorithms", Anand
Rajaraman
Okay, case closed. Right?
Not so fast!
● First Norvig, Banko & Brill were focused on NLP specifically, not general ML.
● Second, Norvig et al were not saying that ML models trained on more data are better
than ML models trained on less data. Norvig was comparing ML (statistical
learning) to hand coded language rules in NLP: “...a deep approach that relies on
hand-coded grammars and ontologies, represented as complex networks of relations;”
(Norvig, 2009)
● In his 2010 UBC Department of Computer Science's Distinguished Lecture Series
talk, Norvig mentions that regarding NLP, google is quickly hitting their limits
with simple models and is needing to focus on more complex models (deep
learning).
● Re: the Netflix data set: The winning team actually then commented and said
Our experience with the Netflix data is different.
IMDB data and the likes gives more information only about the movies, not about the users ... The test set is
dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra
information about the movies.
Our experiments clearly show that once you have strong CF models, such extra data is redundant
and cannot improve accuracy on the Netflix dataset.
Netflix algorithm in production
Netflix recommendations: beyond the 5 stars, Xavier Amatriain (Netflix)
Oh, and what about the size of those data sets?
● 1 billion word corpus = ~2GB
● Netflix prize data = 700Mb compressed
○ 1.5 GB uncompressed (source)
So where does that leave us?
Bias vs Variance
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Andrew Ng, AI is the New Electricity
Conclusion: more data makes sense for high variance
(semi-structured or unstructured) problem domains like
text and images. Sampling makes sense for high bias
domains such as structured problem domains.
Should we always use more data with deep learning?
No! Transfer learning on smaller data often beats
training nets from scratch on larger data-sets.
Open AI pointed out that while the amount of compute has been a key component of AI
progress, “Massive compute is certainly not a requirement to produce important results.”
(source)
In a benchmark run by our very own Matei Zaharia at Stanford, Fast.ai was able to win both
fastest and cheapest image classification:
Imagenet competition, our results were:
● Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single
machine (and faster than Intel’s entry that used a cluster of 128 machines!)
● Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost,
as discussed below).Overall, our findings were:
● Algorithmic creativity is more important than bare-metal performance
(source)
Transfer learning models with a small number of training examples can achieve better
performance than models trained from scratch on 100x the data
Introducing state of the art text classification with universal language models,
Jeremy Howard and Sebastian Ruder
Take-away: Even in the case of deep learning, if an
established model exists, it’s better to use transfer
learning on small data then train from scratch on larger
data
So where does databricks fit into this story?
Training models (including hyperparameter search and
cross validation) is embarrassingly parallel
Shift from distributed data to distributed models
Introducing Pandas UDF for PySpark: How to run your native Python code with PySpark, fast.
Data Scientists spend lots of time setting up their
environment
Rule of frequency: automate the things you do the most
Source: xkcd
Maybe...
Source: xkcd
ML runtime
Data Scientists have a difficult time tracking models
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Data Scientists don’t know how to get the rest of the
organization to use their models
Key insight: Most modern enterprise run on SQL not REST
APIs
Use MLflow + spark UDFs to democratize ML within the
org.
See my mlflow deployment example notebook.
Summary
● More data often improves DL models in high variance problem spaces (with semi or
unstructured data) such as NLP, image, video but does not significantly improve high
bias problem spaces where traditional ML is more appropriate.
● That said, in the limits of more data, there still not a reason to go distributed if you
can instead use transfer learning.
● Data scientists have pain points
○ running many models in parallel
○ automating the experimental set up
○ Getting others (especially analysts) within an organization to use their models
● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
● See my paper for more information.
Thank you!
thunder@databricks.com
Ad

More Related Content

What's hot (20)

Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
Databricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
SANG WON PARK
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflow
Databricks
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
Databricks
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
Large scale-lm-part1
Large scale-lm-part1Large scale-lm-part1
Large scale-lm-part1
gohyunwoong
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
tutorialvillage
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model Serving
Databricks
 
NoSQL
NoSQLNoSQL
NoSQL
Radu Potop
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
Arvind Devaraj
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법
홍배 김
 
DataOps introduction : DataOps is not only DevOps applied to data!
DataOps introduction : DataOps is not only DevOps applied to data!DataOps introduction : DataOps is not only DevOps applied to data!
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Managing the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOpsManaging the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOps
Fatih Baltacı
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
Databricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
SANG WON PARK
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflow
Databricks
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
Databricks
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
Large scale-lm-part1
Large scale-lm-part1Large scale-lm-part1
Large scale-lm-part1
gohyunwoong
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model Serving
Databricks
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
Arvind Devaraj
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법
홍배 김
 
DataOps introduction : DataOps is not only DevOps applied to data!
DataOps introduction : DataOps is not only DevOps applied to data!DataOps introduction : DataOps is not only DevOps applied to data!
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Managing the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOpsManaging the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOps
Fatih Baltacı
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent
 

Similar to Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas (20)

Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
Grigory Sapunov
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
Edge AI and Vision Alliance
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsPractical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Alexey Rybakov
 
deep-learning-ppt-full-notes.pptx presen
deep-learning-ppt-full-notes.pptx presendeep-learning-ppt-full-notes.pptx presen
deep-learning-ppt-full-notes.pptx presen
RamakanthChhaparwal
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
MLconf
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Büşra İçöz
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
Charmi Chokshi
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Yves Peirsman
 
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geißler
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobile
Anirudh Koul
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learning
mustafa sarac
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
JimmyLiang20
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
Shishir Choudhary
 
Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)
Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)
Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)
Lviv Startup Club
 
Neuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine PresentationNeuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine Presentation
Bohdan Klimenko
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
Grigory Sapunov
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
Edge AI and Vision Alliance
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsPractical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Alexey Rybakov
 
deep-learning-ppt-full-notes.pptx presen
deep-learning-ppt-full-notes.pptx presendeep-learning-ppt-full-notes.pptx presen
deep-learning-ppt-full-notes.pptx presen
RamakanthChhaparwal
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
MLconf
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
Charmi Chokshi
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Yves Peirsman
 
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geißler
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobile
Anirudh Koul
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learning
mustafa sarac
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
JimmyLiang20
 
Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)
Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)
Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)
Lviv Startup Club
 
Neuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine PresentationNeuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine Presentation
Bohdan Klimenko
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Volkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing ProcessVolkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing Process
Process mining Evangelist
 
Principles of information security ch02_2.ppt
Principles of information security ch02_2.pptPrinciples of information security ch02_2.ppt
Principles of information security ch02_2.ppt
EstherBaguma
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Impact_of_AI_on_Everyday_Life and info.pptx
Impact_of_AI_on_Everyday_Life and info.pptxImpact_of_AI_on_Everyday_Life and info.pptx
Impact_of_AI_on_Everyday_Life and info.pptx
swatibhusari5
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx
dharmendrasingh31102
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Volkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing ProcessVolkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing Process
Process mining Evangelist
 
Principles of information security ch02_2.ppt
Principles of information security ch02_2.pptPrinciples of information security ch02_2.ppt
Principles of information security ch02_2.ppt
EstherBaguma
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
Impact_of_AI_on_Everyday_Life and info.pptx
Impact_of_AI_on_Everyday_Life and info.pptxImpact_of_AI_on_Everyday_Life and info.pptx
Impact_of_AI_on_Everyday_Life and info.pptx
swatibhusari5
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx
dharmendrasingh31102
 

Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

  • 1. Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas Thunder Shiviah [email protected] Senior Solutions Architect, Databricks SAIS EUROPE - 2019
  • 2. Abstract ● More data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video but does not significantly improve high bias problem spaces where traditional ML is more appropriate. ● That said, in the limits of more data, there still not a reason to go distributed if you can instead use transfer learning. ● Data scientists have pain points ○ running many models in parallel ○ automating the experimental set up ○ Getting others (especially analysts) within an organization to use their models ● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
  • 3. Why do single node data science?
  • 4. Is more data always better? The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira, Google 2009
  • 6. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better? Team B got much better results, close to the best results on the Netflix leaderboard! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this. - "More data usually beats better algorithms", Anand Rajaraman
  • 9. ● First Norvig, Banko & Brill were focused on NLP specifically, not general ML. ● Second, Norvig et al were not saying that ML models trained on more data are better than ML models trained on less data. Norvig was comparing ML (statistical learning) to hand coded language rules in NLP: “...a deep approach that relies on hand-coded grammars and ontologies, represented as complex networks of relations;” (Norvig, 2009) ● In his 2010 UBC Department of Computer Science's Distinguished Lecture Series talk, Norvig mentions that regarding NLP, google is quickly hitting their limits with simple models and is needing to focus on more complex models (deep learning). ● Re: the Netflix data set: The winning team actually then commented and said Our experience with the Netflix data is different. IMDB data and the likes gives more information only about the movies, not about the users ... The test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies. Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.
  • 10. Netflix algorithm in production Netflix recommendations: beyond the 5 stars, Xavier Amatriain (Netflix)
  • 11. Oh, and what about the size of those data sets? ● 1 billion word corpus = ~2GB ● Netflix prize data = 700Mb compressed ○ 1.5 GB uncompressed (source)
  • 12. So where does that leave us?
  • 15. Andrew Ng, AI is the New Electricity
  • 16. Conclusion: more data makes sense for high variance (semi-structured or unstructured) problem domains like text and images. Sampling makes sense for high bias domains such as structured problem domains.
  • 17. Should we always use more data with deep learning?
  • 18. No! Transfer learning on smaller data often beats training nets from scratch on larger data-sets. Open AI pointed out that while the amount of compute has been a key component of AI progress, “Massive compute is certainly not a requirement to produce important results.” (source)
  • 19. In a benchmark run by our very own Matei Zaharia at Stanford, Fast.ai was able to win both fastest and cheapest image classification: Imagenet competition, our results were: ● Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single machine (and faster than Intel’s entry that used a cluster of 128 machines!) ● Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost, as discussed below).Overall, our findings were: ● Algorithmic creativity is more important than bare-metal performance (source)
  • 20. Transfer learning models with a small number of training examples can achieve better performance than models trained from scratch on 100x the data Introducing state of the art text classification with universal language models, Jeremy Howard and Sebastian Ruder
  • 21. Take-away: Even in the case of deep learning, if an established model exists, it’s better to use transfer learning on small data then train from scratch on larger data
  • 22. So where does databricks fit into this story?
  • 23. Training models (including hyperparameter search and cross validation) is embarrassingly parallel
  • 24. Shift from distributed data to distributed models Introducing Pandas UDF for PySpark: How to run your native Python code with PySpark, fast.
  • 25. Data Scientists spend lots of time setting up their environment
  • 26. Rule of frequency: automate the things you do the most Source: xkcd
  • 29. Data Scientists have a difficult time tracking models
  • 31. Data Scientists don’t know how to get the rest of the organization to use their models
  • 32. Key insight: Most modern enterprise run on SQL not REST APIs
  • 33. Use MLflow + spark UDFs to democratize ML within the org. See my mlflow deployment example notebook.
  • 34. Summary ● More data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video but does not significantly improve high bias problem spaces where traditional ML is more appropriate. ● That said, in the limits of more data, there still not a reason to go distributed if you can instead use transfer learning. ● Data scientists have pain points ○ running many models in parallel ○ automating the experimental set up ○ Getting others (especially analysts) within an organization to use their models ● Databricks solves these problems using pandas udfs, ml runtime and mlflow. ● See my paper for more information.