SlideShare a Scribd company logo
Re-imagine Data Monitoring
with whylogs and Apache
Spark
Andy Dang
Co-Founder & Lead Engineer, WhyLabs
Outline
ML Data Challenges
How traditional data analysis techniques fail ML
data pipelines
Lightweight Profiling for Big ML
Data
Profiling techniques for detecting data quality
problems
The Open Source whylogs Library
Building the standard for data logging
2
Source: Google Cloud AI
3
ML Lifecycle
Issues encountered in production (small sample)...
...or it simply doesn’t work, and nobody know why...
● Experiment/production
environment mismatch
● Wrong model version deployed
● Underprovisioned hardware
● Inappropriate hardware
● Latency/SLA issues
● Data permissions misconfigured
● Untracked changes broke prod
● Traffic sent to the wrong model
● Computational instability
● Customers gaming the model
(adversarial attacks)
● PII data exposed
● Expected accuracy doesn’t
materialize
● Pre-processing mismatch in
experiments vs. production
● Retrained on faulty data
● Accuracy improves on one
segment, regresses in others
● Outliers predicted incorrectly
● Bias identified
● Correlation with protected
features
● Overfitting on training/test
● Surge in missing values
● Surge in duplicates
● Poor performance on new
categories
● Poor performance on new
customer segments
● Poor performance on outliers
● Data quality issues affect
accuracy
● Production data doesn’t match
test/training
● Accuracy is decaying over time
● Data drift in inputs
● Concept drift in outputs
● Extreme predictions for out of
distribution data
● Model not generalizing on new
data / new segments
● Major customer behavior shift
4
Issues encountered in production (small sample)...
issues caused by data
● Experiment/production
environment mismatch
● Wrong model version deployed
● Underprovisioned hardware
● Inappropriate hardware
● Latency/SLA issues
● Data permissions misconfigured
● Untracked changes broke prod
● Traffic sent to the wrong model
● Computational instability
● Customers gaming the model
(adversarial attacks)
● PII data exposed
● Expected accuracy doesn’t
materialize
● Pre-processing mismatch in
experiments vs. production
● Retrained on faulty data
● Accuracy improves on one
segment, regresses in others
● Outliers predicted incorrectly
● Bias identified
● Correlation with protected
features
● Overfitting on training/test
● Surge in missing values
● Surge in duplicates
● Poor performance on new
categories
● Poor performance on new
customer segments
● Poor performance on outliers
● Data quality issues affect
accuracy
● Production data doesn’t match
test/training
● Accuracy is decaying over time
● Data drift in inputs
● Concept drift in outputs
● Extreme predictions for out of
distribution data
● Model not generalizing on new
data / new segments
● Major customer behavior shift
5
Data Logs
Model Metadata
Pipeline Metadata
i.e. data profiling
Data profiling refers to the analysis of information [...] in order to clarify the
structure, content, relationships, and derivation rules of the data [Wikipedia]
6
Data monitoring starts with logging
7
Sampling Profiling
Pros
● Easy to build
● Little upfront design
● Log & raw data analysis identical
● Scalable & lightweight
● Flexible & configurable
● Rare events and outlier-dependent metrics
● Directly interpretable results
Cons
● I/O & storage
● Noisy
● Requires statistical analysis
● Rare events & outliers
● Min/max, unique values, etc
● Data dependent output format
● No existing widespread solutions
● Mathematical & engineering challenges
Data logs: sampling vs. profiling
8
Data logs: must be accurate
Median: errors in the estimate of the median for sampling vs profiling for various distributions. Mean
absolute error and mean relative (fractional) absolute error are shown.
9
Data logs: must be scalable
Dataset Size # of entries # of features Memory
consumption
Output size
Lending Club 1.6G 2.2M 151 14MB 7.4MB
NYC Tickets 1.9G 10.8 43 14MB 2.3MB
Pain pills 75GB 178M 42 15MB 2MB
10
Logging ML data at scale
Four key paradigms:
● Approximations rather than exact results
● Lightweight
● Additive
● Batch and streaming support
profile: collection of lightweight metrics that provide these
properties
Lightweight
Old Approach
11
process process process process
Data Warehouse/
Data Lake
Processing Engine
New Approach
process
profiling
process
profiling
process
profiling
process
profiling
Profile Store
Analysis
Only feasible if:
● Profiling is fast
● Profiling is not memory intensive
Additive
12
dataset 1 dataset 2 dataset 3
sort (shuffle)
reduce step
Median
dataset 1
profile 1
dataset 2
profile 2
dataset 3
profile 3
add(profile1, 2, 3)
Estimated Median
Batch and streaming support
13
partition
1
profiling
partition
2
profiling
partition
3
profiling
partition
n
profiling
Spark/Hive
Query Engine
No shuffle!
day 0
profiling
day 1
profiling
day 2
profiling
day 3
profiling
... ...
sum(profiles)
Approximate Statistics
● Using Stochastic Streaming Algorithms
○ Model the problem as a stochastic process
○ Apache Datasketches is the open source implementation
● Statistics that we focus on at the moment:
○ Histograms
○ Frequent items
○ Cardinality
14
whylogs: The Data Logging Library
● Multi-language support: Python + Java
● Support both data engineering and data science
workflows
● Extensibility: image support. Text, video, audio &
embeddings support to come
● Growing integration list:
15
16
whylogs: Python
● A few lines of code to start logging
● Integrate with popular data science libraries
● Out of the box visualization utilities
17
whylogs in Apache Spark
Data Lake
col1
, col2
, …, coln
partition 1
partition 2
partition k
profile
profile
merge (
profile1
,
profile2
…,
profilek
)
global profile
Schema
Metadata
Sketches
profile
Metrics
18
Simple Spark API
19
pySpark support
20
Catch distribution drift in a few lines of code
21
Scalable monitoring at input feature granularity
Monitoring layer for ML applications
22
23
bit.ly/whylogs
andy@whylabs.ai
@andy_dng
24
bit.ly/whylogs
Help build the open
standard for data
logging!
Thank you!
Ad

More Related Content

What's hot (20)

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
Data Mesh
Data MeshData Mesh
Data Mesh
Piethein Strengholt
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
Databricks
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
Databricks
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
Databricks
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
Databricks
 

Similar to Re-imagine Data Monitoring with whylogs and Spark (20)

MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
South Tyrol Free Software Conference
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
zekeLabs Technologies
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
Humayun Kabir
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
DataWorks Summit
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
Owen Zhang
 
vodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applicationsvodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applications
vodQA
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
Awantik Das
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
Mikhail Rozhkov
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Databricks
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
Petr Zapletal
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
sundharakumarkb1
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
Shesha R
 
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
Umair Shahid
 
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
Umair Shahid
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptx(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptx
Faiz430036
 
The Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML StackThe Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML Stack
Databricks
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
SFScon 22 - Fiete Lüer - Heading towards reproducible machine learning resear...
South Tyrol Free Software Conference
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
zekeLabs Technologies
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
DataWorks Summit
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
Owen Zhang
 
vodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applicationsvodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applications
vodQA
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
Awantik Das
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
Mikhail Rozhkov
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Databricks
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
Petr Zapletal
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
Shesha R
 
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
Umair Shahid
 
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
Umair Shahid
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptx(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptx
Faiz430036
 
The Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML StackThe Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML Stack
Databricks
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Ad

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Ad

Recently uploaded (20)

Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”
vzmcareers
 
Introcomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptxIntrocomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptx
abdulrehmanbscsf22
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Chromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docxChromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docx
NohaSalah45
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
KNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptxKNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptx
sonujha1980712
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
brainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptxbrainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptx
maritzacastro321
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”
vzmcareers
 
Introcomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptxIntrocomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptx
abdulrehmanbscsf22
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Chromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docxChromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docx
NohaSalah45
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
KNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptxKNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptx
sonujha1980712
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
brainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptxbrainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptx
maritzacastro321
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 

Re-imagine Data Monitoring with whylogs and Spark

  • 1. Re-imagine Data Monitoring with whylogs and Apache Spark Andy Dang Co-Founder & Lead Engineer, WhyLabs
  • 2. Outline ML Data Challenges How traditional data analysis techniques fail ML data pipelines Lightweight Profiling for Big ML Data Profiling techniques for detecting data quality problems The Open Source whylogs Library Building the standard for data logging 2
  • 3. Source: Google Cloud AI 3 ML Lifecycle
  • 4. Issues encountered in production (small sample)... ...or it simply doesn’t work, and nobody know why... ● Experiment/production environment mismatch ● Wrong model version deployed ● Underprovisioned hardware ● Inappropriate hardware ● Latency/SLA issues ● Data permissions misconfigured ● Untracked changes broke prod ● Traffic sent to the wrong model ● Computational instability ● Customers gaming the model (adversarial attacks) ● PII data exposed ● Expected accuracy doesn’t materialize ● Pre-processing mismatch in experiments vs. production ● Retrained on faulty data ● Accuracy improves on one segment, regresses in others ● Outliers predicted incorrectly ● Bias identified ● Correlation with protected features ● Overfitting on training/test ● Surge in missing values ● Surge in duplicates ● Poor performance on new categories ● Poor performance on new customer segments ● Poor performance on outliers ● Data quality issues affect accuracy ● Production data doesn’t match test/training ● Accuracy is decaying over time ● Data drift in inputs ● Concept drift in outputs ● Extreme predictions for out of distribution data ● Model not generalizing on new data / new segments ● Major customer behavior shift 4
  • 5. Issues encountered in production (small sample)... issues caused by data ● Experiment/production environment mismatch ● Wrong model version deployed ● Underprovisioned hardware ● Inappropriate hardware ● Latency/SLA issues ● Data permissions misconfigured ● Untracked changes broke prod ● Traffic sent to the wrong model ● Computational instability ● Customers gaming the model (adversarial attacks) ● PII data exposed ● Expected accuracy doesn’t materialize ● Pre-processing mismatch in experiments vs. production ● Retrained on faulty data ● Accuracy improves on one segment, regresses in others ● Outliers predicted incorrectly ● Bias identified ● Correlation with protected features ● Overfitting on training/test ● Surge in missing values ● Surge in duplicates ● Poor performance on new categories ● Poor performance on new customer segments ● Poor performance on outliers ● Data quality issues affect accuracy ● Production data doesn’t match test/training ● Accuracy is decaying over time ● Data drift in inputs ● Concept drift in outputs ● Extreme predictions for out of distribution data ● Model not generalizing on new data / new segments ● Major customer behavior shift 5
  • 6. Data Logs Model Metadata Pipeline Metadata i.e. data profiling Data profiling refers to the analysis of information [...] in order to clarify the structure, content, relationships, and derivation rules of the data [Wikipedia] 6 Data monitoring starts with logging
  • 7. 7 Sampling Profiling Pros ● Easy to build ● Little upfront design ● Log & raw data analysis identical ● Scalable & lightweight ● Flexible & configurable ● Rare events and outlier-dependent metrics ● Directly interpretable results Cons ● I/O & storage ● Noisy ● Requires statistical analysis ● Rare events & outliers ● Min/max, unique values, etc ● Data dependent output format ● No existing widespread solutions ● Mathematical & engineering challenges Data logs: sampling vs. profiling
  • 8. 8 Data logs: must be accurate Median: errors in the estimate of the median for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown.
  • 9. 9 Data logs: must be scalable Dataset Size # of entries # of features Memory consumption Output size Lending Club 1.6G 2.2M 151 14MB 7.4MB NYC Tickets 1.9G 10.8 43 14MB 2.3MB Pain pills 75GB 178M 42 15MB 2MB
  • 10. 10 Logging ML data at scale Four key paradigms: ● Approximations rather than exact results ● Lightweight ● Additive ● Batch and streaming support profile: collection of lightweight metrics that provide these properties
  • 11. Lightweight Old Approach 11 process process process process Data Warehouse/ Data Lake Processing Engine New Approach process profiling process profiling process profiling process profiling Profile Store Analysis Only feasible if: ● Profiling is fast ● Profiling is not memory intensive
  • 12. Additive 12 dataset 1 dataset 2 dataset 3 sort (shuffle) reduce step Median dataset 1 profile 1 dataset 2 profile 2 dataset 3 profile 3 add(profile1, 2, 3) Estimated Median
  • 13. Batch and streaming support 13 partition 1 profiling partition 2 profiling partition 3 profiling partition n profiling Spark/Hive Query Engine No shuffle! day 0 profiling day 1 profiling day 2 profiling day 3 profiling ... ... sum(profiles)
  • 14. Approximate Statistics ● Using Stochastic Streaming Algorithms ○ Model the problem as a stochastic process ○ Apache Datasketches is the open source implementation ● Statistics that we focus on at the moment: ○ Histograms ○ Frequent items ○ Cardinality 14
  • 15. whylogs: The Data Logging Library ● Multi-language support: Python + Java ● Support both data engineering and data science workflows ● Extensibility: image support. Text, video, audio & embeddings support to come ● Growing integration list: 15
  • 16. 16 whylogs: Python ● A few lines of code to start logging ● Integrate with popular data science libraries ● Out of the box visualization utilities
  • 17. 17 whylogs in Apache Spark Data Lake col1 , col2 , …, coln partition 1 partition 2 partition k profile profile merge ( profile1 , profile2 …, profilek ) global profile Schema Metadata Sketches profile Metrics
  • 20. 20 Catch distribution drift in a few lines of code
  • 21. 21 Scalable monitoring at input feature granularity
  • 22. Monitoring layer for ML applications 22
  • 24. [email protected] @andy_dng 24 bit.ly/whylogs Help build the open standard for data logging! Thank you!