0% found this document useful (0 votes)

6 views

03_ml_testing

Uploaded by

terrielin01

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

03_ml_testing

Uploaded by

terrielin01

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

ML Testing

Release Engineering for Machine Learning Applications

(REMLA, CS4295)

Luís Cruz
[email protected]

Sebastian Proksch
[email protected]
REMLA 2022
Outline

• ML Testing Landscape
• What to test?
• How to test?
• Mutamorphic Testing
• Tools and Resources

2
ML Testing Landscape

https://arxiv.org/abs/
1906.10742
3
ML Testing
Zhang et al. (2019)

• De nition 1 – ML Bug: any imperfection in a machine learning item that causes a

discordance between the existing and the required conditions.

• De nition 2 – ML Testing: any activities designed to reveal machine learning bugs.

4
fi
fi
ML Testing
Publications during
2007–2019
(⚠ Cumulative plot)

Zhang et al. (2019)

5
Definition
Preliminary of Machine Learning Comparison with software testing
Introduction of MLT
Contents of Machine Learning Testing How to test: MLT workflow
Where&What to test: MLT components
Why to test: MLT properties
Test input generation
Test oracle generation
Testing workflow (how to test) Test adequacy evaluation
Test prioritisation and reduction Bug report analysis
Debug and repair
Data testing
Testing components (where&what to test) Learning program testing
Framework testing
Machine Learning Testing Correctness
Zhang et al. (2019) Ways of Organising Related Work Model Relevance
Robustness&Security
Testing properties (why to test) Efficiency
Fairness
Interpretability
Privacy
Autonomous driving
Application Scenario Machine translation
Machine learning categories Natural language inference
Machine learning properties
Analysis and Comparison
Data sets
Research Direction
Open-source tools

6
Automated ML testing

• Ensure that all the executions are exactly the same, in the same environment
(no more “it works on my machine, I swear!”).

• Run tests faster: let’s say that you have one thousand models in your pipeline.
Running one by one for sure is not the best way to spend your time.

• Easier debugging: detect model’s malfunctioning earlier, avoiding deploying it

into production.

7
Data Quality Data Cleaning

Model Feature
Monitoring Tests for… Extraction

Trained Model Modelling

8
What should we test?

9
ML Test Score

https://research.google/pubs/
pub45742/ 10
ML Test Score

• 4 main angles:
• Tests for features and data.
• Tests for model development.
• Tests for ML infrastructure.
• Monitoring tests for ML.

11
⚠ Disclaimer
Some tests are not covered but are not less important.
Check the paper for the full list.
Test that the distributions of each feature
match your expectations.

# Tests for Features and Data

13
Test the relationship between each feature
and the target, and the pairwise correlations
between individual signals.

# Tests for Features and Data

14
Test the cost of each feature.
• Latency
• Memory usage
• More upstream data dependencies
• Additional instability

# Tests for Features and Data

15
Test that a model does not contain any
features that have been manually determined
as unsuitable for use.

# Tests for Features and Data

16
Test that your system maintains privacy
controls across its entire data pipeline.
(not only in raw data but also in intermediate stages)

# Tests for Features and Data

17
Test all code that creates input features, both
in training and serving
E.g., methods used to clean date formats; methods use to remove stop words.

# Tests for Features and Data

18
Test that every model specification undergoes
a code review and is checked in to a
repository
mllint might be useful here

# Tests for Model Development

19
Theory Practice

Test the relationship between offline proxy

metrics and the actual impact metrics
For example, how does a 1% improvement in accuracy metrics translate
into e ects on business metrics (e.g., user satisfaction)?

# Tests for Model Development

20
ff
Test the impact of each tunable
hyperparameter.
What’s the oracle?

# Tests for Model Development

21
Test the effect of model staleness.

#Figure
Tests forClemens
credits: ModelMewald,
Development
2018

22
Test against a simpler model as a baseline

# Tests for Model Development

23
Test model quality on important data slices.

# Tests for Model Development

24
Test the model for implicit bias.

# Tests for Model Development

25
Test the reproducibility of training.
Test non-determinism robustness.

# Tests for ML Infrastructure

# Tests for Model Development
26
Integration test the full ML pipeline.

# Tests for ML Infrastructure

27
Test models via a canary process before they
enter production serving environments.
Example: AB testing

# Tests for ML Infrastructure

28
Test how quickly and safely a model can be
rolled back to a previous serving version.

# Tests for ML Infrastructure

29
Test that data invariants hold in training and
serving inputs.
E.g., shape of distributions of features should be the same in training data and serving data.
…

# Monitoring Tests for ML

30
Test for model staleness.

# Monitoring Tests for ML

31
Test for dramatic or slow-leak regressions in
training speed, serving latency, throughput, or
RAM usage.

# Monitoring Tests for ML

32
Test for regressions in prediction quality on
served data.

# Monitoring Tests for ML

33
Final score
ML test score
• 1 point. If you do the test manually.
• 2 points. If you do the test automatically. ⭐
• Meaning of the nal score sum:
• 0 points: More of a prototype project than a productionized system.
• 1-2 points: Not totally untested, but it is worth considering the
possibility of serious holes in reliability.
• 3-4 points: There’s basic productionization, but additional
investment may be needed.
• 5-6 points: Reasonably tested, but it’s possible that more of those
tests and procedures may be automated.
• 7-10 points: Strong levels of automated testing and monitoring,
appropriate for critical systems.
• 12+ points: Exceptional levels of automated testing and monitoring.

34
fi
But really, how should we test?
(a few basic examples)

❌✔❌❌✔✔
35
PyTest – basic example

./tests/test_trained_model.py
Project structure
/my_project_folder
... from sklearn.externals import joblib
/src
train_model.py def test_something_in_the_model():
/tests model = joblib.load('trained_dummy_model.sav')
test_trained_model.py // ...
assert ...

Run the tests

$ pytest

36
Duplicates
Unit test

./tests/test_data_cleaning.py

@pytest.fixture()
def df():
df = pandas.read()
yield df

def test_no_duplicates(df):
assert len(df['id'].unique())==df.shape[0]
assert df.groupby(['date','id']).size().max()==1

37
Preprocess methods
Unit test

./tests/test_data_cleaning.py

@pytest.fixture()
def df():
df = pandas.read()
yield df

def test_preprocess_missing_name(df):
assert preprocess_missing_name("10019\n") is None

def test_preprocess_city(df):
assert preprocess_city("amsterdam") == "Amsterdam"
assert preprocess_city("AMS") == "Amsterdam"
assert preprocess_city(" Amsterdam ") == "Amsterdam"

38
Value ranges
Unit test

./tests/test_data_cleaning.py

@pytest.fixture()
def df():
df = pandas.read()
yield df

def test_value_ranges(df):
assert all (df['percentage']<=1)
assert df.groupby('name')['budget'].sum()<=1000
assert all (df['height'] >= 0)

39
Test Non-determinism Robustness
Model Validation tests

• Performance stability when using

di erent random seeds.

• If a model is performant, it should

have little dependency on random
variance.

• Make seed an attribute in the

pipeline; test di erent seeds;
assert for low variability.

40
ff
ff
Test Non-determinism Robustness
Unit test
./tests/test_data_cleaning.py

@pytest.fixture()
def trained_model():
trained_model = joblib.load('trained_model.sav')
yield trained_model

def test_nondeterminism_robustness(trained_model):
original_score = evaluate_score(trained_model) # score between 0..100
for seed in [1,2]:
model_variant = train_model(random_state=seed)
assert abs(original_score - score(model_variant)) <=0.03

41
Test Noise Robustness

• 1- What happens to the performance when we change a few training

data points?

• 2- What happens to the performance when we add acceptable noise

to test data points? (E.g., add typos) A few noisy data
points should not
completely change
the model

Noise

42
Test model quality on important data slices
Model validation tests ⚠ Warning!
This test is highly dependent on
./tests/test_data_slice.py
the problem.
@pytest.fixture()
def trained_model():
trained_model = joblib.load('trained_model.sav')
yield trained_model

@pytest.fixture()
def test_data():
test_data = pandas.read_csv("test_data.csv")
yield test_data

def test_data_slice(trained_model, test_data):

original_score = evaluate_score(trained_model, test_data)
sliced_data = test_data[test_data['city'] == 'Delft']
sliced_score = evaluate_score(trained_model, sliced_data)
assert abs(original_score - sliced_data) <= 0.05

43
What else?

• A lot of work is yet to be done in this area:

• There is not much documentation around this topic.
• What to test? Practitioners are looking out for testing best practices.

? 44
Mutamorphic testing and repair

https://arxiv.org/abs/1910.02688
45
Mutamorphic testing

• Metamorphic testing + mutation

• Metamorphic testing: derive new test cases based on properties of the
existing ones.
E.g., commutative property:
assertEqual(add(1, 2), 3) => assertEqual(add(2, 1), 3)

• Mutamorphic: the new test cases are not exactly based safe properties
• Black-/grey-box testing.
No access to the model or its training codebase.

• Implemented for ML-based translators.

“It is okay” and “It is ne” have a mutamorphic relationship
(context-similar).
fi
Mutamorphic Testing

1. Automatic Test Input Generation 2. Automatic Test Oracle Generation 3. Automatic Inconsistency Repair

Find a mutant sentence with a

Generate sentences by replacing 1 When the translation of the mutant translation that we can use to replace
word with a context-similar word and the original sentence are fairly the original translation.
(Mutation). different, we have a failing test.
(Only works for translations with
similar structure and word types)

❌✔❌❌✔✔
❌✔❌❌✔✔
✔✔✔✔✔✔

47
Mutamorphic Testing

1. Automatic Test Input 2. Automatic Test Oracle 3. Automatic Inconsistency Repair

Generation Generation

48
Useful tools

• TFDV. https://github.com/tensor ow/data-

validation

• mllint. https://github.com/bvobart/mllint

49
fl
Wrap-up

50
Final Project

• What should you take from this class to

the nal project?
fi

Designing Machine Learning Systems by Chip Huygen by Rick
No ratings yet
Designing Machine Learning Systems by Chip Huygen by Rick
15 pages
HCSE-Field-Smart PV (Commercial & Industrial) V1.0 Training Material
No ratings yet
HCSE-Field-Smart PV (Commercial & Industrial) V1.0 Training Material
306 pages
EIM Tools and Equipment
80% (5)
EIM Tools and Equipment
15 pages
Developing a machine learning or a deep learning model
No ratings yet
Developing a machine learning or a deep learning model
24 pages
The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction
No ratings yet
The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction
10 pages
230208 MLOps Getting From Good to Great
No ratings yet
230208 MLOps Getting From Good to Great
41 pages
Segmentation Dataset
No ratings yet
Segmentation Dataset
41 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
Lecture 8 - Lifecycle of A Data Science Project - Part 2
No ratings yet
Lecture 8 - Lifecycle of A Data Science Project - Part 2
43 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Testing in Data Science
No ratings yet
Testing in Data Science
2 pages
On Testing Machine Learing Programs - Braiek & Khomh
No ratings yet
On Testing Machine Learing Programs - Braiek & Khomh
15 pages
software testing using ML_literature survey
No ratings yet
software testing using ML_literature survey
8 pages
cs329s 2022 02 Slides MLSD
No ratings yet
cs329s 2022 02 Slides MLSD
99 pages
Testing Machine Learning Systems - Code, Data and Models - Made With ML
No ratings yet
Testing Machine Learning Systems - Code, Data and Models - Made With ML
33 pages
ML Advice Lecture
No ratings yet
ML Advice Lecture
87 pages
Unit Test Generation Using Machine Master Thesis Laurence Saes PDF
No ratings yet
Unit Test Generation Using Machine Master Thesis Laurence Saes PDF
64 pages
p2 FSDL Berkeley Lecture10 Testing and Explainability 1 50
No ratings yet
p2 FSDL Berkeley Lecture10 Testing and Explainability 1 50
41 pages
CT1-MLOPs_S1_2
No ratings yet
CT1-MLOPs_S1_2
68 pages
AI 501 - Lesson 4 - Supervised Learning
No ratings yet
AI 501 - Lesson 4 - Supervised Learning
41 pages
P 2 FSDL Berkeley Lecture10 Testing and Explainability 51 97
No ratings yet
P 2 FSDL Berkeley Lecture10 Testing and Explainability 51 97
47 pages
2018 02 Msu Data Science
No ratings yet
2018 02 Msu Data Science
65 pages
C1 W2
No ratings yet
C1 W2
60 pages
Automatically Authoring Regression Tests For Machine-Learning Based Systems
No ratings yet
Automatically Authoring Regression Tests For Machine-Learning Based Systems
10 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Testing Machine Learning Algorithms
No ratings yet
Testing Machine Learning Algorithms
3 pages
July4 SaketAnand FriendlyIntroToML
No ratings yet
July4 SaketAnand FriendlyIntroToML
84 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
FSDL 2022 Lecture3 Testing
No ratings yet
FSDL 2022 Lecture3 Testing
89 pages
C2 - W1 Mlopssadsa
No ratings yet
C2 - W1 Mlopssadsa
111 pages
p4 FSDL Berkeley Lecture10 Testing and Explainability 1 50
No ratings yet
p4 FSDL Berkeley Lecture10 Testing and Explainability 1 50
29 pages
15 ML
No ratings yet
15 ML
60 pages
Test Strategies For Data Processing Pipelines: Lars Albertsson, Independent Consultant (Mapflat)
No ratings yet
Test Strategies For Data Processing Pipelines: Lars Albertsson, Independent Consultant (Mapflat)
36 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Previous Lecture
No ratings yet
Previous Lecture
43 pages
Identifing Software Bugs or Not Using SMLT Model
No ratings yet
Identifing Software Bugs or Not Using SMLT Model
34 pages
NSSE 2011 RMall ModelBasedTesting
No ratings yet
NSSE 2011 RMall ModelBasedTesting
100 pages
AI-Lecture 8 (Machine Learning Overview)
No ratings yet
AI-Lecture 8 (Machine Learning Overview)
42 pages
Chapter 3
No ratings yet
Chapter 3
53 pages
Automation Software Testing Using ML
No ratings yet
Automation Software Testing Using ML
38 pages
2. Machine Learning
No ratings yet
2. Machine Learning
8 pages
Information and Software Technology
No ratings yet
Information and Software Technology
17 pages
CT1-MLOPs-S3_4
No ratings yet
CT1-MLOPs-S3_4
37 pages
Jade Abbott - Mls Hidden Tasks
No ratings yet
Jade Abbott - Mls Hidden Tasks
78 pages
Webinar Slides Mlops
100% (1)
Webinar Slides Mlops
35 pages
Data Management Challenges in Production Machine Learning: Neoklis Polyzotis, Sudip Roy, Steven Whang, Martin Zinkevich
No ratings yet
Data Management Challenges in Production Machine Learning: Neoklis Polyzotis, Sudip Roy, Steven Whang, Martin Zinkevich
122 pages
Lec 2
No ratings yet
Lec 2
13 pages
Quality Assurance For Machine Learning Models - Part 2
No ratings yet
Quality Assurance For Machine Learning Models - Part 2
8 pages
Interpretable Meta-Score For Model Performance
No ratings yet
Interpretable Meta-Score For Model Performance
19 pages
CS585 Lecture October03rd
No ratings yet
CS585 Lecture October03rd
146 pages
AI Engineer Interview Prep Guide
No ratings yet
AI Engineer Interview Prep Guide
16 pages
Lecture 4 Evaluation
No ratings yet
Lecture 4 Evaluation
58 pages
ML Midterm Cheatsheet
No ratings yet
ML Midterm Cheatsheet
2 pages
Steven Skiena-The Algorithm Design Manual-En
50% (2)
Steven Skiena-The Algorithm Design Manual-En
27 pages
03 Machine Learning Overview
No ratings yet
03 Machine Learning Overview
24 pages
Test Driven Machine Learning - Sample Chapter
100% (1)
Test Driven Machine Learning - Sample Chapter
25 pages
Best Practices
No ratings yet
Best Practices
16 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Supervised - ML Complete Book
No ratings yet
Supervised - ML Complete Book
153 pages
Java Testing for New Developers: A Practical Guide with Examples
From Everand
Java Testing for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
Kumar Saurabh Singh: Professional Experience
No ratings yet
Kumar Saurabh Singh: Professional Experience
3 pages
Srdy 00
No ratings yet
Srdy 00
61 pages
Jodhamal Public School, Jammu: Holidays Homework SESSION 2022-23
No ratings yet
Jodhamal Public School, Jammu: Holidays Homework SESSION 2022-23
2 pages
Online Education: Boon or Bane To Students
No ratings yet
Online Education: Boon or Bane To Students
4 pages
Rci 2950DX
No ratings yet
Rci 2950DX
11 pages
Tunstall Flamenco IP - WANTZ SOLUTION R1
No ratings yet
Tunstall Flamenco IP - WANTZ SOLUTION R1
24 pages
Manual Ilia en
No ratings yet
Manual Ilia en
36 pages
Hs 73 Digital
100% (1)
Hs 73 Digital
100 pages
How Lenovo Created A World-Class Supply Chain
No ratings yet
How Lenovo Created A World-Class Supply Chain
15 pages
Survey
No ratings yet
Survey
15 pages
Service Documentation: Market Release 3/95
No ratings yet
Service Documentation: Market Release 3/95
7 pages
Industrial Engineering Lecture 02
No ratings yet
Industrial Engineering Lecture 02
21 pages
Sunny Island 5048 Installation Operation Manual PDF
No ratings yet
Sunny Island 5048 Installation Operation Manual PDF
224 pages
Setting Up VPN On Windows 2000
No ratings yet
Setting Up VPN On Windows 2000
34 pages
Eaton 072738 PKZM0 6,3 en - GB
No ratings yet
Eaton 072738 PKZM0 6,3 en - GB
10 pages
Ooredoo Message Manager - Administrator Manual
No ratings yet
Ooredoo Message Manager - Administrator Manual
21 pages
Business Requirements Document Template Eeee
No ratings yet
Business Requirements Document Template Eeee
11 pages
IVTEC Engine Terrm Paper
No ratings yet
IVTEC Engine Terrm Paper
6 pages
Bca1mpcl 2021 Oct Methodology of Programing in C Language
No ratings yet
Bca1mpcl 2021 Oct Methodology of Programing in C Language
2 pages
Migrating 2 IBMCOBOLAIX
No ratings yet
Migrating 2 IBMCOBOLAIX
38 pages
Gas Turbine: Operation and Maintenance
No ratings yet
Gas Turbine: Operation and Maintenance
53 pages
INSTRUCTION GUIDE BB8
No ratings yet
INSTRUCTION GUIDE BB8
13 pages
cs1 Lab Submission File
No ratings yet
cs1 Lab Submission File
57 pages
18 - TRIZ Methodology and An Application Example For Product Development
100% (1)
18 - TRIZ Methodology and An Application Example For Product Development
10 pages
The Finance Business Partner Survey: A Visual Guide
No ratings yet
The Finance Business Partner Survey: A Visual Guide
9 pages
Assignment 2024
No ratings yet
Assignment 2024
2 pages
Ishita's Resume
No ratings yet
Ishita's Resume
1 page
Profprac2 - RSW - FN02
No ratings yet
Profprac2 - RSW - FN02
83 pages

03_ml_testing

Uploaded by

03_ml_testing

Uploaded by

ML Testing

Release Engineering for Machine Learning Applications

• De nition 1 – ML Bug: any imperfection in a machine learning item that causes a

• De nition 2 – ML Testing: any activities designed to reveal machine learning bugs.

Zhang et al. (2019)

• Easier debugging: detect model’s malfunctioning earlier, avoiding deploying it

Trained Model Modelling

# Tests for Features and Data

# Tests for Features and Data

# Tests for Features and Data

# Tests for Features and Data

# Tests for Features and Data

# Tests for Features and Data

# Tests for Model Development

Test the relationship between offline proxy

# Tests for Model Development

# Tests for Model Development

# Tests for Model Development

# Tests for Model Development

# Tests for Model Development

# Tests for ML Infrastructure

# Tests for ML Infrastructure

# Tests for ML Infrastructure

# Tests for ML Infrastructure

# Monitoring Tests for ML

# Monitoring Tests for ML

# Monitoring Tests for ML

# Monitoring Tests for ML

Run the tests

• Performance stability when using

• If a model is performant, it should

• Make seed an attribute in the

• 1- What happens to the performance when we change a few training

• 2- What happens to the performance when we add acceptable noise

def test_data_slice(trained_model, test_data):

• A lot of work is yet to be done in this area:

• Metamorphic testing + mutation

• Implemented for ML-based translators.

Find a mutant sentence with a

1. Automatic Test Input 2. Automatic Test Oracle 3. Automatic Inconsistency Repair

• TFDV. https://github.com/tensor ow/data-

• What should you take from this class to

You might also like