0% found this document useful (0 votes)
6 views

03_ml_testing

Uploaded by

terrielin01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

03_ml_testing

Uploaded by

terrielin01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

ML Testing

Release Engineering for Machine Learning Applications


(REMLA, CS4295)

Luís Cruz
[email protected]

Sebastian Proksch
[email protected]
REMLA 2022
Outline

• ML Testing Landscape
• What to test?
• How to test?
• Mutamorphic Testing
• Tools and Resources

2
ML Testing Landscape

https://arxiv.org/abs/
1906.10742
3
ML Testing
Zhang et al. (2019)

• De nition 1 – ML Bug: any imperfection in a machine learning item that causes a


discordance between the existing and the required conditions.

• De nition 2 – ML Testing: any activities designed to reveal machine learning bugs.

4
fi
fi
ML Testing
Publications during
2007–2019
(⚠ Cumulative plot)

Zhang et al. (2019)


5
Definition
Preliminary of Machine Learning Comparison with software testing
Introduction of MLT
Contents of Machine Learning Testing How to test: MLT workflow
Where&What to test: MLT components
Why to test: MLT properties
Test input generation
Test oracle generation
Testing workflow (how to test) Test adequacy evaluation
Test prioritisation and reduction Bug report analysis
Debug and repair
Data testing
Testing components (where&what to test) Learning program testing
Framework testing
Machine Learning Testing Correctness
Zhang et al. (2019) Ways of Organising Related Work Model Relevance
Robustness&Security
Testing properties (why to test) Efficiency
Fairness
Interpretability
Privacy
Autonomous driving
Application Scenario Machine translation
Machine learning categories Natural language inference
Machine learning properties
Analysis and Comparison
Data sets
Research Direction
Open-source tools

6
Automated ML testing

• Ensure that all the executions are exactly the same, in the same environment
(no more “it works on my machine, I swear!”).

• Run tests faster: let’s say that you have one thousand models in your pipeline.
Running one by one for sure is not the best way to spend your time.

• Easier debugging: detect model’s malfunctioning earlier, avoiding deploying it


into production.

7
Data Quality Data Cleaning

Model Feature
Monitoring Tests for… Extraction

Trained Model Modelling

8
What should we test?

9
ML Test Score

https://research.google/pubs/
pub45742/ 10
ML Test Score

• 4 main angles:
• Tests for features and data.
• Tests for model development.
• Tests for ML infrastructure.
• Monitoring tests for ML.

11
⚠ Disclaimer
Some tests are not covered but are not less important.
Check the paper for the full list.
Test that the distributions of each feature
match your expectations.

# Tests for Features and Data


13
Test the relationship between each feature
and the target, and the pairwise correlations
between individual signals.

# Tests for Features and Data


14
Test the cost of each feature.
• Latency
• Memory usage
• More upstream data dependencies
• Additional instability

# Tests for Features and Data


15
Test that a model does not contain any
features that have been manually determined
as unsuitable for use.

# Tests for Features and Data


16
Test that your system maintains privacy
controls across its entire data pipeline.
(not only in raw data but also in intermediate stages)

# Tests for Features and Data


17
Test all code that creates input features, both
in training and serving
E.g., methods used to clean date formats; methods use to remove stop words.

# Tests for Features and Data


18
Test that every model specification undergoes
a code review and is checked in to a
repository
mllint might be useful here

# Tests for Model Development


19
Theory Practice

Test the relationship between offline proxy


metrics and the actual impact metrics
For example, how does a 1% improvement in accuracy metrics translate
into e ects on business metrics (e.g., user satisfaction)?

# Tests for Model Development


20
ff
Test the impact of each tunable
hyperparameter.
What’s the oracle?

# Tests for Model Development


21
Test the effect of model staleness.

#Figure
Tests forClemens
credits: ModelMewald,
Development
2018

22
Test against a simpler model as a baseline

# Tests for Model Development


23
Test model quality on important data slices.

# Tests for Model Development


24
Test the model for implicit bias.

# Tests for Model Development


25
Test the reproducibility of training.
Test non-determinism robustness.

# Tests for ML Infrastructure


# Tests for Model Development
26
Integration test the full ML pipeline.

# Tests for ML Infrastructure


27
Test models via a canary process before they
enter production serving environments.
Example: AB testing

# Tests for ML Infrastructure


28
Test how quickly and safely a model can be
rolled back to a previous serving version.

# Tests for ML Infrastructure


29
Test that data invariants hold in training and
serving inputs.
E.g., shape of distributions of features should be the same in training data and serving data.

# Monitoring Tests for ML


30
Test for model staleness.

# Monitoring Tests for ML


31
Test for dramatic or slow-leak regressions in
training speed, serving latency, throughput, or
RAM usage.

# Monitoring Tests for ML


32
Test for regressions in prediction quality on
served data.

# Monitoring Tests for ML


33
Final score
ML test score
• 1 point. If you do the test manually.
• 2 points. If you do the test automatically. ⭐
• Meaning of the nal score sum:
• 0 points: More of a prototype project than a productionized system.
• 1-2 points: Not totally untested, but it is worth considering the
possibility of serious holes in reliability.
• 3-4 points: There’s basic productionization, but additional
investment may be needed.
• 5-6 points: Reasonably tested, but it’s possible that more of those
tests and procedures may be automated.
• 7-10 points: Strong levels of automated testing and monitoring,
appropriate for critical systems.
• 12+ points: Exceptional levels of automated testing and monitoring.

34
fi
But really, how should we test?
(a few basic examples)

❌✔❌❌✔✔
35
PyTest – basic example

./tests/test_trained_model.py
Project structure
/my_project_folder
... from sklearn.externals import joblib
/src
train_model.py def test_something_in_the_model():
/tests model = joblib.load('trained_dummy_model.sav')
test_trained_model.py // ...
assert ...

Run the tests

$ pytest

36
Duplicates
Unit test

./tests/test_data_cleaning.py

@pytest.fixture()
def df():
df = pandas.read()
yield df

def test_no_duplicates(df):
assert len(df['id'].unique())==df.shape[0]
assert df.groupby(['date','id']).size().max()==1

37
Preprocess methods
Unit test

./tests/test_data_cleaning.py

@pytest.fixture()
def df():
df = pandas.read()
yield df

def test_preprocess_missing_name(df):
assert preprocess_missing_name("10019\n") is None

def test_preprocess_city(df):
assert preprocess_city("amsterdam") == "Amsterdam"
assert preprocess_city("AMS") == "Amsterdam"
assert preprocess_city(" Amsterdam ") == "Amsterdam"

38
Value ranges
Unit test

./tests/test_data_cleaning.py

@pytest.fixture()
def df():
df = pandas.read()
yield df

def test_value_ranges(df):
assert all (df['percentage']<=1)
assert df.groupby('name')['budget'].sum()<=1000
assert all (df['height'] >= 0)

39
Test Non-determinism Robustness
Model Validation tests

• Performance stability when using


di erent random seeds.

• If a model is performant, it should


have little dependency on random
variance.

• Make seed an attribute in the


pipeline; test di erent seeds;
assert for low variability.

40
ff
ff
Test Non-determinism Robustness
Unit test
./tests/test_data_cleaning.py

@pytest.fixture()
def trained_model():
trained_model = joblib.load('trained_model.sav')
yield trained_model

def test_nondeterminism_robustness(trained_model):
original_score = evaluate_score(trained_model) # score between 0..100
for seed in [1,2]:
model_variant = train_model(random_state=seed)
assert abs(original_score - score(model_variant)) <=0.03

41
Test Noise Robustness

• 1- What happens to the performance when we change a few training


data points?

• 2- What happens to the performance when we add acceptable noise


to test data points? (E.g., add typos) A few noisy data
points should not
completely change
the model

Noise

42
Test model quality on important data slices
Model validation tests ⚠ Warning!
This test is highly dependent on
./tests/test_data_slice.py
the problem.
@pytest.fixture()
def trained_model():
trained_model = joblib.load('trained_model.sav')
yield trained_model

@pytest.fixture()
def test_data():
test_data = pandas.read_csv("test_data.csv")
yield test_data

def test_data_slice(trained_model, test_data):


original_score = evaluate_score(trained_model, test_data)
sliced_data = test_data[test_data['city'] == 'Delft']
sliced_score = evaluate_score(trained_model, sliced_data)
assert abs(original_score - sliced_data) <= 0.05

43
What else?

• A lot of work is yet to be done in this area:


• There is not much documentation around this topic.
• What to test? Practitioners are looking out for testing best practices.

? 44
Mutamorphic testing and repair

https://arxiv.org/abs/1910.02688
45
Mutamorphic testing

• Metamorphic testing + mutation


• Metamorphic testing: derive new test cases based on properties of the
existing ones.
E.g., commutative property:
assertEqual(add(1, 2), 3) => assertEqual(add(2, 1), 3)

• Mutamorphic: the new test cases are not exactly based safe properties
• Black-/grey-box testing.
No access to the model or its training codebase.

• Implemented for ML-based translators.


“It is okay” and “It is ne” have a mutamorphic relationship
(context-similar).
fi
Mutamorphic Testing

1. Automatic Test Input Generation 2. Automatic Test Oracle Generation 3. Automatic Inconsistency Repair

Find a mutant sentence with a


Generate sentences by replacing 1 When the translation of the mutant translation that we can use to replace
word with a context-similar word and the original sentence are fairly the original translation.
(Mutation). different, we have a failing test.
(Only works for translations with
similar structure and word types)

❌✔❌❌✔✔
❌✔❌❌✔✔
✔✔✔✔✔✔

47
Mutamorphic Testing

1. Automatic Test Input 2. Automatic Test Oracle 3. Automatic Inconsistency Repair


Generation Generation

48
Useful tools

• TFDV. https://github.com/tensor ow/data-


validation

• mllint. https://github.com/bvobart/mllint

49
fl
Wrap-up

50
Final Project

• What should you take from this class to


the nal project?
fi

You might also like