03_ml_testing
03_ml_testing
Luís Cruz
[email protected]
Sebastian Proksch
[email protected]
REMLA 2022
Outline
• ML Testing Landscape
• What to test?
• How to test?
• Mutamorphic Testing
• Tools and Resources
2
ML Testing Landscape
https://arxiv.org/abs/
1906.10742
3
ML Testing
Zhang et al. (2019)
4
fi
fi
ML Testing
Publications during
2007–2019
(⚠ Cumulative plot)
6
Automated ML testing
• Ensure that all the executions are exactly the same, in the same environment
(no more “it works on my machine, I swear!”).
• Run tests faster: let’s say that you have one thousand models in your pipeline.
Running one by one for sure is not the best way to spend your time.
7
Data Quality Data Cleaning
Model Feature
Monitoring Tests for… Extraction
8
What should we test?
9
ML Test Score
https://research.google/pubs/
pub45742/ 10
ML Test Score
• 4 main angles:
• Tests for features and data.
• Tests for model development.
• Tests for ML infrastructure.
• Monitoring tests for ML.
11
⚠ Disclaimer
Some tests are not covered but are not less important.
Check the paper for the full list.
Test that the distributions of each feature
match your expectations.
#Figure
Tests forClemens
credits: ModelMewald,
Development
2018
22
Test against a simpler model as a baseline
34
fi
But really, how should we test?
(a few basic examples)
❌✔❌❌✔✔
35
PyTest – basic example
./tests/test_trained_model.py
Project structure
/my_project_folder
... from sklearn.externals import joblib
/src
train_model.py def test_something_in_the_model():
/tests model = joblib.load('trained_dummy_model.sav')
test_trained_model.py // ...
assert ...
$ pytest
36
Duplicates
Unit test
./tests/test_data_cleaning.py
@pytest.fixture()
def df():
df = pandas.read()
yield df
def test_no_duplicates(df):
assert len(df['id'].unique())==df.shape[0]
assert df.groupby(['date','id']).size().max()==1
37
Preprocess methods
Unit test
./tests/test_data_cleaning.py
@pytest.fixture()
def df():
df = pandas.read()
yield df
def test_preprocess_missing_name(df):
assert preprocess_missing_name("10019\n") is None
def test_preprocess_city(df):
assert preprocess_city("amsterdam") == "Amsterdam"
assert preprocess_city("AMS") == "Amsterdam"
assert preprocess_city(" Amsterdam ") == "Amsterdam"
38
Value ranges
Unit test
./tests/test_data_cleaning.py
@pytest.fixture()
def df():
df = pandas.read()
yield df
def test_value_ranges(df):
assert all (df['percentage']<=1)
assert df.groupby('name')['budget'].sum()<=1000
assert all (df['height'] >= 0)
39
Test Non-determinism Robustness
Model Validation tests
40
ff
ff
Test Non-determinism Robustness
Unit test
./tests/test_data_cleaning.py
@pytest.fixture()
def trained_model():
trained_model = joblib.load('trained_model.sav')
yield trained_model
def test_nondeterminism_robustness(trained_model):
original_score = evaluate_score(trained_model) # score between 0..100
for seed in [1,2]:
model_variant = train_model(random_state=seed)
assert abs(original_score - score(model_variant)) <=0.03
41
Test Noise Robustness
Noise
42
Test model quality on important data slices
Model validation tests ⚠ Warning!
This test is highly dependent on
./tests/test_data_slice.py
the problem.
@pytest.fixture()
def trained_model():
trained_model = joblib.load('trained_model.sav')
yield trained_model
@pytest.fixture()
def test_data():
test_data = pandas.read_csv("test_data.csv")
yield test_data
43
What else?
? 44
Mutamorphic testing and repair
https://arxiv.org/abs/1910.02688
45
Mutamorphic testing
• Mutamorphic: the new test cases are not exactly based safe properties
• Black-/grey-box testing.
No access to the model or its training codebase.
1. Automatic Test Input Generation 2. Automatic Test Oracle Generation 3. Automatic Inconsistency Repair
❌✔❌❌✔✔
❌✔❌❌✔✔
✔✔✔✔✔✔
47
Mutamorphic Testing
48
Useful tools
• mllint. https://github.com/bvobart/mllint
49
fl
Wrap-up
50
Final Project