Developing a machine learning or a deep learning model
Developing a machine learning or a deep learning model
straightforward task. It usually involves research, collecting and preprocessing the data,
extracting features, building and training the model, evaluation, and inference. Most of the
time is consumed in the data-preprocessing phase, followed by the modeling-building phase.
If the accuracy is not up to the mark, we then reiterate the whole process until we find a
satisfactory accuracy.
The difficulty arises when we want to put the model into production in the real world. The
model often does not perform as well as it did during the training and evaluation phase. This
happens primarily because of concept drift or data drift and issues concerning data integrity.
Therefore, testing an ML model becomes very important so that we can understand its
strengths and weaknesses and act accordingly.
In this article, we will discuss some of the tools that can be leveraged to test an ML model.
Some of these tools and libraries are open-source, while others require a subscription. Either
way, this article will fully explore the tools which will be handy for your MLOps pipeline.
Model testing and evaluation are similar to what we call diagnosis and screening in
medicine.
Model evaluation is similar to diagnosis, where the performance of the model is checked
based upon certain metrics like F1 score or MSE loss. These metrics do not provide a focused
area of concern.
️The Ultimate Guide to Evaluation and Selection of Models in Machine Learning
️F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You
Choose?
Model testing is similar to diagnosis, where a certain test like the invariance test and unit test
aims to find a particular issue in the model.
Apart from testing data, the ML testing suite contains tools to test the model’s capability to
predict, as well as overfitting, underfitting, variance and bias et cetera. The idea of the testing
framework is to inspect the pipeline in the three major phases of development:
data ingestion,
data preprocessing,
and model evaluation.
Some of the frameworks like Robust Intelligence and Kolena rigorously test the given ML
pipeline automatically in these given areas to ensure a production-ready model.
What are the best tools for machine learning model testing?
Now, let’s discuss some of the tools for testing ML models. This section is divided into three
parts: open-source tools, subscription-based tools, and hybrid tools.
Open-source model testing tools
1. DeepChecks
DeepChecks is an open-source Python framework for testing ML Models & Data. It basically
enables users to test the ML pipeline in three different phases:
The image above shows the schema of three different tests that could be performed in an ML
pipeline | Source
These tests can be performed all at once and even independently. The image above shows the
schema of three different tests that could be performed in an ML pipeline.
Installation
DeepChecks introduces three important terms: Check, Condition and Suite. It is worth noting
that these three terms together form the core structure of the framework.
Check
It enables a user to inspect a specific aspect of the data and models. The framework contains
various classes which allow you to check both of them. You can do a full check as well. Here
are a couple of such checks:
1. Data inspecting involves inspection around data drift, duplication, missing values,
string mismatch, statistical analysis such as data distribution et cetera. You can find
the various data inspecting tools within the check module. The check module allows
you to precisely design the inspecting methods for your datasets. These are some of
the tools that you will find for data inspection:
‘DataDuplicates’,
‘DatasetsSizeComparison’,
‘DateTrainTestLeakageDuplicates’,
‘DateTrainTestLeakageOverlap’,
‘DominantFrequencyChange’,
‘FeatureFeatureCorrelation’,
‘FeatureLabelCorrelation’,
‘FeatureLabelCorrelationChange’,
‘IdentifierLabelCorrelation’,
‘IndexTrainTestLeakage’,
‘IsSingleValue’,
‘MixedDataTypes’,
‘MixedNulls’,
‘WholeDatasetDrift’
In the following example, we will inspect whether the dataset has duplicates or not. We will
import the class DataDuplicates from the checks module and pass the dataset as a parameter.
This will return a table containing relevant information on whether the dataset has duplicate
values or not.
As you can see, the table above yields relative information about the number of duplicates
present in the dataset. Now let’s see how DeepChecks uses a visual aid to provide the
concerning information.
In the following example, we will inspect feature-feature correlation within the dataset. For
that, we will import the FeatureFeatureCorrelation class from the checks module.
ffc = FeatureFeatureCorrelation()
ffc.run(data)
An example of inspecting feature-feature correlation within the dataset | Source: Author
As you can see from both examples, the results can be displayed either in the form of a table
or a graph, or even both to give relevant information to the user.
An example of a
model check or inspection on Random Forest Classifier | Source: Author
Condition
It is a function or attribute that can be added to a Check. Essentially it contains a predefined
parameter that can return a pass, fail, or warning results. These parameters can be modified as
well accordingly. Follow the code snippet below to get an understanding.
The image above shows a bar graph of feature label correlation. It essentially measures the
predictive power of an independent feature that can predict the target value by itself. When
you add a condition to a check as in the example above, the condition will return additional
information mentioning the features which are above and below the condition.
In this particular example, you will find that the condition returned a statement stating that
the algorithm “Found 2 out of 4 features with PPS above threshold: {‘petal width (cm)’:
‘0.9’, ‘petal length (cm)’: ‘0.87’}” meaning that features with high PPS are suitable to predict
the labels.
Suite
It is a module containing a collection of checks for data and model. It is an ordered collection
of checks. All the checks can be found in the suite module. Below is the schematic diagram
of the framework and how it works.
The
schematic diagram of the suite of checks and how it works | Source
As you can see from the image above, the data and the model can be passed into the suites
which contain the different checks. The checks can be provided with the conditions for much
more precise testing.
You can run the following code to see the list of 35 checks and their conditions that
DeepChecks provides:
In conclusion, Check, Condition, and Suites allow users to essentially check the data and
model in their respective tasks. These can be extended and modified according to the
requirements of the project and for various use cases.
DeepChecks allows flexibility and instant validation of the ML pipeline with less effort.
Their strong boilerplate code can allow users to automate the whole testing process, which
can save a lot of time.
Key features
1It supports both classification and regression models in both computer vision and
tabular datasets.
2It can easily run a large group of checks with a single call.
3It is flexible, editable, and expandable.
4It yields results in both tabular and visual formats.
5It does not require a login dashboard as all the results, including the visualization,
are displayed instantly during execution itself. And it has a pretty good UX on the go.
An example of performance checks | Source
Key drawbacks
2. Drifter-ML
Drifter ML is an ML model testing tool specifically written for the Scikit-learn library. It can
also be used to test datasets similar to DeepChecks. It has five modules, each very specific to
the task at hand.
Installation
Drifter ML conforms to the Scikit-Learn blueprint for models, i.e., the model must contain
a .fit and .predict methods. This essentially means that you can test deep learning models as
well since Scikit-Learn has an integrated Keras API. Check the example below.
#Source: https://drifter-ml.readthedocs.io/en/latest/classification-tests.html#lower-
bound-classification-measures
The example above shows the ease with which you can design your ANN model using
drifter-ml. Similarly, you can also design a test case as well. In the test defined below, we
will try to find the lowest decision boundary by which the model can easily classify the two
classes.
def test_cv_precision_lower_boundary():
df = pd.read_csv("data.csv")
column_names = ["A", "B", "C"]
target_name = "target"
clf = joblib.load("model.joblib")
test_suite = ClassificationTests(clf,
df, target_name, column_names)
lower_boundary = 0.9
return test_suite.cross_val_precision_lower_boundary(
lower_boundary
)
Drifter-ML is specifically written for Scikit-learn, and this library acts as an extension
to it. All the classes and methods are written in sync with Scikit-learn, so data and
model testing become relatively easy and straightforward.
On a side note, if you like to work on an open-source library, then you can extend the
library to other machine learning and deep learning libraries such as Pytorch as well.
Key features
Key drawbacks
Subscription-based tools
1. Kolena.io
Kolena argues that the split test dataset methodology isn’t as reliable as it seems to be.
Splitting the datasets provides a global representation of the entire population distribution and
fails to capture the local representations at a granular level, this is especially true with label or
class. There are hidden nuances of features that still need to be discovered. This leads to the
failure of the model in the real world even though the model yields good scores in the
performance metrics during training and evaluation.
One way of addressing that issue is by creating a much more focused dataset that can be
achieved by breaking a given class into smaller subclasses for focused results or even
creating a subset of the features themselves. Such a dataset can enable the ML model to
extract features and representation at a much granular level. This will increase the
performance of the model as well by balancing both the bias and variance such that the model
generalizes well in the real-world scenario.
For example, when building a classification model, a given class in the dataset can be broken
down into various subsets and those subsets into finer subsets. This can enable users to test
the model in various scenarios. In the table below, the CAR class is tested against several test
cases to check the model’s performance on various attributes.
CAR class tested against several test cases to check the model’s performance on various
attributes | Source
Another benefit is whenever we face a new scenario in the real-world, a new test case can be
designed and tested immediately. Likewise, users can build more comprehensive test cases
for a variety of tasks and train or build a model. The users can also generate a detailed report
on a model’s performance in each category of test cases and compare it to the previous
models with each iteration.
If you are working on a large-scale deep learning model which will be complex to monitor,
then Kolena will be beneficial.
Key features
Key drawbacks
2. Robust intelligence
Evalua
ting the performance of the model | Source
2. AI Firewall, which automatically creates a wrapper around the trained model to protect it
from bad data in real-time. The wrapper is configured based on the model. It also
automatically checks both the data and model, reducing manual effort and time.
3. AI continuous testing, which monitors the model and automatically tests the deployed
model to check for updates and retraining. The testing involves data drift, error, root cause
analysis, anomalies detection et cetera. All the insights gained during continuous testing are
displayed on the dashboard.
Key features
Key drawbacks
(Source)
Hybrid frameworks
1. Etiq.ai
Etiq is an AI-observability platform that supports AI/ML lifecycle. Like Kolena and Robust
Intelligence, the framework offers ML Model testing, monitoring, optimization, and
explainability.
Etiq follows a structure similar to DeepChecks. This structure remains the core of the
framework:
Etiq offers two types of tests: Scan and Root Cause Analysis or RCA, the latter is an
experimental pipeline. The scan type offers
Accuracy: In some cases, high accuracy can indicate a problem just as low accuracy
can. In such cases, an ‘accuracy’ scan can be helpful. If the accuracy is too high, then
you might do a leakage scan, or if it is low, then you can do a drift scan.
Leakage: It helps you to find data leakage.
Drift: It can help you to find feature drift, target drift, concept drift, and prediction
drift.
Bias: Bias refers to algorithmic bias that can happen because of automated decision
making causing unintended discrimination.
Etiq.ai offers a multi-step pipeline, which means you can monitor the test by logging the
results of each of the steps in the ML pipeline. This allows you to identify and repair bias
within the model. If you are looking for a framework that can do the heavy lifting of your AI
pipeline, then Etiq.ai is the one to go.
All the points above are valid for free tier usage.
One key feature of Etiq.ai is that it allows you to be very precise and straightforward in your
model building and deploying approaches. It aims to give users the tools that can help them
to achieve the desired model. At times, the development process gets drifted away from the
original plan mostly because of the lack of tools needed to shape the model. If you want to
deploy a model that is aligned with the proposed requirements, then Etiq.ai is the way to go.
This is because the framework offers similar tests at each step throughout your ML pipeline.
Key features
Key drawbacks
1At the moment, in production, they only provide functionality for batch processing.
2To apply tests to tasks pertaining to segmentation, regression, or recommendation
engines, who must get in touch with the team.
Conclusion
The ML testing frameworks that we discussed are directed toward the needs of the users. All
of the frameworks have their own pros and cons. But you can definitely get by using any one
of these frameworks. ML model testing frameworks play an integral part in defining how the
model will perform when deployed to a real-world scenario.
If you are looking for a free and easy-to-use ML testing framework for structured datasets
and smaller ML models, then go with DeepChecks. If you are working with DL algorithms,
then Etiq.ai is a good option. But if you can spare some money, then you should definitely
inquire about Kolena. And lastly, if you are working in a mid to large-size enterprise and
looking for ML testing solutions, then hands-down, it has to be Robust Intelligence.
I hope this article provided you with all the preliminary information needed for you to get
started with ML testing. Please share this article with everyone who needs it.
Reference
1. https://www.robustintelligence.com/
2. https://aws.amazon.com/marketplace/pp/prodview-23bciknsbkgta
3. https://etiq.ai/
4. https://docs.etiq.ai/
5. https://arxiv.org/pdf/2005.04118.pdf
6. https://medium.com/kolena-ml/best-practices-for-ml-model-testing-224366d3f23c
7. https://docs.kolena.io/
8. https://www.kolena.io/
9. https://github.com/EricSchles/drifter_ml
10. https://arxiv.org/pdf/2203.08491.pdf
11. https://medium.com/@ptannor/new-open-source-for-validating-and-testing-machine-
learning-86bb9c575e71
12. https://deepchecks.com/
13. https://www.xenonstack.com/insights/machine-learning-model-testing
14. https://www.jeremyjordan.me/testing-ml/
15. https://neptune.ai/blog/ml-model-testing-teams-share-how-they-test-models
16. https://mlops.toys
About neptune.ai
Neptune is a tool for experiment tracking and model registry.
It allows you to log, organize, compare, register and share all your ML model metadata in a
single place.
Take interactive tour of the Neptune app See Docs Explore resources Check pricing
Table of contents
1. Why does model testing matter?
2. What will a typical ML software testing suite include?
3. What are the best tools for machine learning model testing?
4. Conclusion
Nilesh Barla
I am the founder of a recent startup perceptronai.net which aims to provide solutions in
medical and material science through our deep learning algorithms. I also read and think a
lot. And sometimes I put them in a form of a painting or a piece of music. And when I need
to catch a breath I go for a run.
Follow me on
Read next
Leveraging Unlabeled Image Data With Self-Supervised Learning or
Pseudo Labeling With Mateusz Opala
Every episode is focused on one specific ML topic, and during this one, we
talked to Mateusz Opala about leveraging unlabeled image data with self-
supervised learning or pseudo-labeling.