0% found this document useful (0 votes)
23 views

Testing in Data Science

This document discusses testing in data science. It outlines two main types of tests: 1) tests for data analysis and 2) tests for machine learning models. For data analysis, tests validate code on previously unseen data by checking properties of the outcome rather than the values. Libraries like Hypothesis generate random test data and check that it satisfies specified properties. For machine learning, tests validate non-ML code with PyTest and use techniques like blackbox testing and metrics to indirectly test models by checking output data properties and model quality.

Uploaded by

Sajal Khandelwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Testing in Data Science

This document discusses testing in data science. It outlines two main types of tests: 1) tests for data analysis and 2) tests for machine learning models. For data analysis, tests validate code on previously unseen data by checking properties of the outcome rather than the values. Libraries like Hypothesis generate random test data and check that it satisfies specified properties. For machine learning, tests validate non-ML code with PyTest and use techniques like blackbox testing and metrics to indirectly test models by checking output data properties and model quality.

Uploaded by

Sajal Khandelwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Testing in Data Science

This is what you need for testing, btw.


In Data Science, two types of tests can be written, in addition to the usual
tests written using the PyTest Library:
1. For Data Analysis
2. For Machine Learning
In Data analysis, you need to test the code for previously unseen data
(basically data validation).
You do that by checking the properties of the outcome rather than the value
of the outcome. There are libraries for that:
I found 4 of them, there are obviously more:
1. En garde
2. Hypothesis
3. Feature Forge
4. Voluptuous
These libraries check for properties of output data, rather than the data itself.
In addition, NumPy and Pandas have builtin data validation libraries that you
can use for this.
For example, Hypothesis (which seems to be the most useful in our case),
create random data given some specifications and runs it through our code to
assert some properties that we want to check for. It also looks for most edge
cases on its own and provides feedback.

This blog basically confirms your doubts


An example of Hypothesis
These talks would help:
1. Testing for Properties

Testing in Data Science 1


2. Data Validation
NumPy builtin data validation
In testing ML models, there are a couple steps involved. You need to PyTest all
the non machine learning code.
Since models cannot be tested directly, there are ways to get around it. 
1. Blackbox Testing for Machine Learning
2. QA for ML Models

You can still do the property checks on the output data. Feature Forge is
specifically used in ML.
Then there are the Metrics we talked about in class yesterday that are used to
check the quality of the model.
In our specific problem, we could use the Hypothesis library to get a random
Dataframe to pass through our function and check if any rows still have
correlation more than a certain number. Since the data is random but
parameters can be defined, we can get exactly the kind of test we want.
I'll write a test for this later. I'll share the code once it works.
Hope this helps.

Testing in Data Science 2

You might also like