1 _ An Introduction to SHAP Values and Machine Learning Interpretability _ DataCamp
1 _ An Introduction to SHAP Values and Machine Learning Interpretability _ DataCamp
Learning Interpretability
Machine learning models are powerful but hard to interpret. However, SHAP
values can help you understand how model features impact predictions.
Machine learning models are becoming increasingly complex, powerful, and able to make
accurate predictions. However, as these models become "black boxes," it's even harder to
understand how they arrived at those predictions. This has led to a growing focus on machine
learning interpretability and explainability.
For example, you applied for a loan at a bank but were rejected. You want to know the reason
for the rejection, but the customer service agent responds that an algorithm dismissed the
application, and they cannot determine the reason why. This is frustrating, right? You deserve an
explanation for the decision that affects you. That's why companies try to make their machine
learning models more transparent and understandable.
One of the most promising tools for this process is SHAP values, which measure how much
each feature (such as income, age, credit score, etc.) contributes to the model's prediction.
SHAP values can help you see which features are most important for the model and how they
affect the outcome.
In this tutorial, we will learn about SHAP values and their role in machine learning model
interpretation. We will also use the `Shap` Python package to create and analyze different plots
for interpreting models.
SHAP values show how each feature affects each final prediction, the significance of each
feature compared to others, and the model's reliance on the interaction between features.
SHAP values are model-agnostic, meaning they can be used to interpret any machine learning
model, including:
Linear regression
Decision trees
Random forests
Neural networks
Additivity
SHAP values are additive, which means that the contribution of each feature to the final
prediction can be computed independently and then summed up. This property allows for
efficient computation of SHAP values, even for high-dimensional datasets.
Local accuracy
SHAP values add up to the difference between the expected model output and the actual output
for a given input. This means that SHAP values provide an accurate and local interpretation of
the model's prediction for a given input.
Missingness
SHAP values are zero for missing or irrelevant features for a prediction. This makes SHAP
values robust to missing data and ensures that irrelevant features do not distort the
interpretation.
Consistency
SHAP values do not change when the model changes unless the contribution of a feature
changes. This means that SHAP values provide a consistent interpretation of the model's
behavior, even when the model architecture or parameters change.
Overall, SHAP values provide a consistent and objective way to gain insights into how a
machine learning model makes predictions and which features have the greatest influence.
Setting Up
Install SHAP either using PyPI or conda-forge:
or
Load the Telecom Customer Churn. The dataset looks clean, and the target column is “Churn.”
import shap
import pandas as pd
import numpy as np
shap.initjs()
customer = pd.read_csv("data/customer_churn.csv")
customer.head()
# Classification Report
print(classification_report(y_pred, y_test))
The model has shown better performance for “0” label than “1” due to an unbalanced dataset.
Overall, it is an acceptable result with 94% accuracy.
Check out our Classification in Machine Learning guide to learn about classification in machine
learning with Python examples.
We will first create an explainer object by providing a random forest classification model, then
calculate SHAP value using a testing set.
explainer = shap.Explainer(clf)
shap_values = explainer.shap_values(X_test)
Summary Plot
Display the summary_plot using SHAP values and testing set.
shap.summary_plot(shap_values, X_test)
The summary plot shows the feature importance of each feature in the model. The results show
that “Status,” “Complaints,” and “Frequency of use” play major roles in determining the results.
Display the summary_plot of the label “0”.
shap.summary_plot(shap_values[0], X_test)
Y-axis indicates the feature names in order of importance from top to bottom.
X-axis represents the SHAP value, which indicates the degree of change in log odds.
The color of each point on the graph represents the value of the corresponding feature,
with red indicating high values and blue indicating low values.
If you look at the feature “Complaints ', you will see that it is mostly high with a negative SHAP
value. It means higher complaint counts tend to negatively affect the output.
A dependence plot is a type of scatter plot that displays how a model's predictions are affected
by a specific feature (Subscription Length). On average, subscription lengths have a mostly
positive effect on the model.
Force Plot
We will examine the first sample in the testing set to determine which features contributed to the
"0" result. To do this, we will utilize a force plot and provide the expected value, SHAP value,
and testing sample.
shap.plots.force(explainer.expected_value[0], shap_values[0][0,:], X
We can clearly see that zero complaints and zero call failures have contributed to negative to
loss of customers.
You can see all of the features with the value and magnitude that have contributed to a loss of
customers. It seems that even one unresolved complaint can cost a telecommunications
company.
Decision Plot
We will now display the decision_plot . It visually depicts the model decisions by mapping the
cumulative SHAP values for each prediction.
Each plotted line on the decision plot shows how strongly the individual features contributed to a
single model prediction, thus explaining what feature values pushed the prediction.
Note: The target label “1” decision plot is tilted towards “1”.
Display the decision plot for the target label “0”
1. Model debugging. By examining the SHAP values, we can identify any biases or outliers in
the data that may be causing the model to make mistakes.
2. Feature importance. Identifying and removing low-impact features can create a more
optimized model.
4. Model summaries. It can provide a global summary of a model in the form of a SHAP
value summary plot. It gives an overview of the most important features across the entire
dataset.
5. Detecting biases. The SHAP value analysis helps identify if certain features
disproportionately affect particular groups. It enables the detection and reduction of
discrimination in the model.
6. Fairness auditing. It can be used to assess a model's fairness and ethical implications.
7. Regulatory approval. SHAP values can help gain regulatory approval by explaining the
model's decisions.
Conclusion
We have explored SHAP values and how we can use them to provide interpretability for
machine learning models. While having an accurate model is essential, companies need to go
beyond accuracy and focus on interpretability and transparency to gain the trust of users and
regulators.
Being able to explain why a model made a particular prediction helps debug potential biases,
identify data issues, and justify the model's decisions.