The Mythos of Model Interpretability
The Mythos of Model Interpretability
The
Mythos
of Model
Interpretability
In machine learning, the
concept of interpretability is
both important and slippery.
ZACHARY C. LIPTON
S
upervised machine-learning models boast
remarkable predictive capabilities. But can you
trust your model? Will it work in deployment?
What else can it tell you about the world?
Models should be not only good, but also
interpretable, yet the task of interpretation appears
underspecified. The academic literature has provided
diverse and sometimes non-overlapping motivations for
interpretability and has offered myriad techniques for
rendering interpretable models. Despite this ambiguity,
many authors proclaim their models to be interpretable
axiomatically, absent further argument. Problematically,
it is not clear what common properties unite these
techniques.
This article seeks to refine the discourse on
interpretability. First it examines the objectives of previous
acmqueue | may-june 2018 1
machine learning 2 of 27
INTRODUCTION
Until recently, humans had a monopoly on agency in
society. If you applied for a job, loan, or bail, a human
decided your fate. If you went to the hospital, a human
would attempt to categorize your malady and recommend
treatment. For consequential decisions such as these, you
might demand an explanation from the decision-making
agent.
If your loan application is denied, for example, you
might want to understand the agent’s reasoning in a bid to
strengthen your next application. If the decision was based
on a flawed premise, you might contest this premise in the
hope of overturning the decision. In the hospital, a doctor’s
explanation might educate you about your condition.
In societal contexts, the reasons for a decision often
matter. For example, intentionally causing death (murder)
vs. unintentionally (manslaughter) are distinct crimes.
Similarly, a hiring decision being based (directly or
indirectly) on a protected characteristic such as race has a
bearing on its legality. However, today’s predictive models
are not capable of reasoning at all.
I
scan, the algorithm can assign a probability that the scan
n the academic
literature,
depicts a cancerous tumor. The ML algorithm takes in a
few authors large corpus of (input, output) pairs, and outputs a model
articulate that can predict the output corresponding to a previously
precisely what unseen input. Formally, researchers call this problem
interpretability
means or setting supervised learning. Then, to automate decisions
precisely how fully, one feeds the model’s output into some decision rule.
their proposed For example, spam filters programmatically discard emails
solution is
predicted to be spam with a level of confidence exceeding
useful.
some threshold.
Thus, ML-based systems do not know why a given input
should receive some label, only that certain inputs are
correlated with that label. For example, shown a dataset
in which the only orange objects are basketballs, an image
classifier might learn to classify all orange objects as
basketballs.
This model would achieve high accuracy even on held
out images, despite failing to grasp the difference that
actually makes a difference.
As ML penetrates critical areas such as medicine, the
criminal justice system, and financial markets, the inability
of humans to understand these models seems problematic.
Some suggest model interpretability as a remedy, but in the
Trust
Some authors suggest interpretability is a prerequisite
for trust.9,23 Again, what is trust? Is it simply confidence
that a model will perform well? If so, a sufficiently
accurate model should be demonstrably trustworthy, and
interpretability would serve no purpose. Trust might also
be defined subjectively. For example, a person might feel
more at ease with a well-understood model, even if this
understanding serves no obvious purpose. Alternatively,
when the training and deployment objectives diverge, trust
Causality
Although supervised learning models are only optimized
directly to make associations, researchers often use them
in the hope of inferring properties of the natural world. For
example, a simple regression model might reveal a strong
association between thalidomide use and birth defects, or
between smoking and lung cancer.29
The associations learned by supervised learning
W
clues about the causal relationships between physiologic
hile
the
signals and affective states. The task of inferring causal
machine- relationships from observational data has been extensively
learning studied.22 Causal inference methods, however, tend to
objective rely on strong assumptions and are not widely used by
might be to
reduce error, practitioners, especially on large, complex data sets.
the real-world
purpose is to Transferability
provide useful
Typically, training and test data are chosen by randomly
information.
partitioning examples from the same distribution. A
model’s generalization error is then judged by the gap
between its performance on training and test data.
Humans exhibit a far richer capacity to generalize,
however, transferring learned skills to unfamiliar
situations. ML algorithms are already used in these
situations, such as when the environment is nonstationary.
Models are also deployed in settings where their use might
alter the environment, invalidating their future predictions.
Along these lines, Caruana et al.3 describe a model trained
to predict probability of death from pneumonia that
assigned less risk to patients if they also had asthma.
Presumably, asthma was predictive of a lower risk of death
because of the more aggressive treatment these patients
Informativeness
Sometimes, decision theory is applied to the outputs of
supervised models to take actions in the real world. In
another common use paradigm, however, the supervised
model is used instead to provide information to human
decision-makers, a setting considered by Kim et al.11 and
Huysmans et al.8 While the machine-learning objective
might be to reduce error, the real-world purpose is to
provide useful information. The most obvious way that a
model conveys information is via its outputs, but it may
be possible via some procedure to convey additional
information to the human decision-maker.
An interpretation may prove informative even without
shedding light on a model’s inner workings. For example,
a diagnosis model might provide intuition to a human
decision maker by pointing to similar cases in support
of a diagnostic decision. In some cases, a supervised
learning model is trained when the real task more closely
resembles unsupervised learning. The real goal might be
to explore the underlying structure of the data, and the
labeling objective serves only as weak supervision.
Simulatability
In the strictest sense, a model might be called transparent
Decomposability
A second notion of transparency might be that each part
of the model—input, parameter, and calculation—admits
an intuitive explanation. This accords with the property of
intelligibility as described by Lou et al.15 For example, each
node in a decision tree might correspond to a plain text
description (e.g., all patients with diastolic blood pressure
T
over 150). Similarly, the parameters of a linear model could
he weights
of a lin-
be described as representing strengths of association
ear model between each feature and the label.
might seem Note that this notion of interpretability requires
intuitive, that inputs themselves be individually interpretable,
but they can be
fragile with re- disqualifying some models with highly engineered or
spect to feature anonymous features. While this notion is popular, it
selection and shouldn’t be accepted blindly. The weights of a linear model
preprocessing.
might seem intuitive, but they can be fragile with respect
to feature selection and preprocessing. For example, the
coefficient corresponding to the association between
flu risk and vaccination might be positive or negative,
depending on whether the feature set includes indicators
of old age, infancy, or immunodeficiency.
Algorithmic transparency
A final notion of transparency might apply at the level of
the learning algorithm itself. In the case of linear models,
you may understand the shape of the error surface. You
can prove that training will converge to a unique solution,
even for previously unseen data sets. This might provide
some confidence that the model will behave in an online
setting requiring programmatic retraining on previously
Text explanations
Humans often justify decisions verbally. Similarly, one
model might be trained to generate predictions, and
a separate model, such as a recurrent neural network
Local explanations
While it may be difficult to describe succinctly the full
mapping learned by a neural network, some of the
literature focuses instead on explaining what a neural
network depends on locally. One popular approach for
deep neural nets is to compute a saliency map. Typically,
they take the gradient of the output corresponding to
the correct class with respect to a given input vector. For
images, this gradient can be applied as a mask, highlighting
regions of the input that, if changed, would most influence
the output.25,30
Note that these explanations of what a model is
focusing on may be misleading. The saliency map is a local
explanation only. Once you move a single pixel, you may get
a very different saliency map. This contrasts with linear
models, which model global relationships between inputs
and outputs.
Another attempt at local explanations is made by
Ribeiro et al.23 In this work, the authors explain the
decisions of any model in a local region near a particular
point by learning a separate sparse linear model to explain
the decisions of the first. Strangely, although the method’s
appeal over saliency maps owes to its ability to provide
explanations for non-differentiable models, it is more
Explanation by example
One post hoc mechanism for explaining the decisions of
a model might be to report (in addition to predictions)
which other examples are most similar with respect to
the model, a method suggested by Caruana et al.2 Training
a deep neural network or latent variable model for a
discriminative task provides access to not only predictions
but also the learned representations. Then, for any
example, in addition to generating a prediction, you can
use the activations of the hidden layers to identify the
k-nearest neighbors based on the proximity in the space
learned by the model. This sort of explanation by example
has precedent in how humans sometimes justify actions by
analogy. For example, doctors often refer to case studies
to support a planned treatment protocol.
In the neural network literature, Mikolov et al.19 use
such an approach to examine the learned representations
of words after training the word2vec model. While their
model is trained for discriminative skip-gram prediction,
to examine which relationships the model has learned
they enumerate nearest neighbors of words based on
DISCUSSION
The concept of interpretability appears simultaneously
important and slippery. Earlier, this article analyzed both
L
the motivations for interpretability and some attempts by
inear models
lose simu-
the research community to confer it. Now let’s consider the
latability implications of this analysis and offer several takeaways.
or decom- 3 Linear models are not strictly more interpretable
posability,
than deep neural networks. Despite this claim’s enduring
respectively.
popularity, its truth value depends on which notion of
interpretability is employed. With respect to algorithmic
transparency, this claim seems uncontroversial, but
given high-dimensional or heavily engineered features,
linear models lose simulatability or decomposability,
respectively.
When choosing between linear and deep models,
you must often make a tradeoff between algorithmic
transparency and decomposability. This is because
deep neural networks tend to operate on raw or lightly
processed features. So, if nothing else, the features are
intuitively meaningful, and post hoc reasoning is sensible.
To get comparable performance, however, linear models
often must operate on heavily hand-engineered features.
Lipton et al.13 demonstrate such a case where linear
models can approach the performance of recurrent neural
networks (RNNs) only at the cost of decomposability.
For some kinds of post hoc interpretation, deep
References
1. A
they, S., Imbens, G. W. 2015 Machine-learning methods
https://arxiv.org/abs/1504.01132v1 (see also ref. 7).
2. C aruana, R., Kangarloo, H., Dionisio, J. D, Sinha, U.,
Johnson, D. 1999. Case-based explanation of non-case-
based learning methods. In Proceedings of the American
Medical Informatics Association (AMIA) Symposium:
212-215.
3. C aruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M.,
Elhadad, N. 2015. Intelligible models for healthcare:
Predicting pneumonia risk and hospital 30-day
readmission. In Proceedings of the 21st Annual SIGKDD
International Conference on Knowledge Discovery and
Data Mining, 1721-1730.
4. C hang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., Blei,
D. M. 2009. Reading tea leaves: how humans interpret
topic models. In Proceedings of the 22nd International
Conference on Neural Information Processing Systems
(NIPS), 288-296.