0% found this document useful (0 votes)
5 views13 pages

2401.12874v2

Uploaded by

vgupta1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

2401.12874v2

Uploaded by

vgupta1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

From Understanding to Utilization: A Survey on Explainability for Large

Language Models

Haoyan Luo Lucia Specia


Imperial College London Imperial College London
[email protected] [email protected]

Abstract explainability, not just for understanding, but for


responsible and ethical application.
Explainability for Large Language Models Explainability in LLMs serves two critical func-
(LLMs) is a critical yet challenging aspect of
tions. For end users, it fosters trust by clarifying
arXiv:2401.12874v2 [cs.CL] 22 Feb 2024

natural language processing. As LLMs are


increasingly integral to diverse applications, the model’s reasoning in a nontechnical manner,
their “black-box” nature sparks significant con- enhancing understanding of their capabilities and
cerns regarding transparency and ethical use. potential flaws (Zhao et al., 2023). For develop-
This survey underscores the imperative for in- ers and researchers, it offers insights into unin-
creased explainability in LLMs, delving into tended biases and areas of improvement, serving
both the research on explainability and the as a tool for improving the performance of the
various methodologies and tasks that utilize
model in downstream tasks (Bastings et al., 2022;
an understanding of these models. Our fo-
cus is primarily on pre-trained Transformer-
Meng et al., 2023a; Li et al., 2023b). However, the
based LLMs, such as LLaMA (Touvron et al., scale of LLMs poses unique challenges to explain-
2023), which pose distinctive interpretability ability. Larger models with more parameters and
challenges due to their scale and complexity. extensive training data are harder to interpret. Tra-
In terms of existing methods, we classify them ditional explanation methods such as SHAP values
into local and global analyses, based on their (Lundberg and Lee, 2017) become less practical
explanatory objectives. When considering the for these large-scale models (Zhao et al., 2023).
utilization of explainability, we explore several
Moreover, a comprehensive understanding of LLM-
compelling methods that concentrate on model
editing, control generation, and model enhance- specific phenomena, including in-context learning
ment. Additionally, we examine representa- (Halawi et al., 2023; Hendel et al., 2023; Todd et al.,
tive evaluation metrics and datasets, elucidat- 2023; Wang et al., 2023), along with addressing
ing their advantages and limitations. Our goal issues such as model hallucinations (Ji et al., 2023;
is to reconcile theoretical and empirical under- Chuang et al., 2023) and inherent biases (dev, 2023;
standing with practical implementation, propos- An and Rudinger, 2023; Schick et al., 2021), is vital
ing exciting avenues for explanatory techniques
for ongoing refinement in model design.
and their applications in the LLMs era.
In this survey, we focus on explainability meth-
1 Introduction ods for pre-trained Transformer-based LLMs, often
termed as base models. These models often scale
In the rapidly evolving field of natural language up in training data and have billions of parameters;
processing, Large Language Models (LLMs) have examples include GPT-2 (Radford et al., 2019),
emerged as a cornerstone, demonstrating remark- GPT-J (Chen et al., 2021), GPT-3 (Brown et al.,
able proficiency across a spectrum of tasks. De- 2020), OPT (Yordanov et al., 2022), and LLaMA
spite their effectiveness, LLMs, often characterized family (Touvron et al., 2023). In Section 2, we
as “black-box” systems, present a substantial chal- categorize and pose research questions based on
lenge in terms of explainability and transparency. our survey. Based on this categorization, we review
This opacity can lead to unintended consequences, explainability methods in Section 3, followed by a
such as the generation of harmful or misleading discussion in Section 4 on how these insights are
content (Gehman et al., 2020), and the occurrence leveraged. We further discuss the evaluation meth-
of model hallucinations (Weidinger et al., 2021). ods and metrics in Section 5. Our goal is to syn-
These issues underscore the urgency for improved thesize and critically assess contemporary research,
aiming to bridge the gap between understanding Perturbation-Based Methods. Perturbation-
and practical application of insights derived from based methods, such as LIME (Ribeiro et al.,
complex language models. 2016) and SHAP (Lundberg and Lee, 2017), alter
input features to observe changes in model output.
2 Overview However, this removal strategy assumes input
The field of LLMs is rapidly advancing, making features are independent and ignores correlations
explainability not only a tool for understanding among them. Additionally, models can be
these complex systems but also essential for their over-confidence even when the predictions are
improvement. This section categorizes current ex- nonsensical or wrong (Feng et al., 2018). They
plainability approaches, highlights the challenges also face challenges in efficiency and reliability
in ethical and controllable generation, and proposes highlighted in (Atanasova et al., 2020), leading
research questions for future exploration. to their diminished emphasis in recent attribution
research.
Categorization of Methods We present a struc-
tured categorization for the explainability methods Gradient-Based Methods. One might consider
and their applications in Figure 1. Figure 1 presents gradient-based explanation methods as a natural ap-
a structured categorization of explainability meth- proach for feature attribution. This type of method
ods for pre-trained language models (LMs). We di- computes per-token importance scores (Kinder-
vide these into two broad domains: Local Analysis mans et al., 2016) using backward gradient vec-
and Global Analysis. Local Analysis covers feature tors. Techniques such as gradient × input (Kin-
attribution and transformer block analysis, delving dermans et al., 2017) and integrated gradients (IG)
into detailed operations of models. Global Anal- (Sundararajan et al., 2017) accumulate the gradi-
ysis, on the other hand, includes probing-based ents obtained as the input is interpolated between
methods and mechanistic interpretability, offering a reference point and the actual input. Despite
a comprehensive understanding of model behav- their widespread use, one main challenge of IG is
iors and capacities. Beyond understanding, we also the computational overhead to achieve high-quality
explore applications of these insights in enhanc- integrals (Sikdar et al., 2021; Enguehard, 2023)
ing LLM capabilities, focusing on model editing, Their attribution score has also shown to be unreli-
capability enhancement, and controlled generation. able in terms of faithfulness (Ferrando et al., 2022)
and their ability to elucidate the forward dynamics
3 Explainability for Large Language within hidden states remains constrained.
Models
Vector-Based Methods. Vector-based analyses,
3.1 Local Analysis which focus on token representation formation,
Local explanations in LLMs aim to elucidate how have emerged as a key approach. Approaches range
models generate specific predictions, such as senti- from global attribution from the final output layer
ment classification or token predictions, for a given to more granular, layer-wise decomposition of to-
input. This section categorizes local explanation ken representations (Chen et al., 2020; Modarressi
methods into two types: feature attribution analysis et al., 2022) Consider decomposing the ith token
and analysis into individual Transformer (Vaswani representation in layer l ∈ {0, 1, 2, ..., L, L + 1}1 ,
et al., 2017) components. i.e., xli ∈ {xl1 , xl2 , ..., xlN }, into elemental vectors
3.1.1 Feature Attribution Explanation attributable to each of the N input tokens:
Feature attribution, a local method for explaining a N
X
prediction, analysis quantifies the relevance of each xli = xli⇐k (1)
input token to a model’s prediction. Given an input k=1
text x with n tokens {x1 , x2 , ..., xn }, a pre-trained
language model f outputs f (x). Attribution meth- The norm (Modarressi et al., 2022) or the L1 norm
ods assign a relevance score R(xi ) (Modarressi (Ferrando et al., 2022) of the attribution vector for
et al., 2022; Ferrando et al., 2022; Modarressi et al., the k th input (xli⇐k ) can be used to quantify its
2023) to each token xi , reflecting its contribution total attribution to xli .
to f (x). This category includes perturbation-based, 1
l = 0 is the input embedding layer and l = L + 1 is the
gradient-based, and vector-based methods. language model head over the last decoder layer.
Local Analysis

Purturbation-Based
Methods Hypernetwork
Gradient-Based Knowledge Editors
Methods

Gradient-Based
Methods
Locate-Then-Edit

Analyzing MHSA Feature


Sublayers Attribution
Analysis Model Editinig

Improving
Dissecting Utilization of Long
Analyzing MLP Transformer Text
Sublayers Blocks

Enhancing
Model
Performance
Global Analysis
Improving
Probing-Based Pre-trained LLMs In-Context
Probing Methods Learning
Knowledge

Controllable
Mechanistic Generation
Probing
Interpretability
Representations
Reducing
Hallucination

Circuit Discovery

Causal Tracing Ethical Alignment

Vocabulary Lens

Explainability for LLMs Leveraging Explainability

Figure 1: Categorization of literature on explainability in LLMs, focusing on techniques (left, Section 3) and their applications
(right, Section 4).

Although several established strategies, such as ers, each composed of a multi-head self-attention
attention rollouts (Abnar and Zuidema, 2020; Fer- (MHSA) sublayer followed by an MLP sublayer
rando et al., 2022; Modarressi et al., 2022), focus (Vaswani et al., 2017). Formally, the representation
on the global impact of inputs on outputs by aggre- xli of token i at layer l is obtained by:
gating the local behaviors of all layers, they often
neglect Feed-Forward Network (FFN) in the analy- xli = xl−1
i + ali + mli (2)
ses due to its nonlinearities. Recent works address
where ali and mli are the outputs from the l-th
this by approximating and decomposing activation
MHSA and MLP sublayers, respectively 2 . While
functions and constructing decomposed token rep-
studies have frequently analyzed individual Trans-
resentations throughout layers (Yang et al., 2023;
former components (Kobayashi et al., 2020; Modar-
Modarressi et al., 2023). Empirical evaluations
ressi et al., 2022), the interaction between these
demonstrate the efficacy of vector-based analysis
sublayers is less explored, presenting an avenue for
and exemplify the potential of such methods in
future research.
dissecting each hidden state representation within
transformers. Analyzing MHSA Sublayers. Attention mecha-
nisms in MHSA sublayers are instrumental in cap-
3.1.2 Dissecting Transformer Blocks turing meaningful correlations between interme-
Tracking Transformer block’s component-by- diate states of input that can explain the model’s
component internal processing can provide rich predictions. Visualizing attention weights and uti-
information on its intermediate processing, given lizing gradient attribution scores are two primary
the stacked architecture of decoder-based language methods for analyzing these sublayers (Zhao et al.,
models (Kobayashi et al., 2023). In a transformer 2
For brevity, bias terms and layer normalization (Ba et al.,
inference pass, the input embeddings are trans- 2016) are omitted, as they are nonessential for most of analy-
formed through a sequence of L transformer lay- sis.
(a) (b)

Figure 2: Studied role of each Transformer component. (a) gives an overview of attention mechanism in Trans-
formers. Sizes of the colored circles illustrate the value of the scalar or the norm of the corresponding vector
(Kobayashi et al., 2020). (b) analyzes the FFN updates in the vocabulary space, showing that each update can be
decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often
human-interpretable (Geva et al., 2022).

2023). Many studies have analyzed the linguistic ates a corresponding output vocabulary distribution
capabilities of Transformers by tracking attention (Geva et al., 2021). Figure 2(b) focuses in the FFN
weights. (Abnar and Zuidema, 2020; Katz and Be- outputs, illustrating how each update within these
linkov, 2023; Kobayashi et al., 2023). For instance, layers can be broken down into sub-updates linked
attention mechanisms typically prioritize specific to individual parameter vectors, often encoding
tokens while diminishing the emphasis on frequent concepts that are interpretable to humans (Geva
words or special tokens, a phenomenon observable et al., 2022). Additionally, there is an emerging
through norm-based analysis metrics, as illustrated interest in input-independent methods, which in-
in Figure 2(a) (Kobayashi et al., 2020). In the gra- terpret model parameters directly, thus eliminating
dient analysis, some methods calculate gradients as the need for a forward pass (Dar et al., 2023).
partial derivatives of model outputs with respect to
attention weights (Barkan et al., 2021), while others 3.2 Global Analysis
use integrated gradients, which are cumulative ver- In contrast to local analysis, which focus on eluci-
sions of these partial derivatives (Hao et al., 2021). dating individual model predictions, global analy-
Generally, these combined approaches, which inte- sis aims to understand and explain the knowledge
grate attention metrics with gradient information, or linguistic properties encoded in the hidden state
tend to outperform methods using either metric in activations of a model. This section explores two
isolation. primary approaches to global analysis: probing
methods that scrutinize model representations and
Analyzing MLP Sublayers. More recently, a mechanistic interpretability (Transformer Circuits,
surge of works have investigated the knowledge 2022), an emerging perspective that seeks to re-
captured by the FFN layers (Geva et al., 2022; Dai verse engineer the inner workings of deep neural
et al., 2022). These layers, consuming the majority networks.
of each layer’s parameter budget at 8d2 compared
to 4d2 for self-attention layers (where d represents 3.2.1 Probing-Based Method
the model’s hidden dimension), function akin to Self-supervised pre-training endows models with
key-value memories (Geva et al., 2021). Here, each extensive linguistic knowledge, derived from large-
"key" is associated with specific textual patterns scale training datasets. Probing-based methods are
identified during training, and each "value" gener- employed to capture the internal representations
within these networks. This approach involves distinct perspectives and insights into the mecha-
training a classifier, known as a probe, on the net- nisms of language models.
work’s activations to distinguish between various
types of inputs or outputs. In the following sec- Circuit Discovery. The circuit-based mechanis-
tions, we will discuss studies related to probing, tic interpretability approach aims to align learned
categorized based on their objectives, whether it model representations with known ground truths,
be probing for semantic knowledge or analyzing initially by reverse-engineering the model’s algo-
learned representations. rithm to fully comprehend its feature set (Chughtai
et al., 2023). A prominent example of this ap-
Probing Knowledge. LLMs trained on extensive proach is the analysis of GPT-2 small (Radford
text corpora, are recognized for their ability to en- et al., 2019), where a study identified a human-
capsulate context-independent semantic and factual understandable subgraph within the computational
knowledge accessible via textual prompts (Petroni graph responsible for performing the indirect ob-
et al., 2019). Research in this area primarily fo- ject identification (IOI) task (Wang et al., 2022).
cuses on formulating textual queries to extract vari- In IOI, sentences like “When Mary and John went
ous types of background knowledge from language to the store, John gave a drink” are expected to be
models (Hewitt and Manning, 2019; Peng et al., completed with “Mary”. The study discovered a
2022). Interestingly, probes can sometimes unearth circuit comprising 26 attention heads – just 1.1%
factual information even in scenarios where lan- of the total (head, token position) pairs – that pre-
guage models may not reliably produce truthful dominantly manages this task. This circuits-based
outputs (Hernandez et al., 2023). mechanistic view provides opportunities to scale
our understanding to both larger models and more
Probing Representations. LLMs are adept at complex tasks, including recent explorations into
developing context-dependent knowledge repre- In-Context Learning (ICL) (Halawi et al., 2023;
sentations. To analyze these, probing classifiers Hendel et al., 2023; Todd et al., 2023; Wang et al.,
are applied, typically involving a shallow classi- 2023).
fier trained on the activations of attention heads to
Causal Tracing. The concept of causal analysis
predict specific features. A notable study in this
in machine learning has evolved from early meth-
area involved training linear classifiers to identify
ods that delineate dependencies between hidden
a select group of attention heads that exhibit high
variables using causal graphs (Pearl et al., 2000) to
linear probing accuracy for truthfulness (Li et al.,
more recent approaches like causal mediation anal-
2023b). This research revealed a pattern of special-
ysis (Vig et al., 2020). This newer method quanti-
ization across attention heads, with the represen-
fies the impact of intermediate activations in neu-
tation of “truthfulness” predominantly processed
ral networks on their output (Meng et al., 2023a).
in the early to middle layers, and only a few heads
Specifically, (Meng et al., 2023a) assesses each acti-
in each layer showing standout performance. Such
vation’s contribution to accurate factual predictions
insights pave the way for exploring more complex
through three distinct operational phases: a clean
representations. For instance, research by (Li et al.,
run generating correct predictions, a corrupted run
2023a) has revealed nonlinear internal representa-
where predictions are impaired, and a corrupted-
tions, such as board game states, in models that
with-restoration run that evaluates the ability of
initially lack explicit knowledge of the game or its
a single state to rectify the prediction (Meng et al.,
rules.
2023a). Termed as causal tracing, this approach
3.2.2 Mechanistic Interpretability has identified crucial causal states predominantly
in the middle layers, particularly at the last sub-
Mechanistic interpretability seeks to comprehend ject position where MLP contributions are most
language models by examining individual neu- significant (Figure 3). This finding underscores the
rons and their interconnections, often conceptual- role of middle layer MLPs in factual recall within
ized as circuits (Transformer Circuits, 2022; Zhao LLMs.
et al., 2023). This field encompasses various ap-
proaches, which can be primarily categorized into Vocabulary Lens. Recent work has suggested
three groups: circuit discovery, causal tracing, and that model knowledge and knowledge retrieval may
vocabulary lens. Each of these approaches offers be localized within small parts of a language model
Figure 3: The intensity of each grid cell represents the average causal indirect effect of a hidden state on expressing a factual
association. Darker cells indicate stronger causal mediators. It was found that the MLPs at the last subject token and the attention
modules at the last token play crucial roles. (Meng et al., 2023a)

(Geva et al., 2021) by projecting weights and hid- capabilities with fine-tuning or re-training, we fo-
den states onto their vocabulary space. To analyze cus on methods specifically designed with a strong
the components in vocabulary space, we read from foundation in model explainability.
each token component xlk at layer l at the last token
4.1 Model Editing
position N (N is omitted here), by projecting with
the unembedding matrix E: Despite the ability to train proficient LLMs, the
methodology for ensuring their relevance and recti-
plk = softmax(E ln(xlk )) (3) fying errors remains elusive. In recent years, there
has been a surge in techniques for editing LLMs.
where ln stands for layer normalization before the The goal is to efficiently modify the knowledge or
LM head. (Belrose et al., 2023) refines model behavior of LLMs within specific domains with-
predictions at each transformer layer and decodes out adversely affecting their performance on other
hidden states into vocabulary distributions based inputs (Yao et al., 2023).
on this method. Exploring this avenue further, Hypernetwork Knowledge Editors. This type
(Geva et al., 2022) illuminated the role of trans- of knowledge editors includes memory-based
former feed-forward layers in predictions, spot- model and editors with additional parameters.
lighting specific conceptual emphases via FFN sub- Memory-based models store all edit examples ex-
updates. There is also a growing interest in input- plicitly in memory based on the explainability find-
independent methodologies, where model param- ing of key-value memories inside the FFN (Sec-
eters are interpreted directly, bypassing a forward tion 3.1.2). They can then employ a retriever to
pass (Dar et al., 2023). extract the most relevant edit facts for each new
Augmenting projection-focused interpretations, input, guiding the model to generate the edited fact.
(Din et al., 2023) first unveiled a feasible appli- SERAC (Mitchell et al., 2022), for instance, adopts
cation for such projections, suggesting early exit a distinct counterfactual model while leaving the
strategies by treating hidden state representations original model unchanged. Editors with additional
as final outputs. (Geva et al., 2023) pinpointed two parameters introduce extra trainable parameters
critical junctures where information propagates to within LLMs. These parameters are trained on
the final predictions via projections and attention a modified dataset while the original model param-
edge intervention. While much of the focus has eters remain static. For example, T-Patcher (Huang
been on how hidden states relate to model outputs, et al., 2023) integrates one neuron (patch) for one
recent works have also highlighted the roles of in- mistake in the last layer of the FFN of the model,
dividual tokens, revealing that their contributions which takes effect only when encountering its cor-
through attention outputs are laden with rich se- responding mistake.
mantic information (Ram et al., 2023; Katz and
Belinkov, 2023). Locate-Then-Edit. The locate-then-edit
paradigm first identifies the parameters correspond-
4 Leveraging Explainability ing to the specific knowledge and then modifies
them by directly updating the target parameters.
In this section, we discuss how explainability can The Knowledge Neuron (KN) method (Dai
be used as a tool to debug and improve models. Al- et al., 2022) introduces a knowledge attribution
though various approaches aim to improve model technique to pinpoint the “knowledge neuron” (a
key-value pair in the FFN matrix) that embodies up LLMs (Brown et al., 2020). ICL stands out
the knowledge and then updates these neurons. because it doesn’t require extensive updates to the
ROME (Meng et al., 2023a) and MEMIT (Meng vast number of model parameters and relies on
et al., 2023b) apply causal tracing (Section 3.2.2) human-understandable natural language instruc-
to locate the editing area. Instead of modifying the tions (Dong et al., 2023). As a result, it offers
knowledge neurons in the FFN, ROME alters the a promising approach to harness the full potential
entire matrix. Based on these two methods, PMET of LLMs. With mechanistic interpretability (Sec-
(Li et al., 2023c) involves the attention value to tion 3.2.2), (Wang et al., 2023) reveal that label
achieve better performance. words in the demonstration examples function as
anchors, which can be used to improve ICL per-
4.2 Enhancing Model Capability formance with simple anchor re-weighting method.
While LLMs demonstrate versatility in various (Halawi et al., 2023) study harmful imitation in
NLP tasks, insights from explainability can sig- ICL through vocabulary lens to inspect a model’s
nificantly enhance these capabilities. This section internal representations (Section 3.2.2), and iden-
highlights two key tasks where explainability has tify two related phenomena: overthinking and false
shown considerable impact in recent work: improv- induction heads, the heads in late layers that attend
ing the utilization of long text (Xiao et al., 2023; to and copy false information from previous demon-
Liu et al., 2023; Pope et al., 2022) and enhanc- strations, and whose ablation improves ICL perfor-
ing the performance of In-Context Learning (ICL) mance. Furthermore, using causal tracing (Section
(Hendel et al., 2023; Halawi et al., 2023; Wang 3.2.2), (Hendel et al., 2023; Todd et al., 2023) find
et al., 2023). that a small number attention heads transport a
compact representation of the demonstrated task,
4.2.1 Improving Utilization of Long Text which they call a task vector or function vector
The optimization of handling long text aims to en- (FV). These FVs can be summed to create vectors
hance the ability of LLMs to capture and effectively that trigger new complex tasks and improve perfor-
utilize content within longer contexts. This is par- mance for few-shot prompting (Todd et al., 2023).
ticularly challenging because LLMs tend to strug-
gle with generalizing to sequence lengths longer 4.3 Controllable Generation
than what they were pretrained on, such as the 4K Though large language models have obtained supe-
limit for Llama-2 (Touvron et al., 2023). (Belt- rior performance in text generation, they sometimes
agy et al., 2020), maintains a fixed-size sliding fall short of producing factual content. Leveraging
window on the key-value (KV) states of the most explainability provides opportunities for building
recent tokens. While this approach ensures con- inference-time and fast techniques to improve gen-
stant memory usage and decoding speed after the eration models’ factuality, calibration, and control-
cache is initially filled, it faces limitations when the lability and align more with human preference.
sequence length exceeds the cache size (Liu et al.,
2023). An innovative solution proposed by (Xiao 4.3.1 Reducing Hallucination
et al., 2023) takes advantage of the MHSA expla- Hallucinations in LLMs refer to generated content
nations (Section 3.1.2) in LLMs, which allocates not based on training data or facts, various fac-
a significant amount of attention to the initial to- tors such as imperfect learning and decoding con-
kens. They introduce StreamingLLM, a simple and tribute to this (Ji et al., 2023). To mitigate hallucina-
efficient framework that allows LLMs to handle tions, initial approaches used reinforcement learn-
unlimited text without fine-tuning. This is achieved ing from human feeback (Ouyang et al., 2022) and
by retaining the "attention sink," which consists of distillation into smaller models such as Alpaca (Li
several initial tokens, in the KV states (Figure 4). et al., 2023d). Leveraging explainability provides a
The authors also demonstrate that pre-training mod- significantly less expensive way to reduce halluci-
els with a dedicated sink token can further improve nation, enjoying the advantage of being adjustable
streaming performance. and minimally invasive. For example, (Li et al.,
2023b) use as few as 40 samples to locate and find
4.2.2 Improving In-Context Learning “truthful” heads and directions through a trained
In-context Learning (ICL) has emerged as a power- probe (Section 3.2.1). They propose inference-time
ful capability alongside the development of scaled- intervention (ITI), a computationally inexpensive
Figure 4: (a) Dense Attention (Vaswani et al., 2017) has O(T 2 ) time complexity and an increasing cache size. Its performance
decreases when the text length exceeds the pre-training text length. (b) Window Attention () caches the most recent L tokens’
KV. While efficient in inference, performance declines sharply once the starting tokens’ keys and values are evicted. (c) Sliding
Window (Pope et al., 2022) with Re-computation performs well on long texts, but its O(T L2 ) complexity, stemming from
quadratic attention in context re-computation, makes it considerably slow. (d) StreamingLLM keeps (Xiao et al., 2023) the
attention sink (several initial tokens) for stable attention computation, combined with the recent tokens. It’s efficient and offers
stable performance on extended texts.

strategy to intervene on the attention head to shift tion, they suppress the activations of the pinpointed
the activations in the “truthful” direction, which neurons to mitigate bias. Extensive experiments
achieves comparable or better performance toward have verified the effectiveness of this method and
the instruction-finetuned model. have yielded the potential applicability of the ex-
plainability method for ethical alignment research
4.3.2 Ethical Alignment in LLMs.
As research on AI fairness gains increasing im-
portance, there have been efforts to detect social 5 Evaluation
bias (Fleisig et al., 2023; An and Rudinger, 2023)
Recently, LLMs such as GPT-4 (OpenAI, 2023)
and suppress toxicity (Gehman et al., 2020; Schick
have shown impressive abilities to generate natural
et al., 2021) in LMs. Many previous debiasing
language explanations for their predictions. How-
methods (Qian et al., 2022) have focused on con-
ever, it remains unclear whether these explanations
structing anti-stereotypical datasets and then ei-
actually help humans understand the reasoning of
ther retraining the LM from scratch or conducting
the model (Zhao et al., 2023). Specifically designed
fine-tuning. This line of debiasing approaches, al-
evaluation methods are needed to better assess the
though effective, comes with high costs for data
performance of explainability methods, such as
construction and model retraining. Moreover, it
attribution. Furthermore, calibrated datasets and
faces the challenge of catastrophic forgetting if
metrics are required to evaluate the application of
fine-tuning is performed (Zhao et al., 2023). While
explainability to downstream tasks, such as truth-
few work has focused on the interpretability of the
fulness evaluation 4 .
fairness research, (dev, 2023) explore interpreting
and mitigating social biases in LLMs by introduc- 5.1 Evaluating Explanation Plausibility
ing the concept of social bias neurons. Inspired by
One common technique to evaluate the plausibil-
the gradient-based attribution method IG (Section
ity of attribution analysis is to remove K% of to-
3.1.1), they introduce an interpretable technique,
kens with the highest or lowest estimated impor-
denoted as intergrated gap gradient (IG2 ), to pin-
tance to observe its impact on the model output
point social bias neurons by back-propagating and
(Chen et al., 2020; Modarressi et al., 2023). An-
integrating the gradients of the logits gap for a se-
other approach to assessing explanation plausibil-
lected pair of demographics 3 Taking this interpreta-
ity involves indirect methods, such as measuring
3
Demographic include properties like gender, sexuality, the performance of model editing, particularly for
occupation, etc. 9 common demographics are collected and
4
pairs of demographics are selected to reveal the fairness gap Due to space limit, we only discusse the most commonly
(dev, 2023). used evaluation approaches in explainability research
“locate-then-edit” editing methods, which heavily useful overview of this emerging research area and
rely on interpretation accuracy. Recent research highlights open problems and directions for future
(Yao et al., 2023; Zhao et al., 2023) suggests that research.
having evaluation datasets is crucial for evaluat-
ing factual editing in LLMs. Two commonly used
datasets for this purpose are ZsRE (Levy et al., References
2017), a Question Answering (QA) dataset that em- 2023. The devil is in the neurons: Interpreting and miti-
ploys question rephrasings generated through back- gating social biases in language models. In Openre-
view for International Conference on Learning Rep-
translation, and CounterFact (Meng et al., 2023a), a resentations 2024.
more challenging dataset that includes counterfacts
starting with low scores compared to correct facts. Samira Abnar and Willem Zuidema. 2020. Quantify-
ing attention flow in transformers. In Proceedings
of the 58th Annual Meeting of the Association for
5.2 Evaluating Truthfulness Computational Linguistics, pages 4190–4197, On-
Model truthfulness is an important metric for mea- line. Association for Computational Linguistics.
suring the trustworthiness of generative models. Haozhe An and Rachel Rudinger. 2023. Nichelle and
We expect model outputs to be both informative nancy: The influence of demographic attributes and
and factually correct and faithful. Ideally, human tokenization length on first name biases. In Proceed-
ings of the 61st Annual Meeting of the Association
annotators would label model answers as true or for Computational Linguistics (Volume 2: Short Pa-
false, given a gold standard answer, but this is of- pers), pages 388–401, Toronto, Canada. Association
ten costly. (Lin et al., 2022) propose the use of for Computational Linguistics.
two fine-tuned GPT-3-13B models (GPT-judge) to Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain,
classify each answer as true or false and informa- Deep Ganguli, Tom Henighan, Andy Jones, Nicholas
tive or not. Evaluation using GPT-judge is a stan- Joseph, Ben Mann, Nova DasSarma, Nelson El-
dard practice on TruthfulQA benchmark, a widely hage, Zac Hatfield-Dodds, Danny Hernandez, Jack-
son Kernion, Kamal Ndousse, Catherine Olsson,
used dataset adversarially constructed to measure Dario Amodei, Tom Brown, Jack Clark, Sam Mc-
whether a language model is truthful in generat- Candlish, Chris Olah, and Jared Kaplan. 2021. A
ing answers (Askell et al., 2021; Li et al., 2023b; general language assistant as a laboratory for align-
Chuang et al., 2023). The main metric of Truth- ment.
fulQA is true*informative, a product of scalar Pepa Atanasova, Jakob Grue Simonsen, Christina Li-
truthful and informative scores. This metric not oma, and Isabelle Augenstein. 2020. A diagnostic
only captures how many questions are answered study of explainability techniques for text classifi-
cation. In Proceedings of the 2020 Conference on
truthfully but also prevents the model from indis- Empirical Methods in Natural Language Processing
criminately replying with “I have no comment” by (EMNLP), pages 3256–3274, Online. Association for
assessing the informativeness of each answer. Computational Linguistics.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-
6 Conclusion ton. 2016. Layer normalization.
In this survey, we have presented a comprehensive Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz,
overview of explainability for LLMs and their ap- Itzik Malkiel, Omri Armstrong, and Noam Koenig-
stein. 2021. Grad-sam: Explaining transformers
plications. We have summarized methods for local via gradient self-attention maps. In Proceedings of
and global analysis based on the objectives of expla- the 30th ACM International Conference on Informa-
nations. In addition, we have discussed the use of tion & Knowledge Management, CIKM ’21, page
explanations to enhance models and the evaluation 2882–2887, New York, NY, USA. Association for
Computing Machinery.
of these methods. Major future research directions
to understanding LLM include developing explana- Jasmijn Bastings, Sebastian Ebert, Polina Zablotskaia,
tion methods tailored to different language models Anders Sandholm, and Katja Filippova. 2022. "will
you find these shortcuts?" a protocol for evaluating
and making LLMs more trustworthy and aligned the faithfulness of input salience methods for text
with human values by using explainability knowl- classification.
edge. As LLMs continue to advance, explainability
Nora Belrose, Zach Furman, Logan Smith, Danny Ha-
will become incredibly vital to ensure that these lawi, Igor Ostrovsky, Lev McKinney, Stella Bider-
models are transparent, fair, and beneficial. We man, and Jacob Steinhardt. 2023. Eliciting latent
hope that this review of the literature provides a predictions from transformers with the tuned lens.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Alexander Yom Din, Taelin Karidi, Leshem Choshen,
Longformer: The long-document transformer. and Mor Geva. 2023. Jump to conclusions: Short-
cutting transformers with linear transformations.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Zhifang Sui. 2023. A survey on in-context learning.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Joseph Enguehard. 2023. Sequential integrated gradi-
Clemens Winter, Christopher Hesse, Mark Chen, Eric ents: a simple but effective method for explaining
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, language models.
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer,
2020. Language models are few-shot learners. Pedro Rodriguez, and Jordan Boyd-Graber. 2018.
Hanjie Chen, Guangtao Zheng, and Yangfeng Ji. 2020. Pathologies of neural models make interpretations
Generating hierarchical explanations on text classifi- difficult. In Proceedings of the 2018 Conference on
cation via feature interaction detection. In Proceed- Empirical Methods in Natural Language Processing,
ings of the 58th Annual Meeting of the Association pages 3719–3728, Brussels, Belgium. Association
for Computational Linguistics, pages 5578–5593, On- for Computational Linguistics.
line. Association for Computational Linguistics.
Javier Ferrando, Gerard I. Gállego, and Marta R. Costa-
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming jussà. 2022. Measuring the mixing of contextual
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- information in the transformer. In Proceedings of
plan, Harri Edwards, Yuri Burda, Nicholas Joseph, the 2022 Conference on Empirical Methods in Nat-
Greg Brockman, Alex Ray, Raul Puri, Gretchen ural Language Processing, pages 8698–8714, Abu
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- Dhabi, United Arab Emirates. Association for Com-
try, Pamela Mishkin, Brooke Chan, Scott Gray, putational Linguistics.
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter, Eve Fleisig, Aubrie Amstutz, Chad Atalla, Su Lin
Philippe Tillet, Felipe Petroski Such, Dave Cum- Blodgett, Hal Daumé III, Alexandra Olteanu, Emily
mings, Matthias Plappert, Fotios Chantzis, Eliza- Sheng, Dan Vann, and Hanna Wallach. 2023. Fair-
beth Barnes, Ariel Herbert-Voss, William Hebgen Prism: Evaluating fairness-related harms in text gen-
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie eration. In Proceedings of the 61st Annual Meeting of
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, the Association for Computational Linguistics (Vol-
William Saunders, Christopher Hesse, Andrew N. ume 1: Long Papers), pages 6231–6251, Toronto,
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Canada. Association for Computational Linguistics.
Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Samuel Gehman, Suchin Gururangan, Maarten Sap,
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Yejin Choi, and Noah A. Smith. 2020. RealToxi-
Sutskever, and Wojciech Zaremba. 2021. Evaluating cityPrompts: Evaluating neural toxic degeneration
large language models trained on code. in language models. In Findings of the Association
for Computational Linguistics: EMNLP 2020, pages
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon 3356–3369, Online. Association for Computational
Kim, James Glass, and Pengcheng He. 2023. Dola: Linguistics.
Decoding by contrasting layers improves factuality
in large language models. Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir
Globerson. 2023. Dissecting recall of factual associ-
Bilal Chughtai, Lawrence Chan, and Neel Nanda. 2023. ations in auto-regressive language models.
A toy model of universality: Reverse engineering
how networks learn group operations. Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Gold-
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao berg. 2022. Transformer feed-forward layers build
Chang, and Furu Wei. 2022. Knowledge neurons in predictions by promoting concepts in the vocabulary
pretrained transformers. In Proceedings of the 60th space. In Proceedings of the 2022 Conference on
Annual Meeting of the Association for Computational Empirical Methods in Natural Language Process-
Linguistics (Volume 1: Long Papers), pages 8493– ing, pages 30–45, Abu Dhabi, United Arab Emirates.
8502, Dublin, Ireland. Association for Computational Association for Computational Linguistics.
Linguistics.
Mor Geva, Roei Schuster, Jonathan Berant, and Omer
Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Levy. 2021. Transformer feed-forward layers are key-
2023. Analyzing transformers in embedding space. value memories. In Proceedings of the 2021 Confer-
In Proceedings of the 61st Annual Meeting of the ence on Empirical Methods in Natural Language Pro-
Association for Computational Linguistics (Volume 1: cessing, pages 5484–5495, Online and Punta Cana,
Long Papers), pages 16124–16170, Toronto, Canada. Dominican Republic. Association for Computational
Association for Computational Linguistics. Linguistics.
Danny Halawi, Jean-Stanislas Denain, and Jacob Stein- Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda
hardt. 2023. Overthinking the truth: Understanding Viégas, Hanspeter Pfister, and Martin Wattenberg.
how language models process false demonstrations. 2023a. Emergent world representations: Exploring a
sequence model trained on a synthetic task.
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2021. Self-
attention attribution: Interpreting information inter- Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter
actions inside transformer. Pfister, and Martin Wattenberg. 2023b. Inference-
time intervention: Eliciting truthful answers from a
Roee Hendel, Mor Geva, and Amir Globerson. 2023. language model.
In-context learning creates task vectors.
Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun
Evan Hernandez, Belinda Z. Li, and Jacob Andreas.
Ma, and Jie Yu. 2023c. Pmet: Precise model editing
2023. Inspecting and editing knowledge representa-
in a transformer. ArXiv, abs/2308.08742.
tions in language models.

John Hewitt and Christopher D. Manning. 2019. A Zhihui Li, Max Gronke, and Charles Steidel. 2023d.
structural probe for finding syntax in word represen- Alpaca: A new semi-analytic model for metal absorp-
tations. In Proceedings of the 2019 Conference of tion lines emerging from clumpy galactic environ-
the North American Chapter of the Association for ments.
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Stephanie Lin, Jacob Hilton, and Owain Evans. 2022.
4129–4138, Minneapolis, Minnesota. Association for Truthfulqa: Measuring how models mimic human
Computational Linguistics. falsehoods.

Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-
Wenge Rong, and Zhang Xiong. 2023. Transformer- jape, Michele Bevilacqua, Fabio Petroni, and Percy
patcher: One mistake worth one neuron. Liang. 2023. Lost in the middle: How language
models use long contexts.
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Scott Lundberg and Su-In Lee. 2017. A unified ap-
Madotto, and Pascale Fung. 2023. Survey of halluci- proach to interpreting model predictions.
nation in natural language generation. ACM Comput-
ing Surveys, 55(12):1–38. Kevin Meng, David Bau, Alex Andonian, and Yonatan
Belinkov. 2023a. Locating and editing factual associ-
Shahar Katz and Yonatan Belinkov. 2023. Interpreting ations in gpt.
transformer’s attention dynamic memory and visual-
izing the semantic information flow of gpt. Kevin Meng, Arnab Sen Sharma, Alex Andonian,
Yonatan Belinkov, and David Bau. 2023b. Mass-
Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, editing memory in a transformer.
Maximilian Alber, Kristof T. Schütt, Sven Dähne, Du-
mitru Erhan, and Been Kim. 2017. The (un)reliability Eric Mitchell, Charles Lin, Antoine Bosselut, Christo-
of saliency methods. pher D. Manning, and Chelsea Finn. 2022. Memory-
based model editing at scale.
Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert
Müller, and Sven Dähne. 2016. Investigating the
Ali Modarressi, Mohsen Fayyaz, Ehsan Aghazadeh,
influence of noise and distractors on the interpretation
Yadollah Yaghoobzadeh, and Mohammad Taher Pile-
of neural networks.
hvar. 2023. DecompX: Explaining transformers deci-
Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and sions by propagating token decomposition. In Pro-
Kentaro Inui. 2020. Attention is not only a weight: ceedings of the 61st Annual Meeting of the Associa-
Analyzing transformers with vector norms. In tion for Computational Linguistics (Volume 1: Long
Proceedings of the 2020 Conference on Empirical Papers), pages 2649–2664, Toronto, Canada. Associ-
Methods in Natural Language Processing (EMNLP), ation for Computational Linguistics.
pages 7057–7075, Online. Association for Computa-
tional Linguistics. Ali Modarressi, Mohsen Fayyaz, Yadollah
Yaghoobzadeh, and Mohammad Taher Pile-
Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and hvar. 2022. GlobEnc: Quantifying global token
Kentaro Inui. 2023. Analyzing feed-forward blocks attribution by incorporating the whole encoder
in transformers through the lens of attention map. layer in transformers. In Proceedings of the 2022
Conference of the North American Chapter of the
Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Association for Computational Linguistics: Human
Zettlemoyer. 2017. Zero-shot relation extraction via Language Technologies, pages 258–271, Seattle,
reading comprehension. In Proceedings of the 21st United States. Association for Computational
Conference on Computational Natural Language Linguistics.
Learning (CoNLL 2017), pages 333–342, Vancouver,
Canada. Association for Computational Linguistics. OpenAI. 2023. Gpt-4 technical report.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Sandipan Sikdar, Parantapa Bhattacharya, and Kieran
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Heese. 2021. Integrated directional gradients: Fea-
Sandhini Agarwal, Katarina Slama, Alex Ray, John ture interaction attribution for neural NLP models. In
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Proceedings of the 59th Annual Meeting of the Asso-
Maddie Simens, Amanda Askell, Peter Welinder, ciation for Computational Linguistics and the 11th
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. International Joint Conference on Natural Language
Training language models to follow instructions with Processing (Volume 1: Long Papers), pages 865–878,
human feedback. Online. Association for Computational Linguistics.

Judea Pearl et al. 2000. Models, reasoning and infer- Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017.
ence. Cambridge, UK: CambridgeUniversityPress, Axiomatic attribution for deep networks. In Pro-
19(2):3. ceedings of the 34th International Conference on
Machine Learning, volume 70 of Proceedings of Ma-
Hao Peng, Xiaozhi Wang, Shengding Hu, Hailong Jin, chine Learning Research, pages 3319–3328. PMLR.
Lei Hou, Juanzi Li, Zhiyuan Liu, and Qun Liu. 2022.
Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron
Copen: Probing conceptual knowledge in pre-trained
Mueller, Byron C. Wallace, and David Bau. 2023.
language models.
Function vectors in large language models.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Alexander Miller. 2019. Language models as knowl- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
edge bases? In Proceedings of the 2019 Confer- Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
ence on Empirical Methods in Natural Language Pro- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
cessing and the 9th International Joint Conference Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
on Natural Language Processing (EMNLP-IJCNLP), Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
pages 2463–2473, Hong Kong, China. Association thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
for Computational Linguistics. Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
Jacob Devlin, James Bradbury, Anselm Levskaya, ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
Jonathan Heek, Kefan Xiao, Shivani Agrawal, and tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
Jeff Dean. 2022. Efficiently scaling transformer in- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
ference. stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Ruan Silva, Eric Michael Smith, Ranjan Subrama-
Rebecca Qian, Candace Ross, Jude Fernandes, nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
Eric Michael Smith, Douwe Kiela, and Adina lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Williams. 2022. Perturbation augmentation for fairer Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
NLP. In Proceedings of the 2022 Conference on Melanie Kambadur, Sharan Narang, Aurelien Ro-
Empirical Methods in Natural Language Processing, driguez, Robert Stojnic, Sergey Edunov, and Thomas
pages 9496–9521, Abu Dhabi, United Arab Emirates. Scialom. 2023. Llama 2: Open foundation and fine-
Association for Computational Linguistics. tuned chat models.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Transformer Circuits. 2022. Mechanistic interpretations
Dario Amodei, Ilya Sutskever, et al. 2019. Language of transformer circuits. Accessed: [insert access date
models are unsupervised multitask learners. OpenAI here].
blog, 1(8):9.
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
Ori Ram, Liat Bezalel, Adi Zicher, Yonatan Belinkov, Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Jonathan Berant, and Amir Globerson. 2023. What Kaiser, and Illia Polosukhin. 2017. Attention is all
are you token about? dense retrieval as distributions you need. In NIPS.
over the vocabulary. In Proceedings of the 61st An-
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov,
nual Meeting of the Association for Computational
Sharon Qian, Daniel Nevo, Simas Sakenis, Jason
Linguistics (Volume 1: Long Papers), pages 2481–
Huang, Yaron Singer, and Stuart Shieber. 2020.
2498, Toronto, Canada. Association for Computa-
Causal mediation analysis for interpreting neural nlp:
tional Linguistics.
The case of gender bias.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Kevin Wang, Alexandre Variengien, Arthur Conmy,
Guestrin. 2016. "why should i trust you?": Explain- Buck Shlegeris, and Jacob Steinhardt. 2022. Inter-
ing the predictions of any classifier. pretability in the wild: a circuit for indirect object
identification in gpt-2 small.
Timo Schick, Sahana Udupa, and Hinrich Schütze. 2021.
Self-diagnosis and self-debiasing: A proposal for Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou,
reducing corpus-based bias in nlp. Fandong Meng, Jie Zhou, and Xu Sun. 2023. Label
words are anchors: An information flow perspective
for understanding in-context learning. In Proceed-
ings of the 2023 Conference on Empirical Methods
in Natural Language Processing, pages 9840–9855,
Singapore. Association for Computational Linguis-
tics.
Laura Weidinger, John Mellor, Maribeth Rauh, Conor
Griffin, Jonathan Uesato, Po-Sen Huang, Myra
Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh,
Zac Kenton, Sasha Brown, Will Hawkins, Tom
Stepleton, Courtney Biles, Abeba Birhane, Julia
Haas, Laura Rimell, Lisa Anne Hendricks, William
Isaac, Sean Legassick, Geoffrey Irving, and Iason
Gabriel. 2021. Ethical and social risks of harm from
language models.

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song


Han, and Mike Lewis. 2023. Efficient streaming
language models with attention sinks.

Sen Yang, Shujian Huang, Wei Zou, Jianbing Zhang,


Xinyu Dai, and Jiajun Chen. 2023. Local interpre-
tation of transformer based on linear decomposition.
In Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 10270–10287, Toronto, Canada.
Association for Computational Linguistics.
Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng,
Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu
Zhang. 2023. Editing large language models: Prob-
lems, methods, and opportunities.
Yordan Yordanov, Vid Kocijan, Thomas Lukasiewicz,
and Oana-Maria Camburu. 2022. Few-shot out-of-
domain transfer learning of natural language expla-
nations in a label-abundant setup. In Findings of the
Association for Computational Linguistics: EMNLP
2022, pages 3486–3501, Abu Dhabi, United Arab
Emirates. Association for Computational Linguistics.
Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu,
Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei
Yin, and Mengnan Du. 2023. Explainability for large
language models: A survey.

You might also like