Cross Model Flow
Cross Model Flow
Abstract Up
1. Introduction
Specifically, LLMs generate responses based on both visual
Multimodal large language models (MLLMs) [5, 11, 24, 27, and linguistic inputs where visual representations extracted
28] have demonstrated notable performance across a wide from an image encoder precede the word embeddings in
range of vision-language tasks, which is largely attributed to the input sequence. Despite the successful performance and
the combination of powerful auto-regressive large language wide applicability of MLLMs, there is still a lack of under-
models [39, 40, 44, 47] and visual encoders [13, 16, 35]. standing of their internal working mechanisms at play when
1
solving multimodal tasks. Acquiring deeper insights into questions we conduct a series of experiments, blocking in-
these mechanisms could not only enhance the interpretabil- formation flow between (1) the input positions correspond-
ity and transparency [31, 33] of these models but also pave ing to the whole image to the different parts of the question;
the way for developing more efficient and robust models for (2) the input positions corresponding to image regions con-
multimodal interactions. taining objects relevant to answering the question, to the
Some initial studies have begun to explore the internal question; (3) the input positions corresponding to the im-
states corresponding to external behaviors of MLLMs, fo- age and the question to the final prediction, across different
cusing on specific aspects such as information storage in layers of the MLLM.
the model’s parameters [6], reflecting undesirable content Our results reveal that in MLLMs, visual information un-
generation through logit distributions of the generated to- dergoes a two-stage integration into the language represen-
kens [46], the localization and evolution of object-related tation within the lower-to-middle layers: first in a compre-
visual information [32, 34, 37], the localization of safety hensive manner, and subsequently in a more targeted fash-
mechanism [43] and the reduction of redundant visual to- ion. This integrated multimodal representation is then prop-
kens [45]. However, the information flow between the two agated to the hidden representations in the subsequent lay-
modalities within MLLMs remains poorly-understood, thus ers, ultimately reaching the last position for generating an
prompting our main question: Where in the model and how accurate response. The visualization of this mechanism is
is visual and linguistic information integrated within the shown in Figure 1. To the best of our knowledge, ours is the
auto-regressive MLLMs to generate the final prediction in first paper to elucidate the information flow between the two
vision-language tasks? modalities in auto-regressive MLLMs. It thus contributes
To address this question, we investigate the interaction to enhancing the transparency of these models and provides
of different modalities by locating and analyzing the infor- novel and valuable insights for their development.
mation flow [15] between them, across different layers. Our
focus is on the task of visual question answering (VQA), a 2. Related work
popular multimodal task, where the answer is generated by MLLMs multimodal large language models have demon-
MLLMs based on the input image and the corresponding strated remarkable performance across a wide range of
question. Specifically, we aim to reverse engineer the infor- vision-language tasks, which is largely attributed to the de-
mation flow between the two modalities at inference time, velopment of the auto-regressive large language models.
by selectively inhibiting specific attention patterns between The representative MLLMs [5, 11, 24–28] consist of an im-
tokens corresponding to visual and linguistic inputs and by age encoder [13, 16, 35] and a powerful decoder-only large
observing the resulting changes in the performance of the language model [39, 40, 44, 47]. The visual and linguistic
answer prediction. information are integrated in original LLM. In this paper,
In modern auto-regressive MLLMs, which employ we will investigate this inner working mechanism of multi-
Transformer decoder-only architecture [41], the attention modal information processing into these models.
layer is the sole module enabling communication between
hidden representations corresponding to different positions Interpretability of multimodal models The inter-
of the input. To inhibit cross-modal information flow, we pretability of multimodal models has attracted a great deal
therefore adopt an attention knockout approach, proposed of attention in the research community. Works in [7, 17]
by Geva et al. [19]. We use it to block attention edges con- treat the model as a black box, analyzing input—output re-
necting different types of hidden representations (e.g. image lationships to interpret the behavior of models, such as com-
and question) at specific transformer layers. paring the importance of different modalities [7] and the
different modalities’ contribution to visual or textual tasks
We apply this method to a range of MLLMs from
[17]. The works from [3, 8, 29, 38] aim to explain predic-
the LLaVA series, including LLaVA-1.5-7b, LLaVA-1.5-
tions by tracing outputs to specific input contributions for
13b [27], LLaVA-v1.6-Vicuna-7b [28] and Llama3-LLaVA-
a single sample, including through merging the attention
NEXT-8b [2] and a number of diverse question types in
scores [3, 38], using gradient-based methods [8] or model
VQA, as shown in Table 1. Our experiments focus on the
disentanglement [29]. Additionally, some works [9, 20, 36]
following research questions: (1) How is the (more general)
adopt a top-down approach, probing learned representations
visual information from the whole image fused with the lin-
to uncover high-level concepts, such as visual-semantics
guistic information in the question? (2) How is the more
[9], verb understanding [20], shape and size [36]. In con-
targeted visual information (i.e. specific image regions di-
trast, our work focuses on the model’s internal processing
rectly relevant to answering the question) integrated with
mechanisms when solving multimodal tasks.
linguistic information from the question? and (3) In what
ways do the linguistic and visual components of the input Mechanistic interpretability of MLLMs Mechanistic
contribute to the final answer prediction? To answer these interpretability [31, 33] is an emerging research area in
2
Answer
NLP, aiming to reverse-engineer detailed computations
within neural networks. While it has gained attraction in
NLP, research in the multimodal domain remains limited.
Palit et al. [34] introduced a causal tracing tool for image-
conditioned text generation on BLIP [23], marking one of Auto-regressive Model
coder to obtain NV visual patch features V = [vi ]N V After splitting WOℓ into {WOℓ,j }H ∈ R d× H
, we follow
i=1 ,
j=1
d
vi ∈ R . Similarly, the text t, consisting of NT tokens, works in [12, 15, 19] to represent the output of MHAT
N ×d
is embedded into representations through a lookup table of Aℓ = [aℓi ]N i=1 ∈ R at layer ℓ as the sum of the out-
word embeddings, resulting in the text input T = [ti ]N T put from different heads
i=1 ,
d
ti ∈ R . By concatenation of V and T , the multimodal in- H
put sequence I = [v1 . . . vNV , t1 . . . tNT ] ∈ RN ×d , where
X
Aℓ = Aℓ,j V ℓ,j WOℓ,j (2)
N = NV + NT , is fed into MLLM. j=1
!
Hidden representation The input sequence is fed into the Qℓ,j (K ℓ,j )T
Aℓ,j = softmax p + M ℓ,j (3)
MLLM, where the hidden representation at each token posi- d/H
tion is encoded across L transformer layers. Each layer pri-
marily consists of two modules: a masked multi-head atten- where M ℓ,j is a strictly upper triangular mask for Aℓ,j for
tion (MHAT) followed by a fully connected feed-forward j-th head at layer ℓ. For an auto-regressive transformer
network (FFN) [41]. For conciseness, we have excluded model, M ℓ,j is used to guarantee that every position of the
3
input sequence cannot attend to succeeding positions and at- representations. Therefore, we locate the information flow
tends to all preceding positions. Therefore, for the element between different hidden representations corresponding to
ℓ,j
Ms,t with the coordinate (s, t) in M ℓ,j , different positions of the input sequence, such as visual in-
( puts, linguistic inputs, and the last position in the input se-
ℓ,j −∞ if t > s, quence (the position of answer prediction), by blocking the
Ms,t = (4)
0 otherwise. attention edge between them in the MHAT module and ob-
serving the resulting decline in performance as compared to
the original model with an intact attention pattern.
FFN FFN computes the output representation through
Formally, in order to prevent information flow from the
hidden representations hℓs with position s in the source set
fjℓ = WUℓ σ WBℓ aℓj + hℓ−1j (5)
S (e.g. all positions of visual tokens in the input sequence)
where WUℓ ∈ Rd×dff and WBℓ ∈ Rdff ×d are projection ma- to the hidden representations hℓt with position t in the target
trices with inner-dimensionality dff , and σ is a nonlinear ac- set T (e.g. all positions of linguistic tokens in the input se-
tivation function. quence) at a specific layer ℓ < L, we set the corresponding
ℓ,j
element Ms,t in M ℓ,j to −∞ and the updated Eq. (4) is
Output The hidden representation hL N corresponding to (
the last position N of the input sequence at final layer L −∞ if (t > s) or (s in S and t in T ),
ℓ,j
is projected by an unembedding matrix E ∈ R|V|×d and Ms,t = (7)
0 otherwise.
finally the probability distribution over all words in the vo-
cabulary V is computed by This prevents the token position in the target set from at-
tending to that in the source set when MLLM generates the
PN = softmax EhL
N , (6)
predicted answer.
where the word with the highest probability in PN is the
final prediction. 4. Experimental setting
3.2. Attention knockout Setup Our paper investigates the inner working mecha-
nism of MLLMs , focusing on visual question answering
In this paper, we mainly investigate the interaction between
(VQA). Typically, the VQA setup involves an image and a
different modalities by locating and analyzing the informa-
corresponding question about this image, which the model
tion flow between them. We adopt a reverse-engineering
needs to answer. We first investigate where the informa-
approach to trace the information flow. Specifically, by in-
tion from different modalities (image and textual question)
tentionally blocking specific connections between different
is processed in MLLMs, and then how it is integrated within
components in the computation process, we trace the infor-
the model. Finally, we explore how the MLLM makes the
mation flow within them by observing changes in the prob-
final decision using this multimodal information.
ability of final prediction.
In MLLMs, the attention module (MHAT) is the only Tasks and data We collect our data from the valida-
module, which has the function of communication between tion set of GQA dataset [21]. GQA is a dataset designed
different types of hidden representation corresponding to to support visual reasoning and compositional question-
different positions in the input sequence. Therefore, we in- answering, offering the semantic and visual richness of real-
tentionally block the attention edges between hidden repre- world images. It is derived from the Visual Genome dataset,
sentations at different token positions (termed as attention which includes detailed scene graph structures [22]. In
knockout) to trace the information flow between them. We GQA, the questions are categorized through two dimen-
take inspiration from the work of [19], where the authors sions: structure and semantics. The former defines the
use attention knockout to assess how the factual information question format (5 classes) and the latter refers to the se-
is extracted from a single-modality LLM by evaluating the mantic information for the main subject of the question (5
contribution of certain words in a sentence to last-position classes). The answers to these questions consist of only
prediction. We extend this method to multimodal research one word or phrase, which is easy to evaluate. Based on
by not only examining the contribution of each modality to the two dimensions, the questions in GQA are categorized
the last-position prediction but also the transfer of informa- into 15 groups. We exclude most groups that consist of
tion between different modalities. simple binary questions (yes/no) and demonstrate poor per-
Intuitively, when blocking the attention edge connecting formance on the model investigated in this paper. Finally,
two hidden representations corresponding to different posi- we select 6 out of 15 groups (4 structural and 4 seman-
tions of the input sequence leads to a significant deteriora- tic classes) in which their performance is higher than 80%
tion in model performance, it suggests that there exists func- in average performance, as shown in Table 1. The diffi-
tionally important information transfer between these two culty of our selected groups ranges from simple multimodal
4
Structural Semantic Open / Image
Name Question Example Answer Num.
type Type Binary Example
ChooseAttr Choose Attribute Open What was used to make the door, wood or metal? Wood 1000
ChooseCat Choose Category Open Which piece of furniture is striated, bed or door? Bed 1000
ChooseRel Choose Relation Open Is the door to the right or to the left of the bed? Right 964
CompareAttr Compare Attribute Open What is common to the bike and the dog? Color 570
LogicalObj Logical Object Binary Are there either women or men that are running? No 991
QueryAttr Query Attribute Open In which part of the image is the dog? Left 1000
Table 1. Different types of questions in our VQA dataset. The questions are categorized based on two dimensions: structure and semantics.
The structural types define the question format, including: Choose for selecting between alternatives, Compare for comparisons between
objects, Logical for logical inference, and Query for open-ended questions. The semantic types focus on the subject matter, covering Object
existence, and Attribute, Category, Relation of objects. Additionally, questions are labeled as Open for open-ended queries or Binary for
yes/no answers. The dataset is derived from the GQA dataset [21]. Due to space limitations, we present two images, noting that 50% of
question samples in our dataset have unique images.
perception tasks to more complex multimodal reasoning. Models We investigate the current state-of-the-art and
For example, ChooseAttr and ChooseCat ask about basic open-source multimodal large language models from
object attributes and categories for one object in the im- the LLaVA series: LLaVA-1.5-7b, LLaVA-1.5-13b [27],
age, ChooseRel and QueryAttr involve spatial reasoning, LLaVA-v1.6-Vicuna-7b [28] and Llama3-LLaVA-NEXT-8b
and CompareAttr and LogicalObj require more challenging [2], which achieve state-of-the-art performance across a di-
comparisons and logical reasoning between two objects in verse range of 11 tasks including GQA. These models are
the image. For each selected group, we sample an average trained on similar publicly available data but with differ-
of 920 image-question pairs that are correctly predicted by ent architectures and model sizes, which allows us to ex-
most models used in this paper. For each model, we only plore cross-modal interaction and processing over different
use correctly predicted samples for analysis (Each model architectures and minimize interference of unknown factors
achieves an accuracy greater than 95% on the dataset we from training data. All these models have the same image
collected). More details about the dataset and the process encoder (CLIP-ViT-L-336px [35]) but with different LLM:
of collection can be found in Appendix A. Vicuna-v1.5-7b [47] with 32 layers (transformer blocks)
in LLaVA-1.5-7b and LLaVA-v1.6-Vicuna-7b, Vicuna-v1.5-
Format Formally, given an image i and a question q (the
13b [47] with 40 layers in LLaVA-1.5-13b and Llama3-
question may contain answer options os = [o1, o2]), the
8b [14] with 32 layers in Llama3-LLaVA-NEXT-8b, where
model is expected to generate the answer a in the last po-
Vicuna-v1.5 is the standard and dense transformer architec-
sition of the input sequence. In addition, the correct one in
ture [41] and Llama3 adopts grouped query attention [4].
the options is referred to as the true option (ot ) while the
In terms of image processing, LLaVA-1.5-7b and LLaVA-
other ones are denoted as the false option (of ). Since the
1.5-13b directly feed the original fixed-length image patch
image, question and options might contain multiple input
features from the image encoder into the LLM as input
tokens, we use I, Q, Ot , Of to represent the set of input po-
tokens. In contrast, LLaVA-v1.6-Vicuna-7b and Llama3-
sitions corresponding to image, question, true option and
LLaVA-NEXT-8b employ a dynamic high-resolution tech-
false option, respectively.
nique, which dynamically adjusts image resolution, result-
Evaluation We quantify the information flow between ing in variable-length image patch features with higher res-
different input parts by evaluating the relative change in the olution. Due to space limitations, we will primarily present
probability of the answer word which is caused by blocking the results for the model LLaVA-1.5-13b in the subsequent
connections between different input parts (attention knock- sections of this paper, while similar findings for other mod-
out). Formally, given an image-question pair, the MLLM els are presented in Appendix E.
generates the answer a with the highest probability p1 from
the output distribution PN defined in Equation (6). After 5. Contribution of different modalities to the
applying attention knockout at specific layers, we record final prediction
the updated probability p2 for the same answer a as in p1 .
The relative change in probability, pc %, is calculated as For a successful answer prediction for the task of VQA,
pc % = ((p2−p1 )/p1 )×100. In this paper, attention knockout the MLLM will process the input image-question pair [i, q]
is applied to each transformer layer (within a defined win- and generate the final answer from the output layer of the
dow) individually and evaluate their respective pc values. model corresponding to the last position. We first investi-
5
gate whether the different modalities directly contribute to 4 X H V W L R Q / D V W , P D J H / D V W / D V W / D V W
the final prediction.
&