0% found this document useful (0 votes)
10 views

Cross Model Flow

Uploaded by

impananr15
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Cross Model Flow

Uploaded by

impananr15
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Cross-modal Information Flow in Multimodal Large Language Models

Zhi Zhang* , Srishti Yadav*† , Fengze Han‡ , Ekaterina Shutova*


*
ILLC, University of Amsterdam, Netherlands

Dept. of Computer Science, University of Copenhagen, Denmark

Dept. of Computer Engineering, Technical University of Munich, Germany
[email protected], [email protected], [email protected], [email protected]
arXiv:2411.18620v1 [cs.AI] 27 Nov 2024

Abstract Up

The recent advancements in auto-regressive multimodal


large language models (MLLMs) have demonstrated
promising progress for vision-language tasks. While there up

exists a variety of studies investigating the processing of


linguistic information within large language models, little
is currently known about the inner working mechanism of
MLLMs and how linguistic and visual information inter-
act within these models. In this study, we aim to fill this
gap by examining the information flow between different
modalities—language and vision—in MLLMs, focusing on
visual question answering. Specifically, given an image-
question pair as input, we investigate where in the model
and how the visual and linguistic information are combined
to generate the final prediction. Conducting experiments
with a series of models from the LLaVA series, we find that
there are two distinct stages in the process of integration
of the two modalities. In the lower layers, the model first
transfers the more general visual features of the whole im- Are the blinds
Assistant:
up or down?
age into the representations of (linguistic) question tokens.
In the middle layers, it once again transfers visual infor-
Figure 1. Illustration of the internal mechanism of MLLMs when
mation about specific objects relevant to the question to solving multimodal tasks. From bottom to top layers, the model
the respective token positions of the question. Finally, in first propagates general visual information from the whole image
the higher layers, the resulting multimodal representation into the linguistic hidden representation; next, selected visual in-
is propagated to the last position of the input sequence for formation relevant to answering the question is transferred to the
the final prediction. Overall, our findings provide a new and linguistic representation; finally, the integrated multimodal infor-
comprehensive perspective on the spatial and functional as- mation within the hidden representation of the question flows to
pects of image and language processing in the MLLMs, last position facilitating the final prediction. In addition, the an-
thereby facilitating future research into multimodal infor- swers are initially generated in lowercase form and then converted
mation localization and editing. to uppercase for the first letter.

1. Introduction
Specifically, LLMs generate responses based on both visual
Multimodal large language models (MLLMs) [5, 11, 24, 27, and linguistic inputs where visual representations extracted
28] have demonstrated notable performance across a wide from an image encoder precede the word embeddings in
range of vision-language tasks, which is largely attributed to the input sequence. Despite the successful performance and
the combination of powerful auto-regressive large language wide applicability of MLLMs, there is still a lack of under-
models [39, 40, 44, 47] and visual encoders [13, 16, 35]. standing of their internal working mechanisms at play when

1
solving multimodal tasks. Acquiring deeper insights into questions we conduct a series of experiments, blocking in-
these mechanisms could not only enhance the interpretabil- formation flow between (1) the input positions correspond-
ity and transparency [31, 33] of these models but also pave ing to the whole image to the different parts of the question;
the way for developing more efficient and robust models for (2) the input positions corresponding to image regions con-
multimodal interactions. taining objects relevant to answering the question, to the
Some initial studies have begun to explore the internal question; (3) the input positions corresponding to the im-
states corresponding to external behaviors of MLLMs, fo- age and the question to the final prediction, across different
cusing on specific aspects such as information storage in layers of the MLLM.
the model’s parameters [6], reflecting undesirable content Our results reveal that in MLLMs, visual information un-
generation through logit distributions of the generated to- dergoes a two-stage integration into the language represen-
kens [46], the localization and evolution of object-related tation within the lower-to-middle layers: first in a compre-
visual information [32, 34, 37], the localization of safety hensive manner, and subsequently in a more targeted fash-
mechanism [43] and the reduction of redundant visual to- ion. This integrated multimodal representation is then prop-
kens [45]. However, the information flow between the two agated to the hidden representations in the subsequent lay-
modalities within MLLMs remains poorly-understood, thus ers, ultimately reaching the last position for generating an
prompting our main question: Where in the model and how accurate response. The visualization of this mechanism is
is visual and linguistic information integrated within the shown in Figure 1. To the best of our knowledge, ours is the
auto-regressive MLLMs to generate the final prediction in first paper to elucidate the information flow between the two
vision-language tasks? modalities in auto-regressive MLLMs. It thus contributes
To address this question, we investigate the interaction to enhancing the transparency of these models and provides
of different modalities by locating and analyzing the infor- novel and valuable insights for their development.
mation flow [15] between them, across different layers. Our
focus is on the task of visual question answering (VQA), a 2. Related work
popular multimodal task, where the answer is generated by MLLMs multimodal large language models have demon-
MLLMs based on the input image and the corresponding strated remarkable performance across a wide range of
question. Specifically, we aim to reverse engineer the infor- vision-language tasks, which is largely attributed to the de-
mation flow between the two modalities at inference time, velopment of the auto-regressive large language models.
by selectively inhibiting specific attention patterns between The representative MLLMs [5, 11, 24–28] consist of an im-
tokens corresponding to visual and linguistic inputs and by age encoder [13, 16, 35] and a powerful decoder-only large
observing the resulting changes in the performance of the language model [39, 40, 44, 47]. The visual and linguistic
answer prediction. information are integrated in original LLM. In this paper,
In modern auto-regressive MLLMs, which employ we will investigate this inner working mechanism of multi-
Transformer decoder-only architecture [41], the attention modal information processing into these models.
layer is the sole module enabling communication between
hidden representations corresponding to different positions Interpretability of multimodal models The inter-
of the input. To inhibit cross-modal information flow, we pretability of multimodal models has attracted a great deal
therefore adopt an attention knockout approach, proposed of attention in the research community. Works in [7, 17]
by Geva et al. [19]. We use it to block attention edges con- treat the model as a black box, analyzing input—output re-
necting different types of hidden representations (e.g. image lationships to interpret the behavior of models, such as com-
and question) at specific transformer layers. paring the importance of different modalities [7] and the
different modalities’ contribution to visual or textual tasks
We apply this method to a range of MLLMs from
[17]. The works from [3, 8, 29, 38] aim to explain predic-
the LLaVA series, including LLaVA-1.5-7b, LLaVA-1.5-
tions by tracing outputs to specific input contributions for
13b [27], LLaVA-v1.6-Vicuna-7b [28] and Llama3-LLaVA-
a single sample, including through merging the attention
NEXT-8b [2] and a number of diverse question types in
scores [3, 38], using gradient-based methods [8] or model
VQA, as shown in Table 1. Our experiments focus on the
disentanglement [29]. Additionally, some works [9, 20, 36]
following research questions: (1) How is the (more general)
adopt a top-down approach, probing learned representations
visual information from the whole image fused with the lin-
to uncover high-level concepts, such as visual-semantics
guistic information in the question? (2) How is the more
[9], verb understanding [20], shape and size [36]. In con-
targeted visual information (i.e. specific image regions di-
trast, our work focuses on the model’s internal processing
rectly relevant to answering the question) integrated with
mechanisms when solving multimodal tasks.
linguistic information from the question? and (3) In what
ways do the linguistic and visual components of the input Mechanistic interpretability of MLLMs Mechanistic
contribute to the final answer prediction? To answer these interpretability [31, 33] is an emerging research area in

2
Answer
NLP, aiming to reverse-engineer detailed computations
within neural networks. While it has gained attraction in
NLP, research in the multimodal domain remains limited.
Palit et al. [34] introduced a causal tracing tool for image-
conditioned text generation on BLIP [23], marking one of Auto-regressive Model

the few early efforts in this area. Several initial studies


have started to explore the internal states of MLLMs by
linking external behaviours to specific mechanisms, such
as information storage in model parameters [6], undesir-
able content generation reflected in the logit distributions of CLIP-ViT-L Tokenizer
the first generated token [46], localization and evolution of
object-related visual information [32, 34, 37], safety mech- Image Question + Assistant:
anism localization [43], and reducing redundant visual to-
kens [45]. However, research offering a comprehensive un- Figure 2. The typical architecture of multimodal large language
derstanding of the internal mechanisms behind multimodal model. It consists of an image encoder and a decoder-only large
language model in which the multimodal information is integrated.
information integration in MLLMs is still lacking. This pa-
We omitted the projection matrix for the visual patch feature as it
per makes an important first step towards filling this gap.
is nonessential for our analysis.

3. Tracing information flow in MLLMs


the bias terms and layer normalization, as they are not cru-
The focus of this paper is on auto-regressive multimodal
cial for our analysis. Formally, the hidden representation
large language models, which consist of an image encoder
hℓi ∈ Rd in the position i of the input sequence at layer ℓ
and a decoder-only language model, as shown in Figure 2.
can be expressed as
The image encoder transforms images into representations
that the language model can take as input, while the lan- hℓi = hiℓ−1 + aℓi + fiℓ , (1)
guage model integrates these visual cues with any provided
text, generating responses one word at a time. Often, these where aℓi ∈ Rd and fiℓ ∈ Rd are the outputs of MHAT and
components are initialized from a pre-trained image en- FFN modules at layer ℓ, respectively. h0i represents a vector
coder (e.g. CLIP-ViT-L-336px [35] ) and a large language in the input I with position of i. All hidden representations
model (e.g. Llama 2 [40]) respectively. Since the inter- at layer ℓ corresponding to the whole input I can be denoted
N ×d
action between modalities only occurs in the decoder-only by H ℓ = [hℓi ]Ni=1 ∈ R .
transformer, our analysis centers around it and we refer to it MHAT The masked multi-head attention (MHAT) mod-
as MLLM for brevity unless otherwise specified. ule in each transformer layer ℓ contains four projection ma-
3.1. Background: MLLMs trixes: WQℓ , WK ℓ
, WVℓ , WOℓ ∈ Rd×d . For the multi-head
attention, the input H ℓ−1 is first projected to query,key
Input The input to an MLLM typically comprises image and value: Qℓ = H ℓ−1 WQℓ , K ℓ = H ℓ−1 WK ℓ
, Vℓ =
and text features, with the image features being initially ex-
H ℓ−1 WVℓ . Then the projected query, key and value matri-
tracted from an image encoder and the text being encoded
ces are evenly split along the columns to H different heads:
through word embeddings. Formally, an image x is evenly N× Hd
{Qℓ,j }Hj=1 , {K
ℓ,j H
}j=1 , {V ℓ,j }H
j=1 ∈ R , respectively.
split into fixed-size patches and encoded by an image en- d

coder to obtain NV visual patch features V = [vi ]N V After splitting WOℓ into {WOℓ,j }H ∈ R d× H
, we follow
i=1 ,
j=1
d
vi ∈ R . Similarly, the text t, consisting of NT tokens, works in [12, 15, 19] to represent the output of MHAT
N ×d
is embedded into representations through a lookup table of Aℓ = [aℓi ]N i=1 ∈ R at layer ℓ as the sum of the out-
word embeddings, resulting in the text input T = [ti ]N T put from different heads
i=1 ,
d
ti ∈ R . By concatenation of V and T , the multimodal in- H
put sequence I = [v1 . . . vNV , t1 . . . tNT ] ∈ RN ×d , where
X
Aℓ = Aℓ,j V ℓ,j WOℓ,j (2)
N = NV + NT , is fed into MLLM. j=1
!
Hidden representation The input sequence is fed into the Qℓ,j (K ℓ,j )T
Aℓ,j = softmax p + M ℓ,j (3)
MLLM, where the hidden representation at each token posi- d/H
tion is encoded across L transformer layers. Each layer pri-
marily consists of two modules: a masked multi-head atten- where M ℓ,j is a strictly upper triangular mask for Aℓ,j for
tion (MHAT) followed by a fully connected feed-forward j-th head at layer ℓ. For an auto-regressive transformer
network (FFN) [41]. For conciseness, we have excluded model, M ℓ,j is used to guarantee that every position of the

3
input sequence cannot attend to succeeding positions and at- representations. Therefore, we locate the information flow
tends to all preceding positions. Therefore, for the element between different hidden representations corresponding to
ℓ,j
Ms,t with the coordinate (s, t) in M ℓ,j , different positions of the input sequence, such as visual in-
( puts, linguistic inputs, and the last position in the input se-
ℓ,j −∞ if t > s, quence (the position of answer prediction), by blocking the
Ms,t = (4)
0 otherwise. attention edge between them in the MHAT module and ob-
serving the resulting decline in performance as compared to
the original model with an intact attention pattern.
FFN FFN computes the output representation through
Formally, in order to prevent information flow from the
hidden representations hℓs with position s in the source set
 
fjℓ = WUℓ σ WBℓ aℓj + hℓ−1j (5)
S (e.g. all positions of visual tokens in the input sequence)
where WUℓ ∈ Rd×dff and WBℓ ∈ Rdff ×d are projection ma- to the hidden representations hℓt with position t in the target
trices with inner-dimensionality dff , and σ is a nonlinear ac- set T (e.g. all positions of linguistic tokens in the input se-
tivation function. quence) at a specific layer ℓ < L, we set the corresponding
ℓ,j
element Ms,t in M ℓ,j to −∞ and the updated Eq. (4) is
Output The hidden representation hL N corresponding to (
the last position N of the input sequence at final layer L −∞ if (t > s) or (s in S and t in T ),
ℓ,j
is projected by an unembedding matrix E ∈ R|V|×d and Ms,t = (7)
0 otherwise.
finally the probability distribution over all words in the vo-
cabulary V is computed by This prevents the token position in the target set from at-
tending to that in the source set when MLLM generates the
PN = softmax EhL

N , (6)
predicted answer.
where the word with the highest probability in PN is the
final prediction. 4. Experimental setting
3.2. Attention knockout Setup Our paper investigates the inner working mecha-
nism of MLLMs , focusing on visual question answering
In this paper, we mainly investigate the interaction between
(VQA). Typically, the VQA setup involves an image and a
different modalities by locating and analyzing the informa-
corresponding question about this image, which the model
tion flow between them. We adopt a reverse-engineering
needs to answer. We first investigate where the informa-
approach to trace the information flow. Specifically, by in-
tion from different modalities (image and textual question)
tentionally blocking specific connections between different
is processed in MLLMs, and then how it is integrated within
components in the computation process, we trace the infor-
the model. Finally, we explore how the MLLM makes the
mation flow within them by observing changes in the prob-
final decision using this multimodal information.
ability of final prediction.
In MLLMs, the attention module (MHAT) is the only Tasks and data We collect our data from the valida-
module, which has the function of communication between tion set of GQA dataset [21]. GQA is a dataset designed
different types of hidden representation corresponding to to support visual reasoning and compositional question-
different positions in the input sequence. Therefore, we in- answering, offering the semantic and visual richness of real-
tentionally block the attention edges between hidden repre- world images. It is derived from the Visual Genome dataset,
sentations at different token positions (termed as attention which includes detailed scene graph structures [22]. In
knockout) to trace the information flow between them. We GQA, the questions are categorized through two dimen-
take inspiration from the work of [19], where the authors sions: structure and semantics. The former defines the
use attention knockout to assess how the factual information question format (5 classes) and the latter refers to the se-
is extracted from a single-modality LLM by evaluating the mantic information for the main subject of the question (5
contribution of certain words in a sentence to last-position classes). The answers to these questions consist of only
prediction. We extend this method to multimodal research one word or phrase, which is easy to evaluate. Based on
by not only examining the contribution of each modality to the two dimensions, the questions in GQA are categorized
the last-position prediction but also the transfer of informa- into 15 groups. We exclude most groups that consist of
tion between different modalities. simple binary questions (yes/no) and demonstrate poor per-
Intuitively, when blocking the attention edge connecting formance on the model investigated in this paper. Finally,
two hidden representations corresponding to different posi- we select 6 out of 15 groups (4 structural and 4 seman-
tions of the input sequence leads to a significant deteriora- tic classes) in which their performance is higher than 80%
tion in model performance, it suggests that there exists func- in average performance, as shown in Table 1. The diffi-
tionally important information transfer between these two culty of our selected groups ranges from simple multimodal

4
Structural Semantic Open / Image
Name Question Example Answer Num.
type Type Binary Example
ChooseAttr Choose Attribute Open What was used to make the door, wood or metal? Wood 1000
ChooseCat Choose Category Open Which piece of furniture is striated, bed or door? Bed 1000
ChooseRel Choose Relation Open Is the door to the right or to the left of the bed? Right 964
CompareAttr Compare Attribute Open What is common to the bike and the dog? Color 570
LogicalObj Logical Object Binary Are there either women or men that are running? No 991
QueryAttr Query Attribute Open In which part of the image is the dog? Left 1000

Table 1. Different types of questions in our VQA dataset. The questions are categorized based on two dimensions: structure and semantics.
The structural types define the question format, including: Choose for selecting between alternatives, Compare for comparisons between
objects, Logical for logical inference, and Query for open-ended questions. The semantic types focus on the subject matter, covering Object
existence, and Attribute, Category, Relation of objects. Additionally, questions are labeled as Open for open-ended queries or Binary for
yes/no answers. The dataset is derived from the GQA dataset [21]. Due to space limitations, we present two images, noting that 50% of
question samples in our dataset have unique images.

perception tasks to more complex multimodal reasoning. Models We investigate the current state-of-the-art and
For example, ChooseAttr and ChooseCat ask about basic open-source multimodal large language models from
object attributes and categories for one object in the im- the LLaVA series: LLaVA-1.5-7b, LLaVA-1.5-13b [27],
age, ChooseRel and QueryAttr involve spatial reasoning, LLaVA-v1.6-Vicuna-7b [28] and Llama3-LLaVA-NEXT-8b
and CompareAttr and LogicalObj require more challenging [2], which achieve state-of-the-art performance across a di-
comparisons and logical reasoning between two objects in verse range of 11 tasks including GQA. These models are
the image. For each selected group, we sample an average trained on similar publicly available data but with differ-
of 920 image-question pairs that are correctly predicted by ent architectures and model sizes, which allows us to ex-
most models used in this paper. For each model, we only plore cross-modal interaction and processing over different
use correctly predicted samples for analysis (Each model architectures and minimize interference of unknown factors
achieves an accuracy greater than 95% on the dataset we from training data. All these models have the same image
collected). More details about the dataset and the process encoder (CLIP-ViT-L-336px [35]) but with different LLM:
of collection can be found in Appendix A. Vicuna-v1.5-7b [47] with 32 layers (transformer blocks)
in LLaVA-1.5-7b and LLaVA-v1.6-Vicuna-7b, Vicuna-v1.5-
Format Formally, given an image i and a question q (the
13b [47] with 40 layers in LLaVA-1.5-13b and Llama3-
question may contain answer options os = [o1, o2]), the
8b [14] with 32 layers in Llama3-LLaVA-NEXT-8b, where
model is expected to generate the answer a in the last po-
Vicuna-v1.5 is the standard and dense transformer architec-
sition of the input sequence. In addition, the correct one in
ture [41] and Llama3 adopts grouped query attention [4].
the options is referred to as the true option (ot ) while the
In terms of image processing, LLaVA-1.5-7b and LLaVA-
other ones are denoted as the false option (of ). Since the
1.5-13b directly feed the original fixed-length image patch
image, question and options might contain multiple input
features from the image encoder into the LLM as input
tokens, we use I, Q, Ot , Of to represent the set of input po-
tokens. In contrast, LLaVA-v1.6-Vicuna-7b and Llama3-
sitions corresponding to image, question, true option and
LLaVA-NEXT-8b employ a dynamic high-resolution tech-
false option, respectively.
nique, which dynamically adjusts image resolution, result-
Evaluation We quantify the information flow between ing in variable-length image patch features with higher res-
different input parts by evaluating the relative change in the olution. Due to space limitations, we will primarily present
probability of the answer word which is caused by blocking the results for the model LLaVA-1.5-13b in the subsequent
connections between different input parts (attention knock- sections of this paper, while similar findings for other mod-
out). Formally, given an image-question pair, the MLLM els are presented in Appendix E.
generates the answer a with the highest probability p1 from
the output distribution PN defined in Equation (6). After 5. Contribution of different modalities to the
applying attention knockout at specific layers, we record final prediction
the updated probability p2 for the same answer a as in p1 .
The relative change in probability, pc %, is calculated as For a successful answer prediction for the task of VQA,
pc % = ((p2−p1 )/p1 )×100. In this paper, attention knockout the MLLM will process the input image-question pair [i, q]
is applied to each transformer layer (within a defined win- and generate the final answer from the output layer of the
dow) individually and evaluate their respective pc values. model corresponding to the last position. We first investi-

5
gate whether the different modalities directly contribute to 4XHVWLRQ /DVW ,PDJH /DVW /DVW /DVW
the final prediction.

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
  
Experiment 1 For each layer ℓ in the MLLM, we block   

the target set (the last position) from attending to each   
  
source set (I or Q) respectively at the layers within a win-   

dow of k = 9 layers around the ℓ-th layer1 , and measure   
  
the change in the probability of the correct answer word.      
           
The last position means N -th position in the input sequence /D\HU /D\HU /D\HU
and it is also the first generated sub-word for the predicted (a) ChooseAttr (b) ChooseCat (c) ChooseRel

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
answer. Typically, the answers contain a single word or  

 

phrase, which might sometimes be tokenized into several  

sub-word tokens. Therefore, we also conduct the same ex- 

 

periment and observe the probability change at the final 



 
generated sub-word of the predicted answer. Both the first   

and final generated sub-words yield similar results. Thus,    
/D\HU
     
/D\HU
     
/D\HU
 

we present all the results of the first generated sub-words in (d) CompareAttr (e) LogicalObj (f) QueryAttr
the main body of the paper, with details on the final sub-
words provided in Appendix C. Figure 3. The relative changes in prediction probability on LLaVA-
1.5-13b with six VQA tasks. The Question↛ Last, Image↛ Last
Observation 1: the contribution to the prediction at the and Last↛ Last represent preventing last position from attending
last position is derived from other input components, to Question, Image and itself respectively.
rather than the input itself at this position. First of all,
as an auto-regressive model, it is assumed that the input
generated from preceding steps at the final position already about the information flow between different parts of ques-
encompasses the crucial information required for predicting tion, such as options os and object words, and last position
the correct answer. However, as shown in Figure 3, when can be found in Appendix F.
we block the attention edge from the last position to itself
(Last ↛ Last), there is negligible change observed in the Experiment 2 As the MLLM is auto-regressive and the
probability of final prediction. This implies that the input at input format is image followed by the question in our set-
the last position does not encompass crucial information for ting, the information from the image (I) can propagate to the
the final prediction of the model. The prediction is, there- positions of the question (Q), but not the other way around.
fore, mainly influenced by other parts of the input sequence. To establish whether this indeed occurs, for each layer ℓ,
we block Q from attending to I with the same window size
Observation 2: The information from the question posi- (k = 9) around the ℓ-th layer and observe the change in the
tions plays a direct and predominant role in influencing probability of the answer word at the last position as above.
the final prediction. As shown in Figure 3, blocking at-
tention from the last position to the hidden representations Observation: Information flow from the image positions
in Q (Question ↛ Last) results in significant reduction in to question positions occurs twice As shown in Figure 4,
prediction probabilities across all six tasks. For example, blocking the question positions from attending to the image
in the ChooseAttr task, this decreases the prediction proba- positions leads to a reduction in prediction probability. This
bility by up to ∼ 30%. This highlights the critical flow of is visible in lower layers, in two different parts of the model.
information from Q to the last position, directly affecting We first observe a sharp drop in layers ∼ 0 − 4 and then
the final prediction. It is worth noting that this informa- a second smaller drop around 10th layer. This indicates a
tion flow pattern is observed primarily in the middle layers, two-stage integration process of visual information into the
where performance reductions consistently occur across all representations of the question. In the first drop, attention
six tasks. In contrast, information from the image positions knockout reduces the prediction probability by an average
(I) does not directly and significantly impact the final pre- of ∼ 60% across all six tasks. In the second drop, tasks
diction in most tasks, except for QueryAttr, where a slight such as ChooseAttr, ChooseCat, ChooseRel, and Query-
information flow from I to the last position is observed. Attr show another average reduction of ∼ 21% while Com-
However, this direct influence is negligible compared to its pareAttr and LogicalObj exhibit smaller decreases. Despite
indirect effects, discussed below. The additional experiment the variability in the magnitude of the reduction, the layers
1 We experimented with different values of k, as described in Ap- responsible for information flow remain consistent across
pendix B, and observed similar trends as in the analysis we present in this all tasks, which is also observed during the first drop. The
section. additional experiment about the information flow between

6
,PDJH 4XHVWLRQ 5HODWHG,PDJH3DWFKHV 4XHVWLRQ 2WKHU,PDJH3DWFKHV 4XHVWLRQ
   
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 


&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

  

  
 


   
 
 
  

 

  
  
                                   
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel (a) ChooseAttr (b) ChooseCat (c) ChooseRel
    
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

  
  
  

  
 

    

  
  
  

                                   
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr (d) CompareAttr (e) LogicalObj (f) QueryAttr

Figure 4. The relative changes in prediction probability when Figure 5. The relative changes in prediction probability on LLaVA-
blocking attention edges from the question positions to the image 1.5-13b with six VQA tasks. Related Image Patches↛question and
positions on LLaVA-1.5-13b with six VQA tasks. Other Image Patches↛question represent blocking the position of
question from attending to that of different image patches, region
of interest and remainder, respectively.
image and different parts of question, such as option os and
object words, can be found in Appendix F.
whether an image patch includes the corresponding bound-
Overall information flow Given the input sequence: im- ing boxes (objects), we divide the input image patch fea-
age and question with the corresponding sets of positions I tures V into two groups: Vobj corresponding to the patches
and Q respectively, the MLLM first propagates information containing the objects mentioned in the question, and Voth
twice from the image positions to the question positions in containing the remaining patches. Then, for each layer ℓ,
the lower-to-middle layers of the MLLM. Subsequently, in we use the same attention knockout method to block the
the middle layers, the information flows from the question target set Q from attending each source set, Iobj and Ioth ,
positions to the last position for the final prediction. Over- corresponding to the position of Vobj and Voth in the input
all, this reveals the existence of distinct and disjoint stages sequence respectively, at the layers with a window of k = 9
in the computation process of different layers in MLLM, layers around the ℓ-th layer, and observe the change in the
where critical information transfer points from different po- probability of the correct answer word.
sitions corresponding to different modalities are observed to
influence the final predictions of the model. These findings Observation: Shifting focus from comprehensive repre-
are also observed in the other three MLLMs (Appendix E). sentation to specific regions of interest As illustrated in
Figure 5, blocking the attention edges between the position
6. How is the linguistic and visual information of Vobj and the question (related image patches↛question)
and between the position of Voth and the question (other im-
integrated? age patches↛question) appear to account for the two per-
The results of the above analysis suggest a two-stage in- formance drops observed in Figure 4, individually. Specif-
tegration process of the two modalities within an MLLM. ically, other image patches↛question clearly results in a
In this section, we further investigate how the information significant and predominant reduction in prediction proba-
about specific visual and linguistic concepts is integrated bility during the first stage of cross-modal integration, while
across these two stages. related image patches↛question plays a dominant role at
the second stage. It is noteworthy that both types of cross-
Experiment To investigate how the model uses the image modal information transfer occur in similar layers within
to answer the question, we conducted attention knockout the MLLM across all six tasks. Even for the CompareAttr
experiments at the level of individual objects and individ- and LogicalObj tasks, although slight changes in probabil-
ual words. The dataset used in the paper consists of ques- ity are observed during the second stage, the layers in which
tions targeting specific objects and each object is annotated this happens remain consistent with those of the other tasks.
with the bounding box for a certain image region. Based on This suggests in the lower layers, the model integrates the

7
information from the whole image into the question posi- 1RQFDSLWDOL]HG$QVZHU 1RQFDSLWDOL]HG)DOVH2SWLRQ
tions building a more generic representation. And it is only &DSLWDOL]HG$QVZHU &DSLWDOL]HG)DOVH2SWLRQ

in the later layers, that the model starts to pay attention  

to the specific regions in the image relevant to the ques-   

3UREDELOLW\ 

3UREDELOLW\ 

3UREDELOLW\ 
tion, fusing the more fine-grained linguistic and visual rep-   

resentations. The other MLLMs also present similar results   

as shown in Appendix E. The additional more fine-grained   

analysis on intervention of the attention edge between ob- 


    

    

    

ject words in question and image region can be found in /D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel
Appendix F. Moreover, we find compared with LLaVA-1.5-  

13b, the model LLaVA-1.5-7b with smaller size has less in- 
 

3UREDELOLW\ 

3UREDELOLW\ 

3UREDELOLW\ 
formation flow from the position of Voth to that of question 
 

in the first stage, as shown in Appendix E.   

  

7. How is the final answer generated?   


              
/D\HU /D\HU /D\HU
Experiment To track the process of answer generation in
(d) CompareAttr (e) LogicalObj (f) QueryAttr
the MLLM, motivated by the approach of logit lens [1], we
monitor the probability of the correct answer from the hid- Figure 6. The probability of the answer word at the last position
den representations at the last position of the input sequence across all layers in LLaVA-1.5-13b with six VQA tasks. Capi-
across all layers. Formally, for each layer ℓ at the last po- talized Answer and Noncapitalized Answer represent the answer
sition N , we use the unembedding matrix E (as defined in word with or without the uppercase of the initial letter, respec-
Equation (6)) to compute the probability distribution over tively. As the tasks of ChooseAttr, ChooseCat and ChooseRel con-
the entire vocabulary V: tain false option, we also provide the probability of it.

PNℓ = softmax EhℓN ,



(8)
increase in subsequent layers. This indicates the model has
where the probability of the target answer word wa is given already semantically inferred the answer by about halfway
by the corresponding entry in PNℓ , denoted as PNℓ (wa ). As through layers and in the higher layers, the model starts to
the tokenizer in most MLLMs distinguishes the case of the refine the syntactic correctness of the answer. Similar find-
word, especially the initial letter of the word, we monitor ings on other models are shown in Appendix E.
the probability of the answer word with both those starting
with uppercase (Capitalized Answer) and lowercase letters 8. Conclusion
(Noncapitalized Answer).
In this paper, we unveil the inner working mechanisms of
auto-regressive multimodal large language models in han-
Observation 1: The model is able to predict the cor- dling multimodal tasks. Our experiments reveal that dif-
rect answer starting at the layer immediately following ferent multimodal tasks exhibit similar processing patterns
multimodal integration As illustrated in Figure 6, the within the model. Specifically, when provided with an in-
probability of the answer word with a lowercase initial let- put consisting of an image and a question, within the lower-
ter (Noncapitalized Answer) rises sharply from near 0 to a to-middle layers, the model initially propagates the over-
range of ∼20% to ∼70% across the six VQA tasks, around all image information into the hidden representations of the
the model’s middle layers. This implies that the model question in the lower layers and then the model selectively
rapidly acquires the capability to predict correct answers in transfers only the question-relevant image information into
these middle layers, where the phase of multimodal infor- the hidden representations of the question, facilitating mul-
mation integration has just fully completed (see Figure 5) timodal information integration. In the middle layers, this
and the multimodal information is still transforming from integrated multimodal information is propagated to the hid-
question position to last position (see Figure 3). den representation of the last position for the final predic-
tion. In addition, we find that the answers are initially gener-
Observation 2: Semantic generation is followed by syn- ated in lowercase form in middle layers and then converted
tactic refinement As shown in Figure 6, across all VQA to uppercase for the first letter in higher layers. These find-
tasks, the probability of Noncapitalized Answer starts to ings enhance the transparency of such models, offering new
gradually decrease to nearly zero after an increase in middle research directions for better understanding the interaction
layers. In contrast, the probability of Capitalized Answer of the two modalities in MLLMs and ultimately leading to
remains low in the initial layers following 20th but starts to improved model designs.

8
References alyzing transformers in embedding space. arXiv preprint
arXiv:2209.02535, 2022. 3
[1] interpreting GPT: the logit lens — LessWrong — less- [13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
wrong.com. https : / / www . lesswrong . com / Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
posts / AcKRB8wDpdaN6v6ru / interpreting - Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
gpt-the-logit-lens. [Accessed 14-11-2024]. 8 vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is
[2] lmms-lab/llama3-llava-next-8b · hugging face. https: Worth 16x16 Words: Transformers for Image Recognition at
//huggingface.co/lmms-lab/llama3-llava- Scale. 2020. arXiv: 2010.11929. 1, 2
next-8b, 2024. Accessed: 2024-11-13. 2, 5 [14] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab-
[3] Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu, hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil
Chenfei Wu, Nan Duan, and Vasudev Lal. Vl-interpret: An Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The
interactive visualization tool for interpreting vision-language llama 3 herd of models. arXiv preprint arXiv:2407.21783,
transformers. In Proceedings of the IEEE/CVF Conference 2024. 5
on computer vision and pattern recognition, pages 21406– [15] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom
21415, 2022. 2 Henighan, Nicholas Joseph, Ben Mann, Amanda Askell,
[4] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Yuntao Bai, Anna Chen, Tom Conerly, Nova Das-
Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds,
Training generalized multi-query transformer models from Danny Hernandez, Andy Jones, Jackson Kernion, Liane
multi-head checkpoints. arXiv preprint arXiv:2305.13245, Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown,
2023. 5, 7 Jack Clark, Jared Kaplan, Sam McCandlish, and Chris
[5] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Olah. A mathematical framework for transformer circuits.
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Transformer Circuits Thread, 2021. https://transformer-
Zhou. Qwen-VL: A Versatile Vision-Language Model for circuits.pub/2021/framework/index.html. 2, 3
Understanding, Localization, Text Reading, and Beyond, [16] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu,
2023. arXiv:2308.12966 [cs]. 1, 2 Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue
[6] Samyadeep Basu, Martin Grayson, Cecily Morrison, Be- Cao. Eva: Exploring the limits of masked visual representa-
smira Nushi, Soheil Feizi, and Daniela Massiceti. Under- tion learning at scale. In Proceedings of the IEEE/CVF Con-
standing information storage and transfer in multi-modal ference on Computer Vision and Pattern Recognition, pages
large language models. arXiv preprint arXiv:2406.04236, 19358–19369, 2023. 1, 2
2024. 2, 3 [17] Stella Frank, Emanuele Bugliarello, and Desmond Elliott.
[7] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen Chun Chen, Vision-and-Language or Vision-for-Language? On Cross-
and Jingjing Liu. Behind the Scene: Revealing the Secrets Modal Influence in Multimodal Transformers. pages 9847–
of Pre-trained Vision-and-Language Models. Lecture Notes 9857, 2021. arXiv: 2109.04448 ISBN: 9781955917094. 2
in Computer Science (including subseries Lecture Notes in [18] Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy.
Artificial Intelligence and Lecture Notes in Bioinformatics), Transformer feed-forward layers are key-value memories. In
12351 LNCS:565–580, 2020. arXiv: 2005.07310 ISBN: Proceedings of the 2021 Conference on Empirical Methods
9783030585389. 2 in Natural Language Processing. Association for Computa-
tional Linguistics, 2021. 3
[8] Hila Chefer, Shir Gur, and Lior Wolf. Generic
[19] Mor Geva, Jasmijn Bastings, Katja Filippova, and
Attention-model Explainability for Interpreting Bi-Modal
Amir Globerson. Dissecting Recall of Factual As-
and Encoder-Decoder Transformers. pages 397–406, 2021.
sociations in Auto-Regressive Language Models, 2023.
arXiv: 2103.15679. 2
arXiv:2304.14767 [cs]. 2, 3, 4
[9] Adam Dahlgren Lindström, Johanna Björklund, Suna Ben- [20] Lisa Anne Hendricks and Aida Nematzadeh. Probing image-
sch, and Frank Drewes. Probing multimodal embeddings language transformers for verb understanding. In Find-
for linguistic properties: the visual-semantic case. In Pro- ings of the Association for Computational Linguistics: ACL-
ceedings of the 28th International Conference on Compu- IJCNLP 2021, pages 3635–3644, Online, 2021. Association
tational Linguistics, pages 730–744, Barcelona, Spain (On- for Computational Linguistics. 2
line), 2020. International Committee on Computational Lin-
[21] Drew A Hudson and Christopher D Manning. Gqa: A new
guistics. 2
dataset for real-world visual reasoning and compositional
[10] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, question answering. In Proceedings of the IEEE/CVF con-
and Furu Wei. Knowledge Neurons in Pretrained Transform- ference on computer vision and pattern recognition, pages
ers, 2022. arXiv:2104.08696 [cs]. 3 6700–6709, 2019. 4, 5, 1
[11] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat [22] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
Fung, and Steven Hoi. InstructBLIP: Towards General- tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
purpose Vision-Language Models with Instruction Tuning, Connecting language and vision using crowdsourced dense
2023. arXiv:2305.06500 [cs]. 1, 2 image annotations. International journal of computer vision,
[12] Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. An- 123:32–73, 2017. 4, 1

9
[23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. ers Learning Multimodal Representations? A probing per-
BLIP: Bootstrapping Language-Image Pre-training for Uni- spective. Proceedings of the 36th AAAI Conference on Arti-
fied Vision-Language Understanding and Generation. Tech- ficial Intelligence, 2022. 2
nical report. 3 [37] Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David
[24] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Bau, and Antonio Torralba. Multimodal Neurons in Pre-
BLIP-2: Bootstrapping Language-Image Pre-training with trained Text-Only Transformers. In 2023 IEEE/CVF Interna-
Frozen Image Encoders and Large Language Models. 2023. tional Conference on Computer Vision Workshops (ICCVW),
arXiv: 2301.12597. 1, 2 pages 2854–2859, Paris, France, 2023. IEEE. 2, 3
[25] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng [38] Gabriela Ben Melech Stan, Raanan Yehezkel Rohekar,
Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Yaniv Gurwicz, Matthew Lyle Olson, Anahita Bhiwandi-
Jia. Mini-gemini: Mining the potential of multi-modality walla, Estelle Aflalo, Chenfei Wu, Nan Duan, Shao-Yen
vision language models, 2024. Tseng, and Vasudev Lal. Lvlm-intrepret: An interpretabil-
[26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. ity tool for large vision-language models. arXiv preprint
Visual Instruction Tuning, 2023. arXiv:2304.08485 [cs]. arXiv:2404.03118, 2024. 2
[27] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. [39] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Improved baselines with visual instruction tuning. In Pro- Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste
ceedings of the IEEE/CVF Conference on Computer Vision Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien
and Pattern Recognition (CVPR), pages 26296–26306, 2024. Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
1, 2, 5 Lample. LLaMA: Open and Efficient Foundation Language
[28] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Models. 1, 2
Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- [40] Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
proved reasoning, ocr, and world knowledge, 2024. 1, 2, bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash-
5 lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer,
[29] Yiwei Lyu, Paul Pu Liang, Zihao Deng, Ruslan Salakhutdi-
Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer-
nov, and Louis-Philippe Morency. Dime: Fine-grained inter-
nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia
pretations of multimodal models via disentangled local ex-
Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn,
planations. In Proceedings of the 2022 AAAI/ACM Confer-
Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Vik-
ence on AI, Ethics, and Society, pages 455–467, 2022. 2
tor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Ko-
[30] Kevin Meng, David Bau, Alex J Andonian, and Yonatan Be-
renev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
linkov. Locating and editing factual associations in GPT. In
Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning
Advances in Neural Information Processing Systems, 2022.
Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
3
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
[31] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan
Smith, and Jacob Steinhardt. Progress measures for Silva, Eric Michael Smith, Ranjan Subramanian, Xiao-
grokking via mechanistic interpretability. arXiv preprint qing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams,
arXiv:2301.05217, 2023. 2 Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov,
[32] Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan
Krueger, and Fazl Barez. Towards interpreting visual infor- Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov,
mation processing in vision-language models. arXiv preprint and Thomas Scialom. Llama 2: Open Foundation and Fine-
arXiv:2410.07149, 2024. 2, 3 Tuned Chat Models, 2023. arXiv:2307.09288 [cs]. 1, 2, 3
[33] Chris Olah. Mechanistic interpretability, variables, and [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
the importance of interpretable bases. https : / / reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
www.transformer- circuits.pub/2022/mech- Polosukhin. Attention is All you Need. In Advances in Neu-
interp-essay, 2024. Accessed: 2024-10-20. 2 ral Information Processing Systems. Curran Associates, Inc.,
[34] Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu 2017. 2, 3, 5, 6
Liang. Towards Vision-Language Mechanistic Interpretabil- [42] Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fan-
ity: A Causal Tracing Tool for BLIP. 2, 3 dong Meng, Jie Zhou, and Xu Sun. Label Words are An-
[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya chors: An Information Flow Perspective for Understanding
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, In-Context Learning, 2023. arXiv:2305.14160 [cs]. 3
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning [43] Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen,
transferable visual models from natural language supervi- and Xueqi Cheng. Cross-modal safety mechanism trans-
sion. In International conference on machine learning, pages fer in large vision-language models. arXiv preprint
8748–8763. PMLR, 2021. 1, 2, 3, 5 arXiv:2410.12662, 2024. 2, 3
[36] Emmanuelle Salin, Badreddine Farah, Stéphane Ayache, [44] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe,
Benoit Favre, Emmanuelle Salin, Badreddine Farah, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab,
Stéphane Ayache, Benoit Favre Are Vision-language Trans, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam
and Probing Perspective. Are Vision-Language Transform- Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura,

10
Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT:
Open Pre-trained Transformer Language Models, 2022.
arXiv:2205.01068 [cs]. 1, 2
[45] Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan,
Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and
Jieping Ye. From Redundancy to Relevance: Enhancing Ex-
plainability in Multimodal Large Language Models, 2024.
arXiv:2406.06579 [cs]. 2, 3
[46] Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana,
Liang Zheng, and Stephen Gould. The first to know: How
token distributions reveal hidden knowledge in large vision-
language models? arXiv preprint arXiv:2403.09037, 2024.
2, 3
[47] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan
Li, Li, and OTHERS. Judging llm-as-a-judge with mt-bench
and chatbot arena. 2023. 1, 2, 5

11
Cross-modal Information Flow in Multimodal Large Language Models
Supplementary Material
Object Attribute Category Relation Global
4XHVWLRQ /DVW ,PDJH /DVW
Verify 86.21% 83.00% – 87.82% 95.56%
Query – 71.20% 62.88% 52.84% 55.74%

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
Choose – 90.17% 92.03% 87.19% 96.76% 
 


Logical 88.92% 76.17% – – – 




Compare – 71.23% – – – 
 



 
Table 2. The accuracy of the validation set of GQA dataset[21]                  
on LLaVA-1.5-13b [27]. and represent binary (yes/no) /D\HU /D\HU /D\HU
and open question respectively. represents that this category (a) Window 1 (b) Window 3 (c) Window 5

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
contains both binary and open questions. 
 


 

 

A. Dataset collection  


  


We collect our data from the validation set of the GQA      
           
dataset [21], which is designed for visual reasoning and /D\HU /D\HU /D\HU
compositional question-answering. Derived from the Vi- (d) Window 7 (e) Window 11 (f) Window 15
sual Genome dataset [22], GQA provides real-world images
enriched with detailed scene graphs. Questions in GQA Figure 7. The relative changes in prediction probability on LLaVA-
1.5-13b with the tasks of ChooseAttr for different window size.
are categorized along two dimensions: structure (5 classes,
The Question↛ Last and Image↛ Last represent preventing last
defining question formats) and semantics (5 classes, speci- position from attending to Question and Image respectively.
fying the main subject’s semantic focus). Structural classes
include: (1) verify (yes/no questions), (2) query (open ques-
tions), (3) choose (questions with two alternatives), (4) log- as shown in Table 1.
ical (logical inference), and (5) compare (object compar-
isons). Semantic classes are: (1) object (existence ques- B. Informaion flow for different window size k
tions), (2) attribute (object properties or positions), (3)
category (object identification within a class), (4) relation In the main body of the paper, we use a window size k = 9
(questions about relational subjects/objects), and (5) global for a more easy to analysis of the internal working mech-
(overall scene properties like weather or location). Based anism of the multimodal large language models when per-
on the combination of these two dimensions, the questions forming multimodal tasks. We present the relative change in
in GQA are categorized into 15 groups, as shown in Table 2. probability on LLaVA-1.5-13b and the task of ChooseAttr
We select 6 out of 15 groups according to the following with different window sizes of k = 1, 3, 5, 7, 9, 11, 15. The
steps. First, we exclude the verify type, as it is quite simple resulting information flow between different parts of the in-
involving only straightforward binary questions (e.g., ”Is put sequence (image and question) and last position, and
the apple red?”). Then we focus on types with an average between image and question are shown in Figure 7 and Fig-
accuracy above 80% on LLaVA-1.5-13b model [27], retain- ure 8, respectively. Overall, the observations on the infor-
ing ChooseAttr, ChooseCat, ChooseRel, and LogicalObj. mation flow are consistent across different window sizes k.
ChooseGlo is excluded due to its limited sample size in the Specifically, the critical information flow from question to
validation set ofGQA (only 556 instances). After that, to last position occurs in the middle layers while the critical in-
enhance question-type diversity, we select high-performing formation flow from image to last position is not observed
subtypes (accuracy>80%) in CompareAttr and QueryAttr across different window sizes, as shown in Figure 7. As for
from the GQA dataset. Specifically, we use the position- critical information from image to question, the two differ-
Query subtype for spatial-relation questions in QueryAttr ent critical information flows are observed across different
and the twoCommon subtype for comparing common at- window sizes where both of them occur in lower layers and
tributes between two objects in CompareAttr. Finally, for sequentially following each other, as illustrated in Figure 8.
each type of the six, we sample at most 1000 data that are In addition, we observe that as the window size increases,
predicted correctly on model LLaVA-1.5-13b from the val- the two information flows gradually merge into one, which
idation set of GQA resulting in our final data in this paper, is because the larger window encompasses layers involved

1
,PDJH 4XHVWLRQ 4XHVWLRQ /DVW ,PDJH /DVW /DVW /DVW
   

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 


    


 
  

 
   




  
 

                       
           
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(a) Window 1 (b) Window 3 (c) Window 5 (a) ChooseAttr (b) ChooseCat (c) ChooseRel
  

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
 

   


    

   
 

    

    

     

                       
           
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(d) Window 7 (e) Window 11 (f) Window 15 (d) CompareAttr (e) LogicalObj (f) QueryAttr

Figure 8. The relative changes in prediction probability when Figure 9. The relative changes in prediction probability for the
blocking attention edges from the question positions to the image final generated sub-word of the answer on LLaVA-1.5-13b with
positions on LLaVA-1.5-13b with the tasks of ChooseAttr for dif- six VQA tasks. The Question↛ Last, Image↛ Last and Last↛
ferent window sizes. Last represent preventing last position from attending to Question,
Image and itself respectively.

in both information flows. Moreover, the decrease in the


,PDJH 4XHVWLRQ
prediction probability becomes more pronounced with the   
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
 
increase of the value k. This is expected, as blocking more 



attention edges in the computation hinders the model’s abil-  

ity to properly contextualize the input.   



 
                 
/D\HU /D\HU /D\HU
C. Changes in probability of the last sub-word (a) ChooseAttr (b) ChooseCat (c) ChooseRel
  
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
generation 
 



 

In this paper, the answer in our used dataset normally con-  


tains one word or one phrase, which might result in several   
sub-word tokens. In the main body of the paper, we present                  
the relative change in probability of the first generated sub- /D\HU /D\HU /D\HU
word while the final generated sub-words also yield similar (d) CompareAttr (e) LogicalObj (f) QueryAttr
results. Specifically. we conduct the same experiments as in
Figure 10. The relative changes in prediction probability for
the main body of the paper: six tasks (ChooseAttr, Choose- the final generated sub-word of the answer when blocking atten-
Cat, ChooseRel, LogicalObj, QueryAttr and CompareAttr) tion edges from the question positions to the image positions on
on LLaVA-1.5-13b model. Instead of calculating the rela- LLaVA-1.5-13b with six VQA tasks.
tive change in probability for the first generated sub-word
token, we calculate that for the final generated sub-word
token of the correct answer word. As shown in Figure 9, D. Constructing multimodal semantic repre-
Figure 10 and Figure 11, the information flow from differ- sentations
ent parts of the input sequence (image and question) to last
position, from image to question and from different image We have investigated how multimodal information is inte-
patches (related image patches and other image patches) to grated through the MHAT module in Section 6. We now
question are consistent with the observations in Figure 3, take a closer look at how the multimodal semantic repre-
Figure 4 and Figure 5 in the main body of the paper. sentation is constructed.

2
5HODWHG,PDJH3DWFKHV 4XHVWLRQ 2WKHU,PDJH3DWFKHV 4XHVWLRQ 0/3 0+$7
    
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 


  

-DFFDUG6LPLODULW\

-DFFDUG6LPLODULW\

-DFFDUG6LPLODULW\

 

    




     


    

                                
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel (a) ChooseAttr (b) ChooseCat (c) ChooseRel
    
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

    

-DFFDUG6LPLODULW\

-DFFDUG6LPLODULW\

-DFFDUG6LPLODULW\

 
     
 
    

 
   
 
                                
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr (d) CompareAttr (e) LogicalObj (f) QueryAttr

Figure 11. The relative changes in prediction probability for the Figure 12. The Jaccard similarity between the predicted words
final generated sub-word of the answer on LLaVA-1.5-13b with of the original model and those of the intervened model, with the
six VQA tasks. Related Image Patches↛question and Other Im- MLP and MHAT modules removed individually (LLaVA-1.5-13b).
age Patches↛question represent blocking the position of question
from attending to that of different image patches, region of interest
and remainder, respectively. our interventions using Jaccard Similarity:
|Wo ∩ Wi |
J(Wo , Wi ) = (9)
|Wo ∪ Wi |
where Wo and Wi denote the sets of 10·|Q| predicted words
from the original and intervened models, respectively.
Experiment To identify which module in the transformer
contributes to the formulation of multimodal semantic in-
Observation: The MLP module plays a greater role in
formation within hidden representations, we employ a mod-
constructing semantic representations compared to the
ule knockout approach to evaluate the significance of in-
MHAT Module As shown in Figure 12 for model LLaVA-
dividual transformer modules. As shown in Equation (1),
1.5-13b, removing the MLP module severely impacts se-
the hidden representation at layer ℓ is computed by adding
mantic representation, reducing average Jaccard Similarity
aℓi and fiℓ to hiℓ−1 , where aℓi and fiℓ are derived from the
across six tasks by ∼90% when MLP is removed in the first
MHAT (Equation (2)) and MLP (Equation (5)) modules,
layer and ∼25% in the last layer. In contrast, removing
respectively. This allows us to determine which module
the MHAT module has a smaller effect, with reductions of
contributes to constructing semantic information by selec-
∼60% and ∼10% at the first and last layers, respectively.
tively zeroing out the outputs of MHAT or MLP—two addi-
This highlights the MLP module’s important role in gen-
tive modules in the transformer layer. Specifically, for each
′ ′ erating multimodal semantic representations. These results
layer ℓ, we intervene by setting aℓi or fiℓ (i ∈ Q) to zero
′ min{ℓ+8,L} align with findings from [10, 18, 30], who demonstrate that
across 9 consecutive layers {ℓ }ℓ′ =ℓ . We then measure
factual information is primarily stored in the MLP module,
the importance of constructing the multimodal semantic in-
emphasizing its contribution to enriching semantic informa-
formation by observing the semantic change of the hidden
tion. This is also observed in the model LLaVA-1.5-7b, as
representation corresponding to question position Q at the
shown in Figure 13.
final layer L. Our focus on layer L is inspired by Geva
et al. [19], who highlight that semantic information peaks E. Experiments on other models
in the final layer. We follow Wang et al. [42], who eval-
uate the semantic content of a hidden representation using We conduct the same experiments (six VQA task types)
top-k words from this representation. We estimate semantic as in the main body of the paper with other three mod-
content using the top-10 words predicted from each hidden els. Six VQA task types include (ChooseAttr, Choose-
representation at Q, derived via Equation (6), where hL N is Cat, ChooseRel, LogicalObj, QueryAttr and CompareAttr).
replaced with hL i (i ∈ Q). We then quantify the change The other three models include LLaVA-1.5-7b, LLaVA-v1.6-
in semantic content of hidden representation resulting from Vicuna-7b and Llama3-LLaVA-NEXT-8b.

3
0/3 0+$7 ,PDJH 4XHVWLRQ
     

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
 
  
-DFFDUG6LPLODULW\

-DFFDUG6LPLODULW\

-DFFDUG6LPLODULW\

 
     

   
 


   
 

           
              
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel (a) ChooseAttr (b) ChooseCat (c) ChooseRel
    

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 


    


-DFFDUG6LPLODULW\

-DFFDUG6LPLODULW\

-DFFDUG6LPLODULW\



    
 
   
 

   


           
              
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr (d) CompareAttr (e) LogicalObj (f) QueryAttr

Figure 13. The Jaccard similarity between the predicted words Figure 15. The relative changes in prediction probability when
of the original model and those of the intervened model, with the blocking attention edges from the question positions to the image
MLP and MHAT modules removed individually (LLaVA-1.5-7b). positions on LLaVA-1.5-7b with six VQA tasks.

5HODWHG,PDJH3DWFKHV 4XHVWLRQ 2WKHU,PDJH3DWFKHV 4XHVWLRQ


4XHVWLRQ /DVW ,PDJH /DVW /DVW /DVW

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
 
  
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

  

    


   
 
  
 

 
  
 
 
 
               
              
/D\HU /D\HU /D\HU
/D\HU /D\HU /D\HU (a) ChooseAttr (b) ChooseCat (c) ChooseRel
(a) ChooseAttr (b) ChooseCat (c) ChooseRel 
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

  
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 


 
  
  
  
 
   
 
  

  
 
  
                
              
/D\HU /D\HU /D\HU
/D\HU /D\HU /D\HU (d) CompareAttr (e) LogicalObj (f) QueryAttr
(d) CompareAttr (e) LogicalObj (f) QueryAttr
Figure 16. The relative changes in prediction probabil-
Figure 14. The relative changes in prediction probability on ity on LLaVA-1.5-7b with six VQA tasks. Related Image
LLaVA-1.5-7b with six VQA tasks. The Question↛ Last, Image↛ Patches↛question and Other Image Patches↛question represent
Last and Last↛ Last represent preventing last position from at- blocking the position of question from attending to that of differ-
tending to Question, Image and itself respectively. ent image patches, region of interest and remainder, respectively.

E.1. LLaVA-1.5-7b Figure 14, Figure 15 and Figure 16 respectively, are con-
LLaVA-1.5-7b is a small version of LLaVA-1.5-13b pre- sistent with the observations for the LLaVA-1.5-13b model,
sented in the main body of the paper. It contains 32 trans- as shown in Figure 3, Figure 4 and Figure 5 respectively,
former blocks (layers) instead of 40 layers in LLaVA-1.5- in the main body of the paper. Specifically, the model first
13b. The information flow from different parts of the input propagates critical information twice from the image posi-
sequence (image and question) to last position, from image tions to the question positions in the lower-to-middle layers
to question and from different image patches (related image of the MLLM. For the twice multimodal information inte-
patches and other image patches) to question, as shown in gration, the first one focuses on producing the generative

4
1RQFDSLWDOL]HG$QVZHU 1RQFDSLWDOL]HG)DOVH2SWLRQ 4XHVWLRQ /DVW ,PDJH /DVW /DVW /DVW
&DSLWDOL]HG$QVZHU &DSLWDOL]HG)DOVH2SWLRQ
  

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 


   

3UREDELOLW\ 

3UREDELOLW\ 

3UREDELOLW\ 
 
   

     

    

        
        
           
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel (a) ChooseAttr (b) ChooseCat (c) ChooseRel
  

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

 
  

3UREDELOLW\ 

3UREDELOLW\ 

3UREDELOLW\ 

 
   

   



     

        


        
           
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr (d) CompareAttr (e) LogicalObj (f) QueryAttr

Figure 17. The probability of the answer word at the last posi- Figure 18. The relative changes in prediction probability on
tion across all layers in LLaVA-1.5-7b with six VQA tasks. Cap- LLaVA-v1.6-Vicuna-7b with six VQA tasks. The Question↛ Last,
italized Answer and Noncapitalized Answer represent the answer Image↛ Last and Last↛ Last represent preventing last position
word with or without the uppercase of the initial letter, respec- from attending to Question, Image and itself respectively.
tively. As the tasks of ChooseAttr, ChooseCat and ChooseRel con-
tain false option, we also provide the probability of it.
,PDJH 4XHVWLRQ
 
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 



representations over the whole image while the second one 

 
tends to construct question-related representation. Subse- 


quently, in the middle layers, the critical multimodal infor- 

 
mation flows from the question positions to the last posi- 


tion for the final prediction. The difference between the               
/D\HU /D\HU /D\HU
two models is the magnitude of reduction in the probability (a) ChooseAttr (b) ChooseCat (c) ChooseRel
when blocking the attention edge between image and ques-   
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
 
tion. In model LLaVA-1.5-7b, the first drop is rather smaller  



than that in model LLaVA-1.5-13b. However, this does not 




conflict with our conclusion that the information flows from 



image to question twice and one after the other in the main  


  


body of the paper. Moreover, the probability change of the      
        
answer word across all layers as shown in Figure 17 is also /D\HU /D\HU /D\HU
consistent with the result in Figure 6 in the main body of (d) CompareAttr (e) LogicalObj (f) QueryAttr
the paper. Specifically, the model first generates the answer
Figure 19. The relative changes in prediction probability when
semantically in the middle layers and then refines the syn-
blocking attention edges from the question positions to the image
tactic correctness of the answer in the higher layers. positions on LLaVA-v1.6-Vicuna-7b with six VQA tasks.
E.2. LLaVA-v1.6-Vicuna-7b
LLaVA-v1.6-Vicuna-7b has the similar architecture with ically adjusts image resolution, resulting in variable-length
LLaVA-1.5-13b in the main body of the paper. The dif- image patch features with higher resolution. Specifically,
ference between them includes the layer number and the the higher resolution is implemented by splitting the image
the way processing image patch features. The LLaVA-v1.6- into grids and encoding them independently.
Vicuna-7b has 32 layers verse 40 layers in LLaVA-1.5-13b. The information flow from different parts of the input se-
LLaVA-1.5-13b directly feeds the original fixed-length im- quence (image and question) to last position, from image to
age patch features from the image encoder into the LLM question and from different image patches (related image
as input tokens. In contrast, LLaVA-v1.6-Vicuna-7b em- patches and other image patches) to question, as shown in
ploys a dynamic high-resolution technique, which dynam- Figure 18, Figure 19 and Figure 20 respectively, are consis-

5
5HODWHG,PDJH3DWFKHV 4XHVWLRQ 2WKHU,PDJH3DWFKHV 4XHVWLRQ 1RQFDSLWDOL]HG$QVZHU 1RQFDSLWDOL]HG)DOVH2SWLRQ
&DSLWDOL]HG$QVZHU &DSLWDOL]HG)DOVH2SWLRQ
  
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
 
  
  

3UREDELOLW\ 

3UREDELOLW\ 

3UREDELOLW\ 
 

  
 

    


    

                 
/D\HU /D\HU /D\HU            
/D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel
 
(a) ChooseAttr (b) ChooseCat (c) ChooseRel
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

 
  
   
 

3UREDELOLW\ 

3UREDELOLW\ 

3UREDELOLW\ 
 
 
  


    

 
     

                 
/D\HU /D\HU /D\HU            
/D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr
(d) CompareAttr (e) LogicalObj (f) QueryAttr
Figure 20. The relative changes in prediction probability on
Figure 21. The probability of the answer word at the last position
LLaVA-v1.6-Vicuna-7b with six VQA tasks. Related Image
across all layers in LLaVA-v1.6-Vicuna-7b with six VQA tasks.
Patches↛question and Other Image Patches↛question represent
Capitalized Answer and Noncapitalized Answer represent the an-
blocking the position of question from attending to that of differ-
swer word with or without the uppercase of the initial letter, re-
ent image patches, region of interest and remainder, respectively.
spectively. As the tasks of ChooseAttr, ChooseCat and ChooseRel
contain false option, we also provide the probability of it.
tent with the observations for the LLaVA-1.5-13b model, as
shown in Figure 3, Figure 4 and Figure 5 respectively, in the 4XHVWLRQ /DVW ,PDJH /DVW /DVW /DVW
main body of the paper. Specifically, the model first propa-
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
gates critical information twice from the image positions to  


the question positions in the lower-to-middle layers of the  


MLLM. For the dual-stage multimodal information integra-   

 
tion, the first stage emphasizes generating holistic represen- 
 

tations of the entire image, while the second stage focuses 



on constructing representations that are specifically aligned   


/D\HU
    
/D\HU
    
/D\HU
 

with the given question. Subsequently, in the middle layers, (a) ChooseAttr (b) ChooseCat (c) ChooseRel
the critical multimodal information flows from the question
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

 &KDQJHLQSUREDELOLW\ 
positions to the last position for the final prediction. More-  

  
over, the probability change of the answer word across all   

layers as shown in Figure 21 is also consistent with the re- 








sult in Figure 6 in the main body of the paper. Specifi-  
  
cally, the model first generates the answer semantically in               
the middle layers and then refines the syntactic correctness /D\HU /D\HU /D\HU
of the answer in the higher layers. (d) CompareAttr (e) LogicalObj (f) QueryAttr

E.3. Llama3-LLaVA-NEXT-8b Figure 22. The relative changes in prediction probability on


llama3-llava-next-8b with six VQA tasks. The Question↛ Last,
Llama3-LLaVA-NEXT-8b has quiet different architecture Image↛ Last and Last↛ Last represent preventing last position
with LLaVA-1.5-13b in the main body of the paper. The from attending to Question, Image and itself respectively.
difference between them includes the layer number, the way
of processing image patch features, and the attention mech-
anism. The Llama3-LLaVA-NEXT-8b has 32 layers verse 40 ing in variable-length image patch features with higher res-
layers in LLaVA-1.5-13b. LLaVA-1.5-13b directly feeds the olution. Specifically, the higher resolution is implemented
original fixed-length image patch features from the image by splitting the image into grids and encoding them inde-
encoder into the LLM as input tokens. In contrast, Llama3- pendently. As for the attention mechanism, LLaVA-1.5-
LLaVA-NEXT-8b employs a dynamic high-resolution tech- 13b use a standard and dense transformer architecture [41]
nique, which dynamically adjusts image resolution, result- while Llama3-LLaVA-NEXT-8b adopts grouped query at-

6
,PDJH 4XHVWLRQ 1RQFDSLWDOL]HG$QVZHU 1RQFDSLWDOL]HG)DOVH2SWLRQ
&DSLWDOL]HG$QVZHU &DSLWDOL]HG)DOVH2SWLRQ
  
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 


 
  



3UREDELOLW\ 

3UREDELOLW\ 

3UREDELOLW\ 


   



    
 
   

                 
/D\HU /D\HU /D\HU            
/D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel
  
(a) ChooseAttr (b) ChooseCat (c) ChooseRel
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
  
  

 
   


3UREDELOLW\ 

3UREDELOLW\ 

3UREDELOLW\ 

    

 

   

 
     

                 
/D\HU /D\HU /D\HU            
/D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr
(d) CompareAttr (e) LogicalObj (f) QueryAttr
Figure 23. The relative changes in prediction probability when
Figure 25. The probability of the answer word at the last posi-
blocking attention edges from the question positions to the image
tion across all layers in llama3-llava-next-8b with six VQA tasks.
positions on llama3-llava-next-8b with six VQA tasks.
Capitalized Answer and Noncapitalized Answer represent the an-
swer word with or without the uppercase of the initial letter, re-
5HODWHG,PDJH3DWFKHV 4XHVWLRQ 2WKHU,PDJH3DWFKHV 4XHVWLRQ spectively. As the tasks of ChooseAttr, ChooseCat and ChooseRel
contain false option, we also provide the probability of it.

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

 
 





   as shown in Figure 3, Figure 4 and Figure 5 respectively, in
 


the main body of the paper. Although the information flow
 

     
from image to question in Figure 23 appears to exhibit only
        
/D\HU /D\HU /D\HU a single drop, the Figure 24 reveals that, in lower layers,
(a) ChooseAttr (b) ChooseCat (c) ChooseRel the information flow from Other Image Patches to the ques-

tion play a dominant role compared to that from Related
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

 

  Image Patches to question and in following layers, informa-
  

  


tion flow from Related Image Patches to question are more


 notable than that form Other Image Patches to question.
 

 
This observation indicates that the model still has a two-
               stage multimodal information integration process. Specif-
/D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr
ically, in the first stage, the model focuses on generating
holistic representations of the entire image. In the second
Figure 24. The relative changes in prediction probability stage, it refines these representations to align them more
on llama3-llava-next-8b with six VQA tasks. Related Image closely with the specific given question. Subsequently, in
Patches↛question and Other Image Patches↛question represent the middle layers, the critical multimodal information flows
blocking the position of question from attending to that of differ- from the question positions to the last position for the fi-
ent image patches, region of interest and remainder, respectively. nal prediction. Moreover, the probability changes for the
Capitalized Answer across all layers, as illustrated in Fig-
ure 25, align closely with the results in the main body of
tention [4] where the queries are grouped and the queries the paper while no such pattern is observed for the Non-
in the same group has shared key and value. capitalized Answer. This suggests that the model generates
The information flow from different parts of the input se- the syntactically correct answer directly, without a distinct
quence (image and question) to last position, from image to intermediate step of semantic generation followed by syn-
question and from different image patches (related image tactic correction. A potential explanation for this behavior
patches and other image patches) to question, as shown in is that when Llama3 generates an answer to a given ques-
Figure 22, Figure 23 and Figure 24 respectively, are con- tion, it first outputs a “\n” token, which may act as a cue to
sistent with the observations for the LLaVA-1.5-13b model, produce an answer word starting with an uppercase letter.

7
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 


   
 

   


  
 
 
)DOVH2SWLRQ /DVW  
7UXH2SWLRQ /DVW  2EMV /DVW 
 7UXHDQG)DOVH2SWLRQ /DVW )DOVH2SWLRQ /DVW  )DOVH2SWLRQ /DVW   4XHVZR2SWV 7UXH2SWLRQ
2EMV /DVW 7UXH2SWLRQ /DVW 7UXH2SWLRQ /DVW 4XHVZR2SWV 7UXH2SWLRQ 4XHVZR2SWV 7UXH2SWLRQ 4XHVZR2SWV )DOVH2SWLRQ
4XHVZR2SWV /DVW 4XHVZR2SWV /DVW 4XHVZR2SWV /DVW 4XHVZR2SWV )DOVH2SWLRQ 4XHVZR2SWV )DOVH2SWLRQ 4XHVZR2EMV 2EMV
                                   
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel (a) ChooseAttr (b) ChooseCat (c) ChooseRel

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 
 
 
  
 
  

  
 
 

   
 
 

 2EMV /DVW 2EMV /DVW 2EMV /DVW 
4XHVZR2EMV /DVW 4XHVZR2EMV /DVW 4XHVZR2EMV /DVW 4XHVZR2EMV 2EMV 4XHVZR2EMV 2EMV
                             
/D\HU /D\HU /D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr (d) CompareAttr (e) LogicalObj

Figure 26. The relative changes in prediction probability on Figure 27. The relative changes in prediction probability on
LLaVA-v1.5-13b with six VQA tasks. Preventing Last Position LLaVA-v1.5-13b with six VQA tasks. Preventing information flow
from attending to different parts of Question, such as True Op- from Question without Option to Options and from Question with-
tion, False Option, Objects in question, Question without Options, out Objects to Objects.
Question without Objects, both NTrue Option and False Option
together.
attending false option results in an increase in the probabil-
ity of the correct answer word. The increase is reasonable
F. The fine-grain analysis for information flow because the question without the false option becomes easy
In the main body of the paper, we primarily focus on ana- for the modal. For the tasks of ChooseAttr and Choose-
lyzing the information flow between one specific combina- Cat, in the information flowing to last position, the options
tion of (image, question, and last position) for analyzing the play a dominant role while question without options only
multimodal information integration. In this section, we will results in a small reduction for the probability fo the cor-
further investigate the information flow between fine-grain rect answer word. In contrast, for the ChooseRel task, the
parts of the input sequence, including the question without true option does not significantly reduce the probability of
options, true option, false option, objects in the question, the correct answer word. This may stem from the format of
question without objects, related image patches and other the ChooseRel questions, where the options are positioned
image patches. We also use the same attention knockout in the middle of the question, rather than at the end as in
method to block the attention edge between them to inves- the ChooseAttr and ChooseCat tasks. As a result, the op-
tigate the information flow between them. tions in ChooseRel are less effective at aggregating the com-
plete contextual information of the question within an auto-
F.1. Different parts of the question to the last position regressive transformer decoder. Consequently, the flow of
information from the option to the final position becomes
In the tasks of ChooseAttr, ChooseCat and ChooseRel, for
less critical in determining the correct answer.
each layer ℓ, we block last position from attending to dif-
ferent parts of question, including question without options, As the questions in our dataset target one or more spe-
true option, false option, with the same window size (k = 9) cific objects in the image, we also conduct experiments on
around the ℓ-th layer and observe the change in the proba- blocking last position from attending to objects or question
bility of the answer word at the last position. In the tasks without objects. As shown in Figure 26 (d), (e) and (f),
of CompareAttr LogicalObj and QueryAttr, we conduct the the critical information from the objects does not directly
same operations with the above tasks except for blocking transfer into the last position compared to that form ques-
last position from attending to objects or question without tion without objects to last position. This implies that the
objects as these tasks do not contain options in the question. objects might affect the final prediction in an indirect way.
As shown in Figure 26 (a), (b) and (c), for the tasks of
F.2. Different parts of the question to different parts
ChooseAttr, ChooseCat and ChooseRel, the true option and
of the question
false option flowing the information to the last position oc-
cur in similar layers (higher layers) in the model. When In the tasks of ChooseAttr, ChooseCat and ChooseRel, for
blocking last position from attending true option, the prob- each layer ℓ, we block options from attending to question
ability obtain a reduction, while blocking last position from without options with the same window size (k = 9) around

8
    
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

  
  

 
  
 
 
  
 
 
  
,PDJH 2EMV  2WKHU,PDJH3DWFKHV 2EMV
  ,PDJH 4XHVZR2EMV 2WKHU,PDJH3DWFKHV 2EMV 2WKHU,PDJH3DWFKHV 2EMV 2WKHU,PDJH3DWFKHV 4XHVZR2EMV
,PDJH )DOVH2SWLRQ ,PDJH )DOVH2SWLRQ ,PDJH )DOVH2SWLRQ  2WKHU,PDJH3DWFKHV )DOVH2SWLRQ 2WKHU,PDJH3DWFKHV )DOVH2SWLRQ 
2WKHU,PDJH3DWFKHV )DOVH2SWLRQ
   2W
,PDJH 7UXH2SWLRQ ,PDJH 7UXH2SWLRQ  ,PDJH 7UXH2SWLRQ 2WKHU,PDJH3DWFKHV 7UXH2SWLRQ KHU,PDJH3DWFKHV 7UXH2SWLRQ 2WKHU,PDJH3DWFKHV 7UXH2SWLRQ
,PDJH 4XHVZR2SWV ,PDJH 4XHVZR2SWV ,PDJH 4XHVZR2SWV 2WKHU,PDJH3DWFKHV 4XHVZR2SWV 2WKHU,PDJH3DWFKHV 4XHVZR2SWV 2WKHU,PDJH3DWFKHV 4XHVZR2SWV
                             
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel (a) ChooseAttr (b) ChooseCat (c) ChooseRel
    
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

 
   
 
 
   

  
  
 

  
  
,PDJH 2EMV  ,PDJH 2EMV ,PDJH 2EMV 2WKHU,PDJH3DWFKHV 2EMV  2WKHU,PDJH3DWFKHV 2EMV  2WKHU,PDJH3DWFKHV 2EMV
,PDJH 4XHVZR2EMV ,PDJH 4XHVZR2EMV ,PDJH 4XHVZR2EMV 2WKHU,PDJH3DWFKHV 4XHVZR2EMV 2WKHU,PDJH3DWFKHV 4XHVZR2EMV 2WKHU,PDJH3DWFKHV 4XHVZR2EMV
                             
/D\HU /D\HU /D\HU /D\HU /D\HU /D\HU
(d) CompareAttr (e) LogicalObj (f) QueryAttr (d) CompareAttr (e) LogicalObj (f) QueryAttr

Figure 28. The relative changes in prediction probability on Figure 29. The relative changes in prediction probability on
LLaVA-v1.5-13b with six VQA tasks. Blocking the information LLaVA-v1.5-13b with six VQA tasks. Blocking the information
flow from Image to different parts of the question, including True flow from Other Image Patches to different parts of the question,
Option, False Option, Objects in question, Question without Ob- including True Option, False Option, Objects in question, Ques-
jects, Question without Options. tion without Objects, Question without Options.

the ℓ-th layer and observe the change in the probability of As illustrated in Figure 28, the overall information flow
the answer word. In the tasks of CompareAttr and Log- from the image to different parts of the question aligns con-
icalObj, we conduct the same operations with the above sistently with the information flow from the image to the en-
tasks except for blocking objects from attending to question tire question, as depicted in Figure 4 in the main body of the
without objects. paper. Specifically, there are two distinct flows from the im-
As shown in Figure 27 (a), (b) and (c), for the tasks age to the question. Notably, however, different parts of the
of ChooseAttr, ChooseCat and ChooseRel, the information question exhibit varying magnitudes of probability change,
flow from question without options to true option occurs especially in the second-time drop in probability, which
in similar transformer layers with that from question with- may be because different kinds of questions have different
out options to false option. We also observe that these in- attention patterns to the image. For example, during the
direct information flows from question without options to second-time drop in probability, in the tasks of ChooseAttr
false option occur before the information flow from options and ChooseCat, the image information does not transfer to
to last position as shown in Figure 26. This indicates that false option while it transfers much more information to
the information of the question is aggregated into the op- true option. However, this pattern isn’t observed in the task
tions in lower layers and then the information in options is of ChooseRel, where most image information is transferred
transferred to the last position for the final answer predic- into question without options and objects.
tion in higher layers. For the tasks of CompareAttr and Log-
icalObj, we observe that the information flow from question F.4. Other image patches to different parts of question
without objects to objects occurs in lower layers. In the tasks of ChooseAttr, ChooseCat and ChooseRel, for
each layer ℓ, we block attention edge between other image
F.3. Image to different parts of question patches and different parts of question, including question
In the tasks of ChooseAttr, ChooseCat and ChooseRel, for without options, true option, false option, objects and ques-
each layer ℓ, we block attention edge between image and tion without objects, with the same window size (k = 9)
different parts of question, including question without op- around the ℓ-th layer and observe the change in the prob-
tions, true option and false option, with the same window ability of the answer word. In the tasks of CompareAttr,
size (k = 9) around the ℓ-th layer and observe the change LogicalObj and QueryAttr, we conduct the same operations
in the probability of the answer word. In the tasks of Com- with the above tasks except for blocking attention edge be-
pareAttr, LogicalObj and QueryAttr, we conduct the same tween other image patches and question without objects or
operations with the above tasks except for blocking atten- objects respectively.
tion edge between image and question without objects or As shown in Figure 29, the information flow from other
objects respectively. image patches to different parts of the question for all six

9
 
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 







-DFFDUG6LPLODULW\



 

  
 5HODWHG,PDJH3DWFKHV 2EMV
5HODWHG,PDJH3DWFKHV 2EMV 5HODWHG,PDJH3DWFKHV 2EMV 5HODWHG,PDJH3DWFKHV 4XHVZR2EMV
5HODWHG,PDJH3DWFKHV 4XHVZR2SWV
 5HO  5HODWHG,PDJH3DWFKHV 4XHVZR2SWV  5HODWHG,PDJH3DWFKHV )DOVH2SWLRQ
DWHG,PDJH3DWFKHV )DOVH2SWLRQ 5HODWHG,PDJH3DWFKHV )DOVH2SWLRQ 5HODWHG,PDJH3DWFKHV 7UXH2SWLRQ

5HODWHG,PDJH3DWFKHV 7UXH2SWLRQ

5HODWHG,PDJH3DWFKHV 7UXH2SWLRQ

5HODWHG,PDJH3DWFKHV 4XHVZR2SWV 
           
/D\HU /D\HU /D\HU
(a) ChooseAttr (b) ChooseCat (c) ChooseRel 
 
&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 

&KDQJHLQSUREDELOLW\ 





    





/D\HU
 


  Figure 31. The Jaccard similarity between the predicted words
 

5HODWHG,PDJH3DWFKHV 2EMV  5HODWHG,PDJH3DWFKHV 2EMV  5HODWHG,PDJH3DWFKHV 2EMV of the original model LLaVA-1.5-13b and those of the intervened
5HODWHG,PDJH3DWFKHV 4XHVZR2EMV 5HODWHG,PDJH3DWFKHV 4XHVZR2EMV 5HODWHG,PDJH3DWFKHV 4XHVZR2EMV
               model blocking question from attending to image on the task of
/D\HU /D\HU /D\HU
ChooseAttr.
(d) CompareAttr (e) LogicalObj (f) QueryAttr

Figure 30. The relative changes in prediction probability on


LLaVA-v1.5-13b with six VQA tasks. Blocking the information G. The influence of images on the semantics of
flow from Related Image Patches to different parts of the question, Questions
including True Option, False Option, Objects in question, Ques-
tion without Objects, Question without Options. We already know that the image information is integrated
into the representation corresponding to the position of
question. In order to investigate whether the image affects
tasks consistently aligns the flow observed from other im- the final semantics of the question, for each layer ℓ, we pre-
age patches to the entire question, as illustrated in Figure 5 vent the question from attending to the question, with the
in the main body of the paper. Specifically, the information same window size (k = 9) around the ℓ-th layer and ob-
flow dominantly occurs in the first-time drop in the prob- serve the change of semantics of the questtion in the final
ability in the lower layers, regardless of which part of the layer. The semantics of the question is evaluated by the
question is being blocked. Jaccard Similarity as in Appendix D.
As illustrated in Figure 31, the Jaccard Similarity
F.5. Related image patches to different parts of question demonstrates a significant decline in the lower layers, re-
sembling the behavior observed in layers where information
In the tasks of ChooseAttr, ChooseCat and ChooseRel, for flows from the image to the question. This pattern high-
each layer ℓ, we block the attention edge between Related lights the critical role of image information in constructing
image patches and different parts of question, including the final multimodal semantic representation.
question without options, true option, false option, objects
and question without objects, with the same window size
(k = 9) around the ℓ-th layer and observe the change in
the probability of the answer word. In the tasks of Com-
pareAttr, LogicalObj and QueryAttr, we conduct the same
operations with the above tasks except for blocking the at-
tention edge between Related image patches and question
without objects or objects respectively.
The observations of the overall information flow from re-
lated image patches to different parts of the question for all
six tasks shown in Figure 30 consistently aligns the flow
observed from related image patches to the entire ques-
tion, as illustrated in Figure 5 in the main body of the pa-
per. Specifically, the information flow dominantly occurs
in the second-time drop in the probability in the lower-to-
middle layers (around 10th). However, there are some parts
of question that don’t obtain the information folwed from
the related image patches. For example, the objects in the
task of ChooseCat, or false option and true option in the
task of ChooseRel.

10

You might also like