0% found this document useful (0 votes)
26 views

Mitigate Position Bias in Large Language Models Via Scaling A Single Dimension

Uploaded by

pradeep kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Mitigate Position Bias in Large Language Models Via Scaling A Single Dimension

Uploaded by

pradeep kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Mitigate Position Bias in Large Language Models via

Scaling a Single Dimension

Yijiong Yu1†, Huiqiang Jiang2 , Xufang Luo2 , Qianhui Wu2 , Chin-Yew Lin2 ,
Dongsheng Li2 , Yuqing Yang2 , Yongfeng Huang1 , Lili Qiu2
1
Tsinghua University, 2 Microsoft Corporation
[email protected],[email protected]
arXiv:2406.02536v1 [cs.CL] 4 Jun 2024

{hjiang,xufluo,qianhuiwu,cyl,dongsli,yuqyang,liliqiu}@microsoft.com

Abstract

Large Language Models (LLMs) are increasingly applied in various real-world


scenarios due to their excellent generalization capabilities and robust generative
abilities. However, they exhibit position bias, also known as "lost in the middle", a
phenomenon that is especially pronounced in long-context scenarios, which indi-
cates the placement of the key information in different positions of a prompt can
significantly affect accuracy. This paper first explores the micro-level manifesta-
tions of position bias, concluding that attention weights are a micro-level expression
of position bias. It further identifies that, in addition to position embeddings, causal
attention mask also contributes to position bias by creating position-specific hidden
states. Based on these insights, we propose a method to mitigate position bias by
scaling this positional hidden states. Experiments on the NaturalQuestions Multi-
document QA, KV retrieval, LongBench and timeline reorder tasks, using various
models including RoPE models, context window-extended models, and Alibi mod-
els, demonstrate the effectiveness and generalizability of our approach. Our method
can improve performance by up to 15.2% by modifying just one dimension of
hidden states. Our code is available at https://aka.ms/PositionalHidden.

1 Introduction
Long-context large language models (LLMs) [1, 2, 3, 4, 5, 6] have recently garnered significant
attention within the community, enabling LLMs to handle longer and more complex tasks such as
long-context question-answering [7, 8] and repository-level code understanding [9]. However, recent
researches [8, 10, 11, 12, 13], indicates that these long-context LLMs struggle to effectively and
consistently utilize all the information provided in the context, exhibiting a position bias known as
"lost in the middle", which means LLMs tend to ignore information in the middle of the prompt, even
though they can utilize the information at the beginning and end of the prompts well. This issue occurs
in nearly all LLMs [10, 14, 15], whether they are decoder-only models or encoder-decoder models,
powerful models or small LLMs. For example, for the GPT-3.5-Turbo model in the NaturalQuestion
multi-document QA task, the performance difference between ground-truth information placed in the
middle of the prompt versus at the ends is 22 points with 2.3k tokens prompt [10]. This significantly
impacts the practical application of LLMs in real-world scenarios. Studies [16, 17] show that this
position bias becomes more severe as the context length increases, hindering the practical application
of long-context LLMs.
Previous works have analyzed this issue from the perspectives of data distribution [14, 18, 19] and
position embeddings [15, 20]. For example, FILM [19] addresses position bias by constructing data

Work during internship at Microsoft.

Preprint. Under review.


with key information distributed in various positions for supervised fine-tuning (SFT). Ms-PoE [15]
mitigates position bias by interpolating RoPE [21] using head-wise scaling factors. However, these
methods require additional overhead for training or online estimation of scaling coefficients and are
currently applicable to only a few models, limiting their generalizability.
To fundamentally understand and alleviate position bias in LLMs, we first explored the micro-level
manifestation of position bias in LLMs and observed patterns in the attention weights consistent
with position bias. Next, we investigated the underlying causes of attention weight-induced position
bias. By respectively modifying position embedding and causal mask, we found that, in addition
to position embedding, the causal mask also significantly affects position bias. Further analysis
revealed that the causal mask introduces "positional hidden states", which are positively correlated
with absolute positions, thereby conveying positional information to LLMs. These positional hidden
states appear regardless of what position encoding method is used, including RoPE [21], Alibi [22],
and even NoPE [23].
Based on the above findings, we propose a position bias mitigation method named "scale positional
hidden states". Specifically, we first design a prior-based searching algorithm that quickly identifies
which dimensions of hidden states within the model are positional hidden states, using monotonicity,
smoothness, and loss on validation sets as indicators. Next, we design an attention modification
algorithm that only let the scaled hidden states influence the attention of the last token of the prompt,
efficiently implemented using FlashAttention [24].
Extensive experiments on various models, including LLaMA-2 [25], Vicuna [26], Mistral [27],
Gemma [28], Qwen [29], and MPT [30], and across different tasks, including Multi-document QA,
KV retrieval, LongBench [31] benchmark, and the timeline reorder task [11], demonstrate that our
method effectively mitigates position bias by modifying only one dimension of the hidden states of
the model, achieving improvements of up to 15.2%. Our method is compatible with various position
embeddings, including RoPE [21] and Alibi [22], and shows good generalization.
Our main contributions are as follows:

1. We find that position bias can be reflected in attention patterns.


2. We discover that the causal mask also introduces position bias and generates positional
hidden states correlated to absolute positions in the hidden layers.
3. We propose a method for identifying and scaling the positional hidden states to mitigate
position bias.

2 Beyond Position Embeddings: Causal Masks Also Contribute to Position


Bias in LLMs
This section identifies patterns in attention weights that closely correspond to position bias. Addition-
ally, we discover that, apart from position embeddings, position information in the LLMs can also be
generated by the causal mask, which tends to accumulate in a few specific hidden states channels and
bears significant responsibility for the emergence of position bias.

2.1 Microscopic Manifestations of Position Bias in Transformers: Attention Weight Patterns

The attention of auto-regressive can be represented by the following equations:


q = P(W Q h(n), n), k = P(W K h(m), m)
qkT + Mask (1)
an,m = Softmax( √ )
d

where h is the hidden states, and h(n) is the hidden state of the n-th token. W Q , W K are the weights
of the linear layers, P is the position encoding function like RoPE [21], d is the dimensionality of
query and key states, and n and m are the positional order information. Mask is the causal mask.
To explore the micro-level manifestations of position bias in Transformers, we analyzed the attention
weights for sentences containing key information, using a KV retrieval task, which requires the model
to retrieval the ground-truth value of the given key from a list containing 50 Key-Value pairs (see

2
Appendix D for details). As shown in Figures 1, in deep layers the model exhibits retrieval-like
behavior, focusing on ground-truth information, forming a diagonal pattern observed in Figure 1b.
While in other shallow layers, it always focus most attention on the start or end of the prompt,
wherever the key information is located, exhibiting vertical lines patterns, as shown in Figure 1a.
In these layers exhibiting retrieval-like behavior, it can be observed that the attention weights for key
information (Gold KV) exhibit patterns similar to position bias: when key information is located at
the start or end of the prompt, the attention weights focused on it are relatively higher, while in the
middle, they are significantly lower. Moreover, we extract the attention to key information (average
of layers 15~25) with different context length in Figure 1c, where as the context length grows, the
attenuation of attention weights with respect to position becomes more pronounced, reaching almost
zero at the middle. More details about this are in Appendix I and D.
Furthermore, in Appendix E, we found artificially adjusting the attention weights to the key informa-
tion can directly improve the corresponding accuracy. Thus, we claim that position bias is to a large
extent caused by the attention weights patterns at the micro level.

/D\HU  /D\HU
 
 
$WWHQWLRQ 10 3

$WWHQWLRQ 10 3
*ROG.9
V,QGH[

*ROG.9
V,QGH[

 
 

  
  
 
             
(DFK.9
V,QGH[ (DFK.9
V,QGH[
(a) Vertical Line Pattern (b) Diagonal Line Pattern (c) Across Context Lengths
Figure 1: Attention distribution of the ground-truth KV pair to each KV pair across different positions
on the KV retrieval task [10] using Mistral-7B [27]. (a) and (b) show the results averaged across
all heads of the layer. (c) shows the attention of the ground-truth KV to the ground-truth KV (i.e.,
diagonal lines from (b)) across different context lengths.

2.2 Causal Mask Also Contributes to Position Bias

Original Crop Mask PE to Beginning PE to End


1.0
Attention Weight (1e-3)

3
0.8
Accuracy(%)

2 0.6

1 0.4

0.2
0
1st 10th 20th 30th 40th 50th 1st 10th 20th 30th 40th 50th
Position of ground-truth KV Position of ground-truth KV

Figure 2: Performance of different methods with the ground-truth KV at different positions in the KV
retrieval task [10] using Mistral-7B [27].

Based on Equ.(1), position embedding P allows LLMs to acquire postional information. However,
recent works [23, 32, 33] indicate that, besides position embeddings, the causal mask can also
introduce positional information.
Thereby, in this section, we aim to determine whether these two factors affect position bias through
modifying different properties of the ground-truth KV pair. Specifically, we introduce the following
three baselines: (1) Crop Mask, which modifies the casual mask so that the ground-truth KV pair
only sees itself but not the previous tokens. (2) PE to Beginning, which reduces the position IDs

3
of the ground-truth KV pair to be the same as the first KV pair. (3) PE to End, which increases the
position IDs of the ground-truth KV pair to be the same as the last KV pair. More details are provided
in Appendix F.
As shown in Figure 2, the original results exhibit a "lost in the middle" pattern not only in accuracy
but also in attention weight. Secondly, PE to end has a certain degree of help, but can hardly allow
the model’s performance to match the accuracy when the ground-truth KV pair is positioned at the
start or end of the prompt. Furthermore, PE to Beginning results in a noticeable performance drop as
well as attention weight reduction when the gold KV is close to the end. In contrast, modifying the
casual mask effectively enhances attention, especially to the latter KVs, and let the performance at
the middle be improved to almost on par with the beginning. Based on the above observations, we
can conclude that besides position embedding, the casual mask is also an important factor affecting
position bias as well as corresponding attention weights. Moreover, solely modifying the position
embedding hardly alleviates position bias completely.

2.3 Casual Mask Stores Position Information in Specific Hidden states Channels

Definition 2.1 (Positional Hidden States). Let hk (p) denote the k-th dimension of the hidden states
across each token’s position p. We define positional hidden states ht as hidden states whose values
vary consistently and monotonically with the position sequence. Therefore, their derivative (after
curve fitting) should always be positive or negative:

• h′t (p) > 0, ∀p or h′t (p) < 0, ∀p

To further analyze how positional information is transmitted in transformers, we define a special


type of hidden state that directly reflects absolute positional information with high correlations to
position IDs, called positional hidden states, as defined in Definition 2.1. We employ monotonicity
rather than correlation as the primary property of positional hidden states, as correlation does not
account for the sequential nature of positions. As shown in Figure 3, our experiments reveal that
causal LLMs consistently possess such hidden states across most layers (details in Appendix J), even
though these models do not have explicit absolute position embeddings, which means the causal
mask is a very possible factor that provides absolute positional information. We indeed prove it has
major influence on positional hidden states through perturbation experiments on the causal mask
and position embedding in Appendix F. So far, combining the conclusion from section 2.2, we can
conclude that the causal mask stores positional information in some special hidden states, and further
affects attention weights to cause position bias.

'LP 'LP 'LP 'LP


 
 

+LGGHQ6WDWH9DOXHV

+LGGHQ6WDWH9DOXHV

+LGGHQ6WDWH9DOXHV

+LGGHQ6WDWH9DOXHV

 
 

 
 
   
  

                       
3RVLWLRQ,G 3RVLWLRQ,G 3RVLWLRQ,G 3RVLWLRQ,G
(a) Mistral-7b (b) LLaMA-2-7b (c) MPT-30b (d) TinyLlama-NoPE-1.1B
Figure 3: Averaged positional hidden states across all layers in different models.

3 Methodology

Based on the findings in Section 2, although the causal mask profoundly influences position bias, it is
not feasible to know the positions of effective information in the prompt in advance, making methods
that modify the causal mask difficult to design. Therefore, we propose a method to mitigate position
bias by scaling the positional hidden states, as shown in Figure 4. Specifically, it consists of two
steps: identifying the positional hidden states ht and scaling them by the factor s.

4
Positional
Hidden States
𝑄
𝑋
𝐾
Scale
𝑓
𝑞
𝑋
Combined
𝐾 Attention Weights

Figure 4: The framework of scaling positional hidden states and modifying attention.

3.1 Problem Formulation

Given a pre-trained LLM θ and a general dataset {x, y}, our objective is to find the optimal positional
hidden states ht and the corresponding scaling factor s to maximally reduce position bias, which can
be formulated as follows:  
|P |
X
arg min E  L (x, y, pi ; F (θ, ht , s)) (2)
ht ∈H,s<1 i=1

where P represents the set of different positions of the ground-truth information within the prompt x,
F (θ, ht , s) denotes the operation of scaling the LLM θ on the t-th dimension of its hidden states by
the scaling factor s, and L denotes the loss for general downstream tasks of the modified model.

3.2 Identifying Positional Hidden States

We have defined positional hidden states in Defini- Algorithm 1 Positional Hidden State Search
tion 2.1. However, the original values of hidden states 1: Input: LLM θ, hidden states H, layer
may not strictly satisfy monotonicity. After curve number L, validation set Dval , positions
fitting, we can identify dozens or hundreds of dimen- set P , threshold ε
sions that exhibit various degrees of relevance to posi- # Indentify top-K positional dimensions
tional information. Thus, the first step of our method 2: ρ ← ϕ
is to find the dimension that best fits the properties of 3: for t ← 1 to |H| do
positional hidden states. 4: ct ← 0, gt ← 0
5: for l ← 1 to L do
To efficiently search for the positional hidden states 6: if h′t (p) > 0, ∀ p or h′t (p) < 0, ∀ p
from the LLMs’ hidden states set, we leverage the then
characteristics of positional hidden states defined in 7: ct ← ct + 1, gt ← gt + Smooth(hlt )
Section 2.3 and propose a prior-based positional hid- 8: end if
den search algorithm. As shown in Algorithm 1, the 9: end for
search process consists of the following two steps: 1) 10: if ct > ε then
Identify the top-k dimensions ρ in the hidden states 11: ρ ← ρ ∪ {t}
that are monotonic in more than ε layers and are as 12: end if
13: end for
smooth as possible. Here ct is the number of layers
14: ρ ← arg minK gt
where ht (p) is monotonic, and gt is the smooth score t∈ρ
of ht (p). Equ.(3) is the smoothness formula. 2) Use # Evaluate on the validation dataset
a small validation dataset Dval = {x, y} to evaluate 15: for t ∈ ρ do
the impact of scaling these positional hidden states 16: Lt ← 0
respectively and select the positional hidden states ht 17: for p ∈ P do
that can lead to the minimal loss Lt . 18: Lt ← Lt + L(x, y, p; F (θ, ht , s))
19: end for
Z 20: end for
Smooth(ht ) = |h′′t (p)|2 (3) 21: t ← arg mink Lt
t∈ρ
22: return t
As for selecting the best scale factor, we take 0.5, 0,
-0.5, and -1 to respectively experiment on the validation set, obtain the validation loss, and then select
the scaling factor with the lowest loss.

5
3.3 Scaling the Positional Hidden States

To minimize the impact of this modification on the semantics of LLMs, we propose a method scaling
the positional hidden states only affecting the last token as shown in Figure 4. Specifically, for the
tokens preceding the last token, the attention calculation remains the same as the original. For the last
token’s attention computation of a sequence of length l, we obtain the modified query state q l (of the
l-th token, i.e. the last token) and key states K (of all the tokens) by scaling the positional hidden
states. That is,

q l = P(W Q f (h(l), p, s), l), K = P(W K f (h, p, s), [1, 2, ..., l]) (4)

Here f (h, p, s) means the p-th dimension of h is scaled by the factor s. Therefore, the corresponding
attention calculation is as follows:


Softmax( qi K √+ Mask )V , i < l


 d
z= ⊤ (5)
 ql K
Softmax( √ )V ,

 i=l
d
where z is the attention output. We use FlashAttention [24] to implement our method with minimal
overhead. After obtaining the combined attention weights, the remaining computations are same with
the original. As shown in Appendix C.4, our method only causes a slight increase in latency.

4 Experiments

4.1 Setup

Evaluation Tasks and Models We apply our method to a wide range of state-of-the-art open-source
LLMs, including: 1) RoPE [34] models: LLaMA-2 (7B, 13B) [25], Mistral-7B [27], Gemma-7B [28],
Qwen1.5-7B [29]; 2) Context window extended models: Vicuna (7B, 13B) [26]; 3) Alibi [22] models:
MPT-30B [30]. All the models we use are instruction-tuned versions.
And we evaluate the performance across three aspects: 1) Position-bias-related tests on NaturalQues-
tion multi-document QA [10] and KV retrieval [10] with ground-truth at different positions in the
prompt. The NaturalQuestion task includes 20 documents with a prompt length of about 2.3k tokens,
while the KV retrieval task includes 140 KV pairs with an average length of about 10k tokens. 2) Gen-
eral long-context benchmark on LongBench [31], including multi-document QA, single-document
QA, summarization, few-shot learning, synthetic tasks, and code completion, totaling 16 tasks with
an average length of 37k tokens. 3) Position-sensitive tasks on timeline reordering in LooGLE [11],
with an average length of 10k tokens. For prompts that exceed the context windows of LLMs, we
follow LongBench’s approach by truncating from the middle and retaining the head and tail of the
prompt to fit within the context windows. We use the provided metrics and scripts from the following
benchmarks for evaluation.

Implementation Details In this paper, we implement our approach using PyTorch, HuggingFace
Transformers, and FlashAttention [24] in an A100 GPU. To ensure stable and reproducible results,
we use greedy decoding in all experiments. For the search part, we set the top-k size of positional
hidden states to 10 and ε to L/4, where L is the number of layers. The validation set is a synthetic
KV retrieval dataset consisting of 100 examples, which do not overlap with the test set. The search
process takes approximately 10 minutes. For the scaling part, we only modify the intermediate layers
of the model to minimize the negative impact on performance. The details of the scaling dimensions,
layer ranges, and factors are shown in Table 4. More details are provided in Appendix C.

Baselines We include two training-free positional bias mitigation approaches as our baselines: (i)
Original, the original results of LLMs with the ground-truth at different positions in the prompt. (ii)
w/ Ms-PoE [15], uses a head-aware position embedding scaling method to mitigate position bias. We
follow the paper’s settings and apply scaling coefficients of 1.2 to 1.8 from the 3-rd layer.

6
Table 1: Performance of different methods with different models on NaturalQuestions (20 docs) [10]
and KV retrieval (140 KV pairs) [10] dataset.
NaturalQuestion KV Retrieval
Methods
1st 5th 10th 15th 20th Avg. 0% 25% 50% 75% 100% Avg.
LLaMA-2-7b-chat 32.4 23.8 30.6 31.6 38.2 31.3 77.6 24.6 62.0 35.6 78.0 55.6
LLaMA-2-7b-chat w/ Ms-PoE 40.8 29.2 33.0 32.8 39.6 35.1 95.0 29.8 21.4 51.8 89.8 57.6
LLaMA-2-7b-chat w/ Ours 33.6 34.0 40.6 43.0 51.8 40.6 63.6 38.0 82.2 40.6 94.6 63.8
LLaMA-2-13b-chat 45.2 39.6 40.4 44.2 51.0 44.1 74.2 39.0 70.4 84.4 86.8 71.0
LLaMA-2-13b-chat w/ Ms-PoE 48.4 41.4 42.4 45.4 52.6 46.0 87.8 28.0 35.4 49.2 83.0 56.7
LLaMA-2-13b-chat w/ Ours 50.6 43.4 45.0 49.4 58.2 49.3 41.2 17.0 49.6 76.8 84.8 53.9
Vicuna-7b-v1.5-16k 70.4 54.8 46.8 45.8 47.8 53.1 98.4 0.8 0.2 0.2 0.2 20.0
Vicuna-7b-v1.5-16k w/ Ms-PoE 67.0 55.2 50.6 46.8 48.2 53.6 97.4 36.8 15.6 5.2 6.6 32.3
Vicuna-7b-v1.5-16k w/ Ours 63.8 57.6 53.6 51.2 55.6 56.4 95.4 22.0 12.6 5.2 20.4 31.1
Vicuna-13b-v1.5-16k 67.4 48.2 45.2 45.6 44.4 50.2 95.6 74.2 64.2 58.8 18.2 62.2
Vicuna-13b-v1.5-16k w/ Ms-PoE 70.0 51.4 46.8 42.8 47.0 51.6 91.8 59.4 71.6 74.4 48.8 69.2
Vicuna-13b-v1.5-16k w/ Ours 67.4 51.4 47.6 48.8 48.0 52.7 97.2 83.4 80.8 68.8 35.4 73.1
Mistral-7b-Instruct-v0.2 57.2 55.0 61.2 61.6 62.6 59.5 99.8 93.0 89.0 95.0 94.2 94.2
Mistral-7b-Instruct-v0.2 w/ Ms-PoE 58.2 60.0 62.6 58.8 62.2 60.4 99.8 95.6 88.4 96.0 95.4 95.0
Mistral-7b-Instruct-v0.2 w/ Ours 61.2 56.4 63.2 59.8 64.0 60.9 97.6 93.2 90.6 95.6 93.8 94.2
Gemma-1.1-7b-it 29.6 25.2 28.2 29.6 27.4 28.0 98.6 67.0 62.4 83.4 100.0 82.3
Gemma-1.1-7b-it w/ Ms-PoE 33.8 29.0 31.6 28.6 28.6 30.3 0.0 0.0 0.0 0.0 0.0 0.0
Gemma-1.1-7b-it w/ Ours 35.4 31.4 36.0 35.4 35.0 34.6 97.6 95.8 97.6 96.8 99.6 97.5
Qwen1.5-7b-chat 72.4 53.8 52.2 51.2 54.4 56.8 100.0 97.2 84.6 60.0 56.4 79.6
Qwen1.5-7b-chat w/ Ms-PoE 67.4 49.8 48.2 47.4 47.0 52.0 3.4 1.4 2.8 2.6 0.6 2.2
Qwen1.5-7b-chat w/ Ours 67.4 55.2 53.6 56.0 59.4 58.3 97.2 95.6 98.8 76.6 94.4 92.5
MPT-30b-chat 75.6 49.6 39.0 33.4 39.6 47.4 71.4 34.8 31.6 41.6 74.0 50.7
MPT-30b-chat w/ Ms-PoE / / / / / / / / / / / /
MPT-30b-chat w/ Ours 75.0 48.8 41.6 40.6 44.0 50.0 99.0 65.8 48.6 46.6 69.4 65.9

4.2 Main Results

Tables 1, 2, and 6 present the performance of various methods in different benchmarks. Several
observations and conclusions can be drawn: 1) Our method consistently improves overall performance
at different positions, with increases of up to 9.3%, 15.2%, and 4.7% in NQ, KV retrieval, and
LongBench, respectively, except for LLaMA-2-13B in KV retrieval. Additionally, compared to the
SoTA method Ms-PoE, our method shows significant improvements of up to 6.3%, 97.5%, and 14%
in NQ, KV retrieval, and LongBench. The poor performance of Ms-PoE in KV retrieval can be
attributed to the interpolation causing information loss. 2) Our method effectively enhances LLMs’
understanding of information located in the middle and latter parts of the prompt. For key information
at the beginning of the prompt, performance is comparable to baselines. Considering only the average
performance of the last four positions, our method’s improvements over the original increase to
11.3% and 16.8% in NQ and KV retrieval, respectively, and over Ms-PoE increase to 8.7% and
97.5% in NQ and KV retrieval, respectively. 3) Our approach is effective not only for RoPE models
but also for context window extended models like Vicuna-16K, which already readjust RoPE [34].
Additionally, our method can be adapted to different position embeddings, such as Alibi [22] models
like MPT, resulting in improvements of 2.6%, 15.2%, and 1.2% in NQ, KV retrieval, and LongBench,
respectively. 4) Our method demonstrated varying degrees of improvement across different tasks,
with the most significant increases being 22.9% in few-shot learning tasks, 8.6% in code tasks, 4%
in synthetic tasks, 9.2% in single document QA tasks, and 1.9% in multi-document QA tasks. In
summarization tasks, performance was nearly on par with the original results. 5) Our method does
not disrupt the necessary position information in LLMs, as detailed in Appendix H.

4.3 Analysis

From Bias to Balance As shown in Table 1, there is an phenomenon that our method mainly
benefits when the key information is not at the beginning, but can often decrease performance if the
model performs significantly better when the key information is at the beginning. It reveals a possible

7
Table 2: Performance of different methods with different models on LongBench [31].
Models SingleDoc MultiDoc Synth. Summ. FewShot Code AVG
LLaMA-2-7b-chat 28.9 29.7 6.6 26.3 10.2 12.2 19.0
LLaMA-2-7b-chat w/ Ms-PoE 29.8 31.7 10.5 26.7 6.4 13.2 19.7
LLaMA-2-7b-chat w/ Ours 29.2 29.3 9.7 25.0 18.9 20.8 22.1
LLaMA-2-13b-chat 21.4 14.6 11.2 26.1 4.7 16.9 15.8
LLaMA-2-13b-chat w/ Ms-PoE 20.8 15.4 12.7 27.3 3.1 15.7 15.8
LLaMA-2-13b-chat w/ Ours 30.6 9.6 10.8 25.7 27.6 18.7 20.5
Vicuna-7b-v1.5-16k 30.2 21.6 7.2 26.7 9.4 21.2 19.4
Vicuna-7b-v1.5-16k w/ Ms-PoE 32.3 24.2 8.3 28.0 9.8 22.2 20.8
Vicuna-7b-v1.5-16k w/ Ours 27.1 22.1 11.2 26.1 16.7 20.2 20.6
Vicuna-13b-v1.5-16k 31.1 33.8 21.2 26.2 21.6 23.8 26.3
Vicuna-13b-v1.5-16k w/ Ms-PoE 34.5 33.1 16.0 27.5 21.0 25.0 26.2
Vicuna-13b-v1.5-16k w/ Ours 30.1 35.1 25.0 25.8 27.0 24.7 27.9
Mistral-7b-Instruct-v0.2 37.8 28.5 49.7 28.8 49.9 44.0 39.8
Mistral-7b-Instruct-v0.2 w/ Ms-PoE 41.7 22.2 38.4 24.9 14.0 19.5 26.8
Mistral-7b-Instruct-v0.2 w/ Ours 38.4 30.4 49.8 29.4 51.4 45.3 40.8
Gemma-1.1-7b-it 39.4 23.2 32.2 24.2 14.4 19.8 25.5
Gemma-1.1-7b-it w/ Ms-PoE 41.7 22.2 38.4 24.9 14.0 19.5 26.8
Gemma-1.1-7b-it w/ Ours 39.0 23.0 35.5 24.5 14.9 19.3 25.7
Qwen1.5-7b-chat 46.4 39.5 38.4 22.3 39.9 44.6 38.5
Qwen1.5-7b-chat w/ Ms-PoE 42.0 41.5 30.3 25.7 43.2 41.4 37.4
Qwen1.5-7b-chat w/ Ours 45.8 38.8 38.5 22.1 40.0 48.1 38.9
MPT-30b-chat 27.9 21.9 7.5 25.7 18.8 16.7 19.7
MPT-30b-chat w/ Ms-PoE / / / / / / /
MPT-30b-chat w/ Ours 29.4 19.5 6.7 25.8 23.0 21.2 20.9

fact that the positional hidden may be an important factor causing the model to miss the rear parts of
the context while focus too much to the beginning parts. Therefore, scaling such dimension can shift
the model’s attention from being too focused at the beginning to a more balanced distribution. We
validated the above points by testing different scale factors, as shown in Figure 5.



 
*ROG.9'HSWK

 

$WWHQWLRQ 10 3
*ROG.9'HSWK


$FFXUDF\ 

 

   
 
  
$YJ
         
6FDOH)DFWRU         
6FDOH)DFWRU
(a) Scale factor v.s. Attention (b) Scale factor v.s. Performance
Figure 5: Attention distribution and performance when scaling dimension 2393 of Vicuna-7b-v1.5-
16k with different scale factors on KV retrieval [10] of 100 KV pairs.

Scale Factor The scaling factor directly controls the degree and direction of the impact of position
hidden states on position bias. As shown in Figure 5, when the scaling factor is positive, the model
exhibits a clear bias towards focusing more on the beginning, while when negative, this bias shifts
to focusing more on the end. The factor between 0.5 and -1 leads attention to the most balanced
distribution, meanwhile, the improvement in accuracy also reaches its peak. This result proves
that the positional hidden states we scaled can indeed influence the bias of LLMs towards focusing
excessively on the beginning. By adjusting the coefficients appropriately, this bias can be effectively
mitigated.

8
Table 3: Average performance of different ground-truth positions using different methods on Natu-
ralQuestions multi-document QA dataset (20 docs) [10].
Method LLaMA-2-7b Vicuna-13b Gemma-7b Mistral-7b Qwn1.5-7b
Original 31.3 50.2 28.0 59.5 56.8
Ours 40.6 52.7 34.6 60.9 58.3
w/o monotonicity 40.6 51.8 34.6 60.9 58.3
w/o smoothness 40.6 52.7 27.8 60.9 58.3
w/o validation set 30.1 51.8 26.5 60.9 58.3
w/ scale 2 dimensions 37.2 50.8 31.7 60.1 57.2
w/ modify last 16 tokens 41.6 51.5 34.6 59.7 58.1
w/ modify all tokens 44.0 50.8 31.7 59.5 57.4

Ablation Study To evaluate the contributions of different components in our method, we introduce
the following sets for the ablation study: (1) Ours w/o monotonicity, w/o smoothness, and w/o
validation set, which adjust the search algorithm by not considering these three indicators, respectively
(details in Appendix C.2). (2) Ours w/ scale 2 dimensions, which modifies the top-2 positional hidden
states simultaneously. (3) Ours w/ modify last 16 tokens and w/ modify all tokens, which adjust the
range of tokens affected by the scaling operation in Equ.(5).
Table 3 shows the ablation results. It can be seen that without filtering by monotonicity or smooth-
ness, performance may decline, and removing the validation set results in more decline in model
performance. When the range of tokens or dimensions affected by scaling is expanded, most models
experience varying degrees of performance loss. Considering these factors, we choose to modify only
the last token and the top-1 positional dimension to achieve the best performance.

5 Related Works
Long-Context LLMs Recent research has focused on expanding the context window size of LLMs.
The main approaches include: 1) Staged pre-training [35, 36]: Gradually increasing the context
window size during training. 2) Modifying or interpolating position embeddings [22, 34, 37, 38].
3) Utilizing external memory modules for context storage [39, 40]. 4) Expanding computations
across multiple devices in a distributed manner [41]. While these methods address context window
expansion, their impact on positional bias in downstream tasks has yet to be discussed.

Addressing Position Bias Although LLMs incorporate explicit positional information through
methods like RoPE [21] or Alibi [22], studies such as [10, 16] have found that LLMs exhibit varying
degrees of position bias, referred to as "lost in the middle." Recent works aimed to mitigate this
issue and improve LLM performance in long-context scenarios can be categorized as follows: 1)
RoPE-based methods: These approaches modify the RoPE computation process to alleviate long-
distance information decay, including Attention Bucket [20], which uses an ensemble of multiple
RoPE bases to mitigate position bias, and Ms-PoE [15], which dynamically interpolates with a small
coefficient for different heads. 2) SFT-based methods [14, 18, 19]: These methods construct data with
more diverse key information distributions or employ system2think SFT tasks to mitigate position
bias. They require further training of the model. 3) Attention mask-based methods [42]: These
methods modify attention mechanisms, including Attention Transition [43], which redirects attention
to significant parts of the context and Stable Mask [44], which introduces pseudo attention into
the causal mask, ensuring stable attention distribution when facing lengthy texts. 4) Prompt-based
methods [45, 46]: These methods introduce an external module to reorder or compress information
in the prompt, thereby mitigating position bias.

6 Conclusion
This paper proposes a method for scaling positional hidden states to mitigate position bias issue
in LLMs. Specifically, the study first confirms that attention weights manifest position bias within
transformers. Additionally, experiments demonstrate that, besides position embeddings, the causal
mask also contributes to position bias, which is transmitted to other modules through the hidden

9
states containing absolute positional information, termed as positional hidden states. Based on this,
we introduce a prior-based positional hidden search algorithm and mitigate the model’s position bias
by scaling the positional hidden states searched. Testing eight open-source models with different
position embeddings on tasks such as NaturalQuestions Multi-document QA, KV Retrieval, and
LongBench, the results show that our method effectively reduces position bias and improves model
performance.

References
[1] Gradient. Llama-3 8b instruct gradient 4194k (v0.1), 2024.

[2] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-
baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al.
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv
preprint, abs/2403.05530, 2024.

[3] Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video
and language with ringattention. ArXiv preprint, abs/2402.08268, 2024.

[4] Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li,
Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. ArXiv
preprint, abs/2403.04652, 2024.

[5] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany
Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha
Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu
Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon,
Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider,
Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann,
Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat
Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra,
Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas
Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji
Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang,
Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua Wang, Philipp
Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, Ziyi Yang, Donghan Yu,
Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang,
Yunan Zhang, and Xiren Zhou. Phi-3 technical report: A highly capable language model locally
on your phone, 2024.

[6] DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language


model, 2024.

[7] Avi Caciularu, Matthew E Peters, Jacob Goldberger, Ido Dagan, and Arman Cohan. Peek
across: Improving multi-document modeling via cross-document question-answering. pages
1970–1989, 2023.

[8] Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context LLMs Struggle
with Long In-context Learning, 2024.

[9] Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, DC Vageesh, Arun Iyer, Suresh
Parthasarathy, Sriram Rajamani, B Ashok, and Shashank Shet. Codeplan: Repository-level
coding using llms and planning. 2023.

[10] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni,
and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of
the Association for Computational Linguistics, 12:157–173, 2024.

[11] Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. LooGLE: Can Long-Context
Language Models Understand Long Contexts?, 2023.

10
[12] Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael
Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context,
2023.

[13] Ruixiang Tang, Dehan Kong, Longtao Huang, and Hui Xue. Large language models can be
lazy learners: Analyze shortcuts in in-context learning. In Findings of the Association for
Computational Linguistics: ACL 2023, pages 4645–4657, 2023.

[14] He Junqing, Pan Kunhao, Dong Xiaoqun, Song Zhuoyang, Liu Yibo, Liang Yuxin, Wang Hao,
Sun Qianguo, Zhang Songxin, Xie Zejian, et al. Never lost in the middle: Improving large lan-
guage models via attention strengthening question answering. ArXiv preprint, abs/2311.09198,
2023.

[15] Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia
Wu, and Zhangyang Wang. Found in the Middle: How Language Models Use Long Contexts
Better via Plug-and-Play Positional Encoding, 2024.

[16] Greg Kamradt. Needle in a haystack - pressure testing llms, 2023.

[17] Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, and Xuanjing
Huang. Longagent: Scaling language models to 128k context through multi-agent collaboration.
ArXiv preprint, abs/2402.11550, 2024.

[18] Yijiong Yu. Training With "Paraphrasing the Original Text” Improves Long-Context Perfor-
mance, 2023.

[19] Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian-Guang Lou. Make Your LLM
Fully Utilize the Context, 2024.

[20] Yuhan Chen, Ang Lv, Ting-En Lin, Changyu Chen, Yuchuan Wu, Fei Huang, Yongbin Li,
and Rui Yan. Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large
Language Models for Effective Tool Use, 2023.

[21] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
Enhanced transformer with rotary position embedding, 2024.

[22] Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear
biases enables input length extrapolation. In The Tenth International Conference on Learning
Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.

[23] Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models
without positional encodings still learn positional information. In Findings of the Association
for Computational Linguistics: EMNLP 2022, pages 1382–1390, Abu Dhabi, United Arab
Emirates, 2022. Association for Computational Linguistics.

[24] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. 2023.

[25] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.

[26] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna:
An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.

[27] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, et al. Mistral 7b. ArXiv preprint, abs/2310.06825, 2023.

[28] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya
Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open
models based on gemini research and technology. ArXiv preprint, abs/2403.08295, 2024.

11
[29] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu,
Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren,
Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu,
Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu,
Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang,
Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. ArXiv
preprint, abs/2309.16609, 2023.
[30] MosaicML NLP Team. Introducing mpt-30b: Raising the bar for open-source foundation
models, 2023. Accessed: 2023-06-22.
[31] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao
Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A
bilingual, multitask benchmark for long context understanding. ArXiv preprint, abs/2308.14508,
2023.
[32] Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, and Xiaoling
Wang. Length Generalization of Causal Transformers without Position Encoding, 2024.
[33] Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge. Latent
positional information is in the self-attention variance of transformer language models without
positional embeddings. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pages 1183–1193, 2023.
[34] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending Context
Window of Large Language Models via Positional Interpolation, 2023.
[35] Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih
Yavuz, Philippe Laban, Ben Krause, Senthil Purushwalkam, Tong Niu, Wojciech Kryściński,
Lidiya Murakhovs’ka, Prafulla Kumar Choubey, Alex Fabbri, Ye Liu, Rui Meng, Lifu Tu,
Meghana Bhat, Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Shafiq Joty, and Caiming
Xiong. Xgen-7b technical report. ArXiv preprint, abs/2309.03450, 2023.
[36] Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and
Hao Peng. Data engineering for scaling language models to 128k context. ArXiv preprint,
abs/2402.10171, 2024.
[37] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context
window extension of large language models, 2023.
[38] Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan
Yang, and Mao Yang. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens,
2024.
[39] Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long-
range transformers with unlimited length input. In Thirty-seventh Conference on Neural
Information Processing Systems, 2023.
[40] Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski,
and Piotr Miłoś. Focused transformer: Contrastive training for context scaling. In Thirty-seventh
Conference on Neural Information Processing Systems, 2023.
[41] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for
near-infinite context. 2023.
[42] Zhiyuan He, Huiqiang Jiang, Zilong Wang, Yuqing Yang, Luna Qiu, and Lili Qiu. Position
engineering: Boosting large language models through positional information manipulation.
arXiv preprint arXiv:2404.11216, 2024.
[43] Yifei Gao, Lei Wang, Jun Fang, Longhua Hu, and Jun Cheng. Empower Your Model with
Longer and Better Context Comprehension, 2023.

12
[44] Qingyu Yin, Xuzheng He, Xiang Zhuang, Yu Zhao, Jianhua Yao, Xiaoyu Shen, and Qiang
Zhang. StableMask: Refining Causal Masking in Decoder-only Transformer, 2024.
[45] Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and
Lili Qiu. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via
Prompt Compression, 2023.
[46] Alexander Peysakhovich and Adam Lerer. Attention Sorting Combats Recency Bias In Long
Context Language Models, 2023.
[47] Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval Head
Mechanistically Explains Long-Context Factuality, 2024.
[48] Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive Activations in Large
Language Models, 2024.

13
A Limitations
For different LLMs, an additional search cost is required to determine the positional hidden states
including indentify top-K positional hidden and evaluate on the small size validation dataset, which
takes approximately 10 minutes for 7B level LLMs on a single A100 GPU.
Due to the difficulty in fully explaining the impact of some dimension to LLM’s behavior, the top-1
dimension returned by the searching algorithm may have the optimal performance on various tasks.
Sometimes the optimal dimension is in the top-3 results.
Because of the recalculation of the query and key states, the additional time cost will increase linearly
with the length of the input sequence.

B Broader Impacts
Our methods explore the intrinsic causes of position bias in LLMs and propose a way to alleviate this
bias by pruning positional hidden states. This improves the inference capabilities of LLMs, making
them more applicable to a wider range of scenarios, especially long-context LLMs and more complex
applications. Additionally, our work promotes further research on the deep relationships between
the causal mask, hidden states, attention weights and position bias in LLMs, aiding in the iterative
development of related technologies.

C Experiment Details
C.1 Datasets Details

We choose NaturalQuestion Multi-document QA and Key-Value Retrieval datasets used in "lost


in the middle" paper [10] to evaluate the degree to which our method alleviates position bias.
NaturalQuestion Multi-document QA require the model to answer the question based on one key
information document which is inserted in a long context consisting of many irrelevant documents.
And Key-Value Retrieval needs the model to retrieve the value corresponding to the given key from
a list consisting of hundreds of Key-Value pairs. These two datasets are both classic in-context
tasks which aim to evaluate the differences of model performance when key information is located
at different positions in the context. The evaluation metric is accuracy, based on whether the
model’s response contains a string of the correct answer. In addition, we evaluate our method’s
improvements across multi task types, using LongBench [31], a benchmark for bilingual, multitask,
and comprehensive assessment of long context understanding capabilities of LLMs. It contains
six major categories, covering single-document QA, multi-document QA, summarization, few-shot
learning, synthetic tasks and code completion. The evaluation metrics are: F1 for single-document
QA and multi-document QA, Rouge-L for summarization, accuracy (exact match) for few-shot
learning and synthetic tasks, and edit similarity for code completion. During inference, since the
original context may sometimes be too long, the input sequences will be truncated in the middle part
to avoid exceeding the context window of the model.

C.2 Additional Implemention Details

Curve Fitting When we perform curve fitting on h(p), we use least-squares cubic polynomial fit.
And when judging its monotonicity, we skip the first 100 positions because the first a few values are
often outliers. Since h(p) is originally a discrete function, in practice, we employ the second-order
difference to approximate the second-order derivative when computing smoothness.

Ms-PoE on Mistral When applying Ms-PoE [15] to mistral-7b [27] with its default parameters
(minimal scale factor is 1.2 and maximal is 1.8), we found the model fail to generate normal responses,
so we set the maximal scale factor to 1.2, under which Ms-PoE [15] is equal to PI [34] with scale
factor 1.2.

Ablation of the Searching Algorithm We conducted ablation experiments to demonstrate the


necessity of using the three indicators (monotonicity, smoothness, validation loss) in our searching
algorithm. Ours w/o monotonicity means we just select top-10 smoothest dimensions and then use

14
the validation loss to determine. Ours w/o smoothness means we select top-10 dimensions with the
highest number of monotonic layers and then use validation loss. Ours w/o validation loss means we
first select top-10 dimensions with the highest number of monotonic layers and then just choose the
smoothest one among them.

C.3 Scaled Dimensions Details

Table 4: The scaled dimensions, scale factors and applied layers of models.
Model Dimension Scale factor Applied layers
LLaMA-2-7b-chat 2,393 -1 10~25
LLaMA-2-13b-chat 4,283 -1 10~34
Vicuna-7b-v1.5-16k 2,393 0 10~25
Vicuna-13b-v1.5-16k 4,923 0 10~34
Mistral-7B-Instruct-v0.2 213 0 10~25
Gemma-1.1-7b-it 1,665 0 10~22
Qwen1.5-7b-chat 1,081 0.2 10~25
MPT-30b-chat 6,926 0 10~42

The scaled dimensions, scale factors and applied layers of each model we use in out experiments are
shown in Table 4.

C.4 Inference Latency

Table 5: Time consumed (minutes) of LLaMA-2-7b-chat in a single A100.


Method KV Retrieval NaturalQuestion
FlashAttention-2 22 14
Ours 32 15
Ms-PoE 61 26

Table 5 shows the running time of LLaMA-2-7b-chat with different methods in the KV retrieval
dataset consisting of 500 samples with average length of about 10,000, and the multi-document QA
dataset consisting of 500 samples with average length of about 3,300. Our method requires recompute
the query and key states, thus inevitably requires more time compared to baseline, but the cost is
within an acceptable range. In contrast, Ms-PoE [15] need to compute the attention weights twice,
resulting in a doubling of time consumption.

D Obtain Attention to Key Information


To avoid the influence of internal knowledge in the model and make attention calculation simpler, we
conduct a KV retrieval task, whose prompt format is as follows:

Json data: {"os08jbk1limft6wgxeda": "imx6lyp4b8ogjaq7ret1", ......(n key-value pairs)} The


value of key "os08jbk1limft6wgxeda" is "

The last token of the prompt will directly take on the task of predicting the answer, i.e., the value
which need to be retrieved. Hence, the last token’s attention weights to the previous text can reflect
whether it accurately retrieves the key information. We define the model’s attention (in some layer)
to the key information as AG in Eq 6, where G represents the set of token positions corresponding to
where the key information is at, l is the position of the last token of the prompt, and al,j represents
the attention weight of the l-th token to the j-th token. By shifting G, we use the same method to
calculate its attention to each other KV pairs.
1 X
AG = al,j (6)
|G|
j∈G

15
E Attention v.s. Performance

 

 

$FFXUDF\ 

$FFXUDF\ 
$WWHQWLRQ

$WWHQWLRQ
 

 

 

 

 
                   
*ROG.93RVLWLRQ *ROG.93RVLWLRQ *ROG.93RVLWLRQ *ROG.93RVLWLRQ
(a) Attention (b) Attention w/ modifica- (c) Accuracy (d) Accuracy w/ modifica-
tion tion
Figure 6: Distribution of attention weight and accuracy as the ground-truth KV is placed at different
positions in the prompt. (b) and (d) are situations when the attention on the 25th KV pair is modified.

As illustrated in Figure 6, when we manually multiply all the attention weights to the tokens belong
to the key information (here we only choose the 25th KV pair, as shown in Figure 6b) by 2 in the
model’s forward process on the KV retrieval task, the corresponding retrieval accuracy of the 25th
KV also show improvements, while the other parts mainly keep unchanged, as shown in Figure 6d.
This result proves that the attention weights for the key information is positively correlated with the
retrieval accuracy.

F How We Modify Causal Mask and Position Embedding in KV Retrieval

In the method 1 in section 2.2, we crop the causal mask to let the "key tokens" unable to attend the
previous tokens. As shown in Figure 7, the white part represents the cropped part, which means
attention weights are 0, and the orange part represents the attention between tokens within key tokens.
In addition, we have retained the attention of key tokens to the first token to maintain the stability of
attention distribution. What is more, we only modify the causal mask in layers 1~8, but as the results,
the attention to the key information is still significantly improved in layers 15~31, which indicates the
positional information generated by causal mask in former layers can be transmitted to latter layers
using posisional hidden states as the medium, thus modifying the causal mask solely in the former
layers can induce a profound shift in the model’s comprehension of positional information.
In the method 2 and 3 in section 2.2, we modify the position embeddings through altering the position
ids. The specific operation is shown in the Figure 8, in which we directly replace the position ids
corresponding to the key tokens with the position ids of the starting tokens (or the ending tokens) ,
and actually only the attention weights of the last token to previous tokens are modified. We apply
this modification in all the layers. Compared to modifying the causal mask, if only modify position
embedding in former layers, the attention in the latter layers remains almost unchanged, which
indicates the positional information generated by position embedding may be temporary and can
hardly be transmitted across layers.

G Perturbation on Causal Mask and Position Embedding

To further explore the origin of these position hidden states, we performed perturbation experiments.
As depicted in Figure 9c, subtracting 200 from the position ids corresponding to the 400th to 600th
tokens (reducing PE) had only a minor effect on the position hidden states, whereas, in Figure 9b,
crop the causal mask to make the 400th to 600th tokens unable to attend the 1st to 400th tokens
(cropping causal mask) led to significant fluctuations in positional hidden states of the 400th to 600th
tokens. This result proves the causal mask is the main factor causing this kind of positional hidden
states, and it is the token’s position in the causal mask that determines its value in the positional
hidden states, but not position ids of position embedding.

16
Key Tokens

Figure 7: Cropping the causal mask to let key tokens unable to see previous tokens, except the first
token.

Position Ids (Normal) : 0 1 ... 199 200 201 202 203 … 498 499

ROPE 𝑄 𝐾

𝑄 𝐾

ROPE 𝑄’ 𝐾’
Attention Weights

Position Ids (Modified) : 0 1 ... 199 0 1 2 203 … 498 499

Key Tokens

Figure 8: Shifting position ids to the start (PE to beginning).

H Does this Method Compromise the Ability to Perceive Positional


Information?

To prove our method is harmless for position-sensitive tasks though eliminating some positional
information, we conduct the timeline reorder task from LooGLE [11], whose objective is to arrange
the events in accordance with their chronological sequence as dispersed throughout the extensive text.
The results in Table 6 proved our methods will not impair model’s performance on position-sensitive
tasks. This also reflects that the positional information we eliminate may not be necessary for the
model to function well.

OD\HU OD\HU OD\HU 


OD\HU OD\HU OD\HU
  
     

     
     
  
   
  
  
  
                                   
OD\HU OD\HU OD\HU OD\HU OD\HU OD\HU
  
    
 
 
   
 
   
   
    
   
 
                                   

(a) Original hidden state (b) Crop mask (c) Reduce PE


Figure 9: We performed perturbation experiments on the causal mask and position embedding (PE),
showing the dimension 213 of hidden states of Mistral-7b [27] using randomly synthesized corpus as
input.

17
Table 6: Performance of difference models on position-sensitive task-timeline reorder [11].
Model Accuracy
Vicuna-7b-v1.5-16k 20.83
Vicuna-7b-v1.5-16k w/ Ours 20.83
Qwen1.5-7b-chat 28.13
Qwen1.5-7b-chat w/ Ours 28.13
Mistral-7B-Instruct-v0.2 18.75
Mistral-7B-Instruct-v0.2 w/ Ours 19.79

I Attention Distribution Layer-wise and Head-wise

Figure 10 shows Mistral-7b’s attention to each KV pair of each layer (average across all attention
heads) in the context in a KV retrieval task when the gold KV is put at different positions. The
y-axis is the gold KV’s position, x-axis is each KV’s position, and the scale of the colorbar represents
attention (10−3 ). We can observe that diagonal patterns, which indicates the attention is concentrated
on the "key tokens", appear only in the latter layers (start from layer 14), and may be a manifestation
of retrieval behavior. In contrast, the former layers only focus on the beginning or end, regardless of
where the key information is located.
Figure 11 shows the head-wise situation of layer 15. We can see actually only a portion of attention
heads exhibit diagonal patterns, which may correspond to retrieval heads [47]. The attention
distribution in these heads also shows a pattern corresponding "loss in the middle", being larger at the
beginning or end while significantly smaller at the middle.

 /D\HU  /D\HU  /D\HU  /D\HU  /D\HU  /D\HU


 
       

      
   
      
     
           
                                   
 /D\HU  /D\HU  /D\HU /D\HU /D\HU /D\HU
   

      
       
   
     
     
           
                                   
 /D\HU  /D\HU  /D\HU  /D\HU  /D\HU  /D\HU
  
  

 

       
       
 
     
           
                                   
 /D\HU  /D\HU  /D\HU  /D\HU  /D\HU  /D\HU

       
      
 
        
     
           
                                   
 /D\HU  /D\HU  /D\HU  /D\HU  /D\HU  /D\HU

     
          
      
     
           
                                   
 /D\HU  /D\HU
 

  
  
 
   
           

Figure 10: The average attention weight distributed on each KV, of all the 32 layers of Mistral-7b, on
a 50 KV pairs retrieval task, when the gold KV is put at each different position.

18
 /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG   /D\HU+HDG  /D\HU+HDG
   
   

      
   
      
     
           
                                   
 /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG
       
 
       
         
     
           
                                   
 /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG
     
    
      
    
    
     
           
                                   
 /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG
    
   
    
   
        
     
           
                                   
 /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG  /D\HU+HDG
      
         
      
     
          
                                   
 /D\HU+HDG  /D\HU+HDG
  
  
  
 
   
           

Figure 11: The average attention weight distributed on each KV, of all the 32 attention heads of layer
15 of Mistral-7b, on a 50 KV pairs retrieval task, when the gold KV is put at each different position.

J Positional Hidden States Visualization


We shown various models’ positional hidden states of each layer in Figure 12. When visualizing, we
discarded the first 30 tokens because the hidden states values of these tokens are often huge (usually
hundreds of times larger than the normal value [48]), which can disrupt monotonicity. We observed
its monotonic trend first appears just in the first layer (actually just after the first attention mechanism),
and continue to be more marked.

19
layer 0 layer 1 layer 2 0.7
layer 3 0.8 layer 4 0.8
layer 5 layer 0 layer 1 layer 2 layer 3 layer 4 layer 5
0.002 0.024 0.025 0.035 0.06
0.275 0.5 0.7
0.5 0.6 0.07
0.6 0.6 0.003 0.026 0.040 0.07
0.250 0.4 0.030
0.4 0.5 0.5 0.4 0.08
0.225 0.004 0.028 0.045 0.08
0.3 0.4 0.4 0.035
0.200 0.3
0.2 0.005 0.030 0.050 0.09 0.09
0.3 0.3
0.175 0.2 0.040
0.2 0.2
0.2 0.0 0.006 0.032 0.055 0.10 0.10
0.150 0.1 0.1 0.045
0.125 0.1 0.1 0.0 0.2 0.007 0.034 0.060 0.11 0.11
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 6 1.50 layer 7 layer 8 layer 9 layer 10 2.0
layer 11 layer 6 layer 7 layer 8 layer 9 layer 10 layer 11
2.0 0.06 0.10 0.10 0.10
1.0 1.25 1.50 0.10
1.25 1.5 1.5 0.15 0.15
0.8 1.00 1.5 0.08 0.12 0.15
1.00 0.15 0.20
0.6 0.75 1.0 1.0 0.14 0.20
1.0 0.10 0.20
0.75 0.16 0.25
0.4 0.50 0.20 0.25
0.50 0.5 0.5 0.5 0.12 0.30
0.18 0.25
0.2 0.25 0.25 0.30
0.0 0.20 0.35
0.00 0.0 0.0 0.14 0.25
0.0 0.00 0.35
0.22 0.30 0.40
0.2 0.25 0.25 0.5 0.5
0.5 0.16 0.24 0.40 0.45
0.50 0.30
0.50
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 12 layer 13 layer 14 layer 15 layer 16 layer 17 0.1 layer 12 0.1
layer 13 layer 14 layer 15 layer 16 layer 17
2.0 2.0 3.0 0.1
2.0 2.5 3.0 0.1
2.5 0.2 0.2
1.5 1.5 1.5 2.0 2.5 0.2 0.2 0.2
0.2
2.0 2.0 0.3
1.0 1.0 1.0 1.5 0.3 0.3 0.4
1.5 1.5 0.3 0.3
0.5 1.0 0.4 0.4
0.5 0.5 1.0 1.0 0.4
0.5 0.4 0.6
0.5 0.5 0.4 0.5 0.5
0.0 0.0 0.0
0.0 0.5
0.0 0.0 0.5 0.6
0.5 0.5 0.6 0.8
0.5 0.5 0.5 0.5 0.5 0.6
0.7
1.0 1.0 1.0 0.6
1.0 1.0 0.7
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 18 layer 19 layer 20 layer 21 4
layer 22 layer 23 layer 18 0.0
layer 19 layer 20 layer 21 layer 22 layer 23
4 0.0 0.0 0.0 0.0
3
3 3 0.2 0.2 0.2 0.2 0.2
3 3 0.2
2 3
0.4 0.4 0.4 0.4 0.4
2 2 2 0.4
2 2 0.6
0.6 0.6 0.6 0.6
1
1 1 1 0.6 0.8
1 1 0.8 0.8 0.8 0.8
0 1.0
0 0 0 0.8 1.0 1.0 1.0
0 1.0
0 1.2
1.2 1.2
1 1 1.0 1.2 1.2 1.4
1 1 1
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 1.4 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 24 layer 25 layer 26 layer 27 layer 28 layer 29 layer 24 0.2 layer 25 layer 26 layer 27 layer 28 0.5
layer 29
5 6 0.2 0.25
4 4 0.0 0.0 0.25
5 6 0.0 0.00
4 5 0.2 0.2 0.00 0.0
3 3 0.2
4 5 0.25 0.25
0.4 0.4 0.4
3 4
2 0.6 0.50 0.50 0.5
2 3 4 0.6 0.6
2 3 0.8 0.8 0.8 0.75 0.75
1 1 2 3 1.0
1.0 1.0 1.0 1.00 1.00
1 2
0 0 1 2 1.2 1.2 1.2 1.25 1.25
0 1 1.4 1.5
1.4 1.4 1.50 1.50
0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
8
layer 30 1.8
layer 31 layer 30 layer 31
1.0
1.7 0.25
7 0.00
1.6 1.5
6 0.25
1.5
0.50 2.0
5 1.4 0.75
1.3 1.00 2.5
4
1.2 1.25
3 1.1 1.50 3.0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

(a) LLaMA-2-7b-chat dim=2393 (b) mistral-7b dim=213


layer 0 layer 1 layer 2 layer 3 layer 4 layer 5 layer 6 layer 7 layer 0 layer 1 layer 2 layer 3
0.40 0.2 0.5
1.15 1.25 0.95 0.30 0.17 1
0.700 0.55
1.10 0.55 0.35 0.0
1.20 0.90 0.675 0.50 0.18 0.1 0
1.05 0.25 0.45 0.5
0.85 0.650 0.30
1.15 0.50 0.0
1.00 0.80 0.625 0.20 0.40 0.19 1.0 1
0.25
0.95 1.10 0.600 0.45 0.35 0.1 1.5
0.75 0.20 2
0.575 0.15 0.30 0.20
0.90 1.05 0.70 2.0
0.550 0.40 0.15 0.25 0.2
3
0.85 1.00 0.65 0.10 0.20 0.21 2.5
0.525 0.10 0.3
0.80 0.60 0.35 0.15
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 8 layer 9 layer 10 layer 11 layer 12 layer 13 layer 14 layer 15 6 layer 4 layer 5 layer 6 15 layer 7
2.2 2.0 1.0 0.4
1.4 1.6 0.8 4
10 10
2.1 0.2 10
0.8 0.6 0.0
1.3 2.0 1.8 1.4 0.0 2 5 5 5
0.6 0.4 0.2
1.9 0.2 0
1.2 1.6 1.2 0.2 0 0
1.8 0.4 0
0.4 0.4 2
1.0 0.0
1.1 1.7 1.4 0.6 5
0.2 0.2 4 5 5
1.6 0.6
1.0 0.8 0.4 0.8 10
1.5 1.2 0.0 6
1.0 0.8 10 10
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0.6 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 16 layer 17 layer 18 layer 19 layer 20 layer 21 layer 22 layer 23 layer 8 layer 9 layer 10 layer 11
1.6 2.6
0.4 3.8 4.8 10 10 5
0.0 6.2 7.8 5
1.4 2.4
0.2 3.6
2.2 4.6 6.0 7.6 5 5 0 0
0.2
1.2 3.4
0.0 2.0 4.4 5.8 7.4 0 0 5 5
0.4 1.0 3.2
0.2 1.8 4.2 5.6 7.2 5 5 10 10
0.8 3.0
0.6 0.4 1.6
2.8 5.4 7.0 10 10 15 15
4.0
0.6 1.4
0.8 0.6 5.2 6.8
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 24 layer 25 layer 26 layer 27 layer 28 layer 29 layer 30 layer 31 layer 12 layer 13 layer 14 layer 15
13.25 14.8 16.25 19.00
9.4 11.2 17.50 18.75 20.25 5 5 5 5
13.00 14.6 16.00
9.2 17.25 18.50 20.00 0
11.0 12.75 14.4 15.75 0 0 0
17.00 18.25 19.75
9.0 10.8 14.2 15.50 5
12.50 16.75 18.00 5 5 5
19.50
8.8 10.6 12.25 14.0 15.25 16.50 10
17.75 19.25
13.8 10 10 10
8.6 10.4 12.00 15.00 16.25 17.50 15
13.6 19.00
8.4 10.2 11.75 14.75 16.00 17.25 15 15 15
13.4 18.75 20
14.50 15.75 17.00
11.50
0 200 400 600 800 1000 10.0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 18.50 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 32 layer 33 layer 34 layer 35 layer 36 layer 37 layer 38 layer 39 layer 16 layer 17 5
layer 18 layer 19
26.5 29.0 5 5
22.0 25.5 30.0 5
23.0 24.0 27.5 28.5 0
0
25.0 26.0 29.5 0
0
21.5 27.0 28.0 5
29.0 5 5
22.5 23.5 24.5 25.5 27.5 5
21.0 26.5 28.5 10
10
10
27.0 10
22.0 23.0 24.0 25.0 26.0 28.0 15
15 15 15
20.5 26.5 27.5
22.5 23.5 24.5 25.5 20
21.5 20 20 20
26.0 27.0
20.0 24.0 25.0 25
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
layer 40 layer 41 layer 42 layer 43 32 layer 44 layer 45 layer 46 layer 47 layer 20
0.30
layer 21
31.5 33.0 34.5 35 40 0.4 5
32.5 30 30 0.35
34.0 0.2
31.0 28 30 0 0.40
32.0 33.5 34 25
30.5 26 0.0 0.45
31.5 33.0 20 20 5
30.0 31.0 33 24 0.50
32.5 0.2 10
29.5 22 15 10
30.5 32.0 0.55
32 20 0.4 15
29.0 30.0 31.5 10 0 0.60
29.5 18 0.6 0.65
28.5 31.0 31 5 20
29.0 16 10
0.70
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

(c) MPT-30b dim=6926 (d) Tinyllama-NoPE dim=1156


Figure 12: Positional hidden states output by each layer of LLaMA-2-7b-chat, Mistral-7b-Instruct-
v0.2, MPT-30b-chat and TinyLlama-NoPE-1.1B. The x-axis represents the position, and the y-axis
represents the value of the states.

20

You might also like