Mitigate Position Bias in Large Language Models Via Scaling A Single Dimension
Mitigate Position Bias in Large Language Models Via Scaling A Single Dimension
Yijiong Yu1†, Huiqiang Jiang2 , Xufang Luo2 , Qianhui Wu2 , Chin-Yew Lin2 ,
Dongsheng Li2 , Yuqing Yang2 , Yongfeng Huang1 , Lili Qiu2
1
Tsinghua University, 2 Microsoft Corporation
[email protected],[email protected]
arXiv:2406.02536v1 [cs.CL] 4 Jun 2024
{hjiang,xufluo,qianhuiwu,cyl,dongsli,yuqyang,liliqiu}@microsoft.com
Abstract
1 Introduction
Long-context large language models (LLMs) [1, 2, 3, 4, 5, 6] have recently garnered significant
attention within the community, enabling LLMs to handle longer and more complex tasks such as
long-context question-answering [7, 8] and repository-level code understanding [9]. However, recent
researches [8, 10, 11, 12, 13], indicates that these long-context LLMs struggle to effectively and
consistently utilize all the information provided in the context, exhibiting a position bias known as
"lost in the middle", which means LLMs tend to ignore information in the middle of the prompt, even
though they can utilize the information at the beginning and end of the prompts well. This issue occurs
in nearly all LLMs [10, 14, 15], whether they are decoder-only models or encoder-decoder models,
powerful models or small LLMs. For example, for the GPT-3.5-Turbo model in the NaturalQuestion
multi-document QA task, the performance difference between ground-truth information placed in the
middle of the prompt versus at the ends is 22 points with 2.3k tokens prompt [10]. This significantly
impacts the practical application of LLMs in real-world scenarios. Studies [16, 17] show that this
position bias becomes more severe as the context length increases, hindering the practical application
of long-context LLMs.
Previous works have analyzed this issue from the perspectives of data distribution [14, 18, 19] and
position embeddings [15, 20]. For example, FILM [19] addresses position bias by constructing data
†
Work during internship at Microsoft.
where h is the hidden states, and h(n) is the hidden state of the n-th token. W Q , W K are the weights
of the linear layers, P is the position encoding function like RoPE [21], d is the dimensionality of
query and key states, and n and m are the positional order information. Mask is the causal mask.
To explore the micro-level manifestations of position bias in Transformers, we analyzed the attention
weights for sentences containing key information, using a KV retrieval task, which requires the model
to retrieval the ground-truth value of the given key from a list containing 50 Key-Value pairs (see
2
Appendix D for details). As shown in Figures 1, in deep layers the model exhibits retrieval-like
behavior, focusing on ground-truth information, forming a diagonal pattern observed in Figure 1b.
While in other shallow layers, it always focus most attention on the start or end of the prompt,
wherever the key information is located, exhibiting vertical lines patterns, as shown in Figure 1a.
In these layers exhibiting retrieval-like behavior, it can be observed that the attention weights for key
information (Gold KV) exhibit patterns similar to position bias: when key information is located at
the start or end of the prompt, the attention weights focused on it are relatively higher, while in the
middle, they are significantly lower. Moreover, we extract the attention to key information (average
of layers 15~25) with different context length in Figure 1c, where as the context length grows, the
attenuation of attention weights with respect to position becomes more pronounced, reaching almost
zero at the middle. More details about this are in Appendix I and D.
Furthermore, in Appendix E, we found artificially adjusting the attention weights to the key informa-
tion can directly improve the corresponding accuracy. Thus, we claim that position bias is to a large
extent caused by the attention weights patterns at the micro level.
/ D \ H U / D \ H U
$ W W H Q W L R Q 10 3
$ W W H Q W L R Q 10 3
* R O G . 9