0% found this document useful (0 votes)
1 views44 pages

2506.06266v1

The document presents a method called C ARTRIDGE, which utilizes a self-study training approach to create a lightweight key-value (KV) cache for large language models, significantly reducing memory usage and increasing throughput. By generating synthetic conversations and employing a context-distillation objective, C ARTRIDGES can replicate the functionality of in-context learning while consuming 38.6 times less memory and enabling 26.4 times higher throughput. This approach allows for efficient handling of large corpora, making it feasible to serve diverse queries without the high costs associated with traditional in-context learning methods.

Uploaded by

boonchoothamrong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views44 pages

2506.06266v1

The document presents a method called C ARTRIDGE, which utilizes a self-study training approach to create a lightweight key-value (KV) cache for large language models, significantly reducing memory usage and increasing throughput. By generating synthetic conversations and employing a context-distillation objective, C ARTRIDGES can replicate the functionality of in-context learning while consuming 38.6 times less memory and enabling 26.4 times higher throughput. This approach allows for efficient handling of large corpora, making it feasible to serve diverse queries without the high costs associated with traditional in-context learning methods.

Uploaded by

boonchoothamrong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Cartridges: Lightweight and general-purpose long context

representations via self-study

Sabri Eyuboglu 1 ∗ Ryan Ehrlich 1 ∗ Simran Arora 1,2 ∗ Neel Guha 1 Dylan Zinsley 3 Emily Liu 1
Will Tennien 1 Atri Rudra 3 James Zou 1 Azalia Mirhoseini 1,4 Christopher Ré 1
1 Stanford University 2 Caltech 3 University at Buffalo 4 Google DeepMind * Equal contribution
# [email protected], [email protected], [email protected]
© HazyResearch/cartridges

Abstract
arXiv:2506.06266v1 [cs.CL] 6 Jun 2025

Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal
documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context
learning (ICL). Although current models support contexts of 100K–1M tokens, this setup is costly to serve
because the memory consumption of the KV cache scales with input length. We explore an alternative:
training a smaller KV cache offline on each corpus. At inference time, we load this trained KV-cache, which
we call a C ARTRIDGE, and decode a response. Critically, the cost of training a C ARTRIDGE can be amortized
across all the queries referencing the same corpus. However, we find that the naive approach of training
the C ARTRIDGE with next-token prediction on the corpus is not competitive with ICL. Instead, we propose
S ELF -S TUDY, a training recipe in which we generate synthetic conversations about the corpus and train
the C ARTRIDGE with a context-distillation objective. We find that C ARTRIDGES trained with S ELF -S TUDY
replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context
benchmarks, C ARTRIDGES trained with S ELF -S TUDY match ICL performance while using 38.6× less memory
and enabling 26.4× higher throughput. S ELF -S TUDY also extends the model’s effective context length
(e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to C ARTRIDGES that can be composed at
inference time without retraining.

1 Introduction

Large language model (LLM) users often place large text corpora into the context window. For instance, a user
or organization may use LLMs to understand codebases [56], financial documents [33], legal texts [28, 102],
textbooks [61], or personal files [7]. LLMs excel here due to in-context learning (ICL), enabling accurate
responses to diverse queries (e.g., factual Q&A, summarization, code generation) [20].
Despite its flexibility, this usage paradigm is costly to serve. ICL requires maintaining a KV cache that grows
linearly with the input length. For example, LLaMA 70B needs 84 GB of memory (at 16-bit precision) to
answer a single question over a 128k-token context [21]. This severely limits user throughput: on a single
H100 GPU, LLaMA 8B’s peak throughput (tokens/s) drops by 77× when increasing the context from 1k to
120k tokens (Figure 3).
Prior work has thus explored ways to reduce KV cache memory usage. For instance, prompt compres-
sion methods reduce the number of tokens stored in the cache using summarization, or self-information
filtering [16, 37, 48], while KV cache compression techniques directly compress the stored key-value
pairs [23, 60, 75, 98]. Unfortunately, there are memory-quality tradeoffs associated with these methods: in
experiments on challenging long-context tasks, we find that performance degrades rapidly when applying
these methods with compression ratios greater than 2× (see Figure 4).
Motivated by the observation that the cost of preparing a KV cache can be amortized across many queries
that reference the same corpus, we explore a complementary approach based on offline training. Given a
specific text corpus (e.g. a patient’s medical record) we freeze the LLM and train a smaller KV cache offline
Problem Setting In-context learning Documents represented by KV cache
produced with standard prefill. Cartridges Documents represented with a compressed KV cache
that is trained with self-study.
Document Corpus Queries Supports general queries General queries
Self-study
Please, summarize
the documents... 😄 KV Cache

k[1] k[2]
High GPU memory consumption

k[3] k[4]
Cartridge

💻 zk[1] zk[2]
Less GPU memory
38.6×

🫨
Prefill ... less memory
Write a rock song
about the docs... v[1] v[2] v[3] v[4] 💻 ...
zv[1] zv[2] 26.4×

🤔
... 💻 higher throughput
What is the D&A

🤔 💻 🤔 💻
margin for FY15...
LLM + KV Cache LLM + Cartridge
Users send many messages grounded What is the D&A In FY15, the D&A What is the D&A In FY15, the D&A
margin for FY15... margin for FY15... margin for...
in a single large corpus of text.
margin for...

Figure 1: Producing C ARTRIDGES via self-study. For a given document corpus, we train a C ARTRIDGE by
distilling the corpus into a parameterized KV cache through a process we call S ELF -S TUDY. At inference
time, this C ARTRIDGE can be loaded into an LLM, which can then be used to answer diverse queries about
the corpus, simulating in-context analysis of the corpus while requiring substantially less memory.

by backpropagating loss into the key and value vectors in a process closely resembling prefix tuning [45, 47].
We call the trained KV cache representing the corpus a “C ARTRIDGE.” At inference time, we load the trained
C ARTRIDGE, append the user’s messages, and decode. Because users repeatedly reference the same corpora
(e.g. SEC filings, codebase, personal files), each C ARTRIDGE can be trained once offline and reused. This
approach also integrates cleanly with existing inference servers, which are already designed to manage
per-user KV caches [44, 101].
Achieving ICL-equivalent functionality requires C ARTRIDGES to satisfy two non-trivial desiderata. First,
C ARTRIDGES should replicate the generality of ICL, and provide accurate responses across diverse user
prompts [20]. Second, C ARTRIDGES should replicate ICL’s structural awareness—its ability to reason over
document structure, and understand how distant parts of a corpus relate or depend on each other (an ability
that degrades when using lossy KV-cache compression methods). It is unclear if there is a procedure that
satisfies these desiderata, while providing memory efficiency.
The natural baseline approach is to train a C ARTRIDGE with a next-token prediction objective on the raw
corpus. Excitingly, this yields C ARTRIDGES that memorize the corpus perfectly using 107× less memory
than the KV-cache. However, the resulting C ARTRIDGES are not general - they degrade the LM’s ability to
respond to diverse questions beyond regurgitating the corpus (Figure 3).
To address these challenges and produce general, structurally aware C ARTRIDGES for any text corpus, we
propose an automated method called S ELF -S TUDY. S ELF -S TUDY has two steps:

1. Synthetic data generation (Section 4.1): We generate synthetic training data by prompting the model to
quiz itself about the corpus content, resulting in a synthetic conversation trace. Training on these lets us
avoid training on the same exact text multiple times and improves generality (see Figure 3). To support
corpora that exceed the effective context length of the model, we chunk the corpus when generating
synthetic conversations. We also curate a set of seed prompts that bias the synthetic conversations towards
global reasoning and improve structural awareness (see Figure 6 right).
2. Context distillation (Section 4.2): We train on the synthetic conversations using a context-distillation
objective [10, 72], which aligns the C ARTRIDGE-augmented model’s next-token distributions with the
distributions of the model with the corpus in context. We find that the context distillation substantially
improves the quality of the C ARTRIDGES compared to next-token-prediction (see Figure 6 center).

In summary, given a large corpus of text, our goal is to train a small virtual KV cache, termed C ARTRIDGE,
that when used by the model, mimics the conversational behavior of the model with the entire corpus in
context. To do this, we generate synthetic conversations and train the C ARTRIDGE on them with a context
distillation objective — a recipe we call S ELF -S TUDY.
Evaluations. We evaluate C ARTRIDGES trained with S ELF -S TUDY on a set of challenging benchmarks that
pair a single large text corpus (100k-484k tokens) with a diverse set of queries [2, 33, 76]. We make three claims.
First, C ARTRIDGES extends the quality-memory frontier—averaged across the benchmarks, C ARTRIDGES
produced with S ELF -S TUDY match ICL quality while consuming 38.6× less memory, enabling a 26.4×
increase in peak throughput (tokens per second) when serving many users with different corpora. These

2
Consumes limited Retains corpus Supports diverse
Method
memory information prompts
In-context learning ✗ ✓ ✓
Prompt / KV cache compression ✓ ✗ ✓
C ARTRIDGE + Next-token-prediction ✓ ✓ ✗
C ARTRIDGE + S ELF -S TUDY ✓ ✓ ✓

Figure 2: Comparing KV caching strategies. C ARTRIDGE improves memory efficiency, while retaining the quality of
in-context learning across a broad set of prompts. ✓ indicates a strength and ✗ indicates a limitation.

memory reductions and speedups represent an order of magnitude improvement over state-of-the-art cache
compression baselines (e.g. DuoAttention [84]). Second, C ARTRIDGES enables context length extrapolation.
On the MTOB benchmark [76], where models must translate from Kalamang, a low-resource language,
into English, we use S ELF -S TUDY with L LAMA -8B to construct a small C ARTRIDGE from a 484k token
textbook. This C ARTRIDGE outperforms ICL over the first 130, 000 tokens of the textbook by 11.0 chrF points
and matches the ICL performance over a curated subset of the textbook. Third, S ELF -S TUDY also yields
C ARTRIDGES that are composable without joint optimization: multiple C ARTRIDGES can be concatenated and
queried together, emulating ICL’s ability to flexibly answer queries over multiple documents concatenated
in context (see Figure 7).
Additionally, we carefully ablate the design decisions in S ELF -S TUDY and C ARTRIDGES (Section 5.3 and
Appendix A). Notably, we compare C ARTRIDGES parameterized as a KV cache [47] with C ARTRIDGES
parameterized as a LoRA [31] and find that KV cache parameterization performs better on both in-domain
and out-of-domain tasks.
In this work, we demonstrate how offline KV cache training can dramatically reduce the cost of serving
language models in settings where users repeatedly include the same text corpora in context. We hope that
these cost reductions could enable new applications that are currently intractable, like coding agents with
full-repository context or long-term memory in chatbots.

2 Preliminaries

We begin by discussing related work (Section 2.1), formalizing our problem (Section 2.2), and providing
background on language models and KV caches (Section 2.3).

2.1 Related work

See Appendix B for a detailed discussion of prior work.

Parameter Efficient Fine-Tuning Prior work has explored a range of strategies for adapting pretrained
langague models: prompt distillation [43, 73], self-instruction [58], and domain-specific training [17]. Variants
of this approach train corpus into smaller modules (“adapters”) which can be added to the model, which have
parameter efficiency benefits [31, 45, 47, 54, 74]. A number of works have explored the idea of composing
multiple different parameter-efficient adapters through various aggregation operations [26, 32, 46, 82, 83, 86,
99, 100].
Most similar to our work are recent knowledge injection methods, which aim to internalize C into model
weights, allowing models to answer queries from parameter knowledge as opposed to ICL [43] [53] [74].
Recent work like LIFT [53] uses synthetic data to train a per-document adapter for long-context documents,
but does not study the throughput improvements stemming from the lack of a large KV cache. Our approach
improves quality-memory (and thus quality-throughput) tradeoff frontier, matching ICL performance and
supporting composability while keeping the memory footprint small.

Architectures Research has also examined more memory efficient alternatives to traditional attention [79],
which leverage sparsity [9, 15, 77, 93] or alter the structure of attention [3, 71], among other strategies [6, 50,

3
Generalization to diverse queries Quality-memory tradeoff Peak throughput vs. cache size

*Memorization 0.01 113×


e.g. “Please complete the rest of the passage...” Llama 3.1 8B
10k

Peak throughput (tokens/s)


Data structuring tasks 45×
e.g. “Please list AMD’s customers in JSON format” 5k 12× 4× 1×
Synthesis tasks
e.g. “Please summarize the AMD’s FY20 10K.”

Creative tasks
e.g. “Write a poem about AMD’s Q3 performance.”
121×
Mathematical reasoning 20k
e.g. “Compute AMD FY20 days payable outstanding.”
Llama 3.2 3B

Disjoint reasoning 10k 44×


e.g. “List all the tables in AMD’s FY20 10K document.” 11× 3× 1×
Factual questions
e.g. “Who is on AMD’s board as of FY20?”
Cartridges Prompting
Query types
Self-Study Truncated ICL
*Memorization is closely related to the
next-token prediction objective. Next-token predict. Full ICL

Figure 3: C ARTRIDGES trained with S ELF -S TUDY balance the generality and memory consumption
tradeoff. We compare four methods on the G EN C ONVO dataset: C ARTRIDGES trained with next-token
prediction over C , C ARTRIDGES trained with S ELF -S TUDY, full ICL, and truncated ICL, a prompt compression
method in which we truncate the C to the first k tokens. (Left) We evaluate on different slices from the
G EN C ONVO dataset. C ARTRIDGES trained with next-token prediction performs well on memorization
queries, which resemble it’s training distribution, but cannot generalize to other queries like the other
methods. (Center) The x-axis measures the size of the KV cache in GB for the different methods. The y-axis
shows log-perplexity on the G EN C ONVO dataset averaged over the query types. (Right) Peak throughput
(tokens/s) measured for different cache sizes for L LAMA -3B and L LAMA -8B with SGLang [101] on an
1xH100 (See Appendix A).

96]. Certain variants (i.e., grouped-query attention) appear in popular models like Llama-3, which we study
in our experiments.

Prompt and KV-cache compression As the size of the KV cache drives the model memory footprint,
research has examined different strategies for reducing the cache size. One set of approaches focus on
making the prompt smaller—explicit methods alter the prompt text through summarization and filtering
[16, 37, 48, 63, 94], while implicit methods compress prompt representations into a set of “soft” tokens [14,
24, 45, 55, 65, 91]. Another set of approaches exploits observations about the mathematical structure of
the KV cache [11, 41, 92], often finding that because a small number of keys dominate the attention scores
of subsequent queries, non-impactful key-value pairs (or tokens) can be dropped [23, 49, 60, 75, 98] or
merged [80, 81, 97].

Synthetic data generation A large body of work has focused on generating synthetic training data [1, 22,
58, 67]. For example, Bonito is a model that is fine-tuned to generate synthetic data [58], and MetaSynth is a
method proposed by Riaz et al. that uses a language model to orchestrate several expert LLMs for domain-
specific synthetic data generation [67]. The training process for Phi-4, a 14 billion parameter language model,
also incorporates significant amounts of synthetically generated data [1].

2.2 Problem setup

We assume a setting in which users issue a stream of diverse queries about a common corpus of text. We
denote the corpus as C and the query set as Q = {q1 , q2 , . . . , qm }. Illustrative examples of C include legal
filings, financial documents, code repositories, chat histories, and medical records.

Example: Financial Analysis


C may correspond to the 2022 Form 10-K filing [78] for AMD, which is almost 100k tokens. The queries
an analyst might ask an LLM to answer with respect to this form are diverse, including: (1) recalling
factual information, (2) performing mathematical reasoning over values, or (3) even generating creative
responses (e.g., a poem) grounded in the 10-K’s information.

4
Let R = {r1 , r2 , . . . , rm } denote the responses the LLM produces for the queries. We have two objectives.
First, we wish to maximize the quality of responses R under some quality metric (e.g. accuracy). Second, we
wish to minimize the LLM’s memory footprint while it is answering questions with respect to the document.
This is because larger memory footprints decrease throughput and necessitate more hardware to serve the
same number of users (Figure 3, Right).

2.3 Language models and KV caches

Recall that an LLM F accepts as input a sequence of N tokens x ∈ V n drawn from a discrete vocabulary
V ⊂ Z of tokens, each represented by a unique integer. The output, which we denote F (·|x), corresponds to
a categorical distribution over a vocab V conditioned on the prefix x ∈ V n .
Inside the language model, each token x [i ] in x is embedded into a d-dimensional space, yielding a matrix
u ∈ Rn×d . The matrix u is passed through a stack of L model layers, which each mix the matrix along the n
and d dimensions, with layer ℓ outputting yl ∈ Rn×d . The final y L is mapped to the logits over V with a
linear projection.
Most modern language models use the Transformer architecture based on self-attention [79]. Given an input
u ∈ Rn×d for sequence length n and embedding dimension d, it computes the output yl ∈ Rn×d via the
softmax over projections q, k, v = uWq , uWk , uWv :

i
exp(q[i ]⊤ k[ j]/ d)v[ j]
y[ i ] = ∑ i √ (1)

j=1 ∑t=1 exp(q[i ] k[ t ] / d )

where weight matrices Wq , Wk and Wv for each layer are learned during training.
When generating from F , we generate one token at a time by sampling from F (· | x) and appending the
sampled token to x. Critically, the attention operator is causal: every output y[i ] is conditioned on prior
tokens. This allows us to avoid recomputing the keys and values for the prior tokens by storing them in a
KV cache {k[ j], v[ j]}ij=1 , which grows in i. Thus, generation proceeds in two phases: (1) prefill, where we
compute the KV cache for the initial prompt x and (2) decode, where we generate the response token by token
and append to the KV cache. After prefill, if x consists primarily of the corpus C , the KV cache effectively
serves as a representation of the corpus C . This is why including a long corpus C in the context x produces
large memory footprints, as the size of the KV cache scales linearly in the length of x.

3 The C ARTRIDGE paradigm

In this section, we describe the C ARTRIDGE paradigm, in which we generate representations of the corpus C
offline with training, instead of the standard approach of constructing them on-the-fly with prefill.

3.1 Formalizing C ARTRIDGES

Our goal is to train a C ARTRIDGE for a given corpus C . A C ARTRIDGE is a small set of parameters Z ∈ R∗
(i.e. an adapter [31, 47]) that augments an LLM F and causes it to behave as if it had C in its context window.
Formally, let F Z (·|q) denote the distribution of F augmented with Z given a query q. For all q ∈ Q, we want
to ensure that samples r Z ∼ F Z (·|q) are as good or better than the ICL sample rq ∼ F (·|C ⊕ q), according to
some query-specific scoring function. In order for F Z (·|q) to match or exceed the behavior of F (·|C ⊕ q),
three important criteria should be met.

• Displays generality: Because Q might span a diverse range of question types (e.g., mathematical reasoning,
factual recall comprehension, summarization, and more), it is essential that F Z can generalize across
different q ∈ Q. This is non-trivial because Q is unknown when Z is being learned offline. If F Z does
not generalize, then practitioners may need to learn different Z for different distributions of queries, which
increases the cost of the C ARTRIDGE. Ideally, Z should only need to be learned once, yet work for multiple
types of queries.

5
• Captures long range dependencies: Z should also capture long range dependencies contained within C .
In many settings, correctly answering different q ∈ Q requires reasoning about the order of information
presented in C . It is not clear how to capture these dependencies in Z.
• Capable of composition: Ideally, the representation of Z and mechanism by which F utilizes it could allow
for composition, without any particular joint training of C ARTRIDGES. Given Z1 and Z2 corresponding to
C1 and C2 , ideally F[Z1 ,Z2 ] (q) is similar to F (·|C1 ⊕ C2 ⊕ q])

3.2 Parameterizing C ARTRIDGES

We parameterize Z using a simplified version of prefix-tuning [47]. Specifically, we allocate a KV cache


composed of trainable key and value vectors zk , zv ∈ R p×d . The size of the full Z ∈ R L× p×d×2 is controlled
by the hyperparameter p. The memory footprint of Z is equivalent to a KV cache for a prompt with p tokens.
In ICL, the KV cache for FC (q) (where C is of length nC and Q is of length nQ ) would contain nC + nQ
key-value pairs, with the first nC corresponding to C and the last nQ corresponding to Q:

ICL KV Cache C ARTRIDGE KV Cache


(k[1], v[1]), . . . , (k[nC ], v[nC ]), (k[nC + 1], v[nC + 1]) . . . (z [1], zv [1]), . . . , (zk [ p], zv [ p]), (k[n p + 1], v[n p + 1]) . . .
| {z } | {z } | k {z } | {z }
KV pairs for C KV pairs for q Trainable KV pairs in Z KV pairs for q

To train a C ARTRIDGE, we substitute the key-value pairs corresponding to C with Z, and directly optimize
them by back-propagating the loss into the key and value vectors. Critically, we freeze all parameters of
the model, only training the key and value vectors in Z. We discuss the choice of loss in Section 4.2.
Initialization Prior work finds that optimizing a randomly initialized cache Z is unstable and leads to
degraded performance [47]. Instead, these works initialize the trainable cache with a smaller dimensionality
d and then re-project it to the original dimension with an MLP. In contrast, we find that proper initialization
of Z allows us to directly optimize the full cache without reparametrization. Specifically, we initialize Z to
the KV cache corresponding to the first p tokens of the corpus C . Alternatively, we could use a summary of
the corpus or filter tokens using off-the-shelf prompt compression strategies [84]. In Section 5.3, we show
that our initializations lead to stable training and faster convergence than the random initialization.
Why this parameterization? We note that the parameter-efficient fine-tuning literature provides other ways to
augment an LLM with a set of additional parameters, in particular low-rank adaptation (LoRA) [31, 45, 47].
In Section 5.3, we perform a comprehensive comparison of C ARTRIDGES parameterized with prefix-tuning
and LoRA.

3.3 Serving C ARTRIDGES

A C ARTRIDGE can be served efficiently with minimal changes to existing LLM inference servers [40, 44, 101].
Because a C ARTRIDGE is a KV cache, it can be loaded directly into the KV cache slots using existing
mechanisms for handling cached prefixes. LLM inference servers are heavily optimized for managing
distinct KV-caches for multiple users [90], meaning C ARTRIDGES can be served at high throughput using
existing inference servers. Decoding tokens with a C ARTRIDGE is identical to serving a request with a prefix
of length p (the hyperparameter denoting the number of trainable tokens in the C ARTRIDGE). This contrasts
with other methods like LoRA, which require custom infrastructure to serve efficiently to multiple users [13].
See Figure 3 for the relationship between prefix length and throughput.

4 S ELF -S TUDY: A self-supervised method for training C ARTRIDGES

In this section, we describe S ELF -S TUDY, a simple approach for training a C ARTRIDGE Z on any corpus of
text. The design of S ELF -S TUDY is motivated by experiments showing how C ARTRIDGES trained with a
simpler recipe fail to generalize to diverse user queries.

6
LongHealth MTOB (KE) QASPER

Ours Long-context ICL Prompt Compression KV-cache Compression


Cartridges ICL ICL (Full book) Truncated Summary Duo Attention

Figure 4: C ARTRIDGES matches ICL quality with lower memory costs. We measure L LAMA -3B response
quality (y-axis) against KV cache memory (x-axis) for different methods, at different KV cache sizes. The
dashed line marks the quality of standard ICL.

Motivating observations The naive method for constructing a C ARTRIDGE would be to fine-tune the
parameters of Z with the next token prediction objective on the corpus text directly. We show results experi-
menting with this approach in Figure 3, where we evaluate on a dataset derived from FinanceBench [33],
which we refer to as G EN C ONVO (see Appendix D for details). G EN C ONVO contains multiple types of
questions (e.g. synthesis, reasoning). We find that the naïve next-token prediction approach can memorize
with near perfect perplexity (Figure 3 left), while consuming 107× less memory than ICL (Figure 3 center).
However, generalization to other slices is poor, as shown in Figure 3. We seek a training objective that
allows the responses from a model that uses the C ARTRIDGE to generalize to a diverse set of user queries,
resembling ICL.
Motivated by these observations, we describe a synthetic data generation recipe in Section 4.1 and a context-
distillation objective in Section 4.2. As we show in Figure 3, C ARTRIDGES trained with this approach can
generate responses to many types of queries that match the quality of queries generated with ICL. See
Figure 1 for a visualization of the C ARTRIDGE approach.

4.1 Self-supervised synthetic data to avoid overfitting

Towards training general C ARTRIDGES, we propose using LLM generated synthetic data to generate our
training dataset Dtrain .

Overall synthetic data pipeline Our overall pipeline puts information from the corpus C in context and
prompts the model to have a conversation with itself about the corpus to generate the synthetic query-
response pairs as shown in Algorithm 1. We represent the concatenation of two vectors with x ⊕ y.

Algorithm 1 S ELF -S TUDY: Data Generation


Input: C : Corpus, F : Model
Output: {a1 , b1 , . . . , ak , bk } : Convo
1: c̃ ← chunk(C ) ▷ (1) Get a subcorpus of C that fits in the context window
2: s ← get_seed_prompt() ▷ (2) Get a prompt to seed the first message from A
3: for i = 1 to k do ▷ (3) Sample a conversation with k back and forths
4: ai ∼ F (· | c̃ ⊕ s ⊕ a1 ⊕ · · · ⊕ bi−1 ) ▷ (3.1) Sample A’s message with c̃ and s in context
5: bi ∼ F (· | c̃ ⊕ a1 ⊕ · · · ⊕ bi−1 ⊕ ai ) ▷ (3.2) Sample B’s message with c̃ in context
6: end for
7: return {a1 , b1 , . . . , ak , bk }

The conversation is generated by iteratively sampling generations from two LLM participants A and B
(which are the same model). We maintain two different conversation histories: A’s starts with a user message

7
containing a seed prompt s (e.g. “Please start a conversation by asking a question about the document above.")
followed by alternating assistant and user messages from A and B, respectively. B’s conversation history does
not include the seed prompt and contains the same messages as A’s but with the roles of A and B swapped.
Both have the subcorpus c̃ in the system prompt. To build a training dataset, we sample mtrain independent
conversations and concatenate the messages from A and B into a single sequence of tokens:
( j) ( j) ( j) ( j) ( j) ( j)
Dtrain = {x( j) = a1 ⊕ b1 ⊕ a2 ⊕ b2 ⊕ · · · ⊕ ak ⊕ bk }m train
j =1 (2)

where each x( j) is a concatentation of the messages. Note that all of the datasets on which we evaluate in
the main paper involve a single-turn. So, we set k = 1, generating a synthetic conversation with one user
message and one assistant message.
Note that the chunk and get_seed_prompt functions expose two different ways to control the data distribution
of the synthetic data. We find that these two design decisions are critical for training high quality C ARTRIDGES
with S ELF -S TUDY.

Chunking We use short subcorpora c̃ (between 512 and 4096) tokens to let the LLM focus on different parts
of the corpus when generating data. This is motivated by observations in prior work [52, 57]. Furthermore,
chunking also allows us to train C ARTRIDGES on corpora longer than the model’s context window.

Seed prompts Instead of using just one seed prompt, we curate a list of five different seed prompt
types: structuring, summarization, question, use cases, and creative. The full list of seed prompts used in our
experiments is provided in Appendix C. Critically, in all our experiments the seed prompts are generic: they
do not mention anything related to the specifics of the corpora we evaluated (e.g. no mention of translation
for MTOB or medical terms for LongHealth). We use the same set of seed prompts in all of our main results.
In Section 5.3, we ablate the use of diverse seed prompts and find that it improves performance over a single
generic seed prompt by up to 4.8 accuracy points (43.6 → 48.4 on L ONG H EALTH).

4.2 S ELF -S TUDY context-distillation objective

Given a fine-tuning dataset Dtrain , we adapt standard techniques from the model distillation literature [42,
43, 72]. We let F (·|x) denote the next token distribution given some input text x. Our teacher is the model
with the subcorpus, c̃, in context F (·|c̃) and our student is the same model adapted with a trainable cache
F Z (·). We use a classic distillation objective [30] that minimizes the KL-divergence between the teacher
and student next-token distributions over a sequence of tokens x and the corresponding subcorpus used to
generate them c̃.
|x| 
arg min ∑ ∑ DKL F (·|c̃ ⊕ x[: i ]) || F Z (·|x[: i ]) (3)
Z (x,c̃)∈Dtrain i =1

In Appendix A, ablate the use of the context-distillation objective and show that improves accuracy when
controlling for the amount of synthetic data (e.g. 3.7 accuracy points on L ONG H EALTH).

5 Results

We describe experiments evaluating the effectiveness of C ARTRIDGES trained with S ELF -S TUDY in various
long-context scenarios. Our results support the following claims. First, C ARTRIDGES trained with S ELF -
S TUDY can match or outperform ICL while maintaining generality and reducing serving costs (Section 5.1).
Second, S ELF -S TUDY is effective on corpora longer than the context window of the LLM (Section 5.2). Third,
when we concatenate two different C ARTRIDGES without any joint training, the model can respond to
queries requiring information from both C ARTRIDGES (Section 5.4). Finally, we include ablations to assess
the relative benefits of different aspects of S ELF -S TUDY and C ARTRIDGES (Section 5.3).

Datasets We study datasets consisting of diverse (q, r ) pairs about a single long document. Across datasets,
C ranges between 100k and 484k tokens. Our datasets are drawn from popular long-context benchmarks,
with some used as-released and others modified to meet this structure. These include: L ONG H EALTH [2],
MTOB [76], and QASPER [19]. We evaluate LLM response quality using accuracy for L ONG H EALTH,

8
LongHealth MTOB (KE) QASPER

Self-study duration Ours Constant Compute Baselines Self-study duration


(# of training steps) Cartridges ICL Summary (# of training steps)
with different sizes

Figure 5: Scaling S ELF -S TUDY compute. These plots show how quality improves as we scale the training
compute with S ELF -S TUDY. In all plots, the x-axis shows the total number of global training steps with batch
size 64 and maximum sequence length 1024. No synthetically generated data is reused (i.e. training proceeds
for one epoch). Curves are provided for C ARTRIDGES of varying sizes (p ∈ {128, 512, 2048, 8192}). (Left)
The y-axis shows accuracy on L ONG H EALTH [2] with L LAMA -8B. (Middle) The y-axis shows the chrF on
MTOB [76] with L LAMA -3B. (Right) The y-axis shows log-perplexity (lower is better) on QASPER [19] with
L LAMA -3B.

log perplexity for QASPER, and character n-gram f-score (chrF) for MTOB [64, 76]. Because each dataset
effectively consists of a “single” document, we train a single C ARTRIDGE per dataset and evaluate it on the
queries response pairs (q, r ). Appendix D provides further details.

5.1 Pushing the quality/cost tradeoff frontier

We assess how C ARTRIDGES produced with S ELF -S TUDY fare in quality and memory consumption against
baselines for L ONG H EALTH and QASPER on L LAMA -3B. For both datasets, C fits within the model context
window (128k tokens). We compare to traditional ICL, two prompt compression baselines (prompt truncation
and prompt summarization using GPT-4o [59]), and a state-of-the-art KV cache compression baseline (Duo
Attention [38, 84]). We evaluate memory use in terms of KV cache size: the size of the KV cache for the ICL
model and prompt compression methods, the size of the C ARTRIDGE, and the size of the compressed KV
cache for KV cache compression methods like DuoAttention.
Figure 4 presents our main results. On both L ONG H EALTH and QASPER, we find cache sizes at which
C ARTRIDGES outperforms ICL. Compared against ICL, C ARTRIDGES offers substantial memory savings
at comparable performance: up to 10× for L ONG H EALTH, and up to 100× for QASPER. In contrast,
compression baseline methods see performance degradations at compression factors as low as 2×. Crucially,
the small memory footprint of C ARTRIDGES allows for much higher peak throughput (tokens/s). As Figure 3
(right) shows, cache sizes which match performance of ICL allow for almost 26× higher throughput.
We also observe that C ARTRIDGE performance scales as we increase the amount of compute used in self-
study: the longer an C ARTRIDGE is trained, the greater task performance. Figure 5 plots the performance for
differentially sized C ARTRIDGES as a function of the number of training steps. Across all sizes, we observe a
steady positive correlation between performance and compute.

5.2 Extending the effective context window

We evaluate whether S ELF -S TUDY allows us to accurately process corpora that exceed the context window
length. To study this, we consider the MTOB dataset, and L LAMA -8B, which has a context window of 128k
tokens. MTOB provides two different long documents: a full 484k token latex textbook and a shorter 60k
token version, which was manually-curated by the dataset authors to exclude content not relevant to the
translation task. Even though the 484k textbook is 356k tokens longer than L LAMA -8B’s context window
length, we can produce a C ARTRIDGE for the full textbook using the chunking strategy of S ELF -S TUDY.
Figure 4 (middle plot) shows the performance of C ARTRIDGES of various sizes trained with S ELF -S TUDY.

9
Cartridge Parameterization Self-study Objective Self-study Seed Prompts

Accuracy
(MMLU)

(MTOB)

(MTOB)
ChRF

ChRF
ChRF Self-study duration Self-study duration
(MTOB) (# of training steps) (# of training steps)
Cartridge Parameterization Objective Self-study Seed Prompts
512 2048
Prefix-Tuning LoRA 5 seed prompts
Context-distillation
with increasing size with increasing rank 1 seed prompt
with different sizes
Baselines 512 2048
ICL ICL Next-token prediction
with full corpus with empty context with different sizes

Figure 6: Ablating C ARTRIDGE and S ELF -S TUDY design choices. Ablations were performed on the
MTOB dataset (see Appendix A for full ablation experiments). (Left) We train C ARTRIDGES using two
different parameterizations: simplified prefix-tuning (as described in Section 3.2) and low-rank adaptation
(LoRA) [31]. The x-axis shows accuracy on MMLU and the y-axis shows accuracy on the target dataset. Each
point represents a different C ARTRIDGE size. Center We train C ARTRIDGES with S ELF -S TUDY using two
loss functions: a next token prediction loss (green) and a distillation loss (blue). The x axis is the number
of training steps, and the y axis is accuracy. Each hue represents a different C ARTRIDGE size. (Right) We
generate synthetic data according to Algorithm 1 and ablate the choice of seed prompts sampled on Line 2.
We consider two approaches: using a single, broad seed prompt (Green) or randomly sampling one of five
different types of seed prompts (Blue). The x axis is the number of training steps, and the y axis is accuracy.

As a point of comparison, we provide the results for KV cache baseline methods on the smaller 60k token
textbook, and also include ICL on a truncated version of the long textbook. Like above, we observe that
C ARTRIDGE can match the performance of ICL on the hand-curated 60k token version, while requiring
substantially less memory and only having access to the 484k token version, which exceeds the context
window of L LAMA -8B. C ARTRIDGES also outperform competitive baselines at every KV cache size, by up to
11.0 chrF points.

5.3 Ablating S ELF -S TUDY design choices

We perform ablations to study different aspects of S ELF -S TUDY and C ARTRIDGE parameterization. We
provide full results in Appendix A and highlight key findings here and in Figure 6.

C ARTRIDGE Parameterization In Section 3.2, we discuss how we parameterize the C ARTRIDGE with a
trainable KV cache, which is equivalent to a simplified version of prefix tuning [47]. There are a number
of other ways we could parameterize the C ARTRIDGE, notably low-rank adaptation (LoRA), an extremely
popular parameter effcient fine-tuning method [31].
We compare the prefix-tuning parameterization with LoRA (see Appendix A.1 for full results). First, we find
that the prefix-tuning parameterization is more effective than a memory-matched LoRA parameterization
on queries related to the corpus. For example, with C ARTRIDGES of size ∼ 0.6 GB on MTOB, prefix-tuning
outperforms LoRA by 4.5 ChRF points. (See Figure 8 for results on L ONG H EALTH and QASPER.) Even more
interesting is the gap between these parameterizations on queries unrelated to the document like MMLU [29].
When using a LoRA parameterization, we find that MMLU accuracy drops precipitously (from 54.7 to 45.3)
as we increase the C ARTRIDGE size (from 0.15 GB to 1.06 GB). In contrast, with prefix-tuning, the accuracy
drops much less rapidly (from 54.7 to 54.3) as we increase the size (from 0.15 GB to 0.96 GB). See Figure 8 for
plots illustrating these findings on L ONG H EALTH, QASPER, and MTOB. We also show that freezing the
attention sink (the first token in the key and value vectors) improves training stability (Figure 10).

10
Cartridge Composition
Pepsi 10-k
Multi-doc Question Answering
Truncated ICL
Who audited the
Boeing and AMD
🤔 List a few competitors
for each of PepsiCo and 🤔
Pepsi Cartridge statements, respectively? AMD as stated in each10K."
Self-study 39.8 GB

AMD 10-k
One Cartridge
0.6 GB
💻 The audit of the
consolidated financial
statements of AMD was
💻 Here are some competitors
for PepsiCo and AMD:
PepsiCo:
AMD Cartridge
Self-study Composed Cartridges performed by Ernst & Young LLP, * Unilever (as a competitor) ...
1.2 GB while the audit of the * Red Bull (as a competitor in
consolidated financial statements the energy drink market)
Compare the
D&A margins... 🤔 LLM + Cartridges
Pepsi AMD
💻 In FY15, the
D&A margins...
Method (Cache Size) 1.5 1.75 2.0
log(perplexity)
of Boeing was performed by
Deloitte & Touche LLP.
AMD:
* Intel (as a competitor in the ...

Figure 7: C ARTRIDGE Composition. (Left) Illustration of C ARTRIDGE composition, where two indepen-
dently trained C ARTRIDGES (one for a Pepsi 10-K and one for an AMD 10-K) are concatenated without any
additional training. (Middle) We evaluate composition on a dataset of multi-document questions requiring
information in two different ≈100k token documents with L LAMA -3B (see Appendix D). The x-axis shows
log-perplexity (lower is better) on gold-standard answers. We compare C ARTRIDGE composition with an (a)
ICL baseline where we truncate the document to fit in the 128k token context length and (b) an C ARTRIDGE
baseline where we only include the C ARTRIDGE for one of the documents. (Right) Examples of responses to
multi-document questions using composed cartridges.

C ARTRIDGE Initialization We compare three different strategies for initializing the KV cache when
using the prefix-tuning parameterization: (1) random vectors (from a component-wise standard normal
distribution), (2) key and value vectors of random tokens, and (3) key and value vectors of the first p tokens
of the corpus. We find that initializing with key and value vectors of actual tokens (as opposed to random
vectors) is critical for achieving ICL-level performance. On L ONG H EALTH, random vectors achieve an
accuracy of 29.9% while key and value vectors of random tokens achieve an accuracy of 51.3%. Initializing
with the first p tokens provides an additional improvement of 4 percentage points to 55.3%. In the original
prefix-tuning paper, the authors show that initializing from tokens improves performance when performing
supervised fine-tuning on very small datasets [47]. Our results extend this finding to S ELF -S TUDY, where we
train on large synthetic datasets.

S ELF -S TUDY Seed Prompts Next, we ablate the choice of seed prompts (see Line 2 of Algorithm 1). We
compare two approaches: (1) always using the same seed prompt (“Please generate a single chat message to
begin a conversation about the information in the corpus. Ask a question about the corpus or make a request.") and
(2) randomly sampling one of five different types of seed prompts (e.g. structuring, summarization; see
full list in Appendix C). Note even with the latter approach, the seed prompts are generic: the same set
of seed prompts are used for all corpora. On MTOB, we find that using this small set of seed prompts
improves over the single seed prompt by 7.9 ChRF points (24.1 → 32.0; see Figure 6 Left). On L ONG H EALTH,
the improvement is 4.8 accuracy points (43.6 → 48.4 on L ONG H EALTH; see Figure 11). Interestingly, on
QASPER we do not see any significant benefit from using the diverse seed prompts. This is perhaps because,
compared to L ONG H EALTH and MTOB, the queries in QASPER are less reasoning intensive.

S ELF -S TUDY Objective Finally, we evaluate the importance of the context distillation objective (defined
in Section 4.2). Using the same S ELF -S TUDY synthetic data for both objectives, we compare the context-
distillation objective with a simpler next-token prediction objective. On MTOB, we find that using a context
distillation objective on the synthetic conversation data improves ChRF by 8.6 points (24.9 → 33.5; see
Figure 12 Center). We also see improvements on L ONG H EALTH and QASPER (see Figure 12).

5.4 Composing C ARTRIDGES

We evaluate if independently trained C ARTRIDGES can be composed in order to serve queries about two
different corpora (see Figure 7, Left). We train C ARTRIDGES across sizes {512, 1024, 2048, 4096} and long
10-K documents from AMD, Pepsi, AMEX, and Boeing [33]. For each pair of C ARTRIDGES pairwise (6 pairs
per cache size), we evaluate using a dataset of multi-document questions, i.e., requiring information from both
10-Ks. Surprisingly, we find composition not only leads to coherent LLM generations off-the-shelf without
any re-training (Figure 7, Right), but also substantially outperforms the use of a single C ARTRIDGE (i.e. for

11
only AMD) or ICL (which struggles due to context length limits) (Figure 7, Center) on the multi-document
questions.

6 Discussion and conclusion

We propose C ARTRIDGES as an alternative to ICL for settings where many different user messages reference
the same large corpus of text. We demonstrate across a diverse set of language model workloads that,
when trained via S ELF -S TUDY, they match ICL’s response quality while substantially reducing memory
consumption (38.6× memory reduction across our evaluations) and increasing peak throughput (26.4×
higher tokens per second). C ARTRIDGES are simple to train, composable, and compatible with existing LLM
serving infrastructure.
However, compared with ICL, S ELF -S TUDY is not without limitations. Using S ELF -S TUDY to produce a KV-
cache is much more costly than simply running standard ICL pre-fill. With our unoptimized implementation,
training an ICL-quality C ARTRIDGE takes ∼ 30 minutes on a single 8×H100 node (for L LAMA -8B) So our
work does not provide a drop-in replacement for ICL, but rather demonstrates one way to tradeoff increased
compute for reduced memory when constructing a KV-cache. This tradeoff is extremely advantageous
in many settings: users often issue many queries over the same corpus and S ELF -S TUDY can be trained
offline on idle or underutilized compute (e.g. at night when user load is low [25, 34]). Furthermore, there is
ample room for optimizations (e.g. improved shared-prefix attention kernels [18, 39, 90]) that would make
S ELF -S TUDY training procedure more efficient.
Looking forward, we envision C ARTRIDGES enabling a broad class of context-aware AI applications that are
intractable with ICL today, from medical assistants that know a patient’s full medical history to LLM-powered
IDEs that understand entire codebases.

Acknowledgments We thank Jordan Juravsky, Dan Biderman, Bradley Brown, Mayee Chen, Avanika
Narayan, Avner May, Bill Mark, Benjamin Spector, Roberto Garcia, Quinn Mcintyre, Yasa Baig, Geoff Angus,
Kelly Buchanan, Mert Yuksekgonul, Eric Nguyen, Eric Wu, Kevin Wu, Owen Dugan, Jon Saad-Falcon,
Simon Guo and the entire Zou, Hazy, and Scaling Intelligence research labs for helpful discussions and
feedback. We gratefully acknowledge Modal, Prime Intellect, Voltage Park, and Together AI for providing
the GPUs to support for this work. We gratefully acknowledge the support of NIH under No. U54EB020405
(Mobilize), NSF under Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078
(Volume to Velocity), and 1937301 (RTML); US DEVCOM ARL under Nos. W911NF-23-2-0184 (Long-
context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under Nos. N000142312633 (Deep
Signal Processing); Stanford HAI under No. 247183; NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC,
Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud,
Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative
(SDSI), members of the Stanford SEAMS project: IBM and Felicis, as well as members of the Stanford DAWN
project: Meta, Google, and VMWare. SE is supported by the NSF Graduate Research Fellowship Program.
AR’s research is supported by NSF grant CCF#2247014. The U.S. Government is authorized to reproduce
and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any
opinions, findings, and conclusions or recommendations expressed in this material are those of the authors
and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH,
ONR, or the U.S. Government.

Contributions SE and RE conceived of C ARTRIDGES and S ELF -S TUDY. SE, RE, and SA designed the method,
implemented the experiments, wrote the manuscript, and contributed equally to the project. NG made
substantial contributions to the structure of the project and the final manuscript. EL and DZ implemented
and ran experiments and made meaningful contributions to the manuscript. WT implemented the LoRA
baselines. DZ and AR led the theoretical analysis. AR, JZ, AM, and CR supervised the project.

12
References
[1] Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael
Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv
preprint arXiv:2412.08905, 2024.

[2] Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander Löser,
Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. Longhealth: A question
answering benchmark with long clinical documents. arXiv preprint arXiv:2401.14490, 2024.

[3] Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit
Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.
arXiv preprint arXiv:2305.13245, 2023.

[4] Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. arXiv preprint, 2024.

[5] Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra,
and Christopher Ré. Zoology: Measuring and improving recall in efficient language models, 2023.

[6] Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James
Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-
throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024.

[7] Simran Arora and Christopher Ré. Can foundation models help us achieve perfect secrecy? arXiv
preprint arXiv:2205.13722, 2022.

[8] Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova,
Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long
short-term memory. arXiv preprint arXiv:2405.04517, 2024.

[9] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv
preprint arXiv:2004.05150, 2020.

[10] Aman Bhargava, Cameron Witkowski, Alexander Detkov, and Matt Thomson. Prompt baking. arXiv
preprint arXiv:2409.13697, 2024.

[11] Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-
Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. Palu: Compressing kv-cache
with low-rank projection. arXiv preprint arXiv:2407.21118, 2024.

[12] Vivek Chari, Guanghui Qin, and Benjamin Van Durme. Kv-distill: Nearly lossless learnable context
compression for llms. arXiv preprint arXiv:2503.10337, 2025.

[13] Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica:
Multi-tenant lora serving. Proceedings of Machine Learning and Systems, 6:1–13, 2024.

[14] Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to
compress contexts. arXiv preprint arXiv:2305.14788, 2023.

[15] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. arXiv preprint arXiv:1904.10509, 2019.

[16] Yu-Neng Chuang, Tianwei Xing, Chia-Yuan Chang, Zirui Liu, Xun Chen, and Xia Hu. Learning to
compress prompt in natural language formats. arXiv preprint arXiv:2402.18700, 2024.

[17] Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre FT
Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, et al. Saullm-7b: A pioneering large
language model for law. arXiv preprint arXiv:2403.03883, 2024.

13
[18] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-
efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344–
16359, 2022.

[19] Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A
dataset of information-seeking questions and answers anchored in research papers. arXiv preprint
arXiv:2105.03011, 2021.

[20] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong
Wu, Tianyu Liu, et al. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022.

[21] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 Herd of Models.
arXiv preprint arXiv:2407.21783, 2024.

[22] Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, and Graham Neubig. Better synthetic
data by retrieving and transforming existing datasets. arXiv preprint arXiv:2404.14361, 2024.

[23] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you
what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.

[24] Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for
context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023.

[25] Kanishk Goel, Jayashree Mohan, Nipun Kwatra, Ravi Shreyas Anupindi, and Ramachandran Ramjee.
Niyama: Breaking the silos of llm inference serving. arXiv preprint arXiv:2503.22562, 2025.

[26] Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok,
and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning.
arXiv preprint arXiv:2312.12379, 2023.

[27] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv
preprint arXiv:2312.00752, 2023.

[28] Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Alex Chohlas-Wood, Austin
Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, et al. Legalbench: A collaboratively built
benchmark for measuring legal reasoning in large language models. Advances in Neural Information
Processing Systems, 36:44123–44279, 2023.

[29] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
2020.

[30] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531, 2015.

[31] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022.

[32] Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient
cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.

[33] Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen.
Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944,
2023.

[34] Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St
Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, et al. Serving models, fast and slow: optimizing
heterogeneous llm inferencing workloads at scale. arXiv preprint arXiv:2502.14617, 2025.

14
[35] Dulhan Jayalath, James Bradley Wendt, Nicholas Monath, Sandeep Tata, and Beliz Gunel. Long-
range tasks using short-context llms: Incremental reasoning with structured memories. arXiv preprint
arXiv:2412.18914, 2024.

[36] Fengqing Jiang. Identifying and mitigating vulnerabilities in llm-integrated applications. Master’s
thesis, University of Washington, 2024.

[37] Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing
prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023.

[38] Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Compressing
prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika
Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,
pages 13358–13376, Singapore, December 2023. Association for Computational Linguistics.

[39] Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini.
Hydragen: High-throughput llm inference with shared prefixes, 2024.

[40] Jordan Juravsky, Ayush Chakravarthy, Ryan Ehrlich, Sabri Eyuboglu, Bradley Brown, Joseph Shetaye,
Christopher Ré, and Azalia Mirhoseini. Tokasaurus: An llm inference engine for high-throughput
workloads. https://scalingintelligence.stanford.edu/blogs/tokasaurus/, 2025.

[41] Junhyuck Kim, Jongho Park, Jaewoong Cho, and Dimitris Papailiopoulos. Lexico: Extreme kv cache
compression via sparse coding over universal dictionaries. arXiv preprint arXiv:2412.08890, 2024.

[42] Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of the 2016
conference on empirical methods in natural language processing, pages 1317–1327, 2016.

[43] Kalle Kujanpää, Harri Valpola, and Alexander Ilin. Knowledge injection via prompt distillation. arXiv
preprint arXiv:2412.14964, 2024.

[44] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph
Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model
serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages
611–626, 2023.

[45] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. arXiv preprint arXiv:2104.08691, 2021.

[46] Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang,
Lei Duan, Jie Zuo, Cal Yang, et al. Mixlora: Enhancing large language models fine-tuning with
lora-based mixture of experts. arXiv preprint arXiv:2404.15159, 2024.

[47] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In
Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, August 2021. Association
for Computational Linguistics.

[48] Yucheng Li. Unlocking context constraints of llms: Enhancing context efficiency of llms with self-
information-based content filtering. arXiv preprint arXiv:2304.12102, 2023.

[49] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai,
Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.
Advances in Neural Information Processing Systems, 37:22947–22970, 2024.

[50] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong
Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts
language model. arXiv preprint arXiv:2405.04434, 2024.

15
[51] Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv
cache compression in depth dimension for large language models. Advances in Neural Information
Processing Systems, 37, 2024.

[52] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and
Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association
for Computational Linguistics, 12:157–173, 2024.

[53] Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, and
Muhan Zhang. Lift: Improving long context understanding of large language models through long
input fine-tuning. arXiv preprint arXiv:2502.14644, 2025.

[54] Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors
adaptation of large language models. Advances in Neural Information Processing Systems, 37:121038–
121072, 2024.

[55] Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. Advances in
Neural Information Processing Systems, 36:19327–19352, 2023.

[56] Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. Using an
llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on
Software Engineering, pages 1–13, 2024.

[57] Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, and
Christopher Re. Minions: Cost-efficient collaboration between on-device and cloud language models.
arXiv preprint arXiv:2502.15964, 2025.

[58] Nihal V Nayak, Yiyang Nan, Avi Trost, and Stephen H Bach. Learning to generate instruction tuning
datasets for zero-shot task adaptation. arXiv preprint arXiv:2402.18334, 2024.

[59] OpenAI. Gpt-4o system card, 2024.

[60] Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. Transformers are multi-state
rnns. arXiv preprint arXiv:2401.06104, 2024.

[61] Lisa Larrimore Ouellette, Amy Motomura, Jason Reinecke, and Jonathan S Masur. Can ai hold office
hours? Available at SSRN 5166938, 2025.

[62] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E
Gonzalez. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023.

[63] Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor
Rühle, Yuqing Yang, Chin-Yew Lin, et al. Llmlingua-2: Data distillation for efficient and faithful
task-agnostic prompt compression. arXiv preprint arXiv:2403.12968, 2024.

[64] Maja Popović. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the tenth
workshop on statistical machine translation, pages 392–395, 2015.

[65] Guanghui Qin, Corby Rosset, Ethan C Chau, Nikhil Rao, and Benjamin Van Durme. Dodo: Dynamic
contextual compression for decoder-only lms. arXiv preprint arXiv:2310.02409, 2023.

[66] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for
machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.

[67] Haris Riaz, Sourav Bhabesh, Vinayak Arannil, Miguel Ballesteros, and Graham Horwood. Meta-
synth: Meta-prompting-driven agentic scaffolds for diverse synthetic data generation. arXiv preprint
arXiv:2504.12563, 2025.

[68] Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr.
Sparq attention: Bandwidth-efficient llm inference. arXiv preprint arXiv:2312.04985, 2023.

16
[69] Melisa Russak, Umar Jamil, Christopher Bryant, Kiran Kamble, Axel Magnuson, Mateusz Russak, and
Waseem AlShikh. Writing in the margins: Better inference pattern for long context retrieval. arXiv
preprint arXiv:2408.14906, 2024.

[70] Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, and Kaushik Roy. Eigen attention: Attention in
low-rank space for kv cache compression. arXiv preprint arXiv:2408.05646, 2024.

[71] Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint
arXiv:1911.02150, 2019.

[72] Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context. arXiv preprint
arXiv:2209.15189, 2022.

[73] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can
be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.

[74] Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia
Zhou, and Yiqun Liu. Parametric retrieval augmented generation. arXiv preprint arXiv:2501.15915,
2025.

[75] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-
aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774, 2024.

[76] Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for
learning to translate a new language from one grammar book. arXiv preprint arXiv:2309.16575, 2023.

[77] Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu-
patiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2:
Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.

[78] U.S. Securities and Exchange Commission. How to read a 10-k, 2011. Accessed: 2025-05-14.

[79] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems,
30, 2017.

[80] Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing
Xiong, and Mi Zhang. D2o: Dynamic discriminative operations for efficient generative inference of
large language models. arXiv preprint arXiv:2406.13035, 2024.

[81] Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang. Model tells you where to merge: Adaptive
kv cache merging for llms on long-context tasks. arXiv preprint arXiv:2407.08454, 2024.

[82] Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts. arXiv preprint arXiv:2404.13628, 2024.

[83] Chaojun Xiao, Zhengyan Zhang, Chenyang Song, Dazhi Jiang, Feng Yao, Xu Han, Xiaozhi Wang,
Shuo Wang, Yufei Huang, Guanyu Lin, et al. Configurable foundation models: Building llms from a
modular perspective. arXiv preprint arXiv:2409.02877, 2024.

[84] Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and
Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv
preprint arXiv:2410.10819, 2024.

[85] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language
models with attention sinks, 2024.

[86] Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit
Bansal, Leshem Choshen, and Alessandro Sordoni. A survey on model moerging: Recycling and
routing among specialized experts for collaborative learning. arXiv preprint arXiv:2408.07057, 2024.

17
[87] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng
Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.

[88] Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers
with the delta rule over sequence length. arXiv preprint arXiv:2406.06484, 2024.

[89] Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers
with the delta rule over sequence length, 2025.

[90] Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris
Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable
attention engine for llm inference serving. arXiv preprint arXiv:2501.01005, 2025.

[91] Howard Yen. Long-context language modeling with parallel context encoding. Master’s thesis,
Princeton University, 2024.

[92] Hao Yu, Zelan Yang, Shen Li, Yong Li, and Jianxin Wu. Effectively compress kv heads for llm. arXiv
preprint arXiv:2406.07056, 2024.

[93] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago
Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer
sequences. Advances in neural information processing systems, 33:17283–17297, 2020.

[94] Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, and Zhiming Zheng. Adacomp: Ex-
tractive context compression with adaptive predictor for retrieval-augmented large language models.
arXiv preprint arXiv:2409.01579, 2024.

[95] Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, and Yelong
Shen. Lorc: Low-rank compression for llms kv cache with a progressive compression strategy. arXiv
preprint arXiv:2410.03111, 2024.

[96] Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih
Yao. Tensor product attention is all you need. arXiv preprint arXiv:2501.06425, 2025.

[97] Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji.
Cam: Cache merging for memory-efficient llms inference. In Forty-first International Conference on
Machine Learning, 2024.

[98] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song,
Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative
inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710,
2023.

[99] Ziyu Zhao, Leilei Gan, Guoyin Wang, Wangchunshu Zhou, Hongxia Yang, Kun Kuang, and Fei Wu.
Loraretriever: Input-aware lora retrieval and composition for mixed tasks in the wild. arXiv preprint
arXiv:2402.09997, 2024.

[100] Ziyu Zhao, Tao Shen, Didi Zhu, Zexi Li, Jing Su, Xuwu Wang, Kun Kuang, and Fei Wu. Merging loras
like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering. arXiv
preprint arXiv:2409.16167, 2024.

[101] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi
Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured
language model programs. Advances in Neural Information Processing Systems, 37:62557–62583, 2024.

[102] Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D Manning, Peter
Henderson, and Daniel E Ho. A reasoning-focused legal retrieval benchmark. In Proceedings of the
2025 Symposium on Computer Science and Law, pages 169–193, 2025.

18
[103] Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li,
and Xuanjing Huang. Elitekv: Scalable kv cache compression via rope frequency selection and joint
low-rank projection. arXiv preprint arXiv:2503.01586, 2025.

19
LongHealth vs. MMLU MMLU vs. Cartridge Size MMLU vs. Self-study duration

Accuracy
Accuracy

Accuracy
(MMLU)

(MMLU)

(MMLU)
Accuracy Cartridge Size Self-study duration
(LongHealth) (GB) (# of training steps)

QASPER vs. MMLU MMLU vs. Cartridge Size MMLU vs. Self-study duration

Accuracy
Accuracy

Accuracy
(MMLU)

(MMLU)

(MMLU)
log(perplexity) Cartridge Size Self-study duration
(QASPER) (GB) (# of training steps)

MTOB vs. MMLU MMLU vs. Cartridge Size MMLU vs. Self-study duration

Accuracy
Accuracy

Accuracy
(MMLU)

(MMLU)

(MMLU)

ChRF Cartridge Size Self-study duration


(MTOB) (GB) (# of training steps)
Cartridge Parameterization Baselines
Prefix-Tuning LoRA ICL ICL
with increasing tokens with increasing rank with full corpus with empty context

Figure 8: Comparing C ARTRIDGE parameterizations. We train C ARTRIDGES using S ELF -S TUDY on the
corpora from L ONG H EALTH (Top), QASPER (Middle), and MTOB (Bottom) using two different parame-
terizations: simplified prefix-tuning (as described in Section 3.2) and low-rank adaptation (LoRA) [31]. We
experiment with different C ARTRIDGE sizes and choose LoRA rank and prefix-tuning cache size to align
on memory consumption. We evaluate the performance of the C ARTRIDGES on questions from the target
dataset (L ONG H EALTH or QASPER) using the same protocol as in Figure 4 and also on questions from
MMLU [29] that are unrelated to the corpora. (Left) The x-axis shows accuracy on MMLU and the y-axis
shows accuracy on the target dataset. Each point represents a different C ARTRIDGE size. (Center) The x-axis
shows C ARTRIDGE size in GB, and the y-axis shows accuracy on MMLU. (Right) The x-axis shows self-study
duration in training steps, and the y-axis shows accuracy on MMLU. The shade of the points represents the
size of the C ARTRIDGE.

A Extended Results

In this section, we ablate the main design choices of C ARTRIDGES and S ELF -S TUDY.

20
A.1 C ARTRIDGE design choices: parameterization and initialization

In our experiments, we parameterize the C ARTRIDGE with a simplified version of prefix-tuning and initialize
with a truncated KV-cache (see Section 3.2). In this section, we describe ablation experiments motivating
these design choices. First, we compare two different C ARTRIDGE parameterizations (Figure 8): simplified
prefix-tuning [47] and low-rank adaptation (LoRA) [31]. Then, we demonstrate the importance of proper
C ARTRIDGE initialization (Figure 9).

Parameterization We evaluate C ARTRIDGES trained on corpora from L ONG H EALTH or QASPER on both
in-domain (i.e. questions from L ONG H EALTH or QASPER) and out-of-domain (i.e. questions from an unrelated
benchmark, MMLU [29]) queries.
We find that the prefix-tuning parameterization is more effective than a memory-matched LoRA parameteri-
zation on both in-domain and out-of-domain queries. This is illustrated in Figure 8 (Left), where we see that
prefix-tuning occupies the top-right corner of the plot (high accuracy on both MMLU and the target dataset).
Notably, we find that as we increase the C ARTRIDGE size with LoRA tuning, performance on out-of-domain
queries (MMLU) drops significantly. At 1.06 GB (LoRA rank 1632), MMLU accuracy drops from 60.0%
to 45.3%. This drop in performance is highly correlated with the size of the C ARTRIDGE, suggesting that
LoRA is not well-suited to large Cartridges, which we show in Figure 4 are important for recovering ICL
performance. In contrast, with prefix-tuning the accuracy only drops to 54.3% at 1.06 GB. This degradation
is mostly invariant to the size of the C ARTRIDGE (54.7% at 0.15 GB), demonstrating that out-of-domain
performance is robust across C ARTRIDGE sizes.
On in-domain queries, prefix-tuning also outperforms LoRA, but the gap is smaller. Across all C ARTRIDGE
sizes, the best L ONG H EALTH accuracy prefix-tuning achieves is 55.6% at 0.96 GB, while the best LoRA
accuracy is 47.25% at 0.26 GB. Interestingly, LoRA accuracy at the largest C ARTRIDGE sizes is lower; 41.3%
at 0.96. It is possible that this is due to the out-of-domain degradation of LoRA we discussed above. Since
queries in L ONG H EALTH test set are quite different from the synthetic queries generated by S ELF -S TUDY (e.g.
they are multiple choice and require some complicated reasoning traces), out-of-domain robustness may be
also important for “in-domain” performance.
It isn’t clear why prefix-tuning is so much more robust than LoRA to out-of-domain performance degradation.
It is surprising given the similarity between a KV-cache and an MLP – both are linear transformations
separated by a non-linearity. It is possible that this is due to the difference in the activation function (SiLU vs.
Softmax). We leave a more detailed investigation into the root cause of this difference for future work.

Initialization The standard way of initializing a k token C ARTRIDGE in our main paper is using the KV
cache from the first k tokens of the source document. In Figure 9, we ablate different initialization source. We
try two additional initalizations: random vectors and random tokens.
For random vectors, we simply initialize the parameters of the C ARTRIDGE from a component-wise standard
normal distribution. For random tokens, we initialize the C ARTRIDGE as the KV cache of the first k tokens of
arbitrary text (specifically, the Wikipedia page for gradient). The important difference between the these two
strategies is that for random tokens the initial C ARTRIDGE is "valid" KV cache produced by the model, while
for random vectors it is not.

Freezing the attention sink A small yet important detail of training a C ARTRIDGE is that we do not let the
first token’s key and value vectors to be trainable. As studied in [85], the first key vector, which corresponds
to the beginning of sequence token and is thus the same for every sequence, acts as an "attention sink". We
observed that when training a C ARTRIDGE, allowing those key and value vectors to be trainable led to
training instability (see Figure 10). For example, on some runs the MMLU accuracy would dip to below 30%.

21
2048 8192
First k tokens of corpus
with different size Cartridges
2048 8192
Random tokens
with different size Cartridges
2048 8192
Random vectors
with different size Cartridges

Self-study duration
(# of training steps)

Figure 9: Ablating C ARTRIDGE initalization. We train a C ARTRIDGES using S ELF -S TUDY on the corpora
from L ONG H EALTH with 3 different initialization strategies. The x axis is the number of training steps and
the y axis is the accuracy on L ONG H EALTH. The blue lines are the results when initializing the C ARTRIDGE
using the KV cache from the first k tokens of the document. The purple lines are initializing the C ARTRIDGE
from the KV cache of unrelated text. The green lines is initializing the C ARTRIDGE with random vectors.
Initializing from the first k tokens leads to slightly stronger results than initializing from the KV cache of
random text. This difference may be more prominent on other corpora where the first k tokens are more
relevant to solving the downstream task.

MMLU LongHealth

Frozen first token


Trained first token
with 4096 token Cartridge
trained on LongHealth with
Self-study

Self-study duration Self-study duration


(# of training steps) (# of training steps)

Figure 10: Freezing the attention sink. In both plots, the y-axis is accuracy and the x-axis is training step.
The green line which corresponds to a run where we allow a trainable first token. (Left) The y-axis MMLU
accuracy. This plot exemplifies the training instability we observed when the key and value vectors were
trainable. The MMLU score dips to below 30% before recovering. (Left) The y-axis is accuracy on questions
from L ONG H EALTH.

A.2 S ELF -S TUDY design choices: data-generation and objective

In S ELF -S TUDY training we use a seeded data-generation process and a context-distillation training objective
(see Section 4). In this section, we ablate these design choices, comparing against the performance of
S ELF -S TUDY with simpler data-generation and objectives.

Data Generation In Section 4.1, we describe how we use five different seed prompt types when generating
data with Algorithm 1. These prompt types, structuring, summarization, question, use cases, and creative, are
described in more detail in Appendix C.1.

22
In this section, we compare the performance of S ELF -S TUDY with these five prompt types against S ELF -
S TUDY with a single prompt: “Please generate a single chat message to begin a conversation about the information
in the corpus. Ask a question about the corpus or make a request."
Across three datasets, we find that using the five different prompt types during S ELF -S TUDY leads to higher
quality C ARTRIDGES (see Figure 12). On MTOB with C ARTRIDGES of size 1024 tokens, we see a 7.9 point
ChRF improvement (24.1 → 32.0). On L ONG H EALTH, the improvement is 5.5 accuracy points (45.8 → 51.3).
Interestingly, on QASPER, we see no benefit from using the five different prompt types. It is possible this
is because the queries in the QASPER dataset are mostly factual questions that do not require complex
reasoning like L ONG H EALTH and MTOB do.

LongHealth MTOB (KE) QASPER

Self-study duration Self-study Seed Prompts Self-study duration


(# of training steps) 5 seed prompts 1 seed prompt (# of training steps)

Figure 11: Diverse seed prompts improve quality. We generate synthetic data according to Algorithm 1 and
ablate the choice of seed prompts sampled on Line 2. We consider two approaches: using a single, broad
seed prompt (Green) or randomly sampling one of five different types of seed prompts (Blue). We train
C ARTRIDGES using self-study with these two strategies on L ONG H EALTH, MTOB and QASPER corpora. In
all plots, the x axis is the number of training steps, and the y axis is either accuracy (for L ONG H EALTH and
MTOB) or perplexity on ground truth answer (for QASPER). We use an C ARTRIDGE size of 1024 tokens.

Training Objective In Section 4, we describe the context-distillation objective we use [10, 42, 72]. This
approach requires that we collect top output probabilities from the in-context model’s output distribution
during data generation. A simpler alternative would be to just use a next-token prediction objective with a
cross-entropy loss.
In our comparison, we find that this simpler objective underperforms the context-distillation objective
(see Figure 12). Most notably, on MTOB with 2048 token C ARTRIDGES, context-distillation outperforms
next-token prediction by 8.3 ChRF points (24.9 → 33.2). On LongHealth, the gap is 3.7 accuracy points
(47.6 → 51.3).
As shown in Figure 12, quality seems to be consistently improving with more S ELF -S TUDY compute. It is
possible, therefore, that by spending more during S ELF -S TUDY with the next-token prediction objective,
we could close the gap. However, for a fixed amount of S ELF -S TUDY compute, context-distillation is
considerably more effective.
These results demonstrate how context-distillation plays an important role in efficiently recovering ICL
performance with S ELF -S TUDY.

A.3 Throughput measurement details

We provide details for the throughput measurements in Figure 3. We use the state-of-the-art SGLang
inference system, with default parameters [101]. We measure throughput on a single H100 GPU.
We first determine the largest batch size b that fits in GPU memory, given a cache of size k tokens. We then
randomly initialize b C ARTRIDGES of size k and pre-load the C ARTRIDGES into GPU memory. We finally

23
LongHealth MTOB (KE) QASPER

Self-study duration 512 2048 512 2048 Self-study duration


(# of training steps) Context-distillation Next-token prediction (# of training steps)
with different sizes with different sizes

Figure 12: Context-distillation objective improves training efficiency. We train C ARTRIDGES using S ELF -
S TUDY on the corpora from L ONG H EALTH (Left), MTOB (Center) and QASPER (Right) using two loss
functions: a next token prediction loss (green) and a distillation loss (blue). We evaluate the performance
of the C ARTRIDGES on questions from the target dataset (L ONG H EALTH, MTOB or QASPER) using the
same protocol as in Figure 5. In all plots, the x axis is the number of training steps, and the y axis is either
accuracy (for L ONG H EALTH and MTOB) or perplexity on ground truth answer (for QASPER). The shade of
the points represents the size of the C ARTRIDGE. Using a distillation loss achieves higher accuracy (or lower
perplexity for QASPER) across datasets and C ARTRIDGE sizes.

measure the time taken to decode 128 tokens per sequence. The C ARTRIDGES and decoded tokens are
appended to a KV-cache during generation. We report the average of 5 iterations after using 3 warm-up
iterations.

B Extended Related Work

In this section, we provide a more in-depth discussion of the place our work occupies in the broader literature.
The structure below mirrors the structure of our paper: first we discuss work related to the parameterization
and initialization of C ARTRIDGES (Appendix B.1), then we cover work that inspired the design of S ELF -
S TUDY (Appendix B.2), and finally we describe other approaches aimed at reducing the size of the KV-cache,
many of which we compare against in our experiments (Appendix B.3).

B.1 Prior work related to the parameterization of C ARTRIDGES

Below we discuss prior work from the parameter-efficient fine-tuning literature that inform the way we
parameterize C ARTRIDGES in our work.

B.1.1 Parameter-efficient Fine-tuning (PEFT)

In order to adapt large language models (LLMs) to particular domains or tasks in a more compute and
memory-efficient manner, several parameter-efficient fine-tuning (PEFT) methods have been developed.
Some of the most widely used PEFT methods include Low-Rank Adaptation (LoRA) [31], prefix-tuning [47],
and prompt-tuning [45].
Leveraging prior observations that fine-tuned language models exhibit an intrinsic low rank structure, Hu
et al. propose LoRA, which freezes model parameters and injects trainable rank decomposition matrices
between each transformer layer. LoRA exhibits on-par or better fine-tuning quality while reducing the
number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times [31].
Li et al. and Lester et al. both take a different approach to lightweight fine-tuning, proposing tunable
"prefixes" and "soft prompts" respectively to prepend to queries in order to steer the model to desired
outputs. Li et al. proposes prefix-tuning, which learns a continuous representation for the activation of the

24
prefix at each transformer layer. These learned activations are then prepended to activations obtained by
passing the input prompt through the frozen transformer. In contrast, Lester et al. proposes prompt-tuning,
which optimizes at the discrete token level and prepends a series of learnable tokens to the input prompt.
Both methods show strong performance while greatly reducing the number of learnable parameters and
improving compute and memory efficiency for language model adaptation.
Principal Singular values and Singular vectors Adaptation (PiSSA) [54] is another more recent PEFT method
that attempts to ameliorate the slow convergence problems of LoRA. PiSSA initializes the LoRA rank
decomposition matrices with the principal components of the original matrix, and exhibits faster convergence
and enhanced performance compared to LoRA on several tasks, including GSM8K and MATH.
Several of these methods, especially LoRA, have been adapted specifically for distilling knowledge provided
in context into the parameters of a language model. Some of those methods are described in the sections
below, and this work is an extension of prefix-tuning for long-context tasks.

B.1.2 Parameter-efficient Adapter Composition and Merging

A number of works have explored the idea of composing multiple different parameter-efficient adapters (e.g.
LoRAs) by summing them together, concatenating them, or using a dynamic mixture of experts [26, 32, 46,
82, 83, 86, 99, 100]. For example, Huang et al. propose LoraHub, a framework for dynamically weighting and
composing multiple language model adapters [32]. Given a set of LoRA modules for different upstream tasks
and new unseen task with in-context examples, LoraHub dynamically weights the LoRAs and composes a
new LoRA module for the task. Similarly, Zhao et al. propose a method for dynamically retrieving the most
relevant language model LoRAs for a given task [99].

B.1.3 Parametric Knowledge Injection

Several recent works have explored methods for integrating external knowledge directly into model parame-
ters, known as parametric knowledge injection [43, 53, 74]. To the best of our knowledge, these studies are
the closest in scope to ours. Like ours, these works address the problem of parametric knowledge injection:
how to store large text corpora within parameters of a language model. Some use simple synthetic data
generation pipelines or context-distillation objectives. Unlike our work, these studies do not highlight the
memory reduction and throughput advantages of parametric knowledge injection techniques. We highlight
other differences below.
One parametric knowledge injection method, recently proposed by Kujanpaa et al., is prompt distillation, in
which a teacher model with access to privileged knowledge generates question-answer pairs. These pairs are
then used to train a LoRA adapter for a student model (identical to the teacher model, but without access to
privileged information) using a distillation objective (i.e. mimicking the teacher’s full token distribution) [43].
This closely resembles our context-distillation objective, which we also found works better than next-token
prediction. However, unlike our work, Kujanpaa et al. only train LoRA adapters of a single size (rank 1024)
and don’t assess memory reductions with respect to full in-context learning. Indeed, they do not evaluate
against long-context ICL baselines at all, focusing instead on a comparison with RAG. Furthermore, they
evaluate on a relatively simple long-context setting – a concatenation of SQUAD passages [66] – which does
not exhibit long range dependencies or require reasoning the way MTOB and L ONG H EALTH do.
Similarly, Mao et al. propose Long Input Fine-tuning (LIFT), which fine-tunes a language model using a
typical next-token prediction objective on overlapping segments of the corpus, as well as instruction tuning
on question answer pairs generated from the corpus. Unlike our work, Mao et al. find that synthetic Q/A
pairs “offer minimal benefit and can even degrade performance due to overfitting" [53]. The difference in our
findings is perhaps due to the fact that they only generate ten synthetic examples, whereas we generate tens
of thousands. Furthermore, they use a weaker ICL baseline (Llama 3 8B) that only has 8k tokens of context.
Any contexts longer than 8k tokens are truncated before being fed to the ICL baseline.
Finally, Su et al. proposes Parametric Retrieval Augmented Generation (Parametric RAG), in which each
document has a corresponding LoRA adapter, trained on an augmented dataset consisting of the document,

25
rewritten versions of the document, and question-answer pairs generated from the document. At inference
time, a retriever is used to determine relevants documents, and the corresponding LoRA adapters are merged
[74]. This method demonstrates significant gains over RAG on a variety of tasks, including WikiMultihopQA.

B.2 Prior work related to S ELF -S TUDY

B.2.1 Self Distillation and Context Distillation

Self-distillation is another method used to internalize the performance gains provided by information in
context (e.g. scratchpads, informative instructions) into the model parameters. In "Learning by Distilling
Context", the authors distill a model with instructions and scratchpads in context into parameters by
conditioning the model on “[instructions] + [task-input]” to predict “[scratch-pad] + [final answer]”; then
fine-tuning the same model to predict its own “[final answer]” conditioned on the “[task-input]”, without
seeing the “[instructions]” or using the “[scratch-pad]” [73].

B.2.2 Synthetic Data Generation

Due to the ubiquitous need for high quality data for fine-tuning (e.g. for use with the methods described
above), a large body of work has focused on generating high quality synthetic data [58] [1] [22] [67]. For
example, Bonito is a model that is fine-tuned to generate synthetic data [58], and MetaSynth is a method
proposed by Riaz et al. that uses a language model to orchestrate several expert LLMs for domain-specific
synthetic data generation [67]. The training process for Phi-4, a 14 billion parameter language model,
also incorporates significant amounts of synthetically generated data [1]. Incorporating synthetic data, in
conjunction with new post-training techniques, allows Phi-4 to surpass its teacher model on STEM QA tasks,
as well as perform well for its size on reasoning benchmarks. These works demonstrate the potential for
synthetic data generation methods to augment the capabilities of language models.

B.3 Reducing the size of the KV cache

In this section, we discuss existing approaches for reducing the size of the KV cache.
First, in Appendix B.3.3, we describe works that propose architectural changes to the multi-head attention
operation, which reduce the memory footprint of the KV cache. Next, in Appendix B.3.1, we discuss
prompt compression methods, which reduce the size of the KV cache by converting a long sequence of input
embeddings into a shorter one. They can be split into hard-token methods, which output discrete tokens
from the vocabulary, and soft-token methods, which output new token embeddings not from the vocabulary.
Finally, in Appendix B.3.2, we describe KV cache compression methods. These methods directly modify the
key and value matrices in the KV cache. Compared with prompt compression methods, these are more
expressive because they can produce a KV cache that no sequence of input embeddings could have produced.
The methodology proposed in our work relies on cache-tuning, which could be viewed as a form of KV
cache compression.

B.3.1 Prompt compression

Hard-token prompt compression Some works aim to reduce the size of KV cache by converting a longer
text into a shorter text [16, 37, 48, 63, 94]. These methods are typically referred to as hard-token prompt
compression methods because the resulting KV cache comes from discrete tokens from the vocabulary.
Compared with soft-token prompt methods, these methods work well with black-box API models.
These methods can be broadly classified into two categories: filtering and summarization based methods.
Filtering methods cut text from the original prompt using heuristics such as self-information. For example,
LLMLingua and Selective-Context use a smaller LLM to filter a long prompt (e.g. dropping redundant
tokens) before passing it to the main model [37, 48]. Summarization methods paraphrase a long prompt into
a smaller number of tokens [16].

26
Soft-token prompt compression with adapted LLMs In one line of work, researchers train a model
(typically an adapted LLM) to compress a long prompt into a smaller number of soft tokens [14, 24, 55, 65, 91].
For example, Autocompressors and In-context Autoencoders (ICAE) are LLMs that are fine-tuned to output
embeddings which can be used in soft-token prompts [14, 24]. Autocompressors are trained with full-
parameter fine-tuning and leverage a recursive strategy to generate the soft prompts, whereas ICAEs are
trained with LoRA and use a single forward pass to generate the soft prompts. A number of other works
also propose using an auxiliary model to produce soft-tokens from a long prompt [24, 65]. Gisting is another
method that differs from those above in that it uses the same LLM to compress the prompt into soft tokens
as it uses to generate the response [55].

Soft-token prompt compression via gradient-descent Soft tokens can also be produced by optimizing
input token embeddings with gradient descent. This idea, called prompt tuning, was first proposed for the
purpose of conditioning a frozen langauge model to perform specific tasks [45]. As such, it is an important
part of the parameter-efficient fine-tuning literature and is discussed in more detail in Appendix B.1.1. Since
then, Li et al. has extended prefix tuning techniques to long-context settings, proposing a new method called
prefix propagation, which conditions prefixes on previous hidden states to achieve superior performance on
long-document tasks compared to prefix tuning [46].

B.3.2 KV cache compression

Hard-token KV cache compression Motivated by the observation that, in some settings, a small number
of keys dominate the attention scores of subsequent queries, several works have proposed KV cache eviction
policies wherein keys and values are dynamically dropped during generation [23, 60, 75, 98]. For example,
H20 drops keys and values from generated tokens based on a running sum of historical attention scores [98].
Similarly, SnapKV drops keys and values from prompt tokens based on a window of queries from the end of
the prompt [49].
A major limitation of eviction methods is that once a key is evicted, it cannot be recovered. Instead of evicting
keys permanently, another line of work focuses on selectively loading keys from KV cache to SMs. While
these works do not reduce memory consumption of the KV cache, they can speed up inference by making
better use of GPU memory bandwidth [68, 75]. For example, the Quest method estimates critical tokens at
each decoding step and selectively loads them to SMs [75].
Compared with the hard-token prompt compression methods, KV-cache compression methods allow fine-
grained control at the level of an attention head. This means that a token can be dropped from one attention
head but not another.

Soft-token KV cache compression with merging In another line of work, instead of evicting tokens
from the KV cache, researchers propose merging similar tokens [51, 80, 81, 97]. For example, Cache
Merge (CaM) takes keys marked for eviction and merges them instead, using a weighting scheme based on
attention weights [97]. Wang et al. builds on this work by clustering key states into "merge sets" based on
cosine similarity, and merging states within a "merge set" with a Gaussian kernel weighting scheme, which
upweights states more similar to a pivotal state chosen as the token with the largest total attention score [81].
Wan et al. expands on both these works with Dynamic Discriminative Operations (D2O), which performs
optimizations at both the layer and token levels. D2O adjusts the KV cache budget for each layer based on
its attention density and uses an exponential moving average mechanism to dynamically determine when a
previously discarded token is similar enough to retained tokens to be merged back in [80]. All of these works
demonstrate promising results, offering similar or better performance on several tasks compared to a full
cache with a 50% or more reduction in cache size. However, there is still room for further improvement, as
these methods still fail to match full cache performance in several tasks, and even a 50% reduction in cache
size may still be prohibitively expensive for very large models or very long contexts. Additionally, these
works do not evaluate the effectiveness of these methods in long-context settings.

27
Soft-token KV cache compression with low-rank projection A number of works leverage the observation
that the KV cache exhibits low-rank structure to develop compression methods [11, 70, 92, 95, 103]. Similar
to compression methods based on merging, compression methods based on low-rank adaptation achieve
performances similar to or exceeding full caches on several tasks at 50% compression, while experiencing
performance degradation upon further compression.

Soft-token KV cache compression with adapted LLMs Above we discussed how some works adapt an
LLM to output a shorter sequence of soft tokens given a long context. Similarly, one could adapt an LLM to
output a smaller KV cache given a long context. While less explored than the analagous prompt compression
approach, there is at least one published method that falls into this category. In KV-distill, the authors
add LoRA adapters to an LLM’s query projections and train them to to produce queries which aggregate
information from prior tokens [12]. The adapter is applied selectively to some tokens and only these tokens
are kept in the KV cache. The idea is that these selected tokens can act as sinks to collect information from
prior tokens. The adapter is trained with a distillation objective between a compressed and uncompressed
KV cache. However, unlike our work, KV-distill does not use any training at test time.

Soft-token KV cache compression with gradient-descent The idea of treating the keys and value matrices
in a KV cache as weights and training them with gradient descent was first discussed in the prefix-tuning
paper [47]. In this work, the method was not applied to long-contexts, but rather as a parameter-efficient
fine-tuning method that can be applied to training datasets with input-output pairs, so we discuss it in
more detail in B.1.1. Since then, we are not aware of works that have applied this technique to handle
long-contexts.

B.3.3 Architectural changes

A number of works have proposed architectural changes to the original multi-head attention (MHA)
operation [79] that reduce the memory footprint of the KV cache. Because they fundamentally alter the
architecture, these methods are not immediately compatible with pre-trained models using the standard
MHA operation.
The earliest works in this direction developed fixed sparsity patterns in the attention map [9, 15, 93]. For
example, many works use a sliding window sparsity pattern wherein each token attends to a fixed window
of tokens around it. These approaches reduce the size of the KV cache because they require only keeping
around a fixed number of tokens in the KV cache. More recently, some large language models have adopted
sliding window sparsity in a subset of layers/heads [77].
While the methods above reduce the size of the cache by introducing sparsity at the token-level, another
class of methods changes the structure of the attention heads. Multi-query attention (MQA), the earliest
of such modifications, uses multiple query heads but only a single key and value head [71]. While MQA
dramatically reduces the size of the KV cache, it can lead to a significant drop in the expressive power of
the model. Grouped-query attention (GQA) is a middle ground between MQA and MHA that allows a
group of query heads to attend to a single key and value head [3]. Many frontier models use GQA, including
the Llama 3 architecture, which we use in our experiments [21, 36, 87]. More recently, a number of other
architectural modifications have been proposed including including Multi-head Latent Attention [50] and
Tensor Product Attention [96].
In another line of work, researchers observe that without the softmax operation in the attention mechanism
(i.e. linearizing the attention operator), the KV cache can be faithfully represented by the fixed size matrix
K ⊤ V [6]. This allows us to represent the KV cache with a single matrix whose size is independent of the
context length.
Indeed, a large body of work has focused on developing architectures with fixed-size memory consumption
(i.e. models that do away with the KV cache). Notable examples include state-space models [27], RNNs [8],
and other linear attention variants [6, 88].

28
Prior work shows that there are tradeoffs between the memory consumption of an architecture and the
ability of a model to perform recall-intensive tasks, when controlling for compute (i.e. FLOPs) [6]. In this
context, our work shows that by increasing compute (i.e. FLOPs), we can reduce the memory consumption
of a model without sacrificing performance. In Appendix E, we provide a prelinary theoretical analysis
relating S ELF -S TUDY with recurrent architectures. However, future work should explore the relationship
between C ARTRIDGES and recurrent models in more depth.

B.3.4 Orchestration for long-context

In this section, we describe strategies for managing long-contexts by orchestrating calls to LLMs. For instance,
the approach by [69] involves summarizing chunks of the context and then combining the summaries.
Similarly, PRISM [35] treats the context as a sequence of chunks, capturing key information in a structured
data format. MemGPT [62] introduces a virtual memory paging system, drawing inspiration from operating
systems. As context length reaches the limit of available memory, the system strategically determines which
information to retain.

C Extended method description

In this section, we detail the seed prompts and chunking strategy we used to train C ARTRIDGES with
S ELF -S TUDY.

C.1 S ELF -S TUDY seed prompts

As discussed in Algorithm 1, we seed the synthetic conversation generation with a prompt that elicits
conversations about different aspects of the document. For each conversation, we randomly sample one of
the following functions and create a seed prompt by calling it:

Structuring Seed Prompt Generator

1 def structuring_seed_prompt (** kwargs ) :


2 DATA_FORMATS = [
3 " JSON " ,
4 " YAML " ,
5 " TOML " ,
6 " INI " ,
7 " XML " ,
8 " plain text " ,
9 ]
10
11 data_format = random . choice ( DATA_FORMATS )
12
13 EXAMPLES = [
14 (
15 " Can you structure the information in {{ subsection }} of {{ document }} related to {{ something
specific }} "
16 f " in the following format : { data_format }? "
17 " Be sure to include precise information like any dates , times , names , and numerical values
. '"
18 ...
19
20 ]
21
22 example = random . choice ( EXAMPLES )
23
24 return (
25 f" Please generate a single chat message instructing an LLM to structure the information in {
data_format }. "
26 " Output only the chat message itself and absolutely nothing else . "
27 " Make sure it is clear what section and document you are asking about . "
28 f" The message can follow the following template , filling in details from the corpus : \ n \ n '{
example }'"
29 )
30
31

29
Summarization Seed Prompt Generator

1 def summarization_seed_prompt (** kwargs ) :


2 prompts = [
3 (
4 " Please generate a single chat message instructing an LLM to summarize part of the corpus .
"
5 " Make sure the instruction is very explicit about the section of the corpus that you want
to summarize . "
6 " Include details ( ids , names , titles , dates , etc .) that make it clear what you are asking
about . "
7 ),
8 (
9 " Please generate a single chat message instructing an LLM to summarize a section . "
10 " Make sure the instruction is explicit about the section that should be summarized and the
document it is from . "
11 ),
12 ]
13 prompt = random . choice ( prompts )
14 return prompt
15
16

Question Seed Prompt Generator

1 def question_seed_prompt (** kwargs ) :


2 prompts = [
3 (
4 " Generate a question for an LLM that will test its knowledge of the information in the
corpus above . "
5 " In your question be sure to include details ( ids , names , titles , dates , etc .) that make it
clear what you are asking about . "
6 " Output only a single question . Do NOT include any other text or explanation other than the
question . "
7 ),
8 (
9 " Generate a message for an LLM that will test its knowledge of the information in the
corpus above . "
10 " Be sure to include details ( ids , names , titles , dates , etc .) in the question so that it
can be answered without access to the corpus ( i . e . closed - book setting ) . "
11 " Output only a single question . Do NOT include any other text or explanation other than the
question . "
12 ),
13 (
14 " You are helping to quiz a user about the information in the corpus . "
15 " Please generate a question about the subsection of the corpus above . "
16 " Be sure to include details ( ids , names , titles , dates , etc .) in the question to make it
clear what you are asking about . "
17 " Answer only with the question , do not include any other text . "
18 ),
19 ]
20 prompt = random . choice ( prompts )
21 return prompt
22

Use Case Seed Prompt Generator

1 def use_case_seed_prompt (** kwargs ) :


2 prompt = (
3 " You are working to train a language model on the information in the following corpus . "
4 " Your primary goal is to think about practical , real - world tasks or applications that someone
could achieve using the knowledge contained within this corpus . "
5 " Consider how a user might want to apply this information , not just recall it . "
6 " After considering potential use cases , your task will be to generate a sample question that
reflects one of these downstream applications . "
7 " This question / instruction / task should be something a user , who has access to this corpus ,
might ask when trying to accomplish their specific goal . "
8 " Output only a single question . Do NOT include any other text or explanation other than the
question . "
9 )
10 return prompt
11
12

30
Creative Seed Prompt Generator

1 def creative_seed_prompt (** kwargs ) :


2 prompt = [
3 (
4 " You are having a creative conversation inspired by the information in the corpus . "
5 " Please generate a question for your conversation partner to start off the discussion . "
6 " Answer only with the question , do not include any other text . "
7 ),
8 ]
9 return random . choice ( prompt )

C.2 S ELF -S TUDY chunking

For the S ELF -S TUDY data generation process, we extract uniformly random token-level chunks from the
input corpus C . A corresponding textual description is generally prepended to each chunk c̃ to contextualize
it when generating the seed prompt. This approach helps the model focus on different parts of the corpus
and generate diverse synthetic examples. The specific chunking parameters and descriptions are tailored to
each dataset:

• L ONG H EALTH: Chunks are sampled with a minimum size of 512 tokens and a maximum size of 4096
tokens. The accompanying description is: ‘Below is a section of a patient’s medical record. It is part of a larger
corpus of medical records for Npatients different patients.’
• AMD/FinanceBench: Fixed-size chunks of 8192 tokens are utilized. No specific descriptive text is
prepended to these chunks.
• MTOB: Chunks are sampled with a minimum size of 512 tokens and a maximum size of 4096 tokens. The
description used is: ‘The following is an excerpt from a grammar book about the Kalamang language.’
• QASPER: Following our general methodology, chunks are sampled with a minimum size of 512 tokens
and a maximum size of 4096 tokens. A generic description is used to contextualize the chunk as an excerpt
from a research paper, in line with the nature of the Qasper dataset.

D Datasets

D.1 G EN C ONVO

To evaluate the ability of our approach to handle diverse queries over long documents, we generated
the G EN C ONVO dataset. We created G EN C ONVO using the AMD 2022 10-K filing, a document from the
FinanceBench corpus [33]. The primary purpose of G EN C ONVO is to simulate a wide range of tasks a user
might ask a model to perform given a long document, thereby testing the model’s comprehension, reasoning,
and ability to extract varied types of information. The generation process relies on Claude Sonnet 3.7 [4] and
is structured as follows:

1. Document Input: The entire source document (e.g., the AMD 2022 10-K, which is less than 200,000 tokens
and fits within the model’s context window) is provided to Claude Sonnet 3.7.
2. Question Generation: A series of distinct prompt templates (detailed below), designed to elicit different
reasoning traces (e.g., factual recall, synthesis, multi-hop reasoning), are used to generate questions. For
the given document and each prompt template, we ask the model to generate 16 unique questions. This
involves providing the model with the full document content alongside the specific question-generation
prompt.
3. Answer Generation: Subsequently, for each generated question, Claude Sonnet 3.7 is prompted again
with the original full document and the generated question to produce an answer. This process ensures
that the answers are grounded in the provided document.

31
We hope G EN C ONVO provides a challenging benchmark that moves beyond simple fact retrieval, assessing
a model’s capacity for deeper understanding and more complex information processing over long contexts.
The following prompt templates were utilized for the question generation phase:

Factual Prompt Template


Please generate a question to test someone’s ability to remember factual details from the document. The
answer should be a few tokens long and be a factual detail from the statement, such as a number, entity,
date, title, or name.
This question should not be common knowledge: instead, it should be something that is only answerable
via information in the document.

Knowledge Prompt Template


Please generate a question that requires combining information mentioned both inside and outside the
document.
This question should require using a fact from the document and also a fact that you are confident about,
but is not mentioned in the document. For instance: - What are the founding dates of the companies
that got acquired this year? This is a good question because the names of the acquired companies are
mentioned in the document and the founding dates are not mentioned. - What is the name of the CEO’s
spouse? This is a good question because the name of the CEO is mentioned in the document and the
spouse’s name is not mentioned.
The answer should be a fact that is a few tokens long such as a number, entity, date, title, or name.

Disjoint Prompt Template


Please generate a multi-hop question that tests someone’s ability to use factual information mentioned
in at least two very different sub-sections of the document.
This question shouldn’t be a standard question about this kind of document. Instead, it should ask
about two particularly disconnected ideas, like comparing information about the amount of owned space
for the company headquarters with the amount of dollars of estimated liability or comparing the revenue
number with the number of employees.
This question should also test one’s ability to do retrieval: do not give away part of the answer in
the question. Ensure that for one to get the correct answer to the question, they need to understand
the document.
The answer should be a short: for example, a number, entity, date, title, or name.

Synthesize Prompt Template


Please generate a question that requires synthesizing and aggregating information in the document.
For instance, you could ask someone to summarize a page of the document, list all the key competitors
mentioned in the document, or summarize the company’s business model.

Structure Prompt Template


Please generate a question that requires understanding the structure of the document.
This question should be more about the structure of the document, rather than the precise statement
details. For instance, you could ask someone to list the titles of all the sections in the document,
describe the document structure, report the total number of pages, ask which section amongst two sections
comes first, or report the section with the largest number of tables.

Creative Prompt Template


Please generate a question about the document to test someone’s ability to comprehend the content of the
document. This question specifically should be focused on their ability to generalize the information
about the document to a strange question of sorts.
This question shouldn’t be a standard question about this kind of document, it should ask to do something
abnormal and creative, like writing a poem about a financial document.

32
Counting Prompt Template
Please generate a question that requires counting how frequently different events occur in the document.
This question should be about statistical properties of the document, rather than the statement details.
For instance, you could ask someone to count the number of times the word "million" is mentioned or
count the length of the shortest section title.
The answer should be a number.

Reasoning Prompt Template


Please generate a question that requires mathematical reasoning over the values in the document.
This question should require going beyond the facts directly mentioned in the statement, such as asking
to compute the percentage increase in revenue between two years, find the largest expense category, or
calculate difference in profit between two years.
The answer should be a number.

D.2 L ONG H EALTH

L ONG H EALTH is a benchmark for evaluating large language models ability to analyze and interpret long
clinical texts [2]. The benchmark consists of 20 fictional clinical case reports (each containing between 5,090
and 6,754 word) and 400 multiple-choice questions based on them.
In our experiments, the context C consists of the reports for a panel of n patients. We use n = 10 patients,
with a full panel of approximately 100k tokens, which fits in the context length of the L LAMA 3 models.
The questions are categorized into information extraction, negation, and sorting.
A sorting question is included below:

Please answer the question below about the following patient: ID patient_03, Name: Mr. John Williams,
Birthday: 1956-08-08 00:00:00, Diagnosis: Multiple Myeloma
<question>
Mr. Williams received multiple radiologic examinations. In which order did she receive them?
</question>
<options>
CT Whole Body > MR Spine Scan > CT Spine Scan > PSMA-PET-CT Scan > CT Chest > CT Whole Body > Whole
Body CT scan
Whole Body CT scan > CT Spine Scan > CT Whole Body > MR Spine Scan > CT Chest > PSMA-PET-CT Scan > CT
Whole Body.
CT Whole Body > CT Whole Body > CT Chest > CT Chest > PSMA-PET-CT Scan > MR Spine Scan > CT Spine Scan
> Whole Body CT scan > Chest X-ray
CT Chest > CT Spine Scan > CT Whole Body > Whole Body CT scan > PSMA-PET-CT Scan > MR Spine Scan > CT
Whole Body
Whole Body CT scan > CT Spine Scan > CT Whole Body > MR Spine Scan > CT Chest > CT Whole Body >
PSMA-PET-CT Scan
</options>
You should first think step by step. Then give your final answer exactly as it appears in the options.
Your output should be in the following format:
<thinking> {{YOUR_THOUGHT_PROCESS}} </thinking>

<answer>
{YOUR_ANSWER}
</answer>

An example of a negation question is included below:

Please answer the question below about the following patient: ID patient_01, Name: Anna
Sample, Birthday: 1970-01-01 00:00:00, Diagnosis: DLBCL

33
<question>
Which of these examinations were never performed in Mrs. Sample?
</question>
<options>
Bone marrow aspiration
CSF aspiration
MRI of the head
Pulmonary function testing Cardiac stress testing
</options>
You should first think step by step. Then give your final answer exactly as it appears in
the options. Your output should be in the following format:
<thinking> {{YOUR_THOUGHT_PROCESS}} </thinking>

<answer>
{YOUR_ANSWER}
</answer>

D.3 MTOB

The Machine Translation from One Book (MTOB) benchmark tests a large language model’s ability to learn
to translate between English and Kalamang, a low-resource language with virtually no web presence [76].
The core task is to perform translation (Kalamang to English, and English to Kalamang) by primarily relying
on a single comprehensive grammar book and a small set of accompanying linguistic resources. In our work,
we focus on translating from Kalamang to English.
The source documents provided by the MTOB benchmark are:

• A grammar of Kalamang: A comprehensive grammar textbook, with the original source provided in LATEX
format. This book details the phonology, morphology, and syntax of Kalamang.

• Bilingual Word List (W): A list of Kalamang words with their part-of-speech tags and English descriptions.

• Parallel Kalamang-English Corpus (S): A collection of 375 paired Kalamang-English sentences.

The MTOB authors preprocessed the grammar textbook from its original LATEX source into several plaintext
splits for their baseline experiments. These include:

• G m (Medium-length chunk): A plaintext segment of approximately 50k tokens consisting of an overview


chapter, a morpheme table from the grammar book, and the complete bilingual word list (W).

• G l (Long-length chunk): A larger plaintext segment of approximately 100k tokens, containing chapters
from the grammar book that the MTOB authors deemed most important for the translation task.

• Full Plaintext Textbook (G): The entire grammar book converted to plaintext.

The combination of the long-length chunk (G l ), the parallel sentences (S), and the word list (W) exceeds the
context window of Llama 3 models. We use the medium-length chunk G m and the parallel sentence list S as
input for our ICL baseline.

D.4 QASPER

QASPER is a benchmark for evaluating the ability of large language models to answer questions about
scientific papers [19]. To create a challenging multi-query long-context setting resembling the setup described
in Section 2.2, we concatenate 16 papers all related to QA NLP models to form out corpus C . In total, there are
78 questions about these 16 papers in the dataset, which we use as the queries Q.

34
Because the dataset only includes short answers and ground-truth spans containing evidence for each
answer, we rewrite the answers in a longer, more conversational format using GPT-4.1 and use these as the
targets when evaluating.

E Theoretical analysis: Relationship between attention, linear attention, and


C ARTRIDGES

When we generate text with an autoregressive Transformer, we have to maintain a KV-cache that grows
linearly with the length of the input and text. In Appendix B.3.3, we discussed a number of architectural
modifications that either reduce the size of the KV-cache or do away with it altogether. In particular, when
generating text with linear attention (e.g. [6]), we only need to maintain a constant-sized object – the KV-state
matrix – during generation.
Like the KV-state matrix in linear attention, C ARTRIDGES consume a constant amount of memory (i.e. their
size is a hyperparameter, which can be set independently of the input length). However, they differ from
the KV-state in how they are updated. In this work, C ARTRIDGES are updated using S ELF -S TUDY– gradient
descent on synthetically generated data. On the other hand, KV-states are updated using a linear attention
update rule.
In this section, we will study the update rules for attention, linear attention, and gradient descent when
applied to the multi-query associative recall (MQAR) problem [5], a popular synthetic benchmark task
used for studying the capabilities of long-context architectures. In particular, we consider a variant of
the standard MQAR problem where key-value pairs are repeated. First, we highlight some equivalences
between the update rules of these approaches in the case where input keys are orthonormal. Then, in the
more challenging case where input keys are in a Johnson-Lindenstrauss embedding, we provide a separation
result showing that the gradient descent update rule is able to exactly solve an MQAR problem that linear
attention cannot.
These theoretical results provide intuition for why constant-sized C ARTRIDGES are able to match the
performance of full KV-caches in long-context settings when linear-attention architectures have struggled to
do so.

E.1 Notation

All vectors are assumed to be row vectors.


Parenthesized superscripts (e.g. k(1) ) denote some temporal quality of an element. Subscripts denote
different elements in a set, as is standard.
A concise explanation for each variable:

• d : model (and token) dimension.


• m : number of unique key-value pairs.
• n : number of queries.
• N : number of key-value pairs in stream.

E.2 MQAR

We define the Multiple Query Associative Recall (MQAR) problem.


Definition 1. There is a universe of keys:
K ⊂ R1×d ,
and values:
V ⊂ R1 × d .

35
Definition 2. [5] In the MQAR problem, the input is:
(k(1) , v(1) ), . . . , (k( N ) , v( N ) ) where (k(t) , v(t) ) ∈ K × V for 1 ≤ t ≤ N,

followed by a set of queries


q1 , . . . qn where qi ∈ K for 1 ≤ i ≤ n.

Then for each i ∈ [n], output: (


vi∗ where i∗ = max{i ∈ [1, N ]|ki = q j }
0d if no such i exists.

E.3 m − repetitive MQAR

Definition 3. m − repetitive MQAR is a special case where each (K (t) , V (t) ) ∈ S, where:
S = {(k1 , v1 ), . . . , (km , vm )}.
Additionally, ki is unique.
(t)
Definition 4. To capture this, ri is defined as the number of occurrences of (ki , vi ) in the stream at timestep t.

E.3.1 Orthonormal Embedding

First, we will look at the MQAR problem in a restricted case, when all keys are orthonormal.
Definition 5. We call the set K to be orthonormal if for all k, k′ ∈ K:
(
′ 0 if k ̸= k′
⟨k, k ⟩ =
1 otherwise.

E.3.2 Johnson-Lindenstrauss Embedding

Next, we will look at the MQAR problem in a restricted case, when all keys are in a JL embedding.
Definition 6. Let ϵ > 0, we call the set K to be ϵ−JL if for all k, k′ ∈ K:
(
′ [−ϵ, ϵ] if k ̸= k′
⟨k, k ⟩ = .
1 otherwise.

E.4 Model Definitions

Below, we will describe three different model architectures. While they each exhibit different performance
and capabilities they can be describe with a common framework for the MQAR problem.

1. State: is how the model store Key-Value pairs.


2. Update rule: how the model incorporates new Key-Value pairs into its state.
3. Query rule: how the model uses its state to answer a look up a value or a query.

E.4.1 Transformer
1. The state is:
W ( t ) = ( K ( t ) , V ( t ) ),
where,
K ( t ) ∈ Rt × d , V ( t ) ∈ Rt × d .
Note that this consumes more memory as the context gets longer.
2. The update rule is:
K ( t +1) = K ( t ) ⊕ k ( t +1) , V ( t +1) = V ( t ) ⊕ v ( t +1)

36
3. On query q ∈ K, return:
 ⊤
q K (t) V (t) .

These rules define the transformer setting for MQAR.

E.4.2 Linear Attention


1. The state:
W ( t ) ∈ Rd × d .
2. The update rule is defined as:
W ( t +1) = W ( t ) + ( k ( t +1) ) ⊤ ( v ( t +1) ).
With the initial matrix being initialized to zeros. I.e. W (0) = 0d×d .
3. On query q, return:
qW (t) .
Lemma 1. [89] Linear attention rule emerges if we were to update using the loss function −k(t) W (t) vt .

It is important to mention here that we are not using any kernels for linear attention. These rules define the
linear attention setting for MQAR.
 ⊤  ⊤
Lemma 2. [89] W (t+1) = W (t) − k(t) k(t) W (t) + k(t) v(t) is the update rule that emerges when we use
the gradient descent loss function: 12 ||k(t) W (t) − v(t) ||22 .
Definition 7.
1 (t) (t)
L= ||k W − v(t) ||22
2

Proof. In general, gradient descent has the update rule:


W ( t + 1 ) = W ( t ) − η ∇W ( t ) . (4)
Taking the gradient of the loss function gives us:
1  ⊤
∇W ||k(t) W (t) − v(t) ||22 = k(t) (k(t) W (t) − v(t) )
2
 ⊤  ⊤
= k(t) k(t) W (t) − k(t) v(t) .

Using the above and choosing η = 1, we get for Equation (4)


 ⊤  ⊤ 
W ( t +1) = W ( t ) − 1 k(t) k(t) W (t) − k(t) v(t)
 ⊤  ⊤
= W (t) − k(t) k(t) W (t) + k(t) v(t) .

E.4.3 Gradient Descent

Gradient descent training on the cache. We look at the capability of this trained state on a certain input.

1. The state at time t is defined as:


W ( t ) ∈ Rd × d .
2. The update rule which follows from Lemma 2:
 ⊤  ⊤
W ( t +1) = W ( t ) − k ( t ) k(t) W (t) + k(t) v(t) .

With the initial matrix being initialized to zeros. I.e. W (0) = 0d×d .
3. On query q, return:
qW (t) .

37
E.4.4 Orthonormal Case

We now see how the three models perform on the m − repetitive MQAR when K is orthonormal.
Transformer
Lemma 3. On every input to MQAR (even those for 1-rep-MQAR) the state of Transformer needs Ω( Nd) parameters.

Intuitively, at each timestep, you will append d parameters to the state. At timestep t the model will have td
parameters.
Linear attention
Theorem 1. Linear attention can solve repetitive MQAR for any m ≥ 1 and orthonormal K, up to scaling (producing
(t)
ri vi when W (t) is queried with ki ) and all keys being distinct with O(d2 ) parameters.

Proof. We first prove that for any t ≥ 0:

m

(t) ⊤
W (t) = ri ′ ki′ vi′ . (5)

i =1

Base Case: Initially, W (0) = 0d×d . From this, we indeed have:


m

(0) ⊤
W (0) = ri ′ ki′ vi′ ,

i =1

since for all i′ ∈ [m]:


(0)
ri′ = 0.

Inductive hypothesis: Assume that the state matrix at some arbitrary integer timestep t is as claimed. I.e.:
m

(t) ⊤
W (t) = ri ′ ki′ vi′ .

i =1

Inductive step: If (k( j) , v( j) ) appears at timestep t + 1 the update rule will be:

W ( t +1) = W ( t ) + ( k ( t +1) ) ⊤ v ( t )
= W ( t ) + ( k j ) ⊤ vj

By the inductive hypothesis, we have that:

W ( t +1) = W ( t ) + k j ( v j ) ⊤
m

(t) ⊤
= ri ′ ki′ vi′ + k j ( vj ) ⊤

i =1
m

( t +1) ⊤
= ri ′ ki′ vi′ .

i =1

( t +1) (t) ( t +1) (t)


The final step follows from the fact that r j = r j + 1 when (k(t+1) , v(t+1) ) = (k j , vj ) and ri = ri for
all i ̸= j.
The proof of Equation (5) is complete by induction.

38
Finally, it is the case that on query ki :
m

(t) ⊤
ki W (t) = ki ri ′ ki′ vi′

i =1
m

(t)
= ri′ ki ki⊤′ vi′

i =1


(t) (t)
= ri′ ki ki⊤′ vi′ + ri ki ki⊤ vi
i ′ ̸ =i


(t) (t)
= ri ′ · 0 · vi′ + ri · 1 · vi

i ̸ =i
(t)
= ri · vi ,

as desired. In the above, the second last inequality follows from from Definition 5 and the fact that all ki are
distinct.
O(d2 ) parameters are needed as the matrix must have dimension d × d

Gradient Descent
Theorem 2. Gradient descent is able to exactly solve the m − repetitive MQAR (produce vi when W (t) is queries
with ki ) with O(d2 ) parameters.

Proof. Here we can handle repetitions because our update rule includes a "peel" term. This means it removes
the current value stored under a key before updating it with a new value.
We will show by induction that for all t ≥ 0:
m
W (t) = ∑

1 (t) · ki⊤′ vi′ .
r ′ >0
i =1 i

Base Case: Initially, the cache matrix is set to all zeros. From this, naturally follows that:
m
W (0) = ∑

0 · ki⊤′ vi′ ,
i =1

since for all i′

(0)
ri′ = 0.

Inductive hypothesis: Assume that at some arbitrary timestep t, we have:


m
W (t) = ∑′ 1r(′t) · ki⊤′ vi′
i i >0

Inductive step: If (kℓ , vℓ ) appears at timestep t + 1 the update will be:


! !
m m m
∑ 1ri(>t+01) ki⊤ vi = ∑

1 (t)
r′
ki⊤′ vi′ − ∑

1 (t)
r′
k⊤ ⊤
ℓ kℓ ki′ vi′ + k⊤
ℓ vℓ
i =1 i =1 i >0 i =1 i >0

the second term reduces to just peeling the term relating to kℓ , if it exists, as all other inner products are 0,
! 
m 
= ∑ 1 (t) ki′ vi′ − 1 (t) · k⊤
⊤ ⊤
ℓ ℓ + kℓ vℓ
v
r′ rℓ>0
i ′ =1 i >0
!
m
= ∑

1 (t)
r′
ki⊤′ vi′ + k⊤
ℓ vℓ
i ̸=ℓ i >0

39
This replaces the value associated with kℓ with the new value, while keeping everything else the same. This
is the form that we want, as the only time we want to add a key if it is an new key.
Finally, it is the case that on query ki :
!
m
ki · W (t)
= ki · ∑ 1r(′t) ki⊤′ vi′
i ′ =1 i >0
!
m
= ∑ 1r(′t) ki · ki⊤′ vi′
i ′ =1 i >0

=1 (t) · 1 · vi
r i >0

=1 (t) · vi
r i >0

Again here a matrix of dimension d × d can store d orthogonal vectors. Thus this requires, O(d2 ) parameters.

E.4.5 JL Embedding

We now see how the 3 models perform on the m − repetitive MQAR when K is ϵ−JL.
Transformer
Lemma 4. On every input to MQAR (even those for 1-rep-MQAR) the state of Transformer needs Ω( Nd) parameters.

We note that when K is ϵ−JL it is no longer possible to get the exact answer from query rule ki W (t) . Thus,
we need to add a decoding step.
Definition 8. The output decoding step is vi∗ where:
i∗ = arg max ⟨vi′ , ki W (t) ⟩.
i′ ∈[m]
Definition 9. For all i, j ∈ [m], define:
ϵi,j = ⟨ki , k j ⟩.

Linear Attention
 Definition 8) is unable to solve even the 2 − repetitive MQAR and
Theorem 3. Linear attention (+ decodingas in
1
each vi being 1-hot encoding unless K is ω N −JL.

Proof. Due to the agreeance between different keys, when querying for key i, there is noise from other keys
returned along with the correct answer. While we can tolerate some error, this error scales with the number
of times the model has seen a single key. Making it unfit for longer contexts, or contexts with many repeats.
First, note that the base case Equation (5) from Theorem 1 still holds. In general, this holds for all K.
Specifically, on query k1 we have:

(t) (t) (t) (t)


k1 W (t) = r1 ⟨k1 , k1 ⟩v1 + r2 ⟨k1 , k2 ⟩v2 = r1 v1 + r2 ϵ1,2 v2 .

Now, consider an input to 2 − repetitive MQAR such that


(t) (t)
r1 < r2 ϵ1,2 .
Note that in this case:
(t) (t)
r1 = ⟨v1 , k1 W (t) ⟩ < ⟨v2 , k1 W (t) ⟩ = r2 ϵ1,2
and hence we output v2 instead of v1 .
If the embedding was ω ( N1 the number of repeats could not overcome the ϵ value.

40
Gradient Descent

Theorem 4. Gradient descent (+ decoding as in Definition 8) is able to exactly solve m − repetitive MQAR with
O(d2 ) parameters for ϵ−JL K, as long as ϵ ≤ m2 (m1 −1) and α < m −1
m +1 .

Proof. We define:

(t)
Ci,j

to be the coefficient associated with ki⊤ vj in W (t) . Specifically, let

m m
∑ ∑ Ci,j ki⊤ vj
(t)
W (t) = (6)
i =1 j =1

We will prove by induction that:

(t) (t)
Ci,j = 1(ki ,vj ) has occurred + ∆i,j (7)

where,

t
∑ ((m − 1)ϵ)a .
(t)
∆i,j ≤ (8)
a =1

(t)
Base Case: Initially, the state is set to all zeros. From this, naturally follows that all of the Ci,j are zero. I.e.
Equation (7):

∆i,j = 0.

Inductive hypothesis: Assume that all for some timestep t and 1 ≤ i, j ≤ m:

(t) (t)
Ci,j = 1(ki ,vj ) has occurred + ∆i,j ,

(t)
where ∆i,j satisfies Equation (8).

41
Inductive Step: If at timestep t + 1 we are given (kℓ , vℓ ), from Equation (6) the update looks like:

m m
∑ ∑ Ci,j
( t +1) ⊤
W ( t +1) = ki vj
i =1 j =1
 
m m m m
∑ ∑ −∑ ∑
(t) (t)
= Ci′ ,j′ ki⊤′ vj′ Ci′ ,j′ k⊤ ⊤ 
ℓ k ℓ ki ′ vj′ + k⊤
ℓ vℓ
i ′ =1 j ′ =1 i ′ =1 j ′ =1
 
m m m m
∑ ∑ Ci′ ,j′ ki⊤′ vj′ −  ∑ ∑ ϵℓ,i′ Ci′ ,j′ k⊤
(t)  + k⊤ vℓ (t)
= ℓ vj′ ℓ
′ ′
i =1 j =1 ′ ′ i =1 j =1

change the associativity of the summations,


 ! 
m m m m
= ∑ ∑ C ′ ′ k ′ vj′ −  ∑ ∑ ϵℓ,i′ C ′ ′ k⊤ vj′  + k⊤ vℓ
(t) ⊤ ( t )
i ,j i i ,j ℓ ℓ
i ′ =1 j ′ =1 j ′ =1 i ′ =1

here we separate the first term where i′ = ℓ and i′ ̸= ℓ,


 ! 
m m m m m
∑ ∑ Ci′ ,j′ ki⊤′ vj′ + ∑ Cℓ,j′ k⊤ ∑ ∑
(t) (t) (t)
= ℓ vj′ − ϵℓ,i′ Ci′ ,j′ k⊤
ℓ vj′
 + k⊤ vℓ

′ ′
i ̸ = ℓ j =1 ′ j =1 ′ j =1 ′
i =1

here we separate the first term where i = ℓ and i′ ̸= ℓ, ′


   ! 
m m m m m
∑ ∑ Ci′ ,j′ ki⊤′ vj′ + ∑ Cℓ,j′ k⊤  ∑ ϵℓ,ℓ C ′ k⊤ vj′  −  ∑ ∑ ϵℓ,i′ C ′ ′
(t) (t) (t) (t)
= ℓ vj′ − ℓ,j ℓ i ,j
k⊤
ℓ vj′
 + k⊤ vℓ

′ ′
i ̸ = ℓ j =1 ′ j =1 ′ ′ ′ j =1 j =1 i ̸=ℓ

remove ϵ j,j ,
 ! 
m m m m m
∑ ∑ Ci′ ,j′ ki⊤′ vj′ + ∑ Cℓ,j′ k⊤
ℓ vj′ − ∑ Cℓ,j′ kℓ vj′ −
 ∑ ∑ ϵℓ,i′ C ′ ′
(t) (t) ⊤ (t) (t)
= i ,j
k⊤
ℓ vj′
 + k⊤ vℓ

′ ′
i ̸ = ℓ j =1 ′ j =1 ′ ′ ′ j =1 j =1 i ̸=ℓ

cancel terms,
 ! 
m m m
∑ ∑ Ci′ ,j′ ki⊤′ vj′ −  ∑ ∑ ϵℓ,i′ Ci′ ,j′
(t) (t)
= k⊤
ℓ vj′
 + k⊤ vℓ .

′ ′
i ̸ = ℓ j =1 ′ ′ j =1 i ̸=ℓ

Note with this we can see that:



(t)
Ci,j
 if ℓ ̸= i
( t +1)
Ci,j = if ℓ = i .

(t)
−
 ϵℓ,i′ Ci′ ,j + 1 j=ℓ

i ̸=ℓ

Thus, if i ̸= ℓ, we have:

( t +1) (t)
Ci,j = Ci,j ,

( t +1)
for i ̸= ℓ. The inductive statement holds for these pairs. Now let’s consider Cℓ,j . If ℓ = j then:


( t +1) ( t +1) (t)
Cℓ,ℓ = 1 + ∆ ℓ,ℓ = ϵℓ,i′ Ci′ ,j + 1

i ̸=ℓ

42
and note that by the triangle inequality and Definition 6:


( t +1) (t)
∆ ℓ,ℓ ≤ϵ Ci′ ,ℓ

i ̸=ℓ

by the inductive hypothesis,


t
≤ϵ ∑

(1 + ∑ ((m − 1)ϵ) a )
i ̸=ℓ a =1
t
= ((m − 1)ϵ)(1 + ∑ ((m − 1)ϵ)a )
a =1
t +1
= ( ∑ ((m − 1)ϵ) a ),
a =1
as desired.
Then for j ̸= ℓ, we have:
( t +1) ( t +1)
∆ j,ℓ = Ci,j


(t)
= ϵℓ,i′ Ci′ ,j

i ̸=ℓ
(t)
The bounding of ∆ℓ,j is similar to the ℓ = j case.

With this we have completed the inductive proof on error terms.


If the we set:

1
ϵ< ,
m2 ( m − 1)
we get the following bound:

t
∑ ((m − 1)ϵ)a
(t)
∆i,j ≤ (9)
a =1
( m − 1) ϵ
≤ (10)
1 − ( m − 1) ϵ
1
< 2 (11)
m −1
Before the next steps, we must bound:
⟨vi , vj ⟩ ≤ α (12)

For a query with ki , assuming we have seen ki before, we get:


(t)
ki · W (t) = vi + ∆i,j′ vj′

j ̸ =i

Now for the decoding step where for an arbitrary vj we get:


⟨vj , ki · W (t) ⟩ = ⟨vj , vi ⟩ + ⟨vj , ∑ ∆i,j′ vj′ ⟩
j′ ̸ =i

For the case where i = j it is the case that:


⟨vi , ki · W (t) ⟩ = 1 + ⟨vi , ∑ ∆i,j′ vj′ ⟩
j′ ̸ =i
1
≥ 1− α.
m+1

43
This follows from Equation (11) and Equation (12).
For the case where i ̸= j it is the case that:

⟨vj , ki · W (t) ⟩ = ⟨vi , vj ⟩ + ⟨vj , ∑ ∆i,j′ vj′ ⟩


j′ ̸ =i
1
≤ α+ α
m+1
This follows from Equation (11) and Equation (12).
m −1
As a result, we will always pick the correct value when α < m +1 .

44

You might also like