Hyena Hierarchy Towards Larger Convolutional Language Models PDF
Hyena Hierarchy Towards Larger Convolutional Language Models PDF
Abstract
Recent advances in deep learning have relied heavily on the use of large Transformers due to their
ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits
quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic
methods based on low-rank and sparse approximations need to be combined with dense attention layers
to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic
drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions
and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of
thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-
spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-
the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103
and The Pile), reaching Transformer quality with a 20% reduction in training compute required at
sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length
8K, and 100× faster at sequence length 64K.
1 Introduction
Large Transformers have enabled a number of breakthrough advances in modeling language, vision, audio,
biology and numerous other domains (Vaswani et al., 2017), (Dosovitskiy et al., 2020), (Radford et al., 2022),
(Cramer, 2021). Much of the success of Transformers, powered by the attention operator (Vaswani et al.,
2017), relies on their scaling properties (Hoffmann et al., 2022) and the emergence of in-context learning
(Garg et al., 2022), which allows them to generalize to unseen data and tasks given context as input. The
Transformer block is a powerful tool for sequence modeling, but it is not without its limitations. One of the
most notable is the computational cost, which grows rapidly as the length of the input sequence increases.
Specifically, the cost scales quadratically with the length L of the sequence, which places a strict limit on the
amount of context that can be considered by the model. Breaking the quadratic barrier is a key step towards
new possibilities for deep learning, such as using entire textbooks as context, generating long-form music or
processing gigapixel scale images.
Efforts to reduce the computational cost of attention in models primarily involve the use of linearized,
low-rank, and sparse approximations (Child et al., 2019; Wang et al., 2020; Kitaev et al., 2020; Zhai et al.,
2021; Roy et al., 2021; Schlag et al., 2021; Tu et al., 2022). These approaches introduce a trade-off between
expressivity and speed, requiring hybridization with standard attention layers to reach Transformer quality
(Mehta et al., 2022; Dao et al., 2022c).
A growing amount of evidence suggests that attention mechanisms only utilize a small portion of their
quadratic capabilities for language processing (Olsson et al., 2022; Dao et al., 2022c), leading us to question
its role as the gold-standard operator for deep learning at scale. Specifically, we ask:
Are there subquadratic operators that can match the quality of attention at scale?
∗ Equal contribution. † Equal senior authorship. 1 Stanford University. 2 Mila and Université de Montréal.
1
Figure 1.1: The Hyena operator is defined as a recurrence of two efficient subquadratic primitives: an implicit
long convolution h (i.e. Hyena filters parameterized by a feed-forward network) and multiplicative element-
wise gating of the (projected) input. The depth of the recurrence specifies the size of the operator. Hyena
can equivalently be expressed as a multiplication with data-controlled (conditioned by the input u) diagonal
matrices Dx and Toeplitz matrices Sh . In addition, Hyena exhibits sublinear parameter scaling (in sequence
length) and unrestricted context, similar to attention, while having lower time complexity.
We obtain a positive answer based on a composition of efficient subquadratic primitives, such as element-
wise multiplication (gating) and long convolutions i.e., convolutions with filter sizes as long as the input. We
rely on a set of targeted reasoning tasks, grounded in recent work on mechanistic interpretability (Elhage
et al., 2021; Power et al., 2022; Olsson et al., 2022; Zhang et al., 2022) such as recall and induction, to distill
three properties of attention correlated with its performance and the quality gap with existing subquadratic
approaches:
a. Data control: Attention implements an expressive data-controlled (Massaroli et al., 2020) linear operator1 ,
encoding an entire family of linear functions in a single block.
b. Sublinear parameter scaling: Parameter counts of attention layers are decoupled from sequence length,
allowing Transformers to allocate more parameters elsewhere e.g., the feed-forward neural networks (FFNs)
between attention layers.
c. Unrestricted context: For a given input, attention has an unrestricted context i.e., it can approximate
dependencies between any two inputs, without arbitrary restrictions such as locality (except in cases using
masking such as autoregressive models).
The Hyena hierarchy Guided by these findings, we introduce the Hyena hierarchy, an operator defined by a
recurrence of two efficient subquadratic primitives: a long convolution and element-wise multiplicative
gating (see Figure 1.1). A specified depth (i.e., number of steps) of the recurrence controls the size of the
operator. For short recurrences, existing models are recovered as special cases (Mehta et al., 2022; Dao
et al., 2022c). By mapping each step in the Hyena recurrence to its corresponding matrix form, we reveal
Hyena operators to be equivalently defined as a decomposition of a data-controlled matrix i.e., a matrix
whose entries are functions of the input. Furthermore, we show how Hyena operators can be evaluated
efficiently without materializing the full matrix, by leveraging fast convolution algorithms (Selesnick and
Burrus, 2017). Empirically, Hyena operators are able to significantly shrink the quality gap with attention at
scale, reaching similar perplexity and downstream performance with a smaller computational budget (Section
4.2) and without hybridization of attention.
Narrowing the capabilities gap The design of Hyena is motivated by a quality gap between standard
dense attention and alternative subquadratic operators, which we identify by focusing on reasoning tasks cor-
related with language modeling performance at scale. We extend the suite of basic mechanistic interpretability
benchmarks (induction and recall ) with additional tasks that probe how quickly model performance degrades
1 Self-attention can be expressed as y = A(k, q)v where A is the attention matrix conditioned by linear projections k, q of the
Scaling in language and vision Next, we aim to verify whether rankings in our reasoning benchmark
suite are predictive of quality at scale. We test Hyena on autoregressive language modeling at the sub-billion
parameter scale, setting a new state-of-the-art for dense-attention-free architectures in standard datasets
(WikiText103 and The Pile) and matching Transformer quality. On the The Pile at the 335M parame-
ter scale, we match Transformer perplexity with a 20% reduction in the total count of floating point operations
(FLOPs). As an extension, we investigate the generality of Hyena operators by testing on large-scale im-
age recognition, replacing attention in the Vision Transformer (ViT) (Dosovitskiy et al., 2020). In image
classification, Hyena is able to match attention in accuracy when training on ImageNet-1k from scratch.
Toward much longer context Finally, we benchmark the efficiency of Hyena on long sequences. We
measure 5x speedups over dense self-attention at length 8192 – 2x over highly optimized FlashAttention2
(Dao et al., 2022b) – and 100x speedup over FlashAttention at sequence lengths of 64k, where standard
attention implementation in PyTorch runs out of memory.
Long convolutions and memory: A crude proxy for memory of a single computational unit is how
far in the past it can access information to produce the output at a certain step. This can be roughly
quantified by the number of non-zero entries ∂yt /∂ut−n for n = 0, . . . , t. The memory of CNNs
filters is equivalent to the filter size M since ∂yt /∂ut−n = hn . The total mnemonic capacity of an all-
convolutions CNN therefore scales with the number of model’s parameters. Implicit parametrizations,
on the other hand, allow us to disentangle the memory of each filter from the parameter count
and where the length of the filter is implicitly controlled by the learned parameters. In an SSM,
∂yt /∂ut−n = CAn B and the memory extent is solely determined by the spectral radius of A and can
be finely tuned by the training processa . On the other hand, the number of parameters controls the
expressivity of the memory unit, e.g. the number of basis functions forming ht .
a See e.g.Gu et al. (2020, 2021)
Fast Methods for Convolutions One of the first applications of the Cooley-Tukey fast Fourier transform
(FFT) algorithm was to implement convolution faster than the direct evaluation of (1). At first glance (1)
comes with O(L2 ) an asymptotic time complexity. A common approach to achieve fast long convolutions in
subquadratic time is through the FFT algorithm. The method first converts the aperiodic convolution into a
circular convolution Selesnick and Burrus (2017) by appropriate zero-padding of input and filter sequences.
The resulting kernel Ŝh is a circulant matrix and is diagonalized by the discrete Fourier basis
Ŝh = W−1 DH W
0
where W is the DFT matrix, Wtt0 = z , z = e
−t i2πt /L
and H is the DFT of the padded filter h, H = Wpad(h).
Thus, the calculation of such convolutions is performed as
pad(y) = Ŝh pad(u)
= W−1 DH W pad(u)
= iFFT(DH FFT(pad(u)))
where DH is the matrix with Wh on its diagonal. The above is known as the convolution theorem of DFT
(Oppenheim et al., 1997). In this FFTConv form the convolution can be performed without materializing
the operator Sh with the same asymptotic cost O(L log2 L) of FFT.
4
Figure 2.1: Comparison between data-controlled matrices: SelfAttention and Hyena.
y = SelfAttention(u) (3)
= A(u)uMv ,
where Mq , Mk , Mv ∈ R D×D
are learnable linear projections and SoftMax is intended to be applied row-wise.
Attention parametrizes a family of dense linear operators and for an input u, indexes through it via
projections of u i.e., A(u). We refer to operators of this type as data-controlled, as they encode a linear
transformation u 7→ y, that is, however, nonlinearly defined by u. This approach yields expressive nonlinear
operators in u, and we hypothesize contributes, together with other mechanisms (Olsson et al., 2022), to the
ability of certain operators to learn in-context i.e., to adapt to unseen tasks by leveraging context. In deep
learning, the projections take on specific names: query q = uMq , key k = uMk and value v = uMv . We often
rewrite the attention operator as y = A(q, k)v.
Remark 2.1. Similarly to implicit convolutions, SelfAttention does not entangle its ability to access distant
information with the number of parameters: it looks at the whole sequence at the price of O(L2 ) operations.
Definition 3.1 (Order–N Hyena Operator). Let (v, x1 , · · · , xN ) be projections of the input and let
h1 , . . . , hN be a set of learnable filters. The HyenaN operator is defined by the recurrence:
zt1 = v t
ztn+1 = xnt (hn ∗ z n )t n = 1, . . . , N (4)
yt = ztN +1
Remark 3.1. The time complexity of a Hyena recurrence is O(N L log2 L). The input-output map can be
rewritten as
y = xN · (hN ∗ (xN −1 · (hN −1 ∗ (· · · ))))
where each convolution is performed through the Fourier domain in O(L log2 L).
Interestingly, the element-wise product in time domain corresponds to convolution in frequency domain,
i.e.
xt ut = (x̂ ∗ û)t ,
where x̂, û denote the DFT of x and u, respectively. Thus, Hyena is alternatively applying convolutions in
the time and then the frequency domain (or alternatively applying element-wise products in the time and
frequency domain). One potential explanation for the effectiveness of this procedure is that the convolution
in the time domain (element-wise multiplication in the frequency domain) increases the memory length,
allowing for a broader context to be taken into account. On the other hand, the element-wise multiplication
in the time domain (convolution in the frequency domain) allows for more fine-grained selection of specific
frequency components of the signal.
Remark 3.2 (Hyena generalizes H3 and GSS.). The H3 mechanism (Dao et al., 2022c) corresponds to Hyena2
and GSS (Mehta et al., 2022) is Hyena1 , with a particular choice of parametrization for the long convolutions
(SSMs).
Analysis of the H3 mechanism as a decomposition Dq Sψ Dk Sϕ of its surrogate attention matrix5 clarifies
a connection to fast evaluation algorithms for matrix-vector multiplications. In particular, the generalization
of (8) to an arbitrary order is inspired by fast evaluation algorithms for structured dense matrices based on
butterfly decompositions (Li et al., 2015; Dao et al., 2019, 2022a), with length of the decomposition closely
tied to its expressivity (in the classes of matrices it can represent). The Hyena operator blends data control
with a special case of butterfly decomposition.
Remark 3.3. Hyena operators have unbounded context. Namely, they are not artificially restricted by e.g.,
locality, and can learn long-range dependencies between any of the elements of v via long convolutions, which
we discuss next.
Specializing filters in Hyena The window and positional encoding functions are used to specialize filters
in Hyena operators, biasing them towards a specific type. Figure 3.1 provides an important example: we
choose at least one of the convolutions in Hyena to be shaped towards exponential decay, mirroring the
findings of (Li et al., 2022) in other applications. Interestingly, we find that long exponentially decaying
filters display synergy with high-frequency filters, as they enable the operator to select specific inputs at
specific steps6 . Similarly to (Romero et al., 2021b), we use high-frequency periodic activations (sine) in the
FFN. This allows (7) to learn filters with high-frequency content, addressing the low-frequency bias of neural
networks (Basri et al., 2020). Owing to the FFN, the parametrization in (7) can approximate filters obtained
through other means, such as S4 (Gu et al., 2020, 2021), CKConv (Romero et al., 2021b), SGConv (Li et al.,
2022) and Fourier Neural Operator (FNO) (Li et al., 2020).
Preserving causality Causality is necessary to train autoregressive language models, in order for the
output at a given position to depend only on the past. For example, Transformers mask the attention
matrix to be lower triangular. In the case of Hyena, causality can be guaranteed by parametrizing causal
convolutions:
5 Some of this analysis is reported in the Appendix.
6 This observation finds mirrors in the parametrization of the convolutions in H3 (Dao et al., 2022c) as a shift SSM and a
diagonal SSM.
7
Proposition 3.1 (Causal Hyenas). If each filter hn , n = 1, . . . , N is causal, then the corresponding HyenaN
operator is causal.
In practice, we need not constrain the learning of the filter (7) to ensure its numerical causality. If we
use FFT-based convolution algorithms, all we need is to evaluate the filter at t = 0, . . . , L − 1 and zero-pad
the input and filter sequences to 2L − 1 before taking FFT.
Efficiency One bottleneck of long convolution models can be their low utilization of hardware accelerators,
especially when they involve iterative numerical methods to materialize the filter7 . Evaluation of 7 is fast,
since it involves a single forward pass of an FFN, and can be performed in parallel across sequence length and
all orders of an Hyena operator as displayed in Algorithm 2, increasing hardware utilization. An additional
source of low utilization is the FFT, which is also shared by other long other convolutional layers. This
bottleneck can be partially addressed by blocking (Selesnick and Burrus, 2017), and optimization of the
underlying routines (Dao et al., 2022c). We benchmark runtime in Section 4.5.
Algorithm 1 Projection
Require: Input sequence u ∈ RL×D
1. In parallel across L: ẑ = Linear(u), Linear : RD → R(N +1)D
2. In parallel across D: z = DepthwiseConv1d(h, ẑ), h is a short convolution filter
3. Reshape and split z into x1 , . . . , xN , v. Dimensions of one element are xn ∈ RD×L
Return x1 , . . . , xN , v, xn
Proposition 3.2 (Computational Complexity). The computational cost of processing an input u ∈ RL×D
with an order-N Hyena operator is
O(N DL(log2 L + D))
7 In contrast, deep learning primitives are designed for high GPU utilization, with FFNs and attention usually reaching
8
Associative Recall
Vocabulary Size: 10 Vocabulary Size: 20 Vocabulary Size: 30 Vocabulary Size: 40
100
80
60
40
20
0
27 29 211 213 215 27 29 211 213 215 27 29 211 213 215 27 29 211 213 215
Sequence Length Sequence Length Sequence Length Sequence Length
Hyena CKConv Transfer Function H3 Conv1D FNO
Figure 4.1: Benchmark of long convolution parametrizations in order 2 Hyena operators on associative recall
(%). Our results show that implicit parametrizations scale more favorably in vocabulary size (number of
possible values of tokens in the input) and length of the sequence.
4 Experiments
4.1 Shrinking the gap on in-context learning
We begin by empirically motivating the Hyena design, including the choice of long convolution parametriza-
tion. We consider the suite of tasks described in Table 4.1. Our evaluation is grounded in recent work
on mechanistic interpretability of Transformers (Elhage et al., 2021; Power et al., 2022; Olsson et al., 2022;
Zhang et al., 2022). Recently, associative recall, in particular, has been successfully used to guide the design
of H3 (Dao et al., 2022c). We extend the suite of tasks from these works and include benchmarking more
challenging versions of each task . For example, solving associative recall with a vocabulary size of only 10
reveals whether a model is structurally capable of performing recall. Testing on much longer sequences and
larger vocabularies reveals additional gaps in performance that are otherwise hidden.
How to parametrize long convolutions We compare the performance of the following long convolution
parametrizations for S 1 and S 2 in an order 2 Hyena:
• Conv1d: Explicit convolutions (regular convolution layers with fixed filter size).
• FNO: Filters parametrized explicitly in the frequency-domain (Li et al., 2020).
• H3: Implicit parametrization using state-space models (SSMs), in particular the standard S4 (Gu et al.,
2021).
• TransferFunc: Implicit parametrization via transfer functions, a classical system-theoretic generalization of
SSMs8
9
Table 4.2: Test accuracy (%) for associative recall on longer sequences, vocabulary size 30. The symbol 7 is
used to mark settings where the model does not fit in memory.
Sequence length Hyena FlashTransformer Transformer GSS H3 AFT RWKV
30k 100.0 32.4 7 5.3 8.4 2.3 12.4
64k 100.0 26.7 7 2.1 4.3 1.2 6.5
131k 97.2 7 7 0.1 0.6 0.8 2.3
• Hyena: Combination of implicit parametrizations via FFNs (with exponential decay modulation as shown
in Figure 3.1), and short explicit filters.
All models have the same width and 2 layers. Figure 4.1 shows implicit approaches based on FFNs outperform
other long convolutions, with the gap widening on longer sequences and larger vocabulary sizes. We train
a different model on each setting of sequence length and vocabulary size. The ranking is correlated with
the ability to decouple sequence length from parameter count (Hyena, CKConv, TransferFunc, H3) and
expressivity (Hyena, CKConv). We observe similar trends on the other tasks.
Pushing sequence length to the limit Next, we evaluate associative recall performance on extremely
long sequences of length 131k. To the best of our knowledge, these represent the first empirical display of
attention-free in-context learning on sequences of this length. The gap between parametrization schemes
widens as shown in Appendix A, with Hyena outperforming CKConv by 80 points.
Comparing operators We repeat our associative recall experiment, this time benchmarking different 2
layer models rather than changing the convolution parametrization: an order 2 Hyena, GSS (Mehta et al.,
2022), H3 (Dao et al., 2022c), AFT-conv (Zhai et al., 2021), RWKV (Peng, 2021), and a standard GPT
(Brown et al., 2020) using FlashAttention (Dao et al., 2022b). As shown in Table 4.2, Hyena is the only
operator able to solve the task. Our results challenge the observation that only Transformers are capable of
challenging in-context learning. Surprisingly, rankings of model performance at a fixed sequence length on
The Pile are consistent with rankings on aggregate scores on our synthetics (Appendix C).
Generality of Hyena operators and filters Hyena operators and filters can also applied successfully
beyond language tasks. We experiment on sequential CIFAR, where pixels are flattened as a sequence, and
use the same operator defined for language. We reach the accuracy of standard S4 (Gu et al., 2021) with same
model size (91%). In Section 4.5 and Appendix A, we discuss larger-scale image classification experiments
with Hyena.
ratio of parametric to non-parametric FLOPs (and hence the gains) depend on the ratio of model width D and sequence length
L used in training.
10
Data Scaling on The Pile, 355M parameters
2.44
Hyena
GPT
Loss
2.29
2.21
1.3 1.6 2.6 3.2 3.9 4.9
FLOPs ·10 19
Figure 4.2: Preliminary "scaling law" of language models on The Pile. Comparison of our approach (red)
based on long convolutions and gating (Hyena) and a standard GPT (blue) (Brown et al., 2020). We reach
perplexity of GPT with a smaller training FLOP budget.
Table 4.3: Perplexity on WikiText103 Table 4.4: Perplexity on The Pile for models trained until a
(same tokenizer). ∗ are results from (Dao total number of tokens e.g., 5 billion (different runs for each
et al., 2022c). Deeper and thinner models token total). All models use the same tokenizer (GPT2).
(Hyena-slim) achieve lower perplexity. FLOP count is for the 15 billion token run.
for 332 billion tokens), and a reference GPTNeo (Black et al., 2021) (trained for 300 billion tokens) of the
same size. Tables 4.5 and 4.6 summarize the results. Hyena performs similarly to other models despite
having been trained on less than half the number of total tokens. We observe Hyena to display characteristic
few-shot capabilities of standard Transformers, with some tasks e.g., MultiRC seeing a lift of more than 20%
accuracy over zero-shot when the model is provided additional prompts as context. The improvements are
more noticeable in generation tasks, where the additional prompts can instruct the model on how it should
be responding to the questions. We report an additional downstream evaluation on the LAMBADA task
(Paperno et al., 2016) in Appendix A.
Table 4.5: Zero-shot accuracy (%) on SuperGLUE tasks for small models.
11
Table 4.6: Few-shot (3) accuracy (%) on SuperGLUE tasks for small models.
Benchmarking Hyena
100
4
50
2
0 0
103 104 105 103 103.2 103.4 103.6 103.8
Sequence Length Sequence Length
Figure 4.3: Benchmarking runtime of Hyena, Attention and FlashAttention with varying sequence lengths.
Batch size is set to 64. The figure on the right is an inset showing a zoomed-in portion of the figure on the
left.
4.4 Benchmarking
We benchmark runtime of an order 2 Hyena operator compared to attention and FlashAttention layers (Dao
et al., 2022b). Hyena uses a fused CUDA kernel to perform FFTConv (Dao et al., 2022c). We set batch
size to 64 and measure runtime (in milliseconds). Results are provided in Figure 4.3. Hyena speedups reach
100× at sequence length 64K. Crossover points for Hyena and attention is at length 2048, and for Hyena and
FlashAttention is between 4096 and 8196. Despite the absolute reduction in FLOPs, speedups are achieved
only on longer sequences when the gap grows sufficiently large. This occurs because hardware utilization of
Hyena is lower than FlashAttention. We expect the gap between theoretical maximum speedup to shrink
with improved implementations of FFTConv and specialized hardware.
12
5 Discussion and Conclusion
In this work, we introduced an attention-free drop-in replacement to the core building block of many large-
scale language models. Hyena operators are a recurrence of gating and implicitly parametrized long convo-
lutions, can be evaluated efficiently in subquadratic time, and can learn in-context on very long sequences.
On The Pile, deep stacks of Hyena operators constitute one of the first attention-free, convolutional archi-
tectures to match perplexity and downstream performance of Transformers with a significant reduction in
training compute. Our promising results at the sub-billion parameter scale suggest that attention may not
be all we need, and that simpler subquadratic designs such as Hyena, informed by a set of simple guiding
principles and evaluation on mechanistic interpretability benchmarks, may form the basis for efficient large
models. We are excited about what new capabilities Hyena opens up as we scale and optimize the inference
speed of these models.
Acknowledgments
We would like to thank Karan Goel, Albert Gu, Avanika Narayan, Khaled Saab, Michael Zhang, Elliot Ep-
stein and Sabri Eyuboglu for helpful discussion and feedback on earlier drafts, and Together Computer and
Crusoe for providing the compute used to train models in this paper. We gratefully acknowledge the support
of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078
(Volume to Velocity), and 1937301 (RTML); US DEVCOM ARL under No. W911NF-21-2-0251 (Interactive
Human-AI Teaming); ONR under No. N000141712266 (Unifying Weak Supervision); ONR N00014-20-1-2480:
Understanding and Applying Non-Euclidean Geometry in Machine Learning; N000142012275 (NEPTUNE);
NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture,
Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for
Research program, the Stanford Data Science Initiative (SDSI), Department of Defense (DoD) through the
National Defense Science and Engineering Graduate Fellowship (NDSEG) Program, and members of the Stan-
ford DAWN project: Facebook, Google, and VMWare. This work is supported by NSF (1651565), AFOSR
(FA95501910024), ARO (W911NF-21-1-0125), ONR, DOE (DE-SC0022222), CZ Biohub, and Sloan Fellow-
ship. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommenda-
tions expressed in this material are those of the authors and do not necessarily reflect the views, policies, or
endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.
References
S. Arora, A. Narayan, M. F. Chen, L. J. Orr, N. Guha, K. Bhatia, I. Chami, F. Sala, and C. Ré. Ask me
anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441, 2022.
R. Basri, M. Galun, A. Geifman, D. Jacobs, Y. Kasten, and S. Kritchman. Frequency bias in neural networks
for input of non-uniform density. In International Conference on Machine Learning, pages 685–694. PMLR,
2020.
S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman. GPT-Neo: Large Scale Autoregressive Language
Modeling with Mesh-Tensorflow, Mar. 2021. URL https://doi.org/10.5281/zenodo.5297715. If you
use this software, please cite it using these metadata.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,
A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems,
33:1877–1901, 2020.
C.-T. Chen. Linear system theory and design. Saunders college publishing, 1984.
R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv
preprint arXiv:1904.10509, 2019.
13
P. Cramer. Alphafold2 and the future of structural biology. Nature structural & molecular biology, 28(9):
704–705, 2021.
E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation
with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition workshops, pages 702–703, 2020.
T. Dao, A. Gu, M. Eichhorn, A. Rudra, and C. Ré. Learning fast algorithms for linear transforms using
butterfly factorizations. In International conference on machine learning, pages 1517–1527. PMLR, 2019.
T. Dao, B. Chen, N. S. Sohoni, A. Desai, M. Poli, J. Grogan, A. Liu, A. Rao, A. Rudra, and C. Ré. Monarch:
Expressive structured matrices for efficient and accurate training. In International Conference on Machine
Learning, pages 4690–4721. PMLR, 2022a.
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention
with io-awareness. arXiv preprint arXiv:2205.14135, 2022b.
T. Dao, D. Y. Fu, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré. Hungry hungry hippos: Towards
language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022c.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.
arXiv preprint arXiv:2010.11929, 2020.
N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly,
et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
K. Fukushima and S. Miyake. Neocognitron: A self-organizing neural network model for a mechanism of
visual pattern recognition. In Competition and cooperation in neural nets, pages 267–285. Springer, 1982.
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima,
et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
2020.
S. Garg, D. Tsipras, P. Liang, and G. Valiant. What can transformers learn in-context? a case study of
simple function classes. arXiv preprint arXiv:2208.01066, 2022.
A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré. Hippo: Recurrent memory with optimal polynomial
projections. Advances in Neural Information Processing Systems, 33:1474–1487, 2020.
A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint
arXiv:2111.00396, 2021.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan. Augmix: A simple data
processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019.
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A.
Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint
arXiv:2203.15556, 2022.
G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In European
conference on computer vision, pages 646–661. Springer, 2016.
N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451,
2020.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.
14
Y. Li, H. Yang, E. R. Martin, K. L. Ho, and L. Ying. Butterfly factorization. Multiscale Modeling &
Simulation, 13(2):714–732, 2015.
Y. Li, T. Cai, Y. Zhang, D. Chen, and D. Dey. What makes convolutional models great on long sequence
modeling? arXiv preprint arXiv:2210.09298, 2022.
Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier
neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020.
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu,
A. Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
S. Massaroli, M. Poli, J. Park, A. Yamashita, and H. Asama. Dissecting neural odes. Advances in Neural
Information Processing Systems, 33:3952–3963, 2020.
H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur. Long range language modeling via gated state spaces.
arXiv preprint arXiv:2206.13947, 2022.
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing
scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
E. Nguyen, K. Goel, A. Gu, G. W. Downs, P. Shah, T. Dao, S. A. Baccus, and C. Ré. S4nd: Modeling
images and videos as multidimensional signals using state spaces. arXiv preprint arXiv:2210.06583, 2022.
C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen,
et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
A. V. Oppenheim, A. S. Willsky, S. H. Nawab, and J.-J. Ding. Signals and systems, volume 2. Prentice hall
Upper Saddle River, NJ, 1997.
D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and
R. Fernández. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint
arXiv:1606.06031, 2016.
B. Peng. RWKV-LM, 8 2021. URL https://github.com/BlinkDL/RWKV-LM.
B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on
control and optimization, 30(4):838–855, 1992.
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting
on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via
large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap. Compressive transformers for
long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507.
D. W. Romero, R.-J. Bruintjes, J. M. Tomczak, E. J. Bekkers, M. Hoogendoorn, and J. C. van Gemert.
Flexconv: Continuous kernel convolutions with differentiable kernel sizes. arXiv preprint arXiv:2110.08059,
2021a.
D. W. Romero, A. Kuzina, E. J. Bekkers, J. M. Tomczak, and M. Hoogendoorn. Ckconv: Continuous kernel
convolution for sequential data. arXiv preprint arXiv:2102.02611, 2021b.
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.
In International Conference on Medical image computing and computer-assisted intervention, pages 234–
241. Springer, 2015.
A. Roy, M. Saffar, A. Vaswani, and D. Grangier. Efficient content-based sparse attention with routing
transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
15
I. Schlag, K. Irie, and J. Schmidhuber. Linear transformers are secretly fast weight programmers. In Inter-
national Conference on Machine Learning, pages 9355–9366. PMLR, 2021.
I. W. Selesnick and C. S. Burrus. Fast convolution and filtering. In The Digital Signal Processing Handbook,
pages 8–1. CRC Press, 2017.
A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. Superglue:
A stickier benchmark for general-purpose language understanding systems. Advances in neural information
processing systems, 32, 2019.
S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity. arXiv
preprint arXiv:2006.04768, 2020.
L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan. Tokens-to-token
vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international
conference on computer vision, pages 558–567, 2021.
S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regularization strategy to train strong
classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 6023–6032, 2019.
S. Zhai, W. Talbott, N. Srivastava, C. Huang, H. Goh, R. Zhang, and J. Susskind. An attention free
transformer. arXiv preprint arXiv:2105.14103, 2021.
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv
preprint arXiv:1710.09412, 2017.
Y. Zhang, A. Backurs, S. Bubeck, R. Eldan, S. Gunasekar, and T. Wagner. Unveiling transformers with lego:
a synthetic reasoning task. arXiv preprint arXiv:2206.04301, 2022.
Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. In Proceedings of the
AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.
16
Hyena Hierarchy
Supplementary Material
Contents
1 Introduction 1
4 Experiments 9
4.1 Shrinking the gap on in-context learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Downstream Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5 Large-Scale Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A Experimental Details 18
A.1 Mechanistic Design Synthetic Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.2 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.3 Downstream Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
A.4 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
17
A Experimental Details
An implementation of Hyena can be found at this link.
For each task, we train models using the hyperparameters shown in Table A.1. We consider increasing set-
tings of difficulty controlled by sequence length, spanning values 1024, 2048, 4098, 8196, 16392, 32784, 65568, 131136
and vocabulary sizes 10, 20, 30, 40. For ICL of functions, we vary instead the dimension no .
Note that for associative recall on longer sequences, multiple copies of key-value tuples appear in the
prompt. To see this, consider how likely it is to sample multiple copies of a particular key-value pair with
a vocabulary size of 40, in order to form a sequence of 100k characters. Models capable of looking further
back in the sequence effectively see more data, and can solve challenging versions of the in-context learning
task. Increasing the vocabulary size has the increasing the average distance between instances of the same
key-value pair in each prompt, highlighting performance gaps between different approaches.
Table A.1: (Hyperparameter settings for reasoning and in-context learning tasks.).
Optimizer AdamW
Optimizer momentum β1 , β2 = 0.9, 0.98
Base learning rate 0.0005
Weight decay 0.1
Dropout None
Batch size 32
Training epochs 200
Num samples 2000
Learning rate schedule cosine decay
Warmup epochs 10
Warmup schedule linear
Number of layers 2
Width 64
• Conv1d: Explicit convolutions (regular convolution layers with fixed filter size). We use a fixed filter size
of 64, to match parameters of the other approaches.
18
• FNO: Filters parametrized explicitly in the frequency-domain (Li et al., 2020). We set the number of modes
to 64.
• H3: Implicit parametrization using state-space models (SSMs), and in particular the standard S4 (Gu et al.,
2021). We set the state dimension to 64.
• TransferFunc: Implicit parametrization via transfer functions, a classical system-theoretic generalization
of SSMs. Transfer functions are defined by a ratio of polynomials (we parametrize the coefficients, and
evaluate the polynomials efficiently via FFTs). We set the order to 64.
• CKConv: Implicit parametrization using FFNs (Romero et al., 2021b).
• item Hyena: Combination of implicit parametrizations via FFNs (with exponential decay modulation as
shown in Figure 3.1), and short explicit filters.
CKConv and Hyena use the same size of FFNs (width 32 to match in parameters).
In Table A.1, we report additional results on the challenging setting of sequence length 131072 and vocab-
ulary size 30. Implicit parametrizations of convolutions outperform explicit parametrizations on associative
recall, with CKConv and Hyena greatly improving on the ability to extract the right key, value relations
from different inputs. In Appendix C, we discuss how results on our synthetic tasks can be indicative of
performance at a larger scale.
Table A.2: Test accuracy (%) in associative recall on sequences of length 131072, vocabulary size 30.
Hyena CKConv TransferFunc H3 FNO Conv1d
97.2 14.3 0.5 0.6 0.3 0.5
Operator comparisons: We compare different models on the same associative recall task, using hyper-
parameters in Table A.1. Hyena uses our filter parametrization with decay windowing for long convolutions,
and short explicit convolutions of size 3 after the dense input projections. All other models use defaults from
their largest scale experiment, while keeping the size to 2 layers and width 64.
A note on Transformer performance Transformers can solve associative recall tasks with longer se-
quences, provided the length does not prevent them from fitting in memory, and enough examples are present
in the training data. In all our experiments, we keep the number of samples fixed (2000), a regime where
Transformers struggle to find the generalizing solution (see Table A.1).
For shorter sequences (see Appendix C), Transformers solve the task easily even with limited data, com-
parably to Hyena.
More broadly, these different properties of attention and attention-free token-mixing layers may explain
improved performance when they are combined in hybrid architectures (Dao et al., 2022c). The focus on
this work has been identifying an architecture capable of performing without attention, which is necessary
to tackle domains where long sequences are common. However, when training with shorter sequences (up
to 8k), if final downstream performance is the only metric of interest, improved results can be obtained by
hybridizing our models similarly to H3 (Dao et al., 2022c).
The Pile: We follow a same procedure and train 125M and 355M-sized models on The Pile (Gao et al.,
2020). Hyperparameters are reported in Table A.3. Hyperparameters for 355M are the same beyond a
reduction in peak learning rate to 4 · 10−4 . For larger models (1.3B), we set a learning rate of 2.2 · 10−4 .
We perform three experiments for each model type and size, and train for 5, 10, 15 billion tokens at a
sequence length 2024 and global batch size 256. All models are trained on a single node of 8 A100 80GB
GPUs. We use order 2 Hyenas, with the same architectural considerations described above for WikiText103.
In addition to our data scaling experiments at 5, 10 and 15 billion tokens, we provide preliminary results
for models at the 1.3B parameter scale (10.8 perplexity after 5 billion tokens), and train a 153M model (130
billion tokens), reaching a perplexity of 9.8. The 153M is the same used in our downstream evaluation on
SuperGLUE.
Training hyperparameters match those of standard GPT training pipelines, and are thus likely suboptimal
for new attention-free architectures such as Hyena. We run some preliminary experiments and find that e.g.,
some modifications to the learning rate schedule, currently involving linear warmup and cosine decay, to
improve perplexity at convergence of Hyena models (we recommend slightly lower learning rates for Hyena
models compared to GPT of a similar size). Despite these findings, we use standard GPT hyperparameters
for both GPT and Hyena.
PG-19 We also report results of additional training runs on other datasets. We train a Hyena 153M model
on the standard PG-19 long-range corpus (Rae et al., 2019), with a context length of 16k tokens, reaching a
test perplexity of 14.6 (using the standard GPT2 tokenizer) in 8 epochs.
Architectures Architectural hyperparameters for Hyena are shown in Table A.4. We use sine as an acti-
vation function for the FFN of Hyena filters.
FLOP computation The number of floating point operations (FLOPs) reported in the main text are
computed using the same strategy as in (Hoffmann et al., 2022). For GPT, we do not use the approximation,
opting instead for the more accurate formula based on FLOP counts of individual layers. In the case of
Hyena, FLOPs are computed using the same method, except attention layers are replaced by:
i. Projections: order × d_model × d_model × seq_len.
ii. Short conv on projections: order × d_model × seq_len × filter_len (usually 3).
20
iii. FFTConv: 5 × (order - 1) × d_model × log(seq_len) × seq_len.
iv. Output: d_model × d_model × seq_len.
with a leading factor 2 to account for both additions and multiplications.
Models The models considered are the open-source checkpoint of GPTNeo 125M trained for 300B tokens
The Pile, and the RWKV-v4 169M checkpoint trained for 332B tokens on The Pile. Hyena is a 153M
model trained for 137B tokens on The Pile.
LAMBADA: We evaluate Hyena on the LAMBADA (Paperno et al., 2016) task. We apply a stop word
filter and check whether predictions for all tokens corresponding to the last word agree with the ground truth.
The small Hyena model trained on 137B tokens reaches 44.64% accuracy.
CIFAR-10: We use CIFAR-10 in sequential and 2D experiments. For sequential, we use the Hyena operator
defined in our language tasks and compare with an S4 model (Gu et al., 2021) of the same size by swapping
layers in the residual blocks. In 2D, we learn Hyena filters (in both x and y dimensions) that are equal to the
size of the input shape, and forgo the gating mechanism from our language experiments. We window (i.e.,
apply a soft mask spatially to) the Hyena filters with a decay term. The rate of decay varies across channels,
ensuring different sizes of the filters at initialization. We compare with another implicit 2D convolution,
S4ND (Nguyen et al., 2022), by swapping the model layers with the 2D Hyena filters. The "isometric" model
consists of 4 residual blocks of model dimension 128. We use basic image augmentations, 0.1 dropout, 0.03
weight decay and train for 100 epochs using a Nvidia T4 GPU.
21
Table A.5: ViT and ViT-Hyena settings for ImageNet-1k).
Image size 2242
Optimizer AdamW
Optimizer momentum β1 , β2 = 0.9, 0.999
Weight init trunc. normal (std=0.02)
ViT base learning rate 1e−3
Hyena-ViT base learning rate 2e−4
ViT weight decay 0.05
Hyena-ViT weight decay 0.01
Dropout None
Batch size 1024
Training epochs 300
Learning rate schedule cosine decay
Warmup epochs 10
Warmup schedule linear
Randaugment (Cubuk et al., 2020) (9,0.5,layers=2)
Mixup (Zhang et al., 2017) 0.8
Cutmix (Yun et al., 2019) 1.0
Random erasing (Zhong et al., 2020) 0.25
Label smoothing (Szegedy et al., 2016) 0.1
Stochastic depth (Huang et al., 2016) 0.1
Exp.mov. avg (EMA) (Polyak and Juditsky, 1992) None
Isolating the surrogate attention matrix In the case of length-L discrete sequences
L−1
X
zt = kt ϕt−m vm
m=0
(10)
L−1
X
yt = qt ψt−m zm
m=0
Therefore we can rewrite (8) as
L−1
X L−1
X
yt = qt ψt−m km ϕm−n vn
m=0 n=0
L−1
X L−1
X
= qt ψt−m km ϕm−n vn Move ψ, k inside inner sum
m=0 n=0
(11)
L−1
X L−1
X
= qt ψt−m km ϕm−n vn Index shift
n=0 m=0
L−1
X L−1
X
= qt ψt−m km ϕm−n vn
n=0 m=0
22
And we can define the surrogate attention matrix Aψ
ϕ (q, k)
L−1
X
[Aψ
ϕ (q, k))]t,t0 = qt ψt−m km ϕm−t0 . (12)
m=0
Continuous Signals: We can also consider the case of continuous signals on a group G. In the
continuous case, we can expand the convolutions in (8) as
Z Z
(ϕ ∗ v)t = ϕt−g vg dg, (ψ ∗ z)t = ψt−g zg dg (13)
G G
This allows us to rewrite (8) as
yt = qt (ψ ∗ k(ϕ ∗ v))t
Z Z
= qt ψt−g kg ϕg−τ vτ dτ dg
G G
Z Z
= qt ψt−g kg ϕg−τ vτ dτ dg
G G
(14)
Z Z
= qt ψt−g kg ϕg−τ vτ dg dτ Variable swap
G G
Z Z
= qt ψt−g kg ϕg−τ vτ dg dτ Pull qt in τ integral
G G
Z Z
= qt ψt−g kg ϕg−τ dg vτ dτ Pull vτ out of g integral.
G G
There is a linear operator A : v 7→ y = Av which we interpret as the surrogate attention operator. A
is conditioned on the query q, key k and filters ϕ and ψ, A = Aψ
ϕ (q, k). The kernel K of the operator
is given by Z
K(t, t0 ) = qt ψt−g kg ϕg−t0 dg (15)
G
Operator decomposition of the surrogate attention matrix We can decompose the linear map
v 7→ y; y = Aψ ϕ (q, k)v into a sequence of factors, each dependent on a projection of the input Aϕ (q, k) =
ψ
A (q)Aϕ (k). Let Dq and Dk be the L-by-L diagonal matrices whose respective main diagonal entries are the
ψ
23
We can expand the matrix multiplications in (16) in the case of causal filters ϕ and ψ as
Dq Sψ Dk Sϕ
k0
q0 ψ0 ϕ
0
q1 ψ1 ψ0
k1 ϕ ϕ0
1
.
. . . . . . .
.. .. .. .. .. . . .
. . .
qL−1 ψL−1 ψL−2 · · · ψ0 kL−1 ϕL−1 ϕL−2 · · · ϕ0
(19)
q 0 ψ0 k0 ϕ0
q 1 ψ1 q 1 ψ0 k1 ϕ1 k1 ϕ0
= .. . . . . .
. . . . .
.
. . . . .
qL−1 ψL−1 qL−1 ψL−2 · · · qL−1 ψ0 kL−1 ϕL−1 kL−1 ϕL−2 · · · kL−1 ϕ0
Aψ (q) Aϕ (k)
Table C.1: Hyena Accuracy on associative recall with varying vocabulary size 10, 20, 30, 40 in relation to
test loss on The Pile after 5 billion tokens. We notice a correlation between the two performance metrics,
suggesting that slices of our mechanistic design synthetics may be potentially predictive of performance at
scale.
Model Acc @ 10 Acc @ 20 Acc @ 30 Acc @ 40 Loss @ 5B on The Pile
Conv1d 32 11 10 8 4.21
AFT-conv 55 21 12 10 3.57
H3 92 60 13 10 2.69
Transformer 100 100 92 82 2.59
Hyena 100 100 98 85 2.59
24
Layers: 1, Digits: 2 Layers: 1, Digits: 4 Layers: 1, Digits: 8 Layers: 1, Digits: 16
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
epochs epochs epochs epochs
Layers: 2, Digits: 2 Layers: 2, Digits: 4 Layers: 2, Digits: 8 Layers: 2, Digits: 16
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
epochs epochs epochs epochs
Layers: 3, Digits: 2 Layers: 3, Digits: 4 Layers: 3, Digits: 8 Layers: 3, Digits: 16
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
epochs epochs epochs epochs
Figure C.1: Test loss and accuracy of Hyena on addition with different numbers of digits and model depths.
Each plot reports the results of a different experiment, with the curve tracing test results during training.
Single layer recall All experiments on our synthetic tasks default to 2 layer models. We choose 2 as
it is the canonical number for mechanistic analysis of Transformers (Elhage et al., 2021) based on circuits.
Interestingly, a single layer of Hyena (width 64) is capable of performing associative recall, solving the task
completely even in the challenging setting with vocabulary size 40. Reverse engineering exactly how the
single Hyena operator is able to perform recall is left for future work.
25
D Samples and Visualizations
D.1 Hyena Matrices
We provide visualizations of attention and Hyena matrices activated by test strings. In D.1, D.2, we compare
GPTNeo (Black et al., 2021) attention matrices with Hyena matrices extracted by our pre-trained small
Hyena model. In D.3 and D.4, we provide additional Hyena matrices for the 355M model, activated by test
strings of different length.
For attention, we visualize the raw post-softmax matrix. For Hyena matrices, we plot the (element-wise)
absolute value of H(u):
H(u) = DN N 2 2 1 1
x Sh · · · Dx Sh Dx Sh
Ĥ(u)ij = |H(u)ij |
Since Hyena does not normalize the entries of its matrices with e.g., softmax, there are notable differences with
attention: (1) the entries of H(u) can be either positive and negative, and (2) the magnitude is unconstrained.
We observe the magnitude of matrices in pre-trained Hyena models to be around 10−3 .
26
Figure D.1: Attention matrices from a GPTNeo small model. "We use the test string "Attention is all you
need. Attention is".
27
Figure D.2: Hyena matrices from a Hyena small (same model used for SuperGLUE downstream evaluations).
"We use the test string "Attention is all you need. Attention is". We note that Hyena has a different
data-controlled matrix for each channel i.e. for each dimension in its width, since it does not use heads.
28
Figure D.3: Data-controlled Hyena matrices (355M model), activated by the string "When a doctor doctors
a doctor, does the doctor doing the doctoring doctor as the doctor being doctored wants to be doctored or does
the doctor doing the doctoring doctor as they want to doctor? ". Rows in the plot are matrices from different
layers, columns are matrices from different channels. The operator shows characteristic patterns of attention
matrices, without attention.
29
Figure D.4: Data-controlled Hyena matrices (355M model), activated by the string "Mrs. Dursley, Mr.
Dursley, Dudley Dursley", from Causal scrubbing: results on induction heads. Rows in the plot are matrices
from different layers, columns are matrices from different channels.
30
D.2 Hyena Filters
Figure D.5 provides a visualization of Hyena long convolution filters at initialization and after training to
completion on The Pile.
We find a substantial performance difference (up to 5% perplexity) between initialization schemes. If the
filters at initialization are excessively smooth (see Appendix D.3 for a discussion of positional encoding and
activation), the model finds a worse solution and takes longer to converge. Further, we observe initialization
schemes that regularize filters towards typical filters learned at convergence to decrease performance. These
observations are in line with performance gaps between convolution parametrization schemes discussed in
main text and Appendix A.1. In particular, the performance improvements obtained via Hyena filters could
be due to easier optimization in the space of convolutional filters.
At convergence, Hyena learns a collection of lower-order filters with a similar structure, which can be
exploited to further speed up inference after training.
31
Figure D.5: [Top]: Long convolution Hyena filters at initialization (153M parameters, 18 layer model).
[Bottom]: Filters after training for 130 billion tokens on The Pile.
32
Positional Encoding Impulse Response ht Magnitude Response |FFT[h]| Phase Response ̸ FFT[h]
0 0 0 0
32 32 32 32
Sequence Index
64 64 64 64
96 96 96 96
Positional Encoding Impulse Response ht Magnitude Response |FFT[h]| Phase Response ̸ FFT[h]
0 0 0 0
32 32 32 32
Sequence Index
64 64 64 64
96 96 96 96
Figure D.7: Hyena filters at initialization with 65 positional encoding features K = 32.
33
Positional Encoding Impulse Response ht Magnitude Response |FFT[h]| Phase Response ̸ FFT[h]
0 0 0 0
32 32 32 32
Sequence Index
64 64 64 64
96 96 96 96
Figure D.8: Hyena filters at initialization with 65 positional encoding features K = 64.
Positional Encoding Impulse Response ht Magnitude Response |FFT[h]| Phase Response ̸ FFT[h]
0 0 0 0
32 32 32 32
Sequence Index
64 64 64 64
96 96 96 96
Figure D.9: Hyena filters at initialization with 17 positional encoding features K = 8 and frequency of
sinusodial activation set to 10.
34
D.4 Downstream Examples
MultiRC We report examples of downstream evaluation of small models on the MultiRC question-answering
task. We report answers of small Hyena (153M, trained for 130B tokens on The Pile) and the public check-
point RWKV-v4 (Peng, 2021) (169M, trained for 332B tokens on The Pile). We select randomized examples
with indices being powers of 2. Alignment of Hyena’s responses to the task format is greatly improved by
providing few-shot examples in the prompt, which may be a promising sign for larger models based on the
proposed attention-free architecture. Each example shows responses for specific examples in the validation
set (example indices are listed at the beginning of each example).
MultiRC index: 2, 4
Passage: What causes a change in motion? The application of a force. Any time an object
changes motion, a force has been applied. In what ways can this happen? Force can cause an
object at rest to start moving. Forces can cause objects to speed up or slow down. Forces
can cause a moving object to stop. Forces can also cause a change in direction. In short,
forces cause changes in motion. The moving object may change its speed, its direction,
or both. We know that changes in motion require a force. We know that the size of the
force determines the change in motion. How much an object’s motion changes when a force is
applied depends on two things. It depends on the strength of the force. It also depends on
the object’s mass. Think about some simple tasks you may regularly do. You may pick up a
baseball. This requires only a very small force.
Question: Based on the previous passage, Would the mass of a baseball affect how much force
you have to use to pick it up? Is "Less the mass, less the force applied" a correct answer?
Answer: Yes
RWKV zero-shot: The mass of a baseball depends on the force applied. if the mass of the
baseball is small, the force applied is small.
RWKV few-shot: The mass, less the force applied, is a more important force. it is the force
that changes the mass of a baseball.
Question: Based on the previous passage, What do you apply to an object to make it move or
stop? Is "Strength" a correct answer?
Answer: No
RWKV zero-shot: The force applied to the object is the force applied to the object.
RWKV few-shot: The force applied to a baseball is a force. it is a force that is applied to
a baseball.
35
MultiRC index: 32, 64
Passage: The film opens with Sunita, a medical student, and her friends working on a project
about the human brain. She wants to investigate the curious case of Sanjay Singhania,
a notable city businessman, who is reported to have anterograde amnesia. Her professor
denies access to Sanjay’s records as it is currently under criminal investigation. Sunita,
nonetheless, decides to investigate the matter herself. Sanjay is introduced as he brutally
murders a man. He takes a Polaroid picture of the man, and writes on it “done”. It is
revealed that Sanjay has anterograde amnesia where he loses his memory every 15 minutes.
Sanjay uses a system of photographs, notes, and tattoos on his body to recover his memory
after each cycle. It is revealed that Sanjay is ultimately out to avenge the death of his
sweetheart Kalpana, and that he is systematically killing the people who were responsible
for it. His main target is “Ghajini”, a notable social personality in the city. Police
Inspector Arjun Yadav, on the case of the serial murders, tracks Sanjay down to his flat
and attacks and disables him. Yadav finds two diaries where Sanjay has chronicled the
events of 2005 and 2006 . The film flashes back to 2005 as Yadav reads the diary. Sanjay
Singhania is shown as the owner of the Air Voice mobile telephone company. In the course of
his business, Sanjay sends his men to meet Kalpana, a struggling model, about putting up a
billboard above her apartment. The owner of Kalpana’s advertising firm misinterprets this
as a romantic advance, and in view of a possible lucrative Air Voice ad campaign and other
benefits, encourages Kalpana to accept the overture.
Question: Based on the previous passage, Why did Sanjay murder a man? Is "Sanjay" a correct
answer?
Answer: Yes
Question: Based on the previous passage, Every 15 minutes, Sanjay goes through what process,
Which frustrates his attempts to avenge the death of his sweetheart? Is "He forgets about
facts" a correct answer?
Answer: Yes
36
MultiRC index: 128
Passage: In 1863, Alexander II re-convened the Diet of Finland and initiated several reforms
increasing Finlandś autonomy from Russia including establishment of its own currency,
the markka. Liberation of business led to increased foreign investment and industrial
development. Finland also got its first railways, separately established under Finnish
administration. Finally, the elevation of Finnish from a language of the common people to
a national language equal to Swedish opened opportunities for a larger proportion of the
society. Alexander II is still regarded as "The Good Tsar" in Finland. These reforms could
be seen as results of a genuine belief that reforms were easier to test in an underpopulated,
homogeneous country, than in the whole of Russia. They may also be seen as a reward for the
loyalty of its relatively western-oriented population during the Crimean War and during the
Polish uprising. Encouraging Finnish nationalism and language can also be seen as an attempt
to dilute ties with Sweden.
Question: Based on the previous passage, Alexander II is considered what in Finland since
1863? Is "The good tsar" a correct answer?
Answer: Yes
37
MultiRC index: 1024
Passage: Einstein and Maric married in January 1903. In May 1904, the couple’s first son,
Hans Albert Einstein, was born in Bern, Switzerland. Their second son, Eduard, was born
in Zurich in July 1910. In 1914, the couple separated; Einstein moved to Berlin and his
wife remained in Zurich with their sons. They divorced on 14 February 1919, having lived
apart for five years. Eduard, whom his father called "Tete" (for petit), had a breakdown at
about age 20 and was diagnosed with schizophrenia. His mother cared for him and he was also
committed to asylums for several periods, including full-time after her death. The marriage
with Maric does not seem to have been very happy. In letters revealed in 2015, Einstein
wrote to his early love, Marie Winteler, about his marriage and his still strong feelings
for Marie. In 1910 he wrote to her that "I think of you in heartfelt love every spare minute
and am so unhappy as only a man can be" while his wife was pregnant with their second child.
Einstein spoke about a "misguided love" and a "missed life" regarding his love for Marie.
Einstein married Elsa Lowenthal on 2 June 1919, after having had a relationship with her
since 1912. She was a first cousin maternally and a second cousin paternally. In 1933, they
emigrated to the United States. In 1935, Elsa Einstein was diagnosed with heart and kidney
problems; she died in December 1936.
Question: Based on the previous passage, How much time passed, after Albert Einstein’s
father divorced his mother, that he re-married? Is "3 months and 18 days" a correct answer?
Answer: Yes
RWKV zero-shot: Albert Einstein was born on 3 march 1916 in Gerlin, Germany. he was the son
of a German doctor and a german woman.
RWKV few-shot: It is not a correct answer. The exact date is not known.
38