noise_step
noise_step
P REPRINT
Will Brickner
[email protected]
A BSTRACT
Training large machine learning models is fantastically computationally and energetically burdensome.
With astronomical costs, training runs are a risky endeavor accessible only to a trace portion of
humanity. By restricting weights to low precision, inference throughput and energy consumption
can be dramatically improved. It was recently shown that LLM inference can be performed in
1.58-bit (ternary) precision without any performance loss [1]. However, training remains unimproved,
occurring in f16 precision. This paper presents an algorithm which trains directly in ternary precision,
operates without backpropagation or momentum, and can run concurrently with model inference
for similar cost. The gradient representation is exceptionally well-suited to distributed training
paradigms. A simple model representation naturally follows, with several remarkable properties.
Massive reduction in the memory and energy usage of model training is enabled.
1 Gradient Estimation
Using the Jacobian vector product, one can compute the alignment between an arbitrary perturbation vector ν and the
loss gradient exactly. The alignment can be computed in tandem with f (x), and requires neither Jf nor ∇f .
αf (ν) = Jf ν = ∇f · ν (1)
b = 1
X
νi ∼ N (0, I) ∇f νi α(νi ) (2)
n
In the ternary domain, values are closed in T = {−1, 0, +1}. Directions and magnitudes of ternary vectors are highly
constrained; vector weighting is not useful. Gradient estimation can be recovered if perturbations νi are sparse.
νi ∼ Bernoulli(s) ⊙ U {−1, +1} (3)
Remarkably, only the sign of the alignment is required for estimation.
X
∇f
b = νi sgn α(νi ) (4)
Convergence is improved by rejecting perturbations with alignment magnitude below the step median:
ατ (ν) = sgn α(ν) 1|α(ν)| ≥ medianj |α(νj )| ∈ T (5)
X
∇f
b = νi ατ (νi ) (6)
The only hyperparameters are the number and sparsity of perturbations. Summation may be performed with saturation,
or in higher precision followed by ternary clamping. The algorithm presented here is undoubtedly one member in a
large family, each with varying efficiency and convergence properties.
noise_step: Training in 1.58b With No Gradient Memory P REPRINT V 0
2 Representation Efficiency
One Seed is All You Need Pseudorandom noise has a beautiful property: it is deterministic. Vast sequences can be
reproduced with only a seed. This means the perturbation vectors νi don’t need to be kept in memory, nor transmitted.
Perturbation components can be generated as they are used, faster than the speed of memory.
Distributed Training The throughput of distributed training algorithms is generally limited by synchronization,
wherein gradients and optimizer states are exchanged between participants. To mitigate this problem, many sophisticated
schemes have been devised [3]. Without modification, noise_step gradient steps are encoded using only one tern
(1.58 bits) per perturbation, dramatically reducing total communication. Hybridization with existing algorithms may
prove fruitful.
A complete gradient step can now be represented in a few machine words. With such compact steps, a simple model
representation emerges with remarkable properties.
Model Transport To download a model is to download its weights. Because weight initialization is pseudorandom, it
is also recovered with only the seed. By expressing a model as its steps, the transport size of a model no longer directly
depends on the number of parameters, but on the product of steps and perturbations. With great hubris, one can roughly
estimate the size of a ternary GPT-3 175B in this format to be between 600KB and 19MB. 1 Size reductions also apply
to higher precision models, with size increasing linearly in alignment precision. As discussed later, efficient recovery to
weight space relies on perturbation sparsity.
What Can Be Model updates also benefit: to represent additional full-rank training, one can simply specify the
additional steps. The initialization can be a point in weight space (i.e. a base model), or a prior sequence of steps.
Unburdened By What Has Been The complete history of model weights can be recovered with product cost in
the number of parameters recovered, perturbations used, and steps consumed. Any subset of model weights can be
recovered independently with no overhead. Training can be resumed from any previous step in model history. These
statements are all true a priori. It may also be possible to edit past training steps, e.g. through masking or negation, but
this exotic trick requires empirical validation not performed here.
The Burden The reconstruction algorithm is simple, embarrassingly parallel, highly local, and produces its results
from noise. It is ideal for modern hardware. However, model step reduction has complexity O(nks). For large models
with high step count, reconstruction becomes impractical, as every weight must be updated with every perturbation
vector from every step. Complete reconstruction of a ternary model similar to GPT-3 175B would require roughly 1019
sums. 2
Liberation The perturbation vectors have a fortunate property: almost every component is zero. Reducing only
non-zero components lowers the reconstruction cost by a factor of the average noise density. Because most steps occur
in the ultra-high sparsity regime, reconstruction cost is made tolerable. However, the memory access pattern becomes
non-local. Modifications to how perturbation noise is generated may admit a local implementation.
1
steps ≈ 3003.2B Mtraining tokens
batch size
= 93750, samples ∈ [32, 1024], bits = log2 3 ∗ steps ∗ samples ∈ [4.74 × 106 , 1.5168 × 108 ]
2
ops ≥ weights ∗ samples ∗ steps ∈ [5.25 × 1017 , 1.68 × 1019 ]
2
noise_step: Training in 1.58b With No Gradient Memory P REPRINT V 0
3 Convergence Properties
Because of the discrete geometry of ternary space, gradient steps are also discrete. Weight trajectories are necessarily
discontinuous. Loss curves carry greater noise than in high precision networks, and model performance at fixed size is
slightly inferior. Larger batch sizes are required, but optimization proceeds aggressively and converges similarly to
Adam.
To demonstrate convergence behavior, a simple MLP was trained to classify MNIST samples using noise_step and
Adam. The benchmark model is ReLU MLP of 4 total layers, with layer normalization and a hidden dimension of 256.
All weights are confined to ternary, all activations are f32. The Adam optimizer stores weights in full precision, which
are clamped to ternary for the forward pass. This comprises a Straight Through Estimator as used in the BitNet paper.
It should be noted that the network does not strictly belong to the BitNet class, as the ternary dense layers are not given
a continuous scale parameter. 3
Figure 1: noise_step loss and accuracy curves Figure 2: Adam optimizer loss and accuracy curves,
{ samples=128, density=6 × 10−5 → 3.7 × 10−6 } default parameters.
Empirically, noise_step encounters difficulty in the high-step regime, where only a tiny portion of model parameters
remain with suboptimal values. Increasing noise sparsity improves convergence. For this demonstration, the noise
density is scheduled crudely in two stages, 6 × 10−5 → 1.5 × 10−5 when L < 2, and 1.5 × 10−5 → 3.7 × 10−6 when
L < 1. More principled and flexible methods for sparsity scheduling are needed.
4 Implementation Advice
This work does not provide optimized kernels for noise_step training. Many optimization opportunities exist, some
are outlined here.
Ternary Codes Seemingly uniformly, ternary compute kernels use 2-bit encodings to represent ternary values. This
approach has the advantage of simplicity. Simple shifting can be used to isolate terns, and arithmetic can be performed
more easily. For this simplicity, 2-bit encodings pay a 27% space overhead cost. To minimize memory usage and access,
a more efficient encoding is needed. Previous work on the binary representation of ternary numbers offers excellent
space efficiency, packing 5 terns per byte [4]. Utilizing this encoding can reduce space overhead to just 0.95%. Even
smaller representations are possible when encoding steps, as the distribution of ατ (ν) is biased, evenly split between
zero and uniform sign noise. Development of efficient transport encodings for steps is left to future work.
JVP Sparsity The batched JVP is computed alongside an inference pass according to a set of pushforward rules,
similar to differentiation rules in reverse mode. High perturbation sparsity means many pushforward rules can be
simplified, often reading and producing only a few elements or matrix columns.
3
A future revision of this preprint will contain amendments allowing co-optimization of ternary and high precision values.
3
noise_step: Training in 1.58b With No Gradient Memory P REPRINT V 0
Perturbation Orthogonality, Column Exclusivity Orthogonal perturbations are ideal, extracting maximum in-
formation about the gradient. Sparse perturbations tend to already be highly orthogonal. Given strict orthogonality
and column exclusivity4 , there will be no overlap in the non-zero elements of JVP intermediates. This allows the
intermediates over many perturbations to be stored together densely, reducing kernel shared memory usage.
5 Related Work
This work presents a novel algorithm enabling direct ternary training. It should be explicitly stated its structure draws
influence from [2]. Not directly related to the present work, the TernGrad algorithm performs a stochastic quantization
of high precision gradients to ternary, reducing communication in distributed learning contexts [5]. Speculatively, there
may exist a fundamental connection between the present work and 1-bit compressed sensing.
References
[1] Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang,
Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits, 2024.
[2] Atılım Güneş Baydin, Barak A. Pearlmutter, Don Syme, Frank Wood, and Philip Torr. Gradients without
backpropagation, 2022.
[3] Matthias Langer, Zhen He, Wenny Rahayu, and Yanbo Xue. Distributed training of deep learning models: A
taxonomic perspective. IEEE Transactions on Parallel and Distributed Systems, 31(12):2802–2818, December
2020.
[4] Olivier Muller, Adrien Prost-Boucle, Alban Bourge, and Frédéric Pétrot. Efficient decompression of binary encoded
balanced ternary sequences. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(8):1962–1966,
2019.
[5] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients
to reduce communication in distributed deep learning. CoRR, abs/1705.07878, 2017.
4
Column exclusivity means the nonzero elements in the batched perturbations of a specific matrix will always be in separate
columns