0% found this document useful (0 votes)

17 views

noise_step

Uploaded by

omkarenator

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

noise_step

Uploaded by

omkarenator

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

noise_step: T RAINING IN 1.

58 B W ITH N O G RADIENT M EMORY

P REPRINT

Will Brickner
[email protected]

December 23, 2024

A BSTRACT
Training large machine learning models is fantastically computationally and energetically burdensome.
With astronomical costs, training runs are a risky endeavor accessible only to a trace portion of
humanity. By restricting weights to low precision, inference throughput and energy consumption
can be dramatically improved. It was recently shown that LLM inference can be performed in
1.58-bit (ternary) precision without any performance loss [1]. However, training remains unimproved,
occurring in f16 precision. This paper presents an algorithm which trains directly in ternary precision,
operates without backpropagation or momentum, and can run concurrently with model inference
for similar cost. The gradient representation is exceptionally well-suited to distributed training
paradigms. A simple model representation naturally follows, with several remarkable properties.
Massive reduction in the memory and energy usage of model training is enabled.

1 Gradient Estimation
Using the Jacobian vector product, one can compute the alignment between an arbitrary perturbation vector ν and the
loss gradient exactly. The alignment can be computed in tandem with f (x), and requires neither Jf nor ∇f .
αf (ν) = Jf ν = ∇f · ν (1)

In the continuous domain, the gradient may then be estimated [2]

b = 1
X
νi ∼ N (0, I) ∇f νi α(νi ) (2)
n

In the ternary domain, values are closed in T = {−1, 0, +1}. Directions and magnitudes of ternary vectors are highly
constrained; vector weighting is not useful. Gradient estimation can be recovered if perturbations νi are sparse.
νi ∼ Bernoulli(s) ⊙ U {−1, +1} (3)
Remarkably, only the sign of the alignment is required for estimation.
X
∇f
b = νi sgn α(νi ) (4)

Convergence is improved by rejecting perturbations with alignment magnitude below the step median:
ατ (ν) = sgn α(ν) 1|α(ν)| ≥ medianj |α(νj )| ∈ T (5)
X
∇f
b = νi ατ (νi ) (6)
The only hyperparameters are the number and sparsity of perturbations. Summation may be performed with saturation,
or in higher precision followed by ternary clamping. The algorithm presented here is undoubtedly one member in a
large family, each with varying efficiency and convergence properties.
noise_step: Training in 1.58b With No Gradient Memory P REPRINT V 0

2 Representation Efficiency
One Seed is All You Need Pseudorandom noise has a beautiful property: it is deterministic. Vast sequences can be
reproduced with only a seed. This means the perturbation vectors νi don’t need to be kept in memory, nor transmitted.
Perturbation components can be generated as they are used, faster than the speed of memory.

Distributed Training The throughput of distributed training algorithms is generally limited by synchronization,
wherein gradients and optimizer states are exchanged between participants. To mitigate this problem, many sophisticated
schemes have been devised [3]. Without modification, noise_step gradient steps are encoded using only one tern
(1.58 bits) per perturbation, dramatically reducing total communication. Hybridization with existing algorithms may
prove fruitful.

2.1 Models as Steps

A complete gradient step can now be represented in a few machine words. With such compact steps, a simple model
representation emerges with remarkable properties.

Model Transport To download a model is to download its weights. Because weight initialization is pseudorandom, it
is also recovered with only the seed. By expressing a model as its steps, the transport size of a model no longer directly
depends on the number of parameters, but on the product of steps and perturbations. With great hubris, one can roughly
estimate the size of a ternary GPT-3 175B in this format to be between 600KB and 19MB. 1 Size reductions also apply
to higher precision models, with size increasing linearly in alignment precision. As discussed later, efficient recovery to
weight space relies on perturbation sparsity.

What Can Be Model updates also benefit: to represent additional full-rank training, one can simply specify the
additional steps. The initialization can be a point in weight space (i.e. a base model), or a prior sequence of steps.

Unburdened By What Has Been The complete history of model weights can be recovered with product cost in
the number of parameters recovered, perturbations used, and steps consumed. Any subset of model weights can be
recovered independently with no overhead. Training can be resumed from any previous step in model history. These
statements are all true a priori. It may also be possible to edit past training steps, e.g. through masking or negation, but
this exotic trick requires empirical validation not performed here.

The Burden The reconstruction algorithm is simple, embarrassingly parallel, highly local, and produces its results
from noise. It is ideal for modern hardware. However, model step reduction has complexity O(nks). For large models
with high step count, reconstruction becomes impractical, as every weight must be updated with every perturbation
vector from every step. Complete reconstruction of a ternary model similar to GPT-3 175B would require roughly 1019
sums. 2

Liberation The perturbation vectors have a fortunate property: almost every component is zero. Reducing only
non-zero components lowers the reconstruction cost by a factor of the average noise density. Because most steps occur
in the ultra-high sparsity regime, reconstruction cost is made tolerable. However, the memory access pattern becomes
non-local. Modifications to how perturbation noise is generated may admit a local implementation.

1
steps ≈ 3003.2B Mtraining tokens
batch size
= 93750, samples ∈ [32, 1024], bits = log2 3 ∗ steps ∗ samples ∈ [4.74 × 106 , 1.5168 × 108 ]
2
ops ≥ weights ∗ samples ∗ steps ∈ [5.25 × 1017 , 1.68 × 1019 ]

2
noise_step: Training in 1.58b With No Gradient Memory P REPRINT V 0

3 Convergence Properties
Because of the discrete geometry of ternary space, gradient steps are also discrete. Weight trajectories are necessarily
discontinuous. Loss curves carry greater noise than in high precision networks, and model performance at fixed size is
slightly inferior. Larger batch sizes are required, but optimization proceeds aggressively and converges similarly to
Adam.
To demonstrate convergence behavior, a simple MLP was trained to classify MNIST samples using noise_step and
Adam. The benchmark model is ReLU MLP of 4 total layers, with layer normalization and a hidden dimension of 256.
All weights are confined to ternary, all activations are f32. The Adam optimizer stores weights in full precision, which
are clamped to ternary for the forward pass. This comprises a Straight Through Estimator as used in the BitNet paper.
It should be noted that the network does not strictly belong to the BitNet class, as the ternary dense layers are not given
a continuous scale parameter. 3

Figure 1: noise_step loss and accuracy curves Figure 2: Adam optimizer loss and accuracy curves,
{ samples=128, density=6 × 10−5 → 3.7 × 10−6 } default parameters.

Empirically, noise_step encounters difficulty in the high-step regime, where only a tiny portion of model parameters
remain with suboptimal values. Increasing noise sparsity improves convergence. For this demonstration, the noise
density is scheduled crudely in two stages, 6 × 10−5 → 1.5 × 10−5 when L < 2, and 1.5 × 10−5 → 3.7 × 10−6 when
L < 1. More principled and flexible methods for sparsity scheduling are needed.

4 Implementation Advice
This work does not provide optimized kernels for noise_step training. Many optimization opportunities exist, some
are outlined here.

Ternary Codes Seemingly uniformly, ternary compute kernels use 2-bit encodings to represent ternary values. This
approach has the advantage of simplicity. Simple shifting can be used to isolate terns, and arithmetic can be performed
more easily. For this simplicity, 2-bit encodings pay a 27% space overhead cost. To minimize memory usage and access,
a more efficient encoding is needed. Previous work on the binary representation of ternary numbers offers excellent
space efficiency, packing 5 terns per byte [4]. Utilizing this encoding can reduce space overhead to just 0.95%. Even
smaller representations are possible when encoding steps, as the distribution of ατ (ν) is biased, evenly split between
zero and uniform sign noise. Development of efficient transport encodings for steps is left to future work.

JVP Sparsity The batched JVP is computed alongside an inference pass according to a set of pushforward rules,
similar to differentiation rules in reverse mode. High perturbation sparsity means many pushforward rules can be
simplified, often reading and producing only a few elements or matrix columns.

3
A future revision of this preprint will contain amendments allowing co-optimization of ternary and high precision values.

3
noise_step: Training in 1.58b With No Gradient Memory P REPRINT V 0

Perturbation Orthogonality, Column Exclusivity Orthogonal perturbations are ideal, extracting maximum in-
formation about the gradient. Sparse perturbations tend to already be highly orthogonal. Given strict orthogonality
and column exclusivity4 , there will be no overlap in the non-zero elements of JVP intermediates. This allows the
intermediates over many perturbations to be stored together densely, reducing kernel shared memory usage.

5 Related Work
This work presents a novel algorithm enabling direct ternary training. It should be explicitly stated its structure draws
influence from [2]. Not directly related to the present work, the TernGrad algorithm performs a stochastic quantization
of high precision gradients to ternary, reducing communication in distributed learning contexts [5]. Speculatively, there
may exist a fundamental connection between the present work and 1-bit compressed sensing.

References
[1] Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang,
Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits, 2024.
[2] Atılım Güneş Baydin, Barak A. Pearlmutter, Don Syme, Frank Wood, and Philip Torr. Gradients without
backpropagation, 2022.
[3] Matthias Langer, Zhen He, Wenny Rahayu, and Yanbo Xue. Distributed training of deep learning models: A
taxonomic perspective. IEEE Transactions on Parallel and Distributed Systems, 31(12):2802–2818, December
2020.
[4] Olivier Muller, Adrien Prost-Boucle, Alban Bourge, and Frédéric Pétrot. Efficient decompression of binary encoded
balanced ternary sequences. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(8):1962–1966,
2019.
[5] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients
to reduce communication in distributed deep learning. CoRR, abs/1705.07878, 2017.

4
Column exclusivity means the nonzero elements in the batched perturbations of a specific matrix will always be in separate
columns

CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
100% (1)
CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
36 pages
Signals and Systems Lecture Notes 1 OTU
No ratings yet
Signals and Systems Lecture Notes 1 OTU
13 pages
Mixed Precision Training
No ratings yet
Mixed Precision Training
12 pages
Optimization
No ratings yet
Optimization
51 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
UNIT3
No ratings yet
UNIT3
17 pages
Scaled Conjugate Gradient For Supervised Learning
No ratings yet
Scaled Conjugate Gradient For Supervised Learning
23 pages
9.b Handout-3-GD variants
No ratings yet
9.b Handout-3-GD variants
3 pages
The Era of 1-Bit LLMS: All Large Language Models Are in 1.58 Bits
No ratings yet
The Era of 1-Bit LLMS: All Large Language Models Are in 1.58 Bits
8 pages
NNQuant1
No ratings yet
NNQuant1
14 pages
3an Empirical Study of Binary N
No ratings yet
3an Empirical Study of Binary N
11 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Sample Final Exam Solutions
No ratings yet
Sample Final Exam Solutions
30 pages
2012 Nikolaos Nikolaou MSC
No ratings yet
2012 Nikolaos Nikolaou MSC
102 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
transformer_turing
No ratings yet
transformer_turing
13 pages
00005187-Deep Learning PPT
No ratings yet
00005187-Deep Learning PPT
11 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Ds 5
No ratings yet
Ds 5
12 pages
2502.11880v1
No ratings yet
2502.11880v1
18 pages
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
No ratings yet
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
9 pages
DL 3
No ratings yet
DL 3
72 pages
A Novel Approach To Error Function Minimization For Feedforward Neural Networks
No ratings yet
A Novel Approach To Error Function Minimization For Feedforward Neural Networks
12 pages
Derrick_201906_GCN_complexityAnalysis-writeup
No ratings yet
Derrick_201906_GCN_complexityAnalysis-writeup
7 pages
F - P N N S L: Unction Space Arameterization of Eural Etworks For Equential Earning
No ratings yet
F - P N N S L: Unction Space Arameterization of Eural Etworks For Equential Earning
29 pages
Lec 2
No ratings yet
Lec 2
5 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
A Proof of Local Convergence For The Adam Optimizer
No ratings yet
A Proof of Local Convergence For The Adam Optimizer
8 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Training Neural Networks Without Gradients
No ratings yet
Training Neural Networks Without Gradients
10 pages
A Progressive Batching L-BFGS Method For Machine Learning: Robbins & Monro 1951
No ratings yet
A Progressive Batching L-BFGS Method For Machine Learning: Robbins & Monro 1951
24 pages
Module 2 Part1new
No ratings yet
Module 2 Part1new
32 pages
BN-Free
No ratings yet
BN-Free
11 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
L S S Q: Earned TEP IZE Uantization
No ratings yet
L S S Q: Earned TEP IZE Uantization
12 pages
MidA-F21
No ratings yet
MidA-F21
8 pages
Lec 7 Optimization Part 2
No ratings yet
Lec 7 Optimization Part 2
139 pages
1 Esa
No ratings yet
1 Esa
11 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Neural Network Package For Octave Developers Guide
No ratings yet
Neural Network Package For Octave Developers Guide
31 pages
NeurIPS 2020 Adabelief Optimizer Adapting Stepsizes by The Belief in Observed Gradients Paper
No ratings yet
NeurIPS 2020 Adabelief Optimizer Adapting Stepsizes by The Belief in Observed Gradients Paper
12 pages
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
15 pages
midterm_study_guide_csci566
No ratings yet
midterm_study_guide_csci566
20 pages
Optimization
No ratings yet
Optimization
44 pages
4190 Gradient Descent The Ultimate
No ratings yet
4190 Gradient Descent The Ultimate
12 pages
23-Practical Aspects of Optimization
No ratings yet
23-Practical Aspects of Optimization
7 pages
Adl Unit 1 2
No ratings yet
Adl Unit 1 2
67 pages
[GPU-MODE] Quantized Training (20241006) (1)
No ratings yet
[GPU-MODE] Quantized Training (20241006) (1)
26 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Adabelief Optimizer: Adapting Stepsizes by The Belief in Observed Gradients
No ratings yet
Adabelief Optimizer: Adapting Stepsizes by The Belief in Observed Gradients
29 pages
cours5
No ratings yet
cours5
23 pages
optimization
No ratings yet
optimization
26 pages
Master Thesis Template Polito
No ratings yet
Master Thesis Template Polito
16 pages
Deep learning chapter 1
No ratings yet
Deep learning chapter 1
46 pages
Batch Normalisation
No ratings yet
Batch Normalisation
17 pages
Optimizing (Variational) Physics-Informed Neural Networks using Least Squares
No ratings yet
Optimizing (Variational) Physics-Informed Neural Networks using Least Squares
15 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Monitoraggio Geodetico E Telerilevamento: Earthengine Exercizes
No ratings yet
Monitoraggio Geodetico E Telerilevamento: Earthengine Exercizes
22 pages
02bfe50d98359a7de3000000 PDF
No ratings yet
02bfe50d98359a7de3000000 PDF
6 pages
A Fast, Minimal Memory, Consistent Hash Algorithm
No ratings yet
A Fast, Minimal Memory, Consistent Hash Algorithm
12 pages
14 Recovery
No ratings yet
14 Recovery
4 pages
15.mohan Final
No ratings yet
15.mohan Final
8 pages
Assignment 1
No ratings yet
Assignment 1
1 page
Java Threads 2x2
No ratings yet
Java Threads 2x2
23 pages
23 Modelling
No ratings yet
23 Modelling
84 pages
Amdahl
No ratings yet
Amdahl
4 pages
CPS 305 - Lecture Note and Course Outline
No ratings yet
CPS 305 - Lecture Note and Course Outline
26 pages
Exercises2 - TSA
No ratings yet
Exercises2 - TSA
3 pages
The Simon - Mukunda Polarization Gadget
No ratings yet
The Simon - Mukunda Polarization Gadget
7 pages
Motion 1
No ratings yet
Motion 1
3 pages
2015 One-Dimensional Maps
No ratings yet
2015 One-Dimensional Maps
26 pages
Shiksha Mantra: Mathematics
No ratings yet
Shiksha Mantra: Mathematics
1 page
EJ1340507
No ratings yet
EJ1340507
13 pages
C++ Lab 10 Alpha
No ratings yet
C++ Lab 10 Alpha
2 pages
Xi Maths Sample Paper Term 2 2021-22
No ratings yet
Xi Maths Sample Paper Term 2 2021-22
3 pages
Functional Analysis: Prof. Dr. Patrick Dondl Adapted From Lecture Notes by Josias Reppekus (TU-München) 23rd July 2016
No ratings yet
Functional Analysis: Prof. Dr. Patrick Dondl Adapted From Lecture Notes by Josias Reppekus (TU-München) 23rd July 2016
73 pages
1 Eutiquio C Young Partial Differential Equations An Introduction Allyn and Bacon 1972
33% (3)
1 Eutiquio C Young Partial Differential Equations An Introduction Allyn and Bacon 1972
357 pages
Boolean Algebra
No ratings yet
Boolean Algebra
32 pages
BTT306 - Ktu Qbank
No ratings yet
BTT306 - Ktu Qbank
9 pages
Math 6 Differential Equations Engr. Dave Pojadas
No ratings yet
Math 6 Differential Equations Engr. Dave Pojadas
83 pages
Chapter No 3: Measure of Central Tendency
No ratings yet
Chapter No 3: Measure of Central Tendency
8 pages
Branch Bound
No ratings yet
Branch Bound
3 pages
Abstract Algebra Vocabulary
No ratings yet
Abstract Algebra Vocabulary
4 pages
Introduction To Algebraic Curves - Solution PDF
50% (2)
Introduction To Algebraic Curves - Solution PDF
3 pages
GR 9 Two Step Equations ANSWERS
No ratings yet
GR 9 Two Step Equations ANSWERS
2 pages
Chapter Three Notes
No ratings yet
Chapter Three Notes
20 pages
Acknowledgement
No ratings yet
Acknowledgement
49 pages
C Tokens
No ratings yet
C Tokens
20 pages
Advanced Engineering Mathematics: E. Kreyszig
No ratings yet
Advanced Engineering Mathematics: E. Kreyszig
18 pages
Design - Beam - Column
No ratings yet
Design - Beam - Column
10 pages
Electromagnetic Plunger With Stopper Dynamics
No ratings yet
Electromagnetic Plunger With Stopper Dynamics
28 pages
Implementing Fast Cloth Simulation With Collision Response
No ratings yet
Implementing Fast Cloth Simulation With Collision Response
10 pages
Summable PDF
No ratings yet
Summable PDF
16 pages
Maths Chapter 2 Notes
No ratings yet
Maths Chapter 2 Notes
7 pages
Orientation (MTH121A)
No ratings yet
Orientation (MTH121A)
21 pages

noise_step

Uploaded by

noise_step

Uploaded by

noise_step: T RAINING IN 1.

58 B W ITH N O G RADIENT M EMORY

December 23, 2024

In the continuous domain, the gradient may then be estimated [2]

2.1 Models as Steps

You might also like