Generative Language Modeling for Automated Theorem Proving
Generative Language Modeling for Automated Theorem Proving
Theorem Proving
Abstract
We explore the application of transformer-based language models to automated
theorem proving. This work is motivated by the possibility that a major limitation
of automated theorem provers compared to humans – the generation of original
mathematical terms – might be addressable via generation from language models.
We present an automated prover and proof assistant, GPT-f, for the Metamath
formalization language, and analyze its performance. GPT-f found new short proofs
that were accepted into the main Metamath library, which is to our knowledge, the
first time a deep learning based system has contributed proofs that were adopted by
a formal mathematics community.
1 Introduction
Artificial neural networks have enjoyed a spectacularly successful decade, having made considerable
advances in computer vision [1, 2], translation [3, 4, 5], speech recognition [6, 7], image generation
[8, 9, 10, 11, 12], game playing [13, 14, 15], and robotics [16, 17]. Especially notable is the recent
rapid progress in language understanding and generation capabilities [18, 19, 20, 21, 22].
With the possible exception of AlphaGo [13] and AlphaZero [23], reasoning tasks are conspicuously
absent from the list above. In this work we take a step towards addressing this absence by applying a
transformer language model to automated theorem proving.
Automated theorem proving [24] is an appealing domain for exploring reasoning in general and the
reasoning capabilities of language models in particular for several reasons:
• Reasoning-complete: Proving theorems very likely require general and flexible reasoning;
thus an advance in theorem proving is also an advance in reasoning more broadly.
• Search: Automated theorem proving systems can quickly check the correctness of proofs,
making it a productive environment for the use and development of search methods.
• Automated data generation: The ability to verify proofs makes it possible to automatically
generate new problems that could then be used as training data. This is especially important,
since collecting high quality data for reasoning tasks can be difficult.
Learning to prove theorems is somewhat analogous to learning to play Go: both offer an automated
way of determining success (the game of Go is a miniature formal system), and both offer an
automated way for generating new data via self play-type approaches. This similarity, together with
the clear success of AlphaZero, suggests that automated theorem proving might prove to be a fruitful
domain for the study of reasoning in neural networks where significant progress may be possible.
• We verify that generative pre-training substantially improves performance and that pre-
training on mathematical data (such as arXiv) leads to better performance compared to
pre-training on generic text from the web.
• We find that model size is positively correlated with performance, even though the size of
the Metamath dataset is relatively small.
• We also achieve a new state of the art for the Metamath environment with our best model
capable of closing 56.22% of proofs from a held-out test set (vs 21.16% for the current
state of the art, MetaGen-IL [25]), demonstrating that the Transformer architecture may be
suitable to formal reasoning.
2 Related Work
Deep learning applied to premise selection and proof guidance Research on automated theorem
proving dates back to the 50s [24], but mainstream proof assistants still suffer from combinatorial
explosion of their search space as they are scaled to large corpuses, motivating the use of deep learning.
Early applications of deep learning to formal mathematics focused primarily on premise selection and
proof guidance. DeepMath [26] explored the use of CNNs and RNNs to predict whether a premise is
useful to demonstrate a given conjecture, their results were later improved with FormulaNet [27] by
the use of graph neural networks, reminiscent of NeuroSAT [28]. Proof guidance consists in selecting
the next clause to process inside an automated theorem prover. Loos et al. [29] investigated the use of
models similar to DeepMath’s for proof guidance and demonstrated a significant uplift on the Mizar
library.
Deep learning applied to automated theorem-proving HOList [30] proposes a formal environ-
ment based on HOL Light. They achieve their best performance [31] with a GNN model designed
for premise selection and the use of exploration. More recently, the same team studied the use of
the BERT objective with Transformers on formal statements [32], demonstrating the potential of
leveraging Transformers for formal reasoning. Their study focuses on preliminary tasks that are
related but not directly consisting of proving formal theorems (such as typing and conjecturing).
GamePad [33] and CoqGymn/ASTactic [34] introduce environments based on the Coq theorem
prover. ASTactic generates tactics as programs by sequentially expanding a partial abstract syntax
tree. Holophrasm [35] and MetaGen-IL [25] propose RNN-based models to generate proofs for
Metamath (the formal system we focus on). They rely on three different models, one to value goals,
one to select premises and one to generate substitutions. MetaGen-IL also demonstrates an uplift in
performance by generating synthetic data by forward proving.
Use of Transformers for symbolic tasks Several lines of work have been exploring language
modeling using Transformers [18]. Language modeling improvements have been demonstrated from
better pre-training tasks, using various objectives such as auto-regressive generation [19, 20, 21], token
masking [22] or sequence masking [36], but resulting language models have so far felt short when
applied to reasoning oriented tasks such as algebraic word problems [37, 38]. Recently, Lample and
Charton [39] successfully applied Transformers to anti-derivative calculus and solving differential
equations, hinting that Transformers are capable of generating the exogenous terms involved in
the substitutions required for successful symbolic integration. The Universal Transformer [40], a
Transformer with tied weights, was also shown to be successful at more algorithmic tasks. Also,
Saxton et al. [41] evaluated the Transformer architecture on a variety of mathematical problems.
2
3 Formal Environment
We chose Metamath [42] as our formal environment. Metamath is powered by a simple meta logic
system based on a single substitution rule [43].
The main Metamath library is called set.mm, which is a collection of ∼ 38k proofs based on ZFC set
theory (while other formalisms can also be used on top of Metamath’s meta logic, they are not used
in set.mm).
Metamath has several advantages that make it convenient to use with neural networks:
• Metamath does not have high-level tactics, which means that all of its proof steps are
very low-level. As an example, the de-bruijn factor [44]–the quotient of the size of a
formalization of a mathematical text and the size of its informal original version– of a
Metamath proof is ∼ 10 − 20 while it is around ∼ 1 − 3 in Coq, HOL Light or Lean. Lower
level proof steps mean longer proofs with greater chance of compounding errors during
search.
• The current state of the tooling around Metamath makes it a very “DIY” system, one that is
not yet ready for broad adoption by the mathematics community.
While our approach would be applicable to other formal systems (such as Lean, Coq, or HOL Light),
Metamath’s features allow faster prototyping and reduced iteration time in the near term, which is
why we chose it for this project.
The set.mm library contains the background theorems required to demonstrate most Olympiad or
undergraduate Mathematics type of problems. For example, assisted by the GPT-f proof assistant
described in this work in section 6.2, we formalized IMO 1972 problem B21 .
3
As an example, assume we want to prove ` (3 + 2) = 5 using the definition of 4 and 5 as respective
successors of 3 and 4. As a first step, we should use an equality transitivity theorem such as:
[[
|- A = B # first hypothesis
|- C = B # second hypothesis
]]
|- A = C # conclusion
To apply the transitivity theorem, we need to provide a substitutions that substitutes A with (3 + 2)
and B with 5 such that the conclusion of the theorem unifies to the current goal. We are left with
providing a substitution for B which can hardly be discovered mechanically (hence the appeal to use
generative language modeling). We can substitute B with (4 + 1) as is the case in the actual proof2
in Metamath’s set.mm library.
Putting it all together, the goal here is:
|- ( 3 + 2 ) = 5
[[
|- A = B # first hypothesis
|- C = B # second hypothesis
]]
|- A = C # conclusion
{{ A : ( 3 + 2 ) }} # substitution of A
{{ B : ( 4 + 1 ) }} # substitution of B
{{ C : 5 }} # substitution of C
|- ( 3 + 2 ) = ( 4 + 1 )
|- ( 4 + 1 ) = 5
Applying the following proof step with no hypothesis (the definition of 53 ) to the second subgoal
allows us to prove it.
[[ ]] |- ( 4 + 1 ) = 5
Note that this proof step has no hypothesis and no substitution involved. It therefore closes that
branch of the proof tree. From there the proof can be continued with the first subgoal, proving
backward, until no subgoal is left. Also note that a proof for a given theorem of the library can only
use theorems proven before the appearance of the theorem to prove; we enforce that constraint when
benchmarking our models despite them being trained on the library as a whole.
In most formal systems, a proof step, consists of a goal and a mechanism that, given a goal produces
new subgoals, generally referred to as a tactic. In Metamath, there is only one type of tactic based on
substitution as illustrated above. Additionally since the substituted theorem must unify to the current
goal, the current goal can be deduced from the tactic itself (theorem and substitution pair), which is
not generally the case in other systems. As such, we’ll use tactic and proof step interchangeably in
the rest of the paper.
This informal presentation of Metamath is sufficient to understand the objectives we use to train our
models. A more formal definition of Metamath’s meta-logic can be found in the Metamath Book [42].
3.2 Dataset
Metamath’s set.mm uses a binary compressed format to represent proofs of statements. We process the
library and extract a dataset of proof steps, stored as JSON blobs using the representation presented
2
Metamath Proof Explorer - 3p2e5 http://us.metamath.org/mpeuni/3p2e5.html
3
Metamath Proof Explorer - df-5 http://us.metamath.org/mpeuni/df-5.html
4
above. For each proof step we store a GOAL, a PROOFSTEP and a reference to the parent goal if any,
encoding the tree structure of the proofs:
{
"proof_label": "unidmrn",
"goal": "[[ ]] |- U. U. ‘’ A = ( dom A u. ran A )",
"proof_step": "[[ |- A = B |- C = B ]] |- A = C \\
{{ A : U. U. ‘’ A }} \\
{{ B : ( ran ‘’ A u. dom ‘’ A ) }} \\
{{ C : ( dom A u. ran A ) }}",
"proof_step_hash": "37yZVNorgF8=",
"parent_hash": ["n4Kl7judEN4="]
}
The dataset contains ∼ 3m of such proof steps for ∼ 38k theorems (different proof labels). We split
that dataset between a train set and two valid and test sets each containing ∼ 1k proofs sampled
randomly (∼ 90k proof steps each).
3.3 Glossary
term A string that comply to the Metamath grammar.
statement or proposi- A potentially empty set of hypotheses (terms) and a conclusion (term)
tion entailed by the hypotheses.
theorem A proven statement.
axiom An assumed statement.
goal A statement in the context of a proof search.
substitutions A list of pairs of variables and terms (to substitute the variables within a
theorem or an axiom).
tactic A theorem and substitutions that unify to a goal.
subgoals Goals generated by a tactic (the substituted hypotheses of the tactic’s
theorem).
proof step A goal and a tactic, potentially generating new subgoals.
proof A tree of goals and tactics whose root is the demonstrated theorem; leaves
of the tree are tactics with no subgoals or goals that are hypotheses of the
root theorem.
4 Model
4.1 Architecture
We use decoder-only Transformers similar to GPT-2 [20] and GPT-3 [21]. The largest model we
study has 36 layers and 774m trainable parameters.
The proofstep objective we use for training is a conditional language modeling objective that is asked
to generate the PROOFSTEP given a GOAL, which is directly applicable to proof searches. To do so,
we format our data in the following way:
There is one such objective for each JSON line in our dataset. We train with only one sentence per
context (no-chunking), masking the rest of the context by assigning a loss weight wloss = 0. As we
train we track the valid loss and sequence accuracy while masking the query part of the objective:
5
We regularize the training by early-stopping at the point of minimum valid loss and applying a weight
decay wd = 0.1.
Here is a randomly sampled context as presented to our models for training:
We find proofs by running proof searches. A proof search maintains a proof tree and a queue of open
goals sorted by their cumulative logprob, initialized with the root goal that we wish to demonstrate
(see figure 1). The cumulative logprob of a goal is defined by the sum of the logprobs of the tactics
that were used to reach that goal from the root goal. Intuitively we expand goals for which the
generative model is the most confident globally. This has a tendency to explore breadth first as deeper
goals have more parent tactics and therefore typically a higher cumulative logprob.
Each time we expand an open goal we sample e = 32 tactics (the proofstep objective described
above) from the model at temperature t = 1.0, deduplicate them, and apply the valid tactics (of
which there are at most e) to the goal being expanded. Each successful tactic application generates
new subgoals that are added to the proof tree and the proof search queue. The expanded goal is then
removed from the queue. Note that the subgoals associated with a successfully applied tactic all
share the same cumulative logprob and will eventually be expanded together (as subgoals generated
from their own expansion will mechanically have a higher cumulative logprob, and will therefore be
inserted behind in the queue). We denote the process of selecting the minimal cumulative logprob
goal and expanding it as a proof search expansion.
Each proof search involves d = 128 goal expansions, so proofs generated have at most d proof steps.
When evaluating our models, we attempt a proof search for each statement in the valid set a = 4
times, starting from an empty proof tree each time. In the above, a, e, and d are hyperparameters of
the search process that we can vary to achieve better performance (at the cost of more compute), but
keep constant as we compare models.
Performing such proof searches requires to tightly couple a Metamath verifier with our models. We
implemented a Metamath kernel in Python to avoid the performance cost and brittleness of interacting
with an external kernel over its REPL through standard I/O. It also provides us with a flexible
environment to experiment with new ideas in ways that were not anticipated by existing verifiers.
The kernel consists of a modified LR(0) parser to check that terms generated by our models comply
with the Metamath grammar, along with Goal and Tactic objects that implement the Metamath
substitution and represent proof trees. Our implementation is capable of exporting our in-memory
representations to both our JSON marshalled format and the official set.mm proof format. The latter
allows us to verify the proofs we generate with an external Metamath kernel implementation such as
mmverify.py or metamath-exe.
Collectively, this proof search procedure and the formal verifier tied with it are what we refer to as
the GPT-f automated prover.
4.4 Evaluation
6
Figure 1: Proof search consists in maintaining a proof tree where multiple tactics are explored for
each goal, starting from the root goal. Goals are expanded by cumulative (tactic) logprob priority.
The hyperparameters we chose to evaluate our models attempt to minimize the variance in the
evaluation process while using the least amount of compute. Decreasing the number of expansions
per goal e increases the variance as we less systematically explore the action space when expanding
each goal. The e value can be taken quite high as each auto-regressive proof step generation uses the
same query for a given goal and can therefore be batched together. Increasing e too much may also
hurt performance given the breadth first nature of the cumulative logprob ordering. Increasing the
number of attempts per proposition a decreases variance and consistently improves performance up
to reasonably high values (We use a = 32 attempts for our final benchmarking). We found that a = 4
limited the variance in our evaluations while remaining tractable given the amount of compute we
had available. Finally the proof search depth d has little impact on variance but naturally improves
performance (we take d = 128 to evaluate our models and d = 256 for our final benchmarking).
The number of expansions we use per proof search may appear as relatively low, but it’s important
to realize that it already requires a substantial amount of compute as each expansion consists in the
auto-regressive generation of e = 32 tactics (generally hundreds of tokens and therefore forward
passes each). Empirically, the hyperparameters we chose, require on average around ∼ 1k GPU.hours
(with V100s) to evaluate our 700m parameters model (which leverages GPT-3’s sparse attention as
well as key-value caching).
4.5 Pre-training
We study the effect of pre-training on the performance of our models. We pre-train our models on
both GPT-3’s post-processed version of CommonCrawl as well as a more reasoning-focused mix of
Github, arXiv and Math StackExchange.
7
Github is downloaded using BigQuery4 and filtered to only include deduplicated files from selected
programming languages (excluding markdown, stylesheets, HTML). arXiv is downloaded using Bulk
Data Access5 and filtered to only include articles labeled as Mathematics and whose LaTeX source
is available. Math StackExchange is downloaded from their snapshot on the Internet Archive6 and
post-processed to remove HTML tags and correlate questions and answers. We demote the mix
reported in the table below as WebMath:
Despite being among the largest formal mathematics libraries, the Metamath library remains scarce in
the context of deep learning, especially in light of the advantages demonstrated on various NLP tasks
by pre-training on large corpora. Also set.mm mostly focuses on well-known high-level theorems and
does not include a large number of technical lemmas resembling the type of mathematics exercises
used as curriculum for humans. Finally, Metamath lacking high level tactics such as HOL Light’s
ARITH_RULE7 , or Lean’s ring8 , it is critical to ensure that our models are capable of proving at least
basic technical theorems generally handled by high-level tactics in other systems (in domains such as
arithmetic or ring equalities and inequalities)
To achieve this goal we designed synthetic datasets allowing us to generate proofs for each of these
domains at will while controlling precisely by how many proofs we augment our training set.
We describe below the synthetic datasets we designed and report in section 5 the sample complexity
associated with these synthetic tasks.
Table 2: Average number of proofsteps produced by our synthetic generators for ndigits = 3, 9, 18.
3 9 18
Addition (in Z) 19 48 94
Division 13 93 292
Modulo 25 82 206
Exponentiation 7 27 68
4
https://console.cloud.google.com/marketplace/details/github/github-repos
5
https://arxiv.com/help/bulk_data
6
https://archive.org/details/stackexchange
7
https://www.cl.cam.ac.uk/~jrh13/hol-light/HTML/ARITH_RULE.html
8
https://leanprover-community.github.io/mathlib_docs/algebra/ring/basic.html#ring
9
Metamath Proof Explorer - decadd http://us.metamath.org/mpeuni/decadd.html
10
Metamath Proof Explorer - decaddc http://us.metamath.org/mpeuni/decaddc.html
8
Our goal is to leverage these synthetic generators to ensure our models are confident when faced with
such subgoals in order to mitigate the large number of proof steps they require.
Table 3: Metamath theorems use by our Ring Algebra synthetic generators. Theorems are available
in the Matmath Proof Explorer.
Theorem Weight Description
eqcomd 1 Commutative law for class equality.
int-addcomd 1 Addition commutativity.
int-addassocd 1 Addition associativity.
int-mulcomd 1 Multiplication commutativity.
int-mulassocd 1 Multiplication associativity.
int-leftdistd 3 Left distribution of multiplication over addition.
int-rightdistd 3 Right distribution of multiplication over addition.
int-sqdefd 5 Definition of the square.
muladdd2 5 Product of two sums
ABBA(AB)2 + (C + A) = A + (ABBA)2 + C
(AA)2 = A2 AA
((BA + CA)2 )2 = (BA + CA)2 (BAAB + ACCA + BAAC + ABCA)
((A + B)2 )2 (A + A) = ((A + B)2 (AB + AB + AA + BB) + (A + B)2 (AB + AB + AA + BB))A
To achieve better performance, we also iteratively train a value function to guide the proof search, in
place of the cumulative logprob priority described above.
We implement the value function by means of an outcome objective as follows. Any time we attempt
to prove a statement, we will generate a significant number of intermediate goals. Some of these
9
Table 4: Number of proofs and proofsteps adjunct to constitute our augmented dataset.
Generator Number of Proofs Number of Proofsteps
9-digit Addition (in Z) 100 4541
9-digit Division 100 10047
9-digit Modulo 50 4438
9-digit Exponentiation 50 910
Ring Equalities (depth = 6, nbvar = 2) 50 1373
Ring Equalities (depth = 6, nbvar = 3) 50 1499
goals will lead to the proof, other goals will be proved without being part of the final proof, while
others will not be resolved at all. To obtain a value function, we simply train our model to predict
whether a goal produced during proof search ended up being resolved by generating a new dataset of
the following form:
Where a goal ends with a "P" if was resolved, and "N" otherwise.
The binary nature of the OUTCOME allows the definition of a provability function fP as the
conditional probability of token P given a GOAL without having to introduce a separate value head.
Given a goal g, for a model parametrized by θ:
Not having to introduce a separate value head greatly simplifies the overall architecture. Training
only involves augmenting the dataset with outcome objectives (as additional masked sentences) and
sampling the "provability" function simply consists in reading the probability distribution for the
token following the OUTCOME keyword (which can be done in one forward pass).
• We extract the full proofs that were found for statements from the training set at each
previous iteration and merge them with the original training set. We deduplicate proof steps
at the proof level. This dataset becomes our new train set for the proofstep objective.
• We extract the annotated goals visited by the proof searches for statements from the train set
as well as the goals from the original train set (annotated positively) and deduplicate the
goals giving priority to positive outcomes annotations. This dataset becomes our new train
set for the outcome objective.
This iterative training allows controlling for overfitting on both objectives by processing in the same
way the data generated by proof searches on statements from the valid set and using the resulting
datasets to track their associated valid loss.
10
Training a value function gives an opportunity to the model to learn from its errors on data it generates.
It also shifts proof searches from breadth first exploration to one that is more focused, adaptively
based on the level of confidence modeled by V .
5 Experiments
We fine-tune all of our models with 1024 examples per global batch and a context size of 2048 tokens,
for at most 32B tokens (our augmented dataset contains ∼ 1B tokens), early stopping at min valid
loss when applicable. We anneal the learning-rate to zero (over 32B tokens). We found that restarting
the training with an annealing to zero that matches the early-stopping for a given model only provides
a marginal improvement, and avoided doing so.
The models are trained with the BPE encoding reported in [21], the same tokenization being used for
text, code or formalized statements. We leave as future work a thorough ablation of the encoding as
preliminary experimental results demonstrate possible gains with specialized tokenization techniques.
5.1 Baselines
We report three baselines: (i) the state of the art for Metamath’s set.mm as reported in MetaGen-
IL[25] (their methodology for benchmarking their solution is close to ours so the numbers are directly
comparable); (ii) a 160m parameters trained from scratch on our raw dataset using the proofstep
objective; and (iii) a 160m parameters trained from scratch on our augmented dataset (same objective).
Table 5: Baseline performance from MetaGen-IL as well as a 160m parameters model trained on the
raw and augmented datasets.
Model Performance # Tokens
MetaGen-IL[25] 21.16% N/A
160m raw dataset (ours) 29.22% 18B
160m augmented dataset (ours) 28.96% 18B
We explain the improvement over MetaGen-IL (despite not relying on forward proving data genera-
tion techniques) by our use of a simpler architecture (one unique Transformer vs 3 separate GRU
networks); a more straightforward objective (direct auto-regressive generation of the full tactic as text
vs separate premise selection and generation of the substitutions); more learnable parameters (160m
vs 300k (3 2-layers bi-directional GRUs with 128 hiddens)); and more compute at training as well as
test time.
Note that the dataset augmentation may have a marginal negative effect on performance on the valid
set with our 160m model (but we’re within typical variance). We report in section 5.5 a more reliably
positive effect with a pre-trained 700m model.
Table 6: Performance for various model sizes trained on the augmented datasets.
Model Performance Perplexity # Tokens
160m augmented 28.96% 1.041 18B
400m augmented 30.23% 1.042 18B
700m augmented 31.58% 1.040 18B
These results demonstrate that model size positively impacts performance in our formal setup, despite
the training dataset being limited in size (we train for ∼ 18 epochs). Note that the bigger the model
the more compute we use at training time as well as benchmarking.
5.3 Pre-training
Models are pre-trained on CommonCrawl using GPT-3’s[21] methodology for 260B tokens. When
studying the effect of pre-training on WebMath we start from a CommonCrawl pre-trained model and
11
continue pre-training on WebMath for 16B additional tokens. We also report results after pre-training
on GitHub only instead of WebMath for the same number of tokens.
We hypothesize that the positive pre-training effect is primarily driven by the emergence and transfer
of features that are relevant to formal reasoning. It is possible to argue that most of these features
are probably shallow and mostly relevant at the syntactical level but the lower performance achieved
with Github only in comparison to WebMath suggests that some features may be more elaborate. We
leave as future work a broader investigation of this question, which could be achieved by studying
the performance of linear probes on the features of the different pre-trained models with respect to a
formal objective, such as the truthiness of a set of statements provided in the Metamath (or any other
formal) language.
Table 8: Performance for model sizes ranging from 160m to 1.5b parameters, pre-trained on WebMath.
Model Performance Perplexity # Tokens
160m (WebMath) 34.79% 1.029 16B
400m (WebMath) 39.94% 1.026 15B
700m (WebMath) 42.56% 1.024 15B
1p5b (WebMath) 42.39% 1.024 13B
It is unclear why we do not observe a smooth improvement in performance between the 700m and
the 1.5b models in table 8. The lack of guarantee that the valid set has a smooth difficulty pattern
may play a role here. Another effect may originate from the limited size of the training set, leading
the training dynamics to saturate as we grow the number of parameters. We leave as future work
a closer study of this effect which could be accomplished by training on various fractions of the
training dataset and checking for similar saturation plateaux.
We report the performance of our models as we iteratively train on data generated by sampling proofs
against the verifier.
Table 9: Performance of the 160m and 700m parameters models as we iterate through the learned
value function data generation and re-training process. policy only consists in adding new positive
proofs found to the policy training (without training a value function) while policy+value consists in
the full iterative data-generation and training described in section 4.7.
Model Iteration 0 Iteration 1 Iteration 2
160m (WebMath) policy only 34.79% 38.17% 38.34%
160m (WebMath) policy+value 39.27% 40.70%
700m (WebMath) policy only 42.56% 42.23% 43.15%
700m (WebMath) policy+value 44.59% 47.21%
While overfitting on the train set does not generally appear to negatively impact performance on the
valid set (and can even often help noticeably if not too catastrophic), we discovered that it dramatically
hurts our iterative training process. We hypothesize that overfitting collapses the data generation in
a mode where exploration is weakened, the model being overly optimistic about its predictions on
12
the train set. We therefore carefully avoid overfitting by tracking the loss on the associated valid set,
early stopping as we reach a minimum.
There is probably additional performance to be extracted by running more iterations given how
continuous this iterative process appears to be. We leave as future work the design of an iterative
data generation process that is less compute intensive. Indeed, we believe that a lot of computation is
spent on subgoals that are not necessarily providing a lot of signal for the value function, and each
iteration is quite compute intensive as it requires sampling proofs for the entire training set (which
takes ∼ 20k GPU.hours on V100s in our current setup).
Ablation of our synthetic dataset augmentation demonstrates that synthetically generated proofs
generalize to some extent and provide a noticeable uplift in performance on the valid set for larger
models.
Table 10: Ablation of the augmented dataset for 160m and 700m parameters models.
Model Performance Perplexity # Tokens
160m (WebMath) raw dataset 34.12% 1.029 16B
160m (WebMath) augmented dataset 34.79% 1.029 16B
700m (WebMath) raw dataset 40.28% 1.024 15B
700m (WebMath) augmented dataset 42.56% 1.024 15B
Our main motivation for including synthetic proofs in our training, beyond the relative uplift achieved,
is the study of the effect of model size and training a value function on the sample complexity
of our models, as we control exactly how many examples from the synthetic domain we use for
training. Table 11 reports the performance on 100 synthetically generated statements (different from
the train set) as well as the number of synthetic proofs present in the training set for each model (in
parenthesis).
Table 11: Performance of our models on 100 test statements from our synthetic generators (run with
the same parameters used to augment the training set (see table 4).
Model 9-digit addition 9-digit division Ring equalities
160m raw 13% (0) 4% (0) 6% (0)
160m augmented 78% (100) 27% (100) 77% (100)
160m policy+value (iteration 1) 87% (100) 24% (100) 71% (100)
160m policy+value (iteration 2) 90% (100) 28% (100) 79% (100)
700m raw 12% (0) 5% (0) 7% (0)
700m augmented 76% (100) 32% (100) 82% (100)
700m policy+value (iteration 1) 90% (100) 40% (100) 78% (100)
700m policy+value (iteration 2) 92% (100) 47% (100) 88% (100)
This demonstrates the close (yet not perfectly correlated) relationship between sample complexity
and performance in our formal reasoning setup, suggesting that sample complexity is an important
driver of improved performance with formal mathematics.
More importantly it demonstrates that our models are capable of acquiring new non-trivial capabilities
with a number of training examples that is compatible with manual formalization. We plan in the
future to study similar learning dynamics for more challenging tasks for which we don’t have a
synthetic generator.
5.6 Results
We attempted to push the performance of our models by increasing both the number of expansions
per proof search from d = 128 to d = 256, and the number of attempts per proofs from a = 4 to
a = 32. We report the achieved performance as a function of the number of attempts per statements
on the valid set in Table 12.
Finally, we performed a final evaluation with d = 256 and a = 32 of our 700m model policy+value
(iteration 2) on the held-out test set:
13
Table 12: Performance of our 700m model policy+value (iteration 2) as we double the number of
attempts a per proposition (with d = 256).
Attempts Performance Delta
a=2 42.90%
a=4 47.29% +4.39%
a=8 51.26% +3.97%
a = 16 54.05% +2.99%
a = 32 56.50% +2.45%
Perftest
a=32,e=32,d=256 (θ700m ) = 56.22%
6 Output
We describe in this section two projects we executed, aimed at sharing with the Metamath community
results and tools based on our work.
We contributed 23 shortened proofs1112 of theorems to the Metamath library. These proofs were
generated by the GPT-f automated prover. To discover shorter proofs, we sampled proofs for
statements from the set.mm library, comparing the length of the solutions found by our models to
their ground truth versions, also verifying that the shorter proofs didn’t rely on additional axioms.
The reception13 from the Metamath community was positive, proof length being a metric the commu-
nity care about:
“Any ML-based system is impressive if it can find many shorter proofs than the
ones we already have. Nice work.”
“The shorter proof is easier to translate. It’s more symmetric in that it treats A
and B identically. It’s philosophically more concise in that it doesn’t rely on the
existence of a universal class of all sets.”
To our knowledge, these shortened proofs are the first effective contribution of a deep learning system
to a formal mathematics library14
We created an online proof assistant15 to allow interactive proof constructions with the assistance of
our models.
We used it to formalize more than 200 theorems and exercises. We found our models to be particularly
useful to automatically generate a variety of technical low level proofsteps required in most Metamath
proofs, search the library by adapting existing theorems to the format needed by the user (e.g.,
11
https://github.com/metamath/set.mm/pull/1547
12
https://github.com/metamath/set.mm/pull/1561
13
https://groups.google.com/g/metamath/c/-FNsw2wyllI
14
To determine whether other deep learning-based provers have made contributions to their respective libraries,
we looked for such contributions in the following systems: Holist family in HOL Light, CoqGym+ASTatic in
Coq, TacticToe in HOL4. In addition, we interviewed 6 experts in formal mathematics and/or deep learning
applied to formal mathematics.
15
https://groups.google.com/g/metamath/c/D09W2QVR-_I/m/g_rsqGj0AAAJ
14
Figure 2: Screenshot of the GPT-f Proof Assistant
deduction form16 ) and suggest theorems to use. Even when mistaken, our models generally go for
the right theorems, whose erroneous substitutions are often easy to fix by humans.
We shared the proof assistant with the Metamath community with the objective for it to be mutually
beneficial, helping the community to be more productive and reciprocally helping us improve our
models’ accuracy by automatically gathering human feedback. We also plan to extend GPT-f to other
formal systems.
7 Conclusion
In this paper, we present the GPT-f automated prover and proof assistant and show that the Trans-
former is suitable to formal reasoning, achieving a new state of the art result on the Metamath library.
In particular we demonstrate the importance of pre-training as well as iterative training of a value
function. Our results suggest that tightly coupling a deep learning system with a formal system opens
up interesting opportunities for further research, with the goal of better leveraging the generative
power of the former and the verification capabilities of the latter.
Acknowledgments
Szymon Sidor, Jakub Pachocki, Harri Edwards, Yura Burda and Vedant Misra inspired many of
the ideas presented in this work, offering their guidance throughout the process of building GPT-f.
Auguste Poiroux implemented the synthetic dataset generators presented in this paper, and formalized
a large number of theorems using the proof assistant, providing invaluable feedback in the process.
Szymon Sidor, Pranav Shyam, John Schulman, Jared Kaplan, Ryan Lowe and Jack Clark slogged
through drafts of this paper, identifying errors and sources of confusion as well as providing helpful
suggestions. Finally, the authors would like to thank the whole Metamath community for their
support, feedback, and encouragement, in particular, David A. Wheeler for his motivating enthusiasm
and Mario Carneiro for his precious help on a wide variety of technical questions.
16
Deduction Form and Natural Deduction http://us.metamath.org/mpeuni/mmnatded.html
15
References
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[3] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144, 2016.
[6] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural
networks. In International conference on machine learning, pages 1764–1772, 2014.
[7] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg,
Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech
2: End-to-end speech recognition in english and mandarin. In International conference on
machine learning, pages 173–182, 2016.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural
information processing systems, pages 2672–2680, 2014.
[9] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[10] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[11] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4401–4410, 2019.
[12] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan,
and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International
Conference on Machine Learning, 2020.
[13] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-
tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489,
2016.
[14] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy
Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large
scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
[15] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun-
young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster
level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
[16] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur
Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube
with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
16
[17] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep
visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pages 5998–6008, 2017.
[19] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training, 2018.
[20] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
[21] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[22] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[23] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur
Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering
chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint
arXiv:1712.01815, 2017.
[24] John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In
Computational Logic, volume 9, pages 135–214, 2014.
[25] Mingzhe Wang and Jia Deng. Learning to prove theorems by learning to generate theorems.
arXiv preprint arXiv:2002.07019, 2020.
[26] Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and
Josef Urban. Deepmath-deep sequence models for premise selection. In Advances in Neural
Information Processing Systems, pages 2235–2243, 2016.
[27] Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem proving
by deep graph embedding. In Advances in Neural Information Processing Systems, pages
2786–2796, 2017.
[28] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L
Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018.
[29] Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided
proof search. arXiv preprint arXiv:1701.06972, 2017.
[30] Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An
environment for machine learning of higher order logic theorem proving. In International
Conference on Machine Learning, pages 454–463, 2019.
[31] Kshitij Bansal, Sarah M Loos, Markus N Rabe, and Christian Szegedy. Learning to reason in
large theories without imitation. arXiv preprint arXiv:1905.10501, 2019.
[32] Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning
via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757, 2020.
[33] Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning
environment for theorem proving. arXiv preprint arXiv:1806.00608, 2018.
[34] Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants.
arXiv preprint arXiv:1905.09381, 2019.
[35] Daniel Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv
preprint arXiv:1608.02644, 2016.
17
[36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
[37] Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by ratio-
nale generation: Learning to solve and explain algebraic word problems. arXiv preprint
arXiv:1705.04146, 2017.
[38] Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh
Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based
formalisms. arXiv preprint arXiv:1905.13319, 2019.
[39] Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv
preprint arXiv:1912.01412, 2019.
[40] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
versal transformers. arXiv preprint arXiv:1807.03819, 2018.
[41] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical
reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
[42] Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Pure Mathe-
matics, 2019. http://us.metamath.org/downloads/metamath.pdf.
[44] Freek Wiedijk. The "de Bruijn factor", 2014. http://www.cs.ru.nl/ freek/factor/.
[45] Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Grosse. Int: An inequality benchmark for
evaluating generalization in theorem proving. arXiv preprint arXiv:2007.02924, 2020.
[46] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning
and tree search. In Advances in Neural Information Processing Systems, pages 5360–5370,
2017.
A Key Results
Table 13: Key results described in this paper (on the valid set) with a summary of the source of
performance gains.
Model Performance Gain Main ablation
MetaGen-IL [25] 21.16% Baseline and state of the art.
160m (ours) 28.96% +7.8% Use of Transformers.
700m (ours) 31.58% +2.5% Increase in parameters count.
700m WebMath (ours) 42.56% +10.9% Pre-training.
700m policy+value (ours) 47.21% +4.6% Iterated learned value function.
700m policy+value a = 32 (ours) 56.50% +9.2% Increased test-time compute.
18
B.1 Proof of nn0onn0ex
19
(n − m) ∈ N ⇔ (n − m) ∈ Z ∧ 0 ≤ (n − m) which is exactly the statement of a theorem available
in the Metamath library, elnn0z20 . The statement of elnn0z is memoized by the model, and the
generation of the substitution term for Q is driven by this memoization.
+ eluzelz |- ( N e. ( ZZ>= ‘ M ) -> N e. ZZ )
+ eluzel2 |- ( N e. ( ZZ>= ‘ M ) -> M e. ZZ )
+ zsubcld |- ( N e. ( ZZ>= ‘ M ) -> ( N - M ) e. ZZ )
+ eluzle |- ( N e. ( ZZ>= ‘ M ) -> M <_ N )
+ eluzelre |- ( N e. ( ZZ>= ‘ M ) -> N e. RR )
+ eluzel2 |- ( N e. ( ZZ>= ‘ M ) -> M e. ZZ )
+ zred |- ( N e. ( ZZ>= ‘ M ) -> M e. RR )
+ subge0d |- ( N e. ( ZZ>= ‘ M ) -> ( 0 <_ ( N - M ) <-> M <_ N ) )
+ mpbird |- ( N e. ( ZZ>= ‘ M ) -> 0 <_ ( N - M ) )
+ jca |- ( N e. ( ZZ>= ‘ M ) -> ( ( N - M ) e. ZZ /\
0 <_ ( N - M ) ) )
+ elnn0z |- ( ( N - M ) e. NN0 <-> ( ( N - M ) e. ZZ /\
0 <_ ( N - M ) ) )
+ sylibr |- ( N e. ( ZZ>= ‘ M ) -> ( N - M ) e. NN0 )
This proof displays the model capabilities to demonstrate non-trivial propositional logic statements, a
task of interest because of its relationship to SAT solving.
+ pm2.21 |- ( -. ph -> ( ph -> ps ) )
+ orcd |- ( -. ph -> ( ( ph -> ps ) \/ ( ph -> ch ) ) )
+ ax-1. |- ( ps -> ( ph -> ps ) )
+ ax-1 |- ( ch -> ( ph -> ch ) )
+ orim12i |- ( ( ps \/ ch ) -> ( ( ph -> ps ) \/ ( ph -> ch ) ) )
+ ja |- ( ( ph -> ( ps \/ ch ) ) -> ( ( ph -> ps ) \/
( ph -> ch ) ) )
+ orc |- ( ps -> ( ps \/ ch ) )
+ imim2i |- ( ( ph -> ps ) -> ( ph -> ( ps \/ ch ) ) )
+ olc |- ( ch -> ( ps \/ ch ) )
+ imim2i |- ( ( ph -> ch ) -> ( ph -> ( ps \/ ch ) ) )
+ jaoi |- ( ( ( ph -> ps ) \/ ( ph -> ch ) ) ->
( ph -> ( ps \/ ch ) ) )
+ impbii |- ( ( ph -> ( ps \/ ch ) ) <-> ( ( ph -> ps ) \/
( ph -> ch ) ) )
+ bicomi |- ( ( ( ph -> ps ) \/ ( ph -> ch ) ) <->
( ph -> ( ps \/ ch ) ) )
20
Metamath Proof Explorer - elnn0z http://us.metamath.org/mpeuni/elnn0z.html
20