0% found this document useful (0 votes)

4 views

Generative Language Modeling for Automated Theorem Proving

This document discusses the use of transformer-based language models for automated theorem proving. It highlights the potential advantages of these models over traditional automated theorem provers. The authors, Stanislas Polu and Ilya Sutskever from OpenAI, present their findings and motivations for this research.

Uploaded by

Lindy W

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Generative Language Modeling for Automated Theorem Proving

Uploaded by

Lindy W

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Generative Language Modeling for Automated

Theorem Proving

Stanislas Polu Ilya Sutskever

OpenAI OpenAI
[email protected] [email protected]
arXiv:2009.03393v1 [cs.LG] 7 Sep 2020

Abstract
We explore the application of transformer-based language models to automated
theorem proving. This work is motivated by the possibility that a major limitation
of automated theorem provers compared to humans – the generation of original
mathematical terms – might be addressable via generation from language models.
We present an automated prover and proof assistant, GPT-f, for the Metamath
formalization language, and analyze its performance. GPT-f found new short proofs
that were accepted into the main Metamath library, which is to our knowledge, the
first time a deep learning based system has contributed proofs that were adopted by
a formal mathematics community.

1 Introduction
Artificial neural networks have enjoyed a spectacularly successful decade, having made considerable
advances in computer vision [1, 2], translation [3, 4, 5], speech recognition [6, 7], image generation
[8, 9, 10, 11, 12], game playing [13, 14, 15], and robotics [16, 17]. Especially notable is the recent
rapid progress in language understanding and generation capabilities [18, 19, 20, 21, 22].
With the possible exception of AlphaGo [13] and AlphaZero [23], reasoning tasks are conspicuously
absent from the list above. In this work we take a step towards addressing this absence by applying a
transformer language model to automated theorem proving.
Automated theorem proving [24] is an appealing domain for exploring reasoning in general and the
reasoning capabilities of language models in particular for several reasons:

• Reasoning-complete: Proving theorems very likely require general and flexible reasoning;
thus an advance in theorem proving is also an advance in reasoning more broadly.
• Search: Automated theorem proving systems can quickly check the correctness of proofs,
making it a productive environment for the use and development of search methods.
• Automated data generation: The ability to verify proofs makes it possible to automatically
generate new problems that could then be used as training data. This is especially important,
since collecting high quality data for reasoning tasks can be difficult.

Learning to prove theorems is somewhat analogous to learning to play Go: both offer an automated
way of determining success (the game of Go is a miniature formal system), and both offer an
automated way for generating new data via self play-type approaches. This similarity, together with
the clear success of AlphaZero, suggests that automated theorem proving might prove to be a fruitful
domain for the study of reasoning in neural networks where significant progress may be possible.

Preprint. Under review.

1.1 Contribution

Our contributions are the following:

• We verify that generative pre-training substantially improves performance and that pre-
training on mathematical data (such as arXiv) leads to better performance compared to
pre-training on generic text from the web.

• We find that model size is positively correlated with performance, even though the size of
the Metamath dataset is relatively small.

• We demonstrate that iteratively training a value function on statements generated by our

language model leads to improved prover performance, which immediately suggests a
strategy for continuous self improvement: keep training on proofs generated by the prover.

• We also achieve a new state of the art for the Metamath environment with our best model
capable of closing 56.22% of proofs from a held-out test set (vs 21.16% for the current
state of the art, MetaGen-IL [25]), demonstrating that the Transformer architecture may be
suitable to formal reasoning.

2 Related Work

Deep learning applied to premise selection and proof guidance Research on automated theorem
proving dates back to the 50s [24], but mainstream proof assistants still suffer from combinatorial
explosion of their search space as they are scaled to large corpuses, motivating the use of deep learning.
Early applications of deep learning to formal mathematics focused primarily on premise selection and
proof guidance. DeepMath [26] explored the use of CNNs and RNNs to predict whether a premise is
useful to demonstrate a given conjecture, their results were later improved with FormulaNet [27] by
the use of graph neural networks, reminiscent of NeuroSAT [28]. Proof guidance consists in selecting
the next clause to process inside an automated theorem prover. Loos et al. [29] investigated the use of
models similar to DeepMath’s for proof guidance and demonstrated a significant uplift on the Mizar
library.

Deep learning applied to automated theorem-proving HOList [30] proposes a formal environ-
ment based on HOL Light. They achieve their best performance [31] with a GNN model designed
for premise selection and the use of exploration. More recently, the same team studied the use of
the BERT objective with Transformers on formal statements [32], demonstrating the potential of
leveraging Transformers for formal reasoning. Their study focuses on preliminary tasks that are
related but not directly consisting of proving formal theorems (such as typing and conjecturing).
GamePad [33] and CoqGymn/ASTactic [34] introduce environments based on the Coq theorem
prover. ASTactic generates tactics as programs by sequentially expanding a partial abstract syntax
tree. Holophrasm [35] and MetaGen-IL [25] propose RNN-based models to generate proofs for
Metamath (the formal system we focus on). They rely on three different models, one to value goals,
one to select premises and one to generate substitutions. MetaGen-IL also demonstrates an uplift in
performance by generating synthetic data by forward proving.

Use of Transformers for symbolic tasks Several lines of work have been exploring language
modeling using Transformers [18]. Language modeling improvements have been demonstrated from
better pre-training tasks, using various objectives such as auto-regressive generation [19, 20, 21], token
masking [22] or sequence masking [36], but resulting language models have so far felt short when
applied to reasoning oriented tasks such as algebraic word problems [37, 38]. Recently, Lample and
Charton [39] successfully applied Transformers to anti-derivative calculus and solving differential
equations, hinting that Transformers are capable of generating the exogenous terms involved in
the substitutions required for successful symbolic integration. The Universal Transformer [40], a
Transformer with tied weights, was also shown to be successful at more algorithmic tasks. Also,
Saxton et al. [41] evaluated the Transformer architecture on a variety of mathematical problems.

2
3 Formal Environment
We chose Metamath [42] as our formal environment. Metamath is powered by a simple meta logic
system based on a single substitution rule [43].
The main Metamath library is called set.mm, which is a collection of ∼ 38k proofs based on ZFC set
theory (while other formalisms can also be used on top of Metamath’s meta logic, they are not used
in set.mm).
Metamath has several advantages that make it convenient to use with neural networks:

• Verification is fast and can be implemented in several hundreds lines of code.

• Proof steps are context-free: a goal or subgoal that we wish our system to prove, together
with a list of the statements of the theorems proven so far, completely define the state of the
Metamath system at any stage of a proof. Other formal systems are generally wrapped in
high-level programming languages that make them easier to use for humans (by including
convenient features like module imports or custom user-defined tactics) but are harder to
integrate with a neural network. While proofs in such systems are generally shorter and
more human-readable, they are impacted by long-distance interactions which makes the
complete description of the intermediary states of proofs longer, and therefore less suitable
for neural language models.
• Access to clean and compact subgoal representations makes searching the proof tree rela-
tively straightforward. It is not the case for systems where the proving objective resembles
program synthesis more than an explicit proof tree search.
• set.mm is one of the largest libraries available and its foundations are accepted as compatible
with modern mathematics.

But it also has a number of weaknesses:

• Metamath does not have high-level tactics, which means that all of its proof steps are
very low-level. As an example, the de-bruijn factor [44]–the quotient of the size of a
formalization of a mathematical text and the size of its informal original version– of a
Metamath proof is ∼ 10 − 20 while it is around ∼ 1 − 3 in Coq, HOL Light or Lean. Lower
level proof steps mean longer proofs with greater chance of compounding errors during
search.
• The current state of the tooling around Metamath makes it a very “DIY” system, one that is
not yet ready for broad adoption by the mathematics community.

While our approach would be applicable to other formal systems (such as Lean, Coq, or HOL Light),
Metamath’s features allow faster prototyping and reduced iteration time in the near term, which is
why we chose it for this project.
The set.mm library contains the background theorems required to demonstrate most Olympiad or
undergraduate Mathematics type of problems. For example, assisted by the GPT-f proof assistant
described in this work in section 6.2, we formalized IMO 1972 problem B21 .

3.1 Proving in Metamath

Proving in Metamath consists of applying a previously demonstrated theorem or axiom by providing a

substitution of the variables appearing in the hypotheses and conclusion of the theorem being applied,
such that the substituted conclusion unifies to (which means that it "matches") the current goal which
we wish to prove. The substituted hypotheses, if any, become the new subgoals left to prove.
This mechanism, a proof step, can be used in a forward manner (where we start with the axioms
and reach the desired statement, one proof step at a time) and a backward manner (where we start
with the statement we wish to prove and, after applying enough proof steps, end up at axioms or
previously demonstrated theorems with whose hypothesis we already determined to be true). As it is
more naturally amenable to proof search, we will be operating backward.
1
Metamath Proof Explorer - imo72b2 http://us.metamath.org/mpeuni/imo72b2.html

3
As an example, assume we want to prove ` (3 + 2) = 5 using the definition of 4 and 5 as respective
successors of 3 and 4. As a first step, we should use an equality transitivity theorem such as:

[[
|- A = B # first hypothesis
|- C = B # second hypothesis
]]
|- A = C # conclusion

To apply the transitivity theorem, we need to provide a substitutions that substitutes A with (3 + 2)
and B with 5 such that the conclusion of the theorem unifies to the current goal. We are left with
providing a substitution for B which can hardly be discovered mechanically (hence the appeal to use
generative language modeling). We can substitute B with (4 + 1) as is the case in the actual proof2
in Metamath’s set.mm library.
Putting it all together, the goal here is:

|- ( 3 + 2 ) = 5

The proof step we apply:

[[
|- A = B # first hypothesis
|- C = B # second hypothesis
]]
|- A = C # conclusion
{{ A : ( 3 + 2 ) }} # substitution of A
{{ B : ( 4 + 1 ) }} # substitution of B
{{ C : 5 }} # substitution of C

And finally the new subgoals are:

|- ( 3 + 2 ) = ( 4 + 1 )
|- ( 4 + 1 ) = 5

Applying the following proof step with no hypothesis (the definition of 53 ) to the second subgoal
allows us to prove it.

[[ ]] |- ( 4 + 1 ) = 5

Note that this proof step has no hypothesis and no substitution involved. It therefore closes that
branch of the proof tree. From there the proof can be continued with the first subgoal, proving
backward, until no subgoal is left. Also note that a proof for a given theorem of the library can only
use theorems proven before the appearance of the theorem to prove; we enforce that constraint when
benchmarking our models despite them being trained on the library as a whole.
In most formal systems, a proof step, consists of a goal and a mechanism that, given a goal produces
new subgoals, generally referred to as a tactic. In Metamath, there is only one type of tactic based on
substitution as illustrated above. Additionally since the substituted theorem must unify to the current
goal, the current goal can be deduced from the tactic itself (theorem and substitution pair), which is
not generally the case in other systems. As such, we’ll use tactic and proof step interchangeably in
the rest of the paper.
This informal presentation of Metamath is sufficient to understand the objectives we use to train our
models. A more formal definition of Metamath’s meta-logic can be found in the Metamath Book [42].

3.2 Dataset

Metamath’s set.mm uses a binary compressed format to represent proofs of statements. We process the
library and extract a dataset of proof steps, stored as JSON blobs using the representation presented
2
Metamath Proof Explorer - 3p2e5 http://us.metamath.org/mpeuni/3p2e5.html
3
Metamath Proof Explorer - df-5 http://us.metamath.org/mpeuni/df-5.html

4
above. For each proof step we store a GOAL, a PROOFSTEP and a reference to the parent goal if any,
encoding the tree structure of the proofs:

{
"proof_label": "unidmrn",
"goal": "[[ ]] |- U. U. ‘’ A = ( dom A u. ran A )",
"proof_step": "[[ |- A = B |- C = B ]] |- A = C \\
{{ A : U. U. ‘’ A }} \\
{{ B : ( ran ‘’ A u. dom ‘’ A ) }} \\
{{ C : ( dom A u. ran A ) }}",
"proof_step_hash": "37yZVNorgF8=",
"parent_hash": ["n4Kl7judEN4="]
}

The dataset contains ∼ 3m of such proof steps for ∼ 38k theorems (different proof labels). We split
that dataset between a train set and two valid and test sets each containing ∼ 1k proofs sampled
randomly (∼ 90k proof steps each).

3.3 Glossary
term A string that comply to the Metamath grammar.
statement or proposi- A potentially empty set of hypotheses (terms) and a conclusion (term)
tion entailed by the hypotheses.
theorem A proven statement.
axiom An assumed statement.
goal A statement in the context of a proof search.
substitutions A list of pairs of variables and terms (to substitute the variables within a
theorem or an axiom).
tactic A theorem and substitutions that unify to a goal.
subgoals Goals generated by a tactic (the substituted hypotheses of the tactic’s
theorem).
proof step A goal and a tactic, potentially generating new subgoals.
proof A tree of goals and tactics whose root is the demonstrated theorem; leaves
of the tree are tactics with no subgoals or goals that are hypotheses of the
root theorem.

4 Model
4.1 Architecture

We use decoder-only Transformers similar to GPT-2 [20] and GPT-3 [21]. The largest model we
study has 36 layers and 774m trainable parameters.

4.2 Training Objective

The proofstep objective we use for training is a conditional language modeling objective that is asked
to generate the PROOFSTEP given a GOAL, which is directly applicable to proof searches. To do so,
we format our data in the following way:

GOAL <GOAL> PROOFSTEP <PROOFSTEP><EOT>

There is one such objective for each JSON line in our dataset. We train with only one sentence per
context (no-chunking), masking the rest of the context by assigning a loss weight wloss = 0. As we
train we track the valid loss and sequence accuracy while masking the query part of the objective:

GOAL <GOAL> PROOFSTEP

5
We regularize the training by early-stopping at the point of minimum valid loss and applying a weight
decay wd = 0.1.
Here is a randomly sampled context as presented to our models for training:

GOAL [[ ]] |- ( ( ( J e. Nrm /\ f e. ( J Homeo K ) ) /\ ( x e. K /\ y e. (

( Clsd ‘ K ) i^i ~P x ) ) ) -> ( ‘’ f " x ) e. J ) PROOFSTEP [[ |- ( ph ->
ps ) |- ( ph -> ch ) |- ( ( ps /\ ch ) -> th ) ]] |- ( ph -> th ) {{ ch :
x e. K }} {{ ph : ( ( J e. Nrm /\ f e. ( J Homeo K ) ) /\ ( x e. K /\ y e.
( ( Clsd ‘ K ) i^i ~P x ) ) ) }} {{ ps : f e. ( J Cn K ) }} {{ th : ( ‘’ f
" x ) e. J }} <|endoftext|>

4.3 Proof Search

4.3.1 Goal Expansion

We find proofs by running proof searches. A proof search maintains a proof tree and a queue of open
goals sorted by their cumulative logprob, initialized with the root goal that we wish to demonstrate
(see figure 1). The cumulative logprob of a goal is defined by the sum of the logprobs of the tactics
that were used to reach that goal from the root goal. Intuitively we expand goals for which the
generative model is the most confident globally. This has a tendency to explore breadth first as deeper
goals have more parent tactics and therefore typically a higher cumulative logprob.
Each time we expand an open goal we sample e = 32 tactics (the proofstep objective described
above) from the model at temperature t = 1.0, deduplicate them, and apply the valid tactics (of
which there are at most e) to the goal being expanded. Each successful tactic application generates
new subgoals that are added to the proof tree and the proof search queue. The expanded goal is then
removed from the queue. Note that the subgoals associated with a successfully applied tactic all
share the same cumulative logprob and will eventually be expanded together (as subgoals generated
from their own expansion will mechanically have a higher cumulative logprob, and will therefore be
inserted behind in the queue). We denote the process of selecting the minimal cumulative logprob
goal and expanding it as a proof search expansion.
Each proof search involves d = 128 goal expansions, so proofs generated have at most d proof steps.
When evaluating our models, we attempt a proof search for each statement in the valid set a = 4
times, starting from an empty proof tree each time. In the above, a, e, and d are hyperparameters of
the search process that we can vary to achieve better performance (at the cost of more compute), but
keep constant as we compare models.

4.3.2 Formal Verifier

Performing such proof searches requires to tightly couple a Metamath verifier with our models. We
implemented a Metamath kernel in Python to avoid the performance cost and brittleness of interacting
with an external kernel over its REPL through standard I/O. It also provides us with a flexible
environment to experiment with new ideas in ways that were not anticipated by existing verifiers.
The kernel consists of a modified LR(0) parser to check that terms generated by our models comply
with the Metamath grammar, along with Goal and Tactic objects that implement the Metamath
substitution and represent proof trees. Our implementation is capable of exporting our in-memory
representations to both our JSON marshalled format and the official set.mm proof format. The latter
allows us to verify the proofs we generate with an external Metamath kernel implementation such as
mmverify.py or metamath-exe.
Collectively, this proof search procedure and the formal verifier tied with it are what we refer to as
the GPT-f automated prover.

4.4 Evaluation

We report the performance Perfvalid

a,e,d (θ) of a model θ, as the percentage of proofs found by this
procedure within the valid or test set. We evaluate our models on the valid set of ∼ 1k theorems and
once at the end of this paper on the held-out test set.

6
Figure 1: Proof search consists in maintaining a proof tree where multiple tactics are explored for
each goal, starting from the root goal. Goals are expanded by cumulative (tactic) logprob priority.

The hyperparameters we chose to evaluate our models attempt to minimize the variance in the
evaluation process while using the least amount of compute. Decreasing the number of expansions
per goal e increases the variance as we less systematically explore the action space when expanding
each goal. The e value can be taken quite high as each auto-regressive proof step generation uses the
same query for a given goal and can therefore be batched together. Increasing e too much may also
hurt performance given the breadth first nature of the cumulative logprob ordering. Increasing the
number of attempts per proposition a decreases variance and consistently improves performance up
to reasonably high values (We use a = 32 attempts for our final benchmarking). We found that a = 4
limited the variance in our evaluations while remaining tractable given the amount of compute we
had available. Finally the proof search depth d has little impact on variance but naturally improves
performance (we take d = 128 to evaluate our models and d = 256 for our final benchmarking).
The number of expansions we use per proof search may appear as relatively low, but it’s important
to realize that it already requires a substantial amount of compute as each expansion consists in the
auto-regressive generation of e = 32 tactics (generally hundreds of tokens and therefore forward
passes each). Empirically, the hyperparameters we chose, require on average around ∼ 1k GPU.hours
(with V100s) to evaluate our 700m parameters model (which leverages GPT-3’s sparse attention as
well as key-value caching).

4.5 Pre-training

We study the effect of pre-training on the performance of our models. We pre-train our models on
both GPT-3’s post-processed version of CommonCrawl as well as a more reasoning-focused mix of
Github, arXiv and Math StackExchange.

7
Github is downloaded using BigQuery4 and filtered to only include deduplicated files from selected
programming languages (excluding markdown, stylesheets, HTML). arXiv is downloaded using Bulk
Data Access5 and filtered to only include articles labeled as Mathematics and whose LaTeX source
is available. Math StackExchange is downloaded from their snapshot on the Internet Archive6 and
post-processed to remove HTML tags and correlate questions and answers. We demote the mix
reported in the table below as WebMath:

Table 1: Mix and source of data involved in the WebMath dataset.

Dataset Size Mix
Github 23 GB 33%
arXiv Math 10 GB 33%
Math StackExchange 2 GB 33%

4.6 Synthetic Datasets

Despite being among the largest formal mathematics libraries, the Metamath library remains scarce in
the context of deep learning, especially in light of the advantages demonstrated on various NLP tasks
by pre-training on large corpora. Also set.mm mostly focuses on well-known high-level theorems and
does not include a large number of technical lemmas resembling the type of mathematics exercises
used as curriculum for humans. Finally, Metamath lacking high level tactics such as HOL Light’s
ARITH_RULE7 , or Lean’s ring8 , it is critical to ensure that our models are capable of proving at least
basic technical theorems generally handled by high-level tactics in other systems (in domains such as
arithmetic or ring equalities and inequalities)
To achieve this goal we designed synthetic datasets allowing us to generate proofs for each of these
domains at will while controlling precisely by how many proofs we augment our training set.
We describe below the synthetic datasets we designed and report in section 5 the sample complexity
associated with these synthetic tasks.

4.6.1 n-digit Arithmetic

We synthetically generate proofs for arithmetic formulas such as 11 ∗ 22 = 242 by following the
basic algorithm for addition and multiplication, repeatedly applying theorems such as decadd9 or
decaddc10 . Divisions and subtractions are translated to their equivalent additions and multiplications
theorems in one proof step. We also support generating modulos and exponentiations.
We accept one hyperparameter for these synthetic proof generators, ndigits which controls the number
of digits involved in these arithmetic tasks. When generating a new proof we sample uniformly
in [−10ndigits , 10ndigits ] each of the numbers involved. To illustrate the level at which Metamath
operates, Table 2 shows the average number of proof steps generated as a function of ndigits for each
generator. These statements are generally proved with one tactic application in other higher-level
systems, which is a good example of one of Metamath’s drawbacks we identified earlier.

Table 2: Average number of proofsteps produced by our synthetic generators for ndigits = 3, 9, 18.
3 9 18
Addition (in Z) 19 48 94
Division 13 93 292
Modulo 25 82 206
Exponentiation 7 27 68

4
https://console.cloud.google.com/marketplace/details/github/github-repos
5
https://arxiv.com/help/bulk_data
6
https://archive.org/details/stackexchange
7
https://www.cl.cam.ac.uk/~jrh13/hol-light/HTML/ARITH_RULE.html
8
https://leanprover-community.github.io/mathlib_docs/algebra/ring/basic.html#ring
9
Metamath Proof Explorer - decadd http://us.metamath.org/mpeuni/decadd.html
10
Metamath Proof Explorer - decaddc http://us.metamath.org/mpeuni/decaddc.html

8
Our goal is to leverage these synthetic generators to ensure our models are confident when faced with
such subgoals in order to mitigate the large number of proof steps they require.

4.6.2 Ring Algebra

Our ring equalities generator is largely inspired by the INT inequality generator [45] . They propose an
inequality generator that starts from simple formulas (such as A = A) and iteratively transforms them
into more complex equalities or inequalities using a predefined list of axioms (such as commutativity
of addition or distributivity of addition-multiplication). At each transformation, the axiom to be
applied is chosen uniformly.
Our generator operates similarly within the Metamath formalism based on theorems equivalent to the
axioms they propose. We accept two hyperparameters, the number of variables nbvar involved in
the seed formulas (of the form A = A) as well as the number of theorems applied to transform the
expression, denoted as depth. In addition, we use hand-crafted weights as we sample theorems in
order to obtain formulas that we judged qualitatively better.
Here is a list of the theorems we use and their associated sampling weights.

Table 3: Metamath theorems use by our Ring Algebra synthetic generators. Theorems are available
in the Matmath Proof Explorer.
Theorem Weight Description
eqcomd 1 Commutative law for class equality.
int-addcomd 1 Addition commutativity.
int-addassocd 1 Addition associativity.
int-mulcomd 1 Multiplication commutativity.
int-mulassocd 1 Multiplication associativity.
int-leftdistd 3 Left distribution of multiplication over addition.
int-rightdistd 3 Right distribution of multiplication over addition.
int-sqdefd 5 Definition of the square.
muladdd2 5 Product of two sums

Examples of equalities produced by the generator:

ABBA(AB)2 + (C + A) = A + (ABBA)2 + C
(AA)2 = A2 AA
((BA + CA)2 )2 = (BA + CA)2 (BAAB + ACCA + BAAC + ABCA)
((A + B)2 )2 (A + A) = ((A + B)2 (AB + AB + AA + BB) + (A + B)2 (AB + AB + AA + BB))A

4.6.3 Default augmented Dataset

By default in all of our experiments we add synthetically generated proofs to the dataset extracted from
set.mm as shown in Table 4. We’ll denote this dataset as our augmented dataset. The synthetically
generated proofs account for approximately 1% of our training data which empirically appeared
as big enough to achieve decent performance on the tasks we cared about and small enough not to
hurt performance on the valid set, especially for small models. We attempted scaling the portion of
synthetic proofs to 5% of the dataset and found out that it hurt performance for the model sizes we
studied. It is nonetheless possible that including more synthetic data may turn out to be beneficial for
larger models than the ones studied in this paper.

4.7 Learned Value Function

To achieve better performance, we also iteratively train a value function to guide the proof search, in
place of the cumulative logprob priority described above.
We implement the value function by means of an outcome objective as follows. Any time we attempt
to prove a statement, we will generate a significant number of intermediate goals. Some of these

9
Table 4: Number of proofs and proofsteps adjunct to constitute our augmented dataset.
Generator Number of Proofs Number of Proofsteps
9-digit Addition (in Z) 100 4541
9-digit Division 100 10047
9-digit Modulo 50 4438
9-digit Exponentiation 50 910
Ring Equalities (depth = 6, nbvar = 2) 50 1373
Ring Equalities (depth = 6, nbvar = 3) 50 1499

goals will lead to the proof, other goals will be proved without being part of the final proof, while
others will not be resolved at all. To obtain a value function, we simply train our model to predict
whether a goal produced during proof search ended up being resolved by generating a new dataset of
the following form:

GOAL <GOAL> OUTCOME <P|N><EOT>

Where a goal ends with a "P" if was resolved, and "N" otherwise.
The binary nature of the OUTCOME allows the definition of a provability function fP as the
conditional probability of token P given a GOAL without having to introduce a separate value head.
Given a goal g, for a model parametrized by θ:

fPθ (g) = pθ ("P"|g) ≈trained 1 − pθ ("N"|g)

We then define our value function V on goals with:

Y
V θ (g) = fPθ (g 0 )
g 0 ∈siblings(g)

Not having to introduce a separate value head greatly simplifies the overall architecture. Training
only involves augmenting the dataset with outcome objectives (as additional masked sentences) and
sampling the "provability" function simply consists in reading the probability distribution for the
token following the OUTCOME keyword (which can be done in one forward pass).

4.7.1 Iterative Data Generation and Training

Having access to a formal verifier enables us to generate the training data for fP in a fully synthetic
manner by first training a model on the proofstep objective, then sampling proofs (using cumulative
logprob priority) for statements from the training set, and finally, annotating goals visited by the
proof searches as positives if they were proved and as negatives otherwise.
These annotations are used to train fP and the entire process can be run iteratively, similarly to Expert
Iteration [46], sampling proofs using the newly trained V (instead of cumulative logprob) to guide
proof search for subsequent iterations.
At each iteration we entirely re-train the model on both objectives at the same time on the dataset
constructed as follows:

• We extract the full proofs that were found for statements from the training set at each
previous iteration and merge them with the original training set. We deduplicate proof steps
at the proof level. This dataset becomes our new train set for the proofstep objective.
• We extract the annotated goals visited by the proof searches for statements from the train set
as well as the goals from the original train set (annotated positively) and deduplicate the
goals giving priority to positive outcomes annotations. This dataset becomes our new train
set for the outcome objective.

This iterative training allows controlling for overfitting on both objectives by processing in the same
way the data generated by proof searches on statements from the valid set and using the resulting
datasets to track their associated valid loss.

10
Training a value function gives an opportunity to the model to learn from its errors on data it generates.
It also shifts proof searches from breadth first exploration to one that is more focused, adaptively
based on the level of confidence modeled by V .

5 Experiments
We fine-tune all of our models with 1024 examples per global batch and a context size of 2048 tokens,
for at most 32B tokens (our augmented dataset contains ∼ 1B tokens), early stopping at min valid
loss when applicable. We anneal the learning-rate to zero (over 32B tokens). We found that restarting
the training with an annealing to zero that matches the early-stopping for a given model only provides
a marginal improvement, and avoided doing so.
The models are trained with the BPE encoding reported in [21], the same tokenization being used for
text, code or formalized statements. We leave as future work a thorough ablation of the encoding as
preliminary experimental results demonstrate possible gains with specialized tokenization techniques.

5.1 Baselines

We report three baselines: (i) the state of the art for Metamath’s set.mm as reported in MetaGen-
IL[25] (their methodology for benchmarking their solution is close to ours so the numbers are directly
comparable); (ii) a 160m parameters trained from scratch on our raw dataset using the proofstep
objective; and (iii) a 160m parameters trained from scratch on our augmented dataset (same objective).

Table 5: Baseline performance from MetaGen-IL as well as a 160m parameters model trained on the
raw and augmented datasets.
Model Performance # Tokens
MetaGen-IL[25] 21.16% N/A
160m raw dataset (ours) 29.22% 18B
160m augmented dataset (ours) 28.96% 18B

We explain the improvement over MetaGen-IL (despite not relying on forward proving data genera-
tion techniques) by our use of a simpler architecture (one unique Transformer vs 3 separate GRU
networks); a more straightforward objective (direct auto-regressive generation of the full tactic as text
vs separate premise selection and generation of the substitutions); more learnable parameters (160m
vs 300k (3 2-layers bi-directional GRUs with 128 hiddens)); and more compute at training as well as
test time.
Note that the dataset augmentation may have a marginal negative effect on performance on the valid
set with our 160m model (but we’re within typical variance). We report in section 5.5 a more reliably
positive effect with a pre-trained 700m model.

5.2 Model Size

Table 6: Performance for various model sizes trained on the augmented datasets.
Model Performance Perplexity # Tokens
160m augmented 28.96% 1.041 18B
400m augmented 30.23% 1.042 18B
700m augmented 31.58% 1.040 18B

These results demonstrate that model size positively impacts performance in our formal setup, despite
the training dataset being limited in size (we train for ∼ 18 epochs). Note that the bigger the model
the more compute we use at training time as well as benchmarking.

5.3 Pre-training

Models are pre-trained on CommonCrawl using GPT-3’s[21] methodology for 260B tokens. When
studying the effect of pre-training on WebMath we start from a CommonCrawl pre-trained model and

11
continue pre-training on WebMath for 16B additional tokens. We also report results after pre-training
on GitHub only instead of WebMath for the same number of tokens.

Table 7: Performance for various model sizes and pre-training datasets.

Model Performance Perplexity # Tokens
160m from scratch 28.96% 1.041 18B
160m CommonCrawl 32.34% 1.030 16B
160m Github 33.61% 1.030 16B
160m WebMath 34.79% 1.029 16B
700m from scratch 31.58% 1.040 18B
700m CommonCrawl 39.61% 1.026 15B
700m Github 41.55% 1.025 15B
700m WebMath 42.56% 1.024 15B

We hypothesize that the positive pre-training effect is primarily driven by the emergence and transfer
of features that are relevant to formal reasoning. It is possible to argue that most of these features
are probably shallow and mostly relevant at the syntactical level but the lower performance achieved
with Github only in comparison to WebMath suggests that some features may be more elaborate. We
leave as future work a broader investigation of this question, which could be achieved by studying
the performance of linear probes on the features of the different pre-trained models with respect to a
formal objective, such as the truthiness of a set of statements provided in the Metamath (or any other
formal) language.

Table 8: Performance for model sizes ranging from 160m to 1.5b parameters, pre-trained on WebMath.
Model Performance Perplexity # Tokens
160m (WebMath) 34.79% 1.029 16B
400m (WebMath) 39.94% 1.026 15B
700m (WebMath) 42.56% 1.024 15B
1p5b (WebMath) 42.39% 1.024 13B

It is unclear why we do not observe a smooth improvement in performance between the 700m and
the 1.5b models in table 8. The lack of guarantee that the valid set has a smooth difficulty pattern
may play a role here. Another effect may originate from the limited size of the training set, leading
the training dynamics to saturate as we grow the number of parameters. We leave as future work
a closer study of this effect which could be accomplished by training on various fractions of the
training dataset and checking for similar saturation plateaux.

5.4 Learned Value Function

We report the performance of our models as we iteratively train on data generated by sampling proofs
against the verifier.

Table 9: Performance of the 160m and 700m parameters models as we iterate through the learned
value function data generation and re-training process. policy only consists in adding new positive
proofs found to the policy training (without training a value function) while policy+value consists in
the full iterative data-generation and training described in section 4.7.
Model Iteration 0 Iteration 1 Iteration 2
160m (WebMath) policy only 34.79% 38.17% 38.34%
160m (WebMath) policy+value 39.27% 40.70%
700m (WebMath) policy only 42.56% 42.23% 43.15%
700m (WebMath) policy+value 44.59% 47.21%

While overfitting on the train set does not generally appear to negatively impact performance on the
valid set (and can even often help noticeably if not too catastrophic), we discovered that it dramatically
hurts our iterative training process. We hypothesize that overfitting collapses the data generation in
a mode where exploration is weakened, the model being overly optimistic about its predictions on

12
the train set. We therefore carefully avoid overfitting by tracking the loss on the associated valid set,
early stopping as we reach a minimum.
There is probably additional performance to be extracted by running more iterations given how
continuous this iterative process appears to be. We leave as future work the design of an iterative
data generation process that is less compute intensive. Indeed, we believe that a lot of computation is
spent on subgoals that are not necessarily providing a lot of signal for the value function, and each
iteration is quite compute intensive as it requires sampling proofs for the entire training set (which
takes ∼ 20k GPU.hours on V100s in our current setup).

5.5 Sample Complexity

Ablation of our synthetic dataset augmentation demonstrates that synthetically generated proofs
generalize to some extent and provide a noticeable uplift in performance on the valid set for larger
models.
Table 10: Ablation of the augmented dataset for 160m and 700m parameters models.
Model Performance Perplexity # Tokens
160m (WebMath) raw dataset 34.12% 1.029 16B
160m (WebMath) augmented dataset 34.79% 1.029 16B
700m (WebMath) raw dataset 40.28% 1.024 15B
700m (WebMath) augmented dataset 42.56% 1.024 15B

Our main motivation for including synthetic proofs in our training, beyond the relative uplift achieved,
is the study of the effect of model size and training a value function on the sample complexity
of our models, as we control exactly how many examples from the synthetic domain we use for
training. Table 11 reports the performance on 100 synthetically generated statements (different from
the train set) as well as the number of synthetic proofs present in the training set for each model (in
parenthesis).
Table 11: Performance of our models on 100 test statements from our synthetic generators (run with
the same parameters used to augment the training set (see table 4).
Model 9-digit addition 9-digit division Ring equalities
160m raw 13% (0) 4% (0) 6% (0)
160m augmented 78% (100) 27% (100) 77% (100)
160m policy+value (iteration 1) 87% (100) 24% (100) 71% (100)
160m policy+value (iteration 2) 90% (100) 28% (100) 79% (100)
700m raw 12% (0) 5% (0) 7% (0)
700m augmented 76% (100) 32% (100) 82% (100)
700m policy+value (iteration 1) 90% (100) 40% (100) 78% (100)
700m policy+value (iteration 2) 92% (100) 47% (100) 88% (100)

This demonstrates the close (yet not perfectly correlated) relationship between sample complexity
and performance in our formal reasoning setup, suggesting that sample complexity is an important
driver of improved performance with formal mathematics.
More importantly it demonstrates that our models are capable of acquiring new non-trivial capabilities
with a number of training examples that is compatible with manual formalization. We plan in the
future to study similar learning dynamics for more challenging tasks for which we don’t have a
synthetic generator.

5.6 Results

We attempted to push the performance of our models by increasing both the number of expansions
per proof search from d = 128 to d = 256, and the number of attempts per proofs from a = 4 to
a = 32. We report the achieved performance as a function of the number of attempts per statements
on the valid set in Table 12.
Finally, we performed a final evaluation with d = 256 and a = 32 of our 700m model policy+value
(iteration 2) on the held-out test set:

13
Table 12: Performance of our 700m model policy+value (iteration 2) as we double the number of
attempts a per proposition (with d = 256).
Attempts Performance Delta
a=2 42.90%
a=4 47.29% +4.39%
a=8 51.26% +3.97%
a = 16 54.05% +2.99%
a = 32 56.50% +2.45%

Perftest
a=32,e=32,d=256 (θ700m ) = 56.22%

6 Output
We describe in this section two projects we executed, aimed at sharing with the Metamath community
results and tools based on our work.

6.1 Proof Shortening

We contributed 23 shortened proofs1112 of theorems to the Metamath library. These proofs were
generated by the GPT-f automated prover. To discover shorter proofs, we sampled proofs for
statements from the set.mm library, comparing the length of the solutions found by our models to
their ground truth versions, also verifying that the shorter proofs didn’t rely on additional axioms.
The reception13 from the Metamath community was positive, proof length being a metric the commu-
nity care about:

“I had a look at the proofs—very impressive results! Especially because we

had a global minimization recently, and your method found much shorter proofs
nevertheless.”

“Any ML-based system is impressive if it can find many shorter proofs than the
ones we already have. Nice work.”

“The shorter proof is easier to translate. It’s more symmetric in that it treats A
and B identically. It’s philosophically more concise in that it doesn’t rely on the
existence of a universal class of all sets.”

To our knowledge, these shortened proofs are the first effective contribution of a deep learning system
to a formal mathematics library14

6.2 GPT-f Proof Assistant

We created an online proof assistant15 to allow interactive proof constructions with the assistance of
our models.
We used it to formalize more than 200 theorems and exercises. We found our models to be particularly
useful to automatically generate a variety of technical low level proofsteps required in most Metamath
proofs, search the library by adapting existing theorems to the format needed by the user (e.g.,
11
https://github.com/metamath/set.mm/pull/1547
12
https://github.com/metamath/set.mm/pull/1561
13
https://groups.google.com/g/metamath/c/-FNsw2wyllI
14
To determine whether other deep learning-based provers have made contributions to their respective libraries,
we looked for such contributions in the following systems: Holist family in HOL Light, CoqGym+ASTatic in
Coq, TacticToe in HOL4. In addition, we interviewed 6 experts in formal mathematics and/or deep learning
applied to formal mathematics.
15
https://groups.google.com/g/metamath/c/D09W2QVR-_I/m/g_rsqGj0AAAJ

14
Figure 2: Screenshot of the GPT-f Proof Assistant

deduction form16 ) and suggest theorems to use. Even when mistaken, our models generally go for
the right theorems, whose erroneous substitutions are often easy to fix by humans.
We shared the proof assistant with the Metamath community with the objective for it to be mutually
beneficial, helping the community to be more productive and reciprocally helping us improve our
models’ accuracy by automatically gathering human feedback. We also plan to extend GPT-f to other
formal systems.

7 Conclusion

In this paper, we present the GPT-f automated prover and proof assistant and show that the Trans-
former is suitable to formal reasoning, achieving a new state of the art result on the Metamath library.
In particular we demonstrate the importance of pre-training as well as iterative training of a value
function. Our results suggest that tightly coupling a deep learning system with a formal system opens
up interesting opportunities for further research, with the goal of better leveraging the generative
power of the former and the verification capabilities of the latter.

Acknowledgments

Szymon Sidor, Jakub Pachocki, Harri Edwards, Yura Burda and Vedant Misra inspired many of
the ideas presented in this work, offering their guidance throughout the process of building GPT-f.
Auguste Poiroux implemented the synthetic dataset generators presented in this paper, and formalized
a large number of theorems using the proof assistant, providing invaluable feedback in the process.
Szymon Sidor, Pranav Shyam, John Schulman, Jared Kaplan, Ryan Lowe and Jack Clark slogged
through drafts of this paper, identifying errors and sources of confusion as well as providing helpful
suggestions. Finally, the authors would like to thank the whole Metamath community for their
support, feedback, and encouragement, in particular, David A. Wheeler for his motivating enthusiasm
and Mario Carneiro for his precious help on a wide variety of technical questions.

16
Deduction Form and Natural Deduction http://us.metamath.org/mpeuni/mmnatded.html

15
References
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[3] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144, 2016.
[6] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural
networks. In International conference on machine learning, pages 1764–1772, 2014.
[7] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg,
Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech
2: End-to-end speech recognition in english and mandarin. In International conference on
machine learning, pages 173–182, 2016.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural
information processing systems, pages 2672–2680, 2014.
[9] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[10] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[11] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4401–4410, 2019.
[12] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan,
and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International
Conference on Machine Learning, 2020.
[13] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-
tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489,
2016.
[14] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy
Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large
scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
[15] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun-
young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster
level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
[16] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur
Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube
with a robot hand. arXiv preprint arXiv:1910.07113, 2019.

16
[17] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep
visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pages 5998–6008, 2017.
[19] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training, 2018.
[20] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
[21] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[22] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[23] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur
Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering
chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint
arXiv:1712.01815, 2017.
[24] John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In
Computational Logic, volume 9, pages 135–214, 2014.
[25] Mingzhe Wang and Jia Deng. Learning to prove theorems by learning to generate theorems.
arXiv preprint arXiv:2002.07019, 2020.
[26] Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and
Josef Urban. Deepmath-deep sequence models for premise selection. In Advances in Neural
Information Processing Systems, pages 2235–2243, 2016.
[27] Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem proving
by deep graph embedding. In Advances in Neural Information Processing Systems, pages
2786–2796, 2017.
[28] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L
Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018.
[29] Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided
proof search. arXiv preprint arXiv:1701.06972, 2017.
[30] Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An
environment for machine learning of higher order logic theorem proving. In International
Conference on Machine Learning, pages 454–463, 2019.
[31] Kshitij Bansal, Sarah M Loos, Markus N Rabe, and Christian Szegedy. Learning to reason in
large theories without imitation. arXiv preprint arXiv:1905.10501, 2019.
[32] Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning
via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757, 2020.
[33] Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning
environment for theorem proving. arXiv preprint arXiv:1806.00608, 2018.
[34] Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants.
arXiv preprint arXiv:1905.09381, 2019.
[35] Daniel Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv
preprint arXiv:1608.02644, 2016.

17
[36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

[37] Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by ratio-
nale generation: Learning to solve and explain algebraic word problems. arXiv preprint
arXiv:1705.04146, 2017.

[38] Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh
Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based
formalisms. arXiv preprint arXiv:1905.13319, 2019.

[39] Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv
preprint arXiv:1912.01412, 2019.

[40] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni-
versal transformers. arXiv preprint arXiv:1807.03819, 2018.

[41] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical
reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.

[42] Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Pure Mathe-
matics, 2019. http://us.metamath.org/downloads/metamath.pdf.

[43] Norman Megill. How Metamath Proofs Work, 2006.

http://us.metamath.org/mpeuni/mmset.html#proofs.

[44] Freek Wiedijk. The "de Bruijn factor", 2014. http://www.cs.ru.nl/ freek/factor/.

[45] Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Grosse. Int: An inequality benchmark for
evaluating generalization in theorem proving. arXiv preprint arXiv:2007.02924, 2020.

[46] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning
and tree search. In Advances in Neural Information Processing Systems, pages 5360–5370,
2017.

A Key Results

Table 13: Key results described in this paper (on the valid set) with a summary of the source of
performance gains.
Model Performance Gain Main ablation
MetaGen-IL [25] 21.16% Baseline and state of the art.
160m (ours) 28.96% +7.8% Use of Transformers.
700m (ours) 31.58% +2.5% Increase in parameters count.
700m WebMath (ours) 42.56% +10.9% Pre-training.
700m policy+value (ours) 47.21% +4.6% Iterated learned value function.
700m policy+value a = 32 (ours) 56.50% +9.2% Increased test-time compute.

B Example Proofs Generated

In this appendix, we display a selection of proofs generated by GPT-f (from our valid set). The right
column contains the current goal. The left column displays the name of the theorem applied by to the
goal. Proofs are read bottom-up and the statement being demonstrated is the last goal of the table.
The subgoals generated by a proof step can be retrieved by looking at the theorem names that are
indented by one additional space. The statement of the theorems can be retrieved with the Metamath
Proof Explorer. Substitutions are omitted for clarity, but can be inferred by looking at the statement
of the theorem being applied and comparing it with the current goal and associated subgoals.

18
B.1 Proof of nn0onn0ex

This proof demonstrates that n ∈ N ∧ n+1 2 ∈ N =⇒ ∃m ∈ N : n = 2m + 1. It is interesting for

its first proof step. syl2anc17 states that assuming P =⇒ Q, P =⇒ R, Q ∧ R =⇒ S then
P =⇒ S. P is mechanically unified with n ∈ N ∧ n+1 2 ∈ N, and S with ∃m ∈ N : n = 2m + 1
but the model freely generates substitutions for Q and R. Looking at the subgoals, Q is substituted
with n−1 n−1
2 ∈ N and R with n = 2 2 + 1 which materialises a witness for the existence of m.

The model is left to demonstrate n ∈ N ∧ n+12 ∈ N =⇒

n−1 n+1
2 ∈ N, then n ∈ N ∧ 2 ∈ N =⇒
n−1 n−1 n−1
n = 2 2 + 1 and finally 2 ∈ N ∧ n = 2 2 + 1 =⇒ ∃m ∈ N : n = 2m + 1 using the
existential specialization provided by rspcev18 .
+ nn0o |- ( ( N e. NN0 /\ ( ( N + 1 ) / 2 ) e. NN0 ) ->
( ( N - 1 ) / 2 ) e. NN0 )
+ nn0cn |- ( N e. NN0 -> N e. CC )
+ ax1cn |- 1 e. CC
+ subcl |- ( ( N e. CC /\ 1 e. CC ) -> ( N - 1 ) e. CC )
+ sylancl |- ( N e. NN0 -> ( N - 1 ) e. CC )
+ 2cnd |- ( N e. NN0 -> 2 e. CC )
+ 2ne0 |- 2 =/= 0
+ a1i |- ( N e. NN0 -> 2 =/= 0 )
+ divcan2d |- ( N e. NN0 -> ( 2 x. ( ( N - 1 ) / 2 ) ) = ( N - 1 ) )
+ adantr |- ( ( N e. NN0 /\ ( ( N + 1 ) / 2 ) e. NN0 ) ->
( 2 x. ( ( N - 1 ) / 2 ) ) = ( N - 1 ) )
+ oveq1d |- ( ( N e. NN0 /\ ( ( N + 1 ) / 2 ) e. NN0 ) ->
( ( 2 x. ( ( N - 1 ) / 2 ) ) + 1 ) =
( ( N - 1 ) + 1 ) )
+ nn0cn |- ( N e. NN0 -> N e. CC )
+ ax1cn |- 1 e. CC
+ npcan |- ( ( N e. CC /\ 1 e. CC ) -> ( ( N - 1 ) + 1 ) = N )
+ sylancl |- ( N e. NN0 -> ( ( N - 1 ) + 1 ) = N )
+ adantr |- ( ( N e. NN0 /\ ( ( N + 1 ) / 2 ) e. NN0 ) ->
( ( N - 1 ) + 1 ) = N )
+ eqtr2d |- ( ( N e. NN0 /\ ( ( N + 1 ) / 2 ) e. NN0 ) ->
N = ( ( 2 x. ( ( N - 1 ) / 2 ) ) + 1 ) )
+ oveq2 |- ( m = ( ( N - 1 ) / 2 ) -> ( 2 x. m ) =
( 2 x. ( ( N - 1 ) / 2 ) ) )
+ oveq1d |- ( m = ( ( N - 1 ) / 2 ) -> ( ( 2 x. m ) + 1 ) =
( ( 2 x. ( ( N - 1 ) / 2 ) ) + 1 ) )
+ eqeq2d |- ( m = ( ( N - 1 ) / 2 ) -> ( N = ( ( 2 x. m ) + 1 )
<-> N = ( ( 2 x. ( ( N - 1 ) / 2 ) ) + 1 ) ) )
+ rspcev |- ( ( ( ( N - 1 ) / 2 ) e. NN0 /\
N = ( ( 2 x. ( ( N - 1 ) / 2 ) ) + 1 ) ) ->
E. m e. NN0 N = ( ( 2 x. m ) + 1 ) )
+ syl2anc |- ( ( N e. NN0 /\ ( ( N + 1 ) / 2 ) e. NN0 ) ->
E. m e. NN0 N = ( ( 2 x. m ) + 1 ) )
Such generation of exogenous terms, here to demonstrate an existence proof, is exactly what motivated
our work. It’s therefore encouraging to witness it effectively happening in practice.

B.2 Proof of uznn0sub

This proof demonstrates that n ≥ m ∈ Z =⇒ (n − m) ∈ N. It exhibits another form of term

generation. Here, sylibr19 states that assuming P =⇒ Q, R ⇔ Q then P =⇒ R. Again, P is
mechanically unified to n ≥ m ∈ Z, and R with (n − m) ∈ N. The model is left to generate freely a
substitution for Q: (n − m) ∈ Z ∧ 0 ≤ (n − m). The equivalence R ⇔ Q to demonstrate becomes
17
Metamath Proof Explorer - syl2anc http://us.metamath.org/mpeuni/syl2anc.html
18
Metamath Proof Explorer - rspcev http://us.metamath.org/mpeuni/rspcev.html
19
Metamath Proof Explorer - sylibr http://us.metamath.org/mpeuni/sylibr.html

19
(n − m) ∈ N ⇔ (n − m) ∈ Z ∧ 0 ≤ (n − m) which is exactly the statement of a theorem available
in the Metamath library, elnn0z20 . The statement of elnn0z is memoized by the model, and the
generation of the substitution term for Q is driven by this memoization.
+ eluzelz |- ( N e. ( ZZ>= ‘ M ) -> N e. ZZ )
+ eluzel2 |- ( N e. ( ZZ>= ‘ M ) -> M e. ZZ )
+ zsubcld |- ( N e. ( ZZ>= ‘ M ) -> ( N - M ) e. ZZ )
+ eluzle |- ( N e. ( ZZ>= ‘ M ) -> M <_ N )
+ eluzelre |- ( N e. ( ZZ>= ‘ M ) -> N e. RR )
+ eluzel2 |- ( N e. ( ZZ>= ‘ M ) -> M e. ZZ )
+ zred |- ( N e. ( ZZ>= ‘ M ) -> M e. RR )
+ subge0d |- ( N e. ( ZZ>= ‘ M ) -> ( 0 <_ ( N - M ) <-> M <_ N ) )
+ mpbird |- ( N e. ( ZZ>= ‘ M ) -> 0 <_ ( N - M ) )
+ jca |- ( N e. ( ZZ>= ‘ M ) -> ( ( N - M ) e. ZZ /\
0 <_ ( N - M ) ) )
+ elnn0z |- ( ( N - M ) e. NN0 <-> ( ( N - M ) e. ZZ /\
0 <_ ( N - M ) ) )
+ sylibr |- ( N e. ( ZZ>= ‘ M ) -> ( N - M ) e. NN0 )

B.3 Proof of pm4.78

This proof displays the model capabilities to demonstrate non-trivial propositional logic statements, a
task of interest because of its relationship to SAT solving.
+ pm2.21 |- ( -. ph -> ( ph -> ps ) )
+ orcd |- ( -. ph -> ( ( ph -> ps ) \/ ( ph -> ch ) ) )
+ ax-1. |- ( ps -> ( ph -> ps ) )
+ ax-1 |- ( ch -> ( ph -> ch ) )
+ orim12i |- ( ( ps \/ ch ) -> ( ( ph -> ps ) \/ ( ph -> ch ) ) )
+ ja |- ( ( ph -> ( ps \/ ch ) ) -> ( ( ph -> ps ) \/
( ph -> ch ) ) )
+ orc |- ( ps -> ( ps \/ ch ) )
+ imim2i |- ( ( ph -> ps ) -> ( ph -> ( ps \/ ch ) ) )
+ olc |- ( ch -> ( ps \/ ch ) )
+ imim2i |- ( ( ph -> ch ) -> ( ph -> ( ps \/ ch ) ) )
+ jaoi |- ( ( ( ph -> ps ) \/ ( ph -> ch ) ) ->
( ph -> ( ps \/ ch ) ) )
+ impbii |- ( ( ph -> ( ps \/ ch ) ) <-> ( ( ph -> ps ) \/
( ph -> ch ) ) )
+ bicomi |- ( ( ( ph -> ps ) \/ ( ph -> ch ) ) <->
( ph -> ( ps \/ ch ) ) )

20
Metamath Proof Explorer - elnn0z http://us.metamath.org/mpeuni/elnn0z.html

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Logical Tools For Modelling Legal Argument
No ratings yet
Logical Tools For Modelling Legal Argument
319 pages
DeepSeek-Prover
No ratings yet
DeepSeek-Prover
17 pages
45 Aa 4 D 3 e
No ratings yet
45 Aa 4 D 3 e
12 pages
2406.14219v2
No ratings yet
2406.14219v2
36 pages
Automated Reasoning and Its Applications: John Harrison Intel Corporation
No ratings yet
Automated Reasoning and Its Applications: John Harrison Intel Corporation
35 pages
Autoformalization With Large L
No ratings yet
Autoformalization With Large L
16 pages
2405.01787v3
No ratings yet
2405.01787v3
14 pages
R-Mming Thesis
No ratings yet
R-Mming Thesis
101 pages
Automated Theorem Proving: Fundamentals and Applications
From Everand
Automated Theorem Proving: Fundamentals and Applications
Fouad Sabry
No ratings yet
A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level
No ratings yet
A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level
10 pages
Using ChatGPT as a proof assistant in a mathematics pathways course
No ratings yet
Using ChatGPT as a proof assistant in a mathematics pathways course
25 pages
2502.00212v1
No ratings yet
2502.00212v1
22 pages
Symbolic Regression of Logic Functions
No ratings yet
Symbolic Regression of Logic Functions
28 pages
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
No ratings yet
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
34 pages
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Google IA
No ratings yet
Google IA
9 pages
Machine Learning For Automated Theorem Proving
No ratings yet
Machine Learning For Automated Theorem Proving
106 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
A Survey of Deep Learning For Mathematical Reasoning
No ratings yet
A Survey of Deep Learning For Mathematical Reasoning
24 pages
Leanagent: Lifelong Learning For Formal Theorem Proving: Mathlib4 Mathlib4
No ratings yet
Leanagent: Lifelong Learning For Formal Theorem Proving: Mathlib4 Mathlib4
41 pages
AI Descarteres Nature
No ratings yet
AI Descarteres Nature
10 pages
Paper Review
No ratings yet
Paper Review
31 pages
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
No ratings yet
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
22 pages
2504.06122v1
No ratings yet
2504.06122v1
23 pages
Deep Seek V1
No ratings yet
Deep Seek V1
28 pages
Alpha Geometry
No ratings yet
Alpha Geometry
21 pages
Machine Assisted Proof Notices
No ratings yet
Machine Assisted Proof Notices
10 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
8808 Proving Olympiad Inequali
No ratings yet
8808 Proving Olympiad Inequali
25 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Theoretical Structures As Networks From The Perspective of Automated Creation and Proving of Theorems
No ratings yet
Theoretical Structures As Networks From The Perspective of Automated Creation and Proving of Theorems
8 pages
Data Science with Python: From Zero to Machine Learning
From Everand
Data Science with Python: From Zero to Machine Learning
Pouvo
No ratings yet
Preparing Testing by Proving Critical Systems
No ratings yet
Preparing Testing by Proving Critical Systems
10 pages
print test
No ratings yet
print test
9 pages
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
1995 Harrison
No ratings yet
1995 Harrison
44 pages
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Graph Representations For Higher-Order Logic and Theorem Proving 1905.10006
No ratings yet
Graph Representations For Higher-Order Logic and Theorem Proving 1905.10006
9 pages
2023 Acl-Long 817
No ratings yet
2023 Acl-Long 817
27 pages
2020 Mathematical Reasoning via Self-supervised Skip-tree Training
No ratings yet
2020 Mathematical Reasoning via Self-supervised Skip-tree Training
21 pages
Bringing Images to Life: Exploring DALL-E with ChatGPT
From Everand
Bringing Images to Life: Exploring DALL-E with ChatGPT
Aura-Elena Turcu
No ratings yet
2407.15360v1
No ratings yet
2407.15360v1
8 pages
Cs217 BP Finishing Logic Week6 10feb25
No ratings yet
Cs217 BP Finishing Logic Week6 10feb25
74 pages
The_Limits_of_AI__Can_Deep_Learning_Solve_P_vs_NP
No ratings yet
The_Limits_of_AI__Can_Deep_Learning_Solve_P_vs_NP
4 pages
Metaheuristic: Fundamentals and Applications
From Everand
Metaheuristic: Fundamentals and Applications
Fouad Sabry
No ratings yet
t1 Slides Clark
No ratings yet
t1 Slides Clark
284 pages
Logic Tensor Networks
No ratings yet
Logic Tensor Networks
61 pages
Formalizing Chemical Physics Using the Lean Theorem Prover
No ratings yet
Formalizing Chemical Physics Using the Lean Theorem Prover
26 pages
MacKenzie TheAutomationOfProof
No ratings yet
MacKenzie TheAutomationOfProof
23 pages
2501.05463v1
No ratings yet
2501.05463v1
3 pages
2022 Emnlp-Main 7 PDF
No ratings yet
2022 Emnlp-Main 7 PDF
17 pages
Guiding Large Language Models With Divide-and-Conquer Program For Discerning Problem Solving
No ratings yet
Guiding Large Language Models With Divide-and-Conquer Program For Discerning Problem Solving
18 pages
Tikas-FYP
No ratings yet
Tikas-FYP
37 pages
2405.19973v1
No ratings yet
2405.19973v1
14 pages
AI in International Mathematics Competitions- A New Era of Problem Solving
No ratings yet
AI in International Mathematics Competitions- A New Era of Problem Solving
4 pages
An Interactive Proof Development Environment + Anticipation A Mathematical Assistant?
No ratings yet
An Interactive Proof Development Environment + Anticipation A Mathematical Assistant?
10 pages
2502 13834
No ratings yet
2502 13834
26 pages
Paper2 2
No ratings yet
Paper2 2
9 pages
When Can Transformers Reason With Abstract Symbols?
No ratings yet
When Can Transformers Reason With Abstract Symbols?
55 pages
formal-verification-with-isabellehol-in-practice-finding-a-bug-in-gcc
No ratings yet
formal-verification-with-isabellehol-in-practice-finding-a-bug-in-gcc
16 pages
OlympiadBench - A Challenging Benchmark For Promoting AGI With Olympiad-Level Bilingual Multimodal Scientific Problems
No ratings yet
OlympiadBench - A Challenging Benchmark For Promoting AGI With Olympiad-Level Bilingual Multimodal Scientific Problems
20 pages
LPML - Llm-Prompting Markup Language For Mathematical Reasoning
No ratings yet
LPML - Llm-Prompting Markup Language For Mathematical Reasoning
10 pages
Large Language Models For Mathematical Reasoning - Progresses and Challenges
No ratings yet
Large Language Models For Mathematical Reasoning - Progresses and Challenges
14 pages
Llemma - An Open Language Model For Mathematics
No ratings yet
Llemma - An Open Language Model For Mathematics
28 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Two Categories of Refutation Decision Procedures For Classical and Intuitionistic Logics
No ratings yet
Two Categories of Refutation Decision Procedures For Classical and Intuitionistic Logics
156 pages
Al 304 Unit 2 Notes
No ratings yet
Al 304 Unit 2 Notes
15 pages
Untitled Document
No ratings yet
Untitled Document
7 pages
A Science of Reasoning
No ratings yet
A Science of Reasoning
8 pages
Cyber Physical Systems Security 1st edition by Ã‡etin Kaya KoÃ§ 9783319989358 3319989359 - The ebook in PDF/DOCX format is ready for download now
100% (5)
Cyber Physical Systems Security 1st edition by Ã‡etin Kaya KoÃ§ 9783319989358 3319989359 - The ebook in PDF/DOCX format is ready for download now
89 pages
Document Spin Bot Rewrite Mehraj
No ratings yet
Document Spin Bot Rewrite Mehraj
28 pages
01 Intelligent Systems-Introduction
No ratings yet
01 Intelligent Systems-Introduction
82 pages
Unit 3 Knowledge Representation & Reasoning
No ratings yet
Unit 3 Knowledge Representation & Reasoning
24 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
CaiSATtutorial Indian6
No ratings yet
CaiSATtutorial Indian6
121 pages
SAI HypercubeDecomposition
No ratings yet
SAI HypercubeDecomposition
7 pages
IAI-Unit-1(Part-1)
No ratings yet
IAI-Unit-1(Part-1)
50 pages
Formal Methods for Industrial Critical Systems 19th International Conference FMICS 2014 Florence Italy September 11 12 2014 Proceedings 1st Edition Frédéric Lang All Chapters Instant Download
100% (1)
Formal Methods for Industrial Critical Systems 19th International Conference FMICS 2014 Florence Italy September 11 12 2014 Proceedings 1st Edition Frédéric Lang All Chapters Instant Download
55 pages
STERNOL Specification Tool
No ratings yet
STERNOL Specification Tool
15 pages
DagRep.12.7.80
No ratings yet
DagRep.12.7.80
32 pages
Artificial Intelligence Prof. P. Dasgupta Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur
No ratings yet
Artificial Intelligence Prof. P. Dasgupta Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur
22 pages
Luger Logic
No ratings yet
Luger Logic
20 pages
(AI Assignment)
No ratings yet
(AI Assignment)
9 pages
Problem solving methods in artificial intelligence Elsevier Science 1st edition by Nils Nilsson 0070465738 978-0070465732pdf download
100% (3)
Problem solving methods in artificial intelligence Elsevier Science 1st edition by Nils Nilsson 0070465738 978-0070465732pdf download
76 pages
Davis - Prehistory and Early History of Automated Deduction
No ratings yet
Davis - Prehistory and Early History of Automated Deduction
13 pages
Unit-Iii: Syllabus
No ratings yet
Unit-Iii: Syllabus
11 pages
6th Unit-1
No ratings yet
6th Unit-1
26 pages
Deepak Kapur Hantao Zhang: Key Words
No ratings yet
Deepak Kapur Hantao Zhang: Key Words
30 pages
Binsearch VFR 02 02
No ratings yet
Binsearch VFR 02 02
18 pages

Generative Language Modeling for Automated Theorem Proving

Uploaded by

Generative Language Modeling for Automated Theorem Proving

Uploaded by

Generative Language Modeling for Automated

Stanislas Polu Ilya Sutskever

Preprint. Under review.

Our contributions are the following:

• We demonstrate that iteratively training a value function on statements generated by our

• Verification is fast and can be implemented in several hundreds lines of code.

But it also has a number of weaknesses:

3.1 Proving in Metamath

Proving in Metamath consists of applying a previously demonstrated theorem or axiom by providing a

The proof step we apply:

And finally the new subgoals are:

4.2 Training Objective

GOAL <GOAL> PROOFSTEP <PROOFSTEP><EOT>

GOAL <GOAL> PROOFSTEP

GOAL [[ ]] |- ( ( ( J e. Nrm /\ f e. ( J Homeo K ) ) /\ ( x e. K /\ y e. (

4.3 Proof Search

4.3.1 Goal Expansion

4.3.2 Formal Verifier

We report the performance Perfvalid

Table 1: Mix and source of data involved in the WebMath dataset.

4.6 Synthetic Datasets

4.6.1 n-digit Arithmetic

4.6.2 Ring Algebra

Examples of equalities produced by the generator:

4.6.3 Default augmented Dataset

4.7 Learned Value Function

GOAL <GOAL> OUTCOME <P|N><EOT>

fPθ (g) = pθ ("P"|g) ≈trained 1 − pθ ("N"|g)

We then define our value function V on goals with:

4.7.1 Iterative Data Generation and Training

5.2 Model Size

Table 7: Performance for various model sizes and pre-training datasets.

5.4 Learned Value Function

5.5 Sample Complexity

6.1 Proof Shortening

“I had a look at the proofs—very impressive results! Especially because we

6.2 GPT-f Proof Assistant

[43] Norman Megill. How Metamath Proofs Work, 2006.

B Example Proofs Generated

This proof demonstrates that n ∈ N ∧ n+1 2 ∈ N =⇒ ∃m ∈ N : n = 2m + 1. It is interesting for

The model is left to demonstrate n ∈ N ∧ n+12 ∈ N =⇒

B.2 Proof of uznn0sub

This proof demonstrates that n ≥ m ∈ Z =⇒ (n − m) ∈ N. It exhibits another form of term

B.3 Proof of pm4.78

You might also like