File size: 11,797 Bytes
0bd3717
 
 
335a42c
 
 
0bd3717
335a42c
0bd3717
 
 
 
335a42c
 
 
 
 
 
 
 
 
0bd3717
 
 
335a42c
0bd3717
335a42c
0bd3717
 
 
335a42c
3259870
0bd3717
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3259870
 
0bd3717
 
3259870
0bd3717
 
 
 
 
 
 
 
 
 
 
 
 
c478b2e
3259870
 
f352367
 
 
 
 
 
0bd3717
 
 
3259870
0bd3717
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3259870
0bd3717
 
 
3259870
0bd3717
 
 
 
 
 
 
 
 
 
 
 
 
 
3259870
0bd3717
 
 
335a42c
3259870
 
 
335a42c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bd3717
 
208c439
0bd3717
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
335a42c
0bd3717
335a42c
 
 
0bd3717
335a42c
0bd3717
2c592bf
dfd02e6
 
 
 
0bd3717
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
base_model:
- Qwen/Qwen3-8B
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
---

<div align="center">
    <h1> <a href="http://blog.goedel-prover.com"> <strong>Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date</strong></a></h1>
</div>

<div align="center">
  
[![Website](https://img.shields.io/badge/%F0%9F%A4%96%20Homepage-Goedel-536af5?color=536af5&logoColor=white)](http://blog.goedel-prover.com)
[![GitHub](https://img.shields.io/badge/GitHub-Code-black.svg?logo=github)](https://github.com/Goedel-LM/Goedel-Prover-V2)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20face-Goedel-ffc107?color=ffc107&logoColor=white)](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B)
[![arXiv](https://img.shields.io/badge/arXiv-2508.03613-b31b1b.svg?style=flat)](https://arxiv.org/abs/2508.03613)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

</div>

## 1. Introduction

We introduce Goedel-Prover-V2, an open-source language model series that sets a new state-of-the-art in automated formal proof generation. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) **Scaffolded data synthesis**: We generate synthetic proof tasks of increasing difficulty to progressively train the model, enabling it to master increasingly complex theorems; (2) **Verifier-guided self-correction**: The model learns to iteratively revise its own proofs by leveraging feedback from Lean’s compiler, closely mimicking how humans refine their work; (3) **Model averaging**: We combine multiple model checkpoints to improve robustness and overall performance.

Our small model, Goedel-Prover-V2-8B, reaches 83.0% on MiniF2F test set at Pass@32, matching the performance of prior state-of-the-art DeepSeek-Prover-V2-671B while being nearly 100 times smaller in model size. Our flagship model, Goedel-Prover-V2-32B, achieves 88.0% on MiniF2F at Pass@32 on standard mode and 90.4% on self-correction mode, outperforming prior SOTA DeepSeek-Prover-V2-671B and concurrent work Kimina-Prover-72B by a large margin. Additionaly, our flagship model with self-correction solves 64 problems on PutnamBench at Pass@64, securing the 1st on the leaderboard surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by Pass@1024.

## 2. Benchmark Performance

**Self-correction mode**: Our model improves proof quality by first generating an initial candidate and then using Lean compiler feedback to iteratively revise it. We perform two rounds of self-correction, which remain computationally efficient—the total output length (including the initial proof and two revisions) increases only modestly from the standard 32K to 40K tokens.


<style>
  .fig-row {
    display: flex;
    justify-content: space-between; /* spread them out */
    align-items: flex-start;        /* align tops */
    gap: 1rem;                      /* space between images */
  }
  .fig-row img {
    display: block;
    width: 100%;
    height: auto;
  }
  .fig-row .panel {
    /* override per‐panel width as needed */
    /* e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. */
  }
  figure {
    margin: 0;
  }
  figure figcaption {
    text-align: center;
    font-size: 0.9em;
    margin-top: 0.75rem;
    color: #555;
  }
 figure figcaption strong {
    font-weight: bold;
  }
  /* Italicize the rest of the caption */
  figure figcaption em {
    font-style: italic;
  }
</style>

<figure>
  <div class="fig-row">
    <div class="panel panel-1" style="width:100%;">
      <img src="https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/assets/combined_performance_plots_varied_width.png?raw=true" alt="…">
    </div>
  </div>
  <figcaption>
  <strong>Figure 1</strong>: <em>Pass@32 performance on MiniF2F, PutnamBench, and our new MathOlympiadBench containing 360 IMO-level problems.</em>
  </figcaption>
</figure>

The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V2. We report all numbers at Pass@32: (1) Across all three datasets, our flagship 32B model, in both standard and self-correction mode, significantly outperforms prior state-of-the-art DeepSeek-Prover-V2-671B and Kimina-Prover-72B; (2) on miniF2F, our 8B model matches the performance of DeepSeek-Prover-V2-671B while being 100 times smaller in model size.


<div align="center">
  <table style="margin: 0 auto;">
    <thead>
      <tr>
        <th>#</th>
        <th>Model</th>
        <th>num‑solved</th>
        <th>compute</th>
      </tr>
    </thead>
    <tbody>
      <tr><td>1</td><td><strong>Goedel-Prover-V2-32B (self-correction mode)</strong></td><td><strong>86</strong></td><td><strong>Pass@192</strong></td></tr>
      <tr><td>1</td><td><strong>Goedel-Prover-V2-32B (self-correction mode)</strong></td><td><strong>57</strong></td><td><strong>Pass@32</strong></td></tr>
      <tr><td>1</td><td><strong>Goedel-Prover-V2-32B</strong></td><td><strong>43</strong></td><td><strong>Pass@32</strong></td></tr>
      <tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>47</td><td>Pass@1024</td></tr>
      <tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>22</td><td>Pass@32</td></tr>
      <tr><td>3</td><td>DSP+</td><td>23</td><td>Pass@128</td></tr>
      <tr><td>4</td><td>Kimina‑Prover‑7B‑Distill</td><td>10</td><td>Pass@192</td></tr>
      <tr><td>5</td><td>Self-play Theorem Prover</td><td>8</td><td>Pass@3200</td></tr>
      <tr><td>6</td><td>Goedel-Prover-V1</td><td>7</td><td>Pass@512</td></tr>
    </tbody>
  </table>
    <!-- table caption -->
    <caption align="bottom"><strong>Table 1</strong>: <em>PutnamBench leaderboard. Goedel-Prover-V2-32B secures the top rank with significantly less compute (pass number) than the previous state-of-the-art.</em>
</div>

## 3. Compelling Scaling Performance

<style>
  .fig-row {
    display: flex;
    justify-content: space-between; /* spread them out */
    align-items: flex-start;        /* align tops */
    gap: 1rem;                      /* space between images */
  }
  .fig-row img {
    display: block;
    width: 100%;
    height: auto;
  }
  .fig-row .panel {
    /* override per‐panel width as needed */
    /* e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. */
  }
  figure {
    margin: 0;
  }
  figure figcaption {
    text-align: center;
    font-size: 0.9em;
    margin-top: 0.75rem;
    color: #555;
  }
 figure figcaption strong {
    font-weight: bold;
  }
  /* Italicize the rest of the caption */
  figure figcaption em {
    font-style: italic;
  }
</style>

<figure>
  <div class="fig-row">
    <div class="panel panel-1" style="width:80%;">
      <img src="https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/assets/inference_scale_performance.png?raw=true" alt="…">
    </div>
  </div>
  <figcaption>
    <strong>Figure 2</strong>: <em>Performance on MiniF2F test set under different sample budgets.</em>
  </figcaption>
</figure>

The scaling curves above show that our 32B model consistently outperforms all prior state-of-the-art models across the entire range of inference-time compute budgets.

## 4. Model & Dataset Downloads

We release our Goedel-Prover-V2 models and the new MathOlympiadBench benchmark to foster future research.

<div align="center">
  
| Model | Download |
| -------- | -------- |
|    Goedel-Prover-V2-32B    |   [🤗Download](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B)    |
|    Goedel-Prover-V2-8B    |   [🤗Download](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-8B)    |

</div>

<div align="center">

| Dataset | Download |
| -------- | -------- |
|    MathOlympiadBench    |   [🤗Download](https://huggingface.co/datasets/Goedel-LM/MathOlympiadBench)    |

</div>

**MathOlympiadBench** (Math Olympiad Bench) comprises human-verified formalizations of Olympiad-level mathematical competition problems, sourced from [Compfiles](https://dwrensha.github.io/compfiles/imo.html) and [IMOSLLean4 repository](https://github.com/mortarsanjaya/IMOSLLean4). MathOlympiadBench contains 360 problems, including 158 IMO problems from 1959 to 2024, 131 IMO shortlist problems covering 2006 to 2023, 68 regional mathematical Olympiad problems, and 3 additional mathematical puzzles. 

This model is being released to aid other open-source projects, including those geared towards the upcoming IMO competition. A full paper with all details will be released in the coming weeks.

## 5. Environment Setup

We follow [DeepSeek-Prover-V1.5](https://github.com/deepseek-ai/DeepSeek-Prover-V1.5), which uses Lean 4 version 4.9 and the corresponding Mathlib. Please refer to the following instructions to set up the environments.

### Requirements

* Supported platform: Linux
* Python 3.10

### Installation

1.  **Install Lean 4**

    Follow the instructions on the [Lean 4 installation page](https://leanprover.github.io/lean4/doc/quickstart.html) to set up Lean 4.

2.  **Clone the repository**

```sh
git clone --recurse-submodules https://github.com/Goedel-LM/Goedel-Prover-V2.git
cd Goedel-Prover-V2
```

3.  **Install required packages**
```sh
conda env create -f goedelv2.yml
```

If you encounter installation error when installing packages (flash-attn), please run the following:

```sh
conda activate goedelv2
pip install torch==2.6.0
conda env update --file goedelv2.yml
```

4.  **Build Mathlib4**

```sh
cd mathlib4
lake build
```

5.  **Test Lean 4 and mathlib4 installation**

```sh
cd ..
python lean_compiler/repl_scheduler.py
```
If there is any error, reinstall Lean 4 and rebuild mathlib4.

If you have installed Lean and Mathlib for other projects and want to use the pre-installed things, note that you might need to modify `DEFAULT_LAKE_PATH` and `DEFAULT_LEAN_WORKSPACE` in `lean_compiler/repl_scheduler.py`.

## 6. Quick Start
You can directly use [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)

model_id = "Goedel-LM/Goedel-Prover-V2-32B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)


formal_statement = """
import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat


theorem square_equation_solution {x y : ℝ} (h : x^2 + y^2 = 2*x - 4*y - 5) : x + y = -1 := by
  sorry
""".strip()

prompt = """
Complete the following Lean 4 code:

```lean4
{}```

Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
""".strip()

chat = [
  {"role": "user", "content": prompt.format(formal_statement)},
]

inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

import time
start = time.time()
outputs = model.generate(inputs, max_new_tokens=32768)
print(tokenizer.batch_decode(outputs))
print(time.time() - start)
```

## 7. Batch Inference and Self-correction

```sh
bash scripts/pipeline.sh
```

### 8. Citation
```bibtex
@article{lin2025goedelproverv2,
  title={Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction},
  author={Lin, Yong and Tang, Shange and Lyu, Bohan and Yang, Ziran and Chung, Jui-Hui and Zhao, Haoyu and Jiang, Lai and Geng, Yihan and Ge, Jiawei and Sun, Jingruo and others},
  journal={arXiv preprint arXiv:2508.03613},
  year={2025}
}
```