In July 2025, OpenAI tested a new general-purpose AI model. It solved five of the six problems from a 2025-style International Mathematical Olympiad test, earning a gold medal-level score of 35 out of 42.
The model wrote natural-language proofs without using external tools, demonstrating reasoning similar to that of humans. Although the exact five problems are from a private 2025 test, their success shows strong ability in areas that are usually hard for AI.
- Geometry: Like Alpha Geometry, the model could handle complex, multi-step geometric proofs.
- Combinatorics: The model solved problems with distinct structures and arrangements typical of IMO-level challenges.
- Number theory: The model managed problems that required complex algebraic work and an understanding of integer properties.
- Arbitrary Inequalities: The model demonstrated creative multi-step methods for proving inequalities.
- Advanced Algebraic Equations: The module produced detailed step-by-step proofs for complex functional problems.
Main breakthroughs from this achievement include:
- Human-level reasoning: Instead of relying on specialized tools, the model used natural language and step-by-step thinking to solve problems, not just pattern matching.
- Gold Medal Performance: A score of 35 out of 42 indicates skill equal to that of top human math prodigies.
- General purpose: Unlike models focused only on geometry, this one showed flexible reasoning across many areas of mathematics.
Note: OpenAI reported this achievement, but the 2025 evaluation was done quickly, and experts are still reviewing some results.
We tested our internal model on all 10 first-proof problems, advanced math challenges designed to see whether AI can create correct, checkable proofs. These problems differ from short-answer or computation problems because they require building full arguments in specialized areas, and only experts can reliably judge whether the solutions are correct. Top experts wrote the first proof problems, and some remained unsolved for years before the authors found answers. A university department with expertise in these areas can solve many problems within a week.
We shared our proof attempts on Saturday, February 14th, 2026, at midnight Pacific Time. After receiving expert feedback, five of the models’ proofs (problems 4, 5, 6, 9, and 10) are likely correct, while the others are still under review. At first, we thought our solution for problem 2 was probably right, but after reading the first official proof commentary and more community analysis, we now think it is incorrect. We appreciate everyone’s responses and look forward to more reviews.
You can find all our proof attempts here. The pre-print includes all ten proofs and a new appendix with prompt patterns and examples that show how we interacted with the models during the process.
We believe novel frontier research is the most important way to evaluate the capabilities of next-generation AI models. Benchmarks are useful, but they can miss some of the hardest parts of research:
- sustaining long chains of reasoning
- choosing the right interactions
- handling ambiguity in problem statements
- producing arguments that survive expert scrutiny
Frontier challenges like first proofs help us stress-test and probe those capabilities in settings where correctness is hard to verify, and failure models are informative.
We are currently training a new model, with a main focus on increasing its level of strictness, so that it can think continuously for many hours and remain highly confident in its conclusions. When the first proof problems were announced, it seemed like a perfect test bed, so we tried it over the weekend. It has already solved two of the problems (numbers 9 and 10). As it trained, it became increasingly capable, eventually solving, in our estimation, at least three more. We were particularly pleased when it solved number 6, and then, two days later, number 4, as those problems were from fields similar to those of many of us. It’s incredible to watch a model get tangibly smarter day by day.
James R. Lee (OpenAI researcher reasoning)
We used the model with minimal human supervision. During training, we sometimes suggested trying strategies that had worked before. For some proofs, we asked the model to add more details or to elucidate its reasoning after receiving expert feedback to make them easier to check. We also set up a back-and-forth between the model and ChatGPT to help with checking, formatting, and style. For some problems, we selected the best attempt from several based on expert evaluation. This was a quick process and not as organized as we would want for a fully controlled test.
We look forward to working with the first proof organizers on a more rigorous experiment and evaluation process going forward.
This work builds on earlier results from frontier reasoning models in math and science. In July 2025, we achieved gold-medal-level performance on the International Mathematical Olympiad with a general-purpose reasoning model, scoring 35/42 points. In November, we shared early experiments in accelerating science with GPT-5, a set of case studies in which GPT-5 helped researchers make concrete progress across math, physics, biology, and other fields, along with the limitations we observed. Most recently, we reported a physics collaboration in which GPT-5.2 proposed a candidate expression for a gluon amplitude formula, which was then formally approved by an internal model and verified by the authors.
We look forward to working more closely with the community to evaluate research-level reasoning, including seeking expert feedback on these attempts. We are also excited to bring these new capabilities to future public models.
Source: Our First Proof submissions










