MIT scientists are building the world’s largest collection of math Olympiad problems, and opening it up to everyone | MIT News

Every year, countries competing in the International Mathematics Olympiad (IMO) come up with a handbook of their best, real-world problems. Those brochures are handed to the messengers, and they quietly disappear. No one has ever systematically collected them, cleaned them up, and made them available, not to AI researchers testing the limits of mathematical reasoning, not to students around the world training for these competitions mostly on their own.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and the company HUMAIN have now done just that.
MathNet is the highest quality dataset of evidence-based math problems ever created. It includes over 30,000 expert-authored problems and solutions spanning 47 countries, 17 languages, and 143 competitions, five times larger than the next largest dataset of its kind. The work will be presented at the International Conference on Learning Advocacy (ICLR) in Brazil later this month.
What makes MathNet unique is not just its size, but its scope. The previous Olympiad ranking dataset draws almost exclusively from competitions in the United States and China. MathNet covers dozens of countries across six continents, covers 17 languages, includes both text-based and image-based problems and solutions, and covers four decades of competitive math. The goal is to capture the full range of mathematical ideas and problem-solving cultures that exist throughout the global mathematical community, not just the most visible ones.
“Each country brings a portfolio of its most novel and classic problems,” said Shaden Alshammari, an MIT PhD student and lead author on the paper. “They were sharing brochures, but no one had made the effort to collect them, clean them up, and put them online.”
Building MathNet required tracking down 1,595 PDF volumes totaling 25,000 pages, including digitized documents and scans spanning many years in more than a dozen languages. A key part of that archive came from an unlikely source: Navid Safaei, a longtime IMO sociologist and fellow author who has been collecting and scanning those manuals by hand since 2006. His archive forms a large part of the backbone of the dataset.
Discovery is as important as scale. Where most existing math datasets draw problems from public forums such as Art of Problem Solving (AoPS), MathNet draws exclusively from official national competition publications. The solutions in those manuals are written by experts and are peer-reviewed, and often run to many pages, with the authors going through several approaches to the same problem. That depth gives AI models a much richer signal for learning statistical reasoning than the short, informal solutions common to publicly available datasets. It also means that the dataset is really useful for students: Anyone preparing for an IMO or a national competition can now access a central, searchable collection of high-quality problems and worked solutions from cultures around the world.
“I remember many students it was their individual effort. No one in their country was training them for this kind of competition,” said Alshammari, who entered the IMO as a student herself. “We hope this gives them a central place with high-level problems and solutions to learn from.”
The team has deep roots in the IMO community. Sultan Albarakati, another author, currently serves on the IMO board, and the researchers are working to share the dataset with the IMO foundation directly. To validate the dataset, they assembled a cross-sectional team of over 30 people from countries including Armenia, Russia, Ukraine, Vietnam, and Poland, who linked them together to validate thousands of solutions.
“The MathNet website has the potential to be an excellent resource for students and leaders looking for new problems to tackle or for a solution to a difficult question,” said Tanish Patil, deputy head of the Swiss IMO. “Although some Olympiad problem archives exist (in particular, the Contest Collections forums on AoPS), these resources lack a standard formatting system, proven solutions, and important problem metadata that topics and ideas require. It will also be interesting to see how this dataset is used to improve the performance of reasoning models, and if we will be able to answer questions quickly if we cannot answer important questions: the problem is the first one.”
MathNet also serves as a robust benchmark for AI performance, and the results paint a more complex picture than recent headlines about AI’s math prowess might suggest. Frontier models have made amazing progress: Some have reportedly achieved gold medal performance in the IMO, and in standard benchmarks they are now solving problems that would frustrate many people. But MathNet shows that progress is uneven. Even GPT-5, the best-performing model tested, averaged 69.3 percent on MathNet’s main benchmark of 6,400 problems, failing about one in three Olympiad-level problems. And when the problems involve math, performance drops dramatically across the board, revealing visual reasoning as a consistent weak spot for even the most skilled models.
Several open source models scored 0 percent on Mongolian language problems, highlighting another side where current AI systems fail beyond their full potential.
“GPT models are equally good in English and in other languages,” says Alshammari. “But most open source models fail completely in non-standard languages, like Mongolian.”
The MathNet variant is also designed to address a deep limitation in how AI models learn math. When the training data turns to English and Chinese problems, the models draw on a small piece of mathematical tradition. A Romanian combinatorics problem or a Brazilian number theory problem may approach the same underlying idea from a completely different angle. Exposure to that list, the researchers argue, makes both humans and AI systems better mathematical thinkers.
Beyond problem solving, MathNet introduces a retrieval benchmark that asks whether models can recognize when two problems share the same basic mathematical structure, a skill important to both AI development and the math community itself. Almost twice as many problems have arisen in actual IMO tests over the years because finding statistical equivalence across different scores, languages, and formats is really difficult, even for committees of experts. Examining eight high-level embedding models, the researchers found that even the strongest identified correct matches only about 5 percent of the time on the first try, and the models tended to rate nonstructural problems as more similar than equal.
The dataset also includes an augmented-production benchmark, testing whether giving the model a structurally related problem before asking it to solve a new one improves performance. It does, but only if the returned problem is really worth it. DeepSeek-V3.2-Speciale achieved up to 12 percent better matching retrieval, while irrelevant retrieval resulted in poor performance in about 22 percent of cases.
Alshammari co-authored the paper with Safaei, HUMAIN AI engineer Abrar Zainal, KAUST Academy Director Sultan Albarakati, and MIT CSAIL colleagues: graduate student Kevin Wen SB ’25; Microsoft Principal Engineering Manager Mark Hamilton SM ’22, PhD ’25; and professors William Freeman and Antonio Torralba. Their work was funded, in part, by a Schwarzman College of Computing Fellowship and the National Science Foundation.
MathNet is publicly available at mathnet.csail.mit.edu.



