AI Sparks

Teach AI agents to ask better questions by playing “Warrior” | MIT News

In 2026, the hype for artificial intelligence agents is higher than ever. These autonomous systems can “think” and perform well-defined tasks in areas such as customer service and software development, often using language models (LMs). But fields like medical diagnosis and scientific discovery require them to ask about multiple solutions in uncertain environments, which LMs struggle with.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University’s School of Engineering and Applied Sciences (SEAS) took a deep dive into LMs to understand their core problems in high-performance settings. Their experiment: “Battleship,” a classic guessing game that has helped cognitive scientists test how people seek information.

CSAIL and SEAS scholars added a twist by redesigning the game to ask and answer natural language questions. In their “Cooperative Combat” game, one participant is the “captain” who asks where the hidden ships are, while his teammates play “spotter” by answering those questions in real time.

The researchers first had more than 40 people play the game together, ​​​​​​​​​and collected their questions and yes-no answers to create the “BattleshipQA” dataset. These results became a useful point of comparison when the team tested high-end LMs (like the GPT-5) and smaller models (like the Llama 4 Scout) in their game. Without pre-training the models, they found that advanced LMs can “beat” people in “War” – that is, finish the game in a few turns – but smaller systems have much less logic.

The main problem is that most models cannot come up with useful questions. To get the LMs to ask in ways that reveal more information about the hidden vessels, the researchers gave each model a Monte Carlo identification strategy, carefully weighing the probability that different options are correct for each answer. The result: AI models can beat regular players in “Battleship,” without scale.

Perhaps the most striking results have been the benefits of the Llama 4 Scout. As a small LM, it beats humans by only 8 percent. But with the development of its logic, the model reached a “Battleship” win rate of 82 percent compared to humans. This careful and efficient style of questioning also allowed the model to outperform the frontier model (GPT-5), while operating at about 1 percent of its cost.

In addition to these improvements, researchers narrowed the gap between humans and LMs in answering questions. While the GPT-5 was a reliable spotter that helped modelers complete games quickly, the smaller systems had a nasty habit of giving incorrect answers about where ships were hidden. The models saw 15 percent accuracy on average when they first turned the questions into code that clearly told them how to verify their answers (for example, making the model run a quick search of the area when asked if the ship was there).

“Modern language forms are very well designed to answer complex questions, but it’s not clear that they learn to ask good questions,” said MIT PhD student and CSAIL researcher Gabriel Grand SM ’23, lead author of a paper about the work. “Our work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a ‘model of the world,’ they ask better questions and find resources more efficiently.”

Sea change in LMs

The team’s focus was to get LMs to ask better questions. Using Monte Carlo techniques, LMs think about possible projections as individual particles. Those that appear to be the most valid per response from the spotter will be rated higher, such as game balls that inflate or deflate each turn. With this method of calculation, which is flexible, the captain can ask questions that bring out more information from the spotter.

Scientists then turned to the widely used programming language Python to help AI visionaries. Each question the captain asked was automatically converted into a coded command. For example, the question, “Is there a ship in one column that exceeds two rows?” it turns into instructions for the spotter LM to search the area in question and check how wide the digital game piece is. By giving the model clear instructions in a language it understands, each program gave the correct answers more often. The lightweight system GPT-4o-mini saw an efficiency of 30 percent, for example, and even the larger model Claude 4 Opus jumped by about eight points.

“The field has seen great success in ‘autonomous’ techniques, where LMs generate code to validate their solutions,” said senior author Jacob Andreas, MIT professor of electrical engineering and computer science and CSAIL principal investigator. “What I find most exciting about this work is that it opens up opportunities to use these methods to find better solutions in the first place, by improving the LMs’ skills to evaluate and gather information. We are very excited to elevate this work from scientific domains to applications such as coding and solving mathematical problems.”

Let’s play something else

But how would this approach work in other board games? The team tested their newly equipped LMs on “Guess Who?”, where large and small models skillfully narrowed down 100 options to correctly guess which hidden character was chosen. Llama 4 Scout was 30 percent successful, but after Grand and his colleagues retooled, it completed the mission in more than 72 percent of its runs. Meanwhile, GPT-4o jumped from 62 percent to 90 percent. GPT-5 was counting each match to ensure that the questions were answered as accurately as possible.

Although LMs have made promising progress in both sports, there is room for improvement. For example, models still struggle to answer complex questions, compared to humans. OpenAI researcher, recent Harvard graduate, and co-author Valerio Pepe added that “GPT-5 can beat your average ‘Battleship’ player, and it’s a hair better with our methods. However, professional players are still hard to beat in all models, unlike chess, where even top players are unsuccessful against AI systems.”

The researchers’ findings show that AI agents have untapped potential for “needle-in-a-haystack” discovery – they navigate through a vast array of options to find unconventional solutions to scientific challenges. While advanced search capabilities could make them excellent research assistants, they say, to identify the molecular structure of compounds, researchers caution that “Collaborative Warfare” is an easy test bed. They would like to test LMs in complex settings, where systems must consider many options.

Grand also plans to have humans and AI models work together to learn if they work best together. The models may also benefit from fine-tuning of game simulations, and with more computing power, LMs will have more advanced reasoning capabilities to predict how the game will unfold.

“As AI systems become more powerful, the most difficult problems are becoming social: tracking similarities, resolving conflicts, and getting to know different partners over time,” said Robert Hawkins, an assistant professor of linguistics at Stanford University, who was not involved in the paper. “This work captures these phenomena well in a controlled collaborative environment, and makes a compelling case that the real bottleneck for AI agents is not just the calculation of appropriate questions, but the logical thinking required to derive their answers.”

Grand and Pepe co-authored the paper with two CSAIL principal investigators: MIT Associate Professor Jacob Andreas and MIT Professor Joshua Tenenbaum. Their work was supported, in part, by the MIT Siegel Family Quest for Intelligence, the MIT-IBM Watson AI Lab, the FinTechaI@CSAIL initiative, a Sloan Research Fellowship, Intel, the Air Force Office of Scientific Research, the Defense Advanced Research Projects Agency, the Office of Naval Foundation Research, and the National Science Foundation Research. They are presenting their paper as an oral presentation at the International Conference on Learning Advocacy (ICLR) in April.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button