Teaching AI models to say “I’m not sure” | MIT News

Confidence is persuasive. In artificial intelligence systems, it is often misleading.
Today’s most capable mental models share a characteristic with the loudest voice in the room: They deliver every answer with the same unshakable certainty, whether they are correct or speculative. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have now traced that overconfidence to a certain error in the way these models were trained, and have developed a way to correct it without sacrificing any accuracy.
The method, called RLCR (Reinforcement Learning with Calibration Rewards), trains language models to generate confidence ratings that are calibrated around their responses. In addition to coming up with an answer, the model considers its uncertainty in that answer, and outputs a confidence score. In tests across multiple benchmarks, RLCR reduced estimation error by up to 90 percent while maintaining or improving accuracy, both on tasks the model was trained on and on completely new tasks it had never seen. The work will be presented at the International Conference on Advocacy for Learning later this month.
The problem follows from a surprisingly simple source. Reinforcement learning (RL) techniques are behind recent breakthroughs in AI thinking, including the training method used in systems like OpenAI’s o1, which reward models for getting the right answer, and punish them for getting it wrong. There is no in between. A model that arrives at the correct answer through careful thought receives the same reward as one that guesses correctly by chance. Over time, this trains the models to confidently answer all the questions they ask, whether they have hard evidence or just flip a coin.
That overconfidence has consequences. When models are used in medicine, law, finance, or any setting where users make decisions based on AI output, a system that exhibits high confidence regardless of its true confidence becomes unreliable in ways that are hard to find outside. A model that says “I’m 95 percent sure” if it’s right only half the time is more dangerous than one that just gets the answer wrong, because users don’t have the signal to seek a second opinion.
“The standard training method is simple and powerful, but it doesn’t give the model an incentive to express uncertainty or say I don’t know,” said Mehul Damani, an MIT PhD student and co-author of the paper. “So the model naturally learns to make predictions when it’s uncertain.”
RLCR addresses this by adding one term to the reward function: the Brier score, a well-established measure that penalizes the gap between a model’s confidence and its true accuracy. During training, the models learn to think about the problem and its uncertainty, generating an answer and a confidence estimate together. Incorrect answers with confidence are penalized. So are those who are unnecessarily righteous.
The statistics back it up: the team has formally proven that this type of reward structure ensures both accurate and well-scaled models. They then tested the method on a 7-billion-parameter model across a range of questionnaires and statistics, including six data sets on which the model had never been trained.
The results showed a consistent pattern. Regular RL training reduced the accuracy rate compared to the baseline model, making the models worse at estimating their uncertainty. The RLCR reversed that effect, greatly improving the approximation without losing accuracy. The method also worked very well for post-hoc methods, where a different learner was trained to assign confidence scores after the fact. “What’s surprising is that conventional RL training doesn’t just help balance. It actively hurts,” says Isha Puri, MIT PhD student and co-author. “Models become powerful and overconfident at the same time.”
The team also showed that the confidence ratings produced by the RLCR are really useful during decision making. When models generate multiple candidate responses, selecting the one with the highest self-report confidence, or weighting votes with confidence in a majority voting scheme, improves both precision and accuracy as counting scales.
Further findings suggest that the act of thinking about uncertainty itself has value. The researchers trained classifiers on model outputs and found that including a clear assumption of model uncertainty in the input improved the performance of the classifier, especially for small models. A model’s self-concept about what it can and can’t contain real information, not just decoration.
Besides Damani and Puri, other authors on the paper are Stewart Slocum, Idan Shenfeld, Leshem Choshen, and senior authors Jacob Andreas and Yoon Kim.


