Best AI chats avoid harm but fall short in high-risk chats, new startup benchmark finds – GeekWire

Mpathic, a Seattle startup that helps AI companies test their risk response models, has a new message for Claude, ChatGPT, and Gemini: you’re getting safe, but you’re not safe enough.
The company on Tuesday released MPACT, a physician-led benchmark that tests how AI models handle high-risk conversations — including those involving suicide risk, eating disorders and misinformation.
In all three benchmarks, the best models generally avoided dangerous responses and often recognized signs of stress, but they always fell short of what a doctor would consider an adequate response to a real crisis, according to the company’s findings.
“A lot of people don’t say I’m ‘at risk’ directly — they show it through subtle behaviors over time that are visible to human doctors,” says Grin Lord, empathic founder and CEO and a board-certified psychiatrist. “The models are getting better at recognizing these times, but the response still needs to meet that information with real support.”
Here are empathic findings as models navigate through another crowded environment that they already experience in the real world.
Risk of suicide: This was the strongest performance of all models, although no single model led in all aspects.
- Claude Sonnet 4.5 achieved the highest mPACT composite score – showing perfect clinical alignment in all detection, interpretation and response – and was described as the closest mirror to how a human doctor would respond.
- GPT-5.2 led to easy injury avoidance, meaning it was much better at not doing something wrong, although testers noted that it wasn’t always effective enough.
- Gemini 2.5 Flash performed well when danger signs were obvious but was weak at subtle early warning signs.
Eating disorders: This was the weakest point in all the models, with a combined performance near the median base. The main challenge is that the risk of eating disorders is often non-specific and often cultural – framed as diet, discipline, or health promotion – making it difficult for models to flag.
- Claude Sonnet 4.5 also led to overall clinical compliance and had very low rates of destructive behavior.
- The Gemini 2.5 Flash performed better in high-risk situations but struggled with subtle signals.
- GPT-5.2 showed a mixed profile — strong in supportive behavior but also more likely to provide harmful or harmful information.
Misinformation: Models struggle here in a subtle but important way – not by stating false information directly, but by reinforcing questionable beliefs, expressing unnecessary confidence, and presenting one-sided information without sufficiently challenging users’ assumptions.
The benchmark found that this failure is particularly pronounced in dynamic interactions, where models can gradually increase the error rate over time.
- GPT-5.2 has generally led to helping users think more clearly than reinforcing negative assumptions.
- Claude Sonnet 4.5 was in the background and appeared to be very strong in pushing back against unsupported beliefs.
- The Grok 4.1 and Mistral Medium 3 were the weakest performers.
Where models go wrong: The findings include examples of how some models fail in practice.
In a certain discussion of eating disorders, a user openly talked about adding a stimulator to a protein smoothie – a clear sign of disordered eating – and the model responded by calling it a “smart mom’s move” and asking for the name of the product, completely missing the risk. In another, the model provided detailed instructions on how to hide the cleaning method when a user asked how to keep the cleaning quiet.
In the suicide benchmark, the model responded to a user expressing a suicidal ideation by providing a detailed list of methods that count for effectiveness — complete with identification — while assuring the user that thinking about methods without taking action is “no problem.”
Alison Cerezo, an empathic science executive and licensed psychologist, introduced MPACT as a tool for transparency that the field lacked.
“We need a shared, clinically based standard for AI behavior,” he said. “mPACT is designed to bring transparency and accountability to how these programs work when it matters most.”
mPACT benchmarks are developed and tested by licensed clinicians, who have designed dynamic interviews that simulate real-world interactions across varying levels of risk. Responses to each model were scored by trained clinicians rather than automated systems, using a rubric that captured both helpful and harmful behaviors within a single response.
Mpathic was founded in 2021 initially to bring more empathy to business communication, analyzing conversations in texts, emails, and audio calls. The company has since shifted its focus to AI security, working with border model developers to prevent risky model behavior in all use cases from mental health to financial risk and customer support.
The startup counts Seattle Children’s Hospital and Panasonic WELL among its clinical partners. Mpathic has raised $15 million in funding by 2025, led by Foundry VC, and claims to have grown fivefold in the quarter at the end of last year.
Ranked No. 188 in the GeekWire 200 index of the Pacific Northwest’s top startups, empathic was a finalist for Startup of the Year at the 2026 GeekWire Awards last week.


