Can you rely on AI chatbots for medical advice?

Carsten Eickhoff from the University of Tübingen examines the problems seen when using AI chatbots for medical inquiries.
Imagine that you have just been diagnosed with early-stage cancer and, before your next appointment, you type a question into an AI chatbot: “What other clinics can successfully treat cancer?” In seconds you get a polished, written answer that reads like it was written by a doctor. Other than that some of the claims are baseless, the footnotes lead nowhere, and the chatbot never once suggested that the question itself might be the wrong one to ask.
That situation is not hypothetical. That’s what a team of seven researchers found when they ranked the five most famous people in the world chatbots with a systematic health information stress assessment. The results are published in BMJ Open.
The chatbots, ChatGPT, Gemini, Grok, Meta AI and DeepSeek, were each asked 50 health and medical questions covering cancer, vaccines, stem cells, nutrition and game performance. Two experts independently rated all responses. They found that about 20pc of the responses were very problematic, half were problematic and 30pc were somewhat problematic. None of the chatbots reliably produced an accurate reference list, and only two of the 250 questions were absolutely forbidden to answer.
Overall, the five chatbots performed almost identically. Grok was the worst performer, with 58pc of his answers marked as problematic, ahead of ChatGPT on 52pc and Meta AI on 50pc.
Performance varies by subject, however. Chatbots handle vaccines and cancer much better — fields with large, well-established research bodies — yet still produce problematic answers about a quarter of the time. They stumble a lot in playing and working in sports, sites are full of conflicting advice on the internet and where there is very little hard evidence on the ground.
Open-ended questions are where things really went sideways: 32pc of those answers were rated as very problematic, compared to 7pc of the closed ones. That distinction is important because many real-world health questions are open. People don’t ask chatbots pure true-or-false questions. They ask things like: “What are the best supplements for overall health?” This is the kind of information that invites a smooth and confident response but can be dangerous.
When the researchers asked each chatbot about 10 scientific references, the accuracy rate (median value) was only 40pc. No chatbot manages a single list that is fully accurate in every 25 attempts. Errors range from bad authors and broken links to completely fabricated papers. This is a particular risk because references look like evidence. A lay reader who sees a well-formatted citation list has little reason to question the content beyond it.
Why chatbots make things go wrong
There is a simple reason why chatbots get wrong medical answers. Language examples do not know objects. They predict the next most likely word statistically based on their training data and context. They do not weigh evidence or make substantive judgments. Their training materials include peer-reviewed papers, but also Reddit threads, health blogs and social media discussions.
The researchers did not ask neutral questions. They deliberately designed commands designed to nudge chatbots into giving misleading answers – a common stress test in AI security research known as the ‘red team’. This means that the error rates are probably higher than what you would experience with neutral sentences. The study also examined the free versions of each model available in February 2025. Paid tiers and new releases may perform better.
However, many people use these free versions, and many health questions are not carefully written. Research situations, if any, reflect how people use these tools.
The findings of the article are not in isolation; they sit among a growing body of evidence that paints a consistent picture.
February 2026 survey Natural Medicine he showed something surprising. Chatbots themselves can get the right medical answer about 95pc of the time. But when real people use those same chatbots, they only get the right answer less than 35pc of the time – no better than people who don’t use them at all. In simple words, the problem is not just whether the chatbot gives the right answer. That everyday users can understand and use that feedback correctly.
A recent study published on Jama Network Open tested by 21 leading AI models. The researchers asked them to check possible medical diagnoses. When the models are given only basic information – such as the patient’s age, gender and symptoms – they struggle, failing to suggest the correct set of possible conditions more than 80pc of the time. When researchers factored in test and lab results, accuracy increased to more than 90pc.
Meanwhile, another US study, published in Natural Communication Medicinefound that chatbots easily replicated and elaborated on medical conditions that were entered into the notification.
Taken together, these studies suggest that the weaknesses found in the BMJ Open study are not the quirks of a single testing method but reflect something more important about where technology stands today.
These chatbots don’t go away, and they shouldn’t. They can summarize complex topics, help prepare the doctor’s questions and serve as a starting point for research. But the study makes a clear case that they should not be treated as independent medical authorities.
If you use one of these chatbots for medical advice, verify any health claim it makes, treat your references as suggestions to be tested rather than fact, and be aware when the answer sounds confident but doesn’t offer a disclaimer.
Carsten Eickhoff
Carsten Eickhoff is a professor of medical data science at University of Tübingen. His lab focuses on the development of machine learning and natural language processing techniques with the goal of improving patient safety, individual health and the quality of medical care. Carsten has written more than 150 articles in computer science conferences and medical journals and has served as an advisor and research committee member to more than 70 students.
Don’t miss out on the information you need to succeed. Sign up for Daily BriefSilicon Republic’s digest of must-know sci-tech news.



