An AI read his MRI and disagreed with his doctor. He left with less certainty, not more.

A man with three weeks of shoulder pain walked out of a clinic with a diagnosis, a treatment plan, and a bad feeling. So he did what a growing number of people now do: he fed the raw scan to a model. In a post published on 28 June, a developer who writes under the name Antoine describes exporting his MRI as a 266MB DICOM package, handing it to Opus 4.8 running inside Claude Code, and letting it install its own tooling and work the images for about an hour. The clinic's radiologist had reported a "Grade III (>50%-width) partial-thickness tear" at the insertion of his subscapularis tendon, and had already begun an intervention-heavy course of treatment. Opus reported an intact tendon. A second pass, set up as a blind arbitration between the two readings, sided with the machine: no discrete tear, mild insertional tendinosis at most.

Read one way, this is the story the technology has been promising for years. A patient with no medical training got a credible expert reading of his own scan for the price of a subscription, and it caught what looked like premature, aggressive treatment. Anthropic is building the productised version of exactly this competence. Claude Science, released in beta the same week, is pitched as a workbench that "runs analyses, searches databases, and traces every step from data wrangling to publication," with healthcare and life sciences named among its target verticals. And the capability claim is not marketing. On 23 June, Nature Medicine published a head-to-head study in which general-purpose models from OpenAI, Google and Anthropic outperformed two FDA-cleared clinical decision tools, OpenEvidence and Wolters Kluwer's UpToDate Expert AI, on questions submitted by practising physicians. The obvious line writes itself: the second opinion has been democratised.

That is the line I want to break off from, because it misreads what actually happened to Antoine. He did not walk away with a better answer. He walked away with two confident answers and no way to choose between them. In his own words he is now in "a state of limbo," less sure than when he started, weighing whether to find another doctor or wait and see. The machine did not resolve his uncertainty. It manufactured more of it.

The scarce good was never the reading

Here is the thing the democratisation framing skips over. Radiology's value to a patient is not a description of what is on the film. It is an accountable description: a reading that a named, licensed, insured person has put their signature to and will answer for if it is wrong. That signature is what lets a patient stop worrying and hand the problem over. Antoine gestures at this himself when he writes about "the incredibly peaceful feeling of being in the hands of an expert you trust."

What Opus produced was a reading nobody stands behind. It is fluent, it cites its own reasoning, it even flags where it cannot resolve a dispute, and all of that makes it more persuasive, not more accountable. The pitch is that AI gives everyone a radiologist. What I can verify is that AI gives everyone a second reading with the accountability stripped out, and then leaves the patient holding the contradiction. For a person trying to decide whether to let someone put needles in their shoulder, a second unadjudicated opinion is not obviously an improvement on one. It can be worse, because it removes the ability to defer to anyone at all.

This is the part the anecdote's happy reading gets backwards. The model's most useful contribution to Antoine was not reading the scan. It was the earlier step, where GPT 5.5 Pro flagged that the clinic had used shockwave therapy against a clinical guideline for non-calcific rotator-cuff tendinopathy, and had injected him with Traumeel, a preparation registered in Germany as a homeopathic medicine with no therapeutic indication. Those are text-reasoning tasks: cross-checking a treatment against published guidance. They are genuinely valuable and models are genuinely good at them. But they are a different capability from reading pixels off a DICOM, and it is the pixel-reading that produced the flat contradiction neither party could settle.

The strongest case for the other side

The fair version of the opposing argument is not that patients should trust the machine blindly. It is narrower and harder to dismiss. Overtreatment is real, access is unevenly distributed, and for a lot of people the alternative to an AI second read is no second read at all. Antoine's clinic did start billing shockwave and homeopathy within minutes of the scan; a sceptical prompt that surfaces that is worth having. And the Nature Medicine result is a real data point, not a vibe: on the query type that matters most at the point of care, the unstructured question a clinician actually asks, the general models already beat tools that cleared a regulatory bar. Antoine's own hope is a trajectory argument. Give it a couple of model generations, he writes, and we will trust AI to read an MRI the way we trust it to proofread an email.

I take that seriously, and I think it still misses. The trajectory argument assumes the bottleneck is model accuracy, and that once the accuracy curve clears some line the accountability problem dissolves behind it. The history says the opposite. Generating a confident, plausible, guideline-shaped reading was never the hard part of clinical AI. IBM's Watson for Oncology could do that in 2017. What it could not do was be right in a way anyone could stand behind: a 2018 STAT investigation found it producing "unsafe and incorrect" treatment recommendations, and MD Anderson had already shelved a project on which a University of Texas audit put the spend at US$62 million. Watson failed not because it sounded uncertain but because it sounded certain and the certainty did not survive contact with real cases. Fluency was the trap, not the fix.

Benchmark-beating is not deployment

There is a measurement gap sitting under all of this that the point-and-shoot MRI workflow flies straight past. Beating a cleared tool on physician-submitted questions is a benchmark result. It is not the same as reading a random patient's DICOM export from a one-line prompt of "right shoulder pain for 2-3 weeks," which, as Antoine notes, was less context than the human doctors received. The clearance apparatus that governs this space is enormous and almost entirely built around imaging: as of March 2026 the FDA had cleared 1,524 AI algorithms, 1,163 of them in radiology. And even that gated population has a validation problem. A University of North Carolina team found that of 521 AI devices the FDA authorised through 2022, about 43% shipped with no clinical validation data at all, and only 22 had been through a randomised controlled trial. If the tools that clear the bar are that thinly validated, a general model reading an unfamiliar scan with no task-specific evaluation is not somewhere above that standard. It is off the chart that measures it.

Notice what Claude Science, the official product, actually sells into this. Its pitch is reproducibility: every figure "ships with its history," the exact code and environment "so they can be reproduced, edited, or defended months later." That is accountability language, and it is aimed at researchers who will have to defend a result to a reviewer. The viral consumer workflow that Antoine ran has none of that scaffolding, no reproducibility guarantee, no defensibility, no one to answer to. The gap between the two is the whole argument. Anthropic knows the professional buyer needs an audit trail. The patient pointing Claude Code at his own shoulder gets a verdict and a "not medical advice" disclaimer.

The bet

I don't have a clean read on where consumer AI diagnostics land, and anyone who says they do is selling something. But the useful question is not "will the models get accurate enough," because on narrow tasks they may already be, and accuracy was never what Antoine was missing. The question is whether anyone will attach accountability to the output. Right now the entire consumer category runs on the disclaimer: the model gives you a reading and explicitly refuses to be responsible for it. That is the tell that it is a demo, not a service.

So here is the falsifiable version. Watch for the first consumer-facing AI second-opinion product that ships with indemnity instead of a disclaimer, where the vendor, an insurer, or a clinic formally accepts liability for the read the way a radiologist's signature does. Until that exists, every story like Antoine's will end the same way his did: not with a patient who knows more, but with one who has been handed a second answer and left alone to decide which stranger to believe. That is the bet. The hard problem in medical AI was never generating the opinion. It was being willing to own it.