01 June 2026 Research Deep-Dive

The Same Patient, Two Descriptions, Two Diagnoses: Why Clinical LLMs Need Semantic Stability Testing

A new paper reveals that clinical LLMs produce different diagnoses when the same patient scenario is described using different words — and introduces metrics to test for it. Here's why this matters for anyone deploying AI in healthcare.

Most clinical LLM evaluations test whether a model gives the right answer for a given input. Almost none test whether the model gives the same answer when the input is rephrased. A new paper shows that this gap is not just a theoretical concern — it is a patient safety risk.


The Problem: Same Patient, Different Words, Different Diagnosis

Consider a patient presenting with chest pain. A nurse writes "pt c/o chest pain." The patient says "my chest hurts." A junior doctor describes it as "retrosternal discomfort." A consultant notes "anginal syndrome." These are all descriptions of the same clinical scenario — but if you feed each one into a clinical LLM, you may get different diagnoses, different recommendations, different risk assessments.

A new paper from researchers at the Indian Institute of Science makes this failure mode precise. They show that current clinical LLMs are not semantically stable: semantically equivalent rephrasings of the same patient scenario produce meaningfully different model outputs. This is not a minor inconsistency. It is a patient safety risk.

Why this matters

If rephrasing the same clinical scenario changes the model's recommendation, the model is not safe to deploy. The variability is in the model's sensitivity to phrasing — not in the patient's condition.


Why Standard Similarity Metrics Fail

The obvious first question is: can you just check whether two descriptions are "similar" and flag cases where they diverge? The answer is no — not with standard tools.

The paper demonstrates a critical gap between lexical similarity and clinical equivalence. Cosine similarity on sentence embeddings treats "no history of diabetes" and "diabetes" as highly similar — they share most of the same words. But clinically, they are opposite. One means the patient has diabetes. The other means they explicitly do not.

Phrase Pair Cosine Similarity Clinical Meaning
"diabetes" vs "no history of diabetes" High Opposite
"prior MI" vs "acute MI" High Different clinical states
"no evidence of malignancy" vs "malignancy" High Opposite

This is not an edge case. Clinical text is full of negation, temporality, and conditional framing. Standard embedding models were not trained to distinguish these clinically critical differences. They capture surface-level word overlap, not medical meaning.


What the Paper Introduces

The paper proposes clinically-aware similarity metrics that go beyond cosine distance. These metrics understand:

With these metrics in hand, the researchers can reliably determine when two patient descriptions are genuinely semantically equivalent — and then test whether the model produces equivalent outputs.

The finding is clear: it does not. Rephrasing changes the output more often than it should.


Why This Matters for Enterprise Clinical AI

Three implications for enterprise leaders

Standard validation misses this failure mode entirely. Real-world clinical notes vary by author. And this is a deployment blocker, not a research curiosity.

First, standard validation misses this entirely. Most clinical LLM evaluations test on fixed benchmarks with fixed wording. They measure accuracy against gold-standard answers. But they do not test whether the model is consistent across rephrasings of the same input. A model can score well on benchmarks and still be unstable in practice.

Second, the real-world impact is concrete. Clinical notes are written by different people with different training, different shorthand, and different levels of detail. A triage note, a patient's own words, a consultant's summary, and a discharge letter may all describe the same patient differently. If your model's output depends on which version it receives, you have a consistency problem that no amount of benchmark accuracy can fix.

Third, this is a deployment blocker. If a clinical LLM gives different recommendations depending on who wrote the note — nurse, patient, junior doctor, consultant — then the model is not safe to deploy. The variability is not in the patient's condition; it is in the model's sensitivity to phrasing. That is exactly the kind of failure that causes harm in production.


The Governance Connection

This paper has an important implication that goes beyond technical model evaluation: you cannot govern what is not consistent.

AI governance frameworks — whether internal policies, regulatory submissions, or audit requirements — assume that a system produces reproducible outputs for equivalent inputs. If the same patient described two different ways gets two different recommendations, what does "the model's recommendation" even mean? How do you audit it? How do you explain it to a regulator? How do you defend it in a clinical incident review?

Semantic stability is not a nice-to-have. It is a prerequisite for every other governance activity: risk assessment, bias evaluation, post-market surveillance, and regulatory defensibility. Without it, the rest of the governance stack is built on sand.


What This Means Practically

  1. Add semantic stability testing to your validation suite. Before deploying any clinical model, test it with multiple semantically equivalent rephrasings of the same scenarios. Measure output consistency, not just accuracy.
  2. Use clinically-aware similarity metrics, not just cosine distance. Standard embeddings will miss clinically critical differences. The paper's metrics provide a starting point, though they will need validation on your specific clinical domain.
  3. Monitor for stability in production. Do not just test at deployment time. Track whether model outputs remain consistent as input phrasing varies in real clinical notes over time.
  4. Document stability as part of your regulatory evidence. If you are preparing a submission or audit trail, include evidence that the model produces consistent outputs for equivalent inputs. This is what regulators expect, even if they do not yet require it explicitly.

Key Takeaways

  1. Clinical LLMs are not semantically stable. Rephrasing the same patient scenario changes the model's output. This is a patient safety risk, not a research curiosity.
  2. Standard similarity metrics are clinically blind. Cosine distance on embeddings cannot distinguish "diabetes" from "no diabetes." You need clinically-aware metrics that understand negation, temporality, and context.
  3. Semantic stability is a governance prerequisite. You cannot audit, regulate, or defend a system whose outputs change depending on who wrote the clinical note. Consistency comes first.

Limitations Worth Noting

The paper is a strong contribution, but some limitations are worth acknowledging:


Paper Details

Title
Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
Published
arXiv 2605.30646v1, 14 pages, 5 figures
Categories
cs.CL, cs.AI
Link
https://arxiv.org/abs/2605.30646

Deploying clinical AI? Let's talk about building semantic stability testing into your model validation workflow.

Get in Touch