The Danger of ‘Mostly Right’

Written by Jan H. Blümel | Feb 3, 2026 8:02:35 AM

There are two ways software can fail. It can fail loudly, or it can fail quietly.

When software fails loudly, when an app crashes or a server times out, it is annoying, but it is safe. The system stops. You know something is wrong.

But when software fails quietly, it is dangerous. It continues to operate and in the case of LLMs, it gives you an answer to your question. It looks like it is working.

We have spent the last years building AI for one of the most regulated sectors: healthcare and in particular healthcare infrastructure. And what we have learned is that most popular generic AI applications suffer from a severe case of quiet failure [1]. They are dangerous not because they are incompetent, but because they are competent enough to deceive you.

The Web of Accountability

To understand why this matters, you have to understand the environment. In the built environment and specifically healthcare estates, compliance is not a checkbox. It is a web of accountability.

If you are a software engineer and you make a mistake, you break the build. If you are responsible for building safety in a hospital and you make a mistake for example regarding fire safety, you face legal consequences. Under the Building Safety Act 2022, failure is not just reputational; it can be criminal.

Compliance here is an ongoing assurance process. Decisions are distributed across roles such as Authorising Engineers, Heads of Estates, Authorised Persons, and every decision must be defensible years after it is made.

The regulatory environment functions as a dense, living network where a change in a single act can ripple through dozens of technical memoranda. Because these regulations are constantly evolving and frequently overlap, an estates team is never just looking for a 'fact', they are looking for a path through a maze of conflicting statutory requirements. We found that teams report spending over 20 hours a week per person just searching these fragmented repositories. They are desperate for help. And that desperation makes them vulnerable to tools that sound helpful.

In this context, accuracy alone is insufficient. What matters is trust under inspection.

The Illusion of Competence

When we tested most popular AI applications like ChatGPT or Copilot and their underlying LLMs against real NHS scenarios, they didn't hallucinate wildly. They didn't speak in gibberish.

They sounded like a confident consultant.

They would give an answer that was about 80% correct. They would use the right tone. They would sound professional. But they would subtly misinterpret a regulation, or apply a fire safety rule meant for an office building to a surgical ward.

We ran a structured evaluation using 250 questions drawn from our deployments and real compliance-driven scenarios. The results were stark:

Microsoft Copilot: 36% accuracy
ChatGPT: 74% accuracy
CompliMind (Domain Specific): 96% accuracy

The scary number here isn't the 36%. It’s the 74%. [2]

A system that is wrong 64% of the time (like Copilot in this specific test) has only very limited ability to fool its user. It is obviously broken.

But a system that is right 74% of the time is a trap. It is right often enough to build trust, but wrong often enough to cause a disaster. In a regulated environment, an answer that is almost right is actually completely wrong, because it cannot be defended in an audit.

Users don't reject unsafe tools; they reject unusable ones.

Context is a System Responsibility

The problem isn't that the underlying generic LLMs are "bad". It's that they are misaligned by design. They are optimised for plausibility and conversational fluency. They want to be helpful.

But in regulated work, "helpfulness" is a bug. If a model doesn't know the specific Health Technical Memorandum (HTM) that applies, the helpful thing to do is to refuse to answer. Generic models rarely refuse. They guess.

Many people think the solution is "better prompting." They think if they just teach users to say "Act as an NHS Estates Director," the problem goes away. Relying on the user to provide context is a category error. Prompting only works if the user already knows:

Which specific regulatory regime applies.
Which document takes precedence (e.g., does the HTM override the British Standard here?).
What critical context the AI has omitted.

This assumes the user is already an expert. But what about junior staff? What about cross-functional teams or governance reviewers? If the system’s correctness depends on the user’s ability to ask the perfect question, the system has failed. [3]

We realised that context is not a prompt; it is a system responsibility.

When we enforce system boundaries by embedding the regulatory regime, forcing the model to cite its sources, the accuracy jumps. But more importantly, the safety jumps. The system stops coming up with reasonable sounding, but wrong, answers.

Trust is an Application Property

This leads to a counter-intuitive conclusion: Trust is not a property of the AI model. It is a property of the application.

When LLMs struggle, the application has to be designed to enable shared responsibility. To enable safety and build trust, the application must do things that chat interfaces usually hate doing:

Refusal: The system must admit when it doesn't know.
Citation: Every claim must link physically to the origin in the regulation.
Scope: The system must understand that some questions are out of bounds.

We found that when users see an interface that cites its sources, their behaviour changes. They stop treating the AI as a guru and start treating it as a search engine. They verify. They check. The application forces them to remain the expert.

The Reality Gap

AI fails quietly in regulated built environments not because models are immature, but because systems are under-designed for accountability.

Bridging this gap requires treating AI not as a replacement for critical thinking, but as accountable infrastructure. It requires the right context, explicit boundaries, visible failure modes, and interfaces that support human judgement rather than bypass it.

Notes

[1] I am deliberately not referring to widely used LLMs such as versions of Claude, GPT or Gemini, because I believe, as later explained, the applications they are embedded in, induce the quiet failure.

[2] The performance gap we saw in our evaluation (96% vs 74%) wasn't random. The errors in generic models clustered around edge cases and document precedence, exactly the places where human compliance teams also struggle. This suggests that the "last mile" of AI performance isn't about more compute; it's about better engineering of the context.

[3] It is a common mistake to think that expertise is a shield against AI error. In fact, expertise often makes you more susceptible to "automation bias." Because you know what the right answer should look like, you are more likely to skim a plausible-sounding paragraph and miss a subtle error in a numerical value or a regulatory citation. The more "fluent" the AI sounds, the more it bypasses our critical faculties.

View full post