Over the past several months I’ve been writing about the engineering reality behind enterprise AI: why demos don’t translate to production, why the context problem matters, why RAG is the pattern worth understanding, why vector databases are infrastructure decisions with long tails. Each post has been circling the same central problem from a different angle.
This one names it directly.
The problem is reliability. Specifically: the near-total absence of it in most enterprise AI deployments, the insufficient attention the industry is paying to it, and the significant gap between what organizations are being sold and what they’re actually getting.
I’m going to call this what it is: reliable enterprise AI — deterministic enough to be audited, consistent enough to be trusted, and observable enough to be fixed when it breaks. That combination of properties is what enterprise software has always required. AI is not exempt. And the industry is far behind on delivering it.
What we mean by reliable
Reliability in software has a reasonably well-developed vocabulary. Uptime. Latency percentiles. Error rates. Mean time to recovery. These metrics are imperfect, but they’re concrete, measurable, and standardized enough that engineers across organizations can have meaningful conversations about them.
AI systems need an expanded vocabulary, and we’re still developing it.
Uptime matters — an AI system that’s down is unreliable in the traditional sense. But an AI system that’s up and returning wrong answers is unreliable in a way that uptime metrics don’t capture. You need a third dimension: output quality. And output quality in AI systems is harder to measure, harder to monitor, and harder to defend to a regulator than uptime.
Here’s what reliable enterprise AI actually requires, from where I’m standing:
Consistency. Given the same inputs and context, the system should produce outputs that are consistent in quality and substance — not necessarily identical word-for-word, but not wildly different in accuracy or scope. Systems where the same question gets a good answer on Tuesday and a wrong one on Thursday, for no traceable reason, cannot be trusted in regulated environments.
Auditability. Every output that matters should be traceable. What was retrieved? What prompt was sent? What did the model receive? What did it produce? These aren’t nice-to-haves for enterprise AI — they’re prerequisites for compliance, for debugging, and for the basic organizational accountability that large institutions require. An AI system that can’t explain its outputs is not enterprise-ready, regardless of how impressive those outputs are.
Graceful failure. When the system doesn’t have enough good information to answer reliably, it should say so — clearly, in terms users can understand. “I don’t have reliable information on this” is a better output than a confidently wrong answer. Most AI systems are not designed for this. They generate something plausible regardless of whether the underlying context supports it.
Stability across model changes. LLM providers update their models. Behavior changes — sometimes subtly, sometimes significantly — without warning. An enterprise AI system needs to be architected so that model updates don’t cause unexpected behavioral regressions. That means abstraction layers, evaluation frameworks, and the organizational discipline to run regression checks before relying on any model update.
Why this isn’t being solved
The honest answer is that reliability is hard, slow, and unsexy. It doesn’t show up in demos. It doesn’t generate press releases. It doesn’t look impressive in a board presentation.
What gets investment is capability. New model releases. Larger context windows. Better benchmark scores. Multimodal inputs. Each of these is genuinely interesting and often genuinely useful. But capability without reliability is a prototype, not a product. And the enterprise is full of AI prototypes that got shipped to production because the capability was impressive enough that nobody asked the harder questions.
There’s also a structural misalignment between where AI products are typically built and where they’re deployed. Most AI development happens in environments where reliability is a secondary concern: research labs, startup product teams, hackathons. The skills and practices that make production software reliable (rigorous testing, observability, incident management, rollback procedures) are not uniformly present in those environments.
When those products reach enterprise, the reliability work that should have been done upfront gets deferred, or skipped entirely, or left to the enterprise engineering team to retrofit after the fact. Retrofitting reliability is expensive and incomplete. It should be designed in.
The regulatory dimension
Here’s a pressure that’s going to force this issue whether the industry is ready or not: the regulatory environment is changing.
The EU AI Act — which came into force this year — establishes requirements for AI systems used in high-risk contexts: employment decisions, credit assessments, access to essential services, law enforcement, and others. These requirements include documentation of the system’s capabilities and limitations, logging of inputs and outputs, human oversight mechanisms, and accuracy and robustness standards. Non-compliance isn’t theoretical — the penalties are structured to be consequential for large organizations.

Most enterprise AI systems deployed today could not pass a rigorous EU AI Act audit for high-risk use cases. The logging isn’t comprehensive enough. The human oversight mechanisms aren’t properly designed. The accuracy claims aren’t substantiated with the required documentation. The systems weren’t built with these requirements in mind, because the requirements weren’t yet law when they were built.
They’re law now. And similar regulation is coming in other jurisdictions.
This isn’t a compliance scare story. It’s a signal about where engineering investment needs to go. The organizations that build reliable, auditable, well-documented AI systems now will be in a substantially better position in 18 months than the ones that treat compliance as a future problem.
What reliable enterprise AI actually looks like
I want to be concrete about this, because the phrase “reliable AI” is easy to say and hard to operationalize.
A reliable enterprise AI system has a retrieval layer that’s been evaluated not just for average performance but for worst-case behavior — what does it return when the query is ambiguous, when the knowledge base doesn’t contain the relevant information, when the user phrases their question in an unexpected way?

It has an output layer that’s been tested for the failure modes that matter in the specific domain — factual errors, hallucinated references, contradictions with source material, outputs that are technically accurate but contextually inappropriate.
It has an observability stack that captures, for every meaningful transaction, the complete context: what was retrieved, what was sent to the model, what the model returned, what was shown to the user. Not sampled. Everything that matters.
It has evaluation runs on a representative set of queries — not a benchmark someone else defined, but queries that reflect the actual use cases in the actual organization — that run before any significant change to the system goes live.
And it has a documented understanding of where the system is and isn’t reliable — which use cases it handles well, which ones it doesn’t, and what the plan is when users encounter the boundary.
That’s not a vision of perfect AI. It’s a description of engineering discipline applied to AI systems — the same discipline that competent engineering teams apply to any critical production system.
Where I’m heading with this
I’ve spent the last year developing a working framework for how to think about reliability in enterprise AI systems. Not a theoretical framework — a practical one, grounded in understanding what breaks, why, and what the available tools are for addressing it.
Over the coming months I’m going deeper on the implementation side. What does a RAG system with serious reliability engineering actually look like under the hood? What evaluation approaches work in practice? How do you design observability for AI systems that captures what actually matters? How do you structure a knowledge base for consistency at scale?
These aren’t questions I have complete answers to yet. But they’re the right questions. And working through them in public — documenting what I find, including the parts that don’t go as expected — is more useful than waiting until I have everything figured out.
The goal is to contribute something to a conversation the industry needs to be having more seriously: not just how to build AI that’s impressive, but how to build AI that can be trusted.
Reliable enterprise AI. That’s the problem worth solving.
Some information may be outdated