What I'm Learning About Enterprise AI Architecture
What I’m Learning About Enterprise AI Architecture
Over the last 18-24 months, I’ve been focusing on understand why some AI systems fail in production while others survive real-world conditions. Some of that learning comes from building code. Some comes from reading what others have tried and where they’ve stumbled.
At this moment, I’m working through a Data Privacy, Ethics and Responsible AI specialization at Coursera right now, and this post connects some patterns I think I’m seeing, but I’m still figuring out how they fit together. If you spot something I’m missing, I’d want to know.
What I’m Observing: Determinism Beats Probability (Usually)
The first point I notice is between what LLMs are designed for and what enterprise systems need them to do.
LLMs are probabilistic. You ask them the same question twice, you might get different answers. That’s by design. They’re great for creative tasks, brainstorming, drafting.
But in enterprise contexts (healthcare, education, finance, legal) probabilistic outputs become a liability. Parents can’t understand why the AI gave their child a different grade on review. Regulators can’t audit a system that produces different results on identical inputs. Auditors have no idea what happened.
What I’m seeing in teams that move to production successfully is a shift: stop expecting the LLM to be a knowledge source. Use it as a reasoning engine instead. Let the actual knowledge come from somewhere you can control and audit: a database, a vector store, an information system you own.
This is especially important in K-12 education, where every output should be explainable: For example, “The system said X because we retrieved documents Y and Z from your curriculum, and the model reasoned that…” That’s traceable. That’s defensible.
I’m still working through how to make this principle work at scale, but it’s becoming clearer that this is where the architecture needs to pivot.
The Pattern That Emerges: RAG Isn’t Just Retrieval
Retrieval-Augmented Generation (RAG) is everywhere now. But what I’m noticing is that basic RAG (“search for relevant documents, throw them at the LLM”)fails in practice more often than it succeeds.
Production-grade systems seem to have evolved a pattern. It looks something like this:
The Ingestion Layer
Here, your data (PDFs, databases, Confluence pages) gets analyzed carefully. The way you break it into chunks, how you embed it, what metadata you attach, and all of this determines whether your system actually retrieves useful information.
At this point, context-aware chunking (breaking on logical boundaries rather than arbitrary character counts) seems to matter more than I initially thought. If you chunk a multi-part question into three separate documents, the retrieval system can’t see the relationship between them.
The Retrieval and Ranking Layer
This is where I think most RAG systems underinvest. Simple similarity search (“find the vectors closest to my query vector”) often misses what you actually need.
The teams doing better work are combining keyword search (BM25) with semantic search, then using reranking models (like Cohere’s rerank) to improve the actual ordering. It sounds like overhead, but the cost is tiny compared to the retrieval failures it prevents.
The Control and Observability Layer
This is the part that shocked me when I started thinking seriously about it. If your RAG system retrieves information that the user asking the question doesn’t have permission to see, you’ve just leaked data. The retrieval access controls have to be built into the database layer, not layered on top.
And observability — what actually got retrieved, what score it had, what the model did with it — is essential before you ship anything. I wrote separately about this because it’s the piece most teams skip and then regret.
Engineering & Prompting
There’s a pattern I’ve seen, and I think it’s worth naming: a team builds an LLM-based system, and if it doesn’t work porperly, instead of redesigning the architecture, they write a longer prompt, then a longer one…
That’s usually a signal that the architecture is wrong, not the instructions.
If I find myself thinking “I just need to prompt this better”, I’m probably missing something structural. Maybe the model is the wrong one. Maybe the retrieval is missing context. Maybe I’m asking the model to do something that should be deterministic logic instead.
The patterns I’ve seen work better lean on established engineering practices:
-
Modularity. The LLM should be decoupled from the vector database, from the business logic, from the access control layer. That way when you want to swap model providers (which you will), you’re not rewriting your entire system.
-
Typing and Frameworks. Using something like LangChain4j for Java gives you structure. It forces you to think about inputs and outputs as types, not just strings. That sounds like overhead until a bug would have been caught by the type system instead of in production.
I’m still learning where exactly the line is between “this is an LLM problem” and “this is an architecture problem,” but I’m becoming convinced the line is more architectural than I initially thought.
The Gap Nobody Talks About: Governance
This is the part I was least prepared to care about when I started this journey, and now I can’t stop thinking about it.
Reliability doesn’t just mean “the model gives good answers”. It means:
- Can someone audit this system? Can you explain why it made this decision?
- Is there a record of what data it used? Can you show a regulator that information?
- If the system fails, do you know? Do you have monitoring in place?
- Who is responsible when something goes wrong?
- Does the system respect data permissions? If a student shouldn’t see a document, the system shouldn’t retrieve it.
The EU AI Act is very specific about this. Articles 13, 14, and 17 require documentation and governance for high-risk systems. Educational AI is explicitly listed as high-risk in Annex III.
What I’m realizing is that governance isn’t something you add after the system works. It’s architectural. It changes how you design retrieval, how you log decisions, what information you expose.
I’m still working through the details of this, and I know I’m only scratching the surface. But it’s becoming clear that teams ignoring governance until later in the process are building systems that will be expensive to redesign.
The Reality Check: Humans Still Matter
One more pattern that keeps showing up: reliable systems don’t try to be fully autonomous.
A teacher reviewing AI-generated grades isn’t a “blocker” to efficiency. That human review is where the system gets better. It’s where errors surface. It’s where the system learns.
In K-12 education especially — but increasingly in other regulated domains — the assumption should be that meaningful human oversight is built into the workflow. The question isn’t “how do we remove the human” but “how do we make human oversight actually work.”
That means:
- The AI output is clearly marked as AI
- The human can easily see what information the AI used (what documents it retrieved)
- The human has a simple way to override or escalate
- There’s a record of what the human decided and why
I’m still figuring out the UX patterns that make this work without slowing everything down, but I’m convinced that’s where the design problem actually is.
What I’m Still Figuring Out
- How do you scale this without the governance burden becoming paralyzing?
- When is a RAG system the right architecture and when is it overhead?
- How much reranking actually helps in practice (vs. looking good in benchmarks)?
- Where exactly should access control live (database layer, retrieval layer, LLM layer)?
- How do you measure whether human oversight is actually working, or if it’s just ceremonial?
These are the questions keeping me up, and I expect my answers will change as I learn more.
Conclusion
The thing that surprised me most in this learning is how much the system matters relative to the model. Everyone talks about getting better models, but I keep finding that teams win or lose based on the architecture: how they design retrieval, how they handle governance, whether they’ve actually built in human oversight or just claimed they did.
I’m not sure yet how to systematize that observation into a repeatable framework. But that’s what I’m trying to figure out.
Some information may be outdated