LOADING
1298 words
6 minutes
The Questions I'm Asking Before Writing a Single Line of AI Code

I’ve been developing software for quite a few years now. Long enough to have made the mistake, more than once, of starting with the solution instead of the problem. Or of trying to understand the technology before understanding what it was actually trying to guarantee.

And I think AI is the most technically complex thing I’ve encountered in all these years of software development. That’s precisely why I’m being cautious. Before writing a single line of AI code (before even choosing a framework, evaluating a model, or designing a pipeline), I ask myself a series of questions that have nothing to do with capability. They have everything to do with what happens when the system fails.

How do I know that a failure has occurred in this system?

This is the first question. Not “What should the system do?”, but “What happens when it doesn’t?”.

In traditional software, the failure is usually explicit. An exception is thrown, a query returns no results, a timeout occurs. The system itself tells you that something went wrong. In AI systems, failure is often invisible. The model returns a seemingly safe answer, but it’s incomplete, slightly erroneous, or outside the intended scope, and nothing in the process detects it.

This asymmetry changes how we design. If the failure mode is silent, the architecture must compensate for it. Output validation, recovery quality thresholds, confidence scores, or human review are needed, depending on what’s at stake. The correct answer depends entirely on the real-world consequences of an incorrect output.

An incorrect answer in a recommendation system for low-risk consumers is a minor inconvenience. An incorrect answer in a system used by students to understand educational content has different consequences. An incorrect answer in a medical, legal, or financial context has quite different consequences.

The issue of failure modes forces us to think about what’s at stake before thinking about functionality. That’s the right order.

Who is responsible when the result is incorrect?

This is the question most engineering teams postpone until something goes wrong. It shouldn’t be postponed.

AI systems introduce a new kind of ambiguity in responsibility. When a deterministic system produces an incorrect result, the cause can be traced: a bug in the code, a bug in the data, a failure in the infrastructure. The chain of responsibility is clear.

When an AI system produces an incorrect result, the cause is distributed among the model, the training data, the recovery mechanism, the request, and the context provided at the time of inference.

The chain is harder to trace and easier to obscure.

This ambiguity is exploitable, not necessarily deliberately, but structurally. Teams that haven’t defined responsibility before implementation tend to discover, during an incident, that everyone assumed someone else was responsible.

The issue isn’t just internal. The EU AI Act, which comes into force on August 1, 2024, imposes explicit accountability requirements on systems classified as high-risk.

For these systems, accountability is not something defined after implementation.

It is something designed into the system before its launch, with audit logs, human oversight mechanisms, and documented decision logic. Understanding whether your system fits this classification is not a matter for the compliance team, but for architecture.

What does “reliable enough” really mean in this context?

Reliability in AI is not binary, but rather based on a distribution.

A RAG system either doesn’t work or it doesn’t work at all. It works well for some queries, poorly for others, and inconsistently for a third category that depends on factors that may not be fully controlled: the quality of the retrieved context, the formulation of the input, the model’s temperature.

Asking “Is this system reliable?” is the wrong question. The right question is: “What is the acceptable failure rate for this use case, and how will we measure whether we meet it?”

This question seems like it should be answered by product managers or business stakeholders.

It should be answered collaboratively, but engineers must be involved, as the answer has direct architectural implications.

A system where 95% accuracy is acceptable may be designed differently than one where the minimum is 99.5%. The recovery strategy, validation layer, backup behavior, and human oversight requirements vary depending on the established reliability level. If this level is not known before development begins, the system will be built to the wrong standard: either oversized for actual needs or underestimated.

What data does this system handle and what obligations does it entail?

Data in AI systems involves obligations that go beyond what most engineering teams typically consider.

Training data, retrieval corpora, user inputs, output logs: each of these elements generates legal and ethical risks that are substantially more complex in AI contexts than in traditional software.

GDPR issues that seemed manageable in a standard application become considerably more complex when user queries are logged to improve retrieval, personal data is used as part of a context window, or results are generated that could be used to make decisions about individuals.

For systems operating in Europe, and increasingly for any system with European users, the EU AI Act adds an additional layer. High-risk systems have explicit data governance requirements: documentation of training data sources, bias monitoring, and data quality standards.

These are not requirements that can be easily implemented later. They must be designed from the outset.

I’m not a lawyer, and this is not legal advice. But saying “we’ll deal with regulatory compliance later” is one of the most costly phrases in software development. The question of what data the system handles and what the legal and ethical implications are belongs in the architecture phase, not in the subsequent legal review.

How will I know if the system degrades over time?

Production software changes. Dependencies are updated, infrastructure is modified, and traffic patterns evolve. Good engineering practice includes designing systems that can be monitored over time, not just validated at the time of implementation.

AI systems degrade in ways that traditional software does not. Model vendors frequently update versions. The quality of data retrieval changes as document corpora are modified. Indications that worked well with one version of the model produce subtly different results with the next. These changes are often invisible without specific instrumentation.

Before writing a single line of AI code, I want to know the monitoring strategy. What metrics will tell me if the system is performing as expected? What is the process for detecting degradation? What is the threshold at which a human intervenes instead of letting the system continue operating autonomously?

These aren’t questions for after implementation. A system delivered without answers to these questions isn’t a production system. It’s a demo that just happens to be running in production.

Why Ask These Questions Before Writing Code?

None of these questions are specific to AI. They are the questions a meticulous engineer asks before building any system where the consequences of failure are significant.

What makes them particularly important in AI is that the technology’s capabilities are so compelling that they create pressure to start building before the questions are answered. The demos are impressive. The frameworks are accessible. The time from “I understand how this works” to “I have something up and running” is shorter than ever for a technology of this complexity.

That speed is the risk, not the technology itself.

The teams that will build reliable enterprise AI are not the ones that move fastest from concept to implementation. They are the ones that dedicate enough time to these questions, before writing a single line of code, to know exactly what they are building, for whom, under what constraints, and how they will know if it stops working.

Conclusion

The gap between a working AI demo and a reliable production system is almost never a technical gap.

It’s a lack of questions.

Teams that skip implementation without defining failure modes, accountability, reliability thresholds, data obligations, and a monitoring strategy build systems that work until they don’t, with no way to detect the transition. These questions don’t slow down development; they define what you’re actually trying to build.

The Questions I'm Asking Before Writing a Single Line of AI Code
Author
Raúl Ferrer
Published at
2024-03-14
License
CC BY-NC-SA 4.0

Some information may be outdated