Six principles for AI systems that work in production

Read time: 12 mins

The AI projects that work in production share a small number of structural decisions. None of them is about the model. All of them are about the system around the model: the threshold defined before the build begins, the constraint treated as architecture rather than an obstacle, the surface designed as the actual product, the MVP scoped as the first instance of the platform, the human kept in the loop as a permanent feature rather than a transitional one. The discipline of building AI well is the discipline of getting these decisions right early enough for the model to do its job, and of refusing to defer them until the cost of deferral becomes unavoidable.

This article is the closing piece in a series of eight that traced what we have learned across three years of building AI systems for clients in energy, academic publishing and maritime operations. The series sits inside a longer practice. We have been building digital products at Thinslices for 15 years across more than 220 projects, and the AI work of the past 3 years is the latest chapter of that practice rather than a separate one. The eight articles are part of the work we could share publicly, drawn from projects with constraints, real users and consequences when the system gets it wrong.

Each previous article took a single question and worked through it in depth:

Eight articles is a lot of ground. Across that ground, a pattern emerged that none of the individual pieces could fully name on its own. The teams that get AI into production are not making better technical choices than the teams that struggle. They are making the same set of structural choices in roughly the same sequence, regardless of the domain they work in. The principles behind that pattern are what this article is about.

None of these principles is novel in isolation. What is rare is treating them as a system, applying them with discipline, and refusing to be talked out of them when the pressure to ship faster or impress an audience pushes against them. The teams that hold the line on these principles tend to ship AI that compounds value. The teams that abandon them at the first sign of friction tend to ship AI that does not survive contact with production.

What we have learned across the series

Before naming the principles, it is worth being honest about what kind of synthesis this is. Each previous article in the series argued a specific position, and most of those positions were not the consensus view in their respective topic areas. Document extraction accuracy is not primarily a fine-tuning problem. Browser-use agents work best as components, not as solutions. Confidence is the engineering problem in RAG, not retrieval. Human-in-the-loop is the destination, not the bridge. The MVP-to-platform transition is a redesign problem, not a scaling problem.

These positions are not unrelated. Each one is an instance of the same underlying observation: the AI work that matters is not in the model. It is in the architecture, the product design, the workflow, the operational discipline that surrounds the model. The model contributes capability. Everything else contributes reliability.

The six principles that follow are an attempt to name that observation in its most useful form.

Principle One

The system around the model matters more than the model

Most AI failures in production are not model failures. The model is doing what it is supposed to do, and the system around it is not. The retrieval layer returns the wrong content, often confidently. The interface hides the confidence signal behind fluent prose. Downstream processes treat the output as deterministic when it is anything but, and the monitoring layer, where there is one, cannot distinguish a correct answer from a confident hallucination.

Teams that hit reliability problems in production reliably reach for the model first. Better prompts, lower temperature, a different provider. Sometimes this helps. Most of the time, the model is fine and the system is wrong. We covered this directly in our piece on how to build RAG systems that work in production, where the central argument is that confidence calibration, source attribution and presentation are the engineering work, not retrieval quality.

The same pattern shows up in document extraction, where pushing past 97% accuracy requires pipeline segmentation, contextual retrieval and continuous feedback loops working together, with fine-tuning establishing the baseline rather than producing the result. And it shows up in browser-use agents, which produce durable value only when they are treated as one component in a larger system rather than as a complete solution.

The recurring lesson is that the model is one part of a working AI product. Often it is the most interesting part to talk about. It is rarely the part that determines whether the product works.

This is worth restating because the model landscape in 2026 makes the principle more urgent, not less. The frontier itself has become a cluster, with independent benchmark analysis from Logic showing Claude Opus 4.7, GPT-5.4, Gemini 3.1 Pro and Grok 4 separated by only a few percentage points on most production-relevant tests, with differences "under 3 to 5 percentage points apart" that "rarely matter in production." Open-weight models from DeepSeek, Moonshot and Zhipu now match closed-model performance on key tasks while running inside the customer's own infrastructure, with one 2026 frontier-models analysis noting that "the quality gap between open and closed is essentially gone." The cost spread across the available models is enormous, with mid-tier offerings delivering most of the frontier's quality at a fraction of the price. A practical 2026 guide to model selection made the implication explicit: "None of these are reasons to avoid frontier AI; they are reasons to design the surrounding system thoughtfully — good prompts, retrieval for facts, cheaper models for easy work, and human review where it counts — so the model's strengths show and its weaknesses are contained."

If you are building an AI product right now, the implication is direct. Picking the right model is no longer the hard part. If your team is spending more time evaluating model providers than designing the system around the model, you are spending your effort in the wrong place. Choose a default. Build the routing layer so you can swap models when economics or capability shifts. Design the retrieval, the human interface, the confidence behavior, and the monitoring as the actual product. The system around the model is where your competitive advantage lives, and it is the only part of the work that survives the next frontier release.

Principle Two

The constraint is the architecture, not the obstacle

Most AI projects encounter constraints early on: regulatory requirements, data residency rules, performance budgets, compliance perimeters, limited training data, and industry-specific quality standards. The temptation is to treat these as obstacles to work around, and to design the system as if they were temporary inconveniences that better engineering will eventually eliminate.

The teams that build durable AI systems do the opposite. They treat the constraints as the architecture itself. The compliance perimeter shapes the model selection, the performance budget shapes the inference design, the quality standard shapes the training data strategy. The constraint stops being an obstacle and becomes the design input that produces a coherent system.

We saw this most clearly in our piece on how regulated companies build AI without third-party APIs, where the SOC 2 requirement that data could not leave the customer's infrastructure was not a problem to solve but the starting point of the architecture. The same logic appears in human-in-the-loop AI systems, where the constraint that no decision closes without human involvement is not a workaround for an immature model. It is the design principle that makes the system trustworthy.

Constraints filter for clarity. AI projects that have to satisfy real constraints from day one tend to produce more deliberate engineering than projects that imagine themselves to be free of them.

If you build AI for a market that touches the EU, the constraint is now operationally binding. The EU AI Act's high-risk obligations became enforceable on 2 August 2026, with the AI Act Omnibus provisionally deferring some Annex III categories to December 2027. A Cloud Security Alliance research note from March 2026 found that more than half of organizations still lacked a basic AI system inventory at that point, meaning most enterprises were treating compliance as a problem to address after the build rather than as a design input shaping the build itself. The teams approaching the deadline with workable systems are the ones who took the constraint seriously from day one. The teams approaching it with retrofit projects are paying the cost the principle is designed to avoid.

 

 

Principle Three

Define the threshold before you build

Almost every AI project we have seen fail had no defined accuracy threshold at the start. The team had a general sense that the model needed to be "good enough." Good enough was not specified in measurable terms. The build was scoped against the impressive-demo standard, not against the operationally useful standard. When the system shipped, the threshold for being operationally useful was discovered after the fact, and it was almost always higher than the threshold the system had been designed to meet.

The threshold is not a technical aspiration. It is the number below which the automation does not displace the manual effort it was supposed to replace. Below it, the business case does not hold and the system has to be either rebuilt or quietly abandoned. We covered this in our piece on how to build a business case for AI before writing a line of code, where the central argument is that the threshold needs to sit in the requirements before any model evaluation begins.

The same principle applies to process selection. Our piece on why AI automation ROI is highest on repetitive, high-volume processes argued that the four criteria for evaluating a candidate process must be applied honestly before the technology conversation begins. A process that scores poorly on any one of them will not produce durable returns regardless of how good the model becomes.

The discipline of defining the threshold early is the difference between a project that ships and a project that proves something. Both can be useful. Only one of them justifies the budget.

The most public version of this lesson is IBM Watson for Oncology, the AI cancer-treatment recommendation system that hospitals worldwide signed up for and that IBM eventually scaled back significantly after physicians found the recommendations were sometimes unsafe. The technical ambition was genuine. The threshold was implicit. For an AI to be operationally useful in cancer treatment, the bar is something approaching diagnostic-grade accuracy across a vast space of edge cases, and no system designed against a softer threshold was ever going to clear it. The lesson has compounded across the industry. Gartner's April 2026 research found that 57% of I&O leaders whose AI projects failed cited expecting too much too fast, and an MIT NANDA study from 2025 attributed the 95% zero-return rate on enterprise GenAI pilots in part to "the absence of a defined outcome before build starts." Define what good enough means in measurable terms before any code is written. Then design against it.

Principle Four

The MVP is the first instance of the platform

If you are building an AI MVP that you expect to scale, the most consequential decisions are not the ones that feel like platform decisions. They are the ones that feel like ordinary MVP decisions. The data model you pick in week two. The way you wire up the first external integration. Whether your check definitions live in code or in configuration. Whether the human reviewer interface is designed for one workflow or as a reusable layer. These choices look small at MVP time and become structural at platform time.

The teams that succeed at scaling AI systems make these decisions deliberately, with the platform in mind, before there is a platform to defend. The data model assumes multi-tenancy before there is a second tenant. External systems are wrapped in adapters with consistent internal interfaces. The human-in-the-loop interface is designed once and standardized across every check, every workflow, every tenant. The architecture generalizes. The content does not.

The cost of getting this wrong is concrete and increasingly well-documented. A 2026 SaaS multi-tenancy guide notes that the database isolation decision, made in week two of a typical project, ends up costing six to twelve months of re-architecture once the product reaches 500 paying customers. Another 2026 SaaS development analysis puts the cost of correcting weak architectural foundations after the fact at $150,000 to $400,000. The decision to design for multi-tenancy at MVP time is not over-engineering. It is the cheapest version of a decision the team will have to make anyway, taken at the moment it costs the least.

This is the argument of our piece on the five MVP decisions that turn an AI MVP into an enterprise AI platform. The framework draws on a project that began as a manuscript screening tool for one academic publisher and is now scaling to forty publisher review areas without a rebuild. The expansion was made possible by specific choices at MVP time: modeling each check as a configurable unit rather than as hardcoded logic, wrapping external systems like Trinka, Crossref and iThenticate in adapters, standardizing the reviewer interface from day one, and refusing to generalize publisher-specific content prematurely. The MVP and the platform are not two products. They are two stages of one product, and the decisions made in the first stage determine what is possible in the second.

Principle Five

The product is the surface, not the model

Users do not interact with the model. They interact with the surface the team builds around it. The surface decides what the user sees, what they trust, how they verify, what they override and how their interactions get fed back into the system. The surface is the product. The model is the engine.

This sounds obvious until you watch how AI teams actually allocate their time. The model gets the engineering attention. The interface gets whatever is left. The result is often a workflow that technically functions but does not respect the user's attention, judgment, or domain knowledge. The user trusts the system less than they should, or more than they should, because the surface does not give them the information they need to calibrate either way.

What surprises most tech teams is that users do not behave the way the model treats them. The model treats every interaction as an isolated request. Users do not. They calibrate trust contextually, by stakes, by recent track record, by what the interface chose to surface and what it chose to hide. Yext's 2026 Consumer Search Behaviors Report found that while 43% of consumers now use AI search tools daily and 62% trust AI for brand decisions, "the higher the personal cost of being wrong, the less willing people are to trust a single source." Users have learned, faster than most teams expected, to demand verification when the consequence matters.

The other surprise of 2026 has been the backlash against agreeable AI. A February 2026 analysis of AI UX patterns made the observation directly: AI that answers every prompt with enthusiastic agreement is no longer perceived as friendly. It is perceived as unreliable. The default "be helpful and affirming" tone that worked for the first generation of consumer AI is now actively eroding trust in production systems. Users do not want a cheerleader. They want a collaborator that pushes back where pushback is warranted, that names uncertainty when uncertainty exists, that shows the source so the user can verify, and that recognizes when the user knows more than the model does. None of this is in the model. All of it is in the surface.

The teams that get this right invert the sequence. They design the workflow first, the surface second, the model third. The surface is treated as the primary product. The model is selected to support it.

The argument runs through our piece on RAG systems in production (where source attribution and confidence visibility are the engineering work that determines whether the system is trustworthy), and through our piece on human-in-the-loop AI systems (where the interface is the part that compounds adoption and trust over time, not the model).

A model on its own is a capability. A product is a model wrapped in a surface that respects the user. The second is what produces durable returns.

Principle Six

The hardest decisions are the easiest to defer

Every AI project arrives at a small number of decisions that feel uncomfortable to make at the moment they have to be made: the accuracy threshold, before any model is evaluated; the compliance posture, before the architecture is sketched; the role of the human, before the workflow is designed; the configuration boundary between tenants, before there is a second tenant; the monitoring layer, before there is anything in production to monitor.

These decisions are uncomfortable because they require the team to commit to something concrete before there is evidence to support the commitment. The temptation is to defer them: ship the MVP first, see how the model performs, wait for the second customer before designing for multi-tenancy, build the monitoring after the first production incident.

Every one of these decisions becomes harder, more expensive and more constraining the longer it is deferred. The accuracy threshold becomes harder to define once the team has spent six months calibrating to no threshold at all, the multi-tenancy architecture becomes harder to retrofit once the data model has settled, and the monitoring layer becomes harder to design once the failure modes are already showing up in production.

The teams that build durable AI systems make these decisions at the moment they are hardest to make, because that is the moment when they are cheapest. The teams that struggle defer them until the cost is unavoidable.

This pattern is the through-line that runs across every previous article in this series. It is the principle the other five rest on. Get the structural decisions right early enough that the model can do its job. Defer them and the model will spend the next two years compensating for choices nobody got around to making.

What ties the principles together

The six principles describe the same underlying discipline applied at different layers of the system. The model is the engine of an AI product, but it is rarely the part that determines whether the product works. The product works because the team made deliberate decisions about the system around the model: the threshold, the constraint, the surface, the platform path, the structural choices that compound over time. This is not novel. It is the same discipline that produces durable software products in any domain. What makes it worth restating in the AI context is that the current generation of model capability is impressive enough to encourage teams to think the model is the product. It is not.

What this series did not address

The honest acknowledgment worth making is that the structural decisions described above are not the only ones that matter. They are the ones a body of work focused on architecture and product design could responsibly address. Four adjacent topics shape whether AI systems work in production just as decisively, and the next phase of writing will go there.

The first is organizational and cultural. How teams need to be structured to ship AI reliably, how engineers and product designers and domain experts need to collaborate, how procurement and security and compliance need to engage with AI projects from the start rather than at the end. The technical work does not survive contact with the rest of the organization unless these questions are answered well.

The other three are cost economics at scale (inference cost, infrastructure design, the trade-offs between self-hosted and hosted approaches), evaluation methodology (benchmark sets that reflect production reality, continuous accuracy assessment, drift detection), and the second-order product question of how AI capability changes what a product should be in the first place. The principles above are the ones we have seen matter most across projects in regulated industries, high-stakes domains and operationally complex environments. They are the ones we now apply by default. The next body of work will sit beside them, not above them.

Pressure-test your AI work against these principles

If you are building or evaluating an AI system and want to think through which of these principles your project is acting on and where it is at risk, let's connect.