There is a question that comes up early in almost every AI conversation we have with founders and product leaders: "Is our process a good candidate for this?"
It sounds like a simple question. It is not. A recent MIT study reports that 95% of enterprise generative AI pilots fail to deliver measurable business impact, and that the primary cause is not the technology itself but the absence of workflow integration and a defined outcome before the build begins. Most teams answer the question by focusing on the technology first, evaluating what a particular model or agent framework can do, and then searching for a process to apply it. That sequence produces many promising pilots but leaves production systems in short supply.
The teams that get reliable returns from AI automation tend to approach it the other way around. They start with the process, apply a small set of diagnostic criteria, and only then ask whether AI is the right tool. The criteria are not complicated. But applying them honestly before committing to a build significantly changes the outcome.
This article lays out those criteria, drawn from two projects in our current AI portfolio that both fit the same pattern, and explains why that pattern matters more than the specific technology used to automate it.
Across the projects we have built, the ones that deliver durable returns share a recognizable profile. Not identical processes, not identical industries, but the same structural characteristics.
The first is volume. A process that happens once a week is a different problem from one that happens ten thousand times a month. AI automation has fixed costs: model selection, fine-tuning, infrastructure, integration, and ongoing monitoring. Those costs need to be distributed across enough repetitions to generate a positive return. The higher the volume, the faster the equation resolves in the right direction.
The second is a defined quality standard. This is the criterion most teams underestimate. For AI to replace or meaningfully assist a manual process, there must be a measurable definition of what "correct" means. Not a general sense that the output is good enough, but a specific, verifiable standard that can be used to benchmark the model's performance and evaluate whether it has reached the threshold required for the business case to hold. As we covered in our piece on how to build a business case for AI before writing a line of code, that accuracy threshold is not a technical aspiration. It is the number below which the automation does not displace the manual effort it was supposed to replace.
The third is the input structure. Fully unstructured inputs, open-ended conversations, creative outputs, highly contextual judgment calls: these are not impossible for AI to handle, but they are harder to automate reliably and the accuracy ceiling tends to be lower. Processes with structured or semi-structured inputs, forms, documents, defined fields, and repeating formats provide the model with more signals to produce more consistent outputs.
The fourth is current reliance on manual human effort. This sounds obvious, but it matters for a specific reason. If a process is already partially automated, the incremental value of AI is smaller and harder to isolate. The clearest business cases involve processes in which a team currently performs repetitive work that produces consistent output. The labor cost is visible, the error rate is measurable, and the displacement calculation is straightforward.
A process that scores strongly on all four criteria is not guaranteed to produce a successful AI implementation. But in our experience, a process that scores poorly on any one of them is unlikely to deliver the kind of reliable, compounding returns that justify the investment. Let's do a quick exercise together to see where you would rate yourself as an AI automation candidate:
Score your process on five criteria. Give yourself 2, 1 or 0 points per question.
The reason this diagnostic framework is worth articulating is that it is not theoretical. Two of the three projects in our current AI portfolio fit the profile above almost exactly, despite being in different industries, serving different clients and involving different technical approaches.
The first is a utility invoice processing pipeline built for a U.S.-based energy intelligence platform. Each month, roughly 14,000 invoices arrive from hundreds of utility providers across North America, each requiring data extraction, normalization, and entry into the platform's system. The process was high volume, had a defined accuracy target tied directly to the business case, involved semi-structured inputs (utility invoices vary in format but follow predictable field types), and relied entirely on a team of support agents performing manual validation. Every one of the four criteria was met before a model was evaluated.
Another project we worked on is an AI-assisted manuscript screening platform built for a major academic publisher. Editorial teams were processing submissions across a large journal portfolio, running a defined set of pre-peer-review checks on each manuscript: author list verification, conflict of interest signals, sensitive topic flags, AI-generated content indicators and others. The process was repeated at high volume across hundreds of journals; each check had a defined pass or fail standard, the inputs were structured documents with consistent metadata, and the workload was carried entirely by a manual team of checkers and supervisors. Again, all four criteria were present before a line of code was written.
The surface-level similarity between these two projects is not obvious. One processes financial documents in a regulated infrastructure environment. The other screens academic manuscripts for editorial integrity. The technology stacks are different, the compliance requirements are different and the domain expertise required is different.
What is the same is the shape of the problem. A team doing repetitive work. A defined standard for what correct looks like. A high volume of consistent inputs. A clear cost to the manual process that the automation can be measured against.
That structural similarity is what made both projects viable. It is also what makes the pattern transferable. A logistics company processing shipping documents, a healthcare platform validating intake forms, a financial services firm reviewing loan applications: each of these can be evaluated against the same four criteria before any AI investment is made.
One of the more useful things the two projects illustrate is that "automating a business process" rarely means removing humans entirely. In both cases, the design was more deliberate than that.
For the invoice processing pipeline, the system was built to extract and normalize data at scale, with support agents reviewing exceptions and corrections rather than processing every invoice manually. The model handles the volume; the human team handles the edge cases and feeds corrections back into the training pipeline. The labor cost does not disappear; it drops significantly and is concentrated on higher-value work. The technical approach behind this, including the fine-tuning strategy and the accuracy architecture, is covered in detail in our piece on AI document extraction and why fine-tuning alone is not enough.
For the manuscript screening platform, the design was explicitly human-in-the-loop across every check. The system runs automated verification, pulls data from external sources, and produces a Pass or Fail recommendation. The final decision always sits with a human checker, who can accept or override the recommendation. Across the more than sixteen checks currently in the platform, roughly 75% follow this assisted pattern, where the AI processes information and suggests an outcome and the checker confirms. Around 15% are enriched checks, where the system surfaces data from external sources and the checker interprets it. The remaining 10% are manual checks with digital support, where the platform provides the instructions and records the result but the checker performs the verification themselves.
None of the checks closes without human involvement. The goal of the design is not zero intervention. It is a better-informed and more consistent intervention, with verification work centralized in a single interface rather than spread across multiple external systems.
This tiered approach is not a compromise forced by technical limitations. It is a product design principle. Earning the right to automate a check more aggressively requires demonstrating that the assisted result consistently meets the quality standard. Until that threshold is reached, the human override is not a fallback. It is a quality gate.
The result in both cases was not a dramatic reduction in headcount from day one. For the manuscript platform specifically, the total number of checks a checker now runs per manuscript has actually increased compared to the previous process, because the platform centralizes verifications that used to be distributed across multiple external systems. Time per manuscript has not necessarily reduced. What changed is the nature of the work: less time spent navigating between disconnected tools, more time spent in a single interface with standardized instructions, and a significantly more consistent process across the team. Quantitative metrics are still being captured as journal rollout expands, with a Power BI dashboard in development to track them.
The four criteria above provide a starting framework. Applying them to a specific process requires honest answers to a small set of questions.
How many times does this process run per month, and what is the fully loaded cost of each repetition?
This establishes the baseline that any automation investment needs to beat. If the answer is unclear, the business case cannot be built, and the investment cannot be justified.
What does "correct" look like for a single unit of this process, and how would you know if the output was wrong?
If the quality standard cannot be articulated in measurable terms, the accuracy target cannot be set, and the model cannot be evaluated against it. Vague standards produce vague results.
How structured are the inputs?
Documents with consistent fields and formats are easier to process than free-form text. Mixed or highly variable inputs are not disqualifying, but they increase the engineering complexity and the accuracy ceiling may be lower.
Who is currently doing this work, and what happens to their time if the process is automated?
This question is often skipped, but it matters for two reasons. First, it clarifies the displacement calculation. Second, it surfaces organizational dynamics that will affect the rollout. Teams that understand how automation changes their work are more likely to engage with it constructively, which directly affects the quality of the feedback loops that improve the model over time.
Is there historical data available?
Both projects described above had an unusual advantage: years of processed, corrected examples that could be used to fine-tune the model. That data was not created for AI training; it was the accumulated output of the manual process. Teams that have been running a manual process for years often have this asset without recognizing it. Teams starting from scratch face a harder problem and a longer timeline to production-grade accuracy.
A high-profile, executive-facing process feels like a meaningful target. It generates momentum and buy-in. But if it does not meet the structural criteria above, the result is a technically impressive system that does not compound value.
Teams often begin with a general sense that the process has a quality bar without defining what that bar is in measurable terms. The model reaches 90% accuracy and the team celebrates, until someone realizes that the downstream process still requires human review for every error, which means the labor saving never materialized. Defining the accuracy threshold before development begins is not a documentation exercise. It is what makes the success criteria testable.
A high-volume process with poorly defined quality standards, highly unstructured inputs or no historical data is not a strong candidate. Volume amplifies the return on a well-structured automation. It also amplifies the cost of a poorly scoped one.
Both projects in this article involved teams of people whose daily workflows changed significantly as automation was introduced. In both cases, the rollout was designed to earn trust incrementally rather than replace effort immediately. That approach is not slower; it is more durable. Teams that understand the system and trust its outputs engage with it differently from teams that feel displaced by it. The quality of the human feedback loop, which is one of the primary mechanisms for improving the model over time, depends directly on that engagement.
The utility invoice pipeline and the manuscript screening platform use different models, different infrastructure, different integration approaches and different accuracy targets. What they share is the underlying process structure that made both viable.
McKinsey's 2025 research on AI returns found that organizations reporting significant financial returns from AI are roughly twice as likely to have redesigned full-workflow processes before selecting modeling techniques. Technology selection without process redesign rarely produces durable returns.
That transferability is the point. The question "is AI right for this process" is not primarily a question about technology. It is a question about the process itself. The technology choices follow from the answer.
Teams that develop the discipline to evaluate process fit before evaluating technology options tend to build systems that compound value rather than systems that demonstrate capability. The diagnostic criteria above are not a guarantee of success. They are a filter that removes the most common source of failure before the build begins.