AI document extraction accuracy: why fine-tuning alone is not enough
Document extraction accuracy is not a single problem but a sequence of failure modes resolved in order. Fine-tuning an open-weight visual-language model closes most of the gap from a general-purpose baseline, but rarely the gap that matters: the one between early performance and the threshold a business case requires. Closing that distance is a separate engineering effort, and the techniques that get there compound on each other rather than substitute for each other.
If you are reading this, there is a reasonable chance you already have something working. A model that reads documents, finds the relevant fields, outputs structured data. It looked promising in testing. The problem is getting it to work accurately, consistently and cheaply enough to justify the automation at scale, and that is a different challenge from getting it to work at all.
What AI changes is not the basic act of reading. It is the ability to read documents that do not conform to a fixed template. Free-form layouts, variable field positions, inconsistent terminology and formats that shift across providers, regions and years: that flexibility is genuinely valuable. It is also where the hard problems begin.
The challenge is not getting to a first working prototype. A reasonable model, applied to a reasonable sample of documents, will produce impressive early results. The challenge is getting from that early performance to the accuracy level a business process actually requires, and staying there at production volumes, under real conditions, with documents you have not seen before.
The scale of the problem is not small. According to APQC's 2025 cross-industry accounts payable benchmarks, the cost to process a single invoice ranges from $1.77 for top-performing organizations to $10.89 for bottom performers, with a cross-industry median of $5.83. For high-volume platforms processing tens of thousands of invoices each month, that spread represents a material operational cost, not an accounting footnote.
That journey from plausible to production-reliable is what this article is about. It draws on a project we built for a U.S.-based energy intelligence platform processing tens of thousands of utility invoices each month, one of the more technically demanding document extraction problems we have worked on. The specific techniques matter less than the principle behind them: closing the accuracy gap requires a compounding set of approaches, each addressing a different failure mode, applied in sequence as the easier gains run out.
The challenge is getting from that early performance to the accuracy level a business process actually requires, and staying there at production volumes, under real conditions, with documents you have not seen before.
Why does AI document extraction fail at scale
The immediate challenge with unstructured documents is not reading them. It is reading them reliably. A visual-language model that extracts the right fields from 90% of invoices in a test set can look very capable. The remaining 10% is where the operational reality sits.
Those failures are not randomly distributed. They cluster around predictable patterns: poor scan quality, unusual layouts, ambiguous field labels, missing data, multi-page documents where relevant information appears in different positions depending on the provider. Each of those patterns has a different root cause and requires a different mitigation. Treating them as a single problem to be solved by improving the model is a mistake that costs teams significant time. APQC benchmark data shows that OCR-only systems typically achieve 85 to 95% accuracy on clean, structured invoices but struggle with inconsistent layouts and low-quality scans, while machine-learning models can push accuracy into the high nineties when applied correctly. The gap between those two outcomes is not a minor technical distinction. It is the difference between a system that requires continuous manual intervention and one that can run at scale.
There is also the accuracy threshold problem. A system operating at 90% accuracy across 14,000 invoices per month still produces 1,400 errors. If each error requires manual review to catch and correct, the labour saving that justified the automation disappears. The accuracy target for a document extraction system is not aspirational. It is the number below which the business case no longer holds. As we covered in our piece on how to build a business case for AI before writing a line of code, that target needs to be defined before any model evaluation begins, because it determines whether any of the subsequent engineering is pointed in the right direction.
Getting from a strong baseline to a production-grade accuracy target is where the real work concentrates. And it is rarely a linear process.
What does fine-tuning actually do for document extraction
For most document extraction projects, the first major accuracy lever is fine-tuning an open-weight visual-language model on domain-specific data. A general-purpose model has broad capability but limited precision on any particular document type. Fine-tuning on real examples from the target domain, invoices from specific utility providers, claims from specific insurers, filings from specific regulatory bodies, brings the model's behavior into much closer alignment with the actual task.
The energy platform project had an unusual advantage here: years of historical invoices, each one processed and corrected by a support agent team. That data was not created for AI training. It was the output of a manual workflow that had accumulated over time. Converting it into a usable training resource required structuring it correctly and building an API to expose it to the fine-tuning pipeline, but the underlying material was already there.
The model started from a baseline accuracy of around 85%. Fine-tuning on that historical invoice data provided a strong foundation, but it was not enough to reach production-grade reliability on its own. Pushing past it required a different category of engineering work entirely.
How to improve AI extraction accuracy beyond fine-tuning
Fine-tuning moves the model from its general-purpose baseline to a task-specific foundation. Getting it the rest of the way requires three techniques applied together, each targeting a different source of extraction failure.
Pipeline segmentation: why splitting documents improves accuracy
Processing a document as a single unit sounds reasonable until you encounter a twelve-page utility invoice with account summaries, usage breakdowns, rate schedules and billing adjustments scattered across different sections. The model has to maintain attention across all of it while extracting specific fields, and the cognitive load compounds with document length and layout complexity.
Pipeline segmentation breaks that problem into parts. A separate, lighter model runs first and identifies what is where: which sections contain which types of information. The main extraction model then operates on individual segments rather than the full document, and the outputs are recombined. The task becomes structurally simpler at every step.
In the energy platform project, this was applied after fine-tuning had reached its ceiling and the remaining errors clustered around the most structurally complex invoices. Segmentation addressed those directly, not by making the model smarter, but by reducing what it had to hold in view at any one time.
Contextual retrieval: how historical data reduces extraction errors
A well-tuned model still encounters documents it has not seen before. New providers enter the mix. Existing providers update their layouts. Field labels shift between billing periods without warning. Fine-tuning cannot anticipate these variations because it optimizes on the training distribution, not on future documents the model has not yet encountered.
The approach that addressed this was connecting the extraction model via a Model Context Protocol server to the platform's historical invoice database. Before processing a new invoice, the system retrieves comparable invoices from the same provider and passes that history as context. The model is no longer reasoning from general knowledge about what utility invoices look like. It is reasoning from verified examples of what this specific provider's invoices look like, including every correction a support agent has applied over the years.
This distinction matters more than it might appear. Fine-tuning updates the model's weights permanently, shifting its general behavior across all future extractions. Contextual retrieval enriches a single extraction decision at inference time, without changing the model at all. The two techniques operate at different levels and address different failure modes. Together, they are more effective than either would be alone.
Continuous feedback loops: turning corrections into training data
The third technique is less a discrete engineering step and more a structural commitment: every correction a support agent makes to an extracted field gets captured, structured and routed back into the training pipeline.
This requires deliberate setup. Corrections need to be captured at the field level rather than flagged at the document level. The data needs to be formatted for fine-tuning, not just logged. And the retraining cycle needs to run frequently enough that improvements surface within a meaningful timeframe rather than accumulating invisibly.
When it is working, the accuracy curve does not plateau. The production workload itself becomes training data. The model improves on the document types it encounters most often, and edge cases that caused failures get absorbed into the training distribution over time. For teams building extraction systems on their own infrastructure rather than relying on third-party APIs, this is one of the harder-to-replicate advantages: a model that reflects your providers, your correction standards and your accumulated operational history, not a generalised capability someone else maintains.
What does it take to reach 97% extraction accuracy
It is worth being direct about what pushing past 97% demands. The techniques are not interchangeable and the sequence matters. Fine-tuning has to establish the baseline before contextual retrieval can improve on it. The feedback loop needs production volume before it generates meaningful signal. Introducing them out of order wastes the potential of each.
The numbers from the energy platform project also depend on conditions that will not be universal: a large historical dataset, a correction-intensive manual workflow that had produced consistently labeled data over years, and a relatively contained document type despite the variation across providers. Different inputs will produce a different curve. What holds across projects is the structure: each technique addresses a failure mode the others do not, and skipping one means leaving that failure mode unaddressed.
What teams get wrong about AI document extraction
The first thing teams underestimate is how far the early gains go. Initial results with a capable model often produce accuracy in the high eighties or low nineties without significant effort. That can create a false impression that consistent accuracy at scale is similarly accessible. It is not. The difference between 90% and 97% is a large multiple of the engineering work required to get from baseline to 90%.
The second thing is data quality. Fine-tuning on low-quality or inconsistently labeled data produces unpredictable results. The historical correction data in the energy platform project was valuable precisely because it had been produced by a consistent team applying consistent standards. Teams that assume any volume of existing data is automatically useful for training often find that quality problems in their historical data limit the gains they can achieve.
The third is the evaluation methodology. Measuring extraction accuracy requires a benchmark dataset that reflects the real distribution of documents the system will encounter in production, including the difficult ones. A benchmark constructed from clean, well-formatted examples will overestimate production performance. The failures that matter are not on the easy documents; they are on the edge cases, and those need to be deliberately included in the evaluation set.
The fourth is operational continuity. As the distribution of real-world documents shifts over time, through new providers, updated formats and regulatory changes, accuracy can degrade without explicit monitoring. A system that performs well at launch may be underperforming twelve months later, not because the model changed but because the documents did. Accuracy needs to be tracked continuously in production, not validated once before launch.
Initial results with a capable model often yield accuracy in the high eighties or low nineties with minimal effort. The difference between 90% and 97% is a large multiple of the engineering work required to get from baseline to 90%.
Can you run AI document extraction in a regulated environment
The energy platform project had an additional layer of complexity that shaped every technical decision: the platform was SOC 2 compliant, and the invoices it processed contained login credentials and sensitive client data that could not be transmitted to a third-party API. The extraction pipeline had to run entirely within the platform's own infrastructure.
That constraint is worth flagging here because it is more common than it appears. Any platform that handles credentials, financial data, health records or personal information under regulatory obligations faces some version of this restriction. The compliance boundary does not make document extraction impossible. It narrows the model selection options, changes the infrastructure economics, and adds operational responsibilities that a hosted arrangement would otherwise carry. We covered this in more depth in our piece on building compliant AI in regulated environments. → link to second article
What the constraint does not change is the accuracy engineering itself. The same techniques, fine-tuning, pipeline segmentation, contextual retrieval and feedback loops, apply regardless of whether the visual-language model is hosted or self-hosted. The constraint changes where the model runs; the accuracy challenge is the same.
Why the order of these techniques matters
The practical takeaway from working through a project like this is that extraction accuracy at scale is not a single technical problem with a single solution. It is a set of failure modes, each requiring a targeted response, applied in a deliberate sequence as the easier approaches reach their limits.
That sequence matters. Applying contextual retrieval before establishing a strong fine-tuned baseline wastes the potential of both techniques. Building a feedback loop before there is enough production volume to generate meaningful training data produces noise rather than signal. The order in which these techniques are introduced is part of the engineering.
What makes this worth understanding for decision-makers is not the specific techniques but the underlying structure. A document extraction system performing at 90% is not nearly there. It is at the point where fine-tuning has delivered what it can, and the more deliberate architectural work begins. Knowing the shape of the problem, what to expect at each stage and what each technique addresses, changes how teams plan, how they allocate engineering time, and how they evaluate whether a project is making real progress toward the accuracy a live system actually requires.
The gap between a convincing demo and consistent accuracy at scale is real. Closing it is achievable. It requires understanding which failure modes remain at each stage of the work, and applying the technique that addresses each one.
Build AI that works in production
If you are evaluating or building a document extraction pipeline and want a clear-eyed view of what the accuracy journey looks like for your specific use case, we are happy to work through it with you.