How regulated companies build AI without third-party APIs

Read time: 7 mins

At some point in almost every AI project we work on with a regulated client, someone on the team says some version of the same thing: "We can't send that data outside."

It usually lands like a problem. The obvious implementation path, connect your data to a capable hosted model, get results back, iterate, suddenly has a wall across it. SOC 2 compliance, data residency requirements, security perimeters built around credentials and sensitive client information: any of these can make the default approach to AI architecture a compliance violation rather than a technical decision.

For quite some time now, every major provider (OpenAI, Claude, etc.) has been rolling out offerings and APIs that are SOC2 compliant. Many of our customers use our APIs, particularly when delivered via platforms such as Azure or AWS Bedrock. However, for some customers, even these contractual assurances are insufficient, requiring fully self-hosted solutions to ensure total control.

What happens next is where teams diverge. Some treat the constraint as a reason to delay, or, more commonly, to scope the AI initiative so narrowly that the sensitive data never enters the workflow. A few treat it as a design input and build accordingly. The difference in outcomes between these two responses is significant, and it rarely shows up in the technology. It shows up in whether the system ever makes it to production.

We have built AI systems in environments where data could not leave the building, where credentials for thousands of third-party accounts had to be managed within a private security perimeter, and where accuracy targets were defined not by what felt achievable but by what the business case required. The architecture that emerges from those constraints differs from the default and, in some respects, is more deliberate. This article attempts to make that process legible.

Before getting into the mechanics, it is worth noting that the compliance constraint does not arrive alone. As we covered in our piece on pre-development business cases for AI, cost ceilings and accuracy targets need to be established before any technical decisions are made. In regulated environments, a hard boundary on where data is permitted to travel sits alongside the constraints from the first conversation. All three shape the architecture. None of them can be treated as implementation details.

Why data residency rules change your AI architecture before you start

When data cannot leave the organization's infrastructure, the model has to come to the data rather than the data going to the model.

That single inversion changes everything. Model selection, infrastructure design, fine-tuning strategy, operational cost: none of those decisions can be made sensibly without the compliance constraint on the table first.

Teams that discover this late, after a vendor relationship has been established and a build is underway, face a difficult choice. Restart the architectural work with the constraint properly incorporated, or proceed with an approach that will either fail a compliance review or require significant remediation before production. Neither outcome is acceptable.

The constraint belongs at the start of the scoping process. Not at the end.

What this looks like in practice

A project we worked on illustrates how a compliance requirement reshapes architecture before a line of code is written.

A U.S.-based energy intelligence platform managed utility data for a large portfolio of commercial and industrial clients. Each month, roughly 14,000 invoices arrived through a single third-party data aggregation service, sourced from hundreds of utility providers across North America, each requiring the platform to hold login credentials on behalf of its clients.

That credential management was itself a layered operational problem: credentials stored in a password manager, two-factor authentication handled through auto-forwarded email codes, and a team of support agents performing manual validation on extracted invoice data. The platform was SOC 2 compliant. That compliance status was not a bureaucratic detail. It meant that the usernames, passwords, and invoice data the platform managed on behalf of its clients could not be transmitted to a third-party API.

Sending an invoice to a hosted visual-language model (VLM) for structured data extraction was not a technical option. It was a compliance violation.

When sensitive data can’t leave the building, external APIs are off the table. Bringing open-weight models inside your own VPC takes more engineering rigor upfront, but it’s the only reliable way to clear security reviews and actually get AI into production.

Tiberiu-ioan Szatmari - AI Engineer at Thinslices


This single constraint determined that any AI system built for this workflow would need to run entirely within the platform's own infrastructure. That decision, reached before any model evaluation began, shaped everything that followed.

Hosted model approach versus self-hosted model approach: how a SOC 2 compliance requirement reshapes AI architecture before a line of code is written.

Hosted model approach versus self-hosted model approach: how a SOC 2 compliance requirement reshapes AI architecture before a line of code is written

 

The real cost of running your own VLM in production

Self-hosting means the organization takes on responsibilities that a hosted provider would otherwise handle. This is worth being direct about, because it is where project scopes most often underestimate the work involved.

With a hosted model, the provider manages model versioning, infrastructure scaling, security patching, and in some cases ongoing model improvement. With a self-hosted model, all of that becomes the organization's responsibility.

In the energy platform project, the infrastructure decision resolved into a specific AWS instance type with NVIDIA GPUs, selected because it could process between 900 and 1,500 invoices per minute, depending on scan quality, at an annualized hardware cost that fit within the cost-per-invoice ceiling the business case had established. That ceiling was not a matter of engineering preference. It was the number below which the system would generate a positive return over a three-year horizon.

Owning the infrastructure also opened a cost optimization that a hosted arrangement would never have permitted. Invoice processing is a batch workload. Jobs are short-lived and can be restarted without significant disruption if interrupted. That makes the workload well-suited to cloud spot instances, which can be 50 to 90 percent cheaper than on-demand pricing.

That optimization was only available because the team owned the infrastructure. It is one of several places where the compliance constraint, which initially appeared as a cost burden, became a cost lever.

How compliance narrows your model selection options

Selecting a self-hosted language model is not the same as selecting the best-performing model on a public benchmark. The evaluation process in a constrained deployment starts with economics, not performance.

The cost-per-unit ceiling sets the outer boundary on hardware spend. Hardware that fits the ceiling limits and which models are viable before any performance evaluation begins. Only then does the question of accuracy become relevant.

Within that constraint, the evaluation shifts to baseline task performance on the actual task, not a proxy. In the energy platform project, this process led to Mistral and Qwen as the candidate models: both small enough to run on the selected hardware within the cost constraint, both providing strong enough baseline performance to justify the fine-tuning investment that would follow.

Credentials add another layer of complexity that rarely appears in general AI evaluations. When an AI agent navigates utility provider portals to retrieve invoices, it needs login credentials for each provider. Those credentials cannot be passed to a hosted model. They must be retrieved from a secure internal store, passed to the agent within the private infrastructure, used to complete the authentication flow, and then destroyed.

The entire sequence takes place within the security perimeter. That requirement alone eliminates most hosted model architectures from consideration before a single performance benchmark is run.

Fine-tuning a private VLM to hit your accuracy target

A self-hosted open-weight model running on cost-constrained hardware will rarely perform at production-ready accuracy on a specialized task without further work. Fine-tuning is where the real engineering effort concentrates, and where the returns are not linear.

The energy platform project had an unusual advantage: hundreds of thousands of historical invoices in the system, each processed, digitized, and corrected by the support agent team over years of operation. That dataset was not created for AI training. It was the output of a manual workflow that had been running long before the AI project began. Converting it into a fine-tuning resource required structuring the data correctly and building an API endpoint to expose it to the training pipeline, but the underlying material was already there.

The model started with a baseline accuracy of around 85%. While fine-tuning on that historical invoice data provided a strong foundation, relying on fine-tuning alone wasn't enough to reach production-grade reliability. Pushing the system's accuracy to over 97% required augmenting the model with a robust engineering architecture. We achieved this by integrating three techniques:

Pipeline segmentation

Invoices were split into sections before extraction, with a separate model identifying which parts of the document contained which types of information. The main extraction model then ran on individual segments rather than the full document, and the outputs were combined.

Contextual retrieval

A Model Context Protocol server connected the self-hosted model to the historical invoice database, allowing it to retrieve comparable invoices from the same provider before processing a new one. The accumulated history of corrections made by the support agent team became a real-time reference that informed each extraction decision.

Continuous feedback loops

Every correction a support agent made to an extracted field was captured and routed back into the training pipeline. Each human correction was not just a fix. It was a data point that made the next extraction slightly more accurate.

None of these steps was visible at the outset of the project. They emerged from the process of defining an accuracy target, measuring honestly against it, and engineering systematically toward it.

“Small” open-weight models don't come out of the box ready for production. Fine-tuning is a good start, but closing that final gap to over 97% accuracy requires rigorous engineering around the model: segmentation, historical context retrieval, and continuous feedback loops. It’s heavy lifting, but the payoff is a purpose-built system that executes your specific workflow safely and at a fraction of the cost.

Tiberiu-ioan Szatmari - AI Engineer at Thinslices

What teams consistently get wrong about self-hosted AI

The compliance constraint is usually identified. The operational reality of sustaining a self-hosted model in production is usually not.

Hosted models improve over time as providers update them. Self-hosted models do not. The version deployed at launch is the version in production until the team deliberately upgrades it, which requires revalidating performance, managing the infrastructure transition, and accepting that the model's behavior may change in ways that affect downstream processes.

Accuracy drift is the other underestimated risk. As the distribution of real-world inputs shifts away from the training distribution, extraction quality degrades gradually. Without explicit monitoring for this, a self-hosted model can quietly underperform for weeks before the problem surfaces in downstream data quality.

Both of these are solvable. But they need to be scoped into the project from the start, not treated as maintenance tasks to figure out after launch.

Why compliance constraints produce better AI systems

There is a perspective on compliance constraints that rarely gets said out loud: they tend to produce better engineering decisions than the unconstrained alternative.

Teams working with hosted models can move fast. The provider handles infrastructure, versioning, and ongoing improvement. The tradeoff is limited visibility into how the model behaves, limited ability to adapt it to the specific task, and limited control over how it changes over time.

Teams working within a data perimeter are forced to make all of those decisions explicitly. They select the model with knowledge of its baseline characteristics. They fine-tune it on their own data, which gives them direct insight into where it fails and why. They run it on infrastructure they control, which means they understand its cost structure and can optimize it in ways a hosted arrangement would never permit.

The result is a system that is more auditable, more predictable, and more aligned with the specific requirements of the business.

The compliance constraint, treated as a design input rather than an obstacle, tends to produce AI systems that are more deliberately engineered than the default alternative. The organizations that move through this challenge most effectively are the ones that start that conversation first, before a vendor has been selected, before infrastructure has been provisioned, and before the engineering team has formed a view of what the system should look like.

Building AI inside a compliance perimeter?

If you are working through a regulated AI deployment and want a clear-eyed view of what the architecture, model selection and infrastructure decisions actually involve, we are happy to work through it with you.