There is an emerging category of product in vogue at the moment, mostly designed for teams using LLM APIs. It is usually called an LLM Gateway, or an AI Gateway, or something like that.
At Doubleword we are releasing our control layer, another product in this category. I want to talk about why you would need this, what I think you should value when choosing a gateway, and why we think we do it well.
The case for AI Gateways
How LLM application building has changed
The way to produce intelligence that was tailored to your use-case before LLMs ate everything was that you would store, manage and maintain a massive trove of data. Then a team of expensive data scientists would run an algorithm to encode all that data into a “model”, a complex and fairly inflexible artefact, that you would then deploy to provide the same sort of insight over and over again.
With LLMs it has changed a bit. LLMs when scaled up do seem to produce “general purpose intelligence” in some sort of controversial but meaningful way. A lot of what would require lots of specialization before just does not any more. On the other hand, the incentives for the companies that provide these large models to flatten the whole space into one big prompt to their preferred model means that the advice for how to build interesting AI applications is suspiciously one-note.
In general, we have not found the right balance here. Building your own model from your data from scratch is too much, but writing your prompt once and sending it to OpenAI is too little.
How to specialise in an age of generalist models
In practice, to build a good application with some level of specialisation, you are going to have to do some extra work, to “specialise” your model.
As I see it, there are three different levels you might go through, in increasing levels of difficulty:
- Modifying your system prompt.
- Picking the best model amongst a set of API providers.
- Owning your own stack. Choosing self-hosted AI models, enhancing them such that they are meaningfully your own, and then hosting them consistently, efficiently, and reliably.
I work for Doubleword, a company that works on making point 3 possible. In the long run, for applications where you want the highest performance, the deepest integration, and the most control, this is the right way to operate. This is part of what we have been calling InferenceOps.
But that is not all applications. All applications have to start somewhere, and for most people, it should not be with buying a rack of B200s. So we are going to end up using APIs.
Why AI gateways
For API models
API models are frustrating for a lot of different reasons. Most of them boil down to the same thing — someone else is managing them. It means that they can be unreliable without accountability, that they can deprecate models that you were relying on, and you do not get privacy, data residency, or tenancy guarantees.
The dream of self-hosting is that you can own your own stack — that you can have more control over your operations, and not fail because someone else did a bad job.
The control layer is our way to help people get some of these benefits from API models. The control layer is a layer of insulation between your developers and the unreliable LLM APIs. It is a single point of control, in which you can define stable APIs that your developers can rely on, even as the API sands shift underneath you. Changing a model can be done in one place. Visibility into every request means that you control what does and does not get sent to API models, so you know at a glance what data is leaving your organization. Rate limits, virtual API keys, and flexible user management mean that you can control and secure your use of LLM APIs, without being locked into any single provider.
For self-hosted models
An AI gateway transforms a self-hosted stack into a production-ready resource. If you deploy an inference stack on your infrastructure, whether it be local and single-tenant, or large scale and multi-tenant, then you should understand your security posture.
Nowadays, all the popular inference engines expose web APIs. By and large, they provide no means to authenticate or authorize users of those APIs. Nor should they — it is not part of their job. But if the default is nothing, then lots of people will do nothing. And nothing is what people do in practice.
Self-hosted inference stacks need an authentication layer. The control layer is a simple and flexible way to provide that layer, in a way that is designed to scale from the smallest deployments to massive multi-tenant systems.
Anatomy of an AI Gateway
AI gateways sit between your users and LLM APIs. Where your developers would call the OpenAI API, they instead call an internal API that you make available.
Beneath the hood, the service you have configured at that internal API receives the request from that developer and sends it where it needs to go.
It is not a big change in infrastructure, but it does some big things:
- It lets you understand and build on your team’s usage of AI in real time.
- It lets you make decisions about how data flows to and from AI models.
- If you do it wrong, it creates a single point of failure, and slows down every request you make.
Understanding AI usage
Seeing what is happening
The control layer gives you a single pane of glass in which you can see how everyone in your organization is using AI.
This is very powerful. AI is expensive and hard to get right. If you want to get better, you have to know what you are doing now.
But this is hard now. Telemetry helps, but it is like looking through frosted glass — the tools we have are designed to do everything. Since every interaction with a model is an API call, we can see everything that matters by intercepting and forwarding these API calls. There is alignment between the goals of your users and the things that you are monitoring — they use the system by making API calls, and you are looking at the API calls.
We do not spend days building dashboards that nobody looks at. There is alignment between the goals of your users and the things that you are monitoring — they use the system by making API calls, and you are looking at the API calls.
Building a flywheel
Once you have built a mature, usable application that works in production and your users rely on it, then the real work starts. How do you make it better? The path from here to there is fuzzy and fraught with dangers.
One thing everyone agrees on is that you need data. Data on how your system is being used, data on how fast it is running, and, if at all possible, the full kitchen sink — every request and response that your application is sending.
We built in this functionality, at the level of granularity that you need. Full-featured RBAC means that sensitive data is only visible to users with the required permissions, and configurable logging means that you do not store what you do not want.
Gating access
The control layer lets you flexibly and transparently decide to whom models are available. We keep your downstream API keys encrypted in the database, and your users issue virtual keys that they use to access the models to which your team has granted them access.
Why its easy to do badly
Infrastructure is hard. An AI gateway will sit in the hot path of every single request your organization makes to an AI model. This could be hundreds of thousands of requests per second. If the gateway goes down, you can be cut off completely. When we were looking for an authentication and authorization layer for our self-hosted inference stack, it became clear that we needed to build something that took this responsibility seriously.
This responsibility makes a few things absolutely paramount:
Reliability
We built our control layer from the ground up in Rust. Rust’s error modelling means that runtime panics are incredibly rare. We maintain 100% test coverage of every code path that is exercised along the path of an AI request. We make sure that we do as little as possible with each request. The less the gateway does, the less can go wrong.
Performance
Gateways are classic examples of IO-bound applications. The intuition is that this is the kind of place where you can get away with an interpreted language and an event loop. There is some truth to this. But there is a philosophy mismatch. Gateways are the simplest possible web applications. This simplicity and flexibility means that they get hit with massive load. Performance is something you need to put in as a keystone of your design — you cannot put it back in later.
Our API key implementation requires no database round trip — instead, a materialized in-memory cache of API keys validates incoming requests in real time. This also means we do not make compromises on revocation — as you might if you used JWTs for example. This also means we do not have the operational overhead of deploying an additional caching layer.
Requests that transit the gateway are not deserialized. Instead, we use serde’s zero-copy implementation to read the model field in incoming JSON bodies, and then route the request at the HTTP level, without needing to parse incoming or outgoing requests. This also makes the system substantially easier to debug — you get the confidence that the errors your users receive are coming from downstream applications, and are not being wrapped or transformed by your proxy.
If we do not parse the requests, how do we log them in a structured way? We built the outlet Rust crate for asynchronous logging of requests. The requests are relayed to a background thread as raw bytes, and then the background thread parses and stores them as structured data. Only the Postgres outlet is implemented for now, but other outlets are on the roadmap.
For benchmarks comparing the control layer to other offerings, see here — https://fergusfinn.com/blog/control-layer-benchmarking
Simplicity
The entire system runs as a single binary that talks to a Postgres database. We do not have a million features, and we do not intend to. A gateway should be infrastructure, in the same way that nginx is infrastructure.
Outlook
Directly building on third-party LLM APIs is a trade-off: you gain initial speed, but sacrifice long-term control. For any serious application, that trade-off eventually breaks down. The instability of shifting models, unpredictable performance, and opaque usage data is not a viable foundation for production systems.
Self-hosted LLMs are an important part of the answer. But as it stands, they are missing important components to make them viable for production use.
An AI gateway is the correct architectural response to both of these problems. But adding a new service to the hot path of every request is a decision that cannot be taken lightly. A gateway that is not fundamentally performant and reliable ceases to be a solution and becomes part of the problem, compromising the very applications it is meant to support. It must be treated as core infrastructure.
The control layer has been that infrastructure for our in-house inference stack for a while now. The design choices — a single binary in Rust, zero-copy request handling, and an obsession with simplicity — are a direct response to the challenges of building a gateway correctly. We intend to make it a solid foundation for providing the control and observability that professional AI development requires.
We are releasing it open source, so that other teams can benefit from it too. It is now available on GitHub — https://github.com/doublewordai/control-layer
This is the first step towards a mature InferenceOps practice. It is about moving beyond one-off API calls and building a deliberate, robust system for deploying intelligence.
