Doubleword | Inference when no one is waiting

Most conversations in inference have centred on making the experience better for someone waiting. Reducing time-to-first-token. Claude's "fast mode.¹" Groq and Cerebras. The whole technical project of the last few years has assumed that a human, somewhere, is waiting for the response.

That assumption held for a real reason: we didn't trust AI to run on its own. For almost every use case, we designed applications so a human was checking the work regularly, catching compounding errors and hallucinations before they spiralled. The latency obsession was downstream of that - if a ‘human is in the loop’, they really don’t want to be waiting.

This assumption is breaking down now. Frontier models, both open source and proprietary models, are good enough that we can trust them to reason and work autonomously for longer stretches without supervision, and that trend is only pointing in one direction². Once no one is waiting, what matters isn't how quickly the answer comes back. It's how much useful work gets done, and how good it is.

A huge proportion of AI workloads fit this pattern. The ones that don't - call them 'tell me now' - are use cases that by construction require human interaction: customer support agents, chatbots, voice. Everything else is 'just get it done’.

The industry has poured almost all of its inference effort into "tell me now" use cases, even though "just get it done" is where the volume of tokens is going to come from. "Tell me now" is bounded by the surface area of interaction a system can have with a person. "Just get it done", by contrast, is boundless. There is an endless backlog of problems that would benefit from intelligence, if intelligence were cheap enough.

These two regimes call for genuinely different inference systems. You can build a stack that's mediocre at both or excellent at one.

We're building for "just get it done" regime. We believe that if you drive the cost of background inference down far enough, you change what's feasible to build. Workloads that were marginally too expensive become routine. Problems that seemed completely out of reach become feasible. The goal, as plainly as I can state it, is to provide the highest possible IQ per dollar.

Optimizing for the highest IQ per dollar, or for the “just get it done” use cases are an underexplored corner of the inference research space. Most inference work over the last few years has chased the lowest possible latency. We, by contrast, optimize for IQ per dollar, and when you optimize for that deliberately at every layer of the stack, there is a world of gains available.

Intelligence has always been scarce, so we've had to ration it. We have researched the diseases that are most profitable to research. We translated the books that were most likely to sell. We reserved legal representation to those with the most means.If we get to a world where intelligence is abundant, it doesn't just make existing work cheaper - it changes which problems are worth working on. And that is the world we’re building inference for.

‍

Footnotes

¹https://code.claude.com/docs/en/fast-mode

²https://metr.org/time-horizons/

‍

Inference when no one is waiting

Footnotes

Table of contents:

Stop overpaying for inference.