Doubleword logo black
Product
Products
Doubleword API
NEW
Inference built for scale
Doubleword Inference Stack
High performance inference stack
Use Cases
Async Agents
Long running background agents
Synthetic Data Generation
Generate high volumes of data for fine- tuning
Data Processing
Apply intelligence to large volumes of data
Resources
Documentation
Technical docs and API reference
Workbooks
Ready-to-run examples
Seen in the Wild
Community content and projects
Resource Centre
All our blogs and guides
Technical Blog
Our blog on building inference systems
Al Dictionary
Key Al terms explained
Savings Calculator
See how much you save with Doubleword
Solutions
By Deployment Option
On-premiseCloudHybrid
By Team
AI, ML & Data SciencePlatform, DevOps & ITCompliance & Cyber
Pricing
Docs
Pricing
Get started - Free
Get started - Free
Resources
/
Blog
/
Inference when no one is waiting
May 5, 2026

Inference when no one is waiting

Meryem Arik
Share:
https://doubleword.ai/resources/inference-when-no-one-is-waiting
Copied
To Webinar
•

Most conversations in inference have centred on making the experience better for someone waiting. Reducing time-to-first-token. Claude's "fast mode.1" Groq and Cerebras. The whole technical project of the last few years has assumed that a human, somewhere, is waiting for the response.

That assumption held for a real reason: we didn't trust AI to run on its own. For almost every use case, we designed applications so a human was checking the work regularly, catching compounding errors and hallucinations before they spiralled. The latency obsession was downstream of that - if a ‘human is in the loop’, they really don’t want to be waiting.

This assumption is breaking down now. Frontier models, both open source and proprietary models, are good enough that we can trust them to reason and work autonomously for longer stretches without supervision, and that trend is only pointing in one direction2. Once no one is waiting, what matters isn't how quickly the answer comes back. It's how much useful work gets done, and how good it is.

A huge proportion of AI workloads fit this pattern. The ones that don't - call them 'tell me now' - are use cases that by construction require human interaction: customer support agents, chatbots, voice. Everything else is 'just get it done’.

The industry has poured almost all of its inference effort into "tell me now" use cases, even though "just get it done" is where the volume of tokens is going to come from. "Tell me now" is bounded by the surface area of interaction a system can have with a person. "Just get it done", by contrast, is boundless. There is an endless backlog of problems that would benefit from intelligence, if intelligence were cheap enough.

These two regimes call for genuinely different inference systems. You can build a stack that's mediocre at both or excellent at one.

We're building for "just get it done" regime. We believe that if you drive the cost of background inference down far enough, you change what's feasible to build. Workloads that were marginally too expensive become routine. Problems that seemed completely out of reach become feasible. The goal, as plainly as I can state it, is to provide the highest possible IQ per dollar. 

Optimizing for the highest IQ per dollar, or for the “just get it done” use cases are an underexplored corner of the inference research space. Most inference work over the last few years has chased the lowest possible latency. We, by contrast, optimize for IQ per dollar, and when you optimize for that deliberately at every layer of the stack, there is a world of gains available. 

Intelligence has always been scarce, so we've had to ration it. We have researched the diseases that are most profitable to research. We translated the books that were most likely to sell. We reserved legal representation to those with the most means.If we get to a world where intelligence is abundant, it doesn't just make existing work cheaper - it changes which problems are worth working on. And that is the world we’re building inference for.

‍

Footnotes

1 https://code.claude.com/docs/en/fast-mode 

2https://metr.org/time-horizons/ 

‍

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Stop overpaying for inference.

Teams use Doubleword to run low-cost, large-scale inference pipelines for async jobs.
‍
Free credits available to get started.

Get started - Free
Doubleword logo black
AI Inference, Built for Scale.
Products
Doubleword APIDoubleword Inference Stack
Use Cases
Async AgentsSynthetic Data GenerationData Processing
Resources
Seen in the WildDocumentationPricingAsync Pipeline BuilderResource CentreTechnical BlogAI Dictionary
Company
AboutPrivacy PolicyTerms of ServiceData Usage Policy
Careers
Hiring!
Contact
© 2026 Doubleword. All rights reserved.
We use cookies to ensure you get the best experience on our website.
Accept
Deny