Doubleword logo black
Product
Products
Doubleword API
NEW
Inference built for scale
Doubleword Inference Stack
High performance inference stack
Use Cases
Async Agents
Long running background agents
Synthetic Data Generation
Generate high volumes of data for fine- tuning
Data Processing
Apply intelligence to large volumes of data
Resources
Documentation
Technical docs and API reference
Workbooks
Ready-to-run examples
Seen in the Wild
Community content and projects
Resource Centre
All our blogs and guides
Technical Blog
Our blog on building inference systems
Al Dictionary
Key Al terms explained
Savings Calculator
See how much you save with Doubleword
Solutions
By Deployment Option
On-premiseCloudHybrid
By Team
AI, ML & Data SciencePlatform, DevOps & ITCompliance & Cyber
Pricing
Docs
Pricing
Get started - Free
Get started - Free
Resources
/
Technical Guide
/
Behind the Stack, Ep 2: How Many Users Can My GPU Serve?
June 4, 2025

Behind the Stack, Ep 2: How Many Users Can My GPU Serve?

Jamie Dborin
Share:
https://doubleword.ai/resources/behind-the-stack-ep-2-how-many-users-can-my-gpu-serve
Copied
To Webinar
•

Behind the Stack: Episode 2 - How Many Users Can My GPU Serve?

Introduction

When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”

It’s a question rooted in system design, not just intuition. While it's easy to watch GPU utilization or rely on batch size as a proxy, neither gives you a reliable measure of how far your hardware can actually stretch under real-world loads.

In this video, we break down the calculation that gives you a usable estimate of your system's capacity - grounded in memory constraints and model architecture. With just a few known quantities (model config, token usage, GPU size), you can forecast how many users your setup can realistically support as well as how to grow that number.

GPU Memory: What's Actually Using It?

At inference time, your GPU memory gets divided among three major components:

  • Model Weights - a fixed chunk, based on parameter count and precision

  • Activations - temporary tensors created during forward passes (often small and engine-managed)

  • KV Cache - memory that stores every token currently active in the system

For real-time or multi-user workloads, the KV cache is often the limiting factor. It's what determines whether a new user’s request can be served without delay, regardless of what your GPU utilization says.

The Core Calculation

Let’s say you’re running LLaMA 8B in FP16 on an 80GB A100. The numbers break down roughly as:

  • Model Weights ≈ 16GB (8B params × 2 bytes per param)

  • Remaining VRAM for KV cache ≈ 64GB

  • Each token uses ≈ 0.00013 GB (based on head size, KV heads, layers, and precision)

That gives you:

64 / 0.00013 ≈ ~492,000 tokens total

Now, assume each user sends 8K tokens of input and expects a 2K token output:

492,000 / 10,000 tokens per user ≈ ~49 users

This gives you a rough upper bound on concurrent users - based entirely on memory, not compute throughput.

Scaling That Number

Once you understand the math, there are three main ways to increase your capacity:

1. Quantize the Model (and/or the KV Cache)

Reducing model precision shrinks the memory footprint of weights - and can sometimes reduce KV cache size if supported by your inference engine.

‍

KV cache quantization is less common in production but can double or quadruple token capacity if supported. The tradeoff is increased, decoding latency unless fused dequantization kernels are available.

2. Increase Available VRAM

You can scale up or out:

  • Vertical scaling: Upgrade to higher VRAM GPUs (e.g., 24GB → 80GB → 128GB)

  • Horizontal scaling: Distribute the model across multiple GPUs using tensor parallelism or pipeline parallelism

More VRAM gives you a larger KV cache - and therefore more tokens to work with. Horizontal scaling introduces some duplication overhead and infrastructure complexity, but it’s often necessary at larger scale.

3. Offload the KV Cache

Some engines allow you to offload older KV layers to CPU or even disk, or keep only the last few layers on GPU. This can reduce GPU KV cache usage by 90%+.

The catch is latency. Unless your inference engine overlaps data movement with computation efficiently, you’ll see increased response times - so this is best used in workloads that prioritize token capacity over speed.

Other Considerations

These calculations give you a strong first estimate, but real-world behavior varies depending on:

  • Inference engine - whether it supports paged attention, chunked prefill, quantized cache, etc.

  • Workload shape - are requests long, short, bursty, or streaming?

  • Fragmentation - fixed-size page allocation can leave some KV cache space unused

  • Decode behavior - Token generation is more memory bound than prefill - so decreasing the load won't necessarily improve response times.

If you’re tuning for production, these second-order factors can shift your real limits by 10–30%.

Conclusion

If you’re self-hosting LLMs and need to hit concurrency or latency targets, it’s critical to move from intuition to calculation. With just a few inputs - model size, VRAM, context length - you can:

  • Estimate concurrency limits
  • Choose the right model precision
  • Plan upgrades and scaling strategies
  • Tune memory settings per engine

This is what determines whether you can serve 10 users - or 100

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Stop overpaying for inference.

Teams use Doubleword to run low-cost, large-scale inference pipelines for async jobs.
‍
Free credits available to get started.

Get started - Free
Doubleword logo black
AI Inference, Built for Scale.
Products
Doubleword APIDoubleword Inference Stack
Use Cases
Async AgentsSynthetic Data GenerationData Processing
Resources
Seen in the WildDocumentationPricingAsync Pipeline BuilderResource CentreTechnical BlogAI Dictionary
Company
AboutPrivacy PolicyTerms of ServiceData Usage Policy
Careers
Hiring!
Contact
© 2026 Doubleword. All rights reserved.
We use cookies to ensure you get the best experience on our website.
Accept
Deny