Doubleword logo black
Product
Products
Doubleword API
NEW
Inference built for scale
Doubleword Inference Stack
High performance inference stack
Use Cases
Async Agents
Long running background agents
Synthetic Data Generation
Generate high volumes of data for fine- tuning
Data Processing
Apply intelligence to large volumes of data
Resources
Documentation
Technical docs and API reference
Workbooks
Ready-to-run examples
Seen in the Wild
Community content and projects
Resource Centre
All our blogs and guides
Technical Blog
Our blog on building inference systems
Al Dictionary
Key Al terms explained
Savings Calculator
See how much you save with Doubleword
Solutions
By Deployment Option
On-premiseCloudHybrid
By Team
AI, ML & Data SciencePlatform, DevOps & ITCompliance & Cyber
Pricing
Docs
Pricing
Get started - Free
Get started - Free
Resources
/
Blog
/
70x faster cold(ish) starts for SGLang
April 6, 2026

70x faster cold(ish) starts for SGLang

Fergus Finn
Share:
https://doubleword.ai/resources/70x-faster-cold-ish-starts-for-sglang
Copied
To Webinar
•

It takes nearly twelve minutes to start serving a 122 billion parameter MoE model1. on a B200 in Kubernetes2.

That's a long time.

Who knows where the time goes

First, measure.

A lot of slow stuff! 21 seconds of python imports. Autotuning, JIT compilation. 531s of weight loading3! To be clear, this all makes sense with SGLang's API contract. SGLang getting 10% more tokens per second is worth way more than launching 10s faster. Over time, what happens is the launch time creeps up and up.

Most of this startup work produces artifacts that SGLang already knows how to cache. Compiled kernels, autotuned configs, JIT outputs. If we persist them across restarts1, we get a warm start:

Much better. The kernel caches cut most of the compilation time, and the OS page cache does the rest. The weights are still on disk, but once they've been read once, Linux keeps them in RAM. Second load reads from page cache instead of NVMe.

We're still not doing great though. 88s! Only 31s of that is actually loading weights. The other 57s is imports, config parsing, autotuning, warmup. All overhead we're paying every time, even though the results are the same. The only irreducible part is moving 117 GB of weights onto the GPU.

On this machine we have PCIe gen 5 x16 (64 GB/s theoretical to GPU) and an NVMe gen4 SSD (7-8 GB/s to RAM). So the bandwidth floor for weight transfer is somewhere between 117/8≈15s117/8≈15s from disk and 117/64≈1.8s117/64≈1.8s from RAM. Can we get close?

Here's one I made earlier

One way to approach this problem is to take the timeline above and start hitting it with the optimization hammer. None of this stuff has to take this much time, and where it does need to take time, it could be cached, parallelised, pipelined.

It's hard though. SGLang is a big project, with a lot of moving parts. And the current structure is rational! We should take lots of time in startup to make sure we're being as performant as we can at runtime. We should expect that they will stay rational. If we make startup fast, it will later become slow again.

What we really want to do is just take everything that SGLang does on startup (including whatever they might choose to do in the future) and cache it, all together in some nice package that's fast to restore from. In the systems world this is called checkpoint/restore. The leading solution on Linux is CRIU, which can checkpoint and restore a process tree through existing kernel APIs. Its CUDA plugin calls NVIDIA's cuda-checkpoint tool to capture and restore GPU device state.

One catch: a naive CRIU checkpoint of SGLang would include all GPU memory, 192GB on our B200. Most of that is weights and KV cache that SGLang already knows how to reload. So we strip them out, shrinking the checkpoint from 192GB to 6.6GB, but we pay for it with a weight reload step on wake. Here's the restore timeline:

Lots better4! We've shaded over some frustrating work here: stripping weights and KV cache from the checkpoint2, packaging it as an OCI image for kubernetes3, GPU device remapping4, waking up after restore5, and getting CRIU and SGLang to play nice6.

Down to the wire

32s, but still far from the bandwidth floor. Of the 32s, only 19s is actually loading weights. The other 13s is overhead: container setup (6s) and CRIU process restore (7s). Before we tackle the weight loading, we can cut this overhead roughly in half.

Containerd was doing redundant work on every checkpoint restore that it doesn't do on normal pod launches7. Two patches cut container setup from 6s to 3s. CRIU itself was also slower than it needed to be. The main fix is zero-copy page restore: mmap the checkpoint pages directly instead of copying them into fresh allocations8. After that, most of the remaining CRIU time is the cuda-checkpoint driver call itself.

Overhead is down to about 6.5s. That leaves the weights.

Keep the home fires burning

It's finally time to tackle the big green bit. We've got lots of RAM on the device5, so the best we can do is PCIe bandwidth: 64GB/s for this machine setup. How are we going to get there?

First, the weight reload. We sneakily dropped it from 31s in the warm start to 19s in the restore baseline. Instead of loading from the safetensors files on disk (like on the initial load), on checkpoint we dump a serialized representation of the actual allocations6. Then, when we load back, we load through a ring buffer with direct IO. We hit the disk ceiling pretty handily for this device.

But the disk ceiling is is pretty low. We want the RAM ceiling: 1.8s. We can add faster disks, GPU Direct Storage, all this stuff. But if we don't want to re-spec our machine, and, like I do here, we've just got an NVMe-backed virtio disk, we need to get the weights into RAM before the restore starts.

The problem is, by definition, the process doesn't exist before it starts restoring. So it's going to have to get this RAM-backed weights file from someone else.

This 'someone else' is a daemon on the node, whose job is to watch the weights checkpoint directory and stage its contents into RAM. It exposes a unix socket to restored containers. On wake, torch_memory_saver queries the socket. If the daemon has the weights staged, it passes the file descriptor over and reload happens from RAM instead of disk. Otherwise, we fall back to regular restore. Depending on how much memory you give the daemon, the chance of a cache hit can be very high or very low7.

The remaining problem is getting 118GB from RAM to the GPU quickly. A naive cudaMemcpy from an unpinned buffer gets nowhere near PCIe speeds. You need the driver to know the buffer is stable in RAM, so it can issue DMA directly. But the registration call (cudaHostRegister) is slow, scaling with the number of pages backing the buffer. We solve this with hugepages (fewer pages to register) and by pipelining registration with the transfer8.

The net effect is big:

We're getting 38GB/s effective loading from RAM to the GPU. It's not 64GB/s (~50 GB/s is perhaps a more realistic expectation), but it's dead fast.

Conclusion

Here are all the reload paths on the same absolute scale:

9.6s. 70x faster than cold, 9x faster than a warm start. Most of what's left is the cuda-checkpoint driver call (3.5s) and the weight DMA (3.1s at 38 GB/s). The theoretical floor is around 1.8s. We're not there yet, but we're getting close9.

‍

Footnotes

  1. Qwen3.5-122B-A10B-FP8, MoE (10B active of 122B total), FP8, served with SGLang v0.5.10.
  2. MicroK8s v1.35.0, containerd 2.1.3, runc 1.2.5, NVIDIA GPU Operator v26.3.0 with CDI enabled, driver 580.126.20. 8x B200 (Blackwell, 192GB each).
  3. echo 3 > /proc/sys/vm/drop_caches, clear all the kernel caches, then load from scratch. I thought this number was off, but it reproduces. It's a genuinely cold load, i.e. no page caching of the downloaded sglang image, no page caching of the downloaded model weights. You'd likely never see a start this cold except post-reboot.
  4. You might have noticed the weight reload dropped from 31s to 19s. More on this later
  5. If we had a spare GPU already running the same model, we could load P2P over NVLink at 1.8TB/s. But we're scaling from zero here.
  6. torch_memory_saver interposes SGLang's CUDA allocations via virtual memory APIs. It already supports backing weights to CPU during release_memory_occupation, but not to disk. We add a batched device-to-host transfer that writes a single flat file (doublewordai/torch_memory_saver#2). On reload, we just need to get the allocations back into their proper places. SGLang supports this weight reload via update_weights_from_disk, but 1. it requires reload from safetensors, which can be slow, and 2. The reload path for lots of models drifts from the original load path (its actually broken for this model).
  7. The daemon can also partially stage: it hands out a pre-populated ring buffer, which it refills from disk as the consumer drains it.
  8. The linux kernel can back a buffer by regular 4KiB pages, or by 2MB (or 1GiB) hugepages. Fewer pages per GB means lower registration overhead. The daemon owns a pool of hugepages, allocates into them, and hands them out over its socket. For the pipeline: you register a block, then issue an async H2D copy on that block. While that copy is executing, you register the next block. I didn't think this would work (the advice I've seen is that cudaHostRegister overhead kills the transfer benefit) but it really does seem to with hugepages. And apparently registration and copying don't serialize. I expected them to, otherwise why don't we just do this all the time instead of CPU bounce buffers. The consumer side is a short ceremony: open a Unix socket, receive the shared memory fd via SCM_RIGHTS, mmap it in-process, cudaHostRegister each 1 GiB span (pipelined with the H2D DMA), and copy to GPU. Operationally it's a DaemonSet that owns the hugepage reservation, and restored pods just mount the checkpoint directory plus the Unix socket (doublewordai/torch_memory_saver#3).
  9. For example, dynamo's snapshot framework runs a persistent root daemon that takes over more control of the snapshot restore. This gives them lots more control, but has more moving parts.

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Stop overpaying for inference.

Teams use Doubleword to run low-cost, large-scale inference pipelines for async jobs.
‍
Free credits available to get started.

Get started - Free
Doubleword logo black
AI Inference, Built for Scale.
Products
Doubleword APIDoubleword Inference Stack
Use Cases
Async AgentsSynthetic Data GenerationData Processing
Resources
Seen in the WildDocumentationPricingAsync Pipeline BuilderResource CentreTechnical BlogAI Dictionary
Company
AboutPrivacy PolicyTerms of ServiceData Usage Policy
Careers
Hiring!
Contact
© 2026 Doubleword. All rights reserved.
We use cookies to ensure you get the best experience on our website.
Accept
Deny