Doubleword | Takeoff Inference v0.11 Release

We're excited to announce the release of TitanML's Takeoff Inference v0.11, which includes several new capabilities to improve performance and usability

Reranking and Classification Endpoints

We've added a new "/classify" endpoint that supports text classification tasks like sentiment analysis, natural language inference, and reranking models. It enables you to use the full sequence representations from models like T5 and BERT to determine document relevance for retrieval.

CUDA Graph Caching

CUDA graphs can accelerate inference but consume additional memory. We've implemented an LRU cache to store a capped number of CUDA graphs to optimize this tradeoff. It improves average throughput while reducing the chance of out-of-memory errors on longer sequences.

Smaller Container Image

By refactoring some dependencies, we've significantly reduced the container image size compared to the previous version. It allows for installation on more resource-constrained systems without compromising on model support.

Takeoff Inference v0.11 Release

Footnotes

Table of contents:

Stop overpaying for inference.