Doubleword logo black
Product
Products
Doubleword API
NEW
Inference built for scale
Doubleword Inference Stack
High performance inference stack
Use Cases
Async Agents
Long running background agents
Synthetic Data Generation
Generate high volumes of data for fine- tuning
Data Processing
Apply intelligence to large volumes of data
Resources
Documentation
Technical docs and API reference
Workbooks
Ready-to-run examples
Seen in the Wild
Community content and projects
Resource Centre
All our blogs and guides
Technical Blog
Our blog on building inference systems
Al Dictionary
Key Al terms explained
Savings Calculator
See how much you save with Doubleword
Solutions
By Deployment Option
On-premiseCloudHybrid
By Team
AI, ML & Data SciencePlatform, DevOps & ITCompliance & Cyber
Pricing
Docs
Pricing
Get started - Free
Get started - Free
Resources
/
Blog
/
TitanTakeoff x LangChain: Supercharged local inference for LLMs
August 31, 2023

TitanTakeoff x LangChain: Supercharged local inference for LLMs

Jamie Dborin
Share:
https://doubleword.ai/resources/titantakeoff-x-langchain-supercharged-local-inference-for-llms
Copied
To Webinar
•

Challenges

With the release of many open source large language models over the past year, developers are increasingly keen to jump on the bandwagon and deploy their own LLMs. However, without specialised knowledge, developers who are experimenting with deploying LLMs on their own hardware may face unexpected technical difficulties. The recent scramble for powerful GPUs has also made it significantly harder to secure sufficient GPU allocation to deploy the best model at the desired latency and scale.

Developers are faced with an unappealing choice between suboptimal applications due to compromises on model size and quality, or costly deployments because of manual optimisations and reliance on expensive GPUs, not to mention wasted time dealing with boring and one-off technical eccentricities.

Titan Takeoff InferenceServer

Falcon-7B-instruct model running on a CPU with Titan Takeoff Server.

That being said, deploying your own models locally doesn’t have to be difficult or frustrating. The Titan Takeoff Inference Server offers a simple solution for the local deployment of open-source Large Language Models (LLMs,) even on memory-constrained CPUs. With it, users gain the benefits of on-premises inferencing — reduced latency, enhanced data security, cost savings in the long run, and unparalleled flexibility in model customization and integration without additional complexity, not to mention the ability to deploy larger and more powerful models on memory-constrained hardware.

‍

Titan Takeoff Server offers significant performance benefits for deployment and inferencing of LLMs.

With its lightning fast inference speeds and support on low cost, readily available devices, the Titan Takeoff Inference Server is suitable for developers who need to constantly deploy, test and refine their LLMs. Through the use of state of the art memory compression techniques, the Titan Takeoff Inference Server offers a 10x improvement in throughput, a 3-4x improvement in latency and a 4–20x cost saving through the use of smaller GPUs, in comparison with the base model implementation. In an era where control and efficiency are paramount, the Titan Takeoff Inference Server stands out as an optimal solution for deploying and inferencing LLMs.

Seamless integration with LangChain

With the recent integration of Titan Takeoff into LangChain, users will be able to inference their LLMs with minimal setup and coding overhead. You can view a short demonstration of how to use the LangChain integration with Titan Takeoff:

Demo of the Titan Takeoff X LangChain integration

You can start deploying and inferencing your LLMs with these simple steps below: 

1. Install the Iris CLI, which will allow you to run the Titan Takeoff Inference Server

2. pip install titan-iris

3. Start the Titan Takeoff InferenceServer, specifying the model name on HuggingFace, as well as the device if you’re using a GPU. This will pull the model from the HuggingFace server, allowing you to inference the model locally.

4. iris takeoff --model tiiuae/falcon-7b-instruct --device cuda

5. The Titan Takeoff Inference Server is now ready. You can then initialise the LLM object by providing it with a custom port (if not running the Titan Takeoff Inference Server on the default port 8000) or other generation parameters such as temperature. There is also an option to specify a streaming flag.

6. llm = TitanTakeoff(port=5000, temperature=0.8, streaming=True)output = llm("What is the weather in London in August?")print(output)# Output: The weather in London in August can vary, with some sunny days and occasional rain showers. The average temperature is around 20-25°C (68-77°F).

You have now made your first inference call to an LLM with the Titan Takeoff Inference Server running right on your local machine!

For more examples demonstrating use of the Titan Takeoff x LangChain integration, view our guide here.

Conclusion

The integration of the Titan Takeoff Inference Server with LangChain marks a transformative phase in the development and deployment of language model-powered applications. As developers and enterprises seek faster, more efficient, and cost-effective ways to leverage the capabilities of LLMs, solutions like this pave the way for a smarter, seamless, and supercharged future.

About TitanML

TitanML is an NLP development platform and service focusing on the deployability of LLMs. Our Titan Takeoff Inference Server is a hyper-optimised LLM inference server that ‘just works’, making it the fastest and simplest way to experiment with and deploy LLMs locally.

Our documentation and Discord community are here to support you.

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Stop overpaying for inference.

Teams use Doubleword to run low-cost, large-scale inference pipelines for async jobs.
‍
Free credits available to get started.

Get started - Free
Doubleword logo black
AI Inference, Built for Scale.
Products
Doubleword APIDoubleword Inference Stack
Use Cases
Async AgentsSynthetic Data GenerationData Processing
Resources
Seen in the WildDocumentationPricingAsync Pipeline BuilderResource CentreTechnical BlogAI Dictionary
Company
AboutPrivacy PolicyTerms of ServiceData Usage Policy
Careers
Hiring!
Contact
© 2026 Doubleword. All rights reserved.
We use cookies to ensure you get the best experience on our website.
Accept
Deny