Doubleword | Why Long Context Length is Not the Death of RAG

Google recently announced their new Gemini model can handle over 1 million tokens of context length - a huge leap in AI capabilities. Many have proclaimed this advancement spells the end of Retrieve and Generate (RAG) systems. However, we don't believe long context length represents the demise of RAG, for several key reasons:

Cost and Speed

Long context lengths are extremely expensive to run computationally. The more context provided, the slower and more resource intensive the model inference becomes. RAG systems help reduce the tokens needing processing by retrieving the most relevant passages upfront, enabling faster and cheaper overall results.

Unproven Performance

While impressive in scale, it is still undetermined how accurate Gemini's recall abilities are over such vast contexts. RAG systems aim to optimize the entire pipeline - search, embeddings and ranking - to feed prompts relevant content. Gemini's memory performance over 1 million tokens requires further evaluation.

Loss of Auditability

A major advantage of RAG systems is providing audit trails showing what content was deemed relevant as input. This grant some explainability into the otherwise "black box" workings of AI. With ultra long contexts like Gemini's, auditability is lost by sheer volume, hampering its usefulness for many enterprise use cases.In summary, while an exciting advancement showing AI's potential, long context length alone is unlikely to make RAG obsolete quite yet. The strengths around cost, performance optimization and auditability mean RAG still has significant value in operational environments. We look forward to seeing how these capabilities evolve together over time.

Why Long Context Length is Not the Death of RAG

Cost and Speed

Unproven Performance

Loss of Auditability

Footnotes

Table of contents:

Stop overpaying for inference.