Pursuing innovation and supremacy in AI shows no signs of slowing down. Google revealed Gemini 1.5, just months after the debut of Gemini, their large language model (LLM) capable of handling contexts spanning up to an impressive 10 million tokens. Simultaneously, OpenAI has taken the stage with Sora, a robust text-to-video model celebrated for its captivating visual effects. The face-off of these two cutting-edge technologies has sparked discussions about the future of AI, especially the role and potential demise of Retrieval Augmented Generation (RAG).
Will Long-context LLMs Kill RAG? ย
The RAG framework, incorporating a vector database, an LLM, and prompt-as-code, is a cutting-edge technology that seamlessly integrates external sources to enrich an LLM's knowledge base for precise and relevant answers. It is a proven solution that effectively addresses fundamental LLM challenges such as hallucinations and lacking domain-specific knowledge.
Witnessing Gemini's impressive performance in handling long contexts, some voices quickly predict RAG's demise. For example, in a review of Gemini 1.5 Pro on Twitter, Dr. Yao Fu boldly stated, "The 10M context kills RAG."ย
Is this assertion true? From my perspective, the answer is โNO.โ The development of the RAG technology has just begun and will continue to evolve. While Gemini excels in managing extended contexts, it grapples with persistent challenges encapsulated as the 4Vs: Velocity, Value, Volume, and Variety.
LLMsโ 4Vs Challenges
Velocity: Gemini faces hurdles in achieving sub second response times for extensive contexts, evidenced by a 30-second delay in responding to 360,000 contexts. Despite optimism about LLMsโ computational advancements, speedy responses at the sub second level when retrieving long contexts remain challenging for large transformer-based models.
Value: The value proposition of LLMs is undermined by the considerable inference costs associated with generating high-quality answers in long contexts. For example, retrieving 1 million tokens of datasets at a rate of $0.0015 per 1000 tokens could lead to substantial expenses, potentially amounting to $1.50 for a single request. This cost factor renders such high expenditures impractical for everyday utilization, posing a significant barrier to widespread adoption.
Volume: Despite its capability to handle a large context window of up to ten million tokens, Gemini's volume capacity is dwarfed when compared to the vastness of unstructured data. For instance, no LLM, including Gemini, can adequately accommodate the colossal scale of data found within the Google search index. Furthermore, private corporate data will have to stay within the confines of their owners, who may choose to use RAG, train their own models, or use a private LLM.
Variety: Real-world use cases involve not only unstructured data like lengthy texts, images, and videos but also a diverse range of structured data that may not be easily captured by an LLM for training purposes such as time-series data, graph data, and code changes. Streamlined data structures and retrieval algorithms are essential to process such varied data efficiently.
All these challenges highlight the importance of a balanced approach in developing AI applications, making RAG increasingly crucial in the evolving landscape of artificial intelligence.ย
Strategies for Optimizing RAG Effectiveness
While RAG has proven beneficial in reducing LLM hallucinations, it does have limitations. In this section, weโll explore strategies to optimize RAG effectiveness to strike a balance between accuracy and performance to make RAG systems more adaptable across a broader range of applications.
Enhancing Long Context Understanding
Conventional RAG techniques often rely on chunking for vectorizing unstructured data, primarily due to the size limitations of embedding models and their context windows. However, this chunking approach presents two notable drawbacks.ย
Firstly, it breaks down the input sequence into isolated chunks, disrupting the continuity of context and negatively impacting embedding quality.ย
Secondly, there's a risk of separating consecutive information into distinct chunks, potentially resulting in incomplete retrieval of essential information.
In response to these challenges, emerging embedding strategies based on LLMs have gained traction as efficient solutions. They boast better embedding capability and support expanded context windows. For instance, SRF-Embedding-Mistral and GritLM7B, two best-performing embedding models on the Huggingface MTEB LeaderBoard, support 32k-token-long contexts, showcasing a substantial improvement in embedding capabilities. This enhancement in embedding unstructured data also elevates RAGโs understanding of long contexts.ย
Another effective approach to tackle the challenges above is the recently released BGE Landmark Embedding strategy. This approach adopts a chunking-free architecture, where embeddings for the fine-grained input units, e.g., sentences, can be generated based on a coherent long context. It also leverages a position-aware function to facilitate the complete retrieval of helpful information comprising multiple consecutive sentences within the long context. Therefore, landmark embedding is beneficial to enhancing the ability of RAG systems to comprehend and process long contexts.
Utilizing Hybrid Search for Improved Search Quality
The quality of RAG responses hinges on its ability to retrieve high-quality information. Data cleaning, structured information extraction, and hybrid search are all effective ways to enhance the retrieval quality. Recent research suggests sparse vector models like Splade outperform dense vector models in out-of-domain knowledge retrieval, keyword perception, and many other areas.ย
The recently open-sourced BGE_M3 embedding model can generate sparse, dense, and Colbert-like token vectors within the same model. This innovation significantly improves the retrieval quality by conducting hybrid retrievals across different types of vectors. Notably, this approach aligns with the widely accepted hybrid search concept among vector database vendors like Zilliz. For example, the upcoming release of Milvus 2.4 promises a more comprehensive hybrid search of dense and sparse vectors.ย
Utilizing Advanced Technologies to Enhance RAGโs Performance
Maximizing RAG capabilities involves addressing numerous algorithmic challenges and leveraging sophisticated engineering capabilities and technologies. As highlighted by Wenqi Glantz in her blog, developing a RAG pipeline presents at least 12 complex engineering challenges. Addressing these challenges requires a deep understanding of ML algorithms and utilizing complicated techniques like query rewriting, intent recognition, and entity detection.
Even advanced models like Gemini 1.5 face substantial hurdles. They require 32 calls to achieve a 90.0% accuracy rate in Google's MMLU benchmark tests. This underscores the nature of maximizing performance in RAG systems.
Vector databases, one of the cutting-edge AI technologies, are a core component in the RAG pipeline. Opting for a more mature and advanced vector database, such as Milvus, extends the capabilities of your RAG pipeline from answer generation to tasks like classification, structured data extraction, and handling intricate PDF documents. Such multifaceted enhancements contribute to the adaptability of RAG systems across a broader spectrum of application use cases.
Conclusion: RAG Remains a Linchpin for the Sustained Success of AI Applications.ย
LLMs are reshaping the world, but they cannot change our worldโs fundamental principles. The separation of computation, memory, and external storage has existed since the inception of the von Neumann architecture in 1945. However, even with single-machine memory reaching the terabyte level today, SATA and flash disks still play crucial roles in different application use cases. This demonstrates the resilience of established paradigms in the face of technological evolution.
The RAG framework is still a linchpin for the sustained success of AI applications. Its provision of long-term memory for LLMs proves indispensable for developers seeking an optimal balance between query quality and cost-effectiveness. In deploying generative AI by large enterprises, RAG is a critical tool for cost control without compromising response quality.
Just like large memory developments cannot kick out hard drives, the role of RAG, coupled with its supporting technologies, remains integral and adaptive. It is poised to endure and coexist within the ever-evolving landscape of AI applications.ย
I especially enjoy your posts on RAG, fine-tuning, long context windows, and prompt engineering.