Stanford CS25: V3 I Retrieval Augmented Language Models | Highlights and Annotations by Gistr.

Retrieval Augmented Generation (RAG) improves LLMs by connecting them to external knowledge bases, mitigating hallucinations and staleness. The lecture covers RAG architectures, optimization strategies (retrievers & generators), and scaling challenges. Future directions include improved retrieval, training methods, and multimodality, emphasizing a system-centric approach. This segment discusses the challenges facing current language models, including hallucinations, attribution issues, obsolescence, and the need for data revision and customization, setting the stage for the introduction of retrieval augmentation. This segment introduces RAG as a solution to the limitations of current language models, explaining its core architecture and comparing it to the "closed-book" and "open-book" exam analogies. It highlights the shift from a purely parametric to a semi-parametric approach. This segment critiques the "frozen" RAG approach, where pre-trained models are used without further training, highlighting its limitations and setting the stage for a discussion of more sophisticated retrieval methods.This segment delves into sparse retrieval methods, specifically TF-IDF and BM25, explaining their mechanisms and limitations. It provides a foundational understanding of retrieval techniques before moving to denser methods. This segment discusses Dragon, a state-of-the-art dense retriever, its training methodology using progressive data augmentation, and the prevalent trend of hybrid search combining sparse and dense retrieval methods for improved results. The speaker highlights the benefits of combining different retrieval techniques to achieve a best-of-both-worlds approach in RAG applications. This segment introduces Replug, a method for improving retrieval by minimizing the KL divergence between the retrieval distribution and the language model's likelihood of the correct answer. The speaker emphasizes Replug's model-agnostic nature and its effectiveness in optimizing retrieval for various language models, even those accessed via APIs.This segment explores methods for improving retrieval by incorporating a ranker on top of BM25 results. The speaker explains how this approach allows for backpropagation into the ranker while keeping the language model fixed, leading to a more optimized and contextualized retrieval system. The discussion includes the advantages of this approach and its implications for improving RAG performance. This segment contrasts the limitations of "frozen" RAG architectures with the benefits of jointly optimizing both the retriever and the generator. The speaker uses the analogy of two halves of a brain not communicating to illustrate the shortcomings of the frozen approach and advocates for a more integrated and contextualized system where both components learn together. This segment introduces Realm, a pioneering method that updates both the query and document encoders, unlike previous approaches. The speaker explains the computational expense of updating the document encoder, especially with large corpora, and highlights Realm's innovative approach to address this challenge. The discussion emphasizes the significance of Realm's contribution to the field of retrieval-augmented generation. This segment details the training methodologies for RAG systems, including prefix language modeling, T5-style denoising, and title-to-section generation. The speaker emphasizes the importance of aligning the training loss function with the language model used, highlighting the practical considerations in building effective RAG systems. This segment discusses the design of generators specifically for RAG systems, focusing on the architectural innovations in Mistral 7B, such as sliding window attention and group query attention. The speaker analyzes the effectiveness of these features in the context of RAG and proposes alternative approaches, such as chunk cross-attention, for improved integration of retrieved information. This segment presents a comparison between retrieval-augmented models and their closed-book equivalents. The speaker highlights the significant performance improvements achieved by incorporating retrieval, emphasizing the substantial impact of this approach on various language modeling tasks. This segment delves into the strategies for updating the retriever component in RAG systems. The speaker discusses the trade-offs between updating the document encoder, reranking, and query-side updates, emphasizing the importance of high-quality embedding models and the impact of dataset characteristics on the optimal update strategy. The discussion highlights the practical considerations for efficient and effective retriever updates. This segment compares RAG with long-context language models, highlighting the efficiency advantages of RAG in handling long contexts. The speaker explains how long-context models often implicitly adopt retrieval-like mechanisms to address the computational challenges of attending to extensive input sequences, ultimately demonstrating the convergence of these approaches. This segment focuses on extending RAG systems beyond text to incorporate multimodality, specifically using images. The speaker highlights the success of their work on Lens, a language model enhanced with vision capabilities, achieving near state-of-the-art results in visual question answering. The discussion emphasizes the growing trend of multimodality in the field, driven by models like GPT-4V, and its potential for future advancements in RAG. This segment explores the potential of decoupling memorization from language models by using a large index and a rich encoder. The speaker discusses the advantages of this approach, including improved scaling tradeoffs, enabling the use of smaller language models while leveraging efficient retrieval mechanisms like dot products, which are significantly faster than self-attention. The discussion also touches upon the potential obsolescence of dedicated vector databases in favor of more efficient alternatives like BM25 and existing sparse databases. This segment delves into the issue of hallucination in language models, clarifying the distinction between hallucination and simple errors. The speaker argues that end-to-end training of retrieval augmented systems is crucial to mitigate hallucinations, and that a system where the language model only reasons and speaks, with knowledge solely derived from retrieval, offers a solution. The discussion also touches upon the importance of accurate measurement and definition of hallucination in the field.