CS 194/294-280 (Advanced LLM Agents) - Lecture 1, Xinyun Chen

This lecture covers advanced LLM reasoning techniques for improved problem-solving. It explores prompting strategies (chain-of-thought, analogical), response selection (self-consistency), iterative refinement, and search methods (tree of thought), aiming to enhance LLMs' ability to tackle complex real-world challenges. This segment explains the fundamental architecture of a large language model-based agent, detailing how it uses an LLM as its "brain" to reason, plan actions, interact with the environment, receive feedback, and revise its internal memory for improved future planning. It also outlines common components within the agent framework, such as tools and knowledge retrieval.This section explains why LLMs benefit from agent frameworks, emphasizing the trial-and-error nature of real-world problem-solving. It highlights the importance of environmental interaction for understanding success and failure modes, leveraging external tools and knowledge, and the role of agent workflows in facilitating complex task accomplishment.This segment discusses the increasing prevalence of AI agents across various applications and domains. It showcases successful demonstrations in code generation, computer use, personal assistance, and robotics, while also mentioning emerging applications in education and finance, highlighting the transformative impact of AI agents on how AI models are perceived and utilized.This section focuses on the rapid advancements in reasoning models, citing specific model releases (OpenAI, Gemini) and their performance improvements. It emphasizes the impressive progress in math and coding problem-solving, showcasing the rapid evolution of these models and their capabilities.This segment highlights remarkable achievements of reasoning models, such as achieving near-gold medal performance in the International Mathematical Olympiad (IMO) and ranking among top human participants in competitive programming contests. It underscores the rapid progress and previously unimaginable capabilities of these models.This segment provides a detailed overview of the course structure, focusing on a deeper dive into reasoning techniques (inference, scaling, training, search, planning), software engineering applications (code generation, verification, auto-formalization), and real-world enterprise applications of AI agents, including safety and ethical considerations. This section outlines the course logistics, highlighting key changes from the previous semester, such as a later lab due date and the introduction of two project tracks: an applications track (similar to the previous semester) and a new research track for students interested in submitting conference or workshop publications. This segment explores the sensitivity of large language models to prompt design. An ablation study on different instruction variations reveals significant performance changes even with minor wording adjustments. This highlights the lack of clear principles for optimal prompt writing and the ongoing need for prompt engineering best practices. This segment details a methodology using analogical prompting to improve code generation. The approach involves a model generating higher-level knowledge, including tutorials and example problems, to enhance its ability to solve new coding problems. Results show that this method outperforms both zero-shot and few-shot prompting techniques on various tasks, even though the generated examples contain some inaccuracies.This section analyzes the surprising effectiveness of analogical prompting despite the presence of errors in the model-generated examples. Approximately 70% of the generated examples were accurate and relevant, yet the model still benefited significantly from the remaining noisy data. This finding suggests that the model's ability to generate reasonable examples is crucial for its learning process. A scaling study across different GPT models is also introduced.This segment presents a scaling study comparing the effectiveness of analogical prompting across various GPT models. Larger models benefit less from analogical prompting compared to smaller ones, but still outperform zero-shot and few-shot methods. The generated chain of thought is suggested to be more tailored to the reasoning style of large language models, aligning with recent trends in reasoning model development. This segment delves into inference time techniques for AI reasoning, using OpenAI's models as examples. It presents data showing how increased inference time leads to significantly improved accuracy on challenging reasoning benchmarks, even reaching near-human performance levels at a high computational cost.This section showcases demonstrations of reasoning models, highlighting the visibility of the model's thought process and its ability to generate long chains of reasoning steps. It uses examples from OpenAI and Google's Gemini models to illustrate how these models approach problem-solving, including multi-modal reasoning with image inputs.This segment discusses the core idea of triggering long chains of thought in LLMs to improve reasoning performance. It explores different approaches, including few-shot prompting, instruction prompting, and reinforcement learning, highlighting the various methods used to elicit detailed reasoning steps from the models.This section outlines a three-part approach to reasoning techniques at inference time: basic prompting techniques (increasing token budget), search and selection from multiple candidates (exploring solution space), and iterative self-improvement (increasing depth to reach solutions). It emphasizes the combined use of these techniques for optimal performance.This segment compares standard prompting with chain-of-thought (CoT) prompting, highlighting the limitations of standard prompting and the advantages of CoT in providing rationale and improving performance, especially for larger models. It also discusses scaling curves and the impact of model size on CoT's effectiveness.This section introduces zero-shot CoT prompting, which eliminates the need for manually annotated examples, and compares its performance to few-shot CoT. It then introduces analogical prompting, a novel approach that allows the model to generate its own relevant examples, combining the benefits of both zero-shot and few-shot approaches while avoiding manual labeling. The speaker argues against limiting large language models to single solutions, advocating for exploring multiple solution branches to allow recovery from mistakes. Two methods are proposed: generating multiple candidate solutions per problem and allowing multiple potential reasoning steps at each stage, significantly increasing the chance of finding a correct solution. This section introduces the concept of using large language models to automate prompt engineering. A method is described where a large language model proposes prompts, which are then scored based on their performance on a small validation set. This approach aims to reduce manual effort and potentially improve prompt quality beyond human-written prompts.This segment details a method that uses reinforcement learning to iteratively improve prompts. Two models act as an optimizer and evaluator, respectively. The optimizer proposes new instructions based on past performance, and the evaluator assesses their accuracy. This approach aims to creatively improve prompts beyond simple search and mutation.This section describes the design of a meta-prompt for optimizing prompts using reinforcement learning. The meta-prompt includes past instruction attempts with their corresponding accuracies and an example problem. Results show that this approach achieves performance comparable to human-written few-shot prompting, demonstrating the potential for automated prompt optimization.This segment discusses the advantages of automated prompt optimization, highlighting its time-saving aspect and ability to discover surprising and effective prompts. It then transitions to a discussion of chain-of-thought prompting, explaining its effectiveness in adapting to problem difficulty.This section delves into the mechanics of chain-of-thought prompting, emphasizing its ability to perform variable computation and adapt to different problem complexities. It highlights the incorporation of human-like reasoning strategies such as decomposition and planning, which contribute to its effectiveness.This segment introduces least-to-most prompting, a technique that decomposes complex problems into simpler subproblems. This approach achieves easy-to-hard generalization, enabling the model to solve more challenging problems by sequentially solving simpler subproblems. Impressive results on the SCAN benchmark are presented.This section extends the least-to-most prompting idea to real-world applications, such as translating natural language questions to SPARQL queries. A dynamic approach is introduced, where relevant examples are selected for each subproblem, addressing the challenges of longer sentences and larger vocabularies.This segment presents results demonstrating the superior performance of the dynamic least-to-most prompting approach compared to baselines. It highlights the scalability of the method and transitions to a discussion on task-specific reasoning structures.This section introduces a method for enabling models to automatically discover task-specific reasoning structures without manual labeling. The model selects appropriate reasoning strategies from a predefined set and composes a reasoning structure to solve the problem. Results show improved performance on various benchmarks.This concluding segment summarizes the key findings and discusses the evolving best practices for interacting with large language models. It highlights the changing preferences of models regarding prompt styles and the need for adaptive strategies.This segment outlines three crucial criteria for effective prompting strategies in large language models: encouraging longer chains of thought for complex tasks, scaling with varying task difficulty, and supporting the reasoning strategies needed for the task. These criteria are essential for developing robust and adaptable prompting techniques. This segment focuses on the application of self-improvement to code generation, drawing parallels to human debugging processes within IDEs. It introduces the concept of interactive loops in code development, where programmers investigate execution results and revise code based on observations. The segment then transitions to the speaker's paper on self-debugging, setting the stage for a deeper dive into feedback formats.This segment explores different feedback formats for self-debugging in code generation. It details various feedback types, including simple correctness indicators, unit test results (including runtime errors), code explanations (line-by-line), and execution traces. The segment concludes by highlighting the consistent performance improvements across different LLMs achieved through self-debugging and the impact of more informative feedback.This segment shifts the focus to self-correction in question-answering tasks, contrasting results with and without an oracle verifier. It presents negative results from a study where LLMs lacked access to an oracle, demonstrating that their inability to reliably judge correctness often led to worse performance after each self-correction iteration.This segment investigates the impact of feedback prompts on self-correction performance. It explores the use of general-purpose feedback prompts and their limitations, noting that while prompt adjustments can influence the model's tendency to retain initial responses, they don't necessarily lead to performance improvements. The segment then sets the stage for a comparison with multi-response self-correction methods.This segment compares self-correction using a single initial response with a multi-response approach (multi-agent debate). It introduces the multi-agent debate baseline, where multiple responses are generated and evaluated by the LLM, and contrasts it with self-consistency (selecting the most consistent response from parallel generations). The analysis reveals that self-consistency scales better than multi-agent debate when considering token budget constraints.This segment addresses the challenge of optimally allocating token budgets between parallel and sequential generation methods. It highlights the task and model dependency of this optimization problem and presents a study showing how the optimal ratio of parallel to sequential generation varies with problem difficulty. The study concludes that a compute-optimal curve, balancing both methods, can outperform purely parallel generation.This segment discusses the influence of model size on inference time and budget allocation. It argues that using more expensive models is not always optimal, especially when smaller models can achieve comparable performance with less computational cost. The segment introduces a study showing how optimal model selection depends on the available inference budget.This concluding segment summarizes the key takeaways, emphasizing the importance of adapting techniques to the model's capabilities and the specific task. It introduces Richard Sutton's "The Bitter Lesson," highlighting the importance of methods that scale with increased computation and the need to teach models to discover what they haven't yet discovered, rather than focusing solely on pre-defined content. This segment highlights the challenge of selecting the best response from multiple candidates without an oracle scorer during inference. While user selection is possible, the ideal scenario involves the model internally selecting the best solution, avoiding manual review of numerous options. The speaker introduces the concept of self-consistency as a potential solution.The speaker introduces the self-consistency method, a simple yet effective approach for selecting the best response from multiple candidates. The method focuses on the consistency of final answers across multiple generated responses, selecting the most frequently appearing answer. This approach demonstrates significant performance improvements across various models and benchmarks, particularly on math problems.This segment analyzes the scaling effect of the self-consistency approach, comparing it to probability-based ranking. The results show that self-consistency scales much better with increasing numbers of responses, continuing to improve performance even with up to 40 responses, unlike probability-based ranking which plateaus. The speaker also discusses exceptions to this trend, such as training a model as a verifier.This section compares self-consistency with other baselines, including beam search and sample baselines, highlighting its superior performance. The importance of diverse responses for effective self-consistency is emphasized, suggesting methods like high-temperature sampling or nucleus sampling to ensure diversity. The speaker transitions to an analysis of model calibration.This segment explores the relationship between accuracy and consistency in the self-consistency approach. The analysis reveals that higher consistency (more responses leading to the same answer) indicates greater model certainty and a higher likelihood of accuracy. This explains the effectiveness of consistency as a selection criterion. The applicability of this approach to code generation is also mentioned.The speaker discusses the application of consistency-based selection in the AlphaCode system for competitive programming. The context of competitive programming is explained, highlighting the challenge of generating code that passes both given and hidden test cases. The speaker introduces filtering and clustering as key components of AlphaCode's inference stage.This segment details AlphaCode's clustering approach for code selection. The system generates new test inputs, executes sample programs, and clusters programs with identical outputs. The assumption is that programs in the same cluster are semantically equivalent. The speaker notes that while clustering improves performance, it doesn't guarantee the best solution.The speaker introduces universal self-consistency, addressing the limitation of requiring an answer extraction process in the original self-consistency method. This new approach leverages the large language model itself to perform consistency-based selection, instructing it to select the most consistent response based on majority consensus. The method's performance is evaluated across various applications.This segment discusses limitations of universal self-consistency, particularly its dependence on the long-context capabilities of the language model. While performance might not scale as well as the original self-consistency with more responses, it remains practical for most tasks within a reasonable number of candidate responses. The speaker proposes training a large language model as a ranker to further improve performance.The speaker explores the idea of training a large language model as a ranker to improve upon consistency-based selection. This ranker would ideally outperform simple consistency criteria by judging the likelihood of accuracy. The speaker mentions the GSM8K dataset and the "Let's Verify Step by Step" paper, which demonstrate methods for training verifiers to judge the correctness of mathematical solutions. Two approaches for training verifiers are discussed: outcome-supervised and process-supervised reward models.This segment shifts focus from solution-level response selection to stepwise scoring. The speaker argues that with a good stepwise scorer, a tree search approach can be more efficient, prioritizing promising partial solutions and reducing token costs. The speaker introduces the Tree of Thoughts (ToT) prompting method as an example.The speaker illustrates the Tree of Thoughts (ToT) method with an example of the game 24. The method involves two stages: step generation (proposing next steps) and step evaluation (assessing the promise of current states). The speaker discusses using the large language model for step evaluation and selection, potentially using voting to determine the best path. The effectiveness of ToT with breadth-first search is highlighted.This segment summarizes the discussed response selection and search methods. The speaker reiterates the effectiveness of consistency-based selection and the potential benefits of stepwise scoring and tree search methods, particularly in reducing token costs and improving scalability. The speaker concludes by mentioning iterative self-improvement as a method to increase the depth of reasoning. This segment introduces the concept of iterative self-improvement in LLMs, contrasting it with the suboptimal parallel generation of multiple solutions. It highlights the human-like iterative error correction process and introduces key papers on reflection and self-refinement, emphasizing the two crucial steps: solution generation and feedback generation using internal and external signals.This section details the self-reflection and self-refinement process, explaining how models generate feedback based on observations and external evaluations. It discusses the agentic setup used in the original reflection paper, where the model proposes actions, receives environmental feedback, and refines its output based on both internal and external signals. The segment emphasizes the effectiveness of this approach, particularly with high-quality evaluations or reliable external signals.