CS 194/294-280 (Advanced LLM Agents) - Lecture 1, Xinyun Chen

This lecture covers advanced LLM reasoning techniques for improved problem-solving. It explores prompting strategies (chain-of-thought, analogical), response selection (self-consistency), iterative refinement, and search methods (tree of thought), aiming to enhance LLMs' ability to tackle complex real-world challenges. This segment explains the fundamental architecture of a large language model-based agent, detailing how it uses an LLM as its "brain" to reason, plan actions, interact with the environment, receive feedback, and revise its internal memory for improved future planning. It also outlines common components within the agent framework, such as tools and knowledge retrieval.This section explains why LLMs benefit from agent frameworks, emphasizing the trial-and-error nature of real-world problem-solving. It highlights the importance of environmental interaction for understanding success and failure modes, leveraging external tools and knowledge, and the role of agent workflows in facilitating complex task accomplishment.This segment discusses the increasing prevalence of AI agents across various applications and domains. It showcases successful demonstrations in code generation, computer use, personal assistance, and robotics, while also mentioning emerging applications in education and finance, highlighting the transformative impact of AI agents on how AI models are perceived and utilized.This section focuses on the rapid advancements in reasoning models, citing specific model releases (OpenAI, Gemini) and their performance improvements. It emphasizes the impressive progress in math and coding problem-solving, showcasing the rapid evolution of these models and their capabilities.This segment highlights remarkable achievements of reasoning models, such as achieving near-gold medal performance in the International Mathematical Olympiad (IMO) and ranking among top human participants in competitive programming contests. It underscores the rapid progress and previously unimaginable capabilities of these models.This segment provides a detailed overview of the course structure, focusing on a deeper dive into reasoning techniques (inference, scaling, training, search, planning), software engineering applications (code generation, verification, auto-formalization), and real-world enterprise applications of AI agents, including safety and ethical considerations. This section outlines the course logistics, highlighting key changes from the previous semester, such as a later lab due date and the introduction of two project tracks: an applications track (similar to the previous semester) and a new research track for students interested in submitting conference or workshop publications. This segment explores the sensitivity of large language models to prompt design. An ablation study on different instruction variations reveals significant performance changes even with minor wording adjustments. This highlights the lack of clear principles for optimal prompt writing and the ongoing need for prompt engineering best practices. This segment details a methodology using analogical prompting to improve code generation. The approach involves a model generating higher-level knowledge, including tutorials and example problems, to enhance its ability to solve new coding problems. Results show that this method outperforms both zero-shot and few-shot prompting techniques on various tasks, even though the generated examples contain some inaccuracies.This section analyzes the surprising effectiveness of analogical prompting despite the presence of errors in the model-generated examples. Approximately 70% of the generated examples were accurate and relevant, yet the model still benefited significantly from the remaining noisy data. This finding suggests that the model's ability to generate reasonable examples is crucial for its learning process. A scaling study across different GPT models is also introduced.This segment presents a scaling study comparing the effectiveness of analogical prompting across various GPT models. Larger models benefit less from analogical prompting compared to smaller ones, but still outperform zero-shot and few-shot methods. The generated chain of thought is suggested to be more tailored to the reasoning style of large language models, aligning with recent trends in reasoning model development. This segment delves into inference time techniques for AI reasoning, using OpenAI's models as examples. It presents data showing how increased inference time leads to significantly improved accuracy on challenging reasoning benchmarks, even reaching near-human performance levels at a high computational cost.This section showcases demonstrations of reasoning models, highlighting the visibility of the model's thought process and its ability to generate long chains of reasoning steps. It uses examples from OpenAI and Google's Gemini models to illustrate how these models approach problem-solving, including multi-modal reasoning with image inputs.This segment discusses the core idea of triggering long chains of thought in LLMs to improve reasoning performance. It explores different approaches, including few-shot prompting, instruction prompting, and reinforcement learning, highlighting the various methods used to elicit detailed reasoning steps from the models.This section outlines a three-part approach to reasoning techniques at inference time: basic prompting techniques (increasing token budget), search and selection from multiple candidates (exploring solution space), and iterative self-improvement (increasing depth to reach solutions). It emphasizes the combined use of these techniques for optimal performance.This segment compares standard prompting with chain-of-thought (CoT) prompting, highlighting the limitations of standard prompting and the advantages of CoT in providing rationale and improving performance, especially for larger models. It also discusses scaling curves and the impact of model size on CoT's effectiveness.This section introduces zero-shot CoT prompting, which eliminates the need for manually annotated examples, and compares its performance to few-shot CoT. It then introduces analogical prompting, a novel approach that allows the model to generate its own relevant examples, combining the benefits of both zero-shot and few-shot approaches while avoiding manual labeling. The speaker argues against limiting large language models to single solutions, advocating for exploring multiple solution branches to allow recovery from mistakes. Two methods are proposed: generating multiple candidate solutions per problem and allowing multiple potential reasoning steps at each stage, significantly increasing the chance of finding a correct solution. This section introduces the concept of using large language models to automate prompt engineering. A method is described where a large language model proposes prompts, which are then scored based on their performance on a small validation set. This approach aims to reduce manual effort and potentially improve prompt quality beyond human-written prompts.This segment details a method that uses reinforcement learning to iteratively improve prompts. Two models act as an optimizer and evaluator, respectively. The optimizer proposes new instructions based on past performance, and the evaluator assesses their accuracy. This approach aims to creatively improve prompts beyond simple search and mutation.This section describes the design of a meta-prompt for optimizing prompts using reinforcement learning. The meta-prompt includes past instruction attempts with their corresponding accuracies and an example problem. Results show that this approach achieves performance comparable to human-written few-shot prompting, demonstrating the potential for automated prompt optimization.This segment discusses the advantages of automated prompt optimization, highlighting its time-saving aspect and ability to discover surprising and effective prompts. It then transitions to a discussion of chain-of-thought prompting, explaining its effectiveness in adapting to problem difficulty.This section delves into the mechanics of chain-of-thought prompting, emphasizing its ability to perform variable computation and adapt to different problem complexities. It highlights the incorporation of human-like reasoning strategies such as decomposition and planning, which contribute to its effectiveness.This segment introduces least-to-most prompting, a technique that decomposes complex problems into simpler subproblems. This approach achieves easy-to-hard generalization, enabling the model to solve more challenging problems by sequentially solving simpler subproblems. Impressive results on the SCAN benchmark are presented.This section extends the least-to-most prompting idea to real-world applications, such as translating natural language questions to SPARQL queries. A dynamic approach is introduced, where relevant examples are selected for each subproblem, addressing the challenges of longer sentences and larger vocabularies.This segment presents results demonstrating the superior performance of the dynamic least-to-most prompting approach compared to baselines. It highlights the scalability of the method and transitions to a discussion on task-specific reasoning structures.This section introduces a method for enabling models to automatically discover task-specific reasoning structures without manual labeling. The model selects appropriate reasoning strategies from a predefined set and composes a reasoning structure to solve the problem. Results show improved performance on various benchmarks.This concluding segment summarizes the key findings and discusses the evolving best practices for interacting with large language models. It highlights the changing preferences of models regarding prompt styles and the need for adaptive strategies.This segment outlines three crucial criteria for effective prompting strategies in large language models: encouraging longer chains of thought for complex tasks, scaling with varying task difficulty, and supporting the reasoning strategies needed for the task. These criteria are essential for developing robust and adaptable prompting techniques. This segment introduces the concept of iterative self-improvement in LLMs, contrasting it with the suboptimal parallel generation of multiple solutions. It highlights the human-like iterative error correction process and introduces key papers on reflection and self-refinement, emphasizing the two crucial steps: solution generation and feedback generation using internal and external signals.This section details the self-reflection and self-refinement process, explaining how models generate feedback based on observations and external evaluations. It discusses the agentic setup used in the original reflection paper, where the model proposes actions, receives environmental feedback, and refines its output based on both internal and external signals. The segment emphasizes the effectiveness of this approach, particularly with high-quality evaluations or reliable external signals. This segment highlights the challenge of selecting the best response from multiple candidates without an oracle scorer during inference. While user selection is possible, the ideal scenario involves the model internally selecting the best solution, avoiding manual review of numerous options. The speaker introduces the concept of self-consistency as a potential solution.The speaker introduces the self-consistency method, a simple yet effective approach for selecting the best response from multiple candidates. The method focuses on the consistency of final answers across multiple generated responses, selecting the most frequently appearing answer. This approach demonstrates significant performance improvements across various models and benchmarks, particularly on math problems.This segment analyzes the scaling effect of the self-consistency approach, comparing it to probability-based ranking. The results show that self-consistency scales much better with increasing numbers of responses, continuing to improve performance even with up to 40 responses, unlike probability-based ranking which plateaus. The speaker also discusses exceptions to this trend, such as training a model as a verifier.This section compares self-consistency with other baselines, including beam search and sample baselines, highlighting its superior performance. The importance of diverse responses for effective self-consistency is emphasized, suggesting methods like high-temperature sampling or nucleus sampling to ensure diversity. The speaker transitions to an analysis of model calibration.This segment explores the relationship between accuracy and consistency in the self-consistency approach. The analysis reveals that higher consistency (more responses leading to the same answer) indicates greater model certainty and a higher likelihood of accuracy. This explains the effectiveness of consistency as a selection criterion. The applicability of this approach to code generation is also mentioned.The speaker discusses the application of consistency-based selection in the AlphaCode system for competitive programming. The context of competitive programming is explained, highlighting the challenge of generating code that passes both given and hidden test cases. The speaker introduces filtering and clustering as key components of AlphaCode's inference stage.This segment details AlphaCode's clustering approach for code selection. The system generates new test inputs, executes sample programs, and clusters programs with identical outputs. The assumption is that programs in the same cluster are semantically equivalent. The speaker notes that while clustering improves performance, it doesn't guarantee the best solution.The speaker introduces universal self-consistency, addressing the limitation of requiring an answer extraction process in the original self-consistency method. This new approach leverages the large language model itself to perform consistency-based selection, instructing it to select the most consistent response based on majority consensus. The method's performance is evaluated across various applications.This segment discusses limitations of universal self-consistency, particularly its dependence on the long-context capabilities of the language model. While performance might not scale as well as the original self-consistency with more responses, it remains practical for most tasks within a reasonable number of candidate responses. The speaker proposes training a large language model as a ranker to further improve performance.The speaker explores the idea of training a large language model as a ranker to improve upon consistency-based selection. This ranker would ideally outperform simple consistency criteria by judging the likelihood of accuracy. The speaker mentions the GSM8K dataset and the "Let's Verify Step by Step" paper, which demonstrate methods for training verifiers to judge the correctness of mathematical solutions. Two approaches for training verifiers are discussed: outcome-supervised and process-supervised reward models.This segment shifts focus from solution-level response selection to stepwise scoring. The speaker argues that with a good stepwise scorer, a tree search approach can be more efficient, prioritizing promising partial solutions and reducing token costs. The speaker introduces the Tree of Thoughts (ToT) prompting method as an example.The speaker illustrates the Tree of Thoughts (ToT) method with an example of the game 24. The method involves two stages: step generation (proposing next steps) and step evaluation (assessing the promise of current states). The speaker discusses using the large language model for step evaluation and selection, potentially using voting to determine the best path. The effectiveness of ToT with breadth-first search is highlighted.This segment summarizes the discussed response selection and search methods. The speaker reiterates the effectiveness of consistency-based selection and the potential benefits of stepwise scoring and tree search methods, particularly in reducing token costs and improving scalability. The speaker concludes by mentioning iterative self-improvement as a method to increase the depth of reasoning. This segment focuses on the application of self-improvement to code generation, drawing parallels to human debugging processes within IDEs. It introduces the concept of interactive loops in code development, where programmers investigate execution results and revise code based on observations. The segment then transitions to the speaker's paper on self-debugging, setting the stage for a deeper dive into feedback formats.This segment explores different feedback formats for self-debugging in code generation. It details various feedback types, including simple correctness indicators, unit test results (including runtime errors), code explanations (line-by-line), and execution traces. The segment concludes by highlighting the consistent performance improvements across different LLMs achieved through self-debugging and the impact of more informative feedback.This segment shifts the focus to self-correction in question-answering tasks, contrasting results with and without an oracle verifier. It presents negative results from a study where LLMs lacked access to an oracle, demonstrating that their inability to reliably judge correctness often led to worse performance after each self-correction iteration.This segment investigates the impact of feedback prompts on self-correction performance. It explores the use of general-purpose feedback prompts and their limitations, noting that while prompt adjustments can influence the model's tendency to retain initial responses, they don't necessarily lead to performance improvements. The segment then sets the stage for a comparison with multi-response self-correction methods.This segment compares self-correction using a single initial response with a multi-response approach (multi-agent debate). It introduces the multi-agent debate baseline, where multiple responses are generated and evaluated by the LLM, and contrasts it with self-consistency (selecting the most consistent response from parallel generations). The analysis reveals that self-consistency scales better than multi-agent debate when considering token budget constraints.This segment addresses the challenge of optimally allocating token budgets between parallel and sequential generation methods. It highlights the task and model dependency of this optimization problem and presents a study showing how the optimal ratio of parallel to sequential generation varies with problem difficulty. The study concludes that a compute-optimal curve, balancing both methods, can outperform purely parallel generation.This segment discusses the influence of model size on inference time and budget allocation. It argues that using more expensive models is not always optimal, especially when smaller models can achieve comparable performance with less computational cost. The segment introduces a study showing how optimal model selection depends on the available inference budget.This concluding segment summarizes the key takeaways, emphasizing the importance of adapting techniques to the model's capabilities and the specific task. It introduces Richard Sutton's "The Bitter Lesson," highlighting the importance of methods that scale with increased computation and the need to teach models to discover what they haven't yet discovered, rather than focusing solely on pre-defined content. CS 194/294-280 (Advanced LLM Agents) - Lecture 1, Xinyun Chen Advanced Large Language Model Agents: A Study Guide This study guide summarizes key concepts from a lecture on advanced large language model agents. Course Overview - Course Title: Advanced Large Language Model Agents Instructors: s Chen (Google DeepMind), Professor dson (UC Berkeley), and CA (Meta) Topics Covered: Fundamental reasoning techniques . Agent workflows . Various applications ( code generation, computer use personal assistant, robotics, education, finance, etc. ). Software engineering aspects. Mathematics ( code generation and verification, auto formalization, theorem proving ). Safety and ethics for real-world deployment. Course Logistics: The lab assignments will be released later in the semester, allowing more time for completion. Two project tracks are available: an applications track (similar to last semester) and a research track (new, focusing on conference/workshop publications). Guest speakers will be invited throughout the semester. Reasoning Models and Inference Time Techniques - Recent Progress in Reasoning Models: Significant advancements have been made in reasoning models, particularly in mathematics and coding . Examples include: AlphaProof and AlphaGeometry (Google DeepMind): Achieved near-gold medal performance in the IMO. OpenAI's models: Ranked among the top 200 human participants in Codeforces competitions. OpenAI's OpenAI-01 and -03 Models: These models demonstrate impressive performance on challenging tasks, especially with increased inference time. Higher inference time budgets lead to significantly higher accuracy. This highlights the trade-off between accuracy and computational cost. Google's Gemini 2.0: This model also shows significant improvements in reasoning capabilities and makes its "thought process" more visible to users, allowing for better understanding of its reasoning steps. This increased transparency improves user trust and understanding. Common Theme: Many recent breakthroughs involve triggering the large language model to generate a long chain of thought (CoT) before reaching the final solution. Inference Time Techniques: Part 1 - Basic Prompting - Standard Prompting: Involves providing question-answer pairs, but lacks the rationale behind the solution. This often leads to poor performance on reasoning benchmarks. Chain-of-Thought (CoT) Prompting: Includes calculation steps or reasoning steps in the examples, showing the model how to derive the solution. This significantly improves performance, especially for larger models. The improvement is more significant for larger models, demonstrating a scaling effect. Few-Shot CoT: Requires fewer manually annotated examples than standard CoT prompting. This reduces the manual effort required for creating training data. Zero-Shot CoT: Achieved through instructions like "Let's think step by step," eliminating the need for manually annotated examples. This is more convenient than few-shot CoT, but performance is generally lower. Analogical Prompting: The model recalls relevant examples from its training data to solve the problem, generating more tailored examples for each problem. This mimics human reasoning, where past experience guides problem-solving. This approach outperforms both few-shot and zero-shot CoT. Prompt Engineering with LLMs: LLMs can be used to automatically generate and optimize prompts, reducing manual effort and potentially discovering better prompts than humans can. This automates a time-consuming and often difficult task. Inference Time Techniques: Part 2 - Search and Selection - Multiple Candidate Solutions: Generating multiple solutions allows the model to recover from mistakes in a single generation. This increases the robustness of the model. Self-Consistency: Selects the most frequent answer among multiple generated responses, regardless of the reasoning process. This simple method significantly improves performance and scales better than probability-based ranking. Model Calibration: Self-consistency improves performance by increasing model certainty and the likelihood of accurate predictions. Clustering for Code Generation (AlphaCode): Clusters code solutions based on execution results, selecting one program from each of the largest clusters. This improves performance by identifying semantically equivalent code solutions. Universal Self-Consistency: Extends self-consistency to tasks without easily extractable answers (e.g., free-form text generation). The LLM performs the consistency-based selection. LLM-Based Ranking: Training a separate LLM as a ranker can improve upon simple consistency-based selection. This allows for a more nuanced evaluation of response quality. Stepwise Evaluation: Using a stepwise evaluator allows for tree search, prioritizing promising partial solutions and reducing token costs. Inference Time Techniques: Part 3 - Iterative Self-Improvement - Self-Reflection and Self-Refinement: The model generates feedback based on its observations and refines its output iteratively. This mimics human debugging processes. This works well with high-quality external evaluation. Self-Debugging for Code Generation: Provides different feedback formats (correctness, unit tests, code explanations, execution traces) to improve code debugging. More informative feedback leads to better performance. Self-Correction in Question Answering: While self-correction with oracle verification shows improvement, self-correction without external feedback often leads to worse performance. This highlights the importance of reliable external feedback. Multi-Agent Debate: Generating multiple responses and using an LLM to evaluate and update them. While showing some improvement, it doesn't scale as well as self-consistency with the same token budget. Balancing Parallel and Sequential Generation: The optimal balance between parallel and sequential generation depends on the task, model, and the quality of self-evaluation. Simpler problems benefit more from self-correction, while harder problems may benefit from a combination of parallel and sequential approaches. Model Size and Inference Budget: Choosing the right model size is crucial for optimal inference cost. Lighter models might be more efficient for smaller budgets, while larger models might be necessary for more challenging problems with larger budgets. Key Takeaways Significant advancements have been made in large language model reasoning capabilities. Chain-of-thought prompting is a powerful technique for improving reasoning performance. Self-consistency and other methods for selecting among multiple candidate solutions are effective and scalable. Iterative self-improvement can enhance model performance, but reliable external feedback is crucial. The optimal strategy for utilizing inference time resources depends on the task, model, and available computational budget. A principled approach to scaling with increased computation is key. CS 194/294-280 (Advanced LLM Agents) - Lecture 1, Xinyun Chen Advanced Large Language Model Agents: Lecture 1 Notes This lecture provides an overview of advanced large language model (LLM) agents, focusing on inference-time techniques for reasoning. It covers various prompting techniques, search and selection methods, and iterative self-improvement strategies. Course Introduction - Instructors: S. Chen (Google DeepMind), Professor Dson (UC Berkeley & Meta) ** Teaching Assistants:** Alex (Head TA), Tara, Ashin, Jason Course Overview: A deep dive into the methodology of LLM agents, focusing on reasoning techniques (inference, scaling, training), and applications in software engineering and mathematics (code generation, verification, formalization). The course also covers real-world enterprise applications, advanced agent workflows, and safety/ethics considerations. Previous Semester: The course was previously offered, attracting 15 ,000 students online and 3,000 participants in a hackathon. LLM Agent Architecture: The core of an LLM agent is an LLM that performs reasoning and planning to take actions. It interacts with the environment, receives feedback, and revises its internal memory for improved planning. This framework allows the agent to use external feedback and tools to enhance its capabilities. Common Components: Tool use, retrieval. Why Agent Frameworks? Real-world tasks involve trial and error. Agents interact with environments to learn from successes and failures, leveraging external tools and knowledge . This facilitates complex task decomposition, subtask allocation, division of labor, collaboration, and multi-agent generation. 2025: The Year of Agents: A significant increase in LLM agent applications across various domains (code generation, personal assistants, robotics, education, finance). Course Logistics - Lab Assignments: Labs are more involved this semester and have later due dates. Project Tracks: Two tracks are available: an applications track (similar to last semester) and a new research track for students interested in conference/workshop publications (under supervision of Don's students). Guest Speakers: Guest speakers will be invited to discuss advanced agent systems and applications. Check the course website for updates. Inference Time Techniques: The lecture focuses on inference-time techniques for LLM reasoning. Reasoning Models and Inference Time - Recent Progress: Significant advancements in reasoning models since late 2024 (e.g., openAI's models, Google's Gemini). Impressive performance gains in math and coding. Examples: AlphaProof/AlphaGeometry (near gold medal in IMO), openAI-03 (top 200 on Codeforces). OpenAI-01 Results: Showed impressive performance on challenging tasks where previous models struggled (math, competitive programming). ** OpenAI-03 Results (ARC AGI Benchmark):** Demonstrated that increased inference time leads to improved accuracy. With a high inference time budget ($1000 per task), the model achieved 87.5% accuracy, matching human performance. OpenAI-01 Demo ( Planning Problem): Showcased the model's thought process, taking over a minute to generate a solution. The interface summarized the key reasoning stages (initial plan, evaluation, revision). Gemini 2.0 Demo (Multimodal Query): Demonstrated reasoning with image input, highlighting the model 's ability to discover insightful steps (e.g., transforming a 9 into a 6 to achieve a sum of 30). Triggering Long Chains of Thought - Core Idea: Triggering LLMs to generate long chains of thought before reaching a final solution. Approaches: Few-shot prompting: Demonstrating the thought process in examples. Instruction prompting: Using instructions like "Let's see step by step". Chat models: Even without specific instructions, chat models tend to generate some thought process. Training methods: Instruction tuning (including chain-of-thought data), reinforcement learning. Inference Time Techniques: Part 1 - Basic Prompting - Standard Prompting: Providing question-answer pairs; lacks rationale. Analogous to teaching only the final solution without context. Chain-of-Thought (CoT) Prompting: Including calculation steps in examples to demonstrate the solution process. Model performance improves significantly with larger models and CoT. Scaling Curves (CoT vs. Standard Prompting): Larger models benefit more from CoT. Zero-Shot CoT: Achieving CoT generation with instructions like "Let's see step by step" without examples. More convenient but less performant than few -shot CoT. Analogical Prompting: Instead of providing examples, instruct the model to recall relevant examples and solve the problem based on them. This is motivated by human analogical reasoning (Polya's "How to Solve It"). Outperforms zero-shot and even few- shot CoT. Even with noisy self-generated examples, the model benefits. Larger models benefit more from analogical prompting, sometimes outperforming retrieval-based methods. Inference Time Techniques: Part 2 - Search and Selection - Increasing Solution Space Exploration: Generating multiple candidate solutions or multiple potential next steps. Challenges: Selecting the best response without an oracle scorer. Self-Consistency: Generating multiple responses and selecting the most frequent answer. Scales better than probability-based ranking. Model calibration plays a role; higher consistency correlates with higher accuracy. AlphaCode (Code Generation): Uses filtering and clustering based on execution results. Universal Self-Consistency: Instructing the LLM to perform consistency-based selection directly, eliminating the need for answer extraction. Works well for various tasks, even without answer extraction. Performance is bounded by long context capabilities. LLM-based Rankers: Training an LLM to rank responses; can outperform simple consistency criteria. Requires careful training and data design. Process-supervised reward models (verifying steps) scale better than outcome- based models. Tree Search with Stepwise Scorer: Using a stepwise scorer to prioritize promising partial solutions during the search process, potentially reducing token cost. Inference Time Techniques: Part 3 - Iterative Self-Improvement - Iterative Improvement: LLMs iteratively improve their responses based on feedback. This aligns better with human error correction. Reflection and Self-Refinement: Generating feedback and refining the output based on internal and external observations. Works well with high-quality evaluation or reliable external signals . Self-Debugging (Code Generation): Providing various feedback formats (correctness, unit tests, code explanations, execution traces) to improve debugging performance. More informative feedback leads to better performance. Self-Correction without Oracle Feedback: LLMs struggle with self-correction without oracle verification ; they often make the solution worse. General-purpose feedback prompts can affect self-correction behavior but don't guarantee improvement. Multi-Agent Debate: Generating multiple responses and having the LLM re-evaluate them. Doesn't always outperform self-consistency when token budget is controlled. Balancing Parallel vs. Sequential Generation: The optimal balance depends on the task, model, and the model's ability for self-reflection and correction. For simpler problems, self-correction is more beneficial. For harder problems, a mix of parallel and sequential generation might be optimal. Model Size and Inference Budget: Consider the model's cost-effectiveness. Lighter models might be more efficient for smaller budgets. General Principles for Effective Reasoning: Scale with increased computation; teach the model to discover what it hasn't discovered yet. (Richard Sutton's Bitter Lesson). Summary and Key Takeaways This lecture provided a comprehensive overview of advanced inference-time techniques for improving the reasoning capabilities of large language models within an agent framework. Key takeaways include the importance of chain-of-thought prompting, the effectiveness of consistency-based response selection, and the potential benefits of iterative self-improvement, while acknowledging the limitations of self-correction without reliable external feedback. The optimal strategies for leveraging these techniques depend heavily on the specific task, model capabilities, and available computational resources. The "Bitter Lesson" emphasizes the importance of scaling methods with increased computation and focusing on methods that allow models to discover new knowledge.