Agentic Evals by Shishir Patil Introduction and Current LLM Interaction : The speaker, Shasher Patel, introduces the topic of agents and agentic evaluation. He clarifies that the presentation includes content from academic literature and open-source projects, and some views are personal opinions. He begins by illustrating the current interaction with LLMs: a user prompts an LLM, receives a response, and then takes action in the digital world based on that response. The Agentic Approach : The goal is to reverse this interaction, placing the agent at the center. The agent receives a prompt, performs actions (including information gathering), observes the response, and reports back to the user. This approach leverages human strengths in distinguishing and LLMs' strengths in generating. An example is given using cake recipes: humans can better judge which cake they prefer after tasting, while LLMs excel at generating recipes. Concrete Example: GPU Allocation : A concrete example uses the request for a GPU. An agent needs to: reason and plan (understanding the user's hyperscaler), check resource quotas, and perform the action (starting a GPU instance and returning an SSH session). Defining Agents : The speaker defines an agent as having three components: the LLM model, the framework (orchestrating the LLM and tools, managing state and fault tolerance), and the tools themselves. Evaluating LLMs for Agentic Behavior: Function Calling : Evaluating an LLM's agentic capabilities focuses on function calling (or tool calling/API invocation). Offline evaluation is preferred over online evaluation for scalability. A method using Abstract Syntax Trees (ASTs) from programming languages is introduced for offline evaluation of function calls. This involves comparing the generated function call to a tree of all possible valid calls for a given API. The Berkeley function calling leaderboard is mentioned as an example of this type of evaluation. Evaluating Entire Agents: ML Gym : The ML Gym framework is presented as a complete system for evaluating agents. It consists of an environment and a benchmark (MLG Bench). The environment allows agents to interact with a computer's shell and file system. The benchmark includes 13 diverse AI research tasks. This allows for evaluation across various skills like generating hypotheses, processing data, training models, and analyzing results. An example result from the ML Gym paper is shown, illustrating how agent actions change over steps, from exploratory actions to model training and validation. Meta's Contributions: Llama Models and Llama Stack : Meta's Llama models (Llama 2 and Llama 3) and Llama Stack framework are highlighted. Llama 2 offered various model sizes for different use cases, including multimodality. Llama Stack aims to simplify building with Llama models, supporting different models, tasks, and environments. It includes features like telemetry and robust state management. Observations on Building Agents : Three key observations are shared: Optimizing for agentic capabilities is difficult because it involves multiple sub-capabilities. Isolating and improving these individually is suggested. Determining when an agent is finished is challenging. Techniques like limiting steps or using end-of-sequence tokens are discussed. Personalized agents are more effective, allowing for flexible use of historical knowledge or available tools. The Future: Autonomous Agentic Systems : The ultimate goal is autonomous agentic systems where multiple agents interact, and user interaction is limited to observing downstream tasks and occasional monitoring. This raises open research questions about undo functionality and associative interactions. Conclusion : The presentation summarizes the key points: a definition of agents (LLM, framework, tools), the use of the Berkeley function calling leaderboard and ML Gym bench for evaluation, and a look at the future of autonomous agents. Agentic Evals by Shishir Patil Agents and Agentic Evaluation: A Deep Dive This document provides a structured breakdown of the provided YouTube video transcript on agents and agentic evaluation. The content is organized into chapters and sections with summaries, key terms highlighted in bold , and explanations in italics . Timestamps are included for video-specific sections. Chapter 1: Introduction and Current LLM Interaction - Summary: This chapter introduces the speaker and sets the context by explaining the current paradigm of human-LLM interaction. It highlights the limitations of the current approach and introduces the concept of agents as a potential solution. Current Interaction: Users interact with LLMs (Large Language Models) through APIs, prompting them for responses and then manually acting on the digital world based on those responses. This is depicted as a user-centered model. Agentic Approach: The proposed shift is to place an agent at the center. The agent receives a prompt, performs actions (including information gathering), observes the results, and reports back to the user. This leverages the strengths of humans ( distinguishing ) and LLMs ( generating ). Example: The difference is illustrated with a cake-baking example. Humans are better at deciding which cake they prefer after tasting, while LLMs are better at generating recipes. Agents allow the LLM to "bake the cakes" (perform actions) before the human evaluates the result. Chapter 2: Agentic Systems: Definition and Components - Summary: This chapter provides a definition of agents and breaks down their key components. Agent Definition: An agent consists of three core components: The LLM model itself. The framework , which orchestrates the LLM and its available tools, manages state, and handles error tolerance. The tools the agent utilizes to interact with the external world. Chapter 3: Evaluating LLMs for Agentic Behavior: Function Calling - Summary: This chapter focuses on evaluating the LLM's ability to interact with tools, specifically through function calling . It introduces the concept of offline evaluation using Abstract Syntax Trees (ASTs) . Function Calling: This is the mechanism by which the LLM interacts with external tools and services, essentially generating syntactically correct function calls. Offline Evaluation: This method allows for scalable evaluation without actually executing every function call. It uses ASTs to verify the correctness of the function call generated by the LLM. Abstract Syntax Trees (ASTs): These are tree-like representations of code, allowing for comparison against a known set of valid function calls provided by a service. Berkeley Function Calling Leaderboard: This is an example of a project that uses this approach to evaluate LLMs' function-calling capabilities. Chapter 4: Evaluating Entire Agents: ML Gym - Summary: This chapter introduces ML Gym , a framework for evaluating complete agents, not just the underlying LLM. ML Gym: A unified framework for implementing and experimenting with different machine learning training algorithms for large language model agents. It consists of an environment and a benchmark (MLG Bench). MLG Bench: This benchmark includes 13 diverse AI research tasks across various domains, requiring agents to demonstrate a range of skills. Agent Evaluation in ML Gym: The framework allows for detailed analysis of agent behavior, tracking tool usage and steps taken to solve tasks. Chapter 5: Meta's Contributions: Llama Models and Llama Stack - Summary: This chapter details Meta's contributions to the field, focusing on the Llama family of models and the Llama Stack framework. Llama Models: A series of LLMs with varying sizes and capabilities, offering options for different use cases and computational resources. Llama Stack: A framework designed to simplify the development and deployment of applications using Llama models, supporting various tasks, environments, and safety considerations. Chapter 6: Observations and Best Practices in Agent Development - Summary: This chapter shares practical observations and best practices for building and optimizing agents. Optimizing Agentic Capabilities: This is challenging because it involves multiple sub-capabilities. The speaker suggests isolating and improving these sub-capabilities individually before integrating them. Determining Agent Completion: Identifying when an agent has finished its task is difficult. Techniques include setting step limits, masking end-of-sequence tokens, or using thresholds. Personalized Agents: Personalized agents, adapting to user preferences and resource constraints, are highly effective. Chapter 7: The Future: Autonomous Agentic Systems - Summary: This chapter offers a glimpse into the future of agents, focusing on the concept of autonomous agentic systems . Autonomous Agentic Systems: These systems involve multiple agents interacting with each other and the digital world, with human users only observing high-level results and intervening periodically. Open Research Questions: The transition to autonomous systems raises several open questions regarding undo functionality, associative interactions, and overall system design. Final Summary The video explores the evolving landscape of Large Language Models (LLMs) , moving beyond simple user-prompted interactions to the more sophisticated concept of agents . It defines agents as comprising an LLM , a framework , and tools , and discusses methods for evaluating both the LLM's function-calling capabilities and the overall agent's performance using frameworks like ML Gym . Meta's contributions, including the Llama models and Llama Stack , are highlighted. Finally, the video looks towards the future of autonomous agentic systems , emphasizing the exciting research challenges and opportunities in this rapidly developing field. Agentic Evals by Shishir Patil How can we effectively measure and improve the "agentic capabilities" of LLMs beyond simply evaluating function calling accuracy? What are the ethical implications of increasingly autonomous agentic systems, and how can we ensure responsible development and deployment? What novel approaches are needed to determine when an agent has completed its task, especially in complex, open-ended scenarios? Beyond function calling, what other key capabilities are crucial for building truly effective and versatile AI agents? How can the design of agentic systems be optimized to facilitate personalized user experiences while maintaining efficiency and cost-effectiveness? The speaker uses the example of baking cakes to illustrate the difference between how humans typically interact with LLMs now and the desired interaction with agentic systems. In the current model, which the speaker likens to being given two cake recipes , it is difficult for a human to simply look at the instructions and predict which cake they would prefer. This is similar to a user receiving a response from a chatbot and having to analyze the text itself. In contrast, the speaker explains that if they were given the two baked cakes to taste, they could very quickly and easily determine which one they prefer. This represents the agentic interaction model, where the agent performs an action (like baking the cake), and the user then observes and distinguishes the output of that action (tasting the cake). The core idea is that humans are adept at distinguishing or evaluating outcomes, while LLMs are skilled at generating possibilities or performing actions. Agentic systems aim to leverage this by having the agent act, allowing the user to focus on evaluating the results rather than being the bottleneck by having to verify every step of the process or the generated text itself. ( , ) Would you like a more detailed explanation of the distinction between human distinguishing capabilities and LLM generation capabilities? The speaker introduces the topic of agents and agentic evaluation, noting that there are various definitions for agents . The definition they propose is that an agent consists of three core components: the Large Language Model (LLM) itself, the framework that orchestrates the LLM and available tools, and the tools themselves , , . Based on this definition, agentic evaluation involves assessing the capabilities and performance of these agent systems. This includes evaluating the LLM's ability to perform functions like tool calling or invoking APIs, which is considered a key capability for enabling agentic behavior , . Evaluation also extends to assessing the entire agent system's ability to handle complex tasks, such as reasoning, planning, reading the state of the environment, and performing actions , . Furthermore, evaluating agents involves understanding tricky aspects like determining when an agent has completed its task , , and optimizing for the multiple sub-capabilities required for an agent to be effective . Frameworks like MLGym are presented as examples of robust systems designed for evaluating the performance of entire agent systems on various tasks . Would you like a more detailed explanation of any specific aspect of agentic evaluation, such as function calling or evaluation frameworks? ( , , , , , , , , , , , )