The video discusses the evolution and importance of agentic AI and compound systems , emphasizing inference time scaling to enhance AI performance and cost-efficiency. Key topics include: Compound Systems: The concept of combining multiple AI model calls, either in parallel (like Google's "chain of thought at 32" [ 0:30 ] or AlphaCode's "million parallel calls" [ 1:26 ]) or sequentially, to achieve better performance or efficiency. Frugal GPT: An early example of routing, where simpler questions are answered by smaller, cheaper models, leading to significant cost savings ( 2:08 ). Inference Time Scaling: The idea that by adding more compute at inference time (e.g., more calls, longer chains of thought), AI systems can achieve higher intelligence or accuracy. This is compared to traditional model scaling (increasing parameters) ( 12:25 ). An example from poker AI shows how search (inference time scaling) dramatically improved performance ( 13:10 ). The "Bitter Lesson": Reinforcing the idea that search and general learning methods that can assimilate compute are crucial for AI progress ( 15:25 ). Ember Framework: A new framework designed to simplify the construction and sharing of these compound AI systems, analogous to PyTorch for neural networks ( 18:51 , 23:47 ). It allows combining calls across different models and providers ( 27:37 ) to improve quality or reduce cost. Practical Applications: Discussion of how these systems can be used for reliable answers ( 17:41 ), optimizing prompts ( 34:04 ), understanding model uncertainty ( 34:47 ), and deploying AI agents in real-world scenarios ( 1:02:03 ). Defining AI Agents: A critical look at the term "agent" in AI, suggesting it should align with the reinforcement learning definition of an entity interacting with an environment and receiving rewards ( 36:26 ). Multi-Agent Workflows: A demonstration of deploying AI agents for executive reporting, showcasing how agents with defined roles, goals, and tools can automate complex tasks ( 1:03:06 ). Goose: An open-source AI agent built as an MCP client for orchestrating multiple agents to build applications ( 1:28:39 ). It can create development teams (e.g., planner, project manager, developers, QA, tech writer) to collaboratively build a product like "AI Brief Me" ( 1:31:43 ). Octa/Auth0's Role: Discussing the security implications and challenges of deploying AI agents that interact with the real world, particularly concerning authentication and authorization for agentic workflows ( 2:29:10 ). YouTube generated summary The key takeaways from the video are: Compound AI systems enhance performance: Combining multiple AI model calls (horizontally with replicas or vertically with deeper thought chains) significantly improves output quality and reliability, going beyond what single models can achieve ( 0:30 , 1:26 , 3:20 ). Inference time scaling is a new frontier: Applying more computational resources at the inference stage (when the model is used) is as crucial as scaling model parameters during training , bending the curve of progress and efficiency ( 12:25 , 14:00 ). Cost-effectiveness is achievable: Despite making multiple calls, intelligent orchestration and leveraging cheaper models can result in overall cost savings while maintaining or improving performance ( 2:08 , 43:45 ). Frameworks are vital for progress: Tools like Ember and Goose are being developed to make it easier to build, share, and optimize these complex "networks of networks," accelerating research and practical application ( 18:51 , 23:47 , 1:28:39 ). Multi-agent systems are the future: Orchestrating specialized AI agents with different roles (e.g., planner, developer, QA) allows for collaborative and automated task completion, akin to a human team ( 1:31:43 ). Security and control are paramount: As AI agents interact more with the real world, robust authentication, authorization, and monitoring mechanisms become critical to manage their actions and ensure safety ( 2:29:10 ). YouTube generated key takeaways Gemini generated summary, highlights, key takeaways, quotes Summary of the Sessions The video features several workshops and presentations on various topics related to Agentic AI. The sessions cover: 10x AI Agent Price Performance with Inference Time Scaling: Jared Quincy Davis from Foundry discusses improving AI agent performance and cost-effectiveness by orchestrating multiple models in what he calls "compound systems" and introduces the Ember framework for building these systems. Multimodality Challenges, Neural Operating System, and Deploying AI Agents: Presentations from Lambda explore the concept of a Neural OS where LLMs act as the operating system, the practical deployment of AI agents for tasks like executive reporting, and research on the future of multimodal learning, particularly the potential of diffusion models. Vibe Coding with Goose: Building Apps with AI Agents and MCP: Angie Jones from Block leads a workshop on using the open-source Goose agent and the Model-Client-Plugin (MCP) standard to build an application, demonstrating the power of orchestrating sub-agents. AI Agents that Interact with the World Around Them: Sam B. S. from Auth0/Okta discusses the critical security and identity challenges when agents interact with APIs, emphasizing the use of OAuth and best practices for managing credentials. Highlights and Key Takeaways Inference Time Scaling: Performance can be improved not just with bigger models, but by smartly composing and scaling existing models at inference time. Compound Systems: The idea of combining multiple AI model calls (ensembles, routers, judge-based aggregation) can lead to performance beyond what a single model can achieve. Neural OS: A new concept where an LLM acts as the operating system, capable of generating its own UI and interacting with other instances. Diffusion vs. Auto-regressive Models: Research suggests that as data becomes a bottleneck, diffusion models may prove to be more data-efficient and outperform auto-regressive models in the long run. Agent Security: As agents interact with external APIs, robust security and identity management, such as OAuth, are essential. Open Source: Frameworks like Ember and open-source agents like Goose are making it easier for researchers and developers to build complex AI agent systems. Interesting Quotes "If I told you that I'm budget insensitive, that $3 per million tokens is too cheap... To what extent could I do that and in the most naive sense actually get better performance?" - Jared Quincy Davis [ 03:54 ] "In the future when we talk about an 8B system, we might not be talking about the number of parameters, we might be talking about the number of calls in some inference time architecture." - Jared Quincy Davis [ 02:11:12 ] "Everybody who's ever had a hot shower has had a good idea, but it's the people who get out of the shower, dry off, and then go do something that make an impact. And that 'do something' part is where I think the agentic AIs can really shine." - Danny Krauss [ 56:30 ] "We're going to run out of new data points by 2028." - Amir Zadeh [ 01:13:28 ] "Don't take your vibe coded app to production." - Angie Jones [ 01:32:40 ] "Do not pass the access token to the LLM. It should never have access to a credentials that in plain text in the inference space." - Sam B. S. [ 02:42:56 ] Video URL: https://www.youtube.com/watch?v=_w5m3h9jY-w Agentic AI Summit | Afternoon Workshops Here are the core concepts and their explanations from the provided content: Compound AI Systems These systems combine multiple AI models or calls to achieve better performance or reliability than a single model. Early examples include making parallel calls to a single model and filtering results (like AlphaCode asking a million replicas the same question). Inference Time Scaling This refers to improving AI performance by scaling the computation done during inference (when the model is generating an output), rather than just by increasing model size during training. It can involve vertical scaling (longer "chains of thought" within a single model) or horizontal scaling (using multiple model replicas in parallel). The "Bitter Lesson" and Compute Assimilation A core principle in AI is that methods capable of effectively assimilating more computational power (flops) tend to yield better results over time. The success of deep learning and new inference time scaling methods is tied to their ability to ride the wave of increasing compute availability. Practical Frameworks for Compound Systems (Ember) Tools like Ember (Inference Time Scaling Architecture Framework) are being developed to make it easier for developers to build, manage, and optimize these complex AI systems. They offer functionalities similar to deep learning frameworks (like PyTorch) but for composing calls across different models and providers. Cost Efficiency in Compound Systems By strategically combining calls to multiple models, including cheaper, smaller models, it's possible to achieve high performance while still being significantly more cost-effective than relying solely on the most expensive "frontier" models. The Future of AI System Architecture The speaker suggests that in the future, AI systems might be characterized more by their "inference time architecture" and "number of calls" rather than just the number of parameters. This is driven by falling inference costs and the increasing diversity and specialization of available AI models. The "Agent" Definition Debate The speaker highlights a common misconception of the term "AI Agent." According to the original reinforcement learning (RL) definition, an agent is defined by its ability to take actions in an environment that produces rewards . Desirable features like memory or tool use are enhancements for powerful agents but not part of the fundamental definition. Compound systems are related to, but distinct from, agents. Neural OS Concept This concept envisions Large Language Models (LLMs) acting as the core "operating system" of a computer. Users provide complex prompts, and the LLM itself generates interfaces (e.g., HTML/CSS) and performs actions, effectively making the prompt the "source code" for dynamic applications. Deployment of AI Agents Real-world applications of AI agents involve defining roles, goals, tasks, and equipping them with specific tools (e.g., a CSV reader). These multi-agent workflows can be structured to automate complex business processes, such as generating detailed executive reports from various data sources. Multimodal Learning Challenges The field faces a data bottleneck, especially for modalities beyond text (e.g., medical scans, infrared), with projections indicating a global shortage of new unique data points by 2028. This contrasts with the continuous increase in computational power. Diffusion vs. Autoregressive Models Research suggests that diffusion models are more data-efficient and will outperform autoregressive models in data-constrained environments, given sufficient compute. Autoregressive models are better if compute is the primary bottleneck. This informs strategic choices for model development based on available resources. Goose AI Agent for "Vibe Coding" Goose is an open-source AI agent designed to facilitate rapid prototyping and development ("vibe coding"). It acts as an Multi-Agent Communication Protocol (MCP) client, allowing it to orchestrate multiple sub-agents, connect to various tools, and be agnostic to the underlying LLM. Orchestrating Sub-Agents with Goose Goose can dynamically spin up teams of specialized sub-agents (e.g., Planner, Architect, Frontend/Backend Developers, QA, Tech Writer) and coordinate their work, including running tasks in parallel, to expedite project development. Goose Features and Debugging Goose offers various modes (autonomous, manual, smart, chat), integrates with external tools via MCP extensions, and includes features for managing LLM context windows. It also provides detailed "tool calls" and outputs to help users debug agent behavior. Security Challenges for AI Agents As AI agents interact with digital systems, significant security concerns arise. These include the risk of LLMs gaining access to sensitive credentials (which should never be passed in plain text), agents being social-engineered, and the need for robust authorization for internal APIs. Securing AI Agents with Identity Providers Solutions involve abstracting authorization by offloading token management and delegation to dedicated authorization servers (like Auth0/Okta). This ensures credentials are never exposed to the LLM, are securely stored in token vaults, and agents only receive explicitly granted, scoped access tokens. Human Approval and Fine-Grained Authorization For sensitive operations (e.g., financial transactions), AI agents may need to request human approval, which requires independent validation between the API and the agent's worker process. Additionally, fine-grained authorization and careful tool design are crucial to limit data disclosure and prevent over-privileged access. Gistr generated core concepts Agentic AI Summit | Afternoon Workshops TL;DR: The session explores cutting-edge advancements in AI agents, covering performance optimization through compound systems, novel AI operating systems, comparative analysis of foundational models, collaborative AI development, and critical security measures for agent interactions. The Gist: Topic: AI Agents and Advanced LLM Architectures Core Concept: This session delves into the evolving landscape of AI agents and large language models (LLMs), focusing on how to enhance their performance, create more sophisticated systems, enable collaborative AI development, and secure their interactions with digital environments. Key Discussions & Insights: Inference Time Scaling & Compound Systems (Jared Davis, Foundry) Core Concept: AI agent performance can be dramatically improved and made more cost-effective by composing multiple LLM calls into "compound systems" rather than relying solely on larger, single models. How it works: Parallel Calls/Ensembles: Asking the same question to multiple model replicas and aggregating responses (e.g., voting, filtering). Routing: Directing queries to smaller, cheaper models for simple tasks and larger models for complex ones (e.g., FrugalGPT). Judge-Based Aggregation: Using a separate model to verify or approve responses from other models. Key Learnings: This approach bends the curve on intelligence-per-compute, similar to how training efficiency has improved. Combining different models (even across providers) can yield superior results and cost savings. The future of "X-billion parameter systems" might refer to "X-million calls" in a complex inference architecture. Neural OS & AI Agent Deployment (Anthony & Danny Krauss, Lambda) Neural OS (Danny Krauss): Core Concept: Enabling LLMs to act as the core "computer" or "operating system," generating dynamic interfaces (HTML/CSS) and networking with other LLM instances. How it works: LLMs interpret prompts to generate interactive UIs and communicate with other Neural OS instances, effectively performing actions beyond just text generation. Insights: This paradigm allows non-technical users to "make something" and offers a path to improving outputs by simply updating the underlying LLM. AI Agent Deployment (Anthony): Core Concept: Demonstrating practical deployment of multi-agent AI workflows for business automation (e.g., executive reporting). Key Steps: Define agents with roles, goals, backstories, tasks, and tools (e.g., LlamaIndex CSV reader). Structure workflows for agents (e.g., sales ops, marketing ops, customer support agents consolidating reports). Deploy agents using tools like Crew AI and managed serverless inference APIs (e.g., Lambda Cloud). Insights: Shows how to automate complex, multi-stage business processes using specialized AI agents. Multimodal Challenges: Auto-regressive vs. Diffusion (Amir Zade, Lambda) Problem: Data scarcity is a growing bottleneck for AI progress, especially in multimodal domains, while compute is abundant. Insight: Research suggests diffusion models will outperform auto-regressive models in data-constrained settings. Key Findings: Diffusion models are more data-efficient, extracting significantly more information from the same dataset over more epochs (higher "half-life" of data). Auto-regressive models quickly exhaust the "newness" of data points compared to diffusion models. Advice: If compute-bottlenecked, choose auto-regressive for faster loss reduction. If data-bottlenecked, choose diffusion to optimize models with more flops. Vibe Coding with Goose AI Agent (Angie Jones, Block) What: Goose is an open-source, LLM-agnostic AI agent framework built as an MCP (Multi-Agent Communication Protocol) client, designed for collaborative AI-driven software development ("vibe coding"). How it works: Orchestrates a team of specialized sub-agents (planner, project manager, architect, frontend/backend devs, QA, tech writer) to build applications. Supports parallel execution of tasks to accelerate development. Allows integration of various LLMs and external tools via MCP servers. Provides debugging capabilities and context management within chat sessions. Key Learnings: AI agents can rapidly generate prototypes and accelerate development cycles. Important to define clear roles and responsibilities for sub-agents. Vibe-coded apps require thorough QA and security considerations before production deployment. Security for AI Agents & Digital Identities (Shak Hika, Auth0/Okta) Problem: AI agents interacting with digital services (e.g., Jira, banking apps) introduce complex security challenges related to delegated access, sensitive data, and privilege management. Key Issues: Passing raw credentials to LLMs is a severe security risk. Custom tool development for every API is inefficient and prone to security flaws. Agents may attempt actions requiring human approval (step-up authentication). Solutions/Techniques: Centralized Token Management: Leverage authorization servers (e.g., Auth0/Okta) and token vaults to manage credentials securely, preventing LLMs from direct access. OAuth/OIDC: Utilize standard protocols for delegated access, ensuring agents only receive specific, scoped tokens. Secure Gateways: Implement proxy MCP servers or internal gateways to control and audit agent access to internal APIs. Data Minimization: Limit the amount of sensitive information exposed to agents and use encryption (e.g., JSON Web Encryption) for data in transit. Fine-Grained Authorization: Control what specific data or actions an agent can perform, even with a valid token. Impact: Crucial for enabling secure and scalable agent interactions in enterprise environments. Key Topics Covered: AI Agent Performance Optimization , Compound AI Systems Inference Time Scaling LLM Architectures Multimodal AI Neural Operating Systems AI Agent Deployment & Workflows Open-Source AI Agents (Goose) 1:28:57:689585c5870a093e0d7285d0]] MCP (Multi-Agent Communication Protocol) Collaborative AI Development ("Vibe Coding") AI Agent Security Authentication & Authorization (OAuth, OIDC) Token Management & Vaults GISTR generated Gist