Agentic Observability - Making LLM Apps Debuggable, Trustworthy, and Scalable by Krishna Gade

Agentic Performance Management (APM 2.0) represents an evolution from traditional Application Performance Management, necessary for monitoring the new class of model-based, non-deterministic agentic applications. While traditional metrics like infrastructure utilization, API latency, and throughput remain relevant, APM 2.0 incorporates new types of metrics focused on the behavior and outcomes of agents and their underlying models. Key examples of metrics that should be included in Agentic Performance Management systems are: Traditional APM Metrics: Latency and reliability of API calls, infrastructure utilization, and overall throughput. These provide foundational system health information. Model Observability Metrics: Metrics related to the performance and behavior of the language models powering the agents, such as hallucination detection, problem detection, and data drift. Agent/Workflow Specific Metrics: Outcome and Alignment: Whether the agent is producing the correct outcome for the user, its alignment with business policy or goals, and the detection and understanding of misalignments. Workflow Execution: How agents transfer tasks between themselves (ensuring chains work properly), the success of user sessions and journeys, performance of individual workflow spans, and hierarchical visibility into workflow and sub-agent performance. Internal Agent Behavior: Metrics indicating the internal workings of the agent, such as how many times reflection was needed or how many iterations of reflection occurred to complete a task. Tool and API Usage: Details on tool execution, including which APIs were called, whether they were necessary, if their output was correctly processed, and why they were invoked. Agent Interaction and Context: Information flow between agents, whether context was transferred properly, and how workflow planning happened. Control Mechanisms: Whether adjustments, alignments, fallback triggers, and guardrails were correctly invoked. Failure Analysis: Understanding the root cause of failures (e.g., model problems vs. system issues, tool call failures, database lookup failures) and why agents might be failing or succeeding. Auditability: The ability to log decisions, tool calls, plans, and action chains to create an audit trail and reconstruct the graph of execution for debugging and understanding. These metrics move beyond just system health to focus on the dynamic behavior, decision-making, and outcomes of agentic workflows, providing the ability to reason about performance and ensure reliability and trustworthiness. ( , , , , , , , , , , , , , , , , , , ) Would you like a more detailed explanation of any of these metric categories or how they differ from traditional APM? Agentic Observability - Making LLM Apps Debuggable, Trustworthy, and Scalable by Krishna Gade Agentic Observability: Building Trust in AI Agents Summary: This document outlines the concept of Agentic Observability, a crucial aspect of building reliable and trustworthy AI agents. It discusses the shift from traditional code-based applications to model-based, non-deterministic agentic AI applications and the challenges in monitoring their complex workflows. The document proposes Agentic Performance Management (APM 2.0) as a solution, integrating traditional APM with model observability to provide comprehensive insights into agent behavior and performance. Chapter 1: The Rise of Agentic AI and its Challenges - 1.1 The Changing Nature of Software: - Traditional software is largely code-based, deterministic, and static. Agentic AI applications are model-based, non-deterministic, and highly data-dependent. Many teams are transforming existing software into agent-based applications or layering agentic capabilities onto existing systems. 1.2 What are Agents and Workflows? - An agent is a system using LLMs to make dynamic decisions and direct workflows to solve complex tasks. A workflow is a directed acyclic graph of execution steps, sequencing agentic tasks. Examples include prompt chaining , routing , parallelization , and evaluator-optimizer workflows. 1.3 The Complexity of Multi-Agent Workflows: - Agentic AI moves LLM applications from simple prompt-response to complex, goal-directed multi-step workflows. This increased complexity introduces challenges in identifying failure points (LLM issues, tool failures, database lookups, etc.). Monitoring multi-agent workflows is more complex than single-agent workflows. A travel booking application is used as an example, highlighting the complexity of monitoring various sessions, sub-agents, and interactions. Chapter 2: Agentic Performance Management (APM 2.0) - 2.1 The Need for a New Approach: - Traditional APM (APM 1.0) focuses on deterministic applications, monitoring server failures, latency, and throughput. Agentic AI requires a new approach, APM 2.0, which incorporates new metrics and behaviors. 2.2 Integrating APM and Model Observability: - Agentic observability systems must combine traditional Application Performance Management (APM) (infrastructure utilization, response latencies, throughputs) and Model Performance Management (hallucination detection, data drift). This integrated system provides hierarchical visibility into workflow performance and individual agent performance. Chapter 3: Building an Agentic Observability System - 3.1 Key Concepts: - Internalized observability: Observability needs to be integrated within the agentic workflow. Reflection: Agents should reason about their own actions and outcomes. Tool usage monitoring: Tracking tool calls, APIs invoked, and planning processes. Agent collaboration: Monitoring interactions between multiple agents. 3.2 Anatomy of an Observed Agent: - An observed agent tracks user input, context, plans, subtasks, tool calls, workflow reflection, and alignment with guardrails. This creates a feedback loop for course correction. 3.3 New Metrics for Agentic Observability: - Metrics include reflection iterations, tool execution metrics, workflow planning details, agent communication, guardrail invocations, and fallback triggers. 3.4 Fiddler AI's Approach: - Treat the workflow as a dynamic system, instrumenting all actions (decisions, tool calls, plans) with trace IDs. Visualize and reconstruct the execution graph for root cause analysis. Use models to generate metrics (model-based, agent-level, system-level). Chapter 4: Production Readiness and Hot Takes - 4.1 Production Readiness: - Production-ready agentic applications must be observable, allowing for reasoning about performance, policy alignment, and providing an audit trail. Insights should be shared with various stakeholders (AI engineers, application engineers, product managers, business people, compliance and security). 4.2 Hot Takes: - Traditional logs and metrics are largely insufficient for agentic observability; runtime semantic tracing is needed to understand the "why." Agentic observability is not an externalized system; it's co-observing agents and their cognitive processes, enabling live intervention. Agentic failures are behaviors, not necessarily bugs. The focus should be on alignment with policies and business goals and course correction. Final Summary This document details the emerging field of Agentic Observability, emphasizing its critical role in deploying reliable and trustworthy AI agents. It highlights the shift from traditional software development and monitoring to a new paradigm that requires integrating traditional APM with model observability. The key concepts of internalized observability, reflection, and monitoring tool usage and agent collaboration are introduced. A new set of metrics is proposed for understanding the complex behavior of agentic workflows, and a practical approach is outlined, exemplified by Fiddler AI's methodology. The document concludes by emphasizing the importance of Agentic Observability for production-ready AI applications and its unique characteristics compared to traditional observability.