Monitoring AI Agents in Production

How do you monitor AI agents in production? Not like you monitor web apps. A healthy HTTP 200 from an agent that just hallucinated a customer refund is worse than a clean 500 error; the error stops the process, the hallucination damages it. Traditional APM tools catch crashes and latency spikes. They miss the failure mode that matters most for agents: confidently wrong behavior.

Production agent monitoring requires a three-layer model that tracks infrastructure health, execution quality, and business outcomes. Here’s what to watch at each layer, what thresholds to set, and what the dashboard looks like.

Why Traditional Monitoring Fails for Agents

Traditional software is deterministic. The same request produces the same response. When it doesn’t, something is broken, and the error code tells you what. Monitoring deterministic systems is a solved problem: uptime, latency percentiles, error rates, throughput.

Agents are probabilistic. The same customer email processed twice might produce two different responses: both valid, both different. The “error” isn’t a status code. It’s a response that’s grammatically correct, structurally sound, and factually wrong. Or a tool call that succeeds on the API side but passes the wrong parameters. Or an action sequence that completes every step but in the wrong order.

Standard monitoring sees all of these as successes. Pod is healthy. Latency is normal. No errors in the log. Meanwhile, the agent just sent a pricing quote with last quarter’s rates to a customer who specifically asked about the new pricing tier.

This is why agent monitoring needs its own metric set. The traditional metrics still matter. you need to know if pods are running and APIs are responding. But they’re the foundation, not the full picture.

Layer 1: Infrastructure Metrics

The base layer. These metrics tell you whether the system is running. They don’t tell you whether it’s running well.

Pod health and resource usage. CPU, memory, restart count. Agents are resource-intensive: an agent processing a complex multi-step workflow can spike memory as it accumulates context. Set memory alerts at 80% of limit, not the default 90%. Agent processes that hit memory limits don’t always crash cleanly; they sometimes truncate context silently, which causes the quality degradation that Layer 2 catches.

LLM API latency and error rates. Agents depend on external LLM calls. Track p50, p95, and p99 latency for every model call. Alert on sustained p95 above 5 seconds; not because the system is down, but because slow model responses cause timeout cascades in multi-step workflows. Track rate limit hits separately. A rate-limited agent isn’t broken; it’s throttled, and the distinction matters for diagnosis.

MCP server availability. Each tool the agent uses is an MCP server. If the CRM integration is down, the agent can’t look up customers. Track each MCP server’s uptime and response time independently. A single MCP server outage degrades the agent’s capability in a specific domain; it doesn’t crash the agent, but it changes what the agent can do. That partial degradation is harder to detect than a full outage.

Token consumption. Track tokens per task, per agent, per day. Token usage is your cost signal and your complexity signal. A task that normally takes 2,000 tokens suddenly consuming 8,000 means the agent is reasoning harder, probably because it encountered something unexpected. Cost anomalies are behavior anomalies.

Layer 2: Execution Metrics

The diagnostic layer. These metrics tell you whether the agent is doing its job correctly.

Task completion rate. What percentage of assigned tasks does the agent complete without human intervention? Track this daily. A healthy agent completes 85-95% of tasks autonomously. Below 80%, something has changed; either the task distribution shifted or the agent’s capabilities degraded. Don’t set the target at 100%. Agents that never escalate are agents that never admit uncertainty, and that’s a worse problem.

Intervention rate. How often does a human step in to correct, override, or complete an agent’s work? This is the single most important metric in agent monitoring. A rising intervention rate is the earliest signal of drift. Track it as a 7-day rolling average. Alert when it rises 10 percentage points above the 30-day baseline. A 5% intervention rate climbing to 15% means something fundamental has changed.

Tool call success rate. What percentage of MCP server calls succeed? Break this down by tool. A 98% success rate on CRM lookups but a 73% success rate on email sends tells you exactly where to investigate. Track failure reasons: authentication errors, parameter validation failures, rate limits, timeouts. Each failure type has a different root cause and a different fix.

Context window utilization. How much of the available context window is the agent using per task? Track the average and the 95th percentile. An agent routinely hitting 90% of its context window is running out of room to reason. It starts dropping earlier context to fit new information, and that’s exactly the context loss failure mode that causes silent degradation. Alert at 85% utilization on p95.

Decision confidence distribution. Agents built for production attach confidence scores to their outputs. Plot the distribution weekly. A healthy distribution clusters between 0.75 and 0.95. A distribution that’s flattening (more decisions in the 0.5-0.7 range) means the agent is encountering more situations where its current skills and context aren’t a clean fit. That’s your signal to run The Recursive Loop: check what patterns are emerging and encode new skills to address them.

Layer 3: Business Outcome Metrics

The impact layer. These metrics tell you whether the agent is delivering value.

Resolution rate. For customer-facing agents: what percentage of interactions reach a satisfactory resolution without human escalation? This is the metric leadership cares about. Track it alongside CSAT scores for agent-handled interactions versus human-handled interactions. The gap tells you where the agent is strong and where it’s weak.

Time savings. Measure the actual time saved per task compared to the manual baseline. An invoice processing agent that takes 4 minutes per invoice versus the 22-minute manual average is saving 18 minutes per unit. Multiply by volume for the daily impact. Track this over time; it should improve as The Recursive Loop adds capabilities. If it plateaus or declines, the agent is hitting the limits of its current encoding.

Cost per action. Total the infrastructure cost (compute, LLM API calls, MCP server usage) per completed task. Compare against the cost of a human performing the same task. This ratio determines ROI and guides investment decisions. A task that costs $0.35 for the agent versus $12.00 for a human is a clear win. A task that costs $3.50 for the agent versus $4.00 for a human needs tighter encoding to justify the overhead.

Error rate by business impact. Not all errors are equal. An agent that misclassifies a low-priority support ticket is a nuisance. An agent that sends a contract to the wrong recipient is a legal exposure. Categorize errors by business impact: low (cosmetic, easily corrected), medium (requires human correction, customer notices), high (financial, legal, or reputational consequence). Alert immediately on any high-impact error. Track medium-impact errors weekly.

Alert Thresholds That Work

After deploying agent monitoring across multiple client engagements, these thresholds have proven reliable:

Metric	Warning	Critical
Intervention rate (7-day avg)	+10 pts above 30-day baseline	+20 pts above baseline
Task completion rate	Below 80%	Below 65%
Context utilization (p95)	Above 85%	Above 95%
Tool call success rate	Below 95%	Below 85%
Token cost per task	2x above 30-day avg	5x above avg
High-impact errors	Any occurrence	2+ in 24 hours

Warning alerts go to the agent operations team. Critical alerts page the on-call engineer and pause the agent’s autonomous actions pending review.

Dashboard Design: Agent Monitoring vs. Traditional APM

A traditional APM dashboard shows request volume, error rates, and latency. An agent monitoring dashboard shows something fundamentally different.

Top row: health at a glance. Three numbers: task completion rate (trailing 24h), intervention rate (7-day rolling), cost per action (trailing 24h). These three tell you in two seconds whether the system is performing.

Second row: execution quality. Decision confidence histogram (last 7 days), tool call success rates by MCP server, context window utilization trend. This row is where you spot degradation before it hits the business metrics.

Third row: business impact. Resolution rate trend, time savings per task trend, error breakdown by impact level. This row is what you show leadership.

Fourth row: infrastructure. Pod health, LLM latency, MCP server status. This row only matters when something in rows one through three looks wrong; it helps you determine whether the issue is infrastructure or agent quality.

The order matters. Traditional dashboards put infrastructure at the top because that’s where most problems originate in deterministic systems. Agent dashboards put business outcomes at the top because agent failures often happen at the quality layer; the infrastructure is fine, the behavior is wrong.

Monitoring as a Trust-Building Mechanism

Monitoring isn’t just diagnostic. It’s the mechanism by which agents earn trust.

When leadership can see a dashboard showing 92% task completion, 4% intervention rate, and $0.28 cost per action (updated daily, auditable, based on real production data) the conversation changes. The agent isn’t a black box doing mysterious things. It’s a system with measurable performance that can be compared against the human baseline.

Business-as-Code makes this possible because every agent action traces back to a specific skill, a specific schema, a specific context document. When intervention rate spikes, you don’t just know that something went wrong. You know which skill produced the bad output, which schema was involved, and which context document needs updating.

That traceability is what separates monitoring from logging. Logs tell you what happened. Monitoring tells you what to do about it. And in a system built on The Recursive Loop, what you do about it is encode the fix (add the missing skill, update the stale context, refine the schema) and watch the metrics improve in the next cycle.

The monitoring system doesn’t just watch the agents. It feeds the improvement loop that makes them better.

Frequently Asked Questions

What's the single most important metric for agent monitoring?

Intervention rate: how often a human has to step in to correct the agent. A rising intervention rate means the agent is drifting. A falling one means it's learning. This metric captures quality, reliability, and user trust in one number.

How is monitoring agents different from monitoring traditional software?

Traditional software is deterministic (the same input produces the same output. Agents are probabilistic) the same input can produce different outputs. You need to monitor behavioral patterns, not just error codes. A 200 OK from an agent that hallucinated a customer record is worse than a clean 500 error.

What monitoring tools does NimbleBrain use?

We monitor at three layers: infrastructure metrics (K8s pod health, resource usage), agent execution metrics (task completion, tool usage, context window utilization), and business outcome metrics (resolution rate, time savings, cost per action). The NimbleBrain Platform provides built-in agent observability.

Mat Goldsborough·Founder & CEO, NimbleBrain