AI Agent Observability: How to Dashboard Claude Code When All You Have Is a Terminal
You scheduled three agents to run overnight. Memory consolidation at midnight. Test suite at 1 AM. Documentation sync at 2 AM. You wake up, check CI — green. Git log shows commits from all three agents. Everything looks fine.
But did the memory consolidation actually finish, or did it silently hit a rate limit at 12:47 AM and write a half-complete summary? Did the test runner spend $4.80 on a run that normally costs $0.09 because it got stuck in a retry loop? Did the docs sync publish a page referencing the old API because it ran before the test runner's fixes were committed?
CI doesn't tell you this. Git log doesn't tell you this. The terminal is long gone. You have zero visibility into what actually happened.
This is the observability gap for AI agents, and it's fundamentally different from the observability problems we've solved for traditional software.
Agent Observability vs. LLM Observability
Most "AI observability" tools are actually LLM observability tools. They monitor API calls — latency, token counts, error rates, prompt/completion pairs. That's useful, but it's like monitoring database queries without understanding the application logic. You know what the model said, but not what the agent did.
Agent observability needs five pillars that go beyond API-level metrics:
Liveness — Is the agent running right now? When did it last run? A cron job that silently stops executing looks identical to a cron job that hasn't been scheduled yet unless you're tracking heartbeats.
Activity — What is the agent doing? Not just "it made an API call" but "it's consolidating memory files from the last 7 days" or "it's running the integration test suite against staging." Activity context turns opaque token consumption into understandable work.
Outcomes — Did the agent achieve its goal? A successful API call doesn't mean a successful agent run. The model might return a perfectly formatted response that contains wrong information. Outcomes track whether the agent's actions produced the intended result — files written, tests passed, PRs created.
Cost — How much did this agent run cost? Not in aggregate, but per run. A memory consolidation job should cost around $0.01. A test suite run might cost $0.09. If either of those numbers jumps to $2.00, something went wrong — maybe an infinite loop, maybe a context window explosion, maybe a model tier mismatch. (For deep coverage on cost tracking, see How to Monitor Claude Code Usage and Costs.)
Health — Is the agent's infrastructure healthy? Are its memory files within size limits? Are its pipeline dependencies satisfied? Is its schedule current? Health is the meta-layer that catches problems before they manifest as failed runs.
Traditional APM tools give you liveness and cost. Agent-specific observability adds activity, outcomes, and health.
The Signal Stack
Agent signals come from four sources, each serving a different observability need.
Log Events
The foundation is structured log events from two primary sources:
Cron run logs capture every scheduled execution. Each entry records the agent name, start time, duration, token usage (input, output, cache reads, cache writes), model used, and exit status. These accumulate in per-run JSONL files and form the basis for historical analysis.
Config audit logs track every change to agent configuration — schedule modifications, tool permission changes, model tier updates, memory path reassignment. When an agent starts behaving differently, the config audit trail tells you what changed and when.
Both log types use structured formats with consistent fields: timestamp, source, level (info/warn/error), category, and a detail payload. The level and category fields enable filtering — you can surface all errors across all agents, or all activity from a specific agent, or all config changes in the last 24 hours.
Log Summaries
Raw logs are comprehensive but dense. The summary layer aggregates them into digestible views: runs per agent per day, average cost per run, error frequency, token usage trends. Summaries answer the "how are things going in general" question without requiring you to parse individual log entries.
Live Stream
For real-time observability, the system uses Server-Sent Events (SSE) to push log entries to a dashboard as they happen. The architecture is intentionally simple: a single SSE connection streams events with level tags (INF, WRN, ERR, DBG), and the client filters and renders them.
The live stream widget maintains a 500-line circular buffer. Old entries roll off as new ones arrive. This prevents memory growth during long-running sessions while keeping enough history to understand recent context. The buffer size is a deliberate trade-off: 500 lines is enough to reconstruct the last several minutes of activity but not enough to substitute for persistent log storage.
Level filtering lets you focus on what matters. During normal operation, you might filter to WRN and ERR only. When debugging a specific agent, switch to all levels including DBG. The filter state persists across widget collapses, so you don't lose your configuration when minimizing the panel.
Anomaly Detection
The simplest and most effective anomaly signal: flag any run that consumes more than 3x the median token count for that agent. This single heuristic catches:
- Retry loops (agent keeps failing and retrying, burning tokens each iteration)
- Context window explosions (agent accidentally loads a massive file into context)
- Model tier mismatches (a routine job accidentally running on Opus instead of Haiku)
- Infinite tool loops (agent calls the same tool repeatedly without making progress)
The 3x threshold is generous enough to avoid false positives from normal variance but tight enough to catch genuine problems. A memory consolidation job that normally uses 2,000 tokens firing at 6,000+ tokens is worth investigating even if it "succeeded."
Live Streaming in Practice
The live stream widget has two states: collapsed and expanded.
Collapsed shows a small pill in the corner of the screen — just enough to indicate the stream is active and show the current event rate. This is the default state because most of the time you don't need to watch events in real time. You just want to know the system is active.
Expanded opens a 440×440 pixel panel showing the scrolling event stream. Each event displays its timestamp, level pill (color-coded: blue for INF, yellow for WRN, red for ERR, gray for DBG), source agent, and message. JSON payloads in events are rendered as collapsed blocks — click to expand and see the full structured data.
The widget is draggable. Default position is bottom-right, 20 pixels from the viewport edges. You can reposition it anywhere and the position persists. This matters in practice because different workflows want the stream in different locations — bottom-right for general monitoring, top-right when you're working in the bottom of the screen, left side when your main content is on the right.
Level filtering pills across the top of the expanded view let you toggle each level independently. The filter combination persists when you collapse and re-expand. A common pattern: keep only ERR visible during overnight runs, then expand the filter to include WRN and INF when you're actively debugging.
Historical Log Browsing
Real-time streaming handles "what's happening now." Historical browsing handles "what happened while I wasn't watching."
The timeline view shows agent activity over time — when each agent ran, how long the run took, and whether it succeeded or failed. This is the first place you look after overnight runs. A healthy timeline shows regular, evenly-spaced runs with green success indicators. A problematic timeline shows gaps (missed runs), clusters (retry storms), or red indicators (failures).
Source filtering lets you isolate a single agent's history. When you notice that the documentation sync has been failing every third run, you filter to that agent and trace through its recent history. The config audit trail often reveals the cause — maybe someone changed a dependency path, or a tool permission was revoked.
Error surfacing prioritizes showing you failures first. Instead of scrolling through hundreds of successful runs to find the three that failed, the error view pulls all failures to the top with their context: what was the agent doing, what was the error, what was the token consumption (anomalous token usage on a failed run usually means the agent struggled before giving up).
Pipeline Health Checks
Individual agent observability is necessary but not sufficient. In a multi-agent system, the relationships between agents matter as much as the agents themselves.
Pipeline health checks use AI-powered analysis to evaluate the entire agent graph. The health check prompt examines six dimensions:
- Overall health summary — a one-paragraph assessment of system status
- Agent ownership — flags any jobs that are UNOWNED, meaning no agent is responsible for them
- Broken edges — identifies pipeline connections where the source job has errored, meaning downstream jobs may be consuming bad data
- Schedule gaps — finds jobs that are infrequent or overdue relative to their configured schedule
- Missing deliveries — catches jobs that run but don't deliver output to any configured target
- Actionable recommendations — the top 2-3 things you should fix, prioritized by impact
The health check streams its analysis via SSE, so you see results appear in real time as the model works through each dimension. This is more than a UX nicety — streaming means you can start acting on ownership issues while the model is still analyzing schedules.
The inline follow-up chat lets you ask questions about the health check results without leaving the dashboard. "Why is the test runner UNOWNED?" or "What would break if I reschedule the memory consolidation to 2 AM?" The chat has access to the same pipeline context, so answers are grounded in your actual configuration.
Per-job context building deserves special mention. When the health check examines a specific job, it constructs a focused context window containing only that job's configuration, recent run history, and immediate dependencies. This keeps the analysis accurate without overwhelming the model with the entire pipeline graph. An individual job analysis might include its last 10 runs, its upstream and downstream connections, and its schedule — typically under 2,000 tokens of context for a precise, actionable assessment. (See How to Build AI Agent Pipelines for the full pipeline architecture.)
What Enterprise Tools Get Wrong
The AI observability market is growing fast, and most tools are aimed at enterprise teams running hundreds of agents across cloud infrastructure. They offer dashboards-as-a-service, centralized log aggregation, multi-tenant role-based access, and enough configuration to require its own documentation.
For a solo developer or small team running Claude Code agents from a terminal, these tools are wildly overbuilt. You don't need multi-tenant access control when you're the only tenant. You don't need cloud log aggregation when your logs live in local JSONL files. You don't need a SaaS subscription when your agents run on your laptop.
What you need is:
- A local dashboard that reads your filesystem directly
- Real-time streaming without cloud infrastructure
- Cost tracking per agent per run with no third-party data egress
- Health checks that understand your specific pipeline graph
- An interface that runs alongside your terminal, not instead of it
The gap between enterprise observability platforms and "grep through log files" is enormous. Most developers running agent teams end up in the second category because the first category requires too much setup, too much infrastructure, and too much ongoing cost for what amounts to a local development tool.
Filling that gap — giving terminal-based agent teams real observability without SaaS complexity — is a design problem as much as an engineering one. The tool needs to be zero-config for basic use, incrementally configurable for advanced use, and lightweight enough to run alongside the agents it's monitoring.
From Invisible to Observable
The transition from "agents that run" to "agents you can see" changes how you work with AI teams. When you can see what your agents are doing — in real time and historically — you stop treating them as black boxes and start treating them as team members. You notice when they're struggling. You catch problems before they compound. You optimize based on data instead of intuition.
The five pillars — liveness, activity, outcomes, cost, health — give you the vocabulary. The signal stack — log events, summaries, live stream, anomaly detection — gives you the data. The dashboard — timeline, filters, health checks, inline chat — gives you the interface.
None of this requires exotic infrastructure. It requires structured logs, an SSE endpoint, and a thoughtful UI. The hard part isn't the technology. It's deciding to treat agent observability as a first-class concern instead of an afterthought.
Want visibility into your Claude Code agent teams? ClawPort gives you the activity console, live stream widget, and AI-powered pipeline health checks — all running locally, no SaaS required. Free and open source.