LLM Call Analytics and Coordination Metrics¶
Every LLM provider call is tracked with comprehensive metadata for financial reporting, debugging, and orchestration overhead analysis. These analytics drive the multi-agent tuning signals (orchestration ratio, coordination efficiency, error amplification) that complement the budget and cost controls.
Per-Call Tracking and Proxy Overhead Metrics¶
Every completion call produces a CompletionResponse with TokenUsage (token counts and cost). The engine layer creates a CostRecord (with agent/task context) and records it into CostTracker. The engine additionally logs proxy overhead metrics at task completion:
turns_per_task: number of LLM turns to complete the tasktokens_per_task: total tokens consumedcost_per_task: total cost in configured currencyduration_seconds: wall-clock execution timeprompt_tokens: estimated system prompt tokensprompt_token_ratio: ratio of prompt tokens to total tokens (overhead indicator; warns when >0.3)
These are natural overhead indicators; a task consuming 15 turns and 50k tokens for a one-line fix signals a problem. Metrics are captured in TaskCompletionMetrics, a frozen Pydantic model with a from_run_result() factory method.
Call Categorisation and Orchestration Ratio¶
When multi-agent coordination exists, each CostRecord is tagged with a call category:
| Category | Description | Examples |
|---|---|---|
productive |
Direct task work: tool calls, code generation, task output | Agent writing code, running tests |
coordination |
Inter-agent communication: delegation, reviews, meetings | Manager reviewing work, agent presenting in meeting |
system |
Framework overhead: system prompt injection, context loading | Initial prompt, memory retrieval injection |
embedding |
Embedding model calls: memory store/retrieve vectorization | Mem0 store embedding, similarity search query embedding |
The orchestration ratio (coordination / total) is surfaced in metrics and alerts. If coordination tokens consistently exceed productive tokens, the company configuration needs tuning (fewer approval layers, simpler meeting protocols, etc.).
Coordination Metrics Suite
A comprehensive suite of coordination metrics derived from empirical agent scaling research (Kim et al., 2025). These metrics explain coordination dynamics and enable data-driven tuning of multi-agent configurations.
| Metric | Symbol | Definition | What It Signals |
|---|---|---|---|
| Coordination efficiency | Ec |
success_rate / (turns / turns_sas): success normalised by relative turn count vs single-agent baseline |
Overall coordination ROI. Low Ec = coordination costs exceed benefits |
| Coordination overhead | O% |
(turns_mas - turns_sas) / turns_sas * 100%: relative turn increase |
Communication cost. Optimal band: 200--300%. Above 400% = over-coordination |
| Error amplification | Ae |
error_rate_mas / error_rate_sas: relative failure probability |
Whether MAS corrects or propagates errors. Centralised ~4.4x, Independent ~17.2x |
| Message density | c |
Inter-agent messages per reasoning turn | Communication intensity. Performance saturates at ~0.39 messages/turn |
| Redundancy rate | R |
Mean cosine similarity of agent output embeddings | Agent agreement. Optimal at ~0.41 (balances fusion with independence) |
| Amdahl ceiling | Sc |
Theoretical max speedup from Amdahl's Law given parallelizable fraction | Diminishing returns threshold. Recommends ideal team size |
| Straggler gap | Gs |
(slowest_turn - median_turn) / median_turn |
Bottleneck severity. High gap = one agent blocks the group |
| Token-speedup ratio | Rt |
total_tokens / speedup_factor |
Cost efficiency of parallelism. Rising ratio = diminishing token ROI |
| Message overhead | Mo |
Pairwise message count relative to team size | Quadratic communication detection. is_quadratic flag when O(n^2) |
All 9 metrics are opt-in via coordination_metrics.enabled in analytics config. Ec and
O% are cheap (turn counting). Ae requires baseline comparison data. R requires
semantic analysis of agent outputs (embedding cosine similarity). c, Sc, Gs, Rt,
and Mo are computed from execution telemetry (turn counts, token usage, message logs).
coordination_metrics:
enabled: false # opt-in -- enable for data gathering
collect:
- efficiency # cheap -- turn counting
- overhead # cheap -- turn counting
- error_amplification # requires SAS baseline data
- message_density # requires message counting infrastructure
- redundancy # requires embedding computation on outputs
- amdahl_ceiling # computed from parallelizable fraction
- straggler_gap # computed from per-agent turn times
- token_speedup_ratio # computed from token usage + speedup
- message_overhead # computed from pairwise message counts
error_taxonomy:
enabled: false # opt-in -- enable for targeted diagnosis
categories:
- logical_contradiction
- numerical_drift
- context_omission
- coordination_failure
The Ae baseline is a sliding window of recent single-agent (SAS)
runs. Its size is the budget.baseline_window_size setting
(default 50), sourced from the SYNTHORG_BUDGET_BASELINE_WINDOW_SIZE
environment variable at API start. It is read-only post-init: the
window is sized once when the baseline store is constructed, so a
change requires a restart.
Full Analytics Layer Configuration
Expanded per-call metadata for comprehensive financial and operational reporting:
call_analytics:
track:
- call_category # productive, coordination, system, embedding
- success # true/false
- retry_count # 0 = first attempt succeeded
- retry_reason # rate_limit, timeout, internal_error
- latency_ms # wall-clock time for the call
- finish_reason # stop, tool_use, max_tokens, error
- cache_hit # prompt caching hit/miss (provider-dependent)
aggregation:
- per_agent_daily # agent spending over time
- per_task # total cost per task
- per_department # department-level rollups
- per_provider # provider reliability and cost comparison
- orchestration_ratio # coordination vs productive tokens
alerts:
orchestration_ratio:
info: 0.30 # info if coordination > 30% of total
warn: 0.50 # warn if coordination > 50% of total
critical: 0.70 # critical if coordination > 70% of total
retry_rate_warn: 0.1 # warn if > 10% of calls need retries
Analytics metadata is append-only and never blocks execution. Failed analytics writes are logged and skipped; the agent's task is never delayed by telemetry.
Coordination Error Taxonomy¶
When coordination metrics collection is enabled, the system classifies coordination errors into structured categories for targeted diagnosis. Each category supports a pluggable dual-implementation system: a cheap heuristic variant (regex/structural) and an optional LLM-backed semantic variant (accurate, expensive, disabled by default).
Detection scope: detectors operate at SAME_TASK (single execution) or TASK_TREE (parent + delegate executions via parent_task_id linkage). Cross-agent data is sanitized via sanitize_message before inclusion.
| Error Category | Description | Heuristic Variant | Semantic Variant | Default Scope |
|---|---|---|---|---|
| Logical contradiction | Agent asserts both "X is true" and "X is false" | Regex assertion matching | LLM reasoning over assistant texts | SAME_TASK |
| Numerical drift | Accumulated errors from cascading rounding (>5% deviation) | Context-labelled number extraction + % drift | LLM cross-verification of numerical claims | SAME_TASK |
| Context omission | Failure to reference previously established entities | Capitalized entity set diff (first-half/second-half) | LLM entity introduction/disposition tracking | SAME_TASK |
| Coordination failure | Message misinterpretation, task allocation conflicts | Tool errors + error finish reasons | LLM classification of coordination breakdowns | SAME_TASK |
| Delegation protocol violation | Broken delegation chains, missing parent linkage | Structural check: parent_task_id, delegation_chain integrity | - | TASK_TREE |
| Review pipeline violation | PASS without stages, PASS contradicting FAIL stage | Structural check: verdict/stage consistency | - | TASK_TREE |
| Authority breach attempt | Execution cost exceeding authority budget limit | Budget comparison: total turn cost vs limit | - | SAME_TASK |
Pipeline architecture: detectors implement the Detector protocol and are discovered dynamically from ErrorTaxonomyConfig.detectors (a dict mapping ErrorCategory to per-category variant/scope config). When multiple variants target the same category, a CompositeDetector runs them concurrently and deduplicates findings by (turn_range, description_hash, category).
Downstream sinks: ClassificationSink protocol enables wiring findings into the performance tracker (PerformanceTrackerSink) and notification dispatcher (NotificationDispatcherSink, threshold-filtered).
Cost control: LLM semantic variants share the provider's rate limiter and track per-classification-run cost against classification_budget_per_task.
Error taxonomy classification runs post-execution (never blocks agent work) and logs structured events to the observability layer. Enable via coordination_metrics.error_taxonomy.enabled: true.
Error categories derived from Kim et al., 2025 and the Multi-Agent System Failure Taxonomy (MAST) by Cemri et al. (2025).