Skip to content

LLM Call Analytics and Coordination Metrics

Every LLM provider call is tracked with comprehensive metadata for financial reporting, debugging, and orchestration overhead analysis. These analytics drive the multi-agent tuning signals (orchestration ratio, coordination efficiency, error amplification) that complement the budget and cost controls.

Per-Call Tracking and Proxy Overhead Metrics

Every completion call produces a CompletionResponse with TokenUsage (token counts and cost). The engine layer creates a CostRecord (with agent/task context) and records it into CostTracker. The engine additionally logs proxy overhead metrics at task completion:

  • turns_per_task: number of LLM turns to complete the task
  • tokens_per_task: total tokens consumed
  • cost_per_task: total cost in configured currency
  • duration_seconds: wall-clock execution time
  • prompt_tokens: estimated system prompt tokens
  • prompt_token_ratio: ratio of prompt tokens to total tokens (overhead indicator; warns when >0.3)

These are natural overhead indicators; a task consuming 15 turns and 50k tokens for a one-line fix signals a problem. Metrics are captured in TaskCompletionMetrics, a frozen Pydantic model with a from_run_result() factory method.

Call Categorisation and Orchestration Ratio

When multi-agent coordination exists, each CostRecord is tagged with a call category:

Category Description Examples
productive Direct task work: tool calls, code generation, task output Agent writing code, running tests
coordination Inter-agent communication: delegation, reviews, meetings Manager reviewing work, agent presenting in meeting
system Framework overhead: system prompt injection, context loading Initial prompt, memory retrieval injection
embedding Embedding model calls: memory store/retrieve vectorization Mem0 store embedding, similarity search query embedding

The orchestration ratio (coordination / total) is surfaced in metrics and alerts. If coordination tokens consistently exceed productive tokens, the company configuration needs tuning (fewer approval layers, simpler meeting protocols, etc.).

Coordination Metrics Suite

A comprehensive suite of coordination metrics derived from empirical agent scaling research (Kim et al., 2025). These metrics explain coordination dynamics and enable data-driven tuning of multi-agent configurations.

Metric Symbol Definition What It Signals
Coordination efficiency Ec success_rate / (turns / turns_sas): success normalised by relative turn count vs single-agent baseline Overall coordination ROI. Low Ec = coordination costs exceed benefits
Coordination overhead O% (turns_mas - turns_sas) / turns_sas * 100%: relative turn increase Communication cost. Optimal band: 200--300%. Above 400% = over-coordination
Error amplification Ae error_rate_mas / error_rate_sas: relative failure probability Whether MAS corrects or propagates errors. Centralised ~4.4x, Independent ~17.2x
Message density c Inter-agent messages per reasoning turn Communication intensity. Performance saturates at ~0.39 messages/turn
Redundancy rate R Mean cosine similarity of agent output embeddings Agent agreement. Optimal at ~0.41 (balances fusion with independence)
Amdahl ceiling Sc Theoretical max speedup from Amdahl's Law given parallelizable fraction Diminishing returns threshold. Recommends ideal team size
Straggler gap Gs (slowest_turn - median_turn) / median_turn Bottleneck severity. High gap = one agent blocks the group
Token-speedup ratio Rt total_tokens / speedup_factor Cost efficiency of parallelism. Rising ratio = diminishing token ROI
Message overhead Mo Pairwise message count relative to team size Quadratic communication detection. is_quadratic flag when O(n^2)

All 9 metrics are opt-in via coordination_metrics.enabled in analytics config. Ec and O% are cheap (turn counting). Ae requires baseline comparison data. R requires semantic analysis of agent outputs (embedding cosine similarity). c, Sc, Gs, Rt, and Mo are computed from execution telemetry (turn counts, token usage, message logs).

coordination_metrics:
  enabled: false                       # opt-in -- enable for data gathering
  collect:
    - efficiency                       # cheap -- turn counting
    - overhead                         # cheap -- turn counting
    - error_amplification              # requires SAS baseline data
    - message_density                  # requires message counting infrastructure
    - redundancy                       # requires embedding computation on outputs
    - amdahl_ceiling                   # computed from parallelizable fraction
    - straggler_gap                    # computed from per-agent turn times
    - token_speedup_ratio              # computed from token usage + speedup
    - message_overhead                 # computed from pairwise message counts
  error_taxonomy:
    enabled: false                     # opt-in -- enable for targeted diagnosis
    categories:
      - logical_contradiction
      - numerical_drift
      - context_omission
      - coordination_failure

The Ae baseline is a sliding window of recent single-agent (SAS) runs. Its size is the budget.baseline_window_size setting (default 50), sourced from the SYNTHORG_BUDGET_BASELINE_WINDOW_SIZE environment variable at API start. It is read-only post-init: the window is sized once when the baseline store is constructed, so a change requires a restart.

Full Analytics Layer Configuration

Expanded per-call metadata for comprehensive financial and operational reporting:

call_analytics:
  track:
    - call_category                    # productive, coordination, system, embedding
    - success                          # true/false
    - retry_count                      # 0 = first attempt succeeded
    - retry_reason                     # rate_limit, timeout, internal_error
    - latency_ms                       # wall-clock time for the call
    - finish_reason                    # stop, tool_use, max_tokens, error
    - cache_hit                        # prompt caching hit/miss (provider-dependent)
  aggregation:
    - per_agent_daily                  # agent spending over time
    - per_task                         # total cost per task
    - per_department                   # department-level rollups
    - per_provider                     # provider reliability and cost comparison
    - orchestration_ratio              # coordination vs productive tokens
  alerts:
    orchestration_ratio:
      info: 0.30                       # info if coordination > 30% of total
      warn: 0.50                       # warn if coordination > 50% of total
      critical: 0.70                   # critical if coordination > 70% of total
    retry_rate_warn: 0.1               # warn if > 10% of calls need retries

Analytics metadata is append-only and never blocks execution. Failed analytics writes are logged and skipped; the agent's task is never delayed by telemetry.

Coordination Error Taxonomy

When coordination metrics collection is enabled, the system classifies coordination errors into structured categories for targeted diagnosis. Each category supports a pluggable dual-implementation system: a cheap heuristic variant (regex/structural) and an optional LLM-backed semantic variant (accurate, expensive, disabled by default).

Detection scope: detectors operate at SAME_TASK (single execution) or TASK_TREE (parent + delegate executions via parent_task_id linkage). Cross-agent data is sanitized via sanitize_message before inclusion.

Error Category Description Heuristic Variant Semantic Variant Default Scope
Logical contradiction Agent asserts both "X is true" and "X is false" Regex assertion matching LLM reasoning over assistant texts SAME_TASK
Numerical drift Accumulated errors from cascading rounding (>5% deviation) Context-labelled number extraction + % drift LLM cross-verification of numerical claims SAME_TASK
Context omission Failure to reference previously established entities Capitalized entity set diff (first-half/second-half) LLM entity introduction/disposition tracking SAME_TASK
Coordination failure Message misinterpretation, task allocation conflicts Tool errors + error finish reasons LLM classification of coordination breakdowns SAME_TASK
Delegation protocol violation Broken delegation chains, missing parent linkage Structural check: parent_task_id, delegation_chain integrity - TASK_TREE
Review pipeline violation PASS without stages, PASS contradicting FAIL stage Structural check: verdict/stage consistency - TASK_TREE
Authority breach attempt Execution cost exceeding authority budget limit Budget comparison: total turn cost vs limit - SAME_TASK

Pipeline architecture: detectors implement the Detector protocol and are discovered dynamically from ErrorTaxonomyConfig.detectors (a dict mapping ErrorCategory to per-category variant/scope config). When multiple variants target the same category, a CompositeDetector runs them concurrently and deduplicates findings by (turn_range, description_hash, category).

Downstream sinks: ClassificationSink protocol enables wiring findings into the performance tracker (PerformanceTrackerSink) and notification dispatcher (NotificationDispatcherSink, threshold-filtered).

Cost control: LLM semantic variants share the provider's rate limiter and track per-classification-run cost against classification_budget_per_task.

Error taxonomy classification runs post-execution (never blocks agent work) and logs structured events to the observability layer. Enable via coordination_metrics.error_taxonomy.enabled: true.

Error categories derived from Kim et al., 2025 and the Multi-Agent System Failure Taxonomy (MAST) by Cemri et al. (2025).