Evaluation Loop¶

The closed-loop evaluation framework continuously measures agent performance and identifies improvement opportunities. It is built on top of the existing five-pillar evaluation, performance tracking, trajectory scoring, and training infrastructure documented in HR and Agent Lifecycle.

Closed-Loop Architecture¶

flowchart LR
    T[Trace Capture] --> BT[Behavior Tagging]
    BT --> EE[Eval Enrichment<br/>5 Pillars]
    EE --> PI[Pattern Identification<br/>stub]
    PI --> FP[Fix Proposal<br/>stub]
    FP --> V[Validation<br/>next run]
    V -->|feedback| T

The EvalLoopCoordinator orchestrates existing services into cycles: collect metrics, enrich with five-pillar evaluation, identify failure patterns (stub), propose targeted fixes (stub), and validate via next-run trajectory scores.

Behaviour Tagging¶

Each turn is tagged with one or more BehaviorTag categories for fine-grained trace analysis and eval routing:

Tag	Description
`file_operations`	Read, write, edit, list, grep
`retrieval`	Search, multi-hop document synthesis
`tool_use`	Generic tool selection and chaining
`memory`	Recall, preference extraction, persistence
`conversation`	Multi-turn dialogue, clarifying questions
`summarization`	Context overflow handling
`delegation`	Sub-agent spawning, handoff
`coordination`	Multi-agent pipeline waves
`verification`	Rubric grading, quality assessment

Tags are inferred by BehaviorTaggerMiddleware (opt-in, after_model slot) via tool-name pattern matching. Stored on TurnRecord.behavior_tags.

Efficiency Ratios¶

Per-run efficiency ratios measured against IdealTrajectoryBaseline:

Step ratio: observed steps / ideal steps (1.0 = on target)
Tool call ratio: observed calls / ideal calls
Latency ratio: observed time / ideal time
Verbosity ratio: observed output tokens / ideal tokens (from SlopCodeBench)
Structural erosion score: composite 0.0--1.0 (duplicated blocks, cyclomatic complexity delta, dead-branch ratio)
PTE: Prefill Token Equivalents (hardware-aware cost metric, from arXiv:2604.05404)
PTE ratio: observed PTE / ideal PTE

Baselines are human-curated and versioned, not auto-updated from observed runs.

Quality Erosion Detection¶

A stagnation variant (StagnationReason.QUALITY_EROSION): the agent keeps working but structural_erosion_score climbs past threshold (default 0.5). QualityErosionDetector implements the StagnationDetector protocol and fires corrective prompt injection or termination.

External Benchmarks¶

A pluggable ExternalBenchmark protocol allows adopting external benchmark suites without modifying the framework. ExternalBenchmarkRegistry manages registration and execution.

Agent Evaluation Testing¶

Tests tagged @pytest.mark.agent_eval(category="file_operations") run separately from the default unit suite:

uv run python -m pytest tests/evals/ -n 8 --eval-timeout=300

The n1_prefix_replay utility replays the first N-1 turns of a recorded trace and lets the agent generate only the final turn for regression testing.

CI: evals.yml runs nightly and on the run-evals label.