Verification & Quality¶
This page covers the quality-assurance pipeline attached to agent output: the verification stage that runs after an agent completes a task, the harness middleware that wraps every agent invocation, the review pipeline that validates produced artifacts, and the intake engine that ingests new work.
Verification Stage¶
Verification is a first-class stage in the workflow engine. Three converging research sources -- Marco DeepResearch (verification-centric agent frameworks), GEMS (five-stage agent loop with explicit Verifier), and the Anthropic three-agent harness (Planner/Generator/Evaluator with calibrated grading) -- all converge on verification as a separate agent with its own context, not a self-evaluation inside the generator step.
Workflow Node and Edge Types¶
WorkflowNodeType.VERIFICATION is a control-flow node like CONDITIONAL. Three dedicated edge types route verification outcomes:
VERIFICATION_PASS-- artifact acceptedVERIFICATION_FAIL-- artifact rejected, routed to regenerationVERIFICATION_REFER-- confidence below threshold, escalated to human review
Blueprint validation enforces exactly one of each edge type per verification node.
Calibrated Rubric Grading¶
Each verification node references a VerificationRubric by name. A rubric contains:
- Criteria (
RubricCriterion) -- weighted dimensions withbinary,ternary, orscoregrade types - Calibration examples -- few-shot demonstrations for LLM graders
- Minimum confidence -- below this threshold, the verdict is overridden to
REFER
Built-in rubrics: frontend-design (four criteria: design/originality/craft/functionality) and default-task (correctness/completeness/probe-adherence).
Atomic Criteria Decomposition¶
Acceptance criteria are decomposed into atomic binary probes (AtomicProbe) via a pluggable CriteriaDecomposer protocol. The default LLMCriteriaDecomposer uses the medium-tier provider. An IdentityCriteriaDecomposer maps each criterion to one probe for deterministic testing.
Structured Handoff Artifacts¶
HandoffArtifact carries the payload, artifact references, probes, and optional rubric between stages. A model validator rejects self-handoff (from_agent_id == to_agent_id). Immutability is enforced by the frozen Pydantic model (frozen=True).
Self-Evaluation Rejection¶
Self-evaluation -- where the generator also judges its own output -- is explicitly rejected. Prior research documents that self-evaluation produces over-confidence and fails to catch the generator's own blind spots.
VerificationResult.evaluator_agent_idMUST differ from the generator agent ID -- enforced by model validator at construction.
Pluggable Grading¶
The RubricGrader protocol follows the standard protocol + strategy + factory + config discriminator pattern (mirroring engine/classification/). Variants: LLM (production) and HEURISTIC (testing/fallback). Configuration via VerificationConfig.
Harness Middleware Layer¶
The engine uses a composable middleware layer for cross-cutting concerns that span agent execution and multi-agent coordination. Two separate protocols serve two distinct pipelines.
Agent Middleware¶
Protocol: AgentMiddleware (engine/middleware/protocol.py). Six async hooks in declared order:
| Hook | Runs | Purpose |
|---|---|---|
before_agent |
Once on invocation | Load memory, validate input, record hashes |
before_model |
Before each model call | Trim history, redact PII, inject context |
wrap_model_call |
Around model call | Caching, dynamic tools, model swap |
wrap_tool_call |
Around tool execution | Inject context, gate tools |
after_model |
After model responds | Human-in-loop, assumption-violation checks |
after_agent |
Once on completion | Save results, notify, cleanup |
Composition: before_* left-to-right, after_* right-to-left, wrap_* onion-style (each wraps the next). Exceptions propagate to the classification pipeline.
Default chain: checkpoint_resume, delegation_chain_hash, authority_deference, sanitize_message, security_interceptor, policy_gate, approval_gate, assumption_violation, classification, cost_recording.
Optional middleware (registered in _AGENT_OPT_IN, must be enabled explicitly):
SemanticDriftDetector(after_modelslot) -- compares model output against task acceptance criteria using cosine similarity. Opt-in viaCompanyConfig.security.semantic_drift_enabled. Fail-soft: logs warnings but never blocks.
Coordination Middleware¶
Protocol: CoordinationMiddleware (engine/middleware/coordination_protocol.py). Five async hooks:
| Hook | Pipeline Position | Purpose |
|---|---|---|
before_decompose |
Before Phase 1 | Clarification gate |
after_decompose |
After Phase 1 | Post-decomposition analysis |
before_dispatch |
Before Phase 3-5 | Plan review gate, task ledger |
after_rollup |
After Phase 6 | Progress ledger, replan hook |
before_update_parent |
Before Phase 7 | Authority deference scan |
Default chain: clarification_gate, task_ledger, plan_review_gate, progress_ledger, coordination_replan, authority_deference_coordination.
S1 Constraint Hooks¶
| Middleware | Hook | Behavior |
|---|---|---|
AuthorityDeferenceGuard |
before_agent |
Detects authority cues in transcripts, logs patterns, injects justification header |
AssumptionViolationMiddleware |
after_model |
Detects broken assumptions, emits escalation events |
ClarificationGateMiddleware |
before_decompose |
Validates acceptance criteria specificity |
DelegationChainHashMiddleware |
before_agent |
Records SHA-256 content hash for delegation drift detection |
Configuration¶
Per-company: CompanyConfig.middleware (MiddlewareConfig) with agent and coordination sub-configs.
Per-task: Task.middleware_override replaces the company-level chain when set.
Error Semantics¶
Middleware exceptions propagate to the classification pipeline. ClassificationResult.action decides: retry, escalate, or fail. No silent swallowing.
Review Pipeline¶
The review pipeline provides a configurable chain of review stages for tasks
in IN_REVIEW status. See the Client Simulation design
page for the full architecture, including ReviewStage protocol, pipeline
execution semantics, and metadata tracking.
Key design decisions:
- No new TaskStatus values for pipeline tracking -- tasks stay
IN_REVIEWthroughout; progress is tracked in task metadata. - Short-circuit on FAIL -- first failing stage sends the task back to
IN_PROGRESSfor rework with the stage name and reason in metadata. - Backward compatible -- when no pipeline is configured, the existing
ReviewGateServicesingle-stage behavior is preserved.
Intake Engine¶
The intake engine processes ClientRequest submissions through an independent
state machine (RequestStatus) before creating tasks in the task engine. See
Client Simulation for the full request lifecycle and
intake strategy contracts.
See Also¶
- Task & Workflow Engine -- task dispatch, state coordination
- Agent Execution -- per-agent execution loop
- Coordination -- multi-agent topology, decomposition
- Design Overview -- full index