ACG Formalism Evaluation for SynthOrg Engine Architecture¶

Dated snapshot

This evaluation is a point-in-time record. References to progressive trust and seniority-based agent selection describe designs that have since been removed; authority now derives from the role reporting graph (see HR & Agent Lifecycle).

Context¶

The IBM/RPI survey "From Static Templates to Dynamic Runtime Graphs" (arXiv:2603.22386, companion repository: IBM awesome list) introduces the Agentic Computation Graph (ACG) formalism to unify the vocabulary for describing agent workflows: from static templates to dynamic runtime graphs, scheduling policies, execution traces, and mutation strategies. SynthOrg has all these concepts but expresses them through domain-specific names (execution loops, hybrid plans, turn records, coordination topology). This evaluation assesses whether adopting ACG vocabulary would improve architecture clarity, identify gaps, and inform design decisions for structural credit assignment and agent pruning.

ACG Vocabulary Mapping¶

The following table maps each ACG concept to its SynthOrg equivalent, with fidelity assessment and source file references.

Core Graph Concepts¶

ACG Concept	SynthOrg Equivalent	Source	Fidelity	Notes
ACG Template	`CompanyConfig` + Company YAML	`src/synthorg/core/company.py`, `src/synthorg/config/schema.py`	Partial	ACG templates are graph-level (workflow topology). SynthOrg's YAML is org-level (agent roster, tool permissions, budget). Closer analogue would be `WorkflowDefinition` for workflow templates.
Realised Graph	`AgentContext` + `TaskExecution` + `CoordinationResult`	`src/synthorg/engine/context.py`, `src/synthorg/engine/coordination/models.py`	Strong	The realised graph IS the running state: context, history, accumulated cost, current position. Multi-agent coordination adds `CoordinationPhaseResult` per phase.
Execution Trace	`tuple[TurnRecord, ...]` in `ExecutionResult` + observability events	`src/synthorg/engine/loop_protocol.py`, `src/synthorg/observability/events/`	Strong	SynthOrg's trace is richer than ACG baseline: per-turn cost, token usage, tool fingerprints, stagnation signals, quality scores. 100+ event constant domains.
Nodes (atomic actions)	LLM calls (`call_provider`), tool invocations (`execute_tool_calls`), validation gates (`check_budget`, `check_stagnation`)	`src/synthorg/engine/loop_helpers.py`	Partial	Node typing is implicit in loop control flow, not a first-class abstraction. There is no `Node` type; actions are identified by function names and turn records.
Edges (control/data flow)	`SubtaskDefinition.dependencies` DAG, `DecompositionPlan.dependency_edges`	`src/synthorg/engine/decomposition/models.py`	Strong for multi-agent	Edges are explicit in multi-agent decomposition (dependency DAG). Implicit in single-agent loops (sequential execution order, no formal edge representation).
Scheduling Policies	`AutoLoopConfig` + `select_loop_type()` + `CoordinationConfig` + `AutoTopologyConfig`	`src/synthorg/engine/loop_selector.py`, `src/synthorg/engine/routing/models.py`	Strong	Three-way loop selection (react/plan-execute/hybrid) and topology selection (SAS/centralised/decentralised/context-dependent) are scheduling policies. Budget-aware downgrade is a resource-constrained policy.

Dynamic Behaviour Concepts¶

ACG Concept	SynthOrg Equivalent	Source	Fidelity	Notes
Conditional branching	HybridLoop replan decisions, PlanExecuteLoop step completion	`src/synthorg/engine/hybrid_loop.py`, `src/synthorg/engine/hybrid/step_helpers.py`, `src/synthorg/engine/hybrid/replan_helpers.py`	Partial	Branching is embedded in loop logic (replan if step fails), not graph-level conditional edges. No formal "if node X succeeds, take edge Y" representation.
Parallel composition	`ParallelExecutor`, `CoordinationWave`, `asyncio.TaskGroup`	`src/synthorg/engine/parallel.py`, `src/synthorg/engine/coordination/models.py`	Strong	Parallel waves in coordination are first-class. `ParallelExecutor` handles concurrent subtask dispatch with `fail_fast` semantics.
Graph mutation	Hybrid replanning (`attempt_replan`), stagnation correction injection	`src/synthorg/engine/hybrid/replan_helpers.py`, `src/synthorg/engine/stagnation/`	Partial	Replanning mutates the execution plan (new subtask list). Stagnation correction injects a new message. These are graph mutations but are not described in those terms.
Termination conditions	`TerminationReason` enum (7 values: COMPLETED, MAX_TURNS, BUDGET_EXHAUSTED, SHUTDOWN, PARKED, STAGNATION, ERROR)	`src/synthorg/engine/loop_protocol.py`	Strong	Richer than typical ACG termination models. 7 named reasons provide precise signal for recovery and routing decisions.

Resource and Cost Concepts¶

ACG Concept	SynthOrg Equivalent	Source	Fidelity	Notes
Node cost	`TurnRecord.cost` per turn, `TokenUsage` per completion	`src/synthorg/engine/loop_protocol.py`, `src/synthorg/providers/models.py`	Strong	Per-turn cost tracking with provider breakdown. Accumulated over execution via `ctx.accumulated_cost`.
Resource constraints	`BudgetEnforcer` (3-layer), quota degradation, context budget	`src/synthorg/budget/enforcer.py`, `src/synthorg/engine/context_budget.py`	Strong (exceeds ACG)	SynthOrg's resource model is more sophisticated than ACG: multi-layer enforcement, per-agent daily limits, context fill tracking, risk budget.
Quality-cost tradeoffs	Budget-aware loop downgrade (hybrid->plan_execute at 80%), model auto-downgrade, quota degradation strategies	`src/synthorg/engine/loop_selector.py`, `src/synthorg/budget/enforcer.py`	Strong	Explicit tradeoff mechanisms with hard budget caps. Downgrade-only at task boundaries (consistency guarantee).

Concepts SynthOrg Has That ACG Does Not Capture¶

Progressive trust: agent trust levels (RESTRICTED/STANDARD/ELEVATED) with human approval for promotion. No ACG equivalent.
Personality and behavioural configuration: PersonalityConfig with Big Five + behavioural enums affecting decision style. No ACG equivalent.
Memory injection: episodic and procedural memory retrieval shaping context before execution. No ACG equivalent.
Prompt profiles: verbosity adaptation by model tier. No ACG equivalent.
Autonomy levels: 4 presets (full/semi/supervised/locked) with tool permission gating. No ACG equivalent.

ACG Survey Findings Validation¶

The survey identifies four structural findings that apply to SynthOrg's architecture:

Claim: When an agent workflow uses the wrong graph structure (e.g., sequential when parallel would be correct), prompt engineering cannot compensate. Structural changes yield greater gains.

SynthOrg validation: Confirmed by the loop selector. The auto-selector maps task complexity to loop type (simple->react, medium->plan_execute, complex->hybrid). The documentation for this design explicitly states that choosing the wrong loop for a task complexity degrades quality beyond what prompt tuning can recover. The budget-aware downgrade (hybrid->plan_execute at 80% monthly) is a deliberate quality tradeoff, accepted because the budget constraint makes the simpler structure correct.

Implication: The loop selector is doing real structural work. Adding complexity to system prompts for tasks that should use a different loop is not a substitute. This validates investing in the auto-selector's classification accuracy over prompt length.

Finding 2: Strong Verifiers Enable More Aggressive Graph Mutation¶

Claim: Systems with high-quality output verification can mutate graphs more aggressively (add/remove nodes, change topology) because they can catch degradation early.

SynthOrg validation: Partially confirmed. The quality scoring system (L2+L3 in src/synthorg/engine/quality/) provides per-step quality signals. The hybrid loop's replan trigger uses these signals to decide whether to add a replan step (graph mutation). However, the quality verifier's confidence is not used to modulate mutation aggressiveness; the replan threshold is fixed, not adaptive to verifier confidence.

Implication: As the quality scoring system matures, consider making the replan threshold adaptive to verifier confidence. High-confidence quality signal -> allow more replans. Low-confidence signal -> conserve budget.

Finding 3: Selection/Pruning from a Super-Graph Beats Unconstrained Generation¶

Claim: Starting from a well-designed set of node/edge templates and selecting subsets outperforms generating arbitrary workflows from scratch.

SynthOrg validation: Strongly confirmed. The Company YAML and 33 built-in roles in src/synthorg/core/role_catalog.py are a super-graph of organisational patterns. Template packs in api/controllers/template_packs.py apply curated patterns. The meeting protocols (3 variants) and loop types (3 variants) are a bounded selection space rather than open-ended generation.

Implication: Adding more template packs and expanding the super-graph is a higher-value investment than adding more free-form configuration options.

Finding 4: Quality-Cost Tradeoffs Must Be Explicit with Hard Budget Caps¶

Claim: Agentic workflows need explicit Pareto frontier navigation between output quality and token cost, with hard caps preventing runaway spending.

SynthOrg validation: Confirmed. The budget system has hard caps at multiple levels (per-task, per-agent daily, monthly hard stop). Model auto-downgrade is an explicit quality-cost tradeoff. The DegradationConfig and quota degradation strategies (alert/fallback/queue) are explicit Pareto navigation mechanisms. The coordination metrics (Amdahl ceiling, straggler gap) provide efficiency bounds.

Implication: The existing budget architecture is sound. The missing piece is exposing the quality-cost tradeoffs via the REST API: specifically, GET /tasks/{id} response and the CoordinationResult Python type should surface cost, quality, and efficiency metadata (estimated cost, actual cost, quality score, Amdahl ceiling, straggler gap). See #688 coordination metrics gap (Gap G4) for the full scoping.

Structural Credit Assignment¶

Problem Statement¶

In multi-agent task pipelines, when a downstream subtask fails, it is not always clear whether the root cause is:

Direct failure: The assigned agent's execution failed
Upstream contamination: The agent received poor-quality input from a predecessor
Coordination overhead: The routing decision created an inefficient handoff
Quality gate propagation: The agent passed quality gates but the downstream consumer found a defect

SynthOrg currently attributes all failure information to the executing agent's TaskExecution:

infer_failure_category() in src/synthorg/engine/recovery.py is keyword-based heuristic classification applied per-execution, not per-agent in a coordination run
RecoveryResult captures one failure_category per execution
CoordinationResult has CoordinationPhaseResult per phase but no per-agent attribution
The 5-pillar evaluation in src/synthorg/hr/evaluation/evaluator.py scores agents over time windows, not for specific pipeline failures

Proposed Design¶

AgentContribution model integrates with CoordinationResult:

Note: CoordinationResult has model_config = ConfigDict(frozen=True). Adding agent_contributions directly is a breaking change. The recommended approach is a separate wrapper: CoordinationResultWithAttribution(result: CoordinationResult, agent_contributions: tuple[AgentContribution, ...]), stored and returned in place of the bare result by _post_execution_pipeline. This preserves immutability and avoids migrating existing persisted CoordinationResult records.

class AgentContribution(BaseModel):
    """Per-agent attribution within a coordination run."""
    agent_id: str
    subtask_id: str
    contribution_score: float  # 0.0 = no contribution, 1.0 = fully responsible
    failure_attribution: Literal[
        "direct", "upstream_contamination", "coordination_overhead", "quality_gate"
    ] | None
    evidence: str | None  # pointer to error findings or quality signal

Attribution algorithm:

Topological sort of DecompositionResult.dependency_edges
For each failing subtask, walk backward through dependency edges
Classify: if predecessor's StepQualitySignal is low, attribute "upstream_contamination" to the predecessor; if the local execution raised directly, attribute "direct" to the executing agent
Coordination overhead: if CoordinationMetrics.error_amplification > threshold, attribute a fraction to topology mismatch
Normalise contribution scores so they sum to 1.0 across the pipeline

Integration points:

Run as part of _post_execution_pipeline after coordination completes
Feed AgentContribution into PerformanceTracker.record_task_metric() for trend detection
Surface in GET /tasks/{id} response metadata for operator inspection

Scope note: This is a research recommendation, not an implementation spec. The minimum viable version introduces a CoordinationResultWithAttribution wrapper containing the original (immutable) CoordinationResult plus a list of AgentContribution objects populated per-agent subtask result using the existing keyword-heuristic from infer_failure_category(). This preserves CoordinationResult immutability with no changes to the frozen model.

Agent Pruning / Dropout Evaluation¶

Current State¶

The infrastructure for agent removal exists and is production-grade:

src/synthorg/hr/offboarding_service.py: OffboardingService, the full pipeline for agent removal (task reassignment, memory archival, team notification, status termination)
src/synthorg/hr/enums.py: FiringReason.PERFORMANCE exists as a reason code
src/synthorg/hr/performance/tracker.py: PerformanceTracker providing rolling windows, trend detection (Theil-Sen), quality and collaboration scoring

What does not exist: any automated trigger for OffboardingService.offboard() based on performance data. FiringReason.PERFORMANCE is defined but never programmatically invoked.

Pruning Signal Sources¶

Four signal categories that should drive pruning recommendations:

Performance trend: Theil-Sen slope below declining_threshold for the 30d window. Available from AgentPerformanceSnapshot.quality_trend in hr/performance/tracker.py.
Utilisation: Tasks assigned relative to team size. Low-utilisation agents are redundant overhead. Currently tracked via task records: a task-per-agent-per-window count would be the metric.
Skill redundancy: High Jaccard similarity of required skills with another agent on the team, combined with high routing substitutability (how often could the other agent have handled this agent's tasks based on RoutingCandidate.score).
Budget pressure: Monthly utilisation approaching threshold. When a team exceeds its budget allocation, pruning lowest-performing agents reduces future spend.

Proposed Protocol¶

PruningEvaluation (new model)
  agent_id: str
  pruning_score: float   # 0.0 = retain, 1.0 = prune
  signals: list[PruningSignal]  # which criteria triggered
  recommendation: Literal["PRUNE", "RETAIN", "MONITOR"]

PruningPolicy (new model)
  quality_decline_threshold: float   # Theil-Sen slope below which to flag
  utilization_minimum: float         # tasks-per-window below which to flag
  redundancy_threshold: float        # Jaccard similarity above which to flag
  cooldown_days: int                 # min time between pruning decisions
  min_team_size: int                 # never prune below this team size

PruningService (new service)
  evaluate(agent_id) -> PruningEvaluation
    # Reads from PerformanceTracker, RoutingHistory
  recommend_pruning(department_name) -> list[PruningEvaluation]
    # Scans all agents in department, returns sorted by pruning_score

Human approval gate: Any PruningEvaluation with recommendation="PRUNE" creates an ApprovalItem following the same approval pattern used by the hiring and promotion pipelines. Required fields:

id: unique UUID per PruningEvaluation
title: short summary, e.g. "Prune agent {agent_id} ({reason})"
description: rationale from PruningEvaluation.signals (quality decline slope, utilisation, Jaccard overlap), affected team, and safety constraint check results
requested_by: the PruningService identifier or calling system
action_type: "org:prune"
risk_level: ApprovalRiskLevel.MEDIUM
created_at: ISO 8601 timestamp

Pruning is never fully automated; it is recommendation plus human approval.

Safety Constraints¶

Minimum team size: Never prune if the team would fall below min_team_size
Unique skill protection: Never prune the last agent with a required skill that no other agent possesses (validated against RoutingHistory.required_skills)
Mid-task protection: Flag if agent has active task assignments; recommendation becomes "MONITOR" instead of "PRUNE" until tasks complete
Cooldown period: PruningPolicy.cooldown_days prevents consecutive pruning decisions
Seniority preference: When multiple agents are candidates, prefer pruning lower-seniority agents first

Relationship to HR Module¶

Pruning is the inverse of hiring. The same OffboardingService used for voluntary departures handles performance-based pruning. The FiringReason.PERFORMANCE code exists precisely for this. The only new infrastructure needed is:

PruningService to evaluate signals and generate recommendations
PruningPolicy config model
API endpoint to surface recommendations (GET /agents/pruning-recommendations or similar)
The approval flow reuses the existing ApprovalItem infrastructure

Relationship to ACG AgentDropout and Adaptive Graph Pruning¶

The ACG survey discusses AgentDropout (removing underperforming agents mid-run) and Adaptive Graph Pruning (removing redundant workflow nodes). SynthOrg's proposed pruning is more conservative; it operates at the HR layer (between runs) rather than mid-execution. Mid-execution dropout (removing an agent after it has started a subtask) is significantly more complex due to task handoff and context transfer requirements. The inter-run HR pruning is the correct first implementation target.

Vocabulary Adoption Recommendation¶

Options Considered¶

Code rename: Replace SynthOrg-specific terms with ACG terms in class/method names
Docs-only: Add an ACG glossary to docs/architecture/ mapping terms
Bidirectional glossary: Document both terminologies, neither renamed

Recommendation: Bidirectional Glossary (Option 3)¶

The ACG formalism is a useful external reference vocabulary but is incomplete for SynthOrg's concepts (trust, personality, autonomy, memory, prompt profiles). Code-level renaming would:

Remove domain-specific precision (e.g., "HybridLoop" is more descriptive than "ConditionalACGNode")
Break existing API contracts and test surface
Gain little because the ACG vocabulary is not user-facing

The value of ACG is in research alignment (citing papers using shared vocabulary) and gap identification (seeing what SynthOrg lacks in the ACG model). Both goals are served by a bidirectional glossary without code changes.

Action: Add docs/architecture/acg-glossary.md mapping ACG concepts to SynthOrg equivalents (using the mapping table from this document). Reference in the design spec and in research communications.

Formal node typing is the one ACG concept that could benefit from a lightweight code adoption. Introducing a NodeType enum with values LLM_CALL, TOOL_INVOCATION, QUALITY_CHECK, BUDGET_CHECK, STAGNATION_CHECK and tagging TurnRecord with the node types executed in that turn would improve execution trace analysis without significant refactoring. This is optional but would directly enable structural credit assignment (knowing which node type failed).

Backward compatibility: TurnRecord is part of execution traces and may be persisted. The node_types field must be added as optional with a default (e.g., node_types: tuple[NodeType, ...] = ()) so existing records remain valid without migration. Serialisation/deserialisation must tolerate the absent field. Consumers (trace analysers, evaluation pipelines) should treat an empty tuple as "unknown composition" rather than erroring.

Summary of Recommendations¶

Bidirectional ACG glossary in docs/architecture/acg-glossary.md; no code changes
Structural credit assignment: Add CoordinationResultWithAttribution wrapper (frozen CoordinationResult + AgentContribution list); run attribution in _post_execution_pipeline; feed into PerformanceTracker
Agent pruning: Implement PruningService + PruningPolicy; wire to existing OffboardingService; human approval gate required
Optional node typing: Add NodeType enum to TurnRecord for richer trace analysis
Adaptive quality-cost tradeoff: Make hybrid loop replan threshold adaptive to quality verifier confidence (longer-term)