ACG Formalism Evaluation for SynthOrg Engine Architecture¶
Context¶
The IBM/RPI survey "From Static Templates to Dynamic Runtime Graphs" (arXiv:2603.22386, companion repository: IBM awesome list) introduces the Agentic Computation Graph (ACG) formalism to unify the vocabulary for describing agent workflows: from static templates to dynamic runtime graphs, scheduling policies, execution traces, and mutation strategies. SynthOrg has all these concepts but expresses them through domain-specific names (execution loops, hybrid plans, turn records, coordination topology). This evaluation assesses whether adopting ACG vocabulary would improve architecture clarity, identify gaps, and inform design decisions for structural credit assignment and agent pruning.
ACG Vocabulary Mapping¶
The following table maps each ACG concept to its SynthOrg equivalent, with fidelity assessment and source file references.
Core Graph Concepts¶
| ACG Concept | SynthOrg Equivalent | Source | Fidelity | Notes |
|---|---|---|---|---|
| ACG Template | CompanyConfig + Company YAML |
src/synthorg/core/company.py, src/synthorg/config/schema.py |
Partial | ACG templates are graph-level (workflow topology). SynthOrg's YAML is org-level (agent roster, tool permissions, budget). Closer analogue would be WorkflowDefinition for workflow templates. |
| Realised Graph | AgentContext + TaskExecution + CoordinationResult |
src/synthorg/engine/context.py, src/synthorg/engine/coordination/models.py |
Strong | The realised graph IS the running state: context, history, accumulated cost, current position. Multi-agent coordination adds CoordinationPhaseResult per phase. |
| Execution Trace | tuple[TurnRecord, ...] in ExecutionResult + observability events |
src/synthorg/engine/loop_protocol.py, src/synthorg/observability/events/ |
Strong | SynthOrg's trace is richer than ACG baseline: per-turn cost, token usage, tool fingerprints, stagnation signals, quality scores. 100+ event constant domains. |
| Nodes (atomic actions) | LLM calls (call_provider), tool invocations (execute_tool_calls), validation gates (check_budget, check_stagnation) |
src/synthorg/engine/loop_helpers.py |
Partial | Node typing is implicit in loop control flow, not a first-class abstraction. There is no Node type; actions are identified by function names and turn records. |
| Edges (control/data flow) | SubtaskDefinition.dependencies DAG, DecompositionPlan.dependency_edges |
src/synthorg/engine/decomposition/models.py |
Strong for multi-agent | Edges are explicit in multi-agent decomposition (dependency DAG). Implicit in single-agent loops (sequential execution order, no formal edge representation). |
| Scheduling Policies | AutoLoopConfig + select_loop_type() + CoordinationConfig + AutoTopologyConfig |
src/synthorg/engine/loop_selector.py, src/synthorg/engine/routing/models.py |
Strong | Three-way loop selection (react/plan-execute/hybrid) and topology selection (SAS/centralised/decentralised/context-dependent) are scheduling policies. Budget-aware downgrade is a resource-constrained policy. |
Dynamic Behaviour Concepts¶
| ACG Concept | SynthOrg Equivalent | Source | Fidelity | Notes |
|---|---|---|---|---|
| Conditional branching | HybridLoop replan decisions, PlanExecuteLoop step completion | src/synthorg/engine/hybrid_loop.py, src/synthorg/engine/hybrid/step_helpers.py, src/synthorg/engine/hybrid/replan_helpers.py |
Partial | Branching is embedded in loop logic (replan if step fails), not graph-level conditional edges. No formal "if node X succeeds, take edge Y" representation. |
| Parallel composition | ParallelExecutor, CoordinationWave, asyncio.TaskGroup |
src/synthorg/engine/parallel.py, src/synthorg/engine/coordination/models.py |
Strong | Parallel waves in coordination are first-class. ParallelExecutor handles concurrent subtask dispatch with fail_fast semantics. |
| Graph mutation | Hybrid replanning (attempt_replan), stagnation correction injection |
src/synthorg/engine/hybrid/replan_helpers.py, src/synthorg/engine/stagnation/ |
Partial | Replanning mutates the execution plan (new subtask list). Stagnation correction injects a new message. These are graph mutations but are not described in those terms. |
| Termination conditions | TerminationReason enum (7 values: COMPLETED, MAX_TURNS, BUDGET_EXHAUSTED, SHUTDOWN, PARKED, STAGNATION, ERROR) |
src/synthorg/engine/loop_protocol.py |
Strong | Richer than typical ACG termination models. 7 named reasons provide precise signal for recovery and routing decisions. |
Resource and Cost Concepts¶
| ACG Concept | SynthOrg Equivalent | Source | Fidelity | Notes |
|---|---|---|---|---|
| Node cost | TurnRecord.cost per turn, TokenUsage per completion |
src/synthorg/engine/loop_protocol.py, src/synthorg/providers/models.py |
Strong | Per-turn cost tracking with provider breakdown. Accumulated over execution via ctx.accumulated_cost. |
| Resource constraints | BudgetEnforcer (3-layer), quota degradation, context budget |
src/synthorg/budget/enforcer.py, src/synthorg/engine/context_budget.py |
Strong (exceeds ACG) | SynthOrg's resource model is more sophisticated than ACG: multi-layer enforcement, per-agent daily limits, context fill tracking, risk budget. |
| Quality-cost tradeoffs | Budget-aware loop downgrade (hybrid->plan_execute at 80%), model auto-downgrade, quota degradation strategies | src/synthorg/engine/loop_selector.py, src/synthorg/budget/enforcer.py |
Strong | Explicit tradeoff mechanisms with hard budget caps. Downgrade-only at task boundaries (consistency guarantee). |
Concepts SynthOrg Has That ACG Does Not Capture¶
- Progressive trust: agent trust levels (RESTRICTED/STANDARD/ELEVATED) with human approval for promotion. No ACG equivalent.
- Personality and behavioural configuration:
PersonalityConfigwith Big Five + behavioural enums affecting decision style. No ACG equivalent. - Memory injection: episodic and procedural memory retrieval shaping context before execution. No ACG equivalent.
- Prompt profiles: verbosity adaptation by model tier. No ACG equivalent.
- Autonomy levels: 4 presets (full/semi/supervised/locked) with tool permission gating. No ACG equivalent.
ACG Survey Findings Validation¶
The survey identifies four structural findings that apply to SynthOrg's architecture:
Finding 1: Structural Improvements > Prompt Refinement When Scaffold Is Poorly Matched¶
Claim: When an agent workflow uses the wrong graph structure (e.g., sequential when parallel would be correct), prompt engineering cannot compensate. Structural changes yield greater gains.
SynthOrg validation: Confirmed by the loop selector. The auto-selector maps task complexity to loop type (simple->react, medium->plan_execute, complex->hybrid). The documentation for this design explicitly states that choosing the wrong loop for a task complexity degrades quality beyond what prompt tuning can recover. The budget-aware downgrade (hybrid->plan_execute at 80% monthly) is a deliberate quality tradeoff, accepted because the budget constraint makes the simpler structure correct.
Implication: The loop selector is doing real structural work. Adding complexity to system prompts for tasks that should use a different loop is not a substitute. This validates investing in the auto-selector's classification accuracy over prompt length.
Finding 2: Strong Verifiers Enable More Aggressive Graph Mutation¶
Claim: Systems with high-quality output verification can mutate graphs more aggressively (add/remove nodes, change topology) because they can catch degradation early.
SynthOrg validation: Partially confirmed. The quality scoring system (L2+L3 in
src/synthorg/engine/quality/) provides per-step quality signals. The hybrid loop's replan
trigger uses these signals to decide whether to add a replan step (graph mutation). However,
the quality verifier's confidence is not used to modulate mutation aggressiveness; the
replan threshold is fixed, not adaptive to verifier confidence.
Implication: As the quality scoring system matures, consider making the replan threshold adaptive to verifier confidence. High-confidence quality signal -> allow more replans. Low-confidence signal -> conserve budget.
Finding 3: Selection/Pruning from a Super-Graph Beats Unconstrained Generation¶
Claim: Starting from a well-designed set of node/edge templates and selecting subsets outperforms generating arbitrary workflows from scratch.
SynthOrg validation: Strongly confirmed. The Company YAML and 30 built-in roles in
src/synthorg/core/role_catalog.py are a super-graph of organisational patterns. Template
packs in api/controllers/template_packs.py apply curated patterns. The meeting protocols
(3 variants) and loop types (3 variants) are a bounded selection space rather than
open-ended generation.
Implication: Adding more template packs and expanding the super-graph is a higher-value investment than adding more free-form configuration options.
Finding 4: Quality-Cost Tradeoffs Must Be Explicit with Hard Budget Caps¶
Claim: Agentic workflows need explicit Pareto frontier navigation between output quality and token cost, with hard caps preventing runaway spending.
SynthOrg validation: Confirmed. The budget system has hard caps at multiple levels
(per-task, per-agent daily, monthly hard stop). Model auto-downgrade is an explicit
quality-cost tradeoff. The DegradationConfig and quota degradation strategies
(alert/fallback/queue) are explicit Pareto navigation mechanisms. The coordination metrics
(Amdahl ceiling, straggler gap) provide efficiency bounds.
Implication: The existing budget architecture is sound. The missing piece is exposing
the quality-cost tradeoffs via the REST API: specifically, GET /tasks/{id} response
and the CoordinationResult Python type should surface cost, quality, and efficiency
metadata (estimated cost, actual cost, quality score, Amdahl ceiling, straggler gap).
See #688 coordination metrics gap (Gap G4) for the full scoping.
Structural Credit Assignment¶
Problem Statement¶
In multi-agent task pipelines, when a downstream subtask fails, it is not always clear whether the root cause is:
- Direct failure: The assigned agent's execution failed
- Upstream contamination: The agent received poor-quality input from a predecessor
- Coordination overhead: The routing decision created an inefficient handoff
- Quality gate propagation: The agent passed quality gates but the downstream consumer found a defect
SynthOrg currently attributes all failure information to the executing agent's
TaskExecution:
infer_failure_category()insrc/synthorg/engine/recovery.pyis keyword-based heuristic classification applied per-execution, not per-agent in a coordination runRecoveryResultcaptures onefailure_categoryper executionCoordinationResulthasCoordinationPhaseResultper phase but no per-agent attribution- The 5-pillar evaluation in
src/synthorg/hr/evaluation/evaluator.pyscores agents over time windows, not for specific pipeline failures
Proposed Design¶
AgentContribution model integrates with CoordinationResult:
Note: CoordinationResult has model_config = ConfigDict(frozen=True). Adding
agent_contributions directly is a breaking change. The recommended approach is a
separate wrapper: CoordinationResultWithAttribution(result: CoordinationResult,
agent_contributions: tuple[AgentContribution, ...]), stored and returned in place of
the bare result by _post_execution_pipeline. This preserves immutability and avoids
migrating existing persisted CoordinationResult records.
class AgentContribution(BaseModel):
"""Per-agent attribution within a coordination run."""
agent_id: str
subtask_id: str
contribution_score: float # 0.0 = no contribution, 1.0 = fully responsible
failure_attribution: Literal[
"direct", "upstream_contamination", "coordination_overhead", "quality_gate"
] | None
evidence: str | None # pointer to error findings or quality signal
Attribution algorithm:
- Topological sort of
DecompositionResult.dependency_edges - For each failing subtask, walk backward through dependency edges
- Classify: if predecessor's
StepQualitySignalis low, attribute "upstream_contamination" to the predecessor; if the local execution raised directly, attribute "direct" to the executing agent - Coordination overhead: if
CoordinationMetrics.error_amplification > threshold, attribute a fraction to topology mismatch - Normalise contribution scores so they sum to 1.0 across the pipeline
Integration points:
- Run as part of
_post_execution_pipelineafter coordination completes - Feed
AgentContributionintoPerformanceTracker.record_task_metric()for trend detection - Surface in
GET /tasks/{id}response metadata for operator inspection
Scope note: This is a research recommendation, not an implementation spec. The
minimum viable version introduces a CoordinationResultWithAttribution wrapper containing
the original (immutable) CoordinationResult plus a list of AgentContribution objects
populated per-agent subtask result using the existing keyword-heuristic from
infer_failure_category(). This preserves CoordinationResult immutability with no changes
to the frozen model.
Agent Pruning / Dropout Evaluation¶
Current State¶
The infrastructure for agent removal exists and is production-grade:
src/synthorg/hr/offboarding_service.py:OffboardingService, the full pipeline for agent removal (task reassignment, memory archival, team notification, status termination)src/synthorg/hr/enums.py:FiringReason.PERFORMANCEexists as a reason codesrc/synthorg/hr/performance/tracker.py:PerformanceTrackerproviding rolling windows, trend detection (Theil-Sen), quality and collaboration scoring
What does not exist: any automated trigger for OffboardingService.offboard() based on
performance data. FiringReason.PERFORMANCE is defined but never programmatically invoked.
Pruning Signal Sources¶
Four signal categories that should drive pruning recommendations:
-
Performance trend: Theil-Sen slope below
declining_thresholdfor the 30d window. Available fromAgentPerformanceSnapshot.quality_trendinhr/performance/tracker.py. -
Utilisation: Tasks assigned relative to team size. Low-utilisation agents are redundant overhead. Currently tracked via task records: a task-per-agent-per-window count would be the metric.
-
Skill redundancy: High Jaccard similarity of required skills with another agent on the team, combined with high routing substitutability (how often could the other agent have handled this agent's tasks based on
RoutingCandidate.score). -
Budget pressure: Monthly utilisation approaching threshold. When a team exceeds its budget allocation, pruning lowest-performing agents reduces future spend.
Proposed Protocol¶
PruningEvaluation (new model)
agent_id: str
pruning_score: float # 0.0 = retain, 1.0 = prune
signals: list[PruningSignal] # which criteria triggered
recommendation: Literal["PRUNE", "RETAIN", "MONITOR"]
PruningPolicy (new model)
quality_decline_threshold: float # Theil-Sen slope below which to flag
utilization_minimum: float # tasks-per-window below which to flag
redundancy_threshold: float # Jaccard similarity above which to flag
cooldown_days: int # min time between pruning decisions
min_team_size: int # never prune below this team size
PruningService (new service)
evaluate(agent_id) -> PruningEvaluation
# Reads from PerformanceTracker, RoutingHistory
recommend_pruning(department_name) -> list[PruningEvaluation]
# Scans all agents in department, returns sorted by pruning_score
Human approval gate: Any PruningEvaluation with recommendation="PRUNE" creates an
ApprovalItem following the same approval pattern used by the hiring and promotion
pipelines. Required fields:
id: unique UUID perPruningEvaluationtitle: short summary, e.g."Prune agent {agent_id} ({reason})"description: rationale fromPruningEvaluation.signals(quality decline slope, utilisation, Jaccard overlap), affected team, and safety constraint check resultsrequested_by: thePruningServiceidentifier or calling systemaction_type:"org:prune"risk_level:ApprovalRiskLevel.MEDIUMcreated_at: ISO 8601 timestamp
Pruning is never fully automated; it is recommendation plus human approval.
Safety Constraints¶
- Minimum team size: Never prune if the team would fall below
min_team_size - Unique skill protection: Never prune the last agent with a required skill that no
other agent possesses (validated against
RoutingHistory.required_skills) - Mid-task protection: Flag if agent has active task assignments; recommendation becomes "MONITOR" instead of "PRUNE" until tasks complete
- Cooldown period:
PruningPolicy.cooldown_daysprevents consecutive pruning decisions - Seniority preference: When multiple agents are candidates, prefer pruning lower-seniority agents first
Relationship to HR Module¶
Pruning is the inverse of hiring. The same OffboardingService used for voluntary
departures handles performance-based pruning. The FiringReason.PERFORMANCE code exists
precisely for this. The only new infrastructure needed is:
PruningServiceto evaluate signals and generate recommendationsPruningPolicyconfig model- API endpoint to surface recommendations (
GET /agents/pruning-recommendationsor similar) - The approval flow reuses the existing
ApprovalIteminfrastructure
Relationship to ACG AgentDropout and Adaptive Graph Pruning¶
The ACG survey discusses AgentDropout (removing underperforming agents mid-run) and Adaptive Graph Pruning (removing redundant workflow nodes). SynthOrg's proposed pruning is more conservative; it operates at the HR layer (between runs) rather than mid-execution. Mid-execution dropout (removing an agent after it has started a subtask) is significantly more complex due to task handoff and context transfer requirements. The inter-run HR pruning is the correct first implementation target.
Vocabulary Adoption Recommendation¶
Options Considered¶
- Code rename: Replace SynthOrg-specific terms with ACG terms in class/method names
- Docs-only: Add an ACG glossary to
docs/architecture/mapping terms - Bidirectional glossary: Document both terminologies, neither renamed
Recommendation: Bidirectional Glossary (Option 3)¶
The ACG formalism is a useful external reference vocabulary but is incomplete for SynthOrg's concepts (trust, personality, autonomy, memory, prompt profiles). Code-level renaming would:
- Remove domain-specific precision (e.g., "HybridLoop" is more descriptive than "ConditionalACGNode")
- Break existing API contracts and test surface
- Gain little because the ACG vocabulary is not user-facing
The value of ACG is in research alignment (citing papers using shared vocabulary) and gap identification (seeing what SynthOrg lacks in the ACG model). Both goals are served by a bidirectional glossary without code changes.
Action: Add docs/architecture/acg-glossary.md mapping ACG concepts to SynthOrg
equivalents (using the mapping table from this document). Reference in the design spec
and in research communications.
Formal node typing is the one ACG concept that could benefit from a lightweight code
adoption. Introducing a NodeType enum with values LLM_CALL, TOOL_INVOCATION,
QUALITY_CHECK, BUDGET_CHECK, STAGNATION_CHECK and tagging TurnRecord with the
node types executed in that turn would improve execution trace analysis without
significant refactoring. This is optional but would directly enable structural credit
assignment (knowing which node type failed).
Backward compatibility: TurnRecord is part of execution traces and may be
persisted. The node_types field must be added as optional with a default (e.g.,
node_types: tuple[NodeType, ...] = ()) so existing records remain valid without
migration. Serialisation/deserialisation must tolerate the absent field. Consumers
(trace analysers, evaluation pipelines) should treat an empty tuple as "unknown
composition" rather than erroring.
Summary of Recommendations¶
- Bidirectional ACG glossary in
docs/architecture/acg-glossary.md; no code changes - Structural credit assignment: Add
CoordinationResultWithAttributionwrapper (frozenCoordinationResult+AgentContributionlist); run attribution in_post_execution_pipeline; feed intoPerformanceTracker - Agent pruning: Implement
PruningService+PruningPolicy; wire to existingOffboardingService; human approval gate required - Optional node typing: Add
NodeTypeenum toTurnRecordfor richer trace analysis - Adaptive quality-cost tradeoff: Make hybrid loop replan threshold adaptive to quality verifier confidence (longer-term)