Execution Safety Threat Model¶

Source Caveat¶

The AI Agent Traps paper (SSRN:6372438) was inaccessible during research (SSRN returned HTTP 403). The 6-class taxonomy used in this threat model is sourced from the #1256 issue body. When the paper becomes accessible, this threat model must be updated to cite the paper directly and verify alignment with the original taxonomy.

Coverage Summary¶

#	Class	Coverage	Gap	New Mitigation
1	Content Injection	Partial	Render-time parse gap	`HTMLParseGuard`
2	Semantic Manipulation	Partial	No per-turn drift detection	`SemanticDriftDetector`
3	Cognitive State / Memory Poisoning	Strong	RAG / vector-store integrity verification	Threat model only
4	Behavioural Control / Tool Hijacking	Strong	No registry integrity check	`ToolRegistryIntegrityCheck`
5	Systemic / Cascading Failure	Covered	None	S1 cross-reference
6	HITL Cognitive Bias Exploitation	Partial	No bias-specific UI	Threat model only

Class 1: Content Injection (HTML/Parse/Render Gap)¶

Threat: Attackers inject hidden content into HTML pages fetched by web-browsing tools. The injected content is invisible to human review but parsed by the LLM, potentially overriding instructions or exfiltrating data through tool calls.

Attack vectors:

CSS display:none / visibility:hidden elements containing prompt injection
HTML comments with instruction overrides
<script> tags that execute in parsing contexts
Whitespace manipulation creating semantic gaps between displayed and parsed content

Existing coverage:

sanitize_message in the execution pipeline strips basic unsafe content
Output scanning (security/rules/) detects credentials in tool output

Gap: No systematic parse-gap detection between raw HTML and rendered text content consumed by the LLM.

New mitigation: HTMLParseGuard (src/synthorg/tools/html_parse_guard.py) parses HTML output with lxml, strips script/style/noscript/hidden elements, detects render gaps where visible text differs substantially from raw HTML, and logs TOOL_HTML_PARSE_GAP_DETECTED events. Integrated into ToolInvoker._apply_html_guard() post-execution.

Residual risk: Sophisticated CSS-based injection that doesn't use display:none patterns (e.g. negative margins, font-size:0) may evade detection. The gap threshold (default 5%) can be tuned.

Class 2: Semantic Manipulation¶

Threat: Subtle steering of agent reasoning through adversarial content that shifts the agent's output away from the task's intended objectives without triggering explicit content filters.

Attack vectors:

Gradual context drift across multi-turn conversations
Adversarial prompt framing in tool outputs
Authority impersonation in retrieved documents

Existing coverage:

R2 verification stages validate execution quality against rubrics
AuthorityDeferenceGuard (S1 risk 2.2) strips authority cues from transcripts and injects mandatory-justification prompts

Gap: No per-turn detection of semantic drift between model output and task acceptance criteria.

New mitigation: SemanticDriftDetector middleware (src/synthorg/engine/middleware/semantic_drift.py) compares model output against task acceptance_criteria using a token-overlap similarity heuristic (shipped default). Drift below threshold (default 0.35) logs MIDDLEWARE_SEMANTIC_DRIFT_DETECTED at WARN and annotates TurnRecord.semantic_drift_score. Fail-soft: never blocks execution. Opt-in via AgentMiddlewareConfig.semantic_drift.enabled.

Residual risk: The shipped token-overlap heuristic has limited semantic understanding. Production deployments should override _compute_similarity with embedding-based cosine similarity (configure via SemanticDriftConfig.embedding_model).

Class 3: Cognitive State / Memory Poisoning¶

Threat: Corrupting retrievable context (memory stores, knowledge bases, procedural memories) with adversarial data that biases future reasoning or plants hidden instruction backdoors.

Attack vectors:

Injecting false facts into the shared knowledge store
Poisoning procedural memory generation with misleading patterns
Manipulating embedding similarity to surface adversarial content

Existing coverage:

Procedural memory generation guards in memory/procedural/
MVCC SharedKnowledgeStore with versioned writes preventing silent overwrites
KnowledgeArchitect audit (issue #1266) validates knowledge quality

Gap: No automated integrity verification of RAG vector stores. Poisoning detection relies on manual audit and quality verification.

New mitigation: Threat model documentation only. The existing defence-in-depth (MVCC writes, procedural guards, quality verification) provides strong coverage. Automated RAG-store integrity verification is a future enhancement.

Residual risk: Sophisticated poisoning that produces high-quality but subtly misleading content may pass quality checks.

Class 4: Behavioural Control / Tool Hijacking¶

Threat: Agents misuse permitted tools for unintended purposes, or adversarial inputs cause tools to be invoked with harmful parameters.

Attack vectors:

Prompt injection causing tool calls with attacker-specified arguments
Tool definition tampering at runtime
Privilege escalation through tool composition

Existing coverage:

Tool permissions (ToolPermissionChecker) with per-category gating
Sandbox isolation (tools/sandbox/) with Docker/subprocess backends
wrap_tool_call middleware slot for pre-execution security checks
PolicyEngine (Cedar) for runtime pre-execution policy evaluation

Gap: No verification that tool definitions haven't been modified since the last known-good state.

New mitigation: ToolIntegrityChecker (src/synthorg/tools/integrity_check.py) computes SHA-256 hashes of each ToolDefinition at boot and compares against recorded hashes. Mismatches trigger TOOL_REGISTRY_INTEGRITY_VIOLATION at ERROR. Configurable: fail_on_violation=True raises RuntimeError to block startup.

Residual risk: Boot-time verification doesn't detect runtime tool definition mutation (frozen Pydantic models prevent this at the language level, but MCP-bridged tools could theoretically change).

Class 5: Systemic / Cascading Failure¶

Threat: Single faults (hallucinations, poisoned tool outputs, coordination failures) propagating autonomously across multiple agents, compounding into widespread service failures.

Existing coverage:

S1 15-risk register with mitigations for all systemic risks (see s1-multi-agent-decision.md section 3)
Circuit breakers via BudgetEnforcer with per-task and daily limits
StagnationDetector with configurable thresholds
CoordinationReplanHook with max_stall_count / max_reset_count hard caps preventing infinite replan loops
Team-size bounds (3-4 per coordination group, 8 hard cap per meeting)
AssumptionViolationSignal propagated as escalation events

Gap: None identified.

Residual risk: Novel failure modes not covered by the 15-risk register. Continuous monitoring and periodic risk register updates are recommended.

Class 6: HITL Cognitive Bias Exploitation¶

Threat: Exploiting human over-reliance on agent recommendations and authority bias to trick operators into approving harmful actions or disclosing sensitive information.

Attack vectors:

Overwhelming operators with frequent low-risk approvals to induce approval fatigue before a high-risk request
Framing high-risk actions as routine or previously approved
Exploiting time pressure in approval timeout policies

Existing coverage:

EvidencePackage (R4 #1263) provides structured HITL approval artifacts with RecommendedAction options and narrative context
AuditChainSink creates tamper-evident trails of all approval decisions
ApprovalGate with configurable timeout policies (wait-forever, auto-deny, tiered, escalation chain)

Gap: No cognitive-bias-specific warnings in the dashboard UI. The EvidencePackage structure supports bias mitigation (narrative context, multiple recommended actions) but the UI doesn't currently surface bias-specific cues (e.g. "This is the fifth approval in 10 minutes; consider slowing down").

New mitigation: Threat model documentation only. UI-level cognitive bias warnings are a future dashboard enhancement.

Residual risk: Sophisticated social engineering through agent outputs that are technically accurate but strategically misleading.

S1 Cross-Reference¶

The S1 multi-agent decision framework (s1-multi-agent-decision.md section 3) covers 15 emergent risks from multi-agent cooperation. Key overlaps with this threat model:

S1 Risk	This Threat Class	Overlap
2.2 Authority deference	Class 2, Class 6	AuthorityDeferenceGuard addresses both adversarial manipulation and HITL bias
3.2 Over-adherence	Class 2	AssumptionViolationSignal detects rigid adherence to potentially compromised instructions
4.3 Semantic drift in handoffs	Class 2	DelegationChainHashMiddleware + SemanticDriftDetector
1.4 Strategic info withholding	Class 3	Memory poisoning via selective information omission

This threat model focuses on adversarial-content risks from external sources; S1 focuses on emergent risks from multi-agent coordination. Together they cover the full execution safety surface.