Execution Safety Threat Model¶
Source Caveat¶
The AI Agent Traps paper (SSRN:6372438) was inaccessible during research (SSRN returned HTTP 403). The 6-class taxonomy used in this threat model is sourced from the #1256 issue body. When the paper becomes accessible, this threat model must be updated to cite the paper directly and verify alignment with the original taxonomy.
Coverage Summary¶
| # | Class | Coverage | Gap | New Mitigation |
|---|---|---|---|---|
| 1 | Content Injection | Partial | Render-time parse gap | HTMLParseGuard |
| 2 | Semantic Manipulation | Partial | No per-turn drift detection | SemanticDriftDetector |
| 3 | Cognitive State / Memory Poisoning | Strong | RAG / vector-store integrity verification | Threat model only |
| 4 | Behavioural Control / Tool Hijacking | Strong | No registry integrity check | ToolRegistryIntegrityCheck |
| 5 | Systemic / Cascading Failure | Covered | None | S1 cross-reference |
| 6 | HITL Cognitive Bias Exploitation | Partial | No bias-specific UI | Threat model only |
Class 1: Content Injection (HTML/Parse/Render Gap)¶
Threat: Attackers inject hidden content into HTML pages fetched by web-browsing tools. The injected content is invisible to human review but parsed by the LLM, potentially overriding instructions or exfiltrating data through tool calls.
Attack vectors:
- CSS
display:none/visibility:hiddenelements containing prompt injection - HTML comments with instruction overrides
<script>tags that execute in parsing contexts- Whitespace manipulation creating semantic gaps between displayed and parsed content
Existing coverage:
sanitize_messagein the execution pipeline strips basic unsafe content- Output scanning (
security/rules/) detects credentials in tool output
Gap: No systematic parse-gap detection between raw HTML and rendered text content consumed by the LLM.
New mitigation: HTMLParseGuard (src/synthorg/tools/html_parse_guard.py)
parses HTML output with lxml, strips script/style/noscript/hidden elements,
detects render gaps where visible text differs substantially from raw HTML,
and logs TOOL_HTML_PARSE_GAP_DETECTED events. Integrated into
ToolInvoker._apply_html_guard() post-execution.
Residual risk: Sophisticated CSS-based injection that doesn't use
display:none patterns (e.g. negative margins, font-size:0) may evade
detection. The gap threshold (default 5%) can be tuned.
Class 2: Semantic Manipulation¶
Threat: Subtle steering of agent reasoning through adversarial content that shifts the agent's output away from the task's intended objectives without triggering explicit content filters.
Attack vectors:
- Gradual context drift across multi-turn conversations
- Adversarial prompt framing in tool outputs
- Authority impersonation in retrieved documents
Existing coverage:
- R2 verification stages validate execution quality against rubrics
AuthorityDeferenceGuard(S1 risk 2.2) strips authority cues from transcripts and injects mandatory-justification prompts
Gap: No per-turn detection of semantic drift between model output and task acceptance criteria.
New mitigation: SemanticDriftDetector middleware
(src/synthorg/engine/middleware/semantic_drift.py) compares model output
against task acceptance_criteria using a token-overlap similarity heuristic
(shipped default). Drift below threshold (default 0.35) logs
MIDDLEWARE_SEMANTIC_DRIFT_DETECTED at WARN and annotates
TurnRecord.semantic_drift_score. Fail-soft: never blocks execution.
Opt-in via AgentMiddlewareConfig.semantic_drift.enabled.
Residual risk: The shipped token-overlap heuristic has limited semantic
understanding. Production deployments should override _compute_similarity
with embedding-based cosine similarity (configure via
SemanticDriftConfig.embedding_model).
Class 3: Cognitive State / Memory Poisoning¶
Threat: Corrupting retrievable context (memory stores, knowledge bases, procedural memories) with adversarial data that biases future reasoning or plants hidden instruction backdoors.
Attack vectors:
- Injecting false facts into the shared knowledge store
- Poisoning procedural memory generation with misleading patterns
- Manipulating embedding similarity to surface adversarial content
Existing coverage:
- Procedural memory generation guards in
memory/procedural/ - MVCC
SharedKnowledgeStorewith versioned writes preventing silent overwrites KnowledgeArchitectaudit (issue #1266) validates knowledge quality
Gap: No automated integrity verification of RAG vector stores. Poisoning detection relies on manual audit and quality verification.
New mitigation: Threat model documentation only. The existing defence-in-depth (MVCC writes, procedural guards, quality verification) provides strong coverage. Automated RAG-store integrity verification is a future enhancement.
Residual risk: Sophisticated poisoning that produces high-quality but subtly misleading content may pass quality checks.
Class 4: Behavioural Control / Tool Hijacking¶
Threat: Agents misuse permitted tools for unintended purposes, or adversarial inputs cause tools to be invoked with harmful parameters.
Attack vectors:
- Prompt injection causing tool calls with attacker-specified arguments
- Tool definition tampering at runtime
- Privilege escalation through tool composition
Existing coverage:
- Tool permissions (
ToolPermissionChecker) with per-category gating - Sandbox isolation (
tools/sandbox/) with Docker/subprocess backends wrap_tool_callmiddleware slot for pre-execution security checksPolicyEngine(Cedar) for runtime pre-execution policy evaluation
Gap: No verification that tool definitions haven't been modified since the last known-good state.
New mitigation: ToolIntegrityChecker
(src/synthorg/tools/integrity_check.py) computes SHA-256 hashes of
each ToolDefinition at boot and compares against recorded hashes.
Mismatches trigger TOOL_REGISTRY_INTEGRITY_VIOLATION at ERROR.
Configurable: fail_on_violation=True raises RuntimeError to block
startup.
Residual risk: Boot-time verification doesn't detect runtime tool definition mutation (frozen Pydantic models prevent this at the language level, but MCP-bridged tools could theoretically change).
Class 5: Systemic / Cascading Failure¶
Threat: Single faults (hallucinations, poisoned tool outputs, coordination failures) propagating autonomously across multiple agents, compounding into widespread service failures.
Existing coverage:
- S1 15-risk register with mitigations for all systemic risks (see s1-multi-agent-decision.md section 3)
- Circuit breakers via
BudgetEnforcerwith per-task and daily limits StagnationDetectorwith configurable thresholdsCoordinationReplanHookwithmax_stall_count/max_reset_counthard caps preventing infinite replan loops- Team-size bounds (3-4 per coordination group, 8 hard cap per meeting)
AssumptionViolationSignalpropagated as escalation events
Gap: None identified.
Residual risk: Novel failure modes not covered by the 15-risk register. Continuous monitoring and periodic risk register updates are recommended.
Class 6: HITL Cognitive Bias Exploitation¶
Threat: Exploiting human over-reliance on agent recommendations and authority bias to trick operators into approving harmful actions or disclosing sensitive information.
Attack vectors:
- Overwhelming operators with frequent low-risk approvals to induce approval fatigue before a high-risk request
- Framing high-risk actions as routine or previously approved
- Exploiting time pressure in approval timeout policies
Existing coverage:
EvidencePackage(R4 #1263) provides structured HITL approval artifacts withRecommendedActionoptions andnarrativecontextAuditChainSinkcreates tamper-evident trails of all approval decisionsApprovalGatewith configurable timeout policies (wait-forever, auto-deny, tiered, escalation chain)
Gap: No cognitive-bias-specific warnings in the dashboard UI. The EvidencePackage structure supports bias mitigation (narrative context, multiple recommended actions) but the UI doesn't currently surface bias-specific cues (e.g. "This is the fifth approval in 10 minutes; consider slowing down").
New mitigation: Threat model documentation only. UI-level cognitive bias warnings are a future dashboard enhancement.
Residual risk: Sophisticated social engineering through agent outputs that are technically accurate but strategically misleading.
S1 Cross-Reference¶
The S1 multi-agent decision framework (s1-multi-agent-decision.md section 3) covers 15 emergent risks from multi-agent cooperation. Key overlaps with this threat model:
| S1 Risk | This Threat Class | Overlap |
|---|---|---|
| 2.2 Authority deference | Class 2, Class 6 | AuthorityDeferenceGuard addresses both adversarial manipulation and HITL bias |
| 3.2 Over-adherence | Class 2 | AssumptionViolationSignal detects rigid adherence to potentially compromised instructions |
| 4.3 Semantic drift in handoffs | Class 2 | DelegationChainHashMiddleware + SemanticDriftDetector |
| 1.4 Strategic info withholding | Class 3 | Memory poisoning via selective information omission |
This threat model focuses on adversarial-content risks from external sources; S1 focuses on emergent risks from multi-agent coordination. Together they cover the full execution safety surface.