feat(security): implement Prompt Injection Mitigation (5 mechanisms) #572

Closed
opened 2026-03-04 23:39:57 +00:00 by freemo · 1 comment
Owner

Metadata

Field Value
Commit Message feat(security): implement Prompt Injection Mitigation (5 mechanisms)
Branch feature/m5-prompt-injection-mitigation

Summary

The specification defines 5 prompt injection mitigation mechanisms (spec lines ~43865-43877) for server mode. None are implemented in the codebase. The existing _sanitize_python_content in plan_service.py only strips code fences from Python output -- it does NOT implement the spec's prompt injection protections.

Spec Reference

  • Section: Architecture > Security Model > Prompt Injection Mitigation
  • Lines: 43865-43877

Description

The spec requires 5 mechanisms:

  1. Input sanitization: User-provided text in action arguments, invariant text, and session prompts is sanitized before inclusion in LLM prompts. HTML entities, control characters, and known injection patterns are escaped or rejected.
  2. Prompt boundary markers: System prompts and user content separated by [USER_CONTENT_START] and [USER_CONTENT_END] markers. System prompt explicitly instructs LLM to recognize these boundaries.
  3. Output validation: LLM outputs used as tool invocations are validated against the tool's JSON Schema before execution.
  4. Tool capability restrictions: Tools declare capabilities (read_only, writes, checkpointable, side_effects). Execution engine enforces these declarations.
  5. Unsafe tool gating: Tools marked as unsafe blocked unless automation profile explicitly sets allow_unsafe_tools: true. In server mode, only administrators can create such profiles.

Acceptance Criteria

  • Implement PromptSanitizer class with methods: sanitize_user_input(text) -> str, wrap_user_content(text) -> str (adds boundary markers)
  • Integrate sanitizer into session prompt construction, action argument handling, and invariant text processing
  • Implement LLM output validation against tool JSON Schema before tool execution in the actor runtime
  • Enforce tool capability declarations at runtime (reads/writes/checkpointable)
  • Implement unsafe tool gating based on automation profile allow_unsafe_tools flag
  • Tests: injection pattern detection tests, boundary marker tests, capability enforcement tests
  • Epic: Security & Safety Hardening #362

Suggested Milestone

v3.3.0 (Security hardening)

Priority

High

Suggested Assignee

@freemo (security architecture)

Subtasks

  • Code: Implement PromptSanitizer class with sanitize_user_input() (escape HTML entities, control chars, known injection patterns) and wrap_user_content() (boundary markers)
  • Code: Integrate sanitizer into session prompt construction, action argument handling, and invariant text processing paths
  • Code: Implement LLM output validation against tool JSON Schema before tool execution in the actor runtime
  • Code: Enforce tool capability declarations (read_only, writes, checkpointable, side_effects) at runtime in the execution engine
  • Code: Implement unsafe tool gating based on automation profile allow_unsafe_tools flag; server mode restricts profile creation to admins
  • Docs: Document the 5 prompt injection mitigation mechanisms in relevant security docs
  • Behave tests: Add BDD feature file features/security/prompt_injection_mitigation.feature covering all 5 mechanisms
  • Robot tests: Add Robot Framework integration tests for injection pattern detection, boundary markers, capability enforcement, and unsafe tool gating
  • ASV benchmarks: Add ASV benchmark for sanitization throughput (benchmarks/bench_prompt_sanitizer.py)
  • Quality: coverage ≥97%: Verify via nox -s coverage_report
  • Quality: nox full suite: Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
## Metadata | Field | Value | |-------|-------| | **Commit Message** | `feat(security): implement Prompt Injection Mitigation (5 mechanisms)` | | **Branch** | `feature/m5-prompt-injection-mitigation` | ## Summary The specification defines 5 prompt injection mitigation mechanisms (spec lines ~43865-43877) for server mode. None are implemented in the codebase. The existing `_sanitize_python_content` in `plan_service.py` only strips code fences from Python output -- it does NOT implement the spec's prompt injection protections. ## Spec Reference - **Section**: Architecture > Security Model > Prompt Injection Mitigation - **Lines**: 43865-43877 ## Description The spec requires 5 mechanisms: 1. **Input sanitization**: User-provided text in action arguments, invariant text, and session prompts is sanitized before inclusion in LLM prompts. HTML entities, control characters, and known injection patterns are escaped or rejected. 2. **Prompt boundary markers**: System prompts and user content separated by `[USER_CONTENT_START]` and `[USER_CONTENT_END]` markers. System prompt explicitly instructs LLM to recognize these boundaries. 3. **Output validation**: LLM outputs used as tool invocations are validated against the tool's JSON Schema before execution. 4. **Tool capability restrictions**: Tools declare capabilities (`read_only`, `writes`, `checkpointable`, `side_effects`). Execution engine enforces these declarations. 5. **Unsafe tool gating**: Tools marked as `unsafe` blocked unless automation profile explicitly sets `allow_unsafe_tools: true`. In server mode, only administrators can create such profiles. ### Acceptance Criteria - [ ] Implement `PromptSanitizer` class with methods: `sanitize_user_input(text) -> str`, `wrap_user_content(text) -> str` (adds boundary markers) - [ ] Integrate sanitizer into session prompt construction, action argument handling, and invariant text processing - [ ] Implement LLM output validation against tool JSON Schema before tool execution in the actor runtime - [ ] Enforce tool capability declarations at runtime (reads/writes/checkpointable) - [ ] Implement unsafe tool gating based on automation profile `allow_unsafe_tools` flag - [ ] Tests: injection pattern detection tests, boundary marker tests, capability enforcement tests ## Related Issues - Epic: Security & Safety Hardening #362 ## Suggested Milestone v3.3.0 (Security hardening) ## Priority High ## Suggested Assignee @freemo (security architecture) ## Subtasks - [ ] **Code**: Implement `PromptSanitizer` class with `sanitize_user_input()` (escape HTML entities, control chars, known injection patterns) and `wrap_user_content()` (boundary markers) - [ ] **Code**: Integrate sanitizer into session prompt construction, action argument handling, and invariant text processing paths - [ ] **Code**: Implement LLM output validation against tool JSON Schema before tool execution in the actor runtime - [ ] **Code**: Enforce tool capability declarations (`read_only`, `writes`, `checkpointable`, `side_effects`) at runtime in the execution engine - [ ] **Code**: Implement unsafe tool gating based on automation profile `allow_unsafe_tools` flag; server mode restricts profile creation to admins - [ ] **Docs**: Document the 5 prompt injection mitigation mechanisms in relevant security docs - [ ] **Behave tests**: Add BDD feature file `features/security/prompt_injection_mitigation.feature` covering all 5 mechanisms - [ ] **Robot tests**: Add Robot Framework integration tests for injection pattern detection, boundary markers, capability enforcement, and unsafe tool gating - [ ] **ASV benchmarks**: Add ASV benchmark for sanitization throughput (`benchmarks/bench_prompt_sanitizer.py`) - [ ] **Quality: coverage ≥97%**: Verify via `nox -s coverage_report` - [ ] **Quality: nox full suite**: Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done.
freemo self-assigned this 2026-03-05 00:29:59 +00:00
freemo added this to the v3.4.0 milestone 2026-03-05 00:29:59 +00:00
Author
Owner

CTO verification: Issue verified. The spec clearly mandates all 5 prompt injection mitigation mechanisms (lines ~43865-43877) and none are currently implemented. The existing _sanitize_python_content in plan_service.py only strips code fences — it does NOT implement the spec's injection protections.

Moving to State/In Progress. Implementing now.

**CTO verification:** Issue verified. The spec clearly mandates all 5 prompt injection mitigation mechanisms (lines ~43865-43877) and none are currently implemented. The existing `_sanitize_python_content` in `plan_service.py` only strips code fences — it does NOT implement the spec's injection protections. Moving to `State/In Progress`. Implementing now.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#396 Epic: ACMS Context Pipeline
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#572
No description provided.