BUG-HUNT: [resource] Context file corruption risk from non-atomic multi-file operations #7213

Open
opened 2026-04-10 09:07:13 +00:00 by HAL9000 · 3 comments
Owner

Metadata

  • Branch: bugfix/reactive-context-manager-non-atomic-save
  • Commit Message: fix(reactive): implement atomic write operations in ContextManager.save() to prevent partial write corruption
  • Milestone: (none — see backlog note below)
  • Parent Epic: #7052

Background and Context

The save() method in src/cleveragents/reactive/context_manager.py (lines 87–95) writes to 4 separate files sequentially without any transactional or atomic guarantees. If any write fails after one or more earlier writes have already succeeded, the context is left in an inconsistent state where some files contain new data and others contain old data.

This is a classic partial-write / torn-write data integrity hazard. The affected files are:

  • messages_file — message history
  • metadata_file — metadata including last_updated and message_count
  • state_file — reactive state
  • global_context_file — global context

A filesystem interruption (disk full, process kill, power loss, OS signal) between any two of these writes leaves the context in a corrupted state that cannot be automatically recovered.

Current Behavior

def save(self) -> None:
    self.metadata["last_updated"] = datetime.now().isoformat()
    self.metadata["message_count"] = len(self.messages)
    with open(self.messages_file, "w", encoding="utf-8") as f:
        json.dump(self.messages, f, indent=2)
    with open(self.metadata_file, "w", encoding="utf-8") as f:
        json.dump(self.metadata, f, indent=2)
    with open(self.state_file, "w", encoding="utf-8") as f:
        json.dump(self.state, f, indent=2)
    with open(self.global_context_file, "w", encoding="utf-8") as f:
        json.dump(self.global_context, f, indent=2)

If any open() or json.dump() call raises after earlier writes have completed, the context files are left in an inconsistent state. For example:

  • messages_file written with new data, but metadata_file still contains old message_count → metadata is stale/wrong
  • state_file write fails after messages_file and metadata_file succeed → state is out of sync with messages

Expected Behavior

Context saves must be atomic: either all files are updated successfully, or none are updated. The context must never be left in a partially-written state. This can be achieved by:

  1. Writing each file to a temporary file in the same directory first
  2. Atomically renaming each temp file to its final destination (os.replace() is atomic on POSIX)
  3. On failure, cleaning up any temp files that were written — leaving the originals untouched

Impact

  • User context data corruption or loss when filesystem operations are interrupted
  • Inconsistent context state causes unpredictable agent behaviour on next load
  • Silent data loss — no error is raised to the caller when partial writes occur
  • Likelihood: Medium — occurs on disk-full conditions, process interruption, or OS signals during save

Acceptance Criteria

  • ContextManager.save() uses atomic write operations (write-to-temp + os.replace()) for all 4 context files
  • If any write fails, all temp files are cleaned up and the original files are left untouched
  • An exception is raised to the caller on failure (no silent failures per CONTRIBUTING.md fail-fast rules)
  • BDD scenario added: partial write failure leaves all original context files intact (tagged @tdd_issue @tdd_issue_<N>)
  • Robot integration test added for atomic save behaviour under simulated write failure
  • All nox stages pass; coverage ≥ 97%

Subtasks

  • Implement _atomic_write(path, data) helper that writes to a temp file then calls os.replace()
  • Refactor save() to use _atomic_write() for all 4 files
  • Add rollback/cleanup logic: if any write fails, delete all temp files created so far
  • Ensure exceptions propagate to the caller (no swallowing)
  • Add Behave BDD scenario: partial write failure leaves originals intact (tagged @tdd_issue @tdd_issue_<N>)
  • Add Robot Framework integration test for atomic save under simulated failure
  • Verify coverage ≥ 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage ≥ 97%.

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Acting on behalf of: Bug Hunt | Agent: new-issue-creator

## Metadata - **Branch**: `bugfix/reactive-context-manager-non-atomic-save` - **Commit Message**: `fix(reactive): implement atomic write operations in ContextManager.save() to prevent partial write corruption` - **Milestone**: *(none — see backlog note below)* - **Parent Epic**: #7052 ## Background and Context The `save()` method in `src/cleveragents/reactive/context_manager.py` (lines 87–95) writes to 4 separate files sequentially without any transactional or atomic guarantees. If any write fails after one or more earlier writes have already succeeded, the context is left in an inconsistent state where some files contain new data and others contain old data. This is a classic partial-write / torn-write data integrity hazard. The affected files are: - `messages_file` — message history - `metadata_file` — metadata including `last_updated` and `message_count` - `state_file` — reactive state - `global_context_file` — global context A filesystem interruption (disk full, process kill, power loss, OS signal) between any two of these writes leaves the context in a corrupted state that cannot be automatically recovered. ## Current Behavior ```python def save(self) -> None: self.metadata["last_updated"] = datetime.now().isoformat() self.metadata["message_count"] = len(self.messages) with open(self.messages_file, "w", encoding="utf-8") as f: json.dump(self.messages, f, indent=2) with open(self.metadata_file, "w", encoding="utf-8") as f: json.dump(self.metadata, f, indent=2) with open(self.state_file, "w", encoding="utf-8") as f: json.dump(self.state, f, indent=2) with open(self.global_context_file, "w", encoding="utf-8") as f: json.dump(self.global_context, f, indent=2) ``` If any `open()` or `json.dump()` call raises after earlier writes have completed, the context files are left in an inconsistent state. For example: - `messages_file` written with new data, but `metadata_file` still contains old `message_count` → metadata is stale/wrong - `state_file` write fails after `messages_file` and `metadata_file` succeed → state is out of sync with messages ## Expected Behavior Context saves must be atomic: either **all** files are updated successfully, or **none** are updated. The context must never be left in a partially-written state. This can be achieved by: 1. Writing each file to a temporary file in the same directory first 2. Atomically renaming each temp file to its final destination (`os.replace()` is atomic on POSIX) 3. On failure, cleaning up any temp files that were written — leaving the originals untouched ## Impact - **User context data corruption or loss** when filesystem operations are interrupted - **Inconsistent context state** causes unpredictable agent behaviour on next load - **Silent data loss** — no error is raised to the caller when partial writes occur - **Likelihood**: Medium — occurs on disk-full conditions, process interruption, or OS signals during save ## Acceptance Criteria - [ ] `ContextManager.save()` uses atomic write operations (write-to-temp + `os.replace()`) for all 4 context files - [ ] If any write fails, all temp files are cleaned up and the original files are left untouched - [ ] An exception is raised to the caller on failure (no silent failures per CONTRIBUTING.md fail-fast rules) - [ ] BDD scenario added: partial write failure leaves all original context files intact (tagged `@tdd_issue @tdd_issue_<N>`) - [ ] Robot integration test added for atomic save behaviour under simulated write failure - [ ] All nox stages pass; coverage ≥ 97% ## Subtasks - [ ] Implement `_atomic_write(path, data)` helper that writes to a temp file then calls `os.replace()` - [ ] Refactor `save()` to use `_atomic_write()` for all 4 files - [ ] Add rollback/cleanup logic: if any write fails, delete all temp files created so far - [ ] Ensure exceptions propagate to the caller (no swallowing) - [ ] Add Behave BDD scenario: partial write failure leaves originals intact (tagged `@tdd_issue @tdd_issue_<N>`) - [ ] Add Robot Framework integration test for atomic save under simulated failure - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage ≥ 97%. > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: Bug Hunt | Agent: new-issue-creator
Author
Owner

Verified — Critical data integrity bug: context file corruption risk from non-atomic operations. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical data integrity bug: context file corruption risk from non-atomic operations. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Critical data integrity bug: context file corruption risk from non-atomic operations. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical data integrity bug: context file corruption risk from non-atomic operations. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Critical data integrity bug: context file corruption risk from non-atomic operations. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical data integrity bug: context file corruption risk from non-atomic operations. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#7213
No description provided.