BUG-HUNT: [data-integrity] context_manager.py save() non-atomic — process kill between file writes leaves corrupt state #7486

Open
opened 2026-04-10 20:47:14 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: Data Integrity — ContextManager.save() Non-Atomic Multi-File Write

Severity Assessment

  • Impact: Context state permanently corrupted on process interruption — messages.json reflects new data while metadata.json/state.json are stale
  • Likelihood: Medium — any crash, kill signal, or disk error during save
  • Priority: High

Location

  • File: src/cleveragents/reactive/context_manager.py
  • Function: ContextManager.save
  • Category: data-integrity

Description

save() writes four separate files sequentially with no atomicity guarantee. If the process is killed after messages.json is written but before metadata.json or state.json, the context is in a permanently inconsistent state. There is no backup of the previous state and no way to detect or recover from partial writes.

Evidence

def save(self) -> None:
    with open(self.messages_file, "w", ...) as f: json.dump(...)   # written
    with open(self.metadata_file, "w", ...) as f: json.dump(...)   # kill here → inconsistent
    with open(self.state_file, "w", ...) as f: json.dump(...)
    with open(self.global_context_file, "w", ...) as f: json.dump(...)

After a partial save, messages_file has new content but metadata_file has stale message_count. On next load, the mismatch causes incorrect behavior.

Expected Behavior

Either all four files are updated atomically, or a crash leaves all files in the previous consistent state.

Actual Behavior

A crash during save leaves files in a mixed old/new state with no recovery path.

Suggested Fix

Write each file to a .tmp sibling, then os.replace() atomically:

def _atomic_write(path: Path, data: Any) -> None:
    tmp = path.with_suffix(".tmp")
    try:
        with open(tmp, "w", encoding="utf-8") as f:
            json.dump(data, f, indent=2)
        os.replace(tmp, path)
    except Exception:
        tmp.unlink(missing_ok=True)
        raise

def save(self) -> None:
    _atomic_write(self.messages_file, ...)
    _atomic_write(self.metadata_file, ...)
    _atomic_write(self.state_file, ...)
    _atomic_write(self.global_context_file, ...)

Category

data-integrity

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: Data Integrity — `ContextManager.save()` Non-Atomic Multi-File Write ### Severity Assessment - **Impact**: Context state permanently corrupted on process interruption — `messages.json` reflects new data while `metadata.json`/`state.json` are stale - **Likelihood**: Medium — any crash, kill signal, or disk error during save - **Priority**: High ### Location - **File**: `src/cleveragents/reactive/context_manager.py` - **Function**: `ContextManager.save` - **Category**: data-integrity ### Description `save()` writes four separate files sequentially with no atomicity guarantee. If the process is killed after `messages.json` is written but before `metadata.json` or `state.json`, the context is in a permanently inconsistent state. There is no backup of the previous state and no way to detect or recover from partial writes. ### Evidence ```python def save(self) -> None: with open(self.messages_file, "w", ...) as f: json.dump(...) # written with open(self.metadata_file, "w", ...) as f: json.dump(...) # kill here → inconsistent with open(self.state_file, "w", ...) as f: json.dump(...) with open(self.global_context_file, "w", ...) as f: json.dump(...) ``` After a partial save, `messages_file` has new content but `metadata_file` has stale `message_count`. On next load, the mismatch causes incorrect behavior. ### Expected Behavior Either all four files are updated atomically, or a crash leaves all files in the previous consistent state. ### Actual Behavior A crash during save leaves files in a mixed old/new state with no recovery path. ### Suggested Fix Write each file to a `.tmp` sibling, then `os.replace()` atomically: ```python def _atomic_write(path: Path, data: Any) -> None: tmp = path.with_suffix(".tmp") try: with open(tmp, "w", encoding="utf-8") as f: json.dump(data, f, indent=2) os.replace(tmp, path) except Exception: tmp.unlink(missing_ok=True) raise def save(self) -> None: _atomic_write(self.messages_file, ...) _atomic_write(self.metadata_file, ...) _atomic_write(self.state_file, ...) _atomic_write(self.global_context_file, ...) ``` ### Category data-integrity ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
HAL9000 added this to the v3.5.0 milestone 2026-04-10 21:39:20 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: High — Concurrency/data integrity bug in autonomy hardening components that impacts M6 milestone functionality
  • Milestone: v3.5.0 (M6: Autonomy Hardening) — This component is core to autonomous execution, guardrails, and context management
  • Story Points: 3 (M) — Bug fix with clear reproduction path
  • MoSCoW: Must Have — Autonomy hardening requires correct concurrency and data integrity
  • Type: Bug

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

Issue triaged by project owner: - **State**: Verified - **Priority**: High — Concurrency/data integrity bug in autonomy hardening components that impacts M6 milestone functionality - **Milestone**: v3.5.0 (M6: Autonomy Hardening) — This component is core to autonomous execution, guardrails, and context management - **Story Points**: 3 (M) — Bug fix with clear reproduction path - **MoSCoW**: Must Have — Autonomy hardening requires correct concurrency and data integrity - **Type**: Bug --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7486
No description provided.