Fix race condition in BoundedMemorySaver._prune causing data corruption under concurrent access #3448

Open
opened 2026-04-05 17:11:20 +00:00 by freemo · 1 comment
Owner

Metadata

  • Branch: fix/bounded-memory-saver-thread-safety
  • Commit Message: fix(memory): add threading lock to BoundedMemorySaver._prune to prevent race condition
  • Milestone: (none — backlog)
  • Parent Epic: #362

Background and Context

BoundedMemorySaver (in src/cleveragents/agents/graphs/plan_generation.py, lines 70–105) subclasses MemorySaver and overrides put to prune old checkpoints via _prune. The _prune method directly mutates three shared dictionaries — storage, blobs, and writes — without any synchronisation primitive.

Because ainvoke is designed for concurrent use (multiple coroutines or threads may call put simultaneously), this creates a classic TOCTOU race condition: two concurrent callers can both pass the len(checkpoints) <= self.max_checkpoints guard, then both attempt to pop the same checkpoint entries, leading to KeyError exceptions or silently corrupted checkpoint state.

Current Behavior

BoundedMemorySaver._prune modifies storage, blobs, and writes without holding any lock. Under concurrent put calls the following failure modes are possible:

  • Double-pop / KeyError: two threads both identify the same checkpoint IDs for removal and both attempt to pop them; the second pop silently returns None (for blobs/writes) or raises KeyError (for storage).
  • Under-pruning: two threads both read len(checkpoints) before either has pruned, so both decide pruning is needed but each removes a different subset, leaving more entries than max_checkpoints.
  • Over-pruning / data loss: interleaved pops can remove checkpoints that were just written by the other thread.

Expected Behavior

BoundedMemorySaver._prune (and by extension put) must be safe to call from multiple threads or coroutines concurrently. The shared storage, blobs, and writes dictionaries must be protected by a lock so that the read-check-and-prune sequence is atomic.

Acceptance Criteria

  • A threading.Lock (or asyncio.Lock if async-only usage is guaranteed) is acquired before entering the prune logic and released after.
  • No KeyError or silent data loss occurs when put is called concurrently from multiple threads/coroutines.
  • Existing unit tests for BoundedMemorySaver continue to pass.
  • New concurrency regression test added: two threads call put simultaneously and the resulting checkpoint count equals max_checkpoints.
  • All nox stages pass; coverage ≥ 97%.

Supporting Information

  • File: src/cleveragents/agents/graphs/plan_generation.py
  • Class: BoundedMemorySaver
  • Lines: 70–105
  • Suggested fix (from bug report):
    import threading
    
    class BoundedMemorySaver(MemorySaver):
        def __init__(self, *, max_checkpoints: int = 2) -> None:
            super().__init__()
            self.max_checkpoints = max(1, max_checkpoints)
            self._lock = threading.Lock()
    
        def _prune(self, thread_id: str, checkpoint_ns: str) -> None:
            with self._lock:
                # ... existing prune logic unchanged
    
  • Category: concurrency

Backlog note: This issue was discovered during autonomous operation
on milestone NONE. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.

Subtasks

  • Add self._lock = threading.Lock() to BoundedMemorySaver.__init__
  • Wrap the body of _prune in with self._lock:
  • Evaluate whether asyncio.Lock is more appropriate given async usage patterns; document the decision
  • Tests (pytest/Behave): Add concurrency regression test — two threads call put simultaneously, assert final checkpoint count equals max_checkpoints
  • Verify no existing tests break
  • Verify coverage ≥ 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage ≥ 97%.

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/bounded-memory-saver-thread-safety` - **Commit Message**: `fix(memory): add threading lock to BoundedMemorySaver._prune to prevent race condition` - **Milestone**: *(none — backlog)* - **Parent Epic**: #362 ## Background and Context `BoundedMemorySaver` (in `src/cleveragents/agents/graphs/plan_generation.py`, lines 70–105) subclasses `MemorySaver` and overrides `put` to prune old checkpoints via `_prune`. The `_prune` method directly mutates three shared dictionaries — `storage`, `blobs`, and `writes` — without any synchronisation primitive. Because `ainvoke` is designed for concurrent use (multiple coroutines or threads may call `put` simultaneously), this creates a classic TOCTOU race condition: two concurrent callers can both pass the `len(checkpoints) <= self.max_checkpoints` guard, then both attempt to `pop` the same checkpoint entries, leading to `KeyError` exceptions or silently corrupted checkpoint state. ## Current Behavior `BoundedMemorySaver._prune` modifies `storage`, `blobs`, and `writes` without holding any lock. Under concurrent `put` calls the following failure modes are possible: - **Double-pop / KeyError**: two threads both identify the same checkpoint IDs for removal and both attempt to `pop` them; the second pop silently returns `None` (for `blobs`/`writes`) or raises `KeyError` (for `storage`). - **Under-pruning**: two threads both read `len(checkpoints)` before either has pruned, so both decide pruning is needed but each removes a different subset, leaving more entries than `max_checkpoints`. - **Over-pruning / data loss**: interleaved pops can remove checkpoints that were just written by the other thread. ## Expected Behavior `BoundedMemorySaver._prune` (and by extension `put`) must be safe to call from multiple threads or coroutines concurrently. The shared `storage`, `blobs`, and `writes` dictionaries must be protected by a lock so that the read-check-and-prune sequence is atomic. ## Acceptance Criteria - [ ] A `threading.Lock` (or `asyncio.Lock` if async-only usage is guaranteed) is acquired before entering the prune logic and released after. - [ ] No `KeyError` or silent data loss occurs when `put` is called concurrently from multiple threads/coroutines. - [ ] Existing unit tests for `BoundedMemorySaver` continue to pass. - [ ] New concurrency regression test added: two threads call `put` simultaneously and the resulting checkpoint count equals `max_checkpoints`. - [ ] All nox stages pass; coverage ≥ 97%. ## Supporting Information - **File**: `src/cleveragents/agents/graphs/plan_generation.py` - **Class**: `BoundedMemorySaver` - **Lines**: 70–105 - Suggested fix (from bug report): ```python import threading class BoundedMemorySaver(MemorySaver): def __init__(self, *, max_checkpoints: int = 2) -> None: super().__init__() self.max_checkpoints = max(1, max_checkpoints) self._lock = threading.Lock() def _prune(self, thread_id: str, checkpoint_ns: str) -> None: with self._lock: # ... existing prune logic unchanged ``` - Category: `concurrency` > **Backlog note:** This issue was discovered during autonomous operation > on milestone NONE. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. ## Subtasks - [ ] Add `self._lock = threading.Lock()` to `BoundedMemorySaver.__init__` - [ ] Wrap the body of `_prune` in `with self._lock:` - [ ] Evaluate whether `asyncio.Lock` is more appropriate given async usage patterns; document the decision - [ ] Tests (pytest/Behave): Add concurrency regression test — two threads call `put` simultaneously, assert final checkpoint count equals `max_checkpoints` - [ ] Verify no existing tests break - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage ≥ 97%. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: ca-new-issue-creator
freemo added this to the v3.5.0 milestone 2026-04-05 17:12:59 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: Medium (elevated from Backlog) — This is a real concurrency bug that can cause data corruption (KeyError, silent data loss) under concurrent put calls. The fix is trivial (add a threading.Lock) but the impact is significant for autonomous execution.
  • Milestone: v3.5.0 (Autonomy Hardening — concurrent subplan execution requires thread-safe checkpoint management)
  • Story Points: 2 (S) — Add a lock to __init__ and wrap _prune in with self._lock:. Well-defined fix with clear acceptance criteria.
  • MoSCoW: Should Have — v3.5.0 acceptance criterion "Parallel execution scales to 10+ concurrent subplans" requires thread-safe checkpoint management. This race condition would surface under parallel execution.
  • Parent Epic: #362 (Security & Safety Hardening)

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: Medium (elevated from Backlog) — This is a real concurrency bug that can cause data corruption (KeyError, silent data loss) under concurrent `put` calls. The fix is trivial (add a threading.Lock) but the impact is significant for autonomous execution. - **Milestone**: v3.5.0 (Autonomy Hardening — concurrent subplan execution requires thread-safe checkpoint management) - **Story Points**: 2 (S) — Add a lock to `__init__` and wrap `_prune` in `with self._lock:`. Well-defined fix with clear acceptance criteria. - **MoSCoW**: Should Have — v3.5.0 acceptance criterion "Parallel execution scales to 10+ concurrent subplans" requires thread-safe checkpoint management. This race condition would surface under parallel execution. - **Parent Epic**: #362 (Security & Safety Hardening) --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
freemo removed this from the v3.5.0 milestone 2026-04-06 21:05:30 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#362 Epic: Security & Safety Hardening
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#3448
No description provided.