BUG-HUNT: [concurrency] Race condition in SandboxManager.commit_all() lock ordering #7060

Open
opened 2026-04-10 07:27:43 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: [Concurrency] — Race condition in SandboxManager.commit_all() lock ordering

Severity Assessment

  • Impact: Data corruption in concurrent plan executions; sandboxes can be modified or removed between snapshot and use
  • Likelihood: Medium — occurs under concurrent load when multiple plans execute simultaneously
  • Priority: Critical (data loss potential)

Location

  • File: src/cleveragents/infrastructure/sandbox/manager.py
  • Method: SandboxManager.commit_all()
  • Lines: ~260–300

Description

commit_all() acquires self._lock only long enough to snapshot the active sandbox list, then immediately releases it before performing any commit or rollback operations. This creates a TOCTOU (time-of-check/time-of-use) window: between the snapshot and the subsequent sandbox.commit() / sandbox.rollback() calls, another thread can mutate self._active_sandboxes — adding, removing, or replacing sandbox entries for the same plan_id. The result is that commit_all() may operate on stale or already-destroyed sandbox objects, leading to silent data corruption or partial commits.

Evidence

# Lock released immediately after snapshot — window opens here
with self._lock:
    sandboxes = list(self._active_sandboxes.get(plan_id, {}).values())

# ← Any thread can mutate _active_sandboxes[plan_id] here ←

committable = [
    sb
    for sb in sandboxes
    if sb.status in (SandboxStatus.CREATED, SandboxStatus.ACTIVE)
]
# ... commit/rollback loop runs entirely outside the lock

The docstring itself acknowledges this:

"The internal lock serializes access to the sandbox registry, but individual sandbox commit() and rollback() calls run outside the lock to avoid blocking all manager operations during potentially slow I/O."

While the intent is to avoid holding the lock during slow I/O, the current implementation provides no protection against concurrent structural mutations to the sandbox registry for the same plan_id.

Expected Behaviour

All reads and writes to _active_sandboxes for a given plan_id during a commit_all() call should be atomic with respect to other threads. Either:

  1. The lock is held for the entire operation (acceptable if commit I/O is fast enough), or
  2. A per-plan_id lock is used so that only the affected plan is serialised, or
  3. The snapshot is taken under the lock and the sandbox objects themselves are made thread-safe so that structural mutations to the registry cannot affect an in-flight commit.

Actual Behaviour

Sandboxes can be added, removed, or replaced in _active_sandboxes[plan_id] between the snapshot and the commit loop, breaking the all-or-nothing atomicity guarantee and potentially causing:

  • Commits against already-closed/replaced sandbox objects
  • Rollbacks that miss newly-added sandboxes
  • Silent partial commits leaving resources in an inconsistent state

Suggested Fix

Introduce a per-plan_id threading.Lock (or RLock) so that commit_all() holds the plan-level lock for its entire duration without blocking unrelated plans:

# Acquire a per-plan lock before snapshotting and hold it through commit/rollback
with self._plan_lock(plan_id):
    with self._lock:
        sandboxes = list(self._active_sandboxes.get(plan_id, {}).values())
    # commit/rollback loop here — plan-level lock still held

Alternatively, use snapshot isolation: copy the sandbox objects under the lock and ensure the registry entry is atomically replaced (not mutated in place) so that in-flight operations always see a consistent view.


Metadata

  • Branch: fix/concurrency-sandbox-manager-commit-all-lock-ordering
  • Commit Message: fix(sandbox): hold plan-level lock across commit_all() to eliminate TOCTOU race
  • Milestone: v3.7.0
  • Parent Epic: #7023

Subtasks

  • Reproduce the race condition with a concurrent test (two threads calling commit_all() for the same plan_id simultaneously)
  • Introduce per-plan_id locking (or equivalent snapshot-isolation mechanism) in SandboxManager
  • Update commit_all() docstring to accurately reflect the new thread-safety guarantee
  • Add/update unit tests covering concurrent commit_all() invocations
  • Add/update integration tests verifying atomicity under concurrent plan execution
  • Verify no deadlock is introduced between self._lock and the new per-plan lock (lock ordering must be consistent)

Definition of Done

  • A regression test that reliably triggers the race condition without the fix (and passes with it) is merged
  • commit_all() is safe to call concurrently for the same plan_id without data corruption
  • No new deadlock paths introduced (verified by lock-ordering analysis or tooling)
  • All nox stages pass
  • Coverage >= 97%

Backlog note: This issue was discovered during autonomous operation
on milestone v3.7.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Acting on behalf of: Bug Hunt | Agent: new-issue-creator

## Bug Report: [Concurrency] — Race condition in `SandboxManager.commit_all()` lock ordering ### Severity Assessment - **Impact**: Data corruption in concurrent plan executions; sandboxes can be modified or removed between snapshot and use - **Likelihood**: Medium — occurs under concurrent load when multiple plans execute simultaneously - **Priority**: Critical (data loss potential) ### Location - **File**: `src/cleveragents/infrastructure/sandbox/manager.py` - **Method**: `SandboxManager.commit_all()` - **Lines**: ~260–300 ### Description `commit_all()` acquires `self._lock` only long enough to snapshot the active sandbox list, then immediately releases it before performing any commit or rollback operations. This creates a TOCTOU (time-of-check/time-of-use) window: between the snapshot and the subsequent `sandbox.commit()` / `sandbox.rollback()` calls, another thread can mutate `self._active_sandboxes` — adding, removing, or replacing sandbox entries for the same `plan_id`. The result is that `commit_all()` may operate on stale or already-destroyed sandbox objects, leading to silent data corruption or partial commits. ### Evidence ```python # Lock released immediately after snapshot — window opens here with self._lock: sandboxes = list(self._active_sandboxes.get(plan_id, {}).values()) # ← Any thread can mutate _active_sandboxes[plan_id] here ← committable = [ sb for sb in sandboxes if sb.status in (SandboxStatus.CREATED, SandboxStatus.ACTIVE) ] # ... commit/rollback loop runs entirely outside the lock ``` The docstring itself acknowledges this: > *"The internal lock serializes access to the sandbox registry, but individual sandbox `commit()` and `rollback()` calls run outside the lock to avoid blocking all manager operations during potentially slow I/O."* While the intent is to avoid holding the lock during slow I/O, the current implementation provides no protection against concurrent structural mutations to the sandbox registry for the same `plan_id`. ### Expected Behaviour All reads and writes to `_active_sandboxes` for a given `plan_id` during a `commit_all()` call should be atomic with respect to other threads. Either: 1. The lock is held for the entire operation (acceptable if commit I/O is fast enough), **or** 2. A per-`plan_id` lock is used so that only the affected plan is serialised, **or** 3. The snapshot is taken under the lock and the sandbox objects themselves are made thread-safe so that structural mutations to the registry cannot affect an in-flight commit. ### Actual Behaviour Sandboxes can be added, removed, or replaced in `_active_sandboxes[plan_id]` between the snapshot and the commit loop, breaking the all-or-nothing atomicity guarantee and potentially causing: - Commits against already-closed/replaced sandbox objects - Rollbacks that miss newly-added sandboxes - Silent partial commits leaving resources in an inconsistent state ### Suggested Fix Introduce a per-`plan_id` `threading.Lock` (or `RLock`) so that `commit_all()` holds the plan-level lock for its entire duration without blocking unrelated plans: ```python # Acquire a per-plan lock before snapshotting and hold it through commit/rollback with self._plan_lock(plan_id): with self._lock: sandboxes = list(self._active_sandboxes.get(plan_id, {}).values()) # commit/rollback loop here — plan-level lock still held ``` Alternatively, use snapshot isolation: copy the sandbox objects under the lock and ensure the registry entry is atomically replaced (not mutated in place) so that in-flight operations always see a consistent view. --- ## Metadata - **Branch**: `fix/concurrency-sandbox-manager-commit-all-lock-ordering` - **Commit Message**: `fix(sandbox): hold plan-level lock across commit_all() to eliminate TOCTOU race` - **Milestone**: v3.7.0 - **Parent Epic**: #7023 ## Subtasks - [ ] Reproduce the race condition with a concurrent test (two threads calling `commit_all()` for the same `plan_id` simultaneously) - [ ] Introduce per-`plan_id` locking (or equivalent snapshot-isolation mechanism) in `SandboxManager` - [ ] Update `commit_all()` docstring to accurately reflect the new thread-safety guarantee - [ ] Add/update unit tests covering concurrent `commit_all()` invocations - [ ] Add/update integration tests verifying atomicity under concurrent plan execution - [ ] Verify no deadlock is introduced between `self._lock` and the new per-plan lock (lock ordering must be consistent) ## Definition of Done - [ ] A regression test that reliably triggers the race condition without the fix (and passes with it) is merged - [ ] `commit_all()` is safe to call concurrently for the same `plan_id` without data corruption - [ ] No new deadlock paths introduced (verified by lock-ordering analysis or tooling) - [ ] All nox stages pass - [ ] Coverage >= 97% > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.7.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: Bug Hunt | Agent: new-issue-creator
HAL9000 added this to the v3.7.0 milestone 2026-04-10 07:27:56 +00:00
Author
Owner

Verified — Critical concurrency bug: race condition in SandboxManager.commit_all() lock ordering. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical concurrency bug: race condition in SandboxManager.commit_all() lock ordering. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7060
No description provided.