BUG-HUNT: [resource] SandboxManager.commit_all() acquires lock but releases it before committing, then re-acquires for cleanup creating a lock window #7344

Open
opened 2026-04-10 17:57:22 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: [resource/concurrency] SandboxManager.commit_all() performs commits outside the lock creating race condition

Severity Assessment

  • Impact: Two concurrent calls to commit_all() for the same plan could both see the same set of sandboxes, leading to double-commit of the same changes (applying them twice) or double-rollback
  • Likelihood: Low-Medium — concurrent execution of commit_all() is warned against but not prevented
  • Priority: High

Location

  • File: src/cleveragents/infrastructure/sandbox/manager.py
  • Function/Class: SandboxManager.commit_all()
  • Lines: ~150-220

Description

The SandboxManager.commit_all() method has a documented but unfixed race condition: the docstring explicitly states "This method is not safe for concurrent calls on the same plan_id." However, the implementation's lock usage makes this race worse than necessary.

The method acquires self._lock to read the list of sandboxes, then releases the lock before committing. This means:

  1. Thread A calls commit_all("plan-1"), reads [sandbox_a, sandbox_b] under lock, then releases lock
  2. Thread B calls commit_all("plan-1"), also reads [sandbox_a, sandbox_b] under lock, then releases lock
  3. Both threads independently commit sandbox_a and sandbox_b
  4. sandbox_a and sandbox_b are committed twice, leaving the target in an undefined state

Additionally, when commit_all() needs to do atomic rollback after a partial failure, the _rollback_committed() method also runs outside the lock. This means another thread might see partially-committed state and make incorrect decisions.

The actual commits and rollbacks for NoSandbox and TransactionSandbox (non-rollbackable types) can cause permanent data loss when committed twice because there's no rollback path.

Evidence

def commit_all(self, plan_id: str) -> list[CommitResult]:
    # ...
    with self._lock:
        # Lock acquired here
        sandboxes = list(self._active_sandboxes.get(plan_id, {}).values())
    # Lock RELEASED here — race window begins!
    
    committable = [
        sb
        for sb in sandboxes
        if sb.status in (SandboxStatus.CREATED, SandboxStatus.ACTIVE)
    ]
    # ...
    # All commits happen outside the lock
    for sandbox in ordered:
        try:
            result = sandbox.commit()  # Outside lock!
            committed.append((sandbox, result))
        except Exception as exc:
            # Rollback also outside lock!
            rolled_back_ids, failed_rollback_ids = self._rollback_committed(
                committed, plan_id
            )

The warning in the docstring acknowledges this but does not enforce it:

.. warning:: Thread safety

   This method is **not safe** for concurrent calls on the same
   *plan_id*.

Expected Behavior

The method should either:

  1. Use a per-plan-id lock to prevent concurrent commit_all() calls for the same plan, or
  2. Use an atomic state transition (e.g., mark sandboxes as "committing" while under lock) to prevent double-commit

Actual Behavior

The warning is documentation-only. Nothing prevents concurrent commit_all() calls for the same plan, and the lock scope is insufficient to provide real protection.

Suggested Fix

Add a per-plan-id lock or transition sandbox status atomically under the global lock before committing:

def commit_all(self, plan_id: str) -> list[CommitResult]:
    if not plan_id:
        raise ValueError("plan_id cannot be empty")
    
    with self._lock:
        sandboxes = list(self._active_sandboxes.get(plan_id, {}).values())
        # Atomically claim committable sandboxes by marking them as COMMITTING
        committable = []
        for sb in sandboxes:
            if sb.status in (SandboxStatus.CREATED, SandboxStatus.ACTIVE):
                # Mark as committing to prevent concurrent commit_all claims
                sb._status = SandboxStatus.COMMITTING  # or equivalent sentinel
                committable.append(sb)
    
    # Now commit outside the lock — only this thread "owns" these sandboxes
    ...

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: [resource/concurrency] SandboxManager.commit_all() performs commits outside the lock creating race condition ### Severity Assessment - **Impact**: Two concurrent calls to `commit_all()` for the same plan could both see the same set of sandboxes, leading to double-commit of the same changes (applying them twice) or double-rollback - **Likelihood**: Low-Medium — concurrent execution of `commit_all()` is warned against but not prevented - **Priority**: High ### Location - **File**: `src/cleveragents/infrastructure/sandbox/manager.py` - **Function/Class**: `SandboxManager.commit_all()` - **Lines**: ~150-220 ### Description The `SandboxManager.commit_all()` method has a **documented but unfixed race condition**: the docstring explicitly states "This method is **not safe** for concurrent calls on the same *plan_id*." However, the implementation's lock usage makes this race worse than necessary. The method acquires `self._lock` to read the list of sandboxes, then **releases the lock** before committing. This means: 1. Thread A calls `commit_all("plan-1")`, reads `[sandbox_a, sandbox_b]` under lock, then releases lock 2. Thread B calls `commit_all("plan-1")`, also reads `[sandbox_a, sandbox_b]` under lock, then releases lock 3. Both threads independently commit `sandbox_a` and `sandbox_b` 4. `sandbox_a` and `sandbox_b` are committed twice, leaving the target in an undefined state Additionally, when `commit_all()` needs to do atomic rollback after a partial failure, the `_rollback_committed()` method also runs **outside the lock**. This means another thread might see partially-committed state and make incorrect decisions. The actual commits and rollbacks for `NoSandbox` and `TransactionSandbox` (non-rollbackable types) can cause **permanent data loss** when committed twice because there's no rollback path. ### Evidence ```python def commit_all(self, plan_id: str) -> list[CommitResult]: # ... with self._lock: # Lock acquired here sandboxes = list(self._active_sandboxes.get(plan_id, {}).values()) # Lock RELEASED here — race window begins! committable = [ sb for sb in sandboxes if sb.status in (SandboxStatus.CREATED, SandboxStatus.ACTIVE) ] # ... # All commits happen outside the lock for sandbox in ordered: try: result = sandbox.commit() # Outside lock! committed.append((sandbox, result)) except Exception as exc: # Rollback also outside lock! rolled_back_ids, failed_rollback_ids = self._rollback_committed( committed, plan_id ) ``` The warning in the docstring acknowledges this but does not enforce it: ``` .. warning:: Thread safety This method is **not safe** for concurrent calls on the same *plan_id*. ``` ### Expected Behavior The method should either: 1. Use a per-plan-id lock to prevent concurrent `commit_all()` calls for the same plan, or 2. Use an atomic state transition (e.g., mark sandboxes as "committing" while under lock) to prevent double-commit ### Actual Behavior The warning is documentation-only. Nothing prevents concurrent `commit_all()` calls for the same plan, and the lock scope is insufficient to provide real protection. ### Suggested Fix Add a per-plan-id lock or transition sandbox status atomically under the global lock before committing: ```python def commit_all(self, plan_id: str) -> list[CommitResult]: if not plan_id: raise ValueError("plan_id cannot be empty") with self._lock: sandboxes = list(self._active_sandboxes.get(plan_id, {}).values()) # Atomically claim committable sandboxes by marking them as COMMITTING committable = [] for sb in sandboxes: if sb.status in (SandboxStatus.CREATED, SandboxStatus.ACTIVE): # Mark as committing to prevent concurrent commit_all claims sb._status = SandboxStatus.COMMITTING # or equivalent sentinel committable.append(sb) # Now commit outside the lock — only this thread "owns" these sandboxes ... ``` ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
HAL9000 added this to the v3.5.0 milestone 2026-04-10 18:44:13 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified — SandboxManager lock scope bug is a real concurrency issue with documented evidence
  • Priority: Priority/Critical — double-commit of sandbox changes can cause permanent data corruption
  • Milestone: v3.5.0 — SandboxManager is core to the parallel execution and apply workflow
  • Type: Type/Bug
  • MoSCoW: Must Have — sandbox commit correctness is required for safe plan apply

The fix: add per-plan-id locking or atomically mark sandboxes as "committing" under the global lock.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

Issue triaged by project owner: - **State**: Verified — SandboxManager lock scope bug is a real concurrency issue with documented evidence - **Priority**: Priority/Critical — double-commit of sandbox changes can cause permanent data corruption - **Milestone**: v3.5.0 — SandboxManager is core to the parallel execution and apply workflow - **Type**: Type/Bug - **MoSCoW**: Must Have — sandbox commit correctness is required for safe plan apply The fix: add per-plan-id locking or atomically mark sandboxes as "committing" under the global lock. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7344
No description provided.