BUG-HUNT: [concurrency] Thread-unsafe changeset operations in PlanExecutionContext #7120

Open
opened 2026-04-10 07:56:22 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Branch: bugfix/plan-execution-context-changeset-thread-safety
  • Commit Message: fix(application): add threading.Lock to PlanExecutionContext changeset operations
  • Milestone: v3.2.0
  • Parent Epic: #7023

Background and Context

PlanExecutionContext in src/cleveragents/application/services/plan_execution_context.py maintains _active_changeset_ids — a plain list[str] that is mutated by start_changeset() and read by record_change() without any synchronization primitive.

In CleverAgents server mode, concurrent plan executions and execution actor graph nodes can share or interleave access to the same PlanExecutionContext instance. The start_changeset() method appends to _active_changeset_ids while record_change() reads the last element via [-1] indexing. Without a lock, these two operations can race, producing incorrect changeset associations or an IndexError crash.

Current Behavior (Bug)

_active_changeset_ids is a plain list[str] with no synchronization:

# src/cleveragents/application/services/plan_execution_context.py, lines ~175-190

def start_changeset(self) -> str:
    """Start a new changeset for this plan."""
    changeset_id = self._changeset_store.start(self._plan_id)
    self._active_changeset_ids.append(changeset_id)   # UNSAFE: No lock
    self._logger.info("Changeset started", changeset_id=changeset_id)
    return changeset_id

def record_change(self, entry: ChangeEntry) -> None:
    """Record a change entry into the most recent active changeset."""
    if not self._active_changeset_ids:
        raise PlanError("No active changeset. Call start_changeset() first.")
    changeset_id = self._active_changeset_ids[-1]     # UNSAFE: No lock
    self._changeset_store.record(changeset_id, entry)
    ...

Race scenario:

  1. Thread A calls start_changeset()_changeset_store.start() returns a new ID but has not yet appended it.
  2. Thread B calls record_change() — reads _active_changeset_ids[-1], which is the previous changeset ID (stale).
  3. Thread A appends the new ID.
  4. Thread B's change entry is silently recorded against the wrong changeset.

A second race: if Thread A is mid-append() and Thread B reads _active_changeset_ids[-1] on an empty list (before the append completes), an IndexError is raised even though the guard if not self._active_changeset_ids passed.

Expected Behavior

All reads and writes to _active_changeset_ids must be serialized. Specifically:

  1. start_changeset() must atomically append the new changeset ID under a lock.
  2. record_change() must atomically read _active_changeset_ids[-1] under the same lock.
  3. No IndexError or stale-ID recording can occur under concurrent access.

Evidence

# plan_execution_context.py — unsynchronized compound operations:
self._active_changeset_ids.append(changeset_id)   # start_changeset(), line ~185
changeset_id = self._active_changeset_ids[-1]     # record_change(), line ~190

CPython's GIL does not protect compound read-modify-write sequences. The check-then-act pattern in record_change() (if not self._active_changeset_ids: ... changeset_id = self._active_changeset_ids[-1]) is a classic TOCTOU race.

Suggested Fix

Introduce a threading.Lock (or threading.RLock) on PlanExecutionContext and acquire it for the duration of both start_changeset() and record_change():

import threading

class PlanExecutionContext:
    def __init__(self, ...):
        ...
        self._active_changeset_ids: list[str] = []
        self._changeset_lock = threading.Lock()

    def start_changeset(self) -> str:
        changeset_id = self._changeset_store.start(self._plan_id)
        with self._changeset_lock:
            self._active_changeset_ids.append(changeset_id)
        self._logger.info("Changeset started", changeset_id=changeset_id)
        return changeset_id

    def record_change(self, entry: ChangeEntry) -> None:
        with self._changeset_lock:
            if not self._active_changeset_ids:
                raise PlanError("No active changeset. Call start_changeset() first.")
            changeset_id = self._active_changeset_ids[-1]
        self._changeset_store.record(changeset_id, entry)
        ...

Impact

  • Incorrect changeset recording: change entries silently associated with the wrong changeset, corrupting the plan's audit trail and apply-phase diff.
  • IndexError crash: concurrent access can trigger an unhandled exception in record_change() even when a changeset has been started.
  • Data loss in apply phase: if the wrong changeset ID is recorded, the apply phase may miss changes or apply them to the wrong resource.
  • Severity: High — runtime errors in production under concurrent plan execution load.

Subtasks

  • Add threading.Lock (_changeset_lock) to PlanExecutionContext.__init__().
  • Wrap _active_changeset_ids.append() in start_changeset() with _changeset_lock.
  • Wrap the guard + [-1] read in record_change() with _changeset_lock.
  • Add a TDD Behave scenario tagged @tdd_issue and @tdd_issue_<N> that reproduces the race condition (two threads calling start_changeset() and record_change() concurrently) and asserts correct changeset association.
  • Verify all existing PlanExecutionContext tests pass after the fix.
  • Run nox full suite and confirm all stages pass.

Definition of Done

  • _changeset_lock (threading.Lock) is present on PlanExecutionContext.
  • start_changeset() acquires _changeset_lock before appending to _active_changeset_ids.
  • record_change() acquires _changeset_lock for the guard check and [-1] read atomically.
  • A regression Behave scenario (@tdd_issue, @tdd_issue_<N>) exists and passes.
  • No # type: ignore annotations introduced.
  • All nox stages pass.
  • Coverage >= 97%.

Automated by CleverAgents Bot
Supervisor: Acting on behalf of: Bug Hunter | Agent: new-issue-creator

## Metadata - **Branch**: `bugfix/plan-execution-context-changeset-thread-safety` - **Commit Message**: `fix(application): add threading.Lock to PlanExecutionContext changeset operations` - **Milestone**: v3.2.0 - **Parent Epic**: #7023 ## Background and Context `PlanExecutionContext` in `src/cleveragents/application/services/plan_execution_context.py` maintains `_active_changeset_ids` — a plain `list[str]` that is mutated by `start_changeset()` and read by `record_change()` without any synchronization primitive. In CleverAgents server mode, concurrent plan executions and execution actor graph nodes can share or interleave access to the same `PlanExecutionContext` instance. The `start_changeset()` method appends to `_active_changeset_ids` while `record_change()` reads the last element via `[-1]` indexing. Without a lock, these two operations can race, producing incorrect changeset associations or an `IndexError` crash. ## Current Behavior (Bug) `_active_changeset_ids` is a plain `list[str]` with no synchronization: ```python # src/cleveragents/application/services/plan_execution_context.py, lines ~175-190 def start_changeset(self) -> str: """Start a new changeset for this plan.""" changeset_id = self._changeset_store.start(self._plan_id) self._active_changeset_ids.append(changeset_id) # UNSAFE: No lock self._logger.info("Changeset started", changeset_id=changeset_id) return changeset_id def record_change(self, entry: ChangeEntry) -> None: """Record a change entry into the most recent active changeset.""" if not self._active_changeset_ids: raise PlanError("No active changeset. Call start_changeset() first.") changeset_id = self._active_changeset_ids[-1] # UNSAFE: No lock self._changeset_store.record(changeset_id, entry) ... ``` **Race scenario:** 1. Thread A calls `start_changeset()` — `_changeset_store.start()` returns a new ID but has not yet appended it. 2. Thread B calls `record_change()` — reads `_active_changeset_ids[-1]`, which is the *previous* changeset ID (stale). 3. Thread A appends the new ID. 4. Thread B's change entry is silently recorded against the wrong changeset. A second race: if Thread A is mid-`append()` and Thread B reads `_active_changeset_ids[-1]` on an empty list (before the append completes), an `IndexError` is raised even though the guard `if not self._active_changeset_ids` passed. ## Expected Behavior All reads and writes to `_active_changeset_ids` must be serialized. Specifically: 1. `start_changeset()` must atomically append the new changeset ID under a lock. 2. `record_change()` must atomically read `_active_changeset_ids[-1]` under the same lock. 3. No `IndexError` or stale-ID recording can occur under concurrent access. ## Evidence ```python # plan_execution_context.py — unsynchronized compound operations: self._active_changeset_ids.append(changeset_id) # start_changeset(), line ~185 changeset_id = self._active_changeset_ids[-1] # record_change(), line ~190 ``` CPython's GIL does not protect compound read-modify-write sequences. The check-then-act pattern in `record_change()` (`if not self._active_changeset_ids: ... changeset_id = self._active_changeset_ids[-1]`) is a classic TOCTOU race. ## Suggested Fix Introduce a `threading.Lock` (or `threading.RLock`) on `PlanExecutionContext` and acquire it for the duration of both `start_changeset()` and `record_change()`: ```python import threading class PlanExecutionContext: def __init__(self, ...): ... self._active_changeset_ids: list[str] = [] self._changeset_lock = threading.Lock() def start_changeset(self) -> str: changeset_id = self._changeset_store.start(self._plan_id) with self._changeset_lock: self._active_changeset_ids.append(changeset_id) self._logger.info("Changeset started", changeset_id=changeset_id) return changeset_id def record_change(self, entry: ChangeEntry) -> None: with self._changeset_lock: if not self._active_changeset_ids: raise PlanError("No active changeset. Call start_changeset() first.") changeset_id = self._active_changeset_ids[-1] self._changeset_store.record(changeset_id, entry) ... ``` ## Impact - **Incorrect changeset recording**: change entries silently associated with the wrong changeset, corrupting the plan's audit trail and apply-phase diff. - **`IndexError` crash**: concurrent access can trigger an unhandled exception in `record_change()` even when a changeset has been started. - **Data loss in apply phase**: if the wrong changeset ID is recorded, the apply phase may miss changes or apply them to the wrong resource. - **Severity**: High — runtime errors in production under concurrent plan execution load. ## Subtasks - [ ] Add `threading.Lock` (`_changeset_lock`) to `PlanExecutionContext.__init__()`. - [ ] Wrap `_active_changeset_ids.append()` in `start_changeset()` with `_changeset_lock`. - [ ] Wrap the guard + `[-1]` read in `record_change()` with `_changeset_lock`. - [ ] Add a TDD Behave scenario tagged `@tdd_issue` and `@tdd_issue_<N>` that reproduces the race condition (two threads calling `start_changeset()` and `record_change()` concurrently) and asserts correct changeset association. - [ ] Verify all existing `PlanExecutionContext` tests pass after the fix. - [ ] Run `nox` full suite and confirm all stages pass. ## Definition of Done - [ ] `_changeset_lock` (`threading.Lock`) is present on `PlanExecutionContext`. - [ ] `start_changeset()` acquires `_changeset_lock` before appending to `_active_changeset_ids`. - [ ] `record_change()` acquires `_changeset_lock` for the guard check and `[-1]` read atomically. - [ ] A regression Behave scenario (`@tdd_issue`, `@tdd_issue_<N>`) exists and passes. - [ ] No `# type: ignore` annotations introduced. - [ ] All nox stages pass. - [ ] Coverage >= 97%. --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: Bug Hunter | Agent: new-issue-creator
HAL9000 added this to the v3.2.0 milestone 2026-04-10 07:56:33 +00:00
Author
Owner

Verified — Critical concurrency bug: thread-unsafe changeset operations in PlanExecutionContext. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical concurrency bug: thread-unsafe changeset operations in PlanExecutionContext. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7120
No description provided.