cleveragents/cleveragents-core

Fork 3

BUG-HUNT: [concurrency] Thread-unsafe changeset operations in PlanExecutionContext #7120

New issue

Open

opened 2026-04-10 07:56:22 +00:00 by HAL9000 · 1 comment

HAL9000 commented

2026-04-10 07:56:22 +00:00

Owner

Metadata

Branch: bugfix/plan-execution-context-changeset-thread-safety
Commit Message: fix(application): add threading.Lock to PlanExecutionContext changeset operations
Milestone: v3.2.0
Parent Epic: #7023

Background and Context

PlanExecutionContext in src/cleveragents/application/services/plan_execution_context.py maintains _active_changeset_ids — a plain list[str] that is mutated by start_changeset() and read by record_change() without any synchronization primitive.

In CleverAgents server mode, concurrent plan executions and execution actor graph nodes can share or interleave access to the same PlanExecutionContext instance. The start_changeset() method appends to _active_changeset_ids while record_change() reads the last element via [-1] indexing. Without a lock, these two operations can race, producing incorrect changeset associations or an IndexError crash.

Current Behavior (Bug)

_active_changeset_ids is a plain list[str] with no synchronization:

# src/cleveragents/application/services/plan_execution_context.py, lines ~175-190

def start_changeset(self) -> str:
    """Start a new changeset for this plan."""
    changeset_id = self._changeset_store.start(self._plan_id)
    self._active_changeset_ids.append(changeset_id)   # UNSAFE: No lock
    self._logger.info("Changeset started", changeset_id=changeset_id)
    return changeset_id

def record_change(self, entry: ChangeEntry) -> None:
    """Record a change entry into the most recent active changeset."""
    if not self._active_changeset_ids:
        raise PlanError("No active changeset. Call start_changeset() first.")
    changeset_id = self._active_changeset_ids[-1]     # UNSAFE: No lock
    self._changeset_store.record(changeset_id, entry)
    ...

Race scenario:

Thread A calls start_changeset() — _changeset_store.start() returns a new ID but has not yet appended it.
Thread B calls record_change() — reads _active_changeset_ids[-1], which is the previous changeset ID (stale).
Thread A appends the new ID.
Thread B's change entry is silently recorded against the wrong changeset.

A second race: if Thread A is mid-append() and Thread B reads _active_changeset_ids[-1] on an empty list (before the append completes), an IndexError is raised even though the guard if not self._active_changeset_ids passed.

Expected Behavior

All reads and writes to _active_changeset_ids must be serialized. Specifically:

start_changeset() must atomically append the new changeset ID under a lock.
record_change() must atomically read _active_changeset_ids[-1] under the same lock.
No IndexError or stale-ID recording can occur under concurrent access.

Evidence

# plan_execution_context.py — unsynchronized compound operations:
self._active_changeset_ids.append(changeset_id)   # start_changeset(), line ~185
changeset_id = self._active_changeset_ids[-1]     # record_change(), line ~190

CPython's GIL does not protect compound read-modify-write sequences. The check-then-act pattern in record_change() (if not self._active_changeset_ids: ... changeset_id = self._active_changeset_ids[-1]) is a classic TOCTOU race.

Suggested Fix

Introduce a threading.Lock (or threading.RLock) on PlanExecutionContext and acquire it for the duration of both start_changeset() and record_change():

import threading

class PlanExecutionContext:
    def __init__(self, ...):
        ...
        self._active_changeset_ids: list[str] = []
        self._changeset_lock = threading.Lock()

    def start_changeset(self) -> str:
        changeset_id = self._changeset_store.start(self._plan_id)
        with self._changeset_lock:
            self._active_changeset_ids.append(changeset_id)
        self._logger.info("Changeset started", changeset_id=changeset_id)
        return changeset_id

    def record_change(self, entry: ChangeEntry) -> None:
        with self._changeset_lock:
            if not self._active_changeset_ids:
                raise PlanError("No active changeset. Call start_changeset() first.")
            changeset_id = self._active_changeset_ids[-1]
        self._changeset_store.record(changeset_id, entry)
        ...

Impact

Incorrect changeset recording: change entries silently associated with the wrong changeset, corrupting the plan's audit trail and apply-phase diff.
IndexError crash: concurrent access can trigger an unhandled exception in record_change() even when a changeset has been started.
Data loss in apply phase: if the wrong changeset ID is recorded, the apply phase may miss changes or apply them to the wrong resource.
Severity: High — runtime errors in production under concurrent plan execution load.

Subtasks

Add threading.Lock (_changeset_lock) to PlanExecutionContext.__init__().
Wrap _active_changeset_ids.append() in start_changeset() with _changeset_lock.
Wrap the guard + [-1] read in record_change() with _changeset_lock.
Add a TDD Behave scenario tagged @tdd_issue and @tdd_issue_<N> that reproduces the race condition (two threads calling start_changeset() and record_change() concurrently) and asserts correct changeset association.
Verify all existing PlanExecutionContext tests pass after the fix.
Run nox full suite and confirm all stages pass.

Definition of Done

_changeset_lock (threading.Lock) is present on PlanExecutionContext.
start_changeset() acquires _changeset_lock before appending to _active_changeset_ids.
record_change() acquires _changeset_lock for the guard check and [-1] read atomically.
A regression Behave scenario (@tdd_issue, @tdd_issue_<N>) exists and passes.
No # type: ignore annotations introduced.
All nox stages pass.
Coverage >= 97%.

Automated by CleverAgents Bot
Supervisor: Acting on behalf of: Bug Hunter | Agent: new-issue-creator

## Metadata - **Branch**: `bugfix/plan-execution-context-changeset-thread-safety` - **Commit Message**: `fix(application): add threading.Lock to PlanExecutionContext changeset operations` - **Milestone**: v3.2.0 - **Parent Epic**: #7023 ## Background and Context `PlanExecutionContext` in `src/cleveragents/application/services/plan_execution_context.py` maintains `_active_changeset_ids` — a plain `list[str]` that is mutated by `start_changeset()` and read by `record_change()` without any synchronization primitive. In CleverAgents server mode, concurrent plan executions and execution actor graph nodes can share or interleave access to the same `PlanExecutionContext` instance. The `start_changeset()` method appends to `_active_changeset_ids` while `record_change()` reads the last element via `[-1]` indexing. Without a lock, these two operations can race, producing incorrect changeset associations or an `IndexError` crash. ## Current Behavior (Bug) `_active_changeset_ids` is a plain `list[str]` with no synchronization: ```python # src/cleveragents/application/services/plan_execution_context.py, lines ~175-190 def start_changeset(self) -> str: """Start a new changeset for this plan.""" changeset_id = self._changeset_store.start(self._plan_id) self._active_changeset_ids.append(changeset_id) # UNSAFE: No lock self._logger.info("Changeset started", changeset_id=changeset_id) return changeset_id def record_change(self, entry: ChangeEntry) -> None: """Record a change entry into the most recent active changeset.""" if not self._active_changeset_ids: raise PlanError("No active changeset. Call start_changeset() first.") changeset_id = self._active_changeset_ids[-1] # UNSAFE: No lock self._changeset_store.record(changeset_id, entry) ... ``` **Race scenario:** 1. Thread A calls `start_changeset()` — `_changeset_store.start()` returns a new ID but has not yet appended it. 2. Thread B calls `record_change()` — reads `_active_changeset_ids[-1]`, which is the *previous* changeset ID (stale). 3. Thread A appends the new ID. 4. Thread B's change entry is silently recorded against the wrong changeset. A second race: if Thread A is mid-`append()` and Thread B reads `_active_changeset_ids[-1]` on an empty list (before the append completes), an `IndexError` is raised even though the guard `if not self._active_changeset_ids` passed. ## Expected Behavior All reads and writes to `_active_changeset_ids` must be serialized. Specifically: 1. `start_changeset()` must atomically append the new changeset ID under a lock. 2. `record_change()` must atomically read `_active_changeset_ids[-1]` under the same lock. 3. No `IndexError` or stale-ID recording can occur under concurrent access. ## Evidence ```python # plan_execution_context.py — unsynchronized compound operations: self._active_changeset_ids.append(changeset_id) # start_changeset(), line ~185 changeset_id = self._active_changeset_ids[-1] # record_change(), line ~190 ``` CPython's GIL does not protect compound read-modify-write sequences. The check-then-act pattern in `record_change()` (`if not self._active_changeset_ids: ... changeset_id = self._active_changeset_ids[-1]`) is a classic TOCTOU race. ## Suggested Fix Introduce a `threading.Lock` (or `threading.RLock`) on `PlanExecutionContext` and acquire it for the duration of both `start_changeset()` and `record_change()`: ```python import threading class PlanExecutionContext: def __init__(self, ...): ... self._active_changeset_ids: list[str] = [] self._changeset_lock = threading.Lock() def start_changeset(self) -> str: changeset_id = self._changeset_store.start(self._plan_id) with self._changeset_lock: self._active_changeset_ids.append(changeset_id) self._logger.info("Changeset started", changeset_id=changeset_id) return changeset_id def record_change(self, entry: ChangeEntry) -> None: with self._changeset_lock: if not self._active_changeset_ids: raise PlanError("No active changeset. Call start_changeset() first.") changeset_id = self._active_changeset_ids[-1] self._changeset_store.record(changeset_id, entry) ... ``` ## Impact - **Incorrect changeset recording**: change entries silently associated with the wrong changeset, corrupting the plan's audit trail and apply-phase diff. - **`IndexError` crash**: concurrent access can trigger an unhandled exception in `record_change()` even when a changeset has been started. - **Data loss in apply phase**: if the wrong changeset ID is recorded, the apply phase may miss changes or apply them to the wrong resource. - **Severity**: High — runtime errors in production under concurrent plan execution load. ## Subtasks - [ ] Add `threading.Lock` (`_changeset_lock`) to `PlanExecutionContext.__init__()`. - [ ] Wrap `_active_changeset_ids.append()` in `start_changeset()` with `_changeset_lock`. - [ ] Wrap the guard + `[-1]` read in `record_change()` with `_changeset_lock`. - [ ] Add a TDD Behave scenario tagged `@tdd_issue` and `@tdd_issue_<N>` that reproduces the race condition (two threads calling `start_changeset()` and `record_change()` concurrently) and asserts correct changeset association. - [ ] Verify all existing `PlanExecutionContext` tests pass after the fix. - [ ] Run `nox` full suite and confirm all stages pass. ## Definition of Done - [ ] `_changeset_lock` (`threading.Lock`) is present on `PlanExecutionContext`. - [ ] `start_changeset()` acquires `_changeset_lock` before appending to `_active_changeset_ids`. - [ ] `record_change()` acquires `_changeset_lock` for the guard check and `[-1]` read atomically. - [ ] A regression Behave scenario (`@tdd_issue`, `@tdd_issue_<N>`) exists and passes. - [ ] No `# type: ignore` annotations introduced. - [ ] All nox stages pass. - [ ] Coverage >= 97%. --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: Bug Hunter | Agent: new-issue-creator

HAL9000 added this to the v3.2.0 milestone

2026-04-10 07:56:33 +00:00

HAL9000 added the

State

Unverified

label

2026-04-10 07:57:40 +00:00

HAL9000 added a new dependency

2026-04-10 07:58:01 +00:00

#7023 [AUTO-BUG-POOL] Status: Bug Detection Report (Cycle 2)

freemo added the

labels

2026-04-12 18:37:04 +00:00

freemo removed a dependency

2026-04-13 16:34:26 +00:00

#7023 [AUTO-BUG-POOL] Status: Bug Detection Report (Cycle 2)

HAL9000 added

and removed

labels

2026-04-14 06:48:47 +00:00

HAL9000 commented

2026-04-14 06:48:48 +00:00

Author

Owner

✅ Verified — Critical concurrency bug: thread-unsafe changeset operations in PlanExecutionContext. MoSCoW: Must-have. Priority: Critical.

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical concurrency bug: thread-unsafe changeset operations in PlanExecutionContext. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor

HAL9000 referenced this issue

2026-04-30 19:40:22 +00:00

fix(e2e): restore M5/WF14 tests broken by JSON envelope and stale config #9912