Race condition in StateManager.update_state when using parallel execution #8290

Open
opened 2026-04-13 08:05:33 +00:00 by HAL9000 · 3 comments
Owner

Metadata

  • Commit message: fix(langgraph): add asyncio.Lock to StateManager.update_state for thread safety
  • Branch name: bugfix/langgraph-state-manager-race-condition

Background and Context

The StateManager.update_state method in src/cleveragents/langgraph/state.py (line 109) is not protected by any locking mechanism. When a LangGraph is configured with parallel_execution=True, multiple nodes can execute concurrently and call update_state at the same time.

The unprotected critical section spans the entire body of update_state: reading and writing self.history, calling self.state.update(updates, mode), incrementing self.state.execution_count, emitting self.state_stream.on_next(self.state), and incrementing self.update_count. Any of these operations interleaved with a concurrent call can produce lost updates, phantom reads, or an inconsistent execution_count.

This is a correctness bug that becomes a data-integrity hazard as soon as parallel_execution=True is used — which is a core requirement of v3.5.0 (M6: Autonomy Hardening, "Parallel execution scales to 10+ concurrent subplans").

Expected Behavior

All calls to StateManager.update_state are serialised. Concurrent node executions each see their update applied atomically; no update is silently dropped and execution_count always reflects the true number of completed updates.

Acceptance Criteria

  • An asyncio.Lock (or threading.Lock if the call-path is synchronous) is added to StateManager.__init__
  • update_state acquires the lock before reading or writing any shared state and releases it on exit (including on exception)
  • The history snapshot, self.state.update(), execution_count increment, state_stream.on_next(), update_count increment, and _save_checkpoint() call are all inside the critical section
  • Existing unit tests continue to pass
  • At least one new BDD scenario demonstrates that two concurrent update_state calls on the same StateManager produce the correct final state with no lost updates
  • nox -e lint and nox -e typecheck pass with no new errors
  • Test coverage for src/cleveragents/langgraph/state.py remains ≥ 97%

Subtasks

  • Add self._lock = asyncio.Lock() (or threading.Lock()) to StateManager.__init__
  • Wrap the body of update_state in async with self._lock: (or with self._lock:)
  • Verify that callers of update_state are compatible with the chosen lock type (async vs. sync)
  • Write a BDD feature file features/state_manager_thread_safety.feature with concurrent-update scenarios
  • Write corresponding step definitions in features/steps/state_manager_thread_safety_steps.py
  • Update CHANGELOG.md under [Unreleased] > Fixed
  • Run full quality gate (nox -e lint, nox -e typecheck, nox -e unit_tests) and confirm all pass

Definition of Done

This issue is closed when:

  1. The update_state method is fully protected by a lock and no concurrent call can interleave with another.
  2. The fix is covered by at least one automated test that would fail without the lock.
  3. All quality gates (lint, typecheck, unit_tests) pass.
  4. The PR is reviewed, approved, and merged into master.

Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Commit message**: `fix(langgraph): add asyncio.Lock to StateManager.update_state for thread safety` - **Branch name**: `bugfix/langgraph-state-manager-race-condition` ## Background and Context The `StateManager.update_state` method in `src/cleveragents/langgraph/state.py` (line 109) is not protected by any locking mechanism. When a `LangGraph` is configured with `parallel_execution=True`, multiple nodes can execute concurrently and call `update_state` at the same time. The unprotected critical section spans the entire body of `update_state`: reading and writing `self.history`, calling `self.state.update(updates, mode)`, incrementing `self.state.execution_count`, emitting `self.state_stream.on_next(self.state)`, and incrementing `self.update_count`. Any of these operations interleaved with a concurrent call can produce lost updates, phantom reads, or an inconsistent `execution_count`. This is a correctness bug that becomes a data-integrity hazard as soon as `parallel_execution=True` is used — which is a core requirement of v3.5.0 (M6: Autonomy Hardening, "Parallel execution scales to 10+ concurrent subplans"). ## Expected Behavior All calls to `StateManager.update_state` are serialised. Concurrent node executions each see their update applied atomically; no update is silently dropped and `execution_count` always reflects the true number of completed updates. ## Acceptance Criteria - [ ] An `asyncio.Lock` (or `threading.Lock` if the call-path is synchronous) is added to `StateManager.__init__` - [ ] `update_state` acquires the lock before reading or writing any shared state and releases it on exit (including on exception) - [ ] The history snapshot, `self.state.update()`, `execution_count` increment, `state_stream.on_next()`, `update_count` increment, and `_save_checkpoint()` call are all inside the critical section - [ ] Existing unit tests continue to pass - [ ] At least one new BDD scenario demonstrates that two concurrent `update_state` calls on the same `StateManager` produce the correct final state with no lost updates - [ ] `nox -e lint` and `nox -e typecheck` pass with no new errors - [ ] Test coverage for `src/cleveragents/langgraph/state.py` remains ≥ 97% ## Subtasks - [ ] Add `self._lock = asyncio.Lock()` (or `threading.Lock()`) to `StateManager.__init__` - [ ] Wrap the body of `update_state` in `async with self._lock:` (or `with self._lock:`) - [ ] Verify that callers of `update_state` are compatible with the chosen lock type (async vs. sync) - [ ] Write a BDD feature file `features/state_manager_thread_safety.feature` with concurrent-update scenarios - [ ] Write corresponding step definitions in `features/steps/state_manager_thread_safety_steps.py` - [ ] Update `CHANGELOG.md` under `[Unreleased] > Fixed` - [ ] Run full quality gate (`nox -e lint`, `nox -e typecheck`, `nox -e unit_tests`) and confirm all pass ## Definition of Done This issue is closed when: 1. The `update_state` method is fully protected by a lock and no concurrent call can interleave with another. 2. The fix is covered by at least one automated test that would fail without the lock. 3. All quality gates (`lint`, `typecheck`, `unit_tests`) pass. 4. The PR is reviewed, approved, and merged into `master`. --- **Automated by CleverAgents Bot** Agent: new-issue-creator
HAL9000 added this to the v3.5.0 milestone 2026-04-13 08:07:01 +00:00
Author
Owner

[AUTO-EPIC] Epic Linkage

This issue is a child of Epic #8082 — A2A Facade Session & Guard Enforcement (M6) (v3.5.0).

The StateManager race condition in parallel execution is a concurrency safety issue that must be resolved for the A2A facade and parallel execution to work correctly.

Dependency direction: This issue (#8290) BLOCKS Epic #8082.


Automated by CleverAgents Bot
Supervisor: Epic Planning | Agent: epic-planning-pool-supervisor

## [AUTO-EPIC] Epic Linkage This issue is a child of **Epic #8082** — A2A Facade Session & Guard Enforcement (M6) (v3.5.0). The StateManager race condition in parallel execution is a concurrency safety issue that must be resolved for the A2A facade and parallel execution to work correctly. **Dependency direction**: This issue (#8290) BLOCKS Epic #8082. --- **Automated by CleverAgents Bot** Supervisor: Epic Planning | Agent: epic-planning-pool-supervisor
Author
Owner

Parent Epic: Blocks #8082 (Epic: A2A Facade Session & Guard Enforcement — M6)

This issue is a child of Epic #8082 under milestone v3.5.0 (M6: Autonomy Hardening). The race condition in StateManager.update_state directly undermines the parallel execution scaling requirement ("Parallel execution scales to 10+ concurrent subplans") that is a core acceptance criterion of that Epic and milestone.


Automated by CleverAgents Bot
Agent: new-issue-creator

**Parent Epic:** Blocks #8082 (Epic: A2A Facade Session & Guard Enforcement — M6) This issue is a child of Epic #8082 under milestone v3.5.0 (M6: Autonomy Hardening). The race condition in `StateManager.update_state` directly undermines the parallel execution scaling requirement ("Parallel execution scales to 10+ concurrent subplans") that is a core acceptance criterion of that Epic and milestone. --- **Automated by CleverAgents Bot** Agent: new-issue-creator
Author
Owner

Verified — Race condition in StateManager.update_state during parallel execution is a data integrity risk that directly impacts the v3.5.0 acceptance criterion for parallel execution (10+ concurrent subplans). Must Have fix for v3.5.0. Verified.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Race condition in StateManager.update_state during parallel execution is a data integrity risk that directly impacts the v3.5.0 acceptance criterion for parallel execution (10+ concurrent subplans). **Must Have** fix for v3.5.0. Verified. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#8290
No description provided.