BUG-HUNT: [concurrency] Race condition in StateManager #1688

Open
opened 2026-04-02 23:29:51 +00:00 by freemo · 2 comments
Owner

Metadata

  • Branch: fix/concurrency-statemanager-race-condition
  • Commit Message: fix(langgraph): add threading lock to StateManager to eliminate race condition on shared state
  • Milestone: v3.5.0
  • Parent Epic: #362

Bug Report: [concurrency] — Race condition in StateManager

Severity Assessment

  • Impact: The StateManager can get into an inconsistent state, leading to incorrect behavior and data corruption.
  • Likelihood: High, especially in graphs with parallel execution.
  • Priority: Critical

Location

  • File: src/cleveragents/langgraph/state.py
  • Function/Class: StateManager
  • Lines: 109-132

Description

The StateManager class is not thread-safe. The state attribute is accessed and modified by multiple threads without any locking. The update_state method, in particular, is not atomic, which means that a thread could be interrupted in the middle of an update, leaving the state in an inconsistent state.

This is especially critical given the v3.5.0 milestone goal of scaling parallel execution to 10+ concurrent subplans — each subplan may invoke update_state concurrently, making this a high-likelihood data corruption vector.

Evidence

def update_state(
    self,
    updates: dict[str, Any],
    mode: StateUpdateMode = StateUpdateMode.MERGE,
    node_id: str | None = None,
) -> GraphState:
    if self.is_closed:
        raise RuntimeError("StateManager is closed")
    if self.enable_time_travel:
        snapshot = StateSnapshot(
            state=self.state.to_dict(), timestamp=datetime.now(), node_id=node_id
        )
        self.history.append(snapshot)
        if len(self.history) > self.max_history_size:
            self.history = self.history[-self.max_history_size :]

    self.state.update(updates, mode)
    self.state.execution_count += 1
    self.state_stream.on_next(self.state)

    self.update_count += 1
    if self.checkpoint_dir and self.update_count % self.checkpoint_interval == 0:
        self._save_checkpoint()
    return self.state

Expected Behavior

The StateManager should be thread-safe. All access to the state attribute should be protected by a lock.

Actual Behavior

Multiple threads can concurrently enter update_state, leading to:

  • Non-atomic read-modify-write on self.state.execution_count
  • Non-atomic read-modify-write on self.update_count
  • Torn writes to self.history (list replacement is not atomic)
  • Concurrent self.state.update() calls producing interleaved/corrupted state

Suggested Fix

Add a threading.Lock to the StateManager class and use it to protect all access to the state attribute. The update_state method should acquire the lock at the beginning and release it at the end.

import threading

class StateManager:
    def __init__(self, ...):
        ...
        self._lock: threading.Lock = threading.Lock()

    def update_state(self, updates, mode=..., node_id=None):
        with self._lock:
            # existing body unchanged
            ...

Category

concurrency

Subtasks

  • Add threading.Lock field _lock to StateManager.__init__ with full type annotation
  • Wrap the entire body of update_state in with self._lock: to make it atomic
  • Audit all other public/protected methods on StateManager that read or write self.state, self.history, self.update_count, or self.is_closed for thread-safety
  • Protect close() / is_closed flag with the same lock to prevent use-after-close races
  • Write Behave scenario: concurrent update_state calls from multiple threads produce consistent execution_count and no torn state
  • Write Behave scenario: close() called concurrently with update_state raises RuntimeError cleanly without data corruption
  • Run nox -e typecheck — verify Pyright passes with no # type: ignore suppressions
  • Run nox -e unit_tests — verify all Behave scenarios pass
  • Run nox -e coverage_report — verify coverage ≥ 97%
  • Run nox (all default sessions), fix any errors

Definition of Done

  • All subtasks above are completed and checked off.
  • StateManager is demonstrably thread-safe: concurrent update_state calls from ≥ 2 threads produce a consistent, non-corrupted GraphState.
  • No # type: ignore or type-suppression comments introduced.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage >= 97%.

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/concurrency-statemanager-race-condition` - **Commit Message**: `fix(langgraph): add threading lock to StateManager to eliminate race condition on shared state` - **Milestone**: v3.5.0 - **Parent Epic**: #362 ## Bug Report: [concurrency] — Race condition in StateManager ### Severity Assessment - **Impact**: The `StateManager` can get into an inconsistent state, leading to incorrect behavior and data corruption. - **Likelihood**: High, especially in graphs with parallel execution. - **Priority**: Critical ### Location - **File**: `src/cleveragents/langgraph/state.py` - **Function/Class**: `StateManager` - **Lines**: 109-132 ### Description The `StateManager` class is not thread-safe. The `state` attribute is accessed and modified by multiple threads without any locking. The `update_state` method, in particular, is not atomic, which means that a thread could be interrupted in the middle of an update, leaving the state in an inconsistent state. This is especially critical given the v3.5.0 milestone goal of scaling parallel execution to **10+ concurrent subplans** — each subplan may invoke `update_state` concurrently, making this a high-likelihood data corruption vector. ### Evidence ```python def update_state( self, updates: dict[str, Any], mode: StateUpdateMode = StateUpdateMode.MERGE, node_id: str | None = None, ) -> GraphState: if self.is_closed: raise RuntimeError("StateManager is closed") if self.enable_time_travel: snapshot = StateSnapshot( state=self.state.to_dict(), timestamp=datetime.now(), node_id=node_id ) self.history.append(snapshot) if len(self.history) > self.max_history_size: self.history = self.history[-self.max_history_size :] self.state.update(updates, mode) self.state.execution_count += 1 self.state_stream.on_next(self.state) self.update_count += 1 if self.checkpoint_dir and self.update_count % self.checkpoint_interval == 0: self._save_checkpoint() return self.state ``` ### Expected Behavior The `StateManager` should be thread-safe. All access to the `state` attribute should be protected by a lock. ### Actual Behavior Multiple threads can concurrently enter `update_state`, leading to: - Non-atomic read-modify-write on `self.state.execution_count` - Non-atomic read-modify-write on `self.update_count` - Torn writes to `self.history` (list replacement is not atomic) - Concurrent `self.state.update()` calls producing interleaved/corrupted state ### Suggested Fix Add a `threading.Lock` to the `StateManager` class and use it to protect all access to the `state` attribute. The `update_state` method should acquire the lock at the beginning and release it at the end. ```python import threading class StateManager: def __init__(self, ...): ... self._lock: threading.Lock = threading.Lock() def update_state(self, updates, mode=..., node_id=None): with self._lock: # existing body unchanged ... ``` ### Category concurrency ## Subtasks - [ ] Add `threading.Lock` field `_lock` to `StateManager.__init__` with full type annotation - [ ] Wrap the entire body of `update_state` in `with self._lock:` to make it atomic - [ ] Audit all other public/protected methods on `StateManager` that read or write `self.state`, `self.history`, `self.update_count`, or `self.is_closed` for thread-safety - [ ] Protect `close()` / `is_closed` flag with the same lock to prevent use-after-close races - [ ] Write Behave scenario: concurrent `update_state` calls from multiple threads produce consistent `execution_count` and no torn state - [ ] Write Behave scenario: `close()` called concurrently with `update_state` raises `RuntimeError` cleanly without data corruption - [ ] Run `nox -e typecheck` — verify Pyright passes with no `# type: ignore` suppressions - [ ] Run `nox -e unit_tests` — verify all Behave scenarios pass - [ ] Run `nox -e coverage_report` — verify coverage ≥ 97% - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done - [ ] All subtasks above are completed and checked off. - [ ] `StateManager` is demonstrably thread-safe: concurrent `update_state` calls from ≥ 2 threads produce a consistent, non-corrupted `GraphState`. - [ ] No `# type: ignore` or type-suppression comments introduced. - [ ] A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - [ ] The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - [ ] The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - [ ] All nox stages pass. - [ ] Coverage >= 97%. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: ca-new-issue-creator
freemo added this to the v3.5.0 milestone 2026-04-02 23:30:01 +00:00
Author
Owner

Label compliance fix applied:

  • Added missing labels based on issue content
  • Reason: Per CONTRIBUTING.md, every issue must have exactly one State/*, Type/*, and Priority/* label.

Automated by CleverAgents Bot
Supervisor: Backlog Grooming | Agent: ca-backlog-groomer

Label compliance fix applied: - Added missing labels based on issue content - Reason: Per CONTRIBUTING.md, every issue must have exactly one `State/*`, `Type/*`, and `Priority/*` label. --- **Automated by CleverAgents Bot** Supervisor: Backlog Grooming | Agent: ca-backlog-groomer
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • MoSCoW: MoSCoW/Should Have — bug or spec compliance issue.

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **MoSCoW**: MoSCoW/Should Have — bug or spec compliance issue. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#362 Epic: Security & Safety Hardening
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#1688
No description provided.