BUG-HUNT: [concurrency] In-memory storage in PlanLifecycleService is not thread-safe #1281

Open
opened 2026-04-02 08:44:14 +00:00 by freemo · 0 comments
Owner

Bug Report: [concurrency] — In-memory storage in PlanLifecycleService is not thread-safe

Severity Assessment

  • Impact: In a multi-threaded or async environment, simultaneous writes to the in-memory caches (_actions, _plans) could lead to race conditions, resulting in data loss or an inconsistent state.
  • Likelihood: Low, as the primary deployment is likely single-threaded CLI or database-backed. However, if the service is ever used in a concurrent server environment without a database, this bug will manifest.
  • Priority: Medium

Location

  • File: src/cleveragents/application/services/plan_lifecycle_service.py
  • Function/Class: PlanLifecycleService
  • Lines: 213-214

Description

The PlanLifecycleService uses standard Python dictionaries (self._actions, self._plans) as a fallback for in-memory storage when a database UnitOfWork is not provided. Access to these dictionaries is not protected by locks or any other synchronization mechanism.

If two threads or async tasks were to call a method like create_action or use_action concurrently, they could both read from and write to these dictionaries simultaneously, leading to a race condition. For example, one action creation could be overwritten by another, leading to data loss.

Evidence

# Relevant code snippet showing the issue
class PlanLifecycleService:
    def __init__(self, ...):
        # ...
        # In-memory fallback storage (used only when no UoW is provided)
        self._actions: dict[str, Action] = {}
        self._plans: dict[str, Plan] = {}

    def create_action(self, ...) -> Action:
        # ...
        # This line is not atomic and is not protected by a lock
        self._actions[action_name_str] = action
        return action

Expected Behavior

A service that might be used in different contexts (CLI, server) should be designed to be thread-safe. The in-memory storage should use thread-safe data structures, such as a dictionary protected by a threading.Lock or asyncio.Lock, to ensure that writes are atomic.

Actual Behavior

The use of standard dictionaries for in-memory storage makes the service unsafe for concurrent use when not backed by a database.

Suggested Fix

Wrap all access (read and write) to self._actions and self._plans with a lock.

import threading

class PlanLifecycleService:
    def __init__(self, ...):
        # ...
        self._actions: dict[str, Action] = {}
        self._plans: dict[str, Plan] = {}
        self._lock = threading.Lock() # Or asyncio.Lock for async contexts

    def create_action(self, ...) -> Action:
        # ...
        with self._lock:
            self._actions[action_name_str] = action
        return action

    def get_action(self, action_name: str) -> Action:
        with self._lock:
            action = self._actions.get(normalised)
        # ...

Category

concurrency


Metadata

  • Branch: fix/concurrency-plan-lifecycle-service-thread-safe-in-memory-storage
  • Commit Message: fix(concurrency): protect PlanLifecycleService in-memory storage with threading.Lock
  • Milestone: v3.3.0
  • Parent Epic: #362

Subtasks

  • Write a TDD issue-capture Behave scenario tagged @tdd_issue, @tdd_expected_fail that demonstrates the race condition (e.g., two concurrent calls to create_action result in data loss or inconsistent state)
  • Evaluate whether threading.Lock or asyncio.Lock is the more appropriate synchronisation primitive given the service's usage contexts (CLI vs. async server)
  • Add self._lock = threading.Lock() (or equivalent) to PlanLifecycleService.__init__ (lines 213-214 of plan_lifecycle_service.py)
  • Wrap all read and write accesses to self._actions with self._lock across all methods in PlanLifecycleService
  • Wrap all read and write accesses to self._plans with self._lock across all methods in PlanLifecycleService
  • Remove the @tdd_expected_fail tag from the TDD scenario once the fix is in place and verify it passes
  • Run nox -e typecheck to confirm no type regressions
  • Run nox -e unit_tests to confirm all Behave tests pass
  • Run nox -e coverage_report to confirm coverage remains ≥ 97%

Definition of Done

  • The @tdd_expected_fail TDD capture scenario is committed first and demonstrates the race condition on self._actions and self._plans
  • self._lock (a threading.Lock or asyncio.Lock) is added to PlanLifecycleService.__init__
  • All read and write accesses to self._actions and self._plans are wrapped with self._lock
  • The TDD scenario passes without @tdd_expected_fail after the fix
  • No regressions in existing plan lifecycle or action-related tests
  • All nox stages pass (nox -e lint, nox -e typecheck, nox -e unit_tests, nox -e integration_tests, nox -e coverage_report)
  • Coverage >= 97%
  • Commit is on branch fix/concurrency-plan-lifecycle-service-thread-safe-in-memory-storage with message fix(concurrency): protect PlanLifecycleService in-memory storage with threading.Lock
  • PR is merged and this issue is closed
## Bug Report: [concurrency] — In-memory storage in PlanLifecycleService is not thread-safe ### Severity Assessment - **Impact**: In a multi-threaded or async environment, simultaneous writes to the in-memory caches (`_actions`, `_plans`) could lead to race conditions, resulting in data loss or an inconsistent state. - **Likelihood**: Low, as the primary deployment is likely single-threaded CLI or database-backed. However, if the service is ever used in a concurrent server environment without a database, this bug will manifest. - **Priority**: Medium ### Location - **File**: `src/cleveragents/application/services/plan_lifecycle_service.py` - **Function/Class**: `PlanLifecycleService` - **Lines**: 213-214 ### Description The `PlanLifecycleService` uses standard Python dictionaries (`self._actions`, `self._plans`) as a fallback for in-memory storage when a database `UnitOfWork` is not provided. Access to these dictionaries is not protected by locks or any other synchronization mechanism. If two threads or async tasks were to call a method like `create_action` or `use_action` concurrently, they could both read from and write to these dictionaries simultaneously, leading to a race condition. For example, one action creation could be overwritten by another, leading to data loss. ### Evidence ```python # Relevant code snippet showing the issue class PlanLifecycleService: def __init__(self, ...): # ... # In-memory fallback storage (used only when no UoW is provided) self._actions: dict[str, Action] = {} self._plans: dict[str, Plan] = {} def create_action(self, ...) -> Action: # ... # This line is not atomic and is not protected by a lock self._actions[action_name_str] = action return action ``` ### Expected Behavior A service that might be used in different contexts (CLI, server) should be designed to be thread-safe. The in-memory storage should use thread-safe data structures, such as a dictionary protected by a `threading.Lock` or `asyncio.Lock`, to ensure that writes are atomic. ### Actual Behavior The use of standard dictionaries for in-memory storage makes the service unsafe for concurrent use when not backed by a database. ### Suggested Fix Wrap all access (read and write) to `self._actions` and `self._plans` with a lock. ```python import threading class PlanLifecycleService: def __init__(self, ...): # ... self._actions: dict[str, Action] = {} self._plans: dict[str, Plan] = {} self._lock = threading.Lock() # Or asyncio.Lock for async contexts def create_action(self, ...) -> Action: # ... with self._lock: self._actions[action_name_str] = action return action def get_action(self, action_name: str) -> Action: with self._lock: action = self._actions.get(normalised) # ... ``` ### Category concurrency --- ## Metadata - **Branch**: `fix/concurrency-plan-lifecycle-service-thread-safe-in-memory-storage` - **Commit Message**: `fix(concurrency): protect PlanLifecycleService in-memory storage with threading.Lock` - **Milestone**: v3.3.0 - **Parent Epic**: #362 ## Subtasks - [ ] Write a TDD issue-capture Behave scenario tagged `@tdd_issue`, `@tdd_expected_fail` that demonstrates the race condition (e.g., two concurrent calls to `create_action` result in data loss or inconsistent state) - [ ] Evaluate whether `threading.Lock` or `asyncio.Lock` is the more appropriate synchronisation primitive given the service's usage contexts (CLI vs. async server) - [ ] Add `self._lock = threading.Lock()` (or equivalent) to `PlanLifecycleService.__init__` (lines 213-214 of `plan_lifecycle_service.py`) - [ ] Wrap all read and write accesses to `self._actions` with `self._lock` across all methods in `PlanLifecycleService` - [ ] Wrap all read and write accesses to `self._plans` with `self._lock` across all methods in `PlanLifecycleService` - [ ] Remove the `@tdd_expected_fail` tag from the TDD scenario once the fix is in place and verify it passes - [ ] Run `nox -e typecheck` to confirm no type regressions - [ ] Run `nox -e unit_tests` to confirm all Behave tests pass - [ ] Run `nox -e coverage_report` to confirm coverage remains ≥ 97% ## Definition of Done - [ ] The `@tdd_expected_fail` TDD capture scenario is committed first and demonstrates the race condition on `self._actions` and `self._plans` - [ ] `self._lock` (a `threading.Lock` or `asyncio.Lock`) is added to `PlanLifecycleService.__init__` - [ ] All read and write accesses to `self._actions` and `self._plans` are wrapped with `self._lock` - [ ] The TDD scenario passes without `@tdd_expected_fail` after the fix - [ ] No regressions in existing plan lifecycle or action-related tests - [ ] All nox stages pass (`nox -e lint`, `nox -e typecheck`, `nox -e unit_tests`, `nox -e integration_tests`, `nox -e coverage_report`) - [ ] Coverage >= 97% - [ ] Commit is on branch `fix/concurrency-plan-lifecycle-service-thread-safe-in-memory-storage` with message `fix(concurrency): protect PlanLifecycleService in-memory storage with threading.Lock` - [ ] PR is merged and this issue is closed
freemo added this to the v3.3.0 milestone 2026-04-02 08:44:52 +00:00
freemo self-assigned this 2026-04-02 18:45:26 +00:00
freemo removed this from the v3.3.0 milestone 2026-04-07 02:31:55 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#1281
No description provided.