BUG-HUNT: [data-integrity] PlanLifecycleService in-memory cache can serve stale data when multiple processes/instances exist #7424

Open
opened 2026-04-10 19:12:21 +00:00 by HAL9000 · 3 comments
Owner

Bug Report: Data Integrity — PlanLifecycleService In-Memory Cache Staleness

Severity Assessment

  • Impact: In multi-process or multi-instance deployments, the in-memory _plans and _actions caches in PlanLifecycleService will serve stale data. A plan modified by one process will not be visible to another process's cache, causing incorrect phase transition decisions and potentially allowing duplicate phase transitions.
  • Likelihood: Medium — any deployment with multiple worker processes (e.g., Gunicorn, multiple CLI invocations) will experience this
  • Priority: High

Location

  • File: src/cleveragents/application/services/plan_lifecycle_service.py
  • Function: PlanLifecycleService.get_plan(), PlanLifecycleService.list_plans()
  • Lines: Throughout

Description

PlanLifecycleService uses a dual-mode approach:

  • In-memory dict as primary cache (self._plans, self._actions)
  • Database via UnitOfWork as persistence layer

The get_plan() method first checks the in-memory cache:

def get_plan(self, plan_id: str) -> Plan:
    plan = self._plans.get(plan_id)  # ← cache hit, no DB check!
    if plan is not None:
        return plan
    # Try persistence layer only on cache miss
    ...

This means:

  1. Once a plan is loaded into cache, it is NEVER refreshed from DB
  2. If another process (e.g., a second CLI invocation or async worker) updates the plan in DB, this instance will serve the old cached version
  3. A plan that has been moved to ProcessingState.APPLIED by one process can be re-executed by another process that has the old ProcessingState.QUEUED version cached

This is especially dangerous for idempotency — the same phase transition could be executed twice by two processes with different cache states.

Evidence

# src/cleveragents/application/services/plan_lifecycle_service.py
def get_plan(self, plan_id: str) -> Plan:
    plan = self._plans.get(plan_id)   # ← cached version, potentially stale
    if plan is not None:
        return plan                    # ← returned without DB refresh!
    # Try persistence layer (only on cache miss, never on cache hit)
    if self._persisted and self.unit_of_work is not None:
        with self.unit_of_work.transaction() as ctx:
            persisted = ctx.lifecycle_plans.get(plan_id)
        if persisted is not None:
            self._plans[plan_id] = persisted
            return persisted
    raise NotFoundError(...)

Expected Behavior

In persisted mode, get_plan() should always read from the database (or use short-lived cache with invalidation), not serve potentially stale in-memory copies.

Actual Behavior

Once a plan is in the in-memory cache, it is never refreshed, causing stale reads in any multi-process/multi-instance scenario.

Suggested Fix

  1. In persisted mode, always refresh from DB on get_plan():
def get_plan(self, plan_id: str) -> Plan:
    if self._persisted and self.unit_of_work is not None:
        with self.unit_of_work.transaction() as ctx:
            persisted = ctx.lifecycle_plans.get(plan_id)
        if persisted is not None:
            self._plans[plan_id] = persisted  # refresh cache
            return persisted
        raise NotFoundError(...)
    # In-memory mode: use cache
    plan = self._plans.get(plan_id)
    if plan is not None:
        return plan
    raise NotFoundError(...)
  1. Or use a short-lived cache with TTL-based invalidation
  2. Or document clearly that the service is single-instance only

Category

data-flow

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD with tags: @tdd_issue, @tdd_issue_, @tdd_expected_fail.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: Data Integrity — PlanLifecycleService In-Memory Cache Staleness ### Severity Assessment - **Impact**: In multi-process or multi-instance deployments, the in-memory `_plans` and `_actions` caches in `PlanLifecycleService` will serve stale data. A plan modified by one process will not be visible to another process's cache, causing incorrect phase transition decisions and potentially allowing duplicate phase transitions. - **Likelihood**: Medium — any deployment with multiple worker processes (e.g., Gunicorn, multiple CLI invocations) will experience this - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/plan_lifecycle_service.py` - **Function**: `PlanLifecycleService.get_plan()`, `PlanLifecycleService.list_plans()` - **Lines**: Throughout ### Description `PlanLifecycleService` uses a dual-mode approach: - In-memory dict as primary cache (`self._plans`, `self._actions`) - Database via `UnitOfWork` as persistence layer The `get_plan()` method first checks the in-memory cache: ```python def get_plan(self, plan_id: str) -> Plan: plan = self._plans.get(plan_id) # ← cache hit, no DB check! if plan is not None: return plan # Try persistence layer only on cache miss ... ``` This means: 1. Once a plan is loaded into cache, it is NEVER refreshed from DB 2. If another process (e.g., a second CLI invocation or async worker) updates the plan in DB, this instance will serve the old cached version 3. A plan that has been moved to `ProcessingState.APPLIED` by one process can be re-executed by another process that has the old `ProcessingState.QUEUED` version cached This is especially dangerous for idempotency — the same phase transition could be executed twice by two processes with different cache states. ### Evidence ```python # src/cleveragents/application/services/plan_lifecycle_service.py def get_plan(self, plan_id: str) -> Plan: plan = self._plans.get(plan_id) # ← cached version, potentially stale if plan is not None: return plan # ← returned without DB refresh! # Try persistence layer (only on cache miss, never on cache hit) if self._persisted and self.unit_of_work is not None: with self.unit_of_work.transaction() as ctx: persisted = ctx.lifecycle_plans.get(plan_id) if persisted is not None: self._plans[plan_id] = persisted return persisted raise NotFoundError(...) ``` ### Expected Behavior In persisted mode, `get_plan()` should always read from the database (or use short-lived cache with invalidation), not serve potentially stale in-memory copies. ### Actual Behavior Once a plan is in the in-memory cache, it is never refreshed, causing stale reads in any multi-process/multi-instance scenario. ### Suggested Fix 1. In persisted mode, always refresh from DB on `get_plan()`: ```python def get_plan(self, plan_id: str) -> Plan: if self._persisted and self.unit_of_work is not None: with self.unit_of_work.transaction() as ctx: persisted = ctx.lifecycle_plans.get(plan_id) if persisted is not None: self._plans[plan_id] = persisted # refresh cache return persisted raise NotFoundError(...) # In-memory mode: use cache plan = self._plans.get(plan_id) if plan is not None: return plan raise NotFoundError(...) ``` 2. Or use a short-lived cache with TTL-based invalidation 3. Or document clearly that the service is single-instance only ### Category data-flow ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD with tags: @tdd_issue, @tdd_issue_<this-issue-number>, @tdd_expected_fail. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
HAL9000 added this to the v3.2.0 milestone 2026-04-10 19:34:59 +00:00
HAL9000 self-assigned this 2026-04-11 03:21:07 +00:00
Author
Owner

Verified — Critical data integrity bug: PlanLifecycleService in-memory cache serves stale data in multi-process scenarios. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical data integrity bug: PlanLifecycleService in-memory cache serves stale data in multi-process scenarios. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Critical data integrity bug: PlanLifecycleService in-memory cache serves stale data in multi-process scenarios. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical data integrity bug: PlanLifecycleService in-memory cache serves stale data in multi-process scenarios. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Critical data integrity bug: PlanLifecycleService in-memory cache serves stale data in multi-process scenarios. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical data integrity bug: PlanLifecycleService in-memory cache serves stale data in multi-process scenarios. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7424
No description provided.