UAT: MEMORY_ENGINES global dict in engine_cache.py is not thread-safe — concurrent writes from multiple threads can corrupt the cache #3950

Open
opened 2026-04-06 07:45:18 +00:00 by freemo · 1 comment
Owner

Metadata

  • Branch: fix/concurrency/memory-engines-thread-safe-cache
  • Commit Message: fix(concurrency): protect MEMORY_ENGINES global dict with threading.Lock
  • Milestone: None — backlog
  • Parent Epic: TBD — needs manual linking

Backlog note: This issue was discovered during autonomous UAT operation
on milestone v3.6.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Background and Context

Code-level analysis of src/cleveragents/infrastructure/database/engine_cache.py and its consumers identified a race condition in the global engine cache. The MEMORY_ENGINES dict is a module-level singleton shared across all threads, but it is accessed without any synchronization primitive. The AsyncWorker service uses a ThreadPoolExecutor with up to 4 worker threads, and SubplanExecutionService uses parallel execution — both can trigger database access concurrently, making this an active risk.

Feature Area: Async and Concurrency Patterns — Thread Safety in Shared State

Severity: Medium — race condition in the global engine cache; concurrent writes from multiple threads can corrupt the dict or create duplicate engines for the same in-memory database URL


Current Behavior

MEMORY_ENGINES is a plain Python dict with no thread synchronization:

# src/cleveragents/infrastructure/database/engine_cache.py
MEMORY_ENGINES: dict[str, Engine] = {}

It is accessed in a check-then-act pattern (read-check-write) from two separate files without any lock:

unit_of_work.py (lines 76-84):

if self.database_url not in MEMORY_ENGINES:
    MEMORY_ENGINES[self.database_url] = create_engine(
        self.database_url,
        ...
    )
self._engine = MEMORY_ENGINES[self.database_url]

migration_runner.py (lines 248-253):

if self.database_url not in MEMORY_ENGINES:
    MEMORY_ENGINES[self.database_url] = create_engine(
        self.database_url,
        ...
    )
engine = MEMORY_ENGINES[self.database_url]

The race condition:

  1. Thread A checks "sqlite:///:memory:" not in MEMORY_ENGINES → True
  2. Thread B checks "sqlite:///:memory:" not in MEMORY_ENGINES → True (before A writes)
  3. Thread A creates engine_A and writes to MEMORY_ENGINES
  4. Thread B creates engine_B and overwrites MEMORY_ENGINES["sqlite:///:memory:"]
  5. Thread A now holds a reference to engine_A (via self._engine) but MEMORY_ENGINES points to engine_B
  6. Both threads are now using different in-memory databases — schema created by one is invisible to the other

This is particularly dangerous for in-memory SQLite databases where the entire purpose of the cache is to ensure all connections share the same database instance.


Expected Behavior

Shared global state accessed from multiple threads must be protected by appropriate synchronization primitives (locks). The MEMORY_ENGINES dict must be guarded by a threading.Lock so that the check-then-act pattern is atomic.


Acceptance Criteria

  • A threading.Lock named _MEMORY_ENGINES_LOCK is added to engine_cache.py alongside MEMORY_ENGINES
  • All check-then-act accesses to MEMORY_ENGINES in unit_of_work.py are wrapped in with _MEMORY_ENGINES_LOCK:
  • All check-then-act accesses to MEMORY_ENGINES in migration_runner.py are wrapped in with _MEMORY_ENGINES_LOCK:
  • No new race condition is introduced (e.g., lock is not released before the engine reference is captured)
  • All existing tests continue to pass after the changes
  • New Behave scenarios cover concurrent access to MEMORY_ENGINES

Supporting Information

Code Locations

  • src/cleveragents/infrastructure/database/engine_cache.pyMEMORY_ENGINES dict (line 15)
  • src/cleveragents/infrastructure/database/unit_of_work.py — lines 76-84
  • src/cleveragents/infrastructure/database/migration_runner.py — lines 248-253

Add a threading.Lock to engine_cache.py and use it to protect the check-then-act pattern:

import threading
from sqlalchemy.engine import Engine

MEMORY_ENGINES: dict[str, Engine] = {}
_MEMORY_ENGINES_LOCK = threading.Lock()

Then in both consumers:

with _MEMORY_ENGINES_LOCK:
    if self.database_url not in MEMORY_ENGINES:
        MEMORY_ENGINES[self.database_url] = create_engine(...)
    engine = MEMORY_ENGINES[self.database_url]

Subtasks

  • Add _MEMORY_ENGINES_LOCK = threading.Lock() to src/cleveragents/infrastructure/database/engine_cache.py
  • Export _MEMORY_ENGINES_LOCK from engine_cache.py for use by consumers
  • Wrap the check-then-act block in unit_of_work.py (lines 76-84) with with _MEMORY_ENGINES_LOCK:
  • Wrap the check-then-act block in migration_runner.py (lines 248-253) with with _MEMORY_ENGINES_LOCK:
  • Verify no other consumers of MEMORY_ENGINES exist that also need the lock
  • Tests (Behave): Add BDD scenarios for concurrent MEMORY_ENGINES access (two threads racing to create the same engine URL)
  • Tests (Behave): Verify that both threads end up with the same engine instance after the race
  • Tests (Robot): Add integration test verifying thread-safe engine cache under concurrent load
  • Verify coverage >= 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly (fix(concurrency): protect MEMORY_ENGINES global dict with threading.Lock), followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly (fix/concurrency/memory-engines-thread-safe-cache).
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage >= 97%.

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/concurrency/memory-engines-thread-safe-cache` - **Commit Message**: `fix(concurrency): protect MEMORY_ENGINES global dict with threading.Lock` - **Milestone**: None — backlog - **Parent Epic**: TBD — needs manual linking > **Backlog note:** This issue was discovered during autonomous UAT operation > on milestone v3.6.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- ## Background and Context Code-level analysis of `src/cleveragents/infrastructure/database/engine_cache.py` and its consumers identified a race condition in the global engine cache. The `MEMORY_ENGINES` dict is a module-level singleton shared across all threads, but it is accessed without any synchronization primitive. The `AsyncWorker` service uses a `ThreadPoolExecutor` with up to 4 worker threads, and `SubplanExecutionService` uses parallel execution — both can trigger database access concurrently, making this an active risk. **Feature Area:** Async and Concurrency Patterns — Thread Safety in Shared State **Severity:** Medium — race condition in the global engine cache; concurrent writes from multiple threads can corrupt the dict or create duplicate engines for the same in-memory database URL --- ## Current Behavior `MEMORY_ENGINES` is a plain Python `dict` with no thread synchronization: ```python # src/cleveragents/infrastructure/database/engine_cache.py MEMORY_ENGINES: dict[str, Engine] = {} ``` It is accessed in a check-then-act pattern (read-check-write) from two separate files without any lock: **`unit_of_work.py` (lines 76-84):** ```python if self.database_url not in MEMORY_ENGINES: MEMORY_ENGINES[self.database_url] = create_engine( self.database_url, ... ) self._engine = MEMORY_ENGINES[self.database_url] ``` **`migration_runner.py` (lines 248-253):** ```python if self.database_url not in MEMORY_ENGINES: MEMORY_ENGINES[self.database_url] = create_engine( self.database_url, ... ) engine = MEMORY_ENGINES[self.database_url] ``` **The race condition:** 1. Thread A checks `"sqlite:///:memory:" not in MEMORY_ENGINES` → True 2. Thread B checks `"sqlite:///:memory:" not in MEMORY_ENGINES` → True (before A writes) 3. Thread A creates `engine_A` and writes to `MEMORY_ENGINES` 4. Thread B creates `engine_B` and overwrites `MEMORY_ENGINES["sqlite:///:memory:"]` 5. Thread A now holds a reference to `engine_A` (via `self._engine`) but `MEMORY_ENGINES` points to `engine_B` 6. Both threads are now using **different in-memory databases** — schema created by one is invisible to the other This is particularly dangerous for in-memory SQLite databases where the entire purpose of the cache is to ensure all connections share the same database instance. --- ## Expected Behavior Shared global state accessed from multiple threads must be protected by appropriate synchronization primitives (locks). The `MEMORY_ENGINES` dict must be guarded by a `threading.Lock` so that the check-then-act pattern is atomic. --- ## Acceptance Criteria - [ ] A `threading.Lock` named `_MEMORY_ENGINES_LOCK` is added to `engine_cache.py` alongside `MEMORY_ENGINES` - [ ] All check-then-act accesses to `MEMORY_ENGINES` in `unit_of_work.py` are wrapped in `with _MEMORY_ENGINES_LOCK:` - [ ] All check-then-act accesses to `MEMORY_ENGINES` in `migration_runner.py` are wrapped in `with _MEMORY_ENGINES_LOCK:` - [ ] No new race condition is introduced (e.g., lock is not released before the engine reference is captured) - [ ] All existing tests continue to pass after the changes - [ ] New Behave scenarios cover concurrent access to `MEMORY_ENGINES` --- ## Supporting Information ### Code Locations - `src/cleveragents/infrastructure/database/engine_cache.py` — `MEMORY_ENGINES` dict (line 15) - `src/cleveragents/infrastructure/database/unit_of_work.py` — lines 76-84 - `src/cleveragents/infrastructure/database/migration_runner.py` — lines 248-253 ### Recommended Fix Add a `threading.Lock` to `engine_cache.py` and use it to protect the check-then-act pattern: ```python import threading from sqlalchemy.engine import Engine MEMORY_ENGINES: dict[str, Engine] = {} _MEMORY_ENGINES_LOCK = threading.Lock() ``` Then in both consumers: ```python with _MEMORY_ENGINES_LOCK: if self.database_url not in MEMORY_ENGINES: MEMORY_ENGINES[self.database_url] = create_engine(...) engine = MEMORY_ENGINES[self.database_url] ``` --- ## Subtasks - [ ] Add `_MEMORY_ENGINES_LOCK = threading.Lock()` to `src/cleveragents/infrastructure/database/engine_cache.py` - [ ] Export `_MEMORY_ENGINES_LOCK` from `engine_cache.py` for use by consumers - [ ] Wrap the check-then-act block in `unit_of_work.py` (lines 76-84) with `with _MEMORY_ENGINES_LOCK:` - [ ] Wrap the check-then-act block in `migration_runner.py` (lines 248-253) with `with _MEMORY_ENGINES_LOCK:` - [ ] Verify no other consumers of `MEMORY_ENGINES` exist that also need the lock - [ ] Tests (Behave): Add BDD scenarios for concurrent `MEMORY_ENGINES` access (two threads racing to create the same engine URL) - [ ] Tests (Behave): Verify that both threads end up with the same engine instance after the race - [ ] Tests (Robot): Add integration test verifying thread-safe engine cache under concurrent load - [ ] Verify coverage >= 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - [ ] All subtasks above are completed and checked off. - [ ] A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly (`fix(concurrency): protect MEMORY_ENGINES global dict with threading.Lock`), followed by a blank line, then additional lines providing relevant details about the implementation. - [ ] The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly (`fix/concurrency/memory-engines-thread-safe-cache`). - [ ] The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - [ ] All nox stages pass. - [ ] Coverage >= 97%. --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
Author
Owner

⚠️ Orphan Issue — Needs Manual Parent Epic Linking

This issue was created without a known parent Epic. Per CONTRIBUTING.md, all non-Epic, non-Legendary issues must be linked to a parent Epic using Forgejo's dependency system (child blocks parent).

This issue likely belongs under the Concurrency + Cleanup Epic (referenced in docs/timeline.md as Epic #365), but this could not be confirmed automatically. A maintainer should:

  1. Identify the correct parent Epic for this thread-safety fix (likely the Concurrency + Cleanup Epic)
  2. Open this issue and add the parent Epic under "blocks", OR open the parent Epic and add this issue under "depends on"

This creates the machine-readable directed link: this issue blocks the parent Epic (the Epic cannot be complete until this fix is done).


Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

⚠️ **Orphan Issue — Needs Manual Parent Epic Linking** This issue was created without a known parent Epic. Per CONTRIBUTING.md, all non-Epic, non-Legendary issues must be linked to a parent Epic using Forgejo's dependency system (child **blocks** parent). This issue likely belongs under the **Concurrency + Cleanup** Epic (referenced in `docs/timeline.md` as Epic #365), but this could not be confirmed automatically. A maintainer should: 1. Identify the correct parent Epic for this thread-safety fix (likely the Concurrency + Cleanup Epic) 2. Open this issue and add the parent Epic under **"blocks"**, OR open the parent Epic and add this issue under **"depends on"** This creates the machine-readable directed link: **this issue blocks the parent Epic** (the Epic cannot be complete until this fix is done). --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#3950
No description provided.