UAT: ContextTierService in-memory tier stores are not thread-safe — concurrent access can corrupt hot/warm/cold tier state #3992

Open
opened 2026-04-06 08:21:46 +00:00 by freemo · 0 comments
Owner

Background and Context

The ContextTierService in src/cleveragents/application/services/context_tiers.py is the core implementation of the ACMS hot/warm/cold tiered storage system described in the specification. The spec describes a multi-actor system where multiple agents can access context concurrently. The tiered storage (hot/warm/cold) must be safe for concurrent access, or at minimum the documentation must clearly indicate threading constraints and the system must enforce them.

The spec also notes that DefaultStrategyExecutor is designed with future parallel strategy invocation in mind — when that parallelism is enabled, the tier service will be accessed concurrently from multiple threads.

Current Behavior

The ContextTierService docstring explicitly states it is "designed for single-threaded use" and the tier stores are plain dict instances without synchronization:

class ContextTierService(TierRuntimeMixin, ScopedTierMixin):
    """...
    .. note::

       This service is designed for **single-threaded** use.  The
       in-memory tier stores are plain ``dict`` instances without
       synchronisation.  Concurrent callers must coordinate externally.
    """

The _hot, _warm, and _cold stores are plain Python dicts. Operations like store(), get(), promote(), demote(), and evict_lru() all modify these dicts without any locking. In a multi-actor scenario:

  • Concurrent store() calls can corrupt the dict state
  • _enforce_hot_budget() reads and modifies _hot without atomicity
  • promote() does a pop() + insert which is not atomic across threads

Code location: src/cleveragents/application/services/context_tiers.py, ContextTierService.__init__() and all mutation methods

Steps to reproduce:

  1. Create a ContextTierService
  2. Concurrently call store() from multiple threads with different fragments
  3. Race conditions can cause fragments to be lost or the hot-tier budget enforcement to be incorrect

Expected Behavior

Per the spec's multi-actor architecture, the ContextTierService should be safe for concurrent access. Either:

  • The tier stores should be protected by a reentrant lock (threading.RLock) so that all mutation methods are atomic, or
  • The service must enforce single-threaded access at the framework level (e.g., via an executor that serialises calls) and document this constraint prominently with a runtime guard

Acceptance Criteria

  • ContextTierService is safe for concurrent access from multiple threads (no data loss, no budget enforcement errors under concurrent load)
  • All mutation methods (store, get, promote, demote, evict_lru, _enforce_hot_budget) are protected by a reentrant lock or equivalent synchronisation primitive
  • The class docstring accurately reflects the threading model (thread-safe or explicitly single-threaded with enforced guard)
  • Unit tests cover concurrent store() calls from multiple threads and verify no fragments are lost
  • No regression in existing ContextTierService tests

Supporting Information

  • Feature area: Memory and Knowledge Management — Context Tier Service
  • Discovered by: UAT testing of the ACMS Context Pipeline
  • Parent Epic: #396 Epic: ACMS Context Pipeline
  • Related spec section: ACMS Tiered Storage, multi-actor context access, DefaultStrategyExecutor parallel execution note
  • Impact: Silent data loss or incorrect budget enforcement in multi-actor CleverAgents deployments

Metadata

  • Branch: fix/context-tier-service-thread-safety
  • Commit Message: fix(context-tiers): add thread-safety to ContextTierService tier stores
  • Milestone: (none — backlog)
  • Parent Epic: #396

Subtasks

  • Add threading.RLock to ContextTierService.__init__() to protect _hot, _warm, and _cold stores
  • Wrap store() with lock acquisition to ensure atomic insert + budget enforcement
  • Wrap get() with lock acquisition to prevent torn reads during concurrent promotion
  • Wrap promote() with lock acquisition — pop() + insert must be atomic
  • Wrap demote() with lock acquisition
  • Wrap evict_lru() with lock acquisition
  • Wrap _enforce_hot_budget() with lock acquisition
  • Update class docstring to reflect thread-safe design
  • Tests (pytest/Behave): Add concurrent-access scenario — multiple threads calling store() simultaneously, assert no fragments lost
  • Tests (pytest/Behave): Add concurrent promotion/demotion scenario, assert tier counts remain consistent
  • Verify coverage >= 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly (fix(context-tiers): add thread-safety to ContextTierService tier stores), followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly (fix/context-tier-service-thread-safety).
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage >= 97%.

Backlog note: This issue was discovered during autonomous operation
on milestone v3.4.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Background and Context The `ContextTierService` in `src/cleveragents/application/services/context_tiers.py` is the core implementation of the ACMS hot/warm/cold tiered storage system described in the specification. The spec describes a multi-actor system where multiple agents can access context concurrently. The tiered storage (hot/warm/cold) must be safe for concurrent access, or at minimum the documentation must clearly indicate threading constraints and the system must enforce them. The spec also notes that `DefaultStrategyExecutor` is designed with future parallel strategy invocation in mind — when that parallelism is enabled, the tier service will be accessed concurrently from multiple threads. ## Current Behavior The `ContextTierService` docstring explicitly states it is "designed for **single-threaded** use" and the tier stores are plain `dict` instances without synchronization: ```python class ContextTierService(TierRuntimeMixin, ScopedTierMixin): """... .. note:: This service is designed for **single-threaded** use. The in-memory tier stores are plain ``dict`` instances without synchronisation. Concurrent callers must coordinate externally. """ ``` The `_hot`, `_warm`, and `_cold` stores are plain Python dicts. Operations like `store()`, `get()`, `promote()`, `demote()`, and `evict_lru()` all modify these dicts without any locking. In a multi-actor scenario: - Concurrent `store()` calls can corrupt the dict state - `_enforce_hot_budget()` reads and modifies `_hot` without atomicity - `promote()` does a `pop()` + insert which is not atomic across threads **Code location**: `src/cleveragents/application/services/context_tiers.py`, `ContextTierService.__init__()` and all mutation methods **Steps to reproduce**: 1. Create a `ContextTierService` 2. Concurrently call `store()` from multiple threads with different fragments 3. Race conditions can cause fragments to be lost or the hot-tier budget enforcement to be incorrect ## Expected Behavior Per the spec's multi-actor architecture, the `ContextTierService` should be safe for concurrent access. Either: - The tier stores should be protected by a reentrant lock (`threading.RLock`) so that all mutation methods are atomic, **or** - The service must enforce single-threaded access at the framework level (e.g., via an executor that serialises calls) and document this constraint prominently with a runtime guard ## Acceptance Criteria - [ ] `ContextTierService` is safe for concurrent access from multiple threads (no data loss, no budget enforcement errors under concurrent load) - [ ] All mutation methods (`store`, `get`, `promote`, `demote`, `evict_lru`, `_enforce_hot_budget`) are protected by a reentrant lock or equivalent synchronisation primitive - [ ] The class docstring accurately reflects the threading model (thread-safe or explicitly single-threaded with enforced guard) - [ ] Unit tests cover concurrent `store()` calls from multiple threads and verify no fragments are lost - [ ] No regression in existing `ContextTierService` tests ## Supporting Information - **Feature area**: Memory and Knowledge Management — Context Tier Service - **Discovered by**: UAT testing of the ACMS Context Pipeline - **Parent Epic**: #396 Epic: ACMS Context Pipeline - **Related spec section**: ACMS Tiered Storage, multi-actor context access, `DefaultStrategyExecutor` parallel execution note - **Impact**: Silent data loss or incorrect budget enforcement in multi-actor CleverAgents deployments ## Metadata - **Branch**: `fix/context-tier-service-thread-safety` - **Commit Message**: `fix(context-tiers): add thread-safety to ContextTierService tier stores` - **Milestone**: *(none — backlog)* - **Parent Epic**: #396 ## Subtasks - [ ] Add `threading.RLock` to `ContextTierService.__init__()` to protect `_hot`, `_warm`, and `_cold` stores - [ ] Wrap `store()` with lock acquisition to ensure atomic insert + budget enforcement - [ ] Wrap `get()` with lock acquisition to prevent torn reads during concurrent promotion - [ ] Wrap `promote()` with lock acquisition — `pop()` + insert must be atomic - [ ] Wrap `demote()` with lock acquisition - [ ] Wrap `evict_lru()` with lock acquisition - [ ] Wrap `_enforce_hot_budget()` with lock acquisition - [ ] Update class docstring to reflect thread-safe design - [ ] Tests (pytest/Behave): Add concurrent-access scenario — multiple threads calling `store()` simultaneously, assert no fragments lost - [ ] Tests (pytest/Behave): Add concurrent promotion/demotion scenario, assert tier counts remain consistent - [ ] Verify coverage >= 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly (`fix(context-tiers): add thread-safety to ContextTierService tier stores`), followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly (`fix/context-tier-service-thread-safety`). - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage >= 97%. --- > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.4.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
HAL9000 added this to the v3.5.0 milestone 2026-04-09 03:12:13 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#396 Epic: ACMS Context Pipeline
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#3992
No description provided.