fix(acms): harden hot/warm/cold tier service reliability #11238

Merged
HAL9000 merged 2 commits from pr-fix/9663-hot-warm-cold-tier-reliability into master 2026-05-28 21:31:23 +00:00
Owner

Summary

Harden the ContextTierService for production reliability under concurrent plan execution and high-throughput scenario.

Changes

  1. Remove dead conflicting defaults: Deleted _DEFAULT_MAX_TOKENS_HOT, _DEFAULT_MAX_DECISIONS_WARM, _DEFAULT_MAX_DECISIONS_COLD which were never used (all budget flows through budget_from_settings() in context_tier_settings.py) and contradicted the canonical values.

  2. Warm-tier capacity enforcement: Added _enforce_warm_capacity() — count-based LRU eviction of the warm tier when max_decisions_warm is exceeded. Triggered after:

    • Cold → warm promotion (in promote())
    • Hot-budget-fallback restore to warm (when promoted fragment could't fit in hot)
  3. Immutable snapshot returns: get_all_fragments() and get_hot_fragments() now return model_copy() instances instead of live mutable references, preventing state corruption from external mutation while the RLock is held.

  4. Naming consistency: _COLD_SUMMARY_MAX_CHARS renamed to _default_summarisation_max_chars following snake_case convention used throughout the module.

Impact

  • No breaking API changes — return types and method signatures unchanged (copies still implement TieredFragment).
  • Warmer tiers no longer silently exceed max_decisions_warm capacity after promotions.
  • Thread-safety improved for all public reader methods.
## Summary Harden the `ContextTierService` for production reliability under concurrent plan execution and high-throughput scenario. ### Changes 1. **Remove dead conflicting defaults**: Deleted `_DEFAULT_MAX_TOKENS_HOT`, `_DEFAULT_MAX_DECISIONS_WARM`, `_DEFAULT_MAX_DECISIONS_COLD` which were never used (all budget flows through `budget_from_settings()` in `context_tier_settings.py`) and contradicted the canonical values. 2. **Warm-tier capacity enforcement**: Added `_enforce_warm_capacity()` — count-based LRU eviction of the warm tier when `max_decisions_warm` is exceeded. Triggered after: - Cold → warm promotion (in `promote()`) - Hot-budget-fallback restore to warm (when promoted fragment could't fit in hot) 3. **Immutable snapshot returns**: `get_all_fragments()` and `get_hot_fragments()` now return `model_copy()` instances instead of live mutable references, preventing state corruption from external mutation while the RLock is held. 4. **Naming consistency**: `_COLD_SUMMARY_MAX_CHARS` renamed to `_default_summarisation_max_chars` following snake_case convention used throughout the module. ### Impact - No breaking API changes — return types and method signatures unchanged (copies still implement `TieredFragment`). - Warmer tiers no longer silently exceed `max_decisions_warm` capacity after promotions. - Thread-safety improved for all public reader methods.
fix(acms): harden hot/warm/cold tier service reliability
All checks were successful
CI / lint (pull_request) Successful in 1m11s
CI / typecheck (pull_request) Successful in 1m28s
CI / security (pull_request) Successful in 1m27s
CI / push-validation (pull_request) Successful in 45s
CI / helm (pull_request) Successful in 51s
CI / build (pull_request) Successful in 1m10s
CI / quality (pull_request) Successful in 1m41s
CI / integration_tests (pull_request) Successful in 7m31s
CI / unit_tests (pull_request) Successful in 9m12s
CI / docker (pull_request) Successful in 1m47s
CI / coverage (pull_request) Successful in 14m21s
CI / status-check (pull_request) Successful in 3s
03e34b035e
- Remove unused conflicting _DEFAULT_* constants that conflicted with
  canonical defaults in context_tier_settings.py (which serves as the
  sole source of truth for budget/setting defaults).

- Add _enforce_warm_capacity() to enforce max_decisions_warm limit on
  warm tier after cold→warm promotion and hot-budget-fallback restore,
  preventing silent over-capacity data accumulation.

- Return deep copies (model_copy) from get_all_fragments() and
  get_hot_fragments() to prevent callers from mutating internal fragment
  state while the service holds its RLock under concurrent plan execution.

- Rename _COLD_SUMMARY_MAX_CHARS → _default_summarisation_max_chars for
  consistent snake_case naming throughout the module.

Fixes: PR #9663
HAL9000 force-pushed pr-fix/9663-hot-warm-cold-tier-reliability from 03e34b035e
All checks were successful
CI / lint (pull_request) Successful in 1m11s
CI / typecheck (pull_request) Successful in 1m28s
CI / security (pull_request) Successful in 1m27s
CI / push-validation (pull_request) Successful in 45s
CI / helm (pull_request) Successful in 51s
CI / build (pull_request) Successful in 1m10s
CI / quality (pull_request) Successful in 1m41s
CI / integration_tests (pull_request) Successful in 7m31s
CI / unit_tests (pull_request) Successful in 9m12s
CI / docker (pull_request) Successful in 1m47s
CI / coverage (pull_request) Successful in 14m21s
CI / status-check (pull_request) Successful in 3s
to 5031170a71
All checks were successful
CI / helm (pull_request) Successful in 35s
CI / lint (pull_request) Successful in 44s
CI / build (pull_request) Successful in 40s
CI / typecheck (pull_request) Successful in 58s
CI / quality (pull_request) Successful in 58s
CI / security (pull_request) Successful in 1m14s
CI / integration_tests (pull_request) Successful in 3m37s
CI / push-validation (pull_request) Successful in 20s
CI / unit_tests (pull_request) Successful in 6m40s
CI / docker (pull_request) Successful in 1m31s
CI / coverage (pull_request) Successful in 10m55s
CI / status-check (pull_request) Successful in 4s
2026-05-28 13:57:27 +00:00
Compare
HAL9000 force-pushed pr-fix/9663-hot-warm-cold-tier-reliability from 5031170a71
All checks were successful
CI / helm (pull_request) Successful in 35s
CI / lint (pull_request) Successful in 44s
CI / build (pull_request) Successful in 40s
CI / typecheck (pull_request) Successful in 58s
CI / quality (pull_request) Successful in 58s
CI / security (pull_request) Successful in 1m14s
CI / integration_tests (pull_request) Successful in 3m37s
CI / push-validation (pull_request) Successful in 20s
CI / unit_tests (pull_request) Successful in 6m40s
CI / docker (pull_request) Successful in 1m31s
CI / coverage (pull_request) Successful in 10m55s
CI / status-check (pull_request) Successful in 4s
to 68fa0479c0
All checks were successful
CI / push-validation (pull_request) Successful in 24s
CI / lint (pull_request) Successful in 33s
CI / helm (pull_request) Successful in 33s
CI / build (pull_request) Successful in 33s
CI / typecheck (pull_request) Successful in 58s
CI / quality (pull_request) Successful in 59s
CI / security (pull_request) Successful in 1m14s
CI / integration_tests (pull_request) Successful in 3m51s
CI / unit_tests (pull_request) Successful in 4m32s
CI / docker (pull_request) Successful in 1m25s
CI / coverage (pull_request) Successful in 11m9s
CI / status-check (pull_request) Successful in 3s
2026-05-28 19:50:48 +00:00
Compare
fix(context_tiers): guard cold→warm promotion against immediate eviction
All checks were successful
CI / push-validation (pull_request) Successful in 19s
CI / helm (pull_request) Successful in 36s
CI / lint (pull_request) Successful in 47s
CI / build (pull_request) Successful in 45s
CI / quality (pull_request) Successful in 48s
CI / security (pull_request) Successful in 1m19s
CI / typecheck (pull_request) Successful in 1m28s
CI / integration_tests (pull_request) Successful in 2m56s
CI / unit_tests (pull_request) Successful in 6m8s
CI / docker (pull_request) Successful in 1m34s
CI / coverage (pull_request) Successful in 12m57s
CI / status-check (pull_request) Successful in 3s
0fee675807
Mirror the warm→hot budget-fallback pattern for the cold→warm promotion
path in ContextTierService.promote(). When _enforce_warm_capacity()
evicts the just-promoted fragment (because its last_accessed timestamp
predates all existing warm entries), restore it with _touch() so it
carries a current timestamp and is not immediately re-evicted.

Without this guard the fragment is silently lost from all stores while
the caller receives a stale handle with tier=WARM — identical to the
data-loss scenario the hot-budget fallback was introduced to prevent.

ISSUES CLOSED: #9663
Author
Owner

Claimed by merge_drive.py (pid 1264876) until 2026-05-28T23:01:17.275801+00:00.

This claim is advisory and will be released when the cycle ends, or after the TTL by a sibling driver's expired-claim sweep.

<!-- merge_drive.py: claim --> Claimed by `merge_drive.py` (pid 1264876) until `2026-05-28T23:01:17.275801+00:00`. This claim is advisory and will be released when the cycle ends, or after the TTL by a sibling driver's expired-claim sweep.
HAL9001 approved these changes 2026-05-28 21:31:21 +00:00
HAL9001 left a comment

Approved by the controller reviewer stage (workflow 17).

Approved by the controller reviewer stage (workflow 17).
HAL9000 merged commit 1566917c40 into master 2026-05-28 21:31:23 +00:00
Author
Owner

event occurred 2026-05-28T13:54:29.105366+00:00

🌱 Grooming: proceed — PR cleared for processing.

(check no_duplicates, category no_duplicates)

PR #11238 is a targeted hardening fix for ContextTierService that combines four distinct reliability improvements: removing dead defaults, enforcing warm-tier capacity with LRU eviction, returning immutable fragment snapshots, and naming consistency. No open PR duplicates this specific combination. Related ACMS PRs (#9663, #10783, #11096) address original feature implementations or different constraints (budget enforcement, path matching, pipeline wiring); none implement the warm-capacity enforcement or immutability patterns central to #11238.

*event occurred 2026-05-28T13:54:29.105366+00:00* **🌱 Grooming: proceed** — PR cleared for processing. (check `no_duplicates`, category `no_duplicates`) PR #11238 is a targeted hardening fix for ContextTierService that combines four distinct reliability improvements: removing dead defaults, enforcing warm-tier capacity with LRU eviction, returning immutable fragment snapshots, and naming consistency. No open PR duplicates this specific combination. Related ACMS PRs (#9663, #10783, #11096) address original feature implementations or different constraints (budget enforcement, path matching, pipeline wiring); none implement the warm-capacity enforcement or immutability patterns central to #11238. <!-- controller:fingerprint:2facce9b80eaf0e6 -->
Author
Owner

event occurred 2026-05-28T13:57:04.956938+00:00

📋 Estimate: tier 1.

Single-file change (+41/-18) but involves non-trivial new logic: a new LRU eviction method (_enforce_warm_capacity()), changed return semantics for two public methods (mutable refs → model_copy() immutable snapshots), and modifications to the promote() call chain. Thread-safety reasoning required to verify correctness of RLock usage around the new eviction path. Not mechanical — standard engineering work warrants tier 1.

*event occurred 2026-05-28T13:57:04.956938+00:00* **📋 Estimate: tier 1.** Single-file change (+41/-18) but involves non-trivial new logic: a new LRU eviction method (_enforce_warm_capacity()), changed return semantics for two public methods (mutable refs → model_copy() immutable snapshots), and modifications to the promote() call chain. Thread-safety reasoning required to verify correctness of RLock usage around the new eviction path. Not mechanical — standard engineering work warrants tier 1. <!-- controller:fingerprint:58129e5477b1d935 -->
Author
Owner

(attempt #3, tier 1)

event occurred 2026-05-28T13:58:30.893624+00:00

🔧 Implementer attempt — rebased.

Pushed 1 commit: 5031170.

_(attempt #3, tier 1)_ *event occurred 2026-05-28T13:58:30.893624+00:00* **🔧 Implementer attempt — `rebased`.** Pushed 1 commit: `5031170`. <!-- controller:fingerprint:2e6f7f8ed885a31f -->
Author
Owner

event occurred 2026-05-28T19:46:37.111234+00:00

🌱 Grooming: proceed — PR cleared for processing.

(check no_duplicates, category no_duplicates)

Searched open PRs for topical match on ContextTierService hardening. Found related feature PR #9663 (hot/warm/cold tier implementation) but it is the original feature, not a duplicate of this fix. Other ACMS PRs address path matching, pipeline wiring, budget enforcement, and indexing—none overlap with ContextTierService production-reliability hardening. No duplicate detected.

*event occurred 2026-05-28T19:46:37.111234+00:00* **🌱 Grooming: proceed** — PR cleared for processing. (check `no_duplicates`, category `no_duplicates`) Searched open PRs for topical match on `ContextTierService` hardening. Found related feature PR #9663 (hot/warm/cold tier implementation) but it is the original feature, not a duplicate of this fix. Other ACMS PRs address path matching, pipeline wiring, budget enforcement, and indexing—none overlap with ContextTierService production-reliability hardening. No duplicate detected. <!-- controller:fingerprint:677e36505ea54fb7 -->
Author
Owner

event occurred 2026-05-28T19:50:33.790742+00:00

📋 Estimate: tier 1.

Single-file change (+41/-18) to ContextTierService, but contains non-trivial logic: a new _enforce_warm_capacity() method with count-based LRU eviction logic requiring correctness reasoning under concurrent access (RLock), and immutable snapshot returns via model_copy() with thread-safety implications. Dead code removal and constant rename are mechanical, but the eviction and snapshot components involve concurrency semantics that exceed the reliable envelope of tier 0. CI is fully green. Tier 1 is the appropriate default for this class of focused-but-logic-heavy single-file work.

*event occurred 2026-05-28T19:50:33.790742+00:00* **📋 Estimate: tier 1.** Single-file change (+41/-18) to ContextTierService, but contains non-trivial logic: a new _enforce_warm_capacity() method with count-based LRU eviction logic requiring correctness reasoning under concurrent access (RLock), and immutable snapshot returns via model_copy() with thread-safety implications. Dead code removal and constant rename are mechanical, but the eviction and snapshot components involve concurrency semantics that exceed the reliable envelope of tier 0. CI is fully green. Tier 1 is the appropriate default for this class of focused-but-logic-heavy single-file work. <!-- controller:fingerprint:0f37b7632bad7406 -->
Author
Owner

(attempt #7, tier 1)

event occurred 2026-05-28T19:51:15.657461+00:00

🔧 Implementer attempt — rebased.

Pushed 1 commit: 68fa047.

_(attempt #7, tier 1)_ *event occurred 2026-05-28T19:51:15.657461+00:00* **🔧 Implementer attempt — `rebased`.** Pushed 1 commit: `68fa047`. <!-- controller:fingerprint:5500a9b427dc6be9 -->
Author
Owner

event occurred 2026-05-28T20:15:50.347667+00:00

🔴 Changes requested

Confidence: high.

Blocking issues (1):

  • [error] src/cleveragents/application/services/context_tiers.py:295-307 — Lines 295-307 (cold→warm branch of promote):

    frag = self._cold.pop(fragment_id)
    promoted = frag.model_copy(update={"tier": ContextTier.WARM})
    self._warm[fragment_id] = promoted
    self._emit_tier_event(...)

    Enforce warm-tier capacity after promotion.

    self._enforce_warm_capacity()
    return promoted

After _enforce_warm_capacity() runs it calls del self._warm[oldest_id] on the LRU warm fragment (lines 461). When the warm tier is at capacity and the just-promoted cold fragment has the oldest last_accessed value among all warm fragments (no _touch() is called during cold→warm promotion — compare to _touch in the _touch staticmethod at line 518), the newly added fragment is the eviction target. It was already removed from self._cold at line 296, so after the eviction it exists in no tier. Yet line 307 unconditionally returns promoted with .tier=ContextTier.WARM, giving the caller a stale handle to a fragment that is silently lost from all stores. The warm→hot path at lines 322-335 explicitly guards against this identical scenario with if fragment_id not in self._hot: followed by restoration + _enforce_warm_capacity(). The cold→warm path was not given the same protection.

  • Suggested fix: Mirror the hot-budget fallback pattern. After self._enforce_warm_capacity() at line 306, add: if fragment_id not in self._warm: promoted = self._touch(promoted); self._warm[fragment_id] = promoted; self._enforce_warm_capacity(). The _touch() call updates last_accessed to now so the restored fragment is not immediately re-evicted. Alternatively, call _touch() on promoted before inserting at line 298 (i.e. promoted = self._touch(frag.model_copy(update={"tier": ContextTier.WARM}))), ensuring the just-promoted fragment is never the LRU immediately after promotion — matching the implicit access semantics of a promotion event.
*event occurred 2026-05-28T20:15:50.347667+00:00* **🔴 Changes requested** Confidence: high. **Blocking issues (1):** - [error] `src/cleveragents/application/services/context_tiers.py:295-307` — Lines 295-307 (cold→warm branch of `promote`): frag = self._cold.pop(fragment_id) promoted = frag.model_copy(update={"tier": ContextTier.WARM}) self._warm[fragment_id] = promoted self._emit_tier_event(...) # Enforce warm-tier capacity after promotion. self._enforce_warm_capacity() return promoted After `_enforce_warm_capacity()` runs it calls `del self._warm[oldest_id]` on the LRU warm fragment (lines 461). When the warm tier is at capacity and the just-promoted cold fragment has the oldest `last_accessed` value among all warm fragments (no `_touch()` is called during cold→warm promotion — compare to `_touch` in the `_touch` staticmethod at line 518), the newly added fragment is the eviction target. It was already removed from `self._cold` at line 296, so after the eviction it exists in no tier. Yet line 307 unconditionally returns `promoted` with `.tier=ContextTier.WARM`, giving the caller a stale handle to a fragment that is silently lost from all stores. The warm→hot path at lines 322-335 explicitly guards against this identical scenario with `if fragment_id not in self._hot:` followed by restoration + `_enforce_warm_capacity()`. The cold→warm path was not given the same protection. - _Suggested fix:_ Mirror the hot-budget fallback pattern. After `self._enforce_warm_capacity()` at line 306, add: `if fragment_id not in self._warm: promoted = self._touch(promoted); self._warm[fragment_id] = promoted; self._enforce_warm_capacity()`. The `_touch()` call updates `last_accessed` to now so the restored fragment is not immediately re-evicted. Alternatively, call `_touch()` on `promoted` before inserting at line 298 (i.e. `promoted = self._touch(frag.model_copy(update={"tier": ContextTier.WARM}))`), ensuring the just-promoted fragment is never the LRU immediately after promotion — matching the implicit access semantics of a promotion event. <!-- controller:fingerprint:47905ef5395e7138 -->
Author
Owner

(attempt #9, tier 1)

event occurred 2026-05-28T20:55:06.953674+00:00

🔧 Implementer attempt — resolved.

Pushed 1 commit: 0fee675.

Files touched: src/cleveragents/application/services/context_tiers.py.

_(attempt #9, tier 1)_ *event occurred 2026-05-28T20:55:06.953674+00:00* **🔧 Implementer attempt — `resolved`.** Pushed 1 commit: `0fee675`. Files touched: `src/cleveragents/application/services/context_tiers.py`. <!-- controller:fingerprint:d55e7261d0098740 -->
Author
Owner

event occurred 2026-05-28T21:24:46.362843+00:00

Approved

Reviewed at commit 0fee675.

Confidence: high.

*event occurred 2026-05-28T21:24:46.362843+00:00* **✅ Approved** Reviewed at commit `0fee675`. Confidence: high. <!-- controller:fingerprint:226ad8656743a723 -->
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core!11238
No description provided.