fix(acms): harden hot/warm/cold tier service reliability #11238

Merged
HAL9000 merged 2 commits from pr-fix/9663-hot-warm-cold-tier-reliability into master 2026-05-28 21:31:23 +00:00
Owner

Summary

Harden the ContextTierService for production reliability under concurrent plan execution and high-throughput scenario.

Changes

  1. Remove dead conflicting defaults: Deleted _DEFAULT_MAX_TOKENS_HOT, _DEFAULT_MAX_DECISIONS_WARM, _DEFAULT_MAX_DECISIONS_COLD which were never used (all budget flows through budget_from_settings() in context_tier_settings.py) and contradicted the canonical values.

  2. Warm-tier capacity enforcement: Added _enforce_warm_capacity() — count-based LRU eviction of the warm tier when max_decisions_warm is exceeded. Triggered after:

    • Cold → warm promotion (in promote())
    • Hot-budget-fallback restore to warm (when promoted fragment could't fit in hot)
  3. Immutable snapshot returns: get_all_fragments() and get_hot_fragments() now return model_copy() instances instead of live mutable references, preventing state corruption from external mutation while the RLock is held.

  4. Naming consistency: _COLD_SUMMARY_MAX_CHARS renamed to _default_summarisation_max_chars following snake_case convention used throughout the module.

Impact

  • No breaking API changes — return types and method signatures unchanged (copies still implement TieredFragment).
  • Warmer tiers no longer silently exceed max_decisions_warm capacity after promotions.
  • Thread-safety improved for all public reader methods.
## Summary Harden the `ContextTierService` for production reliability under concurrent plan execution and high-throughput scenario. ### Changes 1. **Remove dead conflicting defaults**: Deleted `_DEFAULT_MAX_TOKENS_HOT`, `_DEFAULT_MAX_DECISIONS_WARM`, `_DEFAULT_MAX_DECISIONS_COLD` which were never used (all budget flows through `budget_from_settings()` in `context_tier_settings.py`) and contradicted the canonical values. 2. **Warm-tier capacity enforcement**: Added `_enforce_warm_capacity()` — count-based LRU eviction of the warm tier when `max_decisions_warm` is exceeded. Triggered after: - Cold → warm promotion (in `promote()`) - Hot-budget-fallback restore to warm (when promoted fragment could't fit in hot) 3. **Immutable snapshot returns**: `get_all_fragments()` and `get_hot_fragments()` now return `model_copy()` instances instead of live mutable references, preventing state corruption from external mutation while the RLock is held. 4. **Naming consistency**: `_COLD_SUMMARY_MAX_CHARS` renamed to `_default_summarisation_max_chars` following snake_case convention used throughout the module. ### Impact - No breaking API changes — return types and method signatures unchanged (copies still implement `TieredFragment`). - Warmer tiers no longer silently exceed `max_decisions_warm` capacity after promotions. - Thread-safety improved for all public reader methods.
fix(acms): harden hot/warm/cold tier service reliability
All checks were successful
CI / lint (pull_request) Successful in 1m11s
CI / typecheck (pull_request) Successful in 1m28s
CI / security (pull_request) Successful in 1m27s
CI / push-validation (pull_request) Successful in 45s
CI / helm (pull_request) Successful in 51s
CI / build (pull_request) Successful in 1m10s
CI / quality (pull_request) Successful in 1m41s
CI / integration_tests (pull_request) Successful in 7m31s
CI / unit_tests (pull_request) Successful in 9m12s
CI / docker (pull_request) Successful in 1m47s
CI / coverage (pull_request) Successful in 14m21s
CI / status-check (pull_request) Successful in 3s
03e34b035e
- Remove unused conflicting _DEFAULT_* constants that conflicted with
  canonical defaults in context_tier_settings.py (which serves as the
  sole source of truth for budget/setting defaults).

- Add _enforce_warm_capacity() to enforce max_decisions_warm limit on
  warm tier after cold→warm promotion and hot-budget-fallback restore,
  preventing silent over-capacity data accumulation.

- Return deep copies (model_copy) from get_all_fragments() and
  get_hot_fragments() to prevent callers from mutating internal fragment
  state while the service holds its RLock under concurrent plan execution.

- Rename _COLD_SUMMARY_MAX_CHARS → _default_summarisation_max_chars for
  consistent snake_case naming throughout the module.

Fixes: PR #9663
HAL9000 force-pushed pr-fix/9663-hot-warm-cold-tier-reliability from 03e34b035e
All checks were successful
CI / lint (pull_request) Successful in 1m11s
CI / typecheck (pull_request) Successful in 1m28s
CI / security (pull_request) Successful in 1m27s
CI / push-validation (pull_request) Successful in 45s
CI / helm (pull_request) Successful in 51s
CI / build (pull_request) Successful in 1m10s
CI / quality (pull_request) Successful in 1m41s
CI / integration_tests (pull_request) Successful in 7m31s
CI / unit_tests (pull_request) Successful in 9m12s
CI / docker (pull_request) Successful in 1m47s
CI / coverage (pull_request) Successful in 14m21s
CI / status-check (pull_request) Successful in 3s
to 5031170a71
All checks were successful
CI / helm (pull_request) Successful in 35s
CI / lint (pull_request) Successful in 44s
CI / build (pull_request) Successful in 40s
CI / typecheck (pull_request) Successful in 58s
CI / quality (pull_request) Successful in 58s
CI / security (pull_request) Successful in 1m14s
CI / integration_tests (pull_request) Successful in 3m37s
CI / push-validation (pull_request) Successful in 20s
CI / unit_tests (pull_request) Successful in 6m40s
CI / docker (pull_request) Successful in 1m31s
CI / coverage (pull_request) Successful in 10m55s
CI / status-check (pull_request) Successful in 4s
2026-05-28 13:57:27 +00:00
Compare
HAL9000 force-pushed pr-fix/9663-hot-warm-cold-tier-reliability from 5031170a71
All checks were successful
CI / helm (pull_request) Successful in 35s
CI / lint (pull_request) Successful in 44s
CI / build (pull_request) Successful in 40s
CI / typecheck (pull_request) Successful in 58s
CI / quality (pull_request) Successful in 58s
CI / security (pull_request) Successful in 1m14s
CI / integration_tests (pull_request) Successful in 3m37s
CI / push-validation (pull_request) Successful in 20s
CI / unit_tests (pull_request) Successful in 6m40s
CI / docker (pull_request) Successful in 1m31s
CI / coverage (pull_request) Successful in 10m55s
CI / status-check (pull_request) Successful in 4s
to 68fa0479c0
All checks were successful
CI / push-validation (pull_request) Successful in 24s
CI / lint (pull_request) Successful in 33s
CI / helm (pull_request) Successful in 33s
CI / build (pull_request) Successful in 33s
CI / typecheck (pull_request) Successful in 58s
CI / quality (pull_request) Successful in 59s
CI / security (pull_request) Successful in 1m14s
CI / integration_tests (pull_request) Successful in 3m51s
CI / unit_tests (pull_request) Successful in 4m32s
CI / docker (pull_request) Successful in 1m25s
CI / coverage (pull_request) Successful in 11m9s
CI / status-check (pull_request) Successful in 3s
2026-05-28 19:50:48 +00:00
Compare
fix(context_tiers): guard cold→warm promotion against immediate eviction
All checks were successful
CI / push-validation (pull_request) Successful in 19s
CI / helm (pull_request) Successful in 36s
CI / lint (pull_request) Successful in 47s
CI / build (pull_request) Successful in 45s
CI / quality (pull_request) Successful in 48s
CI / security (pull_request) Successful in 1m19s
CI / typecheck (pull_request) Successful in 1m28s
CI / integration_tests (pull_request) Successful in 2m56s
CI / unit_tests (pull_request) Successful in 6m8s
CI / docker (pull_request) Successful in 1m34s
CI / coverage (pull_request) Successful in 12m57s
CI / status-check (pull_request) Successful in 3s
0fee675807
Mirror the warm→hot budget-fallback pattern for the cold→warm promotion
path in ContextTierService.promote(). When _enforce_warm_capacity()
evicts the just-promoted fragment (because its last_accessed timestamp
predates all existing warm entries), restore it with _touch() so it
carries a current timestamp and is not immediately re-evicted.

Without this guard the fragment is silently lost from all stores while
the caller receives a stale handle with tier=WARM — identical to the
data-loss scenario the hot-budget fallback was introduced to prevent.

ISSUES CLOSED: #9663
Author
Owner

Claimed by merge_drive.py (pid 1264876) until 2026-05-28T23:01:17.275801+00:00.

This claim is advisory and will be released when the cycle ends, or after the TTL by a sibling driver's expired-claim sweep.

<!-- merge_drive.py: claim --> Claimed by `merge_drive.py` (pid 1264876) until `2026-05-28T23:01:17.275801+00:00`. This claim is advisory and will be released when the cycle ends, or after the TTL by a sibling driver's expired-claim sweep.
HAL9001 approved these changes 2026-05-28 21:31:21 +00:00
HAL9001 left a comment

Approved by the controller reviewer stage (workflow 17).

Approved by the controller reviewer stage (workflow 17).
HAL9000 merged commit 1566917c40 into master 2026-05-28 21:31:23 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core!11238
No description provided.