fix(acms): harden hot/warm/cold tier service reliability #11238
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
overdue
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleveragents/cleveragents-core!11238
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "pr-fix/9663-hot-warm-cold-tier-reliability"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Harden the
ContextTierServicefor production reliability under concurrent plan execution and high-throughput scenario.Changes
Remove dead conflicting defaults: Deleted
_DEFAULT_MAX_TOKENS_HOT,_DEFAULT_MAX_DECISIONS_WARM,_DEFAULT_MAX_DECISIONS_COLDwhich were never used (all budget flows throughbudget_from_settings()incontext_tier_settings.py) and contradicted the canonical values.Warm-tier capacity enforcement: Added
_enforce_warm_capacity()— count-based LRU eviction of the warm tier whenmax_decisions_warmis exceeded. Triggered after:promote())Immutable snapshot returns:
get_all_fragments()andget_hot_fragments()now returnmodel_copy()instances instead of live mutable references, preventing state corruption from external mutation while the RLock is held.Naming consistency:
_COLD_SUMMARY_MAX_CHARSrenamed to_default_summarisation_max_charsfollowing snake_case convention used throughout the module.Impact
TieredFragment).max_decisions_warmcapacity after promotions.03e34b035e5031170a715031170a7168fa0479c0Claimed by
merge_drive.py(pid1264876) until2026-05-28T23:01:17.275801+00:00.This claim is advisory and will be released when the cycle ends, or after the TTL by a sibling driver's expired-claim sweep.
Approved by the controller reviewer stage (workflow 17).
event occurred 2026-05-28T13:54:29.105366+00:00
🌱 Grooming: proceed — PR cleared for processing.
(check
no_duplicates, categoryno_duplicates)PR #11238 is a targeted hardening fix for ContextTierService that combines four distinct reliability improvements: removing dead defaults, enforcing warm-tier capacity with LRU eviction, returning immutable fragment snapshots, and naming consistency. No open PR duplicates this specific combination. Related ACMS PRs (#9663, #10783, #11096) address original feature implementations or different constraints (budget enforcement, path matching, pipeline wiring); none implement the warm-capacity enforcement or immutability patterns central to #11238.
event occurred 2026-05-28T13:57:04.956938+00:00
📋 Estimate: tier 1.
Single-file change (+41/-18) but involves non-trivial new logic: a new LRU eviction method (_enforce_warm_capacity()), changed return semantics for two public methods (mutable refs → model_copy() immutable snapshots), and modifications to the promote() call chain. Thread-safety reasoning required to verify correctness of RLock usage around the new eviction path. Not mechanical — standard engineering work warrants tier 1.
(attempt #3, tier 1)
event occurred 2026-05-28T13:58:30.893624+00:00
🔧 Implementer attempt —
rebased.Pushed 1 commit:
5031170.event occurred 2026-05-28T19:46:37.111234+00:00
🌱 Grooming: proceed — PR cleared for processing.
(check
no_duplicates, categoryno_duplicates)Searched open PRs for topical match on
ContextTierServicehardening. Found related feature PR #9663 (hot/warm/cold tier implementation) but it is the original feature, not a duplicate of this fix. Other ACMS PRs address path matching, pipeline wiring, budget enforcement, and indexing—none overlap with ContextTierService production-reliability hardening. No duplicate detected.event occurred 2026-05-28T19:50:33.790742+00:00
📋 Estimate: tier 1.
Single-file change (+41/-18) to ContextTierService, but contains non-trivial logic: a new _enforce_warm_capacity() method with count-based LRU eviction logic requiring correctness reasoning under concurrent access (RLock), and immutable snapshot returns via model_copy() with thread-safety implications. Dead code removal and constant rename are mechanical, but the eviction and snapshot components involve concurrency semantics that exceed the reliable envelope of tier 0. CI is fully green. Tier 1 is the appropriate default for this class of focused-but-logic-heavy single-file work.
(attempt #7, tier 1)
event occurred 2026-05-28T19:51:15.657461+00:00
🔧 Implementer attempt —
rebased.Pushed 1 commit:
68fa047.event occurred 2026-05-28T20:15:50.347667+00:00
🔴 Changes requested
Confidence: high.
Blocking issues (1):
[error]
src/cleveragents/application/services/context_tiers.py:295-307— Lines 295-307 (cold→warm branch ofpromote):frag = self._cold.pop(fragment_id)
promoted = frag.model_copy(update={"tier": ContextTier.WARM})
self._warm[fragment_id] = promoted
self._emit_tier_event(...)
Enforce warm-tier capacity after promotion.
self._enforce_warm_capacity()
return promoted
After
_enforce_warm_capacity()runs it callsdel self._warm[oldest_id]on the LRU warm fragment (lines 461). When the warm tier is at capacity and the just-promoted cold fragment has the oldestlast_accessedvalue among all warm fragments (no_touch()is called during cold→warm promotion — compare to_touchin the_touchstaticmethod at line 518), the newly added fragment is the eviction target. It was already removed fromself._coldat line 296, so after the eviction it exists in no tier. Yet line 307 unconditionally returnspromotedwith.tier=ContextTier.WARM, giving the caller a stale handle to a fragment that is silently lost from all stores. The warm→hot path at lines 322-335 explicitly guards against this identical scenario withif fragment_id not in self._hot:followed by restoration +_enforce_warm_capacity(). The cold→warm path was not given the same protection.self._enforce_warm_capacity()at line 306, add:if fragment_id not in self._warm: promoted = self._touch(promoted); self._warm[fragment_id] = promoted; self._enforce_warm_capacity(). The_touch()call updateslast_accessedto now so the restored fragment is not immediately re-evicted. Alternatively, call_touch()onpromotedbefore inserting at line 298 (i.e.promoted = self._touch(frag.model_copy(update={"tier": ContextTier.WARM}))), ensuring the just-promoted fragment is never the LRU immediately after promotion — matching the implicit access semantics of a promotion event.(attempt #9, tier 1)
event occurred 2026-05-28T20:55:06.953674+00:00
🔧 Implementer attempt —
resolved.Pushed 1 commit:
0fee675.Files touched:
src/cleveragents/application/services/context_tiers.py.event occurred 2026-05-28T21:24:46.362843+00:00
✅ Approved
Reviewed at commit
0fee675.Confidence: high.