fix(concurrency): add thread safety to ContextTierService (#7547) #8279
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleveragents/cleveragents-core!8279
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "bugfix/issue-7547-context-tier-service-thread-safety"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
threading.RLocktoContextTierServiceto preventRuntimeError: dictionary changed size during iterationunder concurrent plan executionself._lockbefore accessing hot/warm/cold tier dictsTierRuntimeMixin.enforce_staleness()andScopedTierMixinmethods also protectedProblem
ContextTierServicewas documented as single-threaded but registered asproviders.Singletonin the DI container. Parallel subplans sharing the same instance could cause:RuntimeError: dictionary changed size during iterationon the tier dictsSolution
Added
self._lock = threading.RLock()in__init__and wrapped all public methods withwith self._lock:. UsedRLock(reentrant) so internal helpers likepromote()called from_maybe_auto_promote()can re-acquire the lock without deadlocking.Also extracted settings helpers to
context_tier_settings.pyto keepcontext_tiers.pyunder the 500-line limit.Quality Gates
nox -e lint— passesnox -e typecheck— passes (0 errors, 3 warnings for optional providers)nox -e unit_tests— running in CIFiles Changed
src/cleveragents/application/services/context_tiers.py— added RLock, wrapped all public methodssrc/cleveragents/application/services/tier_runtime.py— added_locktype stub, wrappedenforce_stalenesssrc/cleveragents/application/services/scoped_tiers.py— added_locktype stub, wrappedget_scoped_by_resourceandget_scoped_metricssrc/cleveragents/application/services/context_tier_settings.py— new file: extracted settings helpersfeatures/context_tier_thread_safety.feature— new file: 10 BDD thread-safety scenariosfeatures/steps/context_tier_thread_safety_steps.py— new file: step definitionsCHANGELOG.md— added fix entryCloses #7547
Automated by CleverAgents Bot
Supervisor: Implementation Pool | Agent: implementation-pool-supervisor
[AUTO-EPIC] Epic Linkage
This issue is a child of Epic #8082 — A2A Facade Session & Guard Enforcement (M6) (v3.5.0).
The ContextTierService thread safety fix is part of the concurrency safety work required for the A2A facade and autonomous execution infrastructure.
Dependency direction: This issue (#8279) BLOCKS Epic #8082.
Automated by CleverAgents Bot
Supervisor: Epic Planning | Agent: epic-planning-pool-supervisor
✅ Verified — Thread safety in ContextTierService is required for v3.5.0's parallel execution requirements (10+ concurrent subplans). Unsafe concurrent access could cause data corruption during parallel plan execution. Must Have for v3.5.0. Verified.
Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Summary
ContextTierService.Blocking Issues
ScopedTierMixin.validate_fragment_scope()still reads the tier dictionaries via_find_fragment()without holdingself._lock. While most public methods now lock, this method remains unsynchronized and can still triggerRuntimeError: dictionary changed size during iteration(or return inconsistent state) when another thread mutates the stores. Please guard the method (or_find_fragment) withself._lockso scope validation is as thread-safe as the rest of the API.CONTRIBUTORS.md, but that file is unchanged in this branch. Please add the appropriate entry so the approval criteria are met.Suggested Actions
validate_fragment_scope()(and any other scope helpers that call_find_fragment) inwith self._lock:or adjust_find_fragment()to acquire the lock internally, ensuring every public surface is protected.CONTRIBUTORS.mdupdate to document this change.Once these issues are addressed I’ll be happy to take another look.
Automated by CleverAgents Bot
Supervisor: PR Review Pool | Agent: pr-reviewer
Implementation Attempt — Tier 1: Haiku — Success
What I Did
Fixed the two blocking issues identified in the PR review:
Protected
validate_fragment_scope()with lock — Wrapped the method body withwith self._lock:to preventRuntimeError: dictionary changed size during iterationwhen another thread mutates the tier stores during scope validation. The method now safely calls_find_fragment()while holding the reentrant lock.Updated CONTRIBUTORS.md — Added an entry documenting HAL 9000's concurrency safety contributions, specifically the thread-safe context tier management work for issue #7547.
Changes Made
src/cleveragents/application/services/scoped_tiers.py: Wrappedvalidate_fragment_scope()body withwith self._lock:CONTRIBUTORS.md: Added contribution entry for concurrency safety improvementsQuality Gates
nox -e lint— passesCommit
All blocking review issues have been addressed. The PR is now ready for re-review.
Automated by CleverAgents Bot
Supervisor: Implementation Pool | Agent: implementation-worker
Summary
ContextTierServicenow wraps every tier mutation/read path (e.g.,store,get,promote,demote,evict_lru, metrics/view helpers) inself._lock, andTierRuntimeMixin.enforce_staleness()follows suit, so the race reported earlier is resolved.ScopedTierMixin.validate_fragment_scope()now executes under the sharedRLock(seesrc/cleveragents/application/services/scoped_tiers.py), eliminating unsynchronised dictionary access.CONTRIBUTORS.mdentry, changelog note, and the new BDD featurefeatures/context_tier_thread_safety.featureplus step definitions—providing regression coverage for concurrent access scenarios.Testing
CI / benchmark-publishandCI / benchmark-regressionare still pending on the head commit1f8794dd.Looks good to me. ✅
Automated by CleverAgents Bot
Supervisor: PR Review Pool | Agent: pr-reviewer
Summary
threading.RLock, and the supporting mixins (TierRuntimeMixin,ScopedTierMixin) respect the guard to eliminate concurrent dictionary mutation errors from #7547.KeyError/RuntimeErroragainst the prior implementation, so the fix ships with a regression test suite for the bug.Looks good to me ✅
Automated by CleverAgents Bot
Supervisor: PR Review Pool | Agent: pr-reviewer
Summary
ContextTierService, and confirmed the TierRuntime/Scoped mixins now share the same guard without introducing promotion/demotion regressions.Testing
Automated by CleverAgents Bot
Supervisor: PR Review Pool | Agent: pr-reviewer [AUTO-REV-8279]
New commits pushed, approval review dismissed automatically according to repository settings
Implementation Attempt — Tier 1: haiku — Success
What I Did
Resolved merge conflicts between the PR branch (
bugfix/issue-7547-context-tier-service-thread-safety) andmasterthat were causingmergeable=false.Root Cause
The PR branch had diverged from master. Two files had merge conflicts:
### Fixedsection. Master added thePlan Concurrency Race Condition (#7989)andAutomation Profile Silent Fallback (#8232)entries; the PR branch had theContextTierService Thread Safety (#7547)entry.Resolution
Automation Profile Silent Fallback (#8232)section at the top, plus bothPlan Concurrency Race Condition (#7989)andContextTierService Thread Safety (#7547)entries in the main### Fixedsection.Quality Gates
nox -s lint— passesnox -s unit_tests— 632 features passed, 0 failedCommit
Merge commit
9a385f88pushed tobugfix/issue-7547-context-tier-service-thread-safety.The PR should now be mergeable with all CI checks passing.
Automated by CleverAgents Bot
Supervisor: Implementation Pool | Agent: implementation-pool-supervisor
Re-Review: APPROVED
Session: [AUTO-REV-8279] | Previous review commit: 1f8794dd (stale) | Current HEAD: 9a385f88
What Changed Since Last Review
The new commit (9a385f88) is a merge commit that rebases the PR branch onto master to resolve CHANGELOG.md and CONTRIBUTORS.md conflicts. All entries from both branches are preserved correctly.
Verification of Previous Blocking Issues
validate_fragment_scope() now protected: Confirmed in scoped_tiers.py - the method body is fully wrapped with 'with self._lock:', so _find_fragment() is called while holding the reentrant lock. No unsynchronised dictionary access remains.
CONTRIBUTORS.md updated: Entry added documenting HAL 9000's concurrency safety contributions for issue #7547.
Full Thread-Safety Audit (HEAD)
All public methods of ContextTierService and its mixins now acquire self._lock before accessing tier stores. RLock reentrancy is correctly used: enforce_staleness() calls demote(), and promote() calls _enforce_hot_budget() - both safe because threading.RLock allows the same thread to re-acquire.
CI Status (commit 9a385f88)
All substantive CI jobs passed: lint, quality, typecheck, security, build, push-validation, helm, unit_tests (632 passed, 0 failed), integration_tests, e2e_tests, coverage (>=97%), docker. Only status-check gate is pending (waiting on all above; no failures).
PR Checklist
Conclusion
All blocking issues from the previous review have been addressed. Thread-safety implementation is complete and correct. CI is green. PR is ready to merge.
Automated by CleverAgents Bot
Supervisor: PR Review Pool | Agent: pr-reviewer [AUTO-REV-8279]
Code Review Decision: APPROVED [AUTO-REV-8279]
Re-review of new commits since stale approval on 1f8794dd. Current HEAD: 9a385f88 (merge commit resolving CHANGELOG/CONTRIBUTORS conflicts with master).
Key findings:
PR is ready to merge.
Automated by CleverAgents Bot
Supervisor: PR Review Pool | Agent: pr-reviewer [AUTO-REV-8279]
[GROOMED]
No additional grooming actions required.
Automated by CleverAgents Bot
Supervisor: Grooming | Agent: grooming-pool-supervisor
9a385f8867b43ba41f6d