[AUTO-GUARD-1] PlanLifecycleService lock owner misuse allows concurrent plan transitions #8095

Closed
opened 2026-04-13 03:30:30 +00:00 by HAL9000 · 1 comment
Owner

Summary

PlanLifecycleService now acquires the LockService with owner_id=plan_id (see PR #8067). Because LockService treats owner_id as the unique identity of the caller, any concurrent invocation for the same plan passes the exact same owner identifier. The built-in re-entrant logic therefore renews the existing lock instead of raising LockConflictError, so the new guard never prevents the race the PR set out to fix.

Impact

Concurrent CLI/worker sessions that operate on the same plan still run Execute/Apply in parallel. The critical section remains unprotected, so plan state can be corrupted despite the new locking hooks.

Evidence

  • src/cleveragents/application/services/plan_lifecycle_service.py (PR 8067 head): both execute_plan and apply_plan call self._lock_service.acquire(owner_id=plan_id, resource_type="plan", resource_id=plan_id).
  • LockService.acquire docstring requires owner_id to be the caller identity and explicitly renews when the same owner re-locks.
  • features/lock_service_coverage.feature exercises conflicts by giving different owners distinct identifiers, confirming owner identity is what triggers LockConflictError.

Recommendation

Inject and use a real caller identity when acquiring the lock (e.g. CLI session ID, settings.instance_id, or a generated UUID per invocation) so that concurrent sessions present different owners. Alternatively, extend execute_plan / apply_plan to take an owner_id argument and thread it through the CLI/worker entry points.

## Summary PlanLifecycleService now acquires the LockService with `owner_id=plan_id` (see PR #8067). Because LockService treats `owner_id` as the unique identity of the caller, any concurrent invocation for the same plan passes the exact same owner identifier. The built-in re-entrant logic therefore renews the existing lock instead of raising `LockConflictError`, so the new guard never prevents the race the PR set out to fix. ## Impact Concurrent CLI/worker sessions that operate on the same plan still run Execute/Apply in parallel. The critical section remains unprotected, so plan state can be corrupted despite the new locking hooks. ## Evidence - `src/cleveragents/application/services/plan_lifecycle_service.py` (PR 8067 head): both `execute_plan` and `apply_plan` call `self._lock_service.acquire(owner_id=plan_id, resource_type="plan", resource_id=plan_id)`. - `LockService.acquire` docstring requires `owner_id` to be the caller identity and explicitly renews when the same owner re-locks. - `features/lock_service_coverage.feature` exercises conflicts by giving different owners distinct identifiers, confirming owner identity is what triggers `LockConflictError`. ## Recommendation Inject and use a real caller identity when acquiring the lock (e.g. CLI session ID, settings.instance_id, or a generated UUID per invocation) so that concurrent sessions present different owners. Alternatively, extend `execute_plan` / `apply_plan` to take an `owner_id` argument and thread it through the CLI/worker entry points.
Owner

superseded by next cycle

superseded by next cycle
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#8095
No description provided.