BUG-HUNT: [consistency] Plan state inconsistency during ReconciliationBlockedError #7137

Open
opened 2026-04-10 08:08:03 +00:00 by HAL9000 · 1 comment
Owner

Background and Context

In src/cleveragents/application/services/plan_lifecycle_service.py, the methods start_strategize(), execute_plan(), and apply_plan() mutate the plan's processing_state and/or phase after calling _run_invariant_reconciliation(plan). However, execute_plan() and apply_plan() perform the mutation in the correct order (reconciliation first, then mutation), while start_strategize() calls reconciliation before mutation — which is safe.

The deeper problem is that none of these methods implement a rollback mechanism if reconciliation raises ReconciliationBlockedError mid-transition. In execute_plan() (line ~1530–1536) and apply_plan() (line ~1708–1714), the reconciliation runs first, then plan.processing_state and plan.phase are mutated. If reconciliation succeeds but a subsequent step fails (or if the in-memory object is left in a partially-mutated state due to an exception in the commit path), the plan can be left in an inconsistent state.

More critically, the in-memory Plan object is mutated directly without any transactional guard. If _commit_plan(plan) raises after the state mutation, the in-memory object reflects the new state while the database reflects the old state — a split-brain condition. Similarly, if ReconciliationBlockedError is raised inside _run_invariant_reconciliation() after partial internal state changes, the plan object may carry stale intermediate values.

Code evidence (from execute_plan(), lines ~1530–1539):

# Invariant Reconciliation: verify invariants before Execute
self._run_invariant_reconciliation(plan)  # raises ReconciliationBlockedError on failure

# Transition to Execute phase — set processing_state first so that
# the phase-state validator sees QUEUED (valid in any phase) when
# the phase assignment triggers re-validation.
plan.processing_state = ProcessingState.QUEUED   # ← mutated
plan.phase = PlanPhase.EXECUTE                   # ← mutated
plan.timestamps.updated_at = datetime.now()

self._commit_plan(plan)  # ← if this raises, in-memory ≠ DB

The spec (§Plan Lifecycle) requires that failed phase transitions leave the plan in its previous consistent state. There is no rollback or snapshot-restore pattern in any of the three affected methods.

Current Behavior

When ReconciliationBlockedError is raised during a phase transition:

  • execute_plan() and apply_plan(): reconciliation runs before mutation, so the plan object is not mutated — but if _commit_plan() raises after mutation, the in-memory plan reflects the new (wrong) state while the DB has the old state.
  • Any exception in _commit_plan() after state mutation leaves the in-memory Plan object in the new phase/state while the database still holds the old values.
  • Plans stuck in this split-brain state cannot be properly processed by subsequent CLI invocations that re-fetch from the database.

Expected Behavior

Failed phase transitions must leave the plan in its previous consistent state. The implementation should:

  1. Snapshot the plan's mutable fields (phase, processing_state, timestamps) before any mutation.
  2. Restore the snapshot if _run_invariant_reconciliation() or _commit_plan() raises.
  3. Alternatively, use a copy-on-write pattern: mutate a copy, only replace the live object after a successful commit.

Acceptance Criteria

  • execute_plan() restores plan.phase and plan.processing_state to their pre-transition values if _run_invariant_reconciliation() or _commit_plan() raises any exception.
  • apply_plan() restores plan.phase and plan.processing_state to their pre-transition values on any failure.
  • start_strategize() restores plan.processing_state and timestamps to their pre-transition values on any failure.
  • A plan that fails a phase transition can be re-fetched from the database and will reflect its correct pre-transition state.
  • No silent state divergence between the in-memory Plan object and the database after a failed transition.

Supporting Information

  • File: src/cleveragents/application/services/plan_lifecycle_service.py
  • Methods: _run_invariant_reconciliation() (~line 388), start_strategize() (~line 1300), execute_plan() (~line 1495)
  • Related issues: #5942 (spec gap: ReconciliationBlockedError behavior not documented), #6258 (advisory locking not integrated into phase transitions)
  • Spec reference: §Plan Lifecycle — phase transition atomicity and invariant reconciliation failure behavior

Backlog note: This issue was discovered during autonomous operation
on milestone v3.7.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.

Metadata

  • Branch: bugfix/consistency-plan-state-rollback-on-reconciliation-failure
  • Commit Message: fix(plan-lifecycle): rollback plan phase and state on ReconciliationBlockedError during phase transitions
  • Milestone: (none — backlog, see note above)
  • Parent Epic: #5367

Subtasks

  • Audit start_strategize(), execute_plan(), and apply_plan() for all mutable fields that must be rolled back on failure
  • Implement a pre-transition snapshot pattern (capture phase, processing_state, timestamps before mutation)
  • Add try/except rollback in execute_plan() to restore snapshot if _run_invariant_reconciliation() or _commit_plan() raises
  • Add try/except rollback in apply_plan() to restore snapshot on failure
  • Add try/except rollback in start_strategize() to restore snapshot on failure
  • Tests (Behave): Add scenario — ReconciliationBlockedError during execute_plan() leaves plan in STRATEGIZE/COMPLETE state
  • Tests (Behave): Add scenario — ReconciliationBlockedError during apply_plan() leaves plan in EXECUTE/COMPLETE state
  • Tests (Behave): Add scenario — _commit_plan() failure after state mutation restores in-memory plan to pre-transition state
  • Tests (Robot): Add integration test verifying plan state consistency after failed phase transition
  • Verify coverage ≥ 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage ≥ 97%.

Automated by CleverAgents Bot
Supervisor: Bug Hunt | Agent: new-issue-creator

## Background and Context In `src/cleveragents/application/services/plan_lifecycle_service.py`, the methods `start_strategize()`, `execute_plan()`, and `apply_plan()` mutate the plan's `processing_state` and/or `phase` **after** calling `_run_invariant_reconciliation(plan)`. However, `execute_plan()` and `apply_plan()` perform the mutation in the correct order (reconciliation first, then mutation), while `start_strategize()` calls reconciliation before mutation — which is safe. The deeper problem is that **none of these methods implement a rollback mechanism** if reconciliation raises `ReconciliationBlockedError` mid-transition. In `execute_plan()` (line ~1530–1536) and `apply_plan()` (line ~1708–1714), the reconciliation runs first, then `plan.processing_state` and `plan.phase` are mutated. If reconciliation succeeds but a subsequent step fails (or if the in-memory object is left in a partially-mutated state due to an exception in the commit path), the plan can be left in an inconsistent state. More critically, the in-memory `Plan` object is mutated directly without any transactional guard. If `_commit_plan(plan)` raises after the state mutation, the in-memory object reflects the new state while the database reflects the old state — a split-brain condition. Similarly, if `ReconciliationBlockedError` is raised inside `_run_invariant_reconciliation()` after partial internal state changes, the plan object may carry stale intermediate values. **Code evidence** (from `execute_plan()`, lines ~1530–1539): ```python # Invariant Reconciliation: verify invariants before Execute self._run_invariant_reconciliation(plan) # raises ReconciliationBlockedError on failure # Transition to Execute phase — set processing_state first so that # the phase-state validator sees QUEUED (valid in any phase) when # the phase assignment triggers re-validation. plan.processing_state = ProcessingState.QUEUED # ← mutated plan.phase = PlanPhase.EXECUTE # ← mutated plan.timestamps.updated_at = datetime.now() self._commit_plan(plan) # ← if this raises, in-memory ≠ DB ``` The spec (§Plan Lifecycle) requires that failed phase transitions leave the plan in its **previous consistent state**. There is no rollback or snapshot-restore pattern in any of the three affected methods. ## Current Behavior When `ReconciliationBlockedError` is raised during a phase transition: - `execute_plan()` and `apply_plan()`: reconciliation runs before mutation, so the plan object is not mutated — but if `_commit_plan()` raises after mutation, the in-memory plan reflects the new (wrong) state while the DB has the old state. - Any exception in `_commit_plan()` after state mutation leaves the in-memory `Plan` object in the new phase/state while the database still holds the old values. - Plans stuck in this split-brain state cannot be properly processed by subsequent CLI invocations that re-fetch from the database. ## Expected Behavior Failed phase transitions must leave the plan in its previous consistent state. The implementation should: 1. Snapshot the plan's mutable fields (`phase`, `processing_state`, `timestamps`) before any mutation. 2. Restore the snapshot if `_run_invariant_reconciliation()` or `_commit_plan()` raises. 3. Alternatively, use a copy-on-write pattern: mutate a copy, only replace the live object after a successful commit. ## Acceptance Criteria - [ ] `execute_plan()` restores `plan.phase` and `plan.processing_state` to their pre-transition values if `_run_invariant_reconciliation()` or `_commit_plan()` raises any exception. - [ ] `apply_plan()` restores `plan.phase` and `plan.processing_state` to their pre-transition values on any failure. - [ ] `start_strategize()` restores `plan.processing_state` and timestamps to their pre-transition values on any failure. - [ ] A plan that fails a phase transition can be re-fetched from the database and will reflect its correct pre-transition state. - [ ] No silent state divergence between the in-memory `Plan` object and the database after a failed transition. ## Supporting Information - **File**: `src/cleveragents/application/services/plan_lifecycle_service.py` - **Methods**: `_run_invariant_reconciliation()` (~line 388), `start_strategize()` (~line 1300), `execute_plan()` (~line 1495) - **Related issues**: #5942 (spec gap: `ReconciliationBlockedError` behavior not documented), #6258 (advisory locking not integrated into phase transitions) - **Spec reference**: §Plan Lifecycle — phase transition atomicity and invariant reconciliation failure behavior > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.7.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. ## Metadata - **Branch**: `bugfix/consistency-plan-state-rollback-on-reconciliation-failure` - **Commit Message**: `fix(plan-lifecycle): rollback plan phase and state on ReconciliationBlockedError during phase transitions` - **Milestone**: *(none — backlog, see note above)* - **Parent Epic**: #5367 ## Subtasks - [ ] Audit `start_strategize()`, `execute_plan()`, and `apply_plan()` for all mutable fields that must be rolled back on failure - [ ] Implement a pre-transition snapshot pattern (capture `phase`, `processing_state`, `timestamps` before mutation) - [ ] Add try/except rollback in `execute_plan()` to restore snapshot if `_run_invariant_reconciliation()` or `_commit_plan()` raises - [ ] Add try/except rollback in `apply_plan()` to restore snapshot on failure - [ ] Add try/except rollback in `start_strategize()` to restore snapshot on failure - [ ] Tests (Behave): Add scenario — `ReconciliationBlockedError` during `execute_plan()` leaves plan in `STRATEGIZE/COMPLETE` state - [ ] Tests (Behave): Add scenario — `ReconciliationBlockedError` during `apply_plan()` leaves plan in `EXECUTE/COMPLETE` state - [ ] Tests (Behave): Add scenario — `_commit_plan()` failure after state mutation restores in-memory plan to pre-transition state - [ ] Tests (Robot): Add integration test verifying plan state consistency after failed phase transition - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage ≥ 97%. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunt | Agent: new-issue-creator
Author
Owner

Verified — Consistency bug: plan state inconsistency during ReconciliationBlockedError. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Consistency bug: plan state inconsistency during ReconciliationBlockedError. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#7137
No description provided.