fix(plan): implement error recovery for execute phase per spec §35958 and §18323 #10843

Closed
opened 2026-04-23 13:37:28 +00:00 by hamza.khyari · 0 comments
Member

Summary

When plan execute is called on a plan in execute/errored state, the CLI skips re-execution because it only handles execute/queued. The user cannot recover from failures — the plan is stuck.

The spec defines two distinct recovery mechanisms depending on the failure type:

  1. Transient failures (API timeout, rate limit, auth error): §35958 defines the modify_config automation flag for "automatically retrying operations that fail due to transient errors." The plan should retry the same strategy — no strategy revision needed.

  2. Strategy constraint failures (infeasible approach, unresolvable constraints): §18323-18329 defines Execute → Strategize reversion where the strategy actor receives the execution actor's findings and produces a revised decision tree before re-entering Execute.

Metadata

  • Commit Message: fix(plan): implement error recovery for execute phase per spec §35958 and §18323
  • Branch: bugfix/reexecute-errored-plan

Current Behavior

  1. plan execute fails → plan enters execute/errored (terminal per §18356)
  2. User runs plan execute again
  3. CLI sees EXECUTE phase, checks for EXECUTE/QUEUED → skips because state is ERRORED
  4. Plan stays stuck, apply fails

Expected Behavior

For transient failures (§35958)

  1. CLI detects EXECUTE/ERRORED with a transient error type
  2. Resets plan to EXECUTE/QUEUED and re-executes with the same strategy
  3. Plan reaches execute/complete, apply succeeds

For strategy constraint failures (§18323-18329)

  1. CLI detects EXECUTE/ERRORED with a strategy constraint error
  2. Reverts plan to STRATEGIZE/QUEUED phase
  3. Strategy actor receives the error findings and produces a revised decision tree (§18329)
  4. Plan re-enters Execute with the revised strategy

Subtasks

  • Classify error type from plan.error_details — transient vs strategy constraint
  • Transient path (§35958): reset execute/erroredexecute/queued, clear error, re-run run_execute with same decisions
  • Strategy path (§18323): revert execute/erroredstrategize/queued, run run_strategize which receives prior error findings, then proceed to execute
  • Automation thresholds (modify_config, delete_content) gate automatic retry/reversion only. Explicit plan execute CLI invocations always proceed — the user's intent is clear. Documented in code comments.
  • Document both paths in code comments with spec references
  • Verify transient path with scenario-15 (network failure recovery) — manual test
  • Add Behave scenario for full transient retry lifecycle

Definition of Done

  • plan execute on a transient-errored plan retries successfully (§35958)
  • plan execute on a strategy-errored plan reverts to Strategize (§18323)
  • Automation thresholds documented (apply to automatic operations only)
  • 6 Behave scenarios pass
  • Scenario-15 passes (manual test)
  • M1 E2E passes (no regression)
  • Code comments reference spec sections

Spec Reference

  • §18356 — errored is terminal state for Execute phase
  • §18323-18329 — Execute → Strategize reversion for strategy constraint failures
  • §18330 — Reversion controlled by delete_content automation flag
  • §35958 — modify_config flag for transient error retry
## Summary When `plan execute` is called on a plan in `execute/errored` state, the CLI skips re-execution because it only handles `execute/queued`. The user cannot recover from failures — the plan is stuck. The spec defines two distinct recovery mechanisms depending on the failure type: 1. **Transient failures** (API timeout, rate limit, auth error): §35958 defines the `modify_config` automation flag for *"automatically retrying operations that fail due to transient errors."* The plan should retry the same strategy — no strategy revision needed. 2. **Strategy constraint failures** (infeasible approach, unresolvable constraints): §18323-18329 defines **Execute → Strategize reversion** where the strategy actor receives the execution actor's findings and produces a revised decision tree before re-entering Execute. ## Metadata - **Commit Message**: `fix(plan): implement error recovery for execute phase per spec §35958 and §18323` - **Branch**: `bugfix/reexecute-errored-plan` ## Current Behavior 1. `plan execute` fails → plan enters `execute/errored` (terminal per §18356) 2. User runs `plan execute` again 3. CLI sees `EXECUTE` phase, checks for `EXECUTE/QUEUED` → skips because state is `ERRORED` 4. Plan stays stuck, apply fails ## Expected Behavior ### For transient failures (§35958) 1. CLI detects `EXECUTE/ERRORED` with a transient error type 2. Resets plan to `EXECUTE/QUEUED` and re-executes with the same strategy 3. Plan reaches `execute/complete`, apply succeeds ### For strategy constraint failures (§18323-18329) 1. CLI detects `EXECUTE/ERRORED` with a strategy constraint error 2. Reverts plan to `STRATEGIZE/QUEUED` phase 3. Strategy actor receives the error findings and produces a revised decision tree (§18329) 4. Plan re-enters Execute with the revised strategy ## Subtasks - [x] Classify error type from `plan.error_details` — transient vs strategy constraint - [x] **Transient path (§35958)**: reset `execute/errored` → `execute/queued`, clear error, re-run `run_execute` with same decisions - [x] **Strategy path (§18323)**: revert `execute/errored` → `strategize/queued`, run `run_strategize` which receives prior error findings, then proceed to execute - [x] Automation thresholds (`modify_config`, `delete_content`) gate *automatic* retry/reversion only. Explicit `plan execute` CLI invocations always proceed — the user's intent is clear. Documented in code comments. - [x] Document both paths in code comments with spec references - [x] Verify transient path with scenario-15 (network failure recovery) — manual test - [x] Add Behave scenario for full transient retry lifecycle ## Definition of Done - `plan execute` on a transient-errored plan retries successfully (§35958) - `plan execute` on a strategy-errored plan reverts to Strategize (§18323) - Automation thresholds documented (apply to automatic operations only) - 6 Behave scenarios pass - Scenario-15 passes (manual test) - M1 E2E passes (no regression) - Code comments reference spec sections ## Spec Reference - §18356 — `errored` is terminal state for Execute phase - §18323-18329 — Execute → Strategize reversion for strategy constraint failures - §18330 — Reversion controlled by `delete_content` automation flag - §35958 — `modify_config` flag for transient error retry
hamza.khyari added this to the v3.5.0 milestone 2026-04-23 13:37:28 +00:00
hamza.khyari changed title from fix(plan): reset errored plan to queued on re-execute to fix(plan): reset errored plan to queued on re-execute for transient failures 2026-04-23 21:42:07 +00:00
hamza.khyari changed title from fix(plan): reset errored plan to queued on re-execute for transient failures to fix(plan): implement error recovery for execute phase per spec §35958 and §18323 2026-04-23 21:44:58 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10843
No description provided.