bug(plan): plan stuck in strategize/errored with no recovery path after transient failure #2265

Closed
opened 2026-04-03 11:59:23 +00:00 by hamza.khyari · 1 comment
Member

Metadata

  • Commit Message: fix(plan): allow retry of errored plan phases
  • Branch: fix/plan-errored-retry

Background and Context

When agents plan execute <id> fails during the strategize phase (e.g. due to an invalid API key, network timeout, or rate limit), the plan transitions to strategize/errored and becomes permanently unrecoverable. The user must abandon the plan and create a new one via agents plan use, losing the plan ID.

Reproduction Steps

  1. agents action create --config action.yaml
  2. agents plan use local/my-action "local/my-project" — returns plan ID
  3. export ANTHROPIC_API_KEY=invalid-key
  4. agents plan execute <plan-id> — fails with AuthenticationError: 401
  5. export ANTHROPIC_API_KEY=valid-key
  6. agents plan execute <plan-id> — fails with Plan is not in an executable state (current: strategize/errored)

Current Behavior

The plan is permanently stuck. plan execute rejects it because the state is errored, not queued or processing. There is no plan retry, plan reset, or plan restart command. The only option is to create a new plan.

Expected Behavior

Users should be able to retry a plan that errored during strategize (or any phase where no changes have been applied). Options:

  1. agents plan execute <id> should detect strategize/errored and offer to retry
  2. Or a new agents plan retry <id> command that resets erroredqueued for the current phase
  3. At minimum, strategize/errored should be retryable since no side effects have occurred

Acceptance Criteria

  • A plan in strategize/errored state can be retried without creating a new plan
  • The retry resets the phase state from errored to queued and re-executes
  • Plans in execute/errored can be retried (execution may have side effects — warn user)
  • Plans in apply/errored can be retried (apply may have partial side effects — warn user)
  • The error message when a plan is not executable suggests the retry command

Subtasks

  • Add retry_phase() method to PlanLifecycleService that resets errored → queued
  • Add agents plan retry <plan-id> CLI command (or extend plan execute to handle errored state)
  • Add safety warnings for retrying phases with potential side effects (execute, apply)
  • Tests (Behave): Scenarios for retry after transient failure in each phase
  • Update error message to suggest retry when plan is in errored state

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
## Metadata - **Commit Message**: `fix(plan): allow retry of errored plan phases` - **Branch**: `fix/plan-errored-retry` ## Background and Context When `agents plan execute <id>` fails during the strategize phase (e.g. due to an invalid API key, network timeout, or rate limit), the plan transitions to `strategize/errored` and becomes permanently unrecoverable. The user must abandon the plan and create a new one via `agents plan use`, losing the plan ID. ## Reproduction Steps 1. `agents action create --config action.yaml` 2. `agents plan use local/my-action "local/my-project"` — returns plan ID 3. `export ANTHROPIC_API_KEY=invalid-key` 4. `agents plan execute <plan-id>` — fails with `AuthenticationError: 401` 5. `export ANTHROPIC_API_KEY=valid-key` 6. `agents plan execute <plan-id>` — fails with `Plan is not in an executable state (current: strategize/errored)` ## Current Behavior The plan is permanently stuck. `plan execute` rejects it because the state is `errored`, not `queued` or `processing`. There is no `plan retry`, `plan reset`, or `plan restart` command. The only option is to create a new plan. ## Expected Behavior Users should be able to retry a plan that errored during strategize (or any phase where no changes have been applied). Options: 1. `agents plan execute <id>` should detect `strategize/errored` and offer to retry 2. Or a new `agents plan retry <id>` command that resets `errored` → `queued` for the current phase 3. At minimum, `strategize/errored` should be retryable since no side effects have occurred ## Acceptance Criteria - [ ] A plan in `strategize/errored` state can be retried without creating a new plan - [ ] The retry resets the phase state from `errored` to `queued` and re-executes - [ ] Plans in `execute/errored` can be retried (execution may have side effects — warn user) - [ ] Plans in `apply/errored` can be retried (apply may have partial side effects — warn user) - [ ] The error message when a plan is not executable suggests the retry command ## Subtasks - [ ] Add `retry_phase()` method to `PlanLifecycleService` that resets errored → queued - [ ] Add `agents plan retry <plan-id>` CLI command (or extend `plan execute` to handle errored state) - [ ] Add safety warnings for retrying phases with potential side effects (execute, apply) - [ ] Tests (Behave): Scenarios for retry after transient failure in each phase - [ ] Update error message to suggest retry when plan is in errored state ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done.
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: High (confirmed) — Plans becoming permanently unrecoverable after transient failures is a significant UX and workflow issue. Users lose their plan ID and must start over, which is especially painful for long-running plans.
  • Milestone: Assigning to v3.5.0 — This is a core plan lifecycle issue that directly affects the Autonomy Hardening milestone's acceptance criteria ("Full autonomy acceptance flow with hierarchical decomposition"). A plan that can't recover from transient failures cannot complete an autonomous workflow.
  • MoSCoW: Should Have — The spec's plan lifecycle model should support recovery from transient errors, especially in the strategize phase where no side effects have occurred. This is important for production reliability.
  • Parent Epic: Likely #368 (Plan Lifecycle) or #394 (Plan Lifecycle CLI)

Well-documented bug report with clear reproduction steps and thoughtful acceptance criteria. The suggestion to add agents plan retry or extend plan execute to handle errored states is sound.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: High (confirmed) — Plans becoming permanently unrecoverable after transient failures is a significant UX and workflow issue. Users lose their plan ID and must start over, which is especially painful for long-running plans. - **Milestone**: Assigning to v3.5.0 — This is a core plan lifecycle issue that directly affects the Autonomy Hardening milestone's acceptance criteria ("Full autonomy acceptance flow with hierarchical decomposition"). A plan that can't recover from transient failures cannot complete an autonomous workflow. - **MoSCoW**: Should Have — The spec's plan lifecycle model should support recovery from transient errors, especially in the strategize phase where no side effects have occurred. This is important for production reliability. - **Parent Epic**: Likely #368 (Plan Lifecycle) or #394 (Plan Lifecycle CLI) Well-documented bug report with clear reproduction steps and thoughtful acceptance criteria. The suggestion to add `agents plan retry` or extend `plan execute` to handle errored states is sound. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
freemo added this to the v3.5.0 milestone 2026-04-03 12:21:11 +00:00
hamza.khyari 2026-04-09 14:27:27 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#2265
No description provided.