bug(plan): plan stuck in processing state after worker crash with no CLI recovery path #4372

Open
opened 2026-04-08 07:02:53 +00:00 by hurui200320 · 1 comment
Member

Metadata

  • Commit Message: fix(plan): add force-cancel/reset for plans stuck in processing state
  • Branch: bugfix/plan-stuck-processing-recovery

Background and Context

When agents plan execute is run and the worker process is killed mid-execution (e.g., timeout, OOM, kill -9, container restart), the plan is left permanently stuck in strategize/processing or execute/processing state. No existing CLI command can recover the plan from this zombie state.

This was discovered during end-to-end CLI testing (v1.0.0, Schema v3) while building a sample project. A background plan execute process was terminated, and the plan became irrecoverable via the CLI — only direct SQLite database manipulation (UPDATE plans SET processing_state='queued' WHERE ...) could reset it.

Related issue: #2265 covers a similar "no recovery path" scenario, but specifically for plans stuck in strategize/errored state after a transient API failure. That issue requests plan retry/plan reset for errored plans. This issue is complementary — it covers plans stuck in processing state (the process died before the plan could transition to errored), which is a distinct failure mode requiring a force-cancel or force-reset mechanism.

Current Behavior

  1. User runs agents plan execute <PLAN_ID> (or the plan enters any processing state during strategize/execute)
  2. The worker process is killed (timeout, crash, OOM, manual kill, etc.)
  3. The plan remains in strategize/processing (or execute/processing) indefinitely
  4. Attempting recovery via existing CLI commands all fail:
    • agents plan execute <PLAN_ID> → refuses because plan is already in processing
    • agents plan status <PLAN_ID> → correctly reports processing state but offers no remediation
    • agents plan revert <PLAN_ID> → refuses same-phase reversion (e.g., STRATEGIZE→STRATEGIZE)
    • agents plan resume <PLAN_ID> → does not restart the worker; the plan stays stuck
  5. The only recovery is direct database manipulation: sqlite3 db.sqlite "UPDATE plans SET processing_state='queued' WHERE id='<PLAN_ID>'"

Expected Behavior

When a plan is stuck in processing state, the user should be able to recover it via the CLI without resorting to direct database manipulation. At minimum, one of:

  • agents plan cancel <PLAN_ID> — force-cancel the plan, transitioning it to a terminal cancelled state
  • agents plan reset <PLAN_ID> — force-reset the plan back to queued state so it can be re-executed
  • agents plan execute --force <PLAN_ID> — force-restart execution, resetting the processing state first

Additionally, the system should detect stale processing states on startup or via a health check (e.g., plans in processing for longer than a configurable timeout with no active worker process).

How to Reproduce

Environment: CleverAgents CLI v1.0.0, Schema v3, Linux

# 1. Set up a project with an action (assumes init, resource, project, action are already configured)
cd /workspace
agents plan use local/some-action local/some-project --automation-profile trusted

# 2. Note the Plan ID from the output (e.g., 01KNNT406AFJZH3AHASW02C14Y)

# 3. Start execution in the background
agents plan execute 01KNNT406AFJZH3AHASW02C14Y &
EXEC_PID=$!

# 4. Wait a few seconds for the plan to enter processing state
sleep 5

# 5. Kill the worker process (simulating a crash)
kill -9 $EXEC_PID

# 6. Check plan status — it will be stuck in processing
agents plan status 01KNNT406AFJZH3AHASW02C14Y
# Expected output shows: Phase: strategize (or execute), State: processing

# 7. Attempt recovery — all fail:
agents plan execute 01KNNT406AFJZH3AHASW02C14Y    # refuses: already processing
agents plan revert 01KNNT406AFJZH3AHASW02C14Y     # refuses: same-phase reversion
agents plan resume 01KNNT406AFJZH3AHASW02C14Y     # no effect: plan stays stuck

# 8. Verify: the plan is permanently stuck with no CLI escape
agents plan status 01KNNT406AFJZH3AHASW02C14Y
# Still shows: processing

To verify the fix: After implementing the fix, repeat steps 1–6, then use the new recovery command (e.g., agents plan cancel or agents plan reset). The plan should transition out of processing state and the user should be able to re-execute or discard it.

Acceptance Criteria

  • A CLI command exists to recover a plan stuck in processing state (e.g., plan cancel --force, plan reset, or plan execute --force)
  • The recovery command transitions the plan to a valid state (queued, cancelled, or errored)
  • The recovery command works for both strategize/processing and execute/processing states
  • The recovery command does not corrupt plan data (decisions, changesets, checkpoints are preserved where applicable)
  • If a stale-processing detection mechanism is added (e.g., timeout-based), it is configurable and documented
  • The existing plan revert / plan resume commands are updated to handle the processing edge case, OR their error messages are updated to guide the user toward the correct recovery command

Subtasks

  • Add a force-cancel or force-reset CLI subcommand for stuck plans
  • Implement processing state detection and recovery in the plan lifecycle service
  • Handle both strategize/processing and execute/processing stuck states
  • Update plan status output to suggest recovery commands when a plan appears stuck (e.g., processing for >5 minutes with no active worker)
  • Tests (Behave): Add scenarios for stuck-processing recovery
  • Tests (Robot): Add integration test for kill-and-recover workflow
  • Verify coverage >=97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
## Metadata - **Commit Message**: `fix(plan): add force-cancel/reset for plans stuck in processing state` - **Branch**: `bugfix/plan-stuck-processing-recovery` ## Background and Context When `agents plan execute` is run and the worker process is killed mid-execution (e.g., timeout, OOM, `kill -9`, container restart), the plan is left permanently stuck in `strategize/processing` or `execute/processing` state. No existing CLI command can recover the plan from this zombie state. This was discovered during end-to-end CLI testing (v1.0.0, Schema v3) while building a sample project. A background `plan execute` process was terminated, and the plan became irrecoverable via the CLI — only direct SQLite database manipulation (`UPDATE plans SET processing_state='queued' WHERE ...`) could reset it. **Related issue:** #2265 covers a similar "no recovery path" scenario, but specifically for plans stuck in `strategize/errored` state after a transient API failure. That issue requests `plan retry`/`plan reset` for **errored** plans. This issue is complementary — it covers plans stuck in **processing** state (the process died before the plan could transition to `errored`), which is a distinct failure mode requiring a force-cancel or force-reset mechanism. ## Current Behavior 1. User runs `agents plan execute <PLAN_ID>` (or the plan enters any `processing` state during strategize/execute) 2. The worker process is killed (timeout, crash, OOM, manual kill, etc.) 3. The plan remains in `strategize/processing` (or `execute/processing`) indefinitely 4. Attempting recovery via existing CLI commands all fail: - `agents plan execute <PLAN_ID>` → refuses because plan is already in `processing` - `agents plan status <PLAN_ID>` → correctly reports `processing` state but offers no remediation - `agents plan revert <PLAN_ID>` → refuses same-phase reversion (e.g., `STRATEGIZE→STRATEGIZE`) - `agents plan resume <PLAN_ID>` → does not restart the worker; the plan stays stuck 5. The only recovery is direct database manipulation: `sqlite3 db.sqlite "UPDATE plans SET processing_state='queued' WHERE id='<PLAN_ID>'"` ## Expected Behavior When a plan is stuck in `processing` state, the user should be able to recover it via the CLI without resorting to direct database manipulation. At minimum, one of: - `agents plan cancel <PLAN_ID>` — force-cancel the plan, transitioning it to a terminal `cancelled` state - `agents plan reset <PLAN_ID>` — force-reset the plan back to `queued` state so it can be re-executed - `agents plan execute --force <PLAN_ID>` — force-restart execution, resetting the `processing` state first Additionally, the system should detect stale `processing` states on startup or via a health check (e.g., plans in `processing` for longer than a configurable timeout with no active worker process). ## How to Reproduce **Environment:** CleverAgents CLI v1.0.0, Schema v3, Linux ```bash # 1. Set up a project with an action (assumes init, resource, project, action are already configured) cd /workspace agents plan use local/some-action local/some-project --automation-profile trusted # 2. Note the Plan ID from the output (e.g., 01KNNT406AFJZH3AHASW02C14Y) # 3. Start execution in the background agents plan execute 01KNNT406AFJZH3AHASW02C14Y & EXEC_PID=$! # 4. Wait a few seconds for the plan to enter processing state sleep 5 # 5. Kill the worker process (simulating a crash) kill -9 $EXEC_PID # 6. Check plan status — it will be stuck in processing agents plan status 01KNNT406AFJZH3AHASW02C14Y # Expected output shows: Phase: strategize (or execute), State: processing # 7. Attempt recovery — all fail: agents plan execute 01KNNT406AFJZH3AHASW02C14Y # refuses: already processing agents plan revert 01KNNT406AFJZH3AHASW02C14Y # refuses: same-phase reversion agents plan resume 01KNNT406AFJZH3AHASW02C14Y # no effect: plan stays stuck # 8. Verify: the plan is permanently stuck with no CLI escape agents plan status 01KNNT406AFJZH3AHASW02C14Y # Still shows: processing ``` **To verify the fix:** After implementing the fix, repeat steps 1–6, then use the new recovery command (e.g., `agents plan cancel` or `agents plan reset`). The plan should transition out of `processing` state and the user should be able to re-execute or discard it. ## Acceptance Criteria - [ ] A CLI command exists to recover a plan stuck in `processing` state (e.g., `plan cancel --force`, `plan reset`, or `plan execute --force`) - [ ] The recovery command transitions the plan to a valid state (`queued`, `cancelled`, or `errored`) - [ ] The recovery command works for both `strategize/processing` and `execute/processing` states - [ ] The recovery command does not corrupt plan data (decisions, changesets, checkpoints are preserved where applicable) - [ ] If a stale-processing detection mechanism is added (e.g., timeout-based), it is configurable and documented - [ ] The existing `plan revert` / `plan resume` commands are updated to handle the `processing` edge case, OR their error messages are updated to guide the user toward the correct recovery command ## Subtasks - [ ] Add a force-cancel or force-reset CLI subcommand for stuck plans - [ ] Implement `processing` state detection and recovery in the plan lifecycle service - [ ] Handle both `strategize/processing` and `execute/processing` stuck states - [ ] Update `plan status` output to suggest recovery commands when a plan appears stuck (e.g., processing for >5 minutes with no active worker) - [ ] Tests (Behave): Add scenarios for stuck-processing recovery - [ ] Tests (Robot): Add integration test for kill-and-recover workflow - [ ] Verify coverage >=97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done.
HAL9000 added this to the v3.5.0 milestone 2026-04-08 13:55:11 +00:00
HAL9000 self-assigned this 2026-04-08 13:56:25 +00:00
Author
Member

Set the ticket type to task because this is not a simple bug fix. To fix the issue reported, we need to change the spec doc to add a new subcommand for plan to reset the status with some level of sanity check.

Set the ticket type to task because this is not a simple bug fix. To fix the issue reported, we need to change the spec doc to add a new subcommand for plan to reset the status with some level of sanity check.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#4372
No description provided.