UAT: DecisionType.ERROR_RECOVERY is defined but never recorded in the decision tree — spec requires error recovery decisions to be persisted #4022

Open
opened 2026-04-06 08:42:38 +00:00 by freemo · 0 comments
Owner

Metadata

  • Branch: bugfix/error-recovery-decision-tree-recording
  • Commit Message: fix(plan-executor): record error_recovery decisions in decision tree during execute phase
  • Milestone: None (backlog)
  • Parent Epic: #394

Background and Context

The spec (docs/specification.md, line 18697) explicitly lists error_recovery as an actor-recorded decision type that must be recorded in the plan's decision tree when an error occurs during execution:

| Actor-recorded | strategy_choice, resource_selection, subplan_spawn, subplan_parallel_spawn, implementation_choice, tool_invocation, error_recovery, validation_response | Strategy or Execution actor | Actor identifies a choice point and calls record_decision |

The spec also states (line 28684):

When the target decision was created during Execute (e.g., an implementation_choice or error_recovery), both resource and reasoning rollback are needed

DecisionType.ERROR_RECOVERY = "error_recovery" is defined in decision.py (line 105) and included in EXECUTE_TYPES (line 144), but it is never used anywhere in the codebase to actually record a decision. It is dead code.

Current Behavior

When an error occurs during plan execution:

  1. ErrorRecoveryService.record_error() is called (e.g., plan_executor.py lines 558–565, 791–798)
  2. The error is stored in ErrorRecoveryService._histories (in-memory only, not persisted to the decision tree)
  3. The error is stored in plan.error_details (a flat dict, not a decision tree node)
  4. record_decision() with decision_type=DecisionType.ERROR_RECOVERY is never called

A codebase-wide search confirms DecisionType.ERROR_RECOVERY is only referenced in decision.py at its definition — it is never used to record an actual decision anywhere.

Expected Behavior

When an error occurs during plan execution and a recovery action is taken (retry, revert, cancel, etc.), the execution actor must call record_decision with decision_type=error_recovery, recording:

  • The error that occurred
  • The recovery action chosen (retry/revert/cancel)
  • Alternatives considered
  • Confidence score
  • Rationale

This decision must be persisted in the decision tree so that:

  1. It can be reviewed via agents plan tree <PLAN_ID>
  2. It can be corrected via agents plan correct <DECISION_ID>
  3. Rollback to the error recovery decision point is possible (spec line 28684 requires both resource and reasoning rollback for error_recovery decisions)

Acceptance Criteria

  • When an error occurs during plan execution and a recovery action is taken, record_decision(decision_type=DecisionType.ERROR_RECOVERY, ...) is called with the error details, chosen recovery action, alternatives, confidence, and rationale
  • The resulting decision node is persisted to the plan's decision tree (not only to ErrorRecoveryService._histories or plan.error_details)
  • agents plan tree <PLAN_ID> displays error_recovery decision nodes when errors occurred during execution
  • agents plan correct <DECISION_ID> can target an error_recovery decision node
  • Rollback to an error_recovery decision point triggers both resource and reasoning rollback per spec line 28684
  • DecisionType.ERROR_RECOVERY is no longer dead code — it is exercised by at least one code path

Supporting Information

Code locations:

  • src/cleveragents/domain/models/core/decision.py:105ERROR_RECOVERY = "error_recovery" (defined but unused)
  • src/cleveragents/domain/models/core/decision.py:144 — included in EXECUTE_TYPES (but no code path creates it)
  • src/cleveragents/application/services/plan_executor.py:558-565 — records error via ErrorRecoveryService but not decision tree
  • src/cleveragents/application/services/plan_executor.py:791-808 — retry loop records error but not decision tree

Steps to reproduce:

  1. Create a plan that fails during execute phase
  2. Observe that agents plan tree <PLAN_ID> shows no error_recovery decision nodes
  3. Observe that agents plan correct cannot be used to correct an error recovery decision

Related issue: #3988 (UAT: PlanExecutor.run_strategize() stores decision count in plan.error_details — a semantic misuse of the error field)

Backlog note: This issue was discovered during autonomous operation
on milestone v3.3.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.

Subtasks

  • Identify all error recovery call sites in plan_executor.py where ErrorRecoveryService.record_error() is called
  • Add record_decision(decision_type=DecisionType.ERROR_RECOVERY, ...) calls at each error recovery decision point, capturing: error details, chosen recovery action, alternatives considered, confidence score, and rationale
  • Ensure the decision is persisted to the plan's decision tree (not only in-memory or plan.error_details)
  • Verify agents plan tree <PLAN_ID> displays error_recovery nodes after an error recovery event
  • Verify agents plan correct <DECISION_ID> can target an error_recovery decision
  • Verify rollback to an error_recovery decision triggers both resource and reasoning rollback
  • Tests (unit): Add unit tests for error_recovery decision recording in plan_executor.py
  • Tests (Behave/Robot): Add integration scenario for error recovery decision tree persistence
  • Verify coverage >= 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage >= 97%.

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-uat-tester

## Metadata - **Branch**: `bugfix/error-recovery-decision-tree-recording` - **Commit Message**: `fix(plan-executor): record error_recovery decisions in decision tree during execute phase` - **Milestone**: None (backlog) - **Parent Epic**: #394 ## Background and Context The spec (`docs/specification.md`, line 18697) explicitly lists `error_recovery` as an **actor-recorded decision type** that must be recorded in the plan's decision tree when an error occurs during execution: > | **Actor-recorded** | `strategy_choice`, `resource_selection`, `subplan_spawn`, `subplan_parallel_spawn`, `implementation_choice`, `tool_invocation`, `error_recovery`, `validation_response` | Strategy or Execution actor | Actor identifies a choice point and calls `record_decision` | The spec also states (line 28684): > When the target decision was created during Execute (e.g., an `implementation_choice` or `error_recovery`), **both resource and reasoning rollback are needed** `DecisionType.ERROR_RECOVERY = "error_recovery"` is defined in `decision.py` (line 105) and included in `EXECUTE_TYPES` (line 144), but it is **never used anywhere in the codebase** to actually record a decision. It is dead code. ## Current Behavior When an error occurs during plan execution: 1. `ErrorRecoveryService.record_error()` is called (e.g., `plan_executor.py` lines 558–565, 791–798) 2. The error is stored in `ErrorRecoveryService._histories` (in-memory only, not persisted to the decision tree) 3. The error is stored in `plan.error_details` (a flat dict, not a decision tree node) 4. `record_decision()` with `decision_type=DecisionType.ERROR_RECOVERY` is **never called** A codebase-wide search confirms `DecisionType.ERROR_RECOVERY` is only referenced in `decision.py` at its definition — it is never used to record an actual decision anywhere. ## Expected Behavior When an error occurs during plan execution and a recovery action is taken (retry, revert, cancel, etc.), the execution actor must call `record_decision` with `decision_type=error_recovery`, recording: - The error that occurred - The recovery action chosen (retry/revert/cancel) - Alternatives considered - Confidence score - Rationale This decision must be persisted in the decision tree so that: 1. It can be reviewed via `agents plan tree <PLAN_ID>` 2. It can be corrected via `agents plan correct <DECISION_ID>` 3. Rollback to the error recovery decision point is possible (spec line 28684 requires both resource and reasoning rollback for `error_recovery` decisions) ## Acceptance Criteria - [ ] When an error occurs during plan execution and a recovery action is taken, `record_decision(decision_type=DecisionType.ERROR_RECOVERY, ...)` is called with the error details, chosen recovery action, alternatives, confidence, and rationale - [ ] The resulting decision node is persisted to the plan's decision tree (not only to `ErrorRecoveryService._histories` or `plan.error_details`) - [ ] `agents plan tree <PLAN_ID>` displays `error_recovery` decision nodes when errors occurred during execution - [ ] `agents plan correct <DECISION_ID>` can target an `error_recovery` decision node - [ ] Rollback to an `error_recovery` decision point triggers both resource and reasoning rollback per spec line 28684 - [ ] `DecisionType.ERROR_RECOVERY` is no longer dead code — it is exercised by at least one code path ## Supporting Information **Code locations:** - `src/cleveragents/domain/models/core/decision.py:105` — `ERROR_RECOVERY = "error_recovery"` (defined but unused) - `src/cleveragents/domain/models/core/decision.py:144` — included in `EXECUTE_TYPES` (but no code path creates it) - `src/cleveragents/application/services/plan_executor.py:558-565` — records error via `ErrorRecoveryService` but not decision tree - `src/cleveragents/application/services/plan_executor.py:791-808` — retry loop records error but not decision tree **Steps to reproduce:** 1. Create a plan that fails during execute phase 2. Observe that `agents plan tree <PLAN_ID>` shows no `error_recovery` decision nodes 3. Observe that `agents plan correct` cannot be used to correct an error recovery decision **Related issue:** #3988 (UAT: `PlanExecutor.run_strategize()` stores decision count in `plan.error_details` — a semantic misuse of the error field) > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.3.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. ## Subtasks - [ ] Identify all error recovery call sites in `plan_executor.py` where `ErrorRecoveryService.record_error()` is called - [ ] Add `record_decision(decision_type=DecisionType.ERROR_RECOVERY, ...)` calls at each error recovery decision point, capturing: error details, chosen recovery action, alternatives considered, confidence score, and rationale - [ ] Ensure the decision is persisted to the plan's decision tree (not only in-memory or `plan.error_details`) - [ ] Verify `agents plan tree <PLAN_ID>` displays `error_recovery` nodes after an error recovery event - [ ] Verify `agents plan correct <DECISION_ID>` can target an `error_recovery` decision - [ ] Verify rollback to an `error_recovery` decision triggers both resource and reasoning rollback - [ ] Tests (unit): Add unit tests for `error_recovery` decision recording in `plan_executor.py` - [ ] Tests (Behave/Robot): Add integration scenario for error recovery decision tree persistence - [ ] Verify coverage >= 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage >= 97%. --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-uat-tester
HAL9000 added this to the v3.5.0 milestone 2026-04-09 03:11:57 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#394 Epic: Decision Framework
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#4022
No description provided.