UAT: SubplanFailureHandler.should_retry() silently ignores retry_failed=True for unknown error types — unknown errors are never retried #4001

Open
opened 2026-04-06 08:27:40 +00:00 by freemo · 0 comments
Owner

Metadata

  • Branch: fix/subplan-failure-handler-unknown-error-retry
  • Commit Message: fix(plan): retry unknown errors in SubplanFailureHandler when retry_failed=True
  • Milestone: None (backlog — see backlog note below)
  • Parent Epic: #368

Backlog note: This issue was discovered during autonomous operation
on milestone v3.4.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Discovered By

UAT tester (ca-uat-tester), feature area: Subplan and Parallel Execution

  • File: src/cleveragents/domain/models/core/plan.py
  • Class: SubplanFailureHandler
  • Method: should_retry()
  • Spec reference: docs/specification.md — "retry_failed (default: true)" — controls whether failed subplans are retried

Bug Description

SubplanFailureHandler.should_retry() only returns True for errors that match known RETRIABLE_FAILURES patterns. When retry_failed=True and the error is unknown (not in RETRIABLE_FAILURES or NON_RETRIABLE_ERRORS), the method returns False — meaning the subplan is never retried even though the user explicitly configured retry_failed=True.

What Was Tested

Code-level analysis of src/cleveragents/domain/models/core/plan.py, specifically SubplanFailureHandler.should_retry() (lines 1251–1278) and the RETRIABLE_FAILURES / NON_RETRIABLE_ERRORS constants (lines 1205–1221).

Expected Behavior (from spec)

Per docs/specification.md: "retry_failed (default: true)" — when retry_failed=True, failed subplans should be retried up to max_retries times. The spec does not define a whitelist of retriable error types; it implies all failures should be retried when retry_failed=True.

Actual Behavior

RETRIABLE_FAILURES: frozenset[str] = frozenset({
    "ValidationError",
    "TimeoutError",
    "TemporaryResourceError",
    "MergeConflictError",
})

NON_RETRIABLE_ERRORS: frozenset[str] = frozenset({
    "ConfigurationError",
    "AuthenticationError",
    "MissingResourceError",
    "CircularDependencyError",
})

def should_retry(self, config, status):
    if not config.retry_failed:
        return False
    if status.attempt_number > config.max_retries:
        return False
    error = status.error or ""
    for err_type in NON_RETRIABLE_ERRORS:
        if err_type in error:
            return False
    # Unknown errors: don't retry by default (conservative)
    return any(fail_type in error for fail_type in RETRIABLE_FAILURES)

If the error is "NetworkError: connection refused" (not in either set), should_retry() returns False even when retry_failed=True and attempt_number=1 <= max_retries=2.

Steps to Reproduce (code-level)

from cleveragents.domain.models.core.plan import (
    SubplanConfig, SubplanStatus, SubplanFailureHandler, ExecutionMode
)

config = SubplanConfig(retry_failed=True, max_retries=2)
status = SubplanStatus(
    subplan_id="01HGZ6FE0AQDYTR4BXVQZ6EA00",
    action_name="local/test",
    attempt_number=1,
    error="NetworkError: connection refused",  # Unknown error type
)
handler = SubplanFailureHandler()
result = handler.should_retry(config, status)
# result is False — but user set retry_failed=True!

Impact

Subplans that fail with unknown error types (e.g., network errors, custom errors from user-defined tools) are never retried even when retry_failed=True. This silently defeats the retry configuration.


Subtasks

  • Review the intended semantics of retry_failed in the spec
  • Update should_retry() to retry unknown errors when retry_failed=True (or document the conservative behavior as intentional)
  • Add unit test (Behave) verifying that unknown errors are retried when retry_failed=True

Definition of Done

  • SubplanFailureHandler.should_retry() behavior is consistent with the spec's retry_failed semantics
  • Unit tests (Behave) cover the unknown-error retry behavior
  • No regression in existing retry tests
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/subplan-failure-handler-unknown-error-retry` - **Commit Message**: `fix(plan): retry unknown errors in SubplanFailureHandler when retry_failed=True` - **Milestone**: None (backlog — see backlog note below) - **Parent Epic**: #368 > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.4.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- ## Discovered By UAT tester (ca-uat-tester), feature area: Subplan and Parallel Execution - **File**: `src/cleveragents/domain/models/core/plan.py` - **Class**: `SubplanFailureHandler` - **Method**: `should_retry()` - **Spec reference**: `docs/specification.md` — "`retry_failed` (default: `true`)" — controls whether failed subplans are retried --- ## Bug Description `SubplanFailureHandler.should_retry()` only returns `True` for errors that match known `RETRIABLE_FAILURES` patterns. When `retry_failed=True` and the error is unknown (not in `RETRIABLE_FAILURES` or `NON_RETRIABLE_ERRORS`), the method returns `False` — meaning the subplan is **never retried** even though the user explicitly configured `retry_failed=True`. ### What Was Tested Code-level analysis of `src/cleveragents/domain/models/core/plan.py`, specifically `SubplanFailureHandler.should_retry()` (lines 1251–1278) and the `RETRIABLE_FAILURES` / `NON_RETRIABLE_ERRORS` constants (lines 1205–1221). ### Expected Behavior (from spec) Per `docs/specification.md`: "`retry_failed` (default: `true`)" — when `retry_failed=True`, failed subplans should be retried up to `max_retries` times. The spec does not define a whitelist of retriable error types; it implies all failures should be retried when `retry_failed=True`. ### Actual Behavior ```python RETRIABLE_FAILURES: frozenset[str] = frozenset({ "ValidationError", "TimeoutError", "TemporaryResourceError", "MergeConflictError", }) NON_RETRIABLE_ERRORS: frozenset[str] = frozenset({ "ConfigurationError", "AuthenticationError", "MissingResourceError", "CircularDependencyError", }) def should_retry(self, config, status): if not config.retry_failed: return False if status.attempt_number > config.max_retries: return False error = status.error or "" for err_type in NON_RETRIABLE_ERRORS: if err_type in error: return False # Unknown errors: don't retry by default (conservative) return any(fail_type in error for fail_type in RETRIABLE_FAILURES) ``` If the error is `"NetworkError: connection refused"` (not in either set), `should_retry()` returns `False` even when `retry_failed=True` and `attempt_number=1 <= max_retries=2`. ### Steps to Reproduce (code-level) ```python from cleveragents.domain.models.core.plan import ( SubplanConfig, SubplanStatus, SubplanFailureHandler, ExecutionMode ) config = SubplanConfig(retry_failed=True, max_retries=2) status = SubplanStatus( subplan_id="01HGZ6FE0AQDYTR4BXVQZ6EA00", action_name="local/test", attempt_number=1, error="NetworkError: connection refused", # Unknown error type ) handler = SubplanFailureHandler() result = handler.should_retry(config, status) # result is False — but user set retry_failed=True! ``` ### Impact Subplans that fail with unknown error types (e.g., network errors, custom errors from user-defined tools) are never retried even when `retry_failed=True`. This silently defeats the retry configuration. --- ## Subtasks - [ ] Review the intended semantics of `retry_failed` in the spec - [ ] Update `should_retry()` to retry unknown errors when `retry_failed=True` (or document the conservative behavior as intentional) - [ ] Add unit test (Behave) verifying that unknown errors are retried when `retry_failed=True` ## Definition of Done - [ ] `SubplanFailureHandler.should_retry()` behavior is consistent with the spec's `retry_failed` semantics - [ ] Unit tests (Behave) cover the unknown-error retry behavior - [ ] No regression in existing retry tests - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
HAL9000 added this to the v3.5.0 milestone 2026-04-09 03:12:11 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#368 Epic: Subplans & Parallelism
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#4001
No description provided.