BUG-HUNT: [boundary] Off-by-one error in subplan retry logic #1271

Open
opened 2026-04-02 08:09:29 +00:00 by freemo · 2 comments
Owner

Bug Report: [boundary] — Off-by-one error in subplan retry logic

Severity Assessment

  • Impact: Subplans may be retried one more time than configured in the SubplanConfig, potentially leading to wasted resources and unexpected behavior.
  • Likelihood: High. This will affect any failed subplan where retries are enabled.
  • Priority: High

Location

  • File: src/cleveragents/domain/models/core/plan.py
  • Function/Class: SubplanFailureHandler.should_retry
  • Lines: 1258

Description

The should_retry method in SubplanFailureHandler contains an off-by-one error in its check of the maximum number of retries. The current logic is status.attempt_number > config.max_retries, which allows one more retry than intended. For example, if max_retries is 2, a subplan will be retried even when attempt_number is 2, leading to a third attempt.

Evidence

# Relevant code snippet showing the issue
@should_retry(
    self,
    config: SubplanConfig,
    status: SubplanStatus,
) -> bool:
    """Determine if a failed subplan should be retried."""
    if not config.retry_failed:
        return False
    if status.attempt_number > config.max_retries: # This is the bug
        return False

Expected Behavior

The check should be status.attempt_number >= config.max_retries. If max_retries is 2, the second attempt (attempt_number 2) should be the last one if it fails. The condition 2 >= 2 would be true, and should_retry would correctly return False.

Actual Behavior

The check status.attempt_number > config.max_retries allows an extra retry. If max_retries is 2, and attempt_number is 2, the condition 2 > 2 is false, and the method proceeds to return True (assuming the error is retriable), allowing a third attempt.

Suggested Fix

Change the line:

if status.attempt_number > config.max_retries:

to:

if status.attempt_number >= config.max_retries:

Category

boundary


Metadata

  • Branch: fix/boundary-subplan-retry-off-by-one
  • Commit Message: fix(boundary): correct off-by-one error in SubplanFailureHandler.should_retry
  • Milestone: v3.4.0
  • Parent Epic: #368

Subtasks

  • Write a TDD issue-capture test using @tdd_expected_fail in features/ that demonstrates the off-by-one bug (subplan retried more times than max_retries allows)
  • Fix the condition in SubplanFailureHandler.should_retry — change > to >= on the attempt_number vs max_retries comparison (line 1258 of src/cleveragents/domain/models/core/plan.py)
  • Remove the @tdd_expected_fail tag from the TDD test once the fix is in place
  • Add or update unit tests in features/ covering boundary values: attempt_number == max_retries (should NOT retry), attempt_number < max_retries (should retry), and attempt_number > max_retries (should NOT retry)
  • Run nox -e typecheck to confirm no type regressions
  • Run nox -e unit_tests to confirm all Behave tests pass
  • Run nox -e coverage_report to confirm coverage remains ≥ 97%

Definition of Done

  • The @tdd_expected_fail TDD capture test is committed and demonstrates the bug
  • The one-line fix (>>=) is applied in SubplanFailureHandler.should_retry
  • All boundary-value unit tests pass (attempt == max, attempt < max, attempt > max)
  • No regressions in existing subplan or retry-related tests
  • All nox stages pass (nox -e lint, nox -e typecheck, nox -e unit_tests, nox -e integration_tests, nox -e coverage_report)
  • Coverage >= 97%
  • Commit is on branch fix/boundary-subplan-retry-off-by-one with message fix(boundary): correct off-by-one error in SubplanFailureHandler.should_retry
  • PR is merged and this issue is closed
## Bug Report: [boundary] — Off-by-one error in subplan retry logic ### Severity Assessment - **Impact**: Subplans may be retried one more time than configured in the `SubplanConfig`, potentially leading to wasted resources and unexpected behavior. - **Likelihood**: High. This will affect any failed subplan where retries are enabled. - **Priority**: High ### Location - **File**: `src/cleveragents/domain/models/core/plan.py` - **Function/Class**: `SubplanFailureHandler.should_retry` - **Lines**: 1258 ### Description The `should_retry` method in `SubplanFailureHandler` contains an off-by-one error in its check of the maximum number of retries. The current logic is `status.attempt_number > config.max_retries`, which allows one more retry than intended. For example, if `max_retries` is 2, a subplan will be retried even when `attempt_number` is 2, leading to a third attempt. ### Evidence ```python # Relevant code snippet showing the issue @should_retry( self, config: SubplanConfig, status: SubplanStatus, ) -> bool: """Determine if a failed subplan should be retried.""" if not config.retry_failed: return False if status.attempt_number > config.max_retries: # This is the bug return False ``` ### Expected Behavior The check should be `status.attempt_number >= config.max_retries`. If `max_retries` is 2, the second attempt (`attempt_number` 2) should be the last one if it fails. The condition `2 >= 2` would be true, and `should_retry` would correctly return `False`. ### Actual Behavior The check `status.attempt_number > config.max_retries` allows an extra retry. If `max_retries` is 2, and `attempt_number` is 2, the condition `2 > 2` is false, and the method proceeds to return `True` (assuming the error is retriable), allowing a third attempt. ### Suggested Fix Change the line: ```python if status.attempt_number > config.max_retries: ``` to: ```python if status.attempt_number >= config.max_retries: ``` ### Category boundary --- ## Metadata - **Branch**: `fix/boundary-subplan-retry-off-by-one` - **Commit Message**: `fix(boundary): correct off-by-one error in SubplanFailureHandler.should_retry` - **Milestone**: v3.4.0 - **Parent Epic**: #368 ## Subtasks - [ ] Write a TDD issue-capture test using `@tdd_expected_fail` in `features/` that demonstrates the off-by-one bug (subplan retried more times than `max_retries` allows) - [ ] Fix the condition in `SubplanFailureHandler.should_retry` — change `>` to `>=` on the `attempt_number` vs `max_retries` comparison (line 1258 of `src/cleveragents/domain/models/core/plan.py`) - [ ] Remove the `@tdd_expected_fail` tag from the TDD test once the fix is in place - [ ] Add or update unit tests in `features/` covering boundary values: `attempt_number == max_retries` (should NOT retry), `attempt_number < max_retries` (should retry), and `attempt_number > max_retries` (should NOT retry) - [ ] Run `nox -e typecheck` to confirm no type regressions - [ ] Run `nox -e unit_tests` to confirm all Behave tests pass - [ ] Run `nox -e coverage_report` to confirm coverage remains ≥ 97% ## Definition of Done - [ ] The `@tdd_expected_fail` TDD capture test is committed and demonstrates the bug - [ ] The one-line fix (`>` → `>=`) is applied in `SubplanFailureHandler.should_retry` - [ ] All boundary-value unit tests pass (attempt == max, attempt < max, attempt > max) - [ ] No regressions in existing subplan or retry-related tests - [ ] All nox stages pass (`nox -e lint`, `nox -e typecheck`, `nox -e unit_tests`, `nox -e integration_tests`, `nox -e coverage_report`) - [ ] Coverage >= 97% - [ ] Commit is on branch `fix/boundary-subplan-retry-off-by-one` with message `fix(boundary): correct off-by-one error in SubplanFailureHandler.should_retry` - [ ] PR is merged and this issue is closed
freemo added this to the v3.4.0 milestone 2026-04-02 08:10:32 +00:00
Author
Owner

Dependency / Hierarchy Note

This issue is a child of #368 Epic: Subplans & Parallelism (parent Epic).

Per the ticket hierarchy rules:

  • This issue blocks any work that depends on correct subplan retry behaviour.
  • This issue depends on (is a child of) #368.

The fix is a single-line change (>>=) in SubplanFailureHandler.should_retry at line 1258 of src/cleveragents/domain/models/core/plan.py, scoped entirely within the Subplans & Parallelism epic domain.

**Dependency / Hierarchy Note** This issue is a child of **#368 Epic: Subplans & Parallelism** (parent Epic). Per the ticket hierarchy rules: - This issue **blocks** any work that depends on correct subplan retry behaviour. - This issue **depends on** (is a child of) #368. The fix is a single-line change (`>` → `>=`) in `SubplanFailureHandler.should_retry` at line 1258 of `src/cleveragents/domain/models/core/plan.py`, scoped entirely within the Subplans & Parallelism epic domain.
freemo self-assigned this 2026-04-02 18:45:27 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: High (confirmed) — Off-by-one error in subplan retry logic means subplans are retried one more time than configured. This wastes resources and produces unexpected behavior for any failed subplan with retries enabled.
  • Milestone: v3.4.0 (confirmed) — This milestone is overdue (due Feb 26) and 89% complete. This is a Must Have bug that needs to be resolved to close out the milestone.
  • MoSCoW: Must Have (confirmed) — Correct retry behavior is essential for the subplan execution model. The fix is a one-character change (>>=) but the impact is significant.
  • Parent Epic: #368 (Subplans & Parallelism)

Valid bug-hunt finding. The fix is straightforward but the bug is real and affects all subplan retry scenarios. This should be prioritized as part of closing out v3.4.0.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: High (confirmed) — Off-by-one error in subplan retry logic means subplans are retried one more time than configured. This wastes resources and produces unexpected behavior for any failed subplan with retries enabled. - **Milestone**: v3.4.0 (confirmed) — This milestone is overdue (due Feb 26) and 89% complete. This is a Must Have bug that needs to be resolved to close out the milestone. - **MoSCoW**: Must Have (confirmed) — Correct retry behavior is essential for the subplan execution model. The fix is a one-character change (`>` → `>=`) but the impact is significant. - **Parent Epic**: #368 (Subplans & Parallelism) Valid bug-hunt finding. The fix is straightforward but the bug is real and affects all subplan retry scenarios. This should be prioritized as part of closing out v3.4.0. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
freemo removed this from the v3.4.0 milestone 2026-04-07 02:32:25 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#1271
No description provided.