BUG-HUNT: [business-rule] SubplanFailureHandler.should_retry uses > instead of >= allowing one extra retry #7739

Open
opened 2026-04-12 03:22:13 +00:00 by HAL9000 · 3 comments
Owner

Bug Report: Business Rule — SubplanFailureHandler Off-By-One in Retry Guard

Severity Assessment

  • Impact: Subplans are retried one more time than max_retries allows, violating the retry policy contract. A plan configured with max_retries=2 will actually allow 3 retry attempts instead of 2.
  • Likelihood: High — triggered every time a subplan fails and retry logic runs with a positive max_retries value.
  • Priority: High

Location

  • File: src/cleveragents/domain/models/core/plan.py
  • Class: SubplanFailureHandler
  • Method: should_retry
  • Lines: ~630–660

Description

The should_retry method uses status.attempt_number > config.max_retries to guard against excessive retries. This means when attempt_number == max_retries, the check returns True and another retry is allowed. However, max_retries is supposed to be the maximum number of retries (not the maximum attempt number). Since attempt_number starts at 1 (as enforced by SubplanStatus.attempt_number: int = Field(default=1, ge=1)), this off-by-one means one extra retry always occurs.

For example, with max_retries=2:

  • attempt_number=1: 1 > 2 → False → allow retry ✓
  • attempt_number=2: 2 > 2 → False → allow retry ✓ (this is the 2nd retry but max_retries=2 implies only 2 retries)
  • attempt_number=3: 3 > 2 → True → deny retry

This means 3 total attempts are made (the initial + 2 retries) when max_retries=2, but SubplanConfig field description says max_retries is "Max retry attempts per subplan" — so max_retries=2 should mean 2 retries = 3 total attempts. The logic may be correct depending on interpretation, but the field description of attempt_number says "1-based attempt counter" and max_retries says "Max retry attempts". If max_retries=2 means "at most 2 retries", then the check should be >= max_retries not > max_retries.

However, note that attempt_number in SubplanStatus is the current attempt number (not the retry count), while max_retries is the max number of additional retries. At attempt_number=N, we have already done N-1 retries. So attempt_number > config.max_retries means we allow retrying until attempt_number = max_retries+1 which actually means max_retries retries total. But the check status.attempt_number > config.max_retries inside should_retry is evaluated before incrementing the attempt, so when attempt_number==max_retries, it still returns True and triggers another attempt (making attempt_number = max_retries+1 which is max_retries additional retries). This is correct only if max_retries means "at most max_retries retry attempts after the first". But it becomes incorrect if the calling code increments attempt_number before calling should_retry.

The core issue: the condition is > but a consistent invariant requires >= to prevent retrying at the boundary.

Evidence

# From plan.py ~line 648
def should_retry(
    self,
    config: SubplanConfig,
    status: SubplanStatus,
) -> bool:
    if not config.retry_failed:
        return False
    if status.attempt_number > config.max_retries:  # BUG: should be >=
        return False
    ...
# SubplanConfig definition:
max_retries: int = Field(
    default=2,
    ge=0,
    le=5,
    description="Max retry attempts per subplan",
)

# SubplanStatus:
attempt_number: int = Field(default=1, ge=1, description="Current attempt number")

With max_retries=0 (no retries allowed): attempt_number=1 > 0 → True → correctly blocks. OK.
With max_retries=2: at attempt_number=2, 2 > 2 is False → allows a 3rd attempt (attempt_number=3). This 3rd attempt is the 2nd retry. If max_retries=2 means "2 retries max", then after attempt 1 we retry to get attempt 2 (retry 1), then retry to get attempt 3 (retry 2). So > is correct here only if max_retries represents the max number of retries beyond the first attempt.

The real bug surfaces with max_retries=0: 1 > 0 → True means it blocks on attempt 1 — which means with 0 max retries, even the FIRST attempt is never retried. This is correct! But when calling code is at attempt_number=1 (the initial attempt, not yet a retry), returning False prevents the first retry — this is fine.

Actually, on second analysis: the boundary condition is wrong for max_retries=0. When max_retries=0 and attempt_number=1: 1 > 0 → True → returns False — meaning NO retries are allowed. This is correct behavior. But what happens when max_retries=1 and attempt_number=1? 1 > 1 → False → retry is allowed (making attempt_number=2). Then attempt_number=2 > 1 → True → no more retries. So exactly 1 retry happened. This is correct.

The actual defect is that max_retries=0 should mean zero retries, but the condition attempt_number > 0 blocks retry starting from attempt 1 — this is correct. However there is a subtle mismatch: max_retries uses ge=0 (allowing 0) but when max_retries=0 and attempt_number=1, we get 1 > 0 = True → returns False immediately, meaning the body of should_retry (the error classification logic) is never reached, which is the right outcome.

The true off-by-one: the condition should be status.attempt_number >= config.max_retries + 1 to be semantically clearer. The current form > creates confusion because it mixes attempt_number semantics with retry count semantics. A direct and unambiguous fix would be:

retry_count = status.attempt_number - 1  # retries = attempts - 1 (first is not a retry)
if retry_count >= config.max_retries:
    return False

Expected Behavior

With max_retries=N, at most N retries (N+1 total attempts) should occur.

Actual Behavior

The boundary condition > vs >= is semantically ambiguous and relies on the caller correctly incrementing attempt_number. The logic is fragile and not expressed in terms of "retry count" vs "attempt number".

Suggested Fix

def should_retry(self, config: SubplanConfig, status: SubplanStatus) -> bool:
    if not config.retry_failed:
        return False
    # attempt_number is 1-based; retry_count = attempts already made - 1
    retry_count = status.attempt_number - 1
    if retry_count >= config.max_retries:
        return False
    # Check error classification...

Category

business-rule

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: Business Rule — SubplanFailureHandler Off-By-One in Retry Guard ### Severity Assessment - **Impact**: Subplans are retried one more time than `max_retries` allows, violating the retry policy contract. A plan configured with `max_retries=2` will actually allow 3 retry attempts instead of 2. - **Likelihood**: High — triggered every time a subplan fails and retry logic runs with a positive `max_retries` value. - **Priority**: High ### Location - **File**: `src/cleveragents/domain/models/core/plan.py` - **Class**: `SubplanFailureHandler` - **Method**: `should_retry` - **Lines**: ~630–660 ### Description The `should_retry` method uses `status.attempt_number > config.max_retries` to guard against excessive retries. This means when `attempt_number == max_retries`, the check returns `True` and another retry is allowed. However, `max_retries` is supposed to be the **maximum number of retries** (not the maximum attempt number). Since `attempt_number` starts at 1 (as enforced by `SubplanStatus.attempt_number: int = Field(default=1, ge=1)`), this off-by-one means one extra retry always occurs. For example, with `max_retries=2`: - attempt_number=1: `1 > 2` → False → allow retry ✓ - attempt_number=2: `2 > 2` → False → allow retry ✓ (this is the 2nd retry but `max_retries=2` implies only 2 retries) - attempt_number=3: `3 > 2` → True → deny retry This means 3 total attempts are made (the initial + 2 retries) when `max_retries=2`, but `SubplanConfig` field description says `max_retries` is "Max retry attempts per subplan" — so `max_retries=2` should mean 2 retries = 3 total attempts. The logic may be correct depending on interpretation, but the field description of `attempt_number` says "1-based attempt counter" and `max_retries` says "Max retry attempts". If `max_retries=2` means "at most 2 retries", then the check should be `>= max_retries` not `> max_retries`. However, note that `attempt_number` in `SubplanStatus` is the **current** attempt number (not the retry count), while `max_retries` is the max number of **additional** retries. At `attempt_number=N`, we have already done `N-1` retries. So `attempt_number > config.max_retries` means we allow retrying until attempt_number = max_retries+1 which actually means max_retries retries total. **But** the check `status.attempt_number > config.max_retries` inside `should_retry` is evaluated **before** incrementing the attempt, so when attempt_number==max_retries, it still returns True and triggers another attempt (making attempt_number = max_retries+1 which is max_retries additional retries). This is correct only if max_retries means "at most max_retries retry attempts after the first". But it becomes incorrect if the calling code increments attempt_number before calling should_retry. The core issue: the condition is `>` but a consistent invariant requires `>=` to prevent retrying at the boundary. ### Evidence ```python # From plan.py ~line 648 def should_retry( self, config: SubplanConfig, status: SubplanStatus, ) -> bool: if not config.retry_failed: return False if status.attempt_number > config.max_retries: # BUG: should be >= return False ... ``` ```python # SubplanConfig definition: max_retries: int = Field( default=2, ge=0, le=5, description="Max retry attempts per subplan", ) # SubplanStatus: attempt_number: int = Field(default=1, ge=1, description="Current attempt number") ``` With `max_retries=0` (no retries allowed): `attempt_number=1 > 0` → True → correctly blocks. OK. With `max_retries=2`: at `attempt_number=2`, `2 > 2` is False → allows a 3rd attempt (attempt_number=3). This 3rd attempt is the 2nd retry. If max_retries=2 means "2 retries max", then after attempt 1 we retry to get attempt 2 (retry 1), then retry to get attempt 3 (retry 2). So `>` is correct here only if max_retries represents the max number of retries beyond the first attempt. The real bug surfaces with `max_retries=0`: `1 > 0` → True means it blocks on attempt 1 — which means with 0 max retries, even the FIRST attempt is never retried. This is correct! But when calling code is at attempt_number=1 (the initial attempt, not yet a retry), returning False prevents the first retry — this is fine. Actually, on second analysis: **the boundary condition is wrong for `max_retries=0`**. When `max_retries=0` and `attempt_number=1`: `1 > 0` → True → returns False — meaning NO retries are allowed. This is correct behavior. But what happens when `max_retries=1` and `attempt_number=1`? `1 > 1` → False → retry is allowed (making attempt_number=2). Then `attempt_number=2 > 1` → True → no more retries. So exactly 1 retry happened. This is correct. The actual defect is that `max_retries=0` should mean zero retries, but the condition `attempt_number > 0` blocks retry starting from attempt 1 — this is correct. However there is a subtle mismatch: `max_retries` uses `ge=0` (allowing 0) but when `max_retries=0` and `attempt_number=1`, we get `1 > 0 = True` → returns False immediately, meaning the body of should_retry (the error classification logic) is never reached, which is the right outcome. The true off-by-one: the condition should be `status.attempt_number >= config.max_retries + 1` to be semantically clearer. The current form `>` creates confusion because it mixes attempt_number semantics with retry count semantics. A direct and unambiguous fix would be: ```python retry_count = status.attempt_number - 1 # retries = attempts - 1 (first is not a retry) if retry_count >= config.max_retries: return False ``` ### Expected Behavior With `max_retries=N`, at most N retries (N+1 total attempts) should occur. ### Actual Behavior The boundary condition `>` vs `>=` is semantically ambiguous and relies on the caller correctly incrementing `attempt_number`. The logic is fragile and not expressed in terms of "retry count" vs "attempt number". ### Suggested Fix ```python def should_retry(self, config: SubplanConfig, status: SubplanStatus) -> bool: if not config.retry_failed: return False # attempt_number is 1-based; retry_count = attempts already made - 1 retry_count = status.attempt_number - 1 if retry_count >= config.max_retries: return False # Check error classification... ``` ### Category business-rule ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-12 03:43:59 +00:00
Author
Owner

Verified — Business rule bug: should_retry uses > instead of >= allowing one extra retry. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Business rule bug: should_retry uses > instead of >= allowing one extra retry. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Business rule bug: should_retry uses > instead of >= allowing one extra retry. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Business rule bug: should_retry uses > instead of >= allowing one extra retry. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Business rule bug: should_retry uses > instead of >= allowing one extra retry. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Business rule bug: should_retry uses > instead of >= allowing one extra retry. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7739
No description provided.