BUG-HUNT: [error-handling] Potential busy loop in retry_auto_debug #3020

Open
opened 2026-04-05 03:55:52 +00:00 by freemo · 2 comments
Owner

Metadata

  • Branch: fix/v3.3.0-retry-auto-debug-busy-loop
  • Commit Message: fix(core): add backoff sleep to retry_auto_debug when no debug_callback provided
  • Milestone: v3.3.0
  • Parent Epic: #362

Bug Report: [error-handling] — Potential busy loop in retry_auto_debug

Severity Assessment

  • Impact: Medium. A busy loop can consume significant CPU resources and make the application unresponsive.
  • Likelihood: Medium. This can happen if a wrapped function consistently returns an error dictionary without a debug_callback to handle it.
  • Priority: Medium

Location

  • File: src/cleveragents/core/retry_service_patterns.py
  • Function/Class: retry_auto_debug
  • Lines: ~589-604

Description

In retry_auto_debug, if the wrapped function returns a dictionary containing an "error" key and no debug_callback is provided, the function logs the error and then continues to the next iteration of the loop without any delay. If the underlying error is persistent, this will result in a busy loop, consuming CPU and potentially flooding the logs.

Evidence

                    if isinstance(result, dict):
                        error_value: Any = result.get("error")
                        if error_value is not None:
                            last_error = str(error_value)
                            if debug_callback:
                                debug_result: Any = await debug_callback(
                                    last_error, attempt
                                )
                                if isinstance(debug_result, dict) and debug_result.get(
                                    "fixed"
                                ):
                                    continue
                            else:
                                # S12: No debug callback — return the
                                # result as-is instead of discarding it.
                                return result

The continue statement in the if debug_callback block is not matched by a continue or sleep in the else block. The code returns the result, which is not the correct behavior. The code should continue to the next iteration, but with a sleep.

Expected Behavior

If no debug_callback is provided, the function should still perform an exponential backoff sleep before the next retry attempt, just as it does when an exception is caught.

Suggested Fix

Add a sleep to the else block, or restructure the logic to ensure a sleep happens before the next iteration.

# ...
                            if debug_callback:
                                # ...
                                if isinstance(debug_result, dict) and debug_result.get("fixed"):
                                    continue
                            # No else block here, let it fall through to the sleep
                        else:
                            return result
# ...
                # S17: Do not sleep after the final failed attempt.
                if attempt < max_debug_attempts - 1:
                    await asyncio.sleep(min(2**attempt, 60))

Category

error-handling

Subtasks

  • Reproduce the busy loop by writing a Behave scenario where retry_auto_debug wraps a function that always returns {"error": "persistent error"} with no debug_callback
  • Confirm CPU spike / rapid log flooding in the test
  • Restructure the error-dict handling logic so that the exponential backoff sleep (asyncio.sleep(min(2**attempt, 60))) is always reached before the next iteration
  • Ensure the fix does not alter the return-as-is behaviour when debug_callback is absent and the result is not an error dict
  • Update / add Behave unit tests covering the no-callback backoff path
  • Run nox -e typecheck and confirm no new Pyright errors
  • Run nox -e unit_tests and confirm all scenarios pass
  • Run nox -e coverage_report and confirm coverage ≥ 97%

Definition of Done

  • A Behave scenario demonstrates that retry_auto_debug sleeps between retries even when no debug_callback is supplied
  • The exponential backoff sleep is applied consistently regardless of whether debug_callback is provided
  • No regression in existing retry_auto_debug tests
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/v3.3.0-retry-auto-debug-busy-loop` - **Commit Message**: `fix(core): add backoff sleep to retry_auto_debug when no debug_callback provided` - **Milestone**: v3.3.0 - **Parent Epic**: #362 ## Bug Report: [error-handling] — Potential busy loop in `retry_auto_debug` ### Severity Assessment - **Impact**: Medium. A busy loop can consume significant CPU resources and make the application unresponsive. - **Likelihood**: Medium. This can happen if a wrapped function consistently returns an error dictionary without a `debug_callback` to handle it. - **Priority**: Medium ### Location - **File**: `src/cleveragents/core/retry_service_patterns.py` - **Function/Class**: `retry_auto_debug` - **Lines**: ~589-604 ### Description In `retry_auto_debug`, if the wrapped function returns a dictionary containing an "error" key and no `debug_callback` is provided, the function logs the error and then continues to the next iteration of the loop without any delay. If the underlying error is persistent, this will result in a busy loop, consuming CPU and potentially flooding the logs. ### Evidence ```python if isinstance(result, dict): error_value: Any = result.get("error") if error_value is not None: last_error = str(error_value) if debug_callback: debug_result: Any = await debug_callback( last_error, attempt ) if isinstance(debug_result, dict) and debug_result.get( "fixed" ): continue else: # S12: No debug callback — return the # result as-is instead of discarding it. return result ``` The `continue` statement in the `if debug_callback` block is not matched by a `continue` or `sleep` in the `else` block. The code returns the result, which is not the correct behavior. The code should `continue` to the next iteration, but with a sleep. ### Expected Behavior If no `debug_callback` is provided, the function should still perform an exponential backoff sleep before the next retry attempt, just as it does when an exception is caught. ### Suggested Fix Add a sleep to the `else` block, or restructure the logic to ensure a sleep happens before the next iteration. ```python # ... if debug_callback: # ... if isinstance(debug_result, dict) and debug_result.get("fixed"): continue # No else block here, let it fall through to the sleep else: return result # ... # S17: Do not sleep after the final failed attempt. if attempt < max_debug_attempts - 1: await asyncio.sleep(min(2**attempt, 60)) ``` ### Category error-handling ## Subtasks - [ ] Reproduce the busy loop by writing a Behave scenario where `retry_auto_debug` wraps a function that always returns `{"error": "persistent error"}` with no `debug_callback` - [ ] Confirm CPU spike / rapid log flooding in the test - [ ] Restructure the error-dict handling logic so that the exponential backoff sleep (`asyncio.sleep(min(2**attempt, 60))`) is always reached before the next iteration - [ ] Ensure the fix does not alter the return-as-is behaviour when `debug_callback` is absent and the result is not an error dict - [ ] Update / add Behave unit tests covering the no-callback backoff path - [ ] Run `nox -e typecheck` and confirm no new Pyright errors - [ ] Run `nox -e unit_tests` and confirm all scenarios pass - [ ] Run `nox -e coverage_report` and confirm coverage ≥ 97% ## Definition of Done - [ ] A Behave scenario demonstrates that `retry_auto_debug` sleeps between retries even when no `debug_callback` is supplied - [ ] The exponential backoff sleep is applied consistently regardless of whether `debug_callback` is provided - [ ] No regression in existing `retry_auto_debug` tests - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: ca-new-issue-creator
freemo added this to the v3.3.0 milestone 2026-04-05 03:57:17 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: Confirmed
  • MoSCoW: Should Have

Valid finding verified during batch triage.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: Confirmed - **MoSCoW**: Should Have Valid finding verified during batch triage. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Author
Owner

Label compliance fix applied:

  • Added missing label: Priority/Medium
  • Added missing label: Type/Bug
  • Reason: Issue was missing required Priority/* and Type/* labels per CONTRIBUTING.md. Labels inferred from issue body (severity: Medium, category: error-handling bug).

Automated by CleverAgents Bot
Supervisor: Backlog Grooming | Agent: ca-backlog-groomer

Label compliance fix applied: - Added missing label: `Priority/Medium` - Added missing label: `Type/Bug` - Reason: Issue was missing required `Priority/*` and `Type/*` labels per CONTRIBUTING.md. Labels inferred from issue body (severity: Medium, category: error-handling bug). --- **Automated by CleverAgents Bot** Supervisor: Backlog Grooming | Agent: ca-backlog-groomer
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#362 Epic: Security & Safety Hardening
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#3020
No description provided.