[AUTO-BUG-2] Infinite recursion in Node.execute() retry logic causes exponential retry explosion #10035

Open
opened 2026-04-16 13:48:32 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Commit Message: fix(langgraph): prevent infinite recursion in Node.execute() retry logic
  • Branch: bugfix/m3-langgraph-node-retry-recursion

Background and context

The Node.execute() method in src/cleveragents/langgraph/nodes.py implements a retry loop that recursively calls self.execute(state) when an exception occurs. Because self.execute() itself checks self.config.retry_policy and enters the retry loop again, each retry attempt can itself spawn max_retries more retries, creating exponential recursion depth.

Current behavior

When a node has a retry_policy configured (e.g., max_retries: 3) and the node execution fails on every attempt, the retry loop at lines 138–148 calls return await self.execute(state) recursively. Each recursive call re-enters the retry loop, resulting in up to max_retries^depth actual execution attempts and potential RecursionError / stack overflow.

Code snippet showing the bug (src/cleveragents/langgraph/nodes.py, lines 107–148):

async def execute(self, state: GraphState) -> dict[str, Any]:
    self.execution_count += 1
    ...
    try:
        ...
    except Exception as exc:
        self.last_error = exc
        self.logger.error("Node %s execution failed: %s", self.name, exc)
        if self.config.retry_policy:
            retry_count = self.config.retry_policy.get("max_retries", 3)
            retry_delay = self.config.retry_policy.get("delay", 1.0)
            for _ in range(retry_count):
                await asyncio.sleep(retry_delay)
                try:
                    return await self.execute(state)  # ← RECURSIVE CALL re-enters retry loop
                except Exception:
                    self.logger.warning("Node %s retry attempt failed", self.name)
                    continue
        return {"error": str(exc), "failed_node": self.name}

With max_retries=3, the first failure triggers 3 recursive calls. Each of those triggers 3 more, giving 3^3 = 27 actual execution attempts instead of 3. With deeper recursion, this causes RecursionError.

Expected behavior

The retry loop should call the underlying node execution logic directly (not self.execute() recursively), so that exactly max_retries attempts are made without re-entering the retry logic.

Acceptance criteria

  • Node.execute() with retry_policy: {max_retries: 3} makes exactly 3 retry attempts (4 total including the initial attempt) when all attempts fail.
  • No RecursionError is raised regardless of max_retries value.
  • execution_count is incremented correctly (once per actual execution attempt, not exponentially).

Supporting information

  • File: src/cleveragents/langgraph/nodes.py
  • Lines: 107–148 (the execute method and its retry block)
  • The fix is to extract the core execution logic into a private _execute_once() method and call that from the retry loop instead of calling self.execute().

Subtasks

  • Extract core execution dispatch logic from Node.execute() into Node._execute_once(state)
  • Update retry loop to call self._execute_once(state) instead of self.execute(state)
  • Add execution_count increment only in _execute_once() (or adjust counting logic)
  • Tests (Behave): Add scenario verifying exactly N retry attempts are made
  • Tests (Behave): Add scenario verifying no RecursionError with max_retries=10
  • Verify coverage ≥97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.

Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor


Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Commit Message**: `fix(langgraph): prevent infinite recursion in Node.execute() retry logic` - **Branch**: `bugfix/m3-langgraph-node-retry-recursion` ## Background and context The `Node.execute()` method in `src/cleveragents/langgraph/nodes.py` implements a retry loop that recursively calls `self.execute(state)` when an exception occurs. Because `self.execute()` itself checks `self.config.retry_policy` and enters the retry loop again, each retry attempt can itself spawn `max_retries` more retries, creating exponential recursion depth. ## Current behavior When a node has a `retry_policy` configured (e.g., `max_retries: 3`) and the node execution fails on every attempt, the retry loop at lines 138–148 calls `return await self.execute(state)` recursively. Each recursive call re-enters the retry loop, resulting in up to `max_retries^depth` actual execution attempts and potential `RecursionError` / stack overflow. **Code snippet showing the bug** (`src/cleveragents/langgraph/nodes.py`, lines 107–148): ```python async def execute(self, state: GraphState) -> dict[str, Any]: self.execution_count += 1 ... try: ... except Exception as exc: self.last_error = exc self.logger.error("Node %s execution failed: %s", self.name, exc) if self.config.retry_policy: retry_count = self.config.retry_policy.get("max_retries", 3) retry_delay = self.config.retry_policy.get("delay", 1.0) for _ in range(retry_count): await asyncio.sleep(retry_delay) try: return await self.execute(state) # ← RECURSIVE CALL re-enters retry loop except Exception: self.logger.warning("Node %s retry attempt failed", self.name) continue return {"error": str(exc), "failed_node": self.name} ``` With `max_retries=3`, the first failure triggers 3 recursive calls. Each of those triggers 3 more, giving 3^3 = 27 actual execution attempts instead of 3. With deeper recursion, this causes `RecursionError`. ## Expected behavior The retry loop should call the underlying node execution logic directly (not `self.execute()` recursively), so that exactly `max_retries` attempts are made without re-entering the retry logic. ## Acceptance criteria - `Node.execute()` with `retry_policy: {max_retries: 3}` makes exactly 3 retry attempts (4 total including the initial attempt) when all attempts fail. - No `RecursionError` is raised regardless of `max_retries` value. - `execution_count` is incremented correctly (once per actual execution attempt, not exponentially). ## Supporting information - File: `src/cleveragents/langgraph/nodes.py` - Lines: 107–148 (the `execute` method and its retry block) - The fix is to extract the core execution logic into a private `_execute_once()` method and call that from the retry loop instead of calling `self.execute()`. ## Subtasks - [ ] Extract core execution dispatch logic from `Node.execute()` into `Node._execute_once(state)` - [ ] Update retry loop to call `self._execute_once(state)` instead of `self.execute(state)` - [ ] Add `execution_count` increment only in `_execute_once()` (or adjust counting logic) - [ ] Tests (Behave): Add scenario verifying exactly N retry attempts are made - [ ] Tests (Behave): Add scenario verifying no RecursionError with max_retries=10 - [ ] Verify coverage ≥97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. --- *Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor* --- **Automated by CleverAgents Bot** Agent: new-issue-creator
Author
Owner

Triage Decision

Verified by: Project Owner Supervisor [AUTO-OWNR-1]
Date: 2026-04-16

Field Decision
State Verified
MoSCoW MoSCoW/Could have
Priority Priority/Critical
Milestone None

Rationale: No milestone or future milestone; backlogged.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

## Triage Decision **Verified by**: Project Owner Supervisor [AUTO-OWNR-1] **Date**: 2026-04-16 | Field | Decision | |-------|----------| | State | Verified | | MoSCoW | MoSCoW/Could have | | Priority | Priority/Critical | | Milestone | None | **Rationale**: No milestone or future milestone; backlogged. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10035
No description provided.