BUG-HUNT: [concurrency] Node.execute retry policy causes exponential recursive retries — stack overflow risk with max_retries ≥ 2 #6515

Open
opened 2026-04-09 21:13:58 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [concurrency] — Node.execute Retry Policy Triggers Unbounded Recursive Retries

Severity Assessment

  • Impact: A node with retry_policy that keeps failing will trigger recursive self.execute() calls exponentially: each retry attempt re-enters the full execute() method, which itself has N retry attempts. With max_retries=3 and depth 3, this produces up to 40 total calls and possible stack overflow in deep call chains. Under some workloads this can exhaust the call stack or create arbitrarily long delays.
  • Likelihood: Medium — triggered whenever a node has retry_policy configured and repeatedly fails.
  • Priority: High

Location

  • File: src/cleveragents/langgraph/nodes.py
  • Class: Node
  • Method: execute
  • Lines: ~105–130

Description

Inside Node.execute(), when the initial execution raises an exception and retry_policy is configured, the retry loop calls return await self.execute(state) — the full execute() method — recursively. Each such recursive call also checks retry_policy and spawns its own retry loop if it fails. This leads to exponential explosion.

Evidence

async def execute(self, state: GraphState) -> dict[str, Any]:
    self.execution_count += 1
    loop = asyncio.get_event_loop()
    start_time = loop.time()
    try:
        ...  # run the actual node
    except Exception as exc:
        self.last_error = exc
        self.logger.error("Node %s execution failed: %s", self.name, exc)
        if self.config.retry_policy:
            retry_count = self.config.retry_policy.get("max_retries", 3)
            retry_delay = self.config.retry_policy.get("delay", 1.0)
            for _ in range(retry_count):
                await asyncio.sleep(retry_delay)
                try:
                    return await self.execute(state)   # <-- RECURSIVE: re-enters full execute with retry logic!
                except Exception:
                    ...
                    continue
        return {"error": str(exc), "failed_node": self.name}
    finally:
        self.last_execution_time = loop.time() - start_time

Call tree with max_retries=3 and all attempts failing:

  • execute() fails → spawns 3 retries, each calls execute() recursively
    • Retry 1 calls execute() → fails → spawns 3 more retries → ...
    • Retry 2 calls execute() → fails → spawns 3 more retries → ...
    • Retry 3 calls execute() → fails → spawns 3 more retries → ...

Worst-case call depth: max_retries^depth. With max_retries=3 and depth=10 (all fail), this is 3^10 = 59,049 calls.

Additionally, self.execution_count is incremented at the start of every recursive call, so its final value reflects the exponential number of attempts rather than the intended "number of times this node was asked to execute."

Expected Behavior

The retry logic should call the internal execution helper (e.g., _execute_agent, _execute_function, etc.) directly — not the public execute() wrapper that re-wraps the retry logic. Each top-level call to execute() should attempt at most 1 + max_retries total tries.

Actual Behavior

Each failed retry attempt re-enters the full execute() method, triggering another retry loop. The total number of calls grows exponentially with max_retries.

Suggested Fix

Extract the inner dispatch logic into a private _run_once(state) method and call that from the retry loop:

async def execute(self, state: GraphState) -> dict[str, Any]:
    self.execution_count += 1
    ...
    max_retries = self.config.retry_policy.get("max_retries", 3) if self.config.retry_policy else 0
    last_exc = None
    for attempt in range(1 + max_retries):
        if attempt > 0:
            await asyncio.sleep(self.config.retry_policy["delay"])
        try:
            return await self._run_once(state)
        except Exception as exc:
            last_exc = exc
            self.last_error = exc
            self.logger.warning("Node %s attempt %d failed: %s", self.name, attempt + 1, exc)
    return {"error": str(last_exc), "failed_node": self.name}

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — Node.execute Retry Policy Triggers Unbounded Recursive Retries ### Severity Assessment - **Impact**: A node with `retry_policy` that keeps failing will trigger recursive `self.execute()` calls exponentially: each retry attempt re-enters the full `execute()` method, which itself has N retry attempts. With `max_retries=3` and depth 3, this produces up to 40 total calls and possible stack overflow in deep call chains. Under some workloads this can exhaust the call stack or create arbitrarily long delays. - **Likelihood**: Medium — triggered whenever a node has `retry_policy` configured and repeatedly fails. - **Priority**: High ### Location - **File**: `src/cleveragents/langgraph/nodes.py` - **Class**: `Node` - **Method**: `execute` - **Lines**: ~105–130 ### Description Inside `Node.execute()`, when the initial execution raises an exception and `retry_policy` is configured, the retry loop calls `return await self.execute(state)` — the **full** `execute()` method — recursively. Each such recursive call also checks `retry_policy` and spawns its own retry loop if it fails. This leads to exponential explosion. ### Evidence ```python async def execute(self, state: GraphState) -> dict[str, Any]: self.execution_count += 1 loop = asyncio.get_event_loop() start_time = loop.time() try: ... # run the actual node except Exception as exc: self.last_error = exc self.logger.error("Node %s execution failed: %s", self.name, exc) if self.config.retry_policy: retry_count = self.config.retry_policy.get("max_retries", 3) retry_delay = self.config.retry_policy.get("delay", 1.0) for _ in range(retry_count): await asyncio.sleep(retry_delay) try: return await self.execute(state) # <-- RECURSIVE: re-enters full execute with retry logic! except Exception: ... continue return {"error": str(exc), "failed_node": self.name} finally: self.last_execution_time = loop.time() - start_time ``` **Call tree with `max_retries=3` and all attempts failing:** - `execute()` fails → spawns 3 retries, each calls `execute()` recursively - Retry 1 calls `execute()` → fails → spawns 3 more retries → ... - Retry 2 calls `execute()` → fails → spawns 3 more retries → ... - Retry 3 calls `execute()` → fails → spawns 3 more retries → ... Worst-case call depth: `max_retries^depth`. With `max_retries=3` and depth=10 (all fail), this is 3^10 = 59,049 calls. Additionally, `self.execution_count` is incremented at the start of every recursive call, so its final value reflects the exponential number of attempts rather than the intended "number of times this node was asked to execute." ### Expected Behavior The retry logic should call the internal execution helper (e.g., `_execute_agent`, `_execute_function`, etc.) directly — not the public `execute()` wrapper that re-wraps the retry logic. Each top-level call to `execute()` should attempt at most `1 + max_retries` total tries. ### Actual Behavior Each failed retry attempt re-enters the full `execute()` method, triggering another retry loop. The total number of calls grows exponentially with `max_retries`. ### Suggested Fix Extract the inner dispatch logic into a private `_run_once(state)` method and call that from the retry loop: ```python async def execute(self, state: GraphState) -> dict[str, Any]: self.execution_count += 1 ... max_retries = self.config.retry_policy.get("max_retries", 3) if self.config.retry_policy else 0 last_exc = None for attempt in range(1 + max_retries): if attempt > 0: await asyncio.sleep(self.config.retry_policy["delay"]) try: return await self._run_once(state) except Exception as exc: last_exc = exc self.last_error = exc self.logger.warning("Node %s attempt %d failed: %s", self.name, attempt + 1, exc) return {"error": str(last_exc), "failed_node": self.name} ``` ### Category `concurrency` ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:27:55 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6515
No description provided.