UAT: Node.execute() retry policy causes infinite/exponential recursion — retry loop calls self.execute() recursively #1811

Open
opened 2026-04-02 23:54:10 +00:00 by freemo · 1 comment
Owner

Metadata

  • Commit Message: fix(langgraph): prevent exponential recursion in Node.execute() retry policy
  • Branch: fix/langgraph-node-execute-retry-recursion
  • Milestone: v3.5.0
  • Parent Epic: #390

Background and Context

The Node.execute() method in src/cleveragents/langgraph/nodes.py has a critical bug in its retry policy implementation. When a node execution fails and a retry_policy is configured, the retry loop calls self.execute(state) recursively. Since each recursive call also has the retry policy check, this creates exponential recursion (each retry spawns N more retries), causing a stack overflow or near-infinite execution loop.

This was discovered during UAT testing of the LangGraph integration feature area by UAT tester instance uat-tester-3994408-1775170787.

Current Behavior

File: src/cleveragents/langgraph/nodes.py
Lines: 138–147

Buggy code:

if self.config.retry_policy:
    retry_count = self.config.retry_policy.get("max_retries", 3)
    retry_delay = self.config.retry_policy.get("delay", 1.0)
    for _ in range(retry_count):
        await asyncio.sleep(retry_delay)
        try:
            return await self.execute(state)  # BUG: recursive call also triggers retry policy
        except Exception:
            self.logger.warning("Node %s retry attempt failed", self.name)
            continue

With max_retries=2, the function is called ~983+ times (exponential recursion: 1 + 2 + 4 + 8 + …) before Python's recursion limit or asyncio stack depth is reached.

Steps to reproduce:

import asyncio
from cleveragents.langgraph.nodes import Node, NodeConfig, NodeType
from cleveragents.langgraph.state import GraphState

async def test_retry():
    call_count = 0

    async def failing_func(state):
        nonlocal call_count
        call_count += 1
        raise ValueError('Always fails')

    agents = {'failing': failing_func}
    node = Node(
        NodeConfig(
            name='retry_node',
            type=NodeType.FUNCTION,
            function='failing',
            retry_policy={'max_retries': 2, 'delay': 0.01}
        ),
        agents=agents
    )

    result = await node.execute(GraphState())
    print(f'Call count: {call_count}')  # Expected: 3 (1 + 2 retries), Actual: ~983+

asyncio.run(test_retry())

Expected Behavior

With max_retries=2, the function should be called exactly 3 times (1 initial attempt + 2 retries), then return the error result. The retry loop must not re-enter the full execute() method.

Root Cause

The retry loop calls self.execute(state) which re-enters the full execute() method including the retry policy check. Each recursive call spawns another max_retries retries, creating exponential growth: 1 + max_retries + max_retries² + ….

Fix: The retry loop should call the underlying execution logic directly (e.g., the specific _execute_* method based on self.type) rather than calling self.execute() recursively. Alternatively, a _in_retry flag could be used to prevent nested retries.

Acceptance Criteria

  • With max_retries=N, the underlying function is called exactly N+1 times (1 initial + N retries) before the error result is returned.
  • No recursive call to self.execute() occurs within the retry loop.
  • The fix is covered by a BDD scenario that asserts the exact call count for a configurable max_retries value.
  • No regression in existing LangGraph node execution tests.
  • nox passes with coverage ≥ 97%.

Supporting Information

  • Discovered by UAT tester instance uat-tester-3994408-1775170787 during LangGraph integration UAT.
  • Affects all Node instances that configure a retry_policy.
  • Severity: Critical — any node with a retry policy will exhibit exponential recursion on failure, causing stack overflows in production.

Subtasks

  • Reproduce the bug with a failing BDD scenario (tagged @tdd_expected_fail) that asserts call count equals max_retries + 1
  • Refactor Node.execute() retry loop to call the appropriate _execute_* method directly instead of self.execute()
  • Remove the @tdd_expected_fail tag once the fix is in place and the scenario passes
  • Add BDD scenarios covering: zero retries, one retry, multiple retries, and retry exhaustion returning error result
  • Add integration test verifying no stack overflow occurs with max_retries=3 and a consistently failing node
  • Run nox (all default sessions) and fix any errors
  • Verify coverage ≥ 97% via nox -s coverage_report

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly (fix(langgraph): prevent exponential recursion in Node.execute() retry policy), followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly (fix/langgraph-node-execute-retry-recursion).
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage ≥ 97%.

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Commit Message**: `fix(langgraph): prevent exponential recursion in Node.execute() retry policy` - **Branch**: `fix/langgraph-node-execute-retry-recursion` - **Milestone**: v3.5.0 - **Parent Epic**: #390 ## Background and Context The `Node.execute()` method in `src/cleveragents/langgraph/nodes.py` has a critical bug in its retry policy implementation. When a node execution fails and a `retry_policy` is configured, the retry loop calls `self.execute(state)` recursively. Since each recursive call also has the retry policy check, this creates exponential recursion (each retry spawns N more retries), causing a stack overflow or near-infinite execution loop. This was discovered during UAT testing of the LangGraph integration feature area by UAT tester instance `uat-tester-3994408-1775170787`. ## Current Behavior **File**: `src/cleveragents/langgraph/nodes.py` **Lines**: 138–147 **Buggy code**: ```python if self.config.retry_policy: retry_count = self.config.retry_policy.get("max_retries", 3) retry_delay = self.config.retry_policy.get("delay", 1.0) for _ in range(retry_count): await asyncio.sleep(retry_delay) try: return await self.execute(state) # BUG: recursive call also triggers retry policy except Exception: self.logger.warning("Node %s retry attempt failed", self.name) continue ``` With `max_retries=2`, the function is called ~983+ times (exponential recursion: 1 + 2 + 4 + 8 + …) before Python's recursion limit or asyncio stack depth is reached. **Steps to reproduce**: ```python import asyncio from cleveragents.langgraph.nodes import Node, NodeConfig, NodeType from cleveragents.langgraph.state import GraphState async def test_retry(): call_count = 0 async def failing_func(state): nonlocal call_count call_count += 1 raise ValueError('Always fails') agents = {'failing': failing_func} node = Node( NodeConfig( name='retry_node', type=NodeType.FUNCTION, function='failing', retry_policy={'max_retries': 2, 'delay': 0.01} ), agents=agents ) result = await node.execute(GraphState()) print(f'Call count: {call_count}') # Expected: 3 (1 + 2 retries), Actual: ~983+ asyncio.run(test_retry()) ``` ## Expected Behavior With `max_retries=2`, the function should be called exactly 3 times (1 initial attempt + 2 retries), then return the error result. The retry loop must not re-enter the full `execute()` method. ## Root Cause The retry loop calls `self.execute(state)` which re-enters the full `execute()` method including the retry policy check. Each recursive call spawns another `max_retries` retries, creating exponential growth: `1 + max_retries + max_retries² + …`. **Fix**: The retry loop should call the underlying execution logic directly (e.g., the specific `_execute_*` method based on `self.type`) rather than calling `self.execute()` recursively. Alternatively, a `_in_retry` flag could be used to prevent nested retries. ## Acceptance Criteria - [ ] With `max_retries=N`, the underlying function is called exactly `N+1` times (1 initial + N retries) before the error result is returned. - [ ] No recursive call to `self.execute()` occurs within the retry loop. - [ ] The fix is covered by a BDD scenario that asserts the exact call count for a configurable `max_retries` value. - [ ] No regression in existing LangGraph node execution tests. - [ ] `nox` passes with coverage ≥ 97%. ## Supporting Information - Discovered by UAT tester instance `uat-tester-3994408-1775170787` during LangGraph integration UAT. - Affects all `Node` instances that configure a `retry_policy`. - Severity: **Critical** — any node with a retry policy will exhibit exponential recursion on failure, causing stack overflows in production. ## Subtasks - [ ] Reproduce the bug with a failing BDD scenario (tagged `@tdd_expected_fail`) that asserts call count equals `max_retries + 1` - [ ] Refactor `Node.execute()` retry loop to call the appropriate `_execute_*` method directly instead of `self.execute()` - [ ] Remove the `@tdd_expected_fail` tag once the fix is in place and the scenario passes - [ ] Add BDD scenarios covering: zero retries, one retry, multiple retries, and retry exhaustion returning error result - [ ] Add integration test verifying no stack overflow occurs with `max_retries=3` and a consistently failing node - [ ] Run `nox` (all default sessions) and fix any errors - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly (`fix(langgraph): prevent exponential recursion in Node.execute() retry policy`), followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly (`fix/langgraph-node-execute-retry-recursion`). - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage ≥ 97%. --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
freemo added this to the v3.5.0 milestone 2026-04-02 23:54:38 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: Critical — Node.execute() retry policy causes infinite/exponential recursion. This is a severe runtime bug that can cause stack overflows and resource exhaustion during plan execution.
  • Milestone: v3.5.0 (already assigned — milestone is past due)
  • MoSCoW: Must Have — The retry mechanism is a core part of the execution engine. Infinite recursion in retry logic is a showstopper that makes the execution engine unreliable. Per the specification, the execution phase must be safe and predictable.

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified ✅ - **Priority**: Critical — `Node.execute()` retry policy causes infinite/exponential recursion. This is a severe runtime bug that can cause stack overflows and resource exhaustion during plan execution. - **Milestone**: v3.5.0 (already assigned — milestone is past due) - **MoSCoW**: Must Have — The retry mechanism is a core part of the execution engine. Infinite recursion in retry logic is a showstopper that makes the execution engine unreliable. Per the specification, the execution phase must be safe and predictable. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#1811
No description provided.