UAT: PlanGenerationGraph._should_retry() mutates LangGraph state dict directly — violates LangGraph immutable state contract and causes retry count to not persist #3662

Open
opened 2026-04-05 21:13:51 +00:00 by freemo · 0 comments
Owner

Metadata

  • Branch: fix/plan-generation-retry-count-mutation
  • Commit Message: fix(langgraph): move retry_count increment from routing function to validate node
  • Milestone: v3.6.0
  • Parent Epic: #366

Background

LangGraph's state management requires that node functions return state updates as a new dict rather than mutating the state in-place. The state passed to node functions is a snapshot; mutations to it are not persisted to the graph's state store.

Current Behavior

PlanGenerationGraph._should_retry() in src/cleveragents/agents/graphs/plan_generation.py (lines 548–566) directly mutates the state dict instead of returning a state update:

def _should_retry(self, state: PlanGenerationState) -> str:
    validation = state.get("validation_result", {})
    retry_count = state.get("retry_count", 0)

    # Check if validation failed and retries available
    if validation.get("status") == "FAIL" and retry_count < self.max_retries:
        # Increment retry count
        state["retry_count"] = retry_count + 1  # ← DIRECT MUTATION — NOT PERSISTED
        return "retry"

    return "end"

In LangGraph, _should_retry() is used as a routing function (passed to add_conditional_edges()), not as a node function. Routing functions receive the state but their return value is used only for routing decisions — they cannot update state. The mutation state["retry_count"] = retry_count + 1 modifies a local copy of the state that is immediately discarded.

Expected Behavior

The retry count must be incremented by returning it from a node function (not a routing function). The correct approach is:

  1. The validate node should return {"retry_count": state.get("retry_count", 0) + 1} when validation fails
  2. The _should_retry routing function should only read retry_count to decide routing, never write it

Code Location

  • src/cleveragents/agents/graphs/plan_generation.py lines 548–566: _should_retry()
  • src/cleveragents/agents/graphs/plan_generation.py lines 493–546: _validate() — should be updated to increment retry_count

Impact

The retry mechanism in PlanGenerationGraph is broken. When validation fails:

  1. retry_count is never actually incremented in the persisted state
  2. On the next iteration, retry_count is still 0
  3. The graph will retry indefinitely (until max_retries is reached by coincidence or the LangGraph framework detects an infinite loop)
  4. Or, if the state is a frozen/immutable dict, this raises a TypeError at runtime

Steps to Reproduce

from unittest.mock import MagicMock
from cleveragents.agents.graphs.plan_generation import PlanGenerationGraph, PlanGenerationState

llm = MagicMock()
llm.invoke.return_value = "FAIL: syntax error"
graph = PlanGenerationGraph(llm=llm, max_retries=2)

state: PlanGenerationState = {
    "validation_result": {"status": "FAIL", "message": "syntax error"},
    "retry_count": 0,
    # ... other required fields
}
result = graph._should_retry(state)
assert result == "retry"
# Bug: state["retry_count"] is still 0 after the call
# The mutation was to a local copy, not the persisted state
assert state["retry_count"] == 0  # This passes — proving the mutation was lost

Subtasks

  • Move retry count increment from _should_retry() into _validate() node function
  • Update _validate() to return {"retry_count": state.get("retry_count", 0) + 1} when validation fails
  • Update _should_retry() to be a pure read-only routing function
  • Write Behave unit tests verifying retry count increments correctly across retries
  • Verify all nox stages pass; coverage ≥ 97%

Definition of Done

  • _should_retry() is a pure routing function (no state mutations)
  • retry_count is correctly incremented via _validate() node return value
  • Retry logic works correctly for up to max_retries attempts
  • Unit tests pass for retry behavior
  • All nox stages pass; coverage ≥ 97%

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/plan-generation-retry-count-mutation` - **Commit Message**: `fix(langgraph): move retry_count increment from routing function to validate node` - **Milestone**: v3.6.0 - **Parent Epic**: #366 ## Background LangGraph's state management requires that node functions return state updates as a new dict rather than mutating the state in-place. The state passed to node functions is a snapshot; mutations to it are not persisted to the graph's state store. ## Current Behavior `PlanGenerationGraph._should_retry()` in `src/cleveragents/agents/graphs/plan_generation.py` (lines 548–566) directly mutates the state dict instead of returning a state update: ```python def _should_retry(self, state: PlanGenerationState) -> str: validation = state.get("validation_result", {}) retry_count = state.get("retry_count", 0) # Check if validation failed and retries available if validation.get("status") == "FAIL" and retry_count < self.max_retries: # Increment retry count state["retry_count"] = retry_count + 1 # ← DIRECT MUTATION — NOT PERSISTED return "retry" return "end" ``` In LangGraph, `_should_retry()` is used as a **routing function** (passed to `add_conditional_edges()`), not as a node function. Routing functions receive the state but their return value is used only for routing decisions — they cannot update state. The mutation `state["retry_count"] = retry_count + 1` modifies a local copy of the state that is immediately discarded. ## Expected Behavior The retry count must be incremented by returning it from a **node function** (not a routing function). The correct approach is: 1. The `validate` node should return `{"retry_count": state.get("retry_count", 0) + 1}` when validation fails 2. The `_should_retry` routing function should only read `retry_count` to decide routing, never write it ## Code Location - `src/cleveragents/agents/graphs/plan_generation.py` lines 548–566: `_should_retry()` - `src/cleveragents/agents/graphs/plan_generation.py` lines 493–546: `_validate()` — should be updated to increment `retry_count` ## Impact The retry mechanism in `PlanGenerationGraph` is broken. When validation fails: 1. `retry_count` is never actually incremented in the persisted state 2. On the next iteration, `retry_count` is still 0 3. The graph will retry indefinitely (until `max_retries` is reached by coincidence or the LangGraph framework detects an infinite loop) 4. Or, if the state is a frozen/immutable dict, this raises a `TypeError` at runtime ## Steps to Reproduce ```python from unittest.mock import MagicMock from cleveragents.agents.graphs.plan_generation import PlanGenerationGraph, PlanGenerationState llm = MagicMock() llm.invoke.return_value = "FAIL: syntax error" graph = PlanGenerationGraph(llm=llm, max_retries=2) state: PlanGenerationState = { "validation_result": {"status": "FAIL", "message": "syntax error"}, "retry_count": 0, # ... other required fields } result = graph._should_retry(state) assert result == "retry" # Bug: state["retry_count"] is still 0 after the call # The mutation was to a local copy, not the persisted state assert state["retry_count"] == 0 # This passes — proving the mutation was lost ``` ## Subtasks - [ ] Move retry count increment from `_should_retry()` into `_validate()` node function - [ ] Update `_validate()` to return `{"retry_count": state.get("retry_count", 0) + 1}` when validation fails - [ ] Update `_should_retry()` to be a pure read-only routing function - [ ] Write Behave unit tests verifying retry count increments correctly across retries - [ ] Verify all nox stages pass; coverage ≥ 97% ## Definition of Done - [ ] `_should_retry()` is a pure routing function (no state mutations) - [ ] `retry_count` is correctly incremented via `_validate()` node return value - [ ] Retry logic works correctly for up to `max_retries` attempts - [ ] Unit tests pass for retry behavior - [ ] All nox stages pass; coverage ≥ 97% --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
freemo added this to the v3.6.0 milestone 2026-04-05 21:14:02 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#366 Epic: Post-MVP Deferred Work
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#3662
No description provided.