BUG-HUNT: [concurrency] PlanGenerationGraph._should_retry mutates state directly inside a LangGraph conditional edge — violates LangGraph's immutable state contract #6556

Open
opened 2026-04-09 21:19:07 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [concurrency] — _should_retry conditional edge function mutates workflow state in-place

Severity Assessment

  • Impact: LangGraph conditional edge functions are pure routing functions — they must return a string routing key without side effects. Mutating state inside a conditional edge bypasses LangGraph's state merging and checkpointing machinery. The retry_count increment may be ignored, double-applied, or lost entirely when the graph is checkpointed and resumed, leading to infinite retry loops or incorrect retry counting.
  • Likelihood: High — triggered on every plan generation that hits the validation step
  • Priority: High

Location

  • File: src/cleveragents/agents/graphs/plan_generation.py
  • Class: PlanGenerationGraph
  • Method: _should_retry
  • Lines: 385–393

Description

_should_retry is registered as a conditional edge function in _build_graph:

# plan_generation.py
workflow.add_conditional_edges(
    "validate",
    self._should_retry,  # Registered as routing function
    {"retry": "analyze_requirements", "end": END},
)

However, _should_retry directly mutates the state dict:

# plan_generation.py lines 385-393
def _should_retry(self, state: PlanGenerationState) -> str:
    validation = state.get("validation_result", {})
    retry_count = state.get("retry_count", 0)

    if validation.get("status") == "FAIL" and retry_count < self.max_retries:
        state["retry_count"] = retry_count + 1  # MUTATION in conditional edge!
        return "retry"

    return "end"

In LangGraph, state mutations must happen inside node functions (which return dicts merged into the state). Conditional edge functions are only allowed to read the state and return a routing key. Mutating state in a conditional edge:

  1. May not be persisted by the checkpointer (since LangGraph only checkpoints state after node executions)
  2. Can cause double-increments if the graph replays the edge check
  3. Is an undocumented side-effect that can break graph determinism on resume

Expected Behavior

_should_retry should only read state and return a routing key. The retry_count increment should happen inside the analyze_requirements node (which already runs on retry) or a dedicated intermediate node.

Suggested Fix

Remove the mutation from _should_retry:

def _should_retry(self, state: PlanGenerationState) -> str:
    validation = state.get("validation_result", {})
    retry_count = state.get("retry_count", 0)
    if validation.get("status") == "FAIL" and retry_count < self.max_retries:
        return "retry"
    return "end"

And increment retry_count at the start of _analyze_requirements when it detects it's being retried (e.g., retry_count > 0):

def _analyze_requirements(self, state: PlanGenerationState) -> dict[str, Any]:
    # Increment retry_count if this is a retry
    if state.get("validation_result", {}).get("status") == "FAIL":
        return {"retry_count": state.get("retry_count", 0) + 1, ...}

Category

concurrency / spec-alignment

TDD Note

After this bug is verified, a Type/Testing issue will be created with a TDD test tagged @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — `_should_retry` conditional edge function mutates workflow state in-place ### Severity Assessment - **Impact**: LangGraph conditional edge functions are pure routing functions — they must return a string routing key without side effects. Mutating state inside a conditional edge bypasses LangGraph's state merging and checkpointing machinery. The `retry_count` increment may be ignored, double-applied, or lost entirely when the graph is checkpointed and resumed, leading to infinite retry loops or incorrect retry counting. - **Likelihood**: High — triggered on every plan generation that hits the validation step - **Priority**: High ### Location - **File**: `src/cleveragents/agents/graphs/plan_generation.py` - **Class**: `PlanGenerationGraph` - **Method**: `_should_retry` - **Lines**: 385–393 ### Description `_should_retry` is registered as a **conditional edge function** in `_build_graph`: ```python # plan_generation.py workflow.add_conditional_edges( "validate", self._should_retry, # Registered as routing function {"retry": "analyze_requirements", "end": END}, ) ``` However, `_should_retry` directly mutates the state dict: ```python # plan_generation.py lines 385-393 def _should_retry(self, state: PlanGenerationState) -> str: validation = state.get("validation_result", {}) retry_count = state.get("retry_count", 0) if validation.get("status") == "FAIL" and retry_count < self.max_retries: state["retry_count"] = retry_count + 1 # MUTATION in conditional edge! return "retry" return "end" ``` In LangGraph, state mutations must happen **inside node functions** (which return dicts merged into the state). Conditional edge functions are only allowed to **read** the state and return a routing key. Mutating state in a conditional edge: 1. May not be persisted by the checkpointer (since LangGraph only checkpoints state after node executions) 2. Can cause double-increments if the graph replays the edge check 3. Is an undocumented side-effect that can break graph determinism on resume ### Expected Behavior `_should_retry` should only read state and return a routing key. The `retry_count` increment should happen inside the `analyze_requirements` node (which already runs on retry) or a dedicated intermediate node. ### Suggested Fix Remove the mutation from `_should_retry`: ```python def _should_retry(self, state: PlanGenerationState) -> str: validation = state.get("validation_result", {}) retry_count = state.get("retry_count", 0) if validation.get("status") == "FAIL" and retry_count < self.max_retries: return "retry" return "end" ``` And increment `retry_count` at the start of `_analyze_requirements` when it detects it's being retried (e.g., `retry_count > 0`): ```python def _analyze_requirements(self, state: PlanGenerationState) -> dict[str, Any]: # Increment retry_count if this is a retry if state.get("validation_result", {}).get("status") == "FAIL": return {"retry_count": state.get("retry_count", 0) + 1, ...} ``` ### Category concurrency / spec-alignment ### TDD Note After this bug is verified, a Type/Testing issue will be created with a TDD test tagged `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:27:52 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6556
No description provided.