BUG-HUNT: [concurrency] `PlanGenerationGraph._should_retry` mutates state directly inside a LangGraph conditional edge — violates LangGraph's immutable state contract #6556

New issue

Open

opened 2026-04-09 21:19:07 +00:00 by HAL9000 · 0 comments

HAL9000 commented

2026-04-09 21:19:07 +00:00

Owner

Bug Report: [concurrency] — `_should_retry` conditional edge function mutates workflow state in-place

Severity Assessment

Impact: LangGraph conditional edge functions are pure routing functions — they must return a string routing key without side effects. Mutating state inside a conditional edge bypasses LangGraph's state merging and checkpointing machinery. The retry_count increment may be ignored, double-applied, or lost entirely when the graph is checkpointed and resumed, leading to infinite retry loops or incorrect retry counting.
Likelihood: High — triggered on every plan generation that hits the validation step
Priority: High

Location

File: src/cleveragents/agents/graphs/plan_generation.py
Class: PlanGenerationGraph
Method: _should_retry
Lines: 385–393

Description

_should_retry is registered as a conditional edge function in _build_graph:

# plan_generation.py
workflow.add_conditional_edges(
    "validate",
    self._should_retry,  # Registered as routing function
    {"retry": "analyze_requirements", "end": END},
)

However, _should_retry directly mutates the state dict:

# plan_generation.py lines 385-393
def _should_retry(self, state: PlanGenerationState) -> str:
    validation = state.get("validation_result", {})
    retry_count = state.get("retry_count", 0)

    if validation.get("status") == "FAIL" and retry_count < self.max_retries:
        state["retry_count"] = retry_count + 1  # MUTATION in conditional edge!
        return "retry"

    return "end"

In LangGraph, state mutations must happen inside node functions (which return dicts merged into the state). Conditional edge functions are only allowed to read the state and return a routing key. Mutating state in a conditional edge:

May not be persisted by the checkpointer (since LangGraph only checkpoints state after node executions)
Can cause double-increments if the graph replays the edge check
Is an undocumented side-effect that can break graph determinism on resume

Expected Behavior

_should_retry should only read state and return a routing key. The retry_count increment should happen inside the analyze_requirements node (which already runs on retry) or a dedicated intermediate node.

Suggested Fix

Remove the mutation from _should_retry:

def _should_retry(self, state: PlanGenerationState) -> str:
    validation = state.get("validation_result", {})
    retry_count = state.get("retry_count", 0)
    if validation.get("status") == "FAIL" and retry_count < self.max_retries:
        return "retry"
    return "end"

And increment retry_count at the start of _analyze_requirements when it detects it's being retried (e.g., retry_count > 0):

def _analyze_requirements(self, state: PlanGenerationState) -> dict[str, Any]:
    # Increment retry_count if this is a retry
    if state.get("validation_result", {}).get("status") == "FAIL":
        return {"retry_count": state.get("retry_count", 0) + 1, ...}

TDD Note

After this bug is verified, a Type/Testing issue will be created with a TDD test tagged @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — `_should_retry` conditional edge function mutates workflow state in-place ### Severity Assessment - **Impact**: LangGraph conditional edge functions are pure routing functions — they must return a string routing key without side effects. Mutating state inside a conditional edge bypasses LangGraph's state merging and checkpointing machinery. The `retry_count` increment may be ignored, double-applied, or lost entirely when the graph is checkpointed and resumed, leading to infinite retry loops or incorrect retry counting. - **Likelihood**: High — triggered on every plan generation that hits the validation step - **Priority**: High ### Location - **File**: `src/cleveragents/agents/graphs/plan_generation.py` - **Class**: `PlanGenerationGraph` - **Method**: `_should_retry` - **Lines**: 385–393 ### Description `_should_retry` is registered as a **conditional edge function** in `_build_graph`: ```python # plan_generation.py workflow.add_conditional_edges( "validate", self._should_retry, # Registered as routing function {"retry": "analyze_requirements", "end": END}, ) ``` However, `_should_retry` directly mutates the state dict: ```python # plan_generation.py lines 385-393 def _should_retry(self, state: PlanGenerationState) -> str: validation = state.get("validation_result", {}) retry_count = state.get("retry_count", 0) if validation.get("status") == "FAIL" and retry_count < self.max_retries: state["retry_count"] = retry_count + 1 # MUTATION in conditional edge! return "retry" return "end" ``` In LangGraph, state mutations must happen **inside node functions** (which return dicts merged into the state). Conditional edge functions are only allowed to **read** the state and return a routing key. Mutating state in a conditional edge: 1. May not be persisted by the checkpointer (since LangGraph only checkpoints state after node executions) 2. Can cause double-increments if the graph replays the edge check 3. Is an undocumented side-effect that can break graph determinism on resume ### Expected Behavior `_should_retry` should only read state and return a routing key. The `retry_count` increment should happen inside the `analyze_requirements` node (which already runs on retry) or a dedicated intermediate node. ### Suggested Fix Remove the mutation from `_should_retry`: ```python def _should_retry(self, state: PlanGenerationState) -> str: validation = state.get("validation_result", {}) retry_count = state.get("retry_count", 0) if validation.get("status") == "FAIL" and retry_count < self.max_retries: return "retry" return "end" ``` And increment `retry_count` at the start of `_analyze_requirements` when it detects it's being retried (e.g., `retry_count > 0`): ```python def _analyze_requirements(self, state: PlanGenerationState) -> dict[str, Any]: # Increment retry_count if this is a retry if state.get("validation_result", {}).get("status") == "FAIL": return {"retry_count": state.get("retry_count", 0) + 1, ...} ``` ### Category concurrency / spec-alignment ### TDD Note After this bug is verified, a Type/Testing issue will be created with a TDD test tagged `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter

HAL9000 added the

labels

2026-04-09 21:26:08 +00:00

HAL9000 added this to the v3.2.0 milestone

2026-04-09 21:27:52 +00:00

HAL9000 referenced this issue