BUG-HUNT: [error-handling] Graph continues execution after a node fails #1610

Open
opened 2026-04-02 23:10:36 +00:00 by freemo · 1 comment
Owner

Metadata

  • Branch: fix/error-handling-graph-node-failure-propagation
  • Commit Message: fix(error-handling): propagate node errors to stop graph execution on failure
  • Milestone: v3.3.0
  • Parent Epic: #362

Bug Report

Severity Assessment

  • Impact: If a node in the graph fails, the graph continues to execute, potentially leading to unexpected behavior and incorrect results.
  • Likelihood: High. Any unhandled exception in a node will trigger this behavior.
  • Priority: High

Location

  • File: src/cleveragents/langgraph/graph.py
  • Function/Class: _setup_node_stream_subscriptions
  • Lines: 153–171

Description

The on_error handler in the _setup_node_stream_subscriptions method only logs the error but does not stop the graph or propagate the error to the caller. This means that if a node fails, the graph will continue to run, which may not be the desired behavior.

Evidence

def _setup_node_stream_subscriptions(self) -> None:
    for node_name in self.nodes:
        stream_name = f'__{self.name}_node_{node_name}__'
        if stream_name in self.stream_router.observables:
            observable = self.stream_router.observables[stream_name]

            def on_next(_msg: Any) -> None:
                pass

            def on_error(error: Exception, name: str = stream_name) -> None:
                self.logger.error('Error in node stream %s: %s', name, error)

            def on_completed() -> None:
                pass

            observer = Observer(
                on_next=on_next, on_error=on_error, on_completed=on_completed
            )
            observable.subscribe(observer)

Expected Behavior

The graph should stop execution when a node fails, and the error should be propagated to the caller of the execute method.

Suggested Fix

The on_error handler should be modified to stop the graph execution. This can be done by calling the stop method of the LangGraph class and by propagating the error to the main stream.

Category

error-handling

Subtasks

  • Write a failing Behave scenario that reproduces the bug (TDD — must be merged before fix)
  • Modify on_error handler in _setup_node_stream_subscriptions to call self.stop() and propagate the error to the main stream
  • Ensure the error is surfaced to the caller of the execute method (fail fast, do not suppress)
  • Verify no # type: ignore suppressions are introduced; all code must pass nox -e typecheck
  • Tests (Behave): Add/update scenarios covering node failure propagation and graph halt behaviour
  • Tests (Robot): Add integration test verifying graph halts on node error
  • Verify coverage >=97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/error-handling-graph-node-failure-propagation` - **Commit Message**: `fix(error-handling): propagate node errors to stop graph execution on failure` - **Milestone**: v3.3.0 - **Parent Epic**: #362 ## Bug Report ### Severity Assessment - **Impact**: If a node in the graph fails, the graph continues to execute, potentially leading to unexpected behavior and incorrect results. - **Likelihood**: High. Any unhandled exception in a node will trigger this behavior. - **Priority**: High ### Location - **File**: `src/cleveragents/langgraph/graph.py` - **Function/Class**: `_setup_node_stream_subscriptions` - **Lines**: 153–171 ### Description The `on_error` handler in the `_setup_node_stream_subscriptions` method only logs the error but does not stop the graph or propagate the error to the caller. This means that if a node fails, the graph will continue to run, which may not be the desired behavior. ### Evidence ```python def _setup_node_stream_subscriptions(self) -> None: for node_name in self.nodes: stream_name = f'__{self.name}_node_{node_name}__' if stream_name in self.stream_router.observables: observable = self.stream_router.observables[stream_name] def on_next(_msg: Any) -> None: pass def on_error(error: Exception, name: str = stream_name) -> None: self.logger.error('Error in node stream %s: %s', name, error) def on_completed() -> None: pass observer = Observer( on_next=on_next, on_error=on_error, on_completed=on_completed ) observable.subscribe(observer) ``` ### Expected Behavior The graph should stop execution when a node fails, and the error should be propagated to the caller of the `execute` method. ### Suggested Fix The `on_error` handler should be modified to stop the graph execution. This can be done by calling the `stop` method of the `LangGraph` class and by propagating the error to the main stream. ### Category error-handling ## Subtasks - [ ] Write a failing Behave scenario that reproduces the bug (TDD — must be merged before fix) - [ ] Modify `on_error` handler in `_setup_node_stream_subscriptions` to call `self.stop()` and propagate the error to the main stream - [ ] Ensure the error is surfaced to the caller of the `execute` method (fail fast, do not suppress) - [ ] Verify no `# type: ignore` suppressions are introduced; all code must pass `nox -e typecheck` - [ ] Tests (Behave): Add/update scenarios covering node failure propagation and graph halt behaviour - [ ] Tests (Robot): Add integration test verifying graph halts on node error - [ ] Verify coverage >=97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass - Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
freemo added this to the v3.3.0 milestone 2026-04-02 23:11:16 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: Priority/High (confirmed) — graph continuing execution after node failure violates the fail-fast principle
  • MoSCoW: MoSCoW/Must Have — per CONTRIBUTING.md, the project follows a fail-fast philosophy where exceptions must propagate. A graph silently continuing after a node failure can produce incorrect results and corrupt downstream state. This is a correctness bug in the core execution engine.
  • Milestone: v3.3.0 (confirmed — Corrections + Subplans + Checkpoints)
  • Parent Epic: #362

The specification requires that errors propagate and are not suppressed. The on_error handler in _setup_node_stream_subscriptions only logs the error without stopping execution, which directly violates the fail-fast design principle.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: Priority/High (confirmed) — graph continuing execution after node failure violates the fail-fast principle - **MoSCoW**: MoSCoW/Must Have — per CONTRIBUTING.md, the project follows a fail-fast philosophy where exceptions must propagate. A graph silently continuing after a node failure can produce incorrect results and corrupt downstream state. This is a correctness bug in the core execution engine. - **Milestone**: v3.3.0 (confirmed — Corrections + Subplans + Checkpoints) - **Parent Epic**: #362 The specification requires that errors propagate and are not suppressed. The `on_error` handler in `_setup_node_stream_subscriptions` only logs the error without stopping execution, which directly violates the fail-fast design principle. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#362 Epic: Security & Safety Hardening
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#1610
No description provided.