BUG-HUNT: [BOUNDARY] Agent lookup failure masking causes silent configuration errors #7151

Open
opened 2026-04-10 08:14:57 +00:00 by HAL9000 · 1 comment
Owner

Background and Context

In src/cleveragents/langgraph/nodes.py, the _execute_agent() method (lines ~170–185) contains a boundary condition where a missing agent in the agents registry is silently handled by synthesizing a fake "Agent not found" response. This directly violates the fail-fast principle mandated in CONTRIBUTING.md ("Do not suppress errors. Let exceptions propagate to top-level execution") and the argument validation mandate ("All public and protected class methods must validate arguments as the first guard").

When a caller specifies an agent name that does not exist in the registry — due to a typo, a missing registration, or a misconfigured actor YAML — the graph continues executing with a fabricated assistant message. The caller receives no exception, no error state, and no indication that the configuration is broken. This makes debugging extremely difficult and can cause downstream nodes to process invalid data silently.

Current Behavior

async def _execute_agent(self, state: GraphState) -> dict[str, Any]:
    if not self.config.agent:
        raise ValueError(f"Agent node {self.name} has no agent specified")
    agent = self.agents.get(self.config.agent)
    if not agent:
        # Fallback: synthesize a response when agent instance is unavailable
        return {
            "messages": [
                {
                    "role": "assistant",
                    "content": f"Agent {self.config.agent} not found",
                    "node": self.name,
                    "agent": self.config.agent,
                }
            ],
            "current_node": self.name,
        }

When self.config.agent is set but the agent is not registered, the method returns a synthesized dict instead of raising an exception. The graph continues executing as if the agent responded normally.

Expected Behavior

_execute_agent() must raise a ValueError (or a domain-specific ConfigurationError) immediately when the requested agent is not found in the registry — consistent with the fail-fast pattern already applied to the if not self.config.agent check above it. No fallback synthesis should occur. If explicit fallback behavior is ever needed, it must be an opt-in configuration option, not the default.

async def _execute_agent(self, state: GraphState) -> dict[str, Any]:
    if not self.config.agent:
        raise ValueError(f"Agent node {self.name} has no agent specified")
    agent = self.agents.get(self.config.agent)
    if not agent:
        raise ValueError(
            f"Agent node {self.name}: agent '{self.config.agent}' not found in registry. "
            f"Available agents: {sorted(self.agents.keys())}"
        )
    # ... proceed with real agent execution

Acceptance Criteria

  • _execute_agent() raises ValueError when the requested agent is not found in the registry
  • The error message includes the node name, the missing agent name, and the list of available agents
  • No synthesized fallback response is returned under any default code path
  • Validation is the first guard after the self.config.agent check (fail-fast order preserved)
  • No # type: ignore directives introduced
  • All existing Behave BDD scenarios for nodes.py continue to pass
  • New Behave scenarios cover: valid agent found (passes), agent name not in registry (raises ValueError), empty agent name (raises ValueError)
  • New Robot Framework integration test covers the end-to-end actor graph execution path with a missing agent registration
  • nox -s typecheck passes (Pyright strict)
  • Coverage remains ≥ 97%

Supporting Information

Location: src/cleveragents/langgraph/nodes.py, method _execute_agent(), lines ~170–185

Impact: Configuration errors (typos in agent names, missing agent registrations) are silently masked. Graphs appear to execute successfully but produce incorrect results containing fabricated assistant messages. Downstream nodes that depend on real agent output will process garbage data without any error signal, making root-cause analysis extremely difficult.

Suggested Fix: Replace the fallback return block with a raise ValueError(...) that includes the node name, the missing agent name, and the list of available agents for immediate debuggability. If opt-in fallback behavior is desired in the future, it should be gated behind an explicit NodeConfig field (e.g., allow_missing_agent: bool = False).

Related Issues:

  • #7053_execute_function() in the same file had a similar pattern (no allowlist validation before registry lookup); this issue is the sibling boundary condition in _execute_agent()
  • #6666LangGraph.execute() bypasses StateManager invariants (same file, fail-fast violations)

TDD Requirement: Per CONTRIBUTING.md Bug Fix Workflow, a companion TDD: issue must be created and merged before this fix. The bug fix PR must remove @tdd_expected_fail from all @tdd_issue_<N> scenarios.

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.

Metadata

  • Branch: bugfix/langgraph-agent-lookup-fail-fast
  • Commit Message: fix(langgraph): raise ValueError on missing agent registry lookup in _execute_agent
  • Milestone: (backlog — see note above)
  • Parent Epic: #7053

Subtasks

  • Create companion TDD issue (TDD: [BOUNDARY] Agent lookup failure masking causes silent configuration errors) with @tdd_issue @tdd_issue_<N> @tdd_expected_fail Behave scenarios
  • Remove the synthesized fallback return block in _execute_agent()
  • Replace with raise ValueError(...) including node name, missing agent name, and available agents list
  • Add Behave BDD unit test scenarios: valid agent found, agent not in registry (raises), empty agent name (raises)
  • Add Robot Framework integration test for end-to-end actor graph execution with missing agent registration
  • Remove @tdd_expected_fail tag from all @tdd_issue_<N> scenarios in the bug fix PR
  • Run nox -s typecheck and confirm zero Pyright strict errors
  • Run nox -s coverage_report and confirm coverage ≥ 97%

Definition of Done

  • _execute_agent() raises ValueError (not returns) when agent is not found
  • Error message includes node name, missing agent name, and available agents
  • All new Behave BDD scenarios pass (including the previously @tdd_expected_fail scenario)
  • Robot Framework integration test passes
  • nox -s typecheck passes (zero Pyright strict errors)
  • All nox stages pass
  • Coverage ≥ 97%
  • PR merged and this issue closed

Automated by CleverAgents Bot
Supervisor: Acting on behalf of: UAT Testing | Agent: new-issue-creator

## Background and Context In `src/cleveragents/langgraph/nodes.py`, the `_execute_agent()` method (lines ~170–185) contains a boundary condition where a missing agent in the agents registry is silently handled by synthesizing a fake "Agent not found" response. This directly violates the **fail-fast principle** mandated in CONTRIBUTING.md ("Do not suppress errors. Let exceptions propagate to top-level execution") and the **argument validation mandate** ("All public and protected class methods must validate arguments as the first guard"). When a caller specifies an agent name that does not exist in the registry — due to a typo, a missing registration, or a misconfigured actor YAML — the graph continues executing with a fabricated assistant message. The caller receives no exception, no error state, and no indication that the configuration is broken. This makes debugging extremely difficult and can cause downstream nodes to process invalid data silently. ## Current Behavior ```python async def _execute_agent(self, state: GraphState) -> dict[str, Any]: if not self.config.agent: raise ValueError(f"Agent node {self.name} has no agent specified") agent = self.agents.get(self.config.agent) if not agent: # Fallback: synthesize a response when agent instance is unavailable return { "messages": [ { "role": "assistant", "content": f"Agent {self.config.agent} not found", "node": self.name, "agent": self.config.agent, } ], "current_node": self.name, } ``` When `self.config.agent` is set but the agent is not registered, the method returns a synthesized dict instead of raising an exception. The graph continues executing as if the agent responded normally. ## Expected Behavior `_execute_agent()` must raise a `ValueError` (or a domain-specific `ConfigurationError`) immediately when the requested agent is not found in the registry — consistent with the fail-fast pattern already applied to the `if not self.config.agent` check above it. No fallback synthesis should occur. If explicit fallback behavior is ever needed, it must be an opt-in configuration option, not the default. ```python async def _execute_agent(self, state: GraphState) -> dict[str, Any]: if not self.config.agent: raise ValueError(f"Agent node {self.name} has no agent specified") agent = self.agents.get(self.config.agent) if not agent: raise ValueError( f"Agent node {self.name}: agent '{self.config.agent}' not found in registry. " f"Available agents: {sorted(self.agents.keys())}" ) # ... proceed with real agent execution ``` ## Acceptance Criteria - [ ] `_execute_agent()` raises `ValueError` when the requested agent is not found in the registry - [ ] The error message includes the node name, the missing agent name, and the list of available agents - [ ] No synthesized fallback response is returned under any default code path - [ ] Validation is the first guard after the `self.config.agent` check (fail-fast order preserved) - [ ] No `# type: ignore` directives introduced - [ ] All existing Behave BDD scenarios for `nodes.py` continue to pass - [ ] New Behave scenarios cover: valid agent found (passes), agent name not in registry (raises `ValueError`), empty agent name (raises `ValueError`) - [ ] New Robot Framework integration test covers the end-to-end actor graph execution path with a missing agent registration - [ ] `nox -s typecheck` passes (Pyright strict) - [ ] Coverage remains ≥ 97% ## Supporting Information **Location**: `src/cleveragents/langgraph/nodes.py`, method `_execute_agent()`, lines ~170–185 **Impact**: Configuration errors (typos in agent names, missing agent registrations) are silently masked. Graphs appear to execute successfully but produce incorrect results containing fabricated assistant messages. Downstream nodes that depend on real agent output will process garbage data without any error signal, making root-cause analysis extremely difficult. **Suggested Fix**: Replace the fallback `return` block with a `raise ValueError(...)` that includes the node name, the missing agent name, and the list of available agents for immediate debuggability. If opt-in fallback behavior is desired in the future, it should be gated behind an explicit `NodeConfig` field (e.g., `allow_missing_agent: bool = False`). **Related Issues**: - #7053 — `_execute_function()` in the same file had a similar pattern (no allowlist validation before registry lookup); this issue is the sibling boundary condition in `_execute_agent()` - #6666 — `LangGraph.execute()` bypasses `StateManager` invariants (same file, fail-fast violations) **TDD Requirement**: Per CONTRIBUTING.md Bug Fix Workflow, a companion `TDD:` issue must be created and merged before this fix. The bug fix PR must remove `@tdd_expected_fail` from all `@tdd_issue_<N>` scenarios. > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. ## Metadata - **Branch**: `bugfix/langgraph-agent-lookup-fail-fast` - **Commit Message**: `fix(langgraph): raise ValueError on missing agent registry lookup in _execute_agent` - **Milestone**: *(backlog — see note above)* - **Parent Epic**: #7053 ## Subtasks - [ ] Create companion TDD issue (`TDD: [BOUNDARY] Agent lookup failure masking causes silent configuration errors`) with `@tdd_issue @tdd_issue_<N> @tdd_expected_fail` Behave scenarios - [ ] Remove the synthesized fallback `return` block in `_execute_agent()` - [ ] Replace with `raise ValueError(...)` including node name, missing agent name, and available agents list - [ ] Add Behave BDD unit test scenarios: valid agent found, agent not in registry (raises), empty agent name (raises) - [ ] Add Robot Framework integration test for end-to-end actor graph execution with missing agent registration - [ ] Remove `@tdd_expected_fail` tag from all `@tdd_issue_<N>` scenarios in the bug fix PR - [ ] Run `nox -s typecheck` and confirm zero Pyright strict errors - [ ] Run `nox -s coverage_report` and confirm coverage ≥ 97% ## Definition of Done - [ ] `_execute_agent()` raises `ValueError` (not returns) when agent is not found - [ ] Error message includes node name, missing agent name, and available agents - [ ] All new Behave BDD scenarios pass (including the previously `@tdd_expected_fail` scenario) - [ ] Robot Framework integration test passes - [ ] `nox -s typecheck` passes (zero Pyright strict errors) - [ ] All nox stages pass - [ ] Coverage ≥ 97% - [ ] PR merged and this issue closed --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: UAT Testing | Agent: new-issue-creator
Author
Owner

Verified — Bug: agent lookup failure masking causes silent configuration errors. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Bug: agent lookup failure masking causes silent configuration errors. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#7151
No description provided.