BUG-HUNT: [RESOURCE-LEAK] ThreadPoolExecutor resource leak in graph node execution #7134

Open
opened 2026-04-10 08:06:44 +00:00 by HAL9000 · 2 comments
Owner

Background and Context

The _register_node_executor() method in src/cleveragents/langgraph/graph.py creates a ThreadPoolExecutor via a context manager (with block) to bridge synchronous callers into the async async_executor coroutine. The future.result() call is placed inside the with block, which is technically correct syntactically, but the pattern creates a per-call executor that is torn down immediately after each invocation. Under high concurrency — when multiple graph nodes execute simultaneously — this results in repeated executor creation/destruction cycles, potential race conditions during shutdown, and thread exhaustion due to the lack of a shared executor pool.

Current Behavior

def sync_executor(msg: StreamMessage) -> StreamMessage:
    def run_async() -> Any:
        return asyncio.run(async_executor(msg))

    with concurrent.futures.ThreadPoolExecutor() as executor:
        future = executor.submit(run_async)
        return future.result()  # Executor shuts down after this returns

Each call to sync_executor allocates a brand-new ThreadPoolExecutor, submits one task, blocks for the result, then immediately shuts the executor down. Under concurrent node execution:

  • Thread pool creation/teardown overhead accumulates per node per invocation.
  • If future.result() raises, the executor context exit may suppress or obscure the exception.
  • No upper bound on total threads created across concurrent node calls.
  • Executor lifecycle is not shared or reused, defeating the purpose of a pool.

Expected Behavior

A shared, long-lived ThreadPoolExecutor (or equivalent) should be used across all node executions within a graph instance, with proper lifecycle management tied to the graph's own lifecycle (created on graph init, shut down on graph teardown). This eliminates per-call overhead and provides a bounded thread pool.

Acceptance Criteria

  • _register_node_executor() does not create a new ThreadPoolExecutor per invocation.
  • A shared executor is initialised once (e.g., in __init__ or _setup) and reused across all node calls.
  • The shared executor is properly shut down when the graph is torn down (e.g., via __del__, a close() method, or a context manager on the graph itself).
  • No thread exhaustion occurs under concurrent multi-node execution.
  • All existing BDD unit tests and Robot Framework integration tests continue to pass.
  • Coverage remains ≥ 97%.

Supporting Information

  • File: src/cleveragents/langgraph/graph.py
  • Method: _register_node_executor() (~lines 172–197)
  • Affected area: LangGraph bridge layer; impacts all graph node executions that use the sync→async bridge.
  • Related issues: #7122 (state corruption in concurrent graph execution), #7089 (unsafe event loop access in RouteBridge)

Metadata

  • Branch: bugfix/backlog-langgraph-threadpool-resource-leak
  • Commit Message: fix(langgraph): replace per-call ThreadPoolExecutor with shared pool in _register_node_executor
  • Milestone: (backlog — no milestone assigned)
  • Parent Epic: (see orphan note below)

Subtasks

  • Audit all call sites of _register_node_executor() to understand full scope of impact
  • Introduce a shared ThreadPoolExecutor attribute on the LangGraph class (or equivalent owner)
  • Refactor sync_executor to use the shared executor instead of creating a new one per call
  • Add proper shutdown of the shared executor in graph teardown
  • Update or add BDD Behave scenarios covering concurrent node execution resource behaviour
  • Verify no thread exhaustion under simulated concurrent load (integration test or benchmark)
  • Run full nox suite and confirm coverage ≥ 97%

Definition of Done

  • Shared executor replaces per-call executor in _register_node_executor()
  • Executor is properly shut down on graph teardown (no leaked threads)
  • BDD unit test scenario added/updated covering the resource management behaviour
  • Robot Framework integration test verifies concurrent node execution does not exhaust threads
  • All nox stages pass (nox with no arguments)
  • Coverage ≥ 97%
  • No # type: ignore suppressions introduced
  • Documentation updated if public API surface changes

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

## Background and Context The `_register_node_executor()` method in `src/cleveragents/langgraph/graph.py` creates a `ThreadPoolExecutor` via a context manager (`with` block) to bridge synchronous callers into the async `async_executor` coroutine. The `future.result()` call is placed **inside** the `with` block, which is technically correct syntactically, but the pattern creates a per-call executor that is torn down immediately after each invocation. Under high concurrency — when multiple graph nodes execute simultaneously — this results in repeated executor creation/destruction cycles, potential race conditions during shutdown, and thread exhaustion due to the lack of a shared executor pool. ## Current Behavior ```python def sync_executor(msg: StreamMessage) -> StreamMessage: def run_async() -> Any: return asyncio.run(async_executor(msg)) with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(run_async) return future.result() # Executor shuts down after this returns ``` Each call to `sync_executor` allocates a brand-new `ThreadPoolExecutor`, submits one task, blocks for the result, then immediately shuts the executor down. Under concurrent node execution: - Thread pool creation/teardown overhead accumulates per node per invocation. - If `future.result()` raises, the executor context exit may suppress or obscure the exception. - No upper bound on total threads created across concurrent node calls. - Executor lifecycle is not shared or reused, defeating the purpose of a pool. ## Expected Behavior A shared, long-lived `ThreadPoolExecutor` (or equivalent) should be used across all node executions within a graph instance, with proper lifecycle management tied to the graph's own lifecycle (created on graph init, shut down on graph teardown). This eliminates per-call overhead and provides a bounded thread pool. ## Acceptance Criteria - `_register_node_executor()` does not create a new `ThreadPoolExecutor` per invocation. - A shared executor is initialised once (e.g., in `__init__` or `_setup`) and reused across all node calls. - The shared executor is properly shut down when the graph is torn down (e.g., via `__del__`, a `close()` method, or a context manager on the graph itself). - No thread exhaustion occurs under concurrent multi-node execution. - All existing BDD unit tests and Robot Framework integration tests continue to pass. - Coverage remains ≥ 97%. ## Supporting Information - **File**: `src/cleveragents/langgraph/graph.py` - **Method**: `_register_node_executor()` (~lines 172–197) - **Affected area**: LangGraph bridge layer; impacts all graph node executions that use the sync→async bridge. - **Related issues**: #7122 (state corruption in concurrent graph execution), #7089 (unsafe event loop access in RouteBridge) ## Metadata - **Branch**: `bugfix/backlog-langgraph-threadpool-resource-leak` - **Commit Message**: `fix(langgraph): replace per-call ThreadPoolExecutor with shared pool in _register_node_executor` - **Milestone**: *(backlog — no milestone assigned)* - **Parent Epic**: *(see orphan note below)* ## Subtasks - [ ] Audit all call sites of `_register_node_executor()` to understand full scope of impact - [ ] Introduce a shared `ThreadPoolExecutor` attribute on the `LangGraph` class (or equivalent owner) - [ ] Refactor `sync_executor` to use the shared executor instead of creating a new one per call - [ ] Add proper shutdown of the shared executor in graph teardown - [ ] Update or add BDD Behave scenarios covering concurrent node execution resource behaviour - [ ] Verify no thread exhaustion under simulated concurrent load (integration test or benchmark) - [ ] Run full `nox` suite and confirm coverage ≥ 97% ## Definition of Done - [ ] Shared executor replaces per-call executor in `_register_node_executor()` - [ ] Executor is properly shut down on graph teardown (no leaked threads) - [ ] BDD unit test scenario added/updated covering the resource management behaviour - [ ] Robot Framework integration test verifies concurrent node execution does not exhaust threads - [ ] All nox stages pass (`nox` with no arguments) - [ ] Coverage ≥ 97% - [ ] No `# type: ignore` suppressions introduced - [ ] Documentation updated if public API surface changes > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

⚠️ Orphan Notice — Manual Epic Linking Required

This issue was created by an automated agent and could not be linked to a proper parent Epic at creation time. No open Epic was found that specifically covers LangGraph resource management or the sync→async executor bridge layer.

Current provisional link: This issue has been set to block #7023 ([AUTO-BUG-POOL] Bug Detection Report Cycle 2) as a temporary tracking parent. This is not a proper Epic relationship per CONTRIBUTING.md hierarchy rules.

Action required by a maintainer:

  1. Identify or create an appropriate parent Epic covering LangGraph runtime resource management.
  2. Link this issue to that Epic using Forgejo's dependency system (this issue blocks the parent Epic).
  3. Remove or update the provisional link to #7023 if appropriate.

Related issues that may belong to the same Epic:

  • #7122 — State corruption in concurrent graph execution (unsafe Behave state)
  • #7089 — Unsafe event loop access in RouteBridge
  • #6756RxPyLangGraphBridge._create_graph_executor uses deprecated async pattern
  • #6523bridge.py _create_node_operator calls loop.create_task() on a closed loop

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

⚠️ **Orphan Notice — Manual Epic Linking Required** This issue was created by an automated agent and could not be linked to a proper parent Epic at creation time. No open Epic was found that specifically covers LangGraph resource management or the sync→async executor bridge layer. **Current provisional link:** This issue has been set to block #7023 ([AUTO-BUG-POOL] Bug Detection Report Cycle 2) as a temporary tracking parent. This is **not** a proper Epic relationship per CONTRIBUTING.md hierarchy rules. **Action required by a maintainer:** 1. Identify or create an appropriate parent Epic covering LangGraph runtime resource management. 2. Link this issue to that Epic using Forgejo's dependency system (this issue **blocks** the parent Epic). 3. Remove or update the provisional link to #7023 if appropriate. Related issues that may belong to the same Epic: - #7122 — State corruption in concurrent graph execution (unsafe Behave state) - #7089 — Unsafe event loop access in RouteBridge - #6756 — `RxPyLangGraphBridge._create_graph_executor` uses deprecated async pattern - #6523 — `bridge.py _create_node_operator` calls `loop.create_task()` on a closed loop --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

Verified — Resource leak: ThreadPoolExecutor leak in graph node execution. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Resource leak: ThreadPoolExecutor leak in graph node execution. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7134
No description provided.