UAT: SubplanExecutionService._execute_with_timeout creates a new ThreadPoolExecutor per subplan call and leaks threads on timeout #3943

Open
opened 2026-04-06 07:40:23 +00:00 by freemo · 0 comments
Owner

Metadata

  • Branch: fix/concurrency/subplan-timeout-executor-leak
  • Commit Message: fix(concurrency): reuse executor in SubplanExecutionService._execute_with_timeout to prevent thread leak
  • Milestone: None — backlog
  • Parent Epic: #368

Backlog note: This issue was discovered during autonomous UAT operation
on milestone v3.4.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Background and Context

SubplanExecutionService._execute_with_timeout() creates a new ThreadPoolExecutor(max_workers=1) for every single subplan call. This is wasteful in the normal case and causes a thread leak in the timeout case, because wait=False on pool.shutdown() does not terminate the running thread — it continues executing in the background indefinitely.

This was discovered via code-level analysis of src/cleveragents/application/services/subplan_execution_service.py against concurrency best practices and resource management requirements.


Current Behavior

_execute_with_timeout() creates a new executor per call:

# src/cleveragents/application/services/subplan_execution_service.py
def _execute_with_timeout(
    self,
    status: SubplanStatus,
    timeout_seconds: int,
) -> tuple[SubplanStatus, dict[str, str]]:
    pool = ThreadPoolExecutor(max_workers=1)  # NEW EXECUTOR PER CALL
    future: Future[tuple[SubplanStatus, dict[str, str]]] = pool.submit(
        self._execute_one_with_retry, status
    )
    try:
        return future.result(timeout=timeout_seconds)
    except TimeoutError:
        future.cancel()
        error_msg = (
            f"SubplanTimeoutError: subplan timed out after {timeout_seconds}s"
        )
        return self._error_status(status, error_msg), {}
    finally:
        pool.shutdown(wait=False, cancel_futures=True)  # wait=False — thread may still be running

When a timeout occurs:

  1. future.cancel() is a no-op on an already-running Future
  2. pool.shutdown(wait=False, cancel_futures=True) does not wait for the thread to finish
  3. The underlying thread continues running _execute_one_with_retry indefinitely until it completes naturally
  4. The ThreadPoolExecutor is garbage-collected but its spawned thread is still alive, consuming CPU and memory

In a scenario with many parallel subplans that time out, the system accumulates leaked threads that continue running in the background, potentially causing interference with subsequent operations.

Code Location: src/cleveragents/application/services/subplan_execution_service.py_execute_with_timeout() method (lines ~310–335)


Expected Behavior

Resources (threads, executors) created during subplan execution must be properly cleaned up after use, including in timeout and error scenarios. The system must not accumulate leaked threads over time.


Acceptance Criteria

  • No new ThreadPoolExecutor is created per _execute_with_timeout call
  • Timed-out subplan threads do not continue running after the timeout is raised
  • The fix is verified under sequential and parallel execution modes
  • No regression in normal (non-timeout) subplan execution

Subtasks

  • Investigate and confirm thread leak via a targeted Behave scenario (e.g., assert active thread count does not grow after repeated timeouts)
  • Choose fix strategy: (a) reuse outer pool, (b) convert to asyncio.wait_for, or (c) add cooperative CancellationToken to _execute_one_with_retry
  • Implement chosen fix in SubplanExecutionService._execute_with_timeout
  • Tests (Behave): Add/update scenarios covering timeout + thread cleanup, sequential and parallel modes
  • Tests (Robot): Add integration scenario for subplan timeout resource cleanup
  • Verify coverage >= 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly (fix(concurrency): reuse executor in SubplanExecutionService._execute_with_timeout to prevent thread leak), followed by a blank line, then additional lines providing relevant implementation details.
  • The commit is pushed to the remote on branch fix/concurrency/subplan-timeout-executor-leak.
  • The commit is submitted as a pull request to master, reviewed by at least two contributors, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage >= 97%.

Supporting Information

  • Feature Area: Async and Concurrency Patterns — Concurrent Plan Execution
  • Severity: Medium — thread leak under timeout conditions; each timed-out subplan leaves a dangling thread that cannot be reclaimed
  • Recommended Fix Options:
    1. Reuse the outer pool: Pass the outer ThreadPoolExecutor to _execute_with_timeout instead of creating a new one per call
    2. Use asyncio.wait_for pattern: Convert to async and use asyncio.wait_for with proper task cancellation
    3. Add a cooperative cancellation token: Pass a CancellationToken (already implemented in AsyncWorker) to _execute_one_with_retry so the running thread can detect the timeout and exit cooperatively
  • Related Epic: #368 Epic: Subplans & Parallelism

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/concurrency/subplan-timeout-executor-leak` - **Commit Message**: `fix(concurrency): reuse executor in SubplanExecutionService._execute_with_timeout to prevent thread leak` - **Milestone**: None — backlog - **Parent Epic**: #368 > **Backlog note:** This issue was discovered during autonomous UAT operation > on milestone v3.4.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- ## Background and Context `SubplanExecutionService._execute_with_timeout()` creates a **new `ThreadPoolExecutor(max_workers=1)`** for every single subplan call. This is wasteful in the normal case and causes a thread leak in the timeout case, because `wait=False` on `pool.shutdown()` does not terminate the running thread — it continues executing in the background indefinitely. This was discovered via code-level analysis of `src/cleveragents/application/services/subplan_execution_service.py` against concurrency best practices and resource management requirements. --- ## Current Behavior `_execute_with_timeout()` creates a new executor per call: ```python # src/cleveragents/application/services/subplan_execution_service.py def _execute_with_timeout( self, status: SubplanStatus, timeout_seconds: int, ) -> tuple[SubplanStatus, dict[str, str]]: pool = ThreadPoolExecutor(max_workers=1) # NEW EXECUTOR PER CALL future: Future[tuple[SubplanStatus, dict[str, str]]] = pool.submit( self._execute_one_with_retry, status ) try: return future.result(timeout=timeout_seconds) except TimeoutError: future.cancel() error_msg = ( f"SubplanTimeoutError: subplan timed out after {timeout_seconds}s" ) return self._error_status(status, error_msg), {} finally: pool.shutdown(wait=False, cancel_futures=True) # wait=False — thread may still be running ``` When a timeout occurs: 1. `future.cancel()` is a no-op on an already-running `Future` 2. `pool.shutdown(wait=False, cancel_futures=True)` does **not** wait for the thread to finish 3. The underlying thread continues running `_execute_one_with_retry` indefinitely until it completes naturally 4. The `ThreadPoolExecutor` is garbage-collected but its spawned thread is still alive, consuming CPU and memory In a scenario with many parallel subplans that time out, the system accumulates leaked threads that continue running in the background, potentially causing interference with subsequent operations. **Code Location:** `src/cleveragents/application/services/subplan_execution_service.py` — `_execute_with_timeout()` method (lines ~310–335) --- ## Expected Behavior Resources (threads, executors) created during subplan execution must be properly cleaned up after use, including in timeout and error scenarios. The system must not accumulate leaked threads over time. --- ## Acceptance Criteria - [ ] No new `ThreadPoolExecutor` is created per `_execute_with_timeout` call - [ ] Timed-out subplan threads do not continue running after the timeout is raised - [ ] The fix is verified under sequential and parallel execution modes - [ ] No regression in normal (non-timeout) subplan execution --- ## Subtasks - [ ] Investigate and confirm thread leak via a targeted Behave scenario (e.g., assert active thread count does not grow after repeated timeouts) - [ ] Choose fix strategy: (a) reuse outer pool, (b) convert to `asyncio.wait_for`, or (c) add cooperative `CancellationToken` to `_execute_one_with_retry` - [ ] Implement chosen fix in `SubplanExecutionService._execute_with_timeout` - [ ] Tests (Behave): Add/update scenarios covering timeout + thread cleanup, sequential and parallel modes - [ ] Tests (Robot): Add integration scenario for subplan timeout resource cleanup - [ ] Verify coverage >= 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors --- ## Definition of Done This issue is complete when: - [ ] All subtasks above are completed and checked off. - [ ] A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly (`fix(concurrency): reuse executor in SubplanExecutionService._execute_with_timeout to prevent thread leak`), followed by a blank line, then additional lines providing relevant implementation details. - [ ] The commit is pushed to the remote on branch `fix/concurrency/subplan-timeout-executor-leak`. - [ ] The commit is submitted as a **pull request** to `master`, reviewed by at least two contributors, and **merged** before this issue is marked done. - [ ] All nox stages pass. - [ ] Coverage >= 97%. --- ## Supporting Information - **Feature Area:** Async and Concurrency Patterns — Concurrent Plan Execution - **Severity:** Medium — thread leak under timeout conditions; each timed-out subplan leaves a dangling thread that cannot be reclaimed - **Recommended Fix Options:** 1. **Reuse the outer pool**: Pass the outer `ThreadPoolExecutor` to `_execute_with_timeout` instead of creating a new one per call 2. **Use `asyncio.wait_for` pattern**: Convert to async and use `asyncio.wait_for` with proper task cancellation 3. **Add a cooperative cancellation token**: Pass a `CancellationToken` (already implemented in `AsyncWorker`) to `_execute_one_with_retry` so the running thread can detect the timeout and exit cooperatively - **Related Epic:** #368 Epic: Subplans & Parallelism --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#368 Epic: Subplans & Parallelism
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#3943
No description provided.