BUG-HUNT: [concurrency] Race condition in container_executor.py subprocess cleanup causes resource leaks #7238

Open
opened 2026-04-10 10:27:25 +00:00 by HAL9000 · 4 comments
Owner

Metadata

  • Commit Message: fix(tool): synchronize subprocess cleanup in ContainerToolExecutor._run_command to prevent resource leaks
  • Branch: bugfix/m7-container-executor-subprocess-cleanup-race
  • Milestone: v3.6.0
  • Parent Epic: #5504

Background and Context

The ContainerToolExecutor._run_command method in src/cleveragents/tool/container_executor.py has a race condition between timeout handling and subprocess cleanup. This affects the container tool execution path used in devcontainer environments for autonomous agent sessions, where resource leaks can accumulate over time with many tool executions.

Per the specification, ContainerToolExecutor is the core subprocess management component for container-routed tool execution (ADR-043, ADR-039). Reliable subprocess cleanup is essential for long-running autonomous agent sessions.

Current Behavior

When a subprocess times out, proc.kill() sends SIGKILL but the finally block may attempt to close streams before the process has fully terminated. This creates three race conditions:

  1. Kill/Wait Race: proc.kill() sends SIGKILL but proc.wait() may return before the process fully terminates on high-load systems.
  2. Stream Cleanup Race: The finally block tries to close streams while the killed process may still be writing to them.
  3. Resource Leak: If streams are not properly closed due to the race, file descriptors leak and accumulate over time.

Affected code (src/cleveragents/tool/container_executor.py, ContainerToolExecutor._run_command, lines ~640–700):

def _run_command(self, cmd: list[str], timeout: int, *, stdin_data: str | None = None) -> _ExecResult:
    try:
        proc = subprocess.Popen(...)
        try:
            raw_stdout = _read_bounded(proc.stdout, _MAX_OUTPUT_BYTES)
            raw_stderr = _read_bounded(proc.stderr, _MAX_OUTPUT_BYTES)
            remaining = timeout - (time.monotonic() - start)
            proc.wait(timeout=max(remaining, 0))
        except subprocess.TimeoutExpired:
            proc.kill()  # ← Race condition: kill may not complete before cleanup
            proc.wait()  # ← May still be running when cleanup code executes
            # ... early return with timed_out=True ...
        finally:
            # Ensure all handles are closed even on unexpected errors.
            for stream in (proc.stdout, proc.stderr, proc.stdin):
                if stream and not stream.closed:
                    stream.close()  # ← May race with kill() cleanup

Impact:

  • Resource leaks: File descriptors may not be properly closed
  • System stability: Accumulated resource leaks can impact system performance
  • Unpredictable behavior: Race conditions cause different behavior under different system loads
  • Container integration: Failed cleanup affects devcontainer execution reliability

Expected Behavior

Process cleanup should be synchronized to ensure proper resource cleanup:

  • Wait for process termination confirmation before closing streams
  • Handle SIGKILL cleanup gracefully with a secondary timeout
  • Ensure no file descriptor leaks even in timeout scenarios
  • Wrap stream close operations in try/except OSError to handle already-closed streams

Suggested fix direction:

except subprocess.TimeoutExpired:
    proc.kill()
    try:
        proc.wait(timeout=5.0)  # Give kill time to complete
    except subprocess.TimeoutExpired:
        pass  # Process is truly stuck, but we still need cleanup
    elapsed = (time.monotonic() - start) * 1000.0
    return _ExecResult(
        stdout="",
        stderr=f"Command timed out after {timeout}s",
        exit_code=-1,
        duration_ms=elapsed,
        timed_out=True,
    )
finally:
    # Enhanced cleanup with proper synchronization
    if proc.poll() is None:
        try:
            proc.terminate()
            proc.wait(timeout=1.0)
        except subprocess.TimeoutExpired:
            proc.kill()
            proc.wait(timeout=1.0)
    for stream in (proc.stdout, proc.stderr, proc.stdin):
        if stream and not stream.closed:
            try:
                stream.close()
            except OSError:
                pass  # Stream may already be closed by process termination

Subtasks

  • Reproduce the race condition with a controlled test scenario (high-load subprocess timeout)
  • Fix _run_command timeout handling: add secondary proc.wait(timeout=5.0) after proc.kill()
  • Add proc.poll() check in finally block before stream cleanup to ensure process has terminated
  • Wrap stream close() calls in try/except OSError to handle already-closed streams
  • Tests (Behave): Add scenario for subprocess timeout cleanup race condition in container_executor
  • Tests (Robot): Add integration test for container executor timeout handling and resource cleanup
  • Verify coverage >= 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

## Metadata - **Commit Message**: `fix(tool): synchronize subprocess cleanup in ContainerToolExecutor._run_command to prevent resource leaks` - **Branch**: `bugfix/m7-container-executor-subprocess-cleanup-race` - **Milestone**: v3.6.0 - **Parent Epic**: #5504 ## Background and Context The `ContainerToolExecutor._run_command` method in `src/cleveragents/tool/container_executor.py` has a race condition between timeout handling and subprocess cleanup. This affects the container tool execution path used in devcontainer environments for autonomous agent sessions, where resource leaks can accumulate over time with many tool executions. Per the specification, `ContainerToolExecutor` is the core subprocess management component for container-routed tool execution (ADR-043, ADR-039). Reliable subprocess cleanup is essential for long-running autonomous agent sessions. ## Current Behavior When a subprocess times out, `proc.kill()` sends SIGKILL but the `finally` block may attempt to close streams before the process has fully terminated. This creates three race conditions: 1. **Kill/Wait Race**: `proc.kill()` sends SIGKILL but `proc.wait()` may return before the process fully terminates on high-load systems. 2. **Stream Cleanup Race**: The `finally` block tries to close streams while the killed process may still be writing to them. 3. **Resource Leak**: If streams are not properly closed due to the race, file descriptors leak and accumulate over time. **Affected code** (`src/cleveragents/tool/container_executor.py`, `ContainerToolExecutor._run_command`, lines ~640–700): ```python def _run_command(self, cmd: list[str], timeout: int, *, stdin_data: str | None = None) -> _ExecResult: try: proc = subprocess.Popen(...) try: raw_stdout = _read_bounded(proc.stdout, _MAX_OUTPUT_BYTES) raw_stderr = _read_bounded(proc.stderr, _MAX_OUTPUT_BYTES) remaining = timeout - (time.monotonic() - start) proc.wait(timeout=max(remaining, 0)) except subprocess.TimeoutExpired: proc.kill() # ← Race condition: kill may not complete before cleanup proc.wait() # ← May still be running when cleanup code executes # ... early return with timed_out=True ... finally: # Ensure all handles are closed even on unexpected errors. for stream in (proc.stdout, proc.stderr, proc.stdin): if stream and not stream.closed: stream.close() # ← May race with kill() cleanup ``` **Impact:** - **Resource leaks**: File descriptors may not be properly closed - **System stability**: Accumulated resource leaks can impact system performance - **Unpredictable behavior**: Race conditions cause different behavior under different system loads - **Container integration**: Failed cleanup affects devcontainer execution reliability ## Expected Behavior Process cleanup should be synchronized to ensure proper resource cleanup: - Wait for process termination confirmation before closing streams - Handle SIGKILL cleanup gracefully with a secondary timeout - Ensure no file descriptor leaks even in timeout scenarios - Wrap stream close operations in `try/except OSError` to handle already-closed streams **Suggested fix direction:** ```python except subprocess.TimeoutExpired: proc.kill() try: proc.wait(timeout=5.0) # Give kill time to complete except subprocess.TimeoutExpired: pass # Process is truly stuck, but we still need cleanup elapsed = (time.monotonic() - start) * 1000.0 return _ExecResult( stdout="", stderr=f"Command timed out after {timeout}s", exit_code=-1, duration_ms=elapsed, timed_out=True, ) finally: # Enhanced cleanup with proper synchronization if proc.poll() is None: try: proc.terminate() proc.wait(timeout=1.0) except subprocess.TimeoutExpired: proc.kill() proc.wait(timeout=1.0) for stream in (proc.stdout, proc.stderr, proc.stdin): if stream and not stream.closed: try: stream.close() except OSError: pass # Stream may already be closed by process termination ``` ## Subtasks - [ ] Reproduce the race condition with a controlled test scenario (high-load subprocess timeout) - [ ] Fix `_run_command` timeout handling: add secondary `proc.wait(timeout=5.0)` after `proc.kill()` - [ ] Add `proc.poll()` check in `finally` block before stream cleanup to ensure process has terminated - [ ] Wrap stream `close()` calls in `try/except OSError` to handle already-closed streams - [ ] Tests (Behave): Add scenario for subprocess timeout cleanup race condition in `container_executor` - [ ] Tests (Robot): Add integration test for container executor timeout handling and resource cleanup - [ ] Verify coverage >= 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass - Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
HAL9000 added this to the v3.6.0 milestone 2026-04-10 10:27:34 +00:00
Author
Owner

⚠️ Label Application Note: This issue requires the following labels to be applied manually (label endpoint blocked by security policy):

  • State/Unverified (ID: 846)
  • Type/Bug (ID: 849)
  • Priority/Critical (ID: 858)

Per CONTRIBUTING.md, all Type/Bug issues are Priority/Critical.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

⚠️ **Label Application Note**: This issue requires the following labels to be applied manually (label endpoint blocked by security policy): - `State/Unverified` (ID: 846) - `Type/Bug` (ID: 849) - `Priority/Critical` (ID: 858) Per CONTRIBUTING.md, all `Type/Bug` issues are `Priority/Critical`. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

Verified — Critical concurrency bug: race condition in container_executor.py subprocess cleanup. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical concurrency bug: race condition in container_executor.py subprocess cleanup. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Critical concurrency bug: race condition in container_executor.py subprocess cleanup. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical concurrency bug: race condition in container_executor.py subprocess cleanup. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Critical concurrency bug: race condition in container_executor.py subprocess cleanup. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical concurrency bug: race condition in container_executor.py subprocess cleanup. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#7238
No description provided.