BUG-HUNT: [resource] StdioTransport.stop() uncaught TimeoutExpired after SIGKILL leaves _process unreset — resource leak and broken future stop/start calls #6575

Open
opened 2026-04-09 21:41:16 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [resource] — StdioTransport.stop() uncaught TimeoutExpired after SIGKILL leaves _process unreset

Severity Assessment

  • Impact: StdioTransport._process is never set to None when wait(timeout=2.0) raises after SIGKILL. Future calls to stop() re-enter the same code path; is_alive returns False (process exited), start() raises RuntimeError("Transport already started") because _process is not None. The subprocess pipe FDs are also never released, causing FD leaks in servers that restart many LSP instances.
  • Likelihood: Low in practice (a SIGKILLed process rarely survives 2 seconds), but the code path is reachable in kernel-level edge cases (zombie processes, cgrouped environments that defer reaping) and is provably incorrect.
  • Priority: High

Location

  • File: src/cleveragents/lsp/transport.py
  • Function: StdioTransport.stop()
  • Lines: 158–171

Description

After terminate() times out, stop() sends SIGKILL and calls self._process.wait(timeout=2.0). If this second wait also times out (raising subprocess.TimeoutExpired), the exception propagates uncaught. The cleanup lines code = self._process.returncode and self._process = None (lines 168–169) are skipped entirely, leaving self._process pointing at the (still-running or zombie) process object.

Consequences:

  1. Subsequent stop() calls: hits self._process.poll() is not None → returns the old (stale) return code, sets self._process = None, and returns, masking the earlier failure.
  2. Subsequent start() calls: self._process is not None and self.is_alive — since the process is dead, is_alive returns False, so the guard on line 92 (if self._process is not None and self.is_alive) does NOT raise. A new Popen is created and assigned to self._process, abandoning the old zombie handle without closing its pipes (FD leak).
  3. The uncaught TimeoutExpired propagates to callers of stop() (e.g., LspLifecycleManager._shutdown_server) which do not handle it, causing the lifecycle entry to be left in an inconsistent state.

Evidence

# transport.py lines 156-171
self._process.terminate()
try:
    self._process.wait(timeout=timeout)
except subprocess.TimeoutExpired:
    logger.warning(
        "lsp.transport.force_killing",
        pid=self._process.pid,
    )
    self._process.kill()
    self._process.wait(timeout=2.0)   # ← raises TimeoutExpired; no try/except

# These lines are SKIPPED if TimeoutExpired is raised above:
code = self._process.returncode
self._process = None                  # ← never executes on exception
logger.info("lsp.transport.stopped", exit_code=code)
return code

Expected Behavior

stop() should always reset self._process = None and close the subprocess pipes, even if the post-SIGKILL wait times out. The exception (if any) should be logged and either swallowed or re-raised after cleanup.

Actual Behavior

If wait(timeout=2.0) raises subprocess.TimeoutExpired, the exception propagates before self._process = None is reached, leaving the object in a broken state.

Suggested Fix

Wrap the post-SIGKILL wait in a try/except and perform cleanup unconditionally in a finally block:

self._process.kill()
try:
    self._process.wait(timeout=2.0)
except subprocess.TimeoutExpired:
    logger.error("lsp.transport.kill_timeout", pid=self._process.pid)
finally:
    # Always release the process handle and close pipes
    with contextlib.suppress(OSError):
        if self._process.stdin:
            self._process.stdin.close()
        if self._process.stdout:
            self._process.stdout.close()
        if self._process.stderr:
            self._process.stderr.close()
    code = self._process.returncode
    self._process = None

Category

resource

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [resource] — `StdioTransport.stop()` uncaught `TimeoutExpired` after SIGKILL leaves `_process` unreset ### Severity Assessment - **Impact**: `StdioTransport._process` is never set to `None` when `wait(timeout=2.0)` raises after SIGKILL. Future calls to `stop()` re-enter the same code path; `is_alive` returns `False` (process exited), `start()` raises `RuntimeError("Transport already started")` because `_process is not None`. The subprocess pipe FDs are also never released, causing FD leaks in servers that restart many LSP instances. - **Likelihood**: Low in practice (a SIGKILLed process rarely survives 2 seconds), but the code path is reachable in kernel-level edge cases (zombie processes, cgrouped environments that defer reaping) and is provably incorrect. - **Priority**: High ### Location - **File**: `src/cleveragents/lsp/transport.py` - **Function**: `StdioTransport.stop()` - **Lines**: 158–171 ### Description After `terminate()` times out, `stop()` sends `SIGKILL` and calls `self._process.wait(timeout=2.0)`. If this second `wait` *also* times out (raising `subprocess.TimeoutExpired`), the exception propagates uncaught. The cleanup lines `code = self._process.returncode` and `self._process = None` (lines 168–169) are skipped entirely, leaving `self._process` pointing at the (still-running or zombie) process object. Consequences: 1. Subsequent `stop()` calls: hits `self._process.poll() is not None` → returns the old (stale) return code, sets `self._process = None`, and returns, masking the earlier failure. 2. Subsequent `start()` calls: `self._process is not None and self.is_alive` — since the process is dead, `is_alive` returns `False`, so the guard on line 92 (`if self._process is not None and self.is_alive`) does NOT raise. A new `Popen` is created and assigned to `self._process`, abandoning the old zombie handle without closing its pipes (FD leak). 3. The uncaught `TimeoutExpired` propagates to callers of `stop()` (e.g., `LspLifecycleManager._shutdown_server`) which do not handle it, causing the lifecycle entry to be left in an inconsistent state. ### Evidence ```python # transport.py lines 156-171 self._process.terminate() try: self._process.wait(timeout=timeout) except subprocess.TimeoutExpired: logger.warning( "lsp.transport.force_killing", pid=self._process.pid, ) self._process.kill() self._process.wait(timeout=2.0) # ← raises TimeoutExpired; no try/except # These lines are SKIPPED if TimeoutExpired is raised above: code = self._process.returncode self._process = None # ← never executes on exception logger.info("lsp.transport.stopped", exit_code=code) return code ``` ### Expected Behavior `stop()` should always reset `self._process = None` and close the subprocess pipes, even if the post-SIGKILL `wait` times out. The exception (if any) should be logged and either swallowed or re-raised **after** cleanup. ### Actual Behavior If `wait(timeout=2.0)` raises `subprocess.TimeoutExpired`, the exception propagates before `self._process = None` is reached, leaving the object in a broken state. ### Suggested Fix Wrap the post-SIGKILL `wait` in a `try/except` and perform cleanup unconditionally in a `finally` block: ```python self._process.kill() try: self._process.wait(timeout=2.0) except subprocess.TimeoutExpired: logger.error("lsp.transport.kill_timeout", pid=self._process.pid) finally: # Always release the process handle and close pipes with contextlib.suppress(OSError): if self._process.stdin: self._process.stdin.close() if self._process.stdout: self._process.stdout.close() if self._process.stderr: self._process.stderr.close() code = self._process.returncode self._process = None ``` ### Category resource ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:52:46 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6575
No description provided.