Bug: LSP Transport Process Resource Leak on Exception in StdioTransport.start() #7044

Open
opened 2026-04-10 07:23:58 +00:00 by HAL9000 · 2 comments
Owner

Metadata

  • Branch: bugfix/m3.6.0-lsp-transport-resource-leak
  • Commit Message: fix(lsp): cleanup subprocess on failed initialization in StdioTransport.start()
  • Milestone: v3.6.0
  • Parent Epic: #824

Background and Context

The StdioTransport class in src/cleveragents/lsp/transport.py manages the full lifecycle of an LSP server subprocess. The start() method spawns the subprocess via subprocess.Popen, but if any subsequent initialization step raises an exception after the process handle is assigned to self._process, the subprocess is never cleaned up. The orphaned process continues running in the background, accumulating as a zombie until the OS reclaims it — or until resource exhaustion occurs.

This was identified during LSP module analysis (see #7037). The transport module is referenced in docs/specification.md LSP Server Lifecycle (lines 20744–20758) and was introduced in #826.

Current Behavior

In StdioTransport.start(), self._process is assigned the Popen object inside a try/except block that only catches FileNotFoundError and OSError during the Popen() call itself. However, any exception raised after self._process is assigned (e.g., during logging, post-spawn validation, or future initialization steps added by contributors) will propagate without triggering process cleanup.

# src/cleveragents/lsp/transport.py — start() method (current)
try:
    self._process = subprocess.Popen(...)
except FileNotFoundError as exc:
    raise LspError(...) from exc
except OSError as exc:
    raise LspError(...) from exc

# <- If anything here raises, self._process is set but the subprocess is never terminated
logger.info("lsp.transport.started", pid=self._process.pid, command=self._command)

The logger.info(...) call after Popen is currently safe, but the pattern is fragile: any future code added between the Popen assignment and the end of start() creates a resource leak window. The absence of a cleanup guard violates the resource management contract expected of a lifecycle-owning class.

Expected Behavior

If any exception occurs after the subprocess is successfully spawned but before start() returns normally, the subprocess must be terminated and waited on before the exception propagates. The _process handle must be reset to None so that is_alive and stop() reflect the correct state.

Suggested fix pattern:

try:
    self._process = subprocess.Popen(...)
except FileNotFoundError as exc:
    raise LspError(...) from exc
except OSError as exc:
    raise LspError(...) from exc

try:
    logger.info("lsp.transport.started", pid=self._process.pid, command=self._command)
    # ... any future post-spawn initialization ...
except Exception:
    # Ensure the spawned process is not left running as a zombie
    try:
        self._process.terminate()
        self._process.wait(timeout=_GRACEFUL_SHUTDOWN_TIMEOUT)
    except Exception:  # noqa: BLE001 — best-effort cleanup, must not mask original
        pass
    self._process = None
    raise

Acceptance Criteria

  • StdioTransport.start() wraps all post-Popen initialization in a try/except cleanup guard
  • On any exception after successful spawn, the subprocess is terminated (terminate() + wait()) and self._process is reset to None
  • The cleanup itself does not suppress the original exception
  • The fix is covered by a TDD issue-capture test (see Subtasks)
  • All nox stages pass
  • Coverage >= 97%

Supporting Information

  • File: src/cleveragents/lsp/transport.py
  • Class/Method: StdioTransport.start()
  • Severity: High — can cause process accumulation and resource exhaustion in long-running agents that repeatedly start/stop LSP servers
  • Category: resource management
  • Related: #826 (LSP runtime implementation), #7037 (LSP module analysis), Epic #824
  • Spec reference: docs/specification.md LSP Server Lifecycle (lines 20744–20758)

Implementation

Pull request: #10597

## Metadata - **Branch**: `bugfix/m3.6.0-lsp-transport-resource-leak` - **Commit Message**: `fix(lsp): cleanup subprocess on failed initialization in StdioTransport.start()` - **Milestone**: v3.6.0 - **Parent Epic**: #824 ## Background and Context The `StdioTransport` class in `src/cleveragents/lsp/transport.py` manages the full lifecycle of an LSP server subprocess. The `start()` method spawns the subprocess via `subprocess.Popen`, but if any subsequent initialization step raises an exception after the process handle is assigned to `self._process`, the subprocess is never cleaned up. The orphaned process continues running in the background, accumulating as a zombie until the OS reclaims it — or until resource exhaustion occurs. This was identified during LSP module analysis (see #7037). The transport module is referenced in `docs/specification.md` LSP Server Lifecycle (lines 20744–20758) and was introduced in #826. ## Current Behavior In `StdioTransport.start()`, `self._process` is assigned the `Popen` object inside a `try/except` block that only catches `FileNotFoundError` and `OSError` during the `Popen()` call itself. However, any exception raised **after** `self._process` is assigned (e.g., during logging, post-spawn validation, or future initialization steps added by contributors) will propagate without triggering process cleanup. ```python # src/cleveragents/lsp/transport.py — start() method (current) try: self._process = subprocess.Popen(...) except FileNotFoundError as exc: raise LspError(...) from exc except OSError as exc: raise LspError(...) from exc # <- If anything here raises, self._process is set but the subprocess is never terminated logger.info("lsp.transport.started", pid=self._process.pid, command=self._command) ``` The `logger.info(...)` call after `Popen` is currently safe, but the pattern is fragile: any future code added between the `Popen` assignment and the end of `start()` creates a resource leak window. The absence of a cleanup guard violates the resource management contract expected of a lifecycle-owning class. ## Expected Behavior If any exception occurs after the subprocess is successfully spawned but before `start()` returns normally, the subprocess must be terminated and waited on before the exception propagates. The `_process` handle must be reset to `None` so that `is_alive` and `stop()` reflect the correct state. **Suggested fix pattern:** ```python try: self._process = subprocess.Popen(...) except FileNotFoundError as exc: raise LspError(...) from exc except OSError as exc: raise LspError(...) from exc try: logger.info("lsp.transport.started", pid=self._process.pid, command=self._command) # ... any future post-spawn initialization ... except Exception: # Ensure the spawned process is not left running as a zombie try: self._process.terminate() self._process.wait(timeout=_GRACEFUL_SHUTDOWN_TIMEOUT) except Exception: # noqa: BLE001 — best-effort cleanup, must not mask original pass self._process = None raise ``` ## Acceptance Criteria - [ ] `StdioTransport.start()` wraps all post-`Popen` initialization in a `try/except` cleanup guard - [ ] On any exception after successful spawn, the subprocess is terminated (`terminate()` + `wait()`) and `self._process` is reset to `None` - [ ] The cleanup itself does not suppress the original exception - [ ] The fix is covered by a TDD issue-capture test (see Subtasks) - [ ] All `nox` stages pass - [ ] Coverage >= 97% ## Supporting Information - **File**: `src/cleveragents/lsp/transport.py` - **Class/Method**: `StdioTransport.start()` - **Severity**: High — can cause process accumulation and resource exhaustion in long-running agents that repeatedly start/stop LSP servers - **Category**: resource management - **Related**: #826 (LSP runtime implementation), #7037 (LSP module analysis), Epic #824 - **Spec reference**: `docs/specification.md` LSP Server Lifecycle (lines 20744–20758) ## Implementation Pull request: #10597
HAL9000 added this to the v3.6.0 milestone 2026-04-10 07:24:52 +00:00
Author
Owner

Verified — Critical resource leak: LSP Transport process resource leak on exception. MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical resource leak: LSP Transport process resource leak on exception. MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Implementation Attempt — Success

Implemented the subprocess cleanup fix in StdioTransport.start().
Wrapped post-Popen logger.info call in a try/except guard that calls stop() before re-raising, preventing zombie LSP server process leaks.
Added two new Behave scenarios in features/lsp_transport_coverage.feature for thorough coverage.

Changes committed to branch pr-fix-10597 and PR created.

PR: #11160


Automated by CleverAgents Bot
Supervisor: Implementation | Agent: task-implementor

**Implementation Attempt** — Success Implemented the subprocess cleanup fix in ``StdioTransport.start()``. Wrapped post-Popen ``logger.info`` call in a try/except guard that calls ``stop()`` before re-raising, preventing zombie LSP server process leaks. Added two new Behave scenarios in ``features/lsp_transport_coverage.feature`` for thorough coverage. Changes committed to branch ``pr-fix-10597`` and PR created. PR: https://git.cleverthis.com/cleveragents/cleveragents-core/pulls/11160 --- Automated by CleverAgents Bot Supervisor: Implementation | Agent: task-implementor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#7044
No description provided.