UAT: LspLifecycleManager.restart_server() holds _lock during blocking I/O — potential deadlock under concurrent access #3026

Closed
opened 2026-04-05 04:05:24 +00:00 by freemo · 3 comments
Owner

Metadata

  • Branch: fix/lsp-lifecycle-restart-lock-deadlock
  • Commit Message: fix(lsp): release lock before blocking I/O in restart_server()
  • Milestone: v3.6.0
  • Parent Epic: #824

Background

The LspLifecycleManager in src/cleveragents/lsp/lifecycle.py is responsible for managing the lifecycle of LSP server processes. Thread safety is critical because the LSP Runtime can be accessed concurrently by multiple actor threads.

start_server() correctly uses a 3-phase approach to avoid holding the lock during blocking operations:

  • Phase 1 (short lock): Check state and reserve the slot.
  • Phase 2 (no lock): Perform blocking work — subprocess spawn, LSP handshake.
  • Phase 3 (short lock): Commit the new client into shared state.

However, restart_server() does not follow this pattern. It acquires self._lock and holds it for the entire operation, including all blocking I/O.

Current Behavior

restart_server() acquires self._lock and then performs multiple blocking operations while holding it:

def restart_server(self, name: str) -> LspClient:
    with self._lock:  # Lock acquired here
        managed = self._servers.get(name)
        ...
        managed.transport.stop()  # Blocking I/O while holding lock

        transport = StdioTransport(...)
        transport.start()  # Blocking subprocess spawn while holding lock

        client = LspClient(transport, server_name=name)
        try:
            client.initialize(managed.workspace_path)  # Blocking LSP handshake (up to 60s) while holding lock
        except Exception:
            transport.stop()
            raise

The client.initialize() call can block for up to 60 seconds (the _REQUEST_TIMEOUT in client.py). During this time, the lock is held, preventing any other thread from calling health_check(), get_client(), stop_server(), or list_running().

This is a potential deadlock scenario: if _get_healthy_client() in LspRuntime calls restart_server() from one thread while another thread is waiting to acquire the lock for health_check(), the second thread will block indefinitely.

Expected Behavior

restart_server() should follow the same 3-phase pattern as start_server():

  • Phase 1 (short lock): Read current state, remove the old entry, and mark the slot as restarting.
  • Phase 2 (no lock): Stop the old transport, spawn the new process, and perform the LSP handshake.
  • Phase 3 (short lock): Commit the new client into shared state.

No blocking I/O should occur while self._lock is held.

Subtasks

  • Refactor restart_server() to use the 3-phase lock pattern (read state → release lock → blocking I/O → acquire lock → commit state)
  • Ensure the old transport is stopped outside the lock (Phase 2)
  • Ensure StdioTransport.start() is called outside the lock (Phase 2)
  • Ensure client.initialize() is called outside the lock (Phase 2)
  • Handle error cases (failed restart) correctly without leaving shared state inconsistent
  • Tests (unit): Add concurrent-access test asserting health_check() is not blocked while restart_server() is in progress
  • Tests (unit): Add test asserting restart_server() does not hold the lock during client.initialize()
  • Verify coverage >= 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • restart_server() follows the same 3-phase lock pattern as start_server(), with no blocking I/O performed while self._lock is held.
  • Concurrent calls to health_check(), get_client(), stop_server(), and list_running() are not blocked while a restart_server() is in progress.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage >= 97%.

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/lsp-lifecycle-restart-lock-deadlock` - **Commit Message**: `fix(lsp): release lock before blocking I/O in restart_server()` - **Milestone**: v3.6.0 - **Parent Epic**: #824 ## Background The `LspLifecycleManager` in `src/cleveragents/lsp/lifecycle.py` is responsible for managing the lifecycle of LSP server processes. Thread safety is critical because the LSP Runtime can be accessed concurrently by multiple actor threads. `start_server()` correctly uses a 3-phase approach to avoid holding the lock during blocking operations: - **Phase 1** (short lock): Check state and reserve the slot. - **Phase 2** (no lock): Perform blocking work — subprocess spawn, LSP handshake. - **Phase 3** (short lock): Commit the new client into shared state. However, `restart_server()` does **not** follow this pattern. It acquires `self._lock` and holds it for the entire operation, including all blocking I/O. ## Current Behavior `restart_server()` acquires `self._lock` and then performs multiple blocking operations while holding it: ```python def restart_server(self, name: str) -> LspClient: with self._lock: # Lock acquired here managed = self._servers.get(name) ... managed.transport.stop() # Blocking I/O while holding lock transport = StdioTransport(...) transport.start() # Blocking subprocess spawn while holding lock client = LspClient(transport, server_name=name) try: client.initialize(managed.workspace_path) # Blocking LSP handshake (up to 60s) while holding lock except Exception: transport.stop() raise ``` The `client.initialize()` call can block for up to 60 seconds (the `_REQUEST_TIMEOUT` in `client.py`). During this time, the lock is held, preventing any other thread from calling `health_check()`, `get_client()`, `stop_server()`, or `list_running()`. This is a potential deadlock scenario: if `_get_healthy_client()` in `LspRuntime` calls `restart_server()` from one thread while another thread is waiting to acquire the lock for `health_check()`, the second thread will block indefinitely. ## Expected Behavior `restart_server()` should follow the same 3-phase pattern as `start_server()`: - **Phase 1** (short lock): Read current state, remove the old entry, and mark the slot as restarting. - **Phase 2** (no lock): Stop the old transport, spawn the new process, and perform the LSP handshake. - **Phase 3** (short lock): Commit the new client into shared state. No blocking I/O should occur while `self._lock` is held. ## Subtasks - [ ] Refactor `restart_server()` to use the 3-phase lock pattern (read state → release lock → blocking I/O → acquire lock → commit state) - [ ] Ensure the old transport is stopped outside the lock (Phase 2) - [ ] Ensure `StdioTransport.start()` is called outside the lock (Phase 2) - [ ] Ensure `client.initialize()` is called outside the lock (Phase 2) - [ ] Handle error cases (failed restart) correctly without leaving shared state inconsistent - [ ] Tests (unit): Add concurrent-access test asserting `health_check()` is not blocked while `restart_server()` is in progress - [ ] Tests (unit): Add test asserting `restart_server()` does not hold the lock during `client.initialize()` - [ ] Verify coverage >= 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - `restart_server()` follows the same 3-phase lock pattern as `start_server()`, with no blocking I/O performed while `self._lock` is held. - Concurrent calls to `health_check()`, `get_client()`, `stop_server()`, and `list_running()` are not blocked while a `restart_server()` is in progress. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage >= 97%. --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
freemo added this to the v3.6.0 milestone 2026-04-05 04:06:07 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: Confirmed
  • MoSCoW: Should Have

Valid finding verified during batch triage.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: Confirmed - **MoSCoW**: Should Have Valid finding verified during batch triage. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Author
Owner

PR #3165 created on branch fix/lsp-lifecycle-restart-lock-deadlock. PR review and merge handled by continuous review stream.

Implementation summary:

  • Refactored restart_server() in src/cleveragents/lsp/lifecycle.py to use the 3-phase lock pattern (matching start_server()):
    • Phase 1 (short lock): snapshot state, remove old entry from _servers
    • Phase 2 (no lock): old_transport.stop(), StdioTransport.start(), client.initialize()
    • Phase 3 (short lock): insert new _ManagedServer, preserving ref_count
  • Added 3 new BDD scenarios in features/lsp_lifecycle_coverage.feature verifying:
    1. health_check() is not blocked during restart_server() blocking I/O
    2. _lock is not held during client.initialize()
    3. ref_count is preserved after restart
  • All quality gates pass: ruff lint , ruff format , pyright 0 errors

Automated by CleverAgents Bot
Supervisor: Implementation | Agent: ca-issue-worker

PR #3165 created on branch `fix/lsp-lifecycle-restart-lock-deadlock`. PR review and merge handled by continuous review stream. **Implementation summary:** - Refactored `restart_server()` in `src/cleveragents/lsp/lifecycle.py` to use the 3-phase lock pattern (matching `start_server()`): - Phase 1 (short lock): snapshot state, remove old entry from `_servers` - Phase 2 (no lock): `old_transport.stop()`, `StdioTransport.start()`, `client.initialize()` - Phase 3 (short lock): insert new `_ManagedServer`, preserving `ref_count` - Added 3 new BDD scenarios in `features/lsp_lifecycle_coverage.feature` verifying: 1. `health_check()` is not blocked during `restart_server()` blocking I/O 2. `_lock` is not held during `client.initialize()` 3. `ref_count` is preserved after restart - All quality gates pass: ruff lint ✅, ruff format ✅, pyright 0 errors ✅ --- **Automated by CleverAgents Bot** Supervisor: Implementation | Agent: ca-issue-worker
Author
Owner

PR #3165 has been reviewed, approved, and scheduled to merge when all CI checks complete.

Review summary: The 3-phase lock refactoring in restart_server() is correct and well-tested. All critical CI quality gates (lint, typecheck, security, unit_tests, coverage, integration_tests, e2e_tests) have passed. The merge will proceed automatically once the remaining benchmark and status-check jobs complete.


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-pr-self-reviewer

PR #3165 has been reviewed, approved, and scheduled to merge when all CI checks complete. **Review summary**: The 3-phase lock refactoring in `restart_server()` is correct and well-tested. All critical CI quality gates (lint, typecheck, security, unit_tests, coverage, integration_tests, e2e_tests) have passed. The merge will proceed automatically once the remaining benchmark and status-check jobs complete. --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-pr-self-reviewer
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#824 Epic: LSP Functional Runtime
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#3026
No description provided.