UAT: LspLifecycleManager.stop_server() holds lock during blocking I/O — potential deadlock #3404

Open
opened 2026-04-05 16:30:37 +00:00 by freemo · 1 comment
Owner

Metadata

  • Branch: fix/lsp-lifecycle-stop-server-lock
  • Commit Message: fix(lsp): release lock before blocking I/O in stop_server() to prevent deadlock
  • Milestone: none (backlog)
  • Parent Epic: #824

Backlog note: This issue was discovered during autonomous operation
on milestone . It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.

Bug Description

LspLifecycleManager.stop_server() calls _shutdown_server() while holding self._lock. The _shutdown_server() method calls managed.client.shutdown() (which sends a JSON-RPC request and waits up to 60 seconds for a response) and managed.transport.stop() (which waits for process termination with a 5-second timeout). This means the lock is held during potentially long blocking I/O operations.

In contrast, restart_server() correctly uses a 3-phase lock pattern to avoid holding the lock during blocking I/O (documented in the method's docstring).

Expected Behavior

stop_server() should follow the same 3-phase lock pattern as restart_server():

  1. Phase 1 (short lock): Read state, remove entry from registry
  2. Phase 2 (no lock): Perform blocking shutdown (client.shutdown + transport.stop)
  3. Phase 3 (short lock): Any final state updates

This ensures that health_check(), get_client(), list_running(), and other lock-acquiring operations are not blocked for up to 65 seconds while a server shuts down.

Actual Behavior

In src/cleveragents/lsp/lifecycle.py lines 154–170:

def stop_server(self, name: str) -> None:
    with self._lock:                          # Lock acquired
        managed = self._servers.get(name)
        if managed is None:
            raise LspServerNotFoundError(name)
        managed.ref_count -= 1
        if managed.ref_count > 0:
            return
        # Last reference — shut down
        self._shutdown_server(managed)        # Blocking I/O while lock held!
        del self._servers[name]

_shutdown_server() (lines 172–184) calls:

  • managed.client.shutdown() — sends JSON-RPC shutdown request, waits up to _REQUEST_TIMEOUT = 60.0 seconds
  • managed.transport.stop() — waits up to _GRACEFUL_SHUTDOWN_TIMEOUT = 5.0 seconds

Total potential lock hold time: 65 seconds.

The same issue exists in stop_all() (lines 296–308), which calls _shutdown_server() for each server while holding the lock.

Code Location

  • File: src/cleveragents/lsp/lifecycle.py
  • Lines 154–170: stop_server() — lock held during _shutdown_server()
  • Lines 296–308: stop_all() — lock held during _shutdown_server() for each server
  • Lines 172–184: _shutdown_server() — performs blocking I/O

Impact

  • Any thread calling health_check(), get_client(), list_running(), or start_server() will block for up to 65 seconds while another thread is in stop_server()
  • Under concurrent load (multiple actors sharing servers), this can cause cascading delays
  • stop_all() compounds the issue: N servers × 65 seconds = N × 65 seconds of lock hold time

Subtasks

  • Refactor stop_server() to use 3-phase lock pattern (same as restart_server())
  • Refactor stop_all() to release lock before calling _shutdown_server()
  • Add concurrency test to features/lsp_lifecycle_coverage.feature verifying health_check() is not blocked during stop_server()
  • Update docstring to document the lock pattern

Definition of Done

  • stop_server() does not hold self._lock during client.shutdown() or transport.stop()
  • stop_all() does not hold self._lock during _shutdown_server() calls
  • Concurrency test passes verifying no blocking
  • All existing tests pass
  • All nox stages pass
  • Coverage >= 97%

Parent: #824


Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/lsp-lifecycle-stop-server-lock` - **Commit Message**: `fix(lsp): release lock before blocking I/O in stop_server() to prevent deadlock` - **Milestone**: none (backlog) - **Parent Epic**: #824 > **Backlog note:** This issue was discovered during autonomous operation > on milestone <M>. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. ## Bug Description `LspLifecycleManager.stop_server()` calls `_shutdown_server()` while holding `self._lock`. The `_shutdown_server()` method calls `managed.client.shutdown()` (which sends a JSON-RPC request and waits up to 60 seconds for a response) and `managed.transport.stop()` (which waits for process termination with a 5-second timeout). This means the lock is held during potentially long blocking I/O operations. In contrast, `restart_server()` correctly uses a 3-phase lock pattern to avoid holding the lock during blocking I/O (documented in the method's docstring). ## Expected Behavior `stop_server()` should follow the same 3-phase lock pattern as `restart_server()`: 1. **Phase 1** (short lock): Read state, remove entry from registry 2. **Phase 2** (no lock): Perform blocking shutdown (client.shutdown + transport.stop) 3. **Phase 3** (short lock): Any final state updates This ensures that `health_check()`, `get_client()`, `list_running()`, and other lock-acquiring operations are not blocked for up to 65 seconds while a server shuts down. ## Actual Behavior In `src/cleveragents/lsp/lifecycle.py` lines 154–170: ```python def stop_server(self, name: str) -> None: with self._lock: # Lock acquired managed = self._servers.get(name) if managed is None: raise LspServerNotFoundError(name) managed.ref_count -= 1 if managed.ref_count > 0: return # Last reference — shut down self._shutdown_server(managed) # Blocking I/O while lock held! del self._servers[name] ``` `_shutdown_server()` (lines 172–184) calls: - `managed.client.shutdown()` — sends JSON-RPC `shutdown` request, waits up to `_REQUEST_TIMEOUT = 60.0` seconds - `managed.transport.stop()` — waits up to `_GRACEFUL_SHUTDOWN_TIMEOUT = 5.0` seconds Total potential lock hold time: **65 seconds**. The same issue exists in `stop_all()` (lines 296–308), which calls `_shutdown_server()` for each server while holding the lock. ## Code Location - File: `src/cleveragents/lsp/lifecycle.py` - Lines 154–170: `stop_server()` — lock held during `_shutdown_server()` - Lines 296–308: `stop_all()` — lock held during `_shutdown_server()` for each server - Lines 172–184: `_shutdown_server()` — performs blocking I/O ## Impact - Any thread calling `health_check()`, `get_client()`, `list_running()`, or `start_server()` will block for up to 65 seconds while another thread is in `stop_server()` - Under concurrent load (multiple actors sharing servers), this can cause cascading delays - `stop_all()` compounds the issue: N servers × 65 seconds = N × 65 seconds of lock hold time ## Subtasks - [ ] Refactor `stop_server()` to use 3-phase lock pattern (same as `restart_server()`) - [ ] Refactor `stop_all()` to release lock before calling `_shutdown_server()` - [ ] Add concurrency test to `features/lsp_lifecycle_coverage.feature` verifying `health_check()` is not blocked during `stop_server()` - [ ] Update docstring to document the lock pattern ## Definition of Done - [ ] `stop_server()` does not hold `self._lock` during `client.shutdown()` or `transport.stop()` - [ ] `stop_all()` does not hold `self._lock` during `_shutdown_server()` calls - [ ] Concurrency test passes verifying no blocking - [ ] All existing tests pass - [ ] All nox stages pass - [ ] Coverage >= 97% Parent: #824 --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: Backlog (unchanged)
  • Story Points: 3 — M — Fix potential deadlock in LspLifecycleManager.stop_server().
  • MoSCoW: Should Have — Potential deadlock is a reliability concern.

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: Backlog (unchanged) - **Story Points**: 3 — M — Fix potential deadlock in LspLifecycleManager.stop_server(). - **MoSCoW**: Should Have — Potential deadlock is a reliability concern. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#824 Epic: LSP Functional Runtime
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#3404
No description provided.