UAT: LspLifecycleManager.restart_server() holds the lock during process start — potential deadlock under concurrent access #2523

Open
opened 2026-04-03 18:45:30 +00:00 by freemo · 1 comment
Owner

Metadata

  • Branch: fix/lsp-lifecycle-restart-lock
  • Commit Message: fix(lsp): release lock during slow I/O in LspLifecycleManager.restart_server()
  • Milestone: v3.6.0
  • Parent Epic: #824

Summary

LspLifecycleManager.restart_server() in src/cleveragents/lsp/lifecycle.py holds the threading.Lock for the entire duration of the restart operation, including the slow transport.start() and client.initialize() calls (which can take several seconds). This is inconsistent with start_server(), which correctly releases the lock during the slow initialization phase. Under concurrent access, this can cause other threads to block indefinitely waiting for the lock, effectively creating a deadlock scenario.

What Was Tested

  • Code analysis of src/cleveragents/lsp/lifecycle.py
  • Comparison of start_server() and restart_server() lock patterns

Expected Behavior

restart_server() should follow the same three-phase lock pattern as start_server():

  1. Short lock — check state, stop old transport
  2. No lock — perform slow I/O (start new transport, initialize client)
  3. Short lock — register the new managed server

This prevents other threads from blocking during the slow initialization phase.

Actual Behavior

In src/cleveragents/lsp/lifecycle.py (lines 219-265), restart_server() holds the lock for the entire operation:

def restart_server(self, name: str) -> LspClient:
    with self._lock:  # ← Lock acquired here
        managed = self._servers.get(name)
        if managed is None:
            raise LspServerNotFoundError(name)
        
        managed.transport.stop()  # ← slow I/O inside lock
        
        transport = StdioTransport(...)
        transport.start()  # ← slow I/O inside lock (can take seconds)
        
        client = LspClient(transport, server_name=name)
        try:
            client.initialize(managed.workspace_path)  # ← slow I/O inside lock
        except Exception:
            transport.stop()
            raise
        
        managed.transport = transport
        managed.client = client
    # ← Lock released only after all I/O completes

Compare with start_server() which correctly releases the lock during initialization (lines 101-133).

Code Location

src/cleveragents/lsp/lifecycle.py lines 219-265: restart_server() method — lock held during slow I/O

Impact

When the LSP runtime triggers an automatic restart (e.g., after a server crash detected by _get_healthy_client()), any concurrent thread calling get_client(), health_check(), stop_server(), or list_running() will block for the entire duration of the restart (potentially 5-30 seconds for language server initialization). In a multi-actor environment where multiple actors share LSP servers, this can cause significant latency spikes or apparent deadlocks.

Subtasks

  • Refactor restart_server() to use the same three-phase lock pattern as start_server():
    1. Short lock: read state, stop old transport, clear old entry
    2. No lock: start new transport, initialize client
    3. Short lock: register new managed server
  • Handle the race condition where another thread may have restarted the same server concurrently (similar to start_server()'s race winner check)
  • Add BDD test for concurrent restart behavior
  • Verify all CI checks pass

Definition of Done

  • restart_server() releases the lock during slow I/O operations
  • Concurrent calls to other lifecycle methods do not block during restart
  • BDD test covers concurrent restart scenario
  • All CI checks pass
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/lsp-lifecycle-restart-lock` - **Commit Message**: `fix(lsp): release lock during slow I/O in LspLifecycleManager.restart_server()` - **Milestone**: v3.6.0 - **Parent Epic**: #824 ## Summary `LspLifecycleManager.restart_server()` in `src/cleveragents/lsp/lifecycle.py` holds the `threading.Lock` for the entire duration of the restart operation, including the slow `transport.start()` and `client.initialize()` calls (which can take several seconds). This is inconsistent with `start_server()`, which correctly releases the lock during the slow initialization phase. Under concurrent access, this can cause other threads to block indefinitely waiting for the lock, effectively creating a deadlock scenario. ## What Was Tested - Code analysis of `src/cleveragents/lsp/lifecycle.py` - Comparison of `start_server()` and `restart_server()` lock patterns ## Expected Behavior `restart_server()` should follow the same three-phase lock pattern as `start_server()`: 1. **Short lock** — check state, stop old transport 2. **No lock** — perform slow I/O (start new transport, initialize client) 3. **Short lock** — register the new managed server This prevents other threads from blocking during the slow initialization phase. ## Actual Behavior In `src/cleveragents/lsp/lifecycle.py` (lines 219-265), `restart_server()` holds the lock for the entire operation: ```python def restart_server(self, name: str) -> LspClient: with self._lock: # ← Lock acquired here managed = self._servers.get(name) if managed is None: raise LspServerNotFoundError(name) managed.transport.stop() # ← slow I/O inside lock transport = StdioTransport(...) transport.start() # ← slow I/O inside lock (can take seconds) client = LspClient(transport, server_name=name) try: client.initialize(managed.workspace_path) # ← slow I/O inside lock except Exception: transport.stop() raise managed.transport = transport managed.client = client # ← Lock released only after all I/O completes ``` Compare with `start_server()` which correctly releases the lock during initialization (lines 101-133). ## Code Location `src/cleveragents/lsp/lifecycle.py` lines 219-265: `restart_server()` method — lock held during slow I/O ## Impact When the LSP runtime triggers an automatic restart (e.g., after a server crash detected by `_get_healthy_client()`), any concurrent thread calling `get_client()`, `health_check()`, `stop_server()`, or `list_running()` will block for the entire duration of the restart (potentially 5-30 seconds for language server initialization). In a multi-actor environment where multiple actors share LSP servers, this can cause significant latency spikes or apparent deadlocks. ## Subtasks - [ ] Refactor `restart_server()` to use the same three-phase lock pattern as `start_server()`: 1. Short lock: read state, stop old transport, clear old entry 2. No lock: start new transport, initialize client 3. Short lock: register new managed server - [ ] Handle the race condition where another thread may have restarted the same server concurrently (similar to `start_server()`'s race winner check) - [ ] Add BDD test for concurrent restart behavior - [ ] Verify all CI checks pass ## Definition of Done - [ ] `restart_server()` releases the lock during slow I/O operations - [ ] Concurrent calls to other lifecycle methods do not block during restart - [ ] BDD test covers concurrent restart scenario - [ ] All CI checks pass - All nox stages pass - Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
freemo added this to the v3.6.0 milestone 2026-04-03 18:45:35 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • MoSCoW: Should Have — Spec compliance or quality improvement that should be included in the milestone.

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **MoSCoW**: Should Have — Spec compliance or quality improvement that should be included in the milestone. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#824 Epic: LSP Functional Runtime
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#2523
No description provided.