BUG-HUNT: [concurrency] LspLifecycleManager.stop_server() and stop_all() perform blocking LSP shutdown while holding self._lock — all lifecycle ops stall for up to 65 s per server #6577

Open
opened 2026-04-09 21:42:48 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [concurrency] — Blocking I/O while holding _lock in stop_server() and stop_all()

Severity Assessment

  • Impact: Any thread calling start_server, stop_server, get_client, health_check, restart_server, or list_running is blocked for up to N × 65 s (N servers × 60 s LSP shutdown timeout + 5 s kill timeout) while stop_all() runs. In practice this means LSP operations on other servers are completely frozen during any shutdown sequence. stop_server() has the same flaw for a single server.
  • Likelihood: High — stop_all() is called during normal actor deactivation, and stop_server() on every reference release.
  • Priority: High

Location

  • File: src/cleveragents/lsp/lifecycle.py
  • Function: LspLifecycleManager.stop_server() (lines 154–170) and LspLifecycleManager.stop_all() (lines 297–308)
  • Called helper: _shutdown_server() (lines 172–184)

Description

_shutdown_server() calls managed.client.shutdown() (a blocking JSON-RPC request, up to _REQUEST_TIMEOUT = 60 s) followed by managed.transport.stop() (up to _GRACEFUL_SHUTDOWN_TIMEOUT = 5 s). Both stop_server() and stop_all() invoke _shutdown_server() inside a with self._lock: block.

The lifecycle lock is the same threading.Lock held for every read or write of self._servers. Holding it during blocking I/O means:

  • start_server() → Phase 1 lock acquisition stalls
  • get_client() → stalls
  • health_check() → stalls
  • restart_server() → Phase 1/3 stalls
  • list_running() → stalls

start_server() deliberately uses a 3-phase lock pattern specifically to avoid holding the lock during I/O (see comment at line 101: "Phase 2: Start and initialize OUTSIDE the lock (may take seconds)"). That same discipline is missing from the shutdown path.

Evidence

# lifecycle.py lines 154-170 — stop_server holds lock during shutdown
def stop_server(self, name: str) -> None:
    with self._lock:                          # ← lock acquired
        managed = self._servers.get(name)
        if managed is None:
            raise LspServerNotFoundError(name)
        managed.ref_count -= 1
        if managed.ref_count > 0:
            ...
            return
        # Last reference — shut down
        self._shutdown_server(managed)        # ← blocks up to 65 s while locked!
        del self._servers[name]
# lifecycle.py lines 297-308 — stop_all holds lock during N × shutdown
def stop_all(self) -> None:
    with self._lock:                          # ← lock acquired
        for name, managed in list(self._servers.items()):
            try:
                self._shutdown_server(managed)  # ← blocks up to 65 s per server!
            except Exception:
                logger.warning(...)
        self._servers.clear()
# lifecycle.py lines 172-184 — _shutdown_server is the blocking work
def _shutdown_server(self, managed: _ManagedServer) -> None:
    try:
        managed.client.shutdown()   # JSON-RPC request, up to 60 s
    except Exception:
        ...
    managed.transport.stop()        # process kill, up to 5 s

Note: start_server() (lines 101–133) correctly performs I/O outside the lock with a comment explaining why. The shutdown path lacks this same discipline.

Expected Behavior

Shutdown of individual servers should be performed outside the lock. The lock should only be held for the brief operation of removing the entry from _servers.

Actual Behavior

The full blocking shutdown (LSP protocol exchange + process kill/wait) is performed while holding self._lock, freezing all concurrent lifecycle operations.

Suggested Fix

Apply the same 3-phase lock pattern used by start_server and restart_server to stop_server:

  1. Under lock: decrement ref_count, if last reference remove entry from _servers, snapshot the managed object.
  2. Outside lock: call _shutdown_server(managed).

For stop_all, build the list of servers to shut down under the lock, then clear _servers, then shut them down outside the lock.

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — Blocking I/O while holding `_lock` in `stop_server()` and `stop_all()` ### Severity Assessment - **Impact**: Any thread calling `start_server`, `stop_server`, `get_client`, `health_check`, `restart_server`, or `list_running` is blocked for up to `N × 65 s` (N servers × 60 s LSP shutdown timeout + 5 s kill timeout) while `stop_all()` runs. In practice this means LSP operations on **other** servers are completely frozen during any shutdown sequence. `stop_server()` has the same flaw for a single server. - **Likelihood**: High — `stop_all()` is called during normal actor deactivation, and `stop_server()` on every reference release. - **Priority**: High ### Location - **File**: `src/cleveragents/lsp/lifecycle.py` - **Function**: `LspLifecycleManager.stop_server()` (lines 154–170) and `LspLifecycleManager.stop_all()` (lines 297–308) - **Called helper**: `_shutdown_server()` (lines 172–184) ### Description `_shutdown_server()` calls `managed.client.shutdown()` (a blocking JSON-RPC request, up to `_REQUEST_TIMEOUT = 60 s`) followed by `managed.transport.stop()` (up to `_GRACEFUL_SHUTDOWN_TIMEOUT = 5 s`). Both `stop_server()` and `stop_all()` invoke `_shutdown_server()` **inside** a `with self._lock:` block. The lifecycle lock is the same `threading.Lock` held for every read or write of `self._servers`. Holding it during blocking I/O means: - `start_server()` → Phase 1 lock acquisition stalls - `get_client()` → stalls - `health_check()` → stalls - `restart_server()` → Phase 1/3 stalls - `list_running()` → stalls `start_server()` deliberately uses a 3-phase lock pattern specifically to avoid holding the lock during I/O (see comment at line 101: *"Phase 2: Start and initialize OUTSIDE the lock (may take seconds)"*). That same discipline is missing from the shutdown path. ### Evidence ```python # lifecycle.py lines 154-170 — stop_server holds lock during shutdown def stop_server(self, name: str) -> None: with self._lock: # ← lock acquired managed = self._servers.get(name) if managed is None: raise LspServerNotFoundError(name) managed.ref_count -= 1 if managed.ref_count > 0: ... return # Last reference — shut down self._shutdown_server(managed) # ← blocks up to 65 s while locked! del self._servers[name] ``` ```python # lifecycle.py lines 297-308 — stop_all holds lock during N × shutdown def stop_all(self) -> None: with self._lock: # ← lock acquired for name, managed in list(self._servers.items()): try: self._shutdown_server(managed) # ← blocks up to 65 s per server! except Exception: logger.warning(...) self._servers.clear() ``` ```python # lifecycle.py lines 172-184 — _shutdown_server is the blocking work def _shutdown_server(self, managed: _ManagedServer) -> None: try: managed.client.shutdown() # JSON-RPC request, up to 60 s except Exception: ... managed.transport.stop() # process kill, up to 5 s ``` Note: `start_server()` (lines 101–133) correctly performs I/O outside the lock with a comment explaining why. The shutdown path lacks this same discipline. ### Expected Behavior Shutdown of individual servers should be performed outside the lock. The lock should only be held for the brief operation of removing the entry from `_servers`. ### Actual Behavior The full blocking shutdown (LSP protocol exchange + process kill/wait) is performed while holding `self._lock`, freezing all concurrent lifecycle operations. ### Suggested Fix Apply the same 3-phase lock pattern used by `start_server` and `restart_server` to `stop_server`: 1. Under lock: decrement ref_count, if last reference remove entry from `_servers`, snapshot the `managed` object. 2. Outside lock: call `_shutdown_server(managed)`. For `stop_all`, build the list of servers to shut down under the lock, then clear `_servers`, then shut them down outside the lock. ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:52:45 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6577
No description provided.