BUG-HUNT: [concurrency] Race condition in LSP lifecycle manager restart sequence #7144

Open
opened 2026-04-10 08:10:52 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Branch: bugfix/lsp-lifecycle-restart-race-condition
  • Commit Message: fix(lsp): eliminate race condition window in restart_server() by using restarting state flag instead of entry deletion
  • Milestone: (none — see backlog note below)
  • Parent Epic: #824

Backlog note: This issue was discovered during autonomous operation
on milestone v3.6.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.

Bug Report: Concurrency — Race Condition in LSP Lifecycle Manager Restart Sequence

Severity Assessment

  • Severity: High (race condition)
  • Impact: Multiple threads can see server as non-existent during restart, causing duplicate server starts or failures
  • Likelihood: Medium (requires concurrent access during restart)
  • Priority: High

Location

  • File: src/cleveragents/lsp/lifecycle.py
  • Function: restart_server()
  • Lines: 246–261 (Phase 1 — entry deletion), 290–291 (Phase 3 — re-insertion)

Description

The three-phase locking approach in restart_server() creates a race condition window where the server entry is deleted from self._servers (Phase 1) but not yet re-added (Phase 3). During this window, other threads calling health_check(), get_client(), or restart_server() will see the server as non-existent.

The design intent of deleting the entry before the blocking restart is to prevent other threads from using a stale/dead server reference during the restart. However, the side effect is that the server appears to have never existed — callers receive LspServerNotFoundError or trigger duplicate restart attempts rather than receiving a "server is restarting, please wait" signal.

Discovered during: Bug Hunt Cycle 2, LSP Module Analysis (Worker 14, issue #7037).

Evidence

# Phase 1: lines 246-261 - Server entry deleted while still needed
with self._lock:
    managed = self._servers.get(name)
    if managed is None:
        raise LspServerNotFoundError(name)

    # Remove the entry so concurrent callers see the server as absent
    # during the blocking restart.  Phase 3 will re-insert it.
    del self._servers[name]  # RACE CONDITION: Other threads see server as gone

# Phase 2: Blocking I/O outside lock (good)
# ... long-running transport operations (stop old process, start new process,
#     LSP initialize handshake) — may take several seconds ...

# Phase 3: lines 290-291 - Re-add server entry
with self._lock:
    self._servers[name] = new_managed  # Server reappears

Race Condition Scenarios

  1. Thread A starts restart (deletes entry) → Thread B calls health_check() → sees server as non-existent, returns unhealthy/raises
  2. Thread A starts restart (deletes entry) → Thread B calls get_client() → raises LspServerNotFoundError even though server is being restarted
  3. Thread A starts restart (deletes entry) → Thread B also calls restart_server() → raises LspServerNotFoundError (or worse, triggers a second concurrent restart if the check is absent)
  4. Thread A starts restart (deletes entry) → Thread B calls start_server() → may create a brand-new server instance for the same name, racing with Thread A's Phase 3 re-insertion

Expected Behavior

The server should remain visible to other threads during restart, with a "restarting" state that allows callers to either:

  • Wait for the restart to complete (using a threading.Event or threading.Condition)
  • Receive a clear, actionable error (e.g., LspServerRestartingError) rather than LspServerNotFoundError

Actual Behavior

Server entry is deleted from self._servers during Phase 1 and not re-added until Phase 3. During the entire Phase 2 window (which involves blocking I/O and may take several seconds), the server appears to not exist, causing race conditions and inconsistent behavior for all concurrent callers.

Suggested Fix

  1. Add a restarting flag to _ManagedServer instead of deleting the entry:

    @dataclass
    class _ManagedServer:
        ...
        restarting: bool = False
        restart_event: threading.Event = field(default_factory=threading.Event)
    
  2. Phase 1: Set managed.restarting = True and managed.restart_event.clear() instead of del self._servers[name]

  3. Make other methods respect the restarting state:

    • get_client() / health_check(): if managed.restarting, either wait on restart_event or raise LspServerRestartingError
    • restart_server(): if managed.restarting, raise LspServerRestartingError to prevent double-restart
  4. Phase 3: Update the existing entry in-place (or replace it) and call managed.restart_event.set() to wake waiting threads

  5. Consider using threading.Condition for more flexible coordination (e.g., timeout-based waiting)

  • #7102 — Race condition in start_server() (similar three-phase locking pattern, different method)
  • #7044 — LSP Transport Process Resource Leak on Exception in StdioTransport.start()
  • #7083 — DoS vulnerability in LSP server message reading
  • #7129 — Document leak in LSP runtime operations due to missing cleanup

Spec Reference

docs/specification.md — LSP Server Lifecycle section (server state management and thread safety guarantees)

TDD Note

Per the Bug Fix Workflow, a companion Type/Testing issue prefixed TDD: must be created before this bug is fixed. That TDD issue's sole deliverable is a test tagged @tdd_issue, @tdd_issue_<N> (where N is this issue's number), and @tdd_expected_fail that reproduces the race condition and proves the bug exists. This bug issue depends on (is blocked by) the TDD issue.

Subtasks

  • Create companion TDD issue (TDD: [concurrency] Race condition in LSP lifecycle manager restart sequence)
  • Link TDD issue as a blocker of this issue (this issue depends on TDD issue)
  • Write @tdd_issue @tdd_issue_<N> @tdd_expected_fail BDD scenario that reproduces the race by launching two threads — one calling restart_server() and one calling get_client() or health_check() — and asserting the second thread does not receive LspServerNotFoundError during the restart window
  • Add restarting: bool and restart_event: threading.Event fields to _ManagedServer dataclass
  • Modify restart_server() Phase 1 to set managed.restarting = True instead of deleting the entry
  • Modify restart_server() Phase 3 to update the entry in-place and call restart_event.set()
  • Update get_client() to handle restarting state (wait or raise LspServerRestartingError)
  • Update health_check() to handle restarting state (return appropriate status)
  • Update restart_server() to guard against double-restart when managed.restarting is already True
  • Remove @tdd_expected_fail tag from TDD test once fix is implemented
  • Run nox (all default sessions), fix any errors
  • Verify coverage ≥ 97% via nox -s coverage_report

Definition of Done

  • All subtasks above are completed and checked off
  • A companion TDD issue exists, its test is merged to master with @tdd_expected_fail, and this issue is linked as depending on that TDD issue
  • No thread calling get_client(), health_check(), or restart_server() receives LspServerNotFoundError while a restart is in progress for that server
  • Concurrent restart_server() calls for the same server name are serialized (no double-restart)
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done
  • All nox stages pass
  • Coverage ≥ 97%

Automated by CleverAgents Bot
Supervisor: Acting on behalf of: UAT Testing | Agent: new-issue-creator

## Metadata - **Branch**: `bugfix/lsp-lifecycle-restart-race-condition` - **Commit Message**: `fix(lsp): eliminate race condition window in restart_server() by using restarting state flag instead of entry deletion` - **Milestone**: *(none — see backlog note below)* - **Parent Epic**: #824 > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.6.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. ## Bug Report: Concurrency — Race Condition in LSP Lifecycle Manager Restart Sequence ### Severity Assessment - **Severity**: High (race condition) - **Impact**: Multiple threads can see server as non-existent during restart, causing duplicate server starts or failures - **Likelihood**: Medium (requires concurrent access during restart) - **Priority**: High ### Location - **File**: `src/cleveragents/lsp/lifecycle.py` - **Function**: `restart_server()` - **Lines**: 246–261 (Phase 1 — entry deletion), 290–291 (Phase 3 — re-insertion) ### Description The three-phase locking approach in `restart_server()` creates a race condition window where the server entry is deleted from `self._servers` (Phase 1) but not yet re-added (Phase 3). During this window, other threads calling `health_check()`, `get_client()`, or `restart_server()` will see the server as non-existent. The design intent of deleting the entry before the blocking restart is to prevent other threads from using a stale/dead server reference during the restart. However, the side effect is that the server appears to have never existed — callers receive `LspServerNotFoundError` or trigger duplicate restart attempts rather than receiving a "server is restarting, please wait" signal. **Discovered during**: Bug Hunt Cycle 2, LSP Module Analysis (Worker 14, issue #7037). ### Evidence ```python # Phase 1: lines 246-261 - Server entry deleted while still needed with self._lock: managed = self._servers.get(name) if managed is None: raise LspServerNotFoundError(name) # Remove the entry so concurrent callers see the server as absent # during the blocking restart. Phase 3 will re-insert it. del self._servers[name] # RACE CONDITION: Other threads see server as gone # Phase 2: Blocking I/O outside lock (good) # ... long-running transport operations (stop old process, start new process, # LSP initialize handshake) — may take several seconds ... # Phase 3: lines 290-291 - Re-add server entry with self._lock: self._servers[name] = new_managed # Server reappears ``` ### Race Condition Scenarios 1. **Thread A** starts restart (deletes entry) → **Thread B** calls `health_check()` → sees server as non-existent, returns unhealthy/raises 2. **Thread A** starts restart (deletes entry) → **Thread B** calls `get_client()` → raises `LspServerNotFoundError` even though server is being restarted 3. **Thread A** starts restart (deletes entry) → **Thread B** also calls `restart_server()` → raises `LspServerNotFoundError` (or worse, triggers a second concurrent restart if the check is absent) 4. **Thread A** starts restart (deletes entry) → **Thread B** calls `start_server()` → may create a brand-new server instance for the same name, racing with Thread A's Phase 3 re-insertion ### Expected Behavior The server should remain visible to other threads during restart, with a "restarting" state that allows callers to either: - Wait for the restart to complete (using a `threading.Event` or `threading.Condition`) - Receive a clear, actionable error (e.g., `LspServerRestartingError`) rather than `LspServerNotFoundError` ### Actual Behavior Server entry is deleted from `self._servers` during Phase 1 and not re-added until Phase 3. During the entire Phase 2 window (which involves blocking I/O and may take several seconds), the server appears to not exist, causing race conditions and inconsistent behavior for all concurrent callers. ### Suggested Fix 1. **Add a `restarting` flag to `_ManagedServer`** instead of deleting the entry: ```python @dataclass class _ManagedServer: ... restarting: bool = False restart_event: threading.Event = field(default_factory=threading.Event) ``` 2. **Phase 1**: Set `managed.restarting = True` and `managed.restart_event.clear()` instead of `del self._servers[name]` 3. **Make other methods respect the restarting state**: - `get_client()` / `health_check()`: if `managed.restarting`, either wait on `restart_event` or raise `LspServerRestartingError` - `restart_server()`: if `managed.restarting`, raise `LspServerRestartingError` to prevent double-restart 4. **Phase 3**: Update the existing entry in-place (or replace it) and call `managed.restart_event.set()` to wake waiting threads 5. **Consider using `threading.Condition`** for more flexible coordination (e.g., timeout-based waiting) ### Related Issues - **#7102** — Race condition in `start_server()` (similar three-phase locking pattern, different method) - **#7044** — LSP Transport Process Resource Leak on Exception in `StdioTransport.start()` - **#7083** — DoS vulnerability in LSP server message reading - **#7129** — Document leak in LSP runtime operations due to missing cleanup ### Spec Reference `docs/specification.md` — LSP Server Lifecycle section (server state management and thread safety guarantees) ### TDD Note Per the [Bug Fix Workflow](https://git.cleverthis.com/cleveragents/cleveragents-core/src/branch/master/CONTRIBUTING.md#bug-fix-workflow), a companion **`Type/Testing`** issue prefixed `TDD:` must be created before this bug is fixed. That TDD issue's sole deliverable is a test tagged `@tdd_issue`, `@tdd_issue_<N>` (where N is this issue's number), and `@tdd_expected_fail` that reproduces the race condition and proves the bug exists. **This bug issue depends on (is blocked by) the TDD issue.** ## Subtasks - [ ] Create companion TDD issue (`TDD: [concurrency] Race condition in LSP lifecycle manager restart sequence`) - [ ] Link TDD issue as a blocker of this issue (this issue depends on TDD issue) - [ ] Write `@tdd_issue @tdd_issue_<N> @tdd_expected_fail` BDD scenario that reproduces the race by launching two threads — one calling `restart_server()` and one calling `get_client()` or `health_check()` — and asserting the second thread does not receive `LspServerNotFoundError` during the restart window - [ ] Add `restarting: bool` and `restart_event: threading.Event` fields to `_ManagedServer` dataclass - [ ] Modify `restart_server()` Phase 1 to set `managed.restarting = True` instead of deleting the entry - [ ] Modify `restart_server()` Phase 3 to update the entry in-place and call `restart_event.set()` - [ ] Update `get_client()` to handle `restarting` state (wait or raise `LspServerRestartingError`) - [ ] Update `health_check()` to handle `restarting` state (return appropriate status) - [ ] Update `restart_server()` to guard against double-restart when `managed.restarting` is already `True` - [ ] Remove `@tdd_expected_fail` tag from TDD test once fix is implemented - [ ] Run `nox` (all default sessions), fix any errors - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` ## Definition of Done - [ ] All subtasks above are completed and checked off - [ ] A companion TDD issue exists, its test is merged to `master` with `@tdd_expected_fail`, and this issue is linked as depending on that TDD issue - [ ] No thread calling `get_client()`, `health_check()`, or `restart_server()` receives `LspServerNotFoundError` while a restart is in progress for that server - [ ] Concurrent `restart_server()` calls for the same server name are serialized (no double-restart) - [ ] A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation - [ ] The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly - [ ] The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done - [ ] All nox stages pass - [ ] Coverage ≥ 97% --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: UAT Testing | Agent: new-issue-creator
Author
Owner

Verified — Concurrency bug: race condition in LSP lifecycle manager restart. MoSCoW: Should-have. Priority: High.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: race condition in LSP lifecycle manager restart. MoSCoW: Should-have. Priority: High. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#824 Epic: LSP Functional Runtime
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#7144
No description provided.