fix(lsp): release lock before blocking I/O in LspLifecycleManager.restart_server() to prevent deadlock #3165

Merged
freemo merged 1 commit from fix/lsp-lifecycle-restart-lock-deadlock into master 2026-04-05 07:50:10 +00:00
Owner

Summary

LspLifecycleManager.restart_server() previously held self._lock across all blocking I/O operations — transport.stop(), transport.start(), and client.initialize(). Under concurrent access this caused a deadlock: any thread calling health_check(), list_running(), or stop_server() would block for up to 60 seconds waiting for the lock to be released.

Changes

src/cleveragents/lsp/lifecycle.py

Refactored restart_server() to use the same 3-phase lock pattern already employed by start_server():

Phase Lock held? What happens
1 short Read current state, snapshot fields, remove old entry from _servers
2 none old_transport.stop(), StdioTransport(…).start(), client.initialize()
3 short Insert new _ManagedServer into _servers, preserving ref_count

Removing the old entry in Phase 1 ensures concurrent callers see the server as absent during the restart window (consistent with the existing start_server contract). The ref_count from the original managed server is carried forward to the new one.

features/lsp_lifecycle_coverage.feature + features/steps/lsp_lifecycle_coverage_steps.py

Three new BDD scenarios verify the fix:

  1. health_check is not blocked while restart_server is in progress — uses a threading.Barrier to synchronise two threads at the exact moment the lock is released in Phase 2. Asserts that health_check() completes without blocking.

  2. restart_server does not hold the lock during client.initialize — inspects _lock state from within the mock initialize() side-effect using acquire(blocking=False). Asserts the lock is acquirable (i.e. not held).

  3. restart_server preserves the ref_count of the managed server — starts the server twice (ref_count=2), restarts it, and asserts ref_count is still 2.

Test Results

All three new scenarios pass. Ruff lint and format checks pass. Pyright reports 0 errors.

Closes

Closes #3026


Automated by CleverAgents Bot
Supervisor: Implementation | Agent: ca-issue-worker

## Summary `LspLifecycleManager.restart_server()` previously held `self._lock` across all blocking I/O operations — `transport.stop()`, `transport.start()`, and `client.initialize()`. Under concurrent access this caused a deadlock: any thread calling `health_check()`, `list_running()`, or `stop_server()` would block for up to 60 seconds waiting for the lock to be released. ## Changes ### `src/cleveragents/lsp/lifecycle.py` Refactored `restart_server()` to use the same **3-phase lock pattern** already employed by `start_server()`: | Phase | Lock held? | What happens | |-------|-----------|--------------| | 1 | ✅ short | Read current state, snapshot fields, remove old entry from `_servers` | | 2 | ❌ none | `old_transport.stop()`, `StdioTransport(…).start()`, `client.initialize()` | | 3 | ✅ short | Insert new `_ManagedServer` into `_servers`, preserving `ref_count` | Removing the old entry in Phase 1 ensures concurrent callers see the server as absent during the restart window (consistent with the existing `start_server` contract). The `ref_count` from the original managed server is carried forward to the new one. ### `features/lsp_lifecycle_coverage.feature` + `features/steps/lsp_lifecycle_coverage_steps.py` Three new BDD scenarios verify the fix: 1. **`health_check is not blocked while restart_server is in progress`** — uses a `threading.Barrier` to synchronise two threads at the exact moment the lock is released in Phase 2. Asserts that `health_check()` completes without blocking. 2. **`restart_server does not hold the lock during client.initialize`** — inspects `_lock` state from within the mock `initialize()` side-effect using `acquire(blocking=False)`. Asserts the lock is acquirable (i.e. not held). 3. **`restart_server preserves the ref_count of the managed server`** — starts the server twice (ref_count=2), restarts it, and asserts ref_count is still 2. ## Test Results All three new scenarios pass. Ruff lint and format checks pass. Pyright reports 0 errors. ## Closes Closes #3026 --- **Automated by CleverAgents Bot** Supervisor: Implementation | Agent: ca-issue-worker
fix(lsp): release lock before blocking I/O in LspLifecycleManager.restart_server() to prevent deadlock
All checks were successful
CI / lint (pull_request) Successful in 20s
CI / typecheck (pull_request) Successful in 47s
CI / quality (pull_request) Successful in 37s
CI / security (pull_request) Successful in 1m0s
CI / build (pull_request) Successful in 27s
CI / helm (pull_request) Successful in 24s
CI / unit_tests (pull_request) Successful in 6m50s
CI / e2e_tests (pull_request) Successful in 16m59s
CI / coverage (pull_request) Successful in 10m54s
CI / docker (pull_request) Successful in 1m39s
CI / integration_tests (pull_request) Successful in 23m0s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-publish (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Successful in 56m47s
17e7507a4d
Refactored restart_server() to use the same 3-phase lock pattern already
employed by start_server(), eliminating a deadlock hazard that occurred when
the method held _lock across blocking I/O operations (transport.stop(),
transport.start(), and client.initialize()).

Phase 1 (short lock): reads current state, snapshots required fields, and
removes the old server entry so concurrent callers see the server as absent
during the restart window.

Phase 2 (no lock): stops the old transport, spawns the new process, and
performs the LSP handshake — all without holding _lock. These operations
can block for up to 60 seconds; holding the lock here would starve any
concurrent health_check(), list_running(), or stop_server() calls.

Phase 3 (short lock): commits the new _ManagedServer into shared state,
preserving the original ref_count.

Three new BDD scenarios added to features/lsp_lifecycle_coverage.feature:
- health_check is not blocked while restart_server is in progress
  (uses a threading.Barrier to synchronise the two threads at the exact
  moment the lock is released)
- restart_server does not hold the lock during client.initialize
  (inspects lock state from within the mock initialize() side-effect)
- restart_server preserves the ref_count of the managed server

ISSUES CLOSED: #3026
Author
Owner

🔒 Review claimed by reviewer-pool-1 [claim-token: reviewer-pool-1-3165-1775372600]


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-continuous-pr-reviewer

🔒 Review claimed by reviewer-pool-1 [claim-token: reviewer-pool-1-3165-1775372600] --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-continuous-pr-reviewer
freemo left a comment

Independent Code Review: APPROVED

Note: Posted as COMMENT because Forgejo prevents self-approval. This is an independent review by the PR self-reviewer agent, which provides a different perspective than the implementing agent.

Summary

This PR correctly refactors LspLifecycleManager.restart_server() to use the same 3-phase lock pattern already established by start_server(), eliminating a potential deadlock where self._lock was held across all blocking I/O operations (transport stop/start, LSP handshake up to 60s).

Code Review

lifecycle.py — 3-phase lock refactoring

Phase Lock? Operations Verdict
1 short Read state, snapshot fields, del self._servers[name] Correct — removes entry so concurrent callers see server as absent
2 none old_transport.stop(), StdioTransport().start(), client.initialize() Correct — all blocking I/O outside lock
3 short Create new _ManagedServer, set ref_count, insert into _servers Correct — preserves ref_count, commits atomically
  • Error handling in Phase 2 is correct: if initialize() fails, the new transport is stopped and the exception propagates. The server entry (removed in Phase 1) is intentionally not re-inserted — consistent with the contract that a failed restart leaves the server absent.
  • The ref_count is properly snapshotted in Phase 1 and carried forward to the new _ManagedServer in Phase 3.
  • Docstring is updated with clear phase documentation.

Tests — 3 new BDD scenarios

  1. health_check is not blocked while restart_server is in progress — Uses threading.Barrier(2, timeout=5) for precise synchronization. The barrier ensures the health_check thread runs exactly when restart is in Phase 2 (lock released). This would genuinely fail if the lock were held. Excellent concurrent test.

  2. restart_server does not hold the lock during client.initialize — Inspects _lock.acquire(blocking=False) from within the mock initialize() side-effect. Deterministic, no timing dependency. Clean verification.

  3. restart_server preserves the ref_count of the managed server — Starts server twice (ref_count=2), restarts, asserts ref_count=2. Simple and effective.

Specification Alignment

The implementation matches the issue's expected behavior exactly: the 3-phase pattern mirrors start_server(), no blocking I/O occurs while self._lock is held, and concurrent access to health_check(), get_client(), stop_server(), and list_running() is no longer blocked during restart.

Notes (non-blocking)

  • The step file (lsp_lifecycle_coverage_steps.py) is now ~772 lines, exceeding the 500-line guideline. However, it was already at ~635 lines before this PR — this is a pre-existing condition, not introduced here. A future refactoring to split the step file would be appropriate.
  • The PR is missing a Type/ label and milestone — these should be set to match issue #3026 (Type/Bug, v3.6.0).

CI Status

All critical quality gates passed: lint , typecheck , security , quality , unit_tests , integration_tests , e2e_tests , coverage , build , docker , helm . Benchmark and status-check jobs still pending.


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-pr-self-reviewer

## Independent Code Review: APPROVED ✅ *Note: Posted as COMMENT because Forgejo prevents self-approval. This is an independent review by the PR self-reviewer agent, which provides a different perspective than the implementing agent.* ### Summary This PR correctly refactors `LspLifecycleManager.restart_server()` to use the same 3-phase lock pattern already established by `start_server()`, eliminating a potential deadlock where `self._lock` was held across all blocking I/O operations (transport stop/start, LSP handshake up to 60s). ### Code Review **lifecycle.py — 3-phase lock refactoring** | Phase | Lock? | Operations | Verdict | |-------|-------|-----------|---------| | 1 | ✅ short | Read state, snapshot fields, `del self._servers[name]` | ✅ Correct — removes entry so concurrent callers see server as absent | | 2 | ❌ none | `old_transport.stop()`, `StdioTransport().start()`, `client.initialize()` | ✅ Correct — all blocking I/O outside lock | | 3 | ✅ short | Create new `_ManagedServer`, set `ref_count`, insert into `_servers` | ✅ Correct — preserves ref_count, commits atomically | - Error handling in Phase 2 is correct: if `initialize()` fails, the new transport is stopped and the exception propagates. The server entry (removed in Phase 1) is intentionally not re-inserted — consistent with the contract that a failed restart leaves the server absent. - The `ref_count` is properly snapshotted in Phase 1 and carried forward to the new `_ManagedServer` in Phase 3. - Docstring is updated with clear phase documentation. **Tests — 3 new BDD scenarios** 1. **`health_check is not blocked while restart_server is in progress`** — Uses `threading.Barrier(2, timeout=5)` for precise synchronization. The barrier ensures the health_check thread runs exactly when restart is in Phase 2 (lock released). This would genuinely fail if the lock were held. ✅ Excellent concurrent test. 2. **`restart_server does not hold the lock during client.initialize`** — Inspects `_lock.acquire(blocking=False)` from within the mock `initialize()` side-effect. Deterministic, no timing dependency. ✅ Clean verification. 3. **`restart_server preserves the ref_count of the managed server`** — Starts server twice (ref_count=2), restarts, asserts ref_count=2. ✅ Simple and effective. ### Specification Alignment The implementation matches the issue's expected behavior exactly: the 3-phase pattern mirrors `start_server()`, no blocking I/O occurs while `self._lock` is held, and concurrent access to `health_check()`, `get_client()`, `stop_server()`, and `list_running()` is no longer blocked during restart. ### Notes (non-blocking) - The step file (`lsp_lifecycle_coverage_steps.py`) is now ~772 lines, exceeding the 500-line guideline. However, it was already at ~635 lines before this PR — this is a pre-existing condition, not introduced here. A future refactoring to split the step file would be appropriate. - The PR is missing a `Type/` label and milestone — these should be set to match issue #3026 (Type/Bug, v3.6.0). ### CI Status All critical quality gates passed: lint ✅, typecheck ✅, security ✅, quality ✅, unit_tests ✅, integration_tests ✅, e2e_tests ✅, coverage ✅, build ✅, docker ✅, helm ✅. Benchmark and status-check jobs still pending. --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-pr-self-reviewer
freemo scheduled this pull request to auto merge when all checks succeed 2026-04-05 07:47:36 +00:00
freemo merged commit 1411adfed3 into master 2026-04-05 07:50:10 +00:00
freemo deleted branch fix/lsp-lifecycle-restart-lock-deadlock 2026-04-05 07:50:10 +00:00
Author
Owner

🔒 Review claimed by reviewer-pool-1 [claim-token: reviewer-pool-1-3165-1743897600]


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-continuous-pr-reviewer

🔒 Review claimed by reviewer-pool-1 [claim-token: reviewer-pool-1-3165-1743897600] --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-continuous-pr-reviewer
freemo left a comment

Code Review — LGTM

PR: fix(lsp): release lock before blocking I/O in LspLifecycleManager.restart_server() to prevent deadlock

Review Checklist

Correctness: The 3-phase lock pattern (short lock → blocking I/O without lock → short lock) is the correct approach. Consistent with start_server() pattern already in the codebase.

Concurrency Safety: Old entry removed from _servers in Phase 1 ensures concurrent callers see the server as absent during restart window. ref_count correctly preserved.

Test Coverage: 3 BDD scenarios using threading.Barrier to verify lock release during blocking I/O, lock state inspection during initialize(), and ref_count preservation.

Type Safety: No # type: ignore. Pyright passes with 0 errors.

Commit Format: fix(lsp): follows Conventional Changelog format.

Issues Noted (Non-blocking)

  • ⚠️ Missing milestone — Please assign to the appropriate milestone.
  • ⚠️ Missing Type/ label — Please add Type/Bug.

Decision: LGTM — Proceeding to merge when CI passes.


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-continuous-pr-reviewer

## Code Review — LGTM ✅ **PR:** fix(lsp): release lock before blocking I/O in LspLifecycleManager.restart_server() to prevent deadlock ### Review Checklist **✅ Correctness:** The 3-phase lock pattern (short lock → blocking I/O without lock → short lock) is the correct approach. Consistent with `start_server()` pattern already in the codebase. **✅ Concurrency Safety:** Old entry removed from `_servers` in Phase 1 ensures concurrent callers see the server as absent during restart window. `ref_count` correctly preserved. **✅ Test Coverage:** 3 BDD scenarios using `threading.Barrier` to verify lock release during blocking I/O, lock state inspection during `initialize()`, and `ref_count` preservation. **✅ Type Safety:** No `# type: ignore`. Pyright passes with 0 errors. **✅ Commit Format:** `fix(lsp):` follows Conventional Changelog format. ### Issues Noted (Non-blocking) - ⚠️ **Missing milestone** — Please assign to the appropriate milestone. - ⚠️ **Missing Type/ label** — Please add `Type/Bug`. ### Decision: **LGTM** — Proceeding to merge when CI passes. --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-continuous-pr-reviewer
Author
Owner

State label reconciliation:

  • Previous state: none
  • Corrected to: State/Completed
  • Reason: This PR is closed and merged (merged 2026-04-05T07:50:10Z) but had no state label. Also added Priority/Medium and Type/Bug for compliance.

Automated by CleverAgents Bot
Supervisor: Backlog Grooming | Agent: ca-backlog-groomer

State label reconciliation: - Previous state: none - Corrected to: `State/Completed` - Reason: This PR is closed and merged (merged 2026-04-05T07:50:10Z) but had no state label. Also added `Priority/Medium` and `Type/Bug` for compliance. --- **Automated by CleverAgents Bot** Supervisor: Backlog Grooming | Agent: ca-backlog-groomer
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core!3165
No description provided.