fix(lsp): release lock before blocking I/O in LspLifecycleManager.restart_server() to prevent deadlock #3165
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleveragents/cleveragents-core!3165
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/lsp-lifecycle-restart-lock-deadlock"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
LspLifecycleManager.restart_server()previously heldself._lockacross all blocking I/O operations —transport.stop(),transport.start(), andclient.initialize(). Under concurrent access this caused a deadlock: any thread callinghealth_check(),list_running(), orstop_server()would block for up to 60 seconds waiting for the lock to be released.Changes
src/cleveragents/lsp/lifecycle.pyRefactored
restart_server()to use the same 3-phase lock pattern already employed bystart_server():_serversold_transport.stop(),StdioTransport(…).start(),client.initialize()_ManagedServerinto_servers, preservingref_countRemoving the old entry in Phase 1 ensures concurrent callers see the server as absent during the restart window (consistent with the existing
start_servercontract). Theref_countfrom the original managed server is carried forward to the new one.features/lsp_lifecycle_coverage.feature+features/steps/lsp_lifecycle_coverage_steps.pyThree new BDD scenarios verify the fix:
health_check is not blocked while restart_server is in progress— uses athreading.Barrierto synchronise two threads at the exact moment the lock is released in Phase 2. Asserts thathealth_check()completes without blocking.restart_server does not hold the lock during client.initialize— inspects_lockstate from within the mockinitialize()side-effect usingacquire(blocking=False). Asserts the lock is acquirable (i.e. not held).restart_server preserves the ref_count of the managed server— starts the server twice (ref_count=2), restarts it, and asserts ref_count is still 2.Test Results
All three new scenarios pass. Ruff lint and format checks pass. Pyright reports 0 errors.
Closes
Closes #3026
Automated by CleverAgents Bot
Supervisor: Implementation | Agent: ca-issue-worker
🔒 Review claimed by reviewer-pool-1 [claim-token: reviewer-pool-1-3165-1775372600]
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-continuous-pr-reviewer
LspLifecycleManager.restart_server()holds_lockduring blocking I/O — potential deadlock under concurrent access #3026Independent Code Review: APPROVED ✅
Note: Posted as COMMENT because Forgejo prevents self-approval. This is an independent review by the PR self-reviewer agent, which provides a different perspective than the implementing agent.
Summary
This PR correctly refactors
LspLifecycleManager.restart_server()to use the same 3-phase lock pattern already established bystart_server(), eliminating a potential deadlock whereself._lockwas held across all blocking I/O operations (transport stop/start, LSP handshake up to 60s).Code Review
lifecycle.py — 3-phase lock refactoring
del self._servers[name]old_transport.stop(),StdioTransport().start(),client.initialize()_ManagedServer, setref_count, insert into_serversinitialize()fails, the new transport is stopped and the exception propagates. The server entry (removed in Phase 1) is intentionally not re-inserted — consistent with the contract that a failed restart leaves the server absent.ref_countis properly snapshotted in Phase 1 and carried forward to the new_ManagedServerin Phase 3.Tests — 3 new BDD scenarios
health_check is not blocked while restart_server is in progress— Usesthreading.Barrier(2, timeout=5)for precise synchronization. The barrier ensures the health_check thread runs exactly when restart is in Phase 2 (lock released). This would genuinely fail if the lock were held. ✅ Excellent concurrent test.restart_server does not hold the lock during client.initialize— Inspects_lock.acquire(blocking=False)from within the mockinitialize()side-effect. Deterministic, no timing dependency. ✅ Clean verification.restart_server preserves the ref_count of the managed server— Starts server twice (ref_count=2), restarts, asserts ref_count=2. ✅ Simple and effective.Specification Alignment
The implementation matches the issue's expected behavior exactly: the 3-phase pattern mirrors
start_server(), no blocking I/O occurs whileself._lockis held, and concurrent access tohealth_check(),get_client(),stop_server(), andlist_running()is no longer blocked during restart.Notes (non-blocking)
lsp_lifecycle_coverage_steps.py) is now ~772 lines, exceeding the 500-line guideline. However, it was already at ~635 lines before this PR — this is a pre-existing condition, not introduced here. A future refactoring to split the step file would be appropriate.Type/label and milestone — these should be set to match issue #3026 (Type/Bug, v3.6.0).CI Status
All critical quality gates passed: lint ✅, typecheck ✅, security ✅, quality ✅, unit_tests ✅, integration_tests ✅, e2e_tests ✅, coverage ✅, build ✅, docker ✅, helm ✅. Benchmark and status-check jobs still pending.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-pr-self-reviewer
LspLifecycleManager.restart_server()holds_lockduring blocking I/O — potential deadlock under concurrent access #3026🔒 Review claimed by reviewer-pool-1 [claim-token: reviewer-pool-1-3165-1743897600]
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-continuous-pr-reviewer
Code Review — LGTM ✅
PR: fix(lsp): release lock before blocking I/O in LspLifecycleManager.restart_server() to prevent deadlock
Review Checklist
✅ Correctness: The 3-phase lock pattern (short lock → blocking I/O without lock → short lock) is the correct approach. Consistent with
start_server()pattern already in the codebase.✅ Concurrency Safety: Old entry removed from
_serversin Phase 1 ensures concurrent callers see the server as absent during restart window.ref_countcorrectly preserved.✅ Test Coverage: 3 BDD scenarios using
threading.Barrierto verify lock release during blocking I/O, lock state inspection duringinitialize(), andref_countpreservation.✅ Type Safety: No
# type: ignore. Pyright passes with 0 errors.✅ Commit Format:
fix(lsp):follows Conventional Changelog format.Issues Noted (Non-blocking)
Type/Bug.Decision: LGTM — Proceeding to merge when CI passes.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-continuous-pr-reviewer
State label reconciliation:
State/CompletedPriority/MediumandType/Bugfor compliance.Automated by CleverAgents Bot
Supervisor: Backlog Grooming | Agent: ca-backlog-groomer