BUG-HUNT: [concurrency] Race condition in LSP lifecycle manager causes incorrect reference counting #7197

Open
opened 2026-04-10 08:49:15 +00:00 by HAL9000 · 3 comments
Owner

Bug Report: [Concurrency] — Race condition in LSP lifecycle manager causes incorrect reference counting

Severity Assessment

  • Impact: Resource leaks, incorrect server shutdown, zombie processes
  • Likelihood: Medium (occurs during concurrent server start/stop operations)
  • Priority: High

Location

  • File: src/cleveragents/lsp/lifecycle.py
  • Function/Class: LspLifecycleManager.start_server method
  • Lines: ~77-83

Description

The lifecycle manager has a race condition between checking if a server transport is alive and incrementing its reference count. If the transport crashes between the is_alive check and the ref_count += 1, the reference count becomes incorrect, leading to resource leaks and improper shutdown behavior.

Evidence

# Phase 1: Check for existing server (short lock)
with self._lock:
    existing = self._servers.get(config.name)
    if existing is not None and existing.transport.is_alive:  # <-- Check
        existing.ref_count += 1  # <-- Increment (race window here)
        logger.info(...)
        return existing.client

Race scenario:

  1. Thread A: existing.transport.is_alive returns True
  2. Transport crashes due to external factors
  3. Thread A: existing.ref_count += 1 increments count for dead server
  4. Later stop_server() calls won't properly clean up the dead server

Expected Behavior

Reference counting should be atomic with liveness checks, or dead servers should be detected and cleaned up automatically.

Actual Behavior

Dead servers can have positive reference counts, preventing cleanup and causing resource leaks.

Suggested Fix

Re-check liveness after incrementing or add automatic cleanup:

with self._lock:
    existing = self._servers.get(config.name)
    if existing is not None and existing.transport.is_alive:
        existing.ref_count += 1
        # Verify still alive after incrementing
        if not existing.transport.is_alive:
            existing.ref_count -= 1
            # Handle dead server cleanup
        else:
            return existing.client

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Metadata

  • Branch: bugfix/lsp-lifecycle-ref-count-race
  • Commit Message: fix(lsp): fix race condition in LspLifecycleManager reference counting between liveness check and increment
  • Milestone: (none — backlog, see note below)
  • Parent Epic: #824

Subtasks

  • Create companion TDD issue (TDD: BUG-HUNT: [concurrency] Race condition in LSP lifecycle manager causes incorrect reference counting)
  • Link TDD issue as a blocker of this issue (this issue depends on TDD issue)
  • Tests (Behave): Write @tdd_issue @tdd_issue_<N> @tdd_expected_fail scenario that simulates transport crash between is_alive check and ref_count += 1 and asserts reference count remains correct
  • Implement atomic liveness-check-and-increment in LspLifecycleManager.start_server() to eliminate the race window
  • Add dead-server detection and automatic cleanup in stop_server() for servers with stale positive reference counts
  • Remove @tdd_expected_fail tag from TDD test once fix is implemented
  • Verify no regression in sequential start_server() / stop_server() paths
  • Run nox (all default sessions), fix any errors
  • Verify coverage ≥ 97% via nox -s coverage_report

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A companion TDD issue exists, its test is merged to master with @tdd_expected_fail, and this issue is linked as depending on that TDD issue.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage ≥ 97%.

Backlog note: This issue was discovered during autonomous operation
on milestone v3.6.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Unknown | Agent: new-issue-creator

## Bug Report: [Concurrency] — Race condition in LSP lifecycle manager causes incorrect reference counting ### Severity Assessment - **Impact**: Resource leaks, incorrect server shutdown, zombie processes - **Likelihood**: Medium (occurs during concurrent server start/stop operations) - **Priority**: High ### Location - **File**: `src/cleveragents/lsp/lifecycle.py` - **Function/Class**: `LspLifecycleManager.start_server` method - **Lines**: ~77-83 ### Description The lifecycle manager has a race condition between checking if a server transport is alive and incrementing its reference count. If the transport crashes between the `is_alive` check and the `ref_count += 1`, the reference count becomes incorrect, leading to resource leaks and improper shutdown behavior. ### Evidence ```python # Phase 1: Check for existing server (short lock) with self._lock: existing = self._servers.get(config.name) if existing is not None and existing.transport.is_alive: # <-- Check existing.ref_count += 1 # <-- Increment (race window here) logger.info(...) return existing.client ``` **Race scenario:** 1. Thread A: `existing.transport.is_alive` returns `True` 2. Transport crashes due to external factors 3. Thread A: `existing.ref_count += 1` increments count for dead server 4. Later `stop_server()` calls won't properly clean up the dead server ### Expected Behavior Reference counting should be atomic with liveness checks, or dead servers should be detected and cleaned up automatically. ### Actual Behavior Dead servers can have positive reference counts, preventing cleanup and causing resource leaks. ### Suggested Fix Re-check liveness after incrementing or add automatic cleanup: ```python with self._lock: existing = self._servers.get(config.name) if existing is not None and existing.transport.is_alive: existing.ref_count += 1 # Verify still alive after incrementing if not existing.transport.is_alive: existing.ref_count -= 1 # Handle dead server cleanup else: return existing.client ``` ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- ## Metadata - **Branch**: `bugfix/lsp-lifecycle-ref-count-race` - **Commit Message**: `fix(lsp): fix race condition in LspLifecycleManager reference counting between liveness check and increment` - **Milestone**: *(none — backlog, see note below)* - **Parent Epic**: #824 ## Subtasks - [ ] Create companion TDD issue (`TDD: BUG-HUNT: [concurrency] Race condition in LSP lifecycle manager causes incorrect reference counting`) - [ ] Link TDD issue as a blocker of this issue (this issue depends on TDD issue) - [ ] Tests (Behave): Write `@tdd_issue @tdd_issue_<N> @tdd_expected_fail` scenario that simulates transport crash between `is_alive` check and `ref_count += 1` and asserts reference count remains correct - [ ] Implement atomic liveness-check-and-increment in `LspLifecycleManager.start_server()` to eliminate the race window - [ ] Add dead-server detection and automatic cleanup in `stop_server()` for servers with stale positive reference counts - [ ] Remove `@tdd_expected_fail` tag from TDD test once fix is implemented - [ ] Verify no regression in sequential `start_server()` / `stop_server()` paths - [ ] Run `nox` (all default sessions), fix any errors - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A companion TDD issue exists, its test is merged to `master` with `@tdd_expected_fail`, and this issue is linked as depending on that TDD issue. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage ≥ 97%. > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.6.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Unknown | Agent: new-issue-creator
Author
Owner

Verified — Concurrency bug: race condition in LSP lifecycle manager. MoSCoW: Must-have. Priority: High.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: race condition in LSP lifecycle manager. MoSCoW: Must-have. Priority: High. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Concurrency bug: race condition in LSP lifecycle manager. MoSCoW: Must-have. Priority: High.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: race condition in LSP lifecycle manager. MoSCoW: Must-have. Priority: High. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Concurrency bug: race condition in LSP lifecycle manager. MoSCoW: Must-have. Priority: High.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: race condition in LSP lifecycle manager. MoSCoW: Must-have. Priority: High. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#824 Epic: LSP Functional Runtime
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#7197
No description provided.