BUG-HUNT: [error-handling] LspClient._send_request() busy-loops when LSP server is alive but not responding, burning CPU #7368

Open
opened 2026-04-10 18:20:36 +00:00 by HAL9000 · 3 comments
Owner

Bug Report: [error-handling] LspClient._send_request() busy-loops when server is alive but unresponsive, burning CPU until timeout

Severity Assessment

  • Impact: When the LSP server is alive (process running) but not producing responses (e.g., deadlocked or overwhelmed), _send_request() enters a tight polling loop calling read_message(timeout=remaining) with tiny timeouts, burning 100% CPU until the deadline expires
  • Likelihood: Medium — occurs when LSP servers are slow to respond (e.g., Pyright processing a large codebase, or server deadlock)
  • Priority: Medium

Location

  • File: src/cleveragents/lsp/client.py
  • Function/Class: LspClient._send_request()
  • Lines: ~70-110

Description

The _send_request() method uses a polling loop to wait for the server response:

deadline = time.monotonic() + timeout

while time.monotonic() < deadline:
    remaining = max(0.1, deadline - time.monotonic())
    msg = self._transport.read_message(timeout=remaining)
    if msg is None:
        if self._transport.is_alive:
            continue  # BUG: busy loop!
        raise LspError(...)
    # Process message...

When the server is alive (is_alive == True) but not sending any messages, read_message(timeout=remaining) returns None after waiting up to remaining seconds. The max(0.1, ...) guard ensures at least 0.1 second wait per iteration — however, as the deadline approaches, remaining approaches 0, and max(0.1, 0.001) = 0.1. So the loop polls every 100ms.

But more importantly, the continue statement loops back immediately after getting a None from read_message. If read_message returns None due to a short internal timeout or a busy/partial read, the outer loop immediately calls read_message again without any explicit sleep, potentially creating a tight loop.

The correct approach is to use the actual remaining time as the timeout for read_message, and not add additional looping overhead when read_message already blocks for up to remaining seconds.

Evidence

while time.monotonic() < deadline:
    remaining = max(0.1, deadline - time.monotonic())  # minimum 0.1s
    msg = self._transport.read_message(timeout=remaining)
    if msg is None:
        if self._transport.is_alive:
            continue  # This can be a tight loop if read_message returns immediately!
        raise LspError(...)

If StdioTransport.read_message() has a bug where it returns None immediately (e.g., due to an empty read), this becomes a CPU-burning busy loop for the entire timeout duration (default: 60 seconds).

Additionally, if remaining becomes very small (approaching deadline), max(0.1, remaining) might cause the total wait to exceed the timeout by up to 0.1 seconds, making timeout semantics imprecise.

Expected Behavior

The timeout should be used correctly to wait up to the full remaining time, and there should be a minimum sleep between retries to prevent tight loops. The code should also handle the case where the deadline is passed precisely.

Actual Behavior

The continue statement can cause rapid polling if read_message returns None too quickly. The max(0.1, remaining) guard also means timeouts can exceed their specified duration by up to 100ms.

Suggested Fix

while True:
    remaining = deadline - time.monotonic()
    if remaining <= 0:
        break
    # Use the full remaining time as the timeout - no tight looping
    actual_timeout = min(remaining, _REQUEST_TIMEOUT)
    msg = self._transport.read_message(timeout=actual_timeout)
    if msg is None:
        if not self._transport.is_alive:
            raise LspError(...)
        # Server alive but no message yet - check if deadline exceeded
        continue  # read_message already waited up to timeout

Category

error-handling

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: [error-handling] LspClient._send_request() busy-loops when server is alive but unresponsive, burning CPU until timeout ### Severity Assessment - **Impact**: When the LSP server is alive (process running) but not producing responses (e.g., deadlocked or overwhelmed), `_send_request()` enters a tight polling loop calling `read_message(timeout=remaining)` with tiny timeouts, burning 100% CPU until the deadline expires - **Likelihood**: Medium — occurs when LSP servers are slow to respond (e.g., Pyright processing a large codebase, or server deadlock) - **Priority**: Medium ### Location - **File**: `src/cleveragents/lsp/client.py` - **Function/Class**: `LspClient._send_request()` - **Lines**: ~70-110 ### Description The `_send_request()` method uses a polling loop to wait for the server response: ```python deadline = time.monotonic() + timeout while time.monotonic() < deadline: remaining = max(0.1, deadline - time.monotonic()) msg = self._transport.read_message(timeout=remaining) if msg is None: if self._transport.is_alive: continue # BUG: busy loop! raise LspError(...) # Process message... ``` When the server is alive (`is_alive == True`) but not sending any messages, `read_message(timeout=remaining)` returns `None` after waiting up to `remaining` seconds. The `max(0.1, ...)` guard ensures at least 0.1 second wait per iteration — however, as the deadline approaches, `remaining` approaches 0, and `max(0.1, 0.001)` = `0.1`. So the loop polls every 100ms. But more importantly, the `continue` statement loops back immediately after getting a `None` from `read_message`. If `read_message` returns `None` due to a short internal timeout or a busy/partial read, the outer loop immediately calls `read_message` again without any explicit sleep, potentially creating a tight loop. The correct approach is to use the actual `remaining` time as the timeout for `read_message`, and not add additional looping overhead when `read_message` already blocks for up to `remaining` seconds. ### Evidence ```python while time.monotonic() < deadline: remaining = max(0.1, deadline - time.monotonic()) # minimum 0.1s msg = self._transport.read_message(timeout=remaining) if msg is None: if self._transport.is_alive: continue # This can be a tight loop if read_message returns immediately! raise LspError(...) ``` If `StdioTransport.read_message()` has a bug where it returns `None` immediately (e.g., due to an empty read), this becomes a CPU-burning busy loop for the entire timeout duration (default: 60 seconds). Additionally, if `remaining` becomes very small (approaching deadline), `max(0.1, remaining)` might cause the total wait to exceed the timeout by up to 0.1 seconds, making timeout semantics imprecise. ### Expected Behavior The timeout should be used correctly to wait up to the full remaining time, and there should be a minimum sleep between retries to prevent tight loops. The code should also handle the case where the deadline is passed precisely. ### Actual Behavior The `continue` statement can cause rapid polling if `read_message` returns `None` too quickly. The `max(0.1, remaining)` guard also means timeouts can exceed their specified duration by up to 100ms. ### Suggested Fix ```python while True: remaining = deadline - time.monotonic() if remaining <= 0: break # Use the full remaining time as the timeout - no tight looping actual_timeout = min(remaining, _REQUEST_TIMEOUT) msg = self._transport.read_message(timeout=actual_timeout) if msg is None: if not self._transport.is_alive: raise LspError(...) # Server alive but no message yet - check if deadline exceeded continue # read_message already waited up to timeout ``` ### Category error-handling ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
Author
Owner

Verified — Bug: LspClient busy-loops burning CPU when LSP server is unresponsive. MoSCoW: Must-have. Priority: High — CPU exhaustion risk.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Bug: LspClient busy-loops burning CPU when LSP server is unresponsive. MoSCoW: Must-have. Priority: High — CPU exhaustion risk. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Bug: LspClient busy-loops burning CPU when LSP server is unresponsive. MoSCoW: Must-have. Priority: High — CPU exhaustion risk.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Bug: LspClient busy-loops burning CPU when LSP server is unresponsive. MoSCoW: Must-have. Priority: High — CPU exhaustion risk. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Bug: LspClient busy-loops burning CPU when LSP server is unresponsive. MoSCoW: Must-have. Priority: High — CPU exhaustion risk.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Bug: LspClient busy-loops burning CPU when LSP server is unresponsive. MoSCoW: Must-have. Priority: High — CPU exhaustion risk. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7368
No description provided.