BUG-HUNT: [concurrency] StdioTransport._read_one_message() calls select() then read(content_length) — body read has no timeout and blocks indefinitely if server stalls mid-message #6579

Open
opened 2026-04-09 21:43:32 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [concurrency] — read(content_length) after select() can block indefinitely

Severity Assessment

  • Impact: If an LSP server sends the Content-Length header plus the first byte of the body, then stalls, _read_one_message() blocks in stdout.read(content_length) forever. There is no timeout protecting this call. The entire LspClient._send_request() loop (which calls read_message()) will deadlock, hanging the calling thread until the server process eventually dies or the OS reclaims it.
  • Likelihood: Medium — any slow or crashing LSP server that sends a partial message triggers this. Network-backed servers (TCP transport) and misbehaving language servers under memory pressure are realistic triggers.
  • Priority: High

Location

  • File: src/cleveragents/lsp/transport.py
  • Function: StdioTransport._read_one_message()
  • Lines: 261–267

Description

select.select() confirms that the file descriptor has at least 1 byte available for reading. It does not guarantee that content_length bytes are available. The subsequent stdout.read(content_length) is a blocking call that will wait until all content_length bytes arrive. If the server is slow or sends a truncated message body, this call blocks indefinitely with no timeout.

The same pattern exists for each readline() call in the header loop (lines 233–242): each iteration calls select (with the full timeout, not the remaining time) and then readline() — which is also a blocking call. A server that sends the first byte of a header line and then stalls will block readline() indefinitely, even though select already returned.

The effective_timeout / remaining parameters visible in read_message() / _send_request() provide an outer deadline for the overall operation, but that deadline is enforced by time.monotonic() checks between message reads in _send_request() — not during a blocking read() syscall inside _read_one_message(). Once inside read(content_length), the thread is blocked in a syscall and cannot check the outer deadline.

Evidence

# transport.py lines 261-267
# Read body (wait for remaining data)
ready, _, _ = select.select([stdout], [], [], timeout)   # confirms ≥1 byte ready
if not ready:
    return None
body_bytes = stdout.read(content_length)   # ← BLOCKS until ALL bytes arrive; no timeout!
if len(body_bytes) < content_length:
    return None  # Truncated — server exited mid-message

And in the header loop:

# transport.py lines 233-237
while True:
    ready, _, _ = select.select([stdout], [], [], timeout)  # ≥1 byte ready
    if not ready:
        return None
    line = stdout.readline()   # ← BLOCKS until '\n' arrives; no timeout!

Expected Behavior

read_message(timeout=T) should return None (or raise LspError) if the total read time exceeds T seconds, including within blocking read() / readline() syscalls.

Actual Behavior

Once select() returns ready, the subsequent read() or readline() blocks indefinitely with no timeout. The external deadline in _send_request() cannot interrupt a blocking syscall.

Suggested Fix

Two options:

  1. Set the stream to non-blocking mode (os.set_blocking(stdout.fileno(), False)) and catch BlockingIOError, looping with select to accumulate bytes with a deadline.
  2. Use a dedicated reader thread that puts messages onto a queue.Queue with a timeout, so the main thread can do queue.get(timeout=remaining).

Option 2 is the standard pattern for synchronous LSP clients and avoids the complexity of non-blocking partial reads.

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — `read(content_length)` after `select()` can block indefinitely ### Severity Assessment - **Impact**: If an LSP server sends the Content-Length header plus the first byte of the body, then stalls, `_read_one_message()` blocks in `stdout.read(content_length)` forever. There is no timeout protecting this call. The entire `LspClient._send_request()` loop (which calls `read_message()`) will deadlock, hanging the calling thread until the server process eventually dies or the OS reclaims it. - **Likelihood**: Medium — any slow or crashing LSP server that sends a partial message triggers this. Network-backed servers (TCP transport) and misbehaving language servers under memory pressure are realistic triggers. - **Priority**: High ### Location - **File**: `src/cleveragents/lsp/transport.py` - **Function**: `StdioTransport._read_one_message()` - **Lines**: 261–267 ### Description `select.select()` confirms that the file descriptor has **at least 1 byte** available for reading. It does **not** guarantee that `content_length` bytes are available. The subsequent `stdout.read(content_length)` is a blocking call that will wait until all `content_length` bytes arrive. If the server is slow or sends a truncated message body, this call blocks indefinitely with no timeout. The same pattern exists for each `readline()` call in the header loop (lines 233–242): each iteration calls `select` (with the full `timeout`, not the remaining time) and then `readline()` — which is also a blocking call. A server that sends the first byte of a header line and then stalls will block `readline()` indefinitely, even though `select` already returned. The `effective_timeout` / `remaining` parameters visible in `read_message()` / `_send_request()` provide an **outer deadline** for the overall operation, but that deadline is enforced by `time.monotonic()` checks *between* message reads in `_send_request()` — not during a blocking `read()` syscall inside `_read_one_message()`. Once inside `read(content_length)`, the thread is blocked in a syscall and cannot check the outer deadline. ### Evidence ```python # transport.py lines 261-267 # Read body (wait for remaining data) ready, _, _ = select.select([stdout], [], [], timeout) # confirms ≥1 byte ready if not ready: return None body_bytes = stdout.read(content_length) # ← BLOCKS until ALL bytes arrive; no timeout! if len(body_bytes) < content_length: return None # Truncated — server exited mid-message ``` And in the header loop: ```python # transport.py lines 233-237 while True: ready, _, _ = select.select([stdout], [], [], timeout) # ≥1 byte ready if not ready: return None line = stdout.readline() # ← BLOCKS until '\n' arrives; no timeout! ``` ### Expected Behavior `read_message(timeout=T)` should return `None` (or raise `LspError`) if the total read time exceeds `T` seconds, including within blocking `read()` / `readline()` syscalls. ### Actual Behavior Once `select()` returns `ready`, the subsequent `read()` or `readline()` blocks indefinitely with no timeout. The external deadline in `_send_request()` cannot interrupt a blocking syscall. ### Suggested Fix Two options: 1. **Set the stream to non-blocking mode** (`os.set_blocking(stdout.fileno(), False)`) and catch `BlockingIOError`, looping with `select` to accumulate bytes with a deadline. 2. **Use a dedicated reader thread** that puts messages onto a `queue.Queue` with a timeout, so the main thread can do `queue.get(timeout=remaining)`. Option 2 is the standard pattern for synchronous LSP clients and avoids the complexity of non-blocking partial reads. ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:52:45 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6579
No description provided.