BUG-HUNT: [concurrency] MCPToolAdapter.connect() calls thread.join() while holding the RLock — deadlock when transport callback re-enters the adapter #6755

Open
opened 2026-04-10 01:58:19 +00:00 by HAL9000 · 2 comments
Owner

Bug Report: [concurrency] — MCPToolAdapter.connect() holds RLock during thread.join()

Severity Assessment

  • Impact: Complete deadlock of any thread that calls connect() or any adapter method that acquires _lock while a connection is in progress — permanently hangs the MCP integration in concurrent use
  • Likelihood: Triggered any time connect() is called concurrently from a second thread, or when a transport implementation attempts to call back into the adapter during connection setup
  • Priority: High

Location

  • File: src/cleveragents/mcp/adapter.py
  • Function/Class: MCPToolAdapter.connect()
  • Lines: ~286–315

Description

MCPToolAdapter.connect() acquires self._lock (an RLock) at the top of the method and then calls thread.join(timeout=timeout) while still holding the lock. This creates a critical concurrency hazard:

  1. Transport callback deadlock: If the MCPTransport.connect() implementation (running in _do_connect thread) tries to call any method on the adapter that also acquires self._lock (e.g., disconnect(), is_connected, capabilities), the thread will block attempting to acquire the RLock that the calling thread already holds. Since the calling thread is stuck in thread.join() waiting for the inner thread to finish, and the inner thread is blocked on the RLock — this is a classic deadlock.

  2. Concurrent callers blocked: If a second thread calls connect() while the first is blocked in thread.join(), it will block at with self._lock — unable to return even if the timeout elapses, because the first thread holds the lock.

  3. Health timer deadlock: McpClient._perform_health_check() calls self._adapter.discover_tools() which calls self._adapter.connect() via _ensure_started(). If McpClient.start() is called concurrently from another thread and start()_adapter.connect() holds the lock while joining, any subsequent discover_tools() call from the health timer thread will deadlock.

Evidence

# src/cleveragents/mcp/adapter.py, lines 286-315
def connect(self, timeout: float = 30.0) -> None:
    with self._lock:         # <-- lock acquired here
        if self._connected:
            return

        result: dict[str, Any] = {}
        exc_holder: list[BaseException] = []

        def _do_connect() -> None:
            try:
                result["caps"] = self._transport.connect(self._config)
                # ↑ If this calls back into adapter, it will block on _lock
            except Exception as exc:
                exc_holder.append(exc)

        thread = threading.Thread(target=_do_connect, daemon=True)
        thread.start()
        thread.join(timeout=timeout)   # <-- blocks while holding _lock!

        if thread.is_alive():
            ...

Expected Behavior

The lock should only be held for the brief state-transition critical sections (setting _connected = True, reading/writing _capabilities), NOT across the entire network I/O wait. The thread should be started and joined outside the lock, with the lock re-acquired only to write the final state.

Actual Behavior

thread.join() is called inside with self._lock:, meaning the lock is held for the entire duration of the network connection attempt (up to 30 seconds). Any thread attempting to acquire _lock during this time will block indefinitely if _do_connect never releases (because it's also blocked on the lock).

Suggested Fix

Move the thread start/join outside the lock, and only acquire the lock to write the final connected state:

def connect(self, timeout: float = 30.0) -> None:
    with self._lock:
        if self._connected:
            return

    result: dict[str, Any] = {}
    exc_holder: list[BaseException] = []

    def _do_connect() -> None:
        try:
            result["caps"] = self._transport.connect(self._config)
        except Exception as exc:
            exc_holder.append(exc)

    thread = threading.Thread(target=_do_connect, daemon=True)
    thread.start()
    thread.join(timeout=timeout)   # join without holding lock

    with self._lock:
        if thread.is_alive():
            raise ConnectionError(...)
        if exc_holder:
            raise ConnectionError(...) from exc_holder[0]
        if not self._connected:   # double-check after acquiring lock
            self._capabilities = result.get("caps") or {}
            self._connected = True

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — `MCPToolAdapter.connect()` holds RLock during `thread.join()` ### Severity Assessment - **Impact**: Complete deadlock of any thread that calls `connect()` or any adapter method that acquires `_lock` while a connection is in progress — permanently hangs the MCP integration in concurrent use - **Likelihood**: Triggered any time `connect()` is called concurrently from a second thread, or when a transport implementation attempts to call back into the adapter during connection setup - **Priority**: High ### Location - **File**: `src/cleveragents/mcp/adapter.py` - **Function/Class**: `MCPToolAdapter.connect()` - **Lines**: ~286–315 ### Description `MCPToolAdapter.connect()` acquires `self._lock` (an `RLock`) at the top of the method and then calls `thread.join(timeout=timeout)` **while still holding the lock**. This creates a critical concurrency hazard: 1. **Transport callback deadlock**: If the `MCPTransport.connect()` implementation (running in `_do_connect` thread) tries to call any method on the adapter that also acquires `self._lock` (e.g., `disconnect()`, `is_connected`, `capabilities`), the thread will block attempting to acquire the RLock that the calling thread already holds. Since the calling thread is stuck in `thread.join()` waiting for the inner thread to finish, and the inner thread is blocked on the RLock — this is a classic deadlock. 2. **Concurrent callers blocked**: If a second thread calls `connect()` while the first is blocked in `thread.join()`, it will block at `with self._lock` — unable to return even if the timeout elapses, because the first thread holds the lock. 3. **Health timer deadlock**: `McpClient._perform_health_check()` calls `self._adapter.discover_tools()` which calls `self._adapter.connect()` via `_ensure_started()`. If `McpClient.start()` is called concurrently from another thread and `start()` → `_adapter.connect()` holds the lock while joining, any subsequent `discover_tools()` call from the health timer thread will deadlock. ### Evidence ```python # src/cleveragents/mcp/adapter.py, lines 286-315 def connect(self, timeout: float = 30.0) -> None: with self._lock: # <-- lock acquired here if self._connected: return result: dict[str, Any] = {} exc_holder: list[BaseException] = [] def _do_connect() -> None: try: result["caps"] = self._transport.connect(self._config) # ↑ If this calls back into adapter, it will block on _lock except Exception as exc: exc_holder.append(exc) thread = threading.Thread(target=_do_connect, daemon=True) thread.start() thread.join(timeout=timeout) # <-- blocks while holding _lock! if thread.is_alive(): ... ``` ### Expected Behavior The lock should only be held for the brief state-transition critical sections (setting `_connected = True`, reading/writing `_capabilities`), NOT across the entire network I/O wait. The thread should be started and joined **outside** the lock, with the lock re-acquired only to write the final state. ### Actual Behavior `thread.join()` is called inside `with self._lock:`, meaning the lock is held for the entire duration of the network connection attempt (up to 30 seconds). Any thread attempting to acquire `_lock` during this time will block indefinitely if `_do_connect` never releases (because it's also blocked on the lock). ### Suggested Fix Move the thread start/join outside the lock, and only acquire the lock to write the final connected state: ```python def connect(self, timeout: float = 30.0) -> None: with self._lock: if self._connected: return result: dict[str, Any] = {} exc_holder: list[BaseException] = [] def _do_connect() -> None: try: result["caps"] = self._transport.connect(self._config) except Exception as exc: exc_holder.append(exc) thread = threading.Thread(target=_do_connect, daemon=True) thread.start() thread.join(timeout=timeout) # join without holding lock with self._lock: if thread.is_alive(): raise ConnectionError(...) if exc_holder: raise ConnectionError(...) from exc_holder[0] if not self._connected: # double-check after acquiring lock self._capabilities = result.get("caps") or {} self._connected = True ``` ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
Author
Owner

Label compliance fix applied: Added missing State/Unverified label per CONTRIBUTING.md requirements.


Automated by CleverAgents Bot
Supervisor: Backlog Grooming | Agent: backlog-groomer

Label compliance fix applied: Added missing `State/Unverified` label per CONTRIBUTING.md requirements. --- **Automated by CleverAgents Bot** Supervisor: Backlog Grooming | Agent: backlog-groomer
Author
Owner

Verified — Concurrency bug: MCPToolAdapter.connect() deadlock when transport callback re-enters adapter. MoSCoW: Must-have. Priority: High.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: MCPToolAdapter.connect() deadlock when transport callback re-enters adapter. MoSCoW: Must-have. Priority: High. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6755
No description provided.