MCPToolAdapter holds RLock during entire transport call, blocking concurrent operations #10512

Closed
opened 2026-04-18 10:27:34 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Branch: bugfix/mcp-adapter-lock-held-during-transport
  • Commit Message: Fix MCPToolAdapter RLock held during transport call blocking concurrent operations
  • Blocked by: #10510 (TDD issue)
  • Module: src/cleveragents/mcp/adapter.py

Background and Context

MCPToolAdapter.invoke() and MCPToolAdapter.discover_tools() hold the adapter's threading.RLock for the entire duration of the transport call (a blocking network I/O operation). This serializes all concurrent operations and blocks health checks during tool invocations, violating the MCP specification requirement that health checks must not block the main process.

Summary

MCPToolAdapter.invoke() and MCPToolAdapter.discover_tools() hold the adapter's threading.RLock for the entire duration of the transport call (a blocking network I/O operation). This serializes all concurrent operations and blocks health checks during tool invocations, violating the MCP specification requirement that health checks must not block the main process.

Code Evidence

File: src/cleveragents/mcp/adapter.py

def invoke(self, tool_name: str, arguments: dict[str, Any]) -> MCPToolResult:
    with self._lock:                              # ← lock acquired
        if not self._connected:
            ...
        descriptor = self._tools.get(tool_name)
        if descriptor is None:
            return MCPToolResult(success=False, ...)
        validation_error = self._validate_input(descriptor, arguments)
        if validation_error:
            return MCPToolResult(success=False, ...)
        start = time.monotonic()
        try:
            result = self._transport.call(        # ← SLOW I/O INSIDE LOCK
                "tools/call",
                {"name": tool_name, "arguments": arguments},
            )
        except TimeoutError as exc:
            ...
        except Exception as exc:
            ...
        elapsed = (time.monotonic() - start) * 1000
        ...
        return MCPToolResult(success=True, ...)   # ← lock released here

The same pattern exists in discover_tools():

def discover_tools(self, tool_filter=None) -> list[MCPToolDescriptor]:
    with self._lock:
        if not self._connected:
            ...
        result = self._transport.call("tools/list", {})  # ← SLOW I/O INSIDE LOCK
        ...

Impact

  1. Serialized tool calls: No two tool invocations can run concurrently — each must wait for the previous one to complete, including the full network round-trip.
  2. Health check starvation: _check_health() calls discover_tools() which tries to acquire self._lock. If a tool invocation is in progress, the health check is blocked for the entire duration of the tool call. This can cause health check timeouts and false-positive failure detection.
  3. Disconnect blocked: disconnect() acquires self._lock and is blocked while any tool call is in progress.
  4. Specification violation: The MCP specification requires that health checks must not block the main process. Holding the lock during transport calls violates this requirement.

Proposed Fix

Release the lock before the transport call and re-acquire it only for state mutations:

def invoke(self, tool_name: str, arguments: dict[str, Any]) -> MCPToolResult:
    with self._lock:
        if not self._connected:
            raise RuntimeError(...)
        descriptor = self._tools.get(tool_name)
        if descriptor is None:
            return MCPToolResult(success=False, ...)
        validation_error = self._validate_input(descriptor, arguments)
        if validation_error:
            return MCPToolResult(success=False, ...)
    # Lock released before I/O
    start = time.monotonic()
    try:
        result = self._transport.call(
            "tools/call",
            {"name": tool_name, "arguments": arguments},
        )
    except TimeoutError as exc:
        ...
    except Exception as exc:
        ...
    elapsed = (time.monotonic() - start) * 1000
    ...
    return MCPToolResult(success=True, ...)

Validation Gate

  • Code evidence: adapter.py invoke() and discover_tools()with self._lock: wraps self._transport.call()
  • Environment verification: Reproducible — concurrent tool calls are serialized; health checks block during invocations
  • Actionability: Release lock before transport call; re-acquire only for state mutations
  • Codebase freshness: Verified in current src/cleveragents/mcp/adapter.py
  • Severity match: CRITICAL — violates MCP spec, causes health check starvation, serializes all tool calls

Expected Behavior

Concurrent calls to invoke() and discover_tools() should be able to run in parallel (or at least not be serialized by the lock held during I/O). The lock should only protect shared state access, not the entire duration of the blocking transport call.

Acceptance Criteria

  • Concurrent invoke() calls complete in parallel (not serialized)
  • Health check _check_health() is not blocked during tool invocations
  • disconnect() is not blocked during tool invocations
  • All existing tests pass
  • No new race conditions introduced by the lock release

Subtasks

  • Confirm TDD issue #10510 is merged (failing test exists in codebase)
  • Create bugfix/mcp-adapter-lock-held-during-transport branch from master
  • Refactor invoke() to release lock before self._transport.call()
  • Refactor discover_tools() to release lock before self._transport.call()
  • Ensure thread safety of _connected and _tools state after lock release
  • Remove @tdd_expected_fail tag from the test added in #10510
  • Run full test suite via nox and confirm all tests pass
  • Open PR to master with Closes #<this_issue_number>

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.

Automated by CleverAgents Bot
Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor
Tag: [AUTO-BUG-7]


Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Branch:** `bugfix/mcp-adapter-lock-held-during-transport` - **Commit Message:** `Fix MCPToolAdapter RLock held during transport call blocking concurrent operations` - **Blocked by:** #10510 (TDD issue) - **Module:** `src/cleveragents/mcp/adapter.py` ## Background and Context `MCPToolAdapter.invoke()` and `MCPToolAdapter.discover_tools()` hold the adapter's `threading.RLock` for the entire duration of the transport call (a blocking network I/O operation). This serializes all concurrent operations and blocks health checks during tool invocations, violating the MCP specification requirement that health checks must not block the main process. ## Summary `MCPToolAdapter.invoke()` and `MCPToolAdapter.discover_tools()` hold the adapter's `threading.RLock` for the entire duration of the transport call (a blocking network I/O operation). This serializes all concurrent operations and blocks health checks during tool invocations, violating the MCP specification requirement that health checks must not block the main process. ## Code Evidence **File:** `src/cleveragents/mcp/adapter.py` ```python def invoke(self, tool_name: str, arguments: dict[str, Any]) -> MCPToolResult: with self._lock: # ← lock acquired if not self._connected: ... descriptor = self._tools.get(tool_name) if descriptor is None: return MCPToolResult(success=False, ...) validation_error = self._validate_input(descriptor, arguments) if validation_error: return MCPToolResult(success=False, ...) start = time.monotonic() try: result = self._transport.call( # ← SLOW I/O INSIDE LOCK "tools/call", {"name": tool_name, "arguments": arguments}, ) except TimeoutError as exc: ... except Exception as exc: ... elapsed = (time.monotonic() - start) * 1000 ... return MCPToolResult(success=True, ...) # ← lock released here ``` The same pattern exists in `discover_tools()`: ```python def discover_tools(self, tool_filter=None) -> list[MCPToolDescriptor]: with self._lock: if not self._connected: ... result = self._transport.call("tools/list", {}) # ← SLOW I/O INSIDE LOCK ... ``` ## Impact 1. **Serialized tool calls**: No two tool invocations can run concurrently — each must wait for the previous one to complete, including the full network round-trip. 2. **Health check starvation**: `_check_health()` calls `discover_tools()` which tries to acquire `self._lock`. If a tool invocation is in progress, the health check is blocked for the entire duration of the tool call. This can cause health check timeouts and false-positive failure detection. 3. **Disconnect blocked**: `disconnect()` acquires `self._lock` and is blocked while any tool call is in progress. 4. **Specification violation**: The MCP specification requires that health checks must not block the main process. Holding the lock during transport calls violates this requirement. ## Proposed Fix Release the lock before the transport call and re-acquire it only for state mutations: ```python def invoke(self, tool_name: str, arguments: dict[str, Any]) -> MCPToolResult: with self._lock: if not self._connected: raise RuntimeError(...) descriptor = self._tools.get(tool_name) if descriptor is None: return MCPToolResult(success=False, ...) validation_error = self._validate_input(descriptor, arguments) if validation_error: return MCPToolResult(success=False, ...) # Lock released before I/O start = time.monotonic() try: result = self._transport.call( "tools/call", {"name": tool_name, "arguments": arguments}, ) except TimeoutError as exc: ... except Exception as exc: ... elapsed = (time.monotonic() - start) * 1000 ... return MCPToolResult(success=True, ...) ``` ## Validation Gate - ✅ **Code evidence**: `adapter.py` `invoke()` and `discover_tools()` — `with self._lock:` wraps `self._transport.call()` - ✅ **Environment verification**: Reproducible — concurrent tool calls are serialized; health checks block during invocations - ✅ **Actionability**: Release lock before transport call; re-acquire only for state mutations - ✅ **Codebase freshness**: Verified in current `src/cleveragents/mcp/adapter.py` - ✅ **Severity match**: CRITICAL — violates MCP spec, causes health check starvation, serializes all tool calls ## Expected Behavior Concurrent calls to `invoke()` and `discover_tools()` should be able to run in parallel (or at least not be serialized by the lock held during I/O). The lock should only protect shared state access, not the entire duration of the blocking transport call. ## Acceptance Criteria - Concurrent `invoke()` calls complete in parallel (not serialized) - Health check `_check_health()` is not blocked during tool invocations - `disconnect()` is not blocked during tool invocations - All existing tests pass - No new race conditions introduced by the lock release ## Subtasks - [ ] Confirm TDD issue #10510 is merged (failing test exists in codebase) - [ ] Create `bugfix/mcp-adapter-lock-held-during-transport` branch from `master` - [ ] Refactor `invoke()` to release lock before `self._transport.call()` - [ ] Refactor `discover_tools()` to release lock before `self._transport.call()` - [ ] Ensure thread safety of `_connected` and `_tools` state after lock release - [ ] Remove `@tdd_expected_fail` tag from the test added in #10510 - [ ] Run full test suite via `nox` and confirm all tests pass - [ ] Open PR to `master` with `Closes #<this_issue_number>` ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. --- Automated by CleverAgents Bot Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor Tag: [AUTO-BUG-7] --- **Automated by CleverAgents Bot** Agent: new-issue-creator
Author
Owner

Implementation Attempt — Tier 4: opus — Success

Fixed MCPToolAdapter RLock held during transport call blocking concurrent operations.

Changes:

  • src/cleveragents/mcp/adapter.py: Refactored invoke() and discover_tools() to release the RLock before making transport calls. Both methods now use a three-phase approach: (1) acquire lock for state validation, (2) release lock before transport call, (3) re-acquire lock only for shared state mutation.
  • features/tdd_mcp_adapter_rlock_concurrency.feature: Added 7 BDD scenarios testing concurrent operations, lock-checking, and validation.
  • features/steps/tdd_mcp_adapter_rlock_concurrency_steps.py: Step definitions with slow transport and lock-checking transport mocks.

Quality gate status: lint , typecheck , unit_tests (7/7 new + 44/44 existing MCP adapter + 35/35 other MCP scenarios)

PR: #10764


Automated by CleverAgents Bot
Supervisor: Implementation Pool | Agent: implementation-worker

**Implementation Attempt** — Tier 4: opus — Success Fixed MCPToolAdapter RLock held during transport call blocking concurrent operations. **Changes:** - `src/cleveragents/mcp/adapter.py`: Refactored `invoke()` and `discover_tools()` to release the RLock before making transport calls. Both methods now use a three-phase approach: (1) acquire lock for state validation, (2) release lock before transport call, (3) re-acquire lock only for shared state mutation. - `features/tdd_mcp_adapter_rlock_concurrency.feature`: Added 7 BDD scenarios testing concurrent operations, lock-checking, and validation. - `features/steps/tdd_mcp_adapter_rlock_concurrency_steps.py`: Step definitions with slow transport and lock-checking transport mocks. **Quality gate status:** lint ✅, typecheck ✅, unit_tests ✅ (7/7 new + 44/44 existing MCP adapter + 35/35 other MCP scenarios) PR: https://git.cleverthis.com/cleveragents/cleveragents-core/pulls/10764 --- **Automated by CleverAgents Bot** Supervisor: Implementation Pool | Agent: implementation-worker
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10512
No description provided.