UAT: MCPToolAdapter.discover_tools() and invoke() hold the RLock during network I/O — causes lock contention and potential deadlock under concurrent access #2163

Open
opened 2026-04-03 04:36:11 +00:00 by freemo · 3 comments
Owner

Metadata

  • Branch: bugfix/mcp-adapter-rlock-held-during-network-io
  • Commit Message: fix(mcp): release RLock before network I/O in MCPToolAdapter.discover_tools() and invoke()
  • Milestone: v3.7.0
  • Parent Epic: #399

Background and Context

Code-level analysis of src/cleveragents/mcp/adapter.py revealed that MCPToolAdapter.discover_tools() and invoke() hold the internal threading.RLock for the entire duration of the network call to the MCP server (via self._transport.call(...)). This means that while a slow or blocking network call is in progress, no other thread can call connect(), disconnect(), reconnect(), or any other method that acquires the lock.

Per the specification's concurrency and thread-safety requirements for the MCP adapter layer, the lock should only be held for short, non-blocking critical sections (state reads/writes). Network I/O must be performed outside the lock — this is the standard pattern for thread-safe classes with I/O.

Steps to Reproduce

import threading
from cleveragents.mcp.adapter import MCPToolAdapter

adapter = MCPToolAdapter(transport=slow_transport)
adapter.connect()

# Thread A: calls invoke() — holds RLock for the full duration of the network call
t_a = threading.Thread(target=adapter.invoke, args=("my_tool", {}))

# Thread B: calls discover_tools() — blocks indefinitely waiting for the RLock
t_b = threading.Thread(target=adapter.discover_tools)

t_a.start()
t_b.start()
# Thread B is blocked for the entire duration of Thread A's network I/O

Root Cause

In discover_tools() (lines 421–446 of src/cleveragents/mcp/adapter.py):

def discover_tools(self, tool_filter=None):
    with self._lock:  # Lock acquired
        if not self._connected:
            raise RuntimeError(...)
        result = self._transport.call("tools/list", {})  # Network I/O while lock held!
        raw_tools = result.get("tools", [])
        ...
        self._tools = {d.name: d for d in descriptors}
        return descriptors
    # Lock released

In invoke() (lines 470–527 of src/cleveragents/mcp/adapter.py):

def invoke(self, tool_name, arguments):
    with self._lock:  # Lock acquired
        ...
        start = time.monotonic()
        try:
            result = self._transport.call("tools/call", ...)  # Network I/O while lock held!
        except TimeoutError as exc:
            ...
        elapsed = (time.monotonic() - start) * 1000
        ...
        return MCPToolResult(...)
    # Lock released

Expected Behaviour

Network I/O should be performed outside the lock. The lock should only be held for short, non-blocking critical sections (state reads/writes):

def discover_tools(self, tool_filter=None):
    with self._lock:
        if not self._connected:
            raise RuntimeError(...)
    # Lock released — perform network I/O without holding lock
    result = self._transport.call("tools/list", {})
    raw_tools = result.get("tools", [])
    ...
    with self._lock:
        self._tools = {d.name: d for d in descriptors}
    return descriptors

Actual Behaviour

Both discover_tools() and invoke() hold the RLock for the entire duration of the self._transport.call(...) network operation, blocking all other threads from acquiring the lock until the network call completes (which may take up to the full timeout duration).

Impact

  1. Lock contention: If discover_tools() or invoke() is slow (e.g., 30-second timeout), all other threads are blocked from calling any method on the adapter for the entire duration.
  2. Deadlock risk: If MCPRefreshHook triggers discover_tools() from a notification listener thread while the main thread is also calling invoke(), one thread will block indefinitely waiting for the lock (since RLock is reentrant only for the same thread).
  3. Health check blocking: McpClient._check_health() calls discover_tools(), which holds the lock during the health probe — blocking concurrent tool invocations.

Affected Files

  • src/cleveragents/mcp/adapter.py, lines 421–446: discover_tools() holds lock during transport.call()
  • src/cleveragents/mcp/adapter.py, lines 470–527: invoke() holds lock during transport.call()

Subtasks

  • Refactor MCPToolAdapter.discover_tools() to release the RLock before calling self._transport.call("tools/list", {}) and re-acquire it only to write back self._tools
  • Refactor MCPToolAdapter.invoke() to release the RLock before calling self._transport.call("tools/call", ...) and re-acquire it only for state reads/writes
  • Audit all other methods in MCPToolAdapter for the same pattern (lock held across I/O) and fix any additional occurrences
  • Write a Behave unit test scenario (in features/) that verifies two threads can concurrently call invoke() and discover_tools() without blocking each other
  • Write a Behave unit test scenario that verifies the _connected state check still raises RuntimeError correctly when the adapter is not connected
  • Verify that the refactored discover_tools() and invoke() remain safe against TOCTOU races (e.g., disconnect between the state check and the network call) and handle that case gracefully
  • Run nox -e typecheck and confirm no Pyright errors are introduced
  • Run nox -e unit_tests and confirm all scenarios pass
  • Run nox (all default sessions) and fix any errors
  • Verify coverage >= 97% via nox -e coverage_report

Definition of Done

  • MCPToolAdapter.discover_tools() does not hold the RLock during self._transport.call()
  • MCPToolAdapter.invoke() does not hold the RLock during self._transport.call()
  • Two concurrent threads calling invoke() and discover_tools() respectively are no longer serialised by the lock during network I/O
  • The _connected state guard in both methods still raises RuntimeError correctly when the adapter is disconnected
  • Behave feature scenarios cover the concurrent-access and disconnected-state cases
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `bugfix/mcp-adapter-rlock-held-during-network-io` - **Commit Message**: `fix(mcp): release RLock before network I/O in MCPToolAdapter.discover_tools() and invoke()` - **Milestone**: v3.7.0 - **Parent Epic**: #399 ## Background and Context Code-level analysis of `src/cleveragents/mcp/adapter.py` revealed that `MCPToolAdapter.discover_tools()` and `invoke()` hold the internal `threading.RLock` for the entire duration of the network call to the MCP server (via `self._transport.call(...)`). This means that while a slow or blocking network call is in progress, no other thread can call `connect()`, `disconnect()`, `reconnect()`, or any other method that acquires the lock. Per the specification's concurrency and thread-safety requirements for the MCP adapter layer, the lock should only be held for short, non-blocking critical sections (state reads/writes). Network I/O must be performed outside the lock — this is the standard pattern for thread-safe classes with I/O. ## Steps to Reproduce ```python import threading from cleveragents.mcp.adapter import MCPToolAdapter adapter = MCPToolAdapter(transport=slow_transport) adapter.connect() # Thread A: calls invoke() — holds RLock for the full duration of the network call t_a = threading.Thread(target=adapter.invoke, args=("my_tool", {})) # Thread B: calls discover_tools() — blocks indefinitely waiting for the RLock t_b = threading.Thread(target=adapter.discover_tools) t_a.start() t_b.start() # Thread B is blocked for the entire duration of Thread A's network I/O ``` ## Root Cause In `discover_tools()` (lines 421–446 of `src/cleveragents/mcp/adapter.py`): ```python def discover_tools(self, tool_filter=None): with self._lock: # Lock acquired if not self._connected: raise RuntimeError(...) result = self._transport.call("tools/list", {}) # Network I/O while lock held! raw_tools = result.get("tools", []) ... self._tools = {d.name: d for d in descriptors} return descriptors # Lock released ``` In `invoke()` (lines 470–527 of `src/cleveragents/mcp/adapter.py`): ```python def invoke(self, tool_name, arguments): with self._lock: # Lock acquired ... start = time.monotonic() try: result = self._transport.call("tools/call", ...) # Network I/O while lock held! except TimeoutError as exc: ... elapsed = (time.monotonic() - start) * 1000 ... return MCPToolResult(...) # Lock released ``` ## Expected Behaviour Network I/O should be performed outside the lock. The lock should only be held for short, non-blocking critical sections (state reads/writes): ```python def discover_tools(self, tool_filter=None): with self._lock: if not self._connected: raise RuntimeError(...) # Lock released — perform network I/O without holding lock result = self._transport.call("tools/list", {}) raw_tools = result.get("tools", []) ... with self._lock: self._tools = {d.name: d for d in descriptors} return descriptors ``` ## Actual Behaviour Both `discover_tools()` and `invoke()` hold the `RLock` for the entire duration of the `self._transport.call(...)` network operation, blocking all other threads from acquiring the lock until the network call completes (which may take up to the full timeout duration). ## Impact 1. **Lock contention**: If `discover_tools()` or `invoke()` is slow (e.g., 30-second timeout), all other threads are blocked from calling any method on the adapter for the entire duration. 2. **Deadlock risk**: If `MCPRefreshHook` triggers `discover_tools()` from a notification listener thread while the main thread is also calling `invoke()`, one thread will block indefinitely waiting for the lock (since `RLock` is reentrant only for the same thread). 3. **Health check blocking**: `McpClient._check_health()` calls `discover_tools()`, which holds the lock during the health probe — blocking concurrent tool invocations. ## Affected Files - `src/cleveragents/mcp/adapter.py`, lines 421–446: `discover_tools()` holds lock during `transport.call()` - `src/cleveragents/mcp/adapter.py`, lines 470–527: `invoke()` holds lock during `transport.call()` ## Subtasks - [ ] Refactor `MCPToolAdapter.discover_tools()` to release the `RLock` before calling `self._transport.call("tools/list", {})` and re-acquire it only to write back `self._tools` - [ ] Refactor `MCPToolAdapter.invoke()` to release the `RLock` before calling `self._transport.call("tools/call", ...)` and re-acquire it only for state reads/writes - [ ] Audit all other methods in `MCPToolAdapter` for the same pattern (lock held across I/O) and fix any additional occurrences - [ ] Write a Behave unit test scenario (in `features/`) that verifies two threads can concurrently call `invoke()` and `discover_tools()` without blocking each other - [ ] Write a Behave unit test scenario that verifies the `_connected` state check still raises `RuntimeError` correctly when the adapter is not connected - [ ] Verify that the refactored `discover_tools()` and `invoke()` remain safe against TOCTOU races (e.g., disconnect between the state check and the network call) and handle that case gracefully - [ ] Run `nox -e typecheck` and confirm no Pyright errors are introduced - [ ] Run `nox -e unit_tests` and confirm all scenarios pass - [ ] Run `nox` (all default sessions) and fix any errors - [ ] Verify coverage >= 97% via `nox -e coverage_report` ## Definition of Done - [ ] `MCPToolAdapter.discover_tools()` does not hold the `RLock` during `self._transport.call()` - [ ] `MCPToolAdapter.invoke()` does not hold the `RLock` during `self._transport.call()` - [ ] Two concurrent threads calling `invoke()` and `discover_tools()` respectively are no longer serialised by the lock during network I/O - [ ] The `_connected` state guard in both methods still raises `RuntimeError` correctly when the adapter is disconnected - [ ] Behave feature scenarios cover the concurrent-access and disconnected-state cases - [ ] A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details - [ ] The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly - [ ] The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
freemo added this to the v3.7.0 milestone 2026-04-03 04:36:15 +00:00
freemo self-assigned this 2026-04-03 16:57:59 +00:00
Author
Owner

MoSCoW classification: Should Have

Rationale: This issue addresses an important spec requirement or quality improvement. The project should include this fix but it is not strictly essential for the milestone.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

MoSCoW classification: **Should Have** Rationale: This issue addresses an important spec requirement or quality improvement. The project should include this fix but it is not strictly essential for the milestone. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Author
Owner

Starting implementation on branch bugfix/mcp-adapter-rlock-held-during-network-io.

Issue: MCPToolAdapter.discover_tools() and invoke() hold the RLock during network I/O — causes lock contention and potential deadlock under concurrent access.

Plan: 10 subtasks across 2 parallel waves:

  • Wave 1 (parallel): Subtasks 1–3 (refactor discover_tools, invoke, audit other methods) + Subtasks 4–6 (write Behave tests, TOCTOU safety)
  • Wave 2 (sequential): Subtasks 7–10 (typecheck, unit tests, full nox, coverage)

Automated by CleverAgents Bot
Supervisor: Implementation | Agent: ca-issue-worker

Starting implementation on branch `bugfix/mcp-adapter-rlock-held-during-network-io`. **Issue**: `MCPToolAdapter.discover_tools()` and `invoke()` hold the RLock during network I/O — causes lock contention and potential deadlock under concurrent access. **Plan**: 10 subtasks across 2 parallel waves: - Wave 1 (parallel): Subtasks 1–3 (refactor discover_tools, invoke, audit other methods) + Subtasks 4–6 (write Behave tests, TOCTOU safety) - Wave 2 (sequential): Subtasks 7–10 (typecheck, unit tests, full nox, coverage) --- **Automated by CleverAgents Bot** Supervisor: Implementation | Agent: ca-issue-worker
Author
Owner

Wave 1 Implementation Note — Subtasks 1–6 Complete

Commit: 9e93ea5fc615801ed8136a87598901c3afd09d2d


Implementation Summary

Wave 1 addresses the core concurrency defect described in this issue: MCPToolAdapter.discover_tools() and invoke() were holding the internal threading.RLock across the full duration of their respective self._transport.call(...) network operations. This serialised all concurrent callers for the entire network round-trip, and created a deadlock risk when callers originated from different threads (e.g., MCPRefreshHook triggering discover_tools() from a notification listener while the main thread was blocked inside invoke()).

The fix applies a consistent acquire → check/read → release → I/O → re-acquire (if needed) → write → release pattern across all affected methods. The lock is now held only for short, non-blocking critical sections.

Files modified:

  • src/cleveragents/mcp/adapter.pyMCPToolAdapter.discover_tools(), MCPToolAdapter.invoke(), MCPToolAdapter.disconnect()

Files created:

  • features/mcp_adapter_rlock_fix.feature — 7 new Behave scenarios
  • features/steps/mcp_adapter_rlock_fix_steps.py — step implementations for the above

Design Decisions

1. Lock scope: acquire–release–I/O–re-acquire pattern

The refactored methods follow a two-phase lock pattern rather than a single with self._lock: block:

  • Phase 1 (pre-I/O critical section): Acquire lock → read _connected (and, for invoke(), look up the tool descriptor and validate arguments) → release lock.
  • Network I/O: Performed entirely outside the lock.
  • Phase 2 (post-I/O critical section, discover_tools() only): Re-acquire lock → write back self._tools → release lock.

invoke() does not need a Phase 2 re-acquire because its result is a locally constructed MCPToolResult value — no shared state is mutated after the transport call returns.

Alternative considered — copy-on-read with a single lock window: Copying self._tools under the lock and then releasing before the transport call was considered for discover_tools(). This was rejected because discover_tools() is the writer of self._tools, not a reader; the relevant write-back still requires a lock, so the two-phase pattern is the correct model.

Alternative considered — asyncio / coroutine-based I/O: Out of scope for this issue. The transport layer is synchronous; introducing asyncio would be an architectural change requiring its own ADR.

2. disconnect() fixed as part of the audit (Subtask 3)

During the audit of all MCPToolAdapter methods, disconnect() was found to call self._transport.close() inside the lock — the same anti-pattern as discover_tools() and invoke(). The fix mirrors the pattern applied to the other two methods:

  • Lock acquired → _connected = False, _tools.clear() → lock released → transport.close() called outside the lock.

A secondary benefit of this ordering is that other threads see _connected = False immediately (before close() completes), so any concurrent invoke() or discover_tools() call that checks _connected after the lock is released will receive a RuntimeError rather than attempting to use a transport that is mid-close.

3. connect() and dispatch_notification() — no changes needed

connect() was already correct: it spawns a daemon thread for the transport handshake and joins it, so the actual I/O is not inside the lock.

dispatch_notification() was already correct: it copies the listener list under the lock, then calls listeners outside the lock — the canonical pattern for observer dispatch in a thread-safe class.

4. TOCTOU handling — propagate, do not suppress

Between the pre-I/O lock release and the transport call, a concurrent disconnect() may set _connected = False and close the transport. This is an inherent TOCTOU window in any lock-release-then-I/O pattern.

The decision was to propagate the exception from the transport layer rather than re-checking _connected after the call. Rationale:

  • Re-checking _connected after the transport call would require re-acquiring the lock and adds complexity without eliminating the race (the disconnect could still occur between the re-check and the return).
  • The transport already raises a meaningful exception (e.g., ConnectionError, TimeoutError) when called on a closed connection. Surfacing this to the caller is correct behaviour — the caller should handle transport-level failures regardless of the _connected flag.
  • discover_tools() propagates the exception directly. invoke() catches it and returns a failed MCPToolResult (consistent with its existing error-handling contract).

Discoveries and Assumptions

  1. disconnect() was an undocumented instance of the same bug. The issue body identified only discover_tools() and invoke() by line number. The audit (Subtask 3) revealed disconnect() as a third occurrence. This was fixed in the same commit.

  2. RLock re-entrancy does not help here. The issue description notes that RLock is reentrant only for the same thread. The deadlock scenario involves two different threads, so re-entrancy provides no protection. The fix is correct regardless of whether the lock is Lock or RLock.

  3. self._tools write-back in discover_tools() is a last-write-wins race. If two threads call discover_tools() concurrently, both will perform their transport calls in parallel and then race to write back self._tools. The last writer wins. This is acceptable: both calls return the same logical data (the server's current tool list), and the write-back is atomic under the lock. A future improvement could add a generation counter to detect stale write-backs, but this is out of scope for this issue.

  4. Assumption: self._transport.call() is thread-safe. The refactored code calls transport.call() from multiple threads concurrently without any adapter-level serialisation. This assumes the transport implementation is itself thread-safe (or at minimum, re-entrant for concurrent calls). This assumption is consistent with the transport contract documented in the MCP transport interface.

  5. Open question: should invoke() re-check _connected after the transport call? Currently it does not (see Design Decision 4). If the team decides that a more explicit error message is preferred over a raw transport exception in the TOCTOU case, a post-call _connected check could be added in a follow-on. This is tracked as a potential follow-on item.


Code Locations

All references are to commit 9e93ea5fc615801ed8136a87598901c3afd09d2d.

Logical Location Description
cleveragents.mcp.adapterMCPToolAdapter.discover_tools Refactored: lock released before transport.call(); re-acquired for self._tools write-back
cleveragents.mcp.adapterMCPToolAdapter.invoke Refactored: lock released before transport.call(); no re-acquire needed post-call
cleveragents.mcp.adapterMCPToolAdapter.disconnect Fixed: transport.close() moved outside the lock; _connected = False and _tools.clear() remain inside
cleveragents.mcp.adapterMCPToolAdapter.connect Audited; no change required
cleveragents.mcp.adapterMCPToolAdapter.dispatch_notification Audited; no change required
features/mcp_adapter_rlock_fix.feature New Behave feature file — 7 scenarios for Wave 1
features/steps/mcp_adapter_rlock_fix_steps.py New Behave step definitions for the above feature file

Workarounds and Deviations

  • No deviations from the issue specification for Subtasks 1–6. The two-phase lock pattern matches the pseudocode in the "Expected Behaviour" section of the issue body exactly.
  • disconnect() fix is an addition beyond the stated subtasks. Subtask 3 called for an audit; fixing disconnect() is the direct output of that audit. It is included in the same commit rather than a separate one to keep the fix atomic.
  • Subtasks 7–9 (typecheck, unit_tests, nox full run, coverage) are not yet complete. These are Wave 2 and will be addressed in the next implementation pass. The Behave scenarios in Wave 1 are sufficient to verify the behavioural correctness of the fix; the nox gate results will be reported in a follow-on note.

Test Results

New scenarios (features/mcp_adapter_rlock_fix.feature): 7/7 pass

# Scenario Result
1 Concurrent invoke() + discover_tools() — both complete without blocking each other Pass
2 Two concurrent invoke() calls — both complete without blocking Pass
3 Lock not held during transport.call() — verified by reading _connected from another thread while discover_tools() is blocked inside transport.call() Pass
4 RuntimeError raised when discover_tools() called on disconnected adapter Pass
5 RuntimeError raised when invoke() called on disconnected adapter Pass
6 TOCTOU safety: transport raises mid-discover_tools() → exception propagates gracefully Pass
7 TOCTOU safety: transport raises mid-invoke()invoke() returns a failed MCPToolResult Pass

Regression (features/mcp_adapter.feature): 42/42 existing scenarios still pass. No regressions introduced.

Coverage and full nox gate: Pending (Wave 2). Will be reported in a follow-on comment.


Risk Mitigations

Risk Mitigation
TOCTOU race between _connected check and transport.call() Transport exceptions propagated directly to caller; invoke() wraps in a failed MCPToolResult consistent with its existing error contract. Documented as a known, accepted race.
Concurrent discover_tools() write-back collision (last-write-wins) Both concurrent calls retrieve the same logical data; the write is atomic under the lock. Acceptable for the current use cases. Noted as a potential follow-on improvement.
disconnect() fix introduces a window where _connected = False but transport is not yet closed This is intentional and beneficial: other threads fail fast with RuntimeError rather than attempting to use a mid-close transport. The transport's own error handling covers the close-in-progress case.
Assumption that transport.call() is thread-safe Consistent with the documented transport contract. If a transport implementation is found to be non-thread-safe, that is a defect in the transport, not in the adapter.

Automated by CleverAgents Bot
Supervisor: Implementation | Agent: ca-issue-note-writer

## Wave 1 Implementation Note — Subtasks 1–6 Complete > Commit: `9e93ea5fc615801ed8136a87598901c3afd09d2d` --- ### Implementation Summary Wave 1 addresses the core concurrency defect described in this issue: `MCPToolAdapter.discover_tools()` and `invoke()` were holding the internal `threading.RLock` across the full duration of their respective `self._transport.call(...)` network operations. This serialised all concurrent callers for the entire network round-trip, and created a deadlock risk when callers originated from different threads (e.g., `MCPRefreshHook` triggering `discover_tools()` from a notification listener while the main thread was blocked inside `invoke()`). The fix applies a consistent **acquire → check/read → release → I/O → re-acquire (if needed) → write → release** pattern across all affected methods. The lock is now held only for short, non-blocking critical sections. **Files modified:** - `src/cleveragents/mcp/adapter.py` — `MCPToolAdapter.discover_tools()`, `MCPToolAdapter.invoke()`, `MCPToolAdapter.disconnect()` **Files created:** - `features/mcp_adapter_rlock_fix.feature` — 7 new Behave scenarios - `features/steps/mcp_adapter_rlock_fix_steps.py` — step implementations for the above --- ### Design Decisions #### 1. Lock scope: acquire–release–I/O–re-acquire pattern The refactored methods follow a two-phase lock pattern rather than a single `with self._lock:` block: - **Phase 1 (pre-I/O critical section):** Acquire lock → read `_connected` (and, for `invoke()`, look up the tool descriptor and validate arguments) → release lock. - **Network I/O:** Performed entirely outside the lock. - **Phase 2 (post-I/O critical section, `discover_tools()` only):** Re-acquire lock → write back `self._tools` → release lock. `invoke()` does not need a Phase 2 re-acquire because its result is a locally constructed `MCPToolResult` value — no shared state is mutated after the transport call returns. **Alternative considered — copy-on-read with a single lock window:** Copying `self._tools` under the lock and then releasing before the transport call was considered for `discover_tools()`. This was rejected because `discover_tools()` is the *writer* of `self._tools`, not a reader; the relevant write-back still requires a lock, so the two-phase pattern is the correct model. **Alternative considered — `asyncio` / coroutine-based I/O:** Out of scope for this issue. The transport layer is synchronous; introducing `asyncio` would be an architectural change requiring its own ADR. #### 2. `disconnect()` fixed as part of the audit (Subtask 3) During the audit of all `MCPToolAdapter` methods, `disconnect()` was found to call `self._transport.close()` inside the lock — the same anti-pattern as `discover_tools()` and `invoke()`. The fix mirrors the pattern applied to the other two methods: - Lock acquired → `_connected = False`, `_tools.clear()` → lock released → `transport.close()` called outside the lock. A secondary benefit of this ordering is that other threads see `_connected = False` immediately (before `close()` completes), so any concurrent `invoke()` or `discover_tools()` call that checks `_connected` after the lock is released will receive a `RuntimeError` rather than attempting to use a transport that is mid-close. #### 3. `connect()` and `dispatch_notification()` — no changes needed `connect()` was already correct: it spawns a daemon thread for the transport handshake and joins it, so the actual I/O is not inside the lock. `dispatch_notification()` was already correct: it copies the listener list under the lock, then calls listeners outside the lock — the canonical pattern for observer dispatch in a thread-safe class. #### 4. TOCTOU handling — propagate, do not suppress Between the pre-I/O lock release and the transport call, a concurrent `disconnect()` may set `_connected = False` and close the transport. This is an inherent TOCTOU window in any lock-release-then-I/O pattern. The decision was to **propagate the exception** from the transport layer rather than re-checking `_connected` after the call. Rationale: - Re-checking `_connected` after the transport call would require re-acquiring the lock and adds complexity without eliminating the race (the disconnect could still occur between the re-check and the return). - The transport already raises a meaningful exception (e.g., `ConnectionError`, `TimeoutError`) when called on a closed connection. Surfacing this to the caller is correct behaviour — the caller should handle transport-level failures regardless of the `_connected` flag. - `discover_tools()` propagates the exception directly. `invoke()` catches it and returns a failed `MCPToolResult` (consistent with its existing error-handling contract). --- ### Discoveries and Assumptions 1. **`disconnect()` was an undocumented instance of the same bug.** The issue body identified only `discover_tools()` and `invoke()` by line number. The audit (Subtask 3) revealed `disconnect()` as a third occurrence. This was fixed in the same commit. 2. **`RLock` re-entrancy does not help here.** The issue description notes that `RLock` is reentrant only for the *same* thread. The deadlock scenario involves two *different* threads, so re-entrancy provides no protection. The fix is correct regardless of whether the lock is `Lock` or `RLock`. 3. **`self._tools` write-back in `discover_tools()` is a last-write-wins race.** If two threads call `discover_tools()` concurrently, both will perform their transport calls in parallel and then race to write back `self._tools`. The last writer wins. This is acceptable: both calls return the same logical data (the server's current tool list), and the write-back is atomic under the lock. A future improvement could add a generation counter to detect stale write-backs, but this is out of scope for this issue. 4. **Assumption: `self._transport.call()` is thread-safe.** The refactored code calls `transport.call()` from multiple threads concurrently without any adapter-level serialisation. This assumes the transport implementation is itself thread-safe (or at minimum, re-entrant for concurrent calls). This assumption is consistent with the transport contract documented in the MCP transport interface. 5. **Open question: should `invoke()` re-check `_connected` after the transport call?** Currently it does not (see Design Decision 4). If the team decides that a more explicit error message is preferred over a raw transport exception in the TOCTOU case, a post-call `_connected` check could be added in a follow-on. This is tracked as a potential follow-on item. --- ### Code Locations > All references are to commit `9e93ea5fc615801ed8136a87598901c3afd09d2d`. | Logical Location | Description | |---|---| | `cleveragents.mcp.adapter` → `MCPToolAdapter.discover_tools` | Refactored: lock released before `transport.call()`; re-acquired for `self._tools` write-back | | `cleveragents.mcp.adapter` → `MCPToolAdapter.invoke` | Refactored: lock released before `transport.call()`; no re-acquire needed post-call | | `cleveragents.mcp.adapter` → `MCPToolAdapter.disconnect` | Fixed: `transport.close()` moved outside the lock; `_connected = False` and `_tools.clear()` remain inside | | `cleveragents.mcp.adapter` → `MCPToolAdapter.connect` | Audited; no change required | | `cleveragents.mcp.adapter` → `MCPToolAdapter.dispatch_notification` | Audited; no change required | | `features/mcp_adapter_rlock_fix.feature` | New Behave feature file — 7 scenarios for Wave 1 | | `features/steps/mcp_adapter_rlock_fix_steps.py` | New Behave step definitions for the above feature file | --- ### Workarounds and Deviations - **No deviations from the issue specification** for Subtasks 1–6. The two-phase lock pattern matches the pseudocode in the "Expected Behaviour" section of the issue body exactly. - **`disconnect()` fix is an addition beyond the stated subtasks.** Subtask 3 called for an audit; fixing `disconnect()` is the direct output of that audit. It is included in the same commit rather than a separate one to keep the fix atomic. - **Subtasks 7–9 (typecheck, unit_tests, nox full run, coverage) are not yet complete.** These are Wave 2 and will be addressed in the next implementation pass. The Behave scenarios in Wave 1 are sufficient to verify the behavioural correctness of the fix; the nox gate results will be reported in a follow-on note. --- ### Test Results **New scenarios (`features/mcp_adapter_rlock_fix.feature`):** 7/7 pass | # | Scenario | Result | |---|---|---| | 1 | Concurrent `invoke()` + `discover_tools()` — both complete without blocking each other | ✅ Pass | | 2 | Two concurrent `invoke()` calls — both complete without blocking | ✅ Pass | | 3 | Lock not held during `transport.call()` — verified by reading `_connected` from another thread while `discover_tools()` is blocked inside `transport.call()` | ✅ Pass | | 4 | `RuntimeError` raised when `discover_tools()` called on disconnected adapter | ✅ Pass | | 5 | `RuntimeError` raised when `invoke()` called on disconnected adapter | ✅ Pass | | 6 | TOCTOU safety: transport raises mid-`discover_tools()` → exception propagates gracefully | ✅ Pass | | 7 | TOCTOU safety: transport raises mid-`invoke()` → `invoke()` returns a failed `MCPToolResult` | ✅ Pass | **Regression (`features/mcp_adapter.feature`):** 42/42 existing scenarios still pass. No regressions introduced. **Coverage and full nox gate:** Pending (Wave 2). Will be reported in a follow-on comment. --- ### Risk Mitigations | Risk | Mitigation | |---|---| | TOCTOU race between `_connected` check and `transport.call()` | Transport exceptions propagated directly to caller; `invoke()` wraps in a failed `MCPToolResult` consistent with its existing error contract. Documented as a known, accepted race. | | Concurrent `discover_tools()` write-back collision (last-write-wins) | Both concurrent calls retrieve the same logical data; the write is atomic under the lock. Acceptable for the current use cases. Noted as a potential follow-on improvement. | | `disconnect()` fix introduces a window where `_connected = False` but transport is not yet closed | This is intentional and beneficial: other threads fail fast with `RuntimeError` rather than attempting to use a mid-close transport. The transport's own error handling covers the close-in-progress case. | | Assumption that `transport.call()` is thread-safe | Consistent with the documented transport contract. If a transport implementation is found to be non-thread-safe, that is a defect in the transport, not in the adapter. | --- **Automated by CleverAgents Bot** Supervisor: Implementation | Agent: ca-issue-note-writer
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#399 Epic: Post-MVP Server & Clients
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#2163
No description provided.