BUG-HUNT: [concurrency] McpClient.start() has TOCTOU race — concurrent callers both pass the _started=False guard, triggering double timer scheduling and double discover_tools() calls #6526

Open
opened 2026-04-09 21:15:04 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: [concurrency] — McpClient.start() releases the lock before connection work is done — two threads can simultaneously start the client, causing duplicate idle/health-check timers

Severity Assessment

  • Impact: Two threads calling call_tool() concurrently before the client has started (a common scenario with lazy_start=True) can both trigger start() simultaneously. This results in: (1) two idle auto-stop timers scheduled — after the idle timeout, shutdown() fires twice; (2) two health-check timers scheduled — health checks run at double frequency; (3) discover_tools() called twice in rapid succession, causing unnecessary I/O.
  • Likelihood: Medium-High — any multi-threaded code that calls call_tool() before the first explicit start() is subject to this race, which is the default use case with lazy_start=True.
  • Priority: Priority/Backlog

Location

  • File: src/cleveragents/mcp/client.py
  • Function: McpClient.start() and McpClient._ensure_started()
  • Lines: 208–230 (start()), 254–275 (_ensure_started())

Description

start() holds the lock briefly to check _started and set _state = STARTING, then releases the lock before doing the actual connection work. The _started flag is only set to True inside the second with self._lock: block, which runs after the connection succeeds. In between, any other thread that calls start() (directly or via _ensure_started()) sees _started=False and proceeds to start a second connection.

Evidence

# src/cleveragents/mcp/client.py — McpClient.start()

def start(self) -> None:
    with self._lock:
        if self._started:
            return
        self._state = McpClientState.STARTING
    # ↑ Lock released here — _started is still False

    try:
        self._adapter.connect()        # Line 215 — Thread A and Thread B both reach here
        self._adapter.discover_tools() # Line 216 — runs twice in rapid succession
    except Exception:
        with self._lock:
            self._state = McpClientState.ERROR
        raise

    with self._lock:
        self._started = True           # Only set here, after connection
        self._state = McpClientState.RUNNING
        self._last_activity = time.monotonic()
        self._health_check_failures = 0

    self._schedule_idle_timer()        # Called by BOTH threads → two timers
    self._schedule_health_check()      # Called by BOTH threads → two timers

And _ensure_started():

def _ensure_started(self) -> None:
    with self._lock:
        if self._started:
            return
        if not self._config.lazy_start:
            raise RuntimeError(...)
    # ↑ Lock released here — _started is still False
    self.start()  # Both threads call start()

Race scenario:

Thread A (call_tool 1) Thread B (call_tool 2)
_ensure_started(): lock → _started=False → release
start(): lock → _started=False → set STARTING → release
adapter.connect() begins... _ensure_started(): lock → _started=False (!) → release
start(): lock → _started=False → set STARTING → release
adapter.connect() finishes (no-op for B) adapter.connect() is called (no-op due to adapter lock)
adapter.discover_tools() runs adapter.discover_tools() runs again
lock → set _started=True lock → set _started=True
_schedule_idle_timer()Timer #1 _schedule_idle_timer()Timer #2
_schedule_health_check()Timer #3 _schedule_health_check()Timer #4

After this race: two idle timers and two health-check timers are active, both will fire independently.

Expected Behavior

start() should be idempotent under concurrent calls. Only one connection attempt, one idle timer, and one health-check timer should be active at a time.

Actual Behavior

Concurrent call_tool() calls during lazy startup cause start() to run multiple times, producing duplicate background timers. The idle auto-stop fires twice, which can cause a double-shutdown; health checks run at 2× the configured frequency.

Suggested Fix

Introduce a _starting flag (similar to _shutting_down) that is set while the connection is in progress, while holding the lock:

def start(self) -> None:
    with self._lock:
        if self._started or self._starting:
            return
        self._starting = True
        self._state = McpClientState.STARTING

    try:
        self._adapter.connect()
        self._adapter.discover_tools()
    except Exception:
        with self._lock:
            self._starting = False
            self._state = McpClientState.ERROR
        raise

    with self._lock:
        self._started = True
        self._starting = False
        self._state = McpClientState.RUNNING
        self._last_activity = time.monotonic()
        self._health_check_failures = 0

    self._schedule_idle_timer()
    self._schedule_health_check()

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — `McpClient.start()` releases the lock before connection work is done — two threads can simultaneously start the client, causing duplicate idle/health-check timers ### Severity Assessment - **Impact**: Two threads calling `call_tool()` concurrently before the client has started (a common scenario with `lazy_start=True`) can both trigger `start()` simultaneously. This results in: (1) two idle auto-stop timers scheduled — after the idle timeout, `shutdown()` fires twice; (2) two health-check timers scheduled — health checks run at double frequency; (3) `discover_tools()` called twice in rapid succession, causing unnecessary I/O. - **Likelihood**: Medium-High — any multi-threaded code that calls `call_tool()` before the first explicit `start()` is subject to this race, which is the default use case with `lazy_start=True`. - **Priority**: Priority/Backlog ### Location - **File**: `src/cleveragents/mcp/client.py` - **Function**: `McpClient.start()` and `McpClient._ensure_started()` - **Lines**: 208–230 (`start()`), 254–275 (`_ensure_started()`) ### Description `start()` holds the lock briefly to check `_started` and set `_state = STARTING`, then **releases the lock** before doing the actual connection work. The `_started` flag is only set to `True` inside the second `with self._lock:` block, which runs **after** the connection succeeds. In between, any other thread that calls `start()` (directly or via `_ensure_started()`) sees `_started=False` and proceeds to start a second connection. ### Evidence ```python # src/cleveragents/mcp/client.py — McpClient.start() def start(self) -> None: with self._lock: if self._started: return self._state = McpClientState.STARTING # ↑ Lock released here — _started is still False try: self._adapter.connect() # Line 215 — Thread A and Thread B both reach here self._adapter.discover_tools() # Line 216 — runs twice in rapid succession except Exception: with self._lock: self._state = McpClientState.ERROR raise with self._lock: self._started = True # Only set here, after connection self._state = McpClientState.RUNNING self._last_activity = time.monotonic() self._health_check_failures = 0 self._schedule_idle_timer() # Called by BOTH threads → two timers self._schedule_health_check() # Called by BOTH threads → two timers ``` And `_ensure_started()`: ```python def _ensure_started(self) -> None: with self._lock: if self._started: return if not self._config.lazy_start: raise RuntimeError(...) # ↑ Lock released here — _started is still False self.start() # Both threads call start() ``` **Race scenario:** | Thread A (call_tool 1) | Thread B (call_tool 2) | |---|---| | `_ensure_started()`: lock → `_started=False` → release | — | | `start()`: lock → `_started=False` → set STARTING → release | — | | `adapter.connect()` begins... | `_ensure_started()`: lock → `_started=False` (!) → release | | — | `start()`: lock → `_started=False` → set STARTING → release | | `adapter.connect()` finishes (no-op for B) | `adapter.connect()` is called (no-op due to adapter lock) | | `adapter.discover_tools()` runs | `adapter.discover_tools()` runs again | | lock → set `_started=True` | lock → set `_started=True` | | `_schedule_idle_timer()` → **Timer #1** | `_schedule_idle_timer()` → **Timer #2** | | `_schedule_health_check()` → **Timer #3** | `_schedule_health_check()` → **Timer #4** | After this race: two idle timers and two health-check timers are active, both will fire independently. ### Expected Behavior `start()` should be idempotent under concurrent calls. Only one connection attempt, one idle timer, and one health-check timer should be active at a time. ### Actual Behavior Concurrent `call_tool()` calls during lazy startup cause `start()` to run multiple times, producing duplicate background timers. The idle auto-stop fires twice, which can cause a double-shutdown; health checks run at 2× the configured frequency. ### Suggested Fix Introduce a `_starting` flag (similar to `_shutting_down`) that is set while the connection is in progress, while holding the lock: ```python def start(self) -> None: with self._lock: if self._started or self._starting: return self._starting = True self._state = McpClientState.STARTING try: self._adapter.connect() self._adapter.discover_tools() except Exception: with self._lock: self._starting = False self._state = McpClientState.ERROR raise with self._lock: self._started = True self._starting = False self._state = McpClientState.RUNNING self._last_activity = time.monotonic() self._health_check_failures = 0 self._schedule_idle_timer() self._schedule_health_check() ``` ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:27:53 +00:00
Author
Owner

Verified — Valid concurrency bug. Concurrent callers can both pass the _started=False guard, causing double timer scheduling and double discover_tools() calls. MoSCoW: Should Have (confirmed) — concurrency issue that can cause duplicate tool registrations.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Valid concurrency bug. Concurrent callers can both pass the _started=False guard, causing double timer scheduling and double discover_tools() calls. **MoSCoW: Should Have** (confirmed) — concurrency issue that can cause duplicate tool registrations. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6526
No description provided.