BUG-HUNT: [concurrency] Race condition in _check_idle can cause premature shutdown #2413

Open
opened 2026-04-03 17:39:11 +00:00 by freemo · 1 comment
Owner

Metadata

  • Branch: fix/v3.6.0-mcp-client-check-idle-race-condition
  • Commit Message: fix(mcp): move shutdown() call inside lock in _check_idle to eliminate race condition
  • Milestone: v3.6.0
  • Parent Epic: #399

Background and Context

A race condition exists in the _check_idle method of the McpClient class in src/cleveragents/mcp/client.py (lines 349–369). The method checks whether the client has been idle for longer than the configured timeout and, if so, shuts it down. However, the check for idleness and the call to shutdown() are not performed atomically — the lock is released before shutdown() is called, creating a window in which another thread can update _last_activity, making the shutdown decision stale.

The race condition occurs in the following sequence:

  1. The idle timer fires and calls _check_idle.
  2. The _check_idle thread acquires the lock, checks _last_activity, and determines the client has been idle too long.
  3. The _check_idle thread releases the lock (exits the with block).
  4. Another thread invokes a tool, updating _last_activity to the current time.
  5. The _check_idle thread calls self.shutdown() — even though the client is no longer idle.

This violates the project's concurrency safety requirements and can lead to dropped connections, failed operations, and data loss.

Current Behavior (Bug)

# src/cleveragents/mcp/client.py  lines 349-369
def _check_idle(self) -> None:
    """Check if the client has been idle beyond the timeout."""
    with self._lock:
        if self._shutting_down or not self._started:
            return
        elapsed = time.monotonic() - self._last_activity
        if elapsed < self._config.idle_timeout_seconds:
            # Activity happened while timer was pending; reschedule.
            remaining = self._config.idle_timeout_seconds - elapsed
            timer = threading.Timer(remaining, self._check_idle)
            timer.daemon = True
            self._idle_timer = timer
            timer.start()
            return

    # ← Lock released here; another thread can update _last_activity now
    logger.info(
        "McpClient '%s' idle for %.1fs — auto-stopping",
        self.server_name,
        elapsed,
    )
    self.shutdown()  # ← Shutdown proceeds even if client is no longer idle

The client can be shut down even if there has been recent activity since the lock was released.

Expected Behavior

The decision to shut down the client and the shutdown itself must be an atomic operation. The client should only be shut down if it has been idle for the configured timeout at the time shutdown() is called.

Suggested Fix

Move the logger.info(...) and self.shutdown() calls inside the with self._lock: block. The shutdown() method is re-entrant, so it is safe to call from within the lock.

def _check_idle(self) -> None:
    """Check if the client has been idle beyond the timeout."""
    with self._lock:
        if self._shutting_down or not self._started:
            return
        elapsed = time.monotonic() - self._last_activity
        if elapsed < self._config.idle_timeout_seconds:
            # Activity happened while timer was pending; reschedule.
            remaining = self._config.idle_timeout_seconds - elapsed
            timer = threading.Timer(remaining, self._check_idle)
            timer.daemon = True
            self._idle_timer = timer
            timer.start()
            return

        logger.info(
            "McpClient '%s' idle for %.1fs — auto-stopping",
            self.server_name,
            elapsed,
        )
        self.shutdown()

Severity Assessment

  • Impact: High — unexpected client shutdown can cause dropped connections, failed operations, and data loss.
  • Likelihood: High — the race window depends on thread scheduling, which is unpredictable and environment-dependent.
  • Category: concurrency

Subtasks

  • Confirm shutdown() is re-entrant and safe to call while holding self._lock (audit lock acquisition inside shutdown()).
  • Move logger.info(...) and self.shutdown() inside the with self._lock: block in _check_idle.
  • Write a Behave scenario that reproduces the race condition (using a mock timer and controlled thread interleaving).
  • Write a Behave scenario verifying that a tool invocation arriving between the idle check and shutdown prevents the shutdown.
  • Verify all existing McpClient Behave scenarios continue to pass.
  • Run nox -e typecheck — confirm no type errors or suppressions introduced.
  • Run nox -e coverage_report — confirm coverage ≥ 97%.
  • Run nox (all default sessions) and fix any failures.

Definition of Done

  • _check_idle performs the idle check and shutdown() call atomically within self._lock.
  • No # type: ignore or equivalent type suppression is introduced.
  • New Behave scenarios cover: (a) race condition reproduction, (b) activity arriving between check and shutdown prevents premature shutdown.
  • All existing McpClient Behave scenarios pass.
  • All nox stages pass.
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/v3.6.0-mcp-client-check-idle-race-condition` - **Commit Message**: `fix(mcp): move shutdown() call inside lock in _check_idle to eliminate race condition` - **Milestone**: v3.6.0 - **Parent Epic**: #399 ## Background and Context A race condition exists in the `_check_idle` method of the `McpClient` class in `src/cleveragents/mcp/client.py` (lines 349–369). The method checks whether the client has been idle for longer than the configured timeout and, if so, shuts it down. However, the check for idleness and the call to `shutdown()` are **not performed atomically** — the lock is released before `shutdown()` is called, creating a window in which another thread can update `_last_activity`, making the shutdown decision stale. The race condition occurs in the following sequence: 1. The idle timer fires and calls `_check_idle`. 2. The `_check_idle` thread acquires the lock, checks `_last_activity`, and determines the client has been idle too long. 3. The `_check_idle` thread **releases the lock** (exits the `with` block). 4. Another thread invokes a tool, updating `_last_activity` to the current time. 5. The `_check_idle` thread calls `self.shutdown()` — even though the client is no longer idle. This violates the project's concurrency safety requirements and can lead to dropped connections, failed operations, and data loss. ## Current Behavior (Bug) ```python # src/cleveragents/mcp/client.py lines 349-369 def _check_idle(self) -> None: """Check if the client has been idle beyond the timeout.""" with self._lock: if self._shutting_down or not self._started: return elapsed = time.monotonic() - self._last_activity if elapsed < self._config.idle_timeout_seconds: # Activity happened while timer was pending; reschedule. remaining = self._config.idle_timeout_seconds - elapsed timer = threading.Timer(remaining, self._check_idle) timer.daemon = True self._idle_timer = timer timer.start() return # ← Lock released here; another thread can update _last_activity now logger.info( "McpClient '%s' idle for %.1fs — auto-stopping", self.server_name, elapsed, ) self.shutdown() # ← Shutdown proceeds even if client is no longer idle ``` The client can be shut down even if there has been recent activity since the lock was released. ## Expected Behavior The decision to shut down the client and the shutdown itself must be an **atomic operation**. The client should only be shut down if it has been idle for the configured timeout at the time `shutdown()` is called. ## Suggested Fix Move the `logger.info(...)` and `self.shutdown()` calls **inside** the `with self._lock:` block. The `shutdown()` method is re-entrant, so it is safe to call from within the lock. ```python def _check_idle(self) -> None: """Check if the client has been idle beyond the timeout.""" with self._lock: if self._shutting_down or not self._started: return elapsed = time.monotonic() - self._last_activity if elapsed < self._config.idle_timeout_seconds: # Activity happened while timer was pending; reschedule. remaining = self._config.idle_timeout_seconds - elapsed timer = threading.Timer(remaining, self._check_idle) timer.daemon = True self._idle_timer = timer timer.start() return logger.info( "McpClient '%s' idle for %.1fs — auto-stopping", self.server_name, elapsed, ) self.shutdown() ``` ## Severity Assessment - **Impact**: High — unexpected client shutdown can cause dropped connections, failed operations, and data loss. - **Likelihood**: High — the race window depends on thread scheduling, which is unpredictable and environment-dependent. - **Category**: concurrency ## Subtasks - [ ] Confirm `shutdown()` is re-entrant and safe to call while holding `self._lock` (audit lock acquisition inside `shutdown()`). - [ ] Move `logger.info(...)` and `self.shutdown()` inside the `with self._lock:` block in `_check_idle`. - [ ] Write a Behave scenario that reproduces the race condition (using a mock timer and controlled thread interleaving). - [ ] Write a Behave scenario verifying that a tool invocation arriving between the idle check and shutdown prevents the shutdown. - [ ] Verify all existing `McpClient` Behave scenarios continue to pass. - [ ] Run `nox -e typecheck` — confirm no type errors or suppressions introduced. - [ ] Run `nox -e coverage_report` — confirm coverage ≥ 97%. - [ ] Run `nox` (all default sessions) and fix any failures. ## Definition of Done - [ ] `_check_idle` performs the idle check and `shutdown()` call atomically within `self._lock`. - [ ] No `# type: ignore` or equivalent type suppression is introduced. - [ ] New Behave scenarios cover: (a) race condition reproduction, (b) activity arriving between check and shutdown prevents premature shutdown. - [ ] All existing `McpClient` Behave scenarios pass. - [ ] All nox stages pass. - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: ca-new-issue-creator
freemo added this to the v3.6.0 milestone 2026-04-03 17:39:16 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: High — Race conditions in shutdown logic can cause data loss and unpredictable behavior. This is a concurrency correctness issue.
  • Milestone: v3.6.0 (as specified in issue metadata)
  • MoSCoW: Should Have — Race conditions are serious but this specific one in idle-check shutdown is unlikely to cause data corruption. It should be fixed but is not blocking core functionality.
  • Parent Epic: #399 (Post-MVP Server & Clients)

The issue is well-described with clear code references, line numbers, and a precise fix proposal. Valid and actionable.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: High — Race conditions in shutdown logic can cause data loss and unpredictable behavior. This is a concurrency correctness issue. - **Milestone**: v3.6.0 (as specified in issue metadata) - **MoSCoW**: Should Have — Race conditions are serious but this specific one in idle-check shutdown is unlikely to cause data corruption. It should be fixed but is not blocking core functionality. - **Parent Epic**: #399 (Post-MVP Server & Clients) The issue is well-described with clear code references, line numbers, and a precise fix proposal. Valid and actionable. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#399 Epic: Post-MVP Server & Clients
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#2413
No description provided.