BUG-HUNT: [MCP-CONCURRENCY] Timer race conditions in McpClient idle and health check management #7103

Open
opened 2026-04-10 07:44:00 +00:00 by HAL9000 · 2 comments
Owner

Metadata

  • Branch: fix/mcp-client-timer-race-conditions
  • Commit Message: fix(mcp): move timer creation and assignment inside lock protection in McpClient
  • Milestone: (none — backlog, see note below)
  • Parent Epic: Orphan — see note below; requires manual Epic linking

Backlog note: This issue was discovered during autonomous operation
on milestone v3.5.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Bug Report: [CONCURRENCY] — Timer race conditions in McpClient idle and health check management

Severity Assessment

  • Impact: Timer resource leaks, incorrect idle timeout behavior, potential memory leaks
  • Likelihood: Medium under concurrent usage with rapid idle/health check cycles
  • Priority: High

Location

  • File: src/cleveragents/mcp/client.py
  • Function/Class: McpClient._check_idle(), McpClient._perform_health_check()
  • Lines: 349-369, 373-467

Description

The McpClient timer management has race conditions where timer objects are created and assigned outside of lock protection, allowing timers to be created after cancellation calls and potentially causing resource leaks.

Evidence

# In src/cleveragents/mcp/client.py lines 349-369:
def _check_idle(self) -> None:
    """Check if the client has been idle beyond the timeout."""
    with self._lock:
        if self._shutting_down or not self._started:
            return
        elapsed = time.monotonic() - self._last_activity
        if elapsed < self._config.idle_timeout_seconds:
            # RACE CONDITION: Lock released before timer creation
            remaining = self._config.idle_timeout_seconds - elapsed
            timer = threading.Timer(remaining, self._check_idle)
            timer.daemon = True
            self._idle_timer = timer  # Assignment outside lock!
            timer.start()  # Start outside lock!
            return

Race condition scenarios:

  1. Thread A calls _check_idle(), releases lock, creates timer
  2. Thread B calls shutdown()_cancel_idle_timer() → sets self._idle_timer = None
  3. Thread A assigns new timer to self._idle_timer and starts it
  4. Result: Timer starts after being "canceled", can't be stopped

Similar issue exists in _perform_health_check() method.

Expected Behavior

  • Timer creation and assignment should be atomic within lock protection
  • Canceled timers should never start after cancellation
  • Shutdown should reliably stop all timers
  • No timer resource leaks should occur

Actual Behavior

  • Timer creation happens outside lock protection
  • Timers can start after being canceled by other threads
  • Potential for unstoppable timers that keep client alive
  • Race conditions in concurrent idle/health check scenarios

Suggested Fix

  1. Move timer creation and assignment inside lock protection
  2. Add shutdown check after acquiring lock before creating new timers
  3. Ensure timer.start() is called while holding the lock
  4. Add defensive checks in cancel methods for partially-started timers

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.

Subtasks

  • Reproduce the race condition with a targeted concurrent test
  • Move threading.Timer creation and self._idle_timer assignment inside self._lock in _check_idle()
  • Add post-lock shutdown guard before starting the new timer in _check_idle()
  • Apply the same fix to _perform_health_check() timer management
  • Add defensive cancel checks for partially-started timers in _cancel_idle_timer() and _cancel_health_check_timer()
  • Write BDD scenarios covering concurrent shutdown + timer creation race
  • Verify no timer resource leaks under concurrent load

Definition of Done

  • All subtasks checked
  • Commit created with exact first line: fix(mcp): move timer creation and assignment inside lock protection in McpClient
  • Commit pushed to branch fix/mcp-client-timer-race-conditions
  • PR submitted and merged
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

## Metadata - **Branch**: `fix/mcp-client-timer-race-conditions` - **Commit Message**: `fix(mcp): move timer creation and assignment inside lock protection in McpClient` - **Milestone**: (none — backlog, see note below) - **Parent Epic**: _Orphan — see note below; requires manual Epic linking_ > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.5.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- ## Bug Report: [CONCURRENCY] — Timer race conditions in McpClient idle and health check management ### Severity Assessment - **Impact**: Timer resource leaks, incorrect idle timeout behavior, potential memory leaks - **Likelihood**: Medium under concurrent usage with rapid idle/health check cycles - **Priority**: High ### Location - **File**: `src/cleveragents/mcp/client.py` - **Function/Class**: `McpClient._check_idle()`, `McpClient._perform_health_check()` - **Lines**: 349-369, 373-467 ### Description The McpClient timer management has race conditions where timer objects are created and assigned outside of lock protection, allowing timers to be created after cancellation calls and potentially causing resource leaks. ### Evidence ```python # In src/cleveragents/mcp/client.py lines 349-369: def _check_idle(self) -> None: """Check if the client has been idle beyond the timeout.""" with self._lock: if self._shutting_down or not self._started: return elapsed = time.monotonic() - self._last_activity if elapsed < self._config.idle_timeout_seconds: # RACE CONDITION: Lock released before timer creation remaining = self._config.idle_timeout_seconds - elapsed timer = threading.Timer(remaining, self._check_idle) timer.daemon = True self._idle_timer = timer # Assignment outside lock! timer.start() # Start outside lock! return ``` **Race condition scenarios:** 1. Thread A calls `_check_idle()`, releases lock, creates timer 2. Thread B calls `shutdown()` → `_cancel_idle_timer()` → sets `self._idle_timer = None` 3. Thread A assigns new timer to `self._idle_timer` and starts it 4. Result: Timer starts after being "canceled", can't be stopped Similar issue exists in `_perform_health_check()` method. ### Expected Behavior - Timer creation and assignment should be atomic within lock protection - Canceled timers should never start after cancellation - Shutdown should reliably stop all timers - No timer resource leaks should occur ### Actual Behavior - Timer creation happens outside lock protection - Timers can start after being canceled by other threads - Potential for unstoppable timers that keep client alive - Race conditions in concurrent idle/health check scenarios ### Suggested Fix 1. Move timer creation and assignment inside lock protection 2. Add shutdown check after acquiring lock before creating new timers 3. Ensure `timer.start()` is called while holding the lock 4. Add defensive checks in cancel methods for partially-started timers ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. ## Subtasks - [ ] Reproduce the race condition with a targeted concurrent test - [ ] Move `threading.Timer` creation and `self._idle_timer` assignment inside `self._lock` in `_check_idle()` - [ ] Add post-lock shutdown guard before starting the new timer in `_check_idle()` - [ ] Apply the same fix to `_perform_health_check()` timer management - [ ] Add defensive cancel checks for partially-started timers in `_cancel_idle_timer()` and `_cancel_health_check_timer()` - [ ] Write BDD scenarios covering concurrent shutdown + timer creation race - [ ] Verify no timer resource leaks under concurrent load ## Definition of Done - [ ] All subtasks checked - [ ] Commit created with exact first line: `fix(mcp): move timer creation and assignment inside lock protection in McpClient` - [ ] Commit pushed to branch `fix/mcp-client-timer-race-conditions` - [ ] PR submitted and merged - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

⚠️ Orphan Issue — Manual Epic Linking Required

This issue has no parent Epic. Extensive search found no open MCP-specific Epic. The closest ancestor was issue #938 (feat(mcp): implement lazy start/auto-stop and sandbox path rewriting for MCP tools), which is now closed (milestone v3.5.0). The McpClient timer management code introduced in that feature is where this bug resides.

Action required by maintainer: Please link this issue to the appropriate parent Epic (or create a new MCP reliability/concurrency Epic if one does not exist), then set the dependency so this issue blocks the parent Epic.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

⚠️ **Orphan Issue — Manual Epic Linking Required** This issue has no parent Epic. Extensive search found no open MCP-specific Epic. The closest ancestor was issue #938 (`feat(mcp): implement lazy start/auto-stop and sandbox path rewriting for MCP tools`), which is now **closed** (milestone v3.5.0). The `McpClient` timer management code introduced in that feature is where this bug resides. **Action required by maintainer:** Please link this issue to the appropriate parent Epic (or create a new MCP reliability/concurrency Epic if one does not exist), then set the dependency so this issue **blocks** the parent Epic. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

Verified — Concurrency bug: timer race conditions in McpClient idle/health check management. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: timer race conditions in McpClient idle/health check management. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7103
No description provided.