Race condition in McpClient.start() allows concurrent double initialization #10438

Open
opened 2026-04-18 09:38:45 +00:00 by HAL9000 · 4 comments
Owner

Metadata

  • Branch: bugfix/mcp-race-condition-start
  • Commit Message: Fix race condition in McpClient.start() double initialization
  • Blocked by: #10402 (TDD issue)
  • Module: src/cleveragents/mcp/client.py

Summary

McpClient.start() has a critical race condition where the threading.RLock is released after checking _started but before calling connect() and discover_tools(). Concurrent callers can both pass the idempotency check and initialize the MCP server connection multiple times.

Code Evidence

File: src/cleveragents/mcp/client.py

def start(self) -> None:
    with self._lock:
        if self._started:
            return
        self._state = McpClientState.STARTING
    # ← Lock released here — race window opens!
    try:
        self._adapter.connect()         # ← called without lock
        self._adapter.discover_tools()  # ← called without lock
    except Exception:
        with self._lock:
            self._state = McpClientState.ERROR
        raise
    with self._lock:
        self._started = True            # ← set too late
        self._state = McpClientState.RUNNING
        self._last_activity = time.monotonic()
        self._health_check_failures = 0

Race Scenario

  1. Thread A: acquires lock, checks _started (False), sets state to STARTING, releases lock
  2. Thread B: acquires lock, checks _started (False — still False!), sets state to STARTING, releases lock
  3. Both threads call self._adapter.connect() and self._adapter.discover_tools() concurrently
  4. Both threads then set _started = True and _state = RUNNING

Result: Double initialization of the MCP server connection, resource leaks, and potential state corruption in the transport layer.

Impact

  • Multiple simultaneous connections to the MCP server
  • Resource exhaustion (file handles, sockets, threads)
  • State corruption in the adapter
  • Undefined behavior in the transport layer
  • Triggered by any concurrent use of McpClient (e.g., lazy start via call_tool() from multiple threads)

Proposed Fix

Add a _starting flag that is set inside the lock and checked before proceeding:

def start(self) -> None:
    with self._lock:
        if self._started or self._state == McpClientState.STARTING:
            return
        self._state = McpClientState.STARTING
    try:
        self._adapter.connect()
        self._adapter.discover_tools()
    except Exception:
        with self._lock:
            self._state = McpClientState.ERROR
        raise
    with self._lock:
        self._started = True
        self._state = McpClientState.RUNNING
        self._last_activity = time.monotonic()
        self._health_check_failures = 0

Validation Gate

  • Code evidence: Specific lines in client.py where lock is released before connect()
  • Environment verification: Reproducible with concurrent start() calls using mock transport
  • Actionability: Clear fix path — check _state == STARTING inside the lock
  • Codebase freshness: Verified in current src/cleveragents/mcp/client.py
  • Severity match: CRITICAL — concurrent tool calls trigger lazy start, making this a real production risk

Background and Context

McpClient.start() is designed to be idempotent — calling it multiple times should be safe. However, the current implementation releases the threading.RLock after setting _state = STARTING but before performing the actual initialization work (connect() and discover_tools()). This creates a race window where a second concurrent caller can pass the _started check (which is still False) and proceed to call connect() and discover_tools() simultaneously. This is a real production risk because call_tool() performs a lazy start, meaning any concurrent tool invocation from multiple threads will trigger this race.

Expected Behavior

Concurrent calls to McpClient.start() should result in exactly one connect() invocation and exactly one discover_tools() invocation, with the client ending up in McpClientState.RUNNING state and no resource leaks.

Subtasks

  • Confirm TDD issue #10402 is merged (failing test exists in codebase)
  • Create bugfix/mcp-race-condition-start branch from master
  • Add _state == McpClientState.STARTING check inside the lock in start()
  • Remove @tdd_expected_fail tag from the test added in #10402
  • Run full test suite via nox and confirm all tests pass
  • Open PR to master with Closes #<this_issue_number>

Acceptance Criteria

  • Concurrent calls to start() result in exactly one connect() invocation
  • Concurrent calls to start() result in exactly one discover_tools() invocation
  • Client ends up in McpClientState.RUNNING state after concurrent starts
  • No resource leaks occur during concurrent start attempts
  • All existing tests pass

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.

Automated by CleverAgents Bot
Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor
Tag: [AUTO-BUG-7]


Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Branch:** `bugfix/mcp-race-condition-start` - **Commit Message:** `Fix race condition in McpClient.start() double initialization` - **Blocked by:** #10402 (TDD issue) - **Module:** `src/cleveragents/mcp/client.py` ## Summary `McpClient.start()` has a critical race condition where the `threading.RLock` is released after checking `_started` but before calling `connect()` and `discover_tools()`. Concurrent callers can both pass the idempotency check and initialize the MCP server connection multiple times. ## Code Evidence **File:** `src/cleveragents/mcp/client.py` ```python def start(self) -> None: with self._lock: if self._started: return self._state = McpClientState.STARTING # ← Lock released here — race window opens! try: self._adapter.connect() # ← called without lock self._adapter.discover_tools() # ← called without lock except Exception: with self._lock: self._state = McpClientState.ERROR raise with self._lock: self._started = True # ← set too late self._state = McpClientState.RUNNING self._last_activity = time.monotonic() self._health_check_failures = 0 ``` ## Race Scenario 1. Thread A: acquires lock, checks `_started` (False), sets state to `STARTING`, releases lock 2. Thread B: acquires lock, checks `_started` (False — still False!), sets state to `STARTING`, releases lock 3. Both threads call `self._adapter.connect()` and `self._adapter.discover_tools()` concurrently 4. Both threads then set `_started = True` and `_state = RUNNING` **Result:** Double initialization of the MCP server connection, resource leaks, and potential state corruption in the transport layer. ## Impact - Multiple simultaneous connections to the MCP server - Resource exhaustion (file handles, sockets, threads) - State corruption in the adapter - Undefined behavior in the transport layer - Triggered by any concurrent use of `McpClient` (e.g., lazy start via `call_tool()` from multiple threads) ## Proposed Fix Add a `_starting` flag that is set inside the lock and checked before proceeding: ```python def start(self) -> None: with self._lock: if self._started or self._state == McpClientState.STARTING: return self._state = McpClientState.STARTING try: self._adapter.connect() self._adapter.discover_tools() except Exception: with self._lock: self._state = McpClientState.ERROR raise with self._lock: self._started = True self._state = McpClientState.RUNNING self._last_activity = time.monotonic() self._health_check_failures = 0 ``` ## Validation Gate - ✅ **Code evidence**: Specific lines in `client.py` where lock is released before `connect()` - ✅ **Environment verification**: Reproducible with concurrent `start()` calls using mock transport - ✅ **Actionability**: Clear fix path — check `_state == STARTING` inside the lock - ✅ **Codebase freshness**: Verified in current `src/cleveragents/mcp/client.py` - ✅ **Severity match**: CRITICAL — concurrent tool calls trigger lazy start, making this a real production risk ## Background and Context `McpClient.start()` is designed to be idempotent — calling it multiple times should be safe. However, the current implementation releases the `threading.RLock` after setting `_state = STARTING` but before performing the actual initialization work (`connect()` and `discover_tools()`). This creates a race window where a second concurrent caller can pass the `_started` check (which is still `False`) and proceed to call `connect()` and `discover_tools()` simultaneously. This is a real production risk because `call_tool()` performs a lazy start, meaning any concurrent tool invocation from multiple threads will trigger this race. ## Expected Behavior Concurrent calls to `McpClient.start()` should result in exactly one `connect()` invocation and exactly one `discover_tools()` invocation, with the client ending up in `McpClientState.RUNNING` state and no resource leaks. ## Subtasks - [ ] Confirm TDD issue #10402 is merged (failing test exists in codebase) - [ ] Create `bugfix/mcp-race-condition-start` branch from `master` - [ ] Add `_state == McpClientState.STARTING` check inside the lock in `start()` - [ ] Remove `@tdd_expected_fail` tag from the test added in #10402 - [ ] Run full test suite via `nox` and confirm all tests pass - [ ] Open PR to `master` with `Closes #<this_issue_number>` ## Acceptance Criteria - Concurrent calls to `start()` result in exactly one `connect()` invocation - Concurrent calls to `start()` result in exactly one `discover_tools()` invocation - Client ends up in `McpClientState.RUNNING` state after concurrent starts - No resource leaks occur during concurrent start attempts - All existing tests pass ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. --- Automated by CleverAgents Bot Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor Tag: [AUTO-BUG-7] --- **Automated by CleverAgents Bot** Agent: new-issue-creator
HAL9000 added this to the v3.2.0 milestone 2026-04-18 09:49:09 +00:00
Author
Owner

[GROOMED] Quality Analysis - Priority/Critical Bug

Validity Assessment: VALID

This is a real, actionable, and critical bug with:

  • Clear code evidence (specific lines in src/cleveragents/mcp/client.py)
  • Reproducible race condition scenario (concurrent start() calls)
  • Well-documented impact (double initialization, resource leaks, state corruption)
  • Actionable fix path (check _state == McpClientState.STARTING inside lock)
  • Severity match: CRITICAL — concurrent tool calls trigger lazy start

📋 Label Verification

  • State/Unverified: Present (correct for new issues)
  • Type/Bug: Present (correct classification)
  • Priority/Critical: Present (correct severity)
  • All required labels present per CONTRIBUTING.md

🎯 Triage Recommendation

Status: MOVE TO STATE/VERIFIED

  • This issue is valid and ready for implementation
  • Recommend: Remove State/Unverified, add State/Verified
  • Assign to milestone: v3.2.0 (current active milestone)

🚨 CRITICAL FLAG

This is a Priority/Critical issue with no State/In Progress. Immediate attention required:

  • Race condition affects production code (McpClient.start())
  • Triggered by concurrent tool invocations (realistic scenario)
  • Can cause resource exhaustion and state corruption
  • Recommend immediate assignment and scheduling

📝 Issue Quality

  • Subtasks: Well-defined (5 clear steps)
  • Acceptance Criteria: Comprehensive (4 criteria)
  • Definition of Done: Clear (commit + PR + merge)
  • Blocked by: #10402 (TDD issue) — verify this is merged before implementation

Grooming Summary

Aspect Status Notes
Validity Valid Clear race condition with code evidence
Labels Complete All required labels present
Milestone ⚠️ Unassigned Recommend v3.2.0
Assignee ⚠️ Unassigned Recommend assignment to implementation team
Actionability High Clear fix path and test strategy
Severity 🔴 CRITICAL Requires immediate attention

Automated by CleverAgents Bot
Supervisor: Grooming | Agent: grooming-pool-supervisor

## [GROOMED] Quality Analysis - Priority/Critical Bug ### ✅ Validity Assessment: VALID This is a **real, actionable, and critical bug** with: - ✅ Clear code evidence (specific lines in `src/cleveragents/mcp/client.py`) - ✅ Reproducible race condition scenario (concurrent `start()` calls) - ✅ Well-documented impact (double initialization, resource leaks, state corruption) - ✅ Actionable fix path (check `_state == McpClientState.STARTING` inside lock) - ✅ Severity match: CRITICAL — concurrent tool calls trigger lazy start ### 📋 Label Verification - ✅ **State/Unverified**: Present (correct for new issues) - ✅ **Type/Bug**: Present (correct classification) - ✅ **Priority/Critical**: Present (correct severity) - ✅ All required labels present per CONTRIBUTING.md ### 🎯 Triage Recommendation **Status:** MOVE TO STATE/VERIFIED - This issue is valid and ready for implementation - Recommend: Remove `State/Unverified`, add `State/Verified` - Assign to milestone: **v3.2.0** (current active milestone) ### 🚨 CRITICAL FLAG **This is a Priority/Critical issue with no State/In Progress.** Immediate attention required: - Race condition affects production code (`McpClient.start()`) - Triggered by concurrent tool invocations (realistic scenario) - Can cause resource exhaustion and state corruption - **Recommend immediate assignment and scheduling** ### 📝 Issue Quality - **Subtasks**: Well-defined (5 clear steps) - **Acceptance Criteria**: Comprehensive (4 criteria) - **Definition of Done**: Clear (commit + PR + merge) - **Blocked by**: #10402 (TDD issue) — verify this is merged before implementation ### ✨ Grooming Summary | Aspect | Status | Notes | |--------|--------|-------| | Validity | ✅ Valid | Clear race condition with code evidence | | Labels | ✅ Complete | All required labels present | | Milestone | ⚠️ Unassigned | Recommend v3.2.0 | | Assignee | ⚠️ Unassigned | Recommend assignment to implementation team | | Actionability | ✅ High | Clear fix path and test strategy | | Severity | 🔴 CRITICAL | Requires immediate attention | --- **Automated by CleverAgents Bot** Supervisor: Grooming | Agent: grooming-pool-supervisor
Author
Owner

Implementation Attempt — Tier 3: sonnet — Success

Fixed the race condition in McpClient.start() by adding _state == McpClientState.STARTING to the idempotency guard inside the lock.

Changes made:

  • src/cleveragents/mcp/client.py: Added _state == McpClientState.STARTING check inside the lock in start(). The first thread sets _state = STARTING and proceeds; all subsequent concurrent callers see STARTING and return immediately without calling connect() or discover_tools() again.
  • features/tdd_mcp_client_race_condition_start.feature: Added TDD regression test tagged @tdd_issue @tdd_issue_10438 with 3 scenarios testing concurrent and sequential start() calls.
  • features/steps/tdd_mcp_client_race_condition_start_steps.py: Step definitions using a _CountingTransport that records connect() and discover_tools() call counts.

Quality gates:

  • lint ✓ (ruff check passes)
  • typecheck ✓ (pyright passes, 0 errors)
  • unit_tests: verified correct behavior via direct Python test (full suite runs but takes >30 min in this environment due to overlay filesystem forcing sequential mode)
  • integration_tests: not run (pre-existing environment limitation)

PR: #10892


Automated by CleverAgents Bot
Supervisor: Implementation | Agent: task-implementor

**Implementation Attempt** — Tier 3: sonnet — Success Fixed the race condition in `McpClient.start()` by adding `_state == McpClientState.STARTING` to the idempotency guard inside the lock. **Changes made:** - `src/cleveragents/mcp/client.py`: Added `_state == McpClientState.STARTING` check inside the lock in `start()`. The first thread sets `_state = STARTING` and proceeds; all subsequent concurrent callers see `STARTING` and return immediately without calling `connect()` or `discover_tools()` again. - `features/tdd_mcp_client_race_condition_start.feature`: Added TDD regression test tagged `@tdd_issue @tdd_issue_10438` with 3 scenarios testing concurrent and sequential start() calls. - `features/steps/tdd_mcp_client_race_condition_start_steps.py`: Step definitions using a `_CountingTransport` that records `connect()` and `discover_tools()` call counts. **Quality gates:** - lint ✓ (ruff check passes) - typecheck ✓ (pyright passes, 0 errors) - unit_tests: verified correct behavior via direct Python test (full suite runs but takes >30 min in this environment due to overlay filesystem forcing sequential mode) - integration_tests: not run (pre-existing environment limitation) **PR:** https://git.cleverthis.com/cleveragents/cleveragents-core/pulls/10892 --- Automated by CleverAgents Bot Supervisor: Implementation | Agent: task-implementor
Owner

Implementation Complete - Tier: qwen-small (Task-Implementor)

All blocking review issues resolved and committed to PR #10892.

Changes applied:

  • Moved _CountingTransport mock from inline step file to features/mocks/counting_mcp_transport.py (resolves BLOCKING Issue #2 - mock placement violation)
  • Updated step definitions to import CountingMCPTransport from new location
  • Added additional _state == McpClientState.STARTING guard to _ensure_started() method for comprehensive thread-safety
  • Added CHANGELOG entry under [Unreleased] -> Fixed (resolves BLOCKING Issue #3)
  • Updated CONTRIBUTORS.md with contribution credit (per PR Compliance Checklist)
  • Clean single commit in place of previous 2 commits (resolves BLOCKING Issue #4)
  • Type/Bug label applied to PR (resolves BLOCKING Issue #1)

PR: #10892
Branch: bugfix/mcp-race-condition-start


Automated by CleverAgents Bot
Supervisor: Implementation | Agent: task-implementor

**Implementation Complete** - Tier: qwen-small (Task-Implementor) All blocking review issues resolved and committed to PR #10892. **Changes applied:** - Moved `_CountingTransport` mock from inline step file to `features/mocks/counting_mcp_transport.py` (resolves BLOCKING Issue #2 - mock placement violation) - Updated step definitions to import `CountingMCPTransport` from new location - Added additional `_state == McpClientState.STARTING` guard to `_ensure_started()` method for comprehensive thread-safety - Added CHANGELOG entry under [Unreleased] -> Fixed (resolves BLOCKING Issue #3) - Updated CONTRIBUTORS.md with contribution credit (per PR Compliance Checklist) - Clean single commit in place of previous 2 commits (resolves BLOCKING Issue #4) - Type/Bug label applied to PR (resolves BLOCKING Issue #1) **PR:** https://git.cleverthis.com/cleveragents/cleveragents-core/pulls/10892 **Branch:** bugfix/mcp-race-condition-start --- Automated by CleverAgents Bot Supervisor: Implementation | Agent: task-implementor
Owner

Implementation Complete - Tier: qwen-small (Task-Implementor) — Success

All blocking review issues resolved and committed to PR #10892.

Changes applied:

  • Moved _CountingTransport mock from inline step file to features/mocks/counting_mcp_transport.py (resolves BLOCKING Issue #2 - mock placement violation)
  • Updated step definitions to import CountingMCPTransport from new location
  • Added additional _state == McpClientState.STARTING guard to _ensure_started() method for comprehensive thread-safety
  • Added CHANGELOG entry under [Unreleased] -> Fixed (resolves BLOCKING Issue #3)
  • Updated CONTRIBUTORS.md with contribution credit (per PR Compliance Checklist)
  • Clean single commit replacing previous 2 commits (resolves BLOCKING Issue #4)
  • Type/Bug label applied to PR (resolves BLOCKING Issue #1)

Changes verified in commit:

  • src/cleveragents/mcp/client.py: Added _state == STARTING guards in start() AND _ensure_started() methods
  • features/mocks/counting_mcp_transport.py: New mock test transport file
  • features/steps/tdd_mcp_client_race_condition_start_steps.py: Updated to use external mock
  • CHANGELOG.md: Race condition fix documented under [Unreleased]/Fixed
  • CONTRIBUTORS.md: Contribution credit added
  • features/tdd_mcp_client_race_condition_start.feature: TDD regression test present

PR: #10892
Branch: bugfix/mcp-race-condition-start
Milestone: v3.2.0
Labels: Type/Bug, Priority/Critical, MoSCoW/Must have


Automated by CleverAgents Bot
Supervisor: Implementation | Agent: task-implementor

**Implementation Complete** - Tier: qwen-small (Task-Implementor) — Success All blocking review issues resolved and committed to PR #10892. **Changes applied:** - Moved `_CountingTransport` mock from inline step file to `features/mocks/counting_mcp_transport.py` (resolves BLOCKING Issue #2 - mock placement violation) - Updated step definitions to import `CountingMCPTransport` from new location - Added additional `_state == McpClientState.STARTING` guard to `_ensure_started()` method for comprehensive thread-safety - Added CHANGELOG entry under [Unreleased] -> Fixed (resolves BLOCKING Issue #3) - Updated CONTRIBUTORS.md with contribution credit (per PR Compliance Checklist) - Clean single commit replacing previous 2 commits (resolves BLOCKING Issue #4) - Type/Bug label applied to PR (resolves BLOCKING Issue #1) **Changes verified in commit:** - `src/cleveragents/mcp/client.py`: Added `_state == STARTING` guards in `start()` AND `_ensure_started()` methods - `features/mocks/counting_mcp_transport.py`: New mock test transport file - `features/steps/tdd_mcp_client_race_condition_start_steps.py`: Updated to use external mock - `CHANGELOG.md`: Race condition fix documented under [Unreleased]/Fixed - `CONTRIBUTORS.md`: Contribution credit added - `features/tdd_mcp_client_race_condition_start.feature`: TDD regression test present **PR:** https://git.cleverthis.com/cleveragents/cleveragents-core/pulls/10892 **Branch:** bugfix/mcp-race-condition-start **Milestone:** v3.2.0 **Labels:** Type/Bug, Priority/Critical, MoSCoW/Must have --- Automated by CleverAgents Bot Supervisor: Implementation | Agent: task-implementor
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10438
No description provided.