TDD: Race condition in McpClient.start() allows concurrent double initialization #10402

Closed
opened 2026-04-18 09:30:53 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Branch: tdd/mcp-race-condition-start
  • Commit Message: TDD: Add test for race condition in McpClient.start() double initialization
  • Related Bug: (to be linked after bug issue is created)

Summary

This TDD issue captures the race condition in McpClient.start() where concurrent calls can bypass the idempotency check and initialize the MCP server connection multiple times.

Root Cause

In src/cleveragents/mcp/client.py, the start() method releases the lock after checking _started but before calling connect() and discover_tools():

def start(self) -> None:
    with self._lock:
        if self._started:
            return
        self._state = McpClientState.STARTING
    # Lock released here — race window opens!
    try:
        self._adapter.connect()       # ← called without lock
        self._adapter.discover_tools()  # ← called without lock
    except Exception:
        with self._lock:
            self._state = McpClientState.ERROR
        raise
    with self._lock:
        self._started = True          # ← set too late
        ...

Race scenario:

  1. Thread A: acquires lock, checks _started (False), sets state to STARTING, releases lock
  2. Thread B: acquires lock, checks _started (False — still False!), sets state to STARTING, releases lock
  3. Both threads call connect() and discover_tools() concurrently
  4. Result: double initialization, resource leaks, state corruption

Test to Write

A test tagged with @tdd_issue, @tdd_issue_<N>, and @tdd_expected_fail that:

  1. Creates a McpClient with a mock transport
  2. Calls start() from multiple threads simultaneously (e.g., 5 threads)
  3. Asserts that connect() was called exactly once
  4. Asserts that discover_tools() was called exactly once
  5. Asserts that the client ends up in McpClientState.RUNNING state

Subtasks

  • Write a concurrent test using threading.Thread to call start() from 5 threads simultaneously
  • Use a mock transport that counts connect() and discover_tools() calls
  • Tag the test with @tdd_issue, @tdd_issue_<N>, and @tdd_expected_fail
  • Verify the test fails (proving the bug exists) before the fix is applied
  • Create tdd/mcp-race-condition-start branch and open PR

Acceptance Criteria

  • Test demonstrates that concurrent start() calls result in multiple connect() invocations (proving the bug)
  • Test is tagged with @tdd_issue, @tdd_issue_<N>, and @tdd_expected_fail
  • Test passes CI with @tdd_expected_fail tag (inverted result)
  • Test is in the appropriate test file under features/ or robot/

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.

Automated by CleverAgents Bot
Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor
Tag: [AUTO-BUG-7]


Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Branch:** `tdd/mcp-race-condition-start` - **Commit Message:** `TDD: Add test for race condition in McpClient.start() double initialization` - **Related Bug:** (to be linked after bug issue is created) ## Summary This TDD issue captures the race condition in `McpClient.start()` where concurrent calls can bypass the idempotency check and initialize the MCP server connection multiple times. ## Root Cause In `src/cleveragents/mcp/client.py`, the `start()` method releases the lock after checking `_started` but before calling `connect()` and `discover_tools()`: ```python def start(self) -> None: with self._lock: if self._started: return self._state = McpClientState.STARTING # Lock released here — race window opens! try: self._adapter.connect() # ← called without lock self._adapter.discover_tools() # ← called without lock except Exception: with self._lock: self._state = McpClientState.ERROR raise with self._lock: self._started = True # ← set too late ... ``` **Race scenario:** 1. Thread A: acquires lock, checks `_started` (False), sets state to STARTING, releases lock 2. Thread B: acquires lock, checks `_started` (False — still False!), sets state to STARTING, releases lock 3. Both threads call `connect()` and `discover_tools()` concurrently 4. Result: double initialization, resource leaks, state corruption ## Test to Write A test tagged with `@tdd_issue`, `@tdd_issue_<N>`, and `@tdd_expected_fail` that: 1. Creates a `McpClient` with a mock transport 2. Calls `start()` from multiple threads simultaneously (e.g., 5 threads) 3. Asserts that `connect()` was called exactly once 4. Asserts that `discover_tools()` was called exactly once 5. Asserts that the client ends up in `McpClientState.RUNNING` state ## Subtasks - [ ] Write a concurrent test using `threading.Thread` to call `start()` from 5 threads simultaneously - [ ] Use a mock transport that counts `connect()` and `discover_tools()` calls - [ ] Tag the test with `@tdd_issue`, `@tdd_issue_<N>`, and `@tdd_expected_fail` - [ ] Verify the test fails (proving the bug exists) before the fix is applied - [ ] Create `tdd/mcp-race-condition-start` branch and open PR ## Acceptance Criteria - Test demonstrates that concurrent `start()` calls result in multiple `connect()` invocations (proving the bug) - Test is tagged with `@tdd_issue`, `@tdd_issue_<N>`, and `@tdd_expected_fail` - Test passes CI with `@tdd_expected_fail` tag (inverted result) - Test is in the appropriate test file under `features/` or `robot/` ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. --- Automated by CleverAgents Bot Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor Tag: [AUTO-BUG-7] --- **Automated by CleverAgents Bot** Agent: new-issue-creator
Author
Owner

Implementation Attempt — Tier 3: sonnet — Success

Implemented the TDD issue-capture test for bug #10438 (McpClient.start() race condition).

What was done:

  • Created features/tdd_mcp_client_start_race.feature with a Gherkin scenario tagged @tdd_expected_fail @tdd_issue @tdd_issue_10438
  • Created features/steps/tdd_mcp_client_start_race_steps.py with a _CountingMockTransport that records connect() and discover_tools() call counts
  • Used threading.Barrier to synchronise 5 concurrent start() calls, maximising the chance of the race manifesting

Test behavior:

  • With bug present: discover_tools() is called 5 times (once per thread). The assertion fails. @tdd_expected_fail inverts this to a CI pass.
  • After bug fix: Both connect() and discover_tools() are called exactly once. The scenario passes. @tdd_expected_fail inverts this to a CI fail (unexpected pass), signalling the tag must be removed.

Quality gate status: lint , typecheck , unit_tests (15,661 scenarios passed), integration_tests ⚠️ (7 pre-existing failures), e2e_tests ⚠️ (10 pre-existing failures)

PR: #10915


Automated by CleverAgents Bot
Supervisor: Implementation | Agent: task-implementor

**Implementation Attempt** — Tier 3: sonnet — Success Implemented the TDD issue-capture test for bug #10438 (McpClient.start() race condition). **What was done:** - Created `features/tdd_mcp_client_start_race.feature` with a Gherkin scenario tagged `@tdd_expected_fail @tdd_issue @tdd_issue_10438` - Created `features/steps/tdd_mcp_client_start_race_steps.py` with a `_CountingMockTransport` that records `connect()` and `discover_tools()` call counts - Used `threading.Barrier` to synchronise 5 concurrent `start()` calls, maximising the chance of the race manifesting **Test behavior:** - With bug present: `discover_tools()` is called 5 times (once per thread). The assertion fails. `@tdd_expected_fail` inverts this to a CI pass. - After bug fix: Both `connect()` and `discover_tools()` are called exactly once. The scenario passes. `@tdd_expected_fail` inverts this to a CI fail (unexpected pass), signalling the tag must be removed. **Quality gate status:** lint ✅, typecheck ✅, unit_tests ✅ (15,661 scenarios passed), integration_tests ⚠️ (7 pre-existing failures), e2e_tests ⚠️ (10 pre-existing failures) **PR:** https://git.cleverthis.com/cleveragents/cleveragents-core/pulls/10915 --- Automated by CleverAgents Bot Supervisor: Implementation | Agent: task-implementor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10402
No description provided.