BUG-HUNT: [MCP-CONCURRENCY] Race condition in McpRegistry.register() during client replacement #7121

Open
opened 2026-04-10 07:56:48 +00:00 by HAL9000 · 2 comments
Owner

Metadata

  • Branch: fix/mcp-registry-register-race-condition
  • Commit Message: fix(mcp): make McpRegistry.register() client replacement atomic to prevent shut-down client access
  • Milestone: (none — backlog, see note below)
  • Parent Epic: Orphan — see note below; requires manual Epic linking

Backlog note: This issue was discovered during autonomous operation
on milestone v3.5.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Bug Report: [CONCURRENCY] — Race condition in McpRegistry.register() during client replacement

Severity Assessment

  • Impact: Potential access to shut-down client, inconsistent state during concurrent access
  • Likelihood: Low, requires precise timing of concurrent register/get/call_tool operations
  • Priority: Medium

Location

  • File: src/cleveragents/mcp/registry.py
  • Function/Class: McpRegistry.register(), McpRegistry.get(), McpRegistry.call_tool()
  • Lines: 50-89, 115-161

Description

The McpRegistry.register() method has a race condition where it shuts down an existing client and replaces it, but other threads may access the shut-down client between shutdown and replacement.

Evidence

# In src/cleveragents/mcp/registry.py lines 76-82:
with self._lock:
    existing = self._clients.get(namespace)
    if existing is not None:
        existing.shutdown()  # Client is shut down...

    client = McpClient(config=config, transport=transport)
    self._clients[namespace] = client  # ...but replacement happens later

Race condition scenario:

  1. Thread A calls register("ns", new_config)
  2. Thread A gets existing client, calls existing.shutdown()
  3. Thread B calls call_tool("ns", "tool", args)
  4. Thread B gets the shut-down client (still in dict until replacement)
  5. Thread B calls client.call_tool() on shut-down client → failure

Similar issue exists where call_tool() gets a client while holding the lock but calls the client after releasing the lock.

Expected Behavior

  • Client replacement should be atomic - no access to shut-down clients
  • Concurrent access should either get the old client (still functional) or the new client
  • No access to clients in transitional shutdown state

Actual Behavior

  • Window exists where shut-down client is still accessible via get()/call_tool()
  • call_tool() method releases lock before invoking client, allowing shutdown race
  • Inconsistent state during concurrent registry operations

Suggested Fix

  1. Use atomic replacement pattern - create new client first, then replace in single lock
  2. Add client state checking in call_tool() method
  3. Consider using versioning or reference counting for safe concurrent access
  4. Add retry logic for transient shutdown states

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.

Subtasks

  • Reproduce the race condition with a targeted concurrent test
  • Implement atomic replacement pattern in McpRegistry.register() (create new client before removing old)
  • Add client state guard in McpRegistry.call_tool() to detect shut-down clients
  • Consider reference counting or versioning for safe concurrent client access
  • Add retry logic for transient shutdown states in call_tool()
  • Update unit tests (Behave BDD) to cover concurrent register/get/call_tool scenarios
  • Update integration tests (Robot Framework) for concurrent MCP registry operations
  • Update documentation for McpRegistry thread-safety guarantees

Definition of Done

  • Race condition is reproducible via a failing @tdd_expected_fail test before fix
  • McpRegistry.register() performs atomic client replacement — no window where shut-down client is accessible
  • McpRegistry.call_tool() handles shut-down client state gracefully
  • All Behave BDD unit tests pass (nox -s unit_tests)
  • All Robot Framework integration tests pass (nox -s integration_tests)
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: Unknown | Agent: new-issue-creator

## Metadata - **Branch**: `fix/mcp-registry-register-race-condition` - **Commit Message**: `fix(mcp): make McpRegistry.register() client replacement atomic to prevent shut-down client access` - **Milestone**: (none — backlog, see note below) - **Parent Epic**: _Orphan — see note below; requires manual Epic linking_ > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.5.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- ## Bug Report: [CONCURRENCY] — Race condition in McpRegistry.register() during client replacement ### Severity Assessment - **Impact**: Potential access to shut-down client, inconsistent state during concurrent access - **Likelihood**: Low, requires precise timing of concurrent register/get/call_tool operations - **Priority**: Medium ### Location - **File**: `src/cleveragents/mcp/registry.py` - **Function/Class**: `McpRegistry.register()`, `McpRegistry.get()`, `McpRegistry.call_tool()` - **Lines**: 50-89, 115-161 ### Description The McpRegistry.register() method has a race condition where it shuts down an existing client and replaces it, but other threads may access the shut-down client between shutdown and replacement. ### Evidence ```python # In src/cleveragents/mcp/registry.py lines 76-82: with self._lock: existing = self._clients.get(namespace) if existing is not None: existing.shutdown() # Client is shut down... client = McpClient(config=config, transport=transport) self._clients[namespace] = client # ...but replacement happens later ``` **Race condition scenario:** 1. Thread A calls `register("ns", new_config)` 2. Thread A gets existing client, calls `existing.shutdown()` 3. Thread B calls `call_tool("ns", "tool", args)` 4. Thread B gets the shut-down client (still in dict until replacement) 5. Thread B calls `client.call_tool()` on shut-down client → failure Similar issue exists where `call_tool()` gets a client while holding the lock but calls the client after releasing the lock. ### Expected Behavior - Client replacement should be atomic - no access to shut-down clients - Concurrent access should either get the old client (still functional) or the new client - No access to clients in transitional shutdown state ### Actual Behavior - Window exists where shut-down client is still accessible via get()/call_tool() - call_tool() method releases lock before invoking client, allowing shutdown race - Inconsistent state during concurrent registry operations ### Suggested Fix 1. Use atomic replacement pattern - create new client first, then replace in single lock 2. Add client state checking in call_tool() method 3. Consider using versioning or reference counting for safe concurrent access 4. Add retry logic for transient shutdown states ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. ## Subtasks - [ ] Reproduce the race condition with a targeted concurrent test - [ ] Implement atomic replacement pattern in `McpRegistry.register()` (create new client before removing old) - [ ] Add client state guard in `McpRegistry.call_tool()` to detect shut-down clients - [ ] Consider reference counting or versioning for safe concurrent client access - [ ] Add retry logic for transient shutdown states in `call_tool()` - [ ] Update unit tests (Behave BDD) to cover concurrent register/get/call_tool scenarios - [ ] Update integration tests (Robot Framework) for concurrent MCP registry operations - [ ] Update documentation for `McpRegistry` thread-safety guarantees ## Definition of Done - [ ] Race condition is reproducible via a failing `@tdd_expected_fail` test before fix - [ ] `McpRegistry.register()` performs atomic client replacement — no window where shut-down client is accessible - [ ] `McpRegistry.call_tool()` handles shut-down client state gracefully - [ ] All Behave BDD unit tests pass (`nox -s unit_tests`) - [ ] All Robot Framework integration tests pass (`nox -s integration_tests`) - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: Unknown | Agent: new-issue-creator
Author
Owner

⚠️ Orphan Issue — Manual Epic Linking Required

This issue was created during autonomous bug-hunt operation and no suitable parent Epic was found at creation time. Per CONTRIBUTING.md, every Issue must belong to at least one parent Epic (child blocks parent; parent depends on child).

Action required by maintainer:

  1. Identify or create the appropriate parent Epic for MCP concurrency/thread-safety work
  2. Link this issue as a child: set this issue (#7121) to block the parent Epic
  3. The parent Epic should depend on this issue

Related sibling issue for reference: #7103 (BUG-HUNT: [MCP-CONCURRENCY] Timer race conditions in McpClient idle and health check management) — also an orphan awaiting Epic linking.


Automated by CleverAgents Bot
Supervisor: Unknown | Agent: new-issue-creator

⚠️ **Orphan Issue — Manual Epic Linking Required** This issue was created during autonomous bug-hunt operation and no suitable parent Epic was found at creation time. Per CONTRIBUTING.md, every Issue **must** belong to at least one parent Epic (child blocks parent; parent depends on child). **Action required by maintainer:** 1. Identify or create the appropriate parent Epic for MCP concurrency/thread-safety work 2. Link this issue as a child: set this issue (#7121) to **block** the parent Epic 3. The parent Epic should **depend on** this issue Related sibling issue for reference: #7103 (BUG-HUNT: [MCP-CONCURRENCY] Timer race conditions in McpClient idle and health check management) — also an orphan awaiting Epic linking. --- **Automated by CleverAgents Bot** Supervisor: Unknown | Agent: new-issue-creator
Author
Owner

Verified — Critical concurrency bug: race condition in McpRegistry.register(). MoSCoW: Must-have. Priority: Critical.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Critical concurrency bug: race condition in McpRegistry.register(). MoSCoW: Must-have. Priority: Critical. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7121
No description provided.