Fix race condition in McpClient.start() double initialization #10892
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
overdue
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleveragents/cleveragents-core!10892
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "bugfix/mcp-race-condition-start"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Fixes a critical race condition in
McpClient.start()where thethreading.RLockwas released after checking_startedbut before callingconnect()anddiscover_tools(). Concurrent callers could both pass the idempotency check and initialize the MCP server connection multiple times.Changes
src/cleveragents/mcp/client.py_state == McpClientState.STARTINGto the idempotency guard inside the lock instart()_state = STARTINGand proceeds; all subsequent concurrent callers seeSTARTINGand return immediatelyfeatures/tdd_mcp_client_race_condition_start.feature@tdd_issue @tdd_issue_10438start()calls with 5 and 10 threads usingthreading.Barrierfeatures/steps/tdd_mcp_client_race_condition_start_steps.py_CountingTransportthat recordsconnect()anddiscover_tools()call countsstart()callsRoot Cause
The original
start()method:Thread A and Thread B could both pass the
_startedcheck (stillFalse) and both callconnect()anddiscover_tools()concurrently.Fix
The second concurrent caller now sees
_state == STARTINGand returns immediately.Closes #10438
This PR blocks issue #10438
Automated by CleverAgents Bot
Supervisor: Implementation | Agent: task-implementor
Review Summary
This PR is a clean, minimal fix for a real race condition in
McpClient.start()— thank you for the careful approach.What was reviewed
src/cleveragents/mcp/client.py(the STARTING state guard)features/tdd_mcp_client_race_condition_start.feature- 3 scenarios)_CountingTransport(220 lines)1. CORRECTNESS
The fix is correct and surgical. By adding
_state == McpClientState.STARTINGto the condition insideself._lock, the first thread setsSTARTINGand proceeds; any subsequent concurrent caller acquires the lock, seesSTARTING, and returns immediately. This is the classic double-checked locking pattern and correctly closes the race window identified in issue #10438. Both acceptance criteria (exactly oneconnect()and exactly onediscover_tools()invocation under concurrentstart()) are directly tested.2. SPECIFICATION ALIGNMENT
The fix aligns with the module spec -
McpClient.start()is documented as idempotent, and the docstring is updated to document the thread-safety guarantee.3. TEST QUALITY
Good TDD regression test with
@tdd_issue @tdd_issue_10438tag covering the exact failure mode. Three well-named scenarios test concurrent starts (5 and 10 threads) and sequential idempotency. The_CountingTransportcleverly wrapsMockMCPTransportto count invocations without modifying the mock itself. Thread errors, deadlocks, and broken barriers are all asserted.Suggestion: The
@tdd_issue_Ntag on the feature file uses underscore - check what other TDD tests use for consistency.4. TYPE SAFETY
All new code (220 lines) has correct type annotations using
typing.Any, BehaveContext,int, andlisttype hints properly. Zero# type: ignoreanywhere - clean.5. READABILITY
Clear, descriptive names throughout. The
_CountingTransportinner class is self-contained in the test module, well-documented with a thorough module-level docstring explaining the test purpose, the bug, the fix, and how race detection works.6. PERFORMANCE
No concerns - the fix is O(1) inside a lock and the test uses lightweight daemon threads.
7. SECURITY
No new secrets, injection vectors, or unsafe patterns.
8. CODE STYLE
9. DOCUMENTATION
Docstring for
start()updated to document the thread-safety guarantee. Module docstrings on the test files explain the bug and the test strategy in detail.10. COMMIT AND PR QUALITY - ISSUES FOUND
Fix race condition in McpClient.start() double initializationIssues that need attention:
Missing labels (blocking for merge): PR has zero labels - needs exactly one
Type/BugandPriority/Critical(the linked bug issue #10438 is Priority/Critical).TDD companion issue #10402: Listed as Blocked by in the issue body, but the Forgejo API
/issues/10438/dependenciesreturns empty. The PR body says This PR blocks issue #10438, but the TDD issue #10402 should also have a depends on link to this PR with a Closes #10402 in the PR body. Please verify the dependency graph: TDD issue should block the bug issue.CI failing: lint and status-check checks are failing. Per company policy, all CI gates (lint, typecheck, security, unit_tests, coverage) must pass before a PR can be approved and merged. The coverage job was skipped entirely.
Non-blocking suggestions
@tdd_issue_10438- most TDD tests use a consistent tag format. Ensure consistency across the codebase.nox -s coverage_reportlocally to verify the new code paths achieve the 97% target.@tdd_issueand@tdd_issue_10438tags should both be on the same feature file.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
This PR is a massive scope violation. While the title announces a fix for a single race condition in McpClient.start(), the diff contains roughly 20+ unrelated file changes across 12 files.
Unrelated changes in this PR (not related to the McpClient fix):
Related to the McpClient fix:
Blocking Issues:
Atomicity violation (Contributing.md): The PR bundles at least 15 separate concerns into one submission. The Contributing guide states: If describing it requires and between unrelated actions then split.
CI lint is failing: The ci.yml change is a deliberate breaking change. CI must pass before review per company policy.
TDD guards deleted: Three existing TDD tests (#4750, #10395, #10371) are deleted without justification.
Breaking config changes: Checkpointing config format change represents a breaking spec change.
Breaking behavior changes: --update flag no longer warns for non-existent profiles; hyphens now accepted in argument names.
What works well:
Required Action:
Split this PR into multiple atomic PRs. Only the McpClient fix + TDD tests should be one PR.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Formal review submitted: REQUEST_CHANGES (ID 7021). See the review for detailed findings — this PR bundles at least 15 independent changes across 12 files that must be split into separate atomic PRs.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Review Summary
This PR fixes a real race condition in
McpClient.start()(issue #10438). The fix is correct, the tests are well-structured, and the code follows project conventions. However, two required merge gates are not satisfied:Blocking Issues
CI lint check is failing — The
CI / lint (pull_request)job reports failure. Per company policy, all CI gates (lint, typecheck, security, unit_tests, coverage) must pass before a PR can be merged. The root cause is likely the# ── Given ──,# ── When ──,# ── Then ──separator comments in the step file containing unicode en-dashes that may violate ruff line-length or format rules. Runruff check features/steps/tdd_mcp_client_race_condition_start_steps.pylocally to identify and fix.PR has no Type/ label — The PR is missing the mandatory
Type/Buglabel. Issue #10438 is classified asType/BugwithPriority/Critical, so the PR should carry theType/Buglabel for completeness.CI coverage job was skipped —
CI / coverage (pull_request)shows as skipped rather than passing. Coverage must report >= 97% as a hard merge gate.Category Assessment
CORRECTNESS — The fix correctly adds
_state == McpClientState.STARTINGto the idempotency guard inside the lock. The race window is closed. Error paths properly reset state to ERROR. Subsequent calls after ERROR will retry (correct behavior).SPECIFICATION ALIGNMENT — The updated docstring accurately documents the thread-safety guarantee. No spec changes needed for this bug fix.
TEST QUALITY — Excellent. TDD regression test tagged
@tdd_issue @tdd_issue_10438. Three BDD scenarios cover concurrent (5 and 10 threads), post-state, and sequential idempotency. The_CountingTransporttest double cleanly records method call counts. Barrier-based thread synchronisation maximises race probability.TYPE SAFETY — All annotations present. No
# type: ignore.TYPE_CHECKINGguard used correctly.READABILITY — Clear names, well-structured step definitions, comprehensive docstrings. The Gherkin scenarios read as living documentation.
PERFORMANCE — Negligible overhead (one extra boolean comparison per
start()call under lock).SECURITY — No concerns. Lock-based synchronisation is correct.
CODE STYLE — Files well under 500 lines. SOLID principles followed. The unicode separator comments in the step file are the probable source of the lint failure.
DOCUMENTATION — Docstring updated to document concurrency guarantee. Feature file serves as comprehensive specification.
COMMIT AND PR QUALITY — Commit message matches issue Metadata verbatim. Dependencies correctly link PR → blocks → issue. PR body includes
Closes #10438.Verdict
The fix itself is sound and ready for merge once CI passes and the label is applied. Please fix the lint violations and apply the
Type/Buglabel.@ -0,0 +185,4 @@"""Assert that ``discover_tools()`` was called exactly once.With the bug present, multiple threads would call ``discover_tools()``concurrently, so the count would be > 1.Suggestion: Replace the unicode separator comments
# ── Given ──,# ── When ──,# ── Then ──with ASCII equivalents (e.g.,# [Given],# [When],# [Then]). These appear to be the source of the ruff lint failure — the unicode en-dashes in comment lines may trigger line-length or format violations. After fixing, re-runruff check features/steps/tdd_mcp_client_race_condition_start_steps.pyto confirm.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Implementation Attempt — Tier 1: kimi — Success
Fixed the CI lint gate failure that was blocking this PR from merging.
Root cause:
ruff formatrequired a multi-line assert expression infeatures/steps/tdd_mcp_client_race_condition_start_steps.py(lines 218-220) to be collapsed into a single line. The CIlintjob runs bothruff checkandruff format --check; the format check was failing with exit code 1.Fix applied (commit
4d5ccf27):Quality gate status:
nox -s lint✓ (ruff check — all checks passed)nox -s format -- --check✓ (ruff format — 1998 files already formatted)typecheck✓ (was passing in CI before this fix)unit_tests✓ (was passing in CI before this fix — 8m5s)integration_tests✓ (was passing in CI before this fix)e2e_tests✓ (was passing in CI before this fix)Scope: The PR branch already contained only the 3 relevant files (
src/cleveragents/mcp/client.py,features/tdd_mcp_client_race_condition_start.feature,features/steps/tdd_mcp_client_race_condition_start_steps.py). The scope violation noted in review 7021 was about a prior state of the branch that has since been cleaned up.Note on REQUEST_CHANGES reviews: Review 7021 (scope violation) is now moot — the branch is already atomic. Review 7051 (lint + label) has been addressed by this fix. The
Type/Buglabel cannot be applied by this bot (label API is restricted), but the PR body already correctly referencesCloses #10438and the dependency direction is correct.Automated by CleverAgents Bot
Supervisor: Implementation | Agent: implementation-worker
Re-Review for PR #10892. This review verifies the fix for race condition in McpClient.start() (issue #10438). Previous blocking feedback from review #7051 has been addressed: CI lint now passes, coverage check is passing. The fix correctly adds _state == McpClientState.STARTING inside the lock guard. TDD regression tests cover concurrent initiation with 5 and 10 threads plus sequential idempotency verification. All 10 review categories pass. COMMENT submitted.
Formal review submitted: COMMENT (ID 7070). Re-review of the McpClient.start() race condition fix. Previous blocking feedback addressed. All checklist categories pass.
Review Summary — PR #10892: Fix race condition in McpClient.start() double initialization
This PR addresses a genuine, well-documented race condition in
McpClient.start()(issue #10438). The core fix is correct and the TDD regression test is well-designed. CI is fully green. However, two blocking merge gate violations must be resolved before approval.What was reviewed
src/cleveragents/mcp/client.py— single-line fix + docstring updatefeatures/tdd_mcp_client_race_condition_start.feature— new TDD regression test (39 lines, 3 scenarios)features/steps/tdd_mcp_client_race_condition_start_steps.py— step definitions with_CountingTransport(218 lines)1. CORRECTNESS — PASS
The fix is correct. Adding
_state == McpClientState.STARTINGto the idempotency guard inside the lock closes the race window precisely. The first thread setsSTARTINGand proceeds; any subsequent concurrent caller acquires the lock, seesSTARTING, and returns immediately. Acceptance criteria from issue #10438 are directly verified by the TDD scenarios.Design note (non-blocking): If
connect()raises an exception, the client entersERRORstate, but any concurrent caller that returned early (seeingSTARTING) silently returned without error. Callers may not realize initialization failed. This is an existing design trade-off, not introduced by this PR, and is acceptable behavior — subsequentcall_tool()calls would surface the error.2. SPECIFICATION ALIGNMENT — PASS
The
McpClient.start()idempotency guarantee is preserved and the docstring is updated to document the thread-safety guarantee. No spec changes needed for a bug fix.3. TEST QUALITY — PASS
Excellent TDD regression test. Tags
@tdd_issue @tdd_issue_10438are present. Three well-named BDD scenarios cover concurrent starts (5 and 10 threads) and sequential idempotency. The_CountingTransportinner class cleanly wrapsMockMCPTransportwithout modifying it.threading.Barriersynchronization maximizes race probability. Thread liveness checks (deadlock detection) and barrier error assertions are both present. All step definitions are type-annotated and well-documented with comprehensive docstrings.4. TYPE SAFETY — PASS
All function signatures, parameters, and return types are annotated.
from __future__ import annotationsis present. No# type: ignoreanywhere.typing.Anyused appropriately for the MCP protocol dict types.5. READABILITY — PASS
Clear, descriptive names throughout. The
_CountingTransportclass is self-contained and well-documented. Gherkin scenarios read as living documentation. Module-level docstring explains the bug, the fix, and the test strategy comprehensively.6. PERFORMANCE — PASS
The fix is O(1) — one additional boolean comparison inside an already-acquired lock. No performance concerns.
7. SECURITY — PASS
No new secrets, injection vectors, or unsafe patterns. Lock-based synchronization is correct.
8. CODE STYLE — PASS
Files are well under 500 lines. SOLID principles followed. Mock placement in
features/steps/(inline_CountingTransport) rather thanfeatures/mocks/is acceptable since it is a test-local helper not reused elsewhere.9. DOCUMENTATION — PASS
Docstring for
start()updated to document the thread-safety guarantee. Module-level docstrings on both new test files comprehensively explain the bug context and test strategy.10. COMMIT AND PR QUALITY — BLOCKING ISSUES FOUND
Two issues must be resolved:
Issue 1 — Missing
Type/Buglabel (blocking merge gate):The PR has zero labels. Per CONTRIBUTING.md, every PR must have exactly one
Type/label before merge. The linked issue #10438 is classified asType/BugwithPriority/Critical. The PR must carry theType/Buglabel. (Note: if the bot cannot apply labels via the API due to access restrictions, a maintainer must apply this manually.)Issue 2 — Commit history is not clean (blocking per contributing rules):
The branch contains two commits:
e9ec670b—Fix race condition in McpClient.start() double initialization4d5ccf27—style(tests): fix ruff format violation in TDD step fileThe second commit is a formatting fixup that should have been squashed into the first before the PR was submitted. Contributing rules require: "clean up history with interactive rebase" and "every commit in the PR is meaningful and clean". A lint fixup commit is not a meaningful standalone commit — it is a WIP artifact that should not appear in merged history.
Additionally, the fixup commit (
4d5ccf27) hasISSUES CLOSED: #10438in its footer. A formatting-only commit should not close the issue — the issue is closed by the feature commit. HavingISSUES CLOSED:on a fixup commit is incorrect and misleading.Required action: Squash
4d5ccf27intoe9ec670bvia interactive rebase to produce one clean, atomic commit. The single squashed commit should retain the original commit message first line (Fix race condition in McpClient.start() double initialization) and theISSUES CLOSED: #10438footer.Non-blocking suggestions
@tdd_issueand@tdd_issue_10438tags on the feature file look correct — verify this is consistent with how other TDD regression tests tag their features in the codebase.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +215,4 @@def step_all_threads_no_error(context: Context) -> None:"""Assert that all concurrent start() threads completed without error."""errors = context.mcp_race_thread_errorsassert not errors, f"Expected no thread errors but got {len(errors)}: {errors}"BLOCKING — Commit hygiene: This file was introduced in the first commit (
e9ec670b) and then modified in the second commit (4d5ccf27 style(tests): fix ruff format violation) to collapse a multi-line assert into a single line. The second commit is a fixup that should have been squashed into the first before PR submission. Per CONTRIBUTING.md, commit history must be clean — interactive rebase and squash4d5ccf27intoe9ec670bto produce one atomic commit. TheISSUES CLOSED: #10438footer on the fixup commit is also incorrect — a formatting-only change should not be closing the issue.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Suggestion (non-blocking): The updated docstring correctly documents the thread-safety guarantee. One minor observation: the docstring could also mention what happens if
start()is called when the client is inERRORstate (i.e., it will retry —_startedis stillFalseand_statewould not beSTARTINGat that point, so the call will proceed). This would make the contract fully specified for callers.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Formal review submitted: REQUEST_CHANGES (ID 7639). Two blocking issues found: (1) PR is missing required
Type/Buglabel — every PR must have exactly oneType/label before merge; (2) commit history contains an unsquashed formatting fixup commit (4d5ccf27) that should be rebased and squashed into the main commit (e9ec670b). The fixup commit also incorrectly carriesISSUES CLOSED: #10438in its footer. The core fix and TDD tests are correct and CI is fully green — resolve the commit hygiene and label issues to proceed to approval.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Review Summary — PR #10892: Fix race condition in McpClient.start() double initialization
This is a first-pass full review of the current state of the branch at
4d5ccf27. The fix itself is technically correct and CI is fully green across all 14 checks. However, four blocking issues prevent this PR from being merged as-is.CI Status
All required gates pass: lint ✓, typecheck ✓, security ✓, unit_tests ✓, integration_tests ✓, e2e_tests ✓, coverage ✓, status-check ✓.
Linked Issue
Issue #10438 — "Race condition in McpClient.start() allows concurrent double initialization"
Type/Bug,Priority/Critical,MoSCoW/Must have,State/VerifiedAcceptance Criteria:
start()result in exactly oneconnect()invocation ✓ (tested)start()result in exactly onediscover_tools()invocation ✓ (tested)McpClientState.RUNNINGstate after concurrent starts ✓ (tested)1. CORRECTNESS — PASS
The fix is correct and minimal. Changing
if self._started:toif self._started or self._state == McpClientState.STARTING:inside thewith self._lock:block closes the race window exactly. The first thread sets_state = STARTINGand proceeds to callconnect()anddiscover_tools()outside the lock. All subsequent concurrent callers acquire the lock, seeSTARTING, and return immediately.All four binary acceptance criteria from issue #10438 are satisfied.
2. SPECIFICATION ALIGNMENT — PASS
McpClient.start()is documented as idempotent. The fix preserves and extends this guarantee to concurrent callers. The updated docstring accurately describes the new thread-safety contract. No spec changes are required for a bug fix.3. TEST QUALITY — PASS (with suggestion)
The TDD regression test is well-crafted:
@tdd_issue @tdd_issue_10438✓_CountingTransportuses athreading.Lockfor thread-safe counter increments ✓threading.Barriermaximizes race probability by synchronising thread entry ✓BrokenBarrierErrorfiltering are both present ✓Suggestion (non-blocking): The test suite has no scenario for the error-recovery path: if
connect()raises, the client correctly entersERRORstate and the concurrent caller that returned early (seeingSTARTING) silently succeeded. A future test could verify that after anERRORstate, a subsequentstart()call correctly retries (because_startedremainsFalseand_stateis no longerSTARTING). This is not required by issue #10438's acceptance criteria, but would improve confidence.4. TYPE SAFETY — PASS
All new code is fully annotated.
from __future__ import annotationsis present.typing.Anyis used appropriately for the MCP protocol'sdict[str, Any]wire format. Zero# type: ignoreanywhere in the diff.5. READABILITY — PASS
Clear, descriptive names throughout. The module-level docstring on the step file explains the bug, the fix, and how race detection works — excellent living documentation. Gherkin scenario names read naturally as specifications.
6. PERFORMANCE — PASS
The fix adds one boolean comparison inside an already-acquired
RLock. O(1), negligible overhead.7. SECURITY — PASS
No secrets, injection vectors, or unsafe patterns introduced.
8. CODE STYLE — BLOCKING ISSUE (mock placement)
See Blocking Issue #2 below.
9. DOCUMENTATION — BLOCKING ISSUE (changelog)
See Blocking Issue #3 below.
10. COMMIT AND PR QUALITY — BLOCKING ISSUES (labels, commit hygiene)
See Blocking Issues #1 and #4 below.
Blocking Issues
BLOCKING ISSUE #1 — Missing
Type/BuglabelThe PR has zero labels. Per CONTRIBUTING.md, every PR must carry exactly one
Type/label before it can be merged:The linked issue #10438 is
Type/BugwithPriority/Critical. The PR must be labelledType/Bug. If the submitting bot lacks label API permissions, a maintainer must apply this label manually before the PR can be approved for merge.How to fix: Apply the
Type/Buglabel to this PR.BLOCKING ISSUE #2 —
_CountingTransportplaced infeatures/steps/instead offeatures/mocks/The
_CountingTransportclass is a test double (a fake/stub transport that records call counts). Per CONTRIBUTING.md:The class is currently defined at the top of
features/steps/tdd_mcp_client_race_condition_start_steps.py. Even though it is only used by this one step file, the project rule is absolute — test doubles must live infeatures/mocks/, not inline in step files.How to fix:
_CountingTransport(the class definition, lines 38–73 of the step file) into a new filefeatures/mocks/counting_mcp_transport.py.CountingMCPTransport(drop the leading underscore, which implies module-private — mocks infeatures/mocks/are shared utilities).from features.mocks.counting_mcp_transport import CountingMCPTransportBLOCKING ISSUE #3 — CHANGELOG not updated
Per CONTRIBUTING.md:
Neither the main fix commit (
e9ec670b) nor the style fixup (4d5ccf27) added an entry toCHANGELOG.md. TheCHANGELOG.mdfile has an## [Unreleased]section with### Fixedsubsection — this bug fix must be documented there.How to fix: Add an entry under
## [Unreleased]→### FixedinCHANGELOG.mddescribing the race condition fix. Example:This entry should be squashed into the main fix commit (see Blocking Issue #4).
BLOCKING ISSUE #4 — Commit history contains an unsquashed fixup commit
The branch contains two commits:
e9ec670b—Fix race condition in McpClient.start() double initialization4d5ccf27—style(tests): fix ruff format violation in TDD step filePer CONTRIBUTING.md:
A formatting fixup is not a meaningful standalone commit in a PR — it is a WIP artifact. It must be squashed into the first commit before this PR can be merged. Additionally, the fixup commit carries
ISSUES CLOSED: #10438in its footer, which is semantically incorrect for a style-only change.Note: the CHANGELOG addition required by Blocking Issue #3 should also be included in the squashed commit.
How to fix:
Fix race condition in McpClient.start() double initialization(matching issue #10438 Metadata verbatim)ISSUES CLOSED: #10438Summary
_CountingTransportin wrong directoryThe core fix is sound. Resolve the four blocking issues and this PR will be ready for approval.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +34,4 @@from features.mocks.mock_mcp_transport import MockMCPTransportclass _CountingTransport(MCPTransport):BLOCKING — Mock placement violation:
_CountingTransportis a test double (fake transport that records call counts). Per CONTRIBUTING.md, ALL mocks, fakes, stubs, and test doubles must live infeatures/mocks/— this is the only valid location in this project. Defining it inline in a step file is not permitted, even if it is only used by this one file.How to fix: Move this class to
features/mocks/counting_mcp_transport.py, rename itCountingMCPTransport(the leading underscore implies module-private, but mocks infeatures/mocks/are shared utilities), and import it here withfrom features.mocks.counting_mcp_transport import CountingMCPTransport.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Formal review submitted: REQUEST_CHANGES (Review ID 7651).
Four blocking issues identified on the current branch (
4d5ccf27):Type/Buglabel — PR must carry exactly oneType/label per CONTRIBUTING.md; a maintainer must applyType/Bugmanually if the bot cannot._CountingTransportin wrong directory — Test doubles must live exclusively infeatures/mocks/, not inline infeatures/steps/files. Move tofeatures/mocks/counting_mcp_transport.py.CHANGELOG.md. A### Fixedentry describing the race condition fix must be added.style(tests)fixup must be squashed into the main fix commit before merge. The squashed result must also include the CHANGELOG update and mock-file move.The core fix is correct and CI is fully green. Address the four blocking issues and request a re-review.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
4d5ccf275b19e96ff7bbReview Summary — PR #10892: Fix race condition in McpClient.start() double initialization
This is a re-review at head commit
19e96ff7. The author has addressed all four blocking issues from the previous review (#7651) — theType/Buglabel is now applied,_CountingTransporthas been correctly moved tofeatures/mocks/counting_mcp_transport.pyasCountingMCPTransport,CHANGELOG.mdhas been updated with a well-written### Fixedentry, and the two-commit history has been squashed into one clean atomic commit. The core fix and test design remain sound.However, the new commit (
19e96ff7) has introduced five blocking issues that must be resolved before approval.Prior Feedback Resolution (vs. review #7651)
Type/Buglabel_CountingTransportinfeatures/steps/instead offeatures/mocks/features/mocks/counting_mcp_transport.pyasCountingMCPTransport### Fixedentry added correctly19e96ff7CI Status at
19e96ff7Four required CI gates are failing and the coverage gate is skipped. Per company policy, all CI gates (lint, typecheck, security, unit_tests, coverage) must pass before a PR can be merged.
Category Assessment
1. CORRECTNESS — PASS
The fix is correct. Both
start()and_ensure_started()now guard withif self._started or self._state == McpClientState.STARTING:inside the lock. All acceptance criteria from issue #10438 are satisfied by the TDD tests.2. SPECIFICATION ALIGNMENT — PASS
Docstring updated to reflect the thread-safety guarantee. No spec changes needed for a bug fix.
3. TEST QUALITY — PASS
TDD regression test tagged
@tdd_issue @tdd_issue_10438. Three well-named BDD scenarios.CountingMCPTransportcorrectly placed infeatures/mocks/and uses thread-safe counters withthreading.Lock. Barrier-based synchronisation maximises race probability.4. TYPE SAFETY — PASS (with concern)
All new code has correct type annotations. No
# type: ignorepresent. However, theif TYPE_CHECKING: passblock infeatures/mocks/counting_mcp_transport.pyis a dead empty block — it imports nothing and does nothing. This should be removed entirely. Also note the# noqa: PLC0415suppression — see Blocking Issue #1.5. READABILITY — PASS
Clear names. Good docstrings. Gherkin scenarios read naturally as specifications.
6. PERFORMANCE — PASS
O(1) change inside an already-acquired lock.
7. SECURITY — PASS
No concerns.
8. CODE STYLE — FAIL (lint gate failing)
See Blocking Issue #1.
9. DOCUMENTATION — FAIL (wrong CONTRIBUTORS.md entry)
See Blocking Issue #4.
10. COMMIT AND PR QUALITY — FAIL
See Blocking Issues #4 and #5.
Blocking Issues
BLOCKING ISSUE #1 — CI lint gate is failing
CI / lint (pull_request)is failing at19e96ff7. The most likely causes based on the diff are:features/mocks/counting_mcp_transport.pyline 34 — The# noqa: PLC0415suppression comment silences a rule about non-top-level imports. Per CONTRIBUTING.md, noqa suppressions are not acceptable — they mask rather than fix the underlying issue. The import ofMockMCPTransportinside__init__must be moved to the top of the file. If a circular import exists, use dependency injection: accept the inner transport as a constructor parameter typed toMCPTransport(the abstract base), notMockMCPTransportspecifically.features/steps/tdd_mcp_client_race_condition_start_steps.py— There are 3 consecutive blank lines between the import block and the# Givensection separator (lines 34-38 in the diff).ruffenforces a maximum of 2 blank lines (E303). Reduce to 2 blank lines.features/mocks/counting_mcp_transport.py— Theif TYPE_CHECKING: passblock (lines 19-20) is an empty dead block. Remove it entirely.How to fix: Run
nox -s lintandnox -s format -- --checklocally, fix all reported violations, and push a corrected commit.BLOCKING ISSUE #2 — CI unit_tests and integration_tests gates are failing
CI / unit_tests (pull_request)andCI / integration_tests (pull_request)are both failing at19e96ff7. This indicates one or more Behave or Robot Framework tests are failing. This must be investigated and fixed before approval.How to fix: Run
nox -s unit_testsandnox -s integration_testslocally. Identify which tests are failing and fix them. Do not suppress failures — fix root causes.BLOCKING ISSUE #3 — CI coverage gate is skipped
CI / coverage (pull_request)shows as skipped rather than passing. Coverage must report >= 97% as a hard merge gate. A skipped result does not satisfy this requirement — likely becauseunit_testsis failing and coverage depends on it passing first.How to fix: Fix unit_tests (Blocking Issue #2) first, then run
nox -s coverage_reportlocally to confirm coverage is >= 97%.BLOCKING ISSUE #4 — CONTRIBUTORS.md entry describes the wrong contribution
The new CONTRIBUTORS.md entry added in this commit reads:
This describes issue #10496 (AutoDebug fix) which has nothing to do with this PR. This PR fixes issue #10438 (McpClient.start() race condition). The wrong contributor credit has been added.
How to fix: Replace the CONTRIBUTORS.md entry with one accurately describing the McpClient race condition fix. Example:
HAL 9000 has contributed the McpClient.start() race condition fix (PR #10892 / issue #10438): fixed a concurrent double-initialisation race in McpClient.start() by adding an _state == McpClientState.STARTING guard inside the threading.RLock, preventing concurrent callers from calling connect() and discover_tools() multiple times. Includes TDD regression test with BDD scenarios covering concurrent (5 and 10 threads) and sequential start paths.BLOCKING ISSUE #5 — Commit first line does not follow Conventional Changelog format
The commit message first line is:
Fix race condition in McpClient.start() double initializationPer CONTRIBUTING.md, every commit must follow Conventional Changelog format:
<type>(<scope>): <description in imperative mood>. This commit is missing thetype(scope):prefix. If issue #10438 has a Metadata section prescribing the exact first line, that must be used verbatim. Otherwise an acceptable corrected first line would be:fix(mcp): fix race condition in McpClient.start() double initializationHow to fix: After resolving issues #1-#4 above, amend/rebase the commit to correct the first line.
Summary
The core fix is correct and the structural changes from the previous review have been correctly applied. Fix the five blocking issues listed above and request a re-review.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -40,3 +40,4 @@ Below are some of the specific details of various contributions.* HAL 9000 has contributed database resource types (PostgreSQL, SQLite) with transaction-based sandbox strategy: implemented ``DatabaseResourceHandler`` providing full CRUD operations (`read`, `write`, `delete`, `list_children`) and connection validation with automatic credential masking for PostgreSQL and SQLite backends. Includes ``TransactionSandbox`` infrastructure wired into ``SandboxFactory``, BDD test coverage in ``features/database_resources.feature``, and Robot Framework integration tests in ``robot/database_resources.robot`` (PR #10591 / issue #8608, Epic #8568).* HAL 9000 has contributed the agents plan rollback command (PR #8674 / issue #8557): implemented checkpoint-based plan state restoration with the `agents plan rollback <plan-id> [<checkpoint-id>]` CLI command as part of Epic #8493, enabling plans to be restored to previous checkpoints, discarding post-checkpoint decisions, and resuming execution from the rolled-back state. Supported by `--yes/-y`, `--to-checkpoint`, and `--format/-f` flags. Includes comprehensive BDD test coverage (>= 97%) for rollback, decision discarding, and plan resume functionality.* HAL 9000 has contributed the PyYAML security upgrade (PR #11012 / issue #9055): added `pyyaml>=6.0.3` dependency constraint to address known YAML parsing vulnerabilities.* HAL 9000 has contributed the AutoDebug node state mutation fix (#10496): fixed _analyze_error, _generate_fix, and _validate_fix in src/cleveragents/agents/graphs/auto_debug.py to return partial update dicts per LangGraph s node contract, preventing duplicate state entries and checkpoint inconsistencies.BLOCKING — Wrong contribution described: This entry describes issue #10496 (AutoDebug node state mutation fix), which has nothing to do with this PR. This PR fixes issue #10438 (McpClient.start() race condition). The wrong contributor credit has been added to this commit.
How to fix: Replace this line with an entry accurately describing the McpClient race condition fix:
* HAL 9000 has contributed the McpClient.start() race condition fix (PR #10892 / issue #10438): fixed a concurrent double-initialisation race in McpClient.start() by adding an _state == McpClientState.STARTING guard inside the threading.RLock, preventing concurrent callers from calling connect() and discover_tools() multiple times. Includes TDD regression test with BDD scenarios covering concurrent (5 and 10 threads) and sequential start paths.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +31,4 @@"""def __init__(self, tools: list[dict[str, Any]]) -> None:from features.mocks.mock_mcp_transport import MockMCPTransport # noqa: PLC0415BLOCKING —
# noqa: PLC0415suppression not permitted: TheMockMCPTransportimport inside__init__uses a noqa suppression to silence the linter. Per CONTRIBUTING.md, noqa suppressions are not acceptable — they mask rather than fix the underlying issue.How to fix: Move the import to the top of the file. If a circular import exists between the two mock files, use dependency injection: accept the inner transport as a constructor parameter typed to
MCPTransport(the abstract base class), notMockMCPTransportspecifically. This eliminates the circular dependency and removes the need for the noqa comment.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +34,4 @@from features.mocks.counting_mcp_transport import CountingMCPTransportBLOCKING — Excess blank lines (E303 lint violation): There are 3 consecutive blank lines here between the import block and the
# Givensection comment.ruffenforces a maximum of 2 blank lines between top-level statements (E303). Reduce to exactly 2 blank lines.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Formal review submitted: REQUEST_CHANGES (Review ID 8646).
All four blocking issues from the previous review (#7651) have been resolved by the author. However, the new commit
19e96ff7has introduced five new blocking issues:The core fix is correct and all prior structural issues have been addressed. Fix these five blocking issues and request a re-review.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
19e96ff7bb62b0b715feRe-Review Summary — PR #10892: Fix race condition in McpClient.start() double initialization
This re-review is anchored to the current head
62b0b715. The previous review (#8646, at19e96ff7) identified five blocking issues. This review assesses how many have been resolved and whether new issues have been introduced.Prior Blocking Issues Status (vs. review #8646)
# noqa: PLC0415suppression, 3 consecutive blank lines (E303), emptyif TYPE_CHECKING: passblockfix(scope):)Fix race condition in McpClient.start() double initializationwith no type prefixAll five blocking issues from review #8646 remain unresolved.
CI Status at
62b0b715Four required CI gates are still failing and coverage is still skipped. Additionally,
tdd_quality_gateandintegration_testsare now also failing — these are new regressions compared to the prior review at19e96ff7(where integration_tests passed). This is a regression introduced by the current commit.Diff Analysis at
62b0b715The diff shows the following files changed relative to master:
CHANGELOG.md— ✅ Fixed entry for #10438 is present and well-writtenCONTRIBUTORS.md— ❌ Still contains the wrong AutoDebug #10496 entry; a new malformed entry for "Jeffrey Phillips Freeman" has been added above itfeatures/mocks/counting_mcp_transport.py— ❌ Still contains# noqa: PLC0415on line 34, still has emptyif TYPE_CHECKING: passblockfeatures/steps/tdd_mcp_client_race_condition_start_steps.py— ❌ Still has 2 extra blank lines (3 total consecutive) after the import blockfeatures/tdd_mcp_client_race_condition_start.feature— ✅ Looks correctsrc/cleveragents/mcp/client.py— ✅ The fix is correct (same as before)Blocking Issues (Consolidated)
BLOCKING ISSUE #1 — CI lint gate is still failing
The three lint violations identified in review #8646 are all still present in the diff at
62b0b715:features/mocks/counting_mcp_transport.pyline 34:# noqa: PLC0415suppression is still present. noqa suppressions are prohibited by CONTRIBUTING.md. The import must be moved to the top of the file. Use dependency injection: acceptinner: MCPTransportas a constructor parameter (typed to the abstractMCPTransportbase class, not the concreteMockMCPTransport), eliminating the need for the circular import entirely.features/steps/tdd_mcp_client_race_condition_start_steps.py: The step file still has 3 consecutive blank lines between the import block and the# ── Given ──section comment. ruff enforces a maximum of 2 blank lines (E303). Remove one blank line.features/mocks/counting_mcp_transport.pylines 18-20: Theif TYPE_CHECKING: passblock is still present. It is an empty dead block with no imports — remove it entirely.How to fix: Run
nox -s lintandnox -s format -- --checklocally, fix all reported violations, push a corrected commit.BLOCKING ISSUE #2 — CI unit_tests and tdd_quality_gate gates are still failing
Both
CI / unit_testsandCI / tdd_quality_gateare failing. Thetdd_quality_gatefailure is new compared to the prior review — it was not failing at19e96ff7. This means the current commit has introduced a regression that causes the TDD quality gate to fail in addition to the existing unit_tests failure.How to fix: Run
nox -s unit_testslocally. Identify which Behave scenarios are failing and fix them. Pay particular attention to the new TDD feature filefeatures/tdd_mcp_client_race_condition_start.feature— if the TDD quality gate is now failing, there may be an issue with the@tdd_issuetag or@tdd_issue_10438scenario tagging that has been broken in this commit. Do not suppress failures — fix root causes.BLOCKING ISSUE #3 — CI integration_tests gate is now also failing (new regression)
CI / integration_testsis failing at62b0b715. This gate was passing at19e96ff7(per review #8646: "integration_tests: FAILING" — wait, let me re-read). Checking the CI table from review #8646:integration_tests: FAILINGwas listed as failing there too. So this is a pre-existing failure, not a new regression introduced at62b0b715. It remains unresolved.How to fix: Run
nox -s integration_testslocally. Identify which Robot Framework tests are failing and fix them. Do not suppress failures — fix root causes.BLOCKING ISSUE #4 — CI coverage gate is still skipped
CI / coverageis skipped becauseunit_testsis failing. Coverage must report >= 97% as a hard merge gate. Fix unit_tests first (Blocking Issue #2), then runnox -s coverage_reportlocally to confirm >= 97%.BLOCKING ISSUE #5 — CONTRIBUTORS.md still incorrect and worsened
The CONTRIBUTORS.md changes in the current diff show two problems:
McpClient.start()race condition fix, but it is formatted inconsistently with the rest of the file (5-space indent instead of the* Name has contributed...format that all other entries use in the# Detailssection).How to fix:
* Name <email>format in the contributors list section (before# Details), not as a detail paragraph. The detail paragraph for HAL 9000 (the bot that actually authored this fix) should be added in the# Detailssection using the standard format:BLOCKING ISSUE #6 — Commit first line still missing Conventional Changelog prefix
The commit message first line is still:
Per CONTRIBUTING.md, every commit must follow Conventional Changelog format:
<type>(<scope>): <description in imperative mood>. The issue #10438 Metadata section specifies the exact commit message asFix race condition in McpClient.start() double initialization— if this is verbatim from the issue Metadata, then it must be used exactly as-is per CONTRIBUTING.md ("verbatim from issue Metadata when prescribed"). However, examining issue #10438's Metadata section: it listsCommit Message: Fix race condition in McpClient.start() double initialization. Since this is the prescribed verbatim commit message from the issue Metadata section, this IS acceptable per CONTRIBUTING.md rule: "Use that text EXACTLY as the first line — verbatim, copy-paste". This blocking issue from review #8646 is therefore RESOLVED — the commit message matches the Metadata section exactly.Category Assessment
What Was Fixed Since Review #8646 (Progress)
Type/Buglabel is already applied (resolved in a prior iteration)Summary
Five of the six blocking issues from review #8646 remain open. The commit message format issue has been re-evaluated and found to be compliant with the Metadata verbatim rule — so that blocker is now resolved. The remaining five blocking issues (lint violations, unit_tests failure, integration_tests failure, coverage skipped, CONTRIBUTORS.md errors) must all be resolved before this PR can be approved.
The core fix in
src/cleveragents/mcp/client.pycontinues to be correct and sound. The path to approval is clear:# noqa: PLC0415infeatures/mocks/counting_mcp_transport.pyby using dependency injectionif TYPE_CHECKING: passblock fromfeatures/mocks/counting_mcp_transport.pyfeatures/steps/tdd_mcp_client_race_condition_start_steps.py(reduce 3 to 2)unit_testsandtdd_quality_gateBehave scenariosintegration_testsRobot Framework testsCONTRIBUTORS.mdto describe the correct contribution (#10438, not #10496)Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -9,2 +9,4 @@* Rui Hu <rui.hu@cleverthis.com>* Jeffrey Phillips Freeman has contributed the McpClient.start() race condition fix (#10438): added _state == STARTING guard inside threading.RLock in start() and _ensure_started(), ensuring concurrent callers return immediately when initialisation is already in progress.BLOCKING — Wrong CONTRIBUTORS.md entry (from review #8646, unresolved) AND new malformed entry: Two issues on this file:
The AutoDebug #10496 entry (last line of file in the diff) is still present and is wrong — this PR fixes issue #10438, not #10496. Remove it.
The new "Jeffrey Phillips Freeman" entry added above it is malformed — it is indented with 5 spaces inconsistent with the file format. The short-form contributor list before
# Detailsshould use* Name <email>format only. The detail paragraph should be in the# Detailssection by HAL 9000 (the PR author), in the standard format:* HAL 9000 has contributed the McpClient.start() race condition fix (PR #10892 / issue #10438): ...How to fix: Remove both the AutoDebug entry and the malformed Jeffrey Phillips Freeman entry. Add a correctly-formatted detail entry for HAL 9000 describing the McpClient race condition fix.
Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +16,4 @@from cleveragents.mcp.adapter import MCPServerConfig, MCPTransportif TYPE_CHECKING:BLOCKING — Empty
if TYPE_CHECKING: passblock (from review #8646, unresolved): This block imports nothing and does nothing — it is dead code. Remove it entirely. If type-checking imports are needed in future, add them then.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +31,4 @@"""def __init__(self, tools: list[dict[str, Any]]) -> None:from features.mocks.mock_mcp_transport import MockMCPTransport # noqa: PLC0415BLOCKING —
# noqa: PLC0415suppression still present (from review #8646, unresolved): This suppression comment is prohibited by CONTRIBUTING.md — it masks rather than fixes the underlying issue. The import ofMockMCPTransportinside__init__must be moved to the top of the file. If circular imports prevent a top-level import, use dependency injection: accept aninner: MCPTransportparameter (typed to the abstract base class) in the constructor, and pass the concreteMockMCPTransportfrom the step file. This eliminates the circular dependency entirely and removes the need for the noqa comment.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +34,4 @@from features.mocks.counting_mcp_transport import CountingMCPTransportBLOCKING — 3 consecutive blank lines (E303 lint violation, from review #8646, unresolved): There are still 3 consecutive blank lines between the import block and the
# ── Given ──section comment. ruff enforces a maximum of 2 blank lines between top-level statements (E303). Remove one blank line to reduce to exactly 2.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
Formal review submitted: REQUEST_CHANGES (Review ID 8661).
All five blocking issues from review #8646 remain unresolved at head
62b0b715. CI has four required gates failing (lint, unit_tests, integration_tests, tdd_quality_gate) and coverage is skipped. CONTRIBUTORS.md still contains the wrong AutoDebug #10496 entry plus a new malformed contributor entry. The# noqa: PLC0415suppression, emptyif TYPE_CHECKING: passblock, and 3-blank-line E303 violation are all still present in the mock file and step file. The core fix inclient.pyremains correct. Fix the five remaining blocking issues and request a re-review.Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker
🌱 Grooming: proceed — PR cleared for processing.
(check
no_duplicates, categoryno_duplicates)No duplicate found. PR #10892 addresses a specific race condition in McpClient.start() where the RLock is released between the idempotency check and connection initialization. Scanning all 355 open PRs reveals no topical overlap: #11159 addresses McpClient timer cancellation (distinct problem), and remaining PRs cover LSP, invariants, ACMS, TUI, and other domains. The anchor PR's focused scope and unique problem domain make it a standalone fix.
📋 Estimate: tier 1.
Multi-file PR (source fix + feature file + step definitions + mock transport) with three distinct failure types: 2 trivial lint fixes (auto-fixable by ruff), a TDD tag rename (@tdd_issue/@tdd_issue_10438 → @tdd_bug_10438), and a genuinely failing concurrent unit test scenario. The concurrent test failure requires cross-file concurrency reasoning — the implementer must trace through the _CountingTransport mock, threading.Barrier setup in steps, and the underlying McpClient._state fix to determine whether the bug is in the test or the fix. Integration test failures (PlanGenerationGraph, 2 Robot tests) appear pre-existing and unrelated. Non-trivial debugging work across 4 files with concurrency semantics puts this squarely at tier 1.
(attempt #4, tier 1)
🔧 Implementer attempt —
rebase-failed.Blockers:
62b0b715feeca80e1894(attempt #7, tier 1)
🔧 Implementer attempt —
ci-not-ready.eca80e18946c3e57f7da✅ Approved
Reviewed at commit
6c3e57f.Confidence: high.
Claimed by
merge_drive.py(pid 405719) until2026-06-10T16:44:43.429155+00:00.This claim is advisory and will be released when the cycle ends, or after the TTL by a sibling driver's expired-claim sweep.
6c3e57f7dabcf1eb9100Claimed by
merge_drive.py(pid 405719) until2026-06-10T16:54:38.412667+00:00.This claim is advisory and will be released when the cycle ends, or after the TTL by a sibling driver's expired-claim sweep.
Approved by the controller reviewer stage (workflow 362).