UAT: McpClientConfig missing max_restarts field — health monitor restarts indefinitely on persistent server failure #3832

Open
opened 2026-04-06 06:50:03 +00:00 by freemo · 0 comments
Owner

Metadata

  • Branch: bugfix/mcp-client-config-max-restarts
  • Commit Message: fix(mcp): add max_restarts field to McpClientConfig to limit health-check restart loops
  • Milestone: (none — backlog)
  • Parent Epic: #397

Backlog note: This issue was discovered during autonomous UAT testing of the MCP Tool Integration feature area. It does not block current milestone completion and has been placed in the backlog for human review and future milestone assignment.

Background and Context

The specification (docs/specification.md) defines the MCP server health check configuration as:

health_check.max_restarts: integer | No | Maximum automatic restarts before marking server as failed. Default: 3, min: 0.

However, McpClientConfig in src/cleveragents/mcp/client.py does not have a max_restarts field:

class McpClientConfig(BaseModel):
    server: MCPServerConfig
    idle_timeout_seconds: float = Field(default=300.0, ...)
    health_check_interval_seconds: float = Field(default=30.0, ...)
    auto_restart: bool = Field(default=True, ...)
    lazy_start: bool = Field(default=True, ...)
    # max_restarts is MISSING

And McpClient._do_restart() has no restart limit:

def _do_restart(self) -> None:
    """Restart the MCP server connection."""
    # No check for max_restarts — restarts indefinitely
    try:
        self._adapter.reconnect()
        self._adapter.discover_tools()
    except Exception:
        ...
        self._schedule_health_check()
        return
    with self._lock:
        self._restart_count += 1
        ...

Impact: When an MCP server has a persistent failure (e.g., the server binary is missing or the network endpoint is permanently unavailable), McpClient will attempt to restart it indefinitely, consuming resources and generating log noise without ever marking the server as permanently failed.

The spec requires that after max_restarts consecutive failures, the server should be marked as failed (state = ERROR) and no further restart attempts should be made.

Current Behavior

McpClient restarts the MCP server indefinitely when auto_restart=True and the server keeps failing health checks. There is no upper bound on restart attempts.

Verification:

from cleveragents.mcp.client import McpClientConfig
print(list(McpClientConfig.model_fields.keys()))
# Output: ['server', 'idle_timeout_seconds', 'health_check_interval_seconds', 'auto_restart', 'lazy_start']
# max_restarts is absent

Expected Behavior

McpClientConfig should have a max_restarts field (default: 3, min: 0). After max_restarts consecutive restart failures, McpClient should:

  1. Stop attempting further restarts
  2. Set its state to McpClientState.ERROR
  3. Log a clear error message indicating the server has been marked as permanently failed

Acceptance Criteria

  • McpClientConfig has max_restarts: int = Field(default=3, ge=0, ...)
  • McpClient._do_restart() checks restart_count >= max_restarts and stops retrying when exceeded
  • When max_restarts is exceeded, McpClient.state becomes McpClientState.ERROR
  • When max_restarts=0, restart is disabled (same as auto_restart=False)
  • Behave scenario covers the max_restarts limit behavior

Subtasks

  • Add max_restarts: int = Field(default=3, ge=0, description="Max automatic restarts before marking server as failed") to McpClientConfig
  • Update McpClient._do_restart() to check self._restart_count >= self._config.max_restarts and set state to ERROR when exceeded
  • Add Behave scenario: "Health monitor stops restarting after max_restarts exceeded"
  • Run nox (all default sessions), fix any errors
  • Verify coverage >= 97% via nox -s coverage_report

Definition of Done

This issue is complete when:

  • McpClientConfig.max_restarts field exists with default=3, min=0
  • McpClient stops restarting after max_restarts consecutive failures
  • McpClient.state becomes ERROR after max_restarts exceeded
  • New Behave scenario covers the max_restarts limit
  • All existing MCP tests still pass
  • All nox stages pass
  • Coverage >= 97%
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `bugfix/mcp-client-config-max-restarts` - **Commit Message**: `fix(mcp): add max_restarts field to McpClientConfig to limit health-check restart loops` - **Milestone**: *(none — backlog)* - **Parent Epic**: #397 > **Backlog note:** This issue was discovered during autonomous UAT testing of the MCP Tool Integration feature area. It does not block current milestone completion and has been placed in the backlog for human review and future milestone assignment. ## Background and Context The specification (`docs/specification.md`) defines the MCP server health check configuration as: ``` health_check.max_restarts: integer | No | Maximum automatic restarts before marking server as failed. Default: 3, min: 0. ``` However, `McpClientConfig` in `src/cleveragents/mcp/client.py` does not have a `max_restarts` field: ```python class McpClientConfig(BaseModel): server: MCPServerConfig idle_timeout_seconds: float = Field(default=300.0, ...) health_check_interval_seconds: float = Field(default=30.0, ...) auto_restart: bool = Field(default=True, ...) lazy_start: bool = Field(default=True, ...) # max_restarts is MISSING ``` And `McpClient._do_restart()` has no restart limit: ```python def _do_restart(self) -> None: """Restart the MCP server connection.""" # No check for max_restarts — restarts indefinitely try: self._adapter.reconnect() self._adapter.discover_tools() except Exception: ... self._schedule_health_check() return with self._lock: self._restart_count += 1 ... ``` **Impact:** When an MCP server has a persistent failure (e.g., the server binary is missing or the network endpoint is permanently unavailable), `McpClient` will attempt to restart it indefinitely, consuming resources and generating log noise without ever marking the server as permanently failed. The spec requires that after `max_restarts` consecutive failures, the server should be marked as failed (state = `ERROR`) and no further restart attempts should be made. ## Current Behavior `McpClient` restarts the MCP server indefinitely when `auto_restart=True` and the server keeps failing health checks. There is no upper bound on restart attempts. **Verification:** ```python from cleveragents.mcp.client import McpClientConfig print(list(McpClientConfig.model_fields.keys())) # Output: ['server', 'idle_timeout_seconds', 'health_check_interval_seconds', 'auto_restart', 'lazy_start'] # max_restarts is absent ``` ## Expected Behavior `McpClientConfig` should have a `max_restarts` field (default: 3, min: 0). After `max_restarts` consecutive restart failures, `McpClient` should: 1. Stop attempting further restarts 2. Set its state to `McpClientState.ERROR` 3. Log a clear error message indicating the server has been marked as permanently failed ## Acceptance Criteria - `McpClientConfig` has `max_restarts: int = Field(default=3, ge=0, ...)` - `McpClient._do_restart()` checks `restart_count >= max_restarts` and stops retrying when exceeded - When `max_restarts` is exceeded, `McpClient.state` becomes `McpClientState.ERROR` - When `max_restarts=0`, restart is disabled (same as `auto_restart=False`) - Behave scenario covers the max_restarts limit behavior ## Subtasks - [ ] Add `max_restarts: int = Field(default=3, ge=0, description="Max automatic restarts before marking server as failed")` to `McpClientConfig` - [ ] Update `McpClient._do_restart()` to check `self._restart_count >= self._config.max_restarts` and set state to `ERROR` when exceeded - [ ] Add Behave scenario: "Health monitor stops restarting after max_restarts exceeded" - [ ] Run `nox` (all default sessions), fix any errors - [ ] Verify coverage >= 97% via `nox -s coverage_report` ## Definition of Done This issue is complete when: - [ ] `McpClientConfig.max_restarts` field exists with default=3, min=0 - [ ] `McpClient` stops restarting after `max_restarts` consecutive failures - [ ] `McpClient.state` becomes `ERROR` after max_restarts exceeded - [ ] New Behave scenario covers the max_restarts limit - [ ] All existing MCP tests still pass - All nox stages pass - Coverage >= 97% - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#397 Epic: Server & Autonomy Infrastructure
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#3832
No description provided.