Proposal: improve ca-test-infra-improver — graceful handling of clone and tool failures #1809

Closed
opened 2026-04-02 23:53:43 +00:00 by freemo · 1 comment
Owner

Agent Improvement Proposal

Pattern Detected

Type: Prompt improvement — infrastructure failure misreporting
Affected Agent: ca-test-infra-improver (Worker Mode)
Evidence: During the v3.7.0 session and preceding sessions, the test infrastructure improver has filed 10+ issues about its own infrastructure failures (clone failures, tool failures, environment limitations) instead of handling them gracefully:

Issue Title Type
#1694 Unable to clone repository Clone failure
#1691 Unable to clone repository cleveragents/cleveragents-core Clone failure
#1686 CRITICAL: Unable to clone repository cleveragents/cleveragents-core Clone failure
#1673 Unable to clone repository Clone failure
#1732 Unable to clone repository due to TLS/SNI issue Clone failure (closed)
#1713 Unable to clone repository due to gnutls_handshake error Clone failure (closed)
#1699 Unable to clone repository due to TLS/SNI issue Clone failure (closed)
#1695 Unable to analyze CI execution time due to tool execution failures Tool failure
#1740 Unable to analyze CI execution time due to environment limitations Tool failure (closed)
#1726 Worker tools are failing Tool failure (closed)
#1727 Unable to access CI run history Tool failure (closed)
#1596 Repository is empty Clone failure

Root Cause: When the agent encounters infrastructure failures (wrong hostname for git clone, TLS/SNI issues, tool execution errors like "Maximum call stack size exceeded", or empty repository responses), it treats these as test infrastructure findings and files Forgejo issues about them. This is incorrect — these are failures in the agent's own execution environment, not problems with the project's test infrastructure.

The clone failures specifically stem from the agent constructing the git URL using git.cleveragents.com (derived from the organization name) instead of the actual Forgejo host git.cleverthis.com (from the FORGEJO_URL environment variable).

Impact:

  • 10+ false positive issues waste human review time
  • Some are labeled "CRITICAL", creating unnecessary alarm
  • The agent's entire analysis session is wasted when it can't clone
  • Issue tracker is polluted with non-actionable infrastructure noise

Proposed Change

Modify the Worker Mode section in ca-test-infra-improver.md to:

  1. Fix hostname derivation — Add explicit guidance in the Clone Isolation Protocol: "Derive the git host from the FORGEJO_URL environment variable or from the Forgejo PAT URL provided in the prompt. The Forgejo host is NOT necessarily git.<org-name>.com. Always check FORGEJO_URL first."

  2. Add clone failure handling — After the clone step, add:

    If git clone fails:
      1. Check if you're using the correct hostname (from FORGEJO_URL)
      2. Retry with the corrected hostname
      3. If still failing after 2 retries, EXIT with an error message
      4. NEVER file a Forgejo issue about clone failures — these are
         agent environment issues, not test infrastructure issues
    
  3. Add tool failure handling — Add a new section:

    If any tool (bash, read, etc.) fails with environment errors
    (ENOENT, stack overflow, permission denied, etc.):
      1. Log the error
      2. Skip the affected analysis step
      3. Continue with remaining analysis if possible
      4. NEVER file a Forgejo issue about tool failures — these are
         agent runtime issues, not test infrastructure issues
    
  4. Add scope restriction — Add to the "Important Rules" section: "You analyze the PROJECT's test infrastructure. Infrastructure failures in YOUR OWN execution environment (clone failures, tool crashes, API errors) are OUT OF SCOPE. Never file issues about your own environment."

Expected Impact

  • Eliminates 10+ false positive issues per session
  • Prevents "CRITICAL" false alarms about clone failures
  • Ensures the agent either successfully analyzes or exits gracefully
  • Reduces noise in the issue tracker

Risk Assessment

  • Very low risk: These changes only add error handling and scope restrictions. No analysis logic is modified.
  • Potential concern: If a genuine test infrastructure issue involves cloning (e.g., CI clone step is misconfigured), the scope restriction might cause the agent to miss it. However, the restriction is specifically about the agent's OWN clone operation, not about clone steps in CI workflows. The agent can still analyze CI workflow files and identify clone-related CI issues.

This is a proposal from the agent evolver. A human must approve this issue before the change will be implemented. To approve: remove the needs feedback label, add State/Verified, or comment with approval.


Automated by CleverAgents Bot
Supervisor: Agent Evolver | Agent: ca-agent-evolver

## Agent Improvement Proposal ### Pattern Detected **Type**: Prompt improvement — infrastructure failure misreporting **Affected Agent**: `ca-test-infra-improver` (Worker Mode) **Evidence**: During the v3.7.0 session and preceding sessions, the test infrastructure improver has filed **10+ issues** about its own infrastructure failures (clone failures, tool failures, environment limitations) instead of handling them gracefully: | Issue | Title | Type | |---|---|---| | #1694 | Unable to clone repository | Clone failure | | #1691 | Unable to clone repository cleveragents/cleveragents-core | Clone failure | | #1686 | CRITICAL: Unable to clone repository cleveragents/cleveragents-core | Clone failure | | #1673 | Unable to clone repository | Clone failure | | #1732 | Unable to clone repository due to TLS/SNI issue | Clone failure (closed) | | #1713 | Unable to clone repository due to gnutls_handshake error | Clone failure (closed) | | #1699 | Unable to clone repository due to TLS/SNI issue | Clone failure (closed) | | #1695 | Unable to analyze CI execution time due to tool execution failures | Tool failure | | #1740 | Unable to analyze CI execution time due to environment limitations | Tool failure (closed) | | #1726 | Worker tools are failing | Tool failure (closed) | | #1727 | Unable to access CI run history | Tool failure (closed) | | #1596 | Repository is empty | Clone failure | **Root Cause**: When the agent encounters infrastructure failures (wrong hostname for git clone, TLS/SNI issues, tool execution errors like "Maximum call stack size exceeded", or empty repository responses), it treats these as test infrastructure findings and files Forgejo issues about them. This is incorrect — these are failures in the agent's own execution environment, not problems with the project's test infrastructure. The clone failures specifically stem from the agent constructing the git URL using `git.cleveragents.com` (derived from the organization name) instead of the actual Forgejo host `git.cleverthis.com` (from the `FORGEJO_URL` environment variable). **Impact**: - 10+ false positive issues waste human review time - Some are labeled "CRITICAL", creating unnecessary alarm - The agent's entire analysis session is wasted when it can't clone - Issue tracker is polluted with non-actionable infrastructure noise ### Proposed Change Modify the **Worker Mode** section in `ca-test-infra-improver.md` to: 1. **Fix hostname derivation** — Add explicit guidance in the Clone Isolation Protocol: "Derive the git host from the `FORGEJO_URL` environment variable or from the Forgejo PAT URL provided in the prompt. The Forgejo host is NOT necessarily `git.<org-name>.com`. Always check `FORGEJO_URL` first." 2. **Add clone failure handling** — After the clone step, add: ``` If git clone fails: 1. Check if you're using the correct hostname (from FORGEJO_URL) 2. Retry with the corrected hostname 3. If still failing after 2 retries, EXIT with an error message 4. NEVER file a Forgejo issue about clone failures — these are agent environment issues, not test infrastructure issues ``` 3. **Add tool failure handling** — Add a new section: ``` If any tool (bash, read, etc.) fails with environment errors (ENOENT, stack overflow, permission denied, etc.): 1. Log the error 2. Skip the affected analysis step 3. Continue with remaining analysis if possible 4. NEVER file a Forgejo issue about tool failures — these are agent runtime issues, not test infrastructure issues ``` 4. **Add scope restriction** — Add to the "Important Rules" section: "You analyze the PROJECT's test infrastructure. Infrastructure failures in YOUR OWN execution environment (clone failures, tool crashes, API errors) are OUT OF SCOPE. Never file issues about your own environment." ### Expected Impact - Eliminates 10+ false positive issues per session - Prevents "CRITICAL" false alarms about clone failures - Ensures the agent either successfully analyzes or exits gracefully - Reduces noise in the issue tracker ### Risk Assessment - **Very low risk**: These changes only add error handling and scope restrictions. No analysis logic is modified. - **Potential concern**: If a genuine test infrastructure issue involves cloning (e.g., CI clone step is misconfigured), the scope restriction might cause the agent to miss it. However, the restriction is specifically about the agent's OWN clone operation, not about clone steps in CI workflows. The agent can still analyze CI workflow files and identify clone-related CI issues. --- *This is a proposal from the agent evolver. A human must approve this issue before the change will be implemented. To approve: remove the `needs feedback` label, add `State/Verified`, or comment with approval.* --- **Automated by CleverAgents Bot** Supervisor: Agent Evolver | Agent: ca-agent-evolver
Author
Owner

approved

approved
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#1809
No description provided.