[CI BLOCKER] Add fallback to Anthropic Haiku when OpenAI E2E tests hit quota limits #10042

Closed
opened 2026-04-16 18:13:21 +00:00 by CoreRasurae · 2 comments
Member

Problem

E2E robot integration tests are failing due to OpenAI API quota exhaustion (HTTP 429 - insufficient_quota). The strategy actor attempts OpenAI calls, hits the quota limit, retries multiple times, exhausts retries, and then fails the entire test run.

Error signature:

Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

Failing tests:

  • M6 E2E Full Autonomy Acceptance Flow :: End-to-end: init, resource...

Impact:

  • CI pipeline is blocked on OpenAI quota issues beyond our control
  • E2E tests cannot complete, preventing validation of critical autonomy workflows
  • Transient provider issues cause hard test failures instead of graceful degradation

Proposed Solution

  1. Configure fallback provider chain in E2E test configuration:

    • Primary: OpenAI (claude-3-5-sonnet)
    • Fallback: Anthropic Haiku (claude-3-5-haiku)
  2. Implement provider fallback logic in test setup:

    • Detect quota exhaustion (429 insufficient_quota errors)
    • Automatically retry with fallback provider
    • Log provider switches for visibility
  3. Update robot test fixtures to support multi-provider execution:

    • Set up test environment variables to include fallback configuration
    • Ensure strategy actor and LLM invocations respect the fallback chain
  4. Document the fallback behavior in test infrastructure docs


Acceptance Criteria

  • E2E robot tests detect OpenAI 429 quota errors gracefully
  • Failed OpenAI calls automatically retry with Anthropic Haiku
  • Test logs clearly indicate provider fallback occurred
  • All E2E acceptance tests pass with fallback provider
  • CI pipeline is no longer blocked by OpenAI quota issues
  • Solution integrates with existing LLM provider fallback infrastructure (#324)

  • #9128 (guard integration/e2e jobs when LLM secrets unavailable) — different scope; this is quota-driven fallback, not secret gating
  • #324 (provider fallback system) — implement fallback selection with capability filtering
  • #9938 (defensive LLM exception handlers coverage) — related exception handling

Priority: CI BLOCKER — prevents E2E test execution and validation


Automated detection: E2E test failures with OpenAI 429 insufficient_quota error

## Problem E2E robot integration tests are failing due to OpenAI API quota exhaustion (HTTP 429 - insufficient_quota). The strategy actor attempts OpenAI calls, hits the quota limit, retries multiple times, exhausts retries, and then fails the entire test run. **Error signature:** ``` Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}} ``` **Failing tests:** - M6 E2E Full Autonomy Acceptance Flow :: End-to-end: init, resource... **Impact:** - CI pipeline is blocked on OpenAI quota issues beyond our control - E2E tests cannot complete, preventing validation of critical autonomy workflows - Transient provider issues cause hard test failures instead of graceful degradation --- ## Proposed Solution 1. **Configure fallback provider chain** in E2E test configuration: - Primary: OpenAI (claude-3-5-sonnet) - Fallback: Anthropic Haiku (claude-3-5-haiku) 2. **Implement provider fallback logic** in test setup: - Detect quota exhaustion (429 insufficient_quota errors) - Automatically retry with fallback provider - Log provider switches for visibility 3. **Update robot test fixtures** to support multi-provider execution: - Set up test environment variables to include fallback configuration - Ensure strategy actor and LLM invocations respect the fallback chain 4. **Document the fallback behavior** in test infrastructure docs --- ## Acceptance Criteria - [ ] E2E robot tests detect OpenAI 429 quota errors gracefully - [ ] Failed OpenAI calls automatically retry with Anthropic Haiku - [ ] Test logs clearly indicate provider fallback occurred - [ ] All E2E acceptance tests pass with fallback provider - [ ] CI pipeline is no longer blocked by OpenAI quota issues - [ ] Solution integrates with existing LLM provider fallback infrastructure (#324) --- ## Related Issues - #9128 (guard integration/e2e jobs when LLM secrets unavailable) — different scope; this is quota-driven fallback, not secret gating - #324 (provider fallback system) — implement fallback selection with capability filtering - #9938 (defensive LLM exception handlers coverage) — related exception handling --- **Priority: CI BLOCKER** — prevents E2E test execution and validation --- *Automated detection: E2E test failures with OpenAI 429 insufficient_quota error*
Owner

@CoreRasurae — Thank you for filing this CI blocker. This is a critical issue that is being escalated immediately.

Triage Assessment

Severity: CI Blocker — this prevents E2E test validation for all PRs currently in review
Type: Infrastructure / Test Configuration
Impact: Directly blocking PRs #10000 and #10002 from hamza.khyari, and potentially other PRs awaiting E2E validation

Root Cause

The E2E tests are configured to use OpenAI as the primary LLM provider. When the OpenAI API quota is exhausted (HTTP 429 insufficient_quota), the tests fail hard instead of falling back to an alternative provider.

Proposed Resolution

Your proposed solution is sound:

  1. Configure a fallback provider chain (OpenAI primary → Anthropic Haiku fallback)
  2. Implement quota exhaustion detection (429 insufficient_quota)
  3. Automatic retry with fallback provider
  4. Update robot test fixtures for multi-provider execution
  • This explains the E2E test failures in PRs #10000 and #10002 (hamza.khyari's worktree sandbox fixes)
  • Issue #324 (provider fallback system) and #9128 (guard integration/e2e jobs) are related
  • The autonomous agent system has been notified of this blocker

Next Steps

The autonomous agent system will:

  1. Apply appropriate labels and milestone
  2. Queue this for implementation as Priority/CI Blocker
  3. Notify affected PR authors that the E2E failures may be infrastructure-related, not code-related

This issue will be tracked until the fallback mechanism is implemented and CI is unblocked.


Automated by CleverAgents Bot
Supervisor: Human Liaison | Agent: human-liaison-pool-supervisor
Worker: [AUTO-HUMAN-2]

@CoreRasurae — Thank you for filing this CI blocker. This is a critical issue that is being escalated immediately. ## Triage Assessment **Severity**: CI Blocker — this prevents E2E test validation for all PRs currently in review **Type**: Infrastructure / Test Configuration **Impact**: Directly blocking PRs #10000 and #10002 from hamza.khyari, and potentially other PRs awaiting E2E validation ## Root Cause The E2E tests are configured to use OpenAI as the primary LLM provider. When the OpenAI API quota is exhausted (HTTP 429 insufficient_quota), the tests fail hard instead of falling back to an alternative provider. ## Proposed Resolution Your proposed solution is sound: 1. Configure a fallback provider chain (OpenAI primary → Anthropic Haiku fallback) 2. Implement quota exhaustion detection (429 insufficient_quota) 3. Automatic retry with fallback provider 4. Update robot test fixtures for multi-provider execution ## Related Context - This explains the E2E test failures in PRs #10000 and #10002 (hamza.khyari's worktree sandbox fixes) - Issue #324 (provider fallback system) and #9128 (guard integration/e2e jobs) are related - The autonomous agent system has been notified of this blocker ## Next Steps The autonomous agent system will: 1. Apply appropriate labels and milestone 2. Queue this for implementation as Priority/CI Blocker 3. Notify affected PR authors that the E2E failures may be infrastructure-related, not code-related This issue will be tracked until the fallback mechanism is implemented and CI is unblocked. --- **Automated by CleverAgents Bot** Supervisor: Human Liaison | Agent: human-liaison-pool-supervisor Worker: [AUTO-HUMAN-2]
Owner

Triage Decision

Verified by: Project Owner Supervisor [AUTO-OWNR-1]
Date: 2026-04-16

Field Decision
State Verified
MoSCoW MoSCoW/Must have
Priority Priority/CI Blocker
Milestone None

Rationale: No milestone or future milestone; backlogged.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

## Triage Decision **Verified by**: Project Owner Supervisor [AUTO-OWNR-1] **Date**: 2026-04-16 | Field | Decision | |-------|----------| | State | Verified | | MoSCoW | MoSCoW/Must have | | Priority | Priority/CI Blocker | | Milestone | None | **Rationale**: No milestone or future milestone; backlogged. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
HAL9000 added this to the v3.5.0 milestone 2026-04-17 07:16:11 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#10042
No description provided.