fix(e2e): replace naive OpenAI key-presence check with live API probe in E2E suite setups #10198

Closed
opened 2026-04-17 05:36:52 +00:00 by hurui200320 · 5 comments
Member

Metadata

  • Commit Message: fix(e2e): replace naive OpenAI key-presence check with live API probe in E2E suite setups
  • Branch: fix/ci

Background and Context

E2E test suites (M6 acceptance, WF04, WF05, WF07, WF16) determine which LLM actor to
use by checking whether OPENAI_API_KEY is present in the environment:

has_openai = bool(os.environ.get('OPENAI_API_KEY', ''))
actor = 'openai/gpt-4o' if has_openai else 'anthropic/claude-sonnet-4-20250514'

This check is insufficient. A key can be present and syntactically valid but still fail
at runtime with HTTP 429 (quota exceeded) or HTTP 429 (rate_limit_exceeded).
When this happens, the test selects the OpenAI actor, the plan execution fails mid-run
with a quota error, and the test fails — even though Anthropic credits are available and
could have been used instead.

Two commits by Luis (reverted on fix/ci) attempted to handle quota errors inside
strategy_actor.py with runtime fallback logic. That approach was discussed and rejected
because it adds production complexity (caching fallback LLM instances, recovery timers,
quota detection state) to solve what is fundamentally a test-harness selection problem.
The agreed fix is to validate the key before committing to it, entirely within the
E2E test infrastructure.

Current Behavior

  1. CI has OPENAI_API_KEY set to a valid-but-quota-exhausted key.
  2. Every E2E suite setup evaluates bool(os.environ.get('OPENAI_API_KEY', ''))True.
  3. All suite setups select openai/gpt-4o as the actor.
  4. Plan execution calls the OpenAI API → receives HTTP 429 quota error.
  5. Test fails with an LLM quota error instead of gracefully falling back to Anthropic.

Expected Behavior

  1. Before selecting the OpenAI actor, the suite setup probes the key by sending a
    minimal chat completion request ("Hi", max_tokens=1, gpt-4o-mini) to the
    OpenAI API.
  2. If the probe returns HTTP 200 → key is functional → select openai/gpt-4o.
  3. If the probe returns HTTP 429, HTTP 401, or any other error → key is not
    usable → fall back to anthropic/claude-sonnet-4-20250514 and log a warning.
  4. The actor fallback decision is made once per suite setup, before any test runs.

Acceptance Criteria

  • A new script robot/e2e/check_openai_key.py exists that probes the OpenAI API
    with a minimal request using only Python stdlib (urllib.request) — no extra
    dependencies.
  • The script exits 0 when the API returns HTTP 200 and exits 1 for any quota,
    auth, network, or unexpected error.
  • A new keyword Resolve LLM Actor is added to robot/e2e/common_e2e.resource
    that invokes the check script and returns the appropriate actor name.
  • Resolve LLM Actor accepts ${openai_model} and ${anthropic_model} arguments
    with sensible defaults so callers can override the model names.
  • m6_acceptance.robot M6 Suite Setup uses Resolve LLM Actor instead of the
    naive has_openai check.
  • wf04_multi_project.robot, wf05_db_migration.robot, wf07_cicd.robot, and
    wf16_devcontainer.robot use Resolve LLM Actor instead of their inline
    has_openai checks.
  • When the probe fails, a WARN-level Robot Framework log message records the
    failure reason and which fallback was selected.
  • No changes are made to strategy_actor.py or any production source code.

Supporting Information

  • Reverted commits: 51472c0b (debug logging) and f5712787 (runtime fallback in
    strategy_actor.py) — both on branch fix/ci.
  • Files with naive has_openai check today: m6_acceptance.robot,
    wf04_multi_project.robot, wf05_db_migration.robot, wf07_cicd.robot,
    wf16_devcontainer.robot.
  • Files intentionally excluded: wf17_explicit_container.robot and
    wf18_container_clone.robot already prefer Anthropic as primary; m5_acceptance.robot
    specifically tests OpenAI features and must use OpenAI.
  • The probe uses gpt-4o-mini with max_tokens=1 to minimise cost (fractions of a cent
    per probe call).
  • ${PYTHON} is a suite variable set by E2E Suite Setup (the venv Python executable)
    and is available when Resolve LLM Actor is called from any suite setup keyword.

Subtasks

  • Create robot/e2e/check_openai_key.py (stdlib-only OpenAI probe script)
  • Add Resolve LLM Actor keyword to robot/e2e/common_e2e.resource
  • Update robot/e2e/m6_acceptance.robot to use Resolve LLM Actor
  • Update robot/e2e/wf04_multi_project.robot to use Resolve LLM Actor
  • Update robot/e2e/wf05_db_migration.robot to use Resolve LLM Actor
  • Update robot/e2e/wf07_cicd.robot to use Resolve LLM Actor
  • Update robot/e2e/wf16_devcontainer.robot to use Resolve LLM Actor

Definition of Done

  • All subtasks above are checked off.
  • The fix/ci branch CI pipeline completes without LLM quota errors causing test
    failures.
  • No production source code (src/) is modified.
  • Commit follows the Conventional Changelog format specified in Metadata above.
## Metadata - **Commit Message**: `fix(e2e): replace naive OpenAI key-presence check with live API probe in E2E suite setups` - **Branch**: `fix/ci` --- ## Background and Context E2E test suites (M6 acceptance, WF04, WF05, WF07, WF16) determine which LLM actor to use by checking whether `OPENAI_API_KEY` is present in the environment: ```python has_openai = bool(os.environ.get('OPENAI_API_KEY', '')) actor = 'openai/gpt-4o' if has_openai else 'anthropic/claude-sonnet-4-20250514' ``` This check is insufficient. A key can be present and syntactically valid but still fail at runtime with **HTTP 429 (quota exceeded)** or **HTTP 429 (rate_limit_exceeded)**. When this happens, the test selects the OpenAI actor, the plan execution fails mid-run with a quota error, and the test fails — even though Anthropic credits are available and could have been used instead. Two commits by Luis (reverted on `fix/ci`) attempted to handle quota errors inside `strategy_actor.py` with runtime fallback logic. That approach was discussed and rejected because it adds production complexity (caching fallback LLM instances, recovery timers, quota detection state) to solve what is fundamentally a test-harness selection problem. The agreed fix is to validate the key *before* committing to it, entirely within the E2E test infrastructure. ## Current Behavior 1. CI has `OPENAI_API_KEY` set to a valid-but-quota-exhausted key. 2. Every E2E suite setup evaluates `bool(os.environ.get('OPENAI_API_KEY', ''))` → `True`. 3. All suite setups select `openai/gpt-4o` as the actor. 4. Plan execution calls the OpenAI API → receives HTTP 429 quota error. 5. Test fails with an LLM quota error instead of gracefully falling back to Anthropic. ## Expected Behavior 1. Before selecting the OpenAI actor, the suite setup **probes** the key by sending a minimal chat completion request (`"Hi"`, `max_tokens=1`, `gpt-4o-mini`) to the OpenAI API. 2. If the probe returns **HTTP 200** → key is functional → select `openai/gpt-4o`. 3. If the probe returns **HTTP 429**, **HTTP 401**, or any other error → key is not usable → fall back to `anthropic/claude-sonnet-4-20250514` and log a warning. 4. The actor fallback decision is made once per suite setup, before any test runs. ## Acceptance Criteria - [ ] A new script `robot/e2e/check_openai_key.py` exists that probes the OpenAI API with a minimal request using only Python stdlib (`urllib.request`) — no extra dependencies. - [ ] The script exits `0` when the API returns HTTP 200 and exits `1` for any quota, auth, network, or unexpected error. - [ ] A new keyword `Resolve LLM Actor` is added to `robot/e2e/common_e2e.resource` that invokes the check script and returns the appropriate actor name. - [ ] `Resolve LLM Actor` accepts `${openai_model}` and `${anthropic_model}` arguments with sensible defaults so callers can override the model names. - [ ] `m6_acceptance.robot` `M6 Suite Setup` uses `Resolve LLM Actor` instead of the naive `has_openai` check. - [ ] `wf04_multi_project.robot`, `wf05_db_migration.robot`, `wf07_cicd.robot`, and `wf16_devcontainer.robot` use `Resolve LLM Actor` instead of their inline `has_openai` checks. - [ ] When the probe fails, a `WARN`-level Robot Framework log message records the failure reason and which fallback was selected. - [ ] No changes are made to `strategy_actor.py` or any production source code. ## Supporting Information - Reverted commits: `51472c0b` (debug logging) and `f5712787` (runtime fallback in `strategy_actor.py`) — both on branch `fix/ci`. - Files with naive `has_openai` check today: `m6_acceptance.robot`, `wf04_multi_project.robot`, `wf05_db_migration.robot`, `wf07_cicd.robot`, `wf16_devcontainer.robot`. - Files intentionally excluded: `wf17_explicit_container.robot` and `wf18_container_clone.robot` already prefer Anthropic as primary; `m5_acceptance.robot` specifically tests OpenAI features and must use OpenAI. - The probe uses `gpt-4o-mini` with `max_tokens=1` to minimise cost (fractions of a cent per probe call). - `${PYTHON}` is a suite variable set by `E2E Suite Setup` (the venv Python executable) and is available when `Resolve LLM Actor` is called from any suite setup keyword. ## Subtasks - [ ] Create `robot/e2e/check_openai_key.py` (stdlib-only OpenAI probe script) - [ ] Add `Resolve LLM Actor` keyword to `robot/e2e/common_e2e.resource` - [ ] Update `robot/e2e/m6_acceptance.robot` to use `Resolve LLM Actor` - [ ] Update `robot/e2e/wf04_multi_project.robot` to use `Resolve LLM Actor` - [ ] Update `robot/e2e/wf05_db_migration.robot` to use `Resolve LLM Actor` - [ ] Update `robot/e2e/wf07_cicd.robot` to use `Resolve LLM Actor` - [ ] Update `robot/e2e/wf16_devcontainer.robot` to use `Resolve LLM Actor` ## Definition of Done - All subtasks above are checked off. - The `fix/ci` branch CI pipeline completes without LLM quota errors causing test failures. - No production source code (`src/`) is modified. - Commit follows the Conventional Changelog format specified in Metadata above.
hurui200320 added this to the v3.5.0 milestone 2026-04-17 05:37:05 +00:00
hurui200320 added reference fix/ci 2026-04-17 05:37:15 +00:00
Author
Member

As discussed with Luis, I'll revert the two commits and add fix.

As discussed with Luis, I'll revert the two commits and add fix.
hurui200320 removed their assignment 2026-04-17 05:53:43 +00:00
hurui200320 modified the milestone from v3.5.0 to v3.2.0 2026-04-17 05:53:49 +00:00
Author
Member

Implementation Note

Implemented in PR #10199 (branch fix/ci, commit 37cf6cc6).

What was built

robot/e2e/check_openai_key.py (new file)

A stdlib-only probe script (urllib.request, json, os, contextlib — no third-party deps) that sends the cheapest possible OpenAI request (gpt-4o-mini, "Hi", max_tokens=1, 15 s timeout) and exits 0 on HTTP 200 or 1 on any failure (429, 401, network error, timeout, missing key). The failure reason is printed to stdout so Robot Framework captures it in the test log at WARN level.

Resolve LLM Actor keyword in robot/e2e/common_e2e.resource

Centralises actor selection in one shared keyword. Accepts ${openai_model} (default openai/gpt-4o) and ${anthropic_model} (default anthropic/claude-sonnet-4-20250514) so callers can override the model names. Short-circuits immediately to Anthropic when OPENAI_API_KEY is not set, avoiding an unnecessary network call.

Five suite setups updatedm6_acceptance.robot, wf04_multi_project.robot, wf05_db_migration.robot, wf07_cicd.robot, wf16_devcontainer.robot — each replacing their inline has_openai boolean block with a single Resolve LLM Actor call. wf16 passes openai_model=openai/gpt-4o-mini to preserve its cost-optimisation choice.

Key decisions

  • Test-harness only — no changes to src/ production code. The quota problem is a CI environment concern and is best handled where the actor is selected, not inside StrategyActor.
  • Probe model is gpt-4o-mini — cheapest available chat model; 1 output token costs fractions of a cent. The probe adds < 5 s to suite setup time.
  • contextlib.suppress over bare except: pass — a linting issue (SIM105) was caught after the initial push and fixed in the amended commit before merging.
  • --force-with-lease rejected, used --force — the local remote-tracking ref was stale at push time (lease check failed); switched to --force after confirming no intervening commits from other authors.

Files not touched

wf17_explicit_container.robot and wf18_container_clone.robot already prefer Anthropic as primary; m5_acceptance.robot deliberately targets OpenAI-specific features and must not fall back.

## Implementation Note Implemented in PR #10199 (branch `fix/ci`, commit `37cf6cc6`). ### What was built **`robot/e2e/check_openai_key.py`** (new file) A stdlib-only probe script (`urllib.request`, `json`, `os`, `contextlib` — no third-party deps) that sends the cheapest possible OpenAI request (`gpt-4o-mini`, `"Hi"`, `max_tokens=1`, 15 s timeout) and exits 0 on HTTP 200 or 1 on any failure (429, 401, network error, timeout, missing key). The failure reason is printed to stdout so Robot Framework captures it in the test log at WARN level. **`Resolve LLM Actor` keyword** in `robot/e2e/common_e2e.resource` Centralises actor selection in one shared keyword. Accepts `${openai_model}` (default `openai/gpt-4o`) and `${anthropic_model}` (default `anthropic/claude-sonnet-4-20250514`) so callers can override the model names. Short-circuits immediately to Anthropic when `OPENAI_API_KEY` is not set, avoiding an unnecessary network call. **Five suite setups updated** — `m6_acceptance.robot`, `wf04_multi_project.robot`, `wf05_db_migration.robot`, `wf07_cicd.robot`, `wf16_devcontainer.robot` — each replacing their inline `has_openai` boolean block with a single `Resolve LLM Actor` call. `wf16` passes `openai_model=openai/gpt-4o-mini` to preserve its cost-optimisation choice. ### Key decisions - **Test-harness only** — no changes to `src/` production code. The quota problem is a CI environment concern and is best handled where the actor is selected, not inside `StrategyActor`. - **Probe model is `gpt-4o-mini`** — cheapest available chat model; 1 output token costs fractions of a cent. The probe adds < 5 s to suite setup time. - **`contextlib.suppress` over bare `except: pass`** — a linting issue (`SIM105`) was caught after the initial push and fixed in the amended commit before merging. - **`--force-with-lease` rejected, used `--force`** — the local remote-tracking ref was stale at push time (lease check failed); switched to `--force` after confirming no intervening commits from other authors. ### Files not touched `wf17_explicit_container.robot` and `wf18_container_clone.robot` already prefer Anthropic as primary; `m5_acceptance.robot` deliberately targets OpenAI-specific features and must not fall back.
Owner

🔧 Grooming note: Transitioning this issue from State/In ProgressState/In Review because PR #10199 is open and addresses this issue. Labels have been updated accordingly.


Automated by CleverAgents Bot
Supervisor: Grooming | Agent: grooming-pool-supervisor

🔧 **Grooming note:** Transitioning this issue from `State/In Progress` → `State/In Review` because PR #10199 is open and addresses this issue. Labels have been updated accordingly. --- **Automated by CleverAgents Bot** Supervisor: Grooming | Agent: grooming-pool-supervisor
Author
Member

@HAL9000 Please review the corresponding PR and get it merged asap to stable CI (e2e tests). Thanks.

@HAL9000 Please review the corresponding PR and get it merged asap to stable CI (e2e tests). Thanks.
Owner

Issue #10198 has been received and logged. The root cause is clear: the naive `OPENAI_API_KEY` presence check does not account for quota-exhausted keys, causing E2E suites to select the OpenAI actor and fail mid-run with HTTP 429 even when Anthropic credits are available.

PR #10199 has been submitted to address this with a stdlib-only probe script (`check_openai_key.py`) that validates the key before committing to it. This is the correct layer for the fix — test-harness infrastructure rather than production code.

The issue labels have been updated: the "Needs Feedback" label has been removed (the issue is fully specified) and the state has been advanced to "State/In Review" to reflect that the fix PR is under review.

The PR will be prioritized for automated review given the CI Blocker priority.

---
**Automated by CleverAgents Bot**
Supervisor: Human Liaison | Agent: human-liaison-pool-supervisor
```@hurui200320 — Thank you for reporting this CI Blocker and for the prompt fix. Issue #10198 has been received and logged. The root cause is clear: the naive `OPENAI_API_KEY` presence check does not account for quota-exhausted keys, causing E2E suites to select the OpenAI actor and fail mid-run with HTTP 429 even when Anthropic credits are available. PR #10199 has been submitted to address this with a stdlib-only probe script (`check_openai_key.py`) that validates the key before committing to it. This is the correct layer for the fix — test-harness infrastructure rather than production code. The issue labels have been updated: the "Needs Feedback" label has been removed (the issue is fully specified) and the state has been advanced to "State/In Review" to reflect that the fix PR is under review. The PR will be prioritized for automated review given the CI Blocker priority. --- **Automated by CleverAgents Bot** Supervisor: Human Liaison | Agent: human-liaison-pool-supervisor ```
hurui200320 2026-04-17 11:16:31 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#10198
No description provided.