[AUTO-INF-4] Flaky Tests: Replace real LLM API calls with a mock service in E2E tests #8078

Open
opened 2026-04-13 03:26:38 +00:00 by HAL9000 · 0 comments
Owner

Metadata

  • Commit message: chore(tests): replace real LLM API calls with a mock service in E2E tests
  • Branch name: chore/e2e-mock-llm-service

Background and Context

The current E2E test suite exhibits significant flakiness, primarily due to its direct dependency on external LLM services (OpenAI, Anthropic). These tests, located in the robot/e2e/ directory, use real API keys, which introduces non-determinism and unreliability from network latency, API rate limiting, and service availability.

Problem

  • Intermittent Failures: Recent workflow runs show a pattern of failures and cancellations in the e2e_tests job (e.g., runs #6454, #4965).
  • External Dependencies: Tests like wf18_container_clone.robot and m6_acceptance.robot explicitly use real API keys, making them vulnerable to external factors.
  • Parallel Execution: The E2E tests are run in parallel with 4 worker processes, which, combined with external dependencies, increases the likelihood of race conditions and other non-deterministic failures.

Expected Behavior

The E2E test suite runs reliably and deterministically in CI without any dependency on external LLM services. A lightweight mock or fake LLM service intercepts all API calls and returns predictable, pre-configured responses. The e2e_tests CI job passes consistently across all runs.

Acceptance Criteria

  • A mock/fake LLM service is implemented that simulates the OpenAI and/or Anthropic API response format
  • All E2E tests in robot/e2e/ are updated to use the mock service instead of real API keys
  • The mock service is configurable via environment variable or command-line flag (e.g., USE_MOCK_LLM=true)
  • Real API keys are no longer required to run the E2E test suite in CI
  • The e2e_tests CI job passes consistently across at least 3 consecutive runs without failures
  • The mock service returns deterministic, predictable responses for all test scenarios
  • Parallel execution with 4 workers continues to work correctly with the mock service

Subtasks

  • Audit robot/e2e/ to identify all tests that use real LLM API keys (e.g., wf18_container_clone.robot, m6_acceptance.robot)
  • Evaluate mock/intercept strategies: in-process fake server, responses library, pytest-httpserver, or similar
  • Implement a lightweight fake LLM server that returns deterministic responses matching the OpenAI/Anthropic API format
  • Create pre-configured response fixtures for all E2E test scenarios
  • Update E2E test configuration to route LLM calls to the mock service
  • Remove or gate real API key usage behind an opt-in flag (e.g., USE_REAL_LLM=true for manual testing)
  • Update CI workflow to run E2E tests without real API key secrets
  • Verify parallel execution (4 workers) works correctly with the mock service
  • Run the e2e_tests job 3+ consecutive times to confirm stability
  • Update developer documentation to describe the mock service setup

Definition of Done

This issue should be closed when:

  • All E2E tests in robot/e2e/ use the mock LLM service by default
  • No real API keys are required to run the E2E suite in CI
  • The e2e_tests CI job passes consistently without intermittent failures
  • The mock service is documented and easy for developers to extend with new response fixtures

Duplicate Check

  • Keyword search in open issues: No results found for "flaky", "e2e", "mock", or "external".
  • Cross-area search for similar proposals: No results found for issues with the [AUTO-INF-*] tag.
  • Closed issues search: No results found for "flaky", "e2e", "mock", or "external".

Overview

The current E2E test suite exhibits significant flakiness, primarily due to its direct dependency on external LLM services (OpenAI, Anthropic). These tests, located in the robot/e2e/ directory, use real API keys, which introduces non-determinism and unreliability from network latency, API rate limiting, and service availability.

Proposal

To improve the reliability and stability of the E2E test suite, I propose to replace the real LLM API calls with a mocked or fake LLM service.

This can be achieved by:

  1. Creating a lightweight, in-process fake LLM server that simulates the behavior of the real LLM APIs. This server would return predictable, deterministic responses, removing the dependency on external services.
  2. Using a library like responses or pytest-httpserver to intercept and mock the HTTP requests made to the LLM APIs.
  3. Configuring the E2E tests to use the mock service instead of the real APIs. This could be done using an environment variable or a command-line flag.

Benefits

  • Increased Reliability: Eliminating external dependencies will make the E2E tests more stable and less prone to intermittent failures.
  • Faster Feedback: Mocked responses will be much faster than real API calls, reducing the overall runtime of the E2E test suite.
  • Improved Developer Experience: A more reliable CI/CD pipeline will improve the developer experience and reduce the time spent debugging flaky tests.

This issue was automatically generated by the test-infra-worker with session tag [AUTO-INF-4].


Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Commit message:** `chore(tests): replace real LLM API calls with a mock service in E2E tests` - **Branch name:** `chore/e2e-mock-llm-service` ## Background and Context The current E2E test suite exhibits significant flakiness, primarily due to its direct dependency on external LLM services (OpenAI, Anthropic). These tests, located in the `robot/e2e/` directory, use real API keys, which introduces non-determinism and unreliability from network latency, API rate limiting, and service availability. ### Problem - **Intermittent Failures:** Recent workflow runs show a pattern of failures and cancellations in the `e2e_tests` job (e.g., runs #6454, #4965). - **External Dependencies:** Tests like `wf18_container_clone.robot` and `m6_acceptance.robot` explicitly use real API keys, making them vulnerable to external factors. - **Parallel Execution:** The E2E tests are run in parallel with 4 worker processes, which, combined with external dependencies, increases the likelihood of race conditions and other non-deterministic failures. ## Expected Behavior The E2E test suite runs reliably and deterministically in CI without any dependency on external LLM services. A lightweight mock or fake LLM service intercepts all API calls and returns predictable, pre-configured responses. The `e2e_tests` CI job passes consistently across all runs. ## Acceptance Criteria - [ ] A mock/fake LLM service is implemented that simulates the OpenAI and/or Anthropic API response format - [ ] All E2E tests in `robot/e2e/` are updated to use the mock service instead of real API keys - [ ] The mock service is configurable via environment variable or command-line flag (e.g., `USE_MOCK_LLM=true`) - [ ] Real API keys are no longer required to run the E2E test suite in CI - [ ] The `e2e_tests` CI job passes consistently across at least 3 consecutive runs without failures - [ ] The mock service returns deterministic, predictable responses for all test scenarios - [ ] Parallel execution with 4 workers continues to work correctly with the mock service ## Subtasks - [ ] Audit `robot/e2e/` to identify all tests that use real LLM API keys (e.g., `wf18_container_clone.robot`, `m6_acceptance.robot`) - [ ] Evaluate mock/intercept strategies: in-process fake server, `responses` library, `pytest-httpserver`, or similar - [ ] Implement a lightweight fake LLM server that returns deterministic responses matching the OpenAI/Anthropic API format - [ ] Create pre-configured response fixtures for all E2E test scenarios - [ ] Update E2E test configuration to route LLM calls to the mock service - [ ] Remove or gate real API key usage behind an opt-in flag (e.g., `USE_REAL_LLM=true` for manual testing) - [ ] Update CI workflow to run E2E tests without real API key secrets - [ ] Verify parallel execution (4 workers) works correctly with the mock service - [ ] Run the `e2e_tests` job 3+ consecutive times to confirm stability - [ ] Update developer documentation to describe the mock service setup ## Definition of Done This issue should be closed when: - All E2E tests in `robot/e2e/` use the mock LLM service by default - No real API keys are required to run the E2E suite in CI - The `e2e_tests` CI job passes consistently without intermittent failures - The mock service is documented and easy for developers to extend with new response fixtures ### Duplicate Check - **Keyword search in open issues:** No results found for "flaky", "e2e", "mock", or "external". - **Cross-area search for similar proposals:** No results found for issues with the `[AUTO-INF-*]` tag. - **Closed issues search:** No results found for "flaky", "e2e", "mock", or "external". --- ### Overview The current E2E test suite exhibits significant flakiness, primarily due to its direct dependency on external LLM services (OpenAI, Anthropic). These tests, located in the `robot/e2e/` directory, use real API keys, which introduces non-determinism and unreliability from network latency, API rate limiting, and service availability. ### Proposal To improve the reliability and stability of the E2E test suite, I propose to **replace the real LLM API calls with a mocked or fake LLM service**. This can be achieved by: 1. **Creating a lightweight, in-process fake LLM server** that simulates the behavior of the real LLM APIs. This server would return predictable, deterministic responses, removing the dependency on external services. 2. **Using a library like `responses` or `pytest-httpserver`** to intercept and mock the HTTP requests made to the LLM APIs. 3. **Configuring the E2E tests to use the mock service** instead of the real APIs. This could be done using an environment variable or a command-line flag. ### Benefits - **Increased Reliability:** Eliminating external dependencies will make the E2E tests more stable and less prone to intermittent failures. - **Faster Feedback:** Mocked responses will be much faster than real API calls, reducing the overall runtime of the E2E test suite. - **Improved Developer Experience:** A more reliable CI/CD pipeline will improve the developer experience and reduce the time spent debugging flaky tests. --- *This issue was automatically generated by the `test-infra-worker` with session tag `[AUTO-INF-4]`.* --- **Automated by CleverAgents Bot** Agent: new-issue-creator
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#8078
No description provided.