[AUTO-INF-5] CI: benchmark-publish job depends on a single self-hosted runner #8130

Closed
opened 2026-04-13 03:41:44 +00:00 by HAL9000 · 2 comments
Owner

Metadata

  • Commit Message: chore(ci): improve benchmark-publish runner availability and resilience
  • Branch Name: chore/ci/benchmark-publish-runner-resilience

Background and Context

The benchmark-publish job in the CI pipeline is configured to run on docker-benchmark, a self-hosted runner. This creates a single point of failure for the CI pipeline. If the docker-benchmark runner is offline or unavailable, the benchmark-publish job will fail, which could block the entire pipeline on master/develop branches.

This is particularly concerning given the current state of the CI pipeline (see #8094), where runner availability and reliability are already under scrutiny. A single self-hosted runner with no fallback or pool means any maintenance window, hardware failure, or network issue on that runner will silently break benchmark publishing.

Expected Behavior

The benchmark-publish job should be resilient to individual runner failures. It should either:

  1. Run on a pool of multiple docker-benchmark runners for high availability, or
  2. Fall back to the default docker runners if the docker-benchmark runner is unavailable, or
  3. Be configured with a retry/fallback mechanism so a single runner outage does not block the pipeline.

The CI pipeline should not have a single point of failure for any job that runs on master/develop branches.

Acceptance Criteria

  • The benchmark-publish job no longer depends on a single docker-benchmark runner as its sole execution target
  • At least one of the following is implemented:
    • A runner pool with 2+ docker-benchmark runners is configured and the job targets the pool
    • The job is migrated to the default docker runner (if no special requirements exist)
    • A fallback runner is configured so the job can run on an alternative runner if docker-benchmark is unavailable
  • The CI pipeline on master/develop does not fail due to docker-benchmark runner unavailability
  • The change is documented in the CI configuration comments or a relevant ADR/runbook

Subtasks

  • Audit the benchmark-publish job to determine if it has special requirements that mandate the docker-benchmark runner (e.g., GPU access, large disk, specific tooling)
  • If no special requirements: migrate benchmark-publish to the default docker runner
  • If special requirements exist: provision a second docker-benchmark runner and configure the job to target the runner pool
  • Alternatively: configure a fallback runner label in the CI YAML (e.g., runs-on: [docker-benchmark, docker])
  • Update CI YAML to reflect the chosen solution
  • Verify the benchmark-publish job completes successfully on the new runner configuration
  • Add a comment in the CI YAML explaining the runner choice and fallback strategy

Definition of Done

This issue should be closed when:

  • The benchmark-publish job is no longer a single point of failure in the CI pipeline
  • The chosen runner strategy (pool, fallback, or migration) is implemented and verified
  • The CI pipeline passes on master/develop with the new configuration
  • No regression is introduced in benchmark publishing functionality

Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Commit Message:** `chore(ci): improve benchmark-publish runner availability and resilience` - **Branch Name:** `chore/ci/benchmark-publish-runner-resilience` ## Background and Context The `benchmark-publish` job in the CI pipeline is configured to run on `docker-benchmark`, a self-hosted runner. This creates a single point of failure for the CI pipeline. If the `docker-benchmark` runner is offline or unavailable, the `benchmark-publish` job will fail, which could block the entire pipeline on master/develop branches. This is particularly concerning given the current state of the CI pipeline (see #8094), where runner availability and reliability are already under scrutiny. A single self-hosted runner with no fallback or pool means any maintenance window, hardware failure, or network issue on that runner will silently break benchmark publishing. ## Expected Behavior The `benchmark-publish` job should be resilient to individual runner failures. It should either: 1. Run on a pool of multiple `docker-benchmark` runners for high availability, or 2. Fall back to the default `docker` runners if the `docker-benchmark` runner is unavailable, or 3. Be configured with a retry/fallback mechanism so a single runner outage does not block the pipeline. The CI pipeline should not have a single point of failure for any job that runs on master/develop branches. ## Acceptance Criteria - [ ] The `benchmark-publish` job no longer depends on a single `docker-benchmark` runner as its sole execution target - [ ] At least one of the following is implemented: - A runner pool with 2+ `docker-benchmark` runners is configured and the job targets the pool - The job is migrated to the default `docker` runner (if no special requirements exist) - A fallback runner is configured so the job can run on an alternative runner if `docker-benchmark` is unavailable - [ ] The CI pipeline on master/develop does not fail due to `docker-benchmark` runner unavailability - [ ] The change is documented in the CI configuration comments or a relevant ADR/runbook ## Subtasks - [ ] Audit the `benchmark-publish` job to determine if it has special requirements that mandate the `docker-benchmark` runner (e.g., GPU access, large disk, specific tooling) - [ ] If no special requirements: migrate `benchmark-publish` to the default `docker` runner - [ ] If special requirements exist: provision a second `docker-benchmark` runner and configure the job to target the runner pool - [ ] Alternatively: configure a fallback runner label in the CI YAML (e.g., `runs-on: [docker-benchmark, docker]`) - [ ] Update CI YAML to reflect the chosen solution - [ ] Verify the `benchmark-publish` job completes successfully on the new runner configuration - [ ] Add a comment in the CI YAML explaining the runner choice and fallback strategy ## Definition of Done This issue should be closed when: - The `benchmark-publish` job is no longer a single point of failure in the CI pipeline - The chosen runner strategy (pool, fallback, or migration) is implemented and verified - The CI pipeline passes on master/develop with the new configuration - No regression is introduced in benchmark publishing functionality --- **Automated by CleverAgents Bot** Agent: new-issue-creator
Author
Owner

Verified — Single point of failure in CI infrastructure is a real risk. Should Have fix — important for CI reliability but not blocking feature development. Verified.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Single point of failure in CI infrastructure is a real risk. **Should Have** fix — important for CI reliability but not blocking feature development. Verified. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
HAL9000 added this to the v3.2.0 milestone 2026-04-13 04:33:26 +00:00
Owner

superseded by next cycle

superseded by next cycle
freemo 2026-04-13 16:18:54 +00:00
  • closed this issue
  • added the
    Type
    Task
    label
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#8130
No description provided.