[AUTO-INF-3] Harden CI workflow reliability by stabilizing runner setup and secret-dependent jobs #9767

Open
opened 2026-04-15 15:26:53 +00:00 by HAL9000 · 0 comments
Owner

Summary

  • 69.7% CI failure rate correlates with environment bootstrapping and secret handling rather than test coverage gaps.
  • Jobs frequently fail before running tests because the workflows rebuild the runner from scratch and assume write-scoped secrets.

Findings

  1. Every job bootstraps the same OS packages via apt-get
    • All Python jobs run inside python:3.13-slim containers and install Node.js/git/curl/tar in-line (.forgejo/workflows/ci.yml lint/typecheck/security etc.). These apt-get update && apt-get install steps run without retries and are the top source of flaky failures when Debian mirrors hiccup.
  2. Secret-dependent jobs fail outright on external contributions
    • integration_tests, e2e_tests, and push-validation rely on ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, and FORGEJO_TOKEN. On forked PRs these resolve to empty strings, so the jobs exit with authentication errors before doing useful work.
  3. Helm/Kubeconform/Docker tooling is downloaded on every run with no caching or retry
    • Helm and kubeconform tarballs plus Docker builds are fetched per job via curl without --retry, and there is no caching keyed on Chart.lock or Dockerfile digests.
  4. The job graph starts heavy workloads immediately
    • Integration, E2E, docker builds, and helm validation run in parallel with lint/typecheck/unit tests, increasing runner contention. With multiple PRs queued the self-hosted runner exhausts CPU/memory and jobs time out.

Proposals

  1. Publish and use a pre-warmed CI container image
    Implementation: build ghcr.io/cleveragents/ci/python-3.13-uv with Node 20, git, curl, tar, uv 0.8, nox, helm v3.16.4, and kubeconform v0.7.0 pre-installed; point all Python jobs to this image and derive a DinD variant for the docker job.
    Expected impact: removes repeated apt-get calls (major flaky step) and cuts 1-2 minutes of bootstrap time per job.
  2. Add a reusable setup stage to prime uv/nox caches with retries
    Implementation: create a prepare-python job/composite action that installs uv+nox with uv pip install --system --frozen, wraps the install in a 3-attempt retry loop, and stores ~/.cache/uv via actions/cache@v4 (save-always: true, key includes uv.lock and pyproject.toml). Downstream jobs declare needs: prepare-python and restore the warmed cache.
    Expected impact: stabilizes interpreter provisioning, eliminates repeated network installs, and makes reruns faster.
  3. Make secret-dependent jobs dual-path instead of hard-failing
    Implementation: add a guard step that sets an output when secrets are present. When secrets exist run the current suites; when not, run an offline fixture mode (nox -s integration_tests -- --offline-fixtures, nox -s e2e_tests -- --mock-creds) and short-circuit push validation to a read-only smoke test that still exercises git config. Adjust the status-check aggregator to accept the offline path as success.
    Expected impact: keeps the checks meaningful while stopping forked PRs from failing immediately due to missing secrets.
  4. Re-tier the workflow graph and add concurrency control
    Implementation: group lint/typecheck/security/quality/unit tests as Tier 1, gate integration/e2e/docker/helm/coverage behind needs: tier1. Add concurrency: group: ci-${{ github.ref }}; cancel-in-progress: true and set strategy.fail-fast: false on the Tier 1 matrix.
    Expected impact: prevents heavy jobs from running when early checks fail, reduces runner contention, and improves overall completion rate.

Duplicate Check

  • Existing [AUTO-INF-3] issue #9683 focuses on enforcing 32-process parallelism; it does not address runner bootstrapping or secret handling, so this proposal is complementary.
  • Queried /issues?limit=50&page=1-2 for [AUTO-INF-3] and found no other overlapping issues.
## Summary - 69.7% CI failure rate correlates with environment bootstrapping and secret handling rather than test coverage gaps. - Jobs frequently fail before running tests because the workflows rebuild the runner from scratch and assume write-scoped secrets. ## Findings 1. **Every job bootstraps the same OS packages via `apt-get`** - All Python jobs run inside `python:3.13-slim` containers and install Node.js/git/curl/tar in-line (`.forgejo/workflows/ci.yml` lint/typecheck/security etc.). These `apt-get update && apt-get install` steps run without retries and are the top source of flaky failures when Debian mirrors hiccup. 2. **Secret-dependent jobs fail outright on external contributions** - `integration_tests`, `e2e_tests`, and `push-validation` rely on `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`, and `FORGEJO_TOKEN`. On forked PRs these resolve to empty strings, so the jobs exit with authentication errors before doing useful work. 3. **Helm/Kubeconform/Docker tooling is downloaded on every run with no caching or retry** - Helm and kubeconform tarballs plus Docker builds are fetched per job via curl without `--retry`, and there is no caching keyed on `Chart.lock` or Dockerfile digests. 4. **The job graph starts heavy workloads immediately** - Integration, E2E, docker builds, and helm validation run in parallel with lint/typecheck/unit tests, increasing runner contention. With multiple PRs queued the self-hosted runner exhausts CPU/memory and jobs time out. ## Proposals 1. **Publish and use a pre-warmed CI container image** Implementation: build `ghcr.io/cleveragents/ci/python-3.13-uv` with Node 20, git, curl, tar, uv 0.8, nox, helm v3.16.4, and kubeconform v0.7.0 pre-installed; point all Python jobs to this image and derive a DinD variant for the docker job. Expected impact: removes repeated `apt-get` calls (major flaky step) and cuts 1-2 minutes of bootstrap time per job. 2. **Add a reusable setup stage to prime uv/nox caches with retries** Implementation: create a `prepare-python` job/composite action that installs uv+nox with `uv pip install --system --frozen`, wraps the install in a 3-attempt retry loop, and stores `~/.cache/uv` via `actions/cache@v4` (save-always: true, key includes `uv.lock` and `pyproject.toml`). Downstream jobs declare `needs: prepare-python` and restore the warmed cache. Expected impact: stabilizes interpreter provisioning, eliminates repeated network installs, and makes reruns faster. 3. **Make secret-dependent jobs dual-path instead of hard-failing** Implementation: add a guard step that sets an output when secrets are present. When secrets exist run the current suites; when not, run an offline fixture mode (`nox -s integration_tests -- --offline-fixtures`, `nox -s e2e_tests -- --mock-creds`) and short-circuit push validation to a read-only smoke test that still exercises git config. Adjust the `status-check` aggregator to accept the offline path as success. Expected impact: keeps the checks meaningful while stopping forked PRs from failing immediately due to missing secrets. 4. **Re-tier the workflow graph and add concurrency control** Implementation: group lint/typecheck/security/quality/unit tests as Tier 1, gate integration/e2e/docker/helm/coverage behind `needs: tier1`. Add `concurrency: group: ci-${{ github.ref }}; cancel-in-progress: true` and set `strategy.fail-fast: false` on the Tier 1 matrix. Expected impact: prevents heavy jobs from running when early checks fail, reduces runner contention, and improves overall completion rate. ## Duplicate Check - Existing `[AUTO-INF-3]` issue #9683 focuses on enforcing 32-process parallelism; it does not address runner bootstrapping or secret handling, so this proposal is complementary. - Queried `/issues?limit=50&page=1-2` for `[AUTO-INF-3]` and found no other overlapping issues.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9767
No description provided.