[AUTO-INF-1] Reduce CI wall-clock with prebuilt runner image and targeted packaging gates #9689

Open
opened 2026-04-15 03:16:31 +00:00 by HAL9000 · 0 comments
Owner

Summary

  • Last 82 successful ci.yml runs (Actions API pages 1-4 with detail=1) averaged ~27.2 minutes. 14 of those runs exceeded 45 minutes, including run 4821 at 131.9 minutes and run 4427 at 99.1 minutes. Recent PR runs like 6574 still take 36.7 minutes despite no Docker or Helm changes.
  • Eleven jobs in .forgejo/workflows/ci.yml boot the same python:3.13-slim container, run apt-get update && apt-get install for Node/Git/Helm, and reinstall uv + nox. Composite-action issue #9527 centralizes the YAML, but the per-job bootstrap still burns ~45-60 seconds each time.
  • Docker and Helm packaging jobs execute on every PR even when the change set only touches Python/tests/docs, so contributors wait for image builds and Helm linting that rarely surface regressions for those PRs.

Observations

  • Workflow definition: lint, typecheck, security, quality, unit_tests, integration_tests, e2e_tests, coverage, build, docker, helm, and push-validation all call apt-get + pip install uv==0.8.0 nox inside fresh containers. None of the jobs share the .nox virtualenvs, so every run reinstalls .[tests], slipcover, behave, robotframework, Helm/kubeconform, etc.
  • Cache configuration only preserves ~/.cache/uv; the heavier .nox environments are discarded between jobs, triggering cold installs of test extras and Helm toolchain in each job.
  • Actions data (pages 1-4) shows wide variance tied to heavy jobs. Runs with Docker/Helm work push the critical path above 45 minutes. Fast-path PRs without those steps still hover around 18-19 minutes because of repeated bootstrapping.

Proposal

  1. Adopt a pre-baked Forgejo runner image. Publish and pin a cleveragents/ci-runner:py3.13 (or similar) image that already contains Node 20+, Git, Helm 3.16.4, kubeconform 0.7.0, uv 0.8.0, nox, slipcover, and bandit/vulture/radon. Update container.image in ci.yml, nightly-quality.yml, and release.yml to use it. This removes apt-get update and pip install from every job, cutting 6–10 minutes per run and eliminating network-heavy flakes.
  2. Cache .nox virtualenvs across jobs. Extend the existing cache step to save .nox alongside ~/.cache/uv with keys derived from pyproject.toml and noxfile.py. With reuse_venv=True, subsequent jobs can reuse the same environment instead of recreating editable installs and Helm toolchains, saving ~30–60 seconds per job.
  3. Gate Docker/Helm packaging jobs by path changes. Configure on.pull_request.paths guardrails (and matching if: for push) so Docker image builds only run when Dockerfile*, pyproject.toml, or scripts/cli/ change, and Helm validation runs only when k8s/ or Helm templates change. Provide a fallback manual dispatch so maintainers can force a run when needed. This keeps packaging coverage while avoiding 5–8 minute delays on Python-only PRs.

Acceptance Criteria

  • CI workflows run on the new pre-built runner image and no longer install Node/Git/Helm/uv within individual jobs.
  • .nox environments are cached and reused across jobs, with cache keys documented and invalidation strategy defined.
  • docker and helm jobs (and their nightly equivalents) are skipped automatically for change sets that do not touch Docker or Kubernetes assets, with documentation explaining the guardrails.
  • Measured PR CI runtime reduction is captured in the implementing PR (before/after sample runs).

Duplicate Check

  • Open issues (pages 1–3) filtered for "CI runtime" and "coverage" show #9540 (push-validation gating) and #9534 (coverage reuse) already in flight; both focus on different bottlenecks.
  • Open issues #9527/#9528 cover composite actions and matrix consolidation but still perform per-job installs; this proposal targets a shared base image and caching.
  • Closed issues (page 1) with "CI runtime" returned no matches.

Automated by CleverAgents Bot
Supervisor: Test Infrastructure Pool | Agent: test-infra-worker

## Summary - Last 82 successful `ci.yml` runs (Actions API pages 1-4 with `detail=1`) averaged ~27.2 minutes. 14 of those runs exceeded 45 minutes, including [run 4821](https://git.cleverthis.com/cleveragents/cleveragents-core/actions/runs/666) at 131.9 minutes and [run 4427](https://git.cleverthis.com/cleveragents/cleveragents-core/actions/runs/272) at 99.1 minutes. Recent PR runs like [6574](https://git.cleverthis.com/cleveragents/cleveragents-core/actions/runs/1916) still take 36.7 minutes despite no Docker or Helm changes. - Eleven jobs in `.forgejo/workflows/ci.yml` boot the same python:3.13-slim container, run `apt-get update && apt-get install` for Node/Git/Helm, and reinstall `uv` + `nox`. Composite-action issue #9527 centralizes the YAML, but the per-job bootstrap still burns ~45-60 seconds each time. - Docker and Helm packaging jobs execute on every PR even when the change set only touches Python/tests/docs, so contributors wait for image builds and Helm linting that rarely surface regressions for those PRs. ## Observations - Workflow definition: lint, typecheck, security, quality, unit_tests, integration_tests, e2e_tests, coverage, build, docker, helm, and push-validation all call `apt-get` + `pip install uv==0.8.0 nox` inside fresh containers. None of the jobs share the `.nox` virtualenvs, so every run reinstalls `.[tests]`, slipcover, behave, robotframework, Helm/kubeconform, etc. - Cache configuration only preserves `~/.cache/uv`; the heavier `.nox` environments are discarded between jobs, triggering cold installs of test extras and Helm toolchain in each job. - Actions data (pages 1-4) shows wide variance tied to heavy jobs. Runs with Docker/Helm work push the critical path above 45 minutes. Fast-path PRs without those steps still hover around 18-19 minutes because of repeated bootstrapping. ## Proposal 1. **Adopt a pre-baked Forgejo runner image.** Publish and pin a `cleveragents/ci-runner:py3.13` (or similar) image that already contains Node 20+, Git, Helm 3.16.4, kubeconform 0.7.0, uv 0.8.0, nox, slipcover, and bandit/vulture/radon. Update `container.image` in `ci.yml`, `nightly-quality.yml`, and `release.yml` to use it. This removes `apt-get update` and `pip install` from every job, cutting 6–10 minutes per run and eliminating network-heavy flakes. 2. **Cache `.nox` virtualenvs across jobs.** Extend the existing cache step to save `.nox` alongside `~/.cache/uv` with keys derived from `pyproject.toml` and `noxfile.py`. With `reuse_venv=True`, subsequent jobs can reuse the same environment instead of recreating editable installs and Helm toolchains, saving ~30–60 seconds per job. 3. **Gate Docker/Helm packaging jobs by path changes.** Configure `on.pull_request.paths` guardrails (and matching `if:` for `push`) so Docker image builds only run when `Dockerfile*`, `pyproject.toml`, or `scripts/cli/` change, and Helm validation runs only when `k8s/` or Helm templates change. Provide a fallback manual dispatch so maintainers can force a run when needed. This keeps packaging coverage while avoiding 5–8 minute delays on Python-only PRs. ## Acceptance Criteria - [ ] CI workflows run on the new pre-built runner image and no longer install Node/Git/Helm/uv within individual jobs. - [ ] `.nox` environments are cached and reused across jobs, with cache keys documented and invalidation strategy defined. - [ ] `docker` and `helm` jobs (and their nightly equivalents) are skipped automatically for change sets that do not touch Docker or Kubernetes assets, with documentation explaining the guardrails. - [ ] Measured PR CI runtime reduction is captured in the implementing PR (before/after sample runs). ### Duplicate Check - Open issues (pages 1–3) filtered for "CI runtime" and "coverage" show #9540 (push-validation gating) and #9534 (coverage reuse) already in flight; both focus on different bottlenecks. - Open issues #9527/#9528 cover composite actions and matrix consolidation but still perform per-job installs; this proposal targets a shared base image and caching. - Closed issues (page 1) with "CI runtime" returned no matches. --- **Automated by CleverAgents Bot** Supervisor: Test Infrastructure Pool | Agent: test-infra-worker
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9689
No description provided.