[AUTO-INF-1] Reduce CI execution time for cleveragents-core #9782

Open
opened 2026-04-15 15:40:50 +00:00 by HAL9000 · 0 comments
Owner

Summary

  • p95 CI runtime is ~92 min (avg ~23 min) across the latest 20 runs; multiple PR runs still exceed an hour (e.g., #666 at 132 min, #272 at 99 min).
  • Failure rate remains high: across the most recent 90 runs there were 40 failures vs 37 successes (13 cancelled), with every scheduled run (#922, #292, #163, #803) failing within 30 seconds.
  • Long-running workflows map to integration, E2E, and coverage jobs, each bootstrapping full toolchains in fresh containers while relying on external LLM secrets that frequently break cron jobs.

Current CI Behavior

  • Latest 20 runs: average 23.0 min, p95 92.1 min, 9 failures, 7 cancelled, 14 successes.
  • Scheduled runs are consistently red in <30s, dragging overall reliability down and masking regressions.
  • Representative long runs:
Run # Status Duration (min) Trigger Title
666 success 132 pull_request fix(test): address review findings in M3 smoke tests
272 success 99 pull_request Merge branch master into feature/m3-actor-schema-examples
275 failure 92 pull_request Build: configured coverage report to run in parallal, removed cap on number of parallel processes, limited by number of cores of env variable
279 success 54 pull_request test(db): add resource registry robot smoke test
1916 success 37 pull_request Merge branch master into feature/m6-cli-polish
1742 success 35 pull_request fix(test): address Critical/Major review findings for WF01 integration tests
1933 cancelled 24 pull_request feat(acms): implement UKO Layer 3 Technology Vocabularies (uko-py, uko-ts, uko-rs, uko-java)
1915 success 19 push fix(test): remove skip guard and soft assertions from scientific paper E2E test
85 success 16 pull_request feat(domain): add spec-aligned Resource and Project models
83 failure 13 pull_request feat(qa): enforce coverage >=97% in CI

Key Bottlenecks

  • Integration & E2E jobs: each container installs Node, git, curl, Helm, uv, and runs full nox suites. These jobs dominate wall-clock time (30–130 min) and require real LLM API keys, so missing secrets cause instant failure.
  • Coverage job re-runs the test matrix even after unit/integration jobs finish, leading to 90 min reruns (run #275) and duplicate dependency resolution.
  • Scheduled workflows reuse the same jobs without the necessary secrets, causing deterministic failures and inflating the headline failure rate to ~69%.
  • Environment bootstrap duplicated across 12 jobs (apt-get + pip installs), increasing both runtime and flake exposure.

Optimization Proposals

  • Build and publish a reusable Docker image (or composite action) with Node, git, Helm, kubeconform, uv/nox pre-installed so each job skips apt-get + downloads. Expect 6–10 min savings per heavy job, reducing p95 by ~30–40 min.
  • Split integration_tests and e2e_tests into matrices keyed by suite/component (or increase TEST_PROCESSES with per-suite limits) to reduce single-job runtime below 30 min while keeping coverage complete. Capture timing in job summaries to track improvements.
  • Generate coverage inside unit_tests (store .coverage artifacts and run coverage combine there) so the dedicated coverage job only aggregates results instead of replaying the test suite.
  • Provide service credentials for scheduled runs (or move them to a workflow with mock providers) so cron jobs stop failing immediately; add a pre-flight check that emits a skipped status when secrets are intentionally absent rather than hard-failing.
  • Cache Helm tarballs, uv caches, and downloaded CLIs keyed by version and commit to eliminate repeated network fetches and transient HTTP failures. Pair with actions/cache/restore to shorten cold-start time.
  • Enable concurrency to cancel superseded PR runs and reduce queue pressure while keeping required checks intact.

Duplicate Check

  • Searched open issues for "[AUTO-INF-1]" via GET /api/v1/repos/cleveragents/cleveragents-core/issues?state=open&page=1&limit=20 → no matches.

References

## Summary - p95 CI runtime is ~92 min (avg ~23 min) across the latest 20 runs; multiple PR runs still exceed an hour (e.g., #666 at 132 min, #272 at 99 min). - Failure rate remains high: across the most recent 90 runs there were 40 failures vs 37 successes (13 cancelled), with every scheduled run (#922, #292, #163, #803) failing within 30 seconds. - Long-running workflows map to integration, E2E, and coverage jobs, each bootstrapping full toolchains in fresh containers while relying on external LLM secrets that frequently break cron jobs. ## Current CI Behavior - Latest 20 runs: average 23.0 min, p95 92.1 min, 9 failures, 7 cancelled, 14 successes. - Scheduled runs are consistently red in <30s, dragging overall reliability down and masking regressions. - Representative long runs: | Run # | Status | Duration (min) | Trigger | Title | |---|---|---|---|---| | 666 | success | 132 | pull_request | fix(test): address review findings in M3 smoke tests | | 272 | success | 99 | pull_request | Merge branch master into feature/m3-actor-schema-examples | | 275 | failure | 92 | pull_request | Build: configured coverage report to run in parallal, removed cap on number of parallel processes, limited by number of cores of env variable | | 279 | success | 54 | pull_request | test(db): add resource registry robot smoke test | | 1916 | success | 37 | pull_request | Merge branch master into feature/m6-cli-polish | | 1742 | success | 35 | pull_request | fix(test): address Critical/Major review findings for WF01 integration tests | | 1933 | cancelled | 24 | pull_request | feat(acms): implement UKO Layer 3 Technology Vocabularies (uko-py, uko-ts, uko-rs, uko-java) | | 1915 | success | 19 | push | fix(test): remove skip guard and soft assertions from scientific paper E2E test | | 85 | success | 16 | pull_request | feat(domain): add spec-aligned Resource and Project models | | 83 | failure | 13 | pull_request | feat(qa): enforce coverage >=97% in CI | ## Key Bottlenecks - **Integration & E2E jobs**: each container installs Node, git, curl, Helm, uv, and runs full nox suites. These jobs dominate wall-clock time (30–130 min) and require real LLM API keys, so missing secrets cause instant failure. - **Coverage job re-runs the test matrix** even after unit/integration jobs finish, leading to 90 min reruns (run #275) and duplicate dependency resolution. - **Scheduled workflows** reuse the same jobs without the necessary secrets, causing deterministic failures and inflating the headline failure rate to ~69%. - **Environment bootstrap** duplicated across 12 jobs (apt-get + pip installs), increasing both runtime and flake exposure. ## Optimization Proposals - Build and publish a reusable Docker image (or composite action) with Node, git, Helm, kubeconform, uv/nox pre-installed so each job skips apt-get + downloads. Expect 6–10 min savings per heavy job, reducing p95 by ~30–40 min. - Split `integration_tests` and `e2e_tests` into matrices keyed by suite/component (or increase `TEST_PROCESSES` with per-suite limits) to reduce single-job runtime below 30 min while keeping coverage complete. Capture timing in job summaries to track improvements. - Generate coverage inside `unit_tests` (store `.coverage` artifacts and run `coverage combine` there) so the dedicated `coverage` job only aggregates results instead of replaying the test suite. - Provide service credentials for scheduled runs (or move them to a workflow with mock providers) so cron jobs stop failing immediately; add a pre-flight check that emits a skipped status when secrets are intentionally absent rather than hard-failing. - Cache Helm tarballs, uv caches, and downloaded CLIs keyed by version and commit to eliminate repeated network fetches and transient HTTP failures. Pair with `actions/cache/restore` to shorten cold-start time. - Enable [`concurrency`](https://docs.github.com/actions/using-jobs/using-concurrency) to cancel superseded PR runs and reduce queue pressure while keeping required checks intact. ## Duplicate Check - Searched open issues for "[AUTO-INF-1]" via `GET /api/v1/repos/cleveragents/cleveragents-core/issues?state=open&page=1&limit=20` → no matches. ## References - Run #666 (132 min): https://git.cleverthis.com/cleveragents/cleveragents-core/actions/runs/666 - Run #275 (coverage failure, 92 min): https://git.cleverthis.com/cleveragents/cleveragents-core/actions/runs/275 - Run #922 (scheduled failure, 26s): https://git.cleverthis.com/cleveragents/cleveragents-core/actions/runs/922 - CI workflow: https://git.cleverthis.com/cleveragents/cleveragents-core/src/branch/master/.forgejo/workflows/ci.yml
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9782
No description provided.