Add concurrency groups and job timeouts to CI workflows #451

Closed
opened 2026-02-26 03:46:51 +00:00 by freemo · 1 comment
Owner

Metadata

  • Commit Message: chore(ci): add concurrency groups and job timeouts to CI workflows
  • Branch: chore/ci-concurrency-timeouts

Background and Context

Investigation of CI runner delays revealed that the Forgejo Actions CI pipeline has no concurrency controls. Every push to any branch or PR triggers a full set of 11 parallel CI jobs (lint, typecheck, security, quality, unit_tests, integration_tests, coverage, benchmark-regression, benchmark-publish, build, docker) without cancelling any previously queued or running jobs for the same branch. With 20+ open PRs triggering CI simultaneously, this wastes finite runner capacity on superseded builds — a developer pushing a quick fixup commit ends up with two complete sets of 11 jobs competing for runners instead of the first set being cancelled.

Additionally, no job-level timeout is configured on any CI job, meaning that if a job hangs (e.g., a network stall during apt-get update or an infinite loop in a test), it will consume a runner slot indefinitely until Forgejo's built-in stop_zombie_tasks cron (which runs every 5 minutes) catches it — and even then, the detection heuristics may not classify a hung-but-connected runner as a zombie.

The nightly quality workflow (nightly-quality.yml) has the same issues.

Current Behavior

  • Every push to a PR or protected branch spawns 11 CI jobs regardless of whether a previous run for the same ref is still in progress.
  • Superseded CI runs consume runner capacity unnecessarily, increasing queue wait times for all developers.
  • No timeout-minutes is set on any job, so hung jobs can occupy runners indefinitely.

Expected Behavior

  • When a new CI run is triggered for a branch that already has a run in progress, the previous run is automatically cancelled (cancel-in-progress: true).
  • Each CI job has an appropriate timeout-minutes value so that hung jobs are killed promptly and free up runner capacity.
  • The nightly quality workflow also has concurrency and timeout controls.

Acceptance Criteria

  • ci.yml has a top-level concurrency block with cancel-in-progress: true keyed on workflow + ref.
  • Every job in ci.yml has a timeout-minutes value appropriate to its workload (fast jobs ~10 min, test jobs ~20 min, coverage/benchmark/docker ~30 min).
  • nightly-quality.yml has a top-level concurrency block with cancel-in-progress: true.
  • The full-quality-suite job in nightly-quality.yml has a timeout-minutes value (~45 min).
  • Both workflow files remain valid YAML and parse correctly.
  • CHANGELOG is updated with an entry describing the change.

Subtasks

  • Investigate CI runner delays and identify root cause
  • Add concurrency block to .forgejo/workflows/ci.yml
  • Add timeout-minutes to all 11 jobs in ci.yml
  • Add concurrency block to .forgejo/workflows/nightly-quality.yml
  • Add timeout-minutes to full-quality-suite job in nightly-quality.yml
  • Validate YAML syntax of both files
  • Update CHANGELOG
  • Commit, push, and open PR

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
## Metadata - **Commit Message**: `chore(ci): add concurrency groups and job timeouts to CI workflows` - **Branch**: `chore/ci-concurrency-timeouts` ## Background and Context Investigation of CI runner delays revealed that the Forgejo Actions CI pipeline has no concurrency controls. Every push to any branch or PR triggers a full set of 11 parallel CI jobs (lint, typecheck, security, quality, unit_tests, integration_tests, coverage, benchmark-regression, benchmark-publish, build, docker) without cancelling any previously queued or running jobs for the same branch. With 20+ open PRs triggering CI simultaneously, this wastes finite runner capacity on superseded builds — a developer pushing a quick fixup commit ends up with two complete sets of 11 jobs competing for runners instead of the first set being cancelled. Additionally, no job-level timeout is configured on any CI job, meaning that if a job hangs (e.g., a network stall during `apt-get update` or an infinite loop in a test), it will consume a runner slot indefinitely until Forgejo's built-in `stop_zombie_tasks` cron (which runs every 5 minutes) catches it — and even then, the detection heuristics may not classify a hung-but-connected runner as a zombie. The nightly quality workflow (`nightly-quality.yml`) has the same issues. ## Current Behavior - Every push to a PR or protected branch spawns 11 CI jobs regardless of whether a previous run for the same ref is still in progress. - Superseded CI runs consume runner capacity unnecessarily, increasing queue wait times for all developers. - No `timeout-minutes` is set on any job, so hung jobs can occupy runners indefinitely. ## Expected Behavior - When a new CI run is triggered for a branch that already has a run in progress, the previous run is automatically cancelled (`cancel-in-progress: true`). - Each CI job has an appropriate `timeout-minutes` value so that hung jobs are killed promptly and free up runner capacity. - The nightly quality workflow also has concurrency and timeout controls. ## Acceptance Criteria - [x] `ci.yml` has a top-level `concurrency` block with `cancel-in-progress: true` keyed on workflow + ref. - [x] Every job in `ci.yml` has a `timeout-minutes` value appropriate to its workload (fast jobs ~10 min, test jobs ~20 min, coverage/benchmark/docker ~30 min). - [x] `nightly-quality.yml` has a top-level `concurrency` block with `cancel-in-progress: true`. - [x] The `full-quality-suite` job in `nightly-quality.yml` has a `timeout-minutes` value (~45 min). - [x] Both workflow files remain valid YAML and parse correctly. - [x] CHANGELOG is updated with an entry describing the change. ## Subtasks - [x] Investigate CI runner delays and identify root cause - [x] Add `concurrency` block to `.forgejo/workflows/ci.yml` - [x] Add `timeout-minutes` to all 11 jobs in `ci.yml` - [x] Add `concurrency` block to `.forgejo/workflows/nightly-quality.yml` - [x] Add `timeout-minutes` to `full-quality-suite` job in `nightly-quality.yml` - [x] Validate YAML syntax of both files - [x] Update CHANGELOG - [x] Commit, push, and open PR ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done.
freemo added this to the v3.1.0 milestone 2026-02-26 03:47:20 +00:00
freemo 2026-02-26 19:23:56 +00:00
  • closed this issue
  • added the
    Points
    5
    label
Author
Owner

Closing: all subtasks checked off, PR #452 merged on 2026-02-26. CI concurrency groups and job timeouts are now active in both ci.yml and nightly-quality.yml.

Closing: all subtasks checked off, PR #452 merged on 2026-02-26. CI concurrency groups and job timeouts are now active in both `ci.yml` and `nightly-quality.yml`.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#451
No description provided.