[AUTO-INF-5] Add job-level timeout-minutes to all CI jobs in ci.yml to prevent indefinite hangs #9943

Open
opened 2026-04-16 06:23:21 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Commit message: ci: add job-level timeout-minutes to all jobs in ci.yml
  • Branch name: auto-inf-5/add-ci-job-timeouts

Background and Context

Ten of the eleven jobs in .forgejo/workflows/ci.yml have no timeout-minutes setting, meaning a hung job (e.g., a deadlocked test process, a stalled network download, or a runaway nox session) will consume runner resources indefinitely. Workflow run history shows runs lasting 1h32m and 2h11m, indicating this is already causing real problems.

Expected Behavior

Every job in .forgejo/workflows/ci.yml has a timeout-minutes value appropriate to its observed run duration plus a safety margin. A hung job fails fast with a clear timeout error, freeing runner capacity and giving contributors immediate feedback.

Acceptance Criteria

  • All jobs in .forgejo/workflows/ci.yml have a timeout-minutes field set
  • Timeout values match or exceed the proposed table (see Proposed Improvement below)
  • e2e_tests retains its existing timeout-minutes: 45
  • CI pipeline passes with the new timeouts in place
  • No job is observed to time out under normal (non-hung) conditions

Subtasks

  • Audit .forgejo/workflows/ci.yml and list all jobs missing timeout-minutes
  • Add timeout-minutes: 15 to lint
  • Add timeout-minutes: 20 to typecheck
  • Add timeout-minutes: 20 to security
  • Add timeout-minutes: 15 to quality
  • Add timeout-minutes: 60 to unit_tests
  • Add timeout-minutes: 90 to integration_tests
  • Add timeout-minutes: 60 to coverage
  • Add timeout-minutes: 15 to build
  • Add timeout-minutes: 30 to docker
  • Add timeout-minutes: 15 to helm
  • Add timeout-minutes: 10 to push-validation
  • Add timeout-minutes: 5 to status-check
  • Open a PR, verify CI passes, and merge

Definition of Done

This issue is closed when all jobs in .forgejo/workflows/ci.yml have timeout-minutes set, the PR is merged to the default branch, and no normal CI run is observed to time out due to the new limits.


Summary

Ten of the eleven jobs in .forgejo/workflows/ci.yml have no timeout-minutes setting, meaning a hung job (e.g., a deadlocked test process, a stalled network download, or a runaway nox session) will consume runner resources indefinitely. Workflow run history shows runs lasting 1h32m and 2h11m, indicating this is already causing real problems.

Current State

In .forgejo/workflows/ci.yml, only the e2e_tests job has a timeout:

e2e_tests:
    runs-on: docker
    timeout-minutes: 45   # ← only job with a timeout

The following jobs have no timeout-minutes at all:

  • lint
  • typecheck
  • security
  • quality
  • unit_tests
  • integration_tests
  • coverage
  • build
  • docker
  • helm
  • push-validation
  • status-check

Without a timeout, a single hung job blocks the entire status-check gate indefinitely, tying up the runner and preventing other PRs from getting feedback.

Proposed Improvement

Add timeout-minutes to every job in .forgejo/workflows/ci.yml based on observed run durations plus a safety margin:

Job Proposed timeout
lint 15
typecheck 20
security 20
quality 15
unit_tests 60
integration_tests 90
e2e_tests 45 (already set)
coverage 60
build 15
docker 30
helm 15
push-validation 10
status-check 5

Example diff for unit_tests:

unit_tests:
    runs-on: docker
    timeout-minutes: 60   # ← add this
    container:
        image: python:3.13-slim

Expected Impact

  • Reliability: Hung jobs fail fast with a clear timeout error instead of blocking the runner indefinitely.
  • Resource efficiency: Runner capacity is freed promptly when a job hangs, allowing other PRs to proceed.
  • Faster feedback: Contributors get a clear "timed out" signal rather than waiting hours for a job that will never complete.
  • Reduced 1h32m / 2h11m run anomalies: Timeouts cap worst-case run duration and surface the root cause (hung process) rather than hiding it.

Duplicate Check

  • Searched open issues for keywords: timeout, timeout-minutes, hang, indefinite, stuck job
  • Searched closed issues for keywords: timeout, timeout-minutes, hang, indefinite
  • Searched for AUTO-INF worker issues (AUTO-INF-1 through AUTO-INF-10): none address job-level timeout-minutes in ci.yml
  • Existing AUTO-INF-1 issues address wall-clock speed (caching, parallelism) — not timeout safety
  • Existing AUTO-INF-3 issues address parallelism settings — not timeout safety
  • Result: No duplicates found

Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Commit message:** `ci: add job-level timeout-minutes to all jobs in ci.yml` - **Branch name:** `auto-inf-5/add-ci-job-timeouts` ## Background and Context Ten of the eleven jobs in `.forgejo/workflows/ci.yml` have no `timeout-minutes` setting, meaning a hung job (e.g., a deadlocked test process, a stalled network download, or a runaway nox session) will consume runner resources indefinitely. Workflow run history shows runs lasting 1h32m and 2h11m, indicating this is already causing real problems. ## Expected Behavior Every job in `.forgejo/workflows/ci.yml` has a `timeout-minutes` value appropriate to its observed run duration plus a safety margin. A hung job fails fast with a clear timeout error, freeing runner capacity and giving contributors immediate feedback. ## Acceptance Criteria - [ ] All jobs in `.forgejo/workflows/ci.yml` have a `timeout-minutes` field set - [ ] Timeout values match or exceed the proposed table (see Proposed Improvement below) - [ ] `e2e_tests` retains its existing `timeout-minutes: 45` - [ ] CI pipeline passes with the new timeouts in place - [ ] No job is observed to time out under normal (non-hung) conditions ## Subtasks - [ ] Audit `.forgejo/workflows/ci.yml` and list all jobs missing `timeout-minutes` - [ ] Add `timeout-minutes: 15` to `lint` - [ ] Add `timeout-minutes: 20` to `typecheck` - [ ] Add `timeout-minutes: 20` to `security` - [ ] Add `timeout-minutes: 15` to `quality` - [ ] Add `timeout-minutes: 60` to `unit_tests` - [ ] Add `timeout-minutes: 90` to `integration_tests` - [ ] Add `timeout-minutes: 60` to `coverage` - [ ] Add `timeout-minutes: 15` to `build` - [ ] Add `timeout-minutes: 30` to `docker` - [ ] Add `timeout-minutes: 15` to `helm` - [ ] Add `timeout-minutes: 10` to `push-validation` - [ ] Add `timeout-minutes: 5` to `status-check` - [ ] Open a PR, verify CI passes, and merge ## Definition of Done This issue is closed when all jobs in `.forgejo/workflows/ci.yml` have `timeout-minutes` set, the PR is merged to the default branch, and no normal CI run is observed to time out due to the new limits. --- ## Summary Ten of the eleven jobs in `.forgejo/workflows/ci.yml` have no `timeout-minutes` setting, meaning a hung job (e.g., a deadlocked test process, a stalled network download, or a runaway nox session) will consume runner resources indefinitely. Workflow run history shows runs lasting 1h32m and 2h11m, indicating this is already causing real problems. ## Current State In `.forgejo/workflows/ci.yml`, only the `e2e_tests` job has a timeout: ```yaml e2e_tests: runs-on: docker timeout-minutes: 45 # ← only job with a timeout ``` The following jobs have **no** `timeout-minutes` at all: - `lint` - `typecheck` - `security` - `quality` - `unit_tests` - `integration_tests` - `coverage` - `build` - `docker` - `helm` - `push-validation` - `status-check` Without a timeout, a single hung job blocks the entire `status-check` gate indefinitely, tying up the runner and preventing other PRs from getting feedback. ## Proposed Improvement Add `timeout-minutes` to every job in `.forgejo/workflows/ci.yml` based on observed run durations plus a safety margin: | Job | Proposed timeout | |-----|-----------------| | `lint` | 15 | | `typecheck` | 20 | | `security` | 20 | | `quality` | 15 | | `unit_tests` | 60 | | `integration_tests` | 90 | | `e2e_tests` | 45 (already set) | | `coverage` | 60 | | `build` | 15 | | `docker` | 30 | | `helm` | 15 | | `push-validation` | 10 | | `status-check` | 5 | Example diff for `unit_tests`: ```yaml unit_tests: runs-on: docker timeout-minutes: 60 # ← add this container: image: python:3.13-slim ``` ## Expected Impact - **Reliability**: Hung jobs fail fast with a clear timeout error instead of blocking the runner indefinitely. - **Resource efficiency**: Runner capacity is freed promptly when a job hangs, allowing other PRs to proceed. - **Faster feedback**: Contributors get a clear "timed out" signal rather than waiting hours for a job that will never complete. - **Reduced 1h32m / 2h11m run anomalies**: Timeouts cap worst-case run duration and surface the root cause (hung process) rather than hiding it. ### Duplicate Check - Searched open issues for keywords: `timeout`, `timeout-minutes`, `hang`, `indefinite`, `stuck job` - Searched closed issues for keywords: `timeout`, `timeout-minutes`, `hang`, `indefinite` - Searched for AUTO-INF worker issues (AUTO-INF-1 through AUTO-INF-10): none address job-level `timeout-minutes` in ci.yml - Existing AUTO-INF-1 issues address wall-clock speed (caching, parallelism) — not timeout safety - Existing AUTO-INF-3 issues address parallelism settings — not timeout safety - Result: **No duplicates found** --- **Automated by CleverAgents Bot** Agent: new-issue-creator
Author
Owner

🔍 Triage Decision — Verified

Issue: [AUTO-INF-5] Add job-level timeout-minutes to all CI jobs in ci.yml
Type: Task (CI/Infrastructure)
Priority: Medium
MoSCoW: Should Have

Rationale

This is a well-scoped, actionable infrastructure improvement. Workflow run history already shows anomalous 1h32m and 2h11m runs, confirming real resource waste from missing timeouts. The proposed timeout values are reasonable and evidence-based. This does not block any feature work but meaningfully improves CI reliability and runner efficiency.

Marking as Should Have — important for operational health but not blocking any milestone deliverable. No milestone assigned; this is a cross-cutting CI improvement.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

## 🔍 Triage Decision — Verified ✅ **Issue:** [AUTO-INF-5] Add job-level `timeout-minutes` to all CI jobs in `ci.yml` **Type:** Task (CI/Infrastructure) **Priority:** Medium **MoSCoW:** Should Have ### Rationale This is a well-scoped, actionable infrastructure improvement. Workflow run history already shows anomalous 1h32m and 2h11m runs, confirming real resource waste from missing timeouts. The proposed timeout values are reasonable and evidence-based. This does not block any feature work but meaningfully improves CI reliability and runner efficiency. Marking as **Should Have** — important for operational health but not blocking any milestone deliverable. No milestone assigned; this is a cross-cutting CI improvement. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9943
No description provided.