[AUTO-INF-4] unit_tests CI job missing timeout-minutes allows indefinite hangs to silently block CI pipeline #10193

Open
opened 2026-04-17 04:53:45 +00:00 by HAL9000 · 0 comments
Owner

Problem

The unit_tests CI job in .forgejo/workflows/ci.yml has no timeout-minutes setting, while the e2e_tests job explicitly sets timeout-minutes: 45.

# e2e_tests — has timeout ✅
e2e_tests:
  runs-on: docker
  timeout-minutes: 45   # ← present
  container:
    image: python:3.13-slim

# unit_tests — no timeout ❌
unit_tests:
  runs-on: docker
  # timeout-minutes: ???  ← missing
  container:
    image: python:3.13-slim

Without a timeout, a single hanging Behave scenario (e.g., due to a deadlock in multiprocessing.Pool, an infinite retry loop, or a blocking I/O call) will cause the unit_tests job to run indefinitely, consuming CI runner resources and blocking all dependent jobs (docker, status-check) forever.

Impact

  • CI runner starvation: A hung unit_tests job holds a Docker runner slot indefinitely, preventing other PRs from getting CI time
  • No automatic recovery: Without a timeout, the job never fails — it just hangs until manually cancelled
  • Masks root causes: The consistent ~6m45s failure in issue #2850 suggests a timeout or deadlock; without an explicit timeout-minutes, it's unclear whether the job is timing out at the Forgejo runner level or failing for another reason
  • Blocks docker job: The docker job has needs: [lint, typecheck, security, quality, unit_tests] — a hung unit_tests blocks Docker image builds indefinitely

Proposed Fix

Add timeout-minutes: 20 to the unit_tests CI job in .forgejo/workflows/ci.yml:

unit_tests:
  runs-on: docker
  timeout-minutes: 20   # ← ADD THIS
  container:
    image: python:3.13-slim

A 20-minute timeout is generous for 587 Behave scenarios (which complete in ~6-7 minutes locally) while still catching hangs before they consume excessive CI resources.

  • #2850 — P0 blocker: unit_tests CI job persistently failing after ~6m45s

Duplicate Check

Searched open issues for: timeout unit_tests, timeout-minutes, CI timeout behave, unit_tests hang, CI hang. No existing open or closed issues found specifically addressing the missing timeout-minutes in the unit_tests CI job. Issue #2850 describes the symptom but does not propose adding a timeout as a mitigation.


Automated by CleverAgents Bot
Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor

## Problem The `unit_tests` CI job in `.forgejo/workflows/ci.yml` has **no `timeout-minutes` setting**, while the `e2e_tests` job explicitly sets `timeout-minutes: 45`. ```yaml # e2e_tests — has timeout ✅ e2e_tests: runs-on: docker timeout-minutes: 45 # ← present container: image: python:3.13-slim # unit_tests — no timeout ❌ unit_tests: runs-on: docker # timeout-minutes: ??? ← missing container: image: python:3.13-slim ``` Without a timeout, a single hanging Behave scenario (e.g., due to a deadlock in `multiprocessing.Pool`, an infinite retry loop, or a blocking I/O call) will cause the `unit_tests` job to run indefinitely, consuming CI runner resources and blocking all dependent jobs (`docker`, `status-check`) forever. ## Impact - **CI runner starvation**: A hung `unit_tests` job holds a Docker runner slot indefinitely, preventing other PRs from getting CI time - **No automatic recovery**: Without a timeout, the job never fails — it just hangs until manually cancelled - **Masks root causes**: The consistent ~6m45s failure in issue #2850 suggests a timeout or deadlock; without an explicit `timeout-minutes`, it's unclear whether the job is timing out at the Forgejo runner level or failing for another reason - **Blocks `docker` job**: The `docker` job has `needs: [lint, typecheck, security, quality, unit_tests]` — a hung `unit_tests` blocks Docker image builds indefinitely ## Proposed Fix Add `timeout-minutes: 20` to the `unit_tests` CI job in `.forgejo/workflows/ci.yml`: ```yaml unit_tests: runs-on: docker timeout-minutes: 20 # ← ADD THIS container: image: python:3.13-slim ``` A 20-minute timeout is generous for 587 Behave scenarios (which complete in ~6-7 minutes locally) while still catching hangs before they consume excessive CI resources. ## Related Issues - #2850 — P0 blocker: `unit_tests` CI job persistently failing after ~6m45s ### Duplicate Check Searched open issues for: `timeout unit_tests`, `timeout-minutes`, `CI timeout behave`, `unit_tests hang`, `CI hang`. No existing open or closed issues found specifically addressing the missing `timeout-minutes` in the `unit_tests` CI job. Issue #2850 describes the symptom but does not propose adding a timeout as a mitigation. --- **Automated by CleverAgents Bot** Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10193
No description provided.