[AUTO-INF-4] unit_tests CI job missing TEST_PROCESSES cap causes unbounded parallelism and resource exhaustion in Docker containers #10191

Open
opened 2026-04-17 04:47:22 +00:00 by HAL9000 · 0 comments
Owner

Problem

The unit_tests CI job in .forgejo/workflows/ci.yml does not set the TEST_PROCESSES environment variable, while the e2e_tests job explicitly sets TEST_PROCESSES: "4" to cap parallelism.

The _default_processes() function in noxfile.py determines worker count as follows:

def _default_processes() -> int:
    env_override = os.environ.get("TEST_PROCESSES")
    if env_override:
        return int(env_override)
    try:
        cpus = len(os.sched_getaffinity(0)) or 1
    except AttributeError:
        cpus = os.cpu_count() or 1
    return cpus

In a python:3.13-slim Docker container running on a high-core-count CI runner (e.g., 32 CPUs), os.sched_getaffinity(0) returns the full host CPU count. This means unit_tests spawns 32 parallel fork-based workers simultaneously, causing:

  1. Resource exhaustion: 32 workers × per-worker memory overhead (SQLAlchemy engines, behave runner state, module copies) can exhaust available RAM in the container
  2. Overlayfs copy-up lock contention: Even with compileall pre-run, 32 workers competing for overlayfs copy-up locks on __pycache__ directories can cause open() to deadlock
  3. SQLite file descriptor exhaustion: Each worker creates multiple temp SQLite files; 32 workers × 2 DBs per scenario = hundreds of concurrent file handles
  4. Thundering-herd on tempfile.mktemp(): 32 workers all calling tempfile.mktemp() simultaneously in before_scenario can cause contention on /tmp

The consistent ~6m45s failure time in CI (issue #2850) is consistent with resource exhaustion occurring after the initial fast scenarios complete and memory pressure builds up.

Evidence

Comparing the two CI jobs:

Job TEST_PROCESSES set? Timeout
unit_tests No (uses all CPUs) None
e2e_tests TEST_PROCESSES: "4" 45 min

The noxfile comment itself acknowledges the risk:

# Keep default parallelism conservative to avoid timeout/OOM flakes
# under heavy Robot/pabot subprocess fan-out in CI and shared runners.

Proposed Fix

Add TEST_PROCESSES: "2" to the unit_tests CI job environment in .forgejo/workflows/ci.yml:

unit_tests:
  runs-on: docker
  container:
    image: python:3.13-slim
  steps:
    # ...
    - name: Run unit tests via nox
      run: |
        mkdir -p build
        nox -s unit_tests 2>&1 | tee build/nox-unit-tests-output.log
      env:
        NOX_DEFAULT_VENV_BACKEND: uv
        TEST_PROCESSES: "2"   # ← ADD THIS

A value of 2 matches the conservative default used elsewhere and avoids resource exhaustion while still providing parallelism benefit.

  • #2850 — P0 blocker: unit_tests CI job persistently failing after ~6m45s

Duplicate Check

Searched open issues for: TEST_PROCESSES, parallel workers, unit_tests processes, fork deadlock, multiprocessing fork, CI timeout behave. No existing open or closed issues found specifically addressing the missing TEST_PROCESSES cap in the unit_tests CI job. Issue #2850 describes the symptom but not this specific root cause.


Automated by CleverAgents Bot
Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor

## Problem The `unit_tests` CI job in `.forgejo/workflows/ci.yml` does **not** set the `TEST_PROCESSES` environment variable, while the `e2e_tests` job explicitly sets `TEST_PROCESSES: "4"` to cap parallelism. The `_default_processes()` function in `noxfile.py` determines worker count as follows: ```python def _default_processes() -> int: env_override = os.environ.get("TEST_PROCESSES") if env_override: return int(env_override) try: cpus = len(os.sched_getaffinity(0)) or 1 except AttributeError: cpus = os.cpu_count() or 1 return cpus ``` In a `python:3.13-slim` Docker container running on a high-core-count CI runner (e.g., 32 CPUs), `os.sched_getaffinity(0)` returns the full host CPU count. This means `unit_tests` spawns **32 parallel `fork`-based workers** simultaneously, causing: 1. **Resource exhaustion**: 32 workers × per-worker memory overhead (SQLAlchemy engines, behave runner state, module copies) can exhaust available RAM in the container 2. **Overlayfs copy-up lock contention**: Even with `compileall` pre-run, 32 workers competing for overlayfs copy-up locks on `__pycache__` directories can cause `open()` to deadlock 3. **SQLite file descriptor exhaustion**: Each worker creates multiple temp SQLite files; 32 workers × 2 DBs per scenario = hundreds of concurrent file handles 4. **Thundering-herd on `tempfile.mktemp()`**: 32 workers all calling `tempfile.mktemp()` simultaneously in `before_scenario` can cause contention on `/tmp` The consistent ~6m45s failure time in CI (issue #2850) is consistent with resource exhaustion occurring after the initial fast scenarios complete and memory pressure builds up. ## Evidence Comparing the two CI jobs: | Job | `TEST_PROCESSES` set? | Timeout | |-----|----------------------|---------| | `unit_tests` | ❌ No (uses all CPUs) | ❌ None | | `e2e_tests` | ✅ `TEST_PROCESSES: "4"` | ✅ 45 min | The noxfile comment itself acknowledges the risk: ```python # Keep default parallelism conservative to avoid timeout/OOM flakes # under heavy Robot/pabot subprocess fan-out in CI and shared runners. ``` ## Proposed Fix Add `TEST_PROCESSES: "2"` to the `unit_tests` CI job environment in `.forgejo/workflows/ci.yml`: ```yaml unit_tests: runs-on: docker container: image: python:3.13-slim steps: # ... - name: Run unit tests via nox run: | mkdir -p build nox -s unit_tests 2>&1 | tee build/nox-unit-tests-output.log env: NOX_DEFAULT_VENV_BACKEND: uv TEST_PROCESSES: "2" # ← ADD THIS ``` A value of `2` matches the conservative default used elsewhere and avoids resource exhaustion while still providing parallelism benefit. ## Related Issues - #2850 — P0 blocker: `unit_tests` CI job persistently failing after ~6m45s ### Duplicate Check Searched open issues for: `TEST_PROCESSES`, `parallel workers`, `unit_tests processes`, `fork deadlock`, `multiprocessing fork`, `CI timeout behave`. No existing open or closed issues found specifically addressing the missing `TEST_PROCESSES` cap in the `unit_tests` CI job. Issue #2850 describes the symptom but not this specific root cause. --- **Automated by CleverAgents Bot** Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10191
No description provided.