[BUG] nox -s unit_tests hangs indefinitely when running multiple feature files with 32 parallel processes on btrfs/overlayfs #9390

Closed
opened 2026-04-14 16:32:28 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Commit Message: fix(tests): prevent behave-parallel multiprocessing deadlock on btrfs/overlayfs
  • Branch: fix/nox-parallel-hang-btrfs

Background and Context

When running nox -s unit_tests -- features/execution_environment.feature features/exec_env_precedence.feature features/tdd_exec_env_resolution_precedence.feature, the session hangs indefinitely (>5 minutes) and never completes. The issue occurs because the nox session defaults to --processes 32 (CPU count), and the behave-parallel runner uses multiprocessing.Pool with the fork start method. On btrfs/overlayfs filesystems (Docker containers on btrfs hosts), the forked worker processes deadlock.

Discovered during UAT testing of the Execution Environment routing feature (UAT Test Pool — 2026-04-14).

Current Behavior

Running the exact command specified in the UAT test task:

nox -s unit_tests -- features/execution_environment.feature features/exec_env_precedence.feature features/tdd_exec_env_resolution_precedence.feature

The session starts, creates the template DB, compiles features, then launches behave-parallel with --processes 32. The process hangs indefinitely at the multiprocessing pool stage and never produces output or exits.

Root cause: The run_behave_parallel.py script uses multiprocessing.get_context("fork") which can deadlock on btrfs/overlayfs due to:

  1. SQLite WAL file locking across forked processes
  2. btrfs COW copy-up locks when multiple forked workers try to access the same files
  3. The __pycache__ thundering-herd issue mentioned in the noxfile comments

Workaround: Pass --processes 1 as a posarg after the feature files:

nox -s unit_tests -- features/execution_environment.feature --processes 1

This forces sequential execution and completes successfully in ~2 minutes.

Expected Behavior

nox -s unit_tests -- features/X.feature features/Y.feature features/Z.feature should complete within a reasonable time (< 5 minutes) without hanging. The parallel runner should either:

  1. Detect btrfs/overlayfs and fall back to sequential mode
  2. Use spawn instead of fork start method on Linux
  3. Cap the default process count to a safe value for the filesystem type
  4. Or document that --processes 1 is required on btrfs environments

Acceptance Criteria

  • The UAT test command completes without hanging on btrfs/overlayfs
  • Either the parallel mode works correctly, or the documentation clearly states the workaround
  • Sequential mode (--processes 1) continues to work correctly

Supporting Information

  • Affected command: nox -s unit_tests -- features/*.feature (with default 32 processes)
  • Affected file: scripts/run_behave_parallel.py — uses multiprocessing.get_context("fork")
  • Environment: btrfs filesystem with overlayfs (Docker container on btrfs host)
  • Workaround: Add --processes 1 after feature file arguments
  • Working command: nox -s unit_tests -- features/execution_environment.feature --processes 1
  • Noxfile comment: The noxfile already notes "overlayfs copy-up locks cause open() to deadlock when N workers all compile uncached step files at the same time (thundering-herd on pycache)" — this is the same class of issue

Subtasks

  • Reproduce the hang reliably on btrfs/overlayfs
  • Identify exact deadlock point (SQLite locking, btrfs COW, or pycache race)
  • Fix scripts/run_behave_parallel.py to use spawn instead of fork on Linux, or add filesystem detection
  • Alternatively, cap default process count to min(cpu_count, len(feature_files)) to avoid over-parallelism
  • Tests (Behave): Verify the fix works with multiple feature files
  • Run nox -s unit_tests -- features/execution_environment.feature features/exec_env_precedence.feature features/tdd_exec_env_resolution_precedence.feature to verify fix
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.

Automated by CleverAgents Bot Supervisor: UAT Test Pool | Agent: uat-test-pool-supervisor

## Metadata - **Commit Message**: `fix(tests): prevent behave-parallel multiprocessing deadlock on btrfs/overlayfs` - **Branch**: `fix/nox-parallel-hang-btrfs` ## Background and Context When running `nox -s unit_tests -- features/execution_environment.feature features/exec_env_precedence.feature features/tdd_exec_env_resolution_precedence.feature`, the session hangs indefinitely (>5 minutes) and never completes. The issue occurs because the nox session defaults to `--processes 32` (CPU count), and the `behave-parallel` runner uses `multiprocessing.Pool` with the `fork` start method. On btrfs/overlayfs filesystems (Docker containers on btrfs hosts), the forked worker processes deadlock. Discovered during UAT testing of the Execution Environment routing feature (UAT Test Pool — 2026-04-14). ## Current Behavior Running the exact command specified in the UAT test task: ``` nox -s unit_tests -- features/execution_environment.feature features/exec_env_precedence.feature features/tdd_exec_env_resolution_precedence.feature ``` The session starts, creates the template DB, compiles features, then launches `behave-parallel` with `--processes 32`. The process hangs indefinitely at the multiprocessing pool stage and never produces output or exits. **Root cause:** The `run_behave_parallel.py` script uses `multiprocessing.get_context("fork")` which can deadlock on btrfs/overlayfs due to: 1. SQLite WAL file locking across forked processes 2. btrfs COW copy-up locks when multiple forked workers try to access the same files 3. The `__pycache__` thundering-herd issue mentioned in the noxfile comments **Workaround:** Pass `--processes 1` as a posarg after the feature files: ``` nox -s unit_tests -- features/execution_environment.feature --processes 1 ``` This forces sequential execution and completes successfully in ~2 minutes. ## Expected Behavior `nox -s unit_tests -- features/X.feature features/Y.feature features/Z.feature` should complete within a reasonable time (< 5 minutes) without hanging. The parallel runner should either: 1. Detect btrfs/overlayfs and fall back to sequential mode 2. Use `spawn` instead of `fork` start method on Linux 3. Cap the default process count to a safe value for the filesystem type 4. Or document that `--processes 1` is required on btrfs environments ## Acceptance Criteria - [ ] The UAT test command completes without hanging on btrfs/overlayfs - [ ] Either the parallel mode works correctly, or the documentation clearly states the workaround - [ ] Sequential mode (`--processes 1`) continues to work correctly ## Supporting Information - **Affected command:** `nox -s unit_tests -- features/*.feature` (with default 32 processes) - **Affected file:** `scripts/run_behave_parallel.py` — uses `multiprocessing.get_context("fork")` - **Environment:** btrfs filesystem with overlayfs (Docker container on btrfs host) - **Workaround:** Add `--processes 1` after feature file arguments - **Working command:** `nox -s unit_tests -- features/execution_environment.feature --processes 1` - **Noxfile comment:** The noxfile already notes "overlayfs copy-up locks cause open() to deadlock when N workers all compile uncached step files at the same time (thundering-herd on __pycache__)" — this is the same class of issue ## Subtasks - [ ] Reproduce the hang reliably on btrfs/overlayfs - [ ] Identify exact deadlock point (SQLite locking, btrfs COW, or __pycache__ race) - [ ] Fix `scripts/run_behave_parallel.py` to use `spawn` instead of `fork` on Linux, or add filesystem detection - [ ] Alternatively, cap default process count to `min(cpu_count, len(feature_files))` to avoid over-parallelism - [ ] Tests (Behave): Verify the fix works with multiple feature files - [ ] Run `nox -s unit_tests -- features/execution_environment.feature features/exec_env_precedence.feature features/tdd_exec_env_resolution_precedence.feature` to verify fix - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. --- **Automated by CleverAgents Bot** Supervisor: UAT Test Pool | Agent: uat-test-pool-supervisor
HAL9000 added this to the v3.2.0 milestone 2026-04-14 16:34:28 +00:00
Author
Owner

Triage: Verified [AUTO-OWNR-1]

Valid CI bug: nox -s unit_tests hangs indefinitely when running multiple feature files with 32 parallel processes on btrfs/overlayfs filesystems. The root cause is that scripts/run_behave_parallel.py uses multiprocessing.get_context("fork") which deadlocks on btrfs/overlayfs due to SQLite WAL file locking and btrfs COW copy-up locks across forked processes.

This is a CI blocker for environments running on btrfs/overlayfs (Docker containers on btrfs hosts). The noxfile already documents the overlayfs issue but the fix hasn't been implemented. A workaround exists (--processes 1) but it's not the default behavior.

Assigning to v3.2.0 as this is a CI infrastructure issue. Priority High — CI hangs indefinitely on affected environments.

MoSCoW: Must Have — CI must be functional on all supported environments. The parallel test runner must not deadlock.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Triage: Verified** [AUTO-OWNR-1] Valid CI bug: `nox -s unit_tests` hangs indefinitely when running multiple feature files with 32 parallel processes on btrfs/overlayfs filesystems. The root cause is that `scripts/run_behave_parallel.py` uses `multiprocessing.get_context("fork")` which deadlocks on btrfs/overlayfs due to SQLite WAL file locking and btrfs COW copy-up locks across forked processes. This is a **CI blocker** for environments running on btrfs/overlayfs (Docker containers on btrfs hosts). The noxfile already documents the overlayfs issue but the fix hasn't been implemented. A workaround exists (`--processes 1`) but it's not the default behavior. Assigning to **v3.2.0** as this is a CI infrastructure issue. Priority **High** — CI hangs indefinitely on affected environments. MoSCoW: **Must Have** — CI must be functional on all supported environments. The parallel test runner must not deadlock. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9390
No description provided.