[AUTO-INF-4] run_behave_parallel.py uses fork multiprocessing start method which is unsafe when parent process has threads (Python 3.13 CI deadlock risk) #10196

Open
opened 2026-04-17 05:01:00 +00:00 by HAL9000 · 0 comments
Owner

Problem

scripts/run_behave_parallel.py explicitly uses the fork start method for multiprocessing.Pool:

ctx = multiprocessing.get_context("fork")
with ctx.Pool(processes=min(processes, len(chunks))) as pool:
    results = pool.map(
        _worker_run_features,
        [(chunk, other_args) for chunk in chunks],
    )

The fork start method copies the entire parent process memory into each worker. This is fast (copy-on-write), but unsafe when the parent process has active threads at the time of forking.

Why This Is Dangerous in CI

Before pool.map() is called, the parent process has already:

  1. Imported SQLAlchemy — SQLAlchemy's connection pool may have background threads for connection management
  2. Imported asyncio — The _install_fast_sleep_patch() in environment.py patches asyncio.sleep, and asyncio may have internal threads
  3. Imported cleveragents — The pre-import step in run_behave_parallel.py imports the full cleveragents package, which may start background threads (e.g., for lifecycle management, health checks)
  4. Run before_all — The parent process calls before_all which initializes the mock AI provider, patches MigrationRunner, and patches Scenario.run

When fork() is called with active threads in the parent:

  • Mutexes held by threads are copied in their locked state — if a thread holds a lock at the moment of fork, the child process inherits the locked mutex and will deadlock when it tries to acquire it
  • Python's GIL state is copied — in Python 3.13, the GIL is per-interpreter; forking with a non-trivial GIL state can cause interpreter corruption
  • File descriptors are duplicated — open SQLite file handles, log file handles, and socket connections are all duplicated, causing contention

This is documented in Python's official documentation:

"On POSIX using the fork start method, a child process can safely use a lock only if it was acquired in the parent process before the fork."

Observed Symptoms

The consistent ~6m45s failure in CI (issue #2850) is consistent with a deadlock caused by a forked worker inheriting a locked mutex. The failure occurs after some scenarios complete (not immediately), suggesting the deadlock is triggered by a specific scenario that acquires a lock that was held by a parent thread at fork time.

Proposed Fix

Add a PYTHONWARNINGS or explicit check to detect thread-unsafe fork conditions, and/or switch to the spawn start method with a module pre-import optimization:

# Option A: Use spawn with explicit module pre-loading
ctx = multiprocessing.get_context("spawn")

# Option B: Add a pre-fork thread check (warn if threads are active)
import threading
active_threads = [t for t in threading.enumerate() if t != threading.main_thread()]
if active_threads:
    logger.warning(
        "fork() called with %d active threads — deadlock risk. "
        "Consider using TEST_PROCESSES=1 or spawn start method.",
        len(active_threads)
    )
ctx = multiprocessing.get_context("fork")

Option B is lower-risk (no behavior change) and provides diagnostic information for CI failures.

  • #2850 — P0 blocker: unit_tests CI job persistently failing after ~6m45s

Duplicate Check

Searched open issues for: fork spawn multiprocessing, fork deadlock, multiprocessing fork threads, fork start method, GIL fork, thread-unsafe fork. No existing open or closed issues found specifically addressing the fork start method safety concern in run_behave_parallel.py. Issue #2850 mentions "race conditions in multiprocessing.Pool with fork start method" as a hypothesis but does not propose the specific diagnostic or fix described here.


Automated by CleverAgents Bot
Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor

## Problem `scripts/run_behave_parallel.py` explicitly uses the `fork` start method for `multiprocessing.Pool`: ```python ctx = multiprocessing.get_context("fork") with ctx.Pool(processes=min(processes, len(chunks))) as pool: results = pool.map( _worker_run_features, [(chunk, other_args) for chunk in chunks], ) ``` The `fork` start method copies the entire parent process memory into each worker. This is fast (copy-on-write), but **unsafe** when the parent process has active threads at the time of forking. ## Why This Is Dangerous in CI Before `pool.map()` is called, the parent process has already: 1. **Imported SQLAlchemy** — SQLAlchemy's connection pool may have background threads for connection management 2. **Imported asyncio** — The `_install_fast_sleep_patch()` in `environment.py` patches `asyncio.sleep`, and asyncio may have internal threads 3. **Imported `cleveragents`** — The pre-import step in `run_behave_parallel.py` imports the full `cleveragents` package, which may start background threads (e.g., for lifecycle management, health checks) 4. **Run `before_all`** — The parent process calls `before_all` which initializes the mock AI provider, patches `MigrationRunner`, and patches `Scenario.run` When `fork()` is called with active threads in the parent: - **Mutexes held by threads are copied in their locked state** — if a thread holds a lock at the moment of fork, the child process inherits the locked mutex and will deadlock when it tries to acquire it - **Python's GIL state is copied** — in Python 3.13, the GIL is per-interpreter; forking with a non-trivial GIL state can cause interpreter corruption - **File descriptors are duplicated** — open SQLite file handles, log file handles, and socket connections are all duplicated, causing contention This is documented in Python's official documentation: > "On POSIX using the fork start method, a child process can safely use a lock only if it was acquired in the parent process before the fork." ## Observed Symptoms The consistent ~6m45s failure in CI (issue #2850) is consistent with a deadlock caused by a forked worker inheriting a locked mutex. The failure occurs after some scenarios complete (not immediately), suggesting the deadlock is triggered by a specific scenario that acquires a lock that was held by a parent thread at fork time. ## Proposed Fix Add a `PYTHONWARNINGS` or explicit check to detect thread-unsafe fork conditions, and/or switch to the `spawn` start method with a module pre-import optimization: ```python # Option A: Use spawn with explicit module pre-loading ctx = multiprocessing.get_context("spawn") # Option B: Add a pre-fork thread check (warn if threads are active) import threading active_threads = [t for t in threading.enumerate() if t != threading.main_thread()] if active_threads: logger.warning( "fork() called with %d active threads — deadlock risk. " "Consider using TEST_PROCESSES=1 or spawn start method.", len(active_threads) ) ctx = multiprocessing.get_context("fork") ``` Option B is lower-risk (no behavior change) and provides diagnostic information for CI failures. ## Related Issues - #2850 — P0 blocker: `unit_tests` CI job persistently failing after ~6m45s ### Duplicate Check Searched open issues for: `fork spawn multiprocessing`, `fork deadlock`, `multiprocessing fork threads`, `fork start method`, `GIL fork`, `thread-unsafe fork`. No existing open or closed issues found specifically addressing the `fork` start method safety concern in `run_behave_parallel.py`. Issue #2850 mentions "race conditions in `multiprocessing.Pool` with `fork` start method" as a hypothesis but does not propose the specific diagnostic or fix described here. --- **Automated by CleverAgents Bot** Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10196
No description provided.