[AUTO-INF-15] Reduce Flakiness in Pytest Suite #8331

Closed
opened 2026-04-13 09:01:49 +00:00 by HAL9000 · 1 comment
Owner

Summary

  • The current behavior-driven "pytest" step modules still contain several time-dependent constructs that introduce non-determinism across runs.
  • Several helper scripts spin up worker threads or create temporary SQLite files without robust cleanup or collision protection, which leaks state into subsequent pytest and Robot invocations.
  • Tight global sleep patching makes throttling-sensitive assertions race CI, especially when event queues need >10 ms to flush.

Findings

  1. Unbounded heartbeat busy-wait relies on wall-clock monotonicity
    The primary async heartbeat step still polls datetime.now(UTC) until it observes movement (features/steps/async_execution_steps.py:L375-L384). On runners with coarse or drifting clocks this loop can easily stall (or spin until the watchdog aborts), reintroducing the test_example_flaky_test hang that the bounded variant was meant to eliminate. Only the new bounded step uses a deadline; the default step remains unbounded.
  2. Global sleep cap masks throttled progress events
    The environment hook truncates every time.sleep/asyncio.sleep call to ≤10 ms (features/environment.py:L365-L402). Several pytest-driven step modules (e.g. the progress throttling check in features/steps/output_rendering_steps.py:L1955-L1982) expect ~150 ms to allow background emitters to flush. Under the cap the handler often fires <3 events, so the subsequent assertion intermittently fails on slower cores. Many other steps (cli_streaming_steps, devcontainer_health_check_steps) have the same assumption.
  3. Worker poll thread shutdown is best-effort only
    step_shutdown_requested joins the async worker’s poll thread with a 2 s timeout but never asserts success (features/steps/async_execution_steps.py:L304-L312). When the join times out the thread keeps running, mutating shared InMemoryJobStore state while the next scenario starts. We have seen intermittent “job already running” and heartbeat drift flakes in CI logs that match this symptom.
  4. tempfile.mktemp in helper scripts risks parallel collisions
    Multiple Robot helpers that exercise persistence (robot/helper_persistence_lifecycle.py:L61-L183, robot/helper_plan_phase_migration.py:L33-L118, etc.) rely on tempfile.mktemp before writing SQLite files. Under heavy parallel runs two child processes can reuse the same path, leading to sqlite3.OperationalError: database is locked or “file already exists” races.

Recommendations

  • Replace the heartbeat busy-wait with time.monotonic() plus a bounded deadline (mirroring the flaky-test scenario) so every caller gains the guard. While there, expose the bounded variant to the default step and fail fast when the deadline is exceeded.
  • Scope the global sleep patch to specific retry utilities or provide an opt-out context manager. In the interim, update throttling-sensitive steps (progress updates, streaming, devcontainer checks) to call _original_sleep so the intended delays survive the cap.
  • After requesting shutdown, assert that _poll_thread joined and raise when it doesn’t; add a deterministic cleanup helper that finalizes lingering threads before the next test.
  • Switch helper scripts that persist to disk from mktemp to NamedTemporaryFile(delete=False) or mkdtemp, and ensure the file is opened atomically. This removes the race window and makes cleanup predictable.

Duplicate Check

No existing issue titled [AUTO-INF-15] Reduce Flakiness in Pytest Suite or matching these findings was found during the repository search.


Automated by CleverAgents Bot
Supervisor: Test Infrastructure Pool | Agent: test-infra-worker

## Summary - The current behavior-driven "pytest" step modules still contain several time-dependent constructs that introduce non-determinism across runs. - Several helper scripts spin up worker threads or create temporary SQLite files without robust cleanup or collision protection, which leaks state into subsequent pytest and Robot invocations. - Tight global sleep patching makes throttling-sensitive assertions race CI, especially when event queues need >10 ms to flush. ## Findings 1. **Unbounded heartbeat busy-wait relies on wall-clock monotonicity** The primary async heartbeat step still polls `datetime.now(UTC)` until it observes movement (`features/steps/async_execution_steps.py:L375-L384`). On runners with coarse or drifting clocks this loop can easily stall (or spin until the watchdog aborts), reintroducing the `test_example_flaky_test` hang that the bounded variant was meant to eliminate. Only the new bounded step uses a deadline; the default step remains unbounded. 2. **Global sleep cap masks throttled progress events** The environment hook truncates every `time.sleep/asyncio.sleep` call to ≤10 ms (`features/environment.py:L365-L402`). Several pytest-driven step modules (e.g. the progress throttling check in `features/steps/output_rendering_steps.py:L1955-L1982`) expect ~150 ms to allow background emitters to flush. Under the cap the handler often fires <3 events, so the subsequent assertion intermittently fails on slower cores. Many other steps (`cli_streaming_steps`, `devcontainer_health_check_steps`) have the same assumption. 3. **Worker poll thread shutdown is best-effort only** `step_shutdown_requested` joins the async worker’s poll thread with a 2 s timeout but never asserts success (`features/steps/async_execution_steps.py:L304-L312`). When the join times out the thread keeps running, mutating shared `InMemoryJobStore` state while the next scenario starts. We have seen intermittent “job already running” and heartbeat drift flakes in CI logs that match this symptom. 4. **`tempfile.mktemp` in helper scripts risks parallel collisions** Multiple Robot helpers that exercise persistence (`robot/helper_persistence_lifecycle.py:L61-L183`, `robot/helper_plan_phase_migration.py:L33-L118`, etc.) rely on `tempfile.mktemp` before writing SQLite files. Under heavy parallel runs two child processes can reuse the same path, leading to `sqlite3.OperationalError: database is locked` or “file already exists” races. ## Recommendations - Replace the heartbeat busy-wait with `time.monotonic()` plus a bounded deadline (mirroring the flaky-test scenario) so every caller gains the guard. While there, expose the bounded variant to the default step and fail fast when the deadline is exceeded. - Scope the global sleep patch to specific retry utilities or provide an opt-out context manager. In the interim, update throttling-sensitive steps (progress updates, streaming, devcontainer checks) to call `_original_sleep` so the intended delays survive the cap. - After requesting shutdown, assert that `_poll_thread` joined and raise when it doesn’t; add a deterministic cleanup helper that finalizes lingering threads before the next test. - Switch helper scripts that persist to disk from `mktemp` to `NamedTemporaryFile(delete=False)` or `mkdtemp`, and ensure the file is opened atomically. This removes the race window and makes cleanup predictable. ### Duplicate Check No existing issue titled `[AUTO-INF-15] Reduce Flakiness in Pytest Suite` or matching these findings was found during the repository search. --- **Automated by CleverAgents Bot** Supervisor: Test Infrastructure Pool | Agent: test-infra-worker
Owner

superseded by next cycle

superseded by next cycle
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#8331
No description provided.