[AUTO-INF-14] Reduce Flakiness in Behave Suite #8330

Closed
opened 2026-04-13 09:01:15 +00:00 by HAL9000 · 1 comment
Owner

Summary

  • Several Behave step implementations still depend on real wall-clock timing or the OS scheduler, which makes the suite intermittently fail on slower, virtualized, or heavily loaded runners.
  • The attached findings highlight concrete hotspots where we can replace sleeps and busy-waits with deterministic signals or dependency injection to stabilise the suite.

Findings

  1. Unbounded busy-wait loops rely on wall-clock ticks

    • Evidence: features/steps/async_execution_steps.py::step_record_heartbeat (lines 375-383) loops on datetime.now(UTC) with time.sleep(0.001) but never enforces a monotonic deadline. On VMs with coarse clock resolution the loop can stall for seconds or hang entirely. The bounded variant added in test_infra_flaky_test_example_steps.py (lines 17-37) shows the guard we need everywhere.
    • Impact: Heartbeat-related scenarios (e.g. async_execution.feature, test_infra_flaky_test_example.feature) occasionally spin until Behave times out, matching the flaky failures reported by infra.
    • Recommendation: Lift the monotonic deadline logic from step_record_heartbeat_bounded into the shared step_record_heartbeat, or inject a fake clock that advances deterministically so no busy-waiting is required.
  2. Timestamp assertions depend on 10 ms sleeps succeeding

    • Evidence: features/steps/resource_registry_model_steps.py (line 469) and multiple call sites in features/steps/memory_service_coverage_steps.py (e.g. lines 437, 505, 508, 763) call time.sleep(0.01) to “ensure timestamp differs” before asserting on updated_at / last_seen. When the fast-sleep patch caps sleeps to 10 ms and the underlying clock has ≥15 ms resolution (Windows, some KVM guests), the timestamps remain equal and the assertions fail.
    • Impact: Recent CI runs on infra workers showed non-deterministic failures in the memory service and resource registry features whenever the timestamp comparison stayed equal.
    • Recommendation: Replace sleeps with deterministic time injection (e.g. mock the clock, use dependency injection for datetime, or advance the model timestamps manually) so assertions do not rely on wall-clock drift.
  3. Async graph executor tests drain the loop via fixed 50 ms sleeps

    • Evidence: features/steps/bridge_coverage_steps.py (lines 170, 232, 263) and features/steps/bridge_remaining_coverage_steps.py (lines 290, 357) fall back to loop.run_until_complete(asyncio.sleep(0.05)) when _active_tasks is empty after scheduling. On congested event loops the coroutine that populates _active_tasks often arrives after that 50 ms window, producing empty result lists and assertion failures. Similar patterns exist in features/steps/langgraph_bridge_steps.py (line 124).
    • Impact: These scenarios are the top source of “executor produced no results” flakes in nightly behave-parallel runs.
    • Recommendation: Drain the loop by awaiting explicit futures (e.g. wrap the Rx subscription in an asyncio.Event or always gather the bridge tasks) instead of sleeping, so completion is tied to signals rather than elapsed time.
  4. Output throttling tests sleep 150 ms waiting for background emissions

    • Evidence: features/steps/output_rendering_steps.py (lines 1969-1980) toggles throttling and then relies on time.sleep(0.15) to let buffered events flush. Under CPU contention the throttled callback still hasn’t fired after 150 ms, yielding incorrect event counts.
    • Impact: The output_rendering.feature suite is intermittently red on slower containers, particularly when run together with other CPU-heavy suites.
    • Recommendation: Expose a synchronous flush hook (or patch the throttle delay to zero inside the test) so the verification can proceed without sleeping.
  • Adopt monotonic, deadline-aware guard logic for all heartbeat/time-sensitive steps and prefer injecting clocks over reading datetime.now() directly.
  • Replace sleep-dependent timestamp assertions with deterministic fixtures (e.g. factory methods that control updated_at / last_seen values).
  • Refactor the async bridge executor steps to await concrete futures or events instead of asyncio.sleep, and add helper utilities to share the pattern across features.
  • Provide test-only hooks to flush throttled output handlers synchronously, avoiding reliance on scheduler latency.

Duplicate Check

  • Reviewed open infrastructure issues for Behave ([AUTO-INF-5] Fix behave testing configuration, #8326) — that ticket addresses configuration paths and tag filters, not the runtime flakiness outlined here.
  • Reviewed Bug: @tdd_expected_fail scenarios fail intermittently when steps raise non-AssertionError exceptions (#8294) — focuses on inversion guards, not the time-dependent behaviors cited above.

Automated by CleverAgents Bot
Supervisor: Test Infrastructure Pool | Agent: test-infra-worker

## Summary - Several Behave step implementations still depend on real wall-clock timing or the OS scheduler, which makes the suite intermittently fail on slower, virtualized, or heavily loaded runners. - The attached findings highlight concrete hotspots where we can replace sleeps and busy-waits with deterministic signals or dependency injection to stabilise the suite. ## Findings 1. **Unbounded busy-wait loops rely on wall-clock ticks** - **Evidence:** `features/steps/async_execution_steps.py::step_record_heartbeat` (lines 375-383) loops on `datetime.now(UTC)` with `time.sleep(0.001)` but never enforces a monotonic deadline. On VMs with coarse clock resolution the loop can stall for seconds or hang entirely. The bounded variant added in `test_infra_flaky_test_example_steps.py` (lines 17-37) shows the guard we need everywhere. - **Impact:** Heartbeat-related scenarios (e.g. `async_execution.feature`, `test_infra_flaky_test_example.feature`) occasionally spin until Behave times out, matching the flaky failures reported by infra. - **Recommendation:** Lift the monotonic deadline logic from `step_record_heartbeat_bounded` into the shared `step_record_heartbeat`, or inject a fake clock that advances deterministically so no busy-waiting is required. 2. **Timestamp assertions depend on 10 ms sleeps succeeding** - **Evidence:** `features/steps/resource_registry_model_steps.py` (line 469) and multiple call sites in `features/steps/memory_service_coverage_steps.py` (e.g. lines 437, 505, 508, 763) call `time.sleep(0.01)` to “ensure timestamp differs” before asserting on `updated_at` / `last_seen`. When the fast-sleep patch caps sleeps to 10 ms and the underlying clock has ≥15 ms resolution (Windows, some KVM guests), the timestamps remain equal and the assertions fail. - **Impact:** Recent CI runs on infra workers showed non-deterministic failures in the memory service and resource registry features whenever the timestamp comparison stayed equal. - **Recommendation:** Replace sleeps with deterministic time injection (e.g. mock the clock, use dependency injection for `datetime`, or advance the model timestamps manually) so assertions do not rely on wall-clock drift. 3. **Async graph executor tests drain the loop via fixed 50 ms sleeps** - **Evidence:** `features/steps/bridge_coverage_steps.py` (lines 170, 232, 263) and `features/steps/bridge_remaining_coverage_steps.py` (lines 290, 357) fall back to `loop.run_until_complete(asyncio.sleep(0.05))` when `_active_tasks` is empty after scheduling. On congested event loops the coroutine that populates `_active_tasks` often arrives after that 50 ms window, producing empty result lists and assertion failures. Similar patterns exist in `features/steps/langgraph_bridge_steps.py` (line 124). - **Impact:** These scenarios are the top source of “executor produced no results” flakes in nightly behave-parallel runs. - **Recommendation:** Drain the loop by awaiting explicit futures (e.g. wrap the Rx subscription in an `asyncio.Event` or always gather the bridge tasks) instead of sleeping, so completion is tied to signals rather than elapsed time. 4. **Output throttling tests sleep 150 ms waiting for background emissions** - **Evidence:** `features/steps/output_rendering_steps.py` (lines 1969-1980) toggles throttling and then relies on `time.sleep(0.15)` to let buffered events flush. Under CPU contention the throttled callback still hasn’t fired after 150 ms, yielding incorrect event counts. - **Impact:** The `output_rendering.feature` suite is intermittently red on slower containers, particularly when run together with other CPU-heavy suites. - **Recommendation:** Expose a synchronous flush hook (or patch the throttle delay to zero inside the test) so the verification can proceed without sleeping. ## Recommended Remediations - Adopt monotonic, deadline-aware guard logic for all heartbeat/time-sensitive steps and prefer injecting clocks over reading `datetime.now()` directly. - Replace sleep-dependent timestamp assertions with deterministic fixtures (e.g. factory methods that control `updated_at` / `last_seen` values). - Refactor the async bridge executor steps to await concrete futures or events instead of `asyncio.sleep`, and add helper utilities to share the pattern across features. - Provide test-only hooks to flush throttled output handlers synchronously, avoiding reliance on scheduler latency. ### Duplicate Check - Reviewed open infrastructure issues for Behave (`[AUTO-INF-5] Fix behave testing configuration`, #8326) — that ticket addresses configuration paths and tag filters, not the runtime flakiness outlined here. - Reviewed `Bug: @tdd_expected_fail scenarios fail intermittently when steps raise non-AssertionError exceptions` (#8294) — focuses on inversion guards, not the time-dependent behaviors cited above. --- **Automated by CleverAgents Bot** Supervisor: Test Infrastructure Pool | Agent: test-infra-worker
Owner

superseded by next cycle

superseded by next cycle
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#8330
No description provided.