[AUTO-INF-2] Stabilize async Behave timing, context tier locks, and parallel runner pre-import #9798

Open
opened 2026-04-15 16:05:27 +00:00 by HAL9000 · 0 comments
Owner

Summary

  • CI failure rate (69.7%) is driven by deterministic lint regressions, untagged Behave failures, and flaky async execution/context tier scenarios.
  • Targeted Behave reruns via nox -s unit_tests -- hang during the import cleveragents pre-import, blocking reproduction and causing CI timeouts.
  • Proposed fixes cover async timing waits, locking shared state, guarding the pre-import, and coordinating with #9686 to replace tempfile.mktemp.

Evidence

  • CI logs: ci_logs/pr6723_unit_tests.log (truncated at 44,110 lines), ci_logs/pr6729_integration_tests.log (truncated at 1,479 lines), ci_logs_pr5271_chunks/chunk_001.log-chunk_005.log (57 ruff errors), ci_logs/ci_run_12455_unit_tests.log (B904 lint).
  • Local runs 2026-04-15:
    • nox -s lint -> pass
    • nox -s unit_tests -- features/async_execution.feature -> hang (>90 s) after import cleveragents pre-import
    • nox -s unit_tests -- features/context_tier_thread_safety.feature -> hang (>90 s)
    • nox -s unit_tests -- features/test_infra_flaky_test_example.feature -> hang (>120 s)
  • analysis_results.txt: 11 Category B scenarios without @tdd_expected_fail tags plus one malformed tag scenario.

Findings

Async execution timing flakiness

  • Step definitions poll with time.sleep(0.1) while the global sleep cap forces 10 ms slices, so three second deadlines fire early (same root cause as closed flaky issue #1542).
  • Fix: replace polling with threading.Event.wait timeouts or call time._original_sleep for true waits.

Context tier thread safety races

  • Verification steps read ContextTierService maps without holding the service lock; transient duplicates appear when 4-20 threads finish concurrently.
  • Fix: wrap assertions in with service._lock when gathering tier state.

Parallel runner pre-import hang

  • run_behave_parallel.py pre-imports cleveragents; the UKO detail level map builder blocks so nox -s unit_tests -- never forks workers.
  • Fix: add timeout or guarded subprocess for the pre-import and continue without it when the guard trips.

Tempfile.mktemp race (tracked in #9686)

  • before_scenario uses tempfile.mktemp for DB URIs; this is a TOCTOU race across 32 workers.
  • Coordinate with #9686 to land the mkstemp-based fix.

Deterministic TDD failures

  • Category B scenarios needing fixes or tags: features/cli_extensions.feature (lines 307, 321, 336), features/cli_global_format_flag.feature:51, features/cli_output_formats.feature:95, features/m6_autonomy_acceptance.feature:170, features/tdd_a2a_sdk_dependency.feature:15, features/tdd_event_bus_exception_swallow.feature:22, features/tdd_session_create_di.feature:22, features/cli_json_envelope.feature:22 and 61.
  • Unknown TDD tag: features/tui_slash_overlay_descriptions.feature line 36.
  • Lint regressions from PR 5271 and run_12455 need blocking pre-commit automation.

CI infrastructure gaps

  • Log capture limited to 170 KB so integration verdicts are missing.
  • Default 32 worker parallelism oversubscribes CPU and SQLite temp DBs; consider a CI cap at 16.
  • Need fast validate_tdd_tags preflight to stop deterministic failures early.
  1. Update async execution step definitions to rely on threading.Event waits and explicit time._original_sleep; keep assertions monotonic. (Medium)
  2. Guard ContextTierService verification steps with the service lock. (Low)
  3. Add timeout/fallback around import cleveragents pre-import and emit telemetry. (Low)
  4. Coordinate with #9686 to replace tempfile.mktemp with mkstemp in Behave hooks. (Very low)
  5. Fail fast in CI on missing @tdd_expected_fail/@tdd_issue tags and triage the 11 Category B scenarios plus the malformed tag. (Medium)
  6. Increase CI log capture limit or upload artifacts; align job timeout with capture window. (Medium)
  7. Cap CI TEST_PROCESSES (or new CI_MAX_PROCESSES) to 16 to reduce contention and .pyc thundering herd. (Low)
  8. Enforce nox -s lint in pre-commit and PR gates to stop deterministic lint failures. (Low)

Duplicate Check

Keyword Matching Open Issues Notes
"flaky" #9778 partial, #9783 partial, #9790 mention; #1542 closed No open issue covers async timing/context tier/pre-import findings.
"async execution" None No open issue on features/async_execution.feature timing flake.
"context tier" None No open issue on ContextTierService assertion races.
"tempfile.mktemp" #9686 direct match; #8115/#7935/#7798/#7737 unrelated Reference #9686 for the DB path race.
"pre-import" None No open issue on import cleveragents pre-import hang.

References

  • ci_logs/pr6723_unit_tests.log, ci_logs/pr6729_integration_tests.log, ci_logs_pr5271_chunks/*
  • scripts/run_behave_parallel.py, features/steps/async_execution_steps.py, features/context_tier_thread_safety.feature
  • analysis_results.txt, issue #1542 (closed), issue #9686 (open)
## Summary - CI failure rate (69.7%) is driven by deterministic lint regressions, untagged Behave failures, and flaky async execution/context tier scenarios. - Targeted Behave reruns via nox -s unit_tests -- <feature> hang during the import cleveragents pre-import, blocking reproduction and causing CI timeouts. - Proposed fixes cover async timing waits, locking shared state, guarding the pre-import, and coordinating with #9686 to replace tempfile.mktemp. ## Evidence - CI logs: ci_logs/pr6723_unit_tests.log (truncated at 44,110 lines), ci_logs/pr6729_integration_tests.log (truncated at 1,479 lines), ci_logs_pr5271_chunks/chunk_001.log-chunk_005.log (57 ruff errors), ci_logs/ci_run_12455_unit_tests.log (B904 lint). - Local runs 2026-04-15: - nox -s lint -> pass - nox -s unit_tests -- features/async_execution.feature -> hang (>90 s) after import cleveragents pre-import - nox -s unit_tests -- features/context_tier_thread_safety.feature -> hang (>90 s) - nox -s unit_tests -- features/test_infra_flaky_test_example.feature -> hang (>120 s) - analysis_results.txt: 11 Category B scenarios without @tdd_expected_fail tags plus one malformed tag scenario. ## Findings ### Async execution timing flakiness - Step definitions poll with time.sleep(0.1) while the global sleep cap forces 10 ms slices, so three second deadlines fire early (same root cause as closed flaky issue #1542). - Fix: replace polling with threading.Event.wait timeouts or call time._original_sleep for true waits. ### Context tier thread safety races - Verification steps read ContextTierService maps without holding the service lock; transient duplicates appear when 4-20 threads finish concurrently. - Fix: wrap assertions in with service._lock when gathering tier state. ### Parallel runner pre-import hang - run_behave_parallel.py pre-imports cleveragents; the UKO detail level map builder blocks so nox -s unit_tests -- <feature> never forks workers. - Fix: add timeout or guarded subprocess for the pre-import and continue without it when the guard trips. ### Tempfile.mktemp race (tracked in #9686) - before_scenario uses tempfile.mktemp for DB URIs; this is a TOCTOU race across 32 workers. - Coordinate with #9686 to land the mkstemp-based fix. ### Deterministic TDD failures - Category B scenarios needing fixes or tags: features/cli_extensions.feature (lines 307, 321, 336), features/cli_global_format_flag.feature:51, features/cli_output_formats.feature:95, features/m6_autonomy_acceptance.feature:170, features/tdd_a2a_sdk_dependency.feature:15, features/tdd_event_bus_exception_swallow.feature:22, features/tdd_session_create_di.feature:22, features/cli_json_envelope.feature:22 and 61. - Unknown TDD tag: features/tui_slash_overlay_descriptions.feature line 36. - Lint regressions from PR 5271 and run_12455 need blocking pre-commit automation. ### CI infrastructure gaps - Log capture limited to 170 KB so integration verdicts are missing. - Default 32 worker parallelism oversubscribes CPU and SQLite temp DBs; consider a CI cap at 16. - Need fast validate_tdd_tags preflight to stop deterministic failures early. ## Recommended actions 1. Update async execution step definitions to rely on threading.Event waits and explicit time._original_sleep; keep assertions monotonic. (Medium) 2. Guard ContextTierService verification steps with the service lock. (Low) 3. Add timeout/fallback around import cleveragents pre-import and emit telemetry. (Low) 4. Coordinate with #9686 to replace tempfile.mktemp with mkstemp in Behave hooks. (Very low) 5. Fail fast in CI on missing @tdd_expected_fail/@tdd_issue tags and triage the 11 Category B scenarios plus the malformed tag. (Medium) 6. Increase CI log capture limit or upload artifacts; align job timeout with capture window. (Medium) 7. Cap CI TEST_PROCESSES (or new CI_MAX_PROCESSES) to 16 to reduce contention and .pyc thundering herd. (Low) 8. Enforce nox -s lint in pre-commit and PR gates to stop deterministic lint failures. (Low) ## Duplicate Check | Keyword | Matching Open Issues | Notes | | --- | --- | --- | | "flaky" | #9778 partial, #9783 partial, #9790 mention; #1542 closed | No open issue covers async timing/context tier/pre-import findings. | | "async execution" | None | No open issue on features/async_execution.feature timing flake. | | "context tier" | None | No open issue on ContextTierService assertion races. | | "tempfile.mktemp" | #9686 direct match; #8115/#7935/#7798/#7737 unrelated | Reference #9686 for the DB path race. | | "pre-import" | None | No open issue on import cleveragents pre-import hang. | ## References - ci_logs/pr6723_unit_tests.log, ci_logs/pr6729_integration_tests.log, ci_logs_pr5271_chunks/* - scripts/run_behave_parallel.py, features/steps/async_execution_steps.py, features/context_tier_thread_safety.feature - analysis_results.txt, issue #1542 (closed), issue #9686 (open)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9798
No description provided.