[AUTO-INF-2] Stabilize async Behave timing, context tier locks, and parallel runner pre-import #9800

Open
opened 2026-04-15 16:05:31 +00:00 by HAL9000 · 0 comments
Owner

Summary

  • Three distinct flaky-test root causes were identified in cleveragents-core through CI log analysis, code inspection, and local reproduction attempts: (1) async execution step definitions rely on time.sleep() polling that is capped at 10 ms by the global sleep patch, making wall-clock deadline assertions unreliable; (2) ContextTierService thread-safety assertion steps read tier state without holding _lock, allowing transient inconsistency; (3) the import cleveragents pre-import in scripts/run_behave_parallel.py triggers UKO module-level initialization that hangs indefinitely on some runner configurations, blocking both local reproduction and CI.
  • One near-duplicate open issue exists for the tempfile.mktemp race condition: #9686 [AUTO-INF-4] already proposes replacing mktemp with mkstemp in features/environment.py. This issue cross-references that finding rather than duplicating it.
  • Eleven deterministic (non-flaky) failures in Category B of analysis_results.txt are untagged scenarios that fail on every run and contribute directly to the 69.7% CI failure rate; they require either bug fixes or proper @tdd_expected_fail tagging.

Evidence

CI Logs Analysed

Log File PR / Run Lines Outcome
ci_logs/pr6598_unit_tests.log PR #6598 103,248 PASS
ci_logs/pr6607_unit_tests.log PR #6607 103,280 PASS
ci_logs/pr_4221_unit_tests.log PR #4221 103,023 PASS
ci_logs/pr7004_unit_tests.log PR #7004 103,417 PASS
ci_logs_pr6723_unit_tests.log PR #6723 44,110 TRUNCATED (mid-run)
ci_logs/pr6729_integration_tests.log PR #6729 1,479 TRUNCATED at ~170 KB log-capture limit
ci_logs/pr4583_integration_tests_run12211_job5.log PR #4583 1,475 TRUNCATED
ci_logs_pr5271_chunks/chunk_001-005.log PR #5271 1,442 total FAIL — 57 ruff lint errors
ci_logs/ci_run_12455_unit_tests.log run 12455 FAIL — B904 in plan.py:4220

Local Reproduction Runs

Run # Command Duration Outcome
1 nox -s lint (fresh venv) ~5 s PASS — all checks passed
2 nox -s unit_tests -- features/async_execution.feature >90 s HANG — import cleveragents pre-import never returns
3 nox -s unit_tests -- features/context_tier_thread_safety.feature >90 s HANG — same pre-import hang
4 nox -s unit_tests -- features/test_infra_flaky_test_example.feature >120 s HANG — same pre-import hang

Analysis Artifacts

  • analysis_results.txt — 81 Category A (@tdd_expected_fail) + 11 Category B (untagged failures) + 1 Unknown (malformed TDD tags)
  • features/test_infra_flaky_test_example.feature — documents the previously fixed heartbeat flaky test (issue #1542)
  • features/environment.py — global sleep cap, tempfile.mktemp usage, TDD inversion patch
  • scripts/run_behave_parallel.py — pre-import hang location

Findings

1. Async Execution Timing Tests (features/async_execution.feature)

Affected scenarios: "Worker processes all queued jobs within 3 seconds", "all created jobs reach a terminal state within 3 seconds", "Worker stop cancels in-flight jobs via tokens", "Worker poll loop dispatches jobs via thread pool".

Root cause: features/environment.py installs a global time.sleep cap of 10 ms in before_all:

_MAX_SLEEP = 0.01  # 10 ms cap
def _capped_sleep(seconds: float) -> None:
    time._original_sleep(min(seconds, _MAX_SLEEP))
time.sleep = _capped_sleep

Step definitions in features/steps/async_execution_steps.py poll for job completion using time.sleep(0.1) inside a deadline loop. With the cap, each sleep is truncated to 10 ms, causing busy-spinning that consumes CPU and can starve the background worker threads being polled. On loaded CI runners with 32 parallel workers, this causes the 2–3 second deadline assertions to fail intermittently.

Fix: Replace polling time.sleep() calls in timing-sensitive step definitions with threading.Event.wait(timeout=N) (not subject to the sleep cap) or use time._original_sleep directly for genuine wall-clock waits:

# Instead of:
deadline = time.monotonic() + 3
while not all_terminal and time.monotonic() < deadline:
    time.sleep(0.1)  # capped to 10 ms — busy-spins

# Use:
done_event = threading.Event()
# signal done_event from worker callback
done_event.wait(timeout=3.0)  # not affected by sleep cap

Effort: Medium (~10 step definitions to update in features/steps/async_execution_steps.py).


2. Context Tier Thread-Safety Assertions (features/context_tier_thread_safety.feature)

Affected scenarios: "No fragment should appear in more than one tier", "Concurrent store from multiple threads does not raise RuntimeError".

Root cause: The assertion step reads tier state (_hot, _warm, _cold dicts) after all threads complete but without holding service._lock. Even though ContextTierService uses threading.RLock internally for mutations, the assertion step is outside the lock boundary. A thread scheduler preemption between the last mutation and the assertion read can expose a transiently inconsistent view.

Fix: Acquire the service lock in the assertion step:

with service._lock:
    hot = set(service._hot.keys())
    warm = set(service._warm.keys())
    cold = set(service._cold.keys())
assert not (hot & warm), "Fragment appears in both hot and warm tiers"
assert not (hot & cold), "Fragment appears in both hot and cold tiers"
assert not (warm & cold), "Fragment appears in both warm and cold tiers"

Effort: Low (2–3 step definitions in features/steps/context_tier_thread_safety_steps.py).


3. import cleveragents Pre-Import Hang (scripts/run_behave_parallel.py)

Symptom: Running nox -s unit_tests -- <any_feature_file> hangs indefinitely. The UKO detail level map builder logs appear and then the process stalls:

2026-04-15 15:38:20 [debug] detail_level_map_builder.created child_domain=uko-oo: ...
2026-04-15 15:38:20 [debug] detail_level_map_builder.insert_after ...
# <hangs here indefinitely — reproduced 3/3 times locally>

Root cause: run_behave_parallel.py pre-imports cleveragents before forking workers to share module state via copy-on-write:

with suppress(ImportError):
    import cleveragents  # noqa: F401

This triggers module-level initialization code in the UKO subsystem that blocks on some runner configurations (likely waiting for a resource or lock that is never released in the test environment).

Fix: Add a 30-second timeout guard around the pre-import using signal.alarm:

import signal

def _safe_preimport() -> None:
    def _timeout(signum: int, frame: object) -> None:
        raise TimeoutError("Pre-import timed out after 30s — skipping")
    old = signal.signal(signal.SIGALRM, _timeout)
    signal.alarm(30)
    try:
        with suppress(ImportError):
            import cleveragents  # noqa: F401
    except TimeoutError:
        pass  # Fall back to on-demand import in workers
    finally:
        signal.alarm(0)
        signal.signal(signal.SIGALRM, old)

Effort: Low (~15 lines in scripts/run_behave_parallel.py).

Note: This hang also blocks all local targeted test reproduction, making it impossible to run individual feature files for debugging without fixing this first.


4. tempfile.mktemp() Race Condition — See #9686

Root cause: features/environment.py uses the deprecated tempfile.mktemp() in before_scenario() to generate per-scenario DB paths. mktemp() is non-atomic — it returns a path without creating the file. With 32 parallel workers, two scenarios can receive the same path before either copies the template DB, causing sqlite3.OperationalError: database is locked or cross-scenario state leakage.

This finding is already tracked in open issue #9686 ([AUTO-INF-4] Harden Behave temp DB paths and cap default parallelism), which proposes replacing mktemp with NamedTemporaryFile(delete=False) or mkstemp. No new issue is needed; this report cross-references #9686.


5. Deterministic TDD Failures (11 Category B Scenarios)

Root cause: analysis_results.txt lists 11 scenarios that fail on every run without proper @tdd_expected_fail tags:

Feature File Scenario
features/cli_extensions.feature:307 Plan status in JSON format contains required keys
features/cli_extensions.feature:321 Action show in JSON format contains all required keys
features/cli_extensions.feature:336 Action show in JSON includes invariants list when present
features/cli_global_format_flag.feature:51 version command reads format from global ctx.obj
features/cli_output_formats.feature:95 Format output handles all format types for dict
features/m6_autonomy_acceptance.feature:170 M6 smoke A2A version negotiation rejects unsupported
features/tdd_a2a_sdk_dependency.feature:15 a2a module is importable as a project dependency
features/tdd_event_bus_exception_swallow.feature:22 Bug #988 — emit() logs exception message when handler raises
features/tdd_session_create_di.feature:22 Session create command produces structured output
features/cli_json_envelope.feature:22 YAML output includes all required envelope fields
features/cli_json_envelope.feature:61 YAML envelope data field contains actor list payload

Additionally, features/tui_slash_overlay_descriptions.feature:36 has @tdd_expected_fail but is missing the required @tdd_issue and @tdd_issue_<N> tags, causing validate_tdd_tags() to force it to Status.failed on every run.

Fix: For each Category B scenario, either fix the underlying bug or add proper @tdd_issue @tdd_issue_<N> @tdd_expected_fail tags. Add a CI pre-flight check that fails fast when Category B count is non-zero.


6. CI Infrastructure Gaps

  • Log capture size limit: All integration test logs are truncated at ~170 KB / ~1,479 lines. Integration test failures are invisible in captured logs.
  • PR #6723 log truncation: Unit test log cut at 44,110 lines (vs ~103,000 for completed runs), suggesting a job timeout or log capture interruption mid-run.
  • No job-level timeouts: Most CI jobs have no timeout-minutes, allowing runaway hangs (documented up to 29+ hours). See #8381.
  • Parallel .pyc thundering-herd: The noxfile pre-compiles features/ bytecode to prevent overlayfs copy-up lock deadlocks. This mitigation is in place but the pre-import hang (Finding #3) is a related unmitigated risk.

  1. [Immediate, Very Low] Fix malformed TDD tag on features/tui_slash_overlay_descriptions.feature:36 — add @tdd_issue @tdd_issue_<N> tags.
  2. [Immediate, Low] Add 30-second signal.alarm timeout guard around import cleveragents pre-import in scripts/run_behave_parallel.py (Finding #3). Unblocks all local targeted test debugging.
  3. [Short-term, Very Low] Replace tempfile.mktemp() with tempfile.mkstemp() in features/environment.py — coordinate with #9686.
  4. [Short-term, Low] Acquire service._lock in ContextTierService assertion steps in features/steps/context_tier_thread_safety_steps.py (Finding #2).
  5. [Short-term, Medium] Replace polling time.sleep() with threading.Event.wait(timeout=N) in timing-sensitive async execution step definitions (Finding #1). ~10 step definitions in features/steps/async_execution_steps.py.
  6. [Short-term, High] Fix or properly tag the 11 Category B untagged failing scenarios from analysis_results.txt. Add CI pre-flight check that fails fast when Category B count is non-zero.
  7. [Infrastructure, Medium] Increase CI log capture limit or add a compact pabot summary artifact so integration test failures are visible. See #9778 for broader test-layer restructuring.
  8. [Infrastructure, Low] Add timeout-minutes to all CI jobs. See #8381.

Duplicate Check

Searched open issues via Forgejo REST API for each keyword.

Keyword Matching Open Issues Notes
"flaky" #9778 (partial), #9783 (partial), #9790 (passing mention) #1542 is closed (heartbeat fix, 2026-04-03). No open issue covers the specific flaky tests in this report.
"async execution" None (direct match) #9778 mentions "async retries" generically. No issue covers features/async_execution.feature timing flakiness.
"context tier" None No open issue covers ContextTierService thread-safety assertion flakiness.
"tempfile.mktemp" #9686 (direct match) Issue #9686 [AUTO-INF-4] covers the exact tempfile.mktemp race in features/environment.py. Our tempfile finding references #9686 rather than duplicating it.
"pre-import" None (direct match) No open issue covers the import cleveragents hang in run_behave_parallel.py.

Prior closed issue: #1542 (TEST-INFRA: [flaky-tests] Flaky test detected: test_example_flaky_test) — closed 2026-04-03, fixed with busy-wait loop. This report documents residual risks from the global sleep cap interaction.


References

Source Files

  • features/environment.py — global sleep cap, tempfile.mktemp usage, TDD inversion patch
  • scripts/run_behave_parallel.py — pre-import hang (lines ~280–285)
  • features/async_execution.feature — timing-sensitive scenarios
  • features/context_tier_thread_safety.feature — thread-safety assertion scenarios (issue #7547)
  • features/test_infra_flaky_test_example.feature — documents prior heartbeat fix
  • features/steps/test_infra_flaky_test_example_steps.py — busy-wait fix implementation
  • analysis_results.txt — TDD scenario categories (81 A, 11 B, 1 Unknown)

CI Logs

  • ci_logs/pr6598_unit_tests.log, pr6607_unit_tests.log, pr7004_unit_tests.log, pr_4221_unit_tests.log — passing unit test runs
  • ci_logs_pr5271_chunks/chunk_001-005.log — lint failure (57 ruff errors)
  • ci_logs/pr6729_integration_tests.log — integration test run (truncated at log-capture limit)
  • #1542 — Prior flaky test (heartbeat timestamp) — CLOSED
  • #9686[AUTO-INF-4] Harden Behave temp DB paths — OPEN (covers tempfile.mktemp fix)
  • #9778[AUTO-INF-5] Stabilize Behave/Robot test layers — OPEN (covers global sleep patch, TDD backlog)
  • #8381[AUTO-INF-1] Add per-job timeouts — OPEN
  • #9767[AUTO-INF-3] Harden CI workflow reliability — OPEN
  • #7547ContextTierService thread safety (referenced in feature file)

Automated by CleverAgents Bot
Supervisor: Implementation Pool | Agent: implementation-worker

## Summary - **Three distinct flaky-test root causes** were identified in `cleveragents-core` through CI log analysis, code inspection, and local reproduction attempts: (1) async execution step definitions rely on `time.sleep()` polling that is capped at 10 ms by the global sleep patch, making wall-clock deadline assertions unreliable; (2) `ContextTierService` thread-safety assertion steps read tier state without holding `_lock`, allowing transient inconsistency; (3) the `import cleveragents` pre-import in `scripts/run_behave_parallel.py` triggers UKO module-level initialization that hangs indefinitely on some runner configurations, blocking both local reproduction and CI. - **One near-duplicate open issue exists** for the `tempfile.mktemp` race condition: #9686 `[AUTO-INF-4]` already proposes replacing `mktemp` with `mkstemp` in `features/environment.py`. This issue cross-references that finding rather than duplicating it. - **Eleven deterministic (non-flaky) failures** in Category B of `analysis_results.txt` are untagged scenarios that fail on every run and contribute directly to the 69.7% CI failure rate; they require either bug fixes or proper `@tdd_expected_fail` tagging. --- ## Evidence ### CI Logs Analysed | Log File | PR / Run | Lines | Outcome | |----------|----------|-------|---------| | `ci_logs/pr6598_unit_tests.log` | PR #6598 | 103,248 | PASS | | `ci_logs/pr6607_unit_tests.log` | PR #6607 | 103,280 | PASS | | `ci_logs/pr_4221_unit_tests.log` | PR #4221 | 103,023 | PASS | | `ci_logs/pr7004_unit_tests.log` | PR #7004 | 103,417 | PASS | | `ci_logs_pr6723_unit_tests.log` | PR #6723 | 44,110 | TRUNCATED (mid-run) | | `ci_logs/pr6729_integration_tests.log` | PR #6729 | 1,479 | TRUNCATED at ~170 KB log-capture limit | | `ci_logs/pr4583_integration_tests_run12211_job5.log` | PR #4583 | 1,475 | TRUNCATED | | `ci_logs_pr5271_chunks/chunk_001-005.log` | PR #5271 | 1,442 total | FAIL — 57 ruff lint errors | | `ci_logs/ci_run_12455_unit_tests.log` | run 12455 | — | FAIL — B904 in `plan.py:4220` | ### Local Reproduction Runs | Run # | Command | Duration | Outcome | |-------|---------|----------|---------| | 1 | `nox -s lint` (fresh venv) | ~5 s | PASS — all checks passed | | 2 | `nox -s unit_tests -- features/async_execution.feature` | >90 s | HANG — `import cleveragents` pre-import never returns | | 3 | `nox -s unit_tests -- features/context_tier_thread_safety.feature` | >90 s | HANG — same pre-import hang | | 4 | `nox -s unit_tests -- features/test_infra_flaky_test_example.feature` | >120 s | HANG — same pre-import hang | ### Analysis Artifacts - `analysis_results.txt` — 81 Category A (`@tdd_expected_fail`) + 11 Category B (untagged failures) + 1 Unknown (malformed TDD tags) - `features/test_infra_flaky_test_example.feature` — documents the previously fixed heartbeat flaky test (issue #1542) - `features/environment.py` — global sleep cap, `tempfile.mktemp` usage, TDD inversion patch - `scripts/run_behave_parallel.py` — pre-import hang location --- ## Findings ### 1. Async Execution Timing Tests (`features/async_execution.feature`) **Affected scenarios:** "Worker processes all queued jobs within 3 seconds", "all created jobs reach a terminal state within 3 seconds", "Worker stop cancels in-flight jobs via tokens", "Worker poll loop dispatches jobs via thread pool". **Root cause:** `features/environment.py` installs a global `time.sleep` cap of 10 ms in `before_all`: ```python _MAX_SLEEP = 0.01 # 10 ms cap def _capped_sleep(seconds: float) -> None: time._original_sleep(min(seconds, _MAX_SLEEP)) time.sleep = _capped_sleep ``` Step definitions in `features/steps/async_execution_steps.py` poll for job completion using `time.sleep(0.1)` inside a deadline loop. With the cap, each sleep is truncated to 10 ms, causing busy-spinning that consumes CPU and can starve the background worker threads being polled. On loaded CI runners with 32 parallel workers, this causes the 2–3 second deadline assertions to fail intermittently. **Fix:** Replace polling `time.sleep()` calls in timing-sensitive step definitions with `threading.Event.wait(timeout=N)` (not subject to the sleep cap) or use `time._original_sleep` directly for genuine wall-clock waits: ```python # Instead of: deadline = time.monotonic() + 3 while not all_terminal and time.monotonic() < deadline: time.sleep(0.1) # capped to 10 ms — busy-spins # Use: done_event = threading.Event() # signal done_event from worker callback done_event.wait(timeout=3.0) # not affected by sleep cap ``` **Effort:** Medium (~10 step definitions to update in `features/steps/async_execution_steps.py`). --- ### 2. Context Tier Thread-Safety Assertions (`features/context_tier_thread_safety.feature`) **Affected scenarios:** "No fragment should appear in more than one tier", "Concurrent store from multiple threads does not raise RuntimeError". **Root cause:** The assertion step reads tier state (`_hot`, `_warm`, `_cold` dicts) after all threads complete but without holding `service._lock`. Even though `ContextTierService` uses `threading.RLock` internally for mutations, the assertion step is outside the lock boundary. A thread scheduler preemption between the last mutation and the assertion read can expose a transiently inconsistent view. **Fix:** Acquire the service lock in the assertion step: ```python with service._lock: hot = set(service._hot.keys()) warm = set(service._warm.keys()) cold = set(service._cold.keys()) assert not (hot & warm), "Fragment appears in both hot and warm tiers" assert not (hot & cold), "Fragment appears in both hot and cold tiers" assert not (warm & cold), "Fragment appears in both warm and cold tiers" ``` **Effort:** Low (2–3 step definitions in `features/steps/context_tier_thread_safety_steps.py`). --- ### 3. `import cleveragents` Pre-Import Hang (`scripts/run_behave_parallel.py`) **Symptom:** Running `nox -s unit_tests -- <any_feature_file>` hangs indefinitely. The UKO detail level map builder logs appear and then the process stalls: ``` 2026-04-15 15:38:20 [debug] detail_level_map_builder.created child_domain=uko-oo: ... 2026-04-15 15:38:20 [debug] detail_level_map_builder.insert_after ... # <hangs here indefinitely — reproduced 3/3 times locally> ``` **Root cause:** `run_behave_parallel.py` pre-imports `cleveragents` before forking workers to share module state via copy-on-write: ```python with suppress(ImportError): import cleveragents # noqa: F401 ``` This triggers module-level initialization code in the UKO subsystem that blocks on some runner configurations (likely waiting for a resource or lock that is never released in the test environment). **Fix:** Add a 30-second timeout guard around the pre-import using `signal.alarm`: ```python import signal def _safe_preimport() -> None: def _timeout(signum: int, frame: object) -> None: raise TimeoutError("Pre-import timed out after 30s — skipping") old = signal.signal(signal.SIGALRM, _timeout) signal.alarm(30) try: with suppress(ImportError): import cleveragents # noqa: F401 except TimeoutError: pass # Fall back to on-demand import in workers finally: signal.alarm(0) signal.signal(signal.SIGALRM, old) ``` **Effort:** Low (~15 lines in `scripts/run_behave_parallel.py`). **Note:** This hang also blocks all local targeted test reproduction, making it impossible to run individual feature files for debugging without fixing this first. --- ### 4. `tempfile.mktemp()` Race Condition — See #9686 **Root cause:** `features/environment.py` uses the deprecated `tempfile.mktemp()` in `before_scenario()` to generate per-scenario DB paths. `mktemp()` is non-atomic — it returns a path without creating the file. With 32 parallel workers, two scenarios can receive the same path before either copies the template DB, causing `sqlite3.OperationalError: database is locked` or cross-scenario state leakage. **This finding is already tracked in open issue #9686** (`[AUTO-INF-4] Harden Behave temp DB paths and cap default parallelism`), which proposes replacing `mktemp` with `NamedTemporaryFile(delete=False)` or `mkstemp`. No new issue is needed; this report cross-references #9686. --- ### 5. Deterministic TDD Failures (11 Category B Scenarios) **Root cause:** `analysis_results.txt` lists 11 scenarios that fail on every run without proper `@tdd_expected_fail` tags: | Feature File | Scenario | |-------------|----------| | `features/cli_extensions.feature:307` | Plan status in JSON format contains required keys | | `features/cli_extensions.feature:321` | Action show in JSON format contains all required keys | | `features/cli_extensions.feature:336` | Action show in JSON includes invariants list when present | | `features/cli_global_format_flag.feature:51` | version command reads format from global ctx.obj | | `features/cli_output_formats.feature:95` | Format output handles all format types for dict | | `features/m6_autonomy_acceptance.feature:170` | M6 smoke A2A version negotiation rejects unsupported | | `features/tdd_a2a_sdk_dependency.feature:15` | a2a module is importable as a project dependency | | `features/tdd_event_bus_exception_swallow.feature:22` | Bug #988 — emit() logs exception message when handler raises | | `features/tdd_session_create_di.feature:22` | Session create command produces structured output | | `features/cli_json_envelope.feature:22` | YAML output includes all required envelope fields | | `features/cli_json_envelope.feature:61` | YAML envelope data field contains actor list payload | Additionally, `features/tui_slash_overlay_descriptions.feature:36` has `@tdd_expected_fail` but is missing the required `@tdd_issue` and `@tdd_issue_<N>` tags, causing `validate_tdd_tags()` to force it to `Status.failed` on every run. **Fix:** For each Category B scenario, either fix the underlying bug or add proper `@tdd_issue @tdd_issue_<N> @tdd_expected_fail` tags. Add a CI pre-flight check that fails fast when Category B count is non-zero. --- ### 6. CI Infrastructure Gaps - **Log capture size limit:** All integration test logs are truncated at ~170 KB / ~1,479 lines. Integration test failures are invisible in captured logs. - **PR #6723 log truncation:** Unit test log cut at 44,110 lines (vs ~103,000 for completed runs), suggesting a job timeout or log capture interruption mid-run. - **No job-level timeouts:** Most CI jobs have no `timeout-minutes`, allowing runaway hangs (documented up to 29+ hours). See #8381. - **Parallel `.pyc` thundering-herd:** The noxfile pre-compiles `features/` bytecode to prevent overlayfs copy-up lock deadlocks. This mitigation is in place but the pre-import hang (Finding #3) is a related unmitigated risk. --- ## Recommended Actions 1. **[Immediate, Very Low]** Fix malformed TDD tag on `features/tui_slash_overlay_descriptions.feature:36` — add `@tdd_issue @tdd_issue_<N>` tags. 2. **[Immediate, Low]** Add 30-second `signal.alarm` timeout guard around `import cleveragents` pre-import in `scripts/run_behave_parallel.py` (Finding #3). Unblocks all local targeted test debugging. 3. **[Short-term, Very Low]** Replace `tempfile.mktemp()` with `tempfile.mkstemp()` in `features/environment.py` — coordinate with #9686. 4. **[Short-term, Low]** Acquire `service._lock` in `ContextTierService` assertion steps in `features/steps/context_tier_thread_safety_steps.py` (Finding #2). 5. **[Short-term, Medium]** Replace polling `time.sleep()` with `threading.Event.wait(timeout=N)` in timing-sensitive async execution step definitions (Finding #1). ~10 step definitions in `features/steps/async_execution_steps.py`. 6. **[Short-term, High]** Fix or properly tag the 11 Category B untagged failing scenarios from `analysis_results.txt`. Add CI pre-flight check that fails fast when Category B count is non-zero. 7. **[Infrastructure, Medium]** Increase CI log capture limit or add a compact pabot summary artifact so integration test failures are visible. See #9778 for broader test-layer restructuring. 8. **[Infrastructure, Low]** Add `timeout-minutes` to all CI jobs. See #8381. --- ## Duplicate Check Searched open issues via Forgejo REST API for each keyword. | Keyword | Matching Open Issues | Notes | |---------|---------------------|-------| | `"flaky"` | #9778 (partial), #9783 (partial), #9790 (passing mention) | #1542 is **closed** (heartbeat fix, 2026-04-03). No open issue covers the specific flaky tests in this report. | | `"async execution"` | None (direct match) | #9778 mentions "async retries" generically. No issue covers `features/async_execution.feature` timing flakiness. | | `"context tier"` | None | No open issue covers `ContextTierService` thread-safety assertion flakiness. | | `"tempfile.mktemp"` | **#9686** (direct match) | Issue #9686 `[AUTO-INF-4]` covers the exact `tempfile.mktemp` race in `features/environment.py`. Our tempfile finding references #9686 rather than duplicating it. | | `"pre-import"` | None (direct match) | No open issue covers the `import cleveragents` hang in `run_behave_parallel.py`. | **Prior closed issue:** #1542 (`TEST-INFRA: [flaky-tests] Flaky test detected: test_example_flaky_test`) — closed 2026-04-03, fixed with busy-wait loop. This report documents residual risks from the global sleep cap interaction. --- ## References ### Source Files - `features/environment.py` — global sleep cap, `tempfile.mktemp` usage, TDD inversion patch - `scripts/run_behave_parallel.py` — pre-import hang (lines ~280–285) - `features/async_execution.feature` — timing-sensitive scenarios - `features/context_tier_thread_safety.feature` — thread-safety assertion scenarios (issue #7547) - `features/test_infra_flaky_test_example.feature` — documents prior heartbeat fix - `features/steps/test_infra_flaky_test_example_steps.py` — busy-wait fix implementation - `analysis_results.txt` — TDD scenario categories (81 A, 11 B, 1 Unknown) ### CI Logs - `ci_logs/pr6598_unit_tests.log`, `pr6607_unit_tests.log`, `pr7004_unit_tests.log`, `pr_4221_unit_tests.log` — passing unit test runs - `ci_logs_pr5271_chunks/chunk_001-005.log` — lint failure (57 ruff errors) - `ci_logs/pr6729_integration_tests.log` — integration test run (truncated at log-capture limit) ### Related Issues - #1542 — Prior flaky test (heartbeat timestamp) — **CLOSED** - #9686 — `[AUTO-INF-4]` Harden Behave temp DB paths — **OPEN** (covers `tempfile.mktemp` fix) - #9778 — `[AUTO-INF-5]` Stabilize Behave/Robot test layers — **OPEN** (covers global sleep patch, TDD backlog) - #8381 — `[AUTO-INF-1]` Add per-job timeouts — **OPEN** - #9767 — `[AUTO-INF-3]` Harden CI workflow reliability — **OPEN** - #7547 — `ContextTierService` thread safety (referenced in feature file) --- **Automated by CleverAgents Bot** Supervisor: Implementation Pool | Agent: implementation-worker
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9800
No description provided.