[AUTO-INF-5] Stabilize Behave/Robot test layers to cut CI flake #9778

Open
opened 2026-04-15 15:33:26 +00:00 by HAL9000 · 0 comments
Owner

Summary

  • CI failures (~69.7%) cluster around the Behave + Robot “coverage booster” suites that now act as the only gate for large swaths of the codebase.
  • The suite has grown to 821 Behave step modules and 316 Robot suites, but they all run on every PR with the same configuration—there is no layered smoke vs. coverage split, so any slow or flaky scenario brings the whole pipeline down.
  • Global monkey patches in features/environment.py (time.sleep/asyncio.sleep caps, MigrationRunner overrides, TDD inversion) plus 81 active @tdd_expected_fail scenarios create stateful interactions that make failures hard to diagnose and let real regressions slip through.

Findings

Overview

  • Behave “unit” layer: nox -s unit_tests invokes behave-parallel across the entire features/ tree. Step modules are organised by product area but many scenarios exist purely to flip coverage branches (filenames end in _coverage_steps.py / _coverage_boost_steps.py).
  • Robot integration/e2e layer: nox -s integration_tests and slow_integration_tests run Robot suites via pabot, again across the whole robot/ tree. There is no curated smoke set; 300+ suites are selected every time.
  • Coverage + quality gates: nox -s coverage_report forces a sequential Behave run under Slipcover with a 97% threshold, while analysis_results.txt tracks the TDD backlog and shows the system is already overwhelmed.

Patterns & Anti-patterns

  • Good: pre-migrated template DB copied per scenario (_fast_init_or_upgrade) and unique DB URLs reduce migration time.
  • Good: TDD tagging + analysis_results.txt gives visibility into known regressions.
  • Anti-pattern: coverage boosters (files named *_coverage*_steps.py) and TDD capture scenarios run inline with core smoke tests, so reliability for real regressions depends on hundreds of fragile fixtures.
  • Anti-pattern: features/environment.py globally patches time.sleep/asyncio.sleep, monkey patches MigrationRunner, and rewrites Scenario.run. These side effects leak between workers and differ between behave-parallel vs. sequential coverage runs.
  • Anti-pattern: Expected-fail infrastructure depends on monkey-patched Behave internals; infrastructure exceptions are only logged, so a changed failure mode can silently invalidate the guard.

Test Doubles & Fixtures

  • features/mocks/ bundles bespoke stubs for AI providers, LangGraph transports, MCP, UOW factories, etc., but they are registered via global container overrides in before_all. There is no fixture registry or factory pattern—tests import helpers directly and mutate environment/os variables.
  • Large fixture directories (features/fixtures/m1…m6, scale, validation) contain YAML/JSON “snapshots” but lack tooling to generate or validate them. Keeping them in sync is manual, and failures often stem from drift.
  • Robot suites rely on helper scripts under robot/scripts/, but there is no shared library for common orchestration steps; copy/paste keywords proliferate.

Reliability Risks

  • Unlayered execution: PR CI always runs the full Behave + Robot matrix. Any flaky scenario (networked CLI steps, async retries, TUI rendering) causes reruns; there is no smoke subset that can finish quickly and deterministically.
  • Global monkey patches: The capped time.sleep/asyncio.sleep (10 ms) can mask timing regressions in retry logic, while the MigrationRunner override silently falls back to running real migrations if the template copy fails—tests pass but runtime explodes and CI times out.
  • TDD backlog: analysis_results.txt (checked in) currently lists 81 Category A expected-fail scenarios and 11 Category B scenarios needing new issues. The CI run still parses them, and a single mis-tagged case flips the suite red.
  • Parallel resource contention: Behave-parallel and pabot both write to build/ and reuse the same template DB path. Without per-worker isolation directories, concurrent runs can still contend on build/.template-migrated.db and the build/reports/ output tree.
  • Observability drift: Failures surface as long behave logs without context because coverage boosters suppress stdout (-q, NO_COLOR=1). There is no automated triage to map a red build to a culprit feature file.

Recommendations

Quick Wins (1–2 days)

  • Introduce a @smoke tag (or inverse @coverage_boost) and change the default nox -s unit_tests/integration_tests runs to execute only smoke scenarios on PRs; schedule the heavy coverage/tagged suites in nightly jobs.
  • Fail CI if analysis_results.txt Category B is non-zero—turn that report into a gate so untriaged TDD failures cannot silently pile up.
  • Guard the template DB fast-path: verify the copy succeeded and hard-fail if it falls back to running migrations so that timeouts are surfaced instead of silently slowing down the suite.

Medium Term (1–2 weeks)

  • Split the test matrix: add dedicated nox sessions for Behave smoke vs. coverage boosters vs. TDD captures, and the same for Robot (smoke, slow, E2E). Wire CI so PRs run only the smoke layers plus a rotating bucket of coverage boosters.
  • Replace the global sleep patch with test-specific tenacity settings or a fixture that caps retry waits per scenario. This keeps long waits short without mutating every thread/event loop globally.
  • Build a lightweight fixture library (factory functions) for the common mocks in features/mocks/; expose helpers via a conftest-style module so scenarios stop importing and mutating globals directly.

Longer Term

  • Grow a pytest-based unit layer around the service/domain modules so coverage boosters can be retired or move to nightly jobs—BDD should validate behaviour, not chase residual coverage points.
  • Create per-layer workspaces (e.g., each behave-parallel worker gets its own build/behave/<pid> dir) and move Robot reports likewise to eliminate shared state.
  • Replace the custom TDD expected-fail monkey patch with a small harness that reads metadata (e.g., from analysis_results.json) and skips known failures while still reporting unexpected passes.

Supporting Data & Suggested Commands

  • ls features/steps | wc -l → 821 step modules (API sample: curl …/features/steps?ref=master | jq 'map(select(.type=="file")) | length').
  • find robot -name "*.robot" | wc -l → 316 Robot suites (API sample: curl …/robot?ref=master | jq 'map(select(.type=="file" and (.name|endswith(".robot")))) | length').
  • cat analysis_results.txt shows 81 Category A @tdd_expected_fail scenarios and 11 Category B scenarios needing new issues.
  • nox -s unit_tests / nox -s integration_tests demonstrate the current all-or-nothing execution and are the touchpoints for introducing smoke vs. full splits.

Duplicate Check

  • Reviewed #9697 ([AUTO-INF-5] Harden CI quality gates) – it focuses on workflow thresholds/docs gating, not on reorganising the Behave/Robot suites or TDD backlog.
  • Scanned open issues for [AUTO-INF-5] tags; no other items cover test-layer restructuring.
  • No closed issues were found that tackle the proposed layering, fixture hygiene, or TDD backlog automation.

Automated by CleverAgents Bot
Agent: test-infra-pool-supervisor

## Summary - CI failures (~69.7%) cluster around the Behave + Robot “coverage booster” suites that now act as the only gate for large swaths of the codebase. - The suite has grown to 821 Behave step modules and 316 Robot suites, but they all run on every PR with the same configuration—there is no layered smoke vs. coverage split, so any slow or flaky scenario brings the whole pipeline down. - Global monkey patches in `features/environment.py` (time.sleep/asyncio.sleep caps, MigrationRunner overrides, TDD inversion) plus 81 active `@tdd_expected_fail` scenarios create stateful interactions that make failures hard to diagnose and let real regressions slip through. ## Findings ### Overview - **Behave “unit” layer**: `nox -s unit_tests` invokes `behave-parallel` across the entire `features/` tree. Step modules are organised by product area but many scenarios exist purely to flip coverage branches (filenames end in `_coverage_steps.py` / `_coverage_boost_steps.py`). - **Robot integration/e2e layer**: `nox -s integration_tests` and `slow_integration_tests` run Robot suites via `pabot`, again across the whole `robot/` tree. There is no curated smoke set; 300+ suites are selected every time. - **Coverage + quality gates**: `nox -s coverage_report` forces a sequential Behave run under Slipcover with a 97% threshold, while `analysis_results.txt` tracks the TDD backlog and shows the system is already overwhelmed. ### Patterns & Anti-patterns - ✅ Good: pre-migrated template DB copied per scenario (`_fast_init_or_upgrade`) and unique DB URLs reduce migration time. - ✅ Good: TDD tagging + `analysis_results.txt` gives visibility into known regressions. - ❌ Anti-pattern: coverage boosters (files named `*_coverage*_steps.py`) and TDD capture scenarios run inline with core smoke tests, so reliability for real regressions depends on hundreds of fragile fixtures. - ❌ Anti-pattern: `features/environment.py` globally patches `time.sleep`/`asyncio.sleep`, monkey patches `MigrationRunner`, and rewrites `Scenario.run`. These side effects leak between workers and differ between behave-parallel vs. sequential coverage runs. - ❌ Anti-pattern: Expected-fail infrastructure depends on monkey-patched Behave internals; infrastructure exceptions are only logged, so a changed failure mode can silently invalidate the guard. ### Test Doubles & Fixtures - `features/mocks/` bundles bespoke stubs for AI providers, LangGraph transports, MCP, UOW factories, etc., but they are registered via global container overrides in `before_all`. There is no fixture registry or factory pattern—tests import helpers directly and mutate environment/os variables. - Large fixture directories (`features/fixtures/m1…m6`, `scale`, `validation`) contain YAML/JSON “snapshots” but lack tooling to generate or validate them. Keeping them in sync is manual, and failures often stem from drift. - Robot suites rely on helper scripts under `robot/scripts/`, but there is no shared library for common orchestration steps; copy/paste keywords proliferate. ### Reliability Risks - **Unlayered execution**: PR CI always runs the full Behave + Robot matrix. Any flaky scenario (networked CLI steps, async retries, TUI rendering) causes reruns; there is no smoke subset that can finish quickly and deterministically. - **Global monkey patches**: The capped `time.sleep`/`asyncio.sleep` (10 ms) can mask timing regressions in retry logic, while the MigrationRunner override silently falls back to running real migrations if the template copy fails—tests pass but runtime explodes and CI times out. - **TDD backlog**: `analysis_results.txt` (checked in) currently lists 81 Category A expected-fail scenarios and 11 Category B scenarios needing new issues. The CI run still parses them, and a single mis-tagged case flips the suite red. - **Parallel resource contention**: Behave-parallel and pabot both write to `build/` and reuse the same template DB path. Without per-worker isolation directories, concurrent runs can still contend on `build/.template-migrated.db` and the `build/reports/` output tree. - **Observability drift**: Failures surface as long behave logs without context because coverage boosters suppress stdout (`-q`, `NO_COLOR=1`). There is no automated triage to map a red build to a culprit feature file. ### Recommendations #### Quick Wins (1–2 days) - Introduce a `@smoke` tag (or inverse `@coverage_boost`) and change the default `nox -s unit_tests`/`integration_tests` runs to execute only smoke scenarios on PRs; schedule the heavy coverage/tagged suites in nightly jobs. - Fail CI if `analysis_results.txt` Category B is non-zero—turn that report into a gate so untriaged TDD failures cannot silently pile up. - Guard the template DB fast-path: verify the copy succeeded and hard-fail if it falls back to running migrations so that timeouts are surfaced instead of silently slowing down the suite. #### Medium Term (1–2 weeks) - Split the test matrix: add dedicated `nox` sessions for Behave smoke vs. coverage boosters vs. TDD captures, and the same for Robot (smoke, slow, E2E). Wire CI so PRs run only the smoke layers plus a rotating bucket of coverage boosters. - Replace the global sleep patch with test-specific tenacity settings or a fixture that caps retry waits per scenario. This keeps long waits short without mutating every thread/event loop globally. - Build a lightweight fixture library (factory functions) for the common mocks in `features/mocks/`; expose helpers via a `conftest`-style module so scenarios stop importing and mutating globals directly. #### Longer Term - Grow a pytest-based unit layer around the service/domain modules so coverage boosters can be retired or move to nightly jobs—BDD should validate behaviour, not chase residual coverage points. - Create per-layer workspaces (e.g., each behave-parallel worker gets its own `build/behave/<pid>` dir) and move Robot reports likewise to eliminate shared state. - Replace the custom TDD expected-fail monkey patch with a small harness that reads metadata (e.g., from `analysis_results.json`) and skips known failures while still reporting unexpected passes. ### Supporting Data & Suggested Commands - `ls features/steps | wc -l` → 821 step modules (API sample: `curl …/features/steps?ref=master | jq 'map(select(.type=="file")) | length'`). - `find robot -name "*.robot" | wc -l` → 316 Robot suites (API sample: `curl …/robot?ref=master | jq 'map(select(.type=="file" and (.name|endswith(".robot")))) | length'`). - `cat analysis_results.txt` shows 81 Category A `@tdd_expected_fail` scenarios and 11 Category B scenarios needing new issues. - `nox -s unit_tests` / `nox -s integration_tests` demonstrate the current all-or-nothing execution and are the touchpoints for introducing smoke vs. full splits. ## Duplicate Check - Reviewed [#9697](https://git.cleverthis.com/cleveragents/cleveragents-core/issues/9697) (`[AUTO-INF-5] Harden CI quality gates`) – it focuses on workflow thresholds/docs gating, not on reorganising the Behave/Robot suites or TDD backlog. - Scanned open issues for `[AUTO-INF-5]` tags; no other items cover test-layer restructuring. - No closed issues were found that tackle the proposed layering, fixture hygiene, or TDD backlog automation. --- **Automated by CleverAgents Bot** Agent: test-infra-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9778
No description provided.