Bug: @tdd_expected_fail scenarios fail intermittently when steps raise non-AssertionError exceptions, causing flaky CI #8294

Closed
opened 2026-04-13 08:10:19 +00:00 by hurui200320 · 2 comments
Member

Metadata

  • Commit Message: fix(testing): document and harden non-AssertionError guard in apply_tdd_inversion to reduce flaky CI
  • Branch: fix/tdd-inversion-non-assertion-guard

Background and Context

apply_tdd_inversion() in features/environment.py contains a guard that prevents result inversion when a step in a @tdd_expected_fail scenario fails with a non-AssertionError exception:

for step in all_steps:
    if (
        step.status == Status.failed
        and step.exception is not None
        and not isinstance(step.exception, AssertionError)  # ← guard
    ):
        _tdd_logger.warning(
            "Non-assertion exception in expected-fail scenario "
            "'%s' step '%s': %s — not inverting.",
            ...
        )
        return failed  # ← NOT inverted → scenario reported as FAIL

The guard's intent is correct: if a step fails with a RuntimeError, ConnectionError, TimeoutError, etc., that is likely an infrastructure problem rather than evidence of the captured bug, so the result should not be inverted.

However, this guard is the primary cause of the observed flaky unit test behavior. When a @tdd_expected_fail scenario's step fails intermittently with a non-AssertionError exception (e.g., a transient database connection error, a file system race, or an import error), the guard fires and the scenario is reported as FAIL instead of PASS. On a clean rerun without the infrastructure issue, the step raises AssertionError (the expected bug), the guard does not fire, and the scenario is correctly inverted to PASS.

This produces the exact "sometimes fails, rerun gives pass" symptom reported on master.

Current Behavior

The guard is working as designed, but the design has a gap: it does not distinguish between:

  1. Transient infrastructure errors (e.g., ConnectionError: db connection dropped) — should NOT invert, guard is correct.
  2. Deterministic non-AssertionError exceptions raised intentionally by a step to signal the bug (e.g., a step that raises ValueError to prove a validation bug) — should invert, guard incorrectly blocks it.

Additionally, the guard fires on any non-AssertionError exception, including exceptions that are wrapped or chained from the actual bug. This is overly conservative.

Proof of the flakiness mechanism:

from behave.model import Scenario, Step, Status
from features.environment import apply_tdd_inversion

for exc_class in [AssertionError, RuntimeError, ConnectionError, TimeoutError]:
    s = Scenario('<test>', 1, 'Scenario', 'test',
                 tags=['tdd_expected_fail', 'tdd_issue', 'tdd_issue_999'])
    s.hook_failed = False
    s.was_dry_run = False
    step = Step('<test>', 1, 'When', 'when', 'the test step')
    step.status = Status.failed
    step.exception = exc_class('simulated failure')
    s.steps.append(step)
    s.set_status(Status.failed)

    result = apply_tdd_inversion(s, True)
    print(f'{exc_class.__name__}: result={result}')
    # AssertionError: result=False  ← PASS (inverted correctly)
    # RuntimeError:   result=True   ← FAIL (guard fired — FLAKY!)
    # ConnectionError: result=True  ← FAIL (guard fired — FLAKY!)
    # TimeoutError:   result=True   ← FAIL (guard fired — FLAKY!)

Expected Behavior

The guard should remain in place (it is correct in principle), but the following improvements are needed:

  1. The warning log message should be surfaced more prominently so that when the guard fires, it is immediately visible in CI output and not buried in debug logs. Currently the warning is logged at WARNING level to cleveragents.testing.tdd_tags — this should be elevated or the message should appear in the test output.

  2. @tdd_expected_fail step definitions should be hardened to ensure they only raise AssertionError (or subclasses) when asserting the bug is present. Steps that call infrastructure (DB, files, network) should wrap non-AssertionError exceptions in AssertionError or use pytest.fail() / assert statements so the guard never fires spuriously.

  3. The guard should be documented in CONTRIBUTING.md under the TDD Issue Test Tags section, so developers writing @tdd_expected_fail scenarios know to use assert statements (not bare raise SomeError(...)) in their step definitions.

Acceptance Criteria

  • The _tdd_logger.warning(...) call in the non-AssertionError guard path is changed to also print to sys.stderr (or use print()) so the guard firing is visible in standard Behave output, not just in log files.
  • CONTRIBUTING.md is updated under "TDD Issue Test Tags" to document that @tdd_expected_fail step definitions must use assert statements (raising AssertionError) to signal the bug, not bare raise SomeError(...).
  • A new Behave scenario is added to features/tdd_expected_fail_infrastructure.feature that verifies the guard fires and logs a warning when a step raises a non-AssertionError exception.
  • Existing @tdd_expected_fail step definitions in features/steps/ are audited; any that raise non-AssertionError exceptions to signal the bug are updated to use assert or raise AssertionError(...).
  • Coverage remains ≥ 97%.

Supporting Information

  • File: features/environment.py, function apply_tdd_inversion, lines 184–200.
  • Root cause of flakiness: The 2005b8ef commit (today, Apr 13 2026) converted 234 @skip scenarios to @tdd_expected_fail. Many of these newly-enabled scenarios exercise real infrastructure (DB, files, containers). When that infrastructure fails transiently, the guard fires and the scenario is reported as FAIL instead of PASS.
  • Guard is correct in principle: The guard correctly prevents masking infrastructure failures as expected bug failures. The fix is to harden the step definitions and improve observability, not to remove the guard.
  • Companion issues: #8291 (untested-step reset gap) and #8293 (redundant clear_status() call).

Subtasks

  • Change the _tdd_logger.warning(...) in the non-AssertionError guard to also write to sys.stderr so it appears in Behave output
  • Update CONTRIBUTING.md > TDD Issue Test Tags to document the AssertionError-only requirement for @tdd_expected_fail step definitions
  • Audit all @tdd_expected_fail step definitions in features/steps/ for non-AssertionError raises that signal the bug; fix any found
  • Add a Behave scenario to features/tdd_expected_fail_infrastructure.feature verifying the guard fires on non-AssertionError exceptions (this may already exist — verify and extend if needed)
  • Verify nox -e unit_tests passes with no flaky failures
  • Verify coverage ≥ 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
## Metadata - **Commit Message**: `fix(testing): document and harden non-AssertionError guard in apply_tdd_inversion to reduce flaky CI` - **Branch**: `fix/tdd-inversion-non-assertion-guard` ## Background and Context `apply_tdd_inversion()` in `features/environment.py` contains a guard that prevents result inversion when a step in a `@tdd_expected_fail` scenario fails with a non-`AssertionError` exception: ```python for step in all_steps: if ( step.status == Status.failed and step.exception is not None and not isinstance(step.exception, AssertionError) # ← guard ): _tdd_logger.warning( "Non-assertion exception in expected-fail scenario " "'%s' step '%s': %s — not inverting.", ... ) return failed # ← NOT inverted → scenario reported as FAIL ``` The guard's intent is correct: if a step fails with a `RuntimeError`, `ConnectionError`, `TimeoutError`, etc., that is likely an infrastructure problem rather than evidence of the captured bug, so the result should not be inverted. However, this guard is the **primary cause of the observed flaky unit test behavior**. When a `@tdd_expected_fail` scenario's step fails intermittently with a non-`AssertionError` exception (e.g., a transient database connection error, a file system race, or an import error), the guard fires and the scenario is reported as **FAIL** instead of **PASS**. On a clean rerun without the infrastructure issue, the step raises `AssertionError` (the expected bug), the guard does not fire, and the scenario is correctly inverted to **PASS**. This produces the exact "sometimes fails, rerun gives pass" symptom reported on master. ## Current Behavior The guard is working as designed, but the design has a gap: it does not distinguish between: 1. **Transient infrastructure errors** (e.g., `ConnectionError: db connection dropped`) — should NOT invert, guard is correct. 2. **Deterministic non-AssertionError exceptions** raised intentionally by a step to signal the bug (e.g., a step that raises `ValueError` to prove a validation bug) — should invert, guard incorrectly blocks it. Additionally, the guard fires on **any** non-`AssertionError` exception, including exceptions that are wrapped or chained from the actual bug. This is overly conservative. **Proof of the flakiness mechanism:** ```python from behave.model import Scenario, Step, Status from features.environment import apply_tdd_inversion for exc_class in [AssertionError, RuntimeError, ConnectionError, TimeoutError]: s = Scenario('<test>', 1, 'Scenario', 'test', tags=['tdd_expected_fail', 'tdd_issue', 'tdd_issue_999']) s.hook_failed = False s.was_dry_run = False step = Step('<test>', 1, 'When', 'when', 'the test step') step.status = Status.failed step.exception = exc_class('simulated failure') s.steps.append(step) s.set_status(Status.failed) result = apply_tdd_inversion(s, True) print(f'{exc_class.__name__}: result={result}') # AssertionError: result=False ← PASS (inverted correctly) # RuntimeError: result=True ← FAIL (guard fired — FLAKY!) # ConnectionError: result=True ← FAIL (guard fired — FLAKY!) # TimeoutError: result=True ← FAIL (guard fired — FLAKY!) ``` ## Expected Behavior The guard should remain in place (it is correct in principle), but the following improvements are needed: 1. **The warning log message should be surfaced more prominently** so that when the guard fires, it is immediately visible in CI output and not buried in debug logs. Currently the warning is logged at `WARNING` level to `cleveragents.testing.tdd_tags` — this should be elevated or the message should appear in the test output. 2. **`@tdd_expected_fail` step definitions should be hardened** to ensure they only raise `AssertionError` (or subclasses) when asserting the bug is present. Steps that call infrastructure (DB, files, network) should wrap non-`AssertionError` exceptions in `AssertionError` or use `pytest.fail()` / `assert` statements so the guard never fires spuriously. 3. **The guard should be documented in CONTRIBUTING.md** under the TDD Issue Test Tags section, so developers writing `@tdd_expected_fail` scenarios know to use `assert` statements (not bare `raise SomeError(...)`) in their step definitions. ## Acceptance Criteria - [ ] The `_tdd_logger.warning(...)` call in the non-AssertionError guard path is changed to also print to `sys.stderr` (or use `print()`) so the guard firing is visible in standard Behave output, not just in log files. - [ ] CONTRIBUTING.md is updated under "TDD Issue Test Tags" to document that `@tdd_expected_fail` step definitions must use `assert` statements (raising `AssertionError`) to signal the bug, not bare `raise SomeError(...)`. - [ ] A new Behave scenario is added to `features/tdd_expected_fail_infrastructure.feature` that verifies the guard fires and logs a warning when a step raises a non-`AssertionError` exception. - [ ] Existing `@tdd_expected_fail` step definitions in `features/steps/` are audited; any that raise non-`AssertionError` exceptions to signal the bug are updated to use `assert` or `raise AssertionError(...)`. - [ ] Coverage remains ≥ 97%. ## Supporting Information - **File**: `features/environment.py`, function `apply_tdd_inversion`, lines 184–200. - **Root cause of flakiness**: The `2005b8ef` commit (today, Apr 13 2026) converted 234 `@skip` scenarios to `@tdd_expected_fail`. Many of these newly-enabled scenarios exercise real infrastructure (DB, files, containers). When that infrastructure fails transiently, the guard fires and the scenario is reported as FAIL instead of PASS. - **Guard is correct in principle**: The guard correctly prevents masking infrastructure failures as expected bug failures. The fix is to harden the step definitions and improve observability, not to remove the guard. - **Companion issues**: #8291 (untested-step reset gap) and #8293 (redundant `clear_status()` call). ## Subtasks - [ ] Change the `_tdd_logger.warning(...)` in the non-AssertionError guard to also write to `sys.stderr` so it appears in Behave output - [ ] Update CONTRIBUTING.md > TDD Issue Test Tags to document the `AssertionError`-only requirement for `@tdd_expected_fail` step definitions - [ ] Audit all `@tdd_expected_fail` step definitions in `features/steps/` for non-`AssertionError` raises that signal the bug; fix any found - [ ] Add a Behave scenario to `features/tdd_expected_fail_infrastructure.feature` verifying the guard fires on non-`AssertionError` exceptions (this may already exist — verify and extend if needed) - [ ] Verify `nox -e unit_tests` passes with no flaky failures - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done.
hurui200320 added this to the v3.2.0 milestone 2026-04-13 08:13:46 +00:00
Owner

[AUTO-EPIC] Epic Linkage

This issue is a child of Epic #8043 — M3 UAT Bug Resolution — Spec Alignment Backlog (v3.2.0).

The TDD expected fail intermittent failure bug is a test quality issue that falls under the M3 UAT bug resolution scope.

Dependency direction: This issue (#8294) BLOCKS Epic #8043.


Automated by CleverAgents Bot
Supervisor: Epic Planning | Agent: epic-planning-pool-supervisor

## [AUTO-EPIC] Epic Linkage This issue is a child of **Epic #8043** — M3 UAT Bug Resolution — Spec Alignment Backlog (v3.2.0). The TDD expected fail intermittent failure bug is a test quality issue that falls under the M3 UAT bug resolution scope. **Dependency direction**: This issue (#8294) BLOCKS Epic #8043. --- **Automated by CleverAgents Bot** Supervisor: Epic Planning | Agent: epic-planning-pool-supervisor
Author
Member

Self-QA Implementation Notes (Cycles 1–3)

Cycle 1

Review findings (3C/2M/6m/4n):

  • C1 (Critical): Non-atomic commit diff — 111 files in the PR diff, including apparent reversions of LockService (#7989), silent fallback (#8232), and fail_fast test (#7582), plus 98+ agent config files and a new CI workflow. Root cause: branch was not rebased onto master after those 107 files changed on master.
  • C2 (Critical): CHANGELOG.md appeared to delete entries for #8232, #7989, and #7582. Also, no new entry for #8294.
  • C3 (Critical): CONTRIBUTORS.md appeared to delete HAL 9000 contribution lines.
  • M2 (Major): Missing negative BDD scenario — no test verifying that _warning_with_stderr is NOT called when the exception IS an AssertionError.
  • m1–m6 (Minor): Weak getattr guard in stderr step; extra blank line in _warning_with_stderr; undocumented eager f-string; missing return value assertion in new scenario; audit result not in PR description; PR validation section incomplete.

Fixes applied:

  • C1/C2/C3: Performed git rebase origin/master. All 107 phantom diffs were ghost artifacts of the out-of-date branch base — the branch's actual commit only touched 5 files. After rebase, the PR diff is clean: exactly the files for #8294.
  • A new CHANGELOG entry for #8294 was added under [Unreleased] → ### Fixed.
  • M2: Added Scenario: apply_tdd_inversion does not emit guard warning for AssertionError exceptions with a Then stderr output should not contain "Non-assertion exception" step.
  • m1: Replaced getattr(context, "captured_stderr", "") with assert hasattr(context, "captured_stderr") plus a descriptive error message in both step_check_stderr_contains and the new step_check_stderr_not_contains step definitions.
  • m3: Removed extra blank line between docstring and function body in _warning_with_stderr.
  • m4: Added inline comment explaining the intentional eager f-string deviation.
  • m5: Added And the apply_tdd_inversion result should be True/False to both new scenarios. Note: apply_tdd_inversion returns False for a successful inversion (scenario no longer failed), so the AssertionError (happy path) scenario asserts False.
  • m2/m6: PR description updated with AC #4 audit note and full nox session list.

Cycle 2

Review findings (0C/1M/5m/5n):

  • M1 (Major): PR Validation section had only 3 of 7 nox sessions confirmed (✓). The 4 unconfirmed sessions included coverage_report (required by AC5: coverage ≥ 97%).
  • m1–m5 (Minor): Negative scenario missing step-level status assertion; PR title diverged from commit message; potential double stderr emission undocumented; no isolated unit test for _warning_with_stderr (deferred); no defensive truncation for str(step.exception).

Fixes applied:

  • M1: Ran all 4 remaining nox sessions in a local clone. All passed: integration_tests (1873/1873), e2e_tests (114/114), full nox (all 11 sessions), coverage_report (97.25% ≥ 97% threshold — AC5 satisfied). PR description updated with ✓ marks for all 7 sessions.
  • m1: Added And the step "bug step" should have status "passed" to the negative scenario for symmetry with the positive scenario.
  • m2: Updated PR title to exactly match commit message: fix(testing): document and harden non-AssertionError guard in apply_tdd_inversion to reduce flaky CI.
  • m3: Added inline comment: # Intentional duplication: logger may route to structured sinks not visible in CI; stderr guarantees visibility in Behave output.
  • m5: Added defensive truncation: exc_text = str(step.exception)[:500] to avoid runaway output for exceptions with large __str__ representations.
  • m4 (no isolated unit test) — deferred; BDD coverage is sufficient for this PR.

Cycle 3

Review findings: Approved — No critical or major issues. All acceptance criteria satisfied.

Minor note: PR milestone shows v3.5.0 but ticket #8294 is on v3.2.0; PR milestone should be corrected to v3.2.0 before merge.

Two formatting nits (cosmetic multi-line print() and exc_text assignment) — do not affect correctness.


Remaining Issues

Item Status
PR milestone mismatch (v3.5.0 → v3.2.0) Needs one-click fix in Forgejo before merge
_warning_with_stderr isolated unit test Deferred to follow-up ticket
"Unexpected pass" warning path not using _warning_with_stderr Deferred — out of scope for #8294
500-char truncation boundary test Deferred — trivial Python slice, follow-up ticket recommended

All code changes are correct, well-typed, and tested. Branch is clean after rebase. Coverage 97.25% ≥ 97%.

## Self-QA Implementation Notes (Cycles 1–3) ### Cycle 1 **Review findings (3C/2M/6m/4n):** - **C1 (Critical):** Non-atomic commit diff — 111 files in the PR diff, including apparent reversions of LockService (#7989), silent fallback (#8232), and fail_fast test (#7582), plus 98+ agent config files and a new CI workflow. Root cause: branch was not rebased onto master after those 107 files changed on master. - **C2 (Critical):** CHANGELOG.md appeared to delete entries for #8232, #7989, and #7582. Also, no new entry for #8294. - **C3 (Critical):** CONTRIBUTORS.md appeared to delete HAL 9000 contribution lines. - **M2 (Major):** Missing negative BDD scenario — no test verifying that `_warning_with_stderr` is NOT called when the exception IS an `AssertionError`. - **m1–m6 (Minor):** Weak `getattr` guard in stderr step; extra blank line in `_warning_with_stderr`; undocumented eager f-string; missing return value assertion in new scenario; audit result not in PR description; PR validation section incomplete. **Fixes applied:** - **C1/C2/C3:** Performed `git rebase origin/master`. All 107 phantom diffs were ghost artifacts of the out-of-date branch base — the branch's actual commit only touched 5 files. After rebase, the PR diff is clean: exactly the files for #8294. - A new CHANGELOG entry for #8294 was added under `[Unreleased] → ### Fixed`. - **M2:** Added `Scenario: apply_tdd_inversion does not emit guard warning for AssertionError exceptions` with a `Then stderr output should not contain "Non-assertion exception"` step. - **m1:** Replaced `getattr(context, "captured_stderr", "")` with `assert hasattr(context, "captured_stderr")` plus a descriptive error message in both `step_check_stderr_contains` and the new `step_check_stderr_not_contains` step definitions. - **m3:** Removed extra blank line between docstring and function body in `_warning_with_stderr`. - **m4:** Added inline comment explaining the intentional eager f-string deviation. - **m5:** Added `And the apply_tdd_inversion result should be True/False` to both new scenarios. Note: `apply_tdd_inversion` returns `False` for a successful inversion (scenario no longer failed), so the AssertionError (happy path) scenario asserts `False`. - **m2/m6:** PR description updated with AC #4 audit note and full nox session list. --- ### Cycle 2 **Review findings (0C/1M/5m/5n):** - **M1 (Major):** PR Validation section had only 3 of 7 nox sessions confirmed (✓). The 4 unconfirmed sessions included `coverage_report` (required by AC5: coverage ≥ 97%). - **m1–m5 (Minor):** Negative scenario missing step-level status assertion; PR title diverged from commit message; potential double stderr emission undocumented; no isolated unit test for `_warning_with_stderr` (deferred); no defensive truncation for `str(step.exception)`. **Fixes applied:** - **M1:** Ran all 4 remaining nox sessions in a local clone. All passed: integration_tests (1873/1873), e2e_tests (114/114), full nox (all 11 sessions), coverage_report (**97.25% ≥ 97% threshold — AC5 satisfied**). PR description updated with ✓ marks for all 7 sessions. - **m1:** Added `And the step "bug step" should have status "passed"` to the negative scenario for symmetry with the positive scenario. - **m2:** Updated PR title to exactly match commit message: `fix(testing): document and harden non-AssertionError guard in apply_tdd_inversion to reduce flaky CI`. - **m3:** Added inline comment: `# Intentional duplication: logger may route to structured sinks not visible in CI; stderr guarantees visibility in Behave output.` - **m5:** Added defensive truncation: `exc_text = str(step.exception)[:500]` to avoid runaway output for exceptions with large `__str__` representations. - **m4** (no isolated unit test) — deferred; BDD coverage is sufficient for this PR. --- ### Cycle 3 **Review findings:** ✅ **Approved** — No critical or major issues. All acceptance criteria satisfied. Minor note: PR milestone shows v3.5.0 but ticket #8294 is on v3.2.0; PR milestone should be corrected to v3.2.0 before merge. Two formatting nits (cosmetic multi-line `print()` and `exc_text` assignment) — do not affect correctness. --- ### Remaining Issues | Item | Status | |------|--------| | PR milestone mismatch (v3.5.0 → v3.2.0) | Needs one-click fix in Forgejo before merge | | `_warning_with_stderr` isolated unit test | Deferred to follow-up ticket | | "Unexpected pass" warning path not using `_warning_with_stderr` | Deferred — out of scope for #8294 | | 500-char truncation boundary test | Deferred — trivial Python slice, follow-up ticket recommended | All code changes are correct, well-typed, and tested. Branch is clean after rebase. Coverage 97.25% ≥ 97%.
hurui200320 2026-04-15 05:49:14 +00:00
Sign in to join this conversation.
No milestone
No project
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#8294
No description provided.