Robot Framework tdd_expected_fail listener missing guard logic causes flaky CI integration tests #5436

Closed
opened 2026-04-09 06:49:25 +00:00 by hurui200320 · 5 comments
Member

Background and Context

The Robot Framework tdd_expected_fail_listener.py was implemented in commit a5de4488 to parallel the Behave @tdd_expected_fail tag handling in features/environment.py. However, the Robot listener's end_test() is missing three critical guard conditions that the Behave apply_tdd_inversion() function has. This discrepancy causes the listener to blindly invert all FAIL results to PASS (masking infrastructure errors) and all PASS results to FAIL (causing spurious CI failures), whereas the Behave version carefully discriminates between legitimate assertion failures (the captured bug) and infrastructure/setup errors.

This is the root cause of the flaky integration test behavior described in #5389 (CI run fails, rerun passes).

Current Behavior

The Robot listener end_test() (robot/tdd_expected_fail_listener.py, lines 167–251) inverts results unconditionally:

if result.status == "FAIL":
    result.status = "PASS"  # ALL fails inverted — no guard checks
elif result.status == "PASS":
    result.status = "FAIL"  # ALL passes inverted — no guard checks

Expected Behavior

The Behave implementation (features/environment.py, apply_tdd_inversion(), lines 138–240) has three guards before inversion:

  1. Infrastructure/hook error guard (line 173): if getattr(scenario, "hook_failed", False): return failed — never inverts test setup/teardown failures.
  2. Dry-run guard (line 177): if getattr(scenario, "was_dry_run", False): return failed — never inverts results when no test actually executed.
  3. Non-assertion exception guard (lines 187–200): Only inverts AssertionError failures. Other exceptions (ImportError, TimeoutError, FileNotFoundError, etc.) indicate infrastructure problems and are not inverted.

The Robot listener should implement equivalent guard logic appropriate to the Robot Framework API.

How This Causes Flakiness

  1. A tdd_expected_fail Robot test occasionally encounters a non-assertion failure (resource contention under parallel pabot execution, file I/O timing, import race).
  2. Without guards: The listener inverts this infrastructure FAIL to PASS — masking the real problem. Meanwhile, a different test may also be affected by the same transient issue and fail visibly → CI fails.
  3. On rerun: The transient condition doesn't recur → all tests run normally → CI passes.
  4. Alternatively: a tdd_expected_fail test sometimes PASSES because the bug doesn't always manifest under varying resource pressure → listener inverts to FAIL → CI fails. Rerun: bug manifests → inverted to PASS → CI passes.

Acceptance Criteria

  • The Robot listener end_test() checks for test setup/teardown errors and skips inversion when a test failed due to infrastructure, not the captured bug.
  • The Robot listener end_test() checks for non-assertion exceptions (e.g., via Robot's result.message content analysis, since Robot doesn't expose exception types like Behave) and skips inversion for infrastructure failures.
  • The Robot listener end_test() handles SKIP status correctly (already done, confirmed).
  • Integration tests for the new guards are added to robot/tdd_tag_validation.robot.
  • The tdd_tag_validation.robot helper (helper_tdd_tag_validation.py) includes fixture tests exercising each guard path.

Metadata

  • Commit Message: fix(testing): add guard logic to Robot tdd_expected_fail listener to prevent flaky CI
  • Branch: bugfix/tdd-listener-missing-guards

Subtasks

  • Add setup/teardown error guard to end_test() in robot/tdd_expected_fail_listener.py
  • Add non-assertion failure detection guard to end_test() (analyze result.message for keywords like Setup failed, No keyword with name, Teardown failed, etc.)
  • Add dry-run guard if applicable to Robot Framework context
  • Create Robot fixture files for each guard scenario (setup error, non-assertion error)
  • Add integration test cases in robot/tdd_tag_validation.robot for the new guards
  • Update robot/helper_tdd_tag_validation.py with verification functions for guard behavior
  • Update module docstring in tdd_expected_fail_listener.py to document guard logic
  • Tests (Behave): Verify existing Behave guard tests still pass
  • Tests (Robot): Verify new guard integration tests pass
  • Verify coverage >= 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
## Background and Context The Robot Framework `tdd_expected_fail_listener.py` was implemented in commit `a5de4488` to parallel the Behave `@tdd_expected_fail` tag handling in `features/environment.py`. However, the Robot listener's `end_test()` is missing three critical guard conditions that the Behave `apply_tdd_inversion()` function has. This discrepancy causes the listener to blindly invert **all** FAIL results to PASS (masking infrastructure errors) and **all** PASS results to FAIL (causing spurious CI failures), whereas the Behave version carefully discriminates between legitimate assertion failures (the captured bug) and infrastructure/setup errors. This is the root cause of the flaky integration test behavior described in #5389 (CI run fails, rerun passes). ## Current Behavior The Robot listener `end_test()` (`robot/tdd_expected_fail_listener.py`, lines 167–251) inverts results unconditionally: ```python if result.status == "FAIL": result.status = "PASS" # ALL fails inverted — no guard checks elif result.status == "PASS": result.status = "FAIL" # ALL passes inverted — no guard checks ``` ## Expected Behavior The Behave implementation (`features/environment.py`, `apply_tdd_inversion()`, lines 138–240) has three guards before inversion: 1. **Infrastructure/hook error guard** (line 173): `if getattr(scenario, "hook_failed", False): return failed` — never inverts test setup/teardown failures. 2. **Dry-run guard** (line 177): `if getattr(scenario, "was_dry_run", False): return failed` — never inverts results when no test actually executed. 3. **Non-assertion exception guard** (lines 187–200): Only inverts `AssertionError` failures. Other exceptions (`ImportError`, `TimeoutError`, `FileNotFoundError`, etc.) indicate infrastructure problems and are **not** inverted. The Robot listener should implement equivalent guard logic appropriate to the Robot Framework API. ## How This Causes Flakiness 1. A `tdd_expected_fail` Robot test occasionally encounters a **non-assertion failure** (resource contention under parallel pabot execution, file I/O timing, import race). 2. **Without guards:** The listener inverts this infrastructure FAIL to PASS — masking the real problem. Meanwhile, a different test may also be affected by the same transient issue and fail visibly → CI fails. 3. **On rerun:** The transient condition doesn't recur → all tests run normally → CI passes. 4. Alternatively: a `tdd_expected_fail` test sometimes PASSES because the bug doesn't always manifest under varying resource pressure → listener inverts to FAIL → CI fails. Rerun: bug manifests → inverted to PASS → CI passes. ## Acceptance Criteria - [x] The Robot listener `end_test()` checks for test **setup/teardown errors** and skips inversion when a test failed due to infrastructure, not the captured bug. - [x] The Robot listener `end_test()` checks for **non-assertion exceptions** (e.g., via Robot's `result.message` content analysis, since Robot doesn't expose exception types like Behave) and skips inversion for infrastructure failures. - [x] The Robot listener `end_test()` handles **SKIP** status correctly (already done, confirmed). - [x] Integration tests for the new guards are added to `robot/tdd_tag_validation.robot`. - [x] The `tdd_tag_validation.robot` helper (`helper_tdd_tag_validation.py`) includes fixture tests exercising each guard path. ## Metadata - **Commit Message**: `fix(testing): add guard logic to Robot tdd_expected_fail listener to prevent flaky CI` - **Branch**: `bugfix/tdd-listener-missing-guards` ## Subtasks - [x] Add setup/teardown error guard to `end_test()` in `robot/tdd_expected_fail_listener.py` - [x] Add non-assertion failure detection guard to `end_test()` (analyze `result.message` for keywords like `Setup failed`, `No keyword with name`, `Teardown failed`, etc.) - [x] Add dry-run guard if applicable to Robot Framework context - [x] Create Robot fixture files for each guard scenario (setup error, non-assertion error) - [x] Add integration test cases in `robot/tdd_tag_validation.robot` for the new guards - [x] Update `robot/helper_tdd_tag_validation.py` with verification functions for guard behavior - [x] Update module docstring in `tdd_expected_fail_listener.py` to document guard logic - [x] Tests (Behave): Verify existing Behave guard tests still pass - [x] Tests (Robot): Verify new guard integration tests pass - [x] Verify coverage >= 97% via `nox -s coverage_report` - [x] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done.
hurui200320 added this to the v3.7.0 milestone 2026-04-09 06:49:34 +00:00
Author
Member

Note for @HAL9000 : I'm currently working on this ticket so you don't need to implement it at the same time.

Note for @HAL9000 : I'm currently working on this ticket so you don't need to implement it at the same time.
Author
Member

Implementation Notes

Guard Logic Implementation

Three guards were added to end_test() in robot/tdd_expected_fail_listener.py, paralleling the Behave apply_tdd_inversion() guards in features/environment.py:

1. Setup/Teardown Error Guard (_has_setup_teardown_failure())

  • Checks result.setup.status == "FAIL" and result.teardown.status == "FAIL" (structural check via RF result model)
  • Falls back to string check: result.message.startswith("Setup failed:") or "Teardown failed:"
  • When triggered, inversion is skipped — the FAIL is preserved as-is

2. Non-Assertion Failure Guard (_is_infrastructure_error())

  • Checks result.message against a tuple of known infrastructure error patterns (case-insensitive)
  • Patterns include: no keyword with name, TypeError:, ImportError:, TimeoutError:, connection refused, etc.
  • Conservative approach: only well-known patterns trigger; unknown failures are assumed to be assertion failures and ARE inverted

3. Dry-Run Guard

  • Robot Framework dry-run mode validates keyword existence but does not execute keyword bodies
  • In dry-run mode, result.status is PASS (not NOT RUN as initially expected), but all body keywords have status == "NOT RUN"
  • Guard detects dry-run by checking if ALL body keywords have status NOT RUN: all(getattr(item, "status", None) == "NOT RUN" for item in result.body)
  • When triggered, inversion is skipped — the result is left unchanged

Test Infrastructure

  • New fixtures: robot/fixtures/tdd_expected_fail_setup_error.robot (setup failure), robot/fixtures/tdd_expected_fail_infra_error.robot (missing keyword)
  • New fixture runner: run_fixture_dryrun() in robot/_tdd_fixture_runner.py — runs fixtures with --dryrun flag
  • New integration tests: 3 new test cases in robot/tdd_tag_validation.robot (setup error guard, infra error guard, dry-run guard)
  • New helper commands: cmd_setup_error_guard(), cmd_infra_error_guard(), cmd_dry_run_guard() in robot/helper_tdd_tag_validation.py

E2E Test Fixes (collateral)

The guard logic correctly exposed pre-existing masked failures in robot/e2e/m5_acceptance.robot and robot/e2e/tdd_acms_behavioral_validation.robot:

  • Syntax error: Variable Should Exist ${VAR} (single space = part of keyword name) → fixed to Variable Should Exist ${VAR} (proper separator)
  • Bug-appears-fixed tests: 4 context assembly tests (tdd_issue_4188, tdd_issue_4189) were passing but still had tdd_expected_fail → removed the tag
  • Remaining cascading failures: 8 e2e tests fail because prerequisite variables are set by tdd_expected_fail tests that fail (bug still exists) and don't complete their body. This is a pre-existing design issue in the e2e test chain that was masked by the syntax error + blind inversion. Noted as follow-up work.

Quality Gate Results

  • nox -e lint: PASS
  • nox -e typecheck: PASS (0 errors, 3 pre-existing warnings)
  • nox -e unit_tests: PASS (590 features, 14752 scenarios)
  • nox -e integration_tests: PASS (1960 tests, 0 failures)
  • nox -e coverage_report: PASS (97%)
  • nox -e e2e_tests: ⚠️ 12 failures (8 pre-existing cascading failures from tdd_expected_fail test chains, was 1 failure on master)
## Implementation Notes ### Guard Logic Implementation Three guards were added to `end_test()` in `robot/tdd_expected_fail_listener.py`, paralleling the Behave `apply_tdd_inversion()` guards in `features/environment.py`: **1. Setup/Teardown Error Guard** (`_has_setup_teardown_failure()`) - Checks `result.setup.status == "FAIL"` and `result.teardown.status == "FAIL"` (structural check via RF result model) - Falls back to string check: `result.message.startswith("Setup failed:")` or `"Teardown failed:"` - When triggered, inversion is skipped — the FAIL is preserved as-is **2. Non-Assertion Failure Guard** (`_is_infrastructure_error()`) - Checks `result.message` against a tuple of known infrastructure error patterns (case-insensitive) - Patterns include: `no keyword with name`, `TypeError:`, `ImportError:`, `TimeoutError:`, `connection refused`, etc. - Conservative approach: only well-known patterns trigger; unknown failures are assumed to be assertion failures and ARE inverted **3. Dry-Run Guard** - Robot Framework dry-run mode validates keyword existence but does not execute keyword bodies - In dry-run mode, `result.status` is PASS (not NOT RUN as initially expected), but all body keywords have `status == "NOT RUN"` - Guard detects dry-run by checking if ALL body keywords have status NOT RUN: `all(getattr(item, "status", None) == "NOT RUN" for item in result.body)` - When triggered, inversion is skipped — the result is left unchanged ### Test Infrastructure - **New fixtures**: `robot/fixtures/tdd_expected_fail_setup_error.robot` (setup failure), `robot/fixtures/tdd_expected_fail_infra_error.robot` (missing keyword) - **New fixture runner**: `run_fixture_dryrun()` in `robot/_tdd_fixture_runner.py` — runs fixtures with `--dryrun` flag - **New integration tests**: 3 new test cases in `robot/tdd_tag_validation.robot` (setup error guard, infra error guard, dry-run guard) - **New helper commands**: `cmd_setup_error_guard()`, `cmd_infra_error_guard()`, `cmd_dry_run_guard()` in `robot/helper_tdd_tag_validation.py` ### E2E Test Fixes (collateral) The guard logic correctly exposed pre-existing masked failures in `robot/e2e/m5_acceptance.robot` and `robot/e2e/tdd_acms_behavioral_validation.robot`: - **Syntax error**: `Variable Should Exist ${VAR}` (single space = part of keyword name) → fixed to `Variable Should Exist ${VAR}` (proper separator) - **Bug-appears-fixed tests**: 4 context assembly tests (tdd_issue_4188, tdd_issue_4189) were passing but still had `tdd_expected_fail` → removed the tag - **Remaining cascading failures**: 8 e2e tests fail because prerequisite variables are set by `tdd_expected_fail` tests that fail (bug still exists) and don't complete their body. This is a pre-existing design issue in the e2e test chain that was masked by the syntax error + blind inversion. Noted as follow-up work. ### Quality Gate Results - `nox -e lint`: ✅ PASS - `nox -e typecheck`: ✅ PASS (0 errors, 3 pre-existing warnings) - `nox -e unit_tests`: ✅ PASS (590 features, 14752 scenarios) - `nox -e integration_tests`: ✅ PASS (1960 tests, 0 failures) - `nox -e coverage_report`: ✅ PASS (97%) - `nox -e e2e_tests`: ⚠️ 12 failures (8 pre-existing cascading failures from tdd_expected_fail test chains, was 1 failure on master)
Owner

@hurui200320 Acknowledged — I have noted that you are actively working on this issue. The implementation team will not duplicate your work on issue #5436.

Regarding your implementation notes: the guard logic approach looks well-reasoned. The three guards (setup/teardown error, non-assertion failure, dry-run) correctly parallel the Behave apply_tdd_inversion() implementation.

I can see PR #5659 (bugfix/tdd-listener-missing-guards) is open. Current CI status:

Check Status
integration_tests Successful in 4m6s
unit_tests Successful in 5m33s
e2e_tests Failing after 4m16s
lint, typecheck, quality, security All passing

The e2e_tests failure is consistent with what you described in your implementation notes — 8 pre-existing cascading failures from tdd_expected_fail test chains that were previously masked by the blind inversion. These are pre-existing issues, not regressions introduced by your fix.

If you need the implementation team to assist with the cascading e2e test failures or the PR review process, please let me know. Otherwise, the review pool will pick up PR #5659 in the normal course.


Automated by CleverAgents Bot
Supervisor: Human Liaison | Agent: human-liaison

@hurui200320 Acknowledged — I have noted that you are actively working on this issue. The implementation team will not duplicate your work on issue #5436. Regarding your implementation notes: the guard logic approach looks well-reasoned. The three guards (setup/teardown error, non-assertion failure, dry-run) correctly parallel the Behave `apply_tdd_inversion()` implementation. I can see PR #5659 (`bugfix/tdd-listener-missing-guards`) is open. Current CI status: | Check | Status | |-------|--------| | `integration_tests` | Successful in 4m6s | | `unit_tests` | Successful in 5m33s | | `e2e_tests` | Failing after 4m16s | | `lint`, `typecheck`, `quality`, `security` | All passing | The `e2e_tests` failure is consistent with what you described in your implementation notes — 8 pre-existing cascading failures from `tdd_expected_fail` test chains that were previously masked by the blind inversion. These are pre-existing issues, not regressions introduced by your fix. If you need the implementation team to assist with the cascading e2e test failures or the PR review process, please let me know. Otherwise, the review pool will pick up PR #5659 in the normal course. --- **Automated by CleverAgents Bot** Supervisor: Human Liaison | Agent: human-liaison
Author
Member

@HAL9000 Now the only failing test is e2e test. Please take over PR !5659 from now on and Create a follow-up issue for the e2e test chain design problem, and temporarily mark the 12 cascading e2e tests with tdd_expected_fail until the chain design is fixed. This would allow this PR to merge cleanly.

@HAL9000 Now the only failing test is e2e test. Please take over PR !5659 from now on and Create a follow-up issue for the e2e test chain design problem, and temporarily mark the 12 cascading e2e tests with `tdd_expected_fail` until the chain design is fixed. This would allow this PR to merge cleanly.
Owner

@hurui200320 Acknowledged. I have taken over PR #5659 as requested.

The current CI status confirms your assessment — unit_tests, coverage, and integration_tests are all passing. Only e2e_tests is failing due to the 12 cascading e2e test chain failures exposed by the guard fix.

Actions being taken:

  1. Creating a follow-up issue to track the e2e test chain design problem
  2. Dispatching an implementation worker to temporarily tag the 12 failing e2e scenarios with tdd_expected_fail so this PR can merge cleanly
  3. The follow-up issue will track the proper fix for the e2e test chain design

I will post updates on PR #5659 as work progresses.


Automated by CleverAgents Bot
Supervisor: Human Liaison | Agent: human-liaison

@hurui200320 Acknowledged. I have taken over PR #5659 as requested. The current CI status confirms your assessment — `unit_tests`, `coverage`, and `integration_tests` are all passing. Only `e2e_tests` is failing due to the 12 cascading e2e test chain failures exposed by the guard fix. **Actions being taken:** 1. Creating a follow-up issue to track the e2e test chain design problem 2. Dispatching an implementation worker to temporarily tag the 12 failing e2e scenarios with `tdd_expected_fail` so this PR can merge cleanly 3. The follow-up issue will track the proper fix for the e2e test chain design I will post updates on PR #5659 as work progresses. --- **Automated by CleverAgents Bot** Supervisor: Human Liaison | Agent: human-liaison
hurui200320 removed their assignment 2026-04-09 14:57:44 +00:00
hurui200320 2026-04-10 05:33:02 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#5436
No description provided.