feat(testing): implement @tdd_expected_fail tag handling in Behave environment #665

Major — False failures in --dry-run mode

In dry-run mode, Behave skips hooks (if not runner.config.dry_run and run_scenario:) and doesn't execute steps. _original_run returns False (no failures). The wrapper then enters the 'unexpected pass' branch at line 155 and forces a failure with "Bug appears to be fixed".

This is incorrect — no test actually ran, so the bug's status is unknown. Every @tdd_expected_fail scenario would fail during --dry-run, breaking test discovery.

Fix: Add a dry-run guard:

if getattr(self, 'was_dry_run', False):
    return failed

was_dry_run is set by Behave's Scenario.run() at the start of execution.

**Major — False failures in `--dry-run` mode** In dry-run mode, Behave skips hooks (`if not runner.config.dry_run and run_scenario:`) and doesn't execute steps. `_original_run` returns `False` (no failures). The wrapper then enters the 'unexpected pass' branch at line 155 and forces a failure with "Bug appears to be fixed". This is incorrect — no test actually ran, so the bug's status is unknown. Every `@tdd_expected_fail` scenario would fail during `--dry-run`, breaking test discovery. **Fix:** Add a dry-run guard: ```python if getattr(self, 'was_dry_run', False): return failed ``` `was_dry_run` is set by Behave's `Scenario.run()` at the start of execution.

features/environment.py Outdated

						
				@@ -29,0 +137,4 @@

				        failed: bool = _original_run(self, runner)

				        if not should_invert_result(set(self.effective_tags)):

				            return failed

hurui200320 commented

Major — Hook errors masked by inversion

If before_scenario (lines 370-442), after_scenario, or context cleanup raises an exception on a @tdd_expected_fail scenario, the wrapper sees failed=True and flips it to passed. The hook error is silently swallowed.

The wrapper needs to check self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error) and bail out without inverting:

if self.hook_failed or self.status in (
    Status.hook_error, Status.cleanup_error,
):
    return failed

**Major — Hook errors masked by inversion** If `before_scenario` (lines 370-442), `after_scenario`, or context cleanup raises an exception on a `@tdd_expected_fail` scenario, the wrapper sees `failed=True` and flips it to passed. The hook error is silently swallowed. The wrapper needs to check `self.hook_failed` or `self.status in (Status.hook_error, Status.cleanup_error)` and bail out without inverting: ```python if self.hook_failed or self.status in ( Status.hook_error, Status.cleanup_error, ): return failed ```

features/environment.py Outdated

						
				@@ -29,0 +141,4 @@

				        if failed:

				            # Expected failure — the bug still exists.  Reset the failed

				            # steps so the scenario is reported as passed.

				            for step in self.all_steps:

hurui200320 commented

Minor — Any exception type treated as 'bug still exists'

The wrapper inverts any step failure, whether it's an AssertionError (expected for TDD bug tests) or a RuntimeError, TypeError, ConnectionError, etc. A genuine infrastructure failure during step execution would be silently treated as "bug still exists" rather than flagged as an error.

Consider verifying that failed steps have AssertionError before inverting, and logging a warning (without inverting) for non-assertion exceptions.

**Minor — Any exception type treated as 'bug still exists'** The wrapper inverts any step failure, whether it's an `AssertionError` (expected for TDD bug tests) or a `RuntimeError`, `TypeError`, `ConnectionError`, etc. A genuine infrastructure failure during step execution would be silently treated as "bug still exists" rather than flagged as an error. Consider verifying that failed steps have `AssertionError` before inverting, and logging a warning (without inverting) for non-assertion exceptions.

features/environment.py Outdated

						
				@@ -29,0 +144,4 @@

				            for step in self.all_steps:

				                if step.status == Status.failed:

				                    step.status = Status.untested

				                    step.exception = None

hurui200320 commented

2026-03-11 06:39:21 +00:00

Minor — Diagnostic info lost without logging

The exception and traceback are cleared without any record. If a developer needs to verify that the expected failure is the correct failure (the right AssertionError, not an unrelated crash), this info is gone.

Consider logging at DEBUG before clearing:

_tdd_logger.debug(
    "Clearing expected-fail exception for step '%s': %s",
    step.name, step.exception,
)

**Minor — Diagnostic info lost without logging** The exception and traceback are cleared without any record. If a developer needs to verify that the expected failure is the *correct* failure (the right `AssertionError`, not an unrelated crash), this info is gone. Consider logging at DEBUG before clearing: ```python _tdd_logger.debug( "Clearing expected-fail exception for step '%s': %s", step.name, step.exception, ) ```

hurui200320 requested changes 2026-03-11 06:39:21 +00:00

Dismissed

hurui200320 left a comment

Follow-Up Review — Previous Issues Still Unresolved + Additional Findings

The code has not been updated since my previous review (commit a2d31b3 unchanged). Both major issues I flagged remain unresolved. I'm re-confirming those and adding additional findings from a deeper pass.

Still Unresolved — Major Issues (from previous review)

1. Hook/cleanup errors silently masked by result inversion (features/environment.py:140)

_tdd_aware_run checks only if failed: without distinguishing whether the failure originates from a step assertion or from an infrastructure/hook error. If before_scenario raises (e.g., container override at line 439 fails), Behave sets hook_failed = True and returns failed=True. The wrapper flips this to passed — completely hiding the infrastructure error.

2. --dry-run mode falsely fails all @tdd_expected_fail scenarios (features/environment.py:136-162)

In dry-run mode, Behave doesn't execute steps; _original_run returns False. The wrapper sees "not failed" + @tdd_expected_fail → forces failure with "Bug appears to be fixed". This is incorrect — no test ran. Every @tdd_expected_fail scenario would fail during --dry-run, breaking test discovery workflows.

Still Unresolved — Minor Issues (from previous review)

3. Exception info discarded without logging (features/environment.py:147-148)

step.exception = None; step.exc_traceback = None loses diagnostic info. A _tdd_logger.debug(...) call before clearing would aid debugging.

4. Any exception type treated as "bug still exists" (features/environment.py:141-151)

A RuntimeError, TypeError, or ConnectionError during step execution would be silently inverted, not just AssertionError. The wrapper should at minimum log a warning when the failed step's exception is not an AssertionError.

New Finding — Documentation Inaccuracy

5. Feature file and step docstring reference incorrect mechanism (features/testing/tdd_expected_fail_demo.feature:4-5, features/steps/tdd_tag_validation_steps.py:104)

The demo feature description says "This exercises the after_scenario hook logic" and the step docstring says "The @tdd_expected_fail tag on the scenario causes the after_scenario hook to invert this failure." Both are inaccurate — the inversion is done by the Scenario.run() monkey-patch (_install_tdd_expected_fail_patch), not after_scenario. The PR itself correctly documents this in after_scenario at line 560 with # NOTE: TDD @tdd_expected_fail result inversion is handled by the Scenario.run() wrapper. The feature file and step docstring should be updated to match.

New Finding — Missing Test Coverage for "Unexpected Pass" Branch

6. No test exercises the "unexpected pass → forced failure" path (features/environment.py:152-162)

The demo feature tests only the "expected failure inverted to pass" path. The code branch at line 152-162 where a @tdd_expected_fail scenario passes (meaning the bug is fixed but the tag wasn't removed) is not exercised by any test. This is one of the acceptance criteria from ticket #627: "Scenarios tagged @tdd_expected_fail that pass have their result inverted to fail." While meta-testing this is non-trivial, even a unit-level test of the _tdd_aware_run function with a mocked scenario object would improve confidence.

Positive Observations

The Scenario.run() monkey-patch approach is correct — after_scenario cannot modify the runner's return value, so this is the right architectural choice.
validate_tdd_tags() correctly implements all CONTRIBUTING.md tag validation rules with clear error messages.
Regex r"tdd_bug_\d+" with fullmatch() is correct and not vulnerable to ReDoS.
Idempotency guard (_tdd_run_patched) handles forked workers correctly.
Step names are well-namespaced to avoid AmbiguousStep conflicts.
@mock_only tag on test features is appropriate.
CHANGELOG entry is well-written and references #627.
Tag validation in before_scenario runs before other setup — correct placement.
Moving import logging to module level (from inside after_scenario) is a good cleanup.

Verdict: REQUEST_CHANGES

The two major issues (hook error masking and dry-run false failures) are blocking. The documentation inaccuracy and missing test coverage should also be addressed. The minor issues (exception logging, exception type discrimination) are recommended improvements.

## Follow-Up Review — Previous Issues Still Unresolved + Additional Findings The code has not been updated since my previous review (commit `a2d31b3` unchanged). Both major issues I flagged remain unresolved. I'm re-confirming those and adding additional findings from a deeper pass. --- ### Still Unresolved — Major Issues (from previous review) **1. Hook/cleanup errors silently masked by result inversion** (`features/environment.py:140`) `_tdd_aware_run` checks only `if failed:` without distinguishing whether the failure originates from a step assertion or from an infrastructure/hook error. If `before_scenario` raises (e.g., container override at line 439 fails), Behave sets `hook_failed = True` and returns `failed=True`. The wrapper flips this to passed — completely hiding the infrastructure error. **2. `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios** (`features/environment.py:136-162`) In dry-run mode, Behave doesn't execute steps; `_original_run` returns `False`. The wrapper sees "not failed" + `@tdd_expected_fail` → forces failure with "Bug appears to be fixed". This is incorrect — no test ran. Every `@tdd_expected_fail` scenario would fail during `--dry-run`, breaking test discovery workflows. --- ### Still Unresolved — Minor Issues (from previous review) **3. Exception info discarded without logging** (`features/environment.py:147-148`) `step.exception = None; step.exc_traceback = None` loses diagnostic info. A `_tdd_logger.debug(...)` call before clearing would aid debugging. **4. Any exception type treated as "bug still exists"** (`features/environment.py:141-151`) A `RuntimeError`, `TypeError`, or `ConnectionError` during step execution would be silently inverted, not just `AssertionError`. The wrapper should at minimum log a warning when the failed step's exception is not an `AssertionError`. --- ### New Finding — Documentation Inaccuracy **5. Feature file and step docstring reference incorrect mechanism** (`features/testing/tdd_expected_fail_demo.feature:4-5`, `features/steps/tdd_tag_validation_steps.py:104`) The demo feature description says *"This exercises the after_scenario hook logic"* and the step docstring says *"The @tdd_expected_fail tag on the scenario causes the after_scenario hook to invert this failure."* Both are inaccurate — the inversion is done by the `Scenario.run()` monkey-patch (`_install_tdd_expected_fail_patch`), not `after_scenario`. The PR itself correctly documents this in `after_scenario` at line 560 with `# NOTE: TDD @tdd_expected_fail result inversion is handled by the Scenario.run() wrapper`. The feature file and step docstring should be updated to match. ### New Finding — Missing Test Coverage for "Unexpected Pass" Branch **6. No test exercises the "unexpected pass → forced failure" path** (`features/environment.py:152-162`) The demo feature tests only the "expected failure inverted to pass" path. The code branch at line 152-162 where a `@tdd_expected_fail` scenario *passes* (meaning the bug is fixed but the tag wasn't removed) is not exercised by any test. This is one of the acceptance criteria from ticket #627: *"Scenarios tagged @tdd_expected_fail that pass have their result inverted to fail."* While meta-testing this is non-trivial, even a unit-level test of the `_tdd_aware_run` function with a mocked scenario object would improve confidence. --- ### Positive Observations - The `Scenario.run()` monkey-patch approach is correct — `after_scenario` cannot modify the runner's return value, so this is the right architectural choice. - `validate_tdd_tags()` correctly implements all CONTRIBUTING.md tag validation rules with clear error messages. - Regex `r"tdd_bug_\d+"` with `fullmatch()` is correct and not vulnerable to ReDoS. - Idempotency guard (`_tdd_run_patched`) handles forked workers correctly. - Step names are well-namespaced to avoid `AmbiguousStep` conflicts. - `@mock_only` tag on test features is appropriate. - CHANGELOG entry is well-written and references #627. - Tag validation in `before_scenario` runs before other setup — correct placement. - Moving `import logging` to module level (from inside `after_scenario`) is a good cleanup. ### Verdict: REQUEST_CHANGES The two major issues (hook error masking and dry-run false failures) are blocking. The documentation inaccuracy and missing test coverage should also be addressed. The minor issues (exception logging, exception type discrimination) are recommended improvements.

features/steps/tdd_tag_validation_steps.py Outdated

						
				@@ -0,0 +101,4 @@

				def step_tdd_demo_deliberate_fail(context: Context) -> None:

				    """This step deliberately fails to simulate a bug still being present.

				    The ``@tdd_expected_fail`` tag on the scenario causes the after_scenario

hurui200320 commented

Documentation inaccuracy — This docstring says "The @tdd_expected_fail tag on the scenario causes the after_scenario hook to invert this failure into a pass."

This should reference the Scenario.run() wrapper, not after_scenario. Suggested fix:

    """This step deliberately fails to simulate a bug still being present.

    The ``@tdd_expected_fail`` tag on the scenario causes the Scenario.run()
    wrapper (installed by _install_tdd_expected_fail_patch) to invert this
    failure into a pass.
    """

**Documentation inaccuracy** — This docstring says *"The @tdd_expected_fail tag on the scenario causes the after_scenario hook to invert this failure into a pass."* This should reference the `Scenario.run()` wrapper, not `after_scenario`. Suggested fix: ```python """This step deliberately fails to simulate a bug still being present. The ``@tdd_expected_fail`` tag on the scenario causes the Scenario.run() wrapper (installed by _install_tdd_expected_fail_patch) to invert this failure into a pass. """ ```

features/testing/tdd_expected_fail_demo.feature Outdated

						
				@@ -0,0 +1,12 @@

				@mock_only

				Feature: TDD expected-fail result inversion demo

				  Demonstrates the @tdd_expected_fail tag inverting a deliberately failing

				  scenario so it is reported as passed.  This exercises the after_scenario

hurui200320 commented

2026-03-11 06:39:21 +00:00

Documentation inaccuracy — This description says "after_scenario hook logic" but the inversion is performed by the Scenario.run() monkey-patch installed in _install_tdd_expected_fail_patch(), not after_scenario. The comment at environment.py:560 explicitly says the opposite. Please update to:

  Demonstrates the @tdd_expected_fail tag inverting a deliberately failing
  scenario so it is reported as passed.  This exercises the Scenario.run()
  wrapper installed by _install_tdd_expected_fail_patch() in features/environment.py.

**Documentation inaccuracy** — This description says *"after_scenario hook logic"* but the inversion is performed by the `Scenario.run()` monkey-patch installed in `_install_tdd_expected_fail_patch()`, not `after_scenario`. The comment at `environment.py:560` explicitly says the opposite. Please update to: ``` Demonstrates the @tdd_expected_fail tag inverting a deliberately failing scenario so it is reported as passed. This exercises the Scenario.run() wrapper installed by _install_tdd_expected_fail_patch() in features/environment.py. ```

hurui200320 requested changes 2026-03-11 06:39:27 +00:00

Dismissed

hurui200320 left a comment

Code Review — Rui (Agent 5 of 5)

PR Summary

This PR implements the @tdd_expected_fail three-tag system for TDD bug-capture tests in features/environment.py, as specified in ticket #627 and CONTRIBUTING.md > TDD Bug Test Tags. It adds:

validate_tdd_tags() — enforces tag combination rules in before_scenario
should_invert_result() — checks if a scenario has @tdd_expected_fail
_install_tdd_expected_fail_patch() — monkey-patches Scenario.run() to invert pass/fail results
13 BDD validation scenarios + 1 demo scenario under features/testing/
CHANGELOG.md entry

Verification of Previous Review Issues

I submitted a previous review on this commit. Since the commit SHA is unchanged (a2d31b39), all four issues from my previous review remain unresolved. I'm confirming they are still valid after deeper investigation:

Issues Found

1. Hook/cleanup errors silently masked by result inversion (Major — Blocking)

File: features/environment.py:136-162

Problem: _tdd_aware_run calls _original_run(self, runner) which sets self.hook_failed = True and returns failed=True when before_scenario/after_scenario hooks raise exceptions. The wrapper at line 141 only checks if failed: and inverts it to passed, hiding the infrastructure error completely.

I verified that Behave's Scenario.run() sets self.hook_failed (line ~7 of the source) and checks it at line ~32 to set failed = True. After _original_run returns, self.hook_failed is available and reliable.

Fix: Add a guard before the inversion logic:

if self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error):
    return failed  # Infrastructure errors must never be inverted

2. `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios (Major — Blocking)

File: features/environment.py:136-162

Problem: In --dry-run, Behave skips hooks and step execution. _original_run sets self.was_dry_run = True and returns False (no failures). The wrapper enters the "unexpected pass" branch (line 152) and forces a failure with "Bug appears to be fixed." — even though no test actually ran.

I verified that Behave's Scenario.run() sets self.was_dry_run = dry_run_scenario at line ~10 of the source, so the attribute is available in the wrapper.

Fix: Skip inversion when in dry-run mode:

if getattr(self, 'was_dry_run', False):
    return failed

3. Diagnostic info discarded without logging (Minor)

File: features/environment.py:147-148

Problem: step.exception = None and step.exc_traceback = None discard the exception details with no record. When a developer needs to verify the failure is the correct expected failure (not an unrelated crash), the info is gone.

Fix: Log at DEBUG level before clearing:

_tdd_logger.debug("Clearing expected-fail exception for step '%s': %s", step.name, step.exception)

4. Non-AssertionError exceptions treated as "bug still exists" (Minor)

File: features/environment.py:141-151

Problem: The wrapper inverts any step failure, regardless of exception type. A RuntimeError, TypeError, or ConnectionError during step execution would be silently inverted instead of flagged. TDD bug tests should only raise AssertionError — any other exception likely indicates an infrastructure problem.

Fix: Check exception types before inverting; log a warning for non-assertion exceptions and do not invert them.

5. Misleading docstring/comment references to `after_scenario` (Trivial)

File: features/steps/tdd_tag_validation_steps.py:104, features/testing/tdd_expected_fail_demo.feature:5

Problem: The step docstring at line 104 says "The @tdd_expected_fail tag on the scenario causes the after_scenario hook to invert this failure into a pass." The demo feature description (line 5) says "This exercises the after_scenario hook logic." Both are inaccurate — the inversion is done in the Scenario.run() monkey-patch, not in after_scenario. The environment.py code itself correctly documents this at line 560-563 with a NOTE comment.

Fix: Update references to say "the Scenario.run() wrapper" instead of "the after_scenario hook".

Positive Observations

Well-chosen architecture: Using a Scenario.run() wrapper rather than after_scenario is the correct approach — after_scenario cannot modify the return value of Scenario.run(), so it can't change what the runner sees. The PR description and environment.py comments explain this design choice clearly.
Thorough validation logic: validate_tdd_tags() correctly covers all CONTRIBUTING.md tag rules with clear, actionable error messages including references to documentation.
Comprehensive test coverage: 13 scenarios for validation (5 valid combinations + 5 invalid combinations + 3 should_invert_result checks) plus 1 demo scenario exercising the actual inversion. All edge cases from CONTRIBUTING.md are covered.
Good namespacing of steps: Step names are prefixed with tdd tags / tdd demo to avoid AmbiguousStep conflicts — a common Behave pitfall.
Idempotency guard: _tdd_run_patched flag prevents double-patching in forked workers.
Module-level regex compilation: _TDD_BUG_N_RE is compiled once at module load. The tdd_bug_\d+ pattern is linear (no backtracking), so it's safe from ReDoS.
Clean CHANGELOG entry with proper issue reference.
logging import consolidation: Moved the import logging from inside after_scenario to the top-level imports — clean improvement.

Verdict: REQUEST_CHANGES

The two major issues (hook error masking, dry-run false failures) are blocking. Both are correctness bugs that could mask real infrastructure failures in CI or break --dry-run test discovery. The fixes are straightforward (4-6 lines of additional guard logic) and should be quick to implement.

## Code Review — Rui (Agent 5 of 5) ### PR Summary This PR implements the `@tdd_expected_fail` three-tag system for TDD bug-capture tests in `features/environment.py`, as specified in ticket #627 and CONTRIBUTING.md > TDD Bug Test Tags. It adds: - `validate_tdd_tags()` — enforces tag combination rules in `before_scenario` - `should_invert_result()` — checks if a scenario has `@tdd_expected_fail` - `_install_tdd_expected_fail_patch()` — monkey-patches `Scenario.run()` to invert pass/fail results - 13 BDD validation scenarios + 1 demo scenario under `features/testing/` - CHANGELOG.md entry ### Verification of Previous Review Issues I submitted a previous review on this commit. Since the commit SHA is unchanged (`a2d31b39`), **all four issues from my previous review remain unresolved**. I'm confirming they are still valid after deeper investigation: --- ### Issues Found #### 1. Hook/cleanup errors silently masked by result inversion (Major — Blocking) **File:** `features/environment.py:136-162` **Problem:** `_tdd_aware_run` calls `_original_run(self, runner)` which sets `self.hook_failed = True` and returns `failed=True` when `before_scenario`/`after_scenario` hooks raise exceptions. The wrapper at line 141 only checks `if failed:` and inverts it to passed, hiding the infrastructure error completely. I verified that Behave's `Scenario.run()` sets `self.hook_failed` (line ~7 of the source) and checks it at line ~32 to set `failed = True`. After `_original_run` returns, `self.hook_failed` is available and reliable. **Fix:** Add a guard before the inversion logic: ```python if self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error): return failed # Infrastructure errors must never be inverted ``` #### 2. `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios (Major — Blocking) **File:** `features/environment.py:136-162` **Problem:** In `--dry-run`, Behave skips hooks and step execution. `_original_run` sets `self.was_dry_run = True` and returns `False` (no failures). The wrapper enters the "unexpected pass" branch (line 152) and forces a failure with "Bug appears to be fixed." — even though no test actually ran. I verified that Behave's `Scenario.run()` sets `self.was_dry_run = dry_run_scenario` at line ~10 of the source, so the attribute is available in the wrapper. **Fix:** Skip inversion when in dry-run mode: ```python if getattr(self, 'was_dry_run', False): return failed ``` #### 3. Diagnostic info discarded without logging (Minor) **File:** `features/environment.py:147-148` **Problem:** `step.exception = None` and `step.exc_traceback = None` discard the exception details with no record. When a developer needs to verify the failure is the *correct* expected failure (not an unrelated crash), the info is gone. **Fix:** Log at DEBUG level before clearing: ```python _tdd_logger.debug("Clearing expected-fail exception for step '%s': %s", step.name, step.exception) ``` #### 4. Non-AssertionError exceptions treated as "bug still exists" (Minor) **File:** `features/environment.py:141-151` **Problem:** The wrapper inverts *any* step failure, regardless of exception type. A `RuntimeError`, `TypeError`, or `ConnectionError` during step execution would be silently inverted instead of flagged. TDD bug tests should only raise `AssertionError` — any other exception likely indicates an infrastructure problem. **Fix:** Check exception types before inverting; log a warning for non-assertion exceptions and do not invert them. #### 5. Misleading docstring/comment references to `after_scenario` (Trivial) **File:** `features/steps/tdd_tag_validation_steps.py:104`, `features/testing/tdd_expected_fail_demo.feature:5` **Problem:** The step docstring at line 104 says "The `@tdd_expected_fail` tag on the scenario causes the after_scenario hook to invert this failure into a pass." The demo feature description (line 5) says "This exercises the after_scenario hook logic." Both are inaccurate — the inversion is done in the `Scenario.run()` monkey-patch, not in `after_scenario`. The environment.py code itself correctly documents this at line 560-563 with a NOTE comment. **Fix:** Update references to say "the `Scenario.run()` wrapper" instead of "the after_scenario hook". --- ### Positive Observations 1. **Well-chosen architecture:** Using a `Scenario.run()` wrapper rather than `after_scenario` is the correct approach — `after_scenario` cannot modify the return value of `Scenario.run()`, so it can't change what the runner sees. The PR description and `environment.py` comments explain this design choice clearly. 2. **Thorough validation logic:** `validate_tdd_tags()` correctly covers all CONTRIBUTING.md tag rules with clear, actionable error messages including references to documentation. 3. **Comprehensive test coverage:** 13 scenarios for validation (5 valid combinations + 5 invalid combinations + 3 `should_invert_result` checks) plus 1 demo scenario exercising the actual inversion. All edge cases from CONTRIBUTING.md are covered. 4. **Good namespacing of steps:** Step names are prefixed with `tdd tags` / `tdd demo` to avoid `AmbiguousStep` conflicts — a common Behave pitfall. 5. **Idempotency guard:** `_tdd_run_patched` flag prevents double-patching in forked workers. 6. **Module-level regex compilation:** `_TDD_BUG_N_RE` is compiled once at module load. The `tdd_bug_\d+` pattern is linear (no backtracking), so it's safe from ReDoS. 7. **Clean CHANGELOG entry** with proper issue reference. 8. **`logging` import consolidation:** Moved the `import logging` from inside `after_scenario` to the top-level imports — clean improvement. --- ### Verdict: **REQUEST_CHANGES** The two major issues (hook error masking, dry-run false failures) are blocking. Both are correctness bugs that could mask real infrastructure failures in CI or break `--dry-run` test discovery. The fixes are straightforward (4-6 lines of additional guard logic) and should be quick to implement.

features/environment.py Outdated

						
				@@ -29,0 +138,4 @@

				        if not should_invert_result(set(self.effective_tags)):

				            return failed

				        if failed:

hurui200320 commented

Major (Blocking) — Hook errors silently masked by result inversion

After _original_run returns, self.hook_failed is True if before_scenario or after_scenario raised an exception. The wrapper only checks if failed: and inverts it to passed, completely hiding the infrastructure error.

I verified in Behave's Scenario.run() source that self.hook_failed is set at the top of the method and checked at line ~32. It is available and reliable after _original_run returns.

Fix: Add before line 141:

if self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error):
    return failed  # Infrastructure errors must never be inverted

**Major (Blocking) — Hook errors silently masked by result inversion** After `_original_run` returns, `self.hook_failed` is `True` if `before_scenario` or `after_scenario` raised an exception. The wrapper only checks `if failed:` and inverts it to passed, completely hiding the infrastructure error. I verified in Behave's `Scenario.run()` source that `self.hook_failed` is set at the top of the method and checked at line ~32. It is available and reliable after `_original_run` returns. **Fix:** Add before line 141: ```python if self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error): return failed # Infrastructure errors must never be inverted ```

features/environment.py Outdated

						
				@@ -29,0 +144,4 @@

				            for step in self.all_steps:

				                if step.status == Status.failed:

				                    step.status = Status.untested

				                    step.exception = None

hurui200320 commented

Minor — Diagnostic info lost without logging

The exception and traceback are cleared here without any record. If a developer needs to verify the expected failure is the correct failure (not an unrelated crash), there's no trace.

Fix: Log at DEBUG before clearing:

_tdd_logger.debug("Clearing expected-fail exception for step '%s': %s", step.name, step.exception)

**Minor — Diagnostic info lost without logging** The exception and traceback are cleared here without any record. If a developer needs to verify the expected failure is the *correct* failure (not an unrelated crash), there's no trace. **Fix:** Log at DEBUG before clearing: ```python _tdd_logger.debug("Clearing expected-fail exception for step '%s': %s", step.name, step.exception) ```

features/environment.py Outdated

						
				@@ -29,0 +149,4 @@

				            self.clear_status()

				            self.set_status(Status.passed)

				            return False  # Not a failure for the runner

				        # Unexpected pass — the bug appears to be fixed but the

hurui200320 commented

Major (Blocking) — --dry-run mode falsely fails all @tdd_expected_fail scenarios

In dry-run mode, Behave sets self.was_dry_run = True and skips step execution. _original_run returns False. The wrapper enters the 'unexpected pass' branch here and forces failure with "Bug appears to be fixed" — even though no test actually ran.

I verified in Behave's Scenario.run() source that self.was_dry_run = dry_run_scenario is set at line ~10.

Fix: Add after the should_invert_result check:

if getattr(self, 'was_dry_run', False):
    return failed

**Major (Blocking) — `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios** In dry-run mode, Behave sets `self.was_dry_run = True` and skips step execution. `_original_run` returns `False`. The wrapper enters the 'unexpected pass' branch here and forces failure with "Bug appears to be fixed" — even though no test actually ran. I verified in Behave's `Scenario.run()` source that `self.was_dry_run = dry_run_scenario` is set at line ~10. **Fix:** Add after the `should_invert_result` check: ```python if getattr(self, 'was_dry_run', False): return failed ```

features/steps/tdd_tag_validation_steps.py Outdated

						
				@@ -0,0 +101,4 @@

				def step_tdd_demo_deliberate_fail(context: Context) -> None:

				    """This step deliberately fails to simulate a bug still being present.

				    The ``@tdd_expected_fail`` tag on the scenario causes the after_scenario

hurui200320 commented

Trivial — Inaccurate docstring

This says "the after_scenario hook" but the inversion is actually done in the Scenario.run() monkey-patch (_install_tdd_expected_fail_patch), not in after_scenario. The environment.py code itself correctly documents this at line 560-563.

Should read: "The @tdd_expected_fail tag on the scenario causes the Scenario.run() wrapper to invert this failure into a pass."

**Trivial — Inaccurate docstring** This says "the after_scenario hook" but the inversion is actually done in the `Scenario.run()` monkey-patch (`_install_tdd_expected_fail_patch`), not in `after_scenario`. The `environment.py` code itself correctly documents this at line 560-563. Should read: "The `@tdd_expected_fail` tag on the scenario causes the `Scenario.run()` wrapper to invert this failure into a pass."

hurui200320 requested changes 2026-03-11 06:40:14 +00:00

Dismissed

hurui200320 left a comment

Code Review — TDD Expected-Fail Tag Handling (Rui, second pass)

No new commits since my prior review (a2d31b3). Both major issues remain unresolved. Reaffirming REQUEST_CHANGES with a consolidated summary and one additional observation.

Issue 1 — Hook/cleanup errors silently masked by inversion (Major)

File: features/environment.py:136-162

_tdd_aware_run checks if failed: without distinguishing the source of failure. Behave's Scenario.run() sets failed = True for three distinct reasons:

A step failed (the case we want to invert)
after_scenario hook raised → self.hook_failed = True, self.status = Status.hook_error
Context cleanup raised → self.status = Status.cleanup_error

Cases 2 and 3 are infrastructure errors unrelated to the bug under test. The wrapper currently inverts all three, hiding real infrastructure problems.

Fix: Guard against hook/cleanup errors before inverting:

if self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error):
    return failed  # Do NOT invert infrastructure errors

Issue 2 — `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios (Major)

File: features/environment.py:136-162

Confirmed by reading Behave's Scenario.run() source: in dry-run mode, self.was_dry_run = True, hooks are skipped, steps get Status.untested, and failed stays False. The wrapper then sees "not failed" + @tdd_expected_fail → enters the "unexpected pass" branch → forces failure with "Bug appears to be fixed."

This breaks behave --dry-run for any feature file containing @tdd_expected_fail scenarios.

Fix:

if getattr(self, 'was_dry_run', False):
    return failed  # No test ran; cannot determine bug status

Issue 3 — Exception details discarded without logging (Minor)

File: features/environment.py:146-148

When inverting a failure, step.exception and step.exc_traceback are set to None with no prior logging. If a developer needs to verify the expected failure is the correct failure (e.g. the right AssertionError, not an unrelated crash), this diagnostic info is permanently lost.

Recommendation: Log at DEBUG before clearing:

_tdd_logger.debug(
    "Inverting expected-fail step '%s': %s", step.name, step.exception
)

Issue 4 — Non-AssertionError exceptions treated as "bug still exists" (Minor)

File: features/environment.py:141-151

Any exception type (RuntimeError, TypeError, ConnectionError, etc.) during step execution triggers the "expected failure" inversion. A genuine infrastructure error during a @tdd_expected_fail scenario would be silently swallowed. This is related to but distinct from Issue 1 — Issue 1 covers hook-level errors; this covers step-level non-assertion exceptions.

Recommendation: Check that the failed step's exception is an AssertionError before inverting. Log a warning and skip inversion for other exception types.

Issue 5 — "Unexpected pass" path produces no visible failure reason in test output (Minor, new)

File: features/environment.py:155-162

When a @tdd_expected_fail scenario unexpectedly passes, the wrapper calls self.set_status(Status.failed) and returns True, but no step is marked as failed and no error text is attached to the scenario. Behave's formatters (pretty, plain, JUnit XML) typically report the step that failed and its exception text. With no failed step, the output shows "FAILED" but gives no inline reason — the developer must check logs to find the warning.

Recommendation: Consider attaching a synthetic error to the last step so the failure reason appears in standard Behave output:

last_step = self.all_steps[-1] if self.all_steps else None
if last_step:
    last_step.status = Status.failed
    last_step.exception = AssertionError(
        "Bug appears to be fixed. Remove the @tdd_expected_fail tag..."
    )

Positive Observations

Monkey-patch approach is correct. Wrapping Scenario.run() is the right strategy since after_scenario hooks cannot modify the return value. The idempotency guard (_tdd_run_patched) is sound.
validate_tdd_tags() is thorough. All CONTRIBUTING.md tag validation rules are covered: @tdd_bug_<N> requires @tdd_bug, @tdd_expected_fail requires both. Error messages are clear and reference the docs.
Regex is safe. r"tdd_bug_\d+" is a linear pattern with no ReDoS risk.
Test coverage is comprehensive for validation. 13 scenarios cover valid combos, invalid combos, and should_invert_result — well structured.
Step namespacing (tdd tags, tdd demo) avoids AmbiguousStep collisions with the existing 9700+ scenarios.
CHANGELOG entry is well-written and references the issue.
Inline documentation explaining the three-tag system and referencing CONTRIBUTING.md is excellent.
@mock_only tag on test features correctly skips DB setup overhead.

Verdict: REQUEST_CHANGES

Issues 1 and 2 are correctness bugs. Issue 1 can mask real infrastructure failures, and Issue 2 breaks --dry-run mode. Both should be fixed before merge. Issues 3-5 are minor improvements that would strengthen the implementation but are not blocking on their own.

## Code Review — TDD Expected-Fail Tag Handling (Rui, second pass) No new commits since my prior review (a2d31b3). Both major issues remain unresolved. Reaffirming REQUEST_CHANGES with a consolidated summary and one additional observation. --- ### Issue 1 — Hook/cleanup errors silently masked by inversion (Major) **File:** `features/environment.py:136-162` `_tdd_aware_run` checks `if failed:` without distinguishing the *source* of failure. Behave's `Scenario.run()` sets `failed = True` for three distinct reasons: 1. A step failed (the case we want to invert) 2. `after_scenario` hook raised → `self.hook_failed = True`, `self.status = Status.hook_error` 3. Context cleanup raised → `self.status = Status.cleanup_error` Cases 2 and 3 are infrastructure errors unrelated to the bug under test. The wrapper currently inverts all three, hiding real infrastructure problems. **Fix:** Guard against hook/cleanup errors before inverting: ```python if self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error): return failed # Do NOT invert infrastructure errors ``` --- ### Issue 2 — `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios (Major) **File:** `features/environment.py:136-162` Confirmed by reading Behave's `Scenario.run()` source: in dry-run mode, `self.was_dry_run = True`, hooks are skipped, steps get `Status.untested`, and `failed` stays `False`. The wrapper then sees "not failed" + `@tdd_expected_fail` → enters the "unexpected pass" branch → forces failure with "Bug appears to be fixed." This breaks `behave --dry-run` for any feature file containing `@tdd_expected_fail` scenarios. **Fix:** ```python if getattr(self, 'was_dry_run', False): return failed # No test ran; cannot determine bug status ``` --- ### Issue 3 — Exception details discarded without logging (Minor) **File:** `features/environment.py:146-148` When inverting a failure, `step.exception` and `step.exc_traceback` are set to `None` with no prior logging. If a developer needs to verify the expected failure is the *correct* failure (e.g. the right `AssertionError`, not an unrelated crash), this diagnostic info is permanently lost. **Recommendation:** Log at DEBUG before clearing: ```python _tdd_logger.debug( "Inverting expected-fail step '%s': %s", step.name, step.exception ) ``` --- ### Issue 4 — Non-AssertionError exceptions treated as "bug still exists" (Minor) **File:** `features/environment.py:141-151` Any exception type (`RuntimeError`, `TypeError`, `ConnectionError`, etc.) during step execution triggers the "expected failure" inversion. A genuine infrastructure error during a `@tdd_expected_fail` scenario would be silently swallowed. This is related to but distinct from Issue 1 — Issue 1 covers hook-level errors; this covers step-level non-assertion exceptions. **Recommendation:** Check that the failed step's exception is an `AssertionError` before inverting. Log a warning and skip inversion for other exception types. --- ### Issue 5 — "Unexpected pass" path produces no visible failure reason in test output (Minor, new) **File:** `features/environment.py:155-162` When a `@tdd_expected_fail` scenario unexpectedly passes, the wrapper calls `self.set_status(Status.failed)` and returns `True`, but no step is marked as failed and no error text is attached to the scenario. Behave's formatters (pretty, plain, JUnit XML) typically report the *step* that failed and its exception text. With no failed step, the output shows "FAILED" but gives no inline reason — the developer must check logs to find the warning. **Recommendation:** Consider attaching a synthetic error to the last step so the failure reason appears in standard Behave output: ```python last_step = self.all_steps[-1] if self.all_steps else None if last_step: last_step.status = Status.failed last_step.exception = AssertionError( "Bug appears to be fixed. Remove the @tdd_expected_fail tag..." ) ``` --- ### Positive Observations - **Monkey-patch approach is correct.** Wrapping `Scenario.run()` is the right strategy since `after_scenario` hooks cannot modify the return value. The idempotency guard (`_tdd_run_patched`) is sound. - **`validate_tdd_tags()` is thorough.** All CONTRIBUTING.md tag validation rules are covered: `@tdd_bug_<N>` requires `@tdd_bug`, `@tdd_expected_fail` requires both. Error messages are clear and reference the docs. - **Regex is safe.** `r"tdd_bug_\d+"` is a linear pattern with no ReDoS risk. - **Test coverage is comprehensive for validation.** 13 scenarios cover valid combos, invalid combos, and `should_invert_result` — well structured. - **Step namespacing (`tdd tags`, `tdd demo`) avoids AmbiguousStep collisions** with the existing 9700+ scenarios. - **CHANGELOG entry is well-written** and references the issue. - **Inline documentation** explaining the three-tag system and referencing CONTRIBUTING.md is excellent. - **`@mock_only` tag on test features** correctly skips DB setup overhead. ### Verdict: REQUEST_CHANGES Issues 1 and 2 are correctness bugs. Issue 1 can mask real infrastructure failures, and Issue 2 breaks `--dry-run` mode. Both should be fixed before merge. Issues 3-5 are minor improvements that would strengthen the implementation but are not blocking on their own.

features/environment.py Outdated

						
				@@ -29,0 +135,4 @@

				    def _tdd_aware_run(self: Any, runner: Any) -> bool:

				        failed: bool = _original_run(self, runner)

				        if not should_invert_result(set(self.effective_tags)):

hurui200320 commented

Major (Issue 2) — Still unresolved: Dry-run false failures.

Confirmed: Behave sets self.was_dry_run = dry_run_scenario at line ~10 of Scenario.run(). In dry-run mode, hooks are skipped and no steps execute, so failed is always False. The wrapper interprets this as 'unexpected pass' and forces failure.

Needs:

if getattr(self, 'was_dry_run', False):
    return failed

**Major (Issue 2) — Still unresolved:** Dry-run false failures. Confirmed: Behave sets `self.was_dry_run = dry_run_scenario` at line ~10 of `Scenario.run()`. In dry-run mode, hooks are skipped and no steps execute, so `failed` is always `False`. The wrapper interprets this as 'unexpected pass' and forces failure. Needs: ```python if getattr(self, 'was_dry_run', False): return failed ```

features/environment.py Outdated

						
				@@ -29,0 +138,4 @@

				        if not should_invert_result(set(self.effective_tags)):

				            return failed

				        if failed:

hurui200320 commented

Major (Issue 1) — Still unresolved: Hook/cleanup errors masked.

After reading Behave's Scenario.run() source, confirmed: after the after_scenario hook runs, if self.hook_failed is True, Behave sets self.status = Status.hook_error and failed = True. Similarly, if context _pop() raises, self.status = Status.cleanup_error and failed = True. Both cases reach the wrapper with failed=True, which then inverts to passed.

Needs a guard:

if self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error):
    return failed

**Major (Issue 1) — Still unresolved:** Hook/cleanup errors masked. After reading Behave's `Scenario.run()` source, confirmed: after the `after_scenario` hook runs, if `self.hook_failed` is `True`, Behave sets `self.status = Status.hook_error` and `failed = True`. Similarly, if context `_pop()` raises, `self.status = Status.cleanup_error` and `failed = True`. Both cases reach the wrapper with `failed=True`, which then inverts to passed. Needs a guard: ```python if self.hook_failed or self.status in (Status.hook_error, Status.cleanup_error): return failed ```

features/environment.py Outdated

						
				@@ -29,0 +144,4 @@

				            for step in self.all_steps:

				                if step.status == Status.failed:

				                    step.status = Status.untested

				                    step.exception = None

hurui200320 commented

Minor (Issue 3) — Still unresolved: These lines discard step.exception and step.exc_traceback without logging. Add _tdd_logger.debug(...) before clearing so developers can verify the failure was the expected one.

**Minor (Issue 3) — Still unresolved:** These lines discard `step.exception` and `step.exc_traceback` without logging. Add `_tdd_logger.debug(...)` before clearing so developers can verify the failure was the expected one.

features/environment.py Outdated

						
				@@ -29,0 +158,4 @@

				            "bug fix workflow. See CONTRIBUTING.md > Bug Fix Workflow.",

				            self.name,

				        )

				        self.set_status(Status.failed)

hurui200320 commented

Minor (Issue 5, new) — No visible failure reason in test output.

This set_status(Status.failed) + return True forces a failure, but no step is marked as failed and no error text is attached. Behave formatters display the failed step's exception — without one, the output shows 'FAILED' with no explanation.

Consider setting a synthetic error on the last step so the reason appears in standard output rather than only in logs.

**Minor (Issue 5, new) — No visible failure reason in test output.** This `set_status(Status.failed)` + `return True` forces a failure, but no step is marked as failed and no error text is attached. Behave formatters display the failed step's exception — without one, the output shows 'FAILED' with no explanation. Consider setting a synthetic error on the last step so the reason appears in standard output rather than only in logs.

hurui200320 requested changes 2026-03-11 06:40:28 +00:00

Dismissed

hurui200320 left a comment

Code Review — Rui (Round 2, reaffirming previous findings)

My previous review's two major issues and two minor issues remain unaddressed on the current HEAD (a2d31b39). Reaffirming with additional verification below.

Issue 1 (Major — Still Open): Hook/infrastructure errors silently masked by result inversion

File: features/environment.py:136-162

_tdd_aware_run only checks if failed: without distinguishing the source of failure. Verified via inspect.getsource(Scenario.run) that Behave sets self.hook_failed = True when before_scenario raises (e.g. line 368's validate_tdd_tags, line 439's container override). _original_run then returns failed=True. The wrapper blindly flips this to passed.

Confirmed that self.hook_failed is set by Behave's Scenario.run() before it returns, so it's available in the wrapper. Status.hook_error (value 22) and Status.cleanup_error (value 23) also exist in Behave's Status enum.

Required fix:

if self.hook_failed:
    return failed  # Never invert infrastructure errors

Issue 2 (Major — Still Open): `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios

File: features/environment.py:136-162

Verified via inspect.getsource(Scenario.run) that Behave sets self.was_dry_run = dry_run_scenario at the start of Scenario.run(). In dry-run mode, no steps execute, _original_run returns False (no failures). The wrapper enters the "unexpected pass" branch and forces failure with "Bug appears to be fixed" — incorrect since no test ran.

Required fix:

if getattr(self, 'was_dry_run', False):
    return failed  # No-op during dry-run

Issue 3 (Minor — Still Open): Diagnostic info discarded without logging

File: features/environment.py:144-148

step.exception = None and step.exc_traceback = None destroy diagnostic info. If the expected failure is actually a RuntimeError from infrastructure (see Issue 4), the developer has no record. A _tdd_logger.debug() before clearing would preserve diagnostics in verbose runs.

Issue 4 (Minor — Still Open): Any exception type silently inverted

File: features/environment.py:141-151

The wrapper inverts any step failure regardless of exception type. A ConnectionError, TypeError, or RuntimeError during step execution would be treated as "bug still exists" and silently passed. Consider checking that the failing step's exception is AssertionError before inverting, and logging a warning for non-assertion exceptions.

New Observation: "Unexpected pass" failure message only in logs

File: features/environment.py:155-162

The acceptance criteria state: scenarios that unexpectedly pass should be "reported as failed with message". Currently the message only goes to _tdd_logger.warning(), which won't appear in Behave's standard test output or summary. The scenario shows as failed with no visible explanation — confusing for developers. Consider also setting an error message on the scenario itself or on a dummy step so it appears in test output.

This is a UX gap rather than a functional bug, but it weakens the developer experience specified in the ticket.

Positive Observations

Architecture choice is sound. Wrapping Scenario.run() instead of using after_scenario is the correct approach — after_scenario cannot modify the runner's pass/fail return value.
Regex safety. r"tdd_bug_\d+" with fullmatch() is not vulnerable to ReDoS and prevents partial matches.
Idempotency guard. _tdd_run_patched flag prevents double-patching under behave-parallel fork model.
Tag validation is comprehensive and covers all CONTRIBUTING.md rule combinations.
Test coverage is thorough. 13 validation scenarios + 1 demo scenario cover valid/invalid tag combos, should_invert_result, and actual inversion behavior.
Step naming with tdd tags/tdd demo prefixes avoids AmbiguousStep conflicts.
Inline documentation references CONTRIBUTING.md correctly.
CHANGELOG and feature-level @mock_only tagging are both correct.
logging import moved to module top-level from after_scenario — cleaner.

Verdict: REQUEST_CHANGES

The two major issues (hook error masking and dry-run false failures) are real bugs that would manifest in production use. Both have straightforward fixes. Please address Issues 1 and 2 before merge.

## Code Review — Rui (Round 2, reaffirming previous findings) My previous review's two major issues and two minor issues remain unaddressed on the current HEAD (`a2d31b39`). Reaffirming with additional verification below. --- ### Issue 1 (Major — Still Open): Hook/infrastructure errors silently masked by result inversion **File:** `features/environment.py:136-162` `_tdd_aware_run` only checks `if failed:` without distinguishing the *source* of failure. Verified via `inspect.getsource(Scenario.run)` that Behave sets `self.hook_failed = True` when `before_scenario` raises (e.g. line 368's `validate_tdd_tags`, line 439's container override). `_original_run` then returns `failed=True`. The wrapper blindly flips this to passed. Confirmed that `self.hook_failed` is set by Behave's `Scenario.run()` *before* it returns, so it's available in the wrapper. `Status.hook_error` (value 22) and `Status.cleanup_error` (value 23) also exist in Behave's `Status` enum. **Required fix:** ```python if self.hook_failed: return failed # Never invert infrastructure errors ``` ### Issue 2 (Major — Still Open): `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios **File:** `features/environment.py:136-162` Verified via `inspect.getsource(Scenario.run)` that Behave sets `self.was_dry_run = dry_run_scenario` at the start of `Scenario.run()`. In dry-run mode, no steps execute, `_original_run` returns `False` (no failures). The wrapper enters the "unexpected pass" branch and forces failure with "Bug appears to be fixed" — incorrect since no test ran. **Required fix:** ```python if getattr(self, 'was_dry_run', False): return failed # No-op during dry-run ``` ### Issue 3 (Minor — Still Open): Diagnostic info discarded without logging **File:** `features/environment.py:144-148` `step.exception = None` and `step.exc_traceback = None` destroy diagnostic info. If the expected failure is actually a `RuntimeError` from infrastructure (see Issue 4), the developer has no record. A `_tdd_logger.debug()` before clearing would preserve diagnostics in verbose runs. ### Issue 4 (Minor — Still Open): Any exception type silently inverted **File:** `features/environment.py:141-151` The wrapper inverts *any* step failure regardless of exception type. A `ConnectionError`, `TypeError`, or `RuntimeError` during step execution would be treated as "bug still exists" and silently passed. Consider checking that the failing step's exception is `AssertionError` before inverting, and logging a warning for non-assertion exceptions. ### New Observation: "Unexpected pass" failure message only in logs **File:** `features/environment.py:155-162` The acceptance criteria state: scenarios that unexpectedly pass should be *"reported as failed with message"*. Currently the message only goes to `_tdd_logger.warning()`, which won't appear in Behave's standard test output or summary. The scenario shows as failed with no visible explanation — confusing for developers. Consider also setting an error message on the scenario itself or on a dummy step so it appears in test output. This is a UX gap rather than a functional bug, but it weakens the developer experience specified in the ticket. ### Positive Observations - **Architecture choice is sound.** Wrapping `Scenario.run()` instead of using `after_scenario` is the correct approach — `after_scenario` cannot modify the runner's pass/fail return value. - **Regex safety.** `r"tdd_bug_\d+"` with `fullmatch()` is not vulnerable to ReDoS and prevents partial matches. - **Idempotency guard.** `_tdd_run_patched` flag prevents double-patching under `behave-parallel` fork model. - **Tag validation is comprehensive** and covers all CONTRIBUTING.md rule combinations. - **Test coverage is thorough.** 13 validation scenarios + 1 demo scenario cover valid/invalid tag combos, `should_invert_result`, and actual inversion behavior. - **Step naming** with `tdd tags`/`tdd demo` prefixes avoids `AmbiguousStep` conflicts. - **Inline documentation** references CONTRIBUTING.md correctly. - **CHANGELOG** and feature-level `@mock_only` tagging are both correct. - **`logging` import moved** to module top-level from `after_scenario` — cleaner. ### Verdict: REQUEST_CHANGES The two major issues (hook error masking and dry-run false failures) are real bugs that would manifest in production use. Both have straightforward fixes. Please address Issues 1 and 2 before merge.

features/environment.py Outdated

						
				@@ -29,0 +134,4 @@

				    _original_run = Scenario.run

				    def _tdd_aware_run(self: Any, runner: Any) -> bool:

				        failed: bool = _original_run(self, runner)

hurui200320 commented

Major (Still Open) — Dry-run false failures. Behave sets self.was_dry_run = True when runner.config.dry_run is active (verified in Behave source). In dry-run mode, no steps execute, _original_run returns False, and this wrapper enters the 'unexpected pass' branch — incorrectly forcing failure on every @tdd_expected_fail scenario.

Add after the _original_run call:

if getattr(self, 'was_dry_run', False):
    return failed

**Major (Still Open) — Dry-run false failures.** Behave sets `self.was_dry_run = True` when `runner.config.dry_run` is active (verified in Behave source). In dry-run mode, no steps execute, `_original_run` returns `False`, and this wrapper enters the 'unexpected pass' branch — incorrectly forcing failure on every `@tdd_expected_fail` scenario. Add after the `_original_run` call: ```python if getattr(self, 'was_dry_run', False): return failed ```

features/environment.py Outdated

						
				@@ -29,0 +137,4 @@

				        failed: bool = _original_run(self, runner)

				        if not should_invert_result(set(self.effective_tags)):

				            return failed

hurui200320 commented

Major (Still Open) — Hook errors masked. After _original_run returns, self.hook_failed is guaranteed to be set by Behave's Scenario.run() (verified via source inspection). Add:

if self.hook_failed:
    return failed

before the should_invert_result check. Without this, a before_scenario exception on a @tdd_expected_fail scenario is silently reported as passed.

**Major (Still Open) — Hook errors masked.** After `_original_run` returns, `self.hook_failed` is guaranteed to be set by Behave's `Scenario.run()` (verified via source inspection). Add: ```python if self.hook_failed: return failed ``` before the `should_invert_result` check. Without this, a `before_scenario` exception on a `@tdd_expected_fail` scenario is silently reported as passed.

features/environment.py Outdated

						
				@@ -29,0 +141,4 @@

				        if failed:

				            # Expected failure — the bug still exists.  Reset the failed

				            # steps so the scenario is reported as passed.

				            for step in self.all_steps:

hurui200320 commented

Minor (Still Open) — Any exception type inverted. This branch inverts all failures regardless of exception type. A ConnectionError or TypeError during step execution would be silently treated as 'bug still exists'. Consider guarding:

for step in self.all_steps:
    if step.status == Status.failed:
        if step.exception and not isinstance(step.exception, AssertionError):
            _tdd_logger.warning(
                "Non-assertion exception in expected-fail scenario '%s' "
                "step '%s': %s — not inverting.",
                self.name, step.name, step.exception,
            )
            return failed

**Minor (Still Open) — Any exception type inverted.** This branch inverts all failures regardless of exception type. A `ConnectionError` or `TypeError` during step execution would be silently treated as 'bug still exists'. Consider guarding: ```python for step in self.all_steps: if step.status == Status.failed: if step.exception and not isinstance(step.exception, AssertionError): _tdd_logger.warning( "Non-assertion exception in expected-fail scenario '%s' " "step '%s': %s — not inverting.", self.name, step.name, step.exception, ) return failed ```

features/environment.py Outdated

						
				@@ -29,0 +144,4 @@

				            for step in self.all_steps:

				                if step.status == Status.failed:

				                    step.status = Status.untested

				                    step.exception = None

hurui200320 commented

Minor (Still Open) — Diagnostic info lost. Before setting step.exception = None, consider _tdd_logger.debug("Inverting expected-fail step '%s': %s", step.name, step.exception) so developers can verify the correct failure is being inverted during verbose/debug runs.

**Minor (Still Open) — Diagnostic info lost.** Before setting `step.exception = None`, consider `_tdd_logger.debug("Inverting expected-fail step '%s': %s", step.name, step.exception)` so developers can verify the *correct* failure is being inverted during verbose/debug runs.

hurui200320 requested changes 2026-03-11 06:40:28 +00:00

Dismissed

hurui200320 left a comment

Code Review — Follow-up (Previous Issues Still Unresolved)

The commit SHA is unchanged (a2d31b3) since my previous review. The two major issues I flagged remain unaddressed. Restating them here for clarity, along with additional findings from a fresh pass.

Major Issues (Unresolved from Prior Review)

1. Hook/infrastructure errors silently masked by result inversion

features/environment.py:136-162 — _tdd_aware_run checks only if failed: without distinguishing the source of failure. Behave's own Scenario.run() sets self.hook_failed = True and returns failed = True when before_scenario, after_scenario, or a tag hook raises an exception. The wrapper currently flips this to passed for any @tdd_expected_fail scenario, completely hiding the infrastructure error.

I verified directly in Behave's source that self.hook_failed is set at the top of Scenario.run() and toggled when hooks fail (confirmed via inspect.getsource).

Fix: Add a guard before inversion:

if self.hook_failed:
    return failed

2. --dry-run mode falsely fails all @tdd_expected_fail scenarios

features/environment.py:136-162 — In dry-run mode, Behave skips hooks and step execution. _original_run returns False (no failures). The wrapper enters the "unexpected pass" branch and forces failure with "Bug appears to be fixed." This is incorrect — no test actually ran.

I verified directly in Behave's source that self.was_dry_run = dry_run_scenario is set at the start of Scenario.run().

Fix: Skip inversion during dry-run:

if getattr(self, "was_dry_run", False):
    return failed

Minor Issues

3. Exception details discarded without logging (features/environment.py:146-148)

When inverting a failed scenario to passed, step.exception and step.exc_traceback are set to None without any record. If a developer needs to verify that the expected failure is the correct failure (the right AssertionError, not an unrelated crash), this info is gone. A _tdd_logger.debug(...) call before clearing would preserve diagnostics without noise.

4. Non-AssertionError exceptions treated as "bug still exists" (features/environment.py:141-151)

The inversion logic inverts any step failure regardless of exception type. A RuntimeError, TypeError, or ConnectionError during step execution would be silently treated as "bug still exists" rather than flagged as an error. Consider verifying that failed steps have AssertionError before inverting, and logging a warning for non-assertion exceptions.

5. Inaccurate comments about hook location

features/environment.py:48 — Comment says helpers are "called from the before_scenario and after_scenario hooks respectively" but should_invert_result is called from _tdd_aware_run (the Scenario.run() wrapper), not from after_scenario.
features/testing/tdd_expected_fail_demo.feature:4 — Feature description says "This exercises the after_scenario hook logic" but the inversion is handled by the Scenario.run() wrapper, not after_scenario. The after_scenario hook even has a NOTE comment (line 560) clarifying this distinction.

Positive Observations

validate_tdd_tags() correctly implements all CONTRIBUTING.md validation rules with clear, actionable error messages.
Regex compiled at module level (_TDD_BUG_N_RE), uses fullmatch preventing partial matches — correct and efficient.
Idempotency guard (_tdd_run_patched) prevents double-patching in forked workers.
Monkey-patching Scenario.run() instead of using after_scenario is the right approach — after_scenario cannot change the return value of run().
Test coverage is comprehensive: 13 validation scenarios covering all valid/invalid tag combinations, plus 1 end-to-end demo.
Step names properly namespaced with tdd tags/tdd demo prefixes to avoid AmbiguousStep conflicts.
Feature files correctly placed under features/testing/ with @mock_only tag.
CHANGELOG updated with clear, detailed entry.
All acceptance criteria from #627 are functionally met (modulo the edge cases above).
Process-safe under behave-parallel (module-level patch, no shared mutable state).

Verdict: REQUEST_CHANGES

The two major issues — hook-error masking and dry-run false failures — are real edge cases confirmed against Behave's source code. They need to be fixed before merge. The minor issues (3-5) are improvements I'd like to see but would not block on independently.

## Code Review — Follow-up (Previous Issues Still Unresolved) The commit SHA is unchanged (`a2d31b3`) since my previous review. The two major issues I flagged remain unaddressed. Restating them here for clarity, along with additional findings from a fresh pass. --- ### Major Issues (Unresolved from Prior Review) **1. Hook/infrastructure errors silently masked by result inversion** `features/environment.py:136-162` — `_tdd_aware_run` checks only `if failed:` without distinguishing the *source* of failure. Behave's own `Scenario.run()` sets `self.hook_failed = True` and returns `failed = True` when `before_scenario`, `after_scenario`, or a tag hook raises an exception. The wrapper currently flips this to passed for any `@tdd_expected_fail` scenario, completely hiding the infrastructure error. I verified directly in Behave's source that `self.hook_failed` is set at the top of `Scenario.run()` and toggled when hooks fail (confirmed via `inspect.getsource`). **Fix:** Add a guard before inversion: ```python if self.hook_failed: return failed ``` **2. `--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios** `features/environment.py:136-162` — In dry-run mode, Behave skips hooks and step execution. `_original_run` returns `False` (no failures). The wrapper enters the "unexpected pass" branch and forces failure with "Bug appears to be fixed." This is incorrect — no test actually ran. I verified directly in Behave's source that `self.was_dry_run = dry_run_scenario` is set at the start of `Scenario.run()`. **Fix:** Skip inversion during dry-run: ```python if getattr(self, "was_dry_run", False): return failed ``` --- ### Minor Issues **3. Exception details discarded without logging** (`features/environment.py:146-148`) When inverting a failed scenario to passed, `step.exception` and `step.exc_traceback` are set to `None` without any record. If a developer needs to verify that the expected failure is the *correct* failure (the right `AssertionError`, not an unrelated crash), this info is gone. A `_tdd_logger.debug(...)` call before clearing would preserve diagnostics without noise. **4. Non-AssertionError exceptions treated as "bug still exists"** (`features/environment.py:141-151`) The inversion logic inverts any step failure regardless of exception type. A `RuntimeError`, `TypeError`, or `ConnectionError` during step execution would be silently treated as "bug still exists" rather than flagged as an error. Consider verifying that failed steps have `AssertionError` before inverting, and logging a warning for non-assertion exceptions. **5. Inaccurate comments about hook location** - `features/environment.py:48` — Comment says helpers are "called from the `before_scenario` and `after_scenario` hooks respectively" but `should_invert_result` is called from `_tdd_aware_run` (the `Scenario.run()` wrapper), not from `after_scenario`. - `features/testing/tdd_expected_fail_demo.feature:4` — Feature description says "This exercises the after_scenario hook logic" but the inversion is handled by the `Scenario.run()` wrapper, not `after_scenario`. The `after_scenario` hook even has a NOTE comment (line 560) clarifying this distinction. --- ### Positive Observations - `validate_tdd_tags()` correctly implements all CONTRIBUTING.md validation rules with clear, actionable error messages. - Regex compiled at module level (`_TDD_BUG_N_RE`), uses `fullmatch` preventing partial matches — correct and efficient. - Idempotency guard (`_tdd_run_patched`) prevents double-patching in forked workers. - Monkey-patching `Scenario.run()` instead of using `after_scenario` is the right approach — `after_scenario` cannot change the return value of `run()`. - Test coverage is comprehensive: 13 validation scenarios covering all valid/invalid tag combinations, plus 1 end-to-end demo. - Step names properly namespaced with `tdd tags`/`tdd demo` prefixes to avoid `AmbiguousStep` conflicts. - Feature files correctly placed under `features/testing/` with `@mock_only` tag. - CHANGELOG updated with clear, detailed entry. - All acceptance criteria from #627 are functionally met (modulo the edge cases above). - Process-safe under `behave-parallel` (module-level patch, no shared mutable state). --- ### Verdict: REQUEST_CHANGES The two major issues — hook-error masking and dry-run false failures — are real edge cases confirmed against Behave's source code. They need to be fixed before merge. The minor issues (3-5) are improvements I'd like to see but would not block on independently.

features/environment.py Outdated

						
				@@ -29,0 +45,4 @@

				#                          fixed without removing the tag (reported as failed).

				#

				# The ``validate_tdd_tags`` and ``should_invert_result`` helpers below are

				# called from the ``before_scenario`` and ``after_scenario`` hooks

hurui200320 commented

Minor — Comment inaccuracy

This comment says the helpers are "called from the before_scenario and after_scenario hooks respectively" but should_invert_result is actually called from _tdd_aware_run (the Scenario.run() wrapper installed in before_all), not from after_scenario. The after_scenario hook itself has a NOTE (line 560) clarifying this. Consider updating this comment to match.

**Minor — Comment inaccuracy** This comment says the helpers are "called from the `before_scenario` and `after_scenario` hooks respectively" but `should_invert_result` is actually called from `_tdd_aware_run` (the `Scenario.run()` wrapper installed in `before_all`), not from `after_scenario`. The `after_scenario` hook itself has a NOTE (line 560) clarifying this. Consider updating this comment to match.

features/environment.py Outdated

						
				@@ -29,0 +135,4 @@

				    def _tdd_aware_run(self: Any, runner: Any) -> bool:

				        failed: bool = _original_run(self, runner)

				        if not should_invert_result(set(self.effective_tags)):

hurui200320 commented

Still unresolved — Dry-run false failure (Major)

In --dry-run mode, _original_run returns False (no failure) and self.was_dry_run is True. Without a guard, this falls through to the 'unexpected pass' branch at line 155, forcing every @tdd_expected_fail scenario to fail with a misleading message.

Add before the inversion:

if getattr(self, 'was_dry_run', False):
    return failed

**Still unresolved — Dry-run false failure (Major)** In `--dry-run` mode, `_original_run` returns `False` (no failure) and `self.was_dry_run` is `True`. Without a guard, this falls through to the 'unexpected pass' branch at line 155, forcing every `@tdd_expected_fail` scenario to fail with a misleading message. Add before the inversion: ```python if getattr(self, 'was_dry_run', False): return failed ```

features/environment.py Outdated

						
				@@ -29,0 +137,4 @@

				        failed: bool = _original_run(self, runner)

				        if not should_invert_result(set(self.effective_tags)):

				            return failed

hurui200320 commented

Still unresolved — Hook-error masking (Major)

After the call to _original_run, self.hook_failed will be True if any hook raised an exception. The current code proceeds to invert the result, hiding the infrastructure failure.

Behave source confirms self.hook_failed is set at the top of Scenario.run() and toggled on hook failure. Add:

if self.hook_failed:
    return failed

before the inversion logic.

**Still unresolved — Hook-error masking (Major)** After the call to `_original_run`, `self.hook_failed` will be `True` if any hook raised an exception. The current code proceeds to invert the result, hiding the infrastructure failure. Behave source confirms `self.hook_failed` is set at the top of `Scenario.run()` and toggled on hook failure. Add: ```python if self.hook_failed: return failed ``` before the inversion logic.

features/environment.py Outdated

						
				@@ -29,0 +144,4 @@

				            for step in self.all_steps:

				                if step.status == Status.failed:

				                    step.status = Status.untested

				                    step.exception = None

hurui200320 commented

2026-03-11 07:03:55 +00:00

Minor — Diagnostic info silently discarded

Clearing the exception and traceback here loses the only record of which failure occurred. For expected-fail scenarios, knowing the exception helps developers verify they're testing the right bug.

Consider:

_tdd_logger.debug(
    "Clearing expected-fail exception for step '%s': %s",
    step.name, step.exception,
)

before the assignment.

**Minor — Diagnostic info silently discarded** Clearing the exception and traceback here loses the only record of *which* failure occurred. For expected-fail scenarios, knowing the exception helps developers verify they're testing the right bug. Consider: ```python _tdd_logger.debug( "Clearing expected-fail exception for step '%s': %s", step.name, step.exception, ) ``` before the assignment.

hurui200320 requested changes 2026-03-11 06:51:14 +00:00

hurui200320 left a comment

Combined Review Report — PR #665 (Ticket #627)

Agent consensus: 5/5 review agents returned REQUEST_CHANGES

PR Summary

Implements the three-tag TDD bug-capture system (@tdd_bug, @tdd_bug_<N>, @tdd_expected_fail) in Behave environment hooks. Adds validate_tdd_tags(), should_invert_result(), and _install_tdd_expected_fail_patch() to features/environment.py, along with 13 BDD validation scenarios + 1 demo scenario, and a CHANGELOG update.

Issues Found

Blocking (Major) — found by 5/5 agents

#	File:Line	Description	Recommendation
1	`features/environment.py:136-162`	Hook/infrastructure errors masked by inversion. `_tdd_aware_run` does not check `self.hook_failed` before inverting the result. If a `before_scenario` or `after_scenario` hook raises an exception on a `@tdd_expected_fail` scenario, the wrapper flips `failed=True` to passed — silently hiding infrastructure errors in CI.	Add `if self.hook_failed: return failed` before the inversion logic.
2	`features/environment.py:136-162`	`--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios. In dry-run mode, no steps execute and `_original_run` returns `False`. The wrapper enters the "unexpected pass" branch and forces failure with "Bug appears to be fixed" even though no test actually ran.	Add `if getattr(self, 'was_dry_run', False): return failed` before inversion.

Non-blocking (Minor) — found by 3-5/5 agents

#	File:Line	Description	Recommendation
3	`features/environment.py:146-148`	Exception info discarded without logging. `step.exception = None` and `step.exc_traceback = None` permanently destroy diagnostic info. Developers lose the ability to verify the expected failure is the correct failure.	Add `_tdd_logger.debug("Clearing expected failure: %s", step.exception)` before clearing.
4	`features/environment.py:141-151`	Any exception type is inverted. `RuntimeError`, `ConnectionError`, `TypeError`, etc. during step execution are all treated as "bug still exists" and silently inverted. Only `AssertionError` failures are genuine test assertions.	Consider warning on non-`AssertionError` exceptions, or at minimum logging them at DEBUG level.
5	`features/testing/tdd_expected_fail_demo.feature:4-5`, `features/steps/tdd_tag_validation_steps.py:104`	Incorrect documentation. Comments/docstrings reference "after_scenario hook" as the inversion mechanism, but inversion actually uses the `Scenario.run()` monkey-patch wrapper.	Update references to say "`Scenario.run()` wrapper" instead of "after_scenario hook".
6	`features/environment.py:155-162`	"Unexpected pass" failure message only visible in logs. The failure path sets `scenario.set_status(Status.failed)` but the guidance message ("Bug appears to be fixed") only goes to `_tdd_logger.warning()`, not visible in standard Behave output.	Attach the message to a step or scenario error text so it shows in test results.
7	(no specific line)	"Unexpected pass" branch untested. No test scenario exercises the path where a `@tdd_expected_fail` scenario unexpectedly passes. This is listed as an acceptance criterion in ticket #627.	Add a test scenario that triggers the "bug appears fixed" path.

Positive Observations (unanimous across all agents)

Correct architecture: Monkey-patching Scenario.run() is the only approach that can control the pass/fail return value — good engineering judgment over using after_scenario
Thorough tag validation: validate_tdd_tags() implements all CONTRIBUTING.md rules with clear, actionable error messages
Safe regex: r"tdd_bug_\d+" with fullmatch() — no ReDoS risk, no partial match risk
Idempotent patching: _tdd_run_patched guard prevents double-patching under behave-parallel
Good test coverage: 13 validation + 1 demo scenarios, well-structured with @mock_only
Clean step namespacing: Avoids AmbiguousStep conflicts
Process-safe under forked workers
CHANGELOG entry is well-written with issue reference

Verdict: REQUEST_CHANGES

Issues #1 and #2 are confirmed correctness bugs (verified against Behave's source code) that must be fixed before merge. Both have straightforward 2-3 line fixes. Issues #3-#7 are recommended improvements.

# Combined Review Report — PR #665 (Ticket #627) **Agent consensus:** 5/5 review agents returned **REQUEST_CHANGES** --- ## PR Summary Implements the three-tag TDD bug-capture system (`@tdd_bug`, `@tdd_bug_<N>`, `@tdd_expected_fail`) in Behave environment hooks. Adds `validate_tdd_tags()`, `should_invert_result()`, and `_install_tdd_expected_fail_patch()` to `features/environment.py`, along with 13 BDD validation scenarios + 1 demo scenario, and a CHANGELOG update. --- ## Issues Found ### Blocking (Major) — found by 5/5 agents | # | File:Line | Description | Recommendation | |---|-----------|-------------|----------------| | **1** | `features/environment.py:136-162` | **Hook/infrastructure errors masked by inversion.** `_tdd_aware_run` does not check `self.hook_failed` before inverting the result. If a `before_scenario` or `after_scenario` hook raises an exception on a `@tdd_expected_fail` scenario, the wrapper flips `failed=True` to passed — silently hiding infrastructure errors in CI. | Add `if self.hook_failed: return failed` before the inversion logic. | | **2** | `features/environment.py:136-162` | **`--dry-run` mode falsely fails all `@tdd_expected_fail` scenarios.** In dry-run mode, no steps execute and `_original_run` returns `False`. The wrapper enters the "unexpected pass" branch and forces failure with "Bug appears to be fixed" even though no test actually ran. | Add `if getattr(self, 'was_dry_run', False): return failed` before inversion. | ### Non-blocking (Minor) — found by 3-5/5 agents | # | File:Line | Description | Recommendation | |---|-----------|-------------|----------------| | **3** | `features/environment.py:146-148` | **Exception info discarded without logging.** `step.exception = None` and `step.exc_traceback = None` permanently destroy diagnostic info. Developers lose the ability to verify the expected failure is the *correct* failure. | Add `_tdd_logger.debug("Clearing expected failure: %s", step.exception)` before clearing. | | **4** | `features/environment.py:141-151` | **Any exception type is inverted.** `RuntimeError`, `ConnectionError`, `TypeError`, etc. during step execution are all treated as "bug still exists" and silently inverted. Only `AssertionError` failures are genuine test assertions. | Consider warning on non-`AssertionError` exceptions, or at minimum logging them at DEBUG level. | | **5** | `features/testing/tdd_expected_fail_demo.feature:4-5`, `features/steps/tdd_tag_validation_steps.py:104` | **Incorrect documentation.** Comments/docstrings reference "after_scenario hook" as the inversion mechanism, but inversion actually uses the `Scenario.run()` monkey-patch wrapper. | Update references to say "`Scenario.run()` wrapper" instead of "after_scenario hook". | | **6** | `features/environment.py:155-162` | **"Unexpected pass" failure message only visible in logs.** The failure path sets `scenario.set_status(Status.failed)` but the guidance message ("Bug appears to be fixed") only goes to `_tdd_logger.warning()`, not visible in standard Behave output. | Attach the message to a step or scenario error text so it shows in test results. | | **7** | (no specific line) | **"Unexpected pass" branch untested.** No test scenario exercises the path where a `@tdd_expected_fail` scenario unexpectedly passes. This is listed as an acceptance criterion in ticket #627. | Add a test scenario that triggers the "bug appears fixed" path. | --- ## Positive Observations (unanimous across all agents) - **Correct architecture**: Monkey-patching `Scenario.run()` is the only approach that can control the pass/fail return value — good engineering judgment over using `after_scenario` - **Thorough tag validation**: `validate_tdd_tags()` implements all CONTRIBUTING.md rules with clear, actionable error messages - **Safe regex**: `r"tdd_bug_\d+"` with `fullmatch()` — no ReDoS risk, no partial match risk - **Idempotent patching**: `_tdd_run_patched` guard prevents double-patching under `behave-parallel` - **Good test coverage**: 13 validation + 1 demo scenarios, well-structured with `@mock_only` - **Clean step namespacing**: Avoids `AmbiguousStep` conflicts - **Process-safe** under forked workers - **CHANGELOG entry** is well-written with issue reference --- ## Verdict: **REQUEST_CHANGES** Issues #1 and #2 are confirmed correctness bugs (verified against Behave's source code) that must be fixed before merge. Both have straightforward 2-3 line fixes. Issues #3-#7 are recommended improvements.

hurui200320 commented

Sorry for the messy comments. Today I tried a different approach of reviewing, basically asking LLM to spawn 5 agents to do independent reviews and then merge the findings into one review. But apparently I messed up the prompt, so each agent is posting findings on forgejo.

I did check them one by one, it looks like the new approach is working. It does generate a bunch of messy comments instead of one, but at least they do pick out a bunch of issues in one round of review.

(This message is written by real Rui.)

Sorry for the messy comments. Today I tried a different approach of reviewing, basically asking LLM to spawn 5 agents to do independent reviews and then merge the findings into one review. But apparently I messed up the prompt, so each agent is posting findings on forgejo. I did check them one by one, it looks like the new approach is working. It does generate a bunch of messy comments instead of one, but at least they do pick out a bunch of issues in one round of review. (This message is written by real Rui.)

CoreRasurae force-pushed feature/m5-behave-tdd-tags from a2d31b3952 to 107860259d

2026-03-11 15:24:04 +00:00

Compare

brent.edwards commented

2026-03-11 17:50:27 +00:00

The Lay of Pull Request Six-Hundred Sixty-Five

An Epic Review, in Cantos, of the TDD Expected-Fail Tag Handler

Canto I — The Invocation

Sing, O Muse, of tags that govern tests,
Of @tdd_expected_fail and its behests,
Of CoreRasurae, who labored long and well
To build a system that would catch bugs where they fell.
Yet fate, that fickle keeper of the merge,
Had merged another's work before this surge.
And so we read these thousand lines with care,
To render judgment: honest, true, and fair.

Canto II — The Great Duplication (P0:blocker)

Hear now the gravest finding of them all:
This PR's code already stands within the hall.
On master, in features/environment.py,
A function named handle_tdd_expected_fail holds sway (line 322).
It validates the three-tag system true,
It inverts the status — failed to passed, and passed askew —
It lives within after_scenario's embrace (line 403),
And handles every edge and corner case.

The PR proposes a different architecture:
A monkey-patch on Scenario.run() — a fracture
Installed in before_all, wrapping the method whole,
With apply_tdd_inversion playing the central role.

Were both to live, catastrophe would reign:

A failing test, inverted once, would pass — then pass again
through after_scenario's hook, which sees it passed,
and re-inverts to failed. The bug, still present, fails the blast.
A passing test (bug fixed!) inverted once to fail,
then after_scenario sees failure, inverts — and hides the tale.
The @tdd_expected_fail tag remains, the fix unseen,
A silent phantom haunting CI's green.

The PR is Mergeable: False. The conflicts are not cosmetic —
They arise because master's implementation is complete, not merely aesthetic.

Verdict: P0:blocker — The entire PR is superseded by the existing
implementation on master. Merging would cause double-inversion,
breaking all TDD expected-fail scenarios in both directions.

Canto III — The Status Untested (P1:must-fix)

Within apply_tdd_inversion, when failure turns to grace,
The code sets failed steps to Status.untested — an unstable place:

# PR line 200 of the diff
step.status = Status.untested  # Wrong!

But untested is not a final status in Behave's reckoning.
The parallel runner's _extract_summary() counts it as an error,
Not a pass, not a skip — a ghost that spreads its terror.
And should compute_status() ever be retriggered on the scene,
It would return Status.untested — not the passed we need to glean.

The existing code on master (line 387) does this right:

step.status = Status.passed  # Correct!

And resets both failed and skipped steps to passed outright.

Verdict: P1:must-fix — Status.untested causes incorrect summary
counts and inconsistent state. Should be Status.passed, and skipped
steps must also be reset.

Canto IV — The Phantom Property (P1:must-fix)

The PR accesses scenario.all_steps — but beware!
In real Behave, all_steps is a property returning an iterator,
Not a list that one may index or compare.

At diff line 219–220:

if scenario.all_steps:          # Always True for iterators!
    last_step = scenario.all_steps[-1]  # TypeError on real Scenario!

An itertools.chain object knows not __getitem__,
Nor does it yield False when empty — it is truthy every time.
The mock uses a plain list, hiding this defect,
But on a real Scenario, a TypeError would eject.

The existing code on master wisely uses scenario.steps (a list),
And never indexes all_steps — that trap it has dismissed.

Verdict: P1:must-fix — scenario.all_steps[-1] would raise
TypeError on a real Behave Scenario object. The mock masks the bug.

Canto V — The Forbidden Sigils (P1:must-fix)

Two lines bear the mark of the forbidden:

Scenario.run = _tdd_aware_run  # type: ignore[assignment]
Scenario._tdd_run_patched = True  # type: ignore[attr-defined]

CONTRIBUTING.md speaks plainly at line 548:

"Never use inline comments or annotations to suppress individual
type checking errors (e.g., no type: ignore)."

Two # type: ignore comments dwell within this patch,
A direct violation that no argument can unlatch.

Verdict: P1:must-fix — Two # type: ignore comments violate
the project's absolute prohibition on type suppression.

Canto VI — The Missing Test (P2:should-fix)

Of all the paths that apply_tdd_inversion may tread,
The primary use case — expected failure inverted to pass — is left unsaid
In the Behave feature file. The Robot helper tests it well,
But tdd_tag_validation.feature has no scenario to tell
The tale of failed=True with AssertionError in hand,
Where inversion yields False and the test is marked as planned.

Four inversion scenarios grace the feature file:

Unexpected pass (forced failure) ✓
Hook-error guard ✓
Dry-run guard ✓
Non-assertion guard ✓

But the happy path — the very reason the system exists — is absent.

Verdict: P2:should-fix — The primary expected-failure inversion path
is not tested in the Behave feature file (only in Robot).

Canto VII — The Empty Stage (P2:should-fix)

No test exists where scenario.all_steps stands empty —
A stage with no players, a theater with no entry.
The _make_mock_scenario helper always creates one step,
But what of scenarios with none? The code at line 219
Guards with if scenario.all_steps: — yet this guard is broken
(see Canto IV), and no test has ever spoken
To verify the behavior when no steps are found.

Verdict: P2:should-fix — The empty-steps edge case
is untested and contains the iterator truthiness bug.

Canto VIII — The Doubled Mock (P3:nit)

In features/steps/tdd_tag_validation_steps.py (diff line 433)
And robot/helper_tdd_tag_validation.py (diff line 752),
The _make_mock_scenario function stands, copy-pasted twice,
Identical in form and function — duplication's vice.
Any change to one must echo in the other,
Lest the tests diverge like children from their mother.

Verdict: P3:nit — Consider extracting the mock helper
to a shared module (perhaps features/mocks/mock_tdd_scenario.py)
to eliminate duplication.

Canto IX — The Verdict

O CoreRasurae, your craft is plain to see:
The guards for hooks, dry-runs, and non-assertion errors — these are worthy.
The tag validation logic is correct and clear,
The error messages helpful, the test coverage nearly here.

But master has already claimed this ground.
The handle_tdd_expected_fail function was found
In commits dating March the 9th, before this PR was born.
The monkey-patch approach, though clever, is now shorn
Of purpose — for the after_scenario hook suffices,
Combined with the summary-based exit in the noxfile's devices.

Recommendation: Request Changes

Finding	Severity	File	Line(s)
PR entirely superseded by existing `handle_tdd_expected_fail` on master	P0:blocker	`features/environment.py`	322–403 (master)
`Status.untested` should be `Status.passed`	P1:must-fix	`features/environment.py` (diff)	200
`scenario.all_steps[-1]` raises TypeError on real Scenario	P1:must-fix	`features/environment.py` (diff)	219–220
Two `# type: ignore` comments violate CONTRIBUTING.md	P1:must-fix	`features/environment.py` (diff)	249–250
Primary expected-failure path not tested in Behave	P2:should-fix	`tdd_tag_validation.feature`	—
Empty `all_steps` edge case untested	P2:should-fix	`tdd_tag_validation.feature`	—
`_make_mock_scenario` duplicated across Behave/Robot	P3:nit	steps + helper	—

The path forward: this PR should be closed in favor of the existing
implementation. If specific improvements from this PR (e.g., the
non-assertion exception guard, the dry-run guard) are desired atop
master's handle_tdd_expected_fail, they can be proposed as a
follow-up PR that builds on — rather than replaces — the existing code.

Thus ends the Lay of PR Six-Hundred Sixty-Five.
May the tests stay green, the types stay checked,
And the tags stay properly triple-decked.

# The Lay of Pull Request Six-Hundred Sixty-Five ### *An Epic Review, in Cantos, of the TDD Expected-Fail Tag Handler* --- ## Canto I — The Invocation Sing, O Muse, of tags that govern tests, Of `@tdd_expected_fail` and its behests, Of CoreRasurae, who labored long and well To build a system that would catch bugs where they fell. Yet fate, that fickle keeper of the merge, Had merged another's work before this surge. And so we read these thousand lines with care, To render judgment: honest, true, and fair. --- ## Canto II — The Great Duplication (P0:blocker) Hear now the gravest finding of them all: **This PR's code already stands within the hall.** On master, in `features/environment.py`, A function named `handle_tdd_expected_fail` holds sway (line 322). It validates the three-tag system true, It inverts the status — failed to passed, and passed askew — It lives within `after_scenario`'s embrace (line 403), And handles every edge and corner case. The PR proposes a *different* architecture: A monkey-patch on `Scenario.run()` — a fracture Installed in `before_all`, wrapping the method whole, With `apply_tdd_inversion` playing the central role. **Were both to live, catastrophe would reign:** - A failing test, inverted once, would pass — then pass again through `after_scenario`'s hook, which sees it passed, and *re-inverts* to failed. The bug, still present, *fails the blast.* - A passing test (bug fixed!) inverted once to fail, then `after_scenario` sees failure, inverts — and hides the tale. The `@tdd_expected_fail` tag remains, the fix unseen, A silent phantom haunting CI's green. **The PR is `Mergeable: False`.** The conflicts are not cosmetic — They arise because master's implementation is complete, not merely aesthetic. > **Verdict: P0:blocker** — The entire PR is superseded by the existing > implementation on master. Merging would cause double-inversion, > breaking all TDD expected-fail scenarios in both directions. --- ## Canto III — The Status Untested (P1:must-fix) Within `apply_tdd_inversion`, when failure turns to grace, The code sets failed steps to `Status.untested` — an unstable place: ```python # PR line 200 of the diff step.status = Status.untested # Wrong! ``` But `untested` is not a final status in Behave's reckoning. The parallel runner's `_extract_summary()` counts it as an *error*, Not a pass, not a skip — a ghost that spreads its terror. And should `compute_status()` ever be retriggered on the scene, It would return `Status.untested` — not the `passed` we need to glean. The existing code on master (line 387) does this right: ```python step.status = Status.passed # Correct! ``` And resets *both* failed and skipped steps to `passed` outright. > **Verdict: P1:must-fix** — `Status.untested` causes incorrect summary > counts and inconsistent state. Should be `Status.passed`, and skipped > steps must also be reset. --- ## Canto IV — The Phantom Property (P1:must-fix) The PR accesses `scenario.all_steps` — but beware! In real Behave, `all_steps` is a *property* returning an *iterator*, Not a list that one may index or compare. At diff line 219–220: ```python if scenario.all_steps: # Always True for iterators! last_step = scenario.all_steps[-1] # TypeError on real Scenario! ``` An `itertools.chain` object knows not `__getitem__`, Nor does it yield `False` when empty — it is truthy every time. The mock uses a plain `list`, hiding this defect, But on a real `Scenario`, a `TypeError` would eject. The existing code on master wisely uses `scenario.steps` (a list), And never indexes `all_steps` — that trap it has dismissed. > **Verdict: P1:must-fix** — `scenario.all_steps[-1]` would raise > `TypeError` on a real Behave Scenario object. The mock masks the bug. --- ## Canto V — The Forbidden Sigils (P1:must-fix) Two lines bear the mark of the forbidden: ```python Scenario.run = _tdd_aware_run # type: ignore[assignment] Scenario._tdd_run_patched = True # type: ignore[attr-defined] ``` CONTRIBUTING.md speaks plainly at line 548: > *"Never use inline comments or annotations to suppress individual > type checking errors (e.g., no `type: ignore`)."* Two `# type: ignore` comments dwell within this patch, A direct violation that no argument can unlatch. > **Verdict: P1:must-fix** — Two `# type: ignore` comments violate > the project's absolute prohibition on type suppression. --- ## Canto VI — The Missing Test (P2:should-fix) Of all the paths that `apply_tdd_inversion` may tread, The *primary* use case — expected failure inverted to pass — is left unsaid In the Behave feature file. The Robot helper tests it well, But `tdd_tag_validation.feature` has no scenario to tell The tale of `failed=True` with `AssertionError` in hand, Where inversion yields `False` and the test is marked as planned. Four inversion scenarios grace the feature file: - Unexpected pass (forced failure) ✓ - Hook-error guard ✓ - Dry-run guard ✓ - Non-assertion guard ✓ But the *happy path* — the very reason the system exists — is absent. > **Verdict: P2:should-fix** — The primary expected-failure inversion path > is not tested in the Behave feature file (only in Robot). --- ## Canto VII — The Empty Stage (P2:should-fix) No test exists where `scenario.all_steps` stands empty — A stage with no players, a theater with no entry. The `_make_mock_scenario` helper always creates one step, But what of scenarios with none? The code at line 219 Guards with `if scenario.all_steps:` — yet this guard is broken (see Canto IV), and no test has ever spoken To verify the behavior when no steps are found. > **Verdict: P2:should-fix** — The empty-steps edge case > is untested and contains the iterator truthiness bug. --- ## Canto VIII — The Doubled Mock (P3:nit) In `features/steps/tdd_tag_validation_steps.py` (diff line 433) And `robot/helper_tdd_tag_validation.py` (diff line 752), The `_make_mock_scenario` function stands, copy-pasted twice, Identical in form and function — duplication's vice. Any change to one must echo in the other, Lest the tests diverge like children from their mother. > **Verdict: P3:nit** — Consider extracting the mock helper > to a shared module (perhaps `features/mocks/mock_tdd_scenario.py`) > to eliminate duplication. --- ## Canto IX — The Verdict O CoreRasurae, your craft is plain to see: The guards for hooks, dry-runs, and non-assertion errors — these are worthy. The tag validation logic is correct and clear, The error messages helpful, the test coverage nearly here. But **master has already claimed this ground.** The `handle_tdd_expected_fail` function was found In commits dating March the 9th, before this PR was born. The monkey-patch approach, though clever, is now shorn Of purpose — for the `after_scenario` hook suffices, Combined with the summary-based exit in the noxfile's devices. ### Recommendation: **Request Changes** | Finding | Severity | File | Line(s) | |---------|----------|------|--------| | PR entirely superseded by existing `handle_tdd_expected_fail` on master | P0:blocker | `features/environment.py` | 322–403 (master) | | `Status.untested` should be `Status.passed` | P1:must-fix | `features/environment.py` (diff) | 200 | | `scenario.all_steps[-1]` raises TypeError on real Scenario | P1:must-fix | `features/environment.py` (diff) | 219–220 | | Two `# type: ignore` comments violate CONTRIBUTING.md | P1:must-fix | `features/environment.py` (diff) | 249–250 | | Primary expected-failure path not tested in Behave | P2:should-fix | `tdd_tag_validation.feature` | — | | Empty `all_steps` edge case untested | P2:should-fix | `tdd_tag_validation.feature` | — | | `_make_mock_scenario` duplicated across Behave/Robot | P3:nit | steps + helper | — | The path forward: this PR should be **closed** in favor of the existing implementation. If specific improvements from this PR (e.g., the non-assertion exception guard, the dry-run guard) are desired atop master's `handle_tdd_expected_fail`, they can be proposed as a follow-up PR that builds on — rather than replaces — the existing code. --- *Thus ends the Lay of PR Six-Hundred Sixty-Five.* *May the tests stay green, the types stay checked,* *And the tags stay properly triple-decked.*

freemo added the

labels 2026-03-11 18:15:27 +00:00

freemo commented

2026-03-11 18:17:54 +00:00

PM Review — Day 31 (Specification Update)

Merge conflict detected. This conflict is due to significant specification changes made today.

CRITICAL PATH ITEM

This PR is on the project critical path. The TDD infrastructure chain is:
#665 → #629 → all bug fix PRs → M3 closure

Without this PR merging, NO bug fixes can proceed through the TDD pipeline.

Spec Alignment Check

Behave TDD tag infrastructure is NOT impacted by protocol or TUI changes. The @tdd_expected_fail tag system is orthogonal to the A2A transition.

Action Required

@CoreRasurae — IMMEDIATE: Rebase against master and request review from @brent.edwards. This is your #1 priority.

Note: New integration test tagging issues (#684, #685, #686) have been created for the @mocked/@llm system, which builds on this TDD infrastructure.

## PM Review — Day 31 (Specification Update) **Merge conflict** detected. This conflict is due to significant specification changes made today. ### CRITICAL PATH ITEM This PR is on the **project critical path**. The TDD infrastructure chain is: `#665 → #629 → all bug fix PRs → M3 closure` Without this PR merging, NO bug fixes can proceed through the TDD pipeline. ### Spec Alignment Check Behave TDD tag infrastructure is NOT impacted by protocol or TUI changes. The `@tdd_expected_fail` tag system is orthogonal to the A2A transition. ### Action Required @CoreRasurae — **IMMEDIATE**: Rebase against `master` and request review from @brent.edwards. This is your #1 priority. **Note**: New integration test tagging issues (#684, #685, #686) have been created for the `@mocked/@llm` system, which builds on this TDD infrastructure.

freemo referenced this pull request

2026-03-11 20:25:36 +00:00

Implement @tdd_expected_fail tag handling in Behave environment #627

freemo referenced this pull request

2026-03-11 20:25:40 +00:00

Implement @tdd_expected_fail tag handling in Robot Framework #628

freemo referenced this pull request

2026-03-11 20:25:46 +00:00

test(e2e): validate M3 acceptance criteria for v3.2.0 milestone closure #494

freemo referenced this pull request

2026-03-11 20:29:02 +00:00

feat(testing): implement @tdd_expected_fail tag handling in Robot Framework #673

freemo commented

2026-03-11 20:29:07 +00:00

PM Review — Day 31 (2026-03-11)

Potential issue: superseded by master? A review comment noted that handle_tdd_expected_fail already exists on master in features/environment.py:322-403. If this is accurate, this PR's monkey-patch approach would cause double-inversion if merged.

Action required:

@CoreRasurae — please verify whether master already has the @tdd_expected_fail handler. If so, this PR may need to be either:
- Closed as superseded, with incremental improvements (hook-error guard, dry-run guard) proposed as separate PRs
- Rebased and refactored to build on the existing implementation rather than replacing it
The PR also has merge conflicts that need resolution regardless.

Please report back on whether the master implementation is complete. This determines whether we close this PR or refactor it.

## PM Review — Day 31 (2026-03-11) **Potential issue: superseded by master?** A review comment noted that `handle_tdd_expected_fail` already exists on master in `features/environment.py:322-403`. If this is accurate, this PR's monkey-patch approach would cause **double-inversion** if merged. **Action required:** 1. @CoreRasurae — please verify whether master already has the `@tdd_expected_fail` handler. If so, this PR may need to be either: - **Closed** as superseded, with incremental improvements (hook-error guard, dry-run guard) proposed as separate PRs - **Rebased and refactored** to build on the existing implementation rather than replacing it 2. The PR also has merge conflicts that need resolution regardless. Please report back on whether the master implementation is complete. This determines whether we close this PR or refactor it.

CoreRasurae force-pushed feature/m5-behave-tdd-tags from 107860259d to 282a46b0d1

2026-03-11 21:17:33 +00:00

Compare

CoreRasurae referenced this issue from a commit

2026-03-11 21:58:33 +00:00

fix(testing): address review feedback for @tdd_expected_fail handling

brent.edwards requested changes 2026-03-11 22:28:04 +00:00

Dismissed

brent.edwards left a comment

Review — PR #665 (rebased `2927c5de`)

Reviewed: features/environment.py, features/steps/tdd_tag_validation_steps.py, features/testing/tdd_tag_validation.feature, features/testing/tdd_expected_fail_demo.feature, robot/helper_tdd_tag_validation.py, robot/tdd_tag_validation.robot, CHANGELOG.md

Compared against: master (39595657), merge-base a8bb543f

Overall Assessment

This PR is a genuine and substantial improvement over master's existing handle_tdd_expected_fail. The previous P0:blocker ("superseded by master") from my epic-poem review is withdrawn — Luis was correct that this is "more complete". Key improvements:

Scenario.run() monkey-patch fixes a real correctness bug: master's after_scenario approach inverts scenario.status but cannot modify the failed boolean returned by Scenario.run(), causing the runner to report wrong aggregate pass/fail counts and a non-zero exit code even after successful inversion.
Guards for hook_failed, was_dry_run, and non-AssertionError exceptions prevent silent masking of infrastructure failures.
Extracted pure functions (validate_tdd_tags, should_invert_result, apply_tdd_inversion) are independently testable.
Synthetic AssertionError on unexpected-pass makes the failure reason visible in Behave formatters.
All 7 of Rui's prior findings are fully resolved in the rebase.
All 4 of my prior epic-poem findings (Status.untested, iterator indexing, #type:ignore, missing tests) are fixed.

However, 2 P1 issues and 5 P2 issues remain before merge.

Prior Findings — Disposition

Prior finding	Disposition
P0: PR superseded by master	WITHDRAWN — PR genuinely adds Scenario.run() return-value fix, 3 guards, testable decomposition
P1: `Status.untested` should be `Status.passed`	FIXED in 2nd commit
P1: `scenario.all_steps[-1]` TypeError on iterator	FIXED — `list(scenario.all_steps)` then index
P1: Two `# type: ignore` comments	FIXED in 2nd commit
P2: Primary expected-failure path untested	FIXED — demo scenario + BDD + Robot
P2: Empty `all_steps` edge case untested	FIXED — "handles unexpected pass with no steps"
P3: `_make_mock_scenario` duplicated	STILL APPLIES (P3 below)
Rui findings 1-7 (hook errors, dry-run, logging, non-assertion, comments, unexpected-pass message, test coverage)	ALL RESOLVED

New Findings

P1:must-fix

F1. handle_tdd_expected_fail is zombie code; infrastructure tests give false confidence

handle_tdd_expected_fail is retained (features/environment.py:221) but the PR removed its call from after_scenario. The 6 scenarios in tdd_expected_fail_infrastructure.feature still exercise it — but this is not the production path. The production inversion runs through apply_tdd_inversion (called from the Scenario.run() wrapper). The two functions diverge materially:

Aspect	`apply_tdd_inversion` (production)	`handle_tdd_expected_fail` (dead)
Non-assertion exception guard	Yes — skips inversion	No — inverts everything
Step clearing scope	Only `failed`/`skipped` steps	All steps unconditionally
Exception logging before clear	DEBUG level	None
Input	`failed: bool` parameter	Reads `scenario.status`

The infrastructure tests pass — but they validate semantics that differ from production in at least three ways. This is a false-confidence coverage signal.

Fix: Either (a) update tdd_expected_fail_infrastructure.feature + steps to call apply_tdd_inversion instead, or (b) remove handle_tdd_expected_fail and the old infrastructure tests entirely (the new 19+1 BDD + 12 Robot scenarios cover apply_tdd_inversion comprehensively), or (c) document handle_tdd_expected_fail as a backward-compatible API for external tooling and add the missing non-assertion guard for parity.

F2. from behave.model import Scenario moved inside function — CONTRIBUTING.md import violation (regression)

Master has from behave.model import Scenario at the top of features/environment.py (line 13). The PR removed this top-level import and placed it inside _install_tdd_expected_fail_patch() (line 295):

def _install_tdd_expected_fail_patch() -> None:
    from behave.model import Scenario  # ← function-level import

CONTRIBUTING.md §1289-1294: "Ensure all imports are at the top of the Python file. Do not scatter imports throughout the file or bury them inside functions." Only exception: if TYPE_CHECKING:. This is a regression — master was compliant on this specific import.

(Note: ~15 pre-existing inner imports in the same file are not introduced by this PR and are a separate cleanup concern.)

Fix: Restore from behave.model import Scenario at the top of the file alongside from behave.model import Status (line 13). Remove the inner import from _install_tdd_expected_fail_patch.

P2:should-fix

F3. validate_tdd_tags raises bare ValueError in before_scenario — confusing hook-error output

At line 510, validate_tdd_tags(set(scenario.effective_tags)) is called without a try/except. When tags are invalid, the ValueError propagates into Behave's run_hook(), which prints:

HOOK-ERROR in before_scenario: ValueError: Scenario has @tdd_bug_123 but is missing...

This looks like an infrastructure crash, not a tag configuration error. Master's explicit sys.stderr.write("TDD TAG ERROR: 'scenario_name' — ...") was more scannable and included the scenario name prominently. No test exercises this hook-error path to verify the output is usable.

Fix: Wrap the call in before_scenario:

try:
    validate_tdd_tags(set(scenario.effective_tags))
except ValueError as exc:
    scenario.set_status(Status.failed)
    _tdd_logger.error("TDD TAG ERROR in %r: %s", scenario.name, exc)
    return

F4. Demo-only steps in wrong file (BDD organization violation)

The 3 "tdd demo" steps (tdd demo a step that always succeeds, tdd demo a deliberately failing assertion is executed, tdd demo this step is never reached) are defined in tdd_tag_validation_steps.py but used only by tdd_expected_fail_demo.feature.

CONTRIBUTING.md §1172-1174: "Steps used only by foo.feature must live in foo_steps.py."

Fix: Move these 3 steps to features/steps/tdd_expected_fail_demo_steps.py.

F5. Branch name says m5 but milestone is M3 (v3.2.0)

Branch: feature/m5-behave-tdd-tags — milestone: v3.2.0 = M3. The mismatch originates from issue #627 metadata, but it's worth flagging. Not blocking since the feature/ prefix (not tdd/) is correct for a Type/Feature issue.

F6. Only 1 integration test for the production Scenario.run() wrapper

The demo feature (1 scenario) is the sole test that exercises the real production path through before_all → patch install → Scenario.run() → apply_tdd_inversion. All guard paths (hook_failed, dry-run, non-assertion) and the unexpected-pass path are tested only with mock objects. Consider adding at least one more integration scenario (e.g., an @tdd_expected_fail scenario that passes, triggering the unexpected-pass forced-failure path through the real pipeline).

F7. Two commits — squash needed before merge

CONTRIBUTING.md requires one commit per issue. The branch has:

feat(testing): implement @tdd_expected_fail tag handling... (ISSUES CLOSED: #627)
fix(testing): address review feedback... (Refs: #627)

Fix: Squash into a single commit before merge.

P3:nit

F8. _make_mock_scenario duplicated — tdd_tag_validation_steps.py:108 and helper_tdd_tag_validation.py:34 contain identical helpers. Sharing is difficult across Behave/Robot boundaries, but a shared module under features/testing/ could work since both already import from features.environment.

F9. Logger name changed — "behave.tdd_expected_fail" → "cleveragents.testing.tdd_tags". Any existing logging configuration targeting the old name will silently stop capturing TDD inversion logs.

F10. scenario: Any type annotations — handle_tdd_expected_fail and apply_tdd_inversion both use Any instead of Scenario. Loses static type safety. (Minor since Pyright only checks src/, not features/.)

F11. Logging level downgraded — Exception details logged at DEBUG in apply_tdd_inversion vs INFO in master's handle_tdd_expected_fail. Less visible in default configurations.

F12. Idempotency guard untested — _install_tdd_expected_fail_patch's if getattr(Scenario, "_tdd_run_patched", False): return branch has no test coverage.

Merge Gate

Check	Status
P0:blocker	0 — clear
P1:must-fix	2 (F1: zombie code, F2: import violation) — must resolve
Lint	Not verified (needs CI run)
Typecheck	Not verified
Coverage ≥97%	Not verified
Reviewer approved	No — REQUEST_CHANGES

Requesting changes on F1 and F2. The remaining P2s are strong recommendations. Excellent work overall — the monkey-patch approach is correct, well-documented, and properly guarded. The test coverage is thorough. Once the P1s are addressed, this should be ready for a final pass.

## Review — PR #665 (rebased `2927c5de`) **Reviewed**: `features/environment.py`, `features/steps/tdd_tag_validation_steps.py`, `features/testing/tdd_tag_validation.feature`, `features/testing/tdd_expected_fail_demo.feature`, `robot/helper_tdd_tag_validation.py`, `robot/tdd_tag_validation.robot`, `CHANGELOG.md` **Compared against**: master (`39595657`), merge-base `a8bb543f` --- ### Overall Assessment This PR is a **genuine and substantial improvement** over master's existing `handle_tdd_expected_fail`. The previous P0:blocker ("superseded by master") from my epic-poem review is **withdrawn** — Luis was correct that this is "more complete". Key improvements: - **Scenario.run() monkey-patch** fixes a real correctness bug: master's `after_scenario` approach inverts `scenario.status` but cannot modify the `failed` boolean returned by `Scenario.run()`, causing the runner to report wrong aggregate pass/fail counts and a non-zero exit code even after successful inversion. - **Guards** for `hook_failed`, `was_dry_run`, and non-`AssertionError` exceptions prevent silent masking of infrastructure failures. - **Extracted pure functions** (`validate_tdd_tags`, `should_invert_result`, `apply_tdd_inversion`) are independently testable. - **Synthetic `AssertionError`** on unexpected-pass makes the failure reason visible in Behave formatters. - **All 7 of Rui's prior findings** are fully resolved in the rebase. - **All 4 of my prior epic-poem findings** (Status.untested, iterator indexing, #type:ignore, missing tests) are fixed. However, 2 P1 issues and 5 P2 issues remain before merge. --- ### Prior Findings — Disposition | Prior finding | Disposition | |---|---| | P0: PR superseded by master | **WITHDRAWN** — PR genuinely adds Scenario.run() return-value fix, 3 guards, testable decomposition | | P1: `Status.untested` should be `Status.passed` | **FIXED** in 2nd commit | | P1: `scenario.all_steps[-1]` TypeError on iterator | **FIXED** — `list(scenario.all_steps)` then index | | P1: Two `# type: ignore` comments | **FIXED** in 2nd commit | | P2: Primary expected-failure path untested | **FIXED** — demo scenario + BDD + Robot | | P2: Empty `all_steps` edge case untested | **FIXED** — "handles unexpected pass with no steps" | | P3: `_make_mock_scenario` duplicated | **STILL APPLIES** (P3 below) | | Rui findings 1-7 (hook errors, dry-run, logging, non-assertion, comments, unexpected-pass message, test coverage) | **ALL RESOLVED** | --- ### New Findings #### P1:must-fix **F1. `handle_tdd_expected_fail` is zombie code; infrastructure tests give false confidence** `handle_tdd_expected_fail` is retained (`features/environment.py:221`) but the PR removed its call from `after_scenario`. The 6 scenarios in `tdd_expected_fail_infrastructure.feature` still exercise it — but this is **not the production path**. The production inversion runs through `apply_tdd_inversion` (called from the `Scenario.run()` wrapper). The two functions diverge materially: | Aspect | `apply_tdd_inversion` (production) | `handle_tdd_expected_fail` (dead) | |---|---|---| | Non-assertion exception guard | Yes — skips inversion | **No** — inverts everything | | Step clearing scope | Only `failed`/`skipped` steps | **All** steps unconditionally | | Exception logging before clear | DEBUG level | **None** | | Input | `failed: bool` parameter | Reads `scenario.status` | The infrastructure tests pass — but they validate semantics that differ from production in at least three ways. This is a false-confidence coverage signal. **Fix**: Either (a) update `tdd_expected_fail_infrastructure.feature` + steps to call `apply_tdd_inversion` instead, or (b) remove `handle_tdd_expected_fail` and the old infrastructure tests entirely (the new 19+1 BDD + 12 Robot scenarios cover `apply_tdd_inversion` comprehensively), or (c) document `handle_tdd_expected_fail` as a backward-compatible API for external tooling and add the missing non-assertion guard for parity. --- **F2. `from behave.model import Scenario` moved inside function — CONTRIBUTING.md import violation (regression)** Master has `from behave.model import Scenario` at the top of `features/environment.py` (line 13). The PR removed this top-level import and placed it inside `_install_tdd_expected_fail_patch()` (line 295): ```python def _install_tdd_expected_fail_patch() -> None: from behave.model import Scenario # ← function-level import ``` CONTRIBUTING.md §1289-1294: *"Ensure all imports are at the top of the Python file. Do not scatter imports throughout the file or bury them inside functions."* Only exception: `if TYPE_CHECKING:`. This is a **regression** — master was compliant on this specific import. (Note: ~15 pre-existing inner imports in the same file are not introduced by this PR and are a separate cleanup concern.) **Fix**: Restore `from behave.model import Scenario` at the top of the file alongside `from behave.model import Status` (line 13). Remove the inner import from `_install_tdd_expected_fail_patch`. --- #### P2:should-fix **F3. `validate_tdd_tags` raises bare `ValueError` in `before_scenario` — confusing hook-error output** At line 510, `validate_tdd_tags(set(scenario.effective_tags))` is called without a try/except. When tags are invalid, the `ValueError` propagates into Behave's `run_hook()`, which prints: ``` HOOK-ERROR in before_scenario: ValueError: Scenario has @tdd_bug_123 but is missing... ``` This looks like an **infrastructure crash**, not a tag configuration error. Master's explicit `sys.stderr.write("TDD TAG ERROR: 'scenario_name' — ...")` was more scannable and included the scenario name prominently. No test exercises this hook-error path to verify the output is usable. **Fix**: Wrap the call in `before_scenario`: ```python try: validate_tdd_tags(set(scenario.effective_tags)) except ValueError as exc: scenario.set_status(Status.failed) _tdd_logger.error("TDD TAG ERROR in %r: %s", scenario.name, exc) return ``` --- **F4. Demo-only steps in wrong file (BDD organization violation)** The 3 "tdd demo" steps (`tdd demo a step that always succeeds`, `tdd demo a deliberately failing assertion is executed`, `tdd demo this step is never reached`) are defined in `tdd_tag_validation_steps.py` but used **only** by `tdd_expected_fail_demo.feature`. CONTRIBUTING.md §1172-1174: *"Steps used only by `foo.feature` must live in `foo_steps.py`."* **Fix**: Move these 3 steps to `features/steps/tdd_expected_fail_demo_steps.py`. --- **F5. Branch name says m5 but milestone is M3 (v3.2.0)** Branch: `feature/m5-behave-tdd-tags` — milestone: v3.2.0 = M3. The mismatch originates from issue #627 metadata, but it's worth flagging. Not blocking since the `feature/` prefix (not `tdd/`) is correct for a Type/Feature issue. --- **F6. Only 1 integration test for the production `Scenario.run()` wrapper** The demo feature (1 scenario) is the sole test that exercises the real production path through `before_all` → patch install → `Scenario.run()` → `apply_tdd_inversion`. All guard paths (hook_failed, dry-run, non-assertion) and the unexpected-pass path are tested **only** with mock objects. Consider adding at least one more integration scenario (e.g., an `@tdd_expected_fail` scenario that passes, triggering the unexpected-pass forced-failure path through the real pipeline). --- **F7. Two commits — squash needed before merge** CONTRIBUTING.md requires one commit per issue. The branch has: 1. `feat(testing): implement @tdd_expected_fail tag handling...` (`ISSUES CLOSED: #627`) 2. `fix(testing): address review feedback...` (`Refs: #627`) **Fix**: Squash into a single commit before merge. --- #### P3:nit **F8. `_make_mock_scenario` duplicated** — `tdd_tag_validation_steps.py:108` and `helper_tdd_tag_validation.py:34` contain identical helpers. Sharing is difficult across Behave/Robot boundaries, but a shared module under `features/testing/` could work since both already import from `features.environment`. **F9. Logger name changed** — `"behave.tdd_expected_fail"` → `"cleveragents.testing.tdd_tags"`. Any existing logging configuration targeting the old name will silently stop capturing TDD inversion logs. **F10. `scenario: Any` type annotations** — `handle_tdd_expected_fail` and `apply_tdd_inversion` both use `Any` instead of `Scenario`. Loses static type safety. (Minor since Pyright only checks `src/`, not `features/`.) **F11. Logging level downgraded** — Exception details logged at DEBUG in `apply_tdd_inversion` vs INFO in master's `handle_tdd_expected_fail`. Less visible in default configurations. **F12. Idempotency guard untested** — `_install_tdd_expected_fail_patch`'s `if getattr(Scenario, "_tdd_run_patched", False): return` branch has no test coverage. --- ### Merge Gate | Check | Status | |---|---| | P0:blocker | 0 — clear | | P1:must-fix | **2** (F1: zombie code, F2: import violation) — must resolve | | Lint | Not verified (needs CI run) | | Typecheck | Not verified | | Coverage ≥97% | Not verified | | Reviewer approved | No — REQUEST_CHANGES | Requesting changes on F1 and F2. The remaining P2s are strong recommendations. Excellent work overall — the monkey-patch approach is correct, well-documented, and properly guarded. The test coverage is thorough. Once the P1s are addressed, this should be ready for a final pass.

features/environment.py Outdated

						
				@@ -36,0 +218,4 @@

				    return True  # Force failure for the runner

				def handle_tdd_expected_fail(scenario: Any) -> None:

brent.edwards commented

P1:must-fix (F1) — This function is no longer called from after_scenario (the PR correctly moved inversion to the Scenario.run() wrapper). However, the 6 old scenarios in tdd_expected_fail_infrastructure.feature still call it directly. This creates a false-confidence coverage signal — tests pass but validate dead code with different semantics from the production apply_tdd_inversion (missing non-assertion guard, unconditional step clearing, no exception logging).

Fix: Either update the old infrastructure tests to exercise apply_tdd_inversion instead, or remove handle_tdd_expected_fail and the old tests entirely.

**P1:must-fix (F1)** — This function is no longer called from `after_scenario` (the PR correctly moved inversion to the `Scenario.run()` wrapper). However, the 6 old scenarios in `tdd_expected_fail_infrastructure.feature` still call it directly. This creates a false-confidence coverage signal — tests pass but validate dead code with different semantics from the production `apply_tdd_inversion` (missing non-assertion guard, unconditional step clearing, no exception logging). **Fix**: Either update the old infrastructure tests to exercise `apply_tdd_inversion` instead, or remove `handle_tdd_expected_fail` and the old tests entirely.

features/environment.py Outdated

						
				@@ -36,0 +292,4 @@

				    The patch is installed once in ``before_all`` and is idempotent.

				    """

				    from behave.model import Scenario

brent.edwards commented

P1:must-fix (F2) — from behave.model import Scenario moved inside function. Master had this at the top of the file (line 13). CONTRIBUTING.md §1289-1294 requires all imports at module level (only exception: TYPE_CHECKING). This is a regression.

Fix: Restore from behave.model import Scenario at the top alongside from behave.model import Status, and remove this inner import.

**P1:must-fix (F2)** — `from behave.model import Scenario` moved inside function. Master had this at the top of the file (line 13). CONTRIBUTING.md §1289-1294 requires all imports at module level (only exception: `TYPE_CHECKING`). This is a regression. **Fix**: Restore `from behave.model import Scenario` at the top alongside `from behave.model import Status`, and remove this inner import.

features/environment.py Outdated

						
				@@ -228,0 +507,4 @@

				    # Validate the three-tag system BEFORE any other setup so that

				    # misconfigured TDD tests are caught immediately.

				    # See CONTRIBUTING.md > TDD Bug Test Tags for the full specification.

				    validate_tdd_tags(set(scenario.effective_tags))

brent.edwards commented

P2:should-fix (F3) — Bare ValueError propagating into Behave's hook dispatcher produces HOOK-ERROR in before_scenario: ValueError: ... which looks like an infrastructure crash, not a tag error. Master's explicit stderr messages were clearer.

Suggested fix:

try:
    validate_tdd_tags(set(scenario.effective_tags))
except ValueError as exc:
    scenario.set_status(Status.failed)
    _tdd_logger.error("TDD TAG ERROR in %r: %s", scenario.name, exc)
    return

Also: no test exercises this hook-error path.

**P2:should-fix (F3)** — Bare `ValueError` propagating into Behave's hook dispatcher produces `HOOK-ERROR in before_scenario: ValueError: ...` which looks like an infrastructure crash, not a tag error. Master's explicit stderr messages were clearer. **Suggested fix**: ```python try: validate_tdd_tags(set(scenario.effective_tags)) except ValueError as exc: scenario.set_status(Status.failed) _tdd_logger.error("TDD TAG ERROR in %r: %s", scenario.name, exc) return ``` Also: no test exercises this hook-error path.

features/steps/tdd_tag_validation_steps.py Outdated

						
				@@ -0,0 +117,4 @@

				@then("tdd demo this step is never reached")

				def step_tdd_demo_never_reached(context: Context) -> None:

brent.edwards commented