feat(testing): implement @tdd_expected_fail tag handling in Robot Framework #673
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Blocks
#628 Implement @tdd_expected_fail tag handling in Robot Framework
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core!673
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feature/m5-robot-tdd-tags"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Implements the three-tag TDD bug-capture system in Robot Framework via a Listener v3 module, paralleling the Behave implementation in PR !665 (issue #627). This enables TDD bug-capture tests to work in Robot Framework integration tests using the same
tdd_bug,tdd_bug_<N>, andtdd_expected_failtags as specified in CONTRIBUTING.md > TDD Bug Test Tags.Changes
robot/tdd_expected_fail_listener.py): Robot Framework Listener v3 withstart_test(tag validation) andend_test(result inversion) hooks. Tests taggedtdd_expected_failthat fail have their result inverted to PASS (bug still exists); tests that unexpectedly pass are inverted to FAIL with guidance message. Includes idempotency guard against double-invocation, explicit SKIP status handling,close()hook for clean teardown, and documenteddata.tagsvsresult.tagsdesign choice. Error message@prefix divergence from Behave is documented.tdd_bug_<N>requirestdd_bug;tdd_expected_failrequires bothtdd_bugand at least onetdd_bug_<N>. Validation errors are stored instart_testand applied inend_test. Design note documents whystr | Nonereturn type is used instead ofValueError.noxfile.py): Added--listenerflag inintegration_tests,slow_integration_tests, ande2e_testssessions usingPath(__file__).parent-relative path resolution. Added--exclude tdd_fixtureto prevent fixture files from being picked up by the main pabot runner.robot/fixtures/): 8 fixture.robotfiles covering all acceptance criteria plus SKIP handling, tdd_expected_fail-alone validation, and tdd_bug-alone edge case.robot/tdd_tag_validation.robot): 21 test cases total — 12 Behave-side (from PR !665) plus 9 Robot-side that run fixtures via helper script as sub-processes with the listener active, then inspect output XML for correct final status.robot/helper_tdd_tag_validation.py): Unified helper supporting both Behave-side validation commands and Robot-side fixture commands. Includes_run_multi_fixture()for proving listener selectivity, output.xml existence guard, tightened substring assertions,__all__declaration, XML message attribute fallback for RF 7.3+, and XML parsing risk documentation.features/mocks/tdd_test_helpers.py): Extractedmake_mock_scenario()from duplicated code inrobot/helper_tdd_tag_validation.pyandfeatures/steps/tdd_tag_validation_steps.py. Both files now import from the shared module.CHANGELOG.md): Added entry for this feature.Review Fixes (Round 4)
Rebased onto latest master — PR now contains a single commit for #628 only. Addressed all applicable findings from @hamza.khyari's code review:
_processed_testsset guard withclose()cleanupdata.tagsvsresult.tagstdd_expected_fail_skip.robotfixture + test case"TDD expected failure"assertion"missing the required tdd_bug tag"tdd_expected_fail_alone.robotfixture + test_validation_errorsnot clearedclose()hooktuple[str, str]Path(__file__).parent / ...__all____all__: list[str] = []_make_mock_scenarioduplicatedfeatures/mocks/tdd_test_helpers.pystatus_el.get("message")fallback for RF 7.3+@prefix_validate_tdd_tagsreturn type undocumentedQuality Gates (post-rebase onto latest master)
nox -s lintnox -s typechecknox -s unit_testsnox -s integration_testsnox -s coverage_reportCloses #628
ISSUES CLOSED: #628
Self-Review: PR #673 —
feat(testing): implement @tdd_expected_fail tag handling in Robot FrameworkThis is a self-review to ensure everything looks good before asking other people to review.
Acceptance Criteria Verification (Ticket #628)
All 6 acceptance criteria are met:
tdd_expected_failtests that fail → inverted to PASStdd_expected_fail_listener.py:176-186, fixturetdd_expected_fail_fails.robottdd_expected_failtests that pass → inverted to FAIL with guidancetdd_expected_fail_listener.py:187-201, fixturetdd_expected_fail_passes.robottdd_bug_<N>withouttdd_bug→ validation errortdd_expected_fail_listener.py:88-94, fixturetdd_missing_tdd_bug.robottdd_expected_failwithouttdd_bug/tdd_bug_<N>→ validation errortdd_expected_fail_listener.py:96-108, fixturetdd_expected_fail_missing_bug_n.robottdd_bugusable as a filter tag (--include tdd_bug)tdd_expected_fail_listener.py:172-174, fixturetdd_normal_test.robotIssue Found
# type: ignore[operator]violates CONTRIBUTING.md Type Safety rulesFile:
robot/helper_tdd_tag_validation.py:184Related:
robot/helper_tdd_tag_validation.py:165CONTRIBUTING.md (line 546-548) states: "never use inline comments or annotations to suppress individual type checking errors (e.g., no
type: ignore)".Line 165 types the command dict as
dict[str, object], which causeshandleron line 183 to be inferred asobject(not callable). This forces the# type: ignore[operator]on line 184.10 other helpers in the same directory already use the correct pattern:
(e.g.,
helper_unified_context_models.py:152,helper_uko_ontology.py:105,helper_repo_indexing.py:170)Fix:
from collections.abc import Callableto imports._COMMANDS: dict[str, Callable[[], int]] = {# type: ignore[operator]from line 184.Items Reviewed (No Issues)
tdd_expected_fail_listener.py): Correct Listener v3 API usage. Tag validation instart_testwith deferred application inend_testis the right pattern._validation_errorsdict properly cleaned via.pop(). Case-insensitive tag matching via.lower()is correct._TDD_BUG_N_REregex (tdd_bug_\d+withfullmatch) correctly matches issue reference tags.noxfile.py):--listenerand--exclude tdd_fixturecorrectly added to bothintegration_testsandslow_integration_testssessions. Listener path resolved to absolute.tdd_fixtureexclusion tag, proper documentation, and correctly exercises its intended scenario.tdd_tag_validation.robot): Usescommon.resourceforSetup Test Environment, correctly references${PYTHON}and${WORKSPACE}, validates each sub-command via the helper.helper_tdd_tag_validation.py): Sub-process isolation is correct. XML message extraction has a sensible fallback to<msg>elements._validation_errorsdict (separate processes). With the standard runner, tests are sequential. No concurrency issues.subprocess.run. All paths derived from__file__and fixture names.@ -0,0 +162,4 @@return 1_COMMANDS: dict[str, object] = {This should be
dict[str, Callable[[], int]](withfrom collections.abc import Callableadded to imports). Usingobjectloses the callable type information and forces the# type: ignore[operator]on line 184. 10 other helpers in this directory already use theCallable[[], int]pattern (e.g.,helper_unified_context_models.py:152).@ -0,0 +181,4 @@sys.exit(1)handler = _COMMANDS[sys.argv[1]]sys.exit(handler()) # type: ignore[operator]Remove this
# type: ignore[operator]— it violates CONTRIBUTING.md Type Safety rules (line 546-548). Once the_COMMANDStype on line 165 is fixed todict[str, Callable[[], int]], this suppression is no longer needed.b882b5e208f72a981f54Self-review fix applied (force-pushed
f72a981f):Fixed the
# type: ignore[operator]violation inrobot/helper_tdd_tag_validation.pyidentified in the self-review:from collections.abc import Callableto imports (line 42)_COMMANDS: dict[str, object]→_COMMANDS: dict[str, Callable[[], int]](line 166)# type: ignore[operator]fromsys.exit(handler())(line 185)This follows the same pattern used by 10+ other helper scripts in the
robot/directory (e.g.,helper_unified_context_models.py:152,helper_uko_ontology.py:105).Verified locally:
nox -s lint→ PASS (0 violations)nox -s typecheck→ PASS (0 errors, 0 warnings)CI is running on the updated commit.
Self-Review (Round 2): PR #673 —
feat(testing): implement @tdd_expected_fail tag handling in Robot FrameworkThis is a self-review to ensure everything looks good before asking other people to review.
Previous Review Issues — All Resolved
The round-1 self-review (review #2091) identified two issues. Both are confirmed fixed in
f72a981f:dict[str, object]→dict[str, Callable[[], int]]athelper_tdd_tag_validation.py:166— Fixed# type: ignore[operator]removed athelper_tdd_tag_validation.py:185— FixedAcceptance Criteria (Ticket #628) — All Met
tdd_expected_fail+ fail → PASSlistener.py:176-186, fixturetdd_expected_fail_fails.robottdd_expected_fail+ pass → FAIL with guidancelistener.py:187-201, fixturetdd_expected_fail_passes.robottdd_bug_<N>withouttdd_bug→ validation errorlistener.py:88-94, fixturetdd_missing_tdd_bug.robottdd_expected_failwithout prerequisites → validation errorlistener.py:96-108, fixturetdd_expected_fail_missing_bug_n.robottdd_bugusable as filterlistener.py:172-174, fixturetdd_normal_test.robotGuidance message matches ticket requirement exactly.
New Finding
end_testbareelsecatches SKIP status incorrectly (Medium)File:
robot/tdd_expected_fail_listener.py:187See inline comment below. Robot Framework 4.0+ supports
SKIPstatus. The current bareelse:at line 187 catches any non-FAIL status (PASS, SKIP, etc.), so a skippedtdd_expected_failtest would be incorrectly reported as "Bug appears to be fixed." Should be changed toelif result.status == "PASS":with an explicitelse:for other statuses.Likelihood: Low — TDD tests shouldn't normally be skipped. But since this is long-lived infrastructure code, defensive handling is warranted.
Items Reviewed (No Issues)
_validation_errorsdict with.pop()cleanup. Case-insensitive matching. Regex withfullmatchcorrectly distinguishestdd_bugfromtdd_bug_<N>.--listenerand--exclude tdd_fixturecorrectly added to bothintegration_testsandslow_integration_tests. Listener path resolved to absolute.tdd_tag_validation.robot): Correct setup, variable usage, and assertions.# type: ignore. All functions annotated.@ -0,0 +184,4 @@result.message = ("TDD expected failure: test failed as expected (bug still exists).")else:This bare
else:catches any non-FAIL status, includingSKIP(RF 4.0+). If atdd_expected_failtest is skipped (e.g.,Skipkeyword,--skiptag,skip_on_failure), it would incorrectly be reported as "Bug appears to be fixed."Suggestion: Change to explicit status matching:
f72a981f54f528859e2cSelf-review round 2 fix applied (force-pushed
f528859e):Fixed the bare
elsecatching SKIP status inrobot/tdd_expected_fail_listener.py(review #2093):else:→elif result.status == "PASS":at line 187 — now only catches actual passeselse:clause (line 202) for SKIP and other statuses — logs a debug message and leaves the result unchangedend_testdocstring to document the SKIP handling behaviorRationale: Robot Framework 4.0+ supports
SKIPstatus viaSkipkeyword,--skiptag, andskip_on_failure. The previous bareelse:would incorrectly report a skippedtdd_expected_failtest as "Bug appears to be fixed." A skipped test is not evidence of a fix — it simply didn't execute.Verified locally:
nox -s lint→ PASS (0 violations)nox -s typecheck→ PASS (0 errors, 0 warnings)PM Status (Day 31):
Good work on the self-review fixes @hurui200320.
Status: Merge conflict detected. 4 comments (self-review).
Action required: Rebase against
developand request external review (suggest @brent.edwards or @CoreRasurae).Priority: HIGH — this is part of the TDD infrastructure chain (#628) that unblocks the CI quality gate.
f528859e2cdf74aee449PM Review — Day 31 (Specification Update)
This PR is mergeable.
Spec Alignment Check
Robot Framework TDD tag infrastructure is NOT impacted by protocol or TUI changes.
Status
Action Required
@hurui200320 — Request review from @brent.edwards or @CoreRasurae. This is HIGH priority TDD infrastructure.
PM Status — Day 31 (2026-03-11)
PR is rebased and clean (no merge conflicts). All quality gates pass. @hurui200320 posted a thorough self-review.
Needs: An external reviewer. @CoreRasurae — given your work on the Behave counterpart (#627/PR #665), you'd be the ideal reviewer for consistency. Alternatively, @brent.edwards has TDD workflow experience.
Priority: Tier 1 — this is TDD infrastructure that unblocks the bug pipeline. Please review within 1 business day.
df74aee449c383623eb7Code Review -- PR #673:
feat(testing): implement @tdd_expected_fail tag handling in Robot FrameworkReviewer: hamza.khyari | Protocol: 9-phase review (2 rounds, 6 independent passes)
Summary
Well-structured implementation. The listener correctly implements all 6 acceptance criteria from issue #628. The fixture-based subprocess test architecture is sound and mirrors real-world usage. Code quality is high -- clean separation between validation (
start_test) and inversion (end_test), proper regex compilation, and thorough docstrings.Two rounds of review (4 focused passes + 2 full-protocol passes) produced the findings below. 2 are should-fix, the rest are minor or nits.
Findings
tdd_expected_fail_listener.pytdd_tag_validation.robothelper_tdd_tag_validation.py_run_fixturedoes not guard against missingoutput.xmltdd_expected_fail_listener.pyend_testreadsdata.tagsinstead ofresult.tagstdd_tag_validation.robothelper_tdd_tag_validation.pycmd_expected_fail_invertednever checks message contenthelper_tdd_tag_validation.pytdd_tag_validation.robottdd_expected_failalone (both companions missing)tdd_expected_fail_listener.pytdd_expected_fail_listener.py_validation_errorsnever cleared on abnormal exithelper_tdd_tag_validation.py_run_fixturereturns stale temp file pathnoxfile.pyhelper_tdd_tag_validation.py__all__declarationtdd_expected_fail_listener.pyhelper_tdd_tag_validation.pyxml.etree.ElementTreewithoutdefusedxml1. Double-invocation idempotency bug
Severity: P2:should-fix | Category: BUG
File:
robot/tdd_expected_fail_listener.py(end_test)If the listener is loaded twice (
--listenerspecified twice, or passed via both noxfile and user args), both invocations share the same module-level state. The firstend_testinverts FAIL->PASS; the second sees PASS and inverts it back to FAIL with the "Bug appears to be fixed" guidance message -- a silent double-inversion producing the wrong result.Recommendation: Add an idempotency guard:
Clear it in a
close()hook alongside_validation_errors.2. "Normal Test Unaffected" is a sham test
Severity: P2:should-fix | Category: TEST
File:
robot/helper_tdd_tag_validation.py:133-141,robot/fixtures/tdd_normal_test.robotThe fixture contains
Should Be True ${TRUE}which passes unconditionally. The helper checksstatus == "PASS". This test produces the same PASS result with or without the listener loaded -- even if the listener file is empty or has an import error. It does not prove the listener was active and chose not to modify the result.Recommendation: Either (a) run the normal-test fixture in the same suite as a
tdd_expected_failfixture to prove the listener is loaded and selectively applies, or (b) check Robot's console output for the listener loading message.3.
_run_fixturedoes not guard against missingoutput.xmlSeverity: P3:nit | Category: BUG
File:
robot/helper_tdd_tag_validation.py:81If
robotfails to start (import error, bad listener path, syntax error),output.xmlis never created.ET.parse(out_xml)raises an opaqueFileNotFoundError. Since this is test infrastructure, a clear diagnostic would help debugging.Recommendation:
4.
end_testreadsdata.tagsinstead ofresult.tagsSeverity: P3:nit | Category: BUG
File:
robot/tdd_expected_fail_listener.py:174The old listener on master read from
result.tags(which reflects runtimeSet Tags/Remove Tagsmodifications). The new listener readsdata.tags(static test definition). If any test dynamically addstdd_expected_failat runtime, the new listener won't see it. If this is intentional (TDD tags should be static), add a comment explaining the choice.5. No test for SKIP status branch
Severity: P3:nit | Category: TEST
File:
robot/tdd_expected_fail_listener.py:192-199end_testhas an explicitelsebranch for SKIP that leaves the result unchanged. This branch has zero test coverage. A regression would go undetected.Recommendation: Add a fixture with
[Tags] tdd_bug tdd_bug_777 tdd_expected_fail tdd_fixturethat callsSkip msg=Skipping intentionally, and verify the result stays SKIP.6.
cmd_expected_fail_invertednever checks message contentSeverity: P3:nit | Category: TEST
File:
robot/helper_tdd_tag_validation.py:78-86This function only checks
status == "PASS". It does not verify the message is"TDD expected failure: test failed as expected (bug still exists).". A bug that inverts to PASS with a wrong or empty message goes undetected. By contrast,cmd_unexpected_pass_invertedcorrectly checks for"Bug appears to be fixed".Recommendation:
if status == "PASS" and "TDD expected failure" in message:7. Weak substring assertions in validation checks
Severity: P3:nit | Category: TEST
File:
robot/helper_tdd_tag_validation.py:100,115"tdd_bug" in message.lower()matches any message incidentally containing the substring."Bug appears to be fixed" in messagewould match a truncated message. Tighten to specific error text, e.g."missing the required tdd_bug tag"and"Remove the tdd_expected_fail tag".8. No test for
tdd_expected_failalone (both companions missing)Severity: P3:nit | Category: TEST
File:
robot/tdd_tag_validation.robotAC #4 covers
tdd_expected_failwithouttdd_bugortdd_bug_<N>. The existing fixturetdd_expected_fail_missing_bug_n.robottests missingtdd_bug_<N>only (whiletdd_bugis present). The branch where both are missing (missing: ["tdd_bug", "tdd_bug_<N>"]) is untested.9. Guidance message deviates from AC #2 exact wording
Severity: P3:nit | Category: SPEC
File:
robot/tdd_expected_fail_listener.py:183-188AC #2 specifies: "Bug appears to be fixed. Remove the tdd_expected_fail tag and verify the fix through the bug fix workflow."
Implementation adds "from this test" and appends "See CONTRIBUTING.md > Bug Fix Workflow." -- arguably improvements, but not spec-verbatim. Also note the Behave implementation uses a different message format entirely.
10.
_validation_errorsnever cleared on abnormal exitSeverity: P3:nit | Category: CODE
File:
robot/tdd_expected_fail_listener.py:64If
start_testfires butend_testnever fires (Robot crash, SIGINT), stale entries remain. In normal RF execution this is harmless (process exits), but aclose()hook would make the listener reusable in long-lived processes:11.
_run_fixturereturns stale temp file pathSeverity: P3:nit | Category: CODE
File:
robot/helper_tdd_tag_validation.py:96Returns
(status, message, out_xml)whereout_xmlpoints inside an already-deletedTemporaryDirectory. All callers discard it via_. Simplify totuple[str, str].12. Listener path resolution is CWD-relative
Severity: P3:nit | Category: CODE
File:
noxfile.pyPath("robot/tdd_expected_fail_listener.py").resolve()resolves relative to CWD. If nox is invoked from a different directory, the path breaks. ConsiderPath(__file__).parent / "robot/..."instead.13. No
__all__declarationSeverity: P4:nit | Category: CODE
File:
robot/helper_tdd_tag_validation.pyOther helpers in
robot/(e.g.,helper_cross_plan_correction.py) define__all__: list[str] = []. This helper doesn't -- minor inconsistency.14. Docstring says "mirroring" but messages differ
Severity: P4:nit | Category: CODE
File:
robot/tdd_expected_fail_listener.py:1Module docstring references "mirroring the Behave implementation" but the error message formats differ significantly between the two. Consider "paralleling" instead.
15. XML parsing without
defusedxmlSeverity: P4:nit | Category: SEC
File:
robot/helper_tdd_tag_validation.py:63xml.etree.ElementTree.parse()is vulnerable to billion-laughs entity expansion. The XML is self-generated by Robot Framework so the risk is negligible. Either usedefusedxmlor add a comment documenting the accepted risk.Process Checks (all pass)
ISSUES CLOSED: #628footer present# type: ignorein new sourceprint()usage appropriate (CLI test helper, not production code)PM Status — Day 32
Status: CONFLICTED — needs rebase. HIGH PRIORITY — TDD infrastructure.
PR: Robot Framework
@tdd_expected_failtag handler. M3 (v3.2.0), severely overdue (due Feb 26). Author: @hurui200320. 8 comments, has existing reviews.Context: This is the Robot Framework counterpart to PR #665 (Behave TDD handler). Both are needed for the mandatory TDD bug workflow. PR #665 has a supersession question pending CoreRasurae's response by Day 33.
Action Required:
c383623eb7b1e45b490eReview Fixes Applied (Round 3) — Force-pushed
b1e45b49Rebased onto latest master (
3e3e9b4b) and addressed all 15 findings from @hamza.khyari's code review.P2 Fixes (should-fix)
1. Double-invocation idempotency bug — Added
_processed_tests: set[str]module-level guard intdd_expected_fail_listener.py. Each test is processed at most once perend_testcall. Cleared in the newclose()hook alongside_validation_errors.2. Normal Test is a sham test — Rewrote
cmd_normal_test_unaffected()using a new_run_multi_fixture()function that runstdd_normal_test.robotalongsidetdd_expected_fail_fails.robotin a single Robot invocation. The check now verifies both that the normal test passed AND that the expected-fail fixture was inverted to PASS (proving the listener is loaded and selectively applies).P3 Fixes (nits)
3. Missing output.xml guard — Added
Path(out_xml).exists()check aftersubprocess.run()with diagnostic stderr output includingproc.returncodeandproc.stderr.4.
data.tagsvsresult.tags— Added docstring inend_testdocumenting the intentional choice: TDD tags are declarative metadata that must be statically declared in the source file. RuntimeSet Tags/Remove Tagsmodifications are not supported for TDD tags.5. No SKIP test — Added
tdd_expected_fail_skip.robotfixture (usesSkipkeyword) andcmd_skip_status_unchanged()helper + Robot test case.6. Missing message check —
cmd_expected_fail_inverted()now asserts"TDD expected failure" in messagein addition tostatus == "PASS".7. Weak substring assertions —
cmd_missing_tdd_bug_validation()now asserts"missing the required tdd_bug tag"instead of generic"tdd_bug" in message.lower().8. No tdd_expected_fail-alone test — Added
tdd_expected_fail_alone.robotfixture (onlytdd_expected_failtag, bothtdd_bugandtdd_bug_<N>missing) andcmd_expected_fail_alone_validation()helper + Robot test case.9. Guidance message deviation — Kept the improvements ("from this test", "See CONTRIBUTING.md > Bug Fix Workflow.") as they provide better developer guidance. The Behave implementation uses
@tdd_expected_failwith the@prefix per Behave convention; Robot usestdd_expected_failwithout prefix per Robot Framework convention.10.
_validation_errorsnever cleared — Addedclose()hook that clears both_validation_errorsand_processed_tests. Called by Robot Framework when execution finishes.11. Stale temp path return — Simplified
_run_fixture()return type totuple[str, str](status, message). Theout_xmlpath was always discarded by callers.12. CWD-relative listener path — Changed all three nox sessions (
integration_tests,slow_integration_tests,e2e_tests) to usePath(__file__).parent / "robot" / "tdd_expected_fail_listener.py"instead of CWD-relative resolution.P4 Fixes (nits)
13. No
__all__— Added__all__: list[str] = []tohelper_tdd_tag_validation.py.14. "mirroring" in docstring — Changed to "paralleling" in
tdd_expected_fail_listener.pymodule docstring.15. XML parsing without defusedxml — Added comment documenting the accepted risk: the XML is always self-generated by Robot Framework, never from untrusted input.
Additional Fix
Increased
timeoutfrom30sto120sforCLI Plan Tree Displays Subplan Hierarchyinm4_e2e_verification.robot— pre-existing timeout failure unrelated to this feature (3 subprocess invocations + full Alembic migrations exceed 30s).Quality Gates
nox -s lintnox -s typechecknox -s unit_testsnox -s integration_testsnox -s coverage_reportnox -s benchmarknox -s docsnox -s buildnox -s security_scannox -s dead_codeb1e45b490ed615276349Code Review — PR #673
Verdict: REQUEST_CHANGES
Critical
C-1: PR bundles 7+ issues across 4+ milestones — must be split
Category: PROCESS
This PR contains 12 non-merge commits closing 7+ distinct issues spanning M3, M4, M5, and M6:
tdd_expected_failRobot listenerd6152763tdd_expected_failBehave hooksc8cd7eab3e3e9b4be732c329,61ba4370d4a1a0d8317d0152,eb7706435e625b22,6806ef36CONTRIBUTING.md: "One Epic scope per PR." This needs at minimum 5 separate PRs.
C-2: M4 CLI integration tests use mocks — violates CONTRIBUTING.md
Category: TEST
Files:
robot/helper_m4_e2e_cli.py,robot/helper_m4_e2e_cli_errors.pyThese Robot helpers use
unittest.mock.patchandCliRunner(52+ mock calls in cli.py, 25+ in cli_errors.py). CONTRIBUTING.md line 569: "Mocks are strictly prohibited in integration tests."M1, M2, M3, and M6 were correctly converted to real subprocess invocations in commit
5e625b22, but M4 CLI tests were not. The TDD mock-only checker doesn't scan these sub-modules because_HELPER_FILESonly lists the dispatcher files — so the mocks evade detection.Fix: Convert to
run_cli()fromhelper_e2e_common.py.High
H-1:
session_serviceisproviders.Factory— creates new engine per callCategory: BUG
File:
src/cleveragents/application/container.py:507-512_build_session_servicecreates a new SQLAlchemy engine, inspects tables, and creates a sessionmaker on every invocation. CLI mitigates via a module-level cache, but any non-CLI consumer callingcontainer.session_service()multiple times leaks engines.Fix: Change to
providers.Singleton.H-2: Commit
9c8dc9edviolates Conventional CommitsCategory: PROCESS
Subject
Docs: Daily update to timeline— wrong format, no body, noISSUES CLOSED:footer. Unrelated to the feature.H-3: Commit
94d95324missingISSUES CLOSED:footerCategory: PROCESS
Subject
fix(test): correct mock wiring in plan_correct_tree_wiring Robot helper— no issue reference in footer. CONTRIBUTING.md requires every commit to reference its issue.Medium
M-1:
CorrectionServicecreated ad-hoc inplan.py, bypasses container SingletonCategory: BUG
File:
src/cleveragents/cli/commands/plan.py:2421Container has a
correction_serviceSingleton provider, butplan.pyconstructs its own instance — loses state between calls.Fix:
svc = container.correction_service()M-2:
AuditService._ensure_session()not thread-safeCategory: BUG
File:
src/cleveragents/application/services/audit_service.py:131-139Lazy init with no lock. As a Singleton, concurrent threads can race through
_ensure_session, creating duplicate engines.Fix: Add
threading.Lockaround initialization.M-3: Incomplete
entity_deletedaudit coverage vs specCategory: SPEC
Spec requires
entity_deletedfor allagents <entity> delete/remove. Currently emitted byActorService,ProjectService,SessionService. Missing:SkillService.remove_skill(), tool removal, validation detachment, resource removal. No emitters include "cascade effects" as spec requires.M-4:
_suppress_structlog_stdout/_restore_structlogduplicated 3 times with incomplete restoreCategory: CODE
Files:
tdd_session_shared_steps.py:27-66,session_list_error_steps.py:60-88,tdd_session_list_missing_db_steps.py:34-63Three identical copies. The restore function doesn't restore
logger_factory/wrapper_class— onlycache_logger_on_first_use. In serial test runs, this permanently mutates structlog config.Fix: Extract to shared utility, restore the full config.
M-5:
_make_mock_scenarioduplicated between Robot helper and Behave stepsCategory: CODE
Files:
robot/helper_tdd_tag_validation.py:98-125,features/steps/tdd_tag_validation_steps.py:95-136Nearly identical mock builder. Extract to
features/mocks/tdd_test_helpers.py.M-6:
cleanup_workspace()mutatesos.environ— not parallelism-safeCategory: BUG
File:
robot/helper_e2e_common.py:106-109Direct
os.environmutation is unsafe under pabot. Pass env vars viarun_cli()env parameter instead.M-7: XML message extraction may miss RF status message attribute
Category: BUG
File:
robot/helper_tdd_tag_validation.py:379-385Reads
status_el.textbut doesn't checkstatus_el.get("message"). RF 7.3+ may use either.Fix: Add fallback:
if not message: message = status_el.get("message", "")Low
_fail()usesraise SystemExit(1)vssys.exit(1)— inconsistent (helper_m4_e2e_common.py:70,helper_e2e_common.py:112)CONFIG_CHANGEDaudit events for oneserver connectoperation (server.py:123-125)ConfigService()in_resolve_auto_reindexconstructed withoutevent_bus(container.py:256)container.py:636-643)tdd_session_create_di.feature:3-6,tdd_session_list_di.feature:3-6,tdd_session_list_missing_db.feature:5-6)full_name(tdd_expected_fail_listener.py:40,73)@prefix, diverging from Behave counterpart (tdd_expected_fail_listener.py:105-121)Nits
helper_tdd_tag_validation.pyat 637 lines,helper_m3_e2e_verification.pyat 901 lines,helper_m6_e2e_verification.pyat 863 lines exceed 500-line limit — should be splitstart_test+end_test) — negligible overheadtdd_bugalone withouttdd_bug_Nortdd_expected_fail_validate_tdd_tagsreturns string vs Behave'sValueError— undocumented design choiceType/Featurelabel inaccurate for a PR containing bug fixes, tests, observability, and docsPM Triage — Day 33 (2026-03-13)
@hamza.khyari posted a formal REQUEST_CHANGES review with findings across multiple severity levels. @hurui200320's Round 3 fixes addressed the 15 findings from hamza's earlier comment review, but the formal review (C-1, C-2, H-1 through H-3, M-1 through M-7) contains additional findings that need resolution.
Critical Assessment
C-1 (PR bundles 7+ issues — must split): This is the most significant finding. The PR description confirms 12 non-merge commits spanning #628, #627, #581, #554, #570, #680, #740, #495, #496, #658, #697 across M3–M6. CONTRIBUTING.md does require one Epic scope per PR.
PM ruling on C-1: At Day 33, splitting this PR into 5+ PRs would create significant risk and delay TDD infrastructure that 5 Critical TDD issues (#838–#842) now depend on. Recommended path forward:
C-2 (M4 CLI tests use mocks): Valid —
unittest.mock.patchin integration tests violates CONTRIBUTING.md line 569. However, this is pre-existing code from other commits, not the TDD listener feature. Non-blocking for the TDD merge but must be filed as a separate issue.Blocking vs Non-Blocking
session_serviceFactory bug is pre-existing, not introduced by TDD featureAction Required
@hurui200320: Please respond to hamza's formal REQUEST_CHANGES review within 24 hours. Specifically:
@hamza.khyari: Excellent review depth. Once Rui responds to C-1, would you be willing to re-review if the TDD-only subset is extracted? Your review of the listener code itself (findings 1–15 from your earlier comment) has been fully addressed per Round 3.
Priority: This remains Tier 1 — TDD infrastructure. Issues #838, #839, #840, #841, #842 (all Critical/Must have) are blocked until the TDD tag handler is on master.
PM Status Update — Day 34
This PR is Tier 1 priority — it unblocks 5 Critical bug TDD issues (#838-#842). The technical quality is solid (all 15 of Hamza's findings addressed, 10 quality gates green).
Blocking item: Hamza's REQUEST_CHANGES (C-1 — scope bundling)
The PR bundles 7+ issues across M3-M6 milestones. CONTRIBUTING.md requires one-Epic-scope-per-PR. This is the single thing preventing merge.
PM ruling on C-1 (reaffirming Day 33 position):
Given we are Day 34 and this blocks 5 Critical bugs, splitting is not justified if it takes more than ~4 hours of work. The priority is unblocking the TDD pipeline.
@hurui200320 — Please respond to Hamza's formal review within 24 hours:
unittest.mock.patch): File a follow-up issue and note the issue number here.@hamza.khyari — Given the context (5 Critical bugs blocked), would you accept the PR if Rui addresses C-1 with either approach above? Your review quality is outstanding, but the process-purity vs. schedule tradeoff favors merge here.
Label update: State should be
State/Verified, notState/Unverified.PM Review — Day 34: Scope Bundling Ruling (C-1)
@hurui200320 @hamza.khyari — This PR has been in review for 4 days with 13 comments. All 15 technical findings from Hamza's review have been addressed by Rui. The remaining open discussion is the C-1 scope bundling concern (Hamza's REQUEST_CHANGES): whether the Behave integration tests and Robot handler should be split into separate PRs.
PM Ruling
Accept as-is with a process-debt follow-up. Rationale:
helper_tdd_tag_validation.py). Splitting would require duplicating this infrastructure or creating a dependency chain between 2 PRs — adding complexity, not reducing it.Required Action
PR Status
PM review — Day 34 schedule adherence
d615276349c7bdb62b73Review Response (Round 4) — Force-pushed
c7bdb62bRebased onto latest master (
3b6b1d2) and addressed Hamza's formal REQUEST_CHANGES review (review #2190). After rebase, this PR contains a single commit (c7bdb62b) implementing only #628 — all other commits from the previous branch history are now on master.C-1 and C-2: Resolved by rebase
After rebasing onto latest master, this PR now contains only the #628 commit — 16 files changed, 1004 insertions, 168 deletions. The scope-bundling concern (C-1) and mock-in-integration-tests concern (C-2) are no longer applicable because those commits are already on master.
Findings addressed in this round
_make_mock_scenarioduplicated between Robot helper and Behave stepsfeatures/mocks/tdd_test_helpers.pyasmake_mock_scenario(). Bothrobot/helper_tdd_tag_validation.pyandfeatures/steps/tdd_tag_validation_steps.pynow import from the shared module. Updatedfeatures/mocks/__init__.pyto export it.status_el.get("message", "")fallback in both_run_fixture()and_run_multi_fixture(). Now checks element text →messageattribute (RF 7.3+) → child<msg>element.full_namefull_name" in the Implementation Notes docstring.@prefix, diverging from Behave counterpart_validate_tdd_tagsexplaining the intentional divergence — Robot Framework tags don't use the@prefix, while Behave does. Each framework's error messages use its own convention.tdd_bugalone withouttdd_bug_Nortdd_expected_failrobot/fixtures/tdd_bug_alone.robotfixture,cmd_tdd_bug_alone_valid()helper command, and "TDD Bug Tag Alone Is Valid" test case. Total Robot-side integration tests: 9._validate_tdd_tagsreturns string vs Behave'sValueError— undocumented design choice_validate_tdd_tagsdocstring explaining why `strFindings now out of scope (resolved by rebase)
After rebasing, the following findings are no longer part of this PR's diff — they exist on master from previously-merged commits:
session_serviceFactory → Singleton9c8dc9edviolates Conventional Commits94d95324missingISSUES CLOSED:footerCorrectionServicebypasses container SingletonAuditService._ensure_session()not thread-safeentity_deletedaudit coverage_suppress_structlog_stdoutduplicated 3 timescleanup_workspace()mutatesos.environhelper_e2e_common.py)._fail()usesraise SystemExit(1)vssys.exit(1)CONFIG_CHANGEDaudit eventsConfigService()withoutevent_busType/Featurelabel inaccurate for multi-type PRType/Feature.helper_m3_e2e_verification.py/helper_m6_e2e_verification.pyline countsI recommend @hamza.khyari or PM file separate issues for H-1, M-1, M-2, M-3, M-4, and M-6 to track these as independent follow-ups.
Quality Gates (post-rebase + review fixes)
nox -s lintnox -s typechecknox -s unit_testsnox -s integration_testsnox -s coverage_reportPR scope after rebase
This PR now contains a single commit for a single issue (#628):
.robotfiles (includingtdd_bug_alone.robot)features/mocks/tdd_test_helpers.py)robot/tdd_expected_fail_listener.py) — modifiedrobot/helper_tdd_tag_validation.py) — modifiedrobot/tdd_tag_validation.robot) — modifiedfeatures/mocks/__init__.py— modified (export shared helper)features/steps/tdd_tag_validation_steps.py— modified (use shared helper)Self-Review (Round 5) — PR #673:
feat(testing): implement @tdd_expected_fail tag handling in Robot FrameworkReviewer: hurui200320 | Protocol: 5-agent parallel review (test coverage/flaws, performance, bug detection, security, spec/code quality)
Verdict: APPROVED — with non-blocking suggestions below.
Overall Assessment
All 6 acceptance criteria from ticket #628 are correctly satisfied. Code quality is high — clean separation between validation (
start_test) and inversion (end_test), proper regex compilation at module level, thorough docstrings, and comprehensive Robot + Behave test coverage. All 15 findings from @hamza.khyari's previous review have been properly addressed in Round 4. The rebase cleanly scoped this PR to a single commit for #628.Performance: No issues. Listener overhead is negligible (~microseconds per test across 1500+ tests).
Security: No issues. Subprocess invocation uses list args (no shell injection), fixture names are hardcoded, temp dirs managed via context managers, XML is self-generated.
Bug Detection: No new bugs found. All 8 previously-reported bugs properly fixed.
6 new suggestions are listed below — 2 Medium, 4 Low/Nit. None are blocking.
Findings
helper_tdd_tag_validation.pytdd_expected_fail_listener.py_processed_tests) is untestedtdd_expected_fail_listener.pyclose()cleanup function is untestedhelper_tdd_tag_validation.py:554expected-fail-alonevalidationtdd_expected_fail_listener.py:161.lower()) never exercised by testshelper_tdd_tag_validation.py:347,4131. File exceeds 500-line limit (Medium)
File:
robot/helper_tdd_tag_validation.py(629 lines)CONTRIBUTING.md mandates files stay under 500 lines. This was also noted as a nit by @hamza.khyari in the previous review.
Recommendation: Split along the natural Behave/Robot boundary, or extract the Robot-side fixture runner infrastructure (
_run_fixture,_run_multi_fixture) into a small_tdd_fixture_runner.pyutility. Extracting the XML message-extraction helper (Finding #6) would also reduce line count.2. Idempotency guard untested (Medium)
File:
robot/tdd_expected_fail_listener.py, lines 79, 193–195The
_processed_testsset guard prevents double-inversion if the listener is loaded twice. No test exercises this — each helper subprocess starts fresh, so the guard is never triggered. If this guard broke (e.g., someone clears the set prematurely), atdd_expected_failtest would silently double-invert (FAIL→PASS→FAIL), producing wrong results with no test catching it.Recommendation: Add a Robot-side test that passes
--listener <path> --listener <path>and verifies the fixture is still correctly inverted to PASS (not double-inverted back to FAIL).3.
close()cleanup untested (Low)File:
robot/tdd_expected_fail_listener.py, lines 254–262close()clears both_validation_errorsand_processed_tests. No test verifies this behavior. Each test subprocess starts fresh, soclose()cleanup never observably matters in the current test topology.Recommendation: Low priority. Consider adding a note documenting that
close()is a safety net for embedded/long-lived Robot processes, or add a simple direct-import test that populates the dicts, callsclose(), and verifies they are empty.4. Weak
expected-fail-aloneassertion (Low)File:
robot/helper_tdd_tag_validation.py, line 554The first check (
"tdd_bug" in message) is always satisfied when the second ("tdd_bug_<N>" in message) is true, because"tdd_bug"is a substring of"tdd_bug_<N>". The assertion doesn't truly verify both tags are listed independently in the error message.Recommendation: Tighten to
"tdd_bug, tdd_bug_<N>" in messageor check for the combined phrase from the actual error output.5. Tag case-normalization never exercised (Low)
File:
robot/tdd_expected_fail_listener.py, lines 161, 213All 8 fixture files use lowercase tags exclusively. The
.lower()normalization instart_test/end_testis never tested with mixed-case input. If it were removed, no test would fail.Recommendation: Modify one fixture to use mixed-case tags (e.g.,
TDD_Expected_Fail,TDD_BUG_999) and verify the listener correctly handles them.6. Duplicated XML message-extraction (Low/Nit)
File:
robot/helper_tdd_tag_validation.py, lines 347–353 and 413–419The RF output XML message extraction pattern (check
status_el.text→status_el.get("message")→<msg>child) is copy-pasted between_run_fixtureand_run_multi_fixture. Any future RF output-format change must be updated in two places.Recommendation: Extract to
_extract_message(test_el: ET.Element, status_el: ET.Element) -> str. Also helps reduce line count (Finding #1).Previous Review Comments — Status
All of @hamza.khyari's 15 findings from the formal review have been addressed in Round 4. The C-1 (scope bundling) and C-2 (mock in integration tests) concerns are resolved by rebase — the PR now contains only the #628 commit. PM (@freemo) has ruled the PR mergeable.
Process Checks (all pass)
ISSUES CLOSED: #628footer present ✓# type: ignorein new source ✓features/mocks/only ✓c7bdb62b735e4453d0a4PM Review — Robot Framework TDD Tag Handler (#628)
PR #673: feat(testing): implement @tdd_expected_fail tag handling in Robot Framework
Review Summary
This PR implements the Robot Framework counterpart of the Behave TDD expected-fail system. It has been through 5 rounds of review with all 15 findings from @hamza.khyari addressed. The PR has been rebased to a single commit for #628 only (resolving the C-1 scope bundling concern). All quality gates pass. Approving.
TDD Infrastructure Verification
tdd_expected_failtagstdd_bug_<N>requirestdd_bug,tdd_expected_failrequires bothfeatures/mocks/tdd_test_helpers.py--listenerflagCritical Path Impact
This PR unblocks 5 Critical bug TDD pipelines (#838-#842) that depend on Robot Framework TDD tag support. Its merge is a prerequisite for the following TDD issues:
Previous Scope Concern Resolution
The C-1 scope bundling concern from @hamza.khyari has been resolved — @hurui200320 rebased to include only the single commit for issue #628. The remaining out-of-scope findings were correctly deferred.
Verdict: APPROVED. This is a critical-path merge. Recommend @hamza.khyari convert his previous REQUEST_CHANGES to APPROVED given that all 15 findings have been addressed and the scope concern is resolved. Merge target: Day 37 (2026-03-17).
Code Review — PR #673
feat(testing): implement @tdd_expected_fail tag handling in Robot FrameworkReviewer: @brent.edwards | Size: L (1,002 lines, 15 files) | Focus: Listener correctness, test quality, process compliance
Verification of Previous Reviews
All prior review findings have been addressed:
# type: ignoreviolationCallable[[], int]typingelsecatches SKIP incorrectlyelif PASS+else SKIPNo P0 or P1 findings.
The listener module is well-designed:
end_test— invalid tag combinations always produce FAIL, regardless of test outcomedata.tagsoverresult.tags— correctly uses static test definition tags, not runtime-mutable tags (intentional, documented)fullmatch()on regex — correctly distinguishestdd_bugfromtdd_bug_\<N\>without false matches_processed_testsand_validation_errorsare per-process (pabot uses subprocesses), no concurrency riskclose()cleanup — RF Listener v3 API callsclose()on execution finish# type: ignore— clean type safety# noqain listener — helper has 3 justified# noqa: E402for post-sys.pathimportsP2:should-fix (3)
1.
cmd_skip_status_unchangedandcmd_tdd_bug_alone_validcan false-positive without listener loadedBoth tests verify non-modification of results (SKIP stays SKIP, PASS stays PASS) but run the fixture alone without a positive control. Unlike
cmd_normal_test_unaffected(which uses_run_multi_fixturewith a companiontdd_expected_fail_failsfixture to PROVE the listener was active and selectively applied), these two commands would produce the same result if the--listenerflag were missing entirely.Fix: Use
_run_multi_fixturewithtdd_expected_fail_failsas a co-run fixture (same pattern ascmd_normal_test_unaffected), asserting the companion was inverted.2.
subprocess.runcalls missingtimeoutBoth
_run_fixtureand_run_multi_fixturecallsubprocess.run(cmd, capture_output=True, text=True)withouttimeout. If a fixture hangs, the subprocess blocks forever, which blocks the parent Robot test, which blocks CI with no diagnostic. All other Robot helpers in the repo usetimeout=120or similar.Fix: Add
timeout=120and handlesubprocess.TimeoutExpired.3.
helper_tdd_tag_validation.pyat 629 lines exceeds the 500-line limitAlready flagged in Rui's self-review R5. The natural split point is around line 270 (Behave-side functions vs Robot-side fixture runners). Extracting
_run_fixture/_run_multi_fixture/_extract_messageinto a_tdd_fixture_runner.pyutility would solve this and eliminate the duplicated XML extraction (R5 finding #6).P3:nit (1)
4.
features/mocks/__init__.pydoesn't re-exportmake_mock_scenarioEvery other mock in the package is re-exported from
__init__.py.tdd_test_helpers.make_mock_scenariois imported by full module path, so this works, but breaks the package convention. Minor inconsistency.Summary
Verdict: APPROVED — The listener implementation is correct, well-tested (8 fixtures, 21 integration tests), and all prior review findings have been thoroughly addressed. The 3 P2 findings are genuine quality improvements but do not block merge. This is critical-path TDD infrastructure that unblocks 5+ bug TDD pipelines.
The code demonstrates strong engineering: clean validation/inversion separation, proper RF Listener v3 API usage, comprehensive edge-case coverage (SKIP, double-invocation, tdd_bug-alone, tdd_expected_fail-alone), and an extracted shared mock helper. Well done @hurui200320.
5e4453d0a4a5de448856New commits pushed, approval review dismissed automatically according to repository settings
New commits pushed, approval review dismissed automatically according to repository settings
Review Fixes Applied —
a5de448Addressed all 3 P2 and 1 P3 findings from @brent.edwards's review:
cmd_skip_status_unchangedandcmd_tdd_bug_alone_validcan false-positive without listener loadedrun_multi_fixturewithtdd_expected_fail_failsas positive control — same pattern ascmd_normal_test_unaffectedsubprocess.runmissingtimeouttimeout=120to bothrun_fixtureandrun_multi_fixturewithsubprocess.TimeoutExpiredhandlinghelper_tdd_tag_validation.pyat 629 lines (limit 500)run_fixture,run_multi_fixture, and_extract_messageto newrobot/_tdd_fixture_runner.py(186 lines). Helper now 499 lines.make_mock_scenarionot re-exported fromfeatures/mocks/__init__.py__all__entryLine counts after fix
robot/helper_tdd_tag_validation.py: 499 (was 629)robot/_tdd_fixture_runner.py: 186 (new)Quality gates