test: consolidated Brent QA batch — issues #179, #180, #404, #405, #187 #442

2026-02-25T22:14:58Z

brent.edwards commented

2026-02-25 22:14:58 +00:00

Summary

Consolidated QA test batch covering M3 and M4 milestones.

Issues Included

Issue	Title	Branch	Individual PR
#179	test(e2e): add M3 decision + validation suites	`feature/m3-decision-validation-smoke`	#437
#180	test(persistence): add decision persistence suites	`feature/m4-decision-tests`	#438
#404	test(e2e): verify M3 success criteria	`test/m3-e2e-verification`	#439
#405	test(e2e): verify M4 success criteria	`test/m4-e2e-verification`	#440
#187	test(e2e): add M4 correction + subplan suites	`feature/m4-correction-subplan-smoke`	#441

New Test Artifacts

Behave Scenarios: 62 new scenarios across 3 feature files

m3_decision_validation_smoke.feature (21 scenarios)
decision_persistence.feature (21 scenarios)
m4_correction_subplan_smoke.feature (20 scenarios)

Robot Framework Integration Tests: 33 new test cases across 5 suites

m3_decision_validation_smoke.robot (8 cases)
m3_e2e_verification.robot (10 cases)
m4_e2e_verification.robot (7 cases)
decision_persistence.robot (7 cases)
m4_correction_subplan_smoke.robot (8 cases)

ASV Benchmarks: 3 new benchmark files

m3_smoke_bench.py (5 suites)
decision_persistence_bench.py (4 suites)
m4_smoke_bench.py (4 suites)

Bug Fixes Applied

Fixed CorrectionService patch target in M3/M4 helpers (lazy import in plan.py)
Fixed invalid ULID strings in M4 smoke helper (Pydantic pattern validation)
Fixed dry-run assertion checking for text not in plain format output
Added missing --guidance flag to correction dry-run CLI invocation

Closes #179 #180 #404 #405 #187

## Summary Consolidated QA test batch covering M3 and M4 milestones. ### Issues Included | Issue | Title | Branch | Individual PR | |-------|-------|--------|---------------| | #179 | test(e2e): add M3 decision + validation suites | `feature/m3-decision-validation-smoke` | #437 | | #180 | test(persistence): add decision persistence suites | `feature/m4-decision-tests` | #438 | | #404 | test(e2e): verify M3 success criteria | `test/m3-e2e-verification` | #439 | | #405 | test(e2e): verify M4 success criteria | `test/m4-e2e-verification` | #440 | | #187 | test(e2e): add M4 correction + subplan suites | `feature/m4-correction-subplan-smoke` | #441 | ### New Test Artifacts **Behave Scenarios**: 62 new scenarios across 3 feature files - `m3_decision_validation_smoke.feature` (21 scenarios) - `decision_persistence.feature` (21 scenarios) - `m4_correction_subplan_smoke.feature` (20 scenarios) **Robot Framework Integration Tests**: 33 new test cases across 5 suites - `m3_decision_validation_smoke.robot` (8 cases) - `m3_e2e_verification.robot` (10 cases) - `m4_e2e_verification.robot` (7 cases) - `decision_persistence.robot` (7 cases) - `m4_correction_subplan_smoke.robot` (8 cases) **ASV Benchmarks**: 3 new benchmark files - `m3_smoke_bench.py` (5 suites) - `decision_persistence_bench.py` (4 suites) - `m4_smoke_bench.py` (4 suites) ### Bug Fixes Applied - Fixed CorrectionService patch target in M3/M4 helpers (lazy import in plan.py) - Fixed invalid ULID strings in M4 smoke helper (Pydantic pattern validation) - Fixed dry-run assertion checking for text not in plain format output - Added missing --guidance flag to correction dry-run CLI invocation Closes #179 #180 #404 #405 #187

brent.edwards added the

Type

Testing

label 2026-02-25 22:14:58 +00:00

brent.edwards added 12 commits 2026-02-25 22:14:58 +00:00

test(e2e): verify M3 success criteria — decision tree and correction

CI / lint (pull_request) Successful in 30s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / quality (pull_request) Successful in 32s

Details

CI / security (pull_request) Successful in 49s

Details

CI / build (pull_request) Successful in 29s

Details

CI / typecheck (pull_request) Successful in 1m13s

Details

CI / integration_tests (pull_request) Failing after 3m56s

Details

CI / benchmark-regression (pull_request) Successful in 20m57s

Details

CI / unit_tests (pull_request) Has been cancelled

Details

CI / coverage (pull_request) Has been cancelled

Details

CI / docker (pull_request) Has been cancelled

Details

aeb5cc5110

Robot Framework E2E test suite for M3 milestone verification covering:
- Plan execution generating decisions during Strategize phase
- Decision tree viewing with parent-child relationships and BFS traversal
- Decision explanation with full context snapshot verification
- Invariant add/list via CLI and InvariantService with scope filtering
- Dry-run correction via CorrectionService with impact analysis
- Live revert correction execution with decision re-creation
- Context snapshot round-trip serialisation assertions
- Decision tree persistence via model_dump/model_validate
- Correction revert re-execution from decision point
- Invariant enforcement during strategize with merge precedence

ISSUES CLOSED: #404

test(persistence): add decision persistence suites

CI / lint (pull_request) Successful in 21s

Details

CI / typecheck (pull_request) Successful in 40s

Details

CI / security (pull_request) Successful in 41s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / quality (pull_request) Successful in 37s

Details

CI / build (pull_request) Successful in 32s

Details

CI / integration_tests (pull_request) Successful in 4m46s

Details

CI / unit_tests (pull_request) Successful in 10m36s

Details

CI / docker (pull_request) Successful in 1m2s

Details

CI / benchmark-regression (pull_request) Successful in 22m26s

Details

CI / coverage (pull_request) Successful in 1h55m50s

Details

f4c4884d03

Add comprehensive persistence test suites for the Decision domain model
covering serialization round-trips (model_dump/model_validate, JSON),
context-snapshot persistence, correction-chain reconstruction, and
decision-tree reconstruction from serialized data.

Deliverables:
- Behave: 21 scenarios in features/decision_persistence.feature
- Robot Framework: 7 integration tests in robot/decision_persistence.robot
- ASV benchmarks: 4 suites (17 benchmarks) in benchmarks/decision_persistence_bench.py
- Documentation: Updated docs/development/testing.md

ISSUES CLOSED: #180

test(e2e): add M3 decision + validation suites

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 16s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / build (pull_request) Successful in 20s

Details

CI / security (pull_request) Successful in 55s

Details

CI / typecheck (pull_request) Successful in 1m5s

Details

CI / integration_tests (pull_request) Successful in 4m4s

Details

CI / benchmark-regression (pull_request) Successful in 22m21s

Details

CI / unit_tests (pull_request) Failing after 34m7s

Details

CI / docker (pull_request) Has been skipped

Details

CI / coverage (pull_request) Has been cancelled

Details

ace7311de4

test(e2e): verify M4 success criteria — subplans and parallel execution

CI / lint (pull_request) Successful in 23s

Details

CI / quality (pull_request) Successful in 30s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / security (pull_request) Successful in 44s

Details

CI / typecheck (pull_request) Successful in 44s

Details

CI / build (pull_request) Successful in 24s

Details

CI / integration_tests (pull_request) Successful in 5m18s

Details

CI / unit_tests (pull_request) Successful in 21m31s

Details

CI / benchmark-regression (pull_request) Successful in 26m33s

Details

CI / docker (pull_request) Successful in 1m1s

Details

CI / coverage (pull_request) Failing after 1h29m10s

Details

cfc319ad27

test(e2e): add M4 correction + subplan suites 917c2bc546

fix(test): correct patch target for CorrectionService in M3 e2e helper

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 20s

Details

CI / quality (pull_request) Successful in 20s

Details

CI / build (pull_request) Successful in 26s

Details

CI / typecheck (pull_request) Successful in 57s

Details

CI / security (pull_request) Successful in 1m2s

Details

CI / integration_tests (pull_request) Successful in 4m16s

Details

CI / unit_tests (pull_request) Successful in 23m40s

Details

CI / benchmark-regression (pull_request) Successful in 25m34s

Details

CI / docker (pull_request) Successful in 15s

Details

CI / coverage (pull_request) Successful in 1h40m2s

Details

2e53caaef1

The CorrectionService import in plan.py is a lazy import inside the
correct() function body, so it does not exist as a module-level attribute.
Patch the class at its definition site instead:
  cleveragents.application.services.correction_service.CorrectionService

Fixes CI integration_tests failure for PR #439.

fix(test): correct patch targets and ULID validation in M4 smoke helper

CI / lint (pull_request) Successful in 23s

Details

CI / typecheck (pull_request) Successful in 59s

Details

CI / security (pull_request) Successful in 51s

Details

CI / quality (pull_request) Successful in 35s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 30s

Details

CI / integration_tests (pull_request) Successful in 5m11s

Details

CI / unit_tests (pull_request) Failing after 25m35s

Details

CI / docker (pull_request) Has been skipped

Details

CI / benchmark-regression (pull_request) Successful in 26m22s

Details

CI / coverage (pull_request) Failing after 1h27m30s

Details

55df1915f1

- Patch CorrectionService at its definition site instead of the lazy
  import location in plan.py (same fix as M3 helper)
- Replace hardcoded fake subplan/correction IDs with real ULIDs
  generated via ulid.ULID() to satisfy Pydantic pattern validation
- Add missing --guidance flag to correction dry-run CLI invocation
- Fix subplan-status-sequential assertion to check for plan_id
  (subplan_count is not part of the status JSON output)

fix(test): correct dry-run assertion in M3 decision validation smoke

CI / lint (pull_request) Successful in 24s

Details

CI / typecheck (pull_request) Successful in 58s

Details

CI / security (pull_request) Successful in 56s

Details

CI / quality (pull_request) Successful in 42s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 25s

Details

CI / integration_tests (pull_request) Successful in 5m7s

Details

CI / unit_tests (pull_request) Successful in 25m18s

Details

CI / benchmark-regression (pull_request) Successful in 26m14s

Details

CI / docker (pull_request) Successful in 16s

Details

CI / coverage (pull_request) Successful in 1h40m20s

Details

22986aaaf9

The CLI plain format output does not include the literal text "Dry Run"
(that string only appears in the Rich panel title). Assert against
"risk_level" which is present in the dry-run impact output instead.

Merge branch 'feature/m4-decision-tests' into develop-brent-5 5a212481e1

Merge branch 'test/m3-e2e-verification' into develop-brent-5 49f097472a

Merge branch 'test/m4-e2e-verification' into develop-brent-5 a2b0611722

Merge feature/m4-correction-subplan-smoke into develop-brent-5

CI / lint (pull_request) Successful in 23s

Details

CI / typecheck (pull_request) Successful in 1m7s

Details

CI / security (pull_request) Successful in 1m0s

Details

CI / quality (pull_request) Successful in 28s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 25s

Details

CI / integration_tests (pull_request) Successful in 5m50s

Details

CI / unit_tests (pull_request) Failing after 32m24s

Details

CI / docker (pull_request) Has been skipped

Details

CI / benchmark-regression (pull_request) Successful in 25m44s

Details

CI / coverage (pull_request) Failing after 1h28m6s

Details

c80cda2590

brent.edwards referenced this pull request

2026-02-25 22:15:07 +00:00

test(e2e): add M3 decision + validation suites #437

brent.edwards referenced this pull request

2026-02-25 22:15:07 +00:00

test(persistence): add decision persistence suites #438

brent.edwards referenced this pull request

2026-02-25 22:15:08 +00:00

test(e2e): verify M3 success criteria — decision tree and correction #439

brent.edwards referenced this pull request

2026-02-25 22:15:08 +00:00

test(e2e): verify M4 success criteria — subplans and parallel execution #440

brent.edwards referenced this pull request

2026-02-25 22:15:08 +00:00

test(e2e): add M4 correction + subplan suites #441

brent.edwards added 10 commits 2026-02-26 02:15:42 +00:00

fix(test): harden retry_patterns feature against flaky CI failures

CI / lint (pull_request) Successful in 16s

Details

CI / typecheck (pull_request) Successful in 37s

Details

CI / quality (pull_request) Successful in 24s

Details

CI / security (pull_request) Successful in 51s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 19s

Details

CI / integration_tests (pull_request) Successful in 4m37s

Details

CI / unit_tests (pull_request) Successful in 12m10s

Details

CI / docker (pull_request) Successful in 16s

Details

CI / benchmark-regression (pull_request) Successful in 27m0s

Details

CI / coverage (pull_request) Successful in 1h51m34s

Details

ef58883f7a

The jitter-spread scenario used millisecond-bucketed wall-clock
timestamps to assert that retried operations did not cluster.  On busy
CI runners — especially when a preceding scenario's time.sleep patch
leaked through a late cleanup — tenacity's retry waits became no-ops
and all five operations landed in the same millisecond bucket, tripping
the max_in_bucket <= 3 assertion.

Three hardening changes:

1. Jitter test: switch from time.time() to time.monotonic_ns() and
   replace the fragile bucket assertion with a unique-timestamp count
   (>= 2 distinct readings among 5 sequential operations).
2. Timeout test: eagerly restore time.sleep in a try/finally block so
   subsequent scenarios never observe the patched no-op, regardless of
   behave's cleanup ordering.
3. Async circuit breaker: lower recovery_timeout from 0.2 s to 0.1 s
   (the sleep step already waits 0.2 s) to give a wider safety margin
   on slow CI machines.

fix(test): address review findings in M3 smoke tests

CI / lint (pull_request) Successful in 24s

Details

CI / typecheck (pull_request) Successful in 55s

Details

CI / security (pull_request) Successful in 47s

Details

CI / quality (pull_request) Successful in 28s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 23s

Details

CI / integration_tests (pull_request) Successful in 5m7s

Details

CI / unit_tests (pull_request) Successful in 20m55s

Details

CI / benchmark-regression (pull_request) Successful in 25m40s

Details

CI / docker (pull_request) Successful in 14s

Details

CI / coverage (pull_request) Successful in 1h48m41s

Details

93730067a2

- Replace dead after_scenario() hook with context.add_cleanup() calls
  so patchers are properly stopped (Behave only runs hooks from
  environment.py)
- Fix invalid ULID constants: pad DECISION/CORRECTION to 26 chars,
  replace invalid 'I' with 'J' in INVARIANT_ULID
- Add temp file cleanup for NamedTemporaryFile(delete=False) via
  context.add_cleanup() in behave steps and try/finally in robot helper
- Remove decision-persistence doc sections from testing.md that belong
  in PR #438, not this PR

fix(test): derive decision type count from enum instead of hard-coding 11

CI / lint (pull_request) Successful in 21s

Details

CI / typecheck (pull_request) Successful in 55s

Details

CI / security (pull_request) Successful in 51s

Details

CI / quality (pull_request) Successful in 32s

Details

CI / integration_tests (pull_request) Successful in 5m12s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 24s

Details

CI / unit_tests (pull_request) Successful in 33m37s

Details

CI / benchmark-regression (pull_request) Successful in 26m3s

Details

CI / docker (pull_request) Successful in 1m1s

Details

CI / coverage (pull_request) Has been cancelled

Details

b99510d016

Replace hard-coded '11 decision types' with dynamic len(DecisionType) in
the assertion and remove the literal from scenario names, step patterns,
docstrings, and docs so the tests stay correct when new DecisionType
values are added.

fix(test): route M3 E2E subcommands through CLI rendering path

CI / lint (pull_request) Successful in 24s

Details

CI / typecheck (pull_request) Successful in 1m2s

Details

CI / security (pull_request) Successful in 1m1s

Details

CI / quality (pull_request) Successful in 42s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 25s

Details

CI / integration_tests (pull_request) Successful in 5m18s

Details

CI / unit_tests (pull_request) Successful in 30m5s

Details

CI / docker (pull_request) Successful in 15s

Details

CI / benchmark-regression (pull_request) Successful in 26m25s

Details

CI / coverage (pull_request) Has been cancelled

Details

4f079c20ea

- decision-tree-view, decision-explain, and decision-tree-persistence
  now invoke 'plan status --format plain' via mocked lifecycle service
  so regressions in CLI rendering/serialization are caught
- plan-generates-decisions now asserts use_action was called by the CLI
  and verifies plan status renders the strategize phase after creation
- Updated robot test case documentation to reflect CLI integration

fix(test): correct mock targets, method names, ULIDs, and CLI args in M4 smoke suite

CI / lint (pull_request) Successful in 22s

Details

CI / typecheck (pull_request) Successful in 55s

Details

CI / security (pull_request) Successful in 48s

Details

CI / quality (pull_request) Successful in 28s

Details

CI / integration_tests (pull_request) Successful in 5m6s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 23s

Details

CI / unit_tests (pull_request) Successful in 34m47s

Details

CI / benchmark-regression (pull_request) Successful in 25m45s

Details

CI / docker (pull_request) Successful in 1m15s

Details

CI / coverage (pull_request) Has been cancelled

Details

3c9a3efdf1

- Patch CorrectionService at its module path, not the local-import site
- Use request_correction/execute_correction/analyze_impact (real API)
- Replace invalid 27-char subplan_id values with valid 26-char ULIDs
- Add required --guidance flag to dry-run CLI invocation
- Fix feature file assertions to match actual CLI output

Resolves CI failures in unit_tests (job 4) and coverage (job 6) for
run 645 on PR #441.

Merge feature/m3-decision-validation-smoke into develop-brent-5 b3f126b147

Merge feature/m4-decision-tests into develop-brent-5 3d3c8fda33

Merge test/m3-e2e-verification into develop-brent-5 2e2ae3b9b8

Merge test/m4-e2e-verification into develop-brent-5 a00c33c4c6

Merge feature/m4-correction-subplan-smoke into develop-brent-5

CI / lint (pull_request) Successful in 22s

Details

CI / typecheck (pull_request) Successful in 1m2s

Details

CI / quality (pull_request) Successful in 29s

Details

CI / security (pull_request) Successful in 49s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 22s

Details

CI / integration_tests (pull_request) Successful in 4m55s

Details

CI / unit_tests (pull_request) Successful in 30m41s

Details

CI / benchmark-regression (pull_request) Successful in 26m32s

Details

CI / docker (pull_request) Successful in 15s

Details

CI / coverage (pull_request) Has been cancelled

Details

cf59f285fd

brent.edwards added 1 commit 2026-02-26 03:44:20 +00:00

Merge branch 'master' into develop-brent-5

CI / lint (pull_request) Successful in 23s

Details

CI / typecheck (pull_request) Successful in 56s

Details

CI / security (pull_request) Successful in 57s

Details

CI / quality (pull_request) Successful in 32s

Details

CI / integration_tests (pull_request) Successful in 5m30s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 27s

Details

CI / unit_tests (pull_request) Successful in 32m48s

Details

CI / docker (pull_request) Successful in 14s

Details

CI / benchmark-regression (pull_request) Successful in 24m15s

Details

CI / coverage (pull_request) Successful in 1h8m22s

Details

c53b880a5e

freemo approved these changes 2026-02-26 15:31:18 +00:00

Dismissed

brent.edwards added 2 commits 2026-02-26 16:14:40 +00:00

refactor(test): rename decision persistence files to avoid conflict with master 32b81793a2

Rename our serialization-focused decision persistence suites to
*_serialization to avoid add/add conflicts with the repository-based
decision persistence suites that landed on master independently:

- decision_persistence.feature -> decision_persistence_serialization.feature
- decision_persistence_steps.py -> decision_persistence_serialization_steps.py
- decision_persistence_bench.py -> decision_persistence_serialization_bench.py
- decision_persistence.robot -> decision_persistence_serialization.robot
- helper_decision_persistence.py -> helper_decision_persistence_serialization.py

Updated robot helper path, step docstring, and testing.md references.

Merge branch 'master' into develop-brent-5

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 15s

Details

CI / build (pull_request) Successful in 20s

Details

CI / quality (pull_request) Successful in 29s

Details

CI / typecheck (pull_request) Successful in 36s

Details

CI / security (pull_request) Successful in 54s