test(e2e): validate M3 acceptance criteria for v3.2.0 milestone closure #559

Merged
hurui200320 merged 1 commit from test/m3-acceptance-gate into master 2026-03-05 15:09:20 +00:00
Member

Summary

Validates all 10 M3 (v3.2.0) acceptance criteria via the E2E verification suite. This is the final gate before closing milestone v3.2.0.

Single Commit (Squashed per CONTRIBUTING.md)

Branch history was squashed from 2 commits into 1 (e1665c6e) to comply with the one-commit-per-issue rule. The single commit includes:

  1. Original work: Suite-level Force Tags, per-test acceptance criteria tags, enhanced documentation with milestone cross-references, CHANGELOG and CONTRIBUTORS entries.
  2. Review feedback fixes (addressing Luis's CRITICAL/HIGH findings): Replaced fixture-self-assertion patterns with real CLI-path validation for all 10 acceptance criteria, added persistence-backed decision tree checks, robust JSON output parsing, explicit Robot process timeouts, and clearer correction tags.
  3. Rebase onto current master (fff6bb38): Clean rebase resolving CHANGELOG conflict, as requested by PM (freemo).
  4. Bug fix — async_enabled attribute: PlanLifecycleService.execute_plan() accesses self.settings.async_enabled (plan_lifecycle_service.py:283). The test helper's _make_settings() used create_autospec(Settings) but didn't set this attribute, causing plan execute CLI tests to fail with AttributeError. Fixed by adding settings.async_enabled = False.
  5. Timeout fixes for parallel execution: When pabot runs 16 parallel Robot processes, cold-start import times + Alembic migrations cause 30s timeouts to be insufficient. Increased timeouts from 30s to 120s in 3 robot files: actor_context_management.robot, decision_di_wiring_smoke.robot, changeset_persistence.robot.

Files Changed (7)

File Change
robot/helper_m3_e2e_verification.py Rewritten for real CLI-path validation; _make_settings() fixed with async_enabled = False
robot/m3_e2e_verification.robot Force Tags, per-test tags, timeouts, enhanced docs
robot/actor_context_management.robot Timeout 30s → 120s (line 92)
robot/decision_di_wiring_smoke.robot Timeout 30s → 120s (lines 14, 21)
robot/changeset_persistence.robot ${TIMEOUT} 30s → 120s (line 12)
CHANGELOG.md Entry for #494
CONTRIBUTORS.md Added Rui Hu

Acceptance Criteria Verified (10/10)

Success Criteria:

  • plan use + plan execute generates decisions during Strategize
  • plan tree displays decision tree correctly (via CLI invocation)
  • plan explain shows full decision context (via CLI invocation)
  • invariant add --project / invariant list --project work with project scope
  • plan correct --dry-run performs impact analysis (via CLI invocation)
  • plan correct --mode=revert executes live correction (via CLI invocation)

Technical Criteria:

  • Decisions recorded with full context snapshot
  • Decision tree persists to database and renders correctly
  • Correction in revert mode re-executes from decision point
  • Invariants enforced during strategize

Quality Gates (Full Nox Suite)

Session Result Details
lint PASS ruff clean
format PASS ruff format clean
typecheck PASS pyright 0 errors, 0 warnings
security_scan PASS bandit clean
dead_code PASS vulture clean
unit_tests PASS 8500 BDD scenarios, 0 failures
integration_tests PASS 1194/1194 tests, 0 failures
docs PASS mkdocs build clean
build PASS wheel built
benchmark PASS performance benchmarks pass
coverage_report PASS 96.96% → rounds to 97% (meets --fail-under=97)

Closes #494

## Summary Validates all 10 M3 (v3.2.0) acceptance criteria via the E2E verification suite. This is the final gate before closing milestone v3.2.0. ### Single Commit (Squashed per CONTRIBUTING.md) Branch history was squashed from 2 commits into 1 (`e1665c6e`) to comply with the one-commit-per-issue rule. The single commit includes: 1. **Original work**: Suite-level Force Tags, per-test acceptance criteria tags, enhanced documentation with milestone cross-references, CHANGELOG and CONTRIBUTORS entries. 2. **Review feedback fixes** (addressing Luis's CRITICAL/HIGH findings): Replaced fixture-self-assertion patterns with real CLI-path validation for all 10 acceptance criteria, added persistence-backed decision tree checks, robust JSON output parsing, explicit Robot process timeouts, and clearer correction tags. 3. **Rebase onto current master** (`fff6bb38`): Clean rebase resolving CHANGELOG conflict, as requested by PM (freemo). 4. **Bug fix — `async_enabled` attribute**: `PlanLifecycleService.execute_plan()` accesses `self.settings.async_enabled` (`plan_lifecycle_service.py:283`). The test helper's `_make_settings()` used `create_autospec(Settings)` but didn't set this attribute, causing `plan execute` CLI tests to fail with `AttributeError`. Fixed by adding `settings.async_enabled = False`. 5. **Timeout fixes for parallel execution**: When `pabot` runs 16 parallel Robot processes, cold-start import times + Alembic migrations cause 30s timeouts to be insufficient. Increased timeouts from 30s to 120s in 3 robot files: `actor_context_management.robot`, `decision_di_wiring_smoke.robot`, `changeset_persistence.robot`. ### Files Changed (7) | File | Change | |------|--------| | `robot/helper_m3_e2e_verification.py` | Rewritten for real CLI-path validation; `_make_settings()` fixed with `async_enabled = False` | | `robot/m3_e2e_verification.robot` | Force Tags, per-test tags, timeouts, enhanced docs | | `robot/actor_context_management.robot` | Timeout 30s → 120s (line 92) | | `robot/decision_di_wiring_smoke.robot` | Timeout 30s → 120s (lines 14, 21) | | `robot/changeset_persistence.robot` | `${TIMEOUT}` 30s → 120s (line 12) | | `CHANGELOG.md` | Entry for #494 | | `CONTRIBUTORS.md` | Added Rui Hu | ### Acceptance Criteria Verified (10/10) **Success Criteria:** - `plan use` + `plan execute` generates decisions during Strategize - `plan tree` displays decision tree correctly (via CLI invocation) - `plan explain` shows full decision context (via CLI invocation) - `invariant add --project` / `invariant list --project` work with project scope - `plan correct --dry-run` performs impact analysis (via CLI invocation) - `plan correct --mode=revert` executes live correction (via CLI invocation) **Technical Criteria:** - Decisions recorded with full context snapshot - Decision tree persists to database and renders correctly - Correction in revert mode re-executes from decision point - Invariants enforced during strategize ### Quality Gates (Full Nox Suite) | Session | Result | Details | |---------|--------|---------| | `lint` | PASS | ruff clean | | `format` | PASS | ruff format clean | | `typecheck` | PASS | pyright 0 errors, 0 warnings | | `security_scan` | PASS | bandit clean | | `dead_code` | PASS | vulture clean | | `unit_tests` | PASS | 8500 BDD scenarios, 0 failures | | `integration_tests` | PASS | 1194/1194 tests, 0 failures | | `docs` | PASS | mkdocs build clean | | `build` | PASS | wheel built | | `benchmark` | PASS | performance benchmarks pass | | `coverage_report` | PASS | 96.96% → rounds to 97% (meets `--fail-under=97`) | Closes #494
hurui200320 added this to the v3.2.0 milestone 2026-03-04 05:26:18 +00:00
Member

Code Review Report: Commit d5e73dba by Rui Hu

Branch: test/m3-acceptance-gate
Issue: #494 -- test(e2e): validate M3 acceptance criteria for v3.2.0 milestone closure
Files changed: CHANGELOG.md, CONTRIBUTORS.md, robot/m3_e2e_verification.robot (3 files, +49/-1)


Summary

The commit adds documentation tags, milestone cross-references, and [Tags] annotations to the existing m3_e2e_verification.robot suite. It also appends a CHANGELOG entry and adds Rui Hu to CONTRIBUTORS. The commit does not modify the underlying test logic (helper_m3_e2e_verification.py is unchanged). The changes are primarily organizational/metadata-level. The deeper issues below stem from the pre-existing test suite that this commit validates and declares as passing.


1. CRITICAL -- Acceptance Criteria Gaps (Issue #494 vs. Actual Tests)

Several acceptance criteria from the milestone description and issue #494 are not faithfully covered by the tests this commit marks as passing:

F1. plan tree CLI never invoked

  • Criterion: "agents plan tree <plan_id> displays the decision tree correctly"
  • Actual test (decision_tree_view, helper_m3_e2e_verification.py:316): Invokes plan status via CLI, then performs domain-level tree traversal on locally-constructed objects. The plan tree CLI subcommand (defined at src/cleveragents/cli/commands/plan.py:2918) is never invoked.
  • Impact: The plan tree rendering path -- including DecisionService fetch, build_decision_tree() nesting logic, and Rich Tree output -- is completely untested.

F2. plan explain CLI never invoked

  • Criterion: "agents plan explain <decision_id> shows full decision context"
  • Actual test (decision_explain, helper_m3_e2e_verification.py:397): Invokes plan status via CLI, then checks context snapshot fields on a locally-constructed Decision. The plan explain CLI subcommand (defined at plan.py:2780) is never invoked.
  • Impact: The explain rendering path -- DecisionService.get_decision(), _build_explain_dict(), --show-context/--show-reasoning flags -- is completely untested.

F3. plan execute never invoked

  • Criterion: "agents plan execute <plan_id> generate decisions during Strategize"
  • Actual test (plan_generates_decisions, helper_m3_e2e_verification.py:254): Invokes plan use with a mocked lifecycle service, then validates a locally-built decision tree. plan execute is never called.
  • Impact: The Execute-phase transition and decision generation during execution are untested.

F4. Invariant CLI uses wrong scope

  • Criterion: "agents invariant add --project local/large-project 'Use session cookies'"
  • Actual test (invariant_add_and_list, helper_m3_e2e_verification.py:502): Invokes invariant add --global "CLI test invariant" instead of --project.
  • Criterion: "agents invariant list --project local/large-project"
  • Actual test (helper_m3_e2e_verification.py:518): Invokes invariant list without --project filter.
  • Impact: The project-scoped invariant CLI path is untested. The service-level test uses InvariantScope.PROJECT correctly, but the CLI integration test exercises only the --global flag.

F5. Live revert correction missing CLI test

  • Criterion: "agents plan correct <decision_id> --mode=revert --guidance '...' executes live correction"
  • Actual test (correction_live_revert, helper_m3_e2e_verification.py:635): Tests only CorrectionService directly. Unlike the dry-run test which includes a CLI invocation, the live revert has no CLI path test.

2. HIGH -- Test Validity / Flaw Analysis

F6. Tests assert on self-constructed fixtures, not production outputs

Multiple tests build domain objects (via _build_decision_tree()) and then assert properties on those same objects. Examples:

  • decisions_context_snapshot() (helper_m3_e2e_verification.py:733): Builds 3 Decision objects with hardcoded ContextSnapshot values (e.g., hot_context_hash="sha256:root_ctx_hash"), then asserts those strings are non-empty. This is equivalent to asserting "sha256:root_ctx_hash" != "" -- it validates the fixture, not production behavior.

  • decision_tree_view() (helper_m3_e2e_verification.py:316): Builds a tree, indexes it, then asserts child.parent_decision_id == _ROOT_DEC_ID -- the value it was constructed with moments earlier.

F7. "Persistence" test has no actual database round-trip

  • Test: decision_tree_persistence() (helper_m3_e2e_verification.py:776)
  • The test calls model_dump() / model_validate() as a persistence simulation. While the code comments acknowledge this matches "the SQLAlchemy JSON-column pattern," the actual database layer (decisions table, decision_dependencies table per spec lines 18620-18679) is never exercised. Schema constraints, foreign keys, and query behavior remain untested.

F8. Invariant enforcement always stamps enforced=True

  • Test: invariants_enforced_during_strategize() (helper_m3_e2e_verification.py:941)
  • The InvariantService.enforce_invariants() implementation unconditionally sets enforced=True for every invariant without any LLM-based reconciliation. The test verifies that records are created with enforced=True, but this tells us nothing about actual constraint checking. Per the spec (lines 19454-19460), enforcement should involve the Invariant Reconciliation Actor -- this is not tested.

F9. plan_generates_decisions() tests are disconnected

  • (helper_m3_e2e_verification.py:254-308): The function invokes plan use CLI with a mocked service, then separately calls _build_decision_tree() and asserts on that. There is no connection between the CLI invocation result and the decision verification. The svc.use_action mock returns a plan, but the decisions checked afterward come from a completely independent code path.

3. MEDIUM -- Performance & Robustness

F10. BFS uses list.pop(0) -- O(n) per dequeue

  • Location: helper_m3_e2e_verification.py:384
  • queue.pop(0) on a Python list is O(n). Should use collections.deque with popleft() for O(1). With only 3 nodes this is negligible, but it is a bad pattern that could mislead anyone copying this code for larger trees.

F11. No timeout on Run Process calls

  • Location: All 10 test cases in m3_e2e_verification.robot (lines 30, 44, 59, etc.)
  • Every other Robot suite in this project uses timeout=30s-120s on Run Process. This suite has none. A hanging helper process (e.g., waiting for import, deadlock) would block the entire test suite indefinitely.

F12. Module-level ULID generation

  • Location: helper_m3_e2e_verification.py:85-88
  • _PLAN_ULID, _ROOT_DEC_ID, _CHILD_DEC_ID, _GRANDCHILD_DEC_ID are generated at import time. While each Robot Run Process call spawns a fresh Python process (so this works), it means these IDs are unpredictable across runs and cannot be correlated in logs. Using deterministic test IDs would improve debuggability.

4. LOW -- Minor Issues

F13. CHANGELOG entry references #494 but issue scope is not fully met

The CHANGELOG entry (lines 5-9) states "All 10 E2E verification tests pass against the final implementation" and references #494. Given the acceptance criteria gaps in F1-F5 above, this statement is potentially misleading -- the tests pass, but they do not fully validate the acceptance criteria as written.

F14. Duplicate tag correction on two test cases

Both "Correction Dry Run Via Plan Correct" (line 87) and "Correction Live Revert Executes And Re-Creates Decisions" (line 103) share the [Tags] success_criteria correction tag. This makes tag-based filtering ambiguous when trying to run just one correction test. Consider correction_dry_run and correction_live_revert respectively.

F15. CorrectionService._checkpoint_service is dead code

The CorrectionService.__init__() accepts an optional CheckpointService parameter but no method ever calls it. This means checkpoint/rollback integration -- a core part of the correction spec (spec lines 28309-28357) -- is completely unexercised.


Findings Summary Table

ID Severity Category Description
F1 Critical Coverage plan tree CLI never invoked despite being an acceptance criterion
F2 Critical Coverage plan explain CLI never invoked despite being an acceptance criterion
F3 Critical Coverage plan execute never invoked despite being an acceptance criterion
F4 Critical Coverage Invariant CLI tests use --global instead of --project as specified
F5 High Coverage Live revert correction missing CLI integration test
F6 High Test Flaw Tests assert properties on self-constructed fixtures, not production outputs
F7 High Test Flaw Persistence test uses model round-trip, no actual database
F8 High Test Flaw Invariant enforcement always stamps enforced=True unconditionally
F9 High Test Flaw plan_generates_decisions CLI call and decision verification are disconnected
F10 Medium Performance BFS uses list.pop(0) instead of deque.popleft()
F11 Medium Robustness No timeout on any Run Process call (all other suites use timeouts)
F12 Low Robustness Module-level ULID generation makes test IDs non-deterministic
F13 Low Documentation CHANGELOG claims full validation but acceptance criteria have gaps
F14 Low Maintainability Duplicate correction tag on two distinct test cases
F15 Low Dead Code CorrectionService._checkpoint_service accepted but never used
# Code Review Report: Commit `d5e73dba` by Rui Hu **Branch:** `test/m3-acceptance-gate` **Issue:** #494 -- test(e2e): validate M3 acceptance criteria for v3.2.0 milestone closure **Files changed:** `CHANGELOG.md`, `CONTRIBUTORS.md`, `robot/m3_e2e_verification.robot` (3 files, +49/-1) --- ## Summary The commit adds documentation tags, milestone cross-references, and `[Tags]` annotations to the existing `m3_e2e_verification.robot` suite. It also appends a CHANGELOG entry and adds Rui Hu to CONTRIBUTORS. **The commit does not modify the underlying test logic** (`helper_m3_e2e_verification.py` is unchanged). The changes are primarily organizational/metadata-level. The deeper issues below stem from the pre-existing test suite that this commit validates and declares as passing. --- ## 1. CRITICAL -- Acceptance Criteria Gaps (Issue #494 vs. Actual Tests) Several acceptance criteria from the milestone description and issue #494 are **not faithfully covered** by the tests this commit marks as passing: ### F1. `plan tree` CLI never invoked - **Criterion:** *"`agents plan tree <plan_id>` displays the decision tree correctly"* - **Actual test (`decision_tree_view`, `helper_m3_e2e_verification.py:316`):** Invokes `plan status` via CLI, then performs domain-level tree traversal on locally-constructed objects. The `plan tree` CLI subcommand (defined at `src/cleveragents/cli/commands/plan.py:2918`) is **never invoked**. - **Impact:** The `plan tree` rendering path -- including `DecisionService` fetch, `build_decision_tree()` nesting logic, and Rich Tree output -- is **completely untested**. ### F2. `plan explain` CLI never invoked - **Criterion:** *"`agents plan explain <decision_id>` shows full decision context"* - **Actual test (`decision_explain`, `helper_m3_e2e_verification.py:397`):** Invokes `plan status` via CLI, then checks context snapshot fields on a locally-constructed `Decision`. The `plan explain` CLI subcommand (defined at `plan.py:2780`) is **never invoked**. - **Impact:** The `explain` rendering path -- `DecisionService.get_decision()`, `_build_explain_dict()`, `--show-context`/`--show-reasoning` flags -- is **completely untested**. ### F3. `plan execute` never invoked - **Criterion:** *"`agents plan execute <plan_id>` generate decisions during Strategize"* - **Actual test (`plan_generates_decisions`, `helper_m3_e2e_verification.py:254`):** Invokes `plan use` with a mocked lifecycle service, then validates a locally-built decision tree. `plan execute` is never called. - **Impact:** The Execute-phase transition and decision generation during execution are **untested**. ### F4. Invariant CLI uses wrong scope - **Criterion:** *"`agents invariant add --project local/large-project 'Use session cookies'`"* - **Actual test (`invariant_add_and_list`, `helper_m3_e2e_verification.py:502`):** Invokes `invariant add --global "CLI test invariant"` instead of `--project`. - **Criterion:** *"`agents invariant list --project local/large-project`"* - **Actual test (`helper_m3_e2e_verification.py:518`):** Invokes `invariant list` without `--project` filter. - **Impact:** The project-scoped invariant CLI path is **untested**. The service-level test uses `InvariantScope.PROJECT` correctly, but the CLI integration test exercises only the `--global` flag. ### F5. Live revert correction missing CLI test - **Criterion:** *"`agents plan correct <decision_id> --mode=revert --guidance '...'` executes live correction"* - **Actual test (`correction_live_revert`, `helper_m3_e2e_verification.py:635`):** Tests only `CorrectionService` directly. Unlike the dry-run test which includes a CLI invocation, the live revert has **no CLI path test**. --- ## 2. HIGH -- Test Validity / Flaw Analysis ### F6. Tests assert on self-constructed fixtures, not production outputs Multiple tests build domain objects (via `_build_decision_tree()`) and then assert properties on those same objects. Examples: - `decisions_context_snapshot()` (`helper_m3_e2e_verification.py:733`): Builds 3 `Decision` objects with hardcoded `ContextSnapshot` values (e.g., `hot_context_hash="sha256:root_ctx_hash"`), then asserts those strings are non-empty. This is equivalent to asserting `"sha256:root_ctx_hash" != ""` -- it validates the fixture, not production behavior. - `decision_tree_view()` (`helper_m3_e2e_verification.py:316`): Builds a tree, indexes it, then asserts `child.parent_decision_id == _ROOT_DEC_ID` -- the value it was constructed with moments earlier. ### F7. "Persistence" test has no actual database round-trip - **Test:** `decision_tree_persistence()` (`helper_m3_e2e_verification.py:776`) - The test calls `model_dump()` / `model_validate()` as a persistence simulation. While the code comments acknowledge this matches "the SQLAlchemy JSON-column pattern," the actual database layer (`decisions` table, `decision_dependencies` table per spec lines 18620-18679) is **never exercised**. Schema constraints, foreign keys, and query behavior remain untested. ### F8. Invariant enforcement always stamps `enforced=True` - **Test:** `invariants_enforced_during_strategize()` (`helper_m3_e2e_verification.py:941`) - The `InvariantService.enforce_invariants()` implementation **unconditionally** sets `enforced=True` for every invariant without any LLM-based reconciliation. The test verifies that records are created with `enforced=True`, but this tells us nothing about actual constraint checking. Per the spec (lines 19454-19460), enforcement should involve the Invariant Reconciliation Actor -- this is not tested. ### F9. `plan_generates_decisions()` tests are disconnected - (`helper_m3_e2e_verification.py:254-308`): The function invokes `plan use` CLI with a mocked service, then separately calls `_build_decision_tree()` and asserts on that. There is **no connection** between the CLI invocation result and the decision verification. The `svc.use_action` mock returns a plan, but the decisions checked afterward come from a completely independent code path. --- ## 3. MEDIUM -- Performance & Robustness ### F10. BFS uses `list.pop(0)` -- O(n) per dequeue - **Location:** `helper_m3_e2e_verification.py:384` - `queue.pop(0)` on a Python list is O(n). Should use `collections.deque` with `popleft()` for O(1). With only 3 nodes this is negligible, but it is a bad pattern that could mislead anyone copying this code for larger trees. ### F11. No timeout on `Run Process` calls - **Location:** All 10 test cases in `m3_e2e_verification.robot` (lines 30, 44, 59, etc.) - Every other Robot suite in this project uses `timeout=30s`-`120s` on `Run Process`. This suite has **none**. A hanging helper process (e.g., waiting for import, deadlock) would block the entire test suite indefinitely. ### F12. Module-level ULID generation - **Location:** `helper_m3_e2e_verification.py:85-88` - `_PLAN_ULID`, `_ROOT_DEC_ID`, `_CHILD_DEC_ID`, `_GRANDCHILD_DEC_ID` are generated at import time. While each Robot `Run Process` call spawns a fresh Python process (so this works), it means these IDs are unpredictable across runs and cannot be correlated in logs. Using deterministic test IDs would improve debuggability. --- ## 4. LOW -- Minor Issues ### F13. CHANGELOG entry references `#494` but issue scope is not fully met The CHANGELOG entry (lines 5-9) states *"All 10 E2E verification tests pass against the final implementation"* and references `#494`. Given the acceptance criteria gaps in F1-F5 above, this statement is potentially misleading -- the tests pass, but they do not fully validate the acceptance criteria as written. ### F14. Duplicate tag `correction` on two test cases Both "Correction Dry Run Via Plan Correct" (line 87) and "Correction Live Revert Executes And Re-Creates Decisions" (line 103) share the `[Tags] success_criteria correction` tag. This makes tag-based filtering ambiguous when trying to run just one correction test. Consider `correction_dry_run` and `correction_live_revert` respectively. ### F15. `CorrectionService._checkpoint_service` is dead code The `CorrectionService.__init__()` accepts an optional `CheckpointService` parameter but no method ever calls it. This means checkpoint/rollback integration -- a core part of the correction spec (spec lines 28309-28357) -- is completely unexercised. --- ## Findings Summary Table | ID | Severity | Category | Description | |:---|:---------|:---------|:------------| | F1 | Critical | Coverage | `plan tree` CLI never invoked despite being an acceptance criterion | | F2 | Critical | Coverage | `plan explain` CLI never invoked despite being an acceptance criterion | | F3 | Critical | Coverage | `plan execute` never invoked despite being an acceptance criterion | | F4 | Critical | Coverage | Invariant CLI tests use `--global` instead of `--project` as specified | | F5 | High | Coverage | Live revert correction missing CLI integration test | | F6 | High | Test Flaw | Tests assert properties on self-constructed fixtures, not production outputs | | F7 | High | Test Flaw | Persistence test uses model round-trip, no actual database | | F8 | High | Test Flaw | Invariant enforcement always stamps `enforced=True` unconditionally | | F9 | High | Test Flaw | `plan_generates_decisions` CLI call and decision verification are disconnected | | F10 | Medium | Performance | BFS uses `list.pop(0)` instead of `deque.popleft()` | | F11 | Medium | Robustness | No `timeout` on any `Run Process` call (all other suites use timeouts) | | F12 | Low | Robustness | Module-level ULID generation makes test IDs non-deterministic | | F13 | Low | Documentation | CHANGELOG claims full validation but acceptance criteria have gaps | | F14 | Low | Maintainability | Duplicate `correction` tag on two distinct test cases | | F15 | Low | Dead Code | `CorrectionService._checkpoint_service` accepted but never used |
Member

Code Review Report: Commit d5e73db by Rui Hu

Branch: test/m3-acceptance-gate
Issue: #494 — test(e2e): validate M3 acceptance criteria for v3.2.0 milestone closure


Scope of the Commit

Rui's commit modifies 3 files:

  • CHANGELOG.md — added entry for #494
  • CONTRIBUTORS.md — added Rui Hu
  • robot/m3_e2e_verification.robot — added tags (Force Tags, per-test [Tags]) and enhanced documentation with milestone cross-references

The test logic itself (helper_m3_e2e_verification.py) was not changed. Rui ran the pre-existing tests (written by Brent for #404), confirmed they pass, and annotated the robot file. The commit states: "All 10 tests pass without modification."

The core review question is therefore: Are the pre-existing tests adequate for the acceptance gate they are now formally declared to be?


FINDING 1 — CLI Commands Specified in Acceptance Criteria Are Not Actually Tested [HIGH]

Category: Test Coverage / Acceptance Criteria Gap

The issue #494 acceptance criteria (and milestone v3.2.0) specify exact CLI commands that must work. Several are never invoked through the CLI in the tests:

Acceptance Criterion CLI Tested? What Actually Happens
agents plan execute <plan_id> No Never invoked. Domain-level Decision objects built in-memory.
agents plan tree <plan_id> No Test calls plan status, not plan tree (helper_m3_e2e_verification.py:328).
agents plan explain <decision_id> No Test calls plan status, not plan explain (helper_m3_e2e_verification.py:409).
agents plan correct ... --mode=revert (live) No Uses CorrectionService directly (helper_m3_e2e_verification.py:650-659), never invokes CLI.
agents invariant add --project No CLI test uses --global instead of --project (helper_m3_e2e_verification.py:504).
agents invariant list --project No CLI test lists without --project filter (helper_m3_e2e_verification.py:520).

The _cli_plan_status() helper (helper_m3_e2e_verification.py:102-120) is used as a proxy for CLI validation in multiple tests, but it invokes plan status, not the commands named in the acceptance criteria. This creates a false sense of CLI coverage.

Impact: The acceptance gate may pass while the actual plan tree, plan explain, and plan correct (live) CLI paths are broken. The milestone would be closed with unverified criteria.


FINDING 2 — "Persists to Database" Is Not Tested Against a Database [HIGH]

Category: Test Flaw / Misleading Test Name

The technical criterion states: "Decision tree persists to database and renders correctly."

The test decision_tree_persistence() (helper_m3_e2e_verification.py:776-840) simulates persistence via model_dump() / model_validate() (Pydantic serialization round-trip). No actual database (SQLite or otherwise) is involved.

The spec (docs/specification.md) defines an explicit SQL schema with decisions, decision_dependencies, and correction_attempts tables. The test verifies Pydantic can serialize/deserialize Decision objects, not that the SQLAlchemy persistence layer, JSON column storage, foreign key constraints, or query paths work.

Impact: A regression in the actual database persistence layer (e.g., a broken migration, incorrect column mapping, or JSON serialization edge case) would not be caught by this gate.


FINDING 3 — Tests Heavily Mock the Service Layer, Undermining E2E Purpose [MEDIUM]

Category: Test Flaw / False Confidence

For a suite described as "End-to-end verification" and "the final gate before closing milestone v3.2.0," the tests rely extensively on mocks:

  • _get_lifecycle_service is mocked in every CLI test (helper_m3_e2e_verification.py:110-113, 267-270)
  • _get_service for invariants is mocked (helper_m3_e2e_verification.py:498-501, 514-517)
  • CorrectionService itself is mocked for the plan correct --dry-run CLI test (helper_m3_e2e_verification.py:600-608)
  • _resolve_active_plan_id is mocked (helper_m3_e2e_verification.py:606-608)

The CLI integration tests verify that Typer argument parsing works and that the CLI calls the right method name on the service, but they don't verify the service returns correct results or that the CLI renders them properly. The mock for plan correct --dry-run returns hardcoded CorrectionImpact data (helper_m3_e2e_verification.py:591-596) that may not match what the real service would produce.

Impact: Integration bugs between the CLI layer and the service layer (argument transformation, response rendering, error handling) would not be detected.


FINDING 4 — invariant_enforced Decision Type Never Created or Verified [MEDIUM]

Category: Specification Compliance

The spec explicitly states: "Each effective invariant is then recorded as an invariant_enforced decision in the plan's decision tree."

The test invariants_enforced_during_strategize() (helper_m3_e2e_verification.py:941-1021) tests invariant enforcement via svc.enforce_invariants() and verifies enforcement records. However, it never creates a Decision with decision_type=DecisionType.INVARIANT_ENFORCED or verifies that invariant enforcement produces such decisions in the tree.

The spec's decision type enum includes invariant_enforced as a first-class type, but the test's decision tree (_build_decision_tree()) only uses PROMPT_DEFINITION and STRATEGY_CHOICE.

Impact: The invariant-to-decision integration path is untested. If invariant enforcement silently fails to record decisions, this gate would not catch it.


FINDING 5 — No Negative or Edge-Case Tests [MEDIUM]

Category: Test Coverage

The entire suite (10 tests, ~1060 lines of helper) contains zero negative test cases:

  • No test for correcting the root decision (which should be rejected or handled specially).
  • No test for correcting a decision in an already-applied child plan (the spec says this should be rejected).
  • No test for an empty decision tree.
  • No test for duplicate correction of the same decision.
  • No test for invariant text that is empty or whitespace.
  • No test for the --mode=append correction mode (only revert is tested).
  • No test for plan correct on a decision that has already been superseded.

For a milestone acceptance gate, the absence of boundary-condition tests means the happy path works but robustness is unverified.


FINDING 6 — Influence DAG Structure Not Tested [MEDIUM]

Category: Specification Compliance

The spec describes two independent structures:

  • Structural Tree (parent_decision_id) — used for rendering (plan tree, plan explain)
  • Influence DAG (decision_dependencies table) — used for impact analysis (plan correct affected subtree computation)

All tests only verify the structural tree. The adjacency dicts passed to CorrectionService methods (e.g., helper_m3_e2e_verification.py:542-545, 644-647, 859-862) are manually constructed structural trees, not influence DAGs. The spec's compute_affected_subtree() pseudocode follows both structural children and influence DAG dependents, but the tests only exercise one dimension.


FINDING 7 — BFS Queue Implementation Uses O(n) list.pop(0) [LOW]

Category: Performance / Code Quality

In decision_tree_view() (helper_m3_e2e_verification.py:383):

node = queue.pop(0)

Uses a Python list as a queue with O(n) pop(0). Should use collections.deque with popleft() for O(1). Irrelevant for 3 nodes but signals a pattern that shouldn't be copied into production code.


FINDING 8 — Non-Deterministic ULID Generation at Module Level [LOW]

Category: Test Reliability

Lines 85-88 of helper_m3_e2e_verification.py:

_PLAN_ULID = str(ULID())
_ROOT_DEC_ID = str(ULID())
_CHILD_DEC_ID = str(ULID())
_GRANDCHILD_DEC_ID = str(ULID())

ULIDs are generated at import time, producing different IDs on every run. This makes test failures harder to reproduce and debug. Fixed seed ULIDs or deterministic test IDs would improve reproducibility.


Summary Table

# Finding Severity Category
1 CLI commands from acceptance criteria not actually tested via CLI HIGH Test Coverage
2 "Persists to database" uses Pydantic round-trip, not actual DB HIGH Test Flaw
3 E2E tests heavily mock the service layer MEDIUM Test Flaw
4 invariant_enforced decision type never created/verified MEDIUM Spec Compliance
5 Zero negative or edge-case tests MEDIUM Test Coverage
6 Influence DAG structure not tested MEDIUM Spec Compliance
7 BFS uses O(n) list.pop(0) LOW Code Quality
8 Non-deterministic ULID generation at module level LOW Test Reliability

Recommendation

The two HIGH findings question whether this suite can serve as a reliable milestone closure gate. The tests verify domain model correctness (Pydantic models, Decision objects, service interfaces) but do not verify the specific CLI commands listed in the acceptance criteria, nor actual database persistence.

Before closing milestone v3.2.0, consider:

  1. Adding actual CLI invocations for plan tree, plan explain, and plan correct --mode=revert (live).
  2. Testing invariant add --project and invariant list --project as the acceptance criteria specify.
  3. Testing the database persistence path with a real SQLite instance rather than Pydantic round-trips.

The tagging and documentation additions in the robot file are well done and improve discoverability.

## Code Review Report: Commit `d5e73db` by Rui Hu **Branch:** `test/m3-acceptance-gate` **Issue:** #494 — test(e2e): validate M3 acceptance criteria for v3.2.0 milestone closure --- ### Scope of the Commit Rui's commit modifies **3 files**: - `CHANGELOG.md` — added entry for #494 - `CONTRIBUTORS.md` — added Rui Hu - `robot/m3_e2e_verification.robot` — added tags (`Force Tags`, per-test `[Tags]`) and enhanced documentation with milestone cross-references **The test logic itself (`helper_m3_e2e_verification.py`) was not changed.** Rui ran the pre-existing tests (written by Brent for #404), confirmed they pass, and annotated the robot file. The commit states: *"All 10 tests pass without modification."* The core review question is therefore: **Are the pre-existing tests adequate for the acceptance gate they are now formally declared to be?** --- ### FINDING 1 — CLI Commands Specified in Acceptance Criteria Are Not Actually Tested [HIGH] **Category:** Test Coverage / Acceptance Criteria Gap The issue #494 acceptance criteria (and milestone v3.2.0) specify exact CLI commands that must work. Several are never invoked through the CLI in the tests: | Acceptance Criterion | CLI Tested? | What Actually Happens | |---|---|---| | `agents plan execute <plan_id>` | **No** | Never invoked. Domain-level Decision objects built in-memory. | | `agents plan tree <plan_id>` | **No** | Test calls `plan status`, not `plan tree` (`helper_m3_e2e_verification.py:328`). | | `agents plan explain <decision_id>` | **No** | Test calls `plan status`, not `plan explain` (`helper_m3_e2e_verification.py:409`). | | `agents plan correct ... --mode=revert` (live) | **No** | Uses `CorrectionService` directly (`helper_m3_e2e_verification.py:650-659`), never invokes CLI. | | `agents invariant add --project` | **No** | CLI test uses `--global` instead of `--project` (`helper_m3_e2e_verification.py:504`). | | `agents invariant list --project` | **No** | CLI test lists without `--project` filter (`helper_m3_e2e_verification.py:520`). | The `_cli_plan_status()` helper (`helper_m3_e2e_verification.py:102-120`) is used as a proxy for CLI validation in multiple tests, but it invokes `plan status`, not the commands named in the acceptance criteria. This creates a false sense of CLI coverage. **Impact:** The acceptance gate may pass while the actual `plan tree`, `plan explain`, and `plan correct` (live) CLI paths are broken. The milestone would be closed with unverified criteria. --- ### FINDING 2 — "Persists to Database" Is Not Tested Against a Database [HIGH] **Category:** Test Flaw / Misleading Test Name The technical criterion states: *"Decision tree persists to database and renders correctly."* The test `decision_tree_persistence()` (`helper_m3_e2e_verification.py:776-840`) simulates persistence via `model_dump()` / `model_validate()` (Pydantic serialization round-trip). No actual database (SQLite or otherwise) is involved. The spec (`docs/specification.md`) defines an explicit SQL schema with `decisions`, `decision_dependencies`, and `correction_attempts` tables. The test verifies Pydantic can serialize/deserialize Decision objects, not that the SQLAlchemy persistence layer, JSON column storage, foreign key constraints, or query paths work. **Impact:** A regression in the actual database persistence layer (e.g., a broken migration, incorrect column mapping, or JSON serialization edge case) would not be caught by this gate. --- ### FINDING 3 — Tests Heavily Mock the Service Layer, Undermining E2E Purpose [MEDIUM] **Category:** Test Flaw / False Confidence For a suite described as "End-to-end verification" and "the final gate before closing milestone v3.2.0," the tests rely extensively on mocks: - `_get_lifecycle_service` is mocked in every CLI test (`helper_m3_e2e_verification.py:110-113, 267-270`) - `_get_service` for invariants is mocked (`helper_m3_e2e_verification.py:498-501, 514-517`) - `CorrectionService` itself is mocked for the `plan correct --dry-run` CLI test (`helper_m3_e2e_verification.py:600-608`) - `_resolve_active_plan_id` is mocked (`helper_m3_e2e_verification.py:606-608`) The CLI integration tests verify that Typer argument parsing works and that the CLI calls the right method name on the service, but they don't verify the service returns correct results or that the CLI renders them properly. The mock for `plan correct --dry-run` returns hardcoded `CorrectionImpact` data (`helper_m3_e2e_verification.py:591-596`) that may not match what the real service would produce. **Impact:** Integration bugs between the CLI layer and the service layer (argument transformation, response rendering, error handling) would not be detected. --- ### FINDING 4 — `invariant_enforced` Decision Type Never Created or Verified [MEDIUM] **Category:** Specification Compliance The spec explicitly states: *"Each effective invariant is then recorded as an `invariant_enforced` decision in the plan's decision tree."* The test `invariants_enforced_during_strategize()` (`helper_m3_e2e_verification.py:941-1021`) tests invariant enforcement via `svc.enforce_invariants()` and verifies enforcement records. However, it never creates a `Decision` with `decision_type=DecisionType.INVARIANT_ENFORCED` or verifies that invariant enforcement produces such decisions in the tree. The spec's decision type enum includes `invariant_enforced` as a first-class type, but the test's decision tree (`_build_decision_tree()`) only uses `PROMPT_DEFINITION` and `STRATEGY_CHOICE`. **Impact:** The invariant-to-decision integration path is untested. If invariant enforcement silently fails to record decisions, this gate would not catch it. --- ### FINDING 5 — No Negative or Edge-Case Tests [MEDIUM] **Category:** Test Coverage The entire suite (10 tests, ~1060 lines of helper) contains zero negative test cases: - No test for correcting the **root decision** (which should be rejected or handled specially). - No test for correcting a decision in an **already-applied** child plan (the spec says this should be rejected). - No test for an **empty decision tree**. - No test for **duplicate correction** of the same decision. - No test for invariant text that is **empty or whitespace**. - No test for the `--mode=append` correction mode (only `revert` is tested). - No test for `plan correct` on a decision that has **already been superseded**. For a milestone acceptance gate, the absence of boundary-condition tests means the happy path works but robustness is unverified. --- ### FINDING 6 — Influence DAG Structure Not Tested [MEDIUM] **Category:** Specification Compliance The spec describes two independent structures: - **Structural Tree** (`parent_decision_id`) — used for rendering (`plan tree`, `plan explain`) - **Influence DAG** (`decision_dependencies` table) — used for impact analysis (`plan correct` affected subtree computation) All tests only verify the structural tree. The adjacency dicts passed to `CorrectionService` methods (e.g., `helper_m3_e2e_verification.py:542-545, 644-647, 859-862`) are manually constructed structural trees, not influence DAGs. The spec's `compute_affected_subtree()` pseudocode follows **both** structural children and influence DAG dependents, but the tests only exercise one dimension. --- ### FINDING 7 — BFS Queue Implementation Uses O(n) `list.pop(0)` [LOW] **Category:** Performance / Code Quality In `decision_tree_view()` (`helper_m3_e2e_verification.py:383`): ```python node = queue.pop(0) ``` Uses a Python list as a queue with O(n) `pop(0)`. Should use `collections.deque` with `popleft()` for O(1). Irrelevant for 3 nodes but signals a pattern that shouldn't be copied into production code. --- ### FINDING 8 — Non-Deterministic ULID Generation at Module Level [LOW] **Category:** Test Reliability Lines 85-88 of `helper_m3_e2e_verification.py`: ```python _PLAN_ULID = str(ULID()) _ROOT_DEC_ID = str(ULID()) _CHILD_DEC_ID = str(ULID()) _GRANDCHILD_DEC_ID = str(ULID()) ``` ULIDs are generated at import time, producing different IDs on every run. This makes test failures harder to reproduce and debug. Fixed seed ULIDs or deterministic test IDs would improve reproducibility. --- ### Summary Table | # | Finding | Severity | Category | |---|---------|----------|----------| | 1 | CLI commands from acceptance criteria not actually tested via CLI | **HIGH** | Test Coverage | | 2 | "Persists to database" uses Pydantic round-trip, not actual DB | **HIGH** | Test Flaw | | 3 | E2E tests heavily mock the service layer | **MEDIUM** | Test Flaw | | 4 | `invariant_enforced` decision type never created/verified | **MEDIUM** | Spec Compliance | | 5 | Zero negative or edge-case tests | **MEDIUM** | Test Coverage | | 6 | Influence DAG structure not tested | **MEDIUM** | Spec Compliance | | 7 | BFS uses O(n) `list.pop(0)` | **LOW** | Code Quality | | 8 | Non-deterministic ULID generation at module level | **LOW** | Test Reliability | --- ### Recommendation The two **HIGH** findings question whether this suite can serve as a reliable milestone closure gate. The tests verify domain model correctness (Pydantic models, Decision objects, service interfaces) but do **not** verify the specific CLI commands listed in the acceptance criteria, nor actual database persistence. Before closing milestone v3.2.0, consider: 1. Adding actual CLI invocations for `plan tree`, `plan explain`, and `plan correct --mode=revert` (live). 2. Testing `invariant add --project` and `invariant list --project` as the acceptance criteria specify. 3. Testing the database persistence path with a real SQLite instance rather than Pydantic round-trips. The tagging and documentation additions in the robot file are well done and improve discoverability.
Author
Member

Addressed review feedback and pushed follow-up commit 3d7f50b0 to test/m3-acceptance-gate.

Key updates in this push:

  • Replaced fixture-self-assertion patterns in M3 helper with real CLI-path validation for:
    • plan use + plan execute
    • plan tree
    • plan explain
    • project-scoped invariant add/list CLI
    • plan correct dry-run and live revert CLI flows
  • Added robust JSON output parsing to tolerate structured log preambles.
  • Added persistence-backed decision tree verification (UnitOfWork + DecisionService).
  • Added explicit Robot process timeouts and clarified correction tags.

Verification performed:

  • nox -s integration_tests -- --processes 1 --include acceptance_gate (10/10 pass)
  • nox -s lint typecheck (pass)

If you want, I can also run the full default nox suite again and paste the result summary here.

Addressed review feedback and pushed follow-up commit `3d7f50b0` to `test/m3-acceptance-gate`. Key updates in this push: - Replaced fixture-self-assertion patterns in M3 helper with real CLI-path validation for: - `plan use` + `plan execute` - `plan tree` - `plan explain` - project-scoped invariant add/list CLI - `plan correct` dry-run and live revert CLI flows - Added robust JSON output parsing to tolerate structured log preambles. - Added persistence-backed decision tree verification (`UnitOfWork` + `DecisionService`). - Added explicit Robot process timeouts and clarified correction tags. Verification performed: - `nox -s integration_tests -- --processes 1 --include acceptance_gate` (10/10 pass) - `nox -s lint typecheck` (pass) If you want, I can also run the full default `nox` suite again and paste the result summary here.
freemo left a comment

PM Review — Day 25

Status

  • Merge conflicts@hurui200320 please rebase onto master. This is the primary blocker.
  • Review cycle: Luis (CoreRasurae) raised 15 findings in the initial review. Rui pushed commit 3d7f50b0 addressing the critical and high findings. Luis's REQUEST_REVIEW is now stale.

Action Items

  1. @hurui200320: Rebase test/m3-acceptance-gate onto current master to resolve conflicts.
  2. @CoreRasurae: Once rebased, please re-review to verify that commit 3d7f50b0 adequately addresses the 4 Critical and 5 High findings from your original review. If satisfied, please submit a formal APPROVED verdict.

Priority

This is the M3 acceptance gate — it blocks milestone v3.2.0 closure. The milestone is already 1 week overdue. Please prioritize.

## PM Review — Day 25 ### Status - **Merge conflicts** — @hurui200320 please rebase onto `master`. This is the primary blocker. - **Review cycle**: Luis (CoreRasurae) raised 15 findings in the initial review. Rui pushed commit `3d7f50b0` addressing the critical and high findings. Luis's `REQUEST_REVIEW` is now stale. ### Action Items 1. **@hurui200320**: Rebase `test/m3-acceptance-gate` onto current `master` to resolve conflicts. 2. **@CoreRasurae**: Once rebased, please re-review to verify that commit `3d7f50b0` adequately addresses the 4 Critical and 5 High findings from your original review. If satisfied, please submit a formal APPROVED verdict. ### Priority This is the **M3 acceptance gate** — it blocks milestone v3.2.0 closure. The milestone is already 1 week overdue. Please prioritize.
hurui200320 force-pushed test/m3-acceptance-gate from 3d7f50b01f
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 15s
CI / quality (pull_request) Successful in 18s
CI / security (pull_request) Successful in 31s
CI / typecheck (pull_request) Successful in 34s
CI / unit_tests (pull_request) Successful in 2m15s
CI / integration_tests (pull_request) Successful in 2m53s
CI / docker (pull_request) Successful in 41s
CI / coverage (pull_request) Successful in 4m50s
CI / benchmark-regression (pull_request) Successful in 25m16s
to 7ba2f68bf4
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 19s
CI / security (pull_request) Successful in 31s
CI / typecheck (pull_request) Successful in 39s
CI / unit_tests (pull_request) Successful in 3m31s
CI / docker (pull_request) Successful in 52s
CI / coverage (pull_request) Successful in 4m51s
CI / benchmark-regression (pull_request) Successful in 28m13s
CI / integration_tests (pull_request) Failing after 3m10s
2026-03-05 05:22:36 +00:00
Compare
hurui200320 force-pushed test/m3-acceptance-gate from 7ba2f68bf4
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 19s
CI / security (pull_request) Successful in 31s
CI / typecheck (pull_request) Successful in 39s
CI / unit_tests (pull_request) Successful in 3m31s
CI / docker (pull_request) Successful in 52s
CI / coverage (pull_request) Successful in 4m51s
CI / benchmark-regression (pull_request) Successful in 28m13s
CI / integration_tests (pull_request) Failing after 3m10s
to 3486198a54
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 17s
CI / quality (pull_request) Successful in 19s
CI / security (pull_request) Successful in 34s
CI / typecheck (pull_request) Successful in 49s
CI / unit_tests (pull_request) Successful in 2m6s
CI / docker (pull_request) Successful in 9s
CI / integration_tests (pull_request) Successful in 3m3s
CI / coverage (pull_request) Successful in 4m19s
CI / benchmark-regression (pull_request) Successful in 28m16s
2026-03-05 06:22:57 +00:00
Compare
hurui200320 force-pushed test/m3-acceptance-gate from 3486198a54
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 17s
CI / quality (pull_request) Successful in 19s
CI / security (pull_request) Successful in 34s
CI / typecheck (pull_request) Successful in 49s
CI / unit_tests (pull_request) Successful in 2m6s
CI / docker (pull_request) Successful in 9s
CI / integration_tests (pull_request) Successful in 3m3s
CI / coverage (pull_request) Successful in 4m19s
CI / benchmark-regression (pull_request) Successful in 28m16s
to e1665c6eaa
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 17s
CI / security (pull_request) Successful in 33s
CI / typecheck (pull_request) Successful in 53s
CI / unit_tests (pull_request) Successful in 3m26s
CI / docker (pull_request) Successful in 39s
CI / integration_tests (pull_request) Successful in 4m14s
CI / coverage (pull_request) Successful in 4m20s
CI / benchmark-regression (pull_request) Successful in 27m47s
2026-03-05 07:24:47 +00:00
Compare
Author
Member

Force-Push Summary — Squashed to Single Commit e1665c6e

@CoreRasurae @freemo — heads up on the force-push:

What happened

Per CONTRIBUTING.md one-commit-per-issue rule and freemo's rebase request, the branch was squashed from 2 commits into 1:

  • d5e73dba (original: tags, docs, CHANGELOG, CONTRIBUTORS)
  • 3d7f50b0 (review fixes: CLI-path validation, persistence checks, timeouts)
  • e1665c6e (single squashed commit, rebased onto current master fff6bb38)

Additional fixes in the squashed commit

  1. async_enabled bug fix: _make_settings() in helper_m3_e2e_verification.py now sets settings.async_enabled = False. Without this, plan execute CLI tests fail because PlanLifecycleService.execute_plan() accesses self.settings.async_enabled.

  2. Timeout increases (30s → 120s) in 3 robot files to prevent failures under parallel pabot execution:

    • robot/actor_context_management.robot
    • robot/decision_di_wiring_smoke.robot
    • robot/changeset_persistence.robot

Verification

All 11 default nox sessions pass locally:

  • unit_tests: 8500 BDD scenarios, 0 failures
  • integration_tests: 1194/1194 tests, 0 failures
  • coverage_report: 96.96% (meets --fail-under=97)
  • lint, format, typecheck, security_scan, dead_code, docs, build, benchmark: all PASS

CI run #5535 is currently running for this commit.

Review request

@CoreRasurae — this commit incorporates all your CRITICAL/HIGH findings from the original review. The key changes to re-review in helper_m3_e2e_verification.py:

  • Real CLI invocations for plan tree, plan explain, plan execute, plan correct (dry-run + live)
  • Project-scoped invariant add --project / invariant list --project
  • Persistence-backed decision tree verification via UnitOfWork + DecisionService
  • The new async_enabled fix in _make_settings()

PR description has been updated with full details.

## Force-Push Summary — Squashed to Single Commit `e1665c6e` @CoreRasurae @freemo — heads up on the force-push: ### What happened Per CONTRIBUTING.md one-commit-per-issue rule and freemo's rebase request, the branch was squashed from 2 commits into 1: - `d5e73dba` (original: tags, docs, CHANGELOG, CONTRIBUTORS) - `3d7f50b0` (review fixes: CLI-path validation, persistence checks, timeouts) - → **`e1665c6e`** (single squashed commit, rebased onto current master `fff6bb38`) ### Additional fixes in the squashed commit 1. **`async_enabled` bug fix**: `_make_settings()` in `helper_m3_e2e_verification.py` now sets `settings.async_enabled = False`. Without this, `plan execute` CLI tests fail because `PlanLifecycleService.execute_plan()` accesses `self.settings.async_enabled`. 2. **Timeout increases (30s → 120s)** in 3 robot files to prevent failures under parallel `pabot` execution: - `robot/actor_context_management.robot` - `robot/decision_di_wiring_smoke.robot` - `robot/changeset_persistence.robot` ### Verification All 11 default nox sessions pass locally: - **unit_tests**: 8500 BDD scenarios, 0 failures - **integration_tests**: 1194/1194 tests, 0 failures - **coverage_report**: 96.96% (meets `--fail-under=97`) - lint, format, typecheck, security_scan, dead_code, docs, build, benchmark: all PASS CI run #5535 is currently running for this commit. ### Review request @CoreRasurae — this commit incorporates all your CRITICAL/HIGH findings from the original review. The key changes to re-review in `helper_m3_e2e_verification.py`: - Real CLI invocations for `plan tree`, `plan explain`, `plan execute`, `plan correct` (dry-run + live) - Project-scoped `invariant add --project` / `invariant list --project` - Persistence-backed decision tree verification via `UnitOfWork` + `DecisionService` - The new `async_enabled` fix in `_make_settings()` PR description has been updated with full details.
CoreRasurae requested changes 2026-03-05 10:28:29 +00:00
Dismissed
CoreRasurae left a comment

Code Review Report — Commit e1665c6e on test/m3-acceptance-gate

Reviewer scope: test coverage, test flaws, performance, bug detection, security
Issue: #494 — M3 acceptance criteria for v3.2.0
Specification reference: docs/specification.md, milestone v3.2.0 description, docs/timeline.md, docs/development/testing.md


Summary

The commit is a substantial improvement over the previous fixture-self-assertion approach. The switch to real CLI-path validation, persistence-backed checks, and structured JSON output verification significantly strengthens the M3 acceptance gate. However, I found 9 findings (1 critical, 2 high, 4 medium, 2 low) that should be considered before merge.


[CRITICAL] F1 — Inconsistent DecisionService wiring: some tests lack database isolation

Location: robot/helper_m3_e2e_verification.py:290, 344, 678, 769

Four of the ten test subcommands create DecisionService() with no arguments (no settings, no unit_of_work):

  • decision_tree_view() (line 290)
  • decision_explain() (line 344)
  • decisions_context_snapshot() (line 678)
  • correction_revert_reexecutes() (line 769)

Meanwhile, the two persistence-oriented subcommands (plan_generates_decisions, decision_tree_persistence) correctly wire up in-memory SQLite:

database_url = "sqlite:///:memory:"
uow = _make_uow(database_url)
settings = _make_settings(database_url)
decision_service = DecisionService(settings=settings, unit_of_work=uow)

The zero-argument instances fall back to whatever DecisionService.__init__ defaults to. If the default is an in-memory dict-backed store, this works but doesn't test persistence. If the default resolves the global config's database_url, this could create on-disk artifacts, cause cross-test pollution when pabot runs 16 parallel processes, or fail in CI environments that lack a configured database path. Either way, the inconsistency means 4 of the 10 tests exercise a different code path than the other 6, weakening the acceptance gate's integrity.

Recommendation: Wire all DecisionService instances through _make_uow() + _make_settings() for uniform in-memory-SQLite isolation, consistent with how the persistence tests are already written.


[HIGH] F2 — Correction tests use bare string IDs instead of real ULIDs

Location: robot/helper_m3_e2e_verification.py:498-501, 588-591

In correction_dry_run() and correction_live_revert(), the decision tree uses bare string literals as decision IDs:

tree = {
    "root": ["child"],
    "child": ["grandchild"],
}
request = service.request_correction(
    ...
    target_decision_id="child",
    ...
)

These strings are not valid ULIDs. The production system uses ULIDs for all decision identifiers (per the spec: "ULID-identified"). Meanwhile, correction_revert_reexecutes() (line 772-774) correctly uses real decision IDs from _seed_decisions():

tree = {
    root.decision_id: [child.decision_id],
    child.decision_id: ["leaf"],
}

The use of bare strings means the correction service's ID validation path (if any) is bypassed, and any ULID-specific parsing or ordering logic is untested for two of the three correction subcommands.

Recommendation: Use _seed_decisions() in correction_dry_run() and correction_live_revert() the same way correction_revert_reexecutes() already does, so the tree is built from real decision objects with proper IDs.


[HIGH] F3 — Duplicate CHANGELOG entries for #494

Location: CHANGELOG.md (lines 5-11 of the added content)

The commit adds two separate CHANGELOG entries both tagged with (#494):

- Updated M3 acceptance-gate E2E verification to address review feedback. ... (#494)
- Validated M3 acceptance criteria for v3.2.0 milestone closure. ... (#494)

This appears to be an artifact from the squash of the original commit + the review-feedback commit. A single squashed commit should have a single consolidated CHANGELOG entry. Two entries suggest the CHANGELOG wasn't reconciled during squash.

Recommendation: Merge the two entries into one that covers the final state of the work.


[MEDIUM] F4 — sys.path.insert(0, _SRC) guard removed

Location: robot/helper_m3_e2e_verification.py:33

The old code had a guard preventing duplicate entries:

if _SRC not in sys.path:
    sys.path.insert(0, _SRC)

The new code unconditionally prepends:

sys.path.insert(0, _SRC)

Every Robot subprocess invocation pushes a fresh copy of _SRC onto sys.path. While each subprocess is short-lived (so the impact is minimal), removing a correct guard is a regression. Under pabot parallel execution with process reuse, duplicate entries can accumulate.

Recommendation: Restore the if _SRC not in sys.path: guard.


[MEDIUM] F5 — No --mode=append coverage in the M3 E2E suite

Location: robot/helper_m3_e2e_verification.py (entire file)

The M3 milestone goal (from docs/timeline.md:2090) states:

Users can ... correct decisions (plan correct --mode revert|append) with selective subtree recomputation.

The M3 E2E suite tests only --mode revert. The word "append" does not appear anywhere in the file. Both the issue's acceptance criteria and the milestone v3.2.0 success criteria examples only show --mode=revert, so the commit correctly implements what was asked for.

However, per the specification (docs/specification.md:337, 14799) and the milestone goal, --mode=append is a core correction capability. The testing docs (docs/development/testing.md:1239) confirm that append mode coverage exists only in the M4 correction smoke suite, not M3.

This is an acceptance criteria gap rather than a commit defect, but it means the M3 gate does not validate half of the correction modes that the milestone goal promises.

Recommendation: Either add a basic --mode=append E2E test to the M3 suite, or ensure the M3 milestone description is narrowed to explicitly state that only revert mode is gate-checked in M3 (with append deferred to M4).


[MEDIUM] F6 — Coverage at 96.96% reported as 97%

Source: Rui's issue comment (2026-03-05): "coverage_report: 96.96% (meets --fail-under=97 threshold)"
Commit message: "coverage (97%)"

The actual coverage is 96.96%, which passes only because coverage.py's --fail-under truncates/rounds to integer. The commit message reports "97%" without qualification. While this technically passes the threshold, it is 4 basis points below the stated standard. Any subsequent removal of even a single covered line could break CI.

Recommendation: Clarify the coverage as "96.96% (rounds to 97% for fail-under check)" in the commit message or PR body. Consider whether the 97% threshold should use --precision=2 to enforce the exact number.


[MEDIUM] F7 — Timeout 4x increase (30s to 120s) may mask performance regressions

Location: robot/actor_context_management.robot:92, robot/changeset_persistence.robot:12, robot/decision_di_wiring_smoke.robot:14,21

All three files had their timeouts quadrupled from 30s to 120s. The justification (pabot cold-start + Alembic migrations) is reasonable. However, a test that used to complete in 10s and now takes 90s due to a regression would still pass silently under the 120s timeout.

Recommendation: Consider adding a performance benchmark or assertion (e.g., test must complete within 60s for a warning) rather than only relying on a high timeout ceiling. Alternatively, document the expected normal duration alongside the timeout so reviewers can catch drift.


[LOW] F8 — No negative/error path E2E tests

The 10 test cases exclusively test happy paths. None validate error behavior:

  • plan use with a non-existent action
  • plan tree with an invalid plan ID
  • plan correct on a non-existent decision
  • plan explain with an invalid decision ID
  • invariant add with missing --project argument

The issue's acceptance criteria focus on happy paths, so this is not a defect. However, for a milestone acceptance gate, a few error-path tests would increase confidence that the CLI handles failures gracefully.


[LOW] F9 — _load_json docstring/implementation mismatch

Location: robot/helper_m3_e2e_verification.py:88-115

The docstring says "scans for the final decodable JSON object/array suffix" but the implementation scans from the beginning, returning the first position where JSON starts and extends to the end of the string. This is functionally correct for the use case (structured log lines before a single JSON payload), but the docstring is misleading.

Recommendation: Update the docstring to say "scans forward for the first JSON object/array that extends to end-of-output."


Security

No security issues found. The test file contains no real credentials, API keys, or sensitive data. Hardcoded ULIDs and email addresses are appropriate for test fixtures. The create_autospec(Settings) pattern prevents accidental leakage of real configuration into test execution.


Verdict

The critical finding (F1) regarding inconsistent DecisionService wiring should be addressed before merge, as it undermines the reliability of 4 out of 10 tests. The high findings (F2, F3) are recommended for the same fix pass. The medium and low findings are improvements that could be addressed before or after merge depending on the team's urgency.

## Code Review Report — Commit `e1665c6e` on `test/m3-acceptance-gate` **Reviewer scope:** test coverage, test flaws, performance, bug detection, security **Issue:** #494 — M3 acceptance criteria for v3.2.0 **Specification reference:** `docs/specification.md`, milestone v3.2.0 description, `docs/timeline.md`, `docs/development/testing.md` --- ### Summary The commit is a substantial improvement over the previous fixture-self-assertion approach. The switch to real CLI-path validation, persistence-backed checks, and structured JSON output verification significantly strengthens the M3 acceptance gate. However, I found **9 findings** (1 critical, 2 high, 4 medium, 2 low) that should be considered before merge. --- ### [CRITICAL] F1 — Inconsistent `DecisionService` wiring: some tests lack database isolation **Location:** `robot/helper_m3_e2e_verification.py:290, 344, 678, 769` Four of the ten test subcommands create `DecisionService()` with **no arguments** (no `settings`, no `unit_of_work`): - `decision_tree_view()` (line 290) - `decision_explain()` (line 344) - `decisions_context_snapshot()` (line 678) - `correction_revert_reexecutes()` (line 769) Meanwhile, the two persistence-oriented subcommands (`plan_generates_decisions`, `decision_tree_persistence`) correctly wire up in-memory SQLite: ```python database_url = "sqlite:///:memory:" uow = _make_uow(database_url) settings = _make_settings(database_url) decision_service = DecisionService(settings=settings, unit_of_work=uow) ``` The zero-argument instances fall back to whatever `DecisionService.__init__` defaults to. If the default is an in-memory dict-backed store, this works but doesn't test persistence. If the default resolves the global config's `database_url`, this could create on-disk artifacts, cause cross-test pollution when `pabot` runs 16 parallel processes, or fail in CI environments that lack a configured database path. Either way, the inconsistency means 4 of the 10 tests exercise a different code path than the other 6, weakening the acceptance gate's integrity. **Recommendation:** Wire all `DecisionService` instances through `_make_uow()` + `_make_settings()` for uniform in-memory-SQLite isolation, consistent with how the persistence tests are already written. --- ### [HIGH] F2 — Correction tests use bare string IDs instead of real ULIDs **Location:** `robot/helper_m3_e2e_verification.py:498-501, 588-591` In `correction_dry_run()` and `correction_live_revert()`, the decision tree uses bare string literals as decision IDs: ```python tree = { "root": ["child"], "child": ["grandchild"], } request = service.request_correction( ... target_decision_id="child", ... ) ``` These strings are not valid ULIDs. The production system uses ULIDs for all decision identifiers (per the spec: "ULID-identified"). Meanwhile, `correction_revert_reexecutes()` (line 772-774) correctly uses real decision IDs from `_seed_decisions()`: ```python tree = { root.decision_id: [child.decision_id], child.decision_id: ["leaf"], } ``` The use of bare strings means the correction service's ID validation path (if any) is bypassed, and any ULID-specific parsing or ordering logic is untested for two of the three correction subcommands. **Recommendation:** Use `_seed_decisions()` in `correction_dry_run()` and `correction_live_revert()` the same way `correction_revert_reexecutes()` already does, so the tree is built from real decision objects with proper IDs. --- ### [HIGH] F3 — Duplicate CHANGELOG entries for #494 **Location:** `CHANGELOG.md` (lines 5-11 of the added content) The commit adds **two** separate CHANGELOG entries both tagged with `(#494)`: ``` - Updated M3 acceptance-gate E2E verification to address review feedback. ... (#494) - Validated M3 acceptance criteria for v3.2.0 milestone closure. ... (#494) ``` This appears to be an artifact from the squash of the original commit + the review-feedback commit. A single squashed commit should have a single consolidated CHANGELOG entry. Two entries suggest the CHANGELOG wasn't reconciled during squash. **Recommendation:** Merge the two entries into one that covers the final state of the work. --- ### [MEDIUM] F4 — `sys.path.insert(0, _SRC)` guard removed **Location:** `robot/helper_m3_e2e_verification.py:33` The old code had a guard preventing duplicate entries: ```python if _SRC not in sys.path: sys.path.insert(0, _SRC) ``` The new code unconditionally prepends: ```python sys.path.insert(0, _SRC) ``` Every Robot subprocess invocation pushes a fresh copy of `_SRC` onto `sys.path`. While each subprocess is short-lived (so the impact is minimal), removing a correct guard is a regression. Under `pabot` parallel execution with process reuse, duplicate entries can accumulate. **Recommendation:** Restore the `if _SRC not in sys.path:` guard. --- ### [MEDIUM] F5 — No `--mode=append` coverage in the M3 E2E suite **Location:** `robot/helper_m3_e2e_verification.py` (entire file) The M3 milestone goal (from `docs/timeline.md:2090`) states: > Users can ... correct decisions (`plan correct --mode revert|append`) with selective subtree recomputation. The M3 E2E suite tests **only** `--mode revert`. The word "append" does not appear anywhere in the file. Both the issue's acceptance criteria and the milestone v3.2.0 success criteria examples only show `--mode=revert`, so the commit correctly implements what was asked for. However, per the specification (`docs/specification.md:337, 14799`) and the milestone goal, `--mode=append` is a core correction capability. The testing docs (`docs/development/testing.md:1239`) confirm that append mode coverage exists only in the **M4** correction smoke suite, not M3. This is an acceptance criteria gap rather than a commit defect, but it means the M3 gate does not validate half of the correction modes that the milestone goal promises. **Recommendation:** Either add a basic `--mode=append` E2E test to the M3 suite, or ensure the M3 milestone description is narrowed to explicitly state that only revert mode is gate-checked in M3 (with append deferred to M4). --- ### [MEDIUM] F6 — Coverage at 96.96% reported as 97% **Source:** Rui's issue comment (2026-03-05): "coverage_report: 96.96% (meets `--fail-under=97` threshold)" **Commit message:** "coverage (97%)" The actual coverage is `96.96%`, which passes only because `coverage.py`'s `--fail-under` truncates/rounds to integer. The commit message reports "97%" without qualification. While this technically passes the threshold, it is 4 basis points below the stated standard. Any subsequent removal of even a single covered line could break CI. **Recommendation:** Clarify the coverage as "96.96% (rounds to 97% for fail-under check)" in the commit message or PR body. Consider whether the 97% threshold should use `--precision=2` to enforce the exact number. --- ### [MEDIUM] F7 — Timeout 4x increase (30s to 120s) may mask performance regressions **Location:** `robot/actor_context_management.robot:92`, `robot/changeset_persistence.robot:12`, `robot/decision_di_wiring_smoke.robot:14,21` All three files had their timeouts quadrupled from 30s to 120s. The justification (pabot cold-start + Alembic migrations) is reasonable. However, a test that used to complete in 10s and now takes 90s due to a regression would still pass silently under the 120s timeout. **Recommendation:** Consider adding a performance benchmark or assertion (e.g., test must complete within 60s for a warning) rather than only relying on a high timeout ceiling. Alternatively, document the expected normal duration alongside the timeout so reviewers can catch drift. --- ### [LOW] F8 — No negative/error path E2E tests The 10 test cases exclusively test happy paths. None validate error behavior: - `plan use` with a non-existent action - `plan tree` with an invalid plan ID - `plan correct` on a non-existent decision - `plan explain` with an invalid decision ID - `invariant add` with missing `--project` argument The issue's acceptance criteria focus on happy paths, so this is not a defect. However, for a milestone acceptance gate, a few error-path tests would increase confidence that the CLI handles failures gracefully. --- ### [LOW] F9 — `_load_json` docstring/implementation mismatch **Location:** `robot/helper_m3_e2e_verification.py:88-115` The docstring says *"scans for the final decodable JSON object/array suffix"* but the implementation scans from the **beginning**, returning the **first** position where JSON starts and extends to the end of the string. This is functionally correct for the use case (structured log lines before a single JSON payload), but the docstring is misleading. **Recommendation:** Update the docstring to say *"scans forward for the first JSON object/array that extends to end-of-output."* --- ### Security No security issues found. The test file contains no real credentials, API keys, or sensitive data. Hardcoded ULIDs and email addresses are appropriate for test fixtures. The `create_autospec(Settings)` pattern prevents accidental leakage of real configuration into test execution. --- ### Verdict The **critical finding (F1)** regarding inconsistent `DecisionService` wiring should be addressed before merge, as it undermines the reliability of 4 out of 10 tests. The **high findings (F2, F3)** are recommended for the same fix pass. The medium and low findings are improvements that could be addressed before or after merge depending on the team's urgency.
CHANGELOG.md Outdated
@ -2,6 +2,16 @@
## Unreleased
- Updated M3 acceptance-gate E2E verification to address review feedback.
Member

[HIGH] F3 — Two separate CHANGELOG entries both tagged (#494). This appears to be a squash artifact from combining the original commit + review-feedback commit. A single squashed commit should have one consolidated entry.

Recommendation: Merge these into a single entry covering the final state of the work.

**[HIGH] F3** — Two separate CHANGELOG entries both tagged `(#494)`. This appears to be a squash artifact from combining the original commit + review-feedback commit. A single squashed commit should have one consolidated entry. **Recommendation:** Merge these into a single entry covering the final state of the work.
hurui200320 marked this conversation as resolved
@ -32,3 +32,2 @@
_SRC = str(Path(__file__).resolve().parents[1] / "src")
if _SRC not in sys.path:
sys.path.insert(0, _SRC)
sys.path.insert(0, _SRC)
Member

[MEDIUM] F4 — The old code had if _SRC not in sys.path: guard to prevent duplicate entries. This unconditional sys.path.insert(0, _SRC) pushes a duplicate on every invocation.

Recommendation: Restore the guard: if _SRC not in sys.path:

**[MEDIUM] F4** — The old code had `if _SRC not in sys.path:` guard to prevent duplicate entries. This unconditional `sys.path.insert(0, _SRC)` pushes a duplicate on every invocation. **Recommendation:** Restore the guard: `if _SRC not in sys.path:`
hurui200320 marked this conversation as resolved
@ -104,3 +90,2 @@
Exercises the real CLI rendering/serialization path with a mocked
lifecycle service that returns *plan*.
Some CLI paths emit structured logs before the JSON payload. This
Member

[LOW] F9 — The docstring says "scans for the final decodable JSON object/array suffix" but the implementation scans from the beginning and returns the first position where JSON starts and extends to end-of-string. The behavior is correct for the use case, but the docstring is misleading.

Recommendation: Update to "scans forward for the first JSON object/array that extends to end-of-output."

**[LOW] F9** — The docstring says *"scans for the final decodable JSON object/array suffix"* but the implementation scans from the **beginning** and returns the first position where JSON starts and extends to end-of-string. The behavior is correct for the use case, but the docstring is misleading. **Recommendation:** Update to *"scans forward for the first JSON object/array that extends to end-of-output."*
@ -316,2 +288,3 @@
def decision_tree_view() -> None:
"""View the decision tree structure and verify rendering.
"""Invoke ``plan tree`` CLI and validate rendered hierarchy."""
decision_service = DecisionService()
Member

[CRITICAL] F1DecisionService() is created with no arguments here (and at lines 344, 678, 769), unlike plan_generates_decisions() and decision_tree_persistence() which correctly wire in-memory SQLite via _make_uow() + _make_settings(). This inconsistency means 4/10 tests use a different storage backend than the other 6, risking cross-test pollution under pabot and weakening the acceptance gate.

Recommendation: Wire through _make_uow() + _make_settings() for uniform in-memory-SQLite isolation.

**[CRITICAL] F1** — `DecisionService()` is created with no arguments here (and at lines 344, 678, 769), unlike `plan_generates_decisions()` and `decision_tree_persistence()` which correctly wire in-memory SQLite via `_make_uow()` + `_make_settings()`. This inconsistency means 4/10 tests use a different storage backend than the other 6, risking cross-test pollution under `pabot` and weakening the acceptance gate. **Recommendation:** Wire through `_make_uow()` + `_make_settings()` for uniform in-memory-SQLite isolation.
@ -544,1 +496,3 @@
_CHILD_DEC_ID: [_GRANDCHILD_DEC_ID],
"""Validate dry-run correction impact analysis and CLI wiring."""
service = CorrectionService()
tree = {
Member

[HIGH] F2 — The decision tree here uses bare string literals ("root", "child", "grandchild") as decision IDs instead of real ULIDs. The production system uses ULIDs for all decision identifiers. Compare with correction_revert_reexecutes() (line 772) which correctly uses real decision IDs from _seed_decisions(). This bypasses any ULID validation or ordering logic in the correction service.

Recommendation: Use _seed_decisions() to build the tree from real decision objects, consistent with correction_revert_reexecutes().

**[HIGH] F2** — The decision tree here uses bare string literals (`"root"`, `"child"`, `"grandchild"`) as decision IDs instead of real ULIDs. The production system uses ULIDs for all decision identifiers. Compare with `correction_revert_reexecutes()` (line 772) which correctly uses real decision IDs from `_seed_decisions()`. This bypasses any ULID validation or ordering logic in the correction service. **Recommendation:** Use `_seed_decisions()` to build the tree from real decision objects, consistent with `correction_revert_reexecutes()`.
hurui200320 force-pushed test/m3-acceptance-gate from e1665c6eaa
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 17s
CI / security (pull_request) Successful in 33s
CI / typecheck (pull_request) Successful in 53s
CI / unit_tests (pull_request) Successful in 3m26s
CI / docker (pull_request) Successful in 39s
CI / integration_tests (pull_request) Successful in 4m14s
CI / coverage (pull_request) Successful in 4m20s
CI / benchmark-regression (pull_request) Successful in 27m47s
to 444ae0677a
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 17s
CI / typecheck (pull_request) Successful in 34s
CI / security (pull_request) Successful in 35s
CI / unit_tests (pull_request) Successful in 2m5s
CI / integration_tests (pull_request) Successful in 2m56s
CI / docker (pull_request) Successful in 50s
CI / coverage (pull_request) Successful in 4m20s
CI / benchmark-regression (pull_request) Successful in 27m38s
2026-03-05 11:46:33 +00:00
Compare
Author
Member

Review Response — Commit 444ae067

@CoreRasurae — thank you for the thorough second review. All actionable findings have been addressed in the amended commit 444ae067. Here's the disposition for each:

Fixed (code changes)

Finding Fix
[F1 CRITICAL] Inconsistent DecisionService wiring All 4 bare DecisionService() calls (decision_tree_view, decision_explain, decisions_context_snapshot, correction_revert_reexecutes) now wire through _make_uow("sqlite:///:memory:") + _make_settings() for uniform in-memory SQLite isolation.
[F2 HIGH] Bare string IDs in correction tests correction_dry_run() and correction_live_revert() now use _seed_decisions() to create real Decision objects with valid ULID IDs, matching the pattern already used by correction_revert_reexecutes().
[F3 HIGH] Duplicate CHANGELOG entries Consolidated the two (#494) entries into a single entry covering the final state.
[F4 MEDIUM] sys.path guard removed Restored if _SRC not in sys.path: guard before sys.path.insert(0, _SRC).
[F7 MEDIUM] Timeout increase may mask regressions Added expected-duration comments (# Normal duration: ~X-Ys) alongside all 3 timeout increases so reviewers can catch performance drift.
[F9 LOW] _load_json docstring mismatch Updated docstring to "scans forward for the first JSON object/array that extends to end-of-output" to match the actual implementation.

Acknowledged — no code change

Finding Disposition
[F5 MEDIUM] No --mode=append coverage Correct observation. However, the issue #494 acceptance criteria and milestone v3.2.0 success criteria examples only specify --mode=revert. The testing.md docs confirm append mode coverage belongs to the M4 correction smoke suite, not M3. No change needed for this gate.
[F6 MEDIUM] Coverage 96.96% reported as 97% Already clarified in the PR body as "96.96% → rounds to 97% (meets --fail-under=97)". The --precision=2 suggestion is a good idea for a future CI hardening ticket.
[F8 LOW] No negative/error path tests Agreed this would be valuable, but the acceptance criteria focus on happy paths. Negative tests could be a follow-up issue post-milestone closure.

Verification

  • nox -s integration_tests -- --processes 1 --include acceptance_gate: 10/10 PASS
  • nox -s lint format typecheck security_scan dead_code: all PASS
  • nox -s integration_tests: 1194/1194 PASS
  • nox -s coverage_report: 97% (meets threshold)
  • nox -s unit_tests: 8523 passed, 1 flaky failure (repositories_uncovered_branches.feature:110 — passes in isolation, race condition under pabot parallelism, unrelated to our changes)
  • nox -s docs build benchmark: all PASS
## Review Response — Commit `444ae067` @CoreRasurae — thank you for the thorough second review. All actionable findings have been addressed in the amended commit `444ae067`. Here's the disposition for each: ### Fixed (code changes) | Finding | Fix | |---------|-----| | **[F1 CRITICAL] Inconsistent DecisionService wiring** | All 4 bare `DecisionService()` calls (`decision_tree_view`, `decision_explain`, `decisions_context_snapshot`, `correction_revert_reexecutes`) now wire through `_make_uow("sqlite:///:memory:")` + `_make_settings()` for uniform in-memory SQLite isolation. | | **[F2 HIGH] Bare string IDs in correction tests** | `correction_dry_run()` and `correction_live_revert()` now use `_seed_decisions()` to create real Decision objects with valid ULID IDs, matching the pattern already used by `correction_revert_reexecutes()`. | | **[F3 HIGH] Duplicate CHANGELOG entries** | Consolidated the two `(#494)` entries into a single entry covering the final state. | | **[F4 MEDIUM] sys.path guard removed** | Restored `if _SRC not in sys.path:` guard before `sys.path.insert(0, _SRC)`. | | **[F7 MEDIUM] Timeout increase may mask regressions** | Added expected-duration comments (`# Normal duration: ~X-Ys`) alongside all 3 timeout increases so reviewers can catch performance drift. | | **[F9 LOW] _load_json docstring mismatch** | Updated docstring to "scans forward for the first JSON object/array that extends to end-of-output" to match the actual implementation. | ### Acknowledged — no code change | Finding | Disposition | |---------|-------------| | **[F5 MEDIUM] No `--mode=append` coverage** | Correct observation. However, the issue #494 acceptance criteria and milestone v3.2.0 success criteria examples only specify `--mode=revert`. The `testing.md` docs confirm append mode coverage belongs to the M4 correction smoke suite, not M3. No change needed for this gate. | | **[F6 MEDIUM] Coverage 96.96% reported as 97%** | Already clarified in the PR body as "96.96% → rounds to 97% (meets `--fail-under=97`)". The `--precision=2` suggestion is a good idea for a future CI hardening ticket. | | **[F8 LOW] No negative/error path tests** | Agreed this would be valuable, but the acceptance criteria focus on happy paths. Negative tests could be a follow-up issue post-milestone closure. | ### Verification - `nox -s integration_tests -- --processes 1 --include acceptance_gate`: **10/10 PASS** - `nox -s lint format typecheck security_scan dead_code`: all PASS - `nox -s integration_tests`: **1194/1194 PASS** - `nox -s coverage_report`: **97% (meets threshold)** - `nox -s unit_tests`: 8523 passed, 1 flaky failure (`repositories_uncovered_branches.feature:110` — passes in isolation, race condition under pabot parallelism, unrelated to our changes) - `nox -s docs build benchmark`: all PASS
Author
Member

Also about the timeout, for some reason (mostly because I only have 16 cores while the CI runner has 32 cores), the tests run very slowly on my machine, and sometimes the agent will run unit tests and integration tests at the same time, which single one of them will already use all of my CPU cores. When those two tests running in parallel, it will take longer and the 30s timeout will fail the test.

Also about the timeout, for some reason (mostly because I only have 16 cores while the CI runner has 32 cores), the tests run very slowly on my machine, and sometimes the agent will run unit tests and integration tests at the same time, which single one of them will already use all of my CPU cores. When those two tests running in parallel, it will take longer and the 30s timeout will fail the test.
CoreRasurae left a comment

Code Review Report

Commit: 444ae0677a by Rui Hu on test/m3-acceptance-gate
Issue: #494 — M3 acceptance criteria for v3.2.0 milestone closure
Files changed: 7 (CHANGELOG.md, CONTRIBUTORS.md, 3 robot suites, helper_m3_e2e_verification.py, m3_e2e_verification.robot)


1. BUG DETECTION

[BUG-1] CRITICAL — CLI plan correct calls analyze_impact() and execute_correction() without passing the decision tree

The CLI handler in src/cleveragents/cli/commands/plan.py:2422 calls:

impact = svc.analyze_impact(request.correction_id)

But CorrectionService.analyze_impact() signature is:

def analyze_impact(self, correction_id, decision_tree=None, influence_edges=None)

When decision_tree is None, the BFS traversal uses an empty dict and only returns the target decision itself — never its downstream subtree. The same problem exists for execute_correction() at line 2468.

This means in production, agents plan correct --dry-run will always report an incomplete impact analysis (only the target node, missing all downstream affected decisions), and agents plan correct --mode revert will only revert the single target decision, never cascading to descendants.

The test in correction_dry_run() masks this bug: the direct service call at helper_m3_e2e_verification.py:524 correctly passes the tree and validates 2 affected decisions, but the CLI mock at line 546–547 returns hardcoded impact data, so the test passes despite the CLI never actually computing the full subtree.

Recommendation: The CLI handler should resolve the decision tree from the database and pass it to analyze_impact() and execute_correction(). Alternatively, these service methods should fetch the tree internally when not provided. This is a functional defect in the production code that the E2E tests should be designed to catch.


[BUG-2] MEDIUM — CorrectionService() instantiated without persistence backing in tests

In helper_m3_e2e_verification.py:511 and :607:

service = CorrectionService()

This creates a CorrectionService with checkpoint_service=None. Meanwhile, decisions are seeded via DecisionService(settings=settings, unit_of_work=uow) with an in-memory SQLite backing. The CorrectionService has no reference to this UnitOfWork, so request_correction() stores the request only in its in-memory _requests dict — there is no persistence validation of the correction against the actual decision data in the database.

This means the tests verify the correction algorithm (BFS traversal of the tree dict) but never verify that CorrectionService can actually look up and validate the target decision from the database.


2. TEST FLAWS

[TEST-1] HIGH — Mock assertions mask the tree-passing discrepancy

In correction_dry_run():

What is tested How analyze_impact is called Affected decisions verified
Direct service (line 524) analyze_impact(id, tree) — 2 args child + grandchild (correct)
CLI mock (line 588) assert_called_once_with(id) — 1 arg Hardcoded mock returns both

The mock assertion at line 588 confirms the CLI only passes 1 argument, but the test still passes because the mock returns a pre-built CorrectionImpact with both decisions. The test never proves the CLI path would produce the correct impact analysis in a real scenario.

The same pattern exists in correction_live_revert() at lines 679–686.

Recommendation: Either refactor these tests to use real services end-to-end (similar to plan_generates_decisions()), or add an explicit test that invokes the CLI with a real CorrectionService wired to a real UnitOfWork and verifies the output includes the full affected subtree.


[TEST-2] MEDIUM — Inconsistent service resolution mocking across tests

The tests use three different patching strategies to inject services:

Test function Patch target
plan_generates_decisions() cli.commands.plan._get_lifecycle_service
decision_tree_view() application.container.get_container
decision_explain() application.container.get_container
invariant_add_and_list() cli.commands.invariant._get_service
correction_dry_run() services.correction_service.CorrectionService (class constructor)

Each approach exercises a different resolution path. If the real DI wiring changes (e.g., all commands start using the container), some tests would silently become no-ops while others continue to work.

Recommendation: Standardize on one patching strategy — preferably the DI container approach used in decision_tree_view() — or document why each test requires a different injection mechanism.


[TEST-3] MEDIUM — invariant_add_and_list() is fully mocked, unlike other tests

While plan_generates_decisions(), decisions_context_snapshot(), and decision_tree_persistence() all use real services backed by in-memory SQLite, invariant_add_and_list() at line 428 uses a pure MagicMock(). This means the test validates only CLI argument parsing and JSON rendering — not that invariants are actually persisted, scoped correctly, or retrievable through the same service instance.

Recommendation: Consider adding a persistence-backed path (similar to decision_tree_persistence()) that creates an InvariantService with a real UnitOfWork, adds an invariant, and lists it back.


[TEST-4] LOW — No negative/failure test cases

All 10 test cases are happy-path only. There are no tests for:

  • Invalid decision IDs or plan IDs
  • Missing required --mode or --guidance flags
  • Correction of an already-superseded decision
  • Correction where the affected subtree includes an already-applied child plan (spec says this should be rejected)
  • Invariant add without any scope flag

While the acceptance criteria only mandate positive validation, negative tests would strengthen confidence in the CLI's error handling.


3. SPECIFICATION DEVIATIONS

[SPEC-1] MEDIUM — plan correct uses --plan flag not in the specification

The spec at docs/specification.md line 337–338 defines:

agents plan correct --mode (revert|append) (--guidance|-g) <GUIDANCE>
                   [--dry-run] [--yes|-y] <DECISION_ID>

No --plan flag exists in the spec. However, the implementation adds --plan / -p as an optional flag (defaults to resolving the latest active plan). The tests at helper_m3_e2e_verification.py:563 and :659 pass --plan explicitly.

The milestone acceptance criteria in issue #494 show the command without --plan:

agents plan correct <decision_id> --mode=revert --guidance "..." --dry-run

Impact: The test exercises a code path (explicit --plan) that the acceptance criteria does not require. The implicit plan resolution path (_resolve_active_plan_id()) is never tested.

Recommendation: Add a test variant that omits --plan to verify the default resolution logic, which is the path users will exercise per the spec and milestone description.


[SPEC-2] LOW — Missing test coverage for --show-superseded flag on plan tree

The spec defines agents plan tree [--show-superseded] <PLAN_ID> (line 335). The correction_revert_reexecutes() test creates superseded decisions but never invokes plan tree --show-superseded to verify they render.


4. PERFORMANCE

[PERF-1] LOW — Blanket 30s to 120s timeout increase across unrelated suites

The commit raises timeouts from 30s to 120s in three unrelated robot files:

  • robot/actor_context_management.robot
  • robot/changeset_persistence.robot
  • robot/decision_di_wiring_smoke.robot

The rationale (pabot cold-start with 16 parallel processes + Alembic migration overhead) is reasonable, but a 4x increase means a genuinely stuck test now blocks CI for 2 minutes before failing. If tests normally complete in 5–10s, a 60s timeout would provide the same cold-start safety with better failure detection.

Recommendation: Consider 60s as a middle ground, or add process-level health checks so hung tests are detected earlier.


[PERF-2] INFO — _load_json() worst-case quadratic scanning

The _load_json() helper at line 105–115 iterates character-by-character and attempts raw_decode() at every { or [. For CLI output with many opening braces scattered in log lines before the JSON payload, this is worst-case O(n * m). Not a practical concern for test output sizes, but worth noting.


5. SECURITY

[SEC-1] INFO — No security issues found

The commit contains no credentials, no secrets, no path traversal risks, and no untrusted input processing. All database access uses in-memory SQLite. The sys.path manipulation at line 32–34 is standard test helper practice and carries no risk in this context.


6. OVERALL ASSESSMENT

What the commit does well:

  • Shifts from mock-heavy to persistence-backed testing with real UnitOfWork + in-memory SQLite — a significant quality improvement
  • Uses deterministic hardcoded ULIDs for reproducibility
  • Matches acceptance criteria project name (local/large-project) and invariant text (Use session cookies) exactly
  • Adds explicit Robot Framework timeouts and suite-level tagging for filtering
  • _load_json() is a thoughtful helper that handles structured-log-before-JSON edge cases
  • All 10 acceptance criteria from issue #494 have corresponding test cases

Summary of findings by severity:

Severity ID Category Summary
CRITICAL BUG-1 Bug CLI plan correct never passes decision tree to analyze_impact/execute_correction, producing incomplete results
HIGH TEST-1 Test Flaw Mock assertions mask BUG-1; CLI and service paths test different contracts
MEDIUM BUG-2 Bug CorrectionService() created without DB backing; no persistence validation
MEDIUM TEST-2 Test Flaw Three different mocking strategies across tests; fragile to DI changes
MEDIUM TEST-3 Test Flaw Invariant tests fully mocked, no persistence round-trip
MEDIUM SPEC-1 Spec Deviation --plan flag used in tests but absent from spec and acceptance criteria
LOW TEST-4 Coverage Gap No negative/failure test cases
LOW SPEC-2 Coverage Gap --show-superseded flag untested
LOW PERF-1 Performance Blanket 4x timeout increase; 60s may suffice
INFO PERF-2 Performance _load_json worst-case quadratic (not practical concern)
INFO SEC-1 Security No issues

BUG-1 is the most actionable finding. It reveals a defect in the production CLI handler that the E2E tests were designed to catch but inadvertently mask through mocking. Fixing it requires either changing the CLI handler to pass the tree, or making CorrectionService.analyze_impact() resolve the tree from the database when not provided.

# Code Review Report **Commit:** `444ae0677a` by **Rui Hu** on `test/m3-acceptance-gate` **Issue:** #494 — M3 acceptance criteria for v3.2.0 milestone closure **Files changed:** 7 (CHANGELOG.md, CONTRIBUTORS.md, 3 robot suites, helper\_m3\_e2e\_verification.py, m3\_e2e\_verification.robot) --- ## 1. BUG DETECTION ### [BUG-1] CRITICAL — CLI `plan correct` calls `analyze_impact()` and `execute_correction()` without passing the decision tree The CLI handler in `src/cleveragents/cli/commands/plan.py:2422` calls: ```python impact = svc.analyze_impact(request.correction_id) ``` But `CorrectionService.analyze_impact()` signature is: ```python def analyze_impact(self, correction_id, decision_tree=None, influence_edges=None) ``` When `decision_tree` is `None`, the BFS traversal uses an empty dict and **only returns the target decision itself** — never its downstream subtree. The same problem exists for `execute_correction()` at line 2468. This means **in production**, `agents plan correct --dry-run` will always report an incomplete impact analysis (only the target node, missing all downstream affected decisions), and `agents plan correct --mode revert` will only revert the single target decision, never cascading to descendants. The test in `correction_dry_run()` **masks this bug**: the direct service call at `helper_m3_e2e_verification.py:524` correctly passes the tree and validates 2 affected decisions, but the CLI mock at line 546–547 returns hardcoded impact data, so the test passes despite the CLI never actually computing the full subtree. **Recommendation:** The CLI handler should resolve the decision tree from the database and pass it to `analyze_impact()` and `execute_correction()`. Alternatively, these service methods should fetch the tree internally when not provided. This is a functional defect in the production code that the E2E tests should be designed to catch. --- ### [BUG-2] MEDIUM — `CorrectionService()` instantiated without persistence backing in tests In `helper_m3_e2e_verification.py:511` and `:607`: ```python service = CorrectionService() ``` This creates a `CorrectionService` with `checkpoint_service=None`. Meanwhile, decisions are seeded via `DecisionService(settings=settings, unit_of_work=uow)` with an in-memory SQLite backing. The `CorrectionService` has no reference to this UnitOfWork, so `request_correction()` stores the request only in its in-memory `_requests` dict — there is no persistence validation of the correction against the actual decision data in the database. This means the tests verify the correction *algorithm* (BFS traversal of the tree dict) but never verify that CorrectionService can actually look up and validate the target decision from the database. --- ## 2. TEST FLAWS ### [TEST-1] HIGH — Mock assertions mask the tree-passing discrepancy In `correction_dry_run()`: | What is tested | How `analyze_impact` is called | Affected decisions verified | |---|---|---| | Direct service (line 524) | `analyze_impact(id, tree)` — 2 args | child + grandchild (correct) | | CLI mock (line 588) | `assert_called_once_with(id)` — 1 arg | Hardcoded mock returns both | The mock assertion at line 588 **confirms** the CLI only passes 1 argument, but the test still passes because the mock returns a pre-built `CorrectionImpact` with both decisions. The test never proves the CLI path would produce the correct impact analysis in a real scenario. The same pattern exists in `correction_live_revert()` at lines 679–686. **Recommendation:** Either refactor these tests to use real services end-to-end (similar to `plan_generates_decisions()`), or add an explicit test that invokes the CLI with a real CorrectionService wired to a real UnitOfWork and verifies the output includes the full affected subtree. --- ### [TEST-2] MEDIUM — Inconsistent service resolution mocking across tests The tests use three different patching strategies to inject services: | Test function | Patch target | |---|---| | `plan_generates_decisions()` | `cli.commands.plan._get_lifecycle_service` | | `decision_tree_view()` | `application.container.get_container` | | `decision_explain()` | `application.container.get_container` | | `invariant_add_and_list()` | `cli.commands.invariant._get_service` | | `correction_dry_run()` | `services.correction_service.CorrectionService` (class constructor) | Each approach exercises a different resolution path. If the real DI wiring changes (e.g., all commands start using the container), some tests would silently become no-ops while others continue to work. **Recommendation:** Standardize on one patching strategy — preferably the DI container approach used in `decision_tree_view()` — or document why each test requires a different injection mechanism. --- ### [TEST-3] MEDIUM — `invariant_add_and_list()` is fully mocked, unlike other tests While `plan_generates_decisions()`, `decisions_context_snapshot()`, and `decision_tree_persistence()` all use real services backed by in-memory SQLite, `invariant_add_and_list()` at line 428 uses a pure `MagicMock()`. This means the test validates only CLI argument parsing and JSON rendering — not that invariants are actually persisted, scoped correctly, or retrievable through the same service instance. **Recommendation:** Consider adding a persistence-backed path (similar to `decision_tree_persistence()`) that creates an `InvariantService` with a real UnitOfWork, adds an invariant, and lists it back. --- ### [TEST-4] LOW — No negative/failure test cases All 10 test cases are happy-path only. There are no tests for: - Invalid decision IDs or plan IDs - Missing required `--mode` or `--guidance` flags - Correction of an already-superseded decision - Correction where the affected subtree includes an already-applied child plan (spec says this should be rejected) - Invariant add without any scope flag While the acceptance criteria only mandate positive validation, negative tests would strengthen confidence in the CLI's error handling. --- ## 3. SPECIFICATION DEVIATIONS ### [SPEC-1] MEDIUM — `plan correct` uses `--plan` flag not in the specification The spec at `docs/specification.md` line 337–338 defines: ``` agents plan correct --mode (revert|append) (--guidance|-g) <GUIDANCE> [--dry-run] [--yes|-y] <DECISION_ID> ``` No `--plan` flag exists in the spec. However, the implementation adds `--plan` / `-p` as an optional flag (defaults to resolving the latest active plan). The tests at `helper_m3_e2e_verification.py:563` and `:659` pass `--plan` explicitly. The milestone acceptance criteria in issue #494 show the command **without** `--plan`: ``` agents plan correct <decision_id> --mode=revert --guidance "..." --dry-run ``` **Impact:** The test exercises a code path (explicit `--plan`) that the acceptance criteria does not require. The implicit plan resolution path (`_resolve_active_plan_id()`) is never tested. **Recommendation:** Add a test variant that omits `--plan` to verify the default resolution logic, which is the path users will exercise per the spec and milestone description. --- ### [SPEC-2] LOW — Missing test coverage for `--show-superseded` flag on `plan tree` The spec defines `agents plan tree [--show-superseded] <PLAN_ID>` (line 335). The `correction_revert_reexecutes()` test creates superseded decisions but never invokes `plan tree --show-superseded` to verify they render. --- ## 4. PERFORMANCE ### [PERF-1] LOW — Blanket 30s to 120s timeout increase across unrelated suites The commit raises timeouts from 30s to 120s in three unrelated robot files: - `robot/actor_context_management.robot` - `robot/changeset_persistence.robot` - `robot/decision_di_wiring_smoke.robot` The rationale (pabot cold-start with 16 parallel processes + Alembic migration overhead) is reasonable, but a 4x increase means a genuinely stuck test now blocks CI for 2 minutes before failing. If tests normally complete in 5–10s, a 60s timeout would provide the same cold-start safety with better failure detection. **Recommendation:** Consider 60s as a middle ground, or add process-level health checks so hung tests are detected earlier. --- ### [PERF-2] INFO — `_load_json()` worst-case quadratic scanning The `_load_json()` helper at line 105–115 iterates character-by-character and attempts `raw_decode()` at every `{` or `[`. For CLI output with many opening braces scattered in log lines before the JSON payload, this is worst-case O(n * m). Not a practical concern for test output sizes, but worth noting. --- ## 5. SECURITY ### [SEC-1] INFO — No security issues found The commit contains no credentials, no secrets, no path traversal risks, and no untrusted input processing. All database access uses in-memory SQLite. The `sys.path` manipulation at line 32–34 is standard test helper practice and carries no risk in this context. --- ## 6. OVERALL ASSESSMENT ### What the commit does well: - Shifts from mock-heavy to persistence-backed testing with real `UnitOfWork` + in-memory SQLite — a significant quality improvement - Uses deterministic hardcoded ULIDs for reproducibility - Matches acceptance criteria project name (`local/large-project`) and invariant text (`Use session cookies`) exactly - Adds explicit Robot Framework timeouts and suite-level tagging for filtering - `_load_json()` is a thoughtful helper that handles structured-log-before-JSON edge cases - All 10 acceptance criteria from issue #494 have corresponding test cases ### Summary of findings by severity: | Severity | ID | Category | Summary | |---|---|---|---| | **CRITICAL** | BUG-1 | Bug | CLI `plan correct` never passes decision tree to `analyze_impact`/`execute_correction`, producing incomplete results | | **HIGH** | TEST-1 | Test Flaw | Mock assertions mask BUG-1; CLI and service paths test different contracts | | **MEDIUM** | BUG-2 | Bug | `CorrectionService()` created without DB backing; no persistence validation | | **MEDIUM** | TEST-2 | Test Flaw | Three different mocking strategies across tests; fragile to DI changes | | **MEDIUM** | TEST-3 | Test Flaw | Invariant tests fully mocked, no persistence round-trip | | **MEDIUM** | SPEC-1 | Spec Deviation | `--plan` flag used in tests but absent from spec and acceptance criteria | | **LOW** | TEST-4 | Coverage Gap | No negative/failure test cases | | **LOW** | SPEC-2 | Coverage Gap | `--show-superseded` flag untested | | **LOW** | PERF-1 | Performance | Blanket 4x timeout increase; 60s may suffice | | **INFO** | PERF-2 | Performance | `_load_json` worst-case quadratic (not practical concern) | | **INFO** | SEC-1 | Security | No issues | --- **BUG-1 is the most actionable finding.** It reveals a defect in the production CLI handler that the E2E tests were designed to catch but inadvertently mask through mocking. Fixing it requires either changing the CLI handler to pass the tree, or making `CorrectionService.analyze_impact()` resolve the tree from the database when not provided.
CoreRasurae requested changes 2026-03-05 12:58:44 +00:00
Dismissed
CoreRasurae left a comment

Code Review: M3 Acceptance Gate E2E Verification

The rework from mock-heavy domain-only tests to CLI-integrated tests with real PlanLifecycleService, DecisionService, and in-memory SQLite persistence is a meaningful improvement. However, this review identified 12 findings (3 High, 6 Medium, 4 Low) across test coverage, test flaws, a potential production bug, and documentation.

The most critical finding is that the plan correct CLI command never passes the decision tree to CorrectionService, meaning the affected subtree computation in production would only ever return the single target decision. The test mocks completely hide this gap.


HIGH SEVERITY

H1. CLI Correction Path Never Passes Decision Tree -- Mock Hides Integration Gap

Type: Bug / Test Flaw | File: robot/helper_m3_e2e_verification.py:549-588

The correct_decision CLI handler (plan.py:2409-2468) creates a bare CorrectionService() and calls svc.analyze_impact(request.correction_id) and svc.execute_correction(request.correction_id) without passing decision_tree or influence_edges. Both default to None.

CorrectionService._compute_affected_subtree() with decision_tree=None performs a BFS with no edges, returning only {target_decision_id}. This means in production, dry-run and live correction via CLI would always report only 1 affected decision, ignoring all descendants.

The test mock assertions confirm the current CLI behavior:

mock_service.analyze_impact.assert_called_once_with(mock_request.correction_id)
mock_service.execute_correction.assert_called_once_with(mock_request.correction_id)

But the mock's pre-configured return values hide the bug. The helper's own domain-level tests correctly pass a tree dict, which is why those pass. The gap is in the CLI-to-service integration.

Recommendation: Either the CLI should query the decision tree from DecisionService.get_tree(plan_id) and pass it, or CorrectionService should be wired with a DecisionService dependency and query it internally. The test should verify the tree is propagated.


H2. CorrectionService and DecisionService Are Not Integrated Through Shared Persistence

Type: Test Flaw | File: robot/helper_m3_e2e_verification.py:503-525, 599-620

In correction_dry_run() and correction_live_revert(), decisions are seeded via DecisionService with a UnitOfWork backed by in-memory SQLite. However, CorrectionService() is instantiated independently with no access to the UoW or DecisionService. The tree is manually constructed:

tree = {
    root.decision_id: [child.decision_id],
    child.decision_id: [grandchild.decision_id],
}

This bypasses the actual production path where the tree would be derived from the database. The correction tests validate CorrectionService in isolation, never testing that it can read the persisted decision tree from the same database.

Recommendation: Wire CorrectionService with the same DecisionService / UnitOfWork, and let it derive the tree internally.


H3. Invariant Add/List Uses Separate Mocks -- No Round-Trip Verification

Type: Test Flaw | File: robot/helper_m3_e2e_verification.py:418-493

The acceptance criteria implies:

  • invariant add --project ... adds an invariant
  • invariant list --project ... lists what was just added

The test uses two independent MagicMock services (add_service and list_service). The list mock returns a pre-configured list never populated by the add call. The round-trip is never verified.

Recommendation: Use a single real InvariantService() (or shared mock) so list returns what add created.


MEDIUM SEVERITY

M1. Fragile String Comparison for decision_type

File: robot/helper_m3_e2e_verification.py:259

if str(strategize_decisions[0].decision_type) != "strategy_choice":

DecisionType is a StrEnum. Use direct enum comparison instead:

if strategize_decisions[0].decision_type != DecisionType.STRATEGY_CHOICE:

M2. Live Correction Test Lost Root Boundary Validation

File: robot/helper_m3_e2e_verification.py:621-627

The old code explicitly verified the root was NOT reverted. The new code only checks child/grandchild ARE reverted, not that root is excluded. The spec says correction operates on the affected subtree only -- boundary validation matters.

Recommendation: Add if root.decision_id in result.reverted_decisions: _fail(...).


M3. decision_explain Does Not Verify confidence_score

File: robot/helper_m3_e2e_verification.py:378-408

The spec requires confidence_score (0.0-1.0) on decisions. The acceptance criterion for plan explain is "shows full decision context." This test checks decision_id, question, chosen, rationale, and context_snapshot but omits confidence_score.


M4. _load_json Can Silently Skip Error Prefixes

File: robot/helper_m3_e2e_verification.py:89-117

The parser scans forward for the first JSON that extends to end-of-output. If the CLI emits warnings or partial errors before JSON, they are silently discarded. A CLI command that partially fails but still emits valid JSON would pass the test.

Recommendation: Log or fail when non-JSON prefix content is found.


M5. correction_revert_reexecutes Uses Synthetic "leaf" String as Decision ID

File: robot/helper_m3_e2e_verification.py:798-801

tree = {
    root.decision_id: [child.decision_id],
    child.decision_id: ["leaf"],
}

"leaf" is not a real decision ID. _seed_decisions returns a grandchild, but it is discarded and a fake string is used instead. The CorrectionService BFS will include "leaf" in affected decisions without validation.

Recommendation: Use grandchild.decision_id to keep the tree consistent with seeded data.


M6. decision_tree_view and decision_explain Never Verify Plan Existence

File: robot/helper_m3_e2e_verification.py:290-337, 347-408

Decisions are seeded with _PLAN_ULID but no Plan object exists for that ID. The CLI commands plan tree and plan explain only resolve DecisionService from the container. If the CLI is hardened to validate plan existence, these tests will break.


LOW SEVERITY

L1. _make_settings Only Sets Two Attributes on Autospec Mock

File: robot/helper_m3_e2e_verification.py:127-132

create_autospec(Settings) returns MagicMock for any unset attribute. Only database_url and async_enabled are configured. If services access other settings, they get truthy MagicMock objects causing silent false-positives.

L2. Timeout Inconsistency Between Commit Message and Code

Commit message says "Added explicit Robot process timeouts (60s)" but the changes to actor_context_management.robot, changeset_persistence.robot, and decision_di_wiring_smoke.robot increase timeouts to 120s, not 60s. The M3 suite uses 60s.

L3. No Test for --show-superseded Flag on plan tree

The spec and CLI support plan tree --show-superseded. No test exercises this flag.

L4. No Test for non_overridable Global Invariant Behavior

The spec supports non_overridable global invariants that plan-level invariants cannot override. Not tested.


Acceptance Criteria Coverage Matrix

Criterion Test Verdict
plan use + plan execute plan_generates_decisions() Covered (Strategize driven via direct service calls, not actor)
plan tree displays tree decision_tree_view() Covered
plan explain full context decision_explain() Partial (confidence_score missing -- M3)
invariant add --project invariant_add_and_list() Mock-only, no round-trip (H3)
invariant list --project invariant_add_and_list() Mock-only, no round-trip (H3)
plan correct --dry-run correction_dry_run() Mock hides CLI integration gap (H1)
plan correct --mode=revert correction_live_revert() Mock hides CLI integration gap (H1)
Decisions with full context snapshot decisions_context_snapshot() Covered
Decision tree persists to DB decision_tree_persistence() Covered
Revert re-executes from decision point correction_revert_reexecutes() Covered (with synthetic data -- M5)
Invariants enforced during strategize invariants_enforced_during_strategize() Covered

Overall: Addressing H1-H3 and M1-M2 before merge would significantly strengthen this acceptance gate.

## Code Review: M3 Acceptance Gate E2E Verification The rework from mock-heavy domain-only tests to CLI-integrated tests with real `PlanLifecycleService`, `DecisionService`, and in-memory SQLite persistence is a meaningful improvement. However, this review identified **12 findings** (3 High, 6 Medium, 4 Low) across test coverage, test flaws, a potential production bug, and documentation. The most critical finding is that the `plan correct` CLI command never passes the decision tree to `CorrectionService`, meaning the affected subtree computation in production would only ever return the single target decision. The test mocks completely hide this gap. --- ### HIGH SEVERITY #### H1. CLI Correction Path Never Passes Decision Tree -- Mock Hides Integration Gap **Type:** Bug / Test Flaw | **File:** `robot/helper_m3_e2e_verification.py:549-588` The `correct_decision` CLI handler (`plan.py:2409-2468`) creates a bare `CorrectionService()` and calls `svc.analyze_impact(request.correction_id)` and `svc.execute_correction(request.correction_id)` **without** passing `decision_tree` or `influence_edges`. Both default to `None`. `CorrectionService._compute_affected_subtree()` with `decision_tree=None` performs a BFS with no edges, returning only `{target_decision_id}`. This means **in production, dry-run and live correction via CLI would always report only 1 affected decision**, ignoring all descendants. The test mock assertions confirm the current CLI behavior: ```python mock_service.analyze_impact.assert_called_once_with(mock_request.correction_id) mock_service.execute_correction.assert_called_once_with(mock_request.correction_id) ``` But the mock's pre-configured return values hide the bug. The helper's own domain-level tests correctly pass a `tree` dict, which is why those pass. The gap is in the CLI-to-service integration. **Recommendation:** Either the CLI should query the decision tree from `DecisionService.get_tree(plan_id)` and pass it, or `CorrectionService` should be wired with a `DecisionService` dependency and query it internally. The test should verify the tree is propagated. --- #### H2. CorrectionService and DecisionService Are Not Integrated Through Shared Persistence **Type:** Test Flaw | **File:** `robot/helper_m3_e2e_verification.py:503-525, 599-620` In `correction_dry_run()` and `correction_live_revert()`, decisions are seeded via `DecisionService` with a `UnitOfWork` backed by in-memory SQLite. However, `CorrectionService()` is instantiated independently with no access to the UoW or DecisionService. The tree is manually constructed: ```python tree = { root.decision_id: [child.decision_id], child.decision_id: [grandchild.decision_id], } ``` This bypasses the actual production path where the tree would be derived from the database. The correction tests validate `CorrectionService` in isolation, never testing that it can read the persisted decision tree from the same database. **Recommendation:** Wire `CorrectionService` with the same `DecisionService` / `UnitOfWork`, and let it derive the tree internally. --- #### H3. Invariant Add/List Uses Separate Mocks -- No Round-Trip Verification **Type:** Test Flaw | **File:** `robot/helper_m3_e2e_verification.py:418-493` The acceptance criteria implies: - `invariant add --project ...` **adds** an invariant - `invariant list --project ...` **lists** what was just added The test uses two independent `MagicMock` services (`add_service` and `list_service`). The `list` mock returns a pre-configured list never populated by the `add` call. The round-trip is never verified. **Recommendation:** Use a single real `InvariantService()` (or shared mock) so `list` returns what `add` created. --- ### MEDIUM SEVERITY #### M1. Fragile String Comparison for `decision_type` **File:** `robot/helper_m3_e2e_verification.py:259` ```python if str(strategize_decisions[0].decision_type) != "strategy_choice": ``` `DecisionType` is a `StrEnum`. Use direct enum comparison instead: ```python if strategize_decisions[0].decision_type != DecisionType.STRATEGY_CHOICE: ``` --- #### M2. Live Correction Test Lost Root Boundary Validation **File:** `robot/helper_m3_e2e_verification.py:621-627` The old code explicitly verified the root was NOT reverted. The new code only checks child/grandchild ARE reverted, not that root is excluded. The spec says correction operates on the affected subtree only -- boundary validation matters. **Recommendation:** Add `if root.decision_id in result.reverted_decisions: _fail(...)`. --- #### M3. `decision_explain` Does Not Verify `confidence_score` **File:** `robot/helper_m3_e2e_verification.py:378-408` The spec requires `confidence_score` (0.0-1.0) on decisions. The acceptance criterion for `plan explain` is "shows full decision context." This test checks `decision_id`, `question`, `chosen`, `rationale`, and `context_snapshot` but omits `confidence_score`. --- #### M4. `_load_json` Can Silently Skip Error Prefixes **File:** `robot/helper_m3_e2e_verification.py:89-117` The parser scans forward for the first JSON that extends to end-of-output. If the CLI emits warnings or partial errors before JSON, they are silently discarded. A CLI command that partially fails but still emits valid JSON would pass the test. **Recommendation:** Log or fail when non-JSON prefix content is found. --- #### M5. `correction_revert_reexecutes` Uses Synthetic `"leaf"` String as Decision ID **File:** `robot/helper_m3_e2e_verification.py:798-801` ```python tree = { root.decision_id: [child.decision_id], child.decision_id: ["leaf"], } ``` `"leaf"` is not a real decision ID. `_seed_decisions` returns a grandchild, but it is discarded and a fake string is used instead. The CorrectionService BFS will include `"leaf"` in affected decisions without validation. **Recommendation:** Use `grandchild.decision_id` to keep the tree consistent with seeded data. --- #### M6. `decision_tree_view` and `decision_explain` Never Verify Plan Existence **File:** `robot/helper_m3_e2e_verification.py:290-337, 347-408` Decisions are seeded with `_PLAN_ULID` but no `Plan` object exists for that ID. The CLI commands `plan tree` and `plan explain` only resolve `DecisionService` from the container. If the CLI is hardened to validate plan existence, these tests will break. --- ### LOW SEVERITY #### L1. `_make_settings` Only Sets Two Attributes on Autospec Mock **File:** `robot/helper_m3_e2e_verification.py:127-132` `create_autospec(Settings)` returns `MagicMock` for any unset attribute. Only `database_url` and `async_enabled` are configured. If services access other settings, they get truthy MagicMock objects causing silent false-positives. #### L2. Timeout Inconsistency Between Commit Message and Code Commit message says "Added explicit Robot process timeouts (60s)" but the changes to `actor_context_management.robot`, `changeset_persistence.robot`, and `decision_di_wiring_smoke.robot` increase timeouts to **120s**, not 60s. The M3 suite uses 60s. #### L3. No Test for `--show-superseded` Flag on `plan tree` The spec and CLI support `plan tree --show-superseded`. No test exercises this flag. #### L4. No Test for `non_overridable` Global Invariant Behavior The spec supports `non_overridable` global invariants that plan-level invariants cannot override. Not tested. --- ### Acceptance Criteria Coverage Matrix | Criterion | Test | Verdict | |:----------|:-----|:--------| | `plan use` + `plan execute` | `plan_generates_decisions()` | Covered (Strategize driven via direct service calls, not actor) | | `plan tree` displays tree | `decision_tree_view()` | Covered | | `plan explain` full context | `decision_explain()` | Partial (`confidence_score` missing -- M3) | | `invariant add --project` | `invariant_add_and_list()` | Mock-only, no round-trip (H3) | | `invariant list --project` | `invariant_add_and_list()` | Mock-only, no round-trip (H3) | | `plan correct --dry-run` | `correction_dry_run()` | Mock hides CLI integration gap (H1) | | `plan correct --mode=revert` | `correction_live_revert()` | Mock hides CLI integration gap (H1) | | Decisions with full context snapshot | `decisions_context_snapshot()` | Covered | | Decision tree persists to DB | `decision_tree_persistence()` | Covered | | Revert re-executes from decision point | `correction_revert_reexecutes()` | Covered (with synthetic data -- M5) | | Invariants enforced during strategize | `invariants_enforced_during_strategize()` | Covered | --- **Overall:** Addressing H1-H3 and M1-M2 before merge would significantly strengthen this acceptance gate.
@ -10,3 +10,3 @@
*** Variables ***
${PYTHON} python
${TIMEOUT} 30s
# Normal duration: ~5-10s per test. Timeout raised from 30s to 120s for
Member

[L2 - LOW] Commit message says "Added explicit Robot process timeouts (60s)" but this file (and actor_context_management.robot, decision_di_wiring_smoke.robot) increases the timeout to 120s, not 60s. The M3 suite itself uses 60s. Minor documentation inconsistency.

**[L2 - LOW]** Commit message says "Added explicit Robot process timeouts (60s)" but this file (and `actor_context_management.robot`, `decision_di_wiring_smoke.robot`) increases the timeout to **120s**, not 60s. The M3 suite itself uses 60s. Minor documentation inconsistency.
@ -121,0 +102,4 @@
except json.JSONDecodeError:
pass
for index, char in enumerate(text):
Member

[M4 - MEDIUM] This fallback JSON scanner silently skips any non-JSON prefix content (error messages, warnings). If a CLI command partially fails but still emits valid JSON at the end, the test would pass. Consider at minimum logging when non-JSON prefix content is found, or failing if unexpected content precedes the JSON.

**[M4 - MEDIUM]** This fallback JSON scanner silently skips any non-JSON prefix content (error messages, warnings). If a CLI command partially fails but still emits valid JSON at the end, the test would pass. Consider at minimum logging when non-JSON prefix content is found, or failing if unexpected content precedes the JSON.
@ -176,1 +127,3 @@
)
def _make_settings(database_url: str = "sqlite:///:memory:") -> Any:
"""Create a minimal Settings test double for service wiring."""
settings = create_autospec(Settings, instance=True)
Member

[L1 - LOW] create_autospec(Settings) returns MagicMock for any attribute not explicitly set. Only database_url and async_enabled are configured. If PlanLifecycleService or DecisionService access other settings properties, they get truthy MagicMock objects which could cause silent false-positive behavior.

**[L1 - LOW]** `create_autospec(Settings)` returns `MagicMock` for any attribute not explicitly set. Only `database_url` and `async_enabled` are configured. If `PlanLifecycleService` or `DecisionService` access other settings properties, they get truthy `MagicMock` objects which could cause silent false-positive behavior.
@ -307,0 +256,4 @@
strategize_decisions = decision_service.list_decisions(plan_id)
if not strategize_decisions:
_fail("no decisions recorded during Strategize")
if str(strategize_decisions[0].decision_type) != "strategy_choice":
Member

[M1 - MEDIUM] Fragile string comparison. DecisionType is a StrEnum so str() works today, but this is brittle. Prefer direct enum comparison:

if strategize_decisions[0].decision_type != DecisionType.STRATEGY_CHOICE:
**[M1 - MEDIUM]** Fragile string comparison. `DecisionType` is a `StrEnum` so `str()` works today, but this is brittle. Prefer direct enum comparison: ```python if strategize_decisions[0].decision_type != DecisionType.STRATEGY_CHOICE: ```
@ -333,0 +305,4 @@
):
result = cli_runner.invoke(
plan_app,
["tree", _PLAN_ULID, "--format", "json"],
Member

[M6 - MEDIUM] Decisions are seeded with _PLAN_ULID as the plan_id, but no Plan object exists for this ID in any plan repository. The CLI plan tree command (plan.py:2937-2944) only queries DecisionService, not the plan store. If the CLI is ever hardened to validate plan existence, this test will break.

Consider seeding a corresponding Plan via PlanLifecycleService or documenting this coupling.

**[M6 - MEDIUM]** Decisions are seeded with `_PLAN_ULID` as the `plan_id`, but no `Plan` object exists for this ID in any plan repository. The CLI `plan tree` command (`plan.py:2937-2944`) only queries `DecisionService`, not the plan store. If the CLI is ever hardened to validate plan existence, this test will break. Consider seeding a corresponding Plan via `PlanLifecycleService` or documenting this coupling.
@ -438,0 +386,4 @@
if data.get("chosen") != child.chosen_option:
_fail(f"chosen option mismatch: {data}")
if data.get("rationale") != child.rationale:
_fail(f"rationale mismatch: {data}")
Member

[M3 - MEDIUM] The acceptance criterion says plan explain "shows full decision context", yet confidence_score is not verified here. The spec requires a confidence score on each decision (0.0-1.0) and the old code checked it. Consider adding:

if data.get("confidence") is None:
    _fail(f"missing confidence_score: {data}")
**[M3 - MEDIUM]** The acceptance criterion says `plan explain` "shows full decision context", yet `confidence_score` is not verified here. The spec requires a confidence score on each decision (0.0-1.0) and the old code checked it. Consider adding: ```python if data.get("confidence") is None: _fail(f"missing confidence_score: {data}") ```
@ -513,0 +461,4 @@
source_name=_PROJECT_NAME,
)
list_service = MagicMock()
Member

[H3 - HIGH] Using two independent mocks (add_service and list_service) means the test never verifies the round-trip: that an invariant added via invariant add actually appears in invariant list.

The acceptance criteria implies list should show what was just added. Consider using a single real InvariantService() or a shared mock so list returns what add created.

**[H3 - HIGH]** Using two independent mocks (`add_service` and `list_service`) means the test never verifies the round-trip: that an invariant added via `invariant add` actually appears in `invariant list`. The acceptance criteria implies `list` should show what was just `add`ed. Consider using a single real `InvariantService()` or a shared mock so `list` returns what `add` created.
@ -542,3 +511,1 @@
tree: dict[str, list[str]] = {
_ROOT_DEC_ID: [_CHILD_DEC_ID],
_CHILD_DEC_ID: [_GRANDCHILD_DEC_ID],
service = CorrectionService()
Member

[H2 - HIGH] CorrectionService() is created independently from the DecisionService / UnitOfWork that holds the seeded decisions. The tree dict is manually constructed rather than derived from persistence. This tests CorrectionService in isolation, not the integrated path where the tree comes from the database.

Consider wiring CorrectionService with the same DecisionService or UnitOfWork and letting it derive the tree internally.

**[H2 - HIGH]** `CorrectionService()` is created independently from the `DecisionService` / `UnitOfWork` that holds the seeded decisions. The `tree` dict is manually constructed rather than derived from persistence. This tests `CorrectionService` in isolation, not the integrated path where the tree comes from the database. Consider wiring `CorrectionService` with the same `DecisionService` or `UnitOfWork` and letting it derive the tree internally.
@ -626,0 +585,4 @@
guidance="Use session cookies instead of JWT",
dry_run=True,
)
mock_service.analyze_impact.assert_called_once_with(mock_request.correction_id)
Member

[H1 - HIGH] The mock here hides a CLI integration gap. The correct_decision CLI handler (plan.py:2409-2468) calls svc.analyze_impact(request.correction_id) and svc.execute_correction(request.correction_id) without passing a decision_tree. Both parameters default to None, causing the BFS in _compute_affected_subtree() to return only {target} (a single node) in production.

The mock assertion assert_called_once_with(mock_request.correction_id) confirms the CLI only passes the ID. But since the mock returns the pre-configured mock_impact regardless, the missing tree is never detected.

Either the CLI should query the tree from DecisionService.get_tree(plan_id) and pass it, or CorrectionService should be wired with DecisionService and query it internally.

**[H1 - HIGH]** The mock here hides a CLI integration gap. The `correct_decision` CLI handler (`plan.py:2409-2468`) calls `svc.analyze_impact(request.correction_id)` and `svc.execute_correction(request.correction_id)` **without** passing a `decision_tree`. Both parameters default to `None`, causing the BFS in `_compute_affected_subtree()` to return only `{target}` (a single node) in production. The mock assertion `assert_called_once_with(mock_request.correction_id)` confirms the CLI only passes the ID. But since the mock returns the pre-configured `mock_impact` regardless, the missing tree is never detected. Either the CLI should query the tree from `DecisionService.get_tree(plan_id)` and pass it, or `CorrectionService` should be wired with `DecisionService` and query it internally.
@ -666,2 +622,2 @@
if _GRANDCHILD_DEC_ID not in result.reverted_decisions:
_fail("grandchild not reverted")
_fail(f"expected applied status from execute_revert, got: {result.status}")
if (
Member

[M2 - MEDIUM] The old code explicitly verified the root was NOT reverted: if _ROOT_DEC_ID in result.reverted_decisions: _fail("root should not be reverted"). This boundary validation was lost in the rewrite. The spec says correction operates on the affected subtree only -- verifying the boundary matters.

Add: if root.decision_id in result.reverted_decisions: _fail("root should not be reverted")

**[M2 - MEDIUM]** The old code explicitly verified the root was NOT reverted: `if _ROOT_DEC_ID in result.reverted_decisions: _fail("root should not be reverted")`. This boundary validation was lost in the rewrite. The spec says correction operates on the affected subtree only -- verifying the boundary matters. Add: `if root.decision_id in result.reverted_decisions: _fail("root should not be reverted")`
@ -862,1 +798,4 @@
tree = {
root.decision_id: [child.decision_id],
child.decision_id: ["leaf"],
}
Member

[M5 - MEDIUM] The string "leaf" is not a real decision ID. _seed_decisions() returns 3 decisions (root, child, grandchild) but the grandchild is discarded (_) and a fake placeholder is used. The CorrectionService BFS will include "leaf" in affected decisions without validation, and the tree structure doesn't match the persisted decisions.

Use grandchild.decision_id instead of "leaf" for consistency.

**[M5 - MEDIUM]** The string `"leaf"` is not a real decision ID. `_seed_decisions()` returns 3 decisions (root, child, grandchild) but the grandchild is discarded (`_`) and a fake placeholder is used. The CorrectionService BFS will include `"leaf"` in affected decisions without validation, and the tree structure doesn't match the persisted decisions. Use `grandchild.decision_id` instead of `"leaf"` for consistency.
@ -33,0 +39,4 @@
...
... Validates: plan tree displays the decision tree correctly.
[Tags] success_criteria decision_tree
${result}= Run Process ${PYTHON} ${HELPER} decision-tree-view cwd=${WORKSPACE} timeout=60s
Member

[L3 - LOW] No test exercises the --show-superseded flag on plan tree. The spec and CLI (plan.py:2928-2930) support this flag for rendering both current and historical branches. The correction_revert_reexecutes helper creates superseded decisions but never renders the tree with this flag.

**[L3 - LOW]** No test exercises the `--show-superseded` flag on `plan tree`. The spec and CLI (`plan.py:2928-2930`) support this flag for rendering both current and historical branches. The `correction_revert_reexecutes` helper creates superseded decisions but never renders the tree with this flag.
Author
Member

Review Response — Third Review by @CoreRasurae

Thank you for the thorough third review. We verified every finding against the production code and test code. Here is the disposition for each:


Findings being fixed in this commit (test flaws in scope for #494)

These are genuine test flaws within helper_m3_e2e_verification.py that we will address in an updated push:

Finding Fix
H3 — Invariant add/list uses separate mocks, no round-trip Replacing the two independent MagicMock() instances with a single real InvariantService(). Since InvariantService is purely in-memory (zero-arg constructor, dict-backed), this gives us a genuine add-then-list round-trip through the CLI with no added complexity.
M1 — Fragile string comparison for decision_type Changing str(decision.decision_type) != "strategy_choice" to direct enum comparison != DecisionType.STRATEGY_CHOICE.
M2 — Live correction test lost root boundary validation Adding if root.decision_id in result.reverted_decisions: _fail(...) to verify the root is correctly excluded from the revert.
M3decision_explain does not verify confidence_score Adding confidence field check to the explain output validation. The Decision model has `confidence_score: float
M5 — Synthetic "leaf" string as decision ID Replacing "leaf" with grandchild.decision_id from _seed_decisions(). The grandchild was being discarded (root, child, _ = ...); now captured and used.
L1_make_settings only sets two attributes Adding a comment documenting which attributes are required and why create_autospec is used (it raises AttributeError for spec-violating access, but returns MagicMock for valid but unset attributes).
L2 — Timeout inconsistency in commit message Correcting the commit body text from "60s" to "120s" to match the actual code changes.

A follow-up comment will be posted once these fixes are pushed.


Findings addressed via dedicated ticket (production bug, out of scope for #494)

H1 — CLI plan correct never passes decision tree to CorrectionService

This is a confirmed production bug in src/cleveragents/cli/commands/plan.py:2422,2468. We verified:

  • correct_decision() at line 2409 creates CorrectionService() directly (not via DI container, unlike every other command in the file).
  • analyze_impact(request.correction_id) at line 2422 passes 1 argument. The method signature has decision_tree=None which defaults to {}, so BFS returns only the target node.
  • execute_correction(request.correction_id) at line 2468 has the same issue.
  • DecisionService.get_influence_edges() already exists and its docstring says "This is the format expected by CorrectionService._compute_affected_subtree()" — the plumbing was built but never wired in the CLI handler.

This is a ~15-20 line production code fix in plan.py, not a test fix. Mixing it into this test-only commit (test(e2e): ...) would violate the one-commit-per-issue rule and conflate test and production changes.

Filed as #606: CLI plan correct never passes decision tree to CorrectionService, producing single-node impact analysis

Our test mock assertions at line 588 (analyze_impact.assert_called_once_with(mock_request.correction_id)) accurately document the CLI's current (buggy) behavior. Once #606 is fixed, the mock assertions should be updated to expect the tree arguments.


Findings acknowledged — no code change needed (with explanation)

Finding Disposition
H2CorrectionService not integrated through shared persistence By design. CorrectionService.__init__ accepts only checkpoint_service — it has no unit_of_work parameter, no settings, no DecisionService dependency. Its docstring explicitly states: "State is held in-memory via dictionaries. A production deployment would swap these for repository adapters." The manual tree: dict[str, list[str]] is the correct API contract — _compute_affected_subtree() takes the tree as a parameter, not a database reference. There is nothing to wire.
M4_load_json silently skips error prefixes Deliberate design. The docstring states: "Some CLI paths emit structured logs before the JSON payload." This is the expected behavior for Typer CLI commands that produce structlog output before JSON. Adding a failure on non-JSON prefixes would break the test whenever a CLI command emits any log line.
M6decision_tree_view / decision_explain never verify plan existence Not a current issue. These CLI commands resolve DecisionService and query by decision ID. The spy wraps a real DecisionService that has the seeded decisions in its in-memory SQLite. The commands don't validate plan existence — they operate on decisions directly. If plan-existence validation is added to the CLI later, that would be a separate concern.
L3 — No test for --show-superseded flag Correct observation. Not required by the #494 acceptance criteria. Could be a follow-up enhancement.
L4 — No test for non_overridable global invariant behavior Correct observation. Not required by the #494 acceptance criteria. Could be a follow-up enhancement.
TEST-1 — Mock assertions mask H1 Coupled to H1 (the production bug). The mock currently returns hardcoded impact data, which masks the CLI's failure to pass the tree. This will be addressed when #606 is fixed — the mock assertions will be updated to expect tree arguments.
TEST-2 — Inconsistent service resolution mocking The different patching strategies reflect the actual DI patterns in the production CLI: plan commands use _get_lifecycle_service, invariant commands use _get_service, and correction mocks the constructor because CorrectionService() is instantiated inline (part of the H1/#606 bug). Standardizing would require refactoring the production CLI's DI patterns, which is out of scope.
TEST-4 — No negative/failure test cases Agreed this would be valuable, but the #494 acceptance criteria focus on positive validation. Could be a follow-up issue.
SPEC-1--plan flag not in spec The --plan flag exists in the implementation (plan.py:2354-2361) as a convenience for explicit plan targeting. The tests use it to avoid needing to mock _resolve_active_plan_id(). The spec's omission is a spec-vs-implementation divergence in production code, not a test issue.
SPEC-2--show-superseded untested Same as L3 — not required by acceptance criteria.
PERF-1 — Blanket 4x timeout Already addressed in previous review cycle — expected-duration comments were added alongside all timeout increases.
PERF-2_load_json quadratic Theoretical concern, not practical for test output sizes. Acknowledged.
## Review Response — Third Review by @CoreRasurae Thank you for the thorough third review. We verified every finding against the production code and test code. Here is the disposition for each: --- ### Findings being fixed in this commit (test flaws in scope for #494) These are genuine test flaws within `helper_m3_e2e_verification.py` that we will address in an updated push: | Finding | Fix | |---------|-----| | **H3** — Invariant add/list uses separate mocks, no round-trip | Replacing the two independent `MagicMock()` instances with a single real `InvariantService()`. Since `InvariantService` is purely in-memory (zero-arg constructor, dict-backed), this gives us a genuine add-then-list round-trip through the CLI with no added complexity. | | **M1** — Fragile string comparison for `decision_type` | Changing `str(decision.decision_type) != "strategy_choice"` to direct enum comparison `!= DecisionType.STRATEGY_CHOICE`. | | **M2** — Live correction test lost root boundary validation | Adding `if root.decision_id in result.reverted_decisions: _fail(...)` to verify the root is correctly excluded from the revert. | | **M3** — `decision_explain` does not verify `confidence_score` | Adding `confidence` field check to the explain output validation. The `Decision` model has `confidence_score: float | None` and `as_cli_dict()` outputs it as `"confidence"`. | | **M5** — Synthetic `"leaf"` string as decision ID | Replacing `"leaf"` with `grandchild.decision_id` from `_seed_decisions()`. The grandchild was being discarded (`root, child, _ = ...`); now captured and used. | | **L1** — `_make_settings` only sets two attributes | Adding a comment documenting which attributes are required and why `create_autospec` is used (it raises `AttributeError` for spec-violating access, but returns `MagicMock` for valid but unset attributes). | | **L2** — Timeout inconsistency in commit message | Correcting the commit body text from "60s" to "120s" to match the actual code changes. | A follow-up comment will be posted once these fixes are pushed. --- ### Findings addressed via dedicated ticket (production bug, out of scope for #494) **H1 — CLI `plan correct` never passes decision tree to `CorrectionService`** This is a **confirmed production bug** in `src/cleveragents/cli/commands/plan.py:2422,2468`. We verified: - `correct_decision()` at line 2409 creates `CorrectionService()` directly (not via DI container, unlike every other command in the file). - `analyze_impact(request.correction_id)` at line 2422 passes 1 argument. The method signature has `decision_tree=None` which defaults to `{}`, so BFS returns only the target node. - `execute_correction(request.correction_id)` at line 2468 has the same issue. - `DecisionService.get_influence_edges()` already exists and its docstring says *"This is the format expected by `CorrectionService._compute_affected_subtree()`"* — the plumbing was built but never wired in the CLI handler. This is a ~15-20 line production code fix in `plan.py`, not a test fix. Mixing it into this test-only commit (`test(e2e): ...`) would violate the one-commit-per-issue rule and conflate test and production changes. Filed as **#606**: [CLI `plan correct` never passes decision tree to CorrectionService, producing single-node impact analysis](https://git.cleverthis.com/cleveragents/cleveragents-core/issues/606) Our test mock assertions at line 588 (`analyze_impact.assert_called_once_with(mock_request.correction_id)`) accurately document the CLI's current (buggy) behavior. Once #606 is fixed, the mock assertions should be updated to expect the tree arguments. --- ### Findings acknowledged — no code change needed (with explanation) | Finding | Disposition | |---------|-------------| | **H2** — `CorrectionService` not integrated through shared persistence | **By design.** `CorrectionService.__init__` accepts only `checkpoint_service` — it has no `unit_of_work` parameter, no `settings`, no `DecisionService` dependency. Its docstring explicitly states: *"State is held in-memory via dictionaries. A production deployment would swap these for repository adapters."* The manual `tree: dict[str, list[str]]` is the correct API contract — `_compute_affected_subtree()` takes the tree as a parameter, not a database reference. There is nothing to wire. | | **M4** — `_load_json` silently skips error prefixes | **Deliberate design.** The docstring states: *"Some CLI paths emit structured logs before the JSON payload."* This is the expected behavior for Typer CLI commands that produce structlog output before JSON. Adding a failure on non-JSON prefixes would break the test whenever a CLI command emits any log line. | | **M6** — `decision_tree_view` / `decision_explain` never verify plan existence | **Not a current issue.** These CLI commands resolve `DecisionService` and query by decision ID. The spy wraps a real `DecisionService` that has the seeded decisions in its in-memory SQLite. The commands don't validate plan existence — they operate on decisions directly. If plan-existence validation is added to the CLI later, that would be a separate concern. | | **L3** — No test for `--show-superseded` flag | Correct observation. Not required by the #494 acceptance criteria. Could be a follow-up enhancement. | | **L4** — No test for `non_overridable` global invariant behavior | Correct observation. Not required by the #494 acceptance criteria. Could be a follow-up enhancement. | | **TEST-1** — Mock assertions mask H1 | Coupled to H1 (the production bug). The mock currently returns hardcoded impact data, which masks the CLI's failure to pass the tree. This will be addressed when #606 is fixed — the mock assertions will be updated to expect tree arguments. | | **TEST-2** — Inconsistent service resolution mocking | The different patching strategies reflect the actual DI patterns in the production CLI: `plan` commands use `_get_lifecycle_service`, `invariant` commands use `_get_service`, and `correction` mocks the constructor because `CorrectionService()` is instantiated inline (part of the H1/\#606 bug). Standardizing would require refactoring the production CLI's DI patterns, which is out of scope. | | **TEST-4** — No negative/failure test cases | Agreed this would be valuable, but the #494 acceptance criteria focus on positive validation. Could be a follow-up issue. | | **SPEC-1** — `--plan` flag not in spec | The `--plan` flag exists in the implementation (`plan.py:2354-2361`) as a convenience for explicit plan targeting. The tests use it to avoid needing to mock `_resolve_active_plan_id()`. The spec's omission is a spec-vs-implementation divergence in production code, not a test issue. | | **SPEC-2** — `--show-superseded` untested | Same as L3 — not required by acceptance criteria. | | **PERF-1** — Blanket 4x timeout | Already addressed in previous review cycle — expected-duration comments were added alongside all timeout increases. | | **PERF-2** — `_load_json` quadratic | Theoretical concern, not practical for test output sizes. Acknowledged. |
hurui200320 force-pushed test/m3-acceptance-gate from 444ae0677a
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 17s
CI / typecheck (pull_request) Successful in 34s
CI / security (pull_request) Successful in 35s
CI / unit_tests (pull_request) Successful in 2m5s
CI / integration_tests (pull_request) Successful in 2m56s
CI / docker (pull_request) Successful in 50s
CI / coverage (pull_request) Successful in 4m20s
CI / benchmark-regression (pull_request) Successful in 27m38s
to 06e1b04a2e
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / build (pull_request) Successful in 15s
CI / quality (pull_request) Successful in 18s
CI / security (pull_request) Successful in 33s
CI / typecheck (pull_request) Successful in 35s
CI / integration_tests (pull_request) Has been cancelled
CI / coverage (pull_request) Has been cancelled
CI / benchmark-regression (pull_request) Has been cancelled
CI / unit_tests (pull_request) Has been cancelled
CI / docker (pull_request) Has been cancelled
2026-03-05 14:34:47 +00:00
Compare
Author
Member

Fix Push — Commit 06e1b04a (Third Review Fixes)

@CoreRasurae — all 7 test-flaw fixes from the third review are now in commit 06e1b04a on test/m3-acceptance-gate.

Changes in helper_m3_e2e_verification.py (33 insertions, 33 deletions)

Finding Change
H3 invariant_add_and_list(): Replaced two independent MagicMock() instances with a single real InvariantService(). Both add and list CLI calls now patch _get_service to return the same service instance, giving a genuine add-then-list round-trip. Added text verification ("Use session cookies") on the listed invariant. Removed unused Invariant and datetime imports.
M1 Line 259: str(strategize_decisions[0].decision_type) != "strategy_choice"strategize_decisions[0].decision_type != DecisionType.STRATEGY_CHOICE
M2 correction_live_revert(): Added boundary check if root.decision_id in result.reverted_decisions: _fail(...) — verifies the root is excluded from revert (spec: correction targets affected subtree only).
M3 decision_explain(): Added if "confidence" not in data: _fail(...) before the context snapshot checks. Decision.confidence_score is output as "confidence" via as_cli_dict().
M5 correction_revert_reexecutes(): Changed root, child, _ = _seed_decisions(...) to root, child, grandchild = _seed_decisions(...) and replaced "leaf" with grandchild.decision_id in the tree dict.
L1 _make_settings(): Expanded docstring documenting which attributes are set, why create_autospec is used, and that future attribute needs should be added explicitly.
L2 Commit message body: Corrected "Added explicit Robot process timeouts (60s)" to "(120s)" to match the actual code. Also clarified that the M3 suite uses 60s timeouts.

Verification

  • M3 E2E tests: 10/10 PASS (nox -s integration_tests -- --processes 1 --include acceptance_gate)
  • lint, format, typecheck, security_scan, dead_code: all PASS
  • integration_tests: 1194/1194 PASS (14 minutes)
  • coverage_report: 97% (meets --fail-under=97)
  • unit_tests: 8523 passed, 1 flaky (repositories_uncovered_branches.feature:110 — passes in isolation, pre-existing race condition under 16-process parallelism, unrelated)
  • docs, build, benchmark: all PASS
## Fix Push — Commit `06e1b04a` (Third Review Fixes) @CoreRasurae — all 7 test-flaw fixes from the third review are now in commit `06e1b04a` on `test/m3-acceptance-gate`. ### Changes in `helper_m3_e2e_verification.py` (33 insertions, 33 deletions) | Finding | Change | |---------|--------| | **H3** | `invariant_add_and_list()`: Replaced two independent `MagicMock()` instances with a single real `InvariantService()`. Both `add` and `list` CLI calls now patch `_get_service` to return the same service instance, giving a genuine add-then-list round-trip. Added text verification (`"Use session cookies"`) on the listed invariant. Removed unused `Invariant` and `datetime` imports. | | **M1** | Line 259: `str(strategize_decisions[0].decision_type) != "strategy_choice"` → `strategize_decisions[0].decision_type != DecisionType.STRATEGY_CHOICE` | | **M2** | `correction_live_revert()`: Added boundary check `if root.decision_id in result.reverted_decisions: _fail(...)` — verifies the root is excluded from revert (spec: correction targets affected subtree only). | | **M3** | `decision_explain()`: Added `if "confidence" not in data: _fail(...)` before the context snapshot checks. `Decision.confidence_score` is output as `"confidence"` via `as_cli_dict()`. | | **M5** | `correction_revert_reexecutes()`: Changed `root, child, _ = _seed_decisions(...)` to `root, child, grandchild = _seed_decisions(...)` and replaced `"leaf"` with `grandchild.decision_id` in the tree dict. | | **L1** | `_make_settings()`: Expanded docstring documenting which attributes are set, why `create_autospec` is used, and that future attribute needs should be added explicitly. | | **L2** | Commit message body: Corrected "Added explicit Robot process timeouts (60s)" to "(120s)" to match the actual code. Also clarified that the M3 suite uses 60s timeouts. | ### Verification - **M3 E2E tests**: 10/10 PASS (`nox -s integration_tests -- --processes 1 --include acceptance_gate`) - **lint, format, typecheck, security_scan, dead_code**: all PASS - **integration_tests**: 1194/1194 PASS (14 minutes) - **coverage_report**: 97% (meets `--fail-under=97`) - **unit_tests**: 8523 passed, 1 flaky (`repositories_uncovered_branches.feature:110` — passes in isolation, pre-existing race condition under 16-process parallelism, unrelated) - **docs, build, benchmark**: all PASS
hurui200320 force-pushed test/m3-acceptance-gate from 06e1b04a2e
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / build (pull_request) Successful in 15s
CI / quality (pull_request) Successful in 18s
CI / security (pull_request) Successful in 33s
CI / typecheck (pull_request) Successful in 35s
CI / integration_tests (pull_request) Has been cancelled
CI / coverage (pull_request) Has been cancelled
CI / benchmark-regression (pull_request) Has been cancelled
CI / unit_tests (pull_request) Has been cancelled
CI / docker (pull_request) Has been cancelled
to a8265a5074
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 13s
CI / quality (pull_request) Successful in 16s
CI / build (pull_request) Successful in 17s
CI / security (pull_request) Successful in 31s
CI / typecheck (pull_request) Successful in 35s
CI / unit_tests (pull_request) Successful in 2m0s
CI / docker (pull_request) Successful in 48s
CI / integration_tests (pull_request) Successful in 3m1s
CI / coverage (pull_request) Successful in 4m19s
CI / benchmark-regression (pull_request) Successful in 28m3s
2026-03-05 14:36:54 +00:00
Compare
CoreRasurae left a comment

Final Code Review Report — Commit a8265a50

Branch: test/m3-acceptance-gate
Issue: #494 · Author: Rui Hu
Review context: Re-checked after R1–R7 fixes squashed into a8265a50


1. Coverage Assessment vs Issue #494 Acceptance Criteria

#494 Criterion Test(s) Verdict
plan use + plan execute generate decisions during Strategize plan_generates_decisions() Covered — real services, real CLI invocation, real persistence
plan tree displays decision tree decision_tree_view() + decision_tree_persistence() Covered — validates 3-level hierarchy + persistence round-trip
plan explain shows full decision context decision_explain() Covered — validates all fields including confidence (R4 fix)
Invariant add/list via CLI invariant_add_and_list() Covered — real InvariantService round-trip with text verification (R1 fix)
Correction dry-run (impact analysis) correction_dry_run() Covered — real service + mocked CLI path
Correction live revert correction_live_revert() Covered — real service + mocked CLI path, root boundary check (R3 fix)
Decisions recorded with context snapshots decisions_context_snapshot() Covered — validates all 4 snapshot fields on all 3 decisions
Decision tree persists to DB decision_tree_persistence() Covered — writer/reader separation, parent-chain verification
Revert re-executes from decision point correction_revert_reexecutes() Covered — revert + new corrected decision + supersede chain
Invariants enforced during Strategize invariants_enforced_during_strategize() Covered — 3 scopes, merge precedence, enforcement records, InvariantSet.merge

All 10 acceptance criteria from #494 have corresponding tests. No criterion is unexercised.


2. Verified Fixes (R1–R7) from 06e1b04a Squashed into a8265a50

All 7 fixes are confirmed present and correct:

Fix Location Status
R1: Real InvariantService round-trip Line 436 + line 488 text check Verified
R2: Direct enum comparison Line 265: != DecisionType.STRATEGY_CHOICE Verified
R3: Root boundary check Lines 624–627 Verified
R4: confidence field assertion Lines 397–400 Verified
R5: Uses grandchild.decision_id not "leaf" Line 800 Verified
R6: Expanded _make_settings docstring Lines 126–133 Verified
R7: Commit message timeout description Robot file header + commit Verified

3. Remaining Findings

3.1 Production Bug — CLI plan correct Missing Decision Tree (Critical, Out of Scope)

Tracked as: #606

The CLI plan correct handler in plan.py:2409–2468 calls svc.analyze_impact(request.correction_id) and svc.execute_correction(request.correction_id) without passing decision_tree. With tree=None, BFS returns only the target node — dry-run reports 1 affected decision and live revert only reverts the target, never cascading.

The test mock assertions at lines 584 and 686 correctly document this buggy 1-argument call signature. Intentionally out of scope for this test-only PR and tracked separately as #606.

3.2 Stale Robot Documentation — model_dump / model_validate (Low, New)

Location: m3_e2e_verification.robot:104–106

The Robot documentation for "Decisions Recorded With Full Context Snapshot" claims the test verifies model_dump / model_validate round-trips, but the actual Python test (decisions_context_snapshot(), lines 696–723) does not perform any Pydantic round-trip. It calls decision_service.get_snapshot() and checks field presence. The documentation is misleading.

This was not raised in any of the 3 prior review rounds. Cosmetic but could mislead future maintainers.

Recommendation: Update the Robot documentation to match the actual test behavior.

3.3 JSON Envelope Discrepancy vs Specification (Informational)

The specification mandates all CLI JSON output use envelope format {"command": "...", "status": "ok", "data": {...}}. All 10 tests validate raw payloads without envelopes. This is a spec-implementation alignment issue rather than a test defect.

3.4 invariants_enforced_during_strategize Has No CLI Integration (Low)

This is the only test of the 10 that never invokes any CLI command. It exercises InvariantService methods directly but doesn't verify the CLI→service wiring for enforcement. Previously discussed — Rui noted that invariant_add_and_list() covers CLI wiring separately and this test focuses on the domain enforcement contract. Reasonable architectural split.

3.5 No Negative/Edge-Case Tests (Low, Deferred)

No tests for invalid decision IDs, missing plans, --mode=append, --show-superseded, etc. Explicitly deferred by Rui to follow-up issues. Acceptable for an acceptance-gate PR.

3.6 --plan Flag Not in Spec (Informational)

The plan correct CLI invocations pass --plan which does not appear in the specification's command signature. Acknowledged as spec-vs-implementation divergence.


4. Summary Table

Category Critical High Low Info
Test Coverage 0 0 0 0
Test Flaws 0 0 1 (§3.2) 0
Production Bugs 1 (§3.1 — #606, out of scope) 0 0 0
Spec Alignment 0 0 1 (§3.4) 2 (§3.3, §3.6)
Deferred 0 0 1 (§3.5) 0
Performance 0 0 0 0
Security 0 0 0 0

Verdict: APPROVED

The R1–R7 fixes are all confirmed present and correct. All 10 acceptance criteria from #494 are covered. The only critical finding (#606) is a known production bug already filed and explicitly out of scope for this test-only PR. The one genuinely new finding not previously raised is the stale Robot documentation at m3_e2e_verification.robot:104–106 (§3.2) — low severity, can be addressed in a follow-up.

The test suite is well-structured, uses real services with in-memory persistence where feasible, and provides meaningful acceptance-gate coverage for milestone v3.2.0 closure.

# Final Code Review Report — Commit `a8265a50` **Branch:** `test/m3-acceptance-gate` **Issue:** #494 · **Author:** Rui Hu **Review context:** Re-checked after R1–R7 fixes squashed into `a8265a50` --- ## 1. Coverage Assessment vs Issue #494 Acceptance Criteria | #494 Criterion | Test(s) | Verdict | |---|---|---| | `plan use` + `plan execute` generate decisions during Strategize | `plan_generates_decisions()` | **Covered** — real services, real CLI invocation, real persistence | | `plan tree` displays decision tree | `decision_tree_view()` + `decision_tree_persistence()` | **Covered** — validates 3-level hierarchy + persistence round-trip | | `plan explain` shows full decision context | `decision_explain()` | **Covered** — validates all fields including `confidence` (R4 fix) | | Invariant add/list via CLI | `invariant_add_and_list()` | **Covered** — real `InvariantService` round-trip with text verification (R1 fix) | | Correction dry-run (impact analysis) | `correction_dry_run()` | **Covered** — real service + mocked CLI path | | Correction live revert | `correction_live_revert()` | **Covered** — real service + mocked CLI path, root boundary check (R3 fix) | | Decisions recorded with context snapshots | `decisions_context_snapshot()` | **Covered** — validates all 4 snapshot fields on all 3 decisions | | Decision tree persists to DB | `decision_tree_persistence()` | **Covered** — writer/reader separation, parent-chain verification | | Revert re-executes from decision point | `correction_revert_reexecutes()` | **Covered** — revert + new corrected decision + supersede chain | | Invariants enforced during Strategize | `invariants_enforced_during_strategize()` | **Covered** — 3 scopes, merge precedence, enforcement records, `InvariantSet.merge` | **All 10 acceptance criteria from #494 have corresponding tests. No criterion is unexercised.** --- ## 2. Verified Fixes (R1–R7) from `06e1b04a` Squashed into `a8265a50` All 7 fixes are **confirmed present and correct**: | Fix | Location | Status | |---|---|---| | **R1**: Real `InvariantService` round-trip | Line 436 + line 488 text check | ✅ Verified | | **R2**: Direct enum comparison | Line 265: `!= DecisionType.STRATEGY_CHOICE` | ✅ Verified | | **R3**: Root boundary check | Lines 624–627 | ✅ Verified | | **R4**: `confidence` field assertion | Lines 397–400 | ✅ Verified | | **R5**: Uses `grandchild.decision_id` not `"leaf"` | Line 800 | ✅ Verified | | **R6**: Expanded `_make_settings` docstring | Lines 126–133 | ✅ Verified | | **R7**: Commit message timeout description | Robot file header + commit | ✅ Verified | --- ## 3. Remaining Findings ### 3.1 Production Bug — CLI `plan correct` Missing Decision Tree (Critical, Out of Scope) **Tracked as:** #606 The CLI `plan correct` handler in `plan.py:2409–2468` calls `svc.analyze_impact(request.correction_id)` and `svc.execute_correction(request.correction_id)` **without passing `decision_tree`**. With `tree=None`, BFS returns only the target node — dry-run reports 1 affected decision and live revert only reverts the target, never cascading. The test mock assertions at lines 584 and 686 correctly document this buggy 1-argument call signature. **Intentionally out of scope** for this test-only PR and tracked separately as #606. ### 3.2 Stale Robot Documentation — `model_dump / model_validate` (Low, New) **Location:** `m3_e2e_verification.robot:104–106` The Robot documentation for "Decisions Recorded With Full Context Snapshot" claims the test verifies `model_dump / model_validate` round-trips, but the actual Python test (`decisions_context_snapshot()`, lines 696–723) does **not** perform any Pydantic round-trip. It calls `decision_service.get_snapshot()` and checks field presence. The documentation is misleading. **This was not raised in any of the 3 prior review rounds.** Cosmetic but could mislead future maintainers. **Recommendation:** Update the Robot documentation to match the actual test behavior. ### 3.3 JSON Envelope Discrepancy vs Specification (Informational) The specification mandates all CLI JSON output use envelope format `{"command": "...", "status": "ok", "data": {...}}`. All 10 tests validate raw payloads without envelopes. This is a spec-implementation alignment issue rather than a test defect. ### 3.4 `invariants_enforced_during_strategize` Has No CLI Integration (Low) This is the only test of the 10 that never invokes any CLI command. It exercises `InvariantService` methods directly but doesn't verify the CLI→service wiring for enforcement. Previously discussed — Rui noted that `invariant_add_and_list()` covers CLI wiring separately and this test focuses on the domain enforcement contract. Reasonable architectural split. ### 3.5 No Negative/Edge-Case Tests (Low, Deferred) No tests for invalid decision IDs, missing plans, `--mode=append`, `--show-superseded`, etc. Explicitly deferred by Rui to follow-up issues. Acceptable for an acceptance-gate PR. ### 3.6 `--plan` Flag Not in Spec (Informational) The `plan correct` CLI invocations pass `--plan` which does not appear in the specification's command signature. Acknowledged as spec-vs-implementation divergence. --- ## 4. Summary Table | Category | Critical | High | Low | Info | |---|---|---|---|---| | **Test Coverage** | 0 | 0 | 0 | 0 | | **Test Flaws** | 0 | 0 | 1 (§3.2) | 0 | | **Production Bugs** | 1 (§3.1 — #606, out of scope) | 0 | 0 | 0 | | **Spec Alignment** | 0 | 0 | 1 (§3.4) | 2 (§3.3, §3.6) | | **Deferred** | 0 | 0 | 1 (§3.5) | 0 | | **Performance** | 0 | 0 | 0 | 0 | | **Security** | 0 | 0 | 0 | 0 | --- ## Verdict: APPROVED The R1–R7 fixes are all confirmed present and correct. All 10 acceptance criteria from #494 are covered. The only critical finding (#606) is a known production bug already filed and explicitly out of scope for this test-only PR. The one genuinely new finding not previously raised is the stale Robot documentation at `m3_e2e_verification.robot:104–106` (§3.2) — low severity, can be addressed in a follow-up. The test suite is well-structured, uses real services with in-memory persistence where feasible, and provides meaningful acceptance-gate coverage for milestone v3.2.0 closure.
hurui200320 deleted branch test/m3-acceptance-gate 2026-03-05 15:09:20 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core!559
No description provided.