test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure #560

2026-03-04T07:26:22Z

hurui200320 commented

2026-03-04 07:26:22 +00:00

Summary

Validates all M4 acceptance criteria for v3.3.0 milestone closure. Adds CLI-exercising integration tests for the three subplan commands that lacked actual CLI invocation, fixes a pre-existing unit test failure, and corrects CONTRIBUTORS.md ordering.

Changes

New CLI Integration Tests (Robot Framework)

Three new test cases in robot/m4_e2e_verification.robot with helper functions in robot/helper_m4_e2e_verification.py:

Test	CLI Command	Mocking Approach
`CLI Plan Use Creates Plan With Subplan Config`	`plan use local/refactor-action local/monorepo`	Patches `_get_lifecycle_service`; mocks `get_action_by_name` + `use_action`
`CLI Plan Execute Transitions With Subplans`	`plan execute <plan_id>`	Patches `_get_lifecycle_service`; mocks `get_plan` + `execute_plan`
`CLI Plan Tree Displays Subplan Hierarchy`	`plan tree <plan_id> --format json`	Patches `get_container`; real `Decision` objects with `SUBPLAN_SPAWN`/`SUBPLAN_PARALLEL_SPAWN` types

All three follow the same pattern as the existing plan-diff test: Typer CliRunner.invoke() with mocked services.

Pre-existing Unit Test Fix

File: features/steps/repositories_uncovered_branches_steps.py
Scenario: repo branch cov upsert profile with schema version mismatch (line 110)
Root cause: Plain sessionmaker returns a new session per call. The Given step inserted via session A and committed session B (different session), so the When step's session C couldn't see the uncommitted data.
Fix: Changed to scoped_session(sessionmaker(...)) so all factory calls return the same thread-local session.

Other Fixes

CONTRIBUTORS.md: Moved "Rui Hu" to correct alphabetical position (between Freeman and Khyari)
CHANGELOG.md: Updated #495 entry to document CLI test additions

M4 Acceptance Criteria Verification

All 7 criteria exercised by 10 E2E tests (7 existing + 3 new) + 8 smoke tests:

#	Criterion	Test(s)	Status
1	Subplans spawned during Execute via SubplanConfig	`spawn-subplans`, `cli-plan-use`, `cli-plan-execute`	PASS
2	Parallel execution with max_parallel bounds	`parallel-exec`	PASS
3	Three-way merge combines non-conflicting changes	`merge-clean`	PASS
4	Merge conflicts surfaced with git markers	`merge-conflict`	PASS
5	Parent plan tracks subplan statuses	`parent-tracking`	PASS
6	Plan tree displays subplan hierarchy	`plan-tree`, `cli-plan-tree`	PASS
7	Plan diff shows merged results	`plan-diff` (already uses CLI)	PASS

Quality Gates

Stage	Result
lint	pass
format	pass (1074 files unchanged)
typecheck	pass (0 errors)
unit_tests	8524 scenarios, 0 failures
integration_tests	1110/1118 pass (8 pre-existing failures in `cli_plan_context_commands.robot`, identical on master)
coverage_report	97% (threshold: 97%)
security_scan	pass
dead_code	pass
docs	pass
build	pass
benchmark	pass

Files Changed

CHANGELOG.md — Updated #495 entry
CONTRIBUTORS.md — Fixed alphabetical ordering
features/steps/repositories_uncovered_branches_steps.py — scoped_session fix
robot/helper_m4_e2e_verification.py — 3 new helper functions + imports
robot/m4_e2e_verification.robot — 3 new test cases

ISSUES CLOSED: #495

## Summary Validates all M4 acceptance criteria for v3.3.0 milestone closure. Adds CLI-exercising integration tests for the three subplan commands that lacked actual CLI invocation, fixes a pre-existing unit test failure, and corrects CONTRIBUTORS.md ordering. ## Changes ### New CLI Integration Tests (Robot Framework) Three new test cases in `robot/m4_e2e_verification.robot` with helper functions in `robot/helper_m4_e2e_verification.py`: | Test | CLI Command | Mocking Approach | |------|-------------|-----------------| | `CLI Plan Use Creates Plan With Subplan Config` | `plan use local/refactor-action local/monorepo` | Patches `_get_lifecycle_service`; mocks `get_action_by_name` + `use_action` | | `CLI Plan Execute Transitions With Subplans` | `plan execute <plan_id>` | Patches `_get_lifecycle_service`; mocks `get_plan` + `execute_plan` | | `CLI Plan Tree Displays Subplan Hierarchy` | `plan tree <plan_id> --format json` | Patches `get_container`; real `Decision` objects with `SUBPLAN_SPAWN`/`SUBPLAN_PARALLEL_SPAWN` types | All three follow the same pattern as the existing `plan-diff` test: Typer `CliRunner.invoke()` with mocked services. ### Pre-existing Unit Test Fix **File:** `features/steps/repositories_uncovered_branches_steps.py` **Scenario:** `repo branch cov upsert profile with schema version mismatch` (line 110) **Root cause:** Plain `sessionmaker` returns a new session per call. The `Given` step inserted via session A and committed session B (different session), so the `When` step's session C couldn't see the uncommitted data. **Fix:** Changed to `scoped_session(sessionmaker(...))` so all factory calls return the same thread-local session. ### Other Fixes - **CONTRIBUTORS.md**: Moved "Rui Hu" to correct alphabetical position (between Freeman and Khyari) - **CHANGELOG.md**: Updated #495 entry to document CLI test additions ## M4 Acceptance Criteria Verification All 7 criteria exercised by 10 E2E tests (7 existing + 3 new) + 8 smoke tests: | # | Criterion | Test(s) | Status | |---|-----------|---------|--------| | 1 | Subplans spawned during Execute via SubplanConfig | `spawn-subplans`, **`cli-plan-use`**, **`cli-plan-execute`** | PASS | | 2 | Parallel execution with max_parallel bounds | `parallel-exec` | PASS | | 3 | Three-way merge combines non-conflicting changes | `merge-clean` | PASS | | 4 | Merge conflicts surfaced with git markers | `merge-conflict` | PASS | | 5 | Parent plan tracks subplan statuses | `parent-tracking` | PASS | | 6 | Plan tree displays subplan hierarchy | `plan-tree`, **`cli-plan-tree`** | PASS | | 7 | Plan diff shows merged results | `plan-diff` (already uses CLI) | PASS | ## Quality Gates | Stage | Result | |-------|--------| | lint | pass | | format | pass (1074 files unchanged) | | typecheck | pass (0 errors) | | unit_tests | 8524 scenarios, 0 failures | | integration_tests | 1110/1118 pass (8 pre-existing failures in `cli_plan_context_commands.robot`, identical on master) | | coverage_report | 97% (threshold: 97%) | | security_scan | pass | | dead_code | pass | | docs | pass | | build | pass | | benchmark | pass | ## Files Changed - `CHANGELOG.md` — Updated #495 entry - `CONTRIBUTORS.md` — Fixed alphabetical ordering - `features/steps/repositories_uncovered_branches_steps.py` — `scoped_session` fix - `robot/helper_m4_e2e_verification.py` — 3 new helper functions + imports - `robot/m4_e2e_verification.robot` — 3 new test cases ISSUES CLOSED: #495

hurui200320 added this to the v3.3.0 milestone 2026-03-04 07:26:28 +00:00

hurui200320 added the

Type

Testing

label 2026-03-04 07:26:39 +00:00

hurui200320 added a new dependency 2026-03-04 07:27:03 +00:00

#495 test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure

hurui200320 referenced this pull request

2026-03-04 07:27:55 +00:00

test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure #495

hurui200320 force-pushed test/m4-acceptance-gate from 9f5430fc78 to 4d7dd670b2

2026-03-04 07:30:35 +00:00

Compare

hurui200320 self-assigned this 2026-03-04 07:54:49 +00:00

CoreRasurae commented

2026-03-04 11:16:51 +00:00

Code Review Report: Commit `4d7dd67` by Rui Hu

Branch: test/m4-acceptance-gate
Commit: 4d7dd670b2bee7744ab66723fbca90e664c3d385
Issue: #495 — M4 acceptance criteria for v3.3.0

Scope of the Commit

The commit modifies only 2 files: CHANGELOG.md (+5 lines) and CONTRIBUTORS.md (+1 line). No test code was added or modified. The commit message and issue comments claim that all existing M4 E2E tests pass against the final v3.3.0 implementation without modification.

HIGH Severity

1. Domain Model Bug: `is_subplan` and `is_root_plan` Can Be `True` Simultaneously

File: src/cleveragents/domain/models/core/plan.py:821-829

is_subplan returns True when parent_plan_id is not None. is_root_plan returns True when root_plan_id is None or root_plan_id == plan_id. A plan with parent_plan_id=X but root_plan_id=None is both a subplan and a root plan -- a logically contradictory state. No test creates this configuration. The E2E tests always set both or neither, never exposing this inconsistency.

Spec reference: The specification defines root_plan_id as "The top-most plan in the tree" and parent_plan_id as "present for child plans." A child plan should never be the root.

2. `SEQUENTIAL + fail_fast=False` Behavior Untested in `SubplanFailureHandler`

File: src/cleveragents/domain/models/core/plan.py:1047+

should_stop_others() returns True for SEQUENTIAL mode regardless of fail_fast -- this is the spec-documented behavior that "if one fails, subsequent child plans are not started." However, no test verifies this significant behavioral rule. Tests only cover PARALLEL mode.

3. Tests Labeled "E2E" Are Heavily Mocked -- Not True End-to-End

The M4 test files are named m4_e2e_verification.robot but the underlying Python helpers rely extensively on unittest.mock.MagicMock and patch:

plan_diff() (helper_m4_e2e_verification.py:303-335): Mocks _get_apply_service entirely -- the actual diff computation is never exercised. The test only verifies that the CLI calls the service, not that the service produces correct output.
plan_tree() (helper_m4_e2e_verification.py:237-295): Manually constructs a dict from subplan_statuses -- the actual agents plan tree CLI command is never invoked, unlike plan_diff which at least uses runner.invoke.
All correction tests (helper_m4_correction_subplan_smoke.py:114-228): Mock both the lifecycle service and correction service. No real correction flow runs.

These tests verify domain model construction and CLI argument routing, not actual end-to-end behavior. The merge tests (merge_clean, merge_conflict) are the exception -- they directly exercise GitMergeStrategy.merge() with real content and are genuinely functional.

4. `SubplanMergeService` Entirely Untested End-to-End

The bridge between domain-level SubplanMergeStrategy enum values and infrastructure merge implementations (SubplanMergeService) is never exercised by any M4 test. The conflict simulation scenarios only validate fixture JSON structure, never invoking actual merge operations. The issue acceptance criteria state "Three-way merge combines non-conflicting changes; conflicts surfaced" -- but this is verified only at the infrastructure layer (GitMergeStrategy) and never through the service that plans would actually use.

5. `DEPENDENCY_ORDERED` Execution Mode Exists But Is Completely Untested

File: src/cleveragents/domain/models/core/plan.py:129-137

The ExecutionMode enum has three values: SEQUENTIAL, PARALLEL, and DEPENDENCY_ORDERED. The fixture data references dependency_ordered, but no test ever creates a config with this mode. SubplanFailureHandler.should_stop_others implicitly treats DEPENDENCY_ORDERED like PARALLEL (returns False for stop) -- it's unclear whether this is correct since a dependency failure should arguably stop dependent subplans.

MEDIUM Severity

6. `parallel-max` Test Doesn't Verify Runtime Scheduling

File: robot/helper_m4_e2e_verification.py:343-451

The test verifies SubplanConfig validation constraints (max_parallel bounds 1-50) and manually counts statuses. It never actually schedules or executes subplans concurrently. The max_parallel constraint is tested as a Pydantic validation rule, not as a runtime scheduling limit. The issue acceptance criteria state "Parallel subplan execution works with max_parallel" -- this is only partially verified.

7. `SubplanAttempt` Class Never Instantiated Standalone in Tests

File: src/cleveragents/domain/models/core/plan.py:404-424

SubplanAttempt is constructed inline within SubplanStatus.previous_attempts in helper_m4_e2e_verification.py:604-612, but no standalone test validates its field constraints (attempt_number ge=1, started_at required, etc.). The fixture data describes previous attempts but no test parses them into actual model objects.

8. `as_cli_dict()` Has ~80% of Conditional Paths Untested

File: src/cleveragents/domain/models/core/plan.py:898-998

Tests only check plan_id, name, phase, state, and subplan_count. Approximately 80 lines of conditional rendering logic remain untested: automation_profile, invariants, arguments ordering, cost_metadata, skeleton_metadata, multi_project_metadata, last_completed_step, last_checkpoint_id, etc.

9. Unknown Error Types Silently Return `False` in `should_retry`

File: src/cleveragents/domain/models/core/plan.py:1047+

An error string matching neither RETRIABLE_FAILURES nor NON_RETRIABLE_ERRORS (e.g., "RuntimeError: unexpected") defaults to non-retriable. An error containing both a retriable and non-retriable substring (e.g., "ConfigurationError caused by TimeoutError") always returns False due to evaluation order. Neither case is tested.

10. `GitMergeStrategy` Negative Return Code Fallback Untested

File: src/cleveragents/infrastructure/sandbox/merge.py:130-136

The code handles returncode < 0 from git merge-file by falling back to SequentialMergeStrategy. This error path is never exercised by any test.

11. Dead Code: `validate_phase_state_consistency` Validator

File: src/cleveragents/domain/models/core/plan.py:738-748

The validator checks if self.processing_state is None and sets it. But processing_state has type ProcessingState with a default of QUEUED -- Pydantic will reject None before the validator runs. This code is unreachable.

LOW Severity

12. `CONTRIBUTORS.md` Alphabetical Ordering Broken

The existing entries (after Jeffrey Phillips Freeman) are sorted alphabetically by last name: Edwards, Khyari, Mendes. Rui Hu was appended at the end, but alphabetically "Hu" should come between "Freeman" and "Khyari."

13. Self-Referencing `parent_plan_id == plan_id` Passes Validation

File: src/cleveragents/domain/models/core/plan.py:257-288

PlanIdentity(plan_id=X, parent_plan_id=X) creates a plan that is its own parent. No model validator catches this logically invalid state.

14. No Encoding Guard on `GitMergeStrategy` Temp File Writes

File: src/cleveragents/infrastructure/sandbox/merge.py:88-155

Temp files are written with encoding="utf-8". Content containing invalid UTF-8 or null bytes would raise an unhandled exception. No test covers this edge case.

15. `JsonMergeStrategy` Ignores `base` Parameter

File: src/cleveragents/infrastructure/sandbox/merge.py:207-291

The three-way merge JsonMergeStrategy.merge(base, ours, theirs) accepts base but never uses it -- it performs a two-way merge of ours and theirs only. The docstring calls this a "future enhancement" but the behavior diverges from the spec's three-way merge requirement for JSON resources.

Commit & Process Checks

Check	Status	Notes
Commit message matches issue metadata	PASS	First line is exact match
Branch matches issue metadata	PASS	`test/m4-acceptance-gate`
PR #560 open targeting `master`	PASS	Milestone v3.3.0 assigned
CHANGELOG entry present	PASS	References #495
`ISSUES CLOSED: #495` in commit	PASS
Quality gates (nox)	PASS	8134 unit / 1137 integration / 97% coverage
Unchecked subtask: "Close milestone v3.3.0"	Expected	Can't close until PR merges

Summary

The commit itself (CHANGELOG + CONTRIBUTORS) is clean and minimal. The M4 E2E tests that this commit validates do pass and cover the core acceptance criteria at the domain model and infrastructure level. However, the review reveals that these tests are more accurately described as integration/model-validation tests with heavy mocking rather than true end-to-end tests. The most significant gaps are:

A domain model bug where subplan and root plan states can be contradictory (Finding #1)
Key behavioral rules verified only through manual model construction rather than through the actual service/CLI path (Findings #3, #4, #6)
An entire execution mode (DEPENDENCY_ORDERED) that exists in the codebase but is completely unexercised (Finding #5)
The SubplanMergeService -- the actual service that plans use for merging -- is never tested end-to-end (Finding #4)

These findings don't block the PR from merging (the commit accurately reports what it does), but they represent real test coverage gaps that should be tracked as follow-up work.

## Code Review Report: Commit `4d7dd67` by Rui Hu **Branch:** `test/m4-acceptance-gate` **Commit:** `4d7dd670b2bee7744ab66723fbca90e664c3d385` **Issue:** #495 — M4 acceptance criteria for v3.3.0 --- ### Scope of the Commit The commit modifies only **2 files**: `CHANGELOG.md` (+5 lines) and `CONTRIBUTORS.md` (+1 line). No test code was added or modified. The commit message and issue comments claim that all existing M4 E2E tests pass against the final v3.3.0 implementation without modification. --- ## HIGH Severity ### 1. Domain Model Bug: `is_subplan` and `is_root_plan` Can Be `True` Simultaneously **File:** `src/cleveragents/domain/models/core/plan.py:821-829` `is_subplan` returns `True` when `parent_plan_id is not None`. `is_root_plan` returns `True` when `root_plan_id is None or root_plan_id == plan_id`. A plan with `parent_plan_id=X` but `root_plan_id=None` is **both a subplan and a root plan** -- a logically contradictory state. No test creates this configuration. The E2E tests always set both or neither, never exposing this inconsistency. **Spec reference:** The specification defines `root_plan_id` as "The top-most plan in the tree" and `parent_plan_id` as "present for child plans." A child plan should never be the root. --- ### 2. `SEQUENTIAL + fail_fast=False` Behavior Untested in `SubplanFailureHandler` **File:** `src/cleveragents/domain/models/core/plan.py:1047+` `should_stop_others()` returns `True` for `SEQUENTIAL` mode regardless of `fail_fast` -- this is the spec-documented behavior that "if one fails, subsequent child plans are not started." However, no test verifies this significant behavioral rule. Tests only cover `PARALLEL` mode. --- ### 3. Tests Labeled "E2E" Are Heavily Mocked -- Not True End-to-End The M4 test files are named `m4_e2e_verification.robot` but the underlying Python helpers rely extensively on `unittest.mock.MagicMock` and `patch`: - **`plan_diff()`** (`helper_m4_e2e_verification.py:303-335`): Mocks `_get_apply_service` entirely -- the actual diff computation is never exercised. The test only verifies that the CLI calls the service, not that the service produces correct output. - **`plan_tree()`** (`helper_m4_e2e_verification.py:237-295`): Manually constructs a dict from `subplan_statuses` -- the actual `agents plan tree` CLI command is never invoked, unlike `plan_diff` which at least uses `runner.invoke`. - **All correction tests** (`helper_m4_correction_subplan_smoke.py:114-228`): Mock both the lifecycle service and correction service. No real correction flow runs. These tests verify domain model construction and CLI argument routing, not actual end-to-end behavior. The merge tests (`merge_clean`, `merge_conflict`) are the exception -- they directly exercise `GitMergeStrategy.merge()` with real content and are genuinely functional. --- ### 4. `SubplanMergeService` Entirely Untested End-to-End The bridge between domain-level `SubplanMergeStrategy` enum values and infrastructure merge implementations (`SubplanMergeService`) is never exercised by any M4 test. The conflict simulation scenarios only validate fixture JSON structure, never invoking actual merge operations. The issue acceptance criteria state "Three-way merge combines non-conflicting changes; conflicts surfaced" -- but this is verified only at the infrastructure layer (`GitMergeStrategy`) and never through the service that plans would actually use. --- ### 5. `DEPENDENCY_ORDERED` Execution Mode Exists But Is Completely Untested **File:** `src/cleveragents/domain/models/core/plan.py:129-137` The `ExecutionMode` enum has three values: `SEQUENTIAL`, `PARALLEL`, and `DEPENDENCY_ORDERED`. The fixture data references `dependency_ordered`, but no test ever creates a config with this mode. `SubplanFailureHandler.should_stop_others` implicitly treats `DEPENDENCY_ORDERED` like `PARALLEL` (returns `False` for stop) -- it's unclear whether this is correct since a dependency failure should arguably stop dependent subplans. --- ## MEDIUM Severity ### 6. `parallel-max` Test Doesn't Verify Runtime Scheduling **File:** `robot/helper_m4_e2e_verification.py:343-451` The test verifies `SubplanConfig` validation constraints (max_parallel bounds 1-50) and manually counts statuses. It never actually schedules or executes subplans concurrently. The `max_parallel` constraint is tested as a Pydantic validation rule, not as a runtime scheduling limit. The issue acceptance criteria state "Parallel subplan execution works with max_parallel" -- this is only partially verified. --- ### 7. `SubplanAttempt` Class Never Instantiated Standalone in Tests **File:** `src/cleveragents/domain/models/core/plan.py:404-424` `SubplanAttempt` is constructed inline within `SubplanStatus.previous_attempts` in `helper_m4_e2e_verification.py:604-612`, but no standalone test validates its field constraints (`attempt_number` ge=1, `started_at` required, etc.). The fixture data describes previous attempts but no test parses them into actual model objects. --- ### 8. `as_cli_dict()` Has ~80% of Conditional Paths Untested **File:** `src/cleveragents/domain/models/core/plan.py:898-998` Tests only check `plan_id`, `name`, `phase`, `state`, and `subplan_count`. Approximately 80 lines of conditional rendering logic remain untested: `automation_profile`, `invariants`, `arguments` ordering, `cost_metadata`, `skeleton_metadata`, `multi_project_metadata`, `last_completed_step`, `last_checkpoint_id`, etc. --- ### 9. Unknown Error Types Silently Return `False` in `should_retry` **File:** `src/cleveragents/domain/models/core/plan.py:1047+` An error string matching neither `RETRIABLE_FAILURES` nor `NON_RETRIABLE_ERRORS` (e.g., `"RuntimeError: unexpected"`) defaults to non-retriable. An error containing both a retriable and non-retriable substring (e.g., `"ConfigurationError caused by TimeoutError"`) always returns `False` due to evaluation order. Neither case is tested. --- ### 10. `GitMergeStrategy` Negative Return Code Fallback Untested **File:** `src/cleveragents/infrastructure/sandbox/merge.py:130-136` The code handles `returncode < 0` from `git merge-file` by falling back to `SequentialMergeStrategy`. This error path is never exercised by any test. --- ### 11. Dead Code: `validate_phase_state_consistency` Validator **File:** `src/cleveragents/domain/models/core/plan.py:738-748` The validator checks `if self.processing_state is None` and sets it. But `processing_state` has type `ProcessingState` with a default of `QUEUED` -- Pydantic will reject `None` before the validator runs. This code is unreachable. --- ## LOW Severity ### 12. `CONTRIBUTORS.md` Alphabetical Ordering Broken The existing entries (after Jeffrey Phillips Freeman) are sorted alphabetically by last name: Edwards, Khyari, Mendes. Rui Hu was appended at the end, but alphabetically "Hu" should come between "Freeman" and "Khyari." --- ### 13. Self-Referencing `parent_plan_id == plan_id` Passes Validation **File:** `src/cleveragents/domain/models/core/plan.py:257-288` `PlanIdentity(plan_id=X, parent_plan_id=X)` creates a plan that is its own parent. No model validator catches this logically invalid state. --- ### 14. No Encoding Guard on `GitMergeStrategy` Temp File Writes **File:** `src/cleveragents/infrastructure/sandbox/merge.py:88-155` Temp files are written with `encoding="utf-8"`. Content containing invalid UTF-8 or null bytes would raise an unhandled exception. No test covers this edge case. --- ### 15. `JsonMergeStrategy` Ignores `base` Parameter **File:** `src/cleveragents/infrastructure/sandbox/merge.py:207-291` The three-way merge `JsonMergeStrategy.merge(base, ours, theirs)` accepts `base` but never uses it -- it performs a two-way merge of `ours` and `theirs` only. The docstring calls this a "future enhancement" but the behavior diverges from the spec's three-way merge requirement for JSON resources. --- ## Commit & Process Checks | Check | Status | Notes | |:------|:-------|:------| | Commit message matches issue metadata | PASS | First line is exact match | | Branch matches issue metadata | PASS | `test/m4-acceptance-gate` | | PR #560 open targeting `master` | PASS | Milestone v3.3.0 assigned | | CHANGELOG entry present | PASS | References #495 | | `ISSUES CLOSED: #495` in commit | PASS | | | Quality gates (nox) | PASS | 8134 unit / 1137 integration / 97% coverage | | Unchecked subtask: "Close milestone v3.3.0" | Expected | Can't close until PR merges | --- ## Summary The commit itself (CHANGELOG + CONTRIBUTORS) is clean and minimal. The M4 E2E tests that this commit validates do pass and cover the core acceptance criteria at the domain model and infrastructure level. However, the review reveals that these tests are more accurately described as **integration/model-validation tests with heavy mocking** rather than true end-to-end tests. The most significant gaps are: 1. A **domain model bug** where subplan and root plan states can be contradictory (Finding #1) 2. Key behavioral rules verified only through manual model construction rather than through the actual service/CLI path (Findings #3, #4, #6) 3. An entire execution mode (`DEPENDENCY_ORDERED`) that exists in the codebase but is completely unexercised (Finding #5) 4. The `SubplanMergeService` -- the actual service that plans use for merging -- is never tested end-to-end (Finding #4) These findings don't block the PR from merging (the commit accurately reports what it does), but they represent real test coverage gaps that should be tracked as follow-up work.

freemo added the

labels 2026-03-05 03:28:20 +00:00

freemo reviewed 2026-03-05 03:35:15 +00:00

freemo left a comment

PM Review — Day 25

Status

Merge conflicts — @hurui200320 please rebase onto master. This is the primary blocker.
Review: Luis (CoreRasurae) posted a detailed comment-based review (5 HIGH, 6 MEDIUM, 4 LOW findings) but no formal PR review verdict was submitted. Luis's conclusion was that the findings should be tracked as follow-up work and don't necessarily block the merge.

Action Items

@hurui200320: Rebase test/m4-acceptance-gate onto current master to resolve conflicts.
@CoreRasurae: Please submit a formal PR review verdict (APPROVED or REQUEST_CHANGES). Your comment review was thorough but we need a formal verdict for the merge gate.
If the findings are follow-up items (as Luis recommended), we should track them as separate issues.

Priority

This is the M4 acceptance gate — it blocks milestone v3.3.0 closure. The milestone is 3 days overdue. Please prioritize after #559.

## PM Review — Day 25 ### Status - **Merge conflicts** — @hurui200320 please rebase onto `master`. This is the primary blocker. - **Review**: Luis (CoreRasurae) posted a detailed comment-based review (5 HIGH, 6 MEDIUM, 4 LOW findings) but no formal PR review verdict was submitted. Luis's conclusion was that the findings should be tracked as follow-up work and don't necessarily block the merge. ### Action Items 1. **@hurui200320**: Rebase `test/m4-acceptance-gate` onto current `master` to resolve conflicts. 2. **@CoreRasurae**: Please submit a formal PR review verdict (APPROVED or REQUEST_CHANGES). Your comment review was thorough but we need a formal verdict for the merge gate. 3. If the findings are follow-up items (as Luis recommended), we should track them as separate issues. ### Priority This is the **M4 acceptance gate** — it blocks milestone v3.3.0 closure. The milestone is 3 days overdue. Please prioritize after #559.

hurui200320 force-pushed test/m4-acceptance-gate from 4d7dd670b2 to 11b4c2dc43

2026-03-05 07:56:23 +00:00

Compare

hurui200320 force-pushed test/m4-acceptance-gate from 11b4c2dc43 to baac6bf19a

2026-03-05 10:40:50 +00:00

Compare

hurui200320 referenced this pull request

2026-03-05 10:41:39 +00:00

test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure #495

CoreRasurae approved these changes 2026-03-05 15:26:41 +00:00

Dismissed

CoreRasurae left a comment

Code Review Report — Commit `baac6bf` (Rui Hu)

Branch: test/m4-acceptance-gate
Issue: #495 — Validate M4 acceptance criteria for v3.3.0 milestone closure
Spec reference: docs/specification.md

Summary

The commit adds 3 new CLI-exercising integration tests (cli-plan-use, cli-plan-execute, cli-plan-tree) to the M4 E2E verification suite, fixes a pre-existing BDD session isolation bug, and updates CONTRIBUTORS.md. The tests pass and quality gates are green. However, the review identified 1 bug introduced by the fix itself, several test coverage gaps against the acceptance criteria, and a broader codebase pattern issue that the fix reveals.

Verdict: APPROVED — the commit achieves its stated goal and quality gates pass. Findings below are recommended for follow-up, not blockers.

BUGS

B1. `scoped_session` missing `.remove()` cleanup — HIGH

File: features/steps/repositories_uncovered_branches_steps.py:444

The fix correctly wraps sessionmaker in scoped_session to solve the split-session bug. However, scoped_session uses thread-local storage and requires .remove() to be called after each scenario to release the session. No cleanup exists:

No .remove() call in the step file
No scoped_session-aware cleanup in features/environment.py
No registered cleanup handler

Impact: Session/connection leaks across scenarios. The scoped_session registry holds a reference to the old session, preventing garbage collection of the session and its engine. In long test runs, this accumulates leaked in-memory SQLite connections.

Recommended fix: Register cleanup in the setup step:

@given("repo branch cov an in-memory database with automation profile tables")
def step_automation_db(context: Context):
    context.rb_engine = _make_engine()
    context.rb_session_factory = scoped_session(sessionmaker(bind=context.rb_engine))
    context.add_cleanup(context.rb_session_factory.remove)

B2. Same split-session pattern unfixed at line 142 — MEDIUM

File: features/steps/repositories_uncovered_branches_steps.py:142

The Tool/Validation scenarios at line 142 use the exact same risky pattern:

context.rb_session_factory = sessionmaker(bind=context.rb_engine)

This factory is passed to ToolRepository, ToolRegistryRepository, and ValidationAttachmentRepository, while step code separately calls session_factory() to commit. This is the same root cause that was fixed at line 444, and may break under different conditions.

B3. ~13 other step files share the risky `sessionmaker` pattern — LOW

Across features/steps/, approximately 13 other files pass a bare sessionmaker(bind=engine) as session_factory= to repositories that internally call it to obtain sessions, while step code also calls it independently. Files using the safer lambda: session pattern (e.g., plan_persistence_steps.py, action_persistence_steps.py, decision_persistence_steps.py) are not affected. This commit's fix was surgical, but the broader pattern persists.

TEST COVERAGE GAPS

C1. `cli-plan-use` does not verify the project argument — MEDIUM

File: robot/helper_m4_e2e_verification.py:756-766

The test verifies get_action_by_name("local/refactor-action") and that use_action was called, and checks the action_name argument. But it never verifies the project argument "local/monorepo" was passed through to use_action(project_links=...). The acceptance criteria (issue #495 and the milestone) explicitly require verifying agents plan use local/refactor-action local/monorepo — the project binding is half of this command.

C2. `cli-plan-execute` does not verify subplan statuses in output — MEDIUM

File: robot/helper_m4_e2e_verification.py:806-822

The mock returns a plan with 3 subplan statuses (children A, B, C), but the test only checks that the plan ID appears in the output. It does not verify that the CLI renders subplan-related information (subplan count, statuses, etc.). The entire purpose of M4 is subplan management — the test should confirm the CLI output reflects it.

C3. `cli-plan-tree` JSON structure verification is shallow — LOW/MEDIUM

File: robot/helper_m4_e2e_verification.py:937-947

The test does substring matching on the serialized JSON:

if "subplan_parallel_spawn" not in tree_str:

This does not verify:

Tree hierarchy (that subplan_spawn nodes are children of subplan_parallel_spawn)
Node count (5 decisions expected)
Parent-child relationships via decision_id
Presence of downstream_plan_ids in the output

Since build_decision_tree() returns structured nested dicts, the test should walk tree_data and verify the nesting depth and relationships.

C4. No new CLI integration test for `plan diff` — LOW

Issue #495 acceptance criteria includes "agents plan diff <plan_id> shows merged results". The pre-existing plan-diff test (lines 311-343) covers this through a mocked apply service, but no new CLI-exercising test was added to match the pattern of the other 3 new tests.

TEST FLAWS

F1. Fragile argument extraction in `cli_plan_use` — LOW

File: robot/helper_m4_e2e_verification.py:758-766

The use_action call_args extraction tries 3 different strategies to find the action_name:

action_arg = call_kwargs.kwargs.get(
    "action_name",
    call_kwargs[1].get("action_name") if len(call_kwargs) > 1 else None,
)
if action_arg is None and call_kwargs.args:
    action_arg = call_kwargs.args[0]

This defensive-to-the-point-of-fragile approach suggests uncertainty about the CLI-to-service calling convention. In practice, the CLI at plan.py:1431 always calls use_action(action_name=..., ...) as keyword arguments, so only the first .kwargs.get("action_name") branch is ever exercised. The other branches are dead code that obscures test intent.

F2. No negative/error path tests for the 3 new CLI commands — LOW

None of the 3 new tests exercise failure scenarios:

Invalid action name in plan use
Read-only plan guard in plan execute
Empty decision tree in plan tree

SPECIFICATION ALIGNMENT — No Issues

The tests correctly model the specification's decision hierarchy:

prompt_definition as root (spec line 18413)
strategy_choice under it (spec line 18616)
subplan_parallel_spawn grouping concurrent child spawns (spec line 18177)
Individual subplan_spawn under the parallel container (spec line 18288)
All decisions belonging to the parent plan (spec lines 18175-18176)

The DecisionType enum values, ULID formats, and Decision model field usage are all correct per the codebase definitions.

PERFORMANCE — No Issues

No performance concerns in the test code.

SECURITY — No Issues

No credential exposure, injection vectors, or unsafe patterns. Mocks are used appropriately.

Final Verdict

The commit achieves its stated goal: the M4 E2E suite runs green and validates the milestone criteria. The primary concern is B1 (session leak from missing scoped_session.remove()), which is a latent bug introduced by the fix. C1 and C2 are meaningful coverage gaps where the tests verify the "happy path plumbing" but miss verifying the specific M4-relevant outputs (project binding and subplan statuses). B2 and B3 represent pre-existing technical debt that this commit's pattern exposes but does not address.

Approved — recommended to address B1 as a fast follow-up.

# Code Review Report — Commit `baac6bf` (Rui Hu) **Branch:** `test/m4-acceptance-gate` **Issue:** #495 — Validate M4 acceptance criteria for v3.3.0 milestone closure **Spec reference:** `docs/specification.md` --- ## Summary The commit adds 3 new CLI-exercising integration tests (`cli-plan-use`, `cli-plan-execute`, `cli-plan-tree`) to the M4 E2E verification suite, fixes a pre-existing BDD session isolation bug, and updates CONTRIBUTORS.md. The tests pass and quality gates are green. However, the review identified **1 bug introduced by the fix itself**, **several test coverage gaps** against the acceptance criteria, and **a broader codebase pattern issue** that the fix reveals. **Verdict: APPROVED** — the commit achieves its stated goal and quality gates pass. Findings below are recommended for follow-up, not blockers. --- ## BUGS ### B1. `scoped_session` missing `.remove()` cleanup — HIGH **File:** `features/steps/repositories_uncovered_branches_steps.py:444` The fix correctly wraps `sessionmaker` in `scoped_session` to solve the split-session bug. However, `scoped_session` uses thread-local storage and **requires** `.remove()` to be called after each scenario to release the session. No cleanup exists: - No `.remove()` call in the step file - No `scoped_session`-aware cleanup in `features/environment.py` - No registered cleanup handler **Impact:** Session/connection leaks across scenarios. The `scoped_session` registry holds a reference to the old session, preventing garbage collection of the session and its engine. In long test runs, this accumulates leaked in-memory SQLite connections. **Recommended fix:** Register cleanup in the setup step: ```python @given("repo branch cov an in-memory database with automation profile tables") def step_automation_db(context: Context): context.rb_engine = _make_engine() context.rb_session_factory = scoped_session(sessionmaker(bind=context.rb_engine)) context.add_cleanup(context.rb_session_factory.remove) ``` ### B2. Same split-session pattern unfixed at line 142 — MEDIUM **File:** `features/steps/repositories_uncovered_branches_steps.py:142` The Tool/Validation scenarios at line 142 use the exact same risky pattern: ```python context.rb_session_factory = sessionmaker(bind=context.rb_engine) ``` This factory is passed to `ToolRepository`, `ToolRegistryRepository`, and `ValidationAttachmentRepository`, while step code separately calls `session_factory()` to commit. This is the same root cause that was fixed at line 444, and may break under different conditions. ### B3. ~13 other step files share the risky `sessionmaker` pattern — LOW Across `features/steps/`, approximately 13 other files pass a bare `sessionmaker(bind=engine)` as `session_factory=` to repositories that internally call it to obtain sessions, while step code also calls it independently. Files using the safer `lambda: session` pattern (e.g., `plan_persistence_steps.py`, `action_persistence_steps.py`, `decision_persistence_steps.py`) are not affected. This commit's fix was surgical, but the broader pattern persists. --- ## TEST COVERAGE GAPS ### C1. `cli-plan-use` does not verify the project argument — MEDIUM **File:** `robot/helper_m4_e2e_verification.py:756-766` The test verifies `get_action_by_name("local/refactor-action")` and that `use_action` was called, and checks the action_name argument. But it never verifies the project argument `"local/monorepo"` was passed through to `use_action(project_links=...)`. The acceptance criteria (issue #495 and the milestone) explicitly require verifying `agents plan use local/refactor-action local/monorepo` — the project binding is half of this command. ### C2. `cli-plan-execute` does not verify subplan statuses in output — MEDIUM **File:** `robot/helper_m4_e2e_verification.py:806-822` The mock returns a plan with 3 subplan statuses (children A, B, C), but the test only checks that the plan ID appears in the output. It does not verify that the CLI renders subplan-related information (subplan count, statuses, etc.). The entire purpose of M4 is subplan management — the test should confirm the CLI output reflects it. ### C3. `cli-plan-tree` JSON structure verification is shallow — LOW/MEDIUM **File:** `robot/helper_m4_e2e_verification.py:937-947` The test does substring matching on the serialized JSON: ```python if "subplan_parallel_spawn" not in tree_str: ``` This does not verify: - Tree hierarchy (that `subplan_spawn` nodes are children of `subplan_parallel_spawn`) - Node count (5 decisions expected) - Parent-child relationships via `decision_id` - Presence of `downstream_plan_ids` in the output Since `build_decision_tree()` returns structured nested dicts, the test should walk `tree_data` and verify the nesting depth and relationships. ### C4. No new CLI integration test for `plan diff` — LOW Issue #495 acceptance criteria includes "`agents plan diff <plan_id>` shows merged results". The pre-existing `plan-diff` test (lines 311-343) covers this through a mocked apply service, but no new CLI-exercising test was added to match the pattern of the other 3 new tests. --- ## TEST FLAWS ### F1. Fragile argument extraction in `cli_plan_use` — LOW **File:** `robot/helper_m4_e2e_verification.py:758-766` The `use_action` call_args extraction tries 3 different strategies to find the `action_name`: ```python action_arg = call_kwargs.kwargs.get( "action_name", call_kwargs[1].get("action_name") if len(call_kwargs) > 1 else None, ) if action_arg is None and call_kwargs.args: action_arg = call_kwargs.args[0] ``` This defensive-to-the-point-of-fragile approach suggests uncertainty about the CLI-to-service calling convention. In practice, the CLI at `plan.py:1431` always calls `use_action(action_name=..., ...)` as keyword arguments, so only the first `.kwargs.get("action_name")` branch is ever exercised. The other branches are dead code that obscures test intent. ### F2. No negative/error path tests for the 3 new CLI commands — LOW None of the 3 new tests exercise failure scenarios: - Invalid action name in `plan use` - Read-only plan guard in `plan execute` - Empty decision tree in `plan tree` --- ## SPECIFICATION ALIGNMENT — No Issues The tests correctly model the specification's decision hierarchy: - `prompt_definition` as root (spec line 18413) - `strategy_choice` under it (spec line 18616) - `subplan_parallel_spawn` grouping concurrent child spawns (spec line 18177) - Individual `subplan_spawn` under the parallel container (spec line 18288) - All decisions belonging to the parent plan (spec lines 18175-18176) The `DecisionType` enum values, ULID formats, and `Decision` model field usage are all correct per the codebase definitions. --- ## PERFORMANCE — No Issues No performance concerns in the test code. ## SECURITY — No Issues No credential exposure, injection vectors, or unsafe patterns. Mocks are used appropriately. --- ## Final Verdict The commit achieves its stated goal: the M4 E2E suite runs green and validates the milestone criteria. The primary concern is **B1** (session leak from missing `scoped_session.remove()`), which is a latent bug introduced by the fix. **C1** and **C2** are meaningful coverage gaps where the tests verify the "happy path plumbing" but miss verifying the specific M4-relevant outputs (project binding and subplan statuses). **B2** and **B3** represent pre-existing technical debt that this commit's pattern exposes but does not address. **Approved** — recommended to address B1 as a fast follow-up.

hurui200320 force-pushed test/m4-acceptance-gate from baac6bf19a to d02aa33150

2026-03-06 03:17:42 +00:00

Compare

hurui200320 dismissed CoreRasurae's review 2026-03-06 03:17:42 +00:00

Reason:

New commits pushed, approval review dismissed automatically according to repository settings

hurui200320 scheduled this pull request to auto merge when all checks succeed 2026-03-06 03:17:57 +00:00

hurui200320 merged commit 09d92ac67b into master

2026-03-06 03:22:45 +00:00

hurui200320 referenced this issue from a commit

2026-03-06 03:22:46 +00:00

test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure (#560)

hurui200320 deleted branch test/m4-acceptance-gate

2026-03-06 03:22:46 +00:00

brent.edwards referenced this pull request

2026-03-10 00:42:02 +00:00

bug(test): M1–M6 E2E verification suites use mocks instead of exercising the real system #658

hurui200320 referenced this issue from a commit

2026-03-11 07:01:52 +00:00

test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure

~~hurui200320 referenced this pull request 2026-03-11 07:02:08 +00:00~~

test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure #681

hurui200320 referenced this pull request

2026-03-11 07:03:25 +00:00

test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure #495

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Blocks

#495 test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure

cleveragents/cleveragents-core

Reference: cleveragents/cleveragents-core#560

test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure #560

Summary

Changes

New CLI Integration Tests (Robot Framework)

Pre-existing Unit Test Fix

Other Fixes

M4 Acceptance Criteria Verification

Quality Gates

Files Changed

Code Review Report: Commit 4d7dd67 by Rui Hu

Scope of the Commit

HIGH Severity

1. Domain Model Bug: is_subplan and is_root_plan Can Be True Simultaneously

2. SEQUENTIAL + fail_fast=False Behavior Untested in SubplanFailureHandler

3. Tests Labeled "E2E" Are Heavily Mocked -- Not True End-to-End

4. SubplanMergeService Entirely Untested End-to-End

5. DEPENDENCY_ORDERED Execution Mode Exists But Is Completely Untested

MEDIUM Severity

6. parallel-max Test Doesn't Verify Runtime Scheduling

7. SubplanAttempt Class Never Instantiated Standalone in Tests

8. as_cli_dict() Has ~80% of Conditional Paths Untested

9. Unknown Error Types Silently Return False in should_retry

10. GitMergeStrategy Negative Return Code Fallback Untested

11. Dead Code: validate_phase_state_consistency Validator

LOW Severity

12. CONTRIBUTORS.md Alphabetical Ordering Broken

13. Self-Referencing parent_plan_id == plan_id Passes Validation

14. No Encoding Guard on GitMergeStrategy Temp File Writes

15. JsonMergeStrategy Ignores base Parameter

Commit & Process Checks

Summary

PM Review — Day 25

Status

Action Items

Priority

Code Review Report — Commit baac6bf (Rui Hu)

Summary

BUGS

B1. scoped_session missing .remove() cleanup — HIGH

B2. Same split-session pattern unfixed at line 142 — MEDIUM

B3. ~13 other step files share the risky sessionmaker pattern — LOW

TEST COVERAGE GAPS

C1. cli-plan-use does not verify the project argument — MEDIUM

C2. cli-plan-execute does not verify subplan statuses in output — MEDIUM

C3. cli-plan-tree JSON structure verification is shallow — LOW/MEDIUM

C4. No new CLI integration test for plan diff — LOW

TEST FLAWS

F1. Fragile argument extraction in cli_plan_use — LOW

F2. No negative/error path tests for the 3 new CLI commands — LOW

SPECIFICATION ALIGNMENT — No Issues

PERFORMANCE — No Issues

SECURITY — No Issues

Final Verdict

Code Review Report: Commit `4d7dd67` by Rui Hu

1. Domain Model Bug: `is_subplan` and `is_root_plan` Can Be `True` Simultaneously

2. `SEQUENTIAL + fail_fast=False` Behavior Untested in `SubplanFailureHandler`

4. `SubplanMergeService` Entirely Untested End-to-End

5. `DEPENDENCY_ORDERED` Execution Mode Exists But Is Completely Untested

6. `parallel-max` Test Doesn't Verify Runtime Scheduling

7. `SubplanAttempt` Class Never Instantiated Standalone in Tests

8. `as_cli_dict()` Has ~80% of Conditional Paths Untested

9. Unknown Error Types Silently Return `False` in `should_retry`

10. `GitMergeStrategy` Negative Return Code Fallback Untested

11. Dead Code: `validate_phase_state_consistency` Validator

12. `CONTRIBUTORS.md` Alphabetical Ordering Broken

13. Self-Referencing `parent_plan_id == plan_id` Passes Validation

14. No Encoding Guard on `GitMergeStrategy` Temp File Writes

15. `JsonMergeStrategy` Ignores `base` Parameter

Code Review Report — Commit `baac6bf` (Rui Hu)

B1. `scoped_session` missing `.remove()` cleanup — HIGH

B3. ~13 other step files share the risky `sessionmaker` pattern — LOW

C1. `cli-plan-use` does not verify the project argument — MEDIUM

C2. `cli-plan-execute` does not verify subplan statuses in output — MEDIUM

C3. `cli-plan-tree` JSON structure verification is shallow — LOW/MEDIUM

C4. No new CLI integration test for `plan diff` — LOW

F1. Fragile argument extraction in `cli_plan_use` — LOW