test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure #560
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Blocks
#495 test(e2e): validate M4 acceptance criteria for v3.3.0 milestone closure
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core!560
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "test/m4-acceptance-gate"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Validates all M4 acceptance criteria for v3.3.0 milestone closure. Adds CLI-exercising integration tests for the three subplan commands that lacked actual CLI invocation, fixes a pre-existing unit test failure, and corrects CONTRIBUTORS.md ordering.
Changes
New CLI Integration Tests (Robot Framework)
Three new test cases in
robot/m4_e2e_verification.robotwith helper functions inrobot/helper_m4_e2e_verification.py:CLI Plan Use Creates Plan With Subplan Configplan use local/refactor-action local/monorepo_get_lifecycle_service; mocksget_action_by_name+use_actionCLI Plan Execute Transitions With Subplansplan execute <plan_id>_get_lifecycle_service; mocksget_plan+execute_planCLI Plan Tree Displays Subplan Hierarchyplan tree <plan_id> --format jsonget_container; realDecisionobjects withSUBPLAN_SPAWN/SUBPLAN_PARALLEL_SPAWNtypesAll three follow the same pattern as the existing
plan-difftest: TyperCliRunner.invoke()with mocked services.Pre-existing Unit Test Fix
File:
features/steps/repositories_uncovered_branches_steps.pyScenario:
repo branch cov upsert profile with schema version mismatch(line 110)Root cause: Plain
sessionmakerreturns a new session per call. TheGivenstep inserted via session A and committed session B (different session), so theWhenstep's session C couldn't see the uncommitted data.Fix: Changed to
scoped_session(sessionmaker(...))so all factory calls return the same thread-local session.Other Fixes
M4 Acceptance Criteria Verification
All 7 criteria exercised by 10 E2E tests (7 existing + 3 new) + 8 smoke tests:
spawn-subplans,cli-plan-use,cli-plan-executeparallel-execmerge-cleanmerge-conflictparent-trackingplan-tree,cli-plan-treeplan-diff(already uses CLI)Quality Gates
cli_plan_context_commands.robot, identical on master)Files Changed
CHANGELOG.md— Updated #495 entryCONTRIBUTORS.md— Fixed alphabetical orderingfeatures/steps/repositories_uncovered_branches_steps.py—scoped_sessionfixrobot/helper_m4_e2e_verification.py— 3 new helper functions + importsrobot/m4_e2e_verification.robot— 3 new test casesISSUES CLOSED: #495
9f5430fc784d7dd670b2Code Review Report: Commit
4d7dd67by Rui HuBranch:
test/m4-acceptance-gateCommit:
4d7dd670b2bee7744ab66723fbca90e664c3d385Issue: #495 — M4 acceptance criteria for v3.3.0
Scope of the Commit
The commit modifies only 2 files:
CHANGELOG.md(+5 lines) andCONTRIBUTORS.md(+1 line). No test code was added or modified. The commit message and issue comments claim that all existing M4 E2E tests pass against the final v3.3.0 implementation without modification.HIGH Severity
1. Domain Model Bug:
is_subplanandis_root_planCan BeTrueSimultaneouslyFile:
src/cleveragents/domain/models/core/plan.py:821-829is_subplanreturnsTruewhenparent_plan_id is not None.is_root_planreturnsTruewhenroot_plan_id is None or root_plan_id == plan_id. A plan withparent_plan_id=Xbutroot_plan_id=Noneis both a subplan and a root plan -- a logically contradictory state. No test creates this configuration. The E2E tests always set both or neither, never exposing this inconsistency.Spec reference: The specification defines
root_plan_idas "The top-most plan in the tree" andparent_plan_idas "present for child plans." A child plan should never be the root.2.
SEQUENTIAL + fail_fast=FalseBehavior Untested inSubplanFailureHandlerFile:
src/cleveragents/domain/models/core/plan.py:1047+should_stop_others()returnsTrueforSEQUENTIALmode regardless offail_fast-- this is the spec-documented behavior that "if one fails, subsequent child plans are not started." However, no test verifies this significant behavioral rule. Tests only coverPARALLELmode.3. Tests Labeled "E2E" Are Heavily Mocked -- Not True End-to-End
The M4 test files are named
m4_e2e_verification.robotbut the underlying Python helpers rely extensively onunittest.mock.MagicMockandpatch:plan_diff()(helper_m4_e2e_verification.py:303-335): Mocks_get_apply_serviceentirely -- the actual diff computation is never exercised. The test only verifies that the CLI calls the service, not that the service produces correct output.plan_tree()(helper_m4_e2e_verification.py:237-295): Manually constructs a dict fromsubplan_statuses-- the actualagents plan treeCLI command is never invoked, unlikeplan_diffwhich at least usesrunner.invoke.helper_m4_correction_subplan_smoke.py:114-228): Mock both the lifecycle service and correction service. No real correction flow runs.These tests verify domain model construction and CLI argument routing, not actual end-to-end behavior. The merge tests (
merge_clean,merge_conflict) are the exception -- they directly exerciseGitMergeStrategy.merge()with real content and are genuinely functional.4.
SubplanMergeServiceEntirely Untested End-to-EndThe bridge between domain-level
SubplanMergeStrategyenum values and infrastructure merge implementations (SubplanMergeService) is never exercised by any M4 test. The conflict simulation scenarios only validate fixture JSON structure, never invoking actual merge operations. The issue acceptance criteria state "Three-way merge combines non-conflicting changes; conflicts surfaced" -- but this is verified only at the infrastructure layer (GitMergeStrategy) and never through the service that plans would actually use.5.
DEPENDENCY_ORDEREDExecution Mode Exists But Is Completely UntestedFile:
src/cleveragents/domain/models/core/plan.py:129-137The
ExecutionModeenum has three values:SEQUENTIAL,PARALLEL, andDEPENDENCY_ORDERED. The fixture data referencesdependency_ordered, but no test ever creates a config with this mode.SubplanFailureHandler.should_stop_othersimplicitly treatsDEPENDENCY_ORDEREDlikePARALLEL(returnsFalsefor stop) -- it's unclear whether this is correct since a dependency failure should arguably stop dependent subplans.MEDIUM Severity
6.
parallel-maxTest Doesn't Verify Runtime SchedulingFile:
robot/helper_m4_e2e_verification.py:343-451The test verifies
SubplanConfigvalidation constraints (max_parallel bounds 1-50) and manually counts statuses. It never actually schedules or executes subplans concurrently. Themax_parallelconstraint is tested as a Pydantic validation rule, not as a runtime scheduling limit. The issue acceptance criteria state "Parallel subplan execution works with max_parallel" -- this is only partially verified.7.
SubplanAttemptClass Never Instantiated Standalone in TestsFile:
src/cleveragents/domain/models/core/plan.py:404-424SubplanAttemptis constructed inline withinSubplanStatus.previous_attemptsinhelper_m4_e2e_verification.py:604-612, but no standalone test validates its field constraints (attempt_numberge=1,started_atrequired, etc.). The fixture data describes previous attempts but no test parses them into actual model objects.8.
as_cli_dict()Has ~80% of Conditional Paths UntestedFile:
src/cleveragents/domain/models/core/plan.py:898-998Tests only check
plan_id,name,phase,state, andsubplan_count. Approximately 80 lines of conditional rendering logic remain untested:automation_profile,invariants,argumentsordering,cost_metadata,skeleton_metadata,multi_project_metadata,last_completed_step,last_checkpoint_id, etc.9. Unknown Error Types Silently Return
Falseinshould_retryFile:
src/cleveragents/domain/models/core/plan.py:1047+An error string matching neither
RETRIABLE_FAILURESnorNON_RETRIABLE_ERRORS(e.g.,"RuntimeError: unexpected") defaults to non-retriable. An error containing both a retriable and non-retriable substring (e.g.,"ConfigurationError caused by TimeoutError") always returnsFalsedue to evaluation order. Neither case is tested.10.
GitMergeStrategyNegative Return Code Fallback UntestedFile:
src/cleveragents/infrastructure/sandbox/merge.py:130-136The code handles
returncode < 0fromgit merge-fileby falling back toSequentialMergeStrategy. This error path is never exercised by any test.11. Dead Code:
validate_phase_state_consistencyValidatorFile:
src/cleveragents/domain/models/core/plan.py:738-748The validator checks
if self.processing_state is Noneand sets it. Butprocessing_statehas typeProcessingStatewith a default ofQUEUED-- Pydantic will rejectNonebefore the validator runs. This code is unreachable.LOW Severity
12.
CONTRIBUTORS.mdAlphabetical Ordering BrokenThe existing entries (after Jeffrey Phillips Freeman) are sorted alphabetically by last name: Edwards, Khyari, Mendes. Rui Hu was appended at the end, but alphabetically "Hu" should come between "Freeman" and "Khyari."
13. Self-Referencing
parent_plan_id == plan_idPasses ValidationFile:
src/cleveragents/domain/models/core/plan.py:257-288PlanIdentity(plan_id=X, parent_plan_id=X)creates a plan that is its own parent. No model validator catches this logically invalid state.14. No Encoding Guard on
GitMergeStrategyTemp File WritesFile:
src/cleveragents/infrastructure/sandbox/merge.py:88-155Temp files are written with
encoding="utf-8". Content containing invalid UTF-8 or null bytes would raise an unhandled exception. No test covers this edge case.15.
JsonMergeStrategyIgnoresbaseParameterFile:
src/cleveragents/infrastructure/sandbox/merge.py:207-291The three-way merge
JsonMergeStrategy.merge(base, ours, theirs)acceptsbasebut never uses it -- it performs a two-way merge ofoursandtheirsonly. The docstring calls this a "future enhancement" but the behavior diverges from the spec's three-way merge requirement for JSON resources.Commit & Process Checks
test/m4-acceptance-gatemasterISSUES CLOSED: #495in commitSummary
The commit itself (CHANGELOG + CONTRIBUTORS) is clean and minimal. The M4 E2E tests that this commit validates do pass and cover the core acceptance criteria at the domain model and infrastructure level. However, the review reveals that these tests are more accurately described as integration/model-validation tests with heavy mocking rather than true end-to-end tests. The most significant gaps are:
DEPENDENCY_ORDERED) that exists in the codebase but is completely unexercised (Finding #5)SubplanMergeService-- the actual service that plans use for merging -- is never tested end-to-end (Finding #4)These findings don't block the PR from merging (the commit accurately reports what it does), but they represent real test coverage gaps that should be tracked as follow-up work.
PM Review — Day 25
Status
master. This is the primary blocker.Action Items
test/m4-acceptance-gateonto currentmasterto resolve conflicts.Priority
This is the M4 acceptance gate — it blocks milestone v3.3.0 closure. The milestone is 3 days overdue. Please prioritize after #559.
4d7dd670b211b4c2dc4311b4c2dc43baac6bf19aCode Review Report — Commit
baac6bf(Rui Hu)Branch:
test/m4-acceptance-gateIssue: #495 — Validate M4 acceptance criteria for v3.3.0 milestone closure
Spec reference:
docs/specification.mdSummary
The commit adds 3 new CLI-exercising integration tests (
cli-plan-use,cli-plan-execute,cli-plan-tree) to the M4 E2E verification suite, fixes a pre-existing BDD session isolation bug, and updates CONTRIBUTORS.md. The tests pass and quality gates are green. However, the review identified 1 bug introduced by the fix itself, several test coverage gaps against the acceptance criteria, and a broader codebase pattern issue that the fix reveals.Verdict: APPROVED — the commit achieves its stated goal and quality gates pass. Findings below are recommended for follow-up, not blockers.
BUGS
B1.
scoped_sessionmissing.remove()cleanup — HIGHFile:
features/steps/repositories_uncovered_branches_steps.py:444The fix correctly wraps
sessionmakerinscoped_sessionto solve the split-session bug. However,scoped_sessionuses thread-local storage and requires.remove()to be called after each scenario to release the session. No cleanup exists:.remove()call in the step filescoped_session-aware cleanup infeatures/environment.pyImpact: Session/connection leaks across scenarios. The
scoped_sessionregistry holds a reference to the old session, preventing garbage collection of the session and its engine. In long test runs, this accumulates leaked in-memory SQLite connections.Recommended fix: Register cleanup in the setup step:
B2. Same split-session pattern unfixed at line 142 — MEDIUM
File:
features/steps/repositories_uncovered_branches_steps.py:142The Tool/Validation scenarios at line 142 use the exact same risky pattern:
This factory is passed to
ToolRepository,ToolRegistryRepository, andValidationAttachmentRepository, while step code separately callssession_factory()to commit. This is the same root cause that was fixed at line 444, and may break under different conditions.B3. ~13 other step files share the risky
sessionmakerpattern — LOWAcross
features/steps/, approximately 13 other files pass a baresessionmaker(bind=engine)assession_factory=to repositories that internally call it to obtain sessions, while step code also calls it independently. Files using the saferlambda: sessionpattern (e.g.,plan_persistence_steps.py,action_persistence_steps.py,decision_persistence_steps.py) are not affected. This commit's fix was surgical, but the broader pattern persists.TEST COVERAGE GAPS
C1.
cli-plan-usedoes not verify the project argument — MEDIUMFile:
robot/helper_m4_e2e_verification.py:756-766The test verifies
get_action_by_name("local/refactor-action")and thatuse_actionwas called, and checks the action_name argument. But it never verifies the project argument"local/monorepo"was passed through touse_action(project_links=...). The acceptance criteria (issue #495 and the milestone) explicitly require verifyingagents plan use local/refactor-action local/monorepo— the project binding is half of this command.C2.
cli-plan-executedoes not verify subplan statuses in output — MEDIUMFile:
robot/helper_m4_e2e_verification.py:806-822The mock returns a plan with 3 subplan statuses (children A, B, C), but the test only checks that the plan ID appears in the output. It does not verify that the CLI renders subplan-related information (subplan count, statuses, etc.). The entire purpose of M4 is subplan management — the test should confirm the CLI output reflects it.
C3.
cli-plan-treeJSON structure verification is shallow — LOW/MEDIUMFile:
robot/helper_m4_e2e_verification.py:937-947The test does substring matching on the serialized JSON:
This does not verify:
subplan_spawnnodes are children ofsubplan_parallel_spawn)decision_iddownstream_plan_idsin the outputSince
build_decision_tree()returns structured nested dicts, the test should walktree_dataand verify the nesting depth and relationships.C4. No new CLI integration test for
plan diff— LOWIssue #495 acceptance criteria includes "
agents plan diff <plan_id>shows merged results". The pre-existingplan-difftest (lines 311-343) covers this through a mocked apply service, but no new CLI-exercising test was added to match the pattern of the other 3 new tests.TEST FLAWS
F1. Fragile argument extraction in
cli_plan_use— LOWFile:
robot/helper_m4_e2e_verification.py:758-766The
use_actioncall_args extraction tries 3 different strategies to find theaction_name:This defensive-to-the-point-of-fragile approach suggests uncertainty about the CLI-to-service calling convention. In practice, the CLI at
plan.py:1431always callsuse_action(action_name=..., ...)as keyword arguments, so only the first.kwargs.get("action_name")branch is ever exercised. The other branches are dead code that obscures test intent.F2. No negative/error path tests for the 3 new CLI commands — LOW
None of the 3 new tests exercise failure scenarios:
plan useplan executeplan treeSPECIFICATION ALIGNMENT — No Issues
The tests correctly model the specification's decision hierarchy:
prompt_definitionas root (spec line 18413)strategy_choiceunder it (spec line 18616)subplan_parallel_spawngrouping concurrent child spawns (spec line 18177)subplan_spawnunder the parallel container (spec line 18288)The
DecisionTypeenum values, ULID formats, andDecisionmodel field usage are all correct per the codebase definitions.PERFORMANCE — No Issues
No performance concerns in the test code.
SECURITY — No Issues
No credential exposure, injection vectors, or unsafe patterns. Mocks are used appropriately.
Final Verdict
The commit achieves its stated goal: the M4 E2E suite runs green and validates the milestone criteria. The primary concern is B1 (session leak from missing
scoped_session.remove()), which is a latent bug introduced by the fix. C1 and C2 are meaningful coverage gaps where the tests verify the "happy path plumbing" but miss verifying the specific M4-relevant outputs (project binding and subplan statuses). B2 and B3 represent pre-existing technical debt that this commit's pattern exposes but does not address.Approved — recommended to address B1 as a fast follow-up.
baac6bf19ad02aa33150New commits pushed, approval review dismissed automatically according to repository settings
hurui200320 referenced this pull request2026-03-11 07:02:08 +00:00