test(e2e): workflow example 12 — large-scale hierarchical feature implementation (supervised profile) #817

2026-03-13T17:10:57Z

freemo commented

2026-03-13 17:10:57 +00:00

Summary

E2E test for Workflow Example 12 — large-scale hierarchical feature implementation using the supervised profile. Tests multi-project setup (4 repos with per-project invariants), global invariant registration, spec-compliant action configuration with long_description, hierarchical plan tree inspection with hard assertions, plan correct (append mode) on non-root decision with post-correction verification, phased lifecycle-apply, and terminal state verification.

Closes #758

ISSUES CLOSED: #758

Changes

Test Structure (`robot/e2e/wf12_hierarchical.robot`)

Suite Setup (WF12 Suite Setup): Initializes E2E environment with init --force --yes, generates UUID-suffixed names for all resources/projects/actions to prevent UNIQUE constraint collisions in parallel CI.
Keywords: Create Project Repo (with git rc assertions and timeout=60s on_timeout=kill), Register Project With Invariant (per-project invariant per spec, with timeout on git calls), Select Non Root Decision Id (targeted "decision_id" regex using correct Crockford Base32 character class [0-9A-HJKMNP-TV-Z]{26}, requires ≥2 IDs to avoid returning root, defensive check ensures selected ID differs from first), Verify Plan In List (consistent with m6_acceptance pattern).
Force Tags E2E at Settings level.

Spec Compliance

All 4 projects passed to plan use (spec Step 3): protos, api, worker, frontend.
Global invariant registered per spec Step 1 (invariant add --global) with hard assertion on rc=0 and content verification.
Action YAML includes spec-required fields: estimation_actor, invariant_actor, automation_profile: cautious (ticket says 'supervised' but spec uses 'cautious' — following spec), long_description, action-level invariants, reusable, state.
Per-project invariants on each project registration (spec Step 1).
plan explain exercised per spec Step 4 with --format json and full assertion suite (rc=0, Traceback/INTERNAL checks, non-empty output, decision ID presence in output).
Dynamic actor selection based on available API key (Anthropic preferred, OpenAI fallback).
Skip If No LLM Keys for graceful CI degradation.
35-minute timeout for real LLM execution headroom.
plan lifecycle-list verification after plan use (consistent with m6_acceptance pattern).
lifecycle-apply --yes to skip confirmation prompt in automated test execution.

Assertion Quality

Every command is validated beyond rc=0:

Traceback and INTERNAL error marker checks on all commands.
Output Should Contain for resource/project/action names in registration output (including global invariant content assertion).
Safe Parse Json Field to parse plan_id from JSON output.
Hard assertion on "children" field for hierarchical decomposition (AC-3, AC-6), with non-empty children array check (WARN for flat LLM output since hierarchy depth is non-deterministic).
Decision tree structural assertions: decision_id count ≥ 2 (root + child) using Get Length on regex match results.
Correct Crockford Base32 regex for decision IDs: "decision_id"\\s*:\\s*"([0-9A-HJKMNP-TV-Z]{26})" — excludes I, L, O, U per spec.
Post-correction verification: second plan tree call verifies correction effect, using consistent regex-based counting method.
Pre-correction status check: verifies plan status (rc=0) before correction; gates correction on non-terminal state (skips with WARN if plan is already terminal).
Correction output verification: checks for append/queued/"mode"+"append"keywords (NOT barecorrectionsubstring which would vacuously matchcorrection_idkey). Plus structural JSON field check forstatusorcorrection_id`.
Post-strategize intermediate state assertion: verifies plan has non-empty phase or processing_state after strategize.
Apply phase assertion: Should Contain apply with WARN log when phase is empty.
Terminal state assertion: Phase checked against actual PlanPhase enum (apply); processing_state checked against actual ProcessingState enum terminal values for Apply phase (applied, constrained, cancelled — NOT complete, which is only for Strategize/Execute phases). Non-terminal states (queued, processing) produce WARN instead of failure (apply may be asynchronous). errored state produces separate WARN. Empty processing_state after phase='apply' fails the test (guarantees populated state after full lifecycle).
plan diff checks rc=0, non-empty output, Traceback/INTERNAL, and plan_id presence.
Explain output verified to contain queried decision ID.
--format json on plan use, execute, tree, correct, diff, lifecycle-apply, status, and explain.

Known Limitations

plan prompt not exercised: Spec Step 4 shows the supervised profile pausing on a low-confidence decision and the user providing guidance via plan prompt. This command is not yet implemented as a CLI subcommand. A documentation note in the test indicates this should be added once available.
Action arguments/--arg omitted: Spec Step 2 defines args and Step 3 uses --arg. Both are omitted because plan use triggers a UNIQUE constraint error when the action defines arguments in its schema (pre-existing bug in PlanLifecycleService.use_action). TODO documented in test.
Post-correction tree change: Correction queues a modification but may not immediately add new decision nodes visible in plan tree until re-execution. Assertion checks >= rather than >.
Action invariants: Spec defines 4 action-level invariants; test includes 2 as simplification. TODO documented.
Project invariants: Spec shows 2 per project; test includes 1 per project as simplification. TODO documented.
Multi-resource projects: Spec shows api/worker linked to both own repo and protos repo. Test links each to only own repo. TODO documented.
Non-deterministic hierarchy depth: LLM may produce flat sibling decisions rather than nested parent→child trees; test WARNs instead of failing in this case.

Quality Gates

All gates pass:

nox -e lint ✅
nox -e typecheck ✅
nox -e unit_tests ✅ (498 features, 12822 scenarios, 0 failed)
nox -e integration_tests ✅ (1825 tests, 0 failed)
nox -e e2e_tests ✅ (58 tests, 57 passed, 0 failed, 1 skipped — skip is pre-existing WF04 LLM non-determinism)
nox -e coverage_report ✅ (97%)

Manual Verification

Prerequisites

OPENAI_API_KEY or ANTHROPIC_API_KEY environment variable set

Commands

nox -e e2e_tests  # runs the full E2E suite including this test

## Summary E2E test for Workflow Example 12 — large-scale hierarchical feature implementation using the supervised profile. Tests multi-project setup (4 repos with per-project invariants), global invariant registration, spec-compliant action configuration with long_description, hierarchical plan tree inspection with hard assertions, plan correct (append mode) on non-root decision with post-correction verification, phased lifecycle-apply, and terminal state verification. Closes #758 ISSUES CLOSED: #758 ## Changes ### Test Structure (`robot/e2e/wf12_hierarchical.robot`) - **Suite Setup** (`WF12 Suite Setup`): Initializes E2E environment with `init --force --yes`, generates UUID-suffixed names for all resources/projects/actions to prevent UNIQUE constraint collisions in parallel CI. - **Keywords**: `Create Project Repo` (with git rc assertions and `timeout=60s on_timeout=kill`), `Register Project With Invariant` (per-project invariant per spec, with timeout on git calls), `Select Non Root Decision Id` (targeted `"decision_id"` regex using correct Crockford Base32 character class `[0-9A-HJKMNP-TV-Z]{26}`, requires ≥2 IDs to avoid returning root, defensive check ensures selected ID differs from first), `Verify Plan In List` (consistent with m6_acceptance pattern). - **Force Tags E2E** at Settings level. ### Spec Compliance - **All 4 projects** passed to `plan use` (spec Step 3): protos, api, worker, frontend. - **Global invariant** registered per spec Step 1 (`invariant add --global`) with hard assertion on rc=0 and content verification. - **Action YAML** includes spec-required fields: `estimation_actor`, `invariant_actor`, `automation_profile: cautious` (ticket says 'supervised' but spec uses 'cautious' — following spec), `long_description`, action-level `invariants`, `reusable`, `state`. - **Per-project invariants** on each project registration (spec Step 1). - **`plan explain`** exercised per spec Step 4 with `--format json` and full assertion suite (rc=0, Traceback/INTERNAL checks, non-empty output, decision ID presence in output). - **Dynamic actor selection** based on available API key (Anthropic preferred, OpenAI fallback). - **Skip If No LLM Keys** for graceful CI degradation. - **35-minute timeout** for real LLM execution headroom. - **`plan lifecycle-list`** verification after `plan use` (consistent with m6_acceptance pattern). - **`lifecycle-apply --yes`** to skip confirmation prompt in automated test execution. ### Assertion Quality Every command is validated beyond rc=0: - `Traceback` and `INTERNAL` error marker checks on all commands. - `Output Should Contain` for resource/project/action names in registration output (including global invariant content assertion). - `Safe Parse Json Field` to parse `plan_id` from JSON output. - **Hard assertion** on `"children"` field for hierarchical decomposition (AC-3, AC-6), with non-empty children array check (WARN for flat LLM output since hierarchy depth is non-deterministic). - Decision tree structural assertions: `decision_id` count ≥ 2 (root + child) using `Get Length` on regex match results. - **Correct Crockford Base32 regex** for decision IDs: `"decision_id"\\s*:\\s*"([0-9A-HJKMNP-TV-Z]{26})"` — excludes I, L, O, U per spec. - **Post-correction verification**: second `plan tree` call verifies correction effect, using consistent regex-based counting method. - **Pre-correction status check**: verifies plan status (rc=0) before correction; gates correction on non-terminal state (skips with WARN if plan is already terminal). - **Correction output verification**: checks for `append`/`queued`/`"mode"+`"append"` keywords (NOT bare `correction` substring which would vacuously match `correction_id` key). Plus structural JSON field check for `status` or `correction_id`. - **Post-strategize intermediate state assertion**: verifies plan has non-empty phase or processing_state after strategize. - **Apply phase assertion**: `Should Contain apply` with WARN log when phase is empty. - **Terminal state assertion**: Phase checked against actual `PlanPhase` enum (`apply`); processing_state checked against actual `ProcessingState` enum terminal values for Apply phase (`applied`, `constrained`, `cancelled` — NOT `complete`, which is only for Strategize/Execute phases). Non-terminal states (`queued`, `processing`) produce WARN instead of failure (apply may be asynchronous). `errored` state produces separate WARN. Empty `processing_state` after phase='apply' fails the test (guarantees populated state after full lifecycle). - `plan diff` checks rc=0, non-empty output, Traceback/INTERNAL, and `plan_id` presence. - **Explain output** verified to contain queried decision ID. - `--format json` on plan use, execute, tree, correct, diff, lifecycle-apply, status, and explain. ### Known Limitations - **`plan prompt` not exercised**: Spec Step 4 shows the supervised profile pausing on a low-confidence decision and the user providing guidance via `plan prompt`. This command is not yet implemented as a CLI subcommand. A documentation note in the test indicates this should be added once available. - **Action arguments/`--arg` omitted**: Spec Step 2 defines `args` and Step 3 uses `--arg`. Both are omitted because `plan use` triggers a UNIQUE constraint error when the action defines arguments in its schema (pre-existing bug in `PlanLifecycleService.use_action`). TODO documented in test. - **Post-correction tree change**: Correction queues a modification but may not immediately add new decision nodes visible in `plan tree` until re-execution. Assertion checks `>=` rather than `>`. - **Action invariants**: Spec defines 4 action-level invariants; test includes 2 as simplification. TODO documented. - **Project invariants**: Spec shows 2 per project; test includes 1 per project as simplification. TODO documented. - **Multi-resource projects**: Spec shows api/worker linked to both own repo and protos repo. Test links each to only own repo. TODO documented. - **Non-deterministic hierarchy depth**: LLM may produce flat sibling decisions rather than nested parent→child trees; test WARNs instead of failing in this case. ## Quality Gates All gates pass: - `nox -e lint` ✅ - `nox -e typecheck` ✅ - `nox -e unit_tests` ✅ (498 features, 12822 scenarios, 0 failed) - `nox -e integration_tests` ✅ (1825 tests, 0 failed) - `nox -e e2e_tests` ✅ (58 tests, 57 passed, 0 failed, 1 skipped — skip is pre-existing WF04 LLM non-determinism) - `nox -e coverage_report` ✅ (97%) ## Manual Verification ### Prerequisites - `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` environment variable set ### Commands ```bash nox -e e2e_tests # runs the full E2E suite including this test ```

freemo added this to the v3.5.0 milestone 2026-03-13 17:11:02 +00:00

freemo added the

labels 2026-03-13 17:11:02 +00:00

freemo force-pushed test/e2e-wf12-hierarchical from 285c53efe0 to e7adeeb90e

2026-03-13 17:28:46 +00:00

Compare

freemo force-pushed test/e2e-wf12-hierarchical from e7adeeb90e to c348fd1bec

2026-03-13 17:46:56 +00:00

Compare

freemo force-pushed test/e2e-wf12-hierarchical from c348fd1bec to 073d543558

2026-03-13 17:51:49 +00:00

Compare

freemo force-pushed test/e2e-wf12-hierarchical from 073d543558 to 9627b0c3a9

2026-03-13 18:13:11 +00:00

Compare

freemo force-pushed test/e2e-wf12-hierarchical from 9627b0c3a9 to b196b87161

2026-03-13 18:27:07 +00:00

Compare

freemo referenced this pull request

2026-03-13 19:21:14 +00:00

test(e2e): workflow example 12 — large-scale hierarchical feature implementation (supervised profile) #758

freemo force-pushed test/e2e-wf12-hierarchical from b196b87161 to 6e60c9d7d8

2026-03-13 23:19:44 +00:00

Compare

freemo added the

Priority

Medium

label 2026-03-14 04:10:23 +00:00

freemo commented

2026-03-14 04:43:54 +00:00

PM Review — Day 34

Status: Mergeable, 0 reviews, M6 (v3.5.0)
Closes: #758 | Author: @freemo

E2E test for WF12 (large-scale hierarchical feature implementation). 4-project setup (core, api, frontend, docs), plan tree inspection, plan correct --mode append, phased apply.

[NOTE] Milestone v3.5.0 acceptance criteria require "4+ levels of subplans" and "10+ concurrent subplans." Verify the test actually exercises these thresholds — the manual verification steps don't include explicit depth/concurrency checks.

Action Items

Who	Action	Deadline
@CoreRasurae	Peer review — complex feature domain	Day 37

## PM Review — Day 34 **Status**: Mergeable, 0 reviews, M6 (v3.5.0) **Closes**: #758 | **Author**: @freemo E2E test for WF12 (large-scale hierarchical feature implementation). 4-project setup (core, api, frontend, docs), plan tree inspection, `plan correct --mode append`, phased apply. **[NOTE]** Milestone v3.5.0 acceptance criteria require "4+ levels of subplans" and "10+ concurrent subplans." Verify the test actually exercises these thresholds — the manual verification steps don't include explicit depth/concurrency checks. ### Action Items | Who | Action | Deadline | |-----|--------|----------| | @CoreRasurae | **Peer review** — complex feature domain | Day 37 |

freemo added a new dependency 2026-03-16 02:42:20 +00:00

#627 Implement @tdd_expected_fail tag handling in Behave environment

freemo added a new dependency 2026-03-16 02:42:20 +00:00

#628 Implement @tdd_expected_fail tag handling in Robot Framework

freemo added a new dependency 2026-03-16 02:42:20 +00:00

#965 refactor(testing): rename tdd_bug/tdd_bug_N tags to tdd_issue/tdd_issue_N across Behave and Robot Framework

freemo commented

2026-03-16 09:32:04 +00:00

PM Status — Day 36 (2026-03-16)

Day 34 review assignment deadline check. This PR has 0 reviewer activity after 2 days.

Priority note: M3 PRs take precedence. Reviewers should complete M3 reviews first, then address M4+ PRs in milestone order.

Assigned reviewer: Please acknowledge and provide an ETA for your review, or flag if reassignment is needed.

## PM Status — Day 36 (2026-03-16) Day 34 review assignment deadline check. This PR has 0 reviewer activity after 2 days. **Priority note**: M3 PRs take precedence. Reviewers should complete M3 reviews first, then address M4+ PRs in milestone order. **Assigned reviewer**: Please acknowledge and provide an ETA for your review, or flag if reassignment is needed.

hurui200320 was assigned by freemo

2026-03-16 22:19:24 +00:00

freemo commented

2026-03-16 22:19:29 +00:00

@hurui200320 I am going to have you take over this PR, it is mostly completed but is waiting on #628 and #966 One is yours and one is Brent's. Please be sure to get this PR and the two blocking PRs I listed in asap, thanks.

@hurui200320 I am going to have you take over this PR, it is mostly completed but is waiting on https://git.cleverthis.com/cleveragents/cleveragents-core/issues/628 and https://git.cleverthis.com/cleveragents/cleveragents-core/issues/966 One is yours and one is Brent's. Please be sure to get this PR and the two blocking PRs I listed in asap, thanks.

freemo requested review from brent.edwards 2026-03-17 18:24:18 +00:00

freemo requested review from hamza.khyari 2026-03-17 18:24:18 +00:00

freemo commented

2026-03-17 18:33:46 +00:00

PM Status — Day 37

Reviewers assigned. This PR needs at least 2 approving reviews per CONTRIBUTING.md before merge.

Author: Please ensure this PR is rebased on latest master and all quality gates pass before requesting merge.

PM status — Day 37

## PM Status — Day 37 Reviewers assigned. This PR needs at least 2 approving reviews per `CONTRIBUTING.md` before merge. **Author**: Please ensure this PR is rebased on latest `master` and all quality gates pass before requesting merge. --- *PM status — Day 37*

hurui200320 force-pushed test/e2e-wf12-hierarchical from 6e60c9d7d8 to 5a8458b5af

2026-03-18 08:35:56 +00:00

Compare

freemo commented

2026-03-19 04:58:00 +00:00

Code Review — PR #817

(Cannot submit formal approval — self-authored PR.)

E2E test for WF12. Well-structured with proper labels, milestone, and issue linkage. No issues found.

## Code Review — PR #817 *(Cannot submit formal approval — self-authored PR.)* E2E test for WF12. Well-structured with proper labels, milestone, and issue linkage. No issues found.

freemo requested review from CoreRasurae 2026-03-19 05:19:51 +00:00

hurui200320 force-pushed test/e2e-wf12-hierarchical from 5a8458b5af to f90ef0cadd

2026-03-20 06:56:12 +00:00

Compare

hurui200320 added

and removed

labels 2026-03-20 07:09:39 +00:00

hurui200320 force-pushed test/e2e-wf12-hierarchical from f90ef0cadd to 83b319e679