test(e2e): E2E acceptance criteria for M3 (v3.2.0) — decisions, validations, and invariants #799

freemo · 2026-03-13T01:15:58Z

freemo commented

2026-03-13 01:15:58 +00:00

Summary

E2E acceptance test for M3 (v3.2.0) — decisions, validations, and invariants. Extends M2 with plan explain, plan correct (revert mode), and invariant management.

Closes #743

ISSUES CLOSED: #743

Manual Verification

Prerequisites

OPENAI_API_KEY or GEMINI_API_KEY environment variable set

Commands

# 1-6. Same init/resource/project/action/plan-use flow as M1
REPO=$(mktemp -d) && cd "$REPO" && git init && git checkout -b main
echo "def greet(): return 'hello'" > main.py && git add . && git commit -m "init"
WORKDIR=$(mktemp -d) && cd "$WORKDIR"
python -m cleveragents init --yes --force
python -m cleveragents resource add git-checkout "$REPO"
python -m cleveragents project create my-project
python -m cleveragents action create --config action.yaml
python -m cleveragents plan use fix-greeting

# 7. Execute strategize
python -m cleveragents plan execute PLAN_ID

# 8. Explain a decision (M3-specific)
python -m cleveragents plan explain --format plain PLAN_ID
# → Look for: decision details, reasoning, alternatives considered

# 9. Correct with revert mode (M3-specific)
python -m cleveragents plan correct --mode revert --guidance "Use a different approach" --yes PLAN_ID
# → Look for: correction acknowledged, re-execution from decision point

# 10. Execute implementation
python -m cleveragents plan execute PLAN_ID

# 11. Review and apply
python -m cleveragents plan diff PLAN_ID
python -m cleveragents plan lifecycle-apply PLAN_ID

What to Look For

plan explain outputs decision reasoning (not empty)
plan correct --mode revert completes without Traceback
The full plan lifecycle completes after correction
No Traceback in any command's stderr

## Summary E2E acceptance test for M3 (v3.2.0) — decisions, validations, and invariants. Extends M2 with plan explain, plan correct (revert mode), and invariant management. Closes #743 ISSUES CLOSED: #743 ## Manual Verification ### Prerequisites - `OPENAI_API_KEY` or `GEMINI_API_KEY` environment variable set ### Commands ```bash # 1-6. Same init/resource/project/action/plan-use flow as M1 REPO=$(mktemp -d) && cd "$REPO" && git init && git checkout -b main echo "def greet(): return 'hello'" > main.py && git add . && git commit -m "init" WORKDIR=$(mktemp -d) && cd "$WORKDIR" python -m cleveragents init --yes --force python -m cleveragents resource add git-checkout "$REPO" python -m cleveragents project create my-project python -m cleveragents action create --config action.yaml python -m cleveragents plan use fix-greeting # 7. Execute strategize python -m cleveragents plan execute PLAN_ID # 8. Explain a decision (M3-specific) python -m cleveragents plan explain --format plain PLAN_ID # → Look for: decision details, reasoning, alternatives considered # 9. Correct with revert mode (M3-specific) python -m cleveragents plan correct --mode revert --guidance "Use a different approach" --yes PLAN_ID # → Look for: correction acknowledged, re-execution from decision point # 10. Execute implementation python -m cleveragents plan execute PLAN_ID # 11. Review and apply python -m cleveragents plan diff PLAN_ID python -m cleveragents plan lifecycle-apply PLAN_ID ``` ### What to Look For - `plan explain` outputs decision reasoning (not empty) - `plan correct --mode revert` completes without Traceback - The full plan lifecycle completes after correction - No `Traceback` in any command's stderr

freemo added 1 commit

2026-03-13 01:15:58 +00:00

test(e2e): E2E acceptance criteria for M3 (v3.2.0) — decisions, validations, and invariants

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 25s

Details

CI / build (pull_request) Successful in 26s

Details

CI / quality (pull_request) Successful in 28s

Details

CI / e2e_tests (pull_request) Failing after 40s

Details

CI / typecheck (pull_request) Successful in 50s

Details

CI / security (pull_request) Successful in 51s

Details

CI / unit_tests (pull_request) Successful in 5m16s

Details

CI / integration_tests (pull_request) Successful in 5m23s

Details

CI / docker (pull_request) Successful in 58s

Details

CI / coverage (pull_request) Successful in 5m54s

Details

CI / benchmark-regression (pull_request) Successful in 38m46s

Details

6d2e56b282

Add robot/e2e/m3_acceptance.robot exercising the full M3 feature set:

  - Decision recording during plan execution
  - Decision tree visualization (plan tree)
  - Decision explanation (plan explain)
  - Invariant management (invariant add/list)
  - Decision correction (plan correct --mode=revert)

Zero mocking — all CLI invocations use real LLM API keys.
Follows existing E2E patterns (common_e2e.resource, Skip If No LLM Keys).

ISSUES CLOSED: #743

freemo added the

Type

Testing

label

2026-03-13 01:16:22 +00:00

freemo added this to the v3.2.0 milestone

2026-03-13 01:16:22 +00:00

freemo referenced this pull request

2026-03-13 01:16:44 +00:00

test(e2e): E2E acceptance criteria for M3 (v3.2.0) — decisions, validations, and invariants #743

freemo force-pushed test/e2e-m3-acceptance from 6d2e56b282

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 25s

Details

CI / build (pull_request) Successful in 26s

Details

CI / quality (pull_request) Successful in 28s

Details

CI / e2e_tests (pull_request) Failing after 40s

Details

CI / typecheck (pull_request) Successful in 50s

Details

CI / security (pull_request) Successful in 51s

Details

CI / unit_tests (pull_request) Successful in 5m16s

Details

CI / integration_tests (pull_request) Successful in 5m23s

Details

CI / docker (pull_request) Successful in 58s

Details

CI / coverage (pull_request) Successful in 5m54s

Details

CI / benchmark-regression (pull_request) Successful in 38m46s

Details

to c2b7b3ff1e

CI / lint (pull_request) Successful in 20s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / quality (pull_request) Successful in 22s

Details

CI / security (pull_request) Successful in 37s

Details

CI / typecheck (pull_request) Successful in 45s

Details

CI / build (pull_request) Successful in 20s

Details

CI / e2e_tests (pull_request) Failing after 30s

Details

CI / unit_tests (pull_request) Successful in 3m54s

Details

CI / integration_tests (pull_request) Successful in 4m12s

Details

CI / docker (pull_request) Successful in 1m1s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

CI / coverage (pull_request) Has been cancelled

Details

2026-03-13 16:24:26 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from c2b7b3ff1e

CI / lint (pull_request) Successful in 20s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / quality (pull_request) Successful in 22s

Details

CI / security (pull_request) Successful in 37s

Details

CI / typecheck (pull_request) Successful in 45s

Details

CI / build (pull_request) Successful in 20s

Details

CI / e2e_tests (pull_request) Failing after 30s

Details

CI / unit_tests (pull_request) Successful in 3m54s

Details

CI / integration_tests (pull_request) Successful in 4m12s

Details

CI / docker (pull_request) Successful in 1m1s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

CI / coverage (pull_request) Has been cancelled

Details

to 512ec103a0

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 14s

Details

CI / build (pull_request) Successful in 17s

Details

CI / quality (pull_request) Successful in 18s

Details

CI / e2e_tests (pull_request) Failing after 28s

Details

CI / security (pull_request) Successful in 36s

Details

CI / typecheck (pull_request) Successful in 39s

Details

CI / integration_tests (pull_request) Successful in 3m3s

Details

CI / unit_tests (pull_request) Successful in 3m46s

Details

CI / docker (pull_request) Successful in 52s

Details

CI / coverage (pull_request) Successful in 4m52s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

2026-03-13 16:31:34 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from 512ec103a0

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 14s

Details

CI / build (pull_request) Successful in 17s

Details

CI / quality (pull_request) Successful in 18s

Details

CI / e2e_tests (pull_request) Failing after 28s

Details

CI / security (pull_request) Successful in 36s

Details

CI / typecheck (pull_request) Successful in 39s

Details

CI / integration_tests (pull_request) Successful in 3m3s

Details

CI / unit_tests (pull_request) Successful in 3m46s

Details

CI / docker (pull_request) Successful in 52s

Details

CI / coverage (pull_request) Successful in 4m52s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

to cf42b288fe

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 15s

Details

CI / build (pull_request) Successful in 16s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / e2e_tests (pull_request) Failing after 28s

Details

CI / typecheck (pull_request) Successful in 32s

Details

CI / security (pull_request) Successful in 35s

Details

CI / unit_tests (pull_request) Successful in 2m27s

Details

CI / integration_tests (pull_request) Successful in 2m50s

Details

CI / docker (pull_request) Successful in 47s

Details

CI / coverage (pull_request) Successful in 4m53s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

2026-03-13 16:42:17 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from cf42b288fe

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 15s

Details

CI / build (pull_request) Successful in 16s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / e2e_tests (pull_request) Failing after 28s

Details

CI / typecheck (pull_request) Successful in 32s

Details

CI / security (pull_request) Successful in 35s

Details

CI / unit_tests (pull_request) Successful in 2m27s

Details

CI / integration_tests (pull_request) Successful in 2m50s

Details

CI / docker (pull_request) Successful in 47s

Details

CI / coverage (pull_request) Successful in 4m53s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

to a5b52ca7b0

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 16s

Details

CI / build (pull_request) Successful in 16s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / e2e_tests (pull_request) Failing after 26s

Details

CI / security (pull_request) Successful in 33s

Details

CI / typecheck (pull_request) Successful in 33s

Details

CI / unit_tests (pull_request) Successful in 2m4s

Details

CI / integration_tests (pull_request) Successful in 2m44s

Details

CI / docker (pull_request) Successful in 40s

Details

CI / coverage (pull_request) Successful in 4m55s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

2026-03-13 16:57:59 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from a5b52ca7b0

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 16s

Details

CI / build (pull_request) Successful in 16s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / e2e_tests (pull_request) Failing after 26s

Details

CI / security (pull_request) Successful in 33s

Details

CI / typecheck (pull_request) Successful in 33s

Details

CI / unit_tests (pull_request) Successful in 2m4s

Details

CI / integration_tests (pull_request) Successful in 2m44s

Details

CI / docker (pull_request) Successful in 40s

Details

CI / coverage (pull_request) Successful in 4m55s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

to 58f3aba0d9

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 18s

Details

CI / build (pull_request) Successful in 17s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / security (pull_request) Successful in 35s

Details

CI / typecheck (pull_request) Successful in 35s

Details

CI / e2e_tests (pull_request) Failing after 37s

Details

CI / unit_tests (pull_request) Successful in 2m12s

Details

CI / integration_tests (pull_request) Successful in 3m14s

Details

CI / docker (pull_request) Successful in 35s

Details

CI / coverage (pull_request) Successful in 4m55s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

2026-03-13 17:28:34 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from 58f3aba0d9

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 18s

Details

CI / build (pull_request) Successful in 17s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / security (pull_request) Successful in 35s

Details

CI / typecheck (pull_request) Successful in 35s

Details

CI / e2e_tests (pull_request) Failing after 37s

Details

CI / unit_tests (pull_request) Successful in 2m12s

Details

CI / integration_tests (pull_request) Successful in 3m14s

Details

CI / docker (pull_request) Successful in 35s

Details

CI / coverage (pull_request) Successful in 4m55s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

to a2d1f2e8ee

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 16s

Details

CI / build (pull_request) Successful in 21s

Details

CI / quality (pull_request) Successful in 22s

Details

CI / security (pull_request) Successful in 32s

Details

CI / typecheck (pull_request) Successful in 33s

Details

CI / e2e_tests (pull_request) Successful in 57s

Details

CI / unit_tests (pull_request) Successful in 2m51s

Details

CI / integration_tests (pull_request) Successful in 3m26s

Details

CI / docker (pull_request) Successful in 49s

Details

CI / coverage (pull_request) Successful in 5m38s

Details

CI / benchmark-regression (pull_request) Failing after 19m11s

Details

2026-03-13 17:46:47 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from a2d1f2e8ee

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 16s

Details

CI / build (pull_request) Successful in 21s

Details

CI / quality (pull_request) Successful in 22s

Details

CI / security (pull_request) Successful in 32s

Details

CI / typecheck (pull_request) Successful in 33s

Details

CI / e2e_tests (pull_request) Successful in 57s

Details

CI / unit_tests (pull_request) Successful in 2m51s

Details

CI / integration_tests (pull_request) Successful in 3m26s

Details

CI / docker (pull_request) Successful in 49s

Details

CI / coverage (pull_request) Successful in 5m38s

Details

CI / benchmark-regression (pull_request) Failing after 19m11s

Details

to 6d2e56b282

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 25s

Details

CI / build (pull_request) Successful in 26s

Details

CI / quality (pull_request) Successful in 28s

Details

CI / e2e_tests (pull_request) Failing after 40s

Details

CI / typecheck (pull_request) Successful in 50s

Details

CI / security (pull_request) Successful in 51s

Details

CI / unit_tests (pull_request) Successful in 5m16s

Details

CI / integration_tests (pull_request) Successful in 5m23s

Details

CI / docker (pull_request) Successful in 58s

Details

CI / coverage (pull_request) Successful in 5m54s

Details

CI / benchmark-regression (pull_request) Successful in 38m46s

Details

2026-03-13 18:12:59 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from 6d2e56b282

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 25s

Details

CI / build (pull_request) Successful in 26s

Details

CI / quality (pull_request) Successful in 28s

Details

CI / e2e_tests (pull_request) Failing after 40s

Details

CI / typecheck (pull_request) Successful in 50s

Details

CI / security (pull_request) Successful in 51s

Details

CI / unit_tests (pull_request) Successful in 5m16s

Details

CI / integration_tests (pull_request) Successful in 5m23s

Details

CI / docker (pull_request) Successful in 58s

Details

CI / coverage (pull_request) Successful in 5m54s

Details

CI / benchmark-regression (pull_request) Successful in 38m46s

Details

to e3e6c148d9

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / quality (pull_request) Successful in 18s

Details

CI / build (pull_request) Successful in 16s

Details

CI / lint (pull_request) Successful in 22s

Details

CI / typecheck (pull_request) Successful in 33s

Details

CI / security (pull_request) Successful in 34s

Details

CI / e2e_tests (pull_request) Failing after 31s

Details

CI / integration_tests (pull_request) Successful in 2m52s

Details

CI / unit_tests (pull_request) Successful in 3m45s

Details

CI / docker (pull_request) Successful in 9s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

CI / coverage (pull_request) Has been cancelled

Details

2026-03-13 18:14:17 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from e3e6c148d9

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / quality (pull_request) Successful in 18s

Details

CI / build (pull_request) Successful in 16s

Details

CI / lint (pull_request) Successful in 22s

Details

CI / typecheck (pull_request) Successful in 33s

Details

CI / security (pull_request) Successful in 34s

Details

CI / e2e_tests (pull_request) Failing after 31s

Details

CI / integration_tests (pull_request) Successful in 2m52s

Details

CI / unit_tests (pull_request) Successful in 3m45s

Details

CI / docker (pull_request) Successful in 9s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

CI / coverage (pull_request) Has been cancelled

Details

to 14d9669786

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 14s

Details

CI / quality (pull_request) Successful in 16s

Details

CI / build (pull_request) Successful in 24s

Details

CI / e2e_tests (pull_request) Failing after 25s

Details

CI / security (pull_request) Successful in 31s

Details

CI / typecheck (pull_request) Successful in 42s

Details

CI / unit_tests (pull_request) Successful in 2m8s

Details

CI / docker (pull_request) Successful in 35s

Details

CI / coverage (pull_request) Has been cancelled

Details

CI / integration_tests (pull_request) Has been cancelled

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

2026-03-13 18:22:05 +00:00

Compare

freemo force-pushed test/e2e-m3-acceptance from 14d9669786

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 14s

Details

CI / quality (pull_request) Successful in 16s

Details

CI / build (pull_request) Successful in 24s

Details

CI / e2e_tests (pull_request) Failing after 25s

Details

CI / security (pull_request) Successful in 31s

Details

CI / typecheck (pull_request) Successful in 42s

Details

CI / unit_tests (pull_request) Successful in 2m8s

Details

CI / docker (pull_request) Successful in 35s

Details

CI / coverage (pull_request) Has been cancelled

Details

CI / integration_tests (pull_request) Has been cancelled

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

to f0bacfdcc3

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 15s

Details

CI / build (pull_request) Successful in 16s

Details

CI / quality (pull_request) Successful in 23s

Details

CI / typecheck (pull_request) Successful in 33s

Details

CI / security (pull_request) Successful in 36s

Details

CI / e2e_tests (pull_request) Successful in 1m5s

Details

CI / unit_tests (pull_request) Successful in 3m28s

Details

CI / docker (pull_request) Successful in 10s

Details

CI / integration_tests (pull_request) Successful in 4m48s

Details

CI / coverage (pull_request) Successful in 5m24s

Details

CI / benchmark-regression (pull_request) Successful in 33m28s

Details

2026-03-13 18:25:15 +00:00

Compare

freemo added the

State

In Review

label

2026-03-13 21:16:34 +00:00

freemo force-pushed test/e2e-m3-acceptance from f0bacfdcc3

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 15s

Details

CI / build (pull_request) Successful in 16s

Details

CI / quality (pull_request) Successful in 23s

Details

CI / typecheck (pull_request) Successful in 33s

Details

CI / security (pull_request) Successful in 36s

Details

CI / e2e_tests (pull_request) Successful in 1m5s

Details

CI / unit_tests (pull_request) Successful in 3m28s

Details

CI / docker (pull_request) Successful in 10s

Details

CI / integration_tests (pull_request) Successful in 4m48s

Details

CI / coverage (pull_request) Successful in 5m24s

Details

CI / benchmark-regression (pull_request) Successful in 33m28s

Details

to 0d8a16d289

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 15s

Details

CI / build (pull_request) Successful in 17s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / typecheck (pull_request) Successful in 34s

Details

CI / security (pull_request) Successful in 35s

Details

CI / e2e_tests (pull_request) Failing after 59s

Details

CI / unit_tests (pull_request) Successful in 2m42s

Details

CI / integration_tests (pull_request) Successful in 4m10s

Details

CI / docker (pull_request) Successful in 48s

Details

CI / coverage (pull_request) Successful in 5m51s

Details

CI / benchmark-regression (pull_request) Successful in 36m31s

Details

2026-03-13 23:19:27 +00:00

Compare

freemo added the

Priority

Medium

label

2026-03-14 04:10:11 +00:00

freemo commented

2026-03-14 04:43:34 +00:00

PM Review — Day 34

Status: Mergeable, 0 reviews, M3 (v3.2.0)
Closes: #744 | Author: @freemo

Review Summary

Well-structured E2E acceptance test for M3 covering: init, resource/project creation, invariant management, action creation, plan lifecycle (strategize → execute → correct → apply), plan tree/explain, and final status. Follows established Robot Framework patterns with common_e2e.resource.

Issues

[NON-BLOCKING] Code duplication — Extract Plan Id keyword is duplicated across PRs #799, #800, #801, #802. Should be refactored to common_e2e.resource to prevent drift. This can be a follow-up.

Merge Order Note

This is the base M3 acceptance test. Should merge after PR #710 (CI key config) so CI can actually run it.

Action Items

Who	Action	Deadline
@hurui200320	Peer review — integration test expertise	Day 36

This review also applies to the common pattern used in sibling PRs #800, #801, #802.

## PM Review — Day 34 **Status**: Mergeable, 0 reviews, M3 (v3.2.0) **Closes**: #744 | **Author**: @freemo ### Review Summary Well-structured E2E acceptance test for M3 covering: init, resource/project creation, invariant management, action creation, plan lifecycle (strategize → execute → correct → apply), plan tree/explain, and final status. Follows established Robot Framework patterns with `common_e2e.resource`. ### Issues **[NON-BLOCKING] Code duplication** — `Extract Plan Id` keyword is duplicated across PRs #799, #800, #801, #802. Should be refactored to `common_e2e.resource` to prevent drift. This can be a follow-up. ### Merge Order Note This is the base M3 acceptance test. Should merge **after** PR #710 (CI key config) so CI can actually run it. ### Action Items | Who | Action | Deadline | |-----|--------|----------| | @hurui200320 | **Peer review** — integration test expertise | Day 36 | --- *This review also applies to the common pattern used in sibling PRs #800, #801, #802.*

freemo referenced this pull request

2026-03-14 04:43:38 +00:00

test(e2e): workflow example 3 — multi-file refactoring with invariants (cautious profile) #800

freemo referenced this pull request

2026-03-14 04:43:40 +00:00

test(e2e): workflow example 13 — custom automation profile with semantic escalation #801

freemo referenced this pull request

2026-03-14 04:43:43 +00:00

test(e2e): workflow example 15 — disaster recovery, rollback a failed apply (trusted profile) #802

freemo added the

labels

2026-03-14 22:11:26 +00:00

freemo referenced this pull request

2026-03-16 03:16:18 +00:00

bug(cli): plan execute only transitions state — does not run strategize or execute phase processing #967

freemo referenced this pull request

2026-03-16 03:16:43 +00:00

bug(cli): plan explain expects decision_id but M3 acceptance test passes plan_id — command fails with exit code 1 #968

freemo referenced this pull request

2026-03-16 03:16:58 +00:00

bug(cli): plan correct expects decision_id as first positional but M3 acceptance test passes plan_id #969

freemo commented

2026-03-16 09:32:04 +00:00

PM Status — Day 36 (2026-03-16)

Day 34 review assignment deadline check. This PR has been in review for 2+ days with 0 reviewer activity.

Reminder: Assigned reviewer — please post your review by Day 37 EOD or flag any blockers. These E2E test PRs are foundational for milestone acceptance gates and cannot remain unreviewed indefinitely.

If you are unable to review by the deadline, please comment so the review can be reassigned.

## PM Status — Day 36 (2026-03-16) Day 34 review assignment deadline check. This PR has been in review for 2+ days with 0 reviewer activity. **Reminder**: Assigned reviewer — please post your review by **Day 37 EOD** or flag any blockers. These E2E test PRs are foundational for milestone acceptance gates and cannot remain unreviewed indefinitely. If you are unable to review by the deadline, please comment so the review can be reassigned.

freemo reviewed

2026-03-16 16:15:20 +00:00

freemo left a comment

PM Day 36 Triage: M3 E2E test PR (v3.2.0). Lower priority than bug fixes and TDD infrastructure. Reviewer: @brent.edwards after critical path items clear.

hamza.khyari was assigned by freemo

2026-03-16 22:22:09 +00:00

freemo commented

2026-03-16 22:22:19 +00:00

@hamza.khyari I am going to have you take over this PR, it is mostly completed but is waiting on #628 and #966 One is yours and one is Brent's. Please be sure to get this PR and the two blocking PRs I listed in asap, thanks.

@hamza.khyari I am going to have you take over this PR, it is mostly completed but is waiting on https://git.cleverthis.com/cleveragents/cleveragents-core/issues/628 and https://git.cleverthis.com/cleveragents/cleveragents-core/issues/966 One is yours and one is Brent's. Please be sure to get this PR and the two blocking PRs I listed in asap, thanks.

hamza.khyari force-pushed test/e2e-m3-acceptance from 0d8a16d289

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 15s

Details

CI / build (pull_request) Successful in 17s

Details

CI / quality (pull_request) Successful in 19s

Details

CI / typecheck (pull_request) Successful in 34s

Details

CI / security (pull_request) Successful in 35s

Details

CI / e2e_tests (pull_request) Failing after 59s

Details

CI / unit_tests (pull_request) Successful in 2m42s

Details

CI / integration_tests (pull_request) Successful in 4m10s

Details

CI / docker (pull_request) Successful in 48s

Details

CI / coverage (pull_request) Successful in 5m51s

Details

CI / benchmark-regression (pull_request) Successful in 36m31s

Details

to e65c01670e

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 15s

Details

CI / build (pull_request) Successful in 16s

Details

CI / quality (pull_request) Successful in 30s

Details

CI / typecheck (pull_request) Successful in 44s

Details

CI / security (pull_request) Successful in 53s

Details

CI / e2e_tests (pull_request) Failing after 1m45s

Details

CI / integration_tests (pull_request) Successful in 3m39s

Details

CI / unit_tests (pull_request) Successful in 5m34s

Details

CI / docker (pull_request) Successful in 9s

Details

CI / coverage (pull_request) Successful in 8m23s

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

2026-03-17 04:00:18 +00:00

Compare

hamza.khyari referenced this pull request from a commit

2026-03-17 04:00:19 +00:00

fix(test): harden M3 E2E assertions and consolidate Extract Plan Id keyword

hamza.khyari referenced this pull request

2026-03-17 04:17:03 +00:00

InvariantService uses in-memory storage only — invariants lost between CLI invocations #1022

hamza.khyari added 1 commit

2026-03-17 04:17:59 +00:00

test(e2e): tag M3 acceptance test as tdd_expected_fail (#1022 )

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 22s

Details

CI / build (pull_request) Successful in 24s

Details

CI / quality (pull_request) Successful in 29s

Details

CI / typecheck (pull_request) Successful in 40s

Details

CI / security (pull_request) Successful in 1m5s

Details

CI / e2e_tests (pull_request) Successful in 1m38s

Details

CI / integration_tests (pull_request) Successful in 3m41s

Details

CI / unit_tests (pull_request) Successful in 4m52s

Details

CI / docker (pull_request) Successful in 54s

Details

CI / coverage (pull_request) Successful in 6m0s

Details

CI / benchmark-regression (pull_request) Successful in 38m42s

Details

2f8fb9e65f

InvariantService uses in-memory storage only — invariants added via
`invariant add` are lost when the CLI process exits. The M3 E2E test
correctly catches this as a persistence failure at the `invariant list`
step. Invariant persistence is a v3.4.0 deliverable (not M3 scope).

Tagged with tdd_bug, tdd_bug_1022, tdd_expected_fail per TDD bug-capture
workflow (CONTRIBUTING.md). The listener inverts the result so the test
passes CI while the bug is unfixed.

ISSUES CLOSED: #743

brent.edwards requested changes

2026-03-17 04:52:52 +00:00

Dismissed

brent.edwards left a comment

PR #799 — M3 E2E Acceptance Review

Summary

Test-only PR adding m3_acceptance.robot (196 lines) and refactoring Extract Plan Id into common_e2e.resource. Also touches m1_acceptance.robot (removes local keyword) and m2_acceptance.robot (uses new shared keyword). Scope is clean — no production code modified.

P1 — Missing `Skip If No LLM Keys` guard

Files: robot/e2e/m3_acceptance.robot

The test makes real LLM calls (openai/gpt-4) but never calls Skip If No LLM Keys (available in common_e2e.resource). CI runs without OPENAI_API_KEY get a hard FAIL instead of a graceful skip. M6 uses this consistently on every LLM-dependent test case. M3 should do the same.

Fix: Add Skip If No LLM Keys as the first line of the test case body (or as a Test Setup).

P1 — Missing test-level `[Timeout]`

File: robot/e2e/m3_acceptance.robot

Individual Run CleverAgents Command calls have timeout=300s, but there is no overall [Timeout] on the test case. If multiple steps each approach their 300s limit, the test could run for 20+ minutes without aborting. M4 correctly sets [Timeout] 15 minutes.

Fix: Add [Timeout] 15 minutes (or similar) to the test case header.

P1 — Weak assertions: "no crash" rather than acceptance criteria verification

File: robot/e2e/m3_acceptance.robot

Several steps rely almost entirely on Should Not Contain ... Traceback / Should Not Contain ... INTERNAL without verifying the acceptance criteria:

Step 9 (plan explain): No assertion on the explain output at all — just checks for no Traceback. The M3 milestone AC says "plan explain shows decision details including alternatives considered." At minimum, assert Should Not Be Empty ${r_explain.stdout} and check for a keyword like decision or reasoning.
Step 11 (plan correct --mode revert): Only checks no Traceback/INTERNAL. Should assert Should Not Be Empty ${r_correct.stdout} and verify revert-specific output (e.g., the word revert or correction).
Step 14 (final status): Checks Should Not Be Empty but doesn't verify the plan reached a terminal phase. M4 does this correctly with Should Match Regexp ... "phase".

The milestone AC states: "plan explain shows decision details," "plan correct --mode=revert re-executes from targeted decision point." The current test would pass on empty stdout with rc=0.

P2 — `Extract Plan Id` ULID regex broadened from Crockford Base32

File: robot/e2e/common_e2e.resource

The old M1 keyword used [0-9A-HJ-NP-Z]{26} (proper Crockford Base32: excludes I, L, O, U). The new shared keyword uses [0-9A-Z]{26} which matches any 26-char uppercase alphanumeric string. This is unlikely to cause false positives in practice, but it's a correctness regression. M4 (in a separate PR) still uses the Crockford pattern, creating an inconsistency.

Suggestion: Use [0-9A-HJ-NP-TV-Z]{26} (Crockford) in the shared keyword, or document the deliberate broadening.

P2 — `Extract Plan Id` uses inline Python `import('re')` instead of Robot keywords

File: robot/e2e/common_e2e.resource

The new implementation uses Evaluate __import__('re').search(...) rather than Robot Framework's Get Regexp Matches. This is harder to read and debug in Robot logs. The old M1 keyword used Get Regexp Matches which is idiomatic Robot.

P2 — `Run Process` calls for git commands lack `timeout` and `on_timeout=kill`

File: robot/e2e/m3_acceptance.robot (lines 63-65)

Bare Run Process git add . / git commit calls have no timeout or on_timeout=kill. M5 applies timeout=60s on_timeout=kill to all Run Process calls consistently. Hung git processes would block CI indefinitely.

P2 — Boilerplate duplication (init/resource/project/action setup)

The ~40-line setup sequence (init → resource add → project create → action create → plan use) is copy-pasted across M1, M2, M3. M6 factored this into a reusable Setup Plan Test Resources keyword. Consider extracting a common setup keyword to reduce drift.

P3 — Invariant assertions are superficial

File: robot/e2e/m3_acceptance.robot (steps 3-4)

Output Should Contain ${inv_result} invariant just checks the word "invariant" appears anywhere in output (including the command echo itself). A stronger check would verify the invariant text "API function signatures must not change" appears in the invariant list output.

Step 4 does check Output Should Contain ${inv_list} API function signatures — that one is good. Step 3 could be tightened.

P3 — CHANGELOG insertion position

The CHANGELOG entry is inserted mid-list rather than at the top of the Unreleased section. Minor style nit.

Verdict

REQUEST CHANGES — P1 items (missing skip guard, missing test timeout, weak assertions on explain/correct/status) need to be addressed before merge. The test currently can silently pass with empty outputs and will hard-fail CI without API keys.

## PR #799 — M3 E2E Acceptance Review ### Summary Test-only PR adding `m3_acceptance.robot` (196 lines) and refactoring `Extract Plan Id` into `common_e2e.resource`. Also touches `m1_acceptance.robot` (removes local keyword) and `m2_acceptance.robot` (uses new shared keyword). Scope is clean — no production code modified. --- ### P1 — Missing `Skip If No LLM Keys` guard **Files:** `robot/e2e/m3_acceptance.robot` The test makes real LLM calls (openai/gpt-4) but never calls `Skip If No LLM Keys` (available in `common_e2e.resource`). CI runs without `OPENAI_API_KEY` get a hard **FAIL** instead of a graceful skip. M6 uses this consistently on every LLM-dependent test case. M3 should do the same. **Fix:** Add `Skip If No LLM Keys` as the first line of the test case body (or as a Test Setup). --- ### P1 — Missing test-level `[Timeout]` **File:** `robot/e2e/m3_acceptance.robot` Individual `Run CleverAgents Command` calls have `timeout=300s`, but there is no overall `[Timeout]` on the test case. If multiple steps each approach their 300s limit, the test could run for 20+ minutes without aborting. M4 correctly sets `[Timeout] 15 minutes`. **Fix:** Add `[Timeout] 15 minutes` (or similar) to the test case header. --- ### P1 — Weak assertions: "no crash" rather than acceptance criteria verification **File:** `robot/e2e/m3_acceptance.robot` Several steps rely almost entirely on `Should Not Contain ... Traceback` / `Should Not Contain ... INTERNAL` without verifying the acceptance criteria: - **Step 9 (plan explain):** No assertion on the explain output at all — just checks for no Traceback. The M3 milestone AC says "plan explain shows decision details including alternatives considered." At minimum, assert `Should Not Be Empty ${r_explain.stdout}` and check for a keyword like `decision` or `reasoning`. - **Step 11 (plan correct --mode revert):** Only checks no Traceback/INTERNAL. Should assert `Should Not Be Empty ${r_correct.stdout}` and verify revert-specific output (e.g., the word `revert` or `correction`). - **Step 14 (final status):** Checks `Should Not Be Empty` but doesn't verify the plan reached a terminal phase. M4 does this correctly with `Should Match Regexp ... "phase"`. The milestone AC states: "plan explain shows decision details," "plan correct --mode=revert re-executes from targeted decision point." The current test would pass on empty stdout with rc=0. --- ### P2 — `Extract Plan Id` ULID regex broadened from Crockford Base32 **File:** `robot/e2e/common_e2e.resource` The old M1 keyword used `[0-9A-HJ-NP-Z]{26}` (proper Crockford Base32: excludes I, L, O, U). The new shared keyword uses `[0-9A-Z]{26}` which matches any 26-char uppercase alphanumeric string. This is unlikely to cause false positives in practice, but it's a correctness regression. M4 (in a separate PR) still uses the Crockford pattern, creating an inconsistency. **Suggestion:** Use `[0-9A-HJ-NP-TV-Z]{26}` (Crockford) in the shared keyword, or document the deliberate broadening. --- ### P2 — `Extract Plan Id` uses inline Python `__import__('re')` instead of Robot keywords **File:** `robot/e2e/common_e2e.resource` The new implementation uses `Evaluate __import__('re').search(...)` rather than Robot Framework's `Get Regexp Matches`. This is harder to read and debug in Robot logs. The old M1 keyword used `Get Regexp Matches` which is idiomatic Robot. --- ### P2 — `Run Process` calls for git commands lack `timeout` and `on_timeout=kill` **File:** `robot/e2e/m3_acceptance.robot` (lines 63-65) Bare `Run Process git add .` / `git commit` calls have no timeout or `on_timeout=kill`. M5 applies `timeout=60s on_timeout=kill` to all `Run Process` calls consistently. Hung git processes would block CI indefinitely. --- ### P2 — Boilerplate duplication (init/resource/project/action setup) The ~40-line setup sequence (init → resource add → project create → action create → plan use) is copy-pasted across M1, M2, M3. M6 factored this into a reusable `Setup Plan Test Resources` keyword. Consider extracting a common setup keyword to reduce drift. --- ### P3 — Invariant assertions are superficial **File:** `robot/e2e/m3_acceptance.robot` (steps 3-4) `Output Should Contain ${inv_result} invariant` just checks the word "invariant" appears anywhere in output (including the command echo itself). A stronger check would verify the invariant text "API function signatures must not change" appears in the `invariant list` output. Step 4 does check `Output Should Contain ${inv_list} API function signatures` — that one is good. Step 3 could be tightened. --- ### P3 — CHANGELOG insertion position The CHANGELOG entry is inserted mid-list rather than at the top of the Unreleased section. Minor style nit. --- ### Verdict **REQUEST CHANGES** — P1 items (missing skip guard, missing test timeout, weak assertions on explain/correct/status) need to be addressed before merge. The test currently can silently pass with empty outputs and will hard-fail CI without API keys.

robot/e2e/common_e2e.resource Outdated

					
				@ -85,0 +89,4 @@

				    ...                ULID, a standard UUID, or a ``plan_id: <value>`` pattern.

				    ...                Returns the first match, or EMPTY if none found.

				    [Arguments]    ${stdout}    ${stderr}=${EMPTY}

				    ${combined}=    Set Variable    ${stdout}\n${stderr}

P2: ULID regex broadened from Crockford Base32 [0-9A-HJ-NP-Z]{26} to [0-9A-Z]{26}. The old M1 pattern was spec-correct. This is unlikely to cause false positives but is a correctness regression. Note: M4 (PR #814) still uses the Crockford pattern inline, creating an inconsistency between the two PRs.

**P2**: ULID regex broadened from Crockford Base32 `[0-9A-HJ-NP-Z]{26}` to `[0-9A-Z]{26}`. The old M1 pattern was spec-correct. This is unlikely to cause false positives but is a correctness regression. Note: M4 (PR #814) still uses the Crockford pattern inline, creating an inconsistency between the two PRs.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +19,4 @@

				*** Test Cases ***

				M3 Decision Recording And Tree Visualization

				    [Documentation]    End-to-end test for M3 milestone: decision recording, tree

				    ...                visualization, explanation, invariant management, and correction.

P1: Missing Skip If No LLM Keys guard. This test uses real LLM calls but will hard-FAIL in CI without OPENAI_API_KEY. Add Skip If No LLM Keys as the first keyword call in the test case body. M6 does this consistently for every LLM-dependent test.

**P1**: Missing `Skip If No LLM Keys` guard. This test uses real LLM calls but will hard-FAIL in CI without `OPENAI_API_KEY`. Add `Skip If No LLM Keys` as the first keyword call in the test case body. M6 does this consistently for every LLM-dependent test.

P1: No [Timeout] at the test case level. With four timeout=300s steps, this test could run for 20+ minutes without aborting. M4 correctly sets [Timeout] 15 minutes. Add a similar guard here.

**P1**: No `[Timeout]` at the test case level. With four `timeout=300s` steps, this test could run for 20+ minutes without aborting. M4 correctly sets `[Timeout] 15 minutes`. Add a similar guard here.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +60,4 @@

				    ...    ${SPACE}${SPACE}${SPACE}${SPACE}conn.close()

				    ...    ${SPACE}${SPACE}${SPACE}${SPACE}return {"id": row[0], "name": row[1]} if row else {}

				    Create File    ${repo_dir}${/}src${/}auth.py    ${auth_code}

				    Run Process    git    add    .    cwd=${repo_dir}

P2: Run Process calls for git add/commit/rev-parse lack timeout and on_timeout=kill. M5 consistently applies timeout=60s on_timeout=kill to all Run Process calls. Hung git processes block CI indefinitely.

**P2**: `Run Process` calls for git add/commit/rev-parse lack `timeout` and `on_timeout=kill`. M5 consistently applies `timeout=60s on_timeout=kill` to all `Run Process` calls. Hung git processes block CI indefinitely.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +144,4 @@

				    # ── 9. Plan explain — inspect a decision (use plan_id as fallback) ──

				    ${r_explain}=    Run CleverAgents Command

				    ...    plan    explain    ${plan_id}    --format    plain

P1: No assertion on explain output content. The M3 AC requires "plan explain shows decision details including alternatives considered." This step only checks for no Traceback. At minimum add:

Should Not Be Empty    ${r_explain.stdout}
...    Plan explain produced no output for plan ${plan_id}

And consider checking for a structural keyword like decision or reasoning.

**P1**: No assertion on explain output content. The M3 AC requires "plan explain shows decision details including alternatives considered." This step only checks for no Traceback. At minimum add: ``` Should Not Be Empty ${r_explain.stdout} ... Plan explain produced no output for plan ${plan_id} ``` And consider checking for a structural keyword like `decision` or `reasoning`.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +164,4 @@

				    ...    --guidance    Use SQLAlchemy ORM instead of raw SQL

				    ...    --yes    --format    plain

				    ...    timeout=300s

				    Log    Plan correct stdout: ${r_correct.stdout}

P1: Correction output not verified. Only checks no Traceback/INTERNAL but doesn't assert Should Not Be Empty ${r_correct.stdout}. A successful revert that produces empty stdout would silently pass. Should also verify a keyword like revert or correction appears.

**P1**: Correction output not verified. Only checks no Traceback/INTERNAL but doesn't assert `Should Not Be Empty ${r_correct.stdout}`. A successful revert that produces empty stdout would silently pass. Should also verify a keyword like `revert` or `correction` appears.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +189,4 @@

				    # ── 14. Final plan status ──

				    ${r_status}=    Run CleverAgents Command

				    ...    plan    status    ${plan_id}    --format    plain

P1: Final status doesn't verify terminal phase. M4 correctly asserts Should Match Regexp ... "phase". This should at minimum verify the plan_id appears in status output and ideally check the phase value.

**P1**: Final status doesn't verify terminal phase. M4 correctly asserts `Should Match Regexp ... "phase"`. This should at minimum verify the plan_id appears in status output and ideally check the phase value.

brent.edwards referenced this pull request

2026-03-17 04:53:39 +00:00

test(e2e): E2E acceptance criteria for M4 (v3.3.0) — corrections, subplans, and checkpoints #814

brent.edwards requested changes

2026-03-17 05:00:09 +00:00

Dismissed

brent.edwards left a comment

Code Review — PR #799 `test(e2e): E2E acceptance criteria for M3 (v3.2.0)`

Reviewer: @brent.edwards | Size: M (test-only) | Focus: E2E test quality, assertion strength

P1:must-fix (3)

1. Missing Skip If No LLM Keys guard — hard-FAILs CI without API keys
The test requires OPENAI_API_KEY or GEMINI_API_KEY but has no skip guard. In CI environments without these keys, all test cases FAIL with cryptic rc=1 errors rather than being gracefully skipped. Compare with existing M5/M6 E2E tests which use Skip If No LLM Keys.
Fix: Add skip guard as the first step in Suite Setup.

2. No [Timeout] on test case — unbounded execution
M3 E2E involves plan execution with LLM calls. Without a timeout, a hung LLM call can block CI indefinitely. M4 (#814) correctly uses [Timeout] 15 minutes.
Fix: Add [Timeout] 15 minutes to the test case.

3. Weak assertions on M3-specific commands

plan explain only asserts "no Traceback" — doesn't verify output contains decision details, rationale, or alternatives
plan correct --mode revert only asserts "no Traceback" — doesn't verify correction was acknowledged or re-execution occurred
Final status check only asserts "no Traceback" — doesn't verify plan reached a terminal state

A test that passes on empty stdout doesn't validate any acceptance criteria.
Fix: Add content assertions: Should Contain for keywords like rationale, decision, revert, and a status check for APPLIED or COMPLETE.

P2:should-fix (3)

4. Extract Plan Id uses [0-9A-Z]{26} regex — should be Crockford Base32 [0-9A-HJKMNP-TV-Z]{26}. M4 uses a different (also incorrect) regex. Standardize.

5. Run Process git calls (git init, git add, git commit) lack timeout and on_timeout=kill.

6. ~40 lines of init/resource/project/action boilerplate duplicated with M1, M2 E2E tests. Extract to shared resource keyword.

P3:nit (1)

7. invariant add assertion only checks output contains the word "invariant" — doesn't verify the invariant text was echoed back.

Verdict: REQUEST_CHANGES — the skip guard (P1-1) and timeout (P1-2) are quick fixes. The assertion weakness (P1-3) requires more thought but is essential for the acceptance gate to be meaningful.

## Code Review — PR #799 `test(e2e): E2E acceptance criteria for M3 (v3.2.0)` **Reviewer:** @brent.edwards | **Size:** M (test-only) | **Focus:** E2E test quality, assertion strength --- ### P1:must-fix (3) **1. Missing `Skip If No LLM Keys` guard — hard-FAILs CI without API keys** The test requires `OPENAI_API_KEY` or `GEMINI_API_KEY` but has no skip guard. In CI environments without these keys, all test cases FAIL with cryptic `rc=1` errors rather than being gracefully skipped. Compare with existing M5/M6 E2E tests which use `Skip If No LLM Keys`. **Fix:** Add skip guard as the first step in Suite Setup. **2. No `[Timeout]` on test case — unbounded execution** M3 E2E involves plan execution with LLM calls. Without a timeout, a hung LLM call can block CI indefinitely. M4 (#814) correctly uses `[Timeout] 15 minutes`. **Fix:** Add `[Timeout] 15 minutes` to the test case. **3. Weak assertions on M3-specific commands** - `plan explain` only asserts "no Traceback" — doesn't verify output contains decision details, rationale, or alternatives - `plan correct --mode revert` only asserts "no Traceback" — doesn't verify correction was acknowledged or re-execution occurred - Final status check only asserts "no Traceback" — doesn't verify plan reached a terminal state A test that passes on empty stdout doesn't validate any acceptance criteria. **Fix:** Add content assertions: `Should Contain` for keywords like `rationale`, `decision`, `revert`, and a status check for `APPLIED` or `COMPLETE`. --- ### P2:should-fix (3) **4.** `Extract Plan Id` uses `[0-9A-Z]{26}` regex — should be Crockford Base32 `[0-9A-HJKMNP-TV-Z]{26}`. M4 uses a different (also incorrect) regex. Standardize. **5.** `Run Process` git calls (`git init`, `git add`, `git commit`) lack `timeout` and `on_timeout=kill`. **6.** ~40 lines of init/resource/project/action boilerplate duplicated with M1, M2 E2E tests. Extract to shared resource keyword. --- ### P3:nit (1) **7.** `invariant add` assertion only checks output contains the word "invariant" — doesn't verify the invariant text was echoed back. --- **Verdict:** REQUEST_CHANGES — the skip guard (P1-1) and timeout (P1-2) are quick fixes. The assertion weakness (P1-3) requires more thought but is essential for the acceptance gate to be meaningful.

freemo requested reviews from brent.edwards, CoreRasurae

2026-03-17 18:24:15 +00:00

freemo commented

2026-03-17 18:40:36 +00:00

PM Status — Day 37

Ownership transferred to @hamza.khyari. Blocked on #628 and #966. PR is M3 (v3.2.0) — high priority.

Author: Please rebase onto latest master by Day 39 EOD (2026-03-19) and confirm blocker status. M3 PRs are critical path.

PM status — Day 37

## PM Status — Day 37 Ownership transferred to @hamza.khyari. Blocked on #628 and #966. PR is M3 (v3.2.0) — high priority. **Author**: Please rebase onto latest `master` by **Day 39 EOD (2026-03-19)** and confirm blocker status. M3 PRs are critical path. --- *PM status — Day 37*

hamza.khyari force-pushed test/e2e-m3-acceptance from 2f8fb9e65f

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 22s

Details

CI / build (pull_request) Successful in 24s

Details

CI / quality (pull_request) Successful in 29s

Details

CI / typecheck (pull_request) Successful in 40s

Details

CI / security (pull_request) Successful in 1m5s

Details

CI / e2e_tests (pull_request) Successful in 1m38s

Details

CI / integration_tests (pull_request) Successful in 3m41s

Details

CI / unit_tests (pull_request) Successful in 4m52s

Details

CI / docker (pull_request) Successful in 54s

Details

CI / coverage (pull_request) Successful in 6m0s

Details

CI / benchmark-regression (pull_request) Successful in 38m42s

Details

to cdfc14724c

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 16s

Details

CI / build (pull_request) Successful in 29s

Details

CI / quality (pull_request) Successful in 40s

Details

CI / security (pull_request) Successful in 53s

Details

CI / typecheck (pull_request) Successful in 1m3s

Details

CI / unit_tests (pull_request) Successful in 3m19s

Details

CI / integration_tests (pull_request) Successful in 3m43s

Details

CI / e2e_tests (pull_request) Successful in 4m10s

Details

CI / docker (pull_request) Successful in 56s

Details

CI / coverage (pull_request) Successful in 6m44s

Details

CI / benchmark-regression (pull_request) Successful in 37m31s

Details

2026-03-19 01:05:30 +00:00

Compare

hamza.khyari referenced this pull request from a commit

2026-03-19 01:05:31 +00:00

fix(test): harden M3 E2E assertions and consolidate Extract Plan Id keyword

hamza.khyari commented

2026-03-19 01:05:51 +00:00

Response to @brent.edwards Reviews (Rounds 1+2)

All 10 findings validated and addressed in cdfc1472. Rebased onto latest master.

P1 — Fixed (5)

#	Finding	Fix
1	Missing `Skip If No LLM Keys`	Added as first step after tags/timeout in test case
2	No `[Timeout]` on test case	Added `[Timeout] 15 minutes`
3	Weak `plan explain` assertions	Added `Should Not Be Empty` + keyword check for decision/rationale/reasoning/explain
4	Weak `plan correct` assertions	Added `Should Not Be Empty` + keyword check for revert/correct/decision
5	Weak final status assertions	Added `Should Not Contain Traceback` + keyword check for applied/complete/phase

P2 — Fixed (4)

#	Finding	Fix
6	`Extract Plan Id` regex `[0-9A-Z]`	Changed to Crockford Base32 `[0-9A-HJKMNP-TV-Z]{26}`
7	Git `Run Process` no timeout	Added `timeout=60s on_timeout=kill` to all git calls in both m3_acceptance.robot and common_e2e.resource
8	Inline Python `__import__('re')`	Replaced with idiomatic Robot `Get Regexp Matches` keyword
9	Setup boilerplate duplication	Acknowledged — cross-PR refactor across M1/M2/M3. Out of scope for this PR but noted for follow-up.

P3 — Fixed (1)

#	Finding	Fix
10	Weak invariant add assertion	Added `Output Should Contain ${inv_result} API function signatures`

Rebase

Rebased onto latest master (post-#663 merge). Conflict in common_e2e.resource resolved by merging master's rc-checking pattern with our timeout additions.

## Response to @brent.edwards Reviews (Rounds 1+2) All 10 findings validated and addressed in `cdfc1472`. Rebased onto latest master. ### P1 — Fixed (5) | # | Finding | Fix | |---|---------|-----| | **1** | Missing `Skip If No LLM Keys` | Added as first step after tags/timeout in test case | | **2** | No `[Timeout]` on test case | Added `[Timeout] 15 minutes` | | **3** | Weak `plan explain` assertions | Added `Should Not Be Empty` + keyword check for decision/rationale/reasoning/explain | | **4** | Weak `plan correct` assertions | Added `Should Not Be Empty` + keyword check for revert/correct/decision | | **5** | Weak final status assertions | Added `Should Not Contain Traceback` + keyword check for applied/complete/phase | ### P2 — Fixed (4) | # | Finding | Fix | |---|---------|-----| | **6** | `Extract Plan Id` regex `[0-9A-Z]` | Changed to Crockford Base32 `[0-9A-HJKMNP-TV-Z]{26}` | | **7** | Git `Run Process` no timeout | Added `timeout=60s on_timeout=kill` to all git calls in both m3_acceptance.robot and common_e2e.resource | | **8** | Inline Python `__import__('re')` | Replaced with idiomatic Robot `Get Regexp Matches` keyword | | **9** | Setup boilerplate duplication | Acknowledged — cross-PR refactor across M1/M2/M3. Out of scope for this PR but noted for follow-up. | ### P3 — Fixed (1) | # | Finding | Fix | |---|---------|-----| | **10** | Weak invariant add assertion | Added `Output Should Contain ${inv_result} API function signatures` | ### Rebase Rebased onto latest master (post-#663 merge). Conflict in `common_e2e.resource` resolved by merging master's rc-checking pattern with our timeout additions.

hamza.khyari removed the

labels

2026-03-23 13:35:55 +00:00

CoreRasurae requested changes

2026-03-24 12:57:39 +00:00

CoreRasurae left a comment

Code Review — PR #799 (Post-fix round)

Reviewer: automated (requested by @christofer) | Scope: branch test/e2e-m3-acceptance changes + close surroundings | Ref: issue #743, docs/specification.md

Context: Brent Edwards reviewed twice (Reviews #2322, #2329) and Hamza addressed all 10 findings in commits ef51bc42 and cdfc1472. This review covers issues that remain unaddressed or were not previously identified. Four full review cycles were performed across all categories (specification alignment, bug detection, test coverage, test flaws, code quality, security, performance) until no new findings emerged.

P1 — Critical (2)

C1. `plan explain` and `plan correct` invoked with Plan ID instead of Decision ID

Category: Specification Alignment | Files: robot/e2e/m3_acceptance.robot lines 150, 171

The specification is unambiguous:

agents plan explain ... <DECISION_ID> (spec line 14543)
agents plan correct --mode (revert|append) ... <DECISION_ID> (spec line 14909)

Both commands require a Decision ID, not a Plan ID. The test passes ${plan_id} to both:

# Line 150 — should be a decision_id
${r_explain}=    Run CleverAgents Command
...    plan    explain    ${plan_id}    --format    plain

# Line 171 — should be a decision_id
${r_correct}=    Run CleverAgents Command
...    plan    correct    ${plan_id}    --mode    revert

Impact: The M3 acceptance criteria for decision explanation ("plan explain shows decision details including alternatives considered") and decision correction ("plan correct --mode=revert re-executes from targeted decision point") are structurally untestable as written. Either the CLI rejects the wrong argument type and the test fails for the wrong reason, or the CLI silently accepts a plan ID and the test doesn't validate the intended spec behavior.

Fix: After step 8 (plan tree), extract a decision ID from the tree output (using --format json for reliable parsing), then pass that decision ID to plan explain and plan correct. Example:

${r_tree_json}=    Run CleverAgents Command
...    plan    tree    ${plan_id}    --format    json
${decision_id}=    Safe Parse Json Field    ${r_tree_json.stdout}    decision_id
Should Not Be Empty    ${decision_id}
...    Could not extract decision_id from plan tree JSON output

C2. Hardcoded `openai/gpt-4` actor — fails when only `ANTHROPIC_API_KEY` is set

Category: Bug / Test Reliability | File: robot/e2e/m3_acceptance.robot lines 109–110

The Skip If No LLM Keys guard (common_e2e.resource:56) passes when either ANTHROPIC_API_KEY or OPENAI_API_KEY is set. But the action YAML hardcodes OpenAI:

strategy_actor: openai/gpt-4
execution_actor: openai/gpt-4

When only ANTHROPIC_API_KEY is available, the test is not skipped (guard passes) but plan execute fails because no OpenAI credentials exist. This produces a misleading failure that appears to be a product bug rather than a test configuration issue.

Impact: In any CI or developer environment with only Anthropic credentials, this test fails with a confusing authentication error instead of being gracefully skipped.

Note: M1 and M2 have the same pattern, but M6 correctly demonstrates the fix — dynamic actor selection based on available keys:

# M6 pattern (preferred):
${actor}=    Set Variable If    '%{ANTHROPIC_API_KEY}' != ''
...    anthropic/claude-sonnet-4    openai/gpt-4o

Fix: Either dynamically select the actor/provider based on the available API key (M6 pattern), or tighten Skip If No LLM Keys to specifically require OPENAI_API_KEY for this test.

P2 — High (5)

H1. Missing return code assertions on `git add`/`git commit`

Category: Bug Detection | File: robot/e2e/m3_acceptance.robot lines 65–66

The git commands after creating the auth module file do not capture or check return codes:

Run Process    git    add    .    cwd=${repo_dir}    timeout=60s    on_timeout=kill
Run Process    git    commit    -m    Add auth module    cwd=${repo_dir}    timeout=60s    on_timeout=kill

Contrast with Create Temp Git Repo (common_e2e.resource:163–174) where every Run Process result is stored and its rc is asserted. If git add or git commit fails silently (e.g., disk full, permission error, lock contention), the test proceeds with an incomplete repository and fails in confusing ways at a later step.

Fix:

${git_add}=    Run Process    git    add    .    cwd=${repo_dir}    timeout=60s    on_timeout=kill
Should Be Equal As Integers    ${git_add.rc}    0    git add failed: ${git_add.stderr}
${git_commit}=    Run Process    git    commit    -m    Add auth module    cwd=${repo_dir}    timeout=60s    on_timeout=kill
Should Be Equal As Integers    ${git_commit.rc}    0    git commit failed: ${git_commit.stderr}

H2. No verification of invariant enforcement during strategize — acceptance criteria gap

Category: Test Coverage | File: robot/e2e/m3_acceptance.robot steps 3, 7, 8

Issue #743 acceptance criterion: "Test adds invariants via invariant add and verifies enforcement during strategize."

The test adds an invariant (step 3) and runs strategize (step 7), but never verifies that the invariant was enforced. Per the spec (line 18631), invariant enforcement creates invariant_enforced decisions. The plan tree output (step 8) should contain an invariant_enforced decision node referencing the added invariant, but no such check exists.

Fix: After plan tree, verify the tree output contains evidence of invariant enforcement:

# After plan tree
Should Be True    'invariant' in '''${tree_output}'''.lower()
...    Plan tree does not show invariant enforcement for the added invariant

Or better, with --format json, parse the tree and assert at least one decision has type: invariant_enforced.

H3. `plan explain` assertion doesn't check for "alternatives"

Category: Test Coverage | File: robot/e2e/m3_acceptance.robot line 158

Milestone acceptance criterion: "plan explain shows decision details including alternatives considered."

The assertion checks for decision, rationale, reasoning, or explain keywords, but not alternative:

Should Be True    'decision' in '''${explain_output}'''.lower() or 'rationale' in ...

Fix: Add 'alternative' in '''${explain_output}'''.lower() to the assertion, or better, add a separate dedicated assertion:

Should Be True    'alternative' in '''${explain_output}'''.lower() or 'option' in '''${explain_output}'''.lower()
...    Plan explain output does not mention alternatives or options.\nOutput: ${explain_output}

H4. `Should Be True` triple-quote injection risk in Python expression evaluation

Category: Test Robustness | File: robot/e2e/m3_acceptance.robot lines 145, 158, 183, 210

All keyword-check assertions use:

Should Be True    'keyword' in '''${output_variable}'''.lower()

If CLI output contains a literal ''' (triple single-quote), the Python expression becomes syntactically invalid and produces a confusing EvaluationError instead of a clear assertion failure. While unlikely in normal output, LLM-generated output can contain arbitrary text including Python code blocks with triple quotes.

Fix: Use $output_variable (Robot Framework variable reference in Python expressions) which avoids string interpolation issues:

Should Be True    'plan' in $tree_output.lower() or 'decision' in $tree_output.lower()

Or use Robot Framework's native Should Contain which is immune to injection:

${lower_output}=    Evaluate    $tree_output.lower()
Should Match Regexp    ${lower_output}    (plan|decision|tree)

H5. Overly permissive keyword-based assertions on plan tree/explain/correct/status

Category: Test Flaws | File: robot/e2e/m3_acceptance.robot lines 145, 158, 183, 210

The assertion pattern uses broad OR chains of common words:

plan tree: checks for plan, decision, or tree — the word "plan" appears in virtually any CLI output
plan explain: checks for decision, rationale, reasoning, or explain — explain appears in error messages like "could not explain"
plan correct: checks for revert, correct, or decision — correct appears in phrases like "correct the issue"
Final status: checks for applied, complete, or phase — phase is not a terminal indicator

These assertions can pass on error output, help text, or any message that incidentally contains these common words. The tests would pass even if the commands return the wrong content.

Fix: Use --format json for tree/explain/status and parse specific fields, or use more specific keyword combinations (e.g., check for decision_id rather than decision; check for strategy_choice or invariant_enforced in tree output).

P3 — Medium (6)

M1. Final status assertion accepts the generic word "phase"

File: robot/e2e/m3_acceptance.robot line 210

Should Be True    'applied' in ... or 'complete' in ... or 'phase' in ...

The word "phase" appears in non-terminal contexts like phase: strategize/queued. This assertion would pass even if the plan is stuck in an early phase.

Fix: Remove 'phase' from the OR chain, or use more specific patterns like 'applied' or 'phase: apply'.

M2. `plan explain` not tested with `--show-context` or `--show-reasoning` flags

File: robot/e2e/m3_acceptance.robot line 150 | Spec: lines 14551–14552

The spec documents these flags as key functionality for the explain command. Without testing them, there's no E2E verification that context snapshots and model reasoning are properly captured and displayed.

M3. No `plan tree --show-superseded` test after `plan correct`

File: robot/e2e/m3_acceptance.robot (between steps 11 and 12) | Spec: line 14325, 18473

After plan correct (step 11) creates a correction with superseded decisions, running plan tree --show-superseded would verify the correction history feature described in the spec. This is a core M3 feature (decision tree versioning) with no E2E coverage.

M4. No `plan correct --dry-run` coverage

File: robot/e2e/m3_acceptance.robot | Spec: line 14920

The spec documents --dry-run as a flag that shows impact without executing. This is a safety feature with no E2E test coverage.

M5. `Return From Keyword If` deprecated — mixed with IF/RETURN pattern

File: robot/e2e/common_e2e.resource lines 145, 149, 153

The Extract Plan Id keyword uses Return From Keyword If (deprecated in RF7+) while the rest of the file uses the modern IF/RETURN pattern. Inconsistent style and future-compatibility concern.

M6. No `[Teardown]` at test level

File: robot/e2e/m3_acceptance.robot

Unlike the M6 test which uses per-test teardown for resource cleanup, M3 relies solely on Suite Teardown. If the test fails mid-way through, any partial state (temp directories, orphaned processes) persists until Suite Teardown runs. Not critical since CLEVERAGENTS_HOME is a temp directory, but inconsistent with the M6 pattern.

P4 — Low (4)

L1. Combined-output pattern could be extracted to keyword

File: robot/e2e/m3_acceptance.robot — 6 instances of ${x}= Set Variable ${y.stdout}\n${y.stderr}

Minor code quality/DRY concern.

L2. Inconsistent assertion order between M2 and M3

M2: Should Not Contain ... INTERNAL before Traceback. M3: Traceback before INTERNAL. Minor style inconsistency.

L3. Potential sensitive data in assertion failure diagnostic messages

Lines 123–124 embed raw stdout/stderr in failure messages. If CLI output ever includes configuration details, they appear in the test report. Low risk since Run CleverAgents Command already logs these via Log.

L4. `tdd_expected_fail` on monolithic test — design concern

With the test failing at step 4 (invariant list, bug #1022), steps 5–14 are never exercised. When bug #1022 is fixed and step 4 passes, if a later step fails (e.g., step 9 with the wrong argument type — see C1), the tdd_expected_fail listener would invert FAIL→PASS, masking the step 9 bug. The developer fixing #1022 would only discover it after removing the tag. This is inherent to applying tdd_expected_fail on a monolithic multi-step test, but worth awareness.

Summary

Severity	Count	Categories
P1 Critical	2	Spec alignment (1), Test reliability (1)
P2 High	5	Bug detection (1), Test coverage (2), Test robustness (1), Test flaws (1)
P3 Medium	6	Test flaws (1), Test coverage (3), Code quality (2)
P4 Low	4	Code quality (2), Style (1), Security (1)
Total	17

Verdict: REQUEST_CHANGES — P1 items C1 (wrong argument type for plan explain/correct) and C2 (hardcoded provider mismatch with key guard) require fixes before merge. C1 in particular means the M3 acceptance criteria for decision explanation and correction are structurally untestable as currently written.

## Code Review — PR #799 (Post-fix round) **Reviewer:** automated (requested by @christofer) | **Scope:** branch `test/e2e-m3-acceptance` changes + close surroundings | **Ref:** issue #743, `docs/specification.md` **Context:** Brent Edwards reviewed twice (Reviews #2322, #2329) and Hamza addressed all 10 findings in commits `ef51bc42` and `cdfc1472`. This review covers issues that remain unaddressed or were not previously identified. Four full review cycles were performed across all categories (specification alignment, bug detection, test coverage, test flaws, code quality, security, performance) until no new findings emerged. --- ### P1 — Critical (2) #### C1. `plan explain` and `plan correct` invoked with Plan ID instead of Decision ID **Category:** Specification Alignment | **Files:** `robot/e2e/m3_acceptance.robot` lines 150, 171 The specification is unambiguous: - `agents plan explain ... <DECISION_ID>` (spec line 14543) - `agents plan correct --mode (revert|append) ... <DECISION_ID>` (spec line 14909) Both commands require a **Decision ID**, not a Plan ID. The test passes `${plan_id}` to both: ```robot # Line 150 — should be a decision_id ${r_explain}= Run CleverAgents Command ... plan explain ${plan_id} --format plain # Line 171 — should be a decision_id ${r_correct}= Run CleverAgents Command ... plan correct ${plan_id} --mode revert ``` **Impact:** The M3 acceptance criteria for decision explanation ("plan explain shows decision details including alternatives considered") and decision correction ("plan correct --mode=revert re-executes from targeted decision point") are **structurally untestable** as written. Either the CLI rejects the wrong argument type and the test fails for the wrong reason, or the CLI silently accepts a plan ID and the test doesn't validate the intended spec behavior. **Fix:** After step 8 (`plan tree`), extract a decision ID from the tree output (using `--format json` for reliable parsing), then pass that decision ID to `plan explain` and `plan correct`. Example: ```robot ${r_tree_json}= Run CleverAgents Command ... plan tree ${plan_id} --format json ${decision_id}= Safe Parse Json Field ${r_tree_json.stdout} decision_id Should Not Be Empty ${decision_id} ... Could not extract decision_id from plan tree JSON output ``` --- #### C2. Hardcoded `openai/gpt-4` actor — fails when only `ANTHROPIC_API_KEY` is set **Category:** Bug / Test Reliability | **File:** `robot/e2e/m3_acceptance.robot` lines 109–110 The `Skip If No LLM Keys` guard (common_e2e.resource:56) passes when **either** `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` is set. But the action YAML hardcodes OpenAI: ```yaml strategy_actor: openai/gpt-4 execution_actor: openai/gpt-4 ``` When only `ANTHROPIC_API_KEY` is available, the test is **not** skipped (guard passes) but `plan execute` fails because no OpenAI credentials exist. This produces a misleading failure that appears to be a product bug rather than a test configuration issue. **Impact:** In any CI or developer environment with only Anthropic credentials, this test fails with a confusing authentication error instead of being gracefully skipped. **Note:** M1 and M2 have the same pattern, but M6 correctly demonstrates the fix — dynamic actor selection based on available keys: ```robot # M6 pattern (preferred): ${actor}= Set Variable If '%{ANTHROPIC_API_KEY}' != '' ... anthropic/claude-sonnet-4 openai/gpt-4o ``` **Fix:** Either dynamically select the actor/provider based on the available API key (M6 pattern), or tighten `Skip If No LLM Keys` to specifically require `OPENAI_API_KEY` for this test. --- ### P2 — High (5) #### H1. Missing return code assertions on `git add`/`git commit` **Category:** Bug Detection | **File:** `robot/e2e/m3_acceptance.robot` lines 65–66 The git commands after creating the auth module file do not capture or check return codes: ```robot Run Process git add . cwd=${repo_dir} timeout=60s on_timeout=kill Run Process git commit -m Add auth module cwd=${repo_dir} timeout=60s on_timeout=kill ``` Contrast with `Create Temp Git Repo` (common_e2e.resource:163–174) where every `Run Process` result is stored and its `rc` is asserted. If `git add` or `git commit` fails silently (e.g., disk full, permission error, lock contention), the test proceeds with an incomplete repository and fails in confusing ways at a later step. **Fix:** ```robot ${git_add}= Run Process git add . cwd=${repo_dir} timeout=60s on_timeout=kill Should Be Equal As Integers ${git_add.rc} 0 git add failed: ${git_add.stderr} ${git_commit}= Run Process git commit -m Add auth module cwd=${repo_dir} timeout=60s on_timeout=kill Should Be Equal As Integers ${git_commit.rc} 0 git commit failed: ${git_commit.stderr} ``` --- #### H2. No verification of invariant enforcement during strategize — acceptance criteria gap **Category:** Test Coverage | **File:** `robot/e2e/m3_acceptance.robot` steps 3, 7, 8 Issue #743 acceptance criterion: *"Test adds invariants via `invariant add` and verifies enforcement during strategize."* The test adds an invariant (step 3) and runs strategize (step 7), but **never verifies** that the invariant was enforced. Per the spec (line 18631), invariant enforcement creates `invariant_enforced` decisions. The `plan tree` output (step 8) should contain an `invariant_enforced` decision node referencing the added invariant, but no such check exists. **Fix:** After `plan tree`, verify the tree output contains evidence of invariant enforcement: ```robot # After plan tree Should Be True 'invariant' in '''${tree_output}'''.lower() ... Plan tree does not show invariant enforcement for the added invariant ``` Or better, with `--format json`, parse the tree and assert at least one decision has `type: invariant_enforced`. --- #### H3. `plan explain` assertion doesn't check for "alternatives" **Category:** Test Coverage | **File:** `robot/e2e/m3_acceptance.robot` line 158 Milestone acceptance criterion: *"plan explain shows decision details **including alternatives considered**."* The assertion checks for `decision`, `rationale`, `reasoning`, or `explain` keywords, but not `alternative`: ```robot Should Be True 'decision' in '''${explain_output}'''.lower() or 'rationale' in ... ``` **Fix:** Add `'alternative' in '''${explain_output}'''.lower()` to the assertion, or better, add a separate dedicated assertion: ```robot Should Be True 'alternative' in '''${explain_output}'''.lower() or 'option' in '''${explain_output}'''.lower() ... Plan explain output does not mention alternatives or options.\nOutput: ${explain_output} ``` --- #### H4. `Should Be True` triple-quote injection risk in Python expression evaluation **Category:** Test Robustness | **File:** `robot/e2e/m3_acceptance.robot` lines 145, 158, 183, 210 All keyword-check assertions use: ```robot Should Be True 'keyword' in '''${output_variable}'''.lower() ``` If CLI output contains a literal `'''` (triple single-quote), the Python expression becomes syntactically invalid and produces a confusing `EvaluationError` instead of a clear assertion failure. While unlikely in normal output, LLM-generated output can contain arbitrary text including Python code blocks with triple quotes. **Fix:** Use `$output_variable` (Robot Framework variable reference in Python expressions) which avoids string interpolation issues: ```robot Should Be True 'plan' in $tree_output.lower() or 'decision' in $tree_output.lower() ``` Or use Robot Framework's native `Should Contain` which is immune to injection: ```robot ${lower_output}= Evaluate $tree_output.lower() Should Match Regexp ${lower_output} (plan|decision|tree) ``` --- #### H5. Overly permissive keyword-based assertions on plan tree/explain/correct/status **Category:** Test Flaws | **File:** `robot/e2e/m3_acceptance.robot` lines 145, 158, 183, 210 The assertion pattern uses broad `OR` chains of common words: - `plan tree`: checks for `plan`, `decision`, or `tree` — the word "plan" appears in virtually any CLI output - `plan explain`: checks for `decision`, `rationale`, `reasoning`, or `explain` — `explain` appears in error messages like "could not explain" - `plan correct`: checks for `revert`, `correct`, or `decision` — `correct` appears in phrases like "correct the issue" - Final status: checks for `applied`, `complete`, or `phase` — `phase` is not a terminal indicator These assertions can pass on error output, help text, or any message that incidentally contains these common words. The tests would pass even if the commands return the wrong content. **Fix:** Use `--format json` for tree/explain/status and parse specific fields, or use more specific keyword combinations (e.g., check for `decision_id` rather than `decision`; check for `strategy_choice` or `invariant_enforced` in tree output). --- ### P3 — Medium (6) #### M1. Final status assertion accepts the generic word "phase" **File:** `robot/e2e/m3_acceptance.robot` line 210 ```robot Should Be True 'applied' in ... or 'complete' in ... or 'phase' in ... ``` The word "phase" appears in non-terminal contexts like `phase: strategize/queued`. This assertion would pass even if the plan is stuck in an early phase. **Fix:** Remove `'phase'` from the `OR` chain, or use more specific patterns like `'applied'` or `'phase: apply'`. --- #### M2. `plan explain` not tested with `--show-context` or `--show-reasoning` flags **File:** `robot/e2e/m3_acceptance.robot` line 150 | **Spec:** lines 14551–14552 The spec documents these flags as key functionality for the explain command. Without testing them, there's no E2E verification that context snapshots and model reasoning are properly captured and displayed. --- #### M3. No `plan tree --show-superseded` test after `plan correct` **File:** `robot/e2e/m3_acceptance.robot` (between steps 11 and 12) | **Spec:** line 14325, 18473 After `plan correct` (step 11) creates a correction with superseded decisions, running `plan tree --show-superseded` would verify the correction history feature described in the spec. This is a core M3 feature (decision tree versioning) with no E2E coverage. --- #### M4. No `plan correct --dry-run` coverage **File:** `robot/e2e/m3_acceptance.robot` | **Spec:** line 14920 The spec documents `--dry-run` as a flag that shows impact without executing. This is a safety feature with no E2E test coverage. --- #### M5. `Return From Keyword If` deprecated — mixed with IF/RETURN pattern **File:** `robot/e2e/common_e2e.resource` lines 145, 149, 153 The `Extract Plan Id` keyword uses `Return From Keyword If` (deprecated in RF7+) while the rest of the file uses the modern `IF`/`RETURN` pattern. Inconsistent style and future-compatibility concern. --- #### M6. No `[Teardown]` at test level **File:** `robot/e2e/m3_acceptance.robot` Unlike the M6 test which uses per-test teardown for resource cleanup, M3 relies solely on Suite Teardown. If the test fails mid-way through, any partial state (temp directories, orphaned processes) persists until Suite Teardown runs. Not critical since `CLEVERAGENTS_HOME` is a temp directory, but inconsistent with the M6 pattern. --- ### P4 — Low (4) #### L1. Combined-output pattern could be extracted to keyword **File:** `robot/e2e/m3_acceptance.robot` — 6 instances of `${x}= Set Variable ${y.stdout}\n${y.stderr}` Minor code quality/DRY concern. --- #### L2. Inconsistent assertion order between M2 and M3 M2: `Should Not Contain ... INTERNAL` before `Traceback`. M3: `Traceback` before `INTERNAL`. Minor style inconsistency. --- #### L3. Potential sensitive data in assertion failure diagnostic messages Lines 123–124 embed raw stdout/stderr in failure messages. If CLI output ever includes configuration details, they appear in the test report. Low risk since `Run CleverAgents Command` already logs these via `Log`. --- #### L4. `tdd_expected_fail` on monolithic test — design concern With the test failing at step 4 (invariant list, bug #1022), steps 5–14 are never exercised. When bug #1022 is fixed and step 4 passes, if a later step fails (e.g., step 9 with the wrong argument type — see C1), the `tdd_expected_fail` listener would invert FAIL→PASS, masking the step 9 bug. The developer fixing #1022 would only discover it after removing the tag. This is inherent to applying `tdd_expected_fail` on a monolithic multi-step test, but worth awareness. --- ### Summary | Severity | Count | Categories | |----------|-------|------------| | P1 Critical | 2 | Spec alignment (1), Test reliability (1) | | P2 High | 5 | Bug detection (1), Test coverage (2), Test robustness (1), Test flaws (1) | | P3 Medium | 6 | Test flaws (1), Test coverage (3), Code quality (2) | | P4 Low | 4 | Code quality (2), Style (1), Security (1) | | **Total** | **17** | | **Verdict: REQUEST_CHANGES** — P1 items C1 (wrong argument type for plan explain/correct) and C2 (hardcoded provider mismatch with key guard) require fixes before merge. C1 in particular means the M3 acceptance criteria for decision explanation and correction are structurally untestable as currently written.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +62,4 @@

				    ...    ${SPACE}${SPACE}${SPACE}${SPACE}conn.close()

				    ...    ${SPACE}${SPACE}${SPACE}${SPACE}return {"id": row[0], "name": row[1]} if row else {}

				    Create File    ${repo_dir}${/}src${/}auth.py    ${auth_code}

				    Run Process    git    add    .    cwd=${repo_dir}    timeout=60s    on_timeout=kill

H1 (P2): Result not captured — return code not checked. Contrast with Create Temp Git Repo (common_e2e.resource:163-174) where every git call checks rc. A silent git add failure lets the test proceed with an incomplete repo.

**H1 (P2):** Result not captured — return code not checked. Contrast with `Create Temp Git Repo` (common_e2e.resource:163-174) where every git call checks rc. A silent `git add` failure lets the test proceed with an incomplete repo.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +106,4 @@

				    ...    name: ${ACTION_NAME}

				    ...    description: Refactor auth module from raw SQL to ORM

				    ...    definition_of_done: Replace raw SQL queries with ORM while preserving API

				    ...    strategy_actor: openai/gpt-4

C2 (P1): Hardcoded openai/gpt-4 — if only ANTHROPIC_API_KEY is set, Skip If No LLM Keys passes but plan execute fails on OpenAI auth. M6 dynamically selects actor based on available key. Consider the same pattern here.

**C2 (P1):** Hardcoded `openai/gpt-4` — if only `ANTHROPIC_API_KEY` is set, `Skip If No LLM Keys` passes but plan execute fails on OpenAI auth. M6 dynamically selects actor based on available key. Consider the same pattern here.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +147,4 @@

				    # ── 9. Plan explain — inspect a decision (use plan_id as fallback) ──

				    ${r_explain}=    Run CleverAgents Command

				    ...    plan    explain    ${plan_id}    --format    plain

C1 (P1): Spec says agents plan explain <DECISION_ID> (spec line 14543) — this passes a Plan ID. Should extract a decision ID from plan tree --format json output first, then pass it here. Same issue at line 171 for plan correct.

**C1 (P1):** Spec says `agents plan explain <DECISION_ID>` (spec line 14543) — this passes a Plan ID. Should extract a decision ID from `plan tree --format json` output first, then pass it here. Same issue at line 171 for `plan correct`.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +142,4 @@

				    # Tree output should reference the plan or contain tree-like structure

				    Should Not Be Empty    ${r_tree.stdout}

				    ...    Plan tree produced no output for plan ${plan_id}

				    Should Be True    'plan' in '''${tree_output}'''.lower() or 'decision' in '''${tree_output}'''.lower() or 'tree' in '''${tree_output}'''.lower()

H4 (P2): If CLI output contains ''' (triple single-quote), this Python expression breaks. Use $tree_output (RF variable reference in Python expressions) instead of string interpolation: 'plan' in $tree_output.lower()

**H4 (P2):** If CLI output contains `'''` (triple single-quote), this Python expression breaks. Use `$tree_output` (RF variable reference in Python expressions) instead of string interpolation: `'plan' in $tree_output.lower()`

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +168,4 @@

				    # ── 11. Plan correct — revert mode with guidance ──

				    ${r_correct}=    Run CleverAgents Command

				    ...    plan    correct    ${plan_id}    --mode    revert

C1 (P1): Spec says agents plan correct ... <DECISION_ID> (spec line 14909) — this passes ${plan_id} instead of a decision ID. The correction acceptance criterion ('re-executes from targeted decision point') requires targeting a specific decision, not a plan.

**C1 (P1):** Spec says `agents plan correct ... <DECISION_ID>` (spec line 14909) — this passes `${plan_id}` instead of a decision ID. The correction acceptance criterion ('re-executes from targeted decision point') requires targeting a specific decision, not a plan.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +155,4 @@

				    Should Not Be Empty    ${r_explain.stdout}

				    ...    Plan explain produced no output for plan ${plan_id}

				    # M3 AC: "plan explain shows decision details including alternatives"

				    Should Be True    'decision' in '''${explain_output}'''.lower() or 'rationale' in '''${explain_output}'''.lower() or 'reasoning' in '''${explain_output}'''.lower() or 'explain' in '''${explain_output}'''.lower()

H3 (P2): Milestone AC says 'plan explain shows decision details including alternatives considered'. The assertion checks for decision/rationale/reasoning/explain but not 'alternative' or 'option'.

**H3 (P2):** Milestone AC says 'plan explain shows decision details **including alternatives considered**'. The assertion checks for decision/rationale/reasoning/explain but not 'alternative' or 'option'.

robot/e2e/m3_acceptance.robot Outdated

					
				@ -0,0 +207,4 @@

				    ${status_output}=    Set Variable    ${r_status.stdout}\n${r_status.stderr}

				    Should Not Contain    ${status_output}    Traceback

				    # Verify plan reached a terminal or advanced phase

				    Should Be True    'applied' in '''${status_output}'''.lower() or 'complete' in '''${status_output}'''.lower() or 'phase' in '''${status_output}'''.lower()

M1 (P3): The word 'phase' is too broad — it appears in non-terminal contexts like phase: strategize/queued. Consider removing it from the OR chain or using more specific patterns like 'apply'.

**M1 (P3):** The word 'phase' is too broad — it appears in non-terminal contexts like `phase: strategize/queued`. Consider removing it from the OR chain or using more specific patterns like `'apply'`.

hamza.khyari force-pushed test/e2e-m3-acceptance from cdfc14724c

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 16s

Details

CI / build (pull_request) Successful in 29s

Details

CI / quality (pull_request) Successful in 40s

Details

CI / security (pull_request) Successful in 53s

Details

CI / typecheck (pull_request) Successful in 1m3s

Details

CI / unit_tests (pull_request) Successful in 3m19s

Details

CI / integration_tests (pull_request) Successful in 3m43s

Details

CI / e2e_tests (pull_request) Successful in 4m10s

Details

CI / docker (pull_request) Successful in 56s

Details

CI / coverage (pull_request) Successful in 6m44s

Details

CI / benchmark-regression (pull_request) Successful in 37m31s

Details

to 403ce29a1f

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 19s

Details

CI / helm (pull_request) Successful in 28s

Details

CI / typecheck (pull_request) Successful in 54s

Details

CI / lint (pull_request) Has been cancelled

Details

CI / e2e_tests (pull_request) Has been cancelled

Details

CI / security (pull_request) Has been cancelled

Details

CI / quality (pull_request) Has been cancelled

Details

CI / integration_tests (pull_request) Has been cancelled

Details

CI / unit_tests (pull_request) Has been cancelled

Details

CI / coverage (pull_request) Has been cancelled

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

CI / docker (pull_request) Has been cancelled

Details

CI / status-check (pull_request) Has been cancelled

Details

2026-04-01 15:54:51 +00:00

Compare

hamza.khyari referenced this pull request from a commit

2026-04-01 15:54:53 +00:00

fix(test): harden M3 E2E assertions and consolidate Extract Plan Id keyword

hamza.khyari added 1 commit

2026-04-01 15:57:50 +00:00

fix(test): add context snapshot and invariant enforcement assertions for M3 AC

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 19s

Details

CI / helm (pull_request) Successful in 24s

Details

CI / lint (pull_request) Successful in 3m19s

Details

CI / typecheck (pull_request) Successful in 3m56s

Details

CI / quality (pull_request) Successful in 4m0s

Details

CI / security (pull_request) Successful in 4m21s

Details

CI / unit_tests (pull_request) Successful in 6m28s

Details

CI / docker (pull_request) Successful in 1m42s

Details

CI / e2e_tests (pull_request) Has been cancelled

Details

CI / coverage (pull_request) Has been cancelled

Details

CI / benchmark-regression (pull_request) Has been cancelled

Details

CI / integration_tests (pull_request) Has been cancelled

Details

CI / status-check (pull_request) Has been cancelled

Details

fed584bd58

AC-2: Add plan tree --format json step (8b) to verify decisions are
recorded with context snapshots.  Checks for decision_id keys and
context_snapshot/snapshot/context presence in JSON output.

AC-5: Add invariant enforcement check (7b) after strategize phase.
Checks whether the invariant text or 'invariant' keyword appears in
strategize output.  Diagnostic-only (not hard assertion) because LLM
output is non-deterministic.

Ref: #743

hamza.khyari added 1 commit

2026-04-01 16:11:37 +00:00

fix(test): address CoreRasurae review findings for M3 E2E acceptance test

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 19s

Details

CI / helm (pull_request) Successful in 23s

Details

CI / quality (pull_request) Successful in 1m5s

Details

CI / lint (pull_request) Successful in 3m19s

Details

CI / typecheck (pull_request) Successful in 3m56s

Details

CI / security (pull_request) Successful in 4m7s

Details

CI / unit_tests (pull_request) Successful in 9m32s

Details

CI / docker (pull_request) Successful in 1m42s

Details

CI / e2e_tests (pull_request) Failing after 12m53s

Details

CI / coverage (pull_request) Successful in 12m19s

Details

CI / integration_tests (pull_request) Successful in 24m43s

Details

CI / status-check (pull_request) Failing after 1s

Details

CI / benchmark-regression (pull_request) Successful in 55m14s

Details

a3e9d56001

C1 (Critical): Fix plan explain and plan correct to use Decision ID
extracted from plan tree --format json instead of Plan ID.  The spec
requires a decision_id, not a plan_id.

C2 (Critical): Dynamic actor selection based on available API key.
Uses anthropic/claude-sonnet-4 when ANTHROPIC_API_KEY is set,
falls back to openai/gpt-4.  Prevents misleading auth failures in
Anthropic-only environments.

H1: Add rc assertions on git add/commit after auth module creation.
H3: Add alternative/option to plan explain keyword check.
H4: Replace triple-quote Should Be True with $variable syntax across
all keyword assertions to prevent injection from LLM output.
H5: Tighten keyword assertions — use decision_id structural check
in tree JSON, specific terminal indicators in final status.
M1: Remove generic 'phase' from final status, use applied/complete/apply.
M5: Replace deprecated Return From Keyword If with IF/RETURN in
Extract Plan Id keyword (common_e2e.resource).

Ref: #743

hamza.khyari added 1 commit

2026-04-01 19:12:46 +00:00

test(e2e): harden M3 acceptance assertions for snapshots and invariants

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 17s

Details

CI / lint (pull_request) Successful in 20s

Details

CI / helm (pull_request) Successful in 34s

Details

CI / security (pull_request) Successful in 1m2s

Details

CI / quality (pull_request) Successful in 3m43s

Details

CI / typecheck (pull_request) Successful in 3m58s

Details

CI / unit_tests (pull_request) Successful in 6m12s

Details

CI / docker (pull_request) Successful in 11s

Details

CI / coverage (pull_request) Successful in 12m34s

Details

CI / integration_tests (pull_request) Successful in 21m40s

Details

CI / benchmark-regression (pull_request) Successful in 55m10s

Details

CI / e2e_tests (pull_request) Failing after 16m16s

Details

CI / status-check (pull_request) Failing after 1s

Details

d8e1e86149

- AC-2: assert context_snapshot.hot_context_hash via plan explain
  --format json --show-context --show-reasoning
- AC-5: add hard assertion that plan tree --format json contains an
  invariant_enforced decision for the added invariant

The AC-5 assertion is intentionally hard while the test remains
@tdd_expected_fail. This encodes the intended CLI behavior without
relying on non-deterministic LLM wording.

Ref: #743

hamza.khyari referenced this pull request

2026-04-01 19:15:48 +00:00

bug(cli): strategize path does not surface invariant enforcement in decision tree output #1245

brent.edwards requested changes

2026-04-01 19:27:36 +00:00

brent.edwards left a comment

PR Review: !799 (Ticket #743)

Verdict: Request Changes

Reviewed latest head d8e1e861. The recent changes improved snapshot and invariant assertions and fixed the direct plan_id/decision_id call sites, but the suite is still not a reliable M3 acceptance gate.

Critical Issues

None

Major Issues

P1:must-fix — robot/e2e/m3_acceptance.robot:27
- Problem: The acceptance case is still tagged tdd_expected_fail. Because this is a single monolithic test, the known early failure path short-circuits the later decision/explain/correct checks while the listener still turns the overall result into a pass. That means the PR can merge without the M3 acceptance path actually passing.
- Recommendation: Move the bug-capture behavior into a dedicated TDD test and make the M3 acceptance suite pass normally, or remove tdd_expected_fail once the bug is fixed.
P1:must-fix — robot/e2e/m3_acceptance.robot:158-164
- Problem: Decision extraction still grabs every ULID in the tree JSON and uses the first one. That can still select plan_id or another ULID-shaped field instead of a real decision_id, so the later explain/correct steps are still structurally fragile.
- Recommendation: Extract only values from the "decision_id" key (or parse the JSON tree and walk the decision nodes explicitly) before invoking plan explain and plan correct.
P1:must-fix — robot/e2e/m3_acceptance.robot:211-226
- Problem: The revert correction assertion still only checks for generic words like revert, correct, or decision. It does not verify that the targeted decision was corrected or that re-execution happened from that decision point, which is one of the core M3 acceptance requirements.
- Recommendation: Assert structured correction fields such as Corrects, New Decision, Impact, or recompute details, and/or follow the correction with plan tree --show-superseded to prove the targeted decision was superseded and recomputed.
P1:must-fix — robot/e2e/m3_acceptance.robot:245-253
- Problem: The final status assertion is still too weak for an acceptance gate. Checking for the word apply can pass on non-terminal or intermediate Apply-phase output rather than successful completion.
- Recommendation: Use structured status output and assert a successful terminal state explicitly instead of matching the generic word apply.

Minor Issues

None

Nits

None

Summary

The latest updates fixed some earlier concerns, especially around snapshot and invariant assertions, but the suite still remains expected-fail and does not yet strictly prove targeted decision correction or successful lifecycle completion on the latest head.

## PR Review: !799 (Ticket #743) ### Verdict: Request Changes Reviewed latest head `d8e1e861`. The recent changes improved snapshot and invariant assertions and fixed the direct `plan_id`/`decision_id` call sites, but the suite is still not a reliable M3 acceptance gate. ### Critical Issues None ### Major Issues - **P1:must-fix — `robot/e2e/m3_acceptance.robot:27`** - **Problem:** The acceptance case is still tagged `tdd_expected_fail`. Because this is a single monolithic test, the known early failure path short-circuits the later decision/explain/correct checks while the listener still turns the overall result into a pass. That means the PR can merge without the M3 acceptance path actually passing. - **Recommendation:** Move the bug-capture behavior into a dedicated TDD test and make the M3 acceptance suite pass normally, or remove `tdd_expected_fail` once the bug is fixed. - **P1:must-fix — `robot/e2e/m3_acceptance.robot:158-164`** - **Problem:** Decision extraction still grabs every ULID in the tree JSON and uses the first one. That can still select `plan_id` or another ULID-shaped field instead of a real `decision_id`, so the later explain/correct steps are still structurally fragile. - **Recommendation:** Extract only values from the `"decision_id"` key (or parse the JSON tree and walk the decision nodes explicitly) before invoking `plan explain` and `plan correct`. - **P1:must-fix — `robot/e2e/m3_acceptance.robot:211-226`** - **Problem:** The revert correction assertion still only checks for generic words like `revert`, `correct`, or `decision`. It does not verify that the targeted decision was corrected or that re-execution happened from that decision point, which is one of the core M3 acceptance requirements. - **Recommendation:** Assert structured correction fields such as `Corrects`, `New Decision`, `Impact`, or recompute details, and/or follow the correction with `plan tree --show-superseded` to prove the targeted decision was superseded and recomputed. - **P1:must-fix — `robot/e2e/m3_acceptance.robot:245-253`** - **Problem:** The final status assertion is still too weak for an acceptance gate. Checking for the word `apply` can pass on non-terminal or intermediate Apply-phase output rather than successful completion. - **Recommendation:** Use structured status output and assert a successful terminal state explicitly instead of matching the generic word `apply`. ### Minor Issues None ### Nits None ### Summary The latest updates fixed some earlier concerns, especially around snapshot and invariant assertions, but the suite still remains expected-fail and does not yet strictly prove targeted decision correction or successful lifecycle completion on the latest head.

hamza.khyari added 1 commit

2026-04-01 23:08:55 +00:00

fix(test): remove tdd_expected_fail, harden correction and status assertions

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / lint (pull_request) Successful in 18s

Details

CI / build (pull_request) Successful in 19s

Details

CI / helm (pull_request) Successful in 21s

Details

CI / quality (pull_request) Successful in 44s

Details

CI / security (pull_request) Successful in 1m0s

Details

CI / typecheck (pull_request) Successful in 3m55s

Details

CI / unit_tests (pull_request) Successful in 6m5s

Details

CI / docker (pull_request) Successful in 1m20s

Details

CI / coverage (pull_request) Successful in 9m15s

Details

CI / e2e_tests (pull_request) Failing after 18m36s

Details

CI / integration_tests (pull_request) Successful in 25m38s

Details

CI / status-check (pull_request) Failing after 1s

Details

CI / benchmark-regression (pull_request) Successful in 55m30s

Details

c33ca04ee2

P1-A: Remove tdd_expected_fail, tdd_bug, tdd_bug_1022 tags — bug
#1022 is closed, the test should pass on its own now.

P1-C: After plan correct, verify the decision tree contains
'superseded: true' proving the targeted decision was actually
corrected and marked superseded.

P1-D: Replace weak keyword-based final status check with structured
JSON assertion using plan status --format json.  Verify phase is
'apply' and processing_state is one of queued/processing/complete/applied.

P1-B (decision_id extraction): validated as not-a-bug — plan tree
--format json does not include plan_id, only decision_id fields.
First ULID match is always a decision_id.

Ref: #743

hamza.khyari added 1 commit

2026-04-02 00:26:29 +00:00

fix(test): restore tdd_expected_fail — bug #1022 fix not yet on master

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 16s

Details

CI / lint (pull_request) Successful in 17s

Details

CI / helm (pull_request) Successful in 28s

Details

CI / typecheck (pull_request) Successful in 1m1s

Details

CI / security (pull_request) Successful in 1m2s

Details

CI / quality (pull_request) Successful in 3m41s

Details

CI / unit_tests (pull_request) Successful in 6m4s

Details

CI / docker (pull_request) Successful in 26s

Details

CI / coverage (pull_request) Successful in 9m25s

Details

CI / e2e_tests (pull_request) Successful in 17m48s

Details

CI / integration_tests (pull_request) Successful in 21m33s

Details

CI / status-check (pull_request) Successful in 1s

Details

CI / benchmark-regression (pull_request) Successful in 55m14s

Details

56b867d5ea

Issue #1022 was closed but the actual invariant persistence fix has
not been merged to master.  invariant list still returns 'no
invariants found' between CLI invocations.  Restore tdd_expected_fail
with correct tdd_issue tags until the fix lands.

Ref: #743

hamza.khyari was unassigned by freemo

2026-04-02 06:15:23 +00:00

freemo self-assigned this

2026-04-02 06:15:23 +00:00

freemo force-pushed test/e2e-m3-acceptance from 56b867d5ea

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / build (pull_request) Successful in 16s

Details

CI / lint (pull_request) Successful in 17s

Details

CI / helm (pull_request) Successful in 28s

Details

CI / typecheck (pull_request) Successful in 1m1s

Details

CI / security (pull_request) Successful in 1m2s

Details

CI / quality (pull_request) Successful in 3m41s

Details

CI / unit_tests (pull_request) Successful in 6m4s

Details

CI / docker (pull_request) Successful in 26s

Details

CI / coverage (pull_request) Successful in 9m25s

Details

CI / e2e_tests (pull_request) Successful in 17m48s

Details

CI / integration_tests (pull_request) Successful in 21m33s

Details

CI / status-check (pull_request) Successful in 1s

Details

CI / benchmark-regression (pull_request) Successful in 55m14s

Details

to a4d543428c

CI / e2e_tests (pull_request) Failing after 5s

Details

CI / build (pull_request) Successful in 15s

Details

CI / helm (pull_request) Successful in 21s

Details

CI / lint (pull_request) Successful in 3m30s

Details

CI / quality (pull_request) Successful in 3m56s

Details

CI / typecheck (pull_request) Successful in 4m10s

Details

CI / security (pull_request) Successful in 4m22s

Details

CI / unit_tests (pull_request) Successful in 9m31s

Details

CI / docker (pull_request) Successful in 11s

Details

CI / integration_tests (pull_request) Successful in 22m2s

Details

CI / coverage (pull_request) Successful in 12m41s

Details

CI / status-check (pull_request) Failing after 1s

Details

CI / benchmark-publish (pull_request) Has been skipped

Details

CI / benchmark-regression (pull_request) Successful in 55m9s

Details

2026-04-02 06:51:24 +00:00

Compare

freemo referenced this pull request from a commit

2026-04-02 06:51:24 +00:00

fix(test): harden M3 E2E assertions and consolidate Extract Plan Id keyword

hamza.khyari self-assigned this

2026-04-02 11:23:00 +00:00

freemo was unassigned by hamza.khyari

2026-04-02 11:23:01 +00:00

freemo referenced this pull request

2026-04-02 16:48:55 +00:00

bug(cli): strategize path does not surface invariant enforcement in decision tree output #1245

freemo commented

2026-04-02 17:33:37 +00:00

🤖 Backlog Groomer (groomer-1): Closing as duplicate of #743.

Issue #743 (test(e2e): E2E acceptance criteria for M3 (v3.2.0) — decisions, validations, invariants) is the canonical version with full labels (MoSCoW/Must have, Priority/Critical, State/In Review, Type/Testing) and milestone v3.2.0. This issue is an exact title duplicate.

🤖 **Backlog Groomer (groomer-1):** Closing as duplicate of #743. Issue #743 (`test(e2e): E2E acceptance criteria for M3 (v3.2.0) — decisions, validations, invariants`) is the canonical version with full labels (`MoSCoW/Must have`, `Priority/Critical`, `State/In Review`, `Type/Testing`) and milestone `v3.2.0`. This issue is an exact title duplicate.

freemo closed this pull request

2026-04-02 17:33:51 +00:00