bug(test): M1–M6 E2E verification suites use mocks instead of exercising the real system #658

Closed
opened 2026-03-10 00:28:40 +00:00 by brent.edwards · 10 comments
Member

Summary

The E2E verification test suites for milestones M1 through M6 (robot/m{1..6}_e2e_verification.robot + robot/helper_m{1..6}_e2e_verification.py) do not perform genuine end-to-end testing. Across the 6 suites (56 total test cases), the CLI-facing tests mock the service layer, use Typer's in-process CliRunner instead of subprocess invocation, and create test fixtures programmatically. As a result, these suites cannot detect integration failures that occur when running the actual agents CLI in a real environment.


Audit Results by Milestone

M1 — m1_e2e_verification (8 test cases)

Files: robot/m1_e2e_verification.robot, robot/helper_m1_e2e_verification.py
Related: Issue #402, PR #431

Test case What it mocks CLI method Lines
action_create_from_yaml action._get_lifecycle_service CliRunner 193–208
resource_register_git_checkout resource._get_registry_service CliRunner 231–255
project_create_and_link project._get_namespaced_project_repo, project._get_resource_link_repo, project._get_resource_registry_service CliRunner 284–316
plan_full_lifecycle action._get_lifecycle_service, plan._get_lifecycle_service, plan._get_apply_service CliRunner 370–430
sqlite_persistence_check (none — domain-model round-trip only) No CLI 438–497
changeset_from_tool_invocations (none — in-memory store only) No CLI 505–545
sandbox_isolation_check (none — direct sandbox API) No CLI 553–606
post_apply_commit_check (none — direct sandbox API) No CLI 614–660

Summary: 4/8 tests mock service layer + use CliRunner. 2/8 are pure domain-model tests. 2/8 use real sandbox API (no mocks, but no CLI either).


M2 — m2_e2e_verification (9 test cases)

Files: robot/m2_e2e_verification.robot, robot/helper_m2_e2e_verification.py

Test case What it mocks CLI method Lines
actor_yaml_create_load (none) No CLI (direct ActorLoader) 147–187
actor_add_config_cli actor._get_services CliRunner 195–244
action_create action._get_lifecycle_service CliRunner 252–284
plan_use_execute plan._get_lifecycle_service (x2) CliRunner 292–341
actor_yaml_parse_validate (none) No CLI (direct ActorConfigSchema) 349–402
actor_compile_stategraph (none) No CLI (direct compile_actor) 410–443
tool_router_resolve_external (none) No CLI (direct ToolCallRouter) 451–495
validation_runner_execute (none) No CLI (direct ToolRunner) 503–567
changeset_multifile (none) No CLI (direct ChangeSetCapture) 575–635

Summary: 3/9 tests mock service layer + use CliRunner. 6/9 are pure library-level tests with no mocking and no CLI.


M3 — m3_e2e_verification (10 test cases)

Files: robot/m3_e2e_verification.robot, robot/helper_m3_e2e_verification.py

Test case What it mocks CLI method Lines
plan_generates_decisions plan._get_lifecycle_service (x2) CliRunner 210–288
decision_tree_view container.get_container + spy on DecisionService CliRunner 296–345
decision_explain container.get_container + spy on DecisionService CliRunner 353–421
invariant_add_and_list invariant._get_service (x2) CliRunner 429–491
correction_dry_run CorrectionService (class), container.get_container CliRunner + direct 499–604
correction_live_revert CorrectionService (class), container.get_container CliRunner + direct 612–722
decisions_context_snapshot (none) No CLI (direct service) 730–757
decision_tree_persistence container.get_container CliRunner + direct 765–816
correction_revert_reexecutes (none) No CLI (direct service) 824–873
invariants_enforced_during_strategize (none) No CLI (direct service) 881–942

Summary: 7/10 tests mock service layer or DI container + use CliRunner. 3/10 are pure service-level tests with no mocking or CLI. Uses create_autospec(Settings) throughout, and in-memory SQLite for persistence tests.


M4 — m4_e2e_verification (10 test cases)

Files: robot/m4_e2e_verification.robot, robot/helper_m4_e2e_verification.py

Test case What it mocks CLI method Lines
spawn_subplans (none) No CLI (domain model) 190–237
plan_tree (none) No CLI (domain model) 245–303
plan_diff plan._get_apply_service CliRunner 311–343
parallel_max (none) No CLI (domain model) 351–459
merge_clean (none) No CLI (domain model) 467–508
merge_conflict (none) No CLI (domain model) 516–569
parent_tracking (none) No CLI (domain model) 577–707
cli_plan_use plan._get_lifecycle_service CliRunner 715–768
cli_plan_execute plan._get_lifecycle_service CliRunner 776–824
cli_plan_tree container.get_container CliRunner 839–949

Summary: 4/10 tests mock service layer + use CliRunner. 6/10 are pure domain-model tests with no mocking or CLI.


M5 — m5_e2e_verification (9 test cases)

Files: robot/m5_e2e_verification.robot, robot/helper_m5_e2e_verification.py
Related: PR #456

Test case What it mocks CLI method Lines
project_create_large (none) No CLI (direct repo) 130–195
resource_register_link (none) No CLI (direct service) 203–246
indexing_complete (none) No CLI (direct service) 254–288
context_tier_config (none) No CLI (domain model) 296–349
context_policy_set_show (none) No CLI (direct SQL) 357–417
acms_phase_inheritance (none) No CLI (domain model) 425–484
acms_scoped_context (none) No CLI (domain model + SQL) 492–573
context_policy_clear (none) No CLI (domain model + SQL) 581–637
context_view_validation (none) No CLI (domain model) 645–688

Summary: 0/9 tests use mocks or CliRunner. All 9 exercise real application code against in-memory SQLite. This is the only suite with zero mocking. However, none of the 9 tests invoke the CLI at all — they test services and domain models directly, not end-to-end user workflows.


M6 — m6_e2e_verification (10 test cases)

Files: robot/m6_e2e_verification.robot, robot/helper_m6_e2e_verification.py

Test case What it mocks CLI method Lines
action_create_porting action._get_lifecycle_service CliRunner 151–202
plan_use_execute plan._get_lifecycle_service CliRunner 210–267
hierarchical_decomposition (none) No CLI (domain model) 275–363
correction_affected_subtree (none) No CLI (domain model) 371–471
parallel_execution_scale (none) No CLI (domain model) 479–538
porting_task_autonomous (none) No CLI (domain model) 546–620
plan_apply_lifecycle plan._get_lifecycle_service CliRunner 628–661
failure_handler_logic (none) No CLI (domain model) 669–737
subplan_config_modes (none) No CLI (domain model) 745–782
decision_tree_porting (none) No CLI (domain model) 790–853

Summary: 3/10 tests mock service layer + use CliRunner. 7/10 are pure domain-model tests with no mocking or CLI.


Aggregate Statistics

Suite Test cases Mocked + CliRunner Pure domain/service (no CLI) Real subprocess CLI
M1 8 4 4 0
M2 9 3 6 0
M3 10 7 3 0
M4 10 4 6 0
M5 9 0 9 0
M6 10 3 7 0
Total 56 21 35 0
  • 21 of 56 tests (38%) mock the service layer and use Typer's CliRunner — these cannot detect DI wiring, database, or process-level integration failures.
  • 35 of 56 tests (62%) are pure domain-model or service-level tests that never invoke the CLI at all — these are legitimate unit/integration tests but should not be labeled "E2E verification."
  • 0 of 56 tests invoke the agents CLI binary via subprocess.run or equivalent — meaning no test in any E2E suite verifies the actual user-facing command.

Unique patch() targets across all 6 suites

Patched path Used in
cleveragents.cli.commands.action._get_lifecycle_service M1, M2, M6
cleveragents.cli.commands.plan._get_lifecycle_service M1, M2, M3, M4, M6
cleveragents.cli.commands.plan._get_apply_service M1, M4
cleveragents.cli.commands.resource._get_registry_service M1
cleveragents.cli.commands.project._get_namespaced_project_repo M1
cleveragents.cli.commands.project._get_resource_link_repo M1
cleveragents.cli.commands.project._get_resource_registry_service M1
cleveragents.cli.commands.actor._get_services M2
cleveragents.cli.commands.invariant._get_service M3
cleveragents.application.container.get_container M3, M4
cleveragents.application.services.correction_service.CorrectionService M3

Root Cause

Three design choices recur across all 6 suites and prevent the tests from catching real-world failures:

1. Service layer is mocked (MagicMock)

The DI container, database connections, repository implementations, and service logic are replaced with MagicMock objects in every CLI-facing test. This means:

  • DI wiring bugs (e.g., the container.db() AttributeError in issues #554 and #570) are invisible.
  • Database schema/migration issues are invisible.
  • Any service-layer validation or business logic is bypassed.

2. CliRunner used instead of subprocess

Typer's CliRunner invokes command functions in-process. This does not test:

  • The actual agents CLI entry point and argument parsing by the OS shell.
  • Environment variable propagation.
  • Process exit codes as observed by callers.
  • Rich console output routing (a known issue — see CoreRasurae's review F1 on PR #595).

3. Test fixtures created programmatically

YAML config files are created via tempfile.mkstemp() / tempfile.NamedTemporaryFile() rather than placed at the paths documented in the milestone verification commands. This means file path resolution, permissions, and the documented user workflow are never tested.


Impact

  • False confidence: All 56 tests pass, suggesting M1–M6 success criteria are verified end-to-end. In reality, only CLI function signatures and mock-compatible call patterns are verified for the 21 CLI-facing tests, and the remaining 35 tests are domain/service-level tests that don't touch the CLI.
  • Known bugs not caught: The DI container wiring bugs (#554, #570) would have been caught by any E2E test that ran agents action create or agents session create without mocking the service layer.
  • Acceptance criteria not met: For example, issue #402 requires: "Assertions verify Plan and Action records persist to SQLite" — the tests verify this using in-memory domain objects, not by querying an actual SQLite database after a real CLI invocation.

Recommendation

  1. Replace CliRunner with subprocess.run in the 21 CLI-facing tests to invoke the actual agents binary.
  2. Remove service-layer mocks and use a real file-based SQLite database for the CLI tests.
  3. Create YAML fixtures as files on disk at the paths documented in the milestone verification commands, then invoke the CLI against those paths.
  4. Assert against real database state by querying the SQLite file after CLI commands complete.
  5. Relabel the 35 non-CLI tests as what they are: domain-model unit tests and service-level integration tests. They are valuable, but they are not E2E verification. M5 in particular has zero CLI invocations.

References

  • Issue #402 — M1 E2E verification requirement / PR #431
  • PR #456 — M5 E2E verification
  • Issues #559, #560 — M3 and M4 E2E verification
  • robot/m{1..6}_e2e_verification.robot — Robot test suites
  • robot/helper_m{1..6}_e2e_verification.py — Python test helpers
  • Issues #554, #570 — DI container wiring bugs that these suites should have caught
## Summary The E2E verification test suites for milestones M1 through M6 (`robot/m{1..6}_e2e_verification.robot` + `robot/helper_m{1..6}_e2e_verification.py`) do not perform genuine end-to-end testing. Across the 6 suites (56 total test cases), the CLI-facing tests mock the service layer, use Typer's in-process `CliRunner` instead of subprocess invocation, and create test fixtures programmatically. As a result, these suites cannot detect integration failures that occur when running the actual `agents` CLI in a real environment. --- ## Audit Results by Milestone ### M1 — `m1_e2e_verification` (8 test cases) **Files:** `robot/m1_e2e_verification.robot`, `robot/helper_m1_e2e_verification.py` **Related:** Issue #402, PR #431 | Test case | What it mocks | CLI method | Lines | |-----------|--------------|------------|-------| | `action_create_from_yaml` | `action._get_lifecycle_service` | CliRunner | 193–208 | | `resource_register_git_checkout` | `resource._get_registry_service` | CliRunner | 231–255 | | `project_create_and_link` | `project._get_namespaced_project_repo`, `project._get_resource_link_repo`, `project._get_resource_registry_service` | CliRunner | 284–316 | | `plan_full_lifecycle` | `action._get_lifecycle_service`, `plan._get_lifecycle_service`, `plan._get_apply_service` | CliRunner | 370–430 | | `sqlite_persistence_check` | _(none — domain-model round-trip only)_ | No CLI | 438–497 | | `changeset_from_tool_invocations` | _(none — in-memory store only)_ | No CLI | 505–545 | | `sandbox_isolation_check` | _(none — direct sandbox API)_ | No CLI | 553–606 | | `post_apply_commit_check` | _(none — direct sandbox API)_ | No CLI | 614–660 | **Summary:** 4/8 tests mock service layer + use CliRunner. 2/8 are pure domain-model tests. 2/8 use real sandbox API (no mocks, but no CLI either). --- ### M2 — `m2_e2e_verification` (9 test cases) **Files:** `robot/m2_e2e_verification.robot`, `robot/helper_m2_e2e_verification.py` | Test case | What it mocks | CLI method | Lines | |-----------|--------------|------------|-------| | `actor_yaml_create_load` | _(none)_ | No CLI (direct `ActorLoader`) | 147–187 | | `actor_add_config_cli` | `actor._get_services` | CliRunner | 195–244 | | `action_create` | `action._get_lifecycle_service` | CliRunner | 252–284 | | `plan_use_execute` | `plan._get_lifecycle_service` (x2) | CliRunner | 292–341 | | `actor_yaml_parse_validate` | _(none)_ | No CLI (direct `ActorConfigSchema`) | 349–402 | | `actor_compile_stategraph` | _(none)_ | No CLI (direct `compile_actor`) | 410–443 | | `tool_router_resolve_external` | _(none)_ | No CLI (direct `ToolCallRouter`) | 451–495 | | `validation_runner_execute` | _(none)_ | No CLI (direct `ToolRunner`) | 503–567 | | `changeset_multifile` | _(none)_ | No CLI (direct `ChangeSetCapture`) | 575–635 | **Summary:** 3/9 tests mock service layer + use CliRunner. 6/9 are pure library-level tests with no mocking and no CLI. --- ### M3 — `m3_e2e_verification` (10 test cases) **Files:** `robot/m3_e2e_verification.robot`, `robot/helper_m3_e2e_verification.py` | Test case | What it mocks | CLI method | Lines | |-----------|--------------|------------|-------| | `plan_generates_decisions` | `plan._get_lifecycle_service` (x2) | CliRunner | 210–288 | | `decision_tree_view` | `container.get_container` + spy on `DecisionService` | CliRunner | 296–345 | | `decision_explain` | `container.get_container` + spy on `DecisionService` | CliRunner | 353–421 | | `invariant_add_and_list` | `invariant._get_service` (x2) | CliRunner | 429–491 | | `correction_dry_run` | `CorrectionService` (class), `container.get_container` | CliRunner + direct | 499–604 | | `correction_live_revert` | `CorrectionService` (class), `container.get_container` | CliRunner + direct | 612–722 | | `decisions_context_snapshot` | _(none)_ | No CLI (direct service) | 730–757 | | `decision_tree_persistence` | `container.get_container` | CliRunner + direct | 765–816 | | `correction_revert_reexecutes` | _(none)_ | No CLI (direct service) | 824–873 | | `invariants_enforced_during_strategize` | _(none)_ | No CLI (direct service) | 881–942 | **Summary:** 7/10 tests mock service layer or DI container + use CliRunner. 3/10 are pure service-level tests with no mocking or CLI. Uses `create_autospec(Settings)` throughout, and in-memory SQLite for persistence tests. --- ### M4 — `m4_e2e_verification` (10 test cases) **Files:** `robot/m4_e2e_verification.robot`, `robot/helper_m4_e2e_verification.py` | Test case | What it mocks | CLI method | Lines | |-----------|--------------|------------|-------| | `spawn_subplans` | _(none)_ | No CLI (domain model) | 190–237 | | `plan_tree` | _(none)_ | No CLI (domain model) | 245–303 | | `plan_diff` | `plan._get_apply_service` | CliRunner | 311–343 | | `parallel_max` | _(none)_ | No CLI (domain model) | 351–459 | | `merge_clean` | _(none)_ | No CLI (domain model) | 467–508 | | `merge_conflict` | _(none)_ | No CLI (domain model) | 516–569 | | `parent_tracking` | _(none)_ | No CLI (domain model) | 577–707 | | `cli_plan_use` | `plan._get_lifecycle_service` | CliRunner | 715–768 | | `cli_plan_execute` | `plan._get_lifecycle_service` | CliRunner | 776–824 | | `cli_plan_tree` | `container.get_container` | CliRunner | 839–949 | **Summary:** 4/10 tests mock service layer + use CliRunner. 6/10 are pure domain-model tests with no mocking or CLI. --- ### M5 — `m5_e2e_verification` (9 test cases) **Files:** `robot/m5_e2e_verification.robot`, `robot/helper_m5_e2e_verification.py` **Related:** PR #456 | Test case | What it mocks | CLI method | Lines | |-----------|--------------|------------|-------| | `project_create_large` | _(none)_ | No CLI (direct repo) | 130–195 | | `resource_register_link` | _(none)_ | No CLI (direct service) | 203–246 | | `indexing_complete` | _(none)_ | No CLI (direct service) | 254–288 | | `context_tier_config` | _(none)_ | No CLI (domain model) | 296–349 | | `context_policy_set_show` | _(none)_ | No CLI (direct SQL) | 357–417 | | `acms_phase_inheritance` | _(none)_ | No CLI (domain model) | 425–484 | | `acms_scoped_context` | _(none)_ | No CLI (domain model + SQL) | 492–573 | | `context_policy_clear` | _(none)_ | No CLI (domain model + SQL) | 581–637 | | `context_view_validation` | _(none)_ | No CLI (domain model) | 645–688 | **Summary:** 0/9 tests use mocks or CliRunner. All 9 exercise real application code against in-memory SQLite. **This is the only suite with zero mocking.** However, none of the 9 tests invoke the CLI at all — they test services and domain models directly, not end-to-end user workflows. --- ### M6 — `m6_e2e_verification` (10 test cases) **Files:** `robot/m6_e2e_verification.robot`, `robot/helper_m6_e2e_verification.py` | Test case | What it mocks | CLI method | Lines | |-----------|--------------|------------|-------| | `action_create_porting` | `action._get_lifecycle_service` | CliRunner | 151–202 | | `plan_use_execute` | `plan._get_lifecycle_service` | CliRunner | 210–267 | | `hierarchical_decomposition` | _(none)_ | No CLI (domain model) | 275–363 | | `correction_affected_subtree` | _(none)_ | No CLI (domain model) | 371–471 | | `parallel_execution_scale` | _(none)_ | No CLI (domain model) | 479–538 | | `porting_task_autonomous` | _(none)_ | No CLI (domain model) | 546–620 | | `plan_apply_lifecycle` | `plan._get_lifecycle_service` | CliRunner | 628–661 | | `failure_handler_logic` | _(none)_ | No CLI (domain model) | 669–737 | | `subplan_config_modes` | _(none)_ | No CLI (domain model) | 745–782 | | `decision_tree_porting` | _(none)_ | No CLI (domain model) | 790–853 | **Summary:** 3/10 tests mock service layer + use CliRunner. 7/10 are pure domain-model tests with no mocking or CLI. --- ## Aggregate Statistics | Suite | Test cases | Mocked + CliRunner | Pure domain/service (no CLI) | Real subprocess CLI | |-------|-----------|-------------------|-----------------------------|--------------------| | M1 | 8 | **4** | 4 | 0 | | M2 | 9 | **3** | 6 | 0 | | M3 | 10 | **7** | 3 | 0 | | M4 | 10 | **4** | 6 | 0 | | M5 | 9 | **0** | 9 | 0 | | M6 | 10 | **3** | 7 | 0 | | **Total** | **56** | **21** | **35** | **0** | - **21 of 56 tests** (38%) mock the service layer and use Typer's `CliRunner` — these cannot detect DI wiring, database, or process-level integration failures. - **35 of 56 tests** (62%) are pure domain-model or service-level tests that never invoke the CLI at all — these are legitimate unit/integration tests but should not be labeled "E2E verification." - **0 of 56 tests** invoke the `agents` CLI binary via `subprocess.run` or equivalent — meaning no test in any E2E suite verifies the actual user-facing command. ### Unique `patch()` targets across all 6 suites | Patched path | Used in | |-------------|---------| | `cleveragents.cli.commands.action._get_lifecycle_service` | M1, M2, M6 | | `cleveragents.cli.commands.plan._get_lifecycle_service` | M1, M2, M3, M4, M6 | | `cleveragents.cli.commands.plan._get_apply_service` | M1, M4 | | `cleveragents.cli.commands.resource._get_registry_service` | M1 | | `cleveragents.cli.commands.project._get_namespaced_project_repo` | M1 | | `cleveragents.cli.commands.project._get_resource_link_repo` | M1 | | `cleveragents.cli.commands.project._get_resource_registry_service` | M1 | | `cleveragents.cli.commands.actor._get_services` | M2 | | `cleveragents.cli.commands.invariant._get_service` | M3 | | `cleveragents.application.container.get_container` | M3, M4 | | `cleveragents.application.services.correction_service.CorrectionService` | M3 | --- ## Root Cause Three design choices recur across all 6 suites and prevent the tests from catching real-world failures: ### 1. Service layer is mocked (`MagicMock`) The DI container, database connections, repository implementations, and service logic are replaced with `MagicMock` objects in every CLI-facing test. This means: - DI wiring bugs (e.g., the `container.db()` `AttributeError` in issues #554 and #570) are invisible. - Database schema/migration issues are invisible. - Any service-layer validation or business logic is bypassed. ### 2. `CliRunner` used instead of subprocess Typer's `CliRunner` invokes command functions in-process. This does not test: - The actual `agents` CLI entry point and argument parsing by the OS shell. - Environment variable propagation. - Process exit codes as observed by callers. - Rich console output routing (a known issue — see CoreRasurae's review F1 on PR #595). ### 3. Test fixtures created programmatically YAML config files are created via `tempfile.mkstemp()` / `tempfile.NamedTemporaryFile()` rather than placed at the paths documented in the milestone verification commands. This means file path resolution, permissions, and the documented user workflow are never tested. --- ## Impact - **False confidence:** All 56 tests pass, suggesting M1–M6 success criteria are verified end-to-end. In reality, only CLI function signatures and mock-compatible call patterns are verified for the 21 CLI-facing tests, and the remaining 35 tests are domain/service-level tests that don't touch the CLI. - **Known bugs not caught:** The DI container wiring bugs (#554, #570) would have been caught by any E2E test that ran `agents action create` or `agents session create` without mocking the service layer. - **Acceptance criteria not met:** For example, issue #402 requires: *"Assertions verify Plan and Action records persist to SQLite"* — the tests verify this using in-memory domain objects, not by querying an actual SQLite database after a real CLI invocation. --- ## Recommendation 1. **Replace `CliRunner` with `subprocess.run`** in the 21 CLI-facing tests to invoke the actual `agents` binary. 2. **Remove service-layer mocks** and use a real file-based SQLite database for the CLI tests. 3. **Create YAML fixtures as files on disk** at the paths documented in the milestone verification commands, then invoke the CLI against those paths. 4. **Assert against real database state** by querying the SQLite file after CLI commands complete. 5. **Relabel the 35 non-CLI tests** as what they are: domain-model unit tests and service-level integration tests. They are valuable, but they are not E2E verification. M5 in particular has zero CLI invocations. --- ## References - Issue #402 — M1 E2E verification requirement / PR #431 - PR #456 — M5 E2E verification - Issues #559, #560 — M3 and M4 E2E verification - `robot/m{1..6}_e2e_verification.robot` — Robot test suites - `robot/helper_m{1..6}_e2e_verification.py` — Python test helpers - Issues #554, #570 — DI container wiring bugs that these suites should have caught
brent.edwards changed title from bug(test): M1 E2E verification suite uses mocks instead of exercising the real system to bug(test): M1–M6 E2E verification suites use mocks instead of exercising the real system 2026-03-10 00:42:02 +00:00
freemo added this to the v3.5.0 milestone 2026-03-11 05:34:32 +00:00
Owner

PM Triage (Day 31):

  • Labels: Type/Bug, Priority/Critical, MoSCoW/Must have, State/Unverified, Points/13
  • Milestone: v3.5.0
  • Assignee: @brent.edwards (author, QA specialist)

Assessment: This is a legitimate test infrastructure bug. The E2E verification suites for M1-M6 use mocks instead of exercising the real system, which undermines confidence in milestone acceptance gates. Per CONTRIBUTING.md, bugs receive Priority/Critical.

TDD workflow: Since this is a testing/infrastructure bug (the bug IS in the tests themselves), a TDD counterpart would be circular. No TDD issue will be created — the fix itself will replace mocks with real system calls.

Priority: Important but not blocking current sprint. Assigned to v3.5.0. @brent.edwards — please assess scope and provide an implementation plan.

**PM Triage (Day 31)**: - **Labels**: `Type/Bug`, `Priority/Critical`, `MoSCoW/Must have`, `State/Unverified`, `Points/13` - **Milestone**: v3.5.0 - **Assignee**: @brent.edwards (author, QA specialist) **Assessment**: This is a legitimate test infrastructure bug. The E2E verification suites for M1-M6 use mocks instead of exercising the real system, which undermines confidence in milestone acceptance gates. Per CONTRIBUTING.md, bugs receive `Priority/Critical`. **TDD workflow**: Since this is a testing/infrastructure bug (the bug IS in the tests themselves), a TDD counterpart would be circular. No TDD issue will be created — the fix itself will replace mocks with real system calls. **Priority**: Important but not blocking current sprint. Assigned to v3.5.0. @brent.edwards — please assess scope and provide an implementation plan.
Owner

PM Escalation — Day 31 (2026-03-11)

This bug is past due. Milestone v3.5.0 due date was 2026-03-10 (yesterday). Neither this bug nor its TDD counterpart #697 has started.

Scope: 13-point issue spanning 56 test cases across 6 milestone E2E suites. This is the largest open bug by effort.

Dependency chain:

  • #697 (TDD counterpart) → State/Unverified, not started
  • #697 depends on #684 (integration test tagging system) — status unknown
  • This bug depends on #697 TDD merge per CONTRIBUTING.md workflow

Concerns:

  1. @brent.edwards is currently assigned to 6 open items (#554, #570, #658, #680, #683, #697). With PR #680 addressing #554/#570, focus should shift to #658.
  2. The TDD counterpart #697 needs to be verified and started before any bug fix work.
  3. No dev acknowledgment has been posted on this issue since it was triaged.

Action required:

  • @brent.edwards — please acknowledge this issue and provide an ETA for starting TDD #697.
  • If #684 is a blocker, please flag it immediately so we can unblock.
  • Deadline: TDD #697 PR must be submitted by Day 35 (2026-03-15).
## PM Escalation — Day 31 (2026-03-11) **This bug is past due.** Milestone v3.5.0 due date was 2026-03-10 (yesterday). Neither this bug nor its TDD counterpart #697 has started. **Scope:** 13-point issue spanning 56 test cases across 6 milestone E2E suites. This is the largest open bug by effort. **Dependency chain:** - #697 (TDD counterpart) → `State/Unverified`, not started - #697 depends on #684 (integration test tagging system) — status unknown - This bug depends on #697 TDD merge per CONTRIBUTING.md workflow **Concerns:** 1. @brent.edwards is currently assigned to 6 open items (#554, #570, #658, #680, #683, #697). With PR #680 addressing #554/#570, focus should shift to #658. 2. The TDD counterpart #697 needs to be verified and started before any bug fix work. 3. No dev acknowledgment has been posted on this issue since it was triaged. **Action required:** - @brent.edwards — please acknowledge this issue and provide an ETA for starting TDD #697. - If #684 is a blocker, please flag it immediately so we can unblock. - **Deadline:** TDD #697 PR must be submitted by Day 35 (2026-03-15).
Owner

Cross-reference Note

This issue overlaps partially with #698 (mock LLM removal from Robot tests). FakeListLLM usage in plan_generation_graph.robot and context_analysis_agent.robot is scoped to both issues. Please coordinate with #698 work (assigned @freemo) to avoid conflicting changes.

Additionally, #699 (remove unittest.mock from Robot tests) covers the MagicMock/patch usage that this issue also targets in the E2E helpers.

## Cross-reference Note This issue overlaps partially with #698 (mock LLM removal from Robot tests). `FakeListLLM` usage in `plan_generation_graph.robot` and `context_analysis_agent.robot` is scoped to both issues. Please coordinate with #698 work (assigned @freemo) to avoid conflicting changes. Additionally, #699 (remove unittest.mock from Robot tests) covers the `MagicMock`/`patch` usage that this issue also targets in the E2E helpers.
Owner

PM Status — Day 32: TDD Workflow Tracking

TDD Pipeline Status:

Stage Item Status
1. TDD Issue #697 Open, assigned to @brent.edwards
2. TDD PR #738 Open, BLOCKED — REQUEST_CHANGES from PM review + awaiting peer review
3. TDD merge Waiting on PR #738
4. Bug fix PR Cannot start until TDD tests are merged

Blocker Detail: PR #738 has our REQUEST_CHANGES review with 3 required fixes:

  1. Empty PR body — needs a summary of changes
  2. Missing Closes #697 keyword (note: should close TDD issue #697, not bug #658)
  3. Missing MoSCoW/Must have label

Additionally, @CoreRasurae has been assigned as peer reviewer but has not yet submitted a review.

Action needed:

  • @brent.edwards — Address the 3 required changes from the PM review on PR #738
  • @CoreRasurae — Please submit a code review on PR #738

Dependency note: TDD issue #697 depends on #684 (integration test tagging system). Verify #684 is complete before merging.

Scope: This is the largest open bug — replacing mock-based E2E tests with real subprocess invocations across 6 milestone suites (56 test cases). The fix will be substantial work. The TDD tests correctly use AST analysis to programmatically verify mock usage, which is a sound approach.

### PM Status — Day 32: TDD Workflow Tracking **TDD Pipeline Status:** | Stage | Item | Status | |-------|------|--------| | 1. TDD Issue | #697 | Open, assigned to @brent.edwards | | 2. TDD PR | #738 | Open, **BLOCKED** — REQUEST_CHANGES from PM review + awaiting peer review | | 3. TDD merge | — | Waiting on PR #738 | | 4. Bug fix PR | — | Cannot start until TDD tests are merged | **Blocker Detail:** PR #738 has our REQUEST_CHANGES review with 3 required fixes: 1. **Empty PR body** — needs a summary of changes 2. **Missing `Closes #697`** keyword (note: should close TDD issue #697, not bug #658) 3. **Missing `MoSCoW/Must have` label** Additionally, @CoreRasurae has been assigned as peer reviewer but has not yet submitted a review. **Action needed:** - @brent.edwards — Address the 3 required changes from the PM review on PR #738 - @CoreRasurae — Please submit a code review on PR #738 **Dependency note:** TDD issue #697 depends on #684 (integration test tagging system). Verify #684 is complete before merging. **Scope:** This is the largest open bug — replacing mock-based E2E tests with real subprocess invocations across 6 milestone suites (56 test cases). The fix will be substantial work. The TDD tests correctly use AST analysis to programmatically verify mock usage, which is a sound approach.
Owner

PM Acknowledgment — Day 32

Confirmed PR #738 (TDD tests) and PR #784 (bug fix) received.

Current status:

  • PR #738 (TDD tests): REQUEST_CHANGES — 3 items to address (empty body, missing Closes #697, missing MoSCoW label)
  • PR #784 (bug fix): COMMENT review — code quality is excellent, APPROVED in substance. Blocked until PR #738 merges first per TDD workflow.

Merge order: PR #738 → PR #784

@brent.edwards: Please address the 3 requested changes on PR #738 so both PRs can proceed to merge.

**PM Acknowledgment — Day 32** Confirmed PR #738 (TDD tests) and PR #784 (bug fix) received. **Current status**: - PR #738 (TDD tests): **REQUEST_CHANGES** — 3 items to address (empty body, missing `Closes #697`, missing MoSCoW label) - PR #784 (bug fix): **COMMENT review** — code quality is excellent, APPROVED in substance. Blocked until PR #738 merges first per TDD workflow. **Merge order**: PR #738 → PR #784 @brent.edwards: Please address the 3 requested changes on PR #738 so both PRs can proceed to merge.
Owner

PM Note (Day 33): Closing — PR #784 merged 2026-03-12T22:13 converting all M1-M6 E2E suites to real subprocess CLI invocations. TDD counterpart PR #738 also merged. Issue was not auto-closed by Forgejo (commit keyword format mismatch). Bug is resolved.

**PM Note (Day 33):** Closing — PR #784 merged 2026-03-12T22:13 converting all M1-M6 E2E suites to real subprocess CLI invocations. TDD counterpart PR #738 also merged. Issue was not auto-closed by Forgejo (commit keyword format mismatch). Bug is resolved.
Owner

PM Note (Day 33 update): Cannot close — Forgejo reports open dependencies. PR #784 merged 2026-03-12T22:13 and TDD PR #738 merged 2026-03-12T22:02. Bug is functionally resolved. @freemo — please close this issue manually (admin dependency override needed, same situation as #698).

**PM Note (Day 33 update):** Cannot close — Forgejo reports open dependencies. PR #784 merged 2026-03-12T22:13 and TDD PR #738 merged 2026-03-12T22:02. Bug is functionally resolved. @freemo — please close this issue manually (admin dependency override needed, same situation as #698).
Owner

PM status update (Day 34): Fix PR #784 was merged on 2026-03-12. TDD PR #738 was also merged. This bug is functionally resolved. However, Forgejo blocks closure via API due to unresolved dependency metadata.

@freemo — please close this issue manually using admin dependency override. The fix is verified and merged.

**PM status update (Day 34):** Fix PR #784 was merged on 2026-03-12. TDD PR #738 was also merged. This bug is functionally resolved. However, Forgejo blocks closure via API due to unresolved dependency metadata. @freemo — please close this issue manually using admin dependency override. The fix is verified and merged.
Owner

PM Note — Day 34: Admin Closure Requested

This bug is functionally resolved — PR #784 (the mock-removal fix) was merged, and the TDD counterpart #697 is already closed.

However, Forgejo did not auto-close this issue because PR #784's Closes reference may not have matched (or the merge method didn't trigger auto-close). The fix is verified and merged on master.

@freemo — Please manually close this issue as resolved. The fix has been on master since Day 32.


PM status — Day 34 housekeeping

## PM Note — Day 34: Admin Closure Requested This bug is **functionally resolved** — PR #784 (the mock-removal fix) was merged, and the TDD counterpart #697 is already closed. However, Forgejo did not auto-close this issue because PR #784's `Closes` reference may not have matched (or the merge method didn't trigger auto-close). The fix is verified and merged on master. **@freemo** — Please manually close this issue as resolved. The fix has been on master since Day 32. --- *PM status — Day 34 housekeeping*
Owner

Decided wont do, went with E2E system instead to represent true real world tests

Decided wont do, went with E2E system instead to represent true real world tests
freemo 2026-03-14 23:00:24 +00:00
freemo added reference master 2026-03-14 23:00:44 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#658
No description provided.