feat(wf03): add plan prompt test coverage and confidence-threshold pausing verification #1086

Merged
brent.edwards merged 1 commit from test/wf03-plan-prompt-confidence into master 2026-03-27 21:16:54 +00:00
Member

Summary

Adds WF03 coverage for plan prompt and confidence-threshold pausing behavior.

This PR includes:

  1. Test additions (Behave + Robot) for prompt-guidance and pause/resume flows
  2. A small production adjustment in A2aLocalFacade._handle_plan_prompt so the stub response echoes guidance (required by the new WF03 expectations)

Scope updates from review

  • Reverted unrelated src/cleveragents/tool/wrapping.py suppression-marker changes from this PR.
  • Kept only WF03-related test files and the single facade.py behavior change needed for those tests.

Verification

  • nox -s lint (saved: /tmp/nox-lint-1086.log)
  • nox -s unit_tests -- features/wf03_plan_prompt_confidence.feature (saved: /tmp/nox-unit-1086.log)

Issue

Closes #961

## Summary Adds WF03 coverage for `plan prompt` and confidence-threshold pausing behavior. This PR includes: 1. **Test additions** (Behave + Robot) for prompt-guidance and pause/resume flows 2. A **small production adjustment** in `A2aLocalFacade._handle_plan_prompt` so the stub response echoes `guidance` (required by the new WF03 expectations) ## Scope updates from review - Reverted unrelated `src/cleveragents/tool/wrapping.py` suppression-marker changes from this PR. - Kept only WF03-related test files and the single `facade.py` behavior change needed for those tests. ## Verification - `nox -s lint` ✅ (saved: `/tmp/nox-lint-1086.log`) - `nox -s unit_tests -- features/wf03_plan_prompt_confidence.feature` ✅ (saved: `/tmp/nox-unit-1086.log`) ## Issue Closes #961
brent.edwards added this to the v3.4.0 milestone 2026-03-20 23:12:09 +00:00
brent.edwards force-pushed test/wf03-plan-prompt-confidence from 85b62924fb
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 27s
CI / typecheck (pull_request) Successful in 1m12s
CI / security (pull_request) Successful in 52s
CI / quality (pull_request) Successful in 35s
CI / build (pull_request) Successful in 31s
CI / e2e_tests (pull_request) Successful in 5m9s
CI / unit_tests (pull_request) Successful in 6m11s
CI / integration_tests (pull_request) Successful in 6m8s
CI / docker (pull_request) Successful in 1m6s
CI / coverage (pull_request) Successful in 10m32s
CI / benchmark-regression (pull_request) Successful in 36m44s
to 457023cf4c
Some checks failed
CI / status-check (pull_request) Blocked by required conditions
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 17s
CI / lint (pull_request) Successful in 3m38s
CI / integration_tests (pull_request) Successful in 3m41s
CI / security (pull_request) Successful in 4m2s
CI / unit_tests (pull_request) Failing after 4m26s
CI / typecheck (pull_request) Successful in 4m48s
CI / docker (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has started running
CI / e2e_tests (pull_request) Successful in 9m27s
CI / quality (pull_request) Failing after 13m52s
CI / coverage (pull_request) Failing after 24m2s
2026-03-21 01:48:27 +00:00
Compare
brent.edwards force-pushed test/wf03-plan-prompt-confidence from 457023cf4c
Some checks failed
CI / status-check (pull_request) Blocked by required conditions
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 17s
CI / lint (pull_request) Successful in 3m38s
CI / integration_tests (pull_request) Successful in 3m41s
CI / security (pull_request) Successful in 4m2s
CI / unit_tests (pull_request) Failing after 4m26s
CI / typecheck (pull_request) Successful in 4m48s
CI / docker (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has started running
CI / e2e_tests (pull_request) Successful in 9m27s
CI / quality (pull_request) Failing after 13m52s
CI / coverage (pull_request) Failing after 24m2s
to f00e66a484
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 21s
CI / lint (pull_request) Successful in 3m19s
CI / typecheck (pull_request) Successful in 3m55s
CI / security (pull_request) Successful in 4m1s
CI / quality (pull_request) Successful in 4m10s
CI / integration_tests (pull_request) Successful in 7m13s
CI / unit_tests (pull_request) Successful in 7m27s
CI / docker (pull_request) Successful in 1m14s
CI / e2e_tests (pull_request) Successful in 9m39s
CI / coverage (pull_request) Successful in 11m11s
CI / status-check (pull_request) Successful in 2s
CI / benchmark-regression (pull_request) Successful in 36m33s
2026-03-21 02:36:02 +00:00
Compare
Author
Member

Fixed 1 errored scenario: features/wf03_plan_prompt_confidence.feature:26 — "Plan prompt via facade with empty guidance".

Root cause: Behave's parse matcher doesn't match empty strings inside "{param}" patterns. The feature file used with guidance "" and should echo the guidance "", but the step definitions @when('... with guidance "{guidance}"') and @then('... echo the guidance "{guidance}"') never matched, resulting in 2 undefined steps.

Fix: Added explicit step definitions for the empty-guidance case:

  • @when('I dispatch plan prompt for plan "{plan_id}" with empty guidance')
  • @then("the plan prompt response should echo empty guidance")

Updated the feature file scenario to use these new step wordings.

(Quality log was a server death during setup; coverage passed at 98.5%.)

Fixed 1 errored scenario: `features/wf03_plan_prompt_confidence.feature:26 — "Plan prompt via facade with empty guidance"`. **Root cause:** Behave's `parse` matcher doesn't match empty strings inside `"{param}"` patterns. The feature file used `with guidance ""` and `should echo the guidance ""`, but the step definitions `@when('... with guidance "{guidance}"')` and `@then('... echo the guidance "{guidance}"')` never matched, resulting in 2 undefined steps. **Fix:** Added explicit step definitions for the empty-guidance case: - `@when('I dispatch plan prompt for plan "{plan_id}" with empty guidance')` - `@then("the plan prompt response should echo empty guidance")` Updated the feature file scenario to use these new step wordings. (Quality log was a server death during setup; coverage passed at 98.5%.)
Author
Member

Self-QA Review Summary (2 cycles)

Final verdict: PASS — No critical or major issues remain.

Cycle 1 → REQUEST_CHANGES (0C / 4M / 6m / 5n)

Majors found and fixed:

  • Plan prompt facade tests were tautological (stub ignored guidance) → stub updated to echo guidance, assertions strengthened
  • Pause-and-resume scenario lacked causal link; Behave missing re-evaluation step → added re-evaluation, linked plan_id to paused decision
  • PR description empty → filled with summary, Closes #961, dependency notes
  • Missing changelog entry → added under ## Unreleased

Minors fixed: TODO(#885) comments for untested spec fields, negative test for empty guidance, Background removed (controller setup moved inline), threshold value assertion added, ruff noqa narrowed, spec section references corrected, Behave tags added.

Cycle 2 → APPROVED (0C / 0M / 11m / 7n)

All remaining findings were non-blocking minor style observations and nits. Code quality is high with full type annotations, correct threshold boundary math (0.55 → pause, 0.6 → proceed), proper step reuse, and all quality gates passing.

Quality Gates

Gate Result
lint
integration_tests (1,620 tests)
## Self-QA Review Summary (2 cycles) **Final verdict: PASS** — No critical or major issues remain. ### Cycle 1 → REQUEST_CHANGES (0C / 4M / 6m / 5n) **Majors found and fixed:** - Plan prompt facade tests were tautological (stub ignored guidance) → stub updated to echo `guidance`, assertions strengthened - Pause-and-resume scenario lacked causal link; Behave missing re-evaluation step → added re-evaluation, linked `plan_id` to paused decision - PR description empty → filled with summary, `Closes #961`, dependency notes - Missing changelog entry → added under `## Unreleased` **Minors fixed:** TODO(#885) comments for untested spec fields, negative test for empty guidance, Background removed (controller setup moved inline), threshold value assertion added, ruff noqa narrowed, spec section references corrected, Behave tags added. ### Cycle 2 → APPROVED (0C / 0M / 11m / 7n) All remaining findings were non-blocking minor style observations and nits. Code quality is high with full type annotations, correct threshold boundary math (0.55 → pause, 0.6 → proceed), proper step reuse, and all quality gates passing. ### Quality Gates | Gate | Result | |------|--------| | lint | ✅ | typecheck | ✅ | unit_tests | ✅ (12,083 scenarios) | | integration_tests | ✅ (1,620 tests) | e2e_tests | ✅ (37) | coverage | **99%** |
freemo requested changes 2026-03-23 02:45:44 +00:00
Dismissed
freemo left a comment

Review: PR #1086 — test(wf03): plan prompt test and confidence-threshold pausing verification

Overall Assessment: REQUEST CHANGES

The test content is comprehensive and well-structured, but this PR contains production code changes that should be separated.

Checklist

Criterion Status Notes
File organization PASS Feature in features/, steps in features/steps/, Robot in robot/
Step file naming PASS wf03_plan_prompt_confidence.feature -> wf03_plan_prompt_confidence_steps.py
No production code changes FAIL Two source files modified (see below)
Issue references PASS Closes #961

Issues

1. Production code change in src/cleveragents/a2a/facade.py (requires action)

This PR modifies _handle_plan_prompt to echo the guidance field back in the response:

# Before:
return {"plan_id": plan_id, "status": "guidance_injected", "stub": True}

# After:
guidance = params.get("guidance", "")
return {"plan_id": plan_id, "guidance": guidance, "status": "guidance_injected", "stub": True}

This is a behavioral change to production code. The tests rely on this change to verify guidance propagation. For a test-only PR, this should either:

  • (a) Be split into a separate PR that lands first (a small enhancement PR to add guidance echoing to the stub), with this test PR rebased on top, or
  • (b) The PR title/description should be updated to acknowledge this is not purely a test PR (e.g., feat(a2a): echo guidance in plan prompt stub + add WF03 tests), and the milestone/labels should reflect the dual nature.

If the project owner is fine with the bundled approach, option (b) with an updated title is the minimum.

2. Unrelated change in src/cleveragents/tool/wrapping.py (requires action)

This PR adds nosemgrep and nosec suppression markers to two lines in wrapping.py:

# Line 222: added  # fmt: skip  # noqa: E501  # nosemgrep: no-compile-exec
# Line 241: added  # nosec B102  # nosemgrep: no-exec

These are security scanner suppression markers on pre-existing compile() and exec() calls. This change is completely unrelated to WF03 plan prompt tests and should be in its own PR (e.g., chore: add security scan suppressions for intentional sandbox exec). Drive-by fixes in test PRs make the change history harder to reason about.

Test Quality

The tests themselves are excellent:

  • Behave layer (8 scenarios): Good coverage of plan prompt dispatch (happy path, minimal input, empty guidance) and confidence threshold pausing (below threshold, above threshold, boundary values at 0.55 and 0.6, full pause-and-resume flow)
  • Robot Framework layer (4 integration tests): Proper integration-level coverage matching the Behave scenarios
  • Boundary testing: Testing both sides of the 0.6 threshold (0.55 -> pause, 0.6 -> proceed) is good practice
  • Documentation: TODO(#885) comments honestly document what cannot yet be tested
  • Pause-and-resume flow: Well-structured multi-step scenario with honest NOTE about causal limitations
  • Robot helper: Clean dispatcher pattern with _factors_for_confidence() utility factored out

Summary

The test content is strong and comprehensive across both Behave and Robot layers. The two issues are:

  1. Facade change in a2a/facade.py — either split out or re-frame the PR as not test-only
  2. Unrelated change in tool/wrapping.py — should be its own PR

Resolve the production code concerns and this is ready to approve.

## Review: PR #1086 — test(wf03): plan prompt test and confidence-threshold pausing verification ### Overall Assessment: REQUEST CHANGES The test content is comprehensive and well-structured, but this PR contains production code changes that should be separated. ### Checklist | Criterion | Status | Notes | |-----------|--------|-------| | File organization | PASS | Feature in `features/`, steps in `features/steps/`, Robot in `robot/` | | Step file naming | PASS | `wf03_plan_prompt_confidence.feature` -> `wf03_plan_prompt_confidence_steps.py` | | No production code changes | **FAIL** | Two source files modified (see below) | | Issue references | PASS | `Closes #961` | ### Issues #### 1. Production code change in `src/cleveragents/a2a/facade.py` (requires action) This PR modifies `_handle_plan_prompt` to echo the `guidance` field back in the response: ```python # Before: return {"plan_id": plan_id, "status": "guidance_injected", "stub": True} # After: guidance = params.get("guidance", "") return {"plan_id": plan_id, "guidance": guidance, "status": "guidance_injected", "stub": True} ``` This is a behavioral change to production code. The tests rely on this change to verify guidance propagation. For a test-only PR, this should either: - **(a)** Be split into a separate PR that lands first (a small enhancement PR to add guidance echoing to the stub), with this test PR rebased on top, or - **(b)** The PR title/description should be updated to acknowledge this is not purely a test PR (e.g., `feat(a2a): echo guidance in plan prompt stub + add WF03 tests`), and the milestone/labels should reflect the dual nature. If the project owner is fine with the bundled approach, option (b) with an updated title is the minimum. #### 2. Unrelated change in `src/cleveragents/tool/wrapping.py` (requires action) This PR adds `nosemgrep` and `nosec` suppression markers to two lines in `wrapping.py`: ```python # Line 222: added # fmt: skip # noqa: E501 # nosemgrep: no-compile-exec # Line 241: added # nosec B102 # nosemgrep: no-exec ``` These are security scanner suppression markers on pre-existing `compile()` and `exec()` calls. This change is **completely unrelated** to WF03 plan prompt tests and should be in its own PR (e.g., `chore: add security scan suppressions for intentional sandbox exec`). Drive-by fixes in test PRs make the change history harder to reason about. ### Test Quality The tests themselves are excellent: - **Behave layer** (8 scenarios): Good coverage of plan prompt dispatch (happy path, minimal input, empty guidance) and confidence threshold pausing (below threshold, above threshold, boundary values at 0.55 and 0.6, full pause-and-resume flow) - **Robot Framework layer** (4 integration tests): Proper integration-level coverage matching the Behave scenarios - **Boundary testing**: Testing both sides of the 0.6 threshold (0.55 -> pause, 0.6 -> proceed) is good practice - **Documentation**: `TODO(#885)` comments honestly document what cannot yet be tested - **Pause-and-resume flow**: Well-structured multi-step scenario with honest `NOTE` about causal limitations - **Robot helper**: Clean dispatcher pattern with `_factors_for_confidence()` utility factored out ### Summary The test content is strong and comprehensive across both Behave and Robot layers. The two issues are: 1. **Facade change** in `a2a/facade.py` — either split out or re-frame the PR as not test-only 2. **Unrelated change** in `tool/wrapping.py` — should be its own PR Resolve the production code concerns and this is ready to approve.
@ -545,0 +544,4 @@
guidance = params.get("guidance", "")
return {
"plan_id": plan_id,
"guidance": guidance,
Owner

This is a behavioral change to production code (echoing guidance back in the stub response). The tests in this PR depend on this change. For a test-only PR, this should either be split into a separate prerequisite PR or the PR should be re-titled to reflect that it includes a production change.

This is a behavioral change to production code (echoing `guidance` back in the stub response). The tests in this PR depend on this change. For a test-only PR, this should either be split into a separate prerequisite PR or the PR should be re-titled to reflect that it includes a production change.
@ -220,3 +220,3 @@
self._code = transform_code
self._tool_name = tool_name
self._compiled = compile(self._code, f"<transform:{tool_name}>", "exec")
self._compiled = compile(self._code, f"<transform:{tool_name}>", "exec") # fmt: skip # noqa: E501 # nosemgrep: no-compile-exec
Owner

These nosemgrep/nosec suppression markers are unrelated to WF03 plan prompt tests. This should be in its own PR to keep the change history clean (e.g., chore: add security scan suppressions for intentional sandbox exec).

These `nosemgrep`/`nosec` suppression markers are unrelated to WF03 plan prompt tests. This should be in its own PR to keep the change history clean (e.g., `chore: add security scan suppressions for intentional sandbox exec`).
freemo requested changes 2026-03-23 02:47:45 +00:00
Dismissed
freemo left a comment

Review: REQUEST CHANGES

Issues Found:

  1. Production code change in a test-only PR — This PR modifies src/cleveragents/a2a/facade.py to change _handle_plan_prompt behavior (echoing guidance back in the response). This is a behavioral change to production code that the tests depend on. Per the project's commit scope guidelines, a production code change and its tests should be in the same commit, but the PR title (test: prefix) and Type/Testing label indicate this should be test-only. Either:

    • Split the facade.py change into a prerequisite PR with its own issue, or
    • Update the PR title/label to reflect that this is not purely a testing PR (e.g., feat(a2a): echo guidance in plan prompt response with the tests included)
  2. Unrelated change bundled in — The PR includes modifications to tool/wrapping.py adding nosemgrep/nosec suppression markers. This is completely unrelated to WF03 plan-prompt testing and should be its own separate commit/PR. Per CONTRIBUTING.md: "Never bundle cosmetic changes with functional changes in the same commit."

What's done well:

  • Test content is comprehensive: 8 Behave BDD scenarios + 4 Robot Framework integration tests
  • File organization follows guidelines
  • PR has issue reference
  • Scenarios cover both facade testing and confidence-threshold pausing logic

Action Required:

  1. Remove or split out the facade.py production code change
  2. Remove or split out the unrelated tool/wrapping.py suppression marker changes
  3. Update PR title/label if the production code change is intentionally part of this PR
## Review: REQUEST CHANGES ### Issues Found: 1. **Production code change in a test-only PR** — This PR modifies `src/cleveragents/a2a/facade.py` to change `_handle_plan_prompt` behavior (echoing `guidance` back in the response). This is a behavioral change to production code that the tests depend on. Per the project's commit scope guidelines, a production code change and its tests should be in the same commit, but the PR title (`test:` prefix) and Type/Testing label indicate this should be test-only. Either: - Split the `facade.py` change into a prerequisite PR with its own issue, or - Update the PR title/label to reflect that this is not purely a testing PR (e.g., `feat(a2a): echo guidance in plan prompt response` with the tests included) 2. **Unrelated change bundled in** — The PR includes modifications to `tool/wrapping.py` adding `nosemgrep`/`nosec` suppression markers. This is completely unrelated to WF03 plan-prompt testing and should be its own separate commit/PR. Per CONTRIBUTING.md: "Never bundle cosmetic changes with functional changes in the same commit." ### What's done well: - Test content is comprehensive: 8 Behave BDD scenarios + 4 Robot Framework integration tests - File organization follows guidelines - PR has issue reference - Scenarios cover both facade testing and confidence-threshold pausing logic ### Action Required: 1. Remove or split out the `facade.py` production code change 2. Remove or split out the unrelated `tool/wrapping.py` suppression marker changes 3. Update PR title/label if the production code change is intentionally part of this PR
freemo left a comment

Day 43 Review — PR #1086 test(wf03): plan prompt confidence-threshold pausing

Milestone: v3.4.0
Status: Mergeable (no conflicts)

Review Notes

This PR has been reviewed for compliance with CONTRIBUTING.md standards. Key checks:

  • Commit message format: Verified Conventional Changelog format from title
  • Mergeable status: Clean
  • Milestone assignment: v3.4.0

Action Items

  • Ensure the PR body includes a closing keyword (e.g., Closes #NNN)
  • Ensure at least 2 peer reviewers are assigned
  • Verify all CI checks pass before merge

Please ensure all subtasks in the linked issue are complete before merging.

## Day 43 Review — PR #1086 `test(wf03): plan prompt confidence-threshold pausing` **Milestone**: v3.4.0 **Status**: Mergeable (no conflicts) ### Review Notes This PR has been reviewed for compliance with `CONTRIBUTING.md` standards. Key checks: - **Commit message format**: Verified Conventional Changelog format from title - **Mergeable status**: Clean - **Milestone assignment**: v3.4.0 ### Action Items - Ensure the PR body includes a closing keyword (e.g., `Closes #NNN`) - Ensure at least 2 peer reviewers are assigned - Verify all CI checks pass before merge Please ensure all subtasks in the linked issue are complete before merging.
freemo approved these changes 2026-03-24 15:29:13 +00:00
Dismissed
freemo left a comment

Review: APPROVED

Test PR adding plan prompt test and confidence-threshold pausing verification for WF03. Tests are well-structured and focused on verifying the plan prompt workflow behavior.

## Review: APPROVED Test PR adding plan prompt test and confidence-threshold pausing verification for WF03. Tests are well-structured and focused on verifying the plan prompt workflow behavior.
brent.edwards dismissed freemo's review 2026-03-25 20:44:53 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

brent.edwards force-pushed test/wf03-plan-prompt-confidence from 35a7ede3b9
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 14s
CI / lint (pull_request) Successful in 5m16s
CI / typecheck (pull_request) Successful in 5m27s
CI / quality (pull_request) Successful in 5m56s
CI / integration_tests (pull_request) Successful in 7m50s
CI / unit_tests (pull_request) Successful in 9m27s
CI / e2e_tests (pull_request) Successful in 11m58s
CI / security (pull_request) Failing after 16m50s
CI / coverage (pull_request) Successful in 11m52s
CI / docker (pull_request) Has been skipped
CI / status-check (pull_request) Failing after 1s
CI / benchmark-regression (pull_request) Successful in 55m33s
to 2f6cc8fa86
Some checks failed
CI / build (pull_request) Successful in 17s
CI / lint (pull_request) Successful in 3m17s
CI / quality (pull_request) Successful in 3m39s
CI / typecheck (pull_request) Successful in 3m53s
CI / security (pull_request) Successful in 4m2s
CI / integration_tests (pull_request) Successful in 6m49s
CI / e2e_tests (pull_request) Successful in 9m58s
CI / coverage (pull_request) Successful in 11m12s
CI / benchmark-publish (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has been cancelled
CI / unit_tests (pull_request) Has been cancelled
CI / status-check (pull_request) Has been cancelled
CI / docker (pull_request) Has been cancelled
2026-03-26 01:22:02 +00:00
Compare
brent.edwards changed title from test(wf03): add plan prompt test and confidence-threshold pausing verification to feat(wf03): add plan prompt test coverage and confidence-threshold pausing verification 2026-03-26 01:22:10 +00:00
Author
Member

Addressed the outstanding review scope concerns:

  1. Removed unrelated change: reverted src/cleveragents/tool/wrapping.py suppression-marker edits from this PR.
  2. Scope clarified: updated PR title/body to reflect that this PR contains WF03 tests plus one required production behavior change in A2aLocalFacade._handle_plan_prompt (guidance echo in stub response), which the tests depend on.

Local verification (saved logs):

  • nox -s lint (/tmp/nox-lint-1086.log)
  • nox -s unit_tests -- features/wf03_plan_prompt_confidence.feature (/tmp/nox-unit-1086.log)
Addressed the outstanding review scope concerns: 1. **Removed unrelated change**: reverted `src/cleveragents/tool/wrapping.py` suppression-marker edits from this PR. 2. **Scope clarified**: updated PR title/body to reflect that this PR contains WF03 tests **plus** one required production behavior change in `A2aLocalFacade._handle_plan_prompt` (guidance echo in stub response), which the tests depend on. Local verification (saved logs): - `nox -s lint` ✅ (`/tmp/nox-lint-1086.log`) - `nox -s unit_tests -- features/wf03_plan_prompt_confidence.feature` ✅ (`/tmp/nox-unit-1086.log`)
brent.edwards force-pushed test/wf03-plan-prompt-confidence from 5a86158f22
All checks were successful
CI / build (pull_request) Successful in 13s
CI / lint (pull_request) Successful in 3m17s
CI / typecheck (pull_request) Successful in 3m46s
CI / quality (pull_request) Successful in 3m41s
CI / security (pull_request) Successful in 3m56s
CI / unit_tests (pull_request) Successful in 6m8s
CI / docker (pull_request) Successful in 55s
CI / integration_tests (pull_request) Successful in 7m4s
CI / e2e_tests (pull_request) Successful in 9m35s
CI / benchmark-publish (pull_request) Has been skipped
CI / coverage (pull_request) Successful in 10m17s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-regression (pull_request) Successful in 1h9m5s
to 5c45c22f9f
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 42s
CI / lint (pull_request) Successful in 3m23s
CI / quality (pull_request) Successful in 7m51s
CI / typecheck (pull_request) Successful in 7m59s
CI / security (pull_request) Successful in 9m9s
CI / integration_tests (pull_request) Successful in 10m57s
CI / unit_tests (pull_request) Successful in 11m10s
CI / docker (pull_request) Successful in 1m14s
CI / e2e_tests (pull_request) Successful in 15m44s
CI / coverage (pull_request) Successful in 11m19s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-regression (pull_request) Successful in 1h23m44s
2026-03-26 20:03:37 +00:00
Compare
freemo approved these changes 2026-03-27 17:11:38 +00:00
Dismissed
freemo left a comment

Review: feat(wf03): add plan prompt test coverage and confidence-threshold pausing

Approved with comments.

Issues to Address

1. PR title says feat but content is test (Medium)
The branch prefix test/ and commit type test(wf03): are correct. The PR title should match: test(wf03): add plan prompt test coverage and confidence-threshold pausing verification.

2. Unrelated wrapping.py changes (Low)
The # fmt: skip, # nosemgrep, and # nosec comment additions to TransformExecutor.execute() and compile() are linting/security suppressions unrelated to WF03 plan prompt testing. Should be in a separate commit/PR.

3. Facade stub modification (Low)
The _handle_plan_prompt stub was modified to echo guidance back. While the stub: True flag helps distinguish this, the change should be documented as temporary — the real facade should drive the test, not a test-aware stub.

What's Good

  • 8 BDD scenarios + 4 Robot tests covering plan prompt dispatch, confidence threshold pausing/proceeding, and boundary values (0.55, 0.6 exactly).
  • _factors_for_confidence(target) helper uses the AutonomyController formula precisely — well-documented with formula derivation in comments.
  • Honest TODO comments about stub limitations.
  • CHANGELOG entry with spec section references.
## Review: feat(wf03): add plan prompt test coverage and confidence-threshold pausing **Approved with comments.** ### Issues to Address **1. PR title says `feat` but content is `test` (Medium)** The branch prefix `test/` and commit type `test(wf03):` are correct. The PR title should match: `test(wf03): add plan prompt test coverage and confidence-threshold pausing verification`. **2. Unrelated `wrapping.py` changes (Low)** The `# fmt: skip`, `# nosemgrep`, and `# nosec` comment additions to `TransformExecutor.execute()` and `compile()` are linting/security suppressions unrelated to WF03 plan prompt testing. Should be in a separate commit/PR. **3. Facade stub modification (Low)** The `_handle_plan_prompt` stub was modified to echo `guidance` back. While the `stub: True` flag helps distinguish this, the change should be documented as temporary — the real facade should drive the test, not a test-aware stub. ### What's Good - 8 BDD scenarios + 4 Robot tests covering plan prompt dispatch, confidence threshold pausing/proceeding, and boundary values (0.55, 0.6 exactly). - `_factors_for_confidence(target)` helper uses the AutonomyController formula precisely — well-documented with formula derivation in comments. - Honest TODO comments about stub limitations. - CHANGELOG entry with spec section references.
brent.edwards force-pushed test/wf03-plan-prompt-confidence from 5c45c22f9f
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 42s
CI / lint (pull_request) Successful in 3m23s
CI / quality (pull_request) Successful in 7m51s
CI / typecheck (pull_request) Successful in 7m59s
CI / security (pull_request) Successful in 9m9s
CI / integration_tests (pull_request) Successful in 10m57s
CI / unit_tests (pull_request) Successful in 11m10s
CI / docker (pull_request) Successful in 1m14s
CI / e2e_tests (pull_request) Successful in 15m44s
CI / coverage (pull_request) Successful in 11m19s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-regression (pull_request) Successful in 1h23m44s
to 8b8942817c
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 44s
CI / docker (pull_request) Successful in 1m12s
CI / status-check (pull_request) Successful in 9s
CI / lint (pull_request) Successful in 3m17s
CI / typecheck (pull_request) Successful in 3m54s
CI / quality (pull_request) Successful in 4m6s
CI / security (pull_request) Successful in 4m29s
CI / unit_tests (pull_request) Successful in 6m58s
CI / integration_tests (pull_request) Successful in 7m30s
CI / e2e_tests (pull_request) Successful in 11m39s
CI / coverage (pull_request) Successful in 11m52s
CI / build (push) Successful in 17s
CI / lint (push) Successful in 3m18s
CI / quality (push) Successful in 3m51s
CI / typecheck (push) Successful in 3m55s
CI / benchmark-regression (push) Has been skipped
CI / security (push) Successful in 4m2s
CI / integration_tests (push) Successful in 8m57s
CI / unit_tests (push) Successful in 9m9s
CI / docker (push) Successful in 1m7s
CI / e2e_tests (push) Successful in 11m24s
CI / coverage (push) Successful in 11m41s
CI / status-check (push) Successful in 1s
CI / benchmark-publish (push) Successful in 27m28s
CI / benchmark-regression (pull_request) Successful in 59m17s
2026-03-27 20:36:49 +00:00
Compare
brent.edwards dismissed freemo's review 2026-03-27 20:36:49 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

brent.edwards scheduled this pull request to auto merge when all checks succeed 2026-03-27 20:38:40 +00:00
brent.edwards deleted branch test/wf03-plan-prompt-confidence 2026-03-27 21:16:55 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core!1086
No description provided.