test(e2e): workflow example 14 — server mode team collaboration (supervised profile) #805

Merged
CoreRasurae merged 2 commits from test/e2e-wf14-server-mode into master 2026-03-26 19:26:39 +00:00
Member

Summary

E2E test for Workflow Example 14 — server mode team collaboration using the supervised profile. Tests CLI configuration for server settings, action/actor registration, and profile inspection (no live server required).

Closes #760

ISSUES CLOSED: #760

Manual Verification

Prerequisites

  • No LLM API key required (this test does not make LLM calls)

Commands

WORKDIR=$(mktemp -d) && cd "$WORKDIR"
python -m cleveragents init --yes --force

# 1. Server configuration (WF14-specific)
python -m cleveragents config set server.url http://localhost:8080
python -m cleveragents config get server.url
# → Look for: "http://localhost:8080"

python -m cleveragents config set core.namespace team-workspace
python -m cleveragents config get core.namespace
# → Look for: "team-workspace"

# 2. Diagnostics
python -m cleveragents diagnostics --format plain
# → Look for: system diagnostics output (versions, paths, config)

# 3. Action and actor registration
python -m cleveragents action create --config action.yaml
python -m cleveragents action show action-name
# → Look for: action details

python -m cleveragents actor add --config actor.yaml
python -m cleveragents actor list --format plain
# → Look for: registered actors listed

# 4. Profile inspection
python -m cleveragents automation-profile show supervised
# → Look for: supervised profile details with guard settings

# 5. Plan listing
python -m cleveragents plan list
# → Look for: empty list or existing plans

What to Look For

  • config set/get round-trips server.url and core.namespace
  • diagnostics outputs system information without error
  • Action and actor registration succeeds
  • automation-profile show displays profile details
  • No Traceback in any command's stderr
## Summary E2E test for Workflow Example 14 — server mode team collaboration using the supervised profile. Tests CLI configuration for server settings, action/actor registration, and profile inspection (no live server required). Closes #760 ISSUES CLOSED: #760 ## Manual Verification ### Prerequisites - No LLM API key required (this test does not make LLM calls) ### Commands ```bash WORKDIR=$(mktemp -d) && cd "$WORKDIR" python -m cleveragents init --yes --force # 1. Server configuration (WF14-specific) python -m cleveragents config set server.url http://localhost:8080 python -m cleveragents config get server.url # → Look for: "http://localhost:8080" python -m cleveragents config set core.namespace team-workspace python -m cleveragents config get core.namespace # → Look for: "team-workspace" # 2. Diagnostics python -m cleveragents diagnostics --format plain # → Look for: system diagnostics output (versions, paths, config) # 3. Action and actor registration python -m cleveragents action create --config action.yaml python -m cleveragents action show action-name # → Look for: action details python -m cleveragents actor add --config actor.yaml python -m cleveragents actor list --format plain # → Look for: registered actors listed # 4. Profile inspection python -m cleveragents automation-profile show supervised # → Look for: supervised profile details with guard settings # 5. Plan listing python -m cleveragents plan list # → Look for: empty list or existing plans ``` ### What to Look For - `config set/get` round-trips server.url and core.namespace - `diagnostics` outputs system information without error - Action and actor registration succeeds - `automation-profile show` displays profile details - No `Traceback` in any command's stderr
CoreRasurae added this to the v3.6.0 milestone 2026-03-13 01:43:51 +00:00
freemo force-pushed test/e2e-wf14-server-mode from 8ecaa82262
Some checks failed
CI / lint (pull_request) Successful in 22s
CI / benchmark-publish (pull_request) Has been skipped
CI / quality (pull_request) Successful in 16s
CI / typecheck (pull_request) Successful in 43s
CI / build (pull_request) Successful in 23s
CI / security (pull_request) Successful in 46s
CI / e2e_tests (pull_request) Successful in 51s
CI / unit_tests (pull_request) Successful in 3m43s
CI / docker (pull_request) Successful in 54s
CI / integration_tests (pull_request) Failing after 5m36s
CI / coverage (pull_request) Successful in 6m51s
CI / benchmark-regression (pull_request) Successful in 38m53s
to 92aa055ac5
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 17s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 19s
CI / typecheck (pull_request) Successful in 36s
CI / security (pull_request) Successful in 38s
CI / e2e_tests (pull_request) Successful in 45s
CI / unit_tests (pull_request) Successful in 3m1s
CI / docker (pull_request) Successful in 41s
CI / integration_tests (pull_request) Successful in 3m57s
CI / coverage (pull_request) Successful in 4m57s
CI / benchmark-regression (pull_request) Successful in 36m32s
2026-03-13 18:46:49 +00:00
Compare
freemo force-pushed test/e2e-wf14-server-mode from 92aa055ac5
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 17s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 19s
CI / typecheck (pull_request) Successful in 36s
CI / security (pull_request) Successful in 38s
CI / e2e_tests (pull_request) Successful in 45s
CI / unit_tests (pull_request) Successful in 3m1s
CI / docker (pull_request) Successful in 41s
CI / integration_tests (pull_request) Successful in 3m57s
CI / coverage (pull_request) Successful in 4m57s
CI / benchmark-regression (pull_request) Successful in 36m32s
to 3d240a890e
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / quality (pull_request) Successful in 15s
CI / typecheck (pull_request) Successful in 35s
CI / build (pull_request) Successful in 25s
CI / security (pull_request) Successful in 49s
CI / e2e_tests (pull_request) Successful in 57s
CI / unit_tests (pull_request) Successful in 2m53s
CI / integration_tests (pull_request) Successful in 3m38s
CI / docker (pull_request) Successful in 8s
CI / coverage (pull_request) Successful in 5m40s
CI / benchmark-regression (pull_request) Successful in 35m49s
2026-03-13 23:19:48 +00:00
Compare
Owner

PM Review — Day 34

Status: Mergeable, 0 reviews, M7 (v3.6.0)
Author: @CoreRasurae

E2E test for WF14 (server mode team collaboration, supervised profile).

Action Items

Who Action Deadline
@hamza.khyari Peer review Day 38
## PM Review — Day 34 **Status**: Mergeable, 0 reviews, M7 (v3.6.0) **Author**: @CoreRasurae E2E test for WF14 (server mode team collaboration, supervised profile). ### Action Items | Who | Action | Deadline | |-----|--------|----------| | @hamza.khyari | **Peer review** | Day 38 |
Owner

PM Status — Day 36 (2026-03-16)

Day 34 review assignment deadline check. 0 reviewer activity after 2 days.

Assigned reviewer: Please acknowledge and provide an ETA for review. Prioritize M3 PRs first, then M4+ in milestone order.

## PM Status — Day 36 (2026-03-16) Day 34 review assignment deadline check. 0 reviewer activity after 2 days. **Assigned reviewer**: Please acknowledge and provide an ETA for review. Prioritize M3 PRs first, then M4+ in milestone order.
freemo left a comment

PM Day 36: M7 E2E test. Merge conflict. @CoreRasurae rebase needed.

PM Day 36: M7 E2E test. Merge conflict. @CoreRasurae rebase needed.
Owner

PM Status — Day 37 — Rebase Required

This PR has merge conflicts and cannot be merged in its current state. 42% of all open PRs (21 of 50) have conflicts — this is a project-wide issue that must be resolved.

@CoreRasurae — Please rebase this PR onto master by Day 39 EOD (2026-03-19). If you cannot rebase by then, please post a comment explaining the blocker.


PM rebase request — Day 37

## PM Status — Day 37 — Rebase Required This PR has **merge conflicts** and cannot be merged in its current state. 42% of all open PRs (21 of 50) have conflicts — this is a project-wide issue that must be resolved. @CoreRasurae — Please rebase this PR onto `master` by **Day 39 EOD (2026-03-19)**. If you cannot rebase by then, please post a comment explaining the blocker. --- *PM rebase request — Day 37*
CoreRasurae force-pushed test/e2e-wf14-server-mode from 3d240a890e
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / quality (pull_request) Successful in 15s
CI / typecheck (pull_request) Successful in 35s
CI / build (pull_request) Successful in 25s
CI / security (pull_request) Successful in 49s
CI / e2e_tests (pull_request) Successful in 57s
CI / unit_tests (pull_request) Successful in 2m53s
CI / integration_tests (pull_request) Successful in 3m38s
CI / docker (pull_request) Successful in 8s
CI / coverage (pull_request) Successful in 5m40s
CI / benchmark-regression (pull_request) Successful in 35m49s
to 8ecaa82262
Some checks failed
CI / lint (pull_request) Successful in 22s
CI / benchmark-publish (pull_request) Has been skipped
CI / quality (pull_request) Successful in 16s
CI / typecheck (pull_request) Successful in 43s
CI / build (pull_request) Successful in 23s
CI / security (pull_request) Successful in 46s
CI / e2e_tests (pull_request) Successful in 51s
CI / unit_tests (pull_request) Successful in 3m43s
CI / docker (pull_request) Successful in 54s
CI / integration_tests (pull_request) Failing after 5m36s
CI / coverage (pull_request) Successful in 6m51s
CI / benchmark-regression (pull_request) Successful in 38m53s
2026-03-18 21:11:40 +00:00
Compare
CoreRasurae force-pushed test/e2e-wf14-server-mode from 8ecaa82262
Some checks failed
CI / lint (pull_request) Successful in 22s
CI / benchmark-publish (pull_request) Has been skipped
CI / quality (pull_request) Successful in 16s
CI / typecheck (pull_request) Successful in 43s
CI / build (pull_request) Successful in 23s
CI / security (pull_request) Successful in 46s
CI / e2e_tests (pull_request) Successful in 51s
CI / unit_tests (pull_request) Successful in 3m43s
CI / docker (pull_request) Successful in 54s
CI / integration_tests (pull_request) Failing after 5m36s
CI / coverage (pull_request) Successful in 6m51s
CI / benchmark-regression (pull_request) Successful in 38m53s
to 91ccbf4aab
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 27s
CI / quality (pull_request) Successful in 30s
CI / security (pull_request) Successful in 42s
CI / typecheck (pull_request) Successful in 46s
CI / unit_tests (pull_request) Successful in 3m10s
CI / integration_tests (pull_request) Successful in 3m35s
CI / docker (pull_request) Successful in 56s
CI / e2e_tests (pull_request) Successful in 4m14s
CI / coverage (pull_request) Successful in 7m45s
CI / benchmark-regression (pull_request) Successful in 37m34s
2026-03-18 21:20:24 +00:00
Compare
CoreRasurae force-pushed test/e2e-wf14-server-mode from 91ccbf4aab
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 27s
CI / quality (pull_request) Successful in 30s
CI / security (pull_request) Successful in 42s
CI / typecheck (pull_request) Successful in 46s
CI / unit_tests (pull_request) Successful in 3m10s
CI / integration_tests (pull_request) Successful in 3m35s
CI / docker (pull_request) Successful in 56s
CI / e2e_tests (pull_request) Successful in 4m14s
CI / coverage (pull_request) Successful in 7m45s
CI / benchmark-regression (pull_request) Successful in 37m34s
to 4e134ea2ba
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 16s
CI / build (pull_request) Successful in 30s
CI / quality (pull_request) Successful in 39s
CI / security (pull_request) Successful in 44s
CI / typecheck (pull_request) Successful in 46s
CI / unit_tests (pull_request) Successful in 3m13s
CI / integration_tests (pull_request) Successful in 3m45s
CI / docker (pull_request) Successful in 55s
CI / e2e_tests (pull_request) Successful in 4m13s
CI / coverage (pull_request) Successful in 8m2s
CI / benchmark-regression (pull_request) Successful in 41m25s
2026-03-18 22:08:52 +00:00
Compare
Author
Member

Code Review Report — PR #805 (test/e2e-wf14-server-mode)

Reviewer: Automated deep review (3 global cycles across all categories)
Scope: All changes on branch test/e2e-wf14-server-mode relative to master, cross-referenced with issue #760 acceptance criteria and docs/specification.md Example 14.
Commit: 4e134ea2 by Luis Mendes — test(e2e): workflow example 14 — server mode team collaboration (supervised profile)

Files changed: robot/e2e/wf14_server_mode.robot (new, 89 lines), CHANGELOG.md (+6 lines), src/cleveragents/infrastructure/sandbox/strategy_registry.py (4 lines renamed)


Summary

The PR adds an E2E Robot Framework test for Specification Workflow Example 14 (Server Mode — Team Collaboration) and includes a minor parameter rename in strategy_registry.py. The test structure follows established patterns from common_e2e.resource, but multiple gaps exist between the test coverage and the issue acceptance criteria / specification requirements. Several test-engineering concerns also need attention.

Findings: 12 total — 3 High, 5 Medium, 4 Low


HIGH Severity

H-1. [Test Coverage] Issue AC not fulfilled — missing action list --namespace and plan list --namespace tests

File: robot/e2e/wf14_server_mode.robot

Issue #760 AC states:

  • "Test lists shared actions from the namespace"
  • "Test monitors plans across the namespace via plan list --namespace"

Specification Example 14 Steps 2 and 4 show:

  • agents action list --namespace myteam
  • agents plan list --namespace myteam

Neither command is present in the test. The test runs action show (line 58) but never action list --namespace, and plan list (line 81) without --namespace.

Moreover, investigation confirms plan list (and plan lifecycle-list) do not support a --namespace flag in the current CLI implementation. The action list command does support --namespace.

Impact: Two acceptance criteria are unmet. The plan list --namespace gap also reveals an unimplemented CLI feature the spec assumes exists.

Recommendation: Add action list --namespace myteam assertion. For plan list, either (a) document that the --namespace flag is not yet implemented and defer that AC, or (b) implement the flag first.


H-2. [Test Coverage] Actions and actors created in local/ namespace instead of team namespace

File: robot/e2e/wf14_server_mode.robot, lines 44, 66

The test claims to validate "team collaboration" with a shared namespace. The spec shows actions published as myteam/generate-tests and actors as myteam/review-pipeline. However, the test hardcodes local/ prefixes:

  • Line 44: local/gen-tests-${epoch}
  • Line 66: local/team-rev-${epoch}

The local/ prefix explicitly overrides the configured core.namespace = myteam, so the test never actually exercises namespace-scoped publishing.

Impact: The core "team collaboration via shared namespace" scenario from the spec is not validated.

Recommendation: Create entities without the local/ prefix (e.g., gen-tests-${epoch}) or use the myteam/ prefix so core.namespace is exercised.


H-3. [Test Coverage] CHANGELOG claims do not match actual test coverage

File: CHANGELOG.md, lines 145-149

The CHANGELOG entry claims: "shared action listing, and plan monitoring across namespaces using the supervised automation profile via real CLI with real LLM API keys."

Three inaccuracies:

  1. "shared action listing" — no action list command is tested
  2. "plan monitoring across namespaces"plan list is tested without --namespace
  3. "real LLM API keys" — the test does not use or require LLM API keys (no Skip If No LLM Keys guard, no LLM-dependent commands)

Impact: Misleading documentation for future maintainers and reviewers.

Recommendation: Align the CHANGELOG with what the test actually exercises, or expand the test to cover the claimed behavior.


MEDIUM Severity

M-1. [Test Flaw] Missing config cleanup/teardown for persistent state

File: robot/e2e/wf14_server_mode.robot, lines 17-25

The WF14 E2E Server Config Setup test sets server.url and core.namespace in the config but has no [Teardown] to restore them. By comparison, M6 tests consistently use [Teardown] blocks (e.g., config set core.automation-profile manual).

Config changes persist in the config.toml file under cwd. If other test suites share the same data directory or run sequentially in the same workspace, these leaked settings (server.url=https://agents.example.com, core.namespace=myteam) could cause interference.

Recommendation: Add [Teardown] to the config test case, or add a suite-level [Suite Teardown] step that resets server.url and core.namespace.


M-2. [Test Flaw] Missing server.token configuration per specification

File: robot/e2e/wf14_server_mode.robot

Specification Example 14 Step 1 includes three config steps: server.url, server.token, and core.namespace. The test covers server.url (line 17) and core.namespace (line 22) but skips server.token. Since server.token is the authentication mechanism for server mode, testing its configuration round-trip would strengthen coverage of the server mode setup flow.

Recommendation: Add a config set / config get round-trip for server.token.


M-3. [Test Flaw] plan list test verifies non-crash only, not meaningful behavior

File: robot/e2e/wf14_server_mode.robot, lines 78-83

The test uses expected_rc=None (accepts any exit code) and only asserts Should Not Be Empty ${result.stdout}${result.stderr}. Since plan list without initialization prints an error message ("No project found. Run 'agents init' first.") and aborts, the assertion passes on the error output. The test documentation says "Verify that the plan list command executes successfully" but does not actually verify success.

Recommendation: Either (a) call agents init --yes --force before this test and check for expected_rc=0, or (b) adjust the documentation to state it verifies graceful failure rather than success.


M-4. [Test Flaw] Epoch-based name uniqueness may fail in parallel CI

File: robot/e2e/wf14_server_mode.robot, lines 43, 65

Uses int(__import__('time').time()) (second-level granularity) for unique name suffixes. In parallel CI runners executing within the same second, this produces identical names and risks UNIQUE constraint violations. The M6 acceptance test uses uuid4().hex[:12] which is collision-resistant even under parallelism.

Recommendation: Replace epoch with uuid4().hex[:12] for consistency with established patterns and CI safety.


M-5. [Test Flaw] Misleading test documentation in diagnostics test

File: robot/e2e/wf14_server_mode.robot, lines 28-30

The documentation states: "Server connectivity will fail (no real server) so the return code is not checked." However, investigation of system.py:build_diagnostics_data() confirms there is no server connectivity check in the diagnostics command. The 10 diagnostic checks are all local (config file, data directory, database, API keys, disk space, file permissions, git, stale locks, async workers, error patterns). The expected_rc=None is unnecessary since diagnostics should succeed (rc=0) even without a server.

Recommendation: Update the documentation to reflect the actual diagnostic checks, and consider setting expected_rc=0 since no server check exists.


LOW Severity

L-1. [Code Quality] Unrelated cls to klass rename bundled in E2E test commit

File: src/cleveragents/infrastructure/sandbox/strategy_registry.py, lines 262-297

The rename of the cls parameter to klass in the _validate_protocol staticmethod is a valid improvement (avoids confusion with the classmethod cls convention) but is unrelated to issue #760 or the E2E test work. Bundling unrelated changes in a test commit makes git history less navigable.

Recommendation: Separate this into its own commit or PR for clean history.


L-2. [Code Quality] Inconsistent tag pattern across E2E suites

File: robot/e2e/wf14_server_mode.robot

The WF14 test applies [Tags] E2E per test case (lines 15, 31, 41, 63, 80, 87), while m6_acceptance.robot uses Force Tags E2E at suite level. Both are valid, but Force Tags is cleaner when all tests in a suite share the same tag.

Recommendation: Consider using Force Tags E2E at suite level for consistency with M6.


L-3. [Code Quality] Minimal actor YAML may not exercise full registration flow

File: robot/e2e/wf14_server_mode.robot, lines 67-69

The actor YAML contains only provider: openai and model: gpt-4. While this may be sufficient for actor add, the spec example shows a richer actor configuration with type, capabilities, tools, and graph nodes. A minimal YAML exercises only the happy path minimum.

Recommendation: Consider adding at least description and type fields to better exercise the actor registration flow.


L-4. [Security] No issues found — good practices observed

The test correctly uses https://agents.example.com (RFC 2606 reserved domain) for the server URL. No real tokens, API keys, or sensitive data appear in the test code.


Information / Positive Notes

  • The strategy_registry.py rename is functionally correct and all 4 references (parameter, docstring, getattr, f-string) are consistently updated.
  • YAML files created during tests are properly cleaned up by E2E Suite Teardown (recursive removal of ${SUITE_HOME}).
  • The Create File keyword correctly resolves from the OperatingSystem library imported transitively via common_e2e.resource.
  • The Should Contain Any assertions in the diagnostics test correctly match the actual diagnostic category names ("Config file", "Database", "Disk space").
  • The automation-profile show supervised test (line 88) correctly targets a built-in profile that doesn't require database initialization.

Verdict

REQUEST CHANGES — The test has structural correctness but significant coverage gaps relative to issue #760 acceptance criteria and the specification. The high-severity items (H-1, H-2, H-3) should be addressed before merge. Medium-severity items (M-1 through M-5) represent test-engineering improvements that strengthen reliability and maintainability.

# Code Review Report — PR #805 (`test/e2e-wf14-server-mode`) **Reviewer**: Automated deep review (3 global cycles across all categories) **Scope**: All changes on branch `test/e2e-wf14-server-mode` relative to `master`, cross-referenced with issue #760 acceptance criteria and `docs/specification.md` Example 14. **Commit**: `4e134ea2` by Luis Mendes — `test(e2e): workflow example 14 — server mode team collaboration (supervised profile)` **Files changed**: `robot/e2e/wf14_server_mode.robot` (new, 89 lines), `CHANGELOG.md` (+6 lines), `src/cleveragents/infrastructure/sandbox/strategy_registry.py` (4 lines renamed) --- ## Summary The PR adds an E2E Robot Framework test for Specification Workflow Example 14 (Server Mode — Team Collaboration) and includes a minor parameter rename in `strategy_registry.py`. The test structure follows established patterns from `common_e2e.resource`, but multiple gaps exist between the test coverage and the issue acceptance criteria / specification requirements. Several test-engineering concerns also need attention. **Findings**: 12 total — 3 High, 5 Medium, 4 Low --- ## HIGH Severity ### H-1. [Test Coverage] Issue AC not fulfilled — missing `action list --namespace` and `plan list --namespace` tests **File**: `robot/e2e/wf14_server_mode.robot` Issue #760 AC states: - "Test lists shared actions from the namespace" - "Test monitors plans across the namespace via `plan list --namespace`" Specification Example 14 Steps 2 and 4 show: - `agents action list --namespace myteam` - `agents plan list --namespace myteam` Neither command is present in the test. The test runs `action show` (line 58) but never `action list --namespace`, and `plan list` (line 81) without `--namespace`. Moreover, investigation confirms `plan list` (and `plan lifecycle-list`) do **not** support a `--namespace` flag in the current CLI implementation. The `action list` command does support `--namespace`. **Impact**: Two acceptance criteria are unmet. The `plan list --namespace` gap also reveals an unimplemented CLI feature the spec assumes exists. **Recommendation**: Add `action list --namespace myteam` assertion. For `plan list`, either (a) document that the `--namespace` flag is not yet implemented and defer that AC, or (b) implement the flag first. --- ### H-2. [Test Coverage] Actions and actors created in `local/` namespace instead of team namespace **File**: `robot/e2e/wf14_server_mode.robot`, lines 44, 66 The test claims to validate "team collaboration" with a shared namespace. The spec shows actions published as `myteam/generate-tests` and actors as `myteam/review-pipeline`. However, the test hardcodes `local/` prefixes: - Line 44: `local/gen-tests-${epoch}` - Line 66: `local/team-rev-${epoch}` The `local/` prefix explicitly overrides the configured `core.namespace = myteam`, so the test never actually exercises namespace-scoped publishing. **Impact**: The core "team collaboration via shared namespace" scenario from the spec is not validated. **Recommendation**: Create entities without the `local/` prefix (e.g., `gen-tests-${epoch}`) or use the `myteam/` prefix so `core.namespace` is exercised. --- ### H-3. [Test Coverage] CHANGELOG claims do not match actual test coverage **File**: `CHANGELOG.md`, lines 145-149 The CHANGELOG entry claims: "shared action listing, and plan monitoring across namespaces using the `supervised` automation profile via real CLI with real LLM API keys." Three inaccuracies: 1. **"shared action listing"** — no `action list` command is tested 2. **"plan monitoring across namespaces"** — `plan list` is tested without `--namespace` 3. **"real LLM API keys"** — the test does not use or require LLM API keys (no `Skip If No LLM Keys` guard, no LLM-dependent commands) **Impact**: Misleading documentation for future maintainers and reviewers. **Recommendation**: Align the CHANGELOG with what the test actually exercises, or expand the test to cover the claimed behavior. --- ## MEDIUM Severity ### M-1. [Test Flaw] Missing config cleanup/teardown for persistent state **File**: `robot/e2e/wf14_server_mode.robot`, lines 17-25 The `WF14 E2E Server Config Setup` test sets `server.url` and `core.namespace` in the config but has no `[Teardown]` to restore them. By comparison, M6 tests consistently use `[Teardown]` blocks (e.g., `config set core.automation-profile manual`). Config changes persist in the `config.toml` file under `cwd`. If other test suites share the same data directory or run sequentially in the same workspace, these leaked settings (`server.url=https://agents.example.com`, `core.namespace=myteam`) could cause interference. **Recommendation**: Add `[Teardown]` to the config test case, or add a suite-level `[Suite Teardown]` step that resets `server.url` and `core.namespace`. --- ### M-2. [Test Flaw] Missing `server.token` configuration per specification **File**: `robot/e2e/wf14_server_mode.robot` Specification Example 14 Step 1 includes three config steps: `server.url`, `server.token`, and `core.namespace`. The test covers `server.url` (line 17) and `core.namespace` (line 22) but skips `server.token`. Since `server.token` is the authentication mechanism for server mode, testing its configuration round-trip would strengthen coverage of the server mode setup flow. **Recommendation**: Add a `config set` / `config get` round-trip for `server.token`. --- ### M-3. [Test Flaw] `plan list` test verifies non-crash only, not meaningful behavior **File**: `robot/e2e/wf14_server_mode.robot`, lines 78-83 The test uses `expected_rc=None` (accepts any exit code) and only asserts `Should Not Be Empty ${result.stdout}${result.stderr}`. Since `plan list` without initialization prints an error message ("No project found. Run 'agents init' first.") and aborts, the assertion passes on the error output. The test documentation says "Verify that the plan list command executes successfully" but does not actually verify success. **Recommendation**: Either (a) call `agents init --yes --force` before this test and check for `expected_rc=0`, or (b) adjust the documentation to state it verifies graceful failure rather than success. --- ### M-4. [Test Flaw] Epoch-based name uniqueness may fail in parallel CI **File**: `robot/e2e/wf14_server_mode.robot`, lines 43, 65 Uses `int(__import__('time').time())` (second-level granularity) for unique name suffixes. In parallel CI runners executing within the same second, this produces identical names and risks UNIQUE constraint violations. The M6 acceptance test uses `uuid4().hex[:12]` which is collision-resistant even under parallelism. **Recommendation**: Replace epoch with `uuid4().hex[:12]` for consistency with established patterns and CI safety. --- ### M-5. [Test Flaw] Misleading test documentation in diagnostics test **File**: `robot/e2e/wf14_server_mode.robot`, lines 28-30 The documentation states: "Server connectivity will fail (no real server) so the return code is not checked." However, investigation of `system.py:build_diagnostics_data()` confirms there is **no server connectivity check** in the diagnostics command. The 10 diagnostic checks are all local (config file, data directory, database, API keys, disk space, file permissions, git, stale locks, async workers, error patterns). The `expected_rc=None` is unnecessary since diagnostics should succeed (rc=0) even without a server. **Recommendation**: Update the documentation to reflect the actual diagnostic checks, and consider setting `expected_rc=0` since no server check exists. --- ## LOW Severity ### L-1. [Code Quality] Unrelated `cls` to `klass` rename bundled in E2E test commit **File**: `src/cleveragents/infrastructure/sandbox/strategy_registry.py`, lines 262-297 The rename of the `cls` parameter to `klass` in the `_validate_protocol` staticmethod is a valid improvement (avoids confusion with the classmethod `cls` convention) but is unrelated to issue #760 or the E2E test work. Bundling unrelated changes in a test commit makes git history less navigable. **Recommendation**: Separate this into its own commit or PR for clean history. --- ### L-2. [Code Quality] Inconsistent tag pattern across E2E suites **File**: `robot/e2e/wf14_server_mode.robot` The WF14 test applies `[Tags] E2E` per test case (lines 15, 31, 41, 63, 80, 87), while `m6_acceptance.robot` uses `Force Tags E2E` at suite level. Both are valid, but `Force Tags` is cleaner when all tests in a suite share the same tag. **Recommendation**: Consider using `Force Tags E2E` at suite level for consistency with M6. --- ### L-3. [Code Quality] Minimal actor YAML may not exercise full registration flow **File**: `robot/e2e/wf14_server_mode.robot`, lines 67-69 The actor YAML contains only `provider: openai` and `model: gpt-4`. While this may be sufficient for `actor add`, the spec example shows a richer actor configuration with type, capabilities, tools, and graph nodes. A minimal YAML exercises only the happy path minimum. **Recommendation**: Consider adding at least `description` and `type` fields to better exercise the actor registration flow. --- ### L-4. [Security] No issues found — good practices observed The test correctly uses `https://agents.example.com` (RFC 2606 reserved domain) for the server URL. No real tokens, API keys, or sensitive data appear in the test code. --- ## Information / Positive Notes - The `strategy_registry.py` rename is functionally correct and all 4 references (parameter, docstring, getattr, f-string) are consistently updated. - YAML files created during tests are properly cleaned up by `E2E Suite Teardown` (recursive removal of `${SUITE_HOME}`). - The `Create File` keyword correctly resolves from the `OperatingSystem` library imported transitively via `common_e2e.resource`. - The `Should Contain Any` assertions in the diagnostics test correctly match the actual diagnostic category names ("Config file", "Database", "Disk space"). - The `automation-profile show supervised` test (line 88) correctly targets a built-in profile that doesn't require database initialization. --- ## Verdict **REQUEST CHANGES** — The test has structural correctness but significant coverage gaps relative to issue #760 acceptance criteria and the specification. The high-severity items (H-1, H-2, H-3) should be addressed before merge. Medium-severity items (M-1 through M-5) represent test-engineering improvements that strengthen reliability and maintainability.
CoreRasurae force-pushed test/e2e-wf14-server-mode from 4e134ea2ba
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 16s
CI / build (pull_request) Successful in 30s
CI / quality (pull_request) Successful in 39s
CI / security (pull_request) Successful in 44s
CI / typecheck (pull_request) Successful in 46s
CI / unit_tests (pull_request) Successful in 3m13s
CI / integration_tests (pull_request) Successful in 3m45s
CI / docker (pull_request) Successful in 55s
CI / e2e_tests (pull_request) Successful in 4m13s
CI / coverage (pull_request) Successful in 8m2s
CI / benchmark-regression (pull_request) Successful in 41m25s
to 8f3c15aa08
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 23s
CI / quality (pull_request) Successful in 28s
CI / build (pull_request) Successful in 18s
CI / typecheck (pull_request) Successful in 49s
CI / security (pull_request) Successful in 53s
CI / e2e_tests (pull_request) Failing after 4m40s
CI / integration_tests (pull_request) Successful in 5m43s
CI / coverage (pull_request) Successful in 7m11s
CI / unit_tests (pull_request) Failing after 11m51s
CI / benchmark-regression (pull_request) Successful in 39m22s
CI / docker (pull_request) Has been skipped
2026-03-18 22:55:28 +00:00
Compare
CoreRasurae force-pushed test/e2e-wf14-server-mode from 8f3c15aa08
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 23s
CI / quality (pull_request) Successful in 28s
CI / build (pull_request) Successful in 18s
CI / typecheck (pull_request) Successful in 49s
CI / security (pull_request) Successful in 53s
CI / e2e_tests (pull_request) Failing after 4m40s
CI / integration_tests (pull_request) Successful in 5m43s
CI / coverage (pull_request) Successful in 7m11s
CI / unit_tests (pull_request) Failing after 11m51s
CI / benchmark-regression (pull_request) Successful in 39m22s
CI / docker (pull_request) Has been skipped
to 064d448cce
Some checks failed
CI / build (pull_request) Successful in 14s
CI / lint (pull_request) Successful in 3m14s
CI / quality (pull_request) Successful in 3m44s
CI / typecheck (pull_request) Successful in 3m48s
CI / security (pull_request) Successful in 4m1s
CI / integration_tests (pull_request) Successful in 5m52s
CI / e2e_tests (pull_request) Failing after 5m54s
CI / unit_tests (pull_request) Successful in 5m56s
CI / docker (pull_request) Successful in 1m7s
CI / coverage (pull_request) Successful in 11m33s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-publish (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has started running
2026-03-23 23:05:12 +00:00
Compare
brent.edwards requested changes 2026-03-23 23:25:47 +00:00
Dismissed
brent.edwards left a comment

Code Review — PR #805 (Ticket #760)

Branch: test/e2e-wf14-server-mode
Author: @CoreRasurae
Reviewer: @brent.edwards
Method: Multi-focus parallel review (7 threads: spec compliance, bug detection, test quality, code quality, security, strategy_registry analysis, stale review triage) + fresh-eyes pass


Verdict: Request Changes

There is 1 P0:blocker finding (security fix revert), 1 P1:must-fix (unmet acceptance criterion), and 5 P2:should-fix items. The P0 alone requires immediate attention before merge.

Note on stale self-review: The detailed self-review by @CoreRasurae (comment #67681, against commit 4e134ea2) identified 12 issues. The current commit 064d448c addresses 6 of 8 HIGH/MEDIUM findings. H-1 and H-2 are partially fixed; M-3 remains. The strategy_registry.py unrelated change (L-1) was correctly removed. Good rebase work overall.


P0:blocker — Must Fix Immediately

1. Security fix revert: shell=True re-introduced in cli_coverage_steps.py

File: features/steps/cli_coverage_steps.py:43–44

This PR reverts the security fix from commit bd18491b (fix(test): replace shell=True with shell=False in cli_coverage_steps.py subprocess call, issue #734). The diff shows:

  • import shlex removed
  • shlex.split(command) replaced back with command + shell=True
  • The CHANGELOG entry documenting the #734 fix is also deleted

The original fix was explicitly described as "defense-in-depth command injection prevention, consistent with the existing pattern in cli_plan_context_commands_steps.py." Reverting it re-introduces shell=True subprocess execution, which the project's security standards prohibit. The comment says "Use shell=True to handle quoted arguments correctly" — suggesting the fix broke a test scenario.

Impact: P0 — A security fix merged to master 4 days ago is being silently reverted in an unrelated E2E test PR with no justification or discussion. Even in test code, shell=True is a security concern per the project's own standards and the original fix's rationale.

Recommendation: Either (a) keep the shlex.split() approach and fix whatever test it broke (the correct solution), or (b) if shell=True is genuinely needed for quoted argument handling, explicitly document WHY in a code comment, add # nosec B602 with justification, and create a follow-up issue to find a secure alternative. Do NOT silently revert a security fix. The CHANGELOG entry for #734 must also be restored.


P1:must-fix — Must Fix Before Merge

2. Ticket AC "plan list --namespace" marked complete but unfulfilled

File: robot/e2e/wf14_server_mode.robot:112–119

Ticket #760 acceptance criterion states: "Test monitors plans across the namespace via plan list --namespace" — this checkbox is marked done in the ticket. Specification Example 14 Step 4 explicitly shows agents plan list --namespace myteam. However, the CLI's plan list command does not support a --namespace flag (confirmed in cli/commands/plan.py). The test runs bare plan list with expected_rc=None.

The test documentation (lines 115–116) honestly notes this as a "spec gap," which is good — but the ticket AC is marked complete when the condition cannot be fulfilled.

Recommendation: Uncheck this AC item in ticket #760 and add a note that it's deferred until the --namespace flag is implemented. A checked-off AC that isn't actually validated is misleading for milestone tracking.


P2:should-fix — Fix in Follow-Up PR (Within 3 Days)

3. plan list test assertion is tautological

File: robot/e2e/wf14_server_mode.robot:117–119

expected_rc=None (accepts any exit code) + Should Not Be Empty ${result.stdout}${result.stderr} means this test passes whenever the command produces any output on stdout or stderr — including error messages, tracebacks, or deprecation warnings. Since init --yes was already called in suite setup, plan list should succeed with rc=0.

Recommendation: Change to expected_rc=0. If plan list genuinely fails after init, that's a bug worth catching rather than masking with a permissive assertion.

4. diagnostics test uses expected_rc=None unnecessarily

File: robot/e2e/wf14_server_mode.robot:59

The documentation correctly notes diagnostics checks are all local and should succeed without a live server. Yet expected_rc=None is used. Without --check, diagnostics always returns rc=0.

Recommendation: Change to expected_rc=0 to match the documentation's own assertion that the command should succeed.

5. Actor uses local/ namespace, contradicting "team collaboration" scenario

File: robot/e2e/wf14_server_mode.robot:99

The actor is created with local/team-rev-${RUN_SUFFIX}. Spec Example 14 shows myteam/review-pipeline. The action was correctly fixed to use myteam/ prefix, but the actor was not. The documentation (lines 97–98) explains this is due to ActorService validation, which is honest — but means the "team collaboration" scenario is only half-exercised for namespace behavior.

Recommendation: If ActorService requires local/, file a follow-up issue tracking this spec gap and reference it in the test documentation.

6. Ticket AC "real LLM API keys" unmet

File: robot/e2e/wf14_server_mode.robot (entire file)

Ticket #760 AC states: "All invocations use real LLM API keys — no mocking, stubbing, or test doubles." No command in this test invokes an LLM. The PR description itself says "No LLM API key required." The AC is marked done but the condition isn't met.

Recommendation: Update ticket #760 to remove or reword the LLM AC — server mode config testing genuinely doesn't need LLM calls. Alternatively, add one LLM-dependent step guarded by Skip If No LLM Keys.

7. supervised profile verification is very shallow

File: robot/e2e/wf14_server_mode.robot:121–124

The test only checks that the output contains the string "supervised." The M6 suite's equivalent test checks for auto_strategize, auto_execute, and auto_apply fields. For a test whose title emphasizes the "supervised automation profile," only verifying the name is insufficient.

Recommendation: Add assertions for supervised-specific fields:

Output Should Contain    ${result}    auto_strategize
Output Should Contain    ${result}    auto_execute

P3:nit — Author Discretion

8. init --yes without --force deviates from established pattern

robot/e2e/wf14_server_mode.robot:18 — M6 uses init --force --yes. While WF14's fresh temp dir makes this unlikely to fail, --force provides defense-in-depth.

9. Missing inter-test dependency guard on Action List test

robot/e2e/wf14_server_mode.robot:88–93 — If the Action Create test fails, Action List will fail with an unclear "variable not found" error. Other suites use [Setup] Variable Should Exist guards.

10. Loose db substring in diagnostics assertion

robot/e2e/wf14_server_mode.robot:63db is only 2 characters and could match unrelated content. Keep only database and Database.

11. CHANGELOG wording "plan list graceful handling" oversells the test

The test only verifies non-empty output with any exit code. "Plan list smoke test" would be more accurate.


Items Verified as Correct

Area Status
Branch name — matches ticket metadata exactly
Commit message — matches ticket metadata exactly with em-dash
Commit atomicity — single commit, 2 files (but see P0 for cli_coverage_steps.py!) ⚠️
PR description — summary, closes keyword, manual verification
CHANGELOG entry — follows project pattern (but deletes #734 entry — see P0) ⚠️
Force Tags E2E at suite level — consistent with M6
uuid4().hex[:12] for unique names — CI-safe
Suite teardown — resets server.url, server.token, core.namespace
server.token round-trip test — added per spec Step 1
action list --namespace myteam — added per spec Step 2
Action uses myteam/ prefix — matches spec
strategy_registry.py rename — correctly removed from PR
Safe server URLhttps://agents.example.com (RFC 2606 reserved)
No real tokens/secrets in test code
Robot Framework section order — Settings/Keywords/Test Cases
LabelsType/Testing, State/In Review, correct milestone
ISSUES CLOSED: #760 in commit footer

Stale Self-Review Resolution Summary

ID Finding Status in 064d448c
H-1 Missing action list --namespace Fixed (plan list --namespace acknowledged as spec gap)
H-2 local/ namespace on entities Action fixed to myteam/; actor documented as ActorService constraint
H-3 CHANGELOG inaccurate Fixed — no longer claims LLM keys or cross-namespace plan monitoring
M-1 Missing config teardown Fixed — suite teardown resets all 3 config keys
M-2 Missing server.token Fixed — set/get round-trip added
M-3 plan list non-meaningful assertion ⚠️ Documentation updated but assertion unchanged
M-4 Epoch-based uniqueness Fixed — uses uuid4
M-5 Misleading diagnostics doc Fixed — correctly says checks are local
L-1 Unrelated strategy_registry.py change Removed from PR
L-2 Inconsistent tag pattern Fixed — Force Tags E2E

Summary

The E2E test itself is well-structured and has been substantially improved since the self-review. Config round-trips, namespace-scoped action listing, proper teardown, and uuid-based uniqueness are all solid. The test covers Steps 1–2 of spec Example 14 well.

However, the P0 finding is a hard blocker: this PR silently reverts a security fix (shell=True subprocess call) from issue #734 that was merged to master 4 days ago. The CHANGELOG entry documenting that fix is also deleted. This must be addressed before any other consideration.

Beyond that, the remaining findings are primarily about tightening assertions (P2 #3, #4), documenting spec gaps properly (P1 #2, P2 #5, #6), and making the supervised profile test more meaningful (P2 #7).

# Code Review — PR #805 (Ticket #760) **Branch:** `test/e2e-wf14-server-mode` **Author:** @CoreRasurae **Reviewer:** @brent.edwards **Method:** Multi-focus parallel review (7 threads: spec compliance, bug detection, test quality, code quality, security, strategy_registry analysis, stale review triage) + fresh-eyes pass --- ## Verdict: Request Changes There is **1 P0:blocker** finding (security fix revert), **1 P1:must-fix** (unmet acceptance criterion), and **5 P2:should-fix** items. The P0 alone requires immediate attention before merge. **Note on stale self-review:** The detailed self-review by @CoreRasurae (comment #67681, against commit `4e134ea2`) identified 12 issues. The current commit `064d448c` addresses **6 of 8 HIGH/MEDIUM** findings. H-1 and H-2 are partially fixed; M-3 remains. The `strategy_registry.py` unrelated change (L-1) was correctly removed. Good rebase work overall. --- ## P0:blocker — Must Fix Immediately ### 1. Security fix revert: `shell=True` re-introduced in `cli_coverage_steps.py` **File:** `features/steps/cli_coverage_steps.py:43–44` This PR **reverts** the security fix from commit `bd18491b` (`fix(test): replace shell=True with shell=False in cli_coverage_steps.py subprocess call`, issue #734). The diff shows: - `import shlex` removed - `shlex.split(command)` replaced back with `command` + `shell=True` - The CHANGELOG entry documenting the #734 fix is also deleted The original fix was explicitly described as *"defense-in-depth command injection prevention, consistent with the existing pattern in `cli_plan_context_commands_steps.py`."* Reverting it re-introduces `shell=True` subprocess execution, which the project's security standards prohibit. The comment says "Use shell=True to handle quoted arguments correctly" — suggesting the fix broke a test scenario. **Impact:** P0 — A security fix merged to master 4 days ago is being silently reverted in an unrelated E2E test PR with no justification or discussion. Even in test code, `shell=True` is a security concern per the project's own standards and the original fix's rationale. **Recommendation:** Either (a) keep the `shlex.split()` approach and fix whatever test it broke (the correct solution), or (b) if `shell=True` is genuinely needed for quoted argument handling, explicitly document WHY in a code comment, add `# nosec B602` with justification, and create a follow-up issue to find a secure alternative. Do NOT silently revert a security fix. The CHANGELOG entry for #734 must also be restored. --- ## P1:must-fix — Must Fix Before Merge ### 2. Ticket AC "`plan list --namespace`" marked complete but unfulfilled **File:** `robot/e2e/wf14_server_mode.robot:112–119` Ticket #760 acceptance criterion states: *"Test monitors plans across the namespace via `plan list --namespace`"* — this checkbox is marked done in the ticket. Specification Example 14 Step 4 explicitly shows `agents plan list --namespace myteam`. However, the CLI's `plan list` command does **not support** a `--namespace` flag (confirmed in `cli/commands/plan.py`). The test runs bare `plan list` with `expected_rc=None`. The test documentation (lines 115–116) honestly notes this as a "spec gap," which is good — but the ticket AC is marked complete when the condition cannot be fulfilled. **Recommendation:** Uncheck this AC item in ticket #760 and add a note that it's deferred until the `--namespace` flag is implemented. A checked-off AC that isn't actually validated is misleading for milestone tracking. --- ## P2:should-fix — Fix in Follow-Up PR (Within 3 Days) ### 3. `plan list` test assertion is tautological **File:** `robot/e2e/wf14_server_mode.robot:117–119` `expected_rc=None` (accepts any exit code) + `Should Not Be Empty ${result.stdout}${result.stderr}` means this test passes whenever the command produces *any* output on stdout or stderr — including error messages, tracebacks, or deprecation warnings. Since `init --yes` was already called in suite setup, `plan list` should succeed with rc=0. **Recommendation:** Change to `expected_rc=0`. If `plan list` genuinely fails after init, that's a bug worth catching rather than masking with a permissive assertion. ### 4. `diagnostics` test uses `expected_rc=None` unnecessarily **File:** `robot/e2e/wf14_server_mode.robot:59` The documentation correctly notes diagnostics checks are all local and should succeed without a live server. Yet `expected_rc=None` is used. Without `--check`, diagnostics always returns rc=0. **Recommendation:** Change to `expected_rc=0` to match the documentation's own assertion that the command should succeed. ### 5. Actor uses `local/` namespace, contradicting "team collaboration" scenario **File:** `robot/e2e/wf14_server_mode.robot:99` The actor is created with `local/team-rev-${RUN_SUFFIX}`. Spec Example 14 shows `myteam/review-pipeline`. The action was correctly fixed to use `myteam/` prefix, but the actor was not. The documentation (lines 97–98) explains this is due to ActorService validation, which is honest — but means the "team collaboration" scenario is only half-exercised for namespace behavior. **Recommendation:** If ActorService requires `local/`, file a follow-up issue tracking this spec gap and reference it in the test documentation. ### 6. Ticket AC "real LLM API keys" unmet **File:** `robot/e2e/wf14_server_mode.robot` (entire file) Ticket #760 AC states: *"All invocations use real LLM API keys — no mocking, stubbing, or test doubles."* No command in this test invokes an LLM. The PR description itself says *"No LLM API key required."* The AC is marked done but the condition isn't met. **Recommendation:** Update ticket #760 to remove or reword the LLM AC — server mode config testing genuinely doesn't need LLM calls. Alternatively, add one LLM-dependent step guarded by `Skip If No LLM Keys`. ### 7. `supervised` profile verification is very shallow **File:** `robot/e2e/wf14_server_mode.robot:121–124` The test only checks that the output contains the string "supervised." The M6 suite's equivalent test checks for `auto_strategize`, `auto_execute`, and `auto_apply` fields. For a test whose title emphasizes the "supervised automation profile," only verifying the name is insufficient. **Recommendation:** Add assertions for supervised-specific fields: ```robot Output Should Contain ${result} auto_strategize Output Should Contain ${result} auto_execute ``` --- ## P3:nit — Author Discretion ### 8. `init --yes` without `--force` deviates from established pattern `robot/e2e/wf14_server_mode.robot:18` — M6 uses `init --force --yes`. While WF14's fresh temp dir makes this unlikely to fail, `--force` provides defense-in-depth. ### 9. Missing inter-test dependency guard on Action List test `robot/e2e/wf14_server_mode.robot:88–93` — If the Action Create test fails, Action List will fail with an unclear "variable not found" error. Other suites use `[Setup] Variable Should Exist` guards. ### 10. Loose `db` substring in diagnostics assertion `robot/e2e/wf14_server_mode.robot:63` — `db` is only 2 characters and could match unrelated content. Keep only `database` and `Database`. ### 11. CHANGELOG wording "plan list graceful handling" oversells the test The test only verifies non-empty output with any exit code. "Plan list smoke test" would be more accurate. --- ## Items Verified as Correct ✅ | Area | Status | |------|--------| | **Branch name** — matches ticket metadata exactly | ✅ | | **Commit message** — matches ticket metadata exactly with em-dash | ✅ | | **Commit atomicity** — single commit, 2 files (but see P0 for cli_coverage_steps.py!) | ⚠️ | | **PR description** — summary, closes keyword, manual verification | ✅ | | **CHANGELOG entry** — follows project pattern (but deletes #734 entry — see P0) | ⚠️ | | **`Force Tags E2E`** at suite level — consistent with M6 | ✅ | | **`uuid4().hex[:12]`** for unique names — CI-safe | ✅ | | **Suite teardown** — resets server.url, server.token, core.namespace | ✅ | | **`server.token`** round-trip test — added per spec Step 1 | ✅ | | **`action list --namespace myteam`** — added per spec Step 2 | ✅ | | **Action uses `myteam/` prefix** — matches spec | ✅ | | **strategy_registry.py rename** — correctly removed from PR | ✅ | | **Safe server URL** — `https://agents.example.com` (RFC 2606 reserved) | ✅ | | **No real tokens/secrets** in test code | ✅ | | **Robot Framework section order** — Settings/Keywords/Test Cases | ✅ | | **Labels** — `Type/Testing`, `State/In Review`, correct milestone | ✅ | | **`ISSUES CLOSED: #760`** in commit footer | ✅ | --- ## Stale Self-Review Resolution Summary | ID | Finding | Status in `064d448c` | |----|---------|---------------------| | H-1 | Missing `action list --namespace` | ✅ Fixed (`plan list --namespace` acknowledged as spec gap) | | H-2 | `local/` namespace on entities | ✅ Action fixed to `myteam/`; actor documented as ActorService constraint | | H-3 | CHANGELOG inaccurate | ✅ Fixed — no longer claims LLM keys or cross-namespace plan monitoring | | M-1 | Missing config teardown | ✅ Fixed — suite teardown resets all 3 config keys | | M-2 | Missing server.token | ✅ Fixed — set/get round-trip added | | M-3 | plan list non-meaningful assertion | ⚠️ Documentation updated but assertion unchanged | | M-4 | Epoch-based uniqueness | ✅ Fixed — uses uuid4 | | M-5 | Misleading diagnostics doc | ✅ Fixed — correctly says checks are local | | L-1 | Unrelated strategy_registry.py change | ✅ Removed from PR | | L-2 | Inconsistent tag pattern | ✅ Fixed — Force Tags E2E | --- ## Summary The E2E test itself is well-structured and has been substantially improved since the self-review. Config round-trips, namespace-scoped action listing, proper teardown, and uuid-based uniqueness are all solid. The test covers Steps 1–2 of spec Example 14 well. However, the **P0 finding is a hard blocker**: this PR silently reverts a security fix (`shell=True` subprocess call) from issue #734 that was merged to master 4 days ago. The CHANGELOG entry documenting that fix is also deleted. This must be addressed before any other consideration. Beyond that, the remaining findings are primarily about tightening assertions (P2 #3, #4), documenting spec gaps properly (P1 #2, P2 #5, #6), and making the supervised profile test more meaningful (P2 #7).
Member

Supplementary Review — PR #805 (Second Pass + P0 Correction)

Reviewer: @brent.edwards
Method: 3 deep-dive agents (shell=True analysis, YAML/CLI tracing, CHANGELOG integrity) + manual verification


⚠️ CORRECTION: P0 Finding #1 Downgraded to P3 (Branch Hygiene)

My initial review incorrectly classified the cli_coverage_steps.py change as P0:blocker ("Security fix revert"). I owe @CoreRasurae an apology — the analysis was wrong.

What happened

I used a two-dot diff (git diff origin/master..HEAD) which shows all differences between the branch tips, including divergent history. This made it appear that the PR actively reverts the shlex.split() security fix (commit bd18491b, issue #734) and deletes its CHANGELOG entry.

The reality

The three-dot diff (git diff origin/master...HEAD) — which correctly isolates only changes introduced by the PR's own commit — shows the PR modifies exactly 2 files:

  • CHANGELOG.md (+6 lines — the #760 entry only)
  • robot/e2e/wf14_server_mode.robot (+124 lines — new file)

features/steps/cli_coverage_steps.py is NOT modified by this PR. The security fix was merged to master after this branch was created. The branch simply doesn't have it yet. On merge (or rebase), Git's three-way merge will correctly preserve the security fix and its CHANGELOG entry — I verified this with git merge-tree --write-tree.

Corrected severity

Finding Old Severity New Severity Reason
#1 — "Security fix revert" P0:blocker P3:nit (branch hygiene) Not a revert — stale branch artifact. A rebase (already requested by PM on Day 37) resolves it.

Updated verdict

With the P0 removed, the highest remaining finding is P1 #2 (plan list --namespace AC marked complete but unfulfilled). The verdict remains Request Changes due to this P1 plus the P2 items, but the urgency is significantly reduced.


New Findings From Second Pass

P2 — list_actions() docstring claims DB merge but code only reads in-memory

File: src/cleveragents/application/services/plan_lifecycle_service.py:617–642

The list_actions() docstring (line 624–625) says: "When persistence is enabled, results are merged from the in-memory cache and the database layer." However, line 634 does actions = list(self._actions.values()) — reading ONLY the in-memory dict, with no database fallback. Contrast with get_action() (lines 583–592) which correctly falls back to the database.

Since each Run CleverAgents Command call creates a new subprocess (→ new PlanLifecycleService with empty _actions), the WF14 E2E Action List By Namespace test (line 92) depends on actions being loaded from DB into _actions during service initialization. If that preload doesn't happen, the test will fail at runtime.

Note: This is a pre-existing issue, not introduced by this PR. But it's relevant because:

  • The test may fail when actually run, contradicting the "all nox sessions pass" claim
  • The docstring is inaccurate regardless

Recommendation: Verify the test actually passes by running nox -s e2e_tests. If it fails, list_actions() needs a DB fallback like get_action() has.

P3 — Config commands write to ~/.cleveragents/config.toml, not CLEVERAGENTS_HOME

File: src/cleveragents/cli/commands/config.py:50–51

The config set/get commands use Path.home() / ".cleveragents" / "config.toml" for the config file, ignoring the CLEVERAGENTS_HOME environment variable. This means the WF14 test's config set server.url ... writes to the real home directory, not the per-suite temp directory. While the suite teardown (lines 29–31) resets the values, a crash during teardown could leave stale config in ~/.cleveragents/config.toml.

This is a pre-existing platform issue, not a test bug, but it's worth noting for local developer runs.


Revised Finding Summary

# Severity Finding Status
1 P0P3 Security fix revert → Branch needs rebase CORRECTED
2 P1 plan list --namespace AC marked complete but CLI doesn't support flag Unchanged
3 P2 plan list tautological assertion (expected_rc=None) Unchanged
4 P2 diagnostics uses expected_rc=None unnecessarily Unchanged
5 P2 Actor uses local/ namespace, not team namespace per spec Unchanged
6 P2 Ticket AC "real LLM API keys" unmet Unchanged
7 P2 supervised profile verification is shallow Unchanged
8 P3 init --yes without --force Unchanged
9 P3 Missing inter-test dependency guard Unchanged
10 P3 Loose db substring match Unchanged
11 P3 CHANGELOG "plan list graceful handling" oversells Unchanged
NEW 12 P2 list_actions() docstring claims DB merge but code doesn't — test may fail New
NEW 13 P3 Config writes to ~/ not CLEVERAGENTS_HOME — teardown crash could leak config New

Final tally: 0 P0 + 1 P1 + 6 P2 + 6 P3 = 13 findings

# Supplementary Review — PR #805 (Second Pass + P0 Correction) **Reviewer:** @brent.edwards **Method:** 3 deep-dive agents (shell=True analysis, YAML/CLI tracing, CHANGELOG integrity) + manual verification --- ## ⚠️ CORRECTION: P0 Finding #1 Downgraded to P3 (Branch Hygiene) **My initial review incorrectly classified the `cli_coverage_steps.py` change as P0:blocker ("Security fix revert").** I owe @CoreRasurae an apology — the analysis was wrong. ### What happened I used a **two-dot diff** (`git diff origin/master..HEAD`) which shows all differences between the branch tips, including divergent history. This made it appear that the PR actively reverts the `shlex.split()` security fix (commit `bd18491b`, issue #734) and deletes its CHANGELOG entry. ### The reality The **three-dot diff** (`git diff origin/master...HEAD`) — which correctly isolates only changes introduced by the PR's own commit — shows the PR modifies exactly **2 files**: - `CHANGELOG.md` (+6 lines — the #760 entry only) - `robot/e2e/wf14_server_mode.robot` (+124 lines — new file) **`features/steps/cli_coverage_steps.py` is NOT modified by this PR.** The security fix was merged to master *after* this branch was created. The branch simply doesn't have it yet. On merge (or rebase), Git's three-way merge will correctly preserve the security fix and its CHANGELOG entry — I verified this with `git merge-tree --write-tree`. ### Corrected severity | Finding | Old Severity | New Severity | Reason | |---------|-------------|-------------|--------| | #1 — "Security fix revert" | P0:blocker | **P3:nit (branch hygiene)** | Not a revert — stale branch artifact. A rebase (already requested by PM on Day 37) resolves it. | ### Updated verdict With the P0 removed, the highest remaining finding is **P1 #2** (`plan list --namespace` AC marked complete but unfulfilled). The verdict remains **Request Changes** due to this P1 plus the P2 items, but the urgency is significantly reduced. --- ## New Findings From Second Pass ### P2 — `list_actions()` docstring claims DB merge but code only reads in-memory **File:** `src/cleveragents/application/services/plan_lifecycle_service.py:617–642` The `list_actions()` docstring (line 624–625) says: *"When persistence is enabled, results are merged from the in-memory cache and the database layer."* However, line 634 does `actions = list(self._actions.values())` — reading ONLY the in-memory dict, with no database fallback. Contrast with `get_action()` (lines 583–592) which correctly falls back to the database. Since each `Run CleverAgents Command` call creates a new subprocess (→ new `PlanLifecycleService` with empty `_actions`), the `WF14 E2E Action List By Namespace` test (line 92) depends on actions being loaded from DB into `_actions` during service initialization. If that preload doesn't happen, the test will fail at runtime. **Note:** This is a **pre-existing issue**, not introduced by this PR. But it's relevant because: - The test may fail when actually run, contradicting the "all nox sessions pass" claim - The docstring is inaccurate regardless **Recommendation:** Verify the test actually passes by running `nox -s e2e_tests`. If it fails, `list_actions()` needs a DB fallback like `get_action()` has. ### P3 — Config commands write to `~/.cleveragents/config.toml`, not `CLEVERAGENTS_HOME` **File:** `src/cleveragents/cli/commands/config.py:50–51` The `config set/get` commands use `Path.home() / ".cleveragents" / "config.toml"` for the config file, ignoring the `CLEVERAGENTS_HOME` environment variable. This means the WF14 test's `config set server.url ...` writes to the **real home directory**, not the per-suite temp directory. While the suite teardown (lines 29–31) resets the values, a crash during teardown could leave stale config in `~/.cleveragents/config.toml`. This is a pre-existing platform issue, not a test bug, but it's worth noting for local developer runs. --- ## Revised Finding Summary | # | Severity | Finding | Status | |---|----------|---------|--------| | ~~1~~ | ~~P0~~ → **P3** | ~~Security fix revert~~ → Branch needs rebase | **CORRECTED** | | 2 | P1 | `plan list --namespace` AC marked complete but CLI doesn't support flag | Unchanged | | 3 | P2 | `plan list` tautological assertion (`expected_rc=None`) | Unchanged | | 4 | P2 | `diagnostics` uses `expected_rc=None` unnecessarily | Unchanged | | 5 | P2 | Actor uses `local/` namespace, not team namespace per spec | Unchanged | | 6 | P2 | Ticket AC "real LLM API keys" unmet | Unchanged | | 7 | P2 | `supervised` profile verification is shallow | Unchanged | | 8 | P3 | `init --yes` without `--force` | Unchanged | | 9 | P3 | Missing inter-test dependency guard | Unchanged | | 10 | P3 | Loose `db` substring match | Unchanged | | 11 | P3 | CHANGELOG "plan list graceful handling" oversells | Unchanged | | **NEW** 12 | P2 | `list_actions()` docstring claims DB merge but code doesn't — test may fail | New | | **NEW** 13 | P3 | Config writes to `~/` not `CLEVERAGENTS_HOME` — teardown crash could leak config | New | **Final tally: 0 P0 + 1 P1 + 6 P2 + 6 P3 = 13 findings**
Member

Third-Pass Review — PR #805

Reviewer: @brent.edwards
Method: 3 deep-dive agents (Robot Framework keyword semantics, CLI command-by-command source trace, verbatim spec-to-test mapping) + manual verification

This third pass approached the code from angles the first two passes didn't cover: Robot Framework escape/scoping semantics, whether each CLI command will actually succeed by tracing through the source code, and a verbatim command-by-command comparison against Specification Example 14.


Upgrade: Finding #12 Elevated from P2 to P1

action list --namespace myteam will likely fail at runtime

File: src/cleveragents/application/services/plan_lifecycle_service.py:634
Test file: robot/e2e/wf14_server_mode.robot:92–93

Two independent deep-dive agents traced the full code path and both reached the same conclusion:

list_actions() at line 634 reads ONLY from the in-memory _actions dict:

actions = list(self._actions.values())

Each Run CleverAgents Command call spawns a new subprocess → fresh PlanLifecycleService → empty _actions = {}. The docstring (line 624) falsely claims: "results are merged from the in-memory cache and the database layer."

Contrast with get_action() (lines 587–592) which correctly falls back to the DB, and ActorService.list_actors() which reads from DB. list_actions() is the only action method without a DB fallback.

Result: action list --namespace myteam returns [] → prints "No actions found." → assertion Output Should Contain ${list_result} gen-tests fails.

This is a pre-existing platform bug (not introduced by this PR), but the test depends on it. Unless there's a preload mechanism I couldn't find, the WF14 E2E Action List By Namespace test case will fail when run via nox -s e2e_tests.

Recommendation: Either (a) verify the test actually passes by running nox -s e2e_tests and share the output, or (b) fix list_actions() to query ctx.actions.list_available(namespace) like get_action() does.

Severity: Elevated to P1 — likely runtime test failure.


New Findings (Not in Previous Reviews)

14. Diagnostics test validates nothing server-mode-specific

File: robot/e2e/wf14_server_mode.robot:62–64
Severity: P2

Spec Example 14's diagnostics output prominently shows "Server: OK (connected to agents.example.com)" and a dedicated Server panel with URL, Auth, Namespace, Members, Shared Actions, Latency. The entire reason for running diagnostics in Example 14 is to verify server connectivity.

The test only checks for generic diagnostic categories (config, database, disk) that appear identically with or without any server configuration. The test would pass even if the server config step were removed entirely.

Recommendation: Add at minimum Should Contain Any ${combined} server Server server.url to verify the server-mode diagnostic code path is exercised.

15. read_only: true in action YAML contradicts spec's Read Only: no

File: robot/e2e/wf14_server_mode.robot:77
Severity: P3

Spec Example 14 Step 2 (line 41666) shows the created action has Read Only: no. The test YAML sets read_only: true. This is a verbatim contradiction — the test creates an action with the opposite read-only setting from what the spec demonstrates. The generate-tests action in the spec is explicitly writable because test generation involves writing test files.

Recommendation: Change to read_only: false or remove the field (default is false).

16. Suite documentation is misleading about validation scope

File: robot/e2e/wf14_server_mode.robot:2–7
Severity: P2

The suite documentation says: "Validates the supervised automation profile in a distributed team scenario where multiple engineers share actions, actors, and projects via CleverAgents server mode." In reality, Steps 3 (multi-machine shared resource usage, 4 commands) and Step 4 (team plan monitoring, 2 commands) are entirely untested — 46% of Example 14's commands have zero coverage.

No centralized documentation explains what's in scope and what's deferred. Individual test cases document their own gaps (e.g., plan list --namespace is a "spec gap"), but a reader of the suite description would expect full Example 14 coverage.

Recommendation: Add explicit scope documentation:

...              Coverage: Steps 1–2 (server config, resource publishing).
...              Steps 3–4 (multi-machine usage, team monitoring) require a
...              live server and are deferred.

17. Token format fragility — survives redaction only by accident

File: robot/e2e/wf14_server_mode.robot:44, 47
Severity: P3

tok_e2e_test_placeholder passes through config get unredacted because:

  1. config get in "rich" format doesn't apply redaction (unlike config list which does)
  2. Even if redaction were applied, the underscores in e2e_test_placeholder break the tok_[A-Za-z0-9]{10,} regex pattern

If someone changes the token to tok_e2eTestPlaceholder (no underscores), or adds --format json to the config get call, the assertion would silently break.

Recommendation: Add a comment explaining the token format choice, or use a token that clearly isn't redactable (e.g., test_placeholder_token).

18. actor add invocation syntax differs from spec

File: robot/e2e/wf14_server_mode.robot:106
Severity: P3

Spec shows: agents actor add --config ./actors/review-pipeline.yaml (name from YAML).
Test runs: actor add ${actor_name} --config ${yaml_file} (name as positional arg, YAML has no name: field).

This exercises a different CLI code path (positional name + config-override vs. config-only).

Recommendation: Either add name: ${actor_name} to the YAML and use actor add --config ${yaml_file}, or document the deviation.


Verified Correct — No Issues

Item Verification
\n in Catenate SEPARATOR and Set Variable RF interprets as real newlines; generated YAML is valid
Should Contain Any semantics Checks AT LEAST ONE match (from BuiltIn library)
${EMPTY} through Run Process Passed as empty string, Typer accepts it
Create File from OperatingSystem library Imported transitively via common_e2e.resource, UTF-8 default
Output Should Contain partial matching Case-insensitive by default, won't false-match unrelated keys
server.token NOT redacted in config get Rich format doesn't apply redaction; assertion passes
RF test execution order File order guaranteed; nox doesn't use --randomize
YAML boolean values (reusable: true) YAML parser produces Python True, Pydantic accepts it
Commit message body and footer Conventional Changelog format, detailed review-fix notes, ISSUES CLOSED: #760

Consolidated Finding Table (All Three Reviews)

# Severity Finding Review
1 P0 → P3 Security fix revert → Branch needs rebase R1 → R2 corrected
2 P1 plan list --namespace AC marked complete but CLI doesn't support flag R1
12→ P1 (↑) action list --namespace likely fails at runtime — list_actions() has no DB fallback R2 → R3 elevated
3 P2 plan list tautological assertion R1
4 P2 diagnostics uses expected_rc=None unnecessarily R1
5 P2 Actor uses local/ namespace R1
6 P2 Ticket AC "real LLM API keys" unmet R1
7 P2 supervised profile verification is shallow R1
14 P2 Diagnostics test validates nothing server-mode-specific R3 NEW
16 P2 Suite documentation misleading about validation scope R3 NEW
13 P3 Config writes to ~/ not CLEVERAGENTS_HOME R2
8 P3 init --yes without --force R1
9 P3 Missing inter-test dependency guard R1
10 P3 Loose db substring match R1
11 P3 CHANGELOG "plan list graceful handling" oversells R1
15 P3 read_only: true contradicts spec's Read Only: no R3 NEW
17 P3 Token format fragility R3 NEW
18 P3 actor add syntax differs from spec R3 NEW
1 P3 Branch needs rebase (stale branch artifact) R1→R2

Final tally: 0 P0 + 2 P1 + 7 P2 + 10 P3 = 19 findings


Summary

This third pass uncovered 4 genuinely new findings (diagnostics not testing server mode, read_only contradiction, misleading suite scope docs, token fragility) and elevated one existing finding (action list DB fallback) based on strong CLI source-code evidence.

The most significant new finding is that the diagnostics test (P2 #14) validates nothing server-mode-specific — the exact same assertions would pass on a vanilla install with no server configuration. Combined with the suite documentation claiming to validate "a distributed team scenario" (P2 #16), the test oversells its coverage of Example 14's server-mode aspects.

The Robot Framework keyword semantics are sound — \n escapes, Catenate, Create File, Should Contain Any, and variable scoping all work correctly.

Verdict: Request Changes (2 P1 + 7 P2). The P1 items (unfulfilled plan list --namespace AC and probable action list runtime failure) should be resolved before merge. The P2 items are quality improvements that can be addressed in this PR or follow-ups.

# Third-Pass Review — PR #805 **Reviewer:** @brent.edwards **Method:** 3 deep-dive agents (Robot Framework keyword semantics, CLI command-by-command source trace, verbatim spec-to-test mapping) + manual verification This third pass approached the code from angles the first two passes didn't cover: Robot Framework escape/scoping semantics, whether each CLI command will actually succeed by tracing through the source code, and a verbatim command-by-command comparison against Specification Example 14. --- ## Upgrade: Finding #12 Elevated from P2 to P1 ### `action list --namespace myteam` will likely fail at runtime **File:** `src/cleveragents/application/services/plan_lifecycle_service.py:634` **Test file:** `robot/e2e/wf14_server_mode.robot:92–93` Two independent deep-dive agents traced the full code path and both reached the same conclusion: `list_actions()` at line 634 reads ONLY from the in-memory `_actions` dict: ```python actions = list(self._actions.values()) ``` Each `Run CleverAgents Command` call spawns a new subprocess → fresh `PlanLifecycleService` → empty `_actions = {}`. The docstring (line 624) falsely claims: *"results are merged from the in-memory cache and the database layer."* Contrast with `get_action()` (lines 587–592) which correctly falls back to the DB, and `ActorService.list_actors()` which reads from DB. `list_actions()` is the only action method without a DB fallback. **Result:** `action list --namespace myteam` returns `[]` → prints "No actions found." → assertion `Output Should Contain ${list_result} gen-tests` **fails**. This is a **pre-existing platform bug** (not introduced by this PR), but the test depends on it. Unless there's a preload mechanism I couldn't find, the `WF14 E2E Action List By Namespace` test case will fail when run via `nox -s e2e_tests`. **Recommendation:** Either (a) verify the test actually passes by running `nox -s e2e_tests` and share the output, or (b) fix `list_actions()` to query `ctx.actions.list_available(namespace)` like `get_action()` does. **Severity:** Elevated to P1 — likely runtime test failure. --- ## New Findings (Not in Previous Reviews) ### 14. Diagnostics test validates nothing server-mode-specific **File:** `robot/e2e/wf14_server_mode.robot:62–64` **Severity:** P2 Spec Example 14's diagnostics output prominently shows **"Server: OK (connected to agents.example.com)"** and a **dedicated Server panel** with URL, Auth, Namespace, Members, Shared Actions, Latency. The entire reason for running diagnostics in Example 14 is to verify server connectivity. The test only checks for generic diagnostic categories (`config`, `database`, `disk`) that appear identically with or without any server configuration. The test would pass even if the server config step were removed entirely. **Recommendation:** Add at minimum `Should Contain Any ${combined} server Server server.url` to verify the server-mode diagnostic code path is exercised. ### 15. `read_only: true` in action YAML contradicts spec's `Read Only: no` **File:** `robot/e2e/wf14_server_mode.robot:77` **Severity:** P3 Spec Example 14 Step 2 (line 41666) shows the created action has `Read Only: no`. The test YAML sets `read_only: true`. This is a verbatim contradiction — the test creates an action with the opposite read-only setting from what the spec demonstrates. The generate-tests action in the spec is explicitly writable because test generation involves writing test files. **Recommendation:** Change to `read_only: false` or remove the field (default is `false`). ### 16. Suite documentation is misleading about validation scope **File:** `robot/e2e/wf14_server_mode.robot:2–7` **Severity:** P2 The suite documentation says: *"Validates the supervised automation profile in a distributed team scenario where multiple engineers share actions, actors, and projects via CleverAgents server mode."* In reality, Steps 3 (multi-machine shared resource usage, 4 commands) and Step 4 (team plan monitoring, 2 commands) are entirely untested — 46% of Example 14's commands have zero coverage. No centralized documentation explains what's in scope and what's deferred. Individual test cases document their own gaps (e.g., `plan list --namespace` is a "spec gap"), but a reader of the suite description would expect full Example 14 coverage. **Recommendation:** Add explicit scope documentation: ``` ... Coverage: Steps 1–2 (server config, resource publishing). ... Steps 3–4 (multi-machine usage, team monitoring) require a ... live server and are deferred. ``` ### 17. Token format fragility — survives redaction only by accident **File:** `robot/e2e/wf14_server_mode.robot:44, 47` **Severity:** P3 `tok_e2e_test_placeholder` passes through `config get` unredacted because: 1. `config get` in "rich" format doesn't apply redaction (unlike `config list` which does) 2. Even if redaction were applied, the underscores in `e2e_test_placeholder` break the `tok_[A-Za-z0-9]{10,}` regex pattern If someone changes the token to `tok_e2eTestPlaceholder` (no underscores), or adds `--format json` to the `config get` call, the assertion would silently break. **Recommendation:** Add a comment explaining the token format choice, or use a token that clearly isn't redactable (e.g., `test_placeholder_token`). ### 18. `actor add` invocation syntax differs from spec **File:** `robot/e2e/wf14_server_mode.robot:106` **Severity:** P3 Spec shows: `agents actor add --config ./actors/review-pipeline.yaml` (name from YAML). Test runs: `actor add ${actor_name} --config ${yaml_file}` (name as positional arg, YAML has no `name:` field). This exercises a different CLI code path (positional name + config-override vs. config-only). **Recommendation:** Either add `name: ${actor_name}` to the YAML and use `actor add --config ${yaml_file}`, or document the deviation. --- ## Verified Correct — No Issues | Item | Verification | |------|-------------| | `\n` in `Catenate SEPARATOR` and `Set Variable` | ✅ RF interprets as real newlines; generated YAML is valid | | `Should Contain Any` semantics | ✅ Checks AT LEAST ONE match (from BuiltIn library) | | `${EMPTY}` through `Run Process` | ✅ Passed as empty string, Typer accepts it | | `Create File` from OperatingSystem library | ✅ Imported transitively via common_e2e.resource, UTF-8 default | | `Output Should Contain` partial matching | ✅ Case-insensitive by default, won't false-match unrelated keys | | `server.token` NOT redacted in `config get` | ✅ Rich format doesn't apply redaction; assertion passes | | RF test execution order | ✅ File order guaranteed; nox doesn't use `--randomize` | | YAML boolean values (`reusable: true`) | ✅ YAML parser produces Python `True`, Pydantic accepts it | | Commit message body and footer | ✅ Conventional Changelog format, detailed review-fix notes, `ISSUES CLOSED: #760` | --- ## Consolidated Finding Table (All Three Reviews) | # | Severity | Finding | Review | |---|----------|---------|--------| | ~~1~~ | ~~P0~~ → P3 | ~~Security fix revert~~ → Branch needs rebase | R1 → R2 corrected | | 2 | P1 | `plan list --namespace` AC marked complete but CLI doesn't support flag | R1 | | 12→ | **P1** (↑) | `action list --namespace` likely fails at runtime — `list_actions()` has no DB fallback | R2 → R3 elevated | | 3 | P2 | `plan list` tautological assertion | R1 | | 4 | P2 | `diagnostics` uses `expected_rc=None` unnecessarily | R1 | | 5 | P2 | Actor uses `local/` namespace | R1 | | 6 | P2 | Ticket AC "real LLM API keys" unmet | R1 | | 7 | P2 | `supervised` profile verification is shallow | R1 | | 14 | P2 | Diagnostics test validates nothing server-mode-specific | **R3 NEW** | | 16 | P2 | Suite documentation misleading about validation scope | **R3 NEW** | | 13 | P3 | Config writes to `~/` not `CLEVERAGENTS_HOME` | R2 | | 8 | P3 | `init --yes` without `--force` | R1 | | 9 | P3 | Missing inter-test dependency guard | R1 | | 10 | P3 | Loose `db` substring match | R1 | | 11 | P3 | CHANGELOG "plan list graceful handling" oversells | R1 | | 15 | P3 | `read_only: true` contradicts spec's `Read Only: no` | **R3 NEW** | | 17 | P3 | Token format fragility | **R3 NEW** | | 18 | P3 | `actor add` syntax differs from spec | **R3 NEW** | | 1 | P3 | Branch needs rebase (stale branch artifact) | R1→R2 | **Final tally: 0 P0 + 2 P1 + 7 P2 + 10 P3 = 19 findings** --- ## Summary This third pass uncovered **4 genuinely new findings** (diagnostics not testing server mode, `read_only` contradiction, misleading suite scope docs, token fragility) and **elevated one existing finding** (action list DB fallback) based on strong CLI source-code evidence. The most significant new finding is that the diagnostics test (P2 #14) validates nothing server-mode-specific — the exact same assertions would pass on a vanilla install with no server configuration. Combined with the suite documentation claiming to validate "a distributed team scenario" (P2 #16), the test oversells its coverage of Example 14's server-mode aspects. The Robot Framework keyword semantics are sound — `\n` escapes, `Catenate`, `Create File`, `Should Contain Any`, and variable scoping all work correctly. **Verdict: Request Changes** (2 P1 + 7 P2). The P1 items (unfulfilled `plan list --namespace` AC and probable `action list` runtime failure) should be resolved before merge. The P2 items are quality improvements that can be addressed in this PR or follow-ups.
CoreRasurae force-pushed test/e2e-wf14-server-mode from 064d448cce
Some checks failed
CI / build (pull_request) Successful in 14s
CI / lint (pull_request) Successful in 3m14s
CI / quality (pull_request) Successful in 3m44s
CI / typecheck (pull_request) Successful in 3m48s
CI / security (pull_request) Successful in 4m1s
CI / integration_tests (pull_request) Successful in 5m52s
CI / e2e_tests (pull_request) Failing after 5m54s
CI / unit_tests (pull_request) Successful in 5m56s
CI / docker (pull_request) Successful in 1m7s
CI / coverage (pull_request) Successful in 11m33s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-publish (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has started running
to 20c5bfff8e
Some checks failed
CI / build (pull_request) Successful in 1m25s
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 3m26s
CI / quality (pull_request) Successful in 3m57s
CI / typecheck (pull_request) Successful in 4m16s
CI / security (pull_request) Successful in 4m58s
CI / integration_tests (pull_request) Successful in 6m19s
CI / unit_tests (pull_request) Successful in 7m7s
CI / docker (pull_request) Successful in 1m18s
CI / e2e_tests (pull_request) Failing after 5m58s
CI / coverage (pull_request) Successful in 10m11s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-regression (pull_request) Successful in 52m38s
2026-03-24 18:30:19 +00:00
Compare
CoreRasurae force-pushed test/e2e-wf14-server-mode from 20c5bfff8e
Some checks failed
CI / build (pull_request) Successful in 1m25s
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 3m26s
CI / quality (pull_request) Successful in 3m57s
CI / typecheck (pull_request) Successful in 4m16s
CI / security (pull_request) Successful in 4m58s
CI / integration_tests (pull_request) Successful in 6m19s
CI / unit_tests (pull_request) Successful in 7m7s
CI / docker (pull_request) Successful in 1m18s
CI / e2e_tests (pull_request) Failing after 5m58s
CI / coverage (pull_request) Successful in 10m11s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-regression (pull_request) Successful in 52m38s
to 39c2e43c08
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 14s
CI / lint (pull_request) Successful in 3m44s
CI / typecheck (pull_request) Successful in 4m27s
CI / security (pull_request) Successful in 4m30s
CI / unit_tests (pull_request) Successful in 8m51s
CI / integration_tests (pull_request) Successful in 8m54s
CI / e2e_tests (pull_request) Successful in 9m13s
CI / docker (pull_request) Successful in 54s
CI / quality (pull_request) Failing after 15m14s
CI / coverage (pull_request) Successful in 12m20s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-regression (pull_request) Successful in 52m20s
2026-03-24 20:33:01 +00:00
Compare
brent.edwards requested changes 2026-03-24 21:21:21 +00:00
Dismissed
brent.edwards left a comment

Re-Review #3 — PR #805 (Ticket #760)

Reviewer: @brent.edwards
Method: 3 parallel deep-dive agents (spec compliance, bug detection, code quality/commits) against latest HEAD 39c2e43c, plus review playbook checklists
Previous reviews: Review #2684 (REQUEST_CHANGES, 11 findings), Supplementary (P0 correction + 2 new), Third-pass (4 new + 1 elevation → 19 total)


Verdict: Request Changes

There is 1 P1:must-fix (two-commit branch violating atomic commit rules). Once squashed, the remaining findings are all P2/P3 and would not block merge.


Progress Since Last Review

The author has done excellent work addressing the most impactful findings. 9 of the 19 findings from my previous reviews are fully resolved:

# Finding Status
1 Security fix revert (corrected to P3: branch needs rebase) FIXED — Rebased onto origin/master
3 plan list tautological assertion (expected_rc=None) FIXED — Now expected_rc=0 (line 117)
4 diagnostics uses expected_rc=None unnecessarily FIXED — Now expected_rc=0 (line 59)
7 supervised profile verification shallow FIXED — Now checks auto_strategize, auto_execute, auto_apply (lines 127–129)
12 list_actions() no DB fallback (elevated to P1) FIXED — Comprehensive DB query with fallback (lines 636–658). New ActionRepository.list_all() added. Well-engineered with proper error handling.
CHANGELOG inaccuracy (H-3 from self-review) FIXED — Accurate, no longer claims LLM keys or cross-namespace plan monitoring
Epoch-based uniqueness (M-4 from self-review) FIXED — Uses uuid4().hex[:12] (line 22)
Missing config teardown (M-1 from self-review) FIXED — Suite teardown resets all 3 config keys (lines 29–31)
Inconsistent tag pattern (L-2 from self-review) FIXEDForce Tags E2E at suite level (line 11)

The list_actions() fix in particular is solid — the DB query logic, state filtering, namespace handling, DatabaseError fallback, and in-memory cache refresh are all correct. I traced the full code path from action create → DB persist → action list --namespace myteamget_by_namespace("myteam") and it holds up.


P1:must-fix — Two-Commit Branch

Location: Branch test/e2e-wf14-server-mode — commits f200d724 and 39c2e43c

The branch contains two commits both referencing issue #760:

  1. f200d724fix(action): query persisted actions from database in list_actions (Refs: #760)
  2. 39c2e43ctest(e2e): workflow example 14 — ... (ISSUES CLOSED: #760)

CONTRIBUTING.md line 890: "Single Commit — Corresponds to exactly one commit. One issue produces one commit." Line 950: "If an issue requires decomposition into multiple commits, it must be promoted to an Epic and broken into separate issues."

The fix(action) commit is a legitimate bug fix — it correctly addresses a pre-existing platform issue I flagged in my previous review. However, it either needs:

  • (a) Squash into the test commit (simplest — the fix is small and the test depends on it), OR
  • (b) Separate ticket — Create a Type/Bug issue for the list_actions() fix, move it to its own branch/PR, and rebase this PR on top of it after merge.

Option (a) is recommended since the fix is only 25 lines of production code and the test is its primary validation.

If squashing, keep the test commit's first line (matching ticket metadata), and include the fix description in the commit body.


P2:should-fix — Remaining Items (Non-Blocking)

These remain from previous reviews. They are documentation/ticket hygiene issues, not code bugs. Per the playbook, they can be fixed in a follow-up PR within 3 days.

# Finding Notes
2 plan list --namespace AC checked in ticket #760 but CLI has no --namespace flag Uncheck the AC, add note it's deferred. Spec gap, not a test bug.
5 Actor uses local/ namespace (line 99) instead of myteam/ per spec Honestly documented (lines 97–98) as ActorService constraint. Consider filing follow-up issue.
6 Ticket AC "real LLM API keys" checked but test makes no LLM calls Spirit is met (real CLI, no test doubles). Reword the AC to match reality.
14 Diagnostics test validates nothing server-mode-specific Same assertions would pass without server config. Add server to Should Contain Any.
16 Suite documentation oversells validation scope Add scope note: "Coverage: Steps 1–2. Steps 3–4 deferred (require live server)."

P3:nit — Author Discretion

# Finding
8 init --yes without --force (line 18)
9 Missing inter-test dependency guard for Action List test
10 Loose db 2-char substring in diagnostics assertion (line 63)
15 read_only: true contradicts spec's Read Only: no (line 77)
17 Token format fragility — survives redaction by accident
18 actor add syntax (positional name + config) differs from spec (config-only)
Cache refresh only adds, never updates stale entries (line 651: if key not in) — low practical impact since CLI invocations are isolated
ActionRepository.list_all() returns list[Any] — matches pre-existing ActionRepository convention but is weaker than ActorRepository.list_all() which returns list[Actor]

Quality Gates Verified

Gate Result
nox -e lint All checks passed
nox -e typecheck 0 errors (1 pre-existing warning in strategy_registry.py)
Branch name matches ticket metadata test/e2e-wf14-server-mode
HEAD commit message matches ticket metadata Exact match including em-dash
ISSUES CLOSED: #760 in footer
Labels — Type/Testing, State/In Review, milestone v3.6.0
PR description — summary, closes keyword, manual verification
No secrets/credentials in test code (https://agents.example.com is RFC 2606 reserved)

Summary

This is a well-structured E2E test that has been substantially improved across review cycles. The list_actions() DB fix is correct and well-engineered. The test covers Steps 1–2 of Specification Example 14 thoroughly — config round-trips, namespace-scoped action publishing and listing, actor registration, plan list smoke, and supervised profile threshold verification.

The only blocking item is the two-commit branch. A simple git rebase -i squash resolves it. Once that's done and the P2 ticket-hygiene items are noted (uncheck unfulfilled ACs, add scope documentation), this PR is ready to merge.

## Re-Review #3 — PR #805 (Ticket #760) **Reviewer:** @brent.edwards **Method:** 3 parallel deep-dive agents (spec compliance, bug detection, code quality/commits) against latest HEAD `39c2e43c`, plus review playbook checklists **Previous reviews:** Review #2684 (REQUEST_CHANGES, 11 findings), Supplementary (P0 correction + 2 new), Third-pass (4 new + 1 elevation → 19 total) --- ### Verdict: Request Changes There is **1 P1:must-fix** (two-commit branch violating atomic commit rules). Once squashed, the remaining findings are all P2/P3 and would not block merge. --- ### Progress Since Last Review The author has done excellent work addressing the most impactful findings. **9 of the 19 findings from my previous reviews are fully resolved:** | # | Finding | Status | |---|---------|--------| | ~~1~~ | ~~Security fix revert~~ (corrected to P3: branch needs rebase) | ✅ **FIXED** — Rebased onto `origin/master` | | ~~3~~ | `plan list` tautological assertion (`expected_rc=None`) | ✅ **FIXED** — Now `expected_rc=0` (line 117) | | ~~4~~ | `diagnostics` uses `expected_rc=None` unnecessarily | ✅ **FIXED** — Now `expected_rc=0` (line 59) | | ~~7~~ | `supervised` profile verification shallow | ✅ **FIXED** — Now checks `auto_strategize`, `auto_execute`, `auto_apply` (lines 127–129) | | ~~12~~ | `list_actions()` no DB fallback (elevated to P1) | ✅ **FIXED** — Comprehensive DB query with fallback (lines 636–658). New `ActionRepository.list_all()` added. Well-engineered with proper error handling. | | — | CHANGELOG inaccuracy (H-3 from self-review) | ✅ **FIXED** — Accurate, no longer claims LLM keys or cross-namespace plan monitoring | | — | Epoch-based uniqueness (M-4 from self-review) | ✅ **FIXED** — Uses `uuid4().hex[:12]` (line 22) | | — | Missing config teardown (M-1 from self-review) | ✅ **FIXED** — Suite teardown resets all 3 config keys (lines 29–31) | | — | Inconsistent tag pattern (L-2 from self-review) | ✅ **FIXED** — `Force Tags E2E` at suite level (line 11) | The `list_actions()` fix in particular is solid — the DB query logic, state filtering, namespace handling, `DatabaseError` fallback, and in-memory cache refresh are all correct. I traced the full code path from `action create` → DB persist → `action list --namespace myteam` → `get_by_namespace("myteam")` and it holds up. --- ### P1:must-fix — Two-Commit Branch **Location:** Branch `test/e2e-wf14-server-mode` — commits `f200d724` and `39c2e43c` The branch contains two commits both referencing issue #760: 1. `f200d724` — `fix(action): query persisted actions from database in list_actions` (`Refs: #760`) 2. `39c2e43c` — `test(e2e): workflow example 14 — ...` (`ISSUES CLOSED: #760`) CONTRIBUTING.md line 890: *"Single Commit — Corresponds to exactly one commit. One issue produces one commit."* Line 950: *"If an issue requires decomposition into multiple commits, it must be promoted to an Epic and broken into separate issues."* The `fix(action)` commit is a legitimate bug fix — it correctly addresses a pre-existing platform issue I flagged in my previous review. However, it either needs: - **(a) Squash** into the test commit (simplest — the fix is small and the test depends on it), OR - **(b) Separate ticket** — Create a `Type/Bug` issue for the `list_actions()` fix, move it to its own branch/PR, and rebase this PR on top of it after merge. Option (a) is recommended since the fix is only 25 lines of production code and the test is its primary validation. If squashing, keep the test commit's first line (matching ticket metadata), and include the fix description in the commit body. --- ### P2:should-fix — Remaining Items (Non-Blocking) These remain from previous reviews. They are **documentation/ticket hygiene** issues, not code bugs. Per the playbook, they can be fixed in a follow-up PR within 3 days. | # | Finding | Notes | |---|---------|-------| | 2 | `plan list --namespace` AC checked in ticket #760 but CLI has no `--namespace` flag | Uncheck the AC, add note it's deferred. Spec gap, not a test bug. | | 5 | Actor uses `local/` namespace (line 99) instead of `myteam/` per spec | Honestly documented (lines 97–98) as ActorService constraint. Consider filing follow-up issue. | | 6 | Ticket AC "real LLM API keys" checked but test makes no LLM calls | Spirit is met (real CLI, no test doubles). Reword the AC to match reality. | | 14 | Diagnostics test validates nothing server-mode-specific | Same assertions would pass without server config. Add `server` to `Should Contain Any`. | | 16 | Suite documentation oversells validation scope | Add scope note: "Coverage: Steps 1–2. Steps 3–4 deferred (require live server)." | --- ### P3:nit — Author Discretion | # | Finding | |---|---------| | 8 | `init --yes` without `--force` (line 18) | | 9 | Missing inter-test dependency guard for Action List test | | 10 | Loose `db` 2-char substring in diagnostics assertion (line 63) | | 15 | `read_only: true` contradicts spec's `Read Only: no` (line 77) | | 17 | Token format fragility — survives redaction by accident | | 18 | `actor add` syntax (positional name + config) differs from spec (config-only) | | — | Cache refresh only adds, never updates stale entries (line 651: `if key not in`) — low practical impact since CLI invocations are isolated | | — | `ActionRepository.list_all()` returns `list[Any]` — matches pre-existing `ActionRepository` convention but is weaker than `ActorRepository.list_all()` which returns `list[Actor]` | --- ### Quality Gates Verified | Gate | Result | |------|--------| | `nox -e lint` | ✅ All checks passed | | `nox -e typecheck` | ✅ 0 errors (1 pre-existing warning in `strategy_registry.py`) | | Branch name matches ticket metadata | ✅ `test/e2e-wf14-server-mode` | | HEAD commit message matches ticket metadata | ✅ Exact match including em-dash | | `ISSUES CLOSED: #760` in footer | ✅ | | Labels — `Type/Testing`, `State/In Review`, milestone v3.6.0 | ✅ | | PR description — summary, closes keyword, manual verification | ✅ | | No secrets/credentials in test code | ✅ (`https://agents.example.com` is RFC 2606 reserved) | --- ### Summary This is a well-structured E2E test that has been substantially improved across review cycles. The `list_actions()` DB fix is correct and well-engineered. The test covers Steps 1–2 of Specification Example 14 thoroughly — config round-trips, namespace-scoped action publishing and listing, actor registration, plan list smoke, and supervised profile threshold verification. **The only blocking item is the two-commit branch.** A simple `git rebase -i` squash resolves it. Once that's done and the P2 ticket-hygiene items are noted (uncheck unfulfilled ACs, add scope documentation), this PR is ready to merge.
CoreRasurae force-pushed test/e2e-wf14-server-mode from 39c2e43c08
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 14s
CI / lint (pull_request) Successful in 3m44s
CI / typecheck (pull_request) Successful in 4m27s
CI / security (pull_request) Successful in 4m30s
CI / unit_tests (pull_request) Successful in 8m51s
CI / integration_tests (pull_request) Successful in 8m54s
CI / e2e_tests (pull_request) Successful in 9m13s
CI / docker (pull_request) Successful in 54s
CI / quality (pull_request) Failing after 15m14s
CI / coverage (pull_request) Successful in 12m20s
CI / status-check (pull_request) Successful in 1s
CI / benchmark-regression (pull_request) Successful in 52m20s
to 9fef60264d
Some checks are pending
CI / lint (pull_request) Successful in 37s
CI / typecheck (pull_request) Successful in 47s
CI / security (pull_request) Successful in 48s
CI / quality (pull_request) Successful in 38s
CI / benchmark-publish (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has started running
CI / e2e_tests (pull_request) Successful in 1m12s
CI / build (pull_request) Successful in 25s
CI / unit_tests (pull_request) Successful in 3m58s
CI / integration_tests (pull_request) Successful in 4m4s
CI / docker (pull_request) Successful in 1m16s
CI / coverage (pull_request) Successful in 6m4s
2026-03-24 21:37:04 +00:00
Compare
Author
Member

P1:must-fix — Two-Commit Branch - This is not a problem and will not be fixed. There a special exemption rule that allows the fixes for a given end to end test to be provided as a previous commit to the end to end tests itself and be merged as part of the same commit. All that is requires is that the fix needs to complement the end to end tests by fixing the bug that impeded the end to end test to work.

P1:must-fix — Two-Commit Branch - This is not a problem and will not be fixed. There a special exemption rule that allows the fixes for a given end to end test to be provided as a previous commit to the end to end tests itself and be merged as part of the same commit. All that is requires is that the fix needs to complement the end to end tests by fixing the bug that impeded the end to end test to work.
Author
Member

Response to Re-Review #3 — @brent.edwards

Author: @CoreRasurae
Reviewed against: Re-Review #3 (review #2758, commit 39c2e43c)
Method: Each P2 finding verified against docs/specification.md (Example 14, Steps 1–4, lines 41546–41907) and the CLI synopsis (lines 322–343) before accepting or rejecting.


Findings Addressed

P2 #16 — Suite documentation oversells validation scope

Accepted. The reviewer is correct that the suite documentation claimed to validate "a distributed team scenario" without scoping what is and is not covered.

Fix: Added explicit scope documentation to the *** Settings *** block:

Coverage: Steps 1–2 (server config, resource publishing).
Steps 3–4 (multi-machine usage, team monitoring) require a
live server and are deferred.

P2 #14 — Diagnostics test validates nothing server-mode-specific

Accepted (documentation fix, not assertion fix). The reviewer is correct that the diagnostics test assertions (config, database, disk) would pass identically with or without server configuration — the test does not validate any server-mode-specific diagnostic output.

However, the reviewer's recommended fix — adding Should Contain Any ${combined} server Server server.urlcannot be applied because the diagnostics command does not output any server-related information. I traced the full code path through build_diagnostics_data() (src/cleveragents/cli/commands/system.py:447–489): the 10 diagnostic checks are Config file, Data directory, Database, provider API keys, Disk space, File permissions, Git, Stale locks, Async workers, and Error Pattern DB. There is no _check_server() function and no server connectivity check. The spec Example 14 shows Server: OK (connected to agents.example.com) in the diagnostics output (line 41625), but this is aspirational — the feature is not yet implemented.

Adding the assertion would cause a guaranteed test failure, which is worse than the current state.

Fix: Updated the diagnostics test documentation to explicitly note the server-mode diagnostics spec gap:

Server-mode diagnostics (Server connectivity, Namespace membership)
shown in the spec are not yet implemented in the diagnostics command
(spec gap).

Findings Not Addressed (With Justification)

P1 — Two-Commit Branch (Exempt)

As stated in my previous comment, there is a standing exemption for E2E test PRs: when a platform bug is discovered during E2E test development that prevents the test from functioning, the bug fix may be included as a separate commit in the same PR, provided the fix is a prerequisite for the E2E test. The fix(action) commit (f200d724) fixes list_actions() to query the database — without this fix, the WF14 E2E Action List By Namespace test cannot pass. The fix is small (25 lines of production code), directly enables the test, and is referenced with Refs: #760.

P2 #2plan list --namespace AC checked but CLI has no --namespace flag (Ticket hygiene)

The reviewer is correct. I verified:

  • Spec Example 14 Step 4 (line 41849) shows agents plan list --namespace myteam.
  • The CLI synopsis in the spec (lines 322–323) defines plan list with --phase, --state, --project, --action, and [REGEX]no --namespace flag.
  • The test already documents this at lines 121–122: "The --namespace flag is not yet implemented for plan list (spec gap)."

This is a spec contradiction (the example uses a flag the synopsis does not define) and the AC in issue #760 should be unchecked with a note. This is a ticket-level change on issue #760, not a code change. Will update the ticket separately.

P2 #5 — Actor uses local/ namespace instead of myteam/ per spec (Platform constraint, already documented)

The reviewer acknowledges this is already documented in the test (lines 103–104): "Custom actors must use the local/ namespace prefix per the current ActorService validation." The spec shows myteam/review-pipeline, but ActorService rejects non-local/ namespaces for custom actors. The recommendation is to file a follow-up issue tracking this spec gap. Will file separately.

P2 #6 — Ticket AC "real LLM API keys" checked but test makes no LLM calls (Ticket hygiene)

The reviewer is correct that no command in this test invokes an LLM. Steps 1–2 of spec Example 14 involve config set/get, diagnostics, action create, action list, actor add, plan list, and automation-profile show — none of which require LLM calls. The AC text ("All invocations use real LLM API keys") reflects the general E2E testing philosophy but is inaccurate for this specific test's scope. The spirit is met: real CLI, zero mocking, real subprocess execution. Will reword the AC on issue #760.

P3/nit items (#8, #9, #10, #15, #17, #18, cache refresh, list_all() return type) (Author discretion)

Per the review playbook, P3 items are at author discretion. These are acknowledged and may be addressed in follow-up work where appropriate.


Verification

  • E2E tests: All 7 WF14 tests pass via nox -s e2e_tests -- --suite "Wf14 Server Mode".
  • Coverage: 98% overall (threshold: 97%). plan_lifecycle_service.py at 96% — uncovered lines 639–658 are the list_actions() DB fallback code exercised by the E2E tests (not by unit tests).
  • No regressions: M1, M2, M5, and smoke E2E suites continue to pass.
# Response to Re-Review #3 — @brent.edwards **Author:** @CoreRasurae **Reviewed against:** Re-Review #3 (review #2758, commit `39c2e43c`) **Method:** Each P2 finding verified against `docs/specification.md` (Example 14, Steps 1–4, lines 41546–41907) and the CLI synopsis (lines 322–343) before accepting or rejecting. --- ## Findings Addressed ### P2 #16 — Suite documentation oversells validation scope ✅ **Accepted.** The reviewer is correct that the suite documentation claimed to validate "a distributed team scenario" without scoping what is and is not covered. **Fix:** Added explicit scope documentation to the `*** Settings ***` block: ``` Coverage: Steps 1–2 (server config, resource publishing). Steps 3–4 (multi-machine usage, team monitoring) require a live server and are deferred. ``` ### P2 #14 — Diagnostics test validates nothing server-mode-specific ✅ **Accepted (documentation fix, not assertion fix).** The reviewer is correct that the diagnostics test assertions (`config`, `database`, `disk`) would pass identically with or without server configuration — the test does not validate any server-mode-specific diagnostic output. However, the reviewer's recommended fix — adding `Should Contain Any ${combined} server Server server.url` — **cannot be applied** because the diagnostics command does not output any server-related information. I traced the full code path through `build_diagnostics_data()` (`src/cleveragents/cli/commands/system.py:447–489`): the 10 diagnostic checks are Config file, Data directory, Database, provider API keys, Disk space, File permissions, Git, Stale locks, Async workers, and Error Pattern DB. There is no `_check_server()` function and no server connectivity check. The spec Example 14 shows `Server: OK (connected to agents.example.com)` in the diagnostics output (line 41625), but this is aspirational — the feature is not yet implemented. Adding the assertion would cause a guaranteed test failure, which is worse than the current state. **Fix:** Updated the diagnostics test documentation to explicitly note the server-mode diagnostics spec gap: ``` Server-mode diagnostics (Server connectivity, Namespace membership) shown in the spec are not yet implemented in the diagnostics command (spec gap). ``` --- ## Findings Not Addressed (With Justification) ### P1 — Two-Commit Branch ❌ (Exempt) As stated in my previous comment, there is a standing exemption for E2E test PRs: when a platform bug is discovered during E2E test development that prevents the test from functioning, the bug fix may be included as a separate commit in the same PR, provided the fix is a prerequisite for the E2E test. The `fix(action)` commit (`f200d724`) fixes `list_actions()` to query the database — without this fix, the `WF14 E2E Action List By Namespace` test cannot pass. The fix is small (25 lines of production code), directly enables the test, and is referenced with `Refs: #760`. ### P2 #2 — `plan list --namespace` AC checked but CLI has no `--namespace` flag ❌ (Ticket hygiene) The reviewer is correct. I verified: - Spec Example 14 Step 4 (line 41849) shows `agents plan list --namespace myteam`. - The CLI synopsis in the spec (lines 322–323) defines `plan list` with `--phase`, `--state`, `--project`, `--action`, and `[REGEX]` — **no `--namespace` flag**. - The test already documents this at lines 121–122: *"The `--namespace` flag is not yet implemented for plan list (spec gap)."* This is a **spec contradiction** (the example uses a flag the synopsis does not define) and the AC in issue #760 should be unchecked with a note. This is a ticket-level change on issue #760, not a code change. Will update the ticket separately. ### P2 #5 — Actor uses `local/` namespace instead of `myteam/` per spec ❌ (Platform constraint, already documented) The reviewer acknowledges this is already documented in the test (lines 103–104): *"Custom actors must use the `local/` namespace prefix per the current ActorService validation."* The spec shows `myteam/review-pipeline`, but ActorService rejects non-`local/` namespaces for custom actors. The recommendation is to file a follow-up issue tracking this spec gap. Will file separately. ### P2 #6 — Ticket AC "real LLM API keys" checked but test makes no LLM calls ❌ (Ticket hygiene) The reviewer is correct that no command in this test invokes an LLM. Steps 1–2 of spec Example 14 involve `config set/get`, `diagnostics`, `action create`, `action list`, `actor add`, `plan list`, and `automation-profile show` — none of which require LLM calls. The AC text ("All invocations use real LLM API keys") reflects the general E2E testing philosophy but is inaccurate for this specific test's scope. The spirit is met: real CLI, zero mocking, real subprocess execution. Will reword the AC on issue #760. ### P3/nit items (#8, #9, #10, #15, #17, #18, cache refresh, `list_all()` return type) ❌ (Author discretion) Per the review playbook, P3 items are at author discretion. These are acknowledged and may be addressed in follow-up work where appropriate. --- ## Verification - **E2E tests:** All 7 WF14 tests pass via `nox -s e2e_tests -- --suite "Wf14 Server Mode"`. - **Coverage:** 98% overall (threshold: 97%). `plan_lifecycle_service.py` at 96% — uncovered lines 639–658 are the `list_actions()` DB fallback code exercised by the E2E tests (not by unit tests). - **No regressions:** M1, M2, M5, and smoke E2E suites continue to pass.
CoreRasurae force-pushed test/e2e-wf14-server-mode from 9fef60264d
Some checks are pending
CI / lint (pull_request) Successful in 37s
CI / typecheck (pull_request) Successful in 47s
CI / security (pull_request) Successful in 48s
CI / quality (pull_request) Successful in 38s
CI / benchmark-publish (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has started running
CI / e2e_tests (pull_request) Successful in 1m12s
CI / build (pull_request) Successful in 25s
CI / unit_tests (pull_request) Successful in 3m58s
CI / integration_tests (pull_request) Successful in 4m4s
CI / docker (pull_request) Successful in 1m16s
CI / coverage (pull_request) Successful in 6m4s
to 53b0197f0a
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 3m20s
CI / typecheck (pull_request) Successful in 3m52s
CI / quality (pull_request) Successful in 4m10s
CI / security (pull_request) Successful in 4m42s
CI / integration_tests (pull_request) Successful in 6m17s
CI / unit_tests (pull_request) Successful in 6m37s
CI / docker (pull_request) Successful in 1m13s
CI / e2e_tests (pull_request) Successful in 9m56s
CI / coverage (pull_request) Successful in 11m36s
CI / benchmark-regression (pull_request) Successful in 52m59s
CI / build (pull_request) Successful in 16s
CI / status-check (pull_request) Successful in 1s
2026-03-24 23:35:26 +00:00
Compare
brent.edwards approved these changes 2026-03-24 23:49:49 +00:00
Dismissed
CoreRasurae force-pushed test/e2e-wf14-server-mode from 53b0197f0a
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 3m20s
CI / typecheck (pull_request) Successful in 3m52s
CI / quality (pull_request) Successful in 4m10s
CI / security (pull_request) Successful in 4m42s
CI / integration_tests (pull_request) Successful in 6m17s
CI / unit_tests (pull_request) Successful in 6m37s
CI / docker (pull_request) Successful in 1m13s
CI / e2e_tests (pull_request) Successful in 9m56s
CI / coverage (pull_request) Successful in 11m36s
CI / benchmark-regression (pull_request) Successful in 52m59s
CI / build (pull_request) Successful in 16s
CI / status-check (pull_request) Successful in 1s
to 4f2cdc2155
Some checks failed
CI / build (pull_request) Successful in 20s
CI / lint (pull_request) Successful in 3m18s
CI / quality (pull_request) Successful in 3m41s
CI / security (pull_request) Successful in 3m55s
CI / typecheck (pull_request) Successful in 3m57s
CI / unit_tests (pull_request) Successful in 5m53s
CI / integration_tests (pull_request) Successful in 5m59s
CI / docker (pull_request) Successful in 1m4s
CI / coverage (pull_request) Failing after 15m19s
CI / e2e_tests (pull_request) Failing after 19m15s
CI / benchmark-publish (pull_request) Has been skipped
CI / status-check (pull_request) Failing after 2s
CI / benchmark-regression (pull_request) Successful in 55m13s
2026-03-25 17:07:01 +00:00
Compare
CoreRasurae dismissed brent.edwards's review 2026-03-25 17:07:01 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

CoreRasurae force-pushed test/e2e-wf14-server-mode from 4f2cdc2155
Some checks failed
CI / build (pull_request) Successful in 20s
CI / lint (pull_request) Successful in 3m18s
CI / quality (pull_request) Successful in 3m41s
CI / security (pull_request) Successful in 3m55s
CI / typecheck (pull_request) Successful in 3m57s
CI / unit_tests (pull_request) Successful in 5m53s
CI / integration_tests (pull_request) Successful in 5m59s
CI / docker (pull_request) Successful in 1m4s
CI / coverage (pull_request) Failing after 15m19s
CI / e2e_tests (pull_request) Failing after 19m15s
CI / benchmark-publish (pull_request) Has been skipped
CI / status-check (pull_request) Failing after 2s
CI / benchmark-regression (pull_request) Successful in 55m13s
to 237c76a76f
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 17s
CI / quality (pull_request) Successful in 28s
CI / typecheck (pull_request) Successful in 45s
CI / security (pull_request) Successful in 48s
CI / e2e_tests (pull_request) Successful in 48s
CI / integration_tests (pull_request) Has been cancelled
CI / unit_tests (pull_request) Has been cancelled
CI / coverage (pull_request) Has been cancelled
CI / benchmark-regression (pull_request) Has been cancelled
CI / docker (pull_request) Has been cancelled
2026-03-25 21:33:19 +00:00
Compare
CoreRasurae force-pushed test/e2e-wf14-server-mode from 237c76a76f
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 17s
CI / quality (pull_request) Successful in 28s
CI / typecheck (pull_request) Successful in 45s
CI / security (pull_request) Successful in 48s
CI / e2e_tests (pull_request) Successful in 48s
CI / integration_tests (pull_request) Has been cancelled
CI / unit_tests (pull_request) Has been cancelled
CI / coverage (pull_request) Has been cancelled
CI / benchmark-regression (pull_request) Has been cancelled
CI / docker (pull_request) Has been cancelled
to 4f2cdc2155
Some checks failed
CI / build (pull_request) Successful in 20s
CI / lint (pull_request) Successful in 3m18s
CI / quality (pull_request) Successful in 3m41s
CI / security (pull_request) Successful in 3m55s
CI / typecheck (pull_request) Successful in 3m57s
CI / unit_tests (pull_request) Successful in 5m53s
CI / integration_tests (pull_request) Successful in 5m59s
CI / docker (pull_request) Successful in 1m4s
CI / coverage (pull_request) Failing after 15m19s
CI / e2e_tests (pull_request) Failing after 19m15s
CI / benchmark-publish (pull_request) Has been skipped
CI / status-check (pull_request) Failing after 2s
CI / benchmark-regression (pull_request) Successful in 55m13s
2026-03-25 21:34:38 +00:00
Compare
CoreRasurae force-pushed test/e2e-wf14-server-mode from 4f2cdc2155
Some checks failed
CI / build (pull_request) Successful in 20s
CI / lint (pull_request) Successful in 3m18s
CI / quality (pull_request) Successful in 3m41s
CI / security (pull_request) Successful in 3m55s
CI / typecheck (pull_request) Successful in 3m57s
CI / unit_tests (pull_request) Successful in 5m53s
CI / integration_tests (pull_request) Successful in 5m59s
CI / docker (pull_request) Successful in 1m4s
CI / coverage (pull_request) Failing after 15m19s
CI / e2e_tests (pull_request) Failing after 19m15s
CI / benchmark-publish (pull_request) Has been skipped
CI / status-check (pull_request) Failing after 2s
CI / benchmark-regression (pull_request) Successful in 55m13s
to d2c70bd489
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 29s
CI / quality (pull_request) Successful in 33s
CI / security (pull_request) Successful in 59s
CI / lint (pull_request) Successful in 3m40s
CI / typecheck (pull_request) Successful in 3m57s
CI / unit_tests (pull_request) Successful in 5m55s
CI / docker (pull_request) Successful in 1m11s
CI / integration_tests (pull_request) Successful in 7m21s
CI / coverage (pull_request) Successful in 11m34s
CI / status-check (pull_request) Successful in 1s
CI / e2e_tests (pull_request) Successful in 11m30s
CI / build (push) Successful in 24s
CI / lint (push) Successful in 3m18s
CI / quality (push) Successful in 3m41s
CI / typecheck (push) Successful in 4m16s
CI / benchmark-regression (push) Has been skipped
CI / security (push) Successful in 4m25s
CI / unit_tests (push) Successful in 7m15s
CI / integration_tests (push) Successful in 7m22s
CI / docker (push) Successful in 1m16s
CI / e2e_tests (push) Successful in 12m3s
CI / coverage (push) Successful in 12m2s
CI / status-check (push) Successful in 1s
CI / benchmark-publish (push) Failing after 22m16s
CI / benchmark-regression (pull_request) Successful in 51m52s
2026-03-26 18:56:25 +00:00
Compare
CoreRasurae scheduled this pull request to auto merge when all checks succeed 2026-03-26 18:56:54 +00:00
CoreRasurae deleted branch test/e2e-wf14-server-mode 2026-03-26 19:26:40 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core!805
No description provided.