test(e2e): workflow example 14 — server mode team collaboration (supervised profile) #805

2026-03-13T01:43:33Z

CoreRasurae commented

2026-03-13 01:43:33 +00:00

Summary

E2E test for Workflow Example 14 — server mode team collaboration using the supervised profile. Tests CLI configuration for server settings, action/actor registration, and profile inspection (no live server required).

Closes #760

ISSUES CLOSED: #760

Manual Verification

Prerequisites

No LLM API key required (this test does not make LLM calls)

Commands

WORKDIR=$(mktemp -d) && cd "$WORKDIR"
python -m cleveragents init --yes --force

# 1. Server configuration (WF14-specific)
python -m cleveragents config set server.url http://localhost:8080
python -m cleveragents config get server.url
# → Look for: "http://localhost:8080"

python -m cleveragents config set core.namespace team-workspace
python -m cleveragents config get core.namespace
# → Look for: "team-workspace"

# 2. Diagnostics
python -m cleveragents diagnostics --format plain
# → Look for: system diagnostics output (versions, paths, config)

# 3. Action and actor registration
python -m cleveragents action create --config action.yaml
python -m cleveragents action show action-name
# → Look for: action details

python -m cleveragents actor add --config actor.yaml
python -m cleveragents actor list --format plain
# → Look for: registered actors listed

# 4. Profile inspection
python -m cleveragents automation-profile show supervised
# → Look for: supervised profile details with guard settings

# 5. Plan listing
python -m cleveragents plan list
# → Look for: empty list or existing plans

What to Look For

config set/get round-trips server.url and core.namespace
diagnostics outputs system information without error
Action and actor registration succeeds
automation-profile show displays profile details
No Traceback in any command's stderr

## Summary E2E test for Workflow Example 14 — server mode team collaboration using the supervised profile. Tests CLI configuration for server settings, action/actor registration, and profile inspection (no live server required). Closes #760 ISSUES CLOSED: #760 ## Manual Verification ### Prerequisites - No LLM API key required (this test does not make LLM calls) ### Commands ```bash WORKDIR=$(mktemp -d) && cd "$WORKDIR" python -m cleveragents init --yes --force # 1. Server configuration (WF14-specific) python -m cleveragents config set server.url http://localhost:8080 python -m cleveragents config get server.url # → Look for: "http://localhost:8080" python -m cleveragents config set core.namespace team-workspace python -m cleveragents config get core.namespace # → Look for: "team-workspace" # 2. Diagnostics python -m cleveragents diagnostics --format plain # → Look for: system diagnostics output (versions, paths, config) # 3. Action and actor registration python -m cleveragents action create --config action.yaml python -m cleveragents action show action-name # → Look for: action details python -m cleveragents actor add --config actor.yaml python -m cleveragents actor list --format plain # → Look for: registered actors listed # 4. Profile inspection python -m cleveragents automation-profile show supervised # → Look for: supervised profile details with guard settings # 5. Plan listing python -m cleveragents plan list # → Look for: empty list or existing plans ``` ### What to Look For - `config set/get` round-trips server.url and core.namespace - `diagnostics` outputs system information without error - Action and actor registration succeeds - `automation-profile show` displays profile details - No `Traceback` in any command's stderr

CoreRasurae added this to the v3.6.0 milestone 2026-03-13 01:43:51 +00:00

CoreRasurae added the

Type

Testing

label 2026-03-13 01:43:54 +00:00

CoreRasurae added a new dependency 2026-03-13 01:46:03 +00:00

#760 test(e2e): workflow example 14 — server mode team collaboration (supervised profile)

freemo force-pushed test/e2e-wf14-server-mode from 8ecaa82262 to 92aa055ac5

2026-03-13 18:46:49 +00:00

Compare

freemo added the

State

In Review

label 2026-03-13 21:16:38 +00:00

freemo force-pushed test/e2e-wf14-server-mode from 92aa055ac5 to 3d240a890e

2026-03-13 23:19:48 +00:00

Compare

freemo added the

Priority

Medium

label 2026-03-14 04:10:15 +00:00

freemo commented

2026-03-14 04:45:20 +00:00

PM Review — Day 34

Status: Mergeable, 0 reviews, M7 (v3.6.0)
Author: @CoreRasurae

E2E test for WF14 (server mode team collaboration, supervised profile).

Action Items

Who	Action	Deadline
@hamza.khyari	Peer review	Day 38

## PM Review — Day 34 **Status**: Mergeable, 0 reviews, M7 (v3.6.0) **Author**: @CoreRasurae E2E test for WF14 (server mode team collaboration, supervised profile). ### Action Items | Who | Action | Deadline | |-----|--------|----------| | @hamza.khyari | **Peer review** | Day 38 |

freemo added a new dependency 2026-03-16 02:42:17 +00:00

#627 Implement @tdd_expected_fail tag handling in Behave environment

freemo added a new dependency 2026-03-16 02:42:17 +00:00

#628 Implement @tdd_expected_fail tag handling in Robot Framework

freemo added a new dependency 2026-03-16 02:42:17 +00:00

#965 refactor(testing): rename tdd_bug/tdd_bug_N tags to tdd_issue/tdd_issue_N across Behave and Robot Framework

freemo commented

2026-03-16 09:34:30 +00:00

PM Status — Day 36 (2026-03-16)

Day 34 review assignment deadline check. 0 reviewer activity after 2 days.

Assigned reviewer: Please acknowledge and provide an ETA for review. Prioritize M3 PRs first, then M4+ in milestone order.

## PM Status — Day 36 (2026-03-16) Day 34 review assignment deadline check. 0 reviewer activity after 2 days. **Assigned reviewer**: Please acknowledge and provide an ETA for review. Prioritize M3 PRs first, then M4+ in milestone order.

freemo reviewed 2026-03-16 16:26:28 +00:00

freemo left a comment

PM Day 36: M7 E2E test. Merge conflict. @CoreRasurae rebase needed.

freemo commented

2026-03-17 18:23:17 +00:00

PM Status — Day 37 — Rebase Required

This PR has merge conflicts and cannot be merged in its current state. 42% of all open PRs (21 of 50) have conflicts — this is a project-wide issue that must be resolved.

@CoreRasurae — Please rebase this PR onto master by Day 39 EOD (2026-03-19). If you cannot rebase by then, please post a comment explaining the blocker.

PM rebase request — Day 37

## PM Status — Day 37 — Rebase Required This PR has **merge conflicts** and cannot be merged in its current state. 42% of all open PRs (21 of 50) have conflicts — this is a project-wide issue that must be resolved. @CoreRasurae — Please rebase this PR onto `master` by **Day 39 EOD (2026-03-19)**. If you cannot rebase by then, please post a comment explaining the blocker. --- *PM rebase request — Day 37*

CoreRasurae force-pushed test/e2e-wf14-server-mode from 3d240a890e to 8ecaa82262

2026-03-18 21:11:40 +00:00

Compare

CoreRasurae force-pushed test/e2e-wf14-server-mode from 8ecaa82262 to 91ccbf4aab

2026-03-18 21:20:24 +00:00

Compare

CoreRasurae force-pushed test/e2e-wf14-server-mode from 91ccbf4aab to 4e134ea2ba

2026-03-18 22:08:52 +00:00

Compare

CoreRasurae commented

2026-03-18 22:26:24 +00:00

Code Review Report — PR #805 (`test/e2e-wf14-server-mode`)

Reviewer: Automated deep review (3 global cycles across all categories)
Scope: All changes on branch test/e2e-wf14-server-mode relative to master, cross-referenced with issue #760 acceptance criteria and docs/specification.md Example 14.
Commit: 4e134ea2 by Luis Mendes — test(e2e): workflow example 14 — server mode team collaboration (supervised profile)

Files changed: robot/e2e/wf14_server_mode.robot (new, 89 lines), CHANGELOG.md (+6 lines), src/cleveragents/infrastructure/sandbox/strategy_registry.py (4 lines renamed)

Summary

The PR adds an E2E Robot Framework test for Specification Workflow Example 14 (Server Mode — Team Collaboration) and includes a minor parameter rename in strategy_registry.py. The test structure follows established patterns from common_e2e.resource, but multiple gaps exist between the test coverage and the issue acceptance criteria / specification requirements. Several test-engineering concerns also need attention.

Findings: 12 total — 3 High, 5 Medium, 4 Low

HIGH Severity

H-1. [Test Coverage] Issue AC not fulfilled — missing `action list --namespace` and `plan list --namespace` tests

File: robot/e2e/wf14_server_mode.robot

Issue #760 AC states:

"Test lists shared actions from the namespace"
"Test monitors plans across the namespace via plan list --namespace"

Specification Example 14 Steps 2 and 4 show:

agents action list --namespace myteam
agents plan list --namespace myteam

Neither command is present in the test. The test runs action show (line 58) but never action list --namespace, and plan list (line 81) without --namespace.

Moreover, investigation confirms plan list (and plan lifecycle-list) do not support a --namespace flag in the current CLI implementation. The action list command does support --namespace.

Impact: Two acceptance criteria are unmet. The plan list --namespace gap also reveals an unimplemented CLI feature the spec assumes exists.

Recommendation: Add action list --namespace myteam assertion. For plan list, either (a) document that the --namespace flag is not yet implemented and defer that AC, or (b) implement the flag first.

H-2. [Test Coverage] Actions and actors created in `local/` namespace instead of team namespace

File: robot/e2e/wf14_server_mode.robot, lines 44, 66

The test claims to validate "team collaboration" with a shared namespace. The spec shows actions published as myteam/generate-tests and actors as myteam/review-pipeline. However, the test hardcodes local/ prefixes:

Line 44: local/gen-tests-${epoch}
Line 66: local/team-rev-${epoch}

The local/ prefix explicitly overrides the configured core.namespace = myteam, so the test never actually exercises namespace-scoped publishing.

Impact: The core "team collaboration via shared namespace" scenario from the spec is not validated.

Recommendation: Create entities without the local/ prefix (e.g., gen-tests-${epoch}) or use the myteam/ prefix so core.namespace is exercised.

H-3. [Test Coverage] CHANGELOG claims do not match actual test coverage

File: CHANGELOG.md, lines 145-149

The CHANGELOG entry claims: "shared action listing, and plan monitoring across namespaces using the supervised automation profile via real CLI with real LLM API keys."

Three inaccuracies:

"shared action listing" — no action list command is tested
"plan monitoring across namespaces" — plan list is tested without --namespace
"real LLM API keys" — the test does not use or require LLM API keys (no Skip If No LLM Keys guard, no LLM-dependent commands)

Impact: Misleading documentation for future maintainers and reviewers.

Recommendation: Align the CHANGELOG with what the test actually exercises, or expand the test to cover the claimed behavior.

MEDIUM Severity

M-1. [Test Flaw] Missing config cleanup/teardown for persistent state

File: robot/e2e/wf14_server_mode.robot, lines 17-25

The WF14 E2E Server Config Setup test sets server.url and core.namespace in the config but has no [Teardown] to restore them. By comparison, M6 tests consistently use [Teardown] blocks (e.g., config set core.automation-profile manual).

Config changes persist in the config.toml file under cwd. If other test suites share the same data directory or run sequentially in the same workspace, these leaked settings (server.url=https://agents.example.com, core.namespace=myteam) could cause interference.

Recommendation: Add [Teardown] to the config test case, or add a suite-level [Suite Teardown] step that resets server.url and core.namespace.

M-2. [Test Flaw] Missing `server.token` configuration per specification

File: robot/e2e/wf14_server_mode.robot

Specification Example 14 Step 1 includes three config steps: server.url, server.token, and core.namespace. The test covers server.url (line 17) and core.namespace (line 22) but skips server.token. Since server.token is the authentication mechanism for server mode, testing its configuration round-trip would strengthen coverage of the server mode setup flow.

Recommendation: Add a config set / config get round-trip for server.token.

M-3. [Test Flaw] `plan list` test verifies non-crash only, not meaningful behavior

File: robot/e2e/wf14_server_mode.robot, lines 78-83

The test uses expected_rc=None (accepts any exit code) and only asserts Should Not Be Empty ${result.stdout}${result.stderr}. Since plan list without initialization prints an error message ("No project found. Run 'agents init' first.") and aborts, the assertion passes on the error output. The test documentation says "Verify that the plan list command executes successfully" but does not actually verify success.

Recommendation: Either (a) call agents init --yes --force before this test and check for expected_rc=0, or (b) adjust the documentation to state it verifies graceful failure rather than success.

M-4. [Test Flaw] Epoch-based name uniqueness may fail in parallel CI

File: robot/e2e/wf14_server_mode.robot, lines 43, 65

Uses int(__import__('time').time()) (second-level granularity) for unique name suffixes. In parallel CI runners executing within the same second, this produces identical names and risks UNIQUE constraint violations. The M6 acceptance test uses uuid4().hex[:12] which is collision-resistant even under parallelism.

Recommendation: Replace epoch with uuid4().hex[:12] for consistency with established patterns and CI safety.

M-5. [Test Flaw] Misleading test documentation in diagnostics test

File: robot/e2e/wf14_server_mode.robot, lines 28-30

The documentation states: "Server connectivity will fail (no real server) so the return code is not checked." However, investigation of system.py:build_diagnostics_data() confirms there is no server connectivity check in the diagnostics command. The 10 diagnostic checks are all local (config file, data directory, database, API keys, disk space, file permissions, git, stale locks, async workers, error patterns). The expected_rc=None is unnecessary since diagnostics should succeed (rc=0) even without a server.

Recommendation: Update the documentation to reflect the actual diagnostic checks, and consider setting expected_rc=0 since no server check exists.

LOW Severity

L-1. [Code Quality] Unrelated `cls` to `klass` rename bundled in E2E test commit

File: src/cleveragents/infrastructure/sandbox/strategy_registry.py, lines 262-297

The rename of the cls parameter to klass in the _validate_protocol staticmethod is a valid improvement (avoids confusion with the classmethod cls convention) but is unrelated to issue #760 or the E2E test work. Bundling unrelated changes in a test commit makes git history less navigable.

Recommendation: Separate this into its own commit or PR for clean history.

L-2. [Code Quality] Inconsistent tag pattern across E2E suites

File: robot/e2e/wf14_server_mode.robot

The WF14 test applies [Tags] E2E per test case (lines 15, 31, 41, 63, 80, 87), while m6_acceptance.robot uses Force Tags E2E at suite level. Both are valid, but Force Tags is cleaner when all tests in a suite share the same tag.

Recommendation: Consider using Force Tags E2E at suite level for consistency with M6.

L-3. [Code Quality] Minimal actor YAML may not exercise full registration flow

File: robot/e2e/wf14_server_mode.robot, lines 67-69

The actor YAML contains only provider: openai and model: gpt-4. While this may be sufficient for actor add, the spec example shows a richer actor configuration with type, capabilities, tools, and graph nodes. A minimal YAML exercises only the happy path minimum.

Recommendation: Consider adding at least description and type fields to better exercise the actor registration flow.

L-4. [Security] No issues found — good practices observed

The test correctly uses https://agents.example.com (RFC 2606 reserved domain) for the server URL. No real tokens, API keys, or sensitive data appear in the test code.

Information / Positive Notes

The strategy_registry.py rename is functionally correct and all 4 references (parameter, docstring, getattr, f-string) are consistently updated.
YAML files created during tests are properly cleaned up by E2E Suite Teardown (recursive removal of ${SUITE_HOME}).
The Create File keyword correctly resolves from the OperatingSystem library imported transitively via common_e2e.resource.
The Should Contain Any assertions in the diagnostics test correctly match the actual diagnostic category names ("Config file", "Database", "Disk space").
The automation-profile show supervised test (line 88) correctly targets a built-in profile that doesn't require database initialization.

Verdict

REQUEST CHANGES — The test has structural correctness but significant coverage gaps relative to issue #760 acceptance criteria and the specification. The high-severity items (H-1, H-2, H-3) should be addressed before merge. Medium-severity items (M-1 through M-5) represent test-engineering improvements that strengthen reliability and maintainability.

# Code Review Report — PR #805 (`test/e2e-wf14-server-mode`) **Reviewer**: Automated deep review (3 global cycles across all categories) **Scope**: All changes on branch `test/e2e-wf14-server-mode` relative to `master`, cross-referenced with issue #760 acceptance criteria and `docs/specification.md` Example 14. **Commit**: `4e134ea2` by Luis Mendes — `test(e2e): workflow example 14 — server mode team collaboration (supervised profile)` **Files changed**: `robot/e2e/wf14_server_mode.robot` (new, 89 lines), `CHANGELOG.md` (+6 lines), `src/cleveragents/infrastructure/sandbox/strategy_registry.py` (4 lines renamed) --- ## Summary The PR adds an E2E Robot Framework test for Specification Workflow Example 14 (Server Mode — Team Collaboration) and includes a minor parameter rename in `strategy_registry.py`. The test structure follows established patterns from `common_e2e.resource`, but multiple gaps exist between the test coverage and the issue acceptance criteria / specification requirements. Several test-engineering concerns also need attention. **Findings**: 12 total — 3 High, 5 Medium, 4 Low --- ## HIGH Severity ### H-1. [Test Coverage] Issue AC not fulfilled — missing `action list --namespace` and `plan list --namespace` tests **File**: `robot/e2e/wf14_server_mode.robot` Issue #760 AC states: - "Test lists shared actions from the namespace" - "Test monitors plans across the namespace via `plan list --namespace`" Specification Example 14 Steps 2 and 4 show: - `agents action list --namespace myteam` - `agents plan list --namespace myteam` Neither command is present in the test. The test runs `action show` (line 58) but never `action list --namespace`, and `plan list` (line 81) without `--namespace`. Moreover, investigation confirms `plan list` (and `plan lifecycle-list`) do **not** support a `--namespace` flag in the current CLI implementation. The `action list` command does support `--namespace`. **Impact**: Two acceptance criteria are unmet. The `plan list --namespace` gap also reveals an unimplemented CLI feature the spec assumes exists. **Recommendation**: Add `action list --namespace myteam` assertion. For `plan list`, either (a) document that the `--namespace` flag is not yet implemented and defer that AC, or (b) implement the flag first. --- ### H-2. [Test Coverage] Actions and actors created in `local/` namespace instead of team namespace **File**: `robot/e2e/wf14_server_mode.robot`, lines 44, 66 The test claims to validate "team collaboration" with a shared namespace. The spec shows actions published as `myteam/generate-tests` and actors as `myteam/review-pipeline`. However, the test hardcodes `local/` prefixes: - Line 44: `local/gen-tests-${epoch}` - Line 66: `local/team-rev-${epoch}` The `local/` prefix explicitly overrides the configured `core.namespace = myteam`, so the test never actually exercises namespace-scoped publishing. **Impact**: The core "team collaboration via shared namespace" scenario from the spec is not validated. **Recommendation**: Create entities without the `local/` prefix (e.g., `gen-tests-${epoch}`) or use the `myteam/` prefix so `core.namespace` is exercised. --- ### H-3. [Test Coverage] CHANGELOG claims do not match actual test coverage **File**: `CHANGELOG.md`, lines 145-149 The CHANGELOG entry claims: "shared action listing, and plan monitoring across namespaces using the `supervised` automation profile via real CLI with real LLM API keys." Three inaccuracies: 1. **"shared action listing"** — no `action list` command is tested 2. **"plan monitoring across namespaces"** — `plan list` is tested without `--namespace` 3. **"real LLM API keys"** — the test does not use or require LLM API keys (no `Skip If No LLM Keys` guard, no LLM-dependent commands) **Impact**: Misleading documentation for future maintainers and reviewers. **Recommendation**: Align the CHANGELOG with what the test actually exercises, or expand the test to cover the claimed behavior. --- ## MEDIUM Severity ### M-1. [Test Flaw] Missing config cleanup/teardown for persistent state **File**: `robot/e2e/wf14_server_mode.robot`, lines 17-25 The `WF14 E2E Server Config Setup` test sets `server.url` and `core.namespace` in the config but has no `[Teardown]` to restore them. By comparison, M6 tests consistently use `[Teardown]` blocks (e.g., `config set core.automation-profile manual`). Config changes persist in the `config.toml` file under `cwd`. If other test suites share the same data directory or run sequentially in the same workspace, these leaked settings (`server.url=https://agents.example.com`, `core.namespace=myteam`) could cause interference. **Recommendation**: Add `[Teardown]` to the config test case, or add a suite-level `[Suite Teardown]` step that resets `server.url` and `core.namespace`. --- ### M-2. [Test Flaw] Missing `server.token` configuration per specification **File**: `robot/e2e/wf14_server_mode.robot` Specification Example 14 Step 1 includes three config steps: `server.url`, `server.token`, and `core.namespace`. The test covers `server.url` (line 17) and `core.namespace` (line 22) but skips `server.token`. Since `server.token` is the authentication mechanism for server mode, testing its configuration round-trip would strengthen coverage of the server mode setup flow. **Recommendation**: Add a `config set` / `config get` round-trip for `server.token`. --- ### M-3. [Test Flaw] `plan list` test verifies non-crash only, not meaningful behavior **File**: `robot/e2e/wf14_server_mode.robot`, lines 78-83 The test uses `expected_rc=None` (accepts any exit code) and only asserts `Should Not Be Empty ${result.stdout}${result.stderr}`. Since `plan list` without initialization prints an error message ("No project found. Run 'agents init' first.") and aborts, the assertion passes on the error output. The test documentation says "Verify that the plan list command executes successfully" but does not actually verify success. **Recommendation**: Either (a) call `agents init --yes --force` before this test and check for `expected_rc=0`, or (b) adjust the documentation to state it verifies graceful failure rather than success. --- ### M-4. [Test Flaw] Epoch-based name uniqueness may fail in parallel CI **File**: `robot/e2e/wf14_server_mode.robot`, lines 43, 65 Uses `int(__import__('time').time())` (second-level granularity) for unique name suffixes. In parallel CI runners executing within the same second, this produces identical names and risks UNIQUE constraint violations. The M6 acceptance test uses `uuid4().hex[:12]` which is collision-resistant even under parallelism. **Recommendation**: Replace epoch with `uuid4().hex[:12]` for consistency with established patterns and CI safety. --- ### M-5. [Test Flaw] Misleading test documentation in diagnostics test **File**: `robot/e2e/wf14_server_mode.robot`, lines 28-30 The documentation states: "Server connectivity will fail (no real server) so the return code is not checked." However, investigation of `system.py:build_diagnostics_data()` confirms there is **no server connectivity check** in the diagnostics command. The 10 diagnostic checks are all local (config file, data directory, database, API keys, disk space, file permissions, git, stale locks, async workers, error patterns). The `expected_rc=None` is unnecessary since diagnostics should succeed (rc=0) even without a server. **Recommendation**: Update the documentation to reflect the actual diagnostic checks, and consider setting `expected_rc=0` since no server check exists. --- ## LOW Severity ### L-1. [Code Quality] Unrelated `cls` to `klass` rename bundled in E2E test commit **File**: `src/cleveragents/infrastructure/sandbox/strategy_registry.py`, lines 262-297 The rename of the `cls` parameter to `klass` in the `_validate_protocol` staticmethod is a valid improvement (avoids confusion with the classmethod `cls` convention) but is unrelated to issue #760 or the E2E test work. Bundling unrelated changes in a test commit makes git history less navigable. **Recommendation**: Separate this into its own commit or PR for clean history. --- ### L-2. [Code Quality] Inconsistent tag pattern across E2E suites **File**: `robot/e2e/wf14_server_mode.robot` The WF14 test applies `[Tags] E2E` per test case (lines 15, 31, 41, 63, 80, 87), while `m6_acceptance.robot` uses `Force Tags E2E` at suite level. Both are valid, but `Force Tags` is cleaner when all tests in a suite share the same tag. **Recommendation**: Consider using `Force Tags E2E` at suite level for consistency with M6. --- ### L-3. [Code Quality] Minimal actor YAML may not exercise full registration flow **File**: `robot/e2e/wf14_server_mode.robot`, lines 67-69 The actor YAML contains only `provider: openai` and `model: gpt-4`. While this may be sufficient for `actor add`, the spec example shows a richer actor configuration with type, capabilities, tools, and graph nodes. A minimal YAML exercises only the happy path minimum. **Recommendation**: Consider adding at least `description` and `type` fields to better exercise the actor registration flow. --- ### L-4. [Security] No issues found — good practices observed The test correctly uses `https://agents.example.com` (RFC 2606 reserved domain) for the server URL. No real tokens, API keys, or sensitive data appear in the test code. --- ## Information / Positive Notes - The `strategy_registry.py` rename is functionally correct and all 4 references (parameter, docstring, getattr, f-string) are consistently updated. - YAML files created during tests are properly cleaned up by `E2E Suite Teardown` (recursive removal of `${SUITE_HOME}`). - The `Create File` keyword correctly resolves from the `OperatingSystem` library imported transitively via `common_e2e.resource`. - The `Should Contain Any` assertions in the diagnostics test correctly match the actual diagnostic category names ("Config file", "Database", "Disk space"). - The `automation-profile show supervised` test (line 88) correctly targets a built-in profile that doesn't require database initialization. --- ## Verdict **REQUEST CHANGES** — The test has structural correctness but significant coverage gaps relative to issue #760 acceptance criteria and the specification. The high-severity items (H-1, H-2, H-3) should be addressed before merge. Medium-severity items (M-1 through M-5) represent test-engineering improvements that strengthen reliability and maintainability.

CoreRasurae force-pushed test/e2e-wf14-server-mode from 4e134ea2ba to 8f3c15aa08

2026-03-18 22:55:28 +00:00

Compare

freemo requested review from brent.edwards 2026-03-22 16:40:05 +00:00

CoreRasurae force-pushed test/e2e-wf14-server-mode from 8f3c15aa08 to 064d448cce

2026-03-23 23:05:12 +00:00

Compare

brent.edwards requested changes 2026-03-23 23:25:47 +00:00

Dismissed

brent.edwards left a comment

Code Review — PR #805 (Ticket #760)

Branch: test/e2e-wf14-server-mode
Author: @CoreRasurae
Reviewer: @brent.edwards
Method: Multi-focus parallel review (7 threads: spec compliance, bug detection, test quality, code quality, security, strategy_registry analysis, stale review triage) + fresh-eyes pass

Verdict: Request Changes

There is 1 P0:blocker finding (security fix revert), 1 P1:must-fix (unmet acceptance criterion), and 5 P2:should-fix items. The P0 alone requires immediate attention before merge.

Note on stale self-review: The detailed self-review by @CoreRasurae (comment #67681, against commit 4e134ea2) identified 12 issues. The current commit 064d448c addresses 6 of 8 HIGH/MEDIUM findings. H-1 and H-2 are partially fixed; M-3 remains. The strategy_registry.py unrelated change (L-1) was correctly removed. Good rebase work overall.

P0:blocker — Must Fix Immediately

1. Security fix revert: `shell=True` re-introduced in `cli_coverage_steps.py`

File: features/steps/cli_coverage_steps.py:43–44

This PR reverts the security fix from commit bd18491b (fix(test): replace shell=True with shell=False in cli_coverage_steps.py subprocess call, issue #734). The diff shows:

import shlex removed
shlex.split(command) replaced back with command + shell=True
The CHANGELOG entry documenting the #734 fix is also deleted

The original fix was explicitly described as "defense-in-depth command injection prevention, consistent with the existing pattern in cli_plan_context_commands_steps.py." Reverting it re-introduces shell=True subprocess execution, which the project's security standards prohibit. The comment says "Use shell=True to handle quoted arguments correctly" — suggesting the fix broke a test scenario.

Impact: P0 — A security fix merged to master 4 days ago is being silently reverted in an unrelated E2E test PR with no justification or discussion. Even in test code, shell=True is a security concern per the project's own standards and the original fix's rationale.

Recommendation: Either (a) keep the shlex.split() approach and fix whatever test it broke (the correct solution), or (b) if shell=True is genuinely needed for quoted argument handling, explicitly document WHY in a code comment, add # nosec B602 with justification, and create a follow-up issue to find a secure alternative. Do NOT silently revert a security fix. The CHANGELOG entry for #734 must also be restored.

P1:must-fix — Must Fix Before Merge

2. Ticket AC "`plan list --namespace`" marked complete but unfulfilled

File: robot/e2e/wf14_server_mode.robot:112–119

Ticket #760 acceptance criterion states: "Test monitors plans across the namespace via plan list --namespace" — this checkbox is marked done in the ticket. Specification Example 14 Step 4 explicitly shows agents plan list --namespace myteam. However, the CLI's plan list command does not support a --namespace flag (confirmed in cli/commands/plan.py). The test runs bare plan list with expected_rc=None.

The test documentation (lines 115–116) honestly notes this as a "spec gap," which is good — but the ticket AC is marked complete when the condition cannot be fulfilled.

Recommendation: Uncheck this AC item in ticket #760 and add a note that it's deferred until the --namespace flag is implemented. A checked-off AC that isn't actually validated is misleading for milestone tracking.

P2:should-fix — Fix in Follow-Up PR (Within 3 Days)

3. `plan list` test assertion is tautological

File: robot/e2e/wf14_server_mode.robot:117–119

expected_rc=None (accepts any exit code) + Should Not Be Empty ${result.stdout}${result.stderr} means this test passes whenever the command produces any output on stdout or stderr — including error messages, tracebacks, or deprecation warnings. Since init --yes was already called in suite setup, plan list should succeed with rc=0.

Recommendation: Change to expected_rc=0. If plan list genuinely fails after init, that's a bug worth catching rather than masking with a permissive assertion.

4. `diagnostics` test uses `expected_rc=None` unnecessarily

File: robot/e2e/wf14_server_mode.robot:59

The documentation correctly notes diagnostics checks are all local and should succeed without a live server. Yet expected_rc=None is used. Without --check, diagnostics always returns rc=0.

Recommendation: Change to expected_rc=0 to match the documentation's own assertion that the command should succeed.

5. Actor uses `local/` namespace, contradicting "team collaboration" scenario

File: robot/e2e/wf14_server_mode.robot:99

The actor is created with local/team-rev-${RUN_SUFFIX}. Spec Example 14 shows myteam/review-pipeline. The action was correctly fixed to use myteam/ prefix, but the actor was not. The documentation (lines 97–98) explains this is due to ActorService validation, which is honest — but means the "team collaboration" scenario is only half-exercised for namespace behavior.

Recommendation: If ActorService requires local/, file a follow-up issue tracking this spec gap and reference it in the test documentation.

6. Ticket AC "real LLM API keys" unmet

File: robot/e2e/wf14_server_mode.robot (entire file)

Ticket #760 AC states: "All invocations use real LLM API keys — no mocking, stubbing, or test doubles." No command in this test invokes an LLM. The PR description itself says "No LLM API key required." The AC is marked done but the condition isn't met.

Recommendation: Update ticket #760 to remove or reword the LLM AC — server mode config testing genuinely doesn't need LLM calls. Alternatively, add one LLM-dependent step guarded by Skip If No LLM Keys.

7. `supervised` profile verification is very shallow

File: robot/e2e/wf14_server_mode.robot:121–124

The test only checks that the output contains the string "supervised." The M6 suite's equivalent test checks for auto_strategize, auto_execute, and auto_apply fields. For a test whose title emphasizes the "supervised automation profile," only verifying the name is insufficient.

Recommendation: Add assertions for supervised-specific fields:

Output Should Contain    ${result}    auto_strategize
Output Should Contain    ${result}    auto_execute

P3:nit — Author Discretion

8. `init --yes` without `--force` deviates from established pattern

robot/e2e/wf14_server_mode.robot:18 — M6 uses init --force --yes. While WF14's fresh temp dir makes this unlikely to fail, --force provides defense-in-depth.

9. Missing inter-test dependency guard on Action List test

robot/e2e/wf14_server_mode.robot:88–93 — If the Action Create test fails, Action List will fail with an unclear "variable not found" error. Other suites use [Setup] Variable Should Exist guards.

10. Loose `db` substring in diagnostics assertion

robot/e2e/wf14_server_mode.robot:63 — db is only 2 characters and could match unrelated content. Keep only database and Database.

11. CHANGELOG wording "plan list graceful handling" oversells the test

The test only verifies non-empty output with any exit code. "Plan list smoke test" would be more accurate.

Items Verified as Correct ✅

Area	Status
Branch name — matches ticket metadata exactly	✅
Commit message — matches ticket metadata exactly with em-dash	✅
Commit atomicity — single commit, 2 files (but see P0 for cli_coverage_steps.py!)	⚠️
PR description — summary, closes keyword, manual verification	✅
CHANGELOG entry — follows project pattern (but deletes #734 entry — see P0)	⚠️
`Force Tags E2E` at suite level — consistent with M6	✅
`uuid4().hex[:12]` for unique names — CI-safe	✅
Suite teardown — resets server.url, server.token, core.namespace	✅
`server.token` round-trip test — added per spec Step 1	✅
`action list --namespace myteam` — added per spec Step 2	✅
Action uses `myteam/` prefix — matches spec	✅
strategy_registry.py rename — correctly removed from PR	✅
Safe server URL — `https://agents.example.com` (RFC 2606 reserved)	✅
No real tokens/secrets in test code	✅
Robot Framework section order — Settings/Keywords/Test Cases	✅
Labels — `Type/Testing`, `State/In Review`, correct milestone	✅
`ISSUES CLOSED: #760` in commit footer	✅

Stale Self-Review Resolution Summary

ID	Finding	Status in `064d448c`
H-1	Missing `action list --namespace`	✅ Fixed (`plan list --namespace` acknowledged as spec gap)
H-2	`local/` namespace on entities	✅ Action fixed to `myteam/`; actor documented as ActorService constraint
H-3	CHANGELOG inaccurate	✅ Fixed — no longer claims LLM keys or cross-namespace plan monitoring
M-1	Missing config teardown	✅ Fixed — suite teardown resets all 3 config keys
M-2	Missing server.token	✅ Fixed — set/get round-trip added
M-3	plan list non-meaningful assertion	⚠️ Documentation updated but assertion unchanged
M-4	Epoch-based uniqueness	✅ Fixed — uses uuid4
M-5	Misleading diagnostics doc	✅ Fixed — correctly says checks are local
L-1	Unrelated strategy_registry.py change	✅ Removed from PR
L-2	Inconsistent tag pattern	✅ Fixed — Force Tags E2E

Summary

The E2E test itself is well-structured and has been substantially improved since the self-review. Config round-trips, namespace-scoped action listing, proper teardown, and uuid-based uniqueness are all solid. The test covers Steps 1–2 of spec Example 14 well.

However, the P0 finding is a hard blocker: this PR silently reverts a security fix (shell=True subprocess call) from issue #734 that was merged to master 4 days ago. The CHANGELOG entry documenting that fix is also deleted. This must be addressed before any other consideration.

Beyond that, the remaining findings are primarily about tightening assertions (P2 #3, #4), documenting spec gaps properly (P1 #2, P2 #5, #6), and making the supervised profile test more meaningful (P2 #7).

# Code Review — PR #805 (Ticket #760) **Branch:** `test/e2e-wf14-server-mode` **Author:** @CoreRasurae **Reviewer:** @brent.edwards **Method:** Multi-focus parallel review (7 threads: spec compliance, bug detection, test quality, code quality, security, strategy_registry analysis, stale review triage) + fresh-eyes pass --- ## Verdict: Request Changes There is **1 P0:blocker** finding (security fix revert), **1 P1:must-fix** (unmet acceptance criterion), and **5 P2:should-fix** items. The P0 alone requires immediate attention before merge. **Note on stale self-review:** The detailed self-review by @CoreRasurae (comment #67681, against commit `4e134ea2`) identified 12 issues. The current commit `064d448c` addresses **6 of 8 HIGH/MEDIUM** findings. H-1 and H-2 are partially fixed; M-3 remains. The `strategy_registry.py` unrelated change (L-1) was correctly removed. Good rebase work overall. --- ## P0:blocker — Must Fix Immediately ### 1. Security fix revert: `shell=True` re-introduced in `cli_coverage_steps.py` **File:** `features/steps/cli_coverage_steps.py:43–44` This PR **reverts** the security fix from commit `bd18491b` (`fix(test): replace shell=True with shell=False in cli_coverage_steps.py subprocess call`, issue #734). The diff shows: - `import shlex` removed - `shlex.split(command)` replaced back with `command` + `shell=True` - The CHANGELOG entry documenting the #734 fix is also deleted The original fix was explicitly described as *"defense-in-depth command injection prevention, consistent with the existing pattern in `cli_plan_context_commands_steps.py`."* Reverting it re-introduces `shell=True` subprocess execution, which the project's security standards prohibit. The comment says "Use shell=True to handle quoted arguments correctly" — suggesting the fix broke a test scenario. **Impact:** P0 — A security fix merged to master 4 days ago is being silently reverted in an unrelated E2E test PR with no justification or discussion. Even in test code, `shell=True` is a security concern per the project's own standards and the original fix's rationale. **Recommendation:** Either (a) keep the `shlex.split()` approach and fix whatever test it broke (the correct solution), or (b) if `shell=True` is genuinely needed for quoted argument handling, explicitly document WHY in a code comment, add `# nosec B602` with justification, and create a follow-up issue to find a secure alternative. Do NOT silently revert a security fix. The CHANGELOG entry for #734 must also be restored. --- ## P1:must-fix — Must Fix Before Merge ### 2. Ticket AC "`plan list --namespace`" marked complete but unfulfilled **File:** `robot/e2e/wf14_server_mode.robot:112–119` Ticket #760 acceptance criterion states: *"Test monitors plans across the namespace via `plan list --namespace`"* — this checkbox is marked done in the ticket. Specification Example 14 Step 4 explicitly shows `agents plan list --namespace myteam`. However, the CLI's `plan list` command does **not support** a `--namespace` flag (confirmed in `cli/commands/plan.py`). The test runs bare `plan list` with `expected_rc=None`. The test documentation (lines 115–116) honestly notes this as a "spec gap," which is good — but the ticket AC is marked complete when the condition cannot be fulfilled. **Recommendation:** Uncheck this AC item in ticket #760 and add a note that it's deferred until the `--namespace` flag is implemented. A checked-off AC that isn't actually validated is misleading for milestone tracking. --- ## P2:should-fix — Fix in Follow-Up PR (Within 3 Days) ### 3. `plan list` test assertion is tautological **File:** `robot/e2e/wf14_server_mode.robot:117–119` `expected_rc=None` (accepts any exit code) + `Should Not Be Empty ${result.stdout}${result.stderr}` means this test passes whenever the command produces *any* output on stdout or stderr — including error messages, tracebacks, or deprecation warnings. Since `init --yes` was already called in suite setup, `plan list` should succeed with rc=0. **Recommendation:** Change to `expected_rc=0`. If `plan list` genuinely fails after init, that's a bug worth catching rather than masking with a permissive assertion. ### 4. `diagnostics` test uses `expected_rc=None` unnecessarily **File:** `robot/e2e/wf14_server_mode.robot:59` The documentation correctly notes diagnostics checks are all local and should succeed without a live server. Yet `expected_rc=None` is used. Without `--check`, diagnostics always returns rc=0. **Recommendation:** Change to `expected_rc=0` to match the documentation's own assertion that the command should succeed. ### 5. Actor uses `local/` namespace, contradicting "team collaboration" scenario **File:** `robot/e2e/wf14_server_mode.robot:99` The actor is created with `local/team-rev-${RUN_SUFFIX}`. Spec Example 14 shows `myteam/review-pipeline`. The action was correctly fixed to use `myteam/` prefix, but the actor was not. The documentation (lines 97–98) explains this is due to ActorService validation, which is honest — but means the "team collaboration" scenario is only half-exercised for namespace behavior. **Recommendation:** If ActorService requires `local/`, file a follow-up issue tracking this spec gap and reference it in the test documentation. ### 6. Ticket AC "real LLM API keys" unmet **File:** `robot/e2e/wf14_server_mode.robot` (entire file) Ticket #760 AC states: *"All invocations use real LLM API keys — no mocking, stubbing, or test doubles."* No command in this test invokes an LLM. The PR description itself says *"No LLM API key required."* The AC is marked done but the condition isn't met. **Recommendation:** Update ticket #760 to remove or reword the LLM AC — server mode config testing genuinely doesn't need LLM calls. Alternatively, add one LLM-dependent step guarded by `Skip If No LLM Keys`. ### 7. `supervised` profile verification is very shallow **File:** `robot/e2e/wf14_server_mode.robot:121–124` The test only checks that the output contains the string "supervised." The M6 suite's equivalent test checks for `auto_strategize`, `auto_execute`, and `auto_apply` fields. For a test whose title emphasizes the "supervised automation profile," only verifying the name is insufficient. **Recommendation:** Add assertions for supervised-specific fields: ```robot Output Should Contain ${result} auto_strategize Output Should Contain ${result} auto_execute ``` --- ## P3:nit — Author Discretion ### 8. `init --yes` without `--force` deviates from established pattern `robot/e2e/wf14_server_mode.robot:18` — M6 uses `init --force --yes`. While WF14's fresh temp dir makes this unlikely to fail, `--force` provides defense-in-depth. ### 9. Missing inter-test dependency guard on Action List test `robot/e2e/wf14_server_mode.robot:88–93` — If the Action Create test fails, Action List will fail with an unclear "variable not found" error. Other suites use `[Setup] Variable Should Exist` guards. ### 10. Loose `db` substring in diagnostics assertion `robot/e2e/wf14_server_mode.robot:63` — `db` is only 2 characters and could match unrelated content. Keep only `database` and `Database`. ### 11. CHANGELOG wording "plan list graceful handling" oversells the test The test only verifies non-empty output with any exit code. "Plan list smoke test" would be more accurate. --- ## Items Verified as Correct ✅ | Area | Status | |------|--------| | **Branch name** — matches ticket metadata exactly | ✅ | | **Commit message** — matches ticket metadata exactly with em-dash | ✅ | | **Commit atomicity** — single commit, 2 files (but see P0 for cli_coverage_steps.py!) | ⚠️ | | **PR description** — summary, closes keyword, manual verification | ✅ | | **CHANGELOG entry** — follows project pattern (but deletes #734 entry — see P0) | ⚠️ | | **`Force Tags E2E`** at suite level — consistent with M6 | ✅ | | **`uuid4().hex[:12]`** for unique names — CI-safe | ✅ | | **Suite teardown** — resets server.url, server.token, core.namespace | ✅ | | **`server.token`** round-trip test — added per spec Step 1 | ✅ | | **`action list --namespace myteam`** — added per spec Step 2 | ✅ | | **Action uses `myteam/` prefix** — matches spec | ✅ | | **strategy_registry.py rename** — correctly removed from PR | ✅ | | **Safe server URL** — `https://agents.example.com` (RFC 2606 reserved) | ✅ | | **No real tokens/secrets** in test code | ✅ | | **Robot Framework section order** — Settings/Keywords/Test Cases | ✅ | | **Labels** — `Type/Testing`, `State/In Review`, correct milestone | ✅ | | **`ISSUES CLOSED: #760`** in commit footer | ✅ | --- ## Stale Self-Review Resolution Summary | ID | Finding | Status in `064d448c` | |----|---------|---------------------| | H-1 | Missing `action list --namespace` | ✅ Fixed (`plan list --namespace` acknowledged as spec gap) | | H-2 | `local/` namespace on entities | ✅ Action fixed to `myteam/`; actor documented as ActorService constraint | | H-3 | CHANGELOG inaccurate | ✅ Fixed — no longer claims LLM keys or cross-namespace plan monitoring | | M-1 | Missing config teardown | ✅ Fixed — suite teardown resets all 3 config keys | | M-2 | Missing server.token | ✅ Fixed — set/get round-trip added | | M-3 | plan list non-meaningful assertion | ⚠️ Documentation updated but assertion unchanged | | M-4 | Epoch-based uniqueness | ✅ Fixed — uses uuid4 | | M-5 | Misleading diagnostics doc | ✅ Fixed — correctly says checks are local | | L-1 | Unrelated strategy_registry.py change | ✅ Removed from PR | | L-2 | Inconsistent tag pattern | ✅ Fixed — Force Tags E2E | --- ## Summary The E2E test itself is well-structured and has been substantially improved since the self-review. Config round-trips, namespace-scoped action listing, proper teardown, and uuid-based uniqueness are all solid. The test covers Steps 1–2 of spec Example 14 well. However, the **P0 finding is a hard blocker**: this PR silently reverts a security fix (`shell=True` subprocess call) from issue #734 that was merged to master 4 days ago. The CHANGELOG entry documenting that fix is also deleted. This must be addressed before any other consideration. Beyond that, the remaining findings are primarily about tightening assertions (P2 #3, #4), documenting spec gaps properly (P1 #2, P2 #5, #6), and making the supervised profile test more meaningful (P2 #7).

brent.edwards commented

2026-03-23 23:44:40 +00:00

Supplementary Review — PR #805 (Second Pass + P0 Correction)

Reviewer: @brent.edwards
Method: 3 deep-dive agents (shell=True analysis, YAML/CLI tracing, CHANGELOG integrity) + manual verification

⚠️ CORRECTION: P0 Finding #1 Downgraded to P3 (Branch Hygiene)

My initial review incorrectly classified the cli_coverage_steps.py change as P0:blocker ("Security fix revert"). I owe @CoreRasurae an apology — the analysis was wrong.

What happened

I used a two-dot diff (git diff origin/master..HEAD) which shows all differences between the branch tips, including divergent history. This made it appear that the PR actively reverts the shlex.split() security fix (commit bd18491b, issue #734) and deletes its CHANGELOG entry.

The reality

The three-dot diff (git diff origin/master...HEAD) — which correctly isolates only changes introduced by the PR's own commit — shows the PR modifies exactly 2 files:

CHANGELOG.md (+6 lines — the #760 entry only)
robot/e2e/wf14_server_mode.robot (+124 lines — new file)

features/steps/cli_coverage_steps.py is NOT modified by this PR. The security fix was merged to master after this branch was created. The branch simply doesn't have it yet. On merge (or rebase), Git's three-way merge will correctly preserve the security fix and its CHANGELOG entry — I verified this with git merge-tree --write-tree.

Corrected severity

Finding	Old Severity	New Severity	Reason
#1 — "Security fix revert"	P0:blocker	P3:nit (branch hygiene)	Not a revert — stale branch artifact. A rebase (already requested by PM on Day 37) resolves it.

Updated verdict

With the P0 removed, the highest remaining finding is P1 #2 (plan list --namespace AC marked complete but unfulfilled). The verdict remains Request Changes due to this P1 plus the P2 items, but the urgency is significantly reduced.

New Findings From Second Pass

P2 — `list_actions()` docstring claims DB merge but code only reads in-memory

File: src/cleveragents/application/services/plan_lifecycle_service.py:617–642

The list_actions() docstring (line 624–625) says: "When persistence is enabled, results are merged from the in-memory cache and the database layer." However, line 634 does actions = list(self._actions.values()) — reading ONLY the in-memory dict, with no database fallback. Contrast with get_action() (lines 583–592) which correctly falls back to the database.

Since each Run CleverAgents Command call creates a new subprocess (→ new PlanLifecycleService with empty _actions), the WF14 E2E Action List By Namespace test (line 92) depends on actions being loaded from DB into _actions during service initialization. If that preload doesn't happen, the test will fail at runtime.

Note: This is a pre-existing issue, not introduced by this PR. But it's relevant because:

The test may fail when actually run, contradicting the "all nox sessions pass" claim
The docstring is inaccurate regardless

Recommendation: Verify the test actually passes by running nox -s e2e_tests. If it fails, list_actions() needs a DB fallback like get_action() has.

P3 — Config commands write to `~/.cleveragents/config.toml`, not `CLEVERAGENTS_HOME`

File: src/cleveragents/cli/commands/config.py:50–51

The config set/get commands use Path.home() / ".cleveragents" / "config.toml" for the config file, ignoring the CLEVERAGENTS_HOME environment variable. This means the WF14 test's config set server.url ... writes to the real home directory, not the per-suite temp directory. While the suite teardown (lines 29–31) resets the values, a crash during teardown could leave stale config in ~/.cleveragents/config.toml.

This is a pre-existing platform issue, not a test bug, but it's worth noting for local developer runs.

Revised Finding Summary

#	Severity	Finding	Status
1	P0 → P3	~~Security fix revert~~ → Branch needs rebase	CORRECTED
2	P1	`plan list --namespace` AC marked complete but CLI doesn't support flag	Unchanged
3	P2	`plan list` tautological assertion (`expected_rc=None`)	Unchanged
4	P2	`diagnostics` uses `expected_rc=None` unnecessarily	Unchanged
5	P2	Actor uses `local/` namespace, not team namespace per spec	Unchanged
6	P2	Ticket AC "real LLM API keys" unmet	Unchanged
7	P2	`supervised` profile verification is shallow	Unchanged
8	P3	`init --yes` without `--force`	Unchanged
9	P3	Missing inter-test dependency guard	Unchanged
10	P3	Loose `db` substring match	Unchanged
11	P3	CHANGELOG "plan list graceful handling" oversells	Unchanged
NEW 12	P2	`list_actions()` docstring claims DB merge but code doesn't — test may fail	New
NEW 13	P3	Config writes to `~/` not `CLEVERAGENTS_HOME` — teardown crash could leak config	New

Final tally: 0 P0 + 1 P1 + 6 P2 + 6 P3 = 13 findings

# Supplementary Review — PR #805 (Second Pass + P0 Correction) **Reviewer:** @brent.edwards **Method:** 3 deep-dive agents (shell=True analysis, YAML/CLI tracing, CHANGELOG integrity) + manual verification --- ## ⚠️ CORRECTION: P0 Finding #1 Downgraded to P3 (Branch Hygiene) **My initial review incorrectly classified the `cli_coverage_steps.py` change as P0:blocker ("Security fix revert").** I owe @CoreRasurae an apology — the analysis was wrong. ### What happened I used a **two-dot diff** (`git diff origin/master..HEAD`) which shows all differences between the branch tips, including divergent history. This made it appear that the PR actively reverts the `shlex.split()` security fix (commit `bd18491b`, issue #734) and deletes its CHANGELOG entry. ### The reality The **three-dot diff** (`git diff origin/master...HEAD`) — which correctly isolates only changes introduced by the PR's own commit — shows the PR modifies exactly **2 files**: - `CHANGELOG.md` (+6 lines — the #760 entry only) - `robot/e2e/wf14_server_mode.robot` (+124 lines — new file) **`features/steps/cli_coverage_steps.py` is NOT modified by this PR.** The security fix was merged to master *after* this branch was created. The branch simply doesn't have it yet. On merge (or rebase), Git's three-way merge will correctly preserve the security fix and its CHANGELOG entry — I verified this with `git merge-tree --write-tree`. ### Corrected severity | Finding | Old Severity | New Severity | Reason | |---------|-------------|-------------|--------| | #1 — "Security fix revert" | P0:blocker | **P3:nit (branch hygiene)** | Not a revert — stale branch artifact. A rebase (already requested by PM on Day 37) resolves it. | ### Updated verdict With the P0 removed, the highest remaining finding is **P1 #2** (`plan list --namespace` AC marked complete but unfulfilled). The verdict remains **Request Changes** due to this P1 plus the P2 items, but the urgency is significantly reduced. --- ## New Findings From Second Pass ### P2 — `list_actions()` docstring claims DB merge but code only reads in-memory **File:** `src/cleveragents/application/services/plan_lifecycle_service.py:617–642` The `list_actions()` docstring (line 624–625) says: *"When persistence is enabled, results are merged from the in-memory cache and the database layer."* However, line 634 does `actions = list(self._actions.values())` — reading ONLY the in-memory dict, with no database fallback. Contrast with `get_action()` (lines 583–592) which correctly falls back to the database. Since each `Run CleverAgents Command` call creates a new subprocess (→ new `PlanLifecycleService` with empty `_actions`), the `WF14 E2E Action List By Namespace` test (line 92) depends on actions being loaded from DB into `_actions` during service initialization. If that preload doesn't happen, the test will fail at runtime. **Note:** This is a **pre-existing issue**, not introduced by this PR. But it's relevant because: - The test may fail when actually run, contradicting the "all nox sessions pass" claim - The docstring is inaccurate regardless **Recommendation:** Verify the test actually passes by running `nox -s e2e_tests`. If it fails, `list_actions()` needs a DB fallback like `get_action()` has. ### P3 — Config commands write to `~/.cleveragents/config.toml`, not `CLEVERAGENTS_HOME` **File:** `src/cleveragents/cli/commands/config.py:50–51` The `config set/get` commands use `Path.home() / ".cleveragents" / "config.toml"` for the config file, ignoring the `CLEVERAGENTS_HOME` environment variable. This means the WF14 test's `config set server.url ...` writes to the **real home directory**, not the per-suite temp directory. While the suite teardown (lines 29–31) resets the values, a crash during teardown could leave stale config in `~/.cleveragents/config.toml`. This is a pre-existing platform issue, not a test bug, but it's worth noting for local developer runs. --- ## Revised Finding Summary | # | Severity | Finding | Status | |---|----------|---------|--------| | ~~1~~ | ~~P0~~ → **P3** | ~~Security fix revert~~ → Branch needs rebase | **CORRECTED** | | 2 | P1 | `plan list --namespace` AC marked complete but CLI doesn't support flag | Unchanged | | 3 | P2 | `plan list` tautological assertion (`expected_rc=None`) | Unchanged | | 4 | P2 | `diagnostics` uses `expected_rc=None` unnecessarily | Unchanged | | 5 | P2 | Actor uses `local/` namespace, not team namespace per spec | Unchanged | | 6 | P2 | Ticket AC "real LLM API keys" unmet | Unchanged | | 7 | P2 | `supervised` profile verification is shallow | Unchanged | | 8 | P3 | `init --yes` without `--force` | Unchanged | | 9 | P3 | Missing inter-test dependency guard | Unchanged | | 10 | P3 | Loose `db` substring match | Unchanged | | 11 | P3 | CHANGELOG "plan list graceful handling" oversells | Unchanged | | **NEW** 12 | P2 | `list_actions()` docstring claims DB merge but code doesn't — test may fail | New | | **NEW** 13 | P3 | Config writes to `~/` not `CLEVERAGENTS_HOME` — teardown crash could leak config | New | **Final tally: 0 P0 + 1 P1 + 6 P2 + 6 P3 = 13 findings**

brent.edwards commented

2026-03-23 23:59:11 +00:00

Third-Pass Review — PR #805

Reviewer: @brent.edwards
Method: 3 deep-dive agents (Robot Framework keyword semantics, CLI command-by-command source trace, verbatim spec-to-test mapping) + manual verification

This third pass approached the code from angles the first two passes didn't cover: Robot Framework escape/scoping semantics, whether each CLI command will actually succeed by tracing through the source code, and a verbatim command-by-command comparison against Specification Example 14.

Upgrade: Finding #12 Elevated from P2 to P1

`action list --namespace myteam` will likely fail at runtime

File: src/cleveragents/application/services/plan_lifecycle_service.py:634
Test file: robot/e2e/wf14_server_mode.robot:92–93

Two independent deep-dive agents traced the full code path and both reached the same conclusion:

list_actions() at line 634 reads ONLY from the in-memory _actions dict:

actions = list(self._actions.values())

Each Run CleverAgents Command call spawns a new subprocess → fresh PlanLifecycleService → empty _actions = {}. The docstring (line 624) falsely claims: "results are merged from the in-memory cache and the database layer."

Contrast with get_action() (lines 587–592) which correctly falls back to the DB, and ActorService.list_actors() which reads from DB. list_actions() is the only action method without a DB fallback.

Result: action list --namespace myteam returns [] → prints "No actions found." → assertion Output Should Contain ${list_result} gen-tests fails.

This is a pre-existing platform bug (not introduced by this PR), but the test depends on it. Unless there's a preload mechanism I couldn't find, the WF14 E2E Action List By Namespace test case will fail when run via nox -s e2e_tests.

Recommendation: Either (a) verify the test actually passes by running nox -s e2e_tests and share the output, or (b) fix list_actions() to query ctx.actions.list_available(namespace) like get_action() does.

Severity: Elevated to P1 — likely runtime test failure.

New Findings (Not in Previous Reviews)

14. Diagnostics test validates nothing server-mode-specific

File: robot/e2e/wf14_server_mode.robot:62–64
Severity: P2

Spec Example 14's diagnostics output prominently shows "Server: OK (connected to agents.example.com)" and a dedicated Server panel with URL, Auth, Namespace, Members, Shared Actions, Latency. The entire reason for running diagnostics in Example 14 is to verify server connectivity.

The test only checks for generic diagnostic categories (config, database, disk) that appear identically with or without any server configuration. The test would pass even if the server config step were removed entirely.

Recommendation: Add at minimum Should Contain Any ${combined} server Server server.url to verify the server-mode diagnostic code path is exercised.

15. `read_only: true` in action YAML contradicts spec's `Read Only: no`

File: robot/e2e/wf14_server_mode.robot:77
Severity: P3

Spec Example 14 Step 2 (line 41666) shows the created action has Read Only: no. The test YAML sets read_only: true. This is a verbatim contradiction — the test creates an action with the opposite read-only setting from what the spec demonstrates. The generate-tests action in the spec is explicitly writable because test generation involves writing test files.

Recommendation: Change to read_only: false or remove the field (default is false).

16. Suite documentation is misleading about validation scope

File: robot/e2e/wf14_server_mode.robot:2–7
Severity: P2

The suite documentation says: "Validates the supervised automation profile in a distributed team scenario where multiple engineers share actions, actors, and projects via CleverAgents server mode." In reality, Steps 3 (multi-machine shared resource usage, 4 commands) and Step 4 (team plan monitoring, 2 commands) are entirely untested — 46% of Example 14's commands have zero coverage.

No centralized documentation explains what's in scope and what's deferred. Individual test cases document their own gaps (e.g., plan list --namespace is a "spec gap"), but a reader of the suite description would expect full Example 14 coverage.

Recommendation: Add explicit scope documentation:

...              Coverage: Steps 1–2 (server config, resource publishing).
...              Steps 3–4 (multi-machine usage, team monitoring) require a
...              live server and are deferred.

17. Token format fragility — survives redaction only by accident

File: robot/e2e/wf14_server_mode.robot:44, 47
Severity: P3

tok_e2e_test_placeholder passes through config get unredacted because:

config get in "rich" format doesn't apply redaction (unlike config list which does)
Even if redaction were applied, the underscores in e2e_test_placeholder break the tok_[A-Za-z0-9]{10,} regex pattern

If someone changes the token to tok_e2eTestPlaceholder (no underscores), or adds --format json to the config get call, the assertion would silently break.

Recommendation: Add a comment explaining the token format choice, or use a token that clearly isn't redactable (e.g., test_placeholder_token).

18. `actor add` invocation syntax differs from spec

File: robot/e2e/wf14_server_mode.robot:106
Severity: P3

Spec shows: agents actor add --config ./actors/review-pipeline.yaml (name from YAML).
Test runs: actor add ${actor_name} --config ${yaml_file} (name as positional arg, YAML has no name: field).

This exercises a different CLI code path (positional name + config-override vs. config-only).

Recommendation: Either add name: ${actor_name} to the YAML and use actor add --config ${yaml_file}, or document the deviation.

Verified Correct — No Issues

Item	Verification
`\n` in `Catenate SEPARATOR` and `Set Variable`	✅ RF interprets as real newlines; generated YAML is valid
`Should Contain Any` semantics	✅ Checks AT LEAST ONE match (from BuiltIn library)
`${EMPTY}` through `Run Process`	✅ Passed as empty string, Typer accepts it
`Create File` from OperatingSystem library	✅ Imported transitively via common_e2e.resource, UTF-8 default
`Output Should Contain` partial matching	✅ Case-insensitive by default, won't false-match unrelated keys
`server.token` NOT redacted in `config get`	✅ Rich format doesn't apply redaction; assertion passes
RF test execution order	✅ File order guaranteed; nox doesn't use `--randomize`
YAML boolean values (`reusable: true`)	✅ YAML parser produces Python `True`, Pydantic accepts it
Commit message body and footer	✅ Conventional Changelog format, detailed review-fix notes, `ISSUES CLOSED: #760`

Consolidated Finding Table (All Three Reviews)

#	Severity	Finding	Review
1	P0 → P3	~~Security fix revert~~ → Branch needs rebase	R1 → R2 corrected
2	P1	`plan list --namespace` AC marked complete but CLI doesn't support flag	R1
12→	P1 (↑)	`action list --namespace` likely fails at runtime — `list_actions()` has no DB fallback	R2 → R3 elevated
3	P2	`plan list` tautological assertion	R1
4	P2	`diagnostics` uses `expected_rc=None` unnecessarily	R1
5	P2	Actor uses `local/` namespace	R1
6	P2	Ticket AC "real LLM API keys" unmet	R1
7	P2	`supervised` profile verification is shallow	R1
14	P2	Diagnostics test validates nothing server-mode-specific	R3 NEW
16	P2	Suite documentation misleading about validation scope	R3 NEW
13	P3	Config writes to `~/` not `CLEVERAGENTS_HOME`	R2
8	P3	`init --yes` without `--force`	R1
9	P3	Missing inter-test dependency guard	R1
10	P3	Loose `db` substring match	R1
11	P3	CHANGELOG "plan list graceful handling" oversells	R1
15	P3	`read_only: true` contradicts spec's `Read Only: no`	R3 NEW
17	P3	Token format fragility	R3 NEW
18	P3	`actor add` syntax differs from spec	R3 NEW
1	P3	Branch needs rebase (stale branch artifact)	R1→R2

Final tally: 0 P0 + 2 P1 + 7 P2 + 10 P3 = 19 findings

Summary

This third pass uncovered 4 genuinely new findings (diagnostics not testing server mode, read_only contradiction, misleading suite scope docs, token fragility) and elevated one existing finding (action list DB fallback) based on strong CLI source-code evidence.

The most significant new finding is that the diagnostics test (P2 #14) validates nothing server-mode-specific — the exact same assertions would pass on a vanilla install with no server configuration. Combined with the suite documentation claiming to validate "a distributed team scenario" (P2 #16), the test oversells its coverage of Example 14's server-mode aspects.

The Robot Framework keyword semantics are sound — \n escapes, Catenate, Create File, Should Contain Any, and variable scoping all work correctly.

Verdict: Request Changes (2 P1 + 7 P2). The P1 items (unfulfilled plan list --namespace AC and probable action list runtime failure) should be resolved before merge. The P2 items are quality improvements that can be addressed in this PR or follow-ups.

# Third-Pass Review — PR #805 **Reviewer:** @brent.edwards **Method:** 3 deep-dive agents (Robot Framework keyword semantics, CLI command-by-command source trace, verbatim spec-to-test mapping) + manual verification This third pass approached the code from angles the first two passes didn't cover: Robot Framework escape/scoping semantics, whether each CLI command will actually succeed by tracing through the source code, and a verbatim command-by-command comparison against Specification Example 14. --- ## Upgrade: Finding #12 Elevated from P2 to P1 ### `action list --namespace myteam` will likely fail at runtime **File:** `src/cleveragents/application/services/plan_lifecycle_service.py:634` **Test file:** `robot/e2e/wf14_server_mode.robot:92–93` Two independent deep-dive agents traced the full code path and both reached the same conclusion: `list_actions()` at line 634 reads ONLY from the in-memory `_actions` dict: ```python actions = list(self._actions.values()) ``` Each `Run CleverAgents Command` call spawns a new subprocess → fresh `PlanLifecycleService` → empty `_actions = {}`. The docstring (line 624) falsely claims: *"results are merged from the in-memory cache and the database layer."* Contrast with `get_action()` (lines 587–592) which correctly falls back to the DB, and `ActorService.list_actors()` which reads from DB. `list_actions()` is the only action method without a DB fallback. **Result:** `action list --namespace myteam` returns `[]` → prints "No actions found." → assertion `Output Should Contain ${list_result} gen-tests` **fails**. This is a **pre-existing platform bug** (not introduced by this PR), but the test depends on it. Unless there's a preload mechanism I couldn't find, the `WF14 E2E Action List By Namespace` test case will fail when run via `nox -s e2e_tests`. **Recommendation:** Either (a) verify the test actually passes by running `nox -s e2e_tests` and share the output, or (b) fix `list_actions()` to query `ctx.actions.list_available(namespace)` like `get_action()` does. **Severity:** Elevated to P1 — likely runtime test failure. --- ## New Findings (Not in Previous Reviews) ### 14. Diagnostics test validates nothing server-mode-specific **File:** `robot/e2e/wf14_server_mode.robot:62–64` **Severity:** P2 Spec Example 14's diagnostics output prominently shows **"Server: OK (connected to agents.example.com)"** and a **dedicated Server panel** with URL, Auth, Namespace, Members, Shared Actions, Latency. The entire reason for running diagnostics in Example 14 is to verify server connectivity. The test only checks for generic diagnostic categories (`config`, `database`, `disk`) that appear identically with or without any server configuration. The test would pass even if the server config step were removed entirely. **Recommendation:** Add at minimum `Should Contain Any ${combined} server Server server.url` to verify the server-mode diagnostic code path is exercised. ### 15. `read_only: true` in action YAML contradicts spec's `Read Only: no` **File:** `robot/e2e/wf14_server_mode.robot:77` **Severity:** P3 Spec Example 14 Step 2 (line 41666) shows the created action has `Read Only: no`. The test YAML sets `read_only: true`. This is a verbatim contradiction — the test creates an action with the opposite read-only setting from what the spec demonstrates. The generate-tests action in the spec is explicitly writable because test generation involves writing test files. **Recommendation:** Change to `read_only: false` or remove the field (default is `false`). ### 16. Suite documentation is misleading about validation scope **File:** `robot/e2e/wf14_server_mode.robot:2–7` **Severity:** P2 The suite documentation says: *"Validates the supervised automation profile in a distributed team scenario where multiple engineers share actions, actors, and projects via CleverAgents server mode."* In reality, Steps 3 (multi-machine shared resource usage, 4 commands) and Step 4 (team plan monitoring, 2 commands) are entirely untested — 46% of Example 14's commands have zero coverage. No centralized documentation explains what's in scope and what's deferred. Individual test cases document their own gaps (e.g., `plan list --namespace` is a "spec gap"), but a reader of the suite description would expect full Example 14 coverage. **Recommendation:** Add explicit scope documentation: ``` ... Coverage: Steps 1–2 (server config, resource publishing). ... Steps 3–4 (multi-machine usage, team monitoring) require a ... live server and are deferred. ``` ### 17. Token format fragility — survives redaction only by accident **File:** `robot/e2e/wf14_server_mode.robot:44, 47` **Severity:** P3 `tok_e2e_test_placeholder` passes through `config get` unredacted because: 1. `config get` in "rich" format doesn't apply redaction (unlike `config list` which does) 2. Even if redaction were applied, the underscores in `e2e_test_placeholder` break the `tok_[A-Za-z0-9]{10,}` regex pattern If someone changes the token to `tok_e2eTestPlaceholder` (no underscores), or adds `--format json` to the `config get` call, the assertion would silently break. **Recommendation:** Add a comment explaining the token format choice, or use a token that clearly isn't redactable (e.g., `test_placeholder_token`). ### 18. `actor add` invocation syntax differs from spec **File:** `robot/e2e/wf14_server_mode.robot:106` **Severity:** P3 Spec shows: `agents actor add --config ./actors/review-pipeline.yaml` (name from YAML). Test runs: `actor add ${actor_name} --config ${yaml_file}` (name as positional arg, YAML has no `name:` field). This exercises a different CLI code path (positional name + config-override vs. config-only). **Recommendation:** Either add `name: ${actor_name}` to the YAML and use `actor add --config ${yaml_file}`, or document the deviation. --- ## Verified Correct — No Issues | Item | Verification | |------|-------------| | `\n` in `Catenate SEPARATOR` and `Set Variable` | ✅ RF interprets as real newlines; generated YAML is valid | | `Should Contain Any` semantics | ✅ Checks AT LEAST ONE match (from BuiltIn library) | | `${EMPTY}` through `Run Process` | ✅ Passed as empty string, Typer accepts it | | `Create File` from OperatingSystem library | ✅ Imported transitively via common_e2e.resource, UTF-8 default | | `Output Should Contain` partial matching | ✅ Case-insensitive by default, won't false-match unrelated keys | | `server.token` NOT redacted in `config get` | ✅ Rich format doesn't apply redaction; assertion passes | | RF test execution order | ✅ File order guaranteed; nox doesn't use `--randomize` | | YAML boolean values (`reusable: true`) | ✅ YAML parser produces Python `True`, Pydantic accepts it | | Commit message body and footer | ✅ Conventional Changelog format, detailed review-fix notes, `ISSUES CLOSED: #760` | --- ## Consolidated Finding Table (All Three Reviews) | # | Severity | Finding | Review | |---|----------|---------|--------| | ~~1~~ | ~~P0~~ → P3 | ~~Security fix revert~~ → Branch needs rebase | R1 → R2 corrected | | 2 | P1 | `plan list --namespace` AC marked complete but CLI doesn't support flag | R1 | | 12→ | **P1** (↑) | `action list --namespace` likely fails at runtime — `list_actions()` has no DB fallback | R2 → R3 elevated | | 3 | P2 | `plan list` tautological assertion | R1 | | 4 | P2 | `diagnostics` uses `expected_rc=None` unnecessarily | R1 | | 5 | P2 | Actor uses `local/` namespace | R1 | | 6 | P2 | Ticket AC "real LLM API keys" unmet | R1 | | 7 | P2 | `supervised` profile verification is shallow | R1 | | 14 | P2 | Diagnostics test validates nothing server-mode-specific | **R3 NEW** | | 16 | P2 | Suite documentation misleading about validation scope | **R3 NEW** | | 13 | P3 | Config writes to `~/` not `CLEVERAGENTS_HOME` | R2 | | 8 | P3 | `init --yes` without `--force` | R1 | | 9 | P3 | Missing inter-test dependency guard | R1 | | 10 | P3 | Loose `db` substring match | R1 | | 11 | P3 | CHANGELOG "plan list graceful handling" oversells | R1 | | 15 | P3 | `read_only: true` contradicts spec's `Read Only: no` | **R3 NEW** | | 17 | P3 | Token format fragility | **R3 NEW** | | 18 | P3 | `actor add` syntax differs from spec | **R3 NEW** | | 1 | P3 | Branch needs rebase (stale branch artifact) | R1→R2 | **Final tally: 0 P0 + 2 P1 + 7 P2 + 10 P3 = 19 findings** --- ## Summary This third pass uncovered **4 genuinely new findings** (diagnostics not testing server mode, `read_only` contradiction, misleading suite scope docs, token fragility) and **elevated one existing finding** (action list DB fallback) based on strong CLI source-code evidence. The most significant new finding is that the diagnostics test (P2 #14) validates nothing server-mode-specific — the exact same assertions would pass on a vanilla install with no server configuration. Combined with the suite documentation claiming to validate "a distributed team scenario" (P2 #16), the test oversells its coverage of Example 14's server-mode aspects. The Robot Framework keyword semantics are sound — `\n` escapes, `Catenate`, `Create File`, `Should Contain Any`, and variable scoping all work correctly. **Verdict: Request Changes** (2 P1 + 7 P2). The P1 items (unfulfilled `plan list --namespace` AC and probable `action list` runtime failure) should be resolved before merge. The P2 items are quality improvements that can be addressed in this PR or follow-ups.

CoreRasurae force-pushed test/e2e-wf14-server-mode from 064d448cce to 20c5bfff8e

2026-03-24 18:30:19 +00:00

Compare

CoreRasurae force-pushed test/e2e-wf14-server-mode from 20c5bfff8e to 39c2e43c08

2026-03-24 20:33:01 +00:00

Compare

brent.edwards requested changes 2026-03-24 21:21:21 +00:00

Dismissed

brent.edwards left a comment

Re-Review #3 — PR #805 (Ticket #760)

Reviewer: @brent.edwards
Method: 3 parallel deep-dive agents (spec compliance, bug detection, code quality/commits) against latest HEAD 39c2e43c, plus review playbook checklists
Previous reviews: Review #2684 (REQUEST_CHANGES, 11 findings), Supplementary (P0 correction + 2 new), Third-pass (4 new + 1 elevation → 19 total)

Verdict: Request Changes

There is 1 P1:must-fix (two-commit branch violating atomic commit rules). Once squashed, the remaining findings are all P2/P3 and would not block merge.

Progress Since Last Review

The author has done excellent work addressing the most impactful findings. 9 of the 19 findings from my previous reviews are fully resolved:

#	Finding	Status
1	~~Security fix revert~~ (corrected to P3: branch needs rebase)	✅ FIXED — Rebased onto `origin/master`
3	`plan list` tautological assertion (`expected_rc=None`)	✅ FIXED — Now `expected_rc=0` (line 117)
4	`diagnostics` uses `expected_rc=None` unnecessarily	✅ FIXED — Now `expected_rc=0` (line 59)
7	`supervised` profile verification shallow	✅ FIXED — Now checks `auto_strategize`, `auto_execute`, `auto_apply` (lines 127–129)
12	`list_actions()` no DB fallback (elevated to P1)	✅ FIXED — Comprehensive DB query with fallback (lines 636–658). New `ActionRepository.list_all()` added. Well-engineered with proper error handling.
—	CHANGELOG inaccuracy (H-3 from self-review)	✅ FIXED — Accurate, no longer claims LLM keys or cross-namespace plan monitoring
—	Epoch-based uniqueness (M-4 from self-review)	✅ FIXED — Uses `uuid4().hex[:12]` (line 22)
—	Missing config teardown (M-1 from self-review)	✅ FIXED — Suite teardown resets all 3 config keys (lines 29–31)
—	Inconsistent tag pattern (L-2 from self-review)	✅ FIXED — `Force Tags E2E` at suite level (line 11)

The list_actions() fix in particular is solid — the DB query logic, state filtering, namespace handling, DatabaseError fallback, and in-memory cache refresh are all correct. I traced the full code path from action create → DB persist → action list --namespace myteam → get_by_namespace("myteam") and it holds up.

P1:must-fix — Two-Commit Branch

Location: Branch test/e2e-wf14-server-mode — commits f200d724 and 39c2e43c

The branch contains two commits both referencing issue #760:

f200d724 — fix(action): query persisted actions from database in list_actions (Refs: #760)
39c2e43c — test(e2e): workflow example 14 — ... (ISSUES CLOSED: #760)

CONTRIBUTING.md line 890: "Single Commit — Corresponds to exactly one commit. One issue produces one commit." Line 950: "If an issue requires decomposition into multiple commits, it must be promoted to an Epic and broken into separate issues."

The fix(action) commit is a legitimate bug fix — it correctly addresses a pre-existing platform issue I flagged in my previous review. However, it either needs:

(a) Squash into the test commit (simplest — the fix is small and the test depends on it), OR
(b) Separate ticket — Create a Type/Bug issue for the list_actions() fix, move it to its own branch/PR, and rebase this PR on top of it after merge.

Option (a) is recommended since the fix is only 25 lines of production code and the test is its primary validation.

If squashing, keep the test commit's first line (matching ticket metadata), and include the fix description in the commit body.

P2:should-fix — Remaining Items (Non-Blocking)

These remain from previous reviews. They are documentation/ticket hygiene issues, not code bugs. Per the playbook, they can be fixed in a follow-up PR within 3 days.

#	Finding	Notes
2	`plan list --namespace` AC checked in ticket #760 but CLI has no `--namespace` flag	Uncheck the AC, add note it's deferred. Spec gap, not a test bug.
5	Actor uses `local/` namespace (line 99) instead of `myteam/` per spec	Honestly documented (lines 97–98) as ActorService constraint. Consider filing follow-up issue.
6	Ticket AC "real LLM API keys" checked but test makes no LLM calls	Spirit is met (real CLI, no test doubles). Reword the AC to match reality.
14	Diagnostics test validates nothing server-mode-specific	Same assertions would pass without server config. Add `server` to `Should Contain Any`.
16	Suite documentation oversells validation scope	Add scope note: "Coverage: Steps 1–2. Steps 3–4 deferred (require live server)."

P3:nit — Author Discretion

#	Finding
8	`init --yes` without `--force` (line 18)
9	Missing inter-test dependency guard for Action List test
10	Loose `db` 2-char substring in diagnostics assertion (line 63)
15	`read_only: true` contradicts spec's `Read Only: no` (line 77)
17	Token format fragility — survives redaction by accident
18	`actor add` syntax (positional name + config) differs from spec (config-only)
—	Cache refresh only adds, never updates stale entries (line 651: `if key not in`) — low practical impact since CLI invocations are isolated
—	`ActionRepository.list_all()` returns `list[Any]` — matches pre-existing `ActionRepository` convention but is weaker than `ActorRepository.list_all()` which returns `list[Actor]`

Quality Gates Verified

Gate	Result
`nox -e lint`	✅ All checks passed
`nox -e typecheck`	✅ 0 errors (1 pre-existing warning in `strategy_registry.py`)
Branch name matches ticket metadata	✅ `test/e2e-wf14-server-mode`
HEAD commit message matches ticket metadata	✅ Exact match including em-dash
`ISSUES CLOSED: #760` in footer	✅
Labels — `Type/Testing`, `State/In Review`, milestone v3.6.0	✅
PR description — summary, closes keyword, manual verification	✅
No secrets/credentials in test code	✅ (`https://agents.example.com` is RFC 2606 reserved)

Summary

This is a well-structured E2E test that has been substantially improved across review cycles. The list_actions() DB fix is correct and well-engineered. The test covers Steps 1–2 of Specification Example 14 thoroughly — config round-trips, namespace-scoped action publishing and listing, actor registration, plan list smoke, and supervised profile threshold verification.

The only blocking item is the two-commit branch. A simple git rebase -i squash resolves it. Once that's done and the P2 ticket-hygiene items are noted (uncheck unfulfilled ACs, add scope documentation), this PR is ready to merge.

## Re-Review #3 — PR #805 (Ticket #760) **Reviewer:** @brent.edwards **Method:** 3 parallel deep-dive agents (spec compliance, bug detection, code quality/commits) against latest HEAD `39c2e43c`, plus review playbook checklists **Previous reviews:** Review #2684 (REQUEST_CHANGES, 11 findings), Supplementary (P0 correction + 2 new), Third-pass (4 new + 1 elevation → 19 total) --- ### Verdict: Request Changes There is **1 P1:must-fix** (two-commit branch violating atomic commit rules). Once squashed, the remaining findings are all P2/P3 and would not block merge. --- ### Progress Since Last Review The author has done excellent work addressing the most impactful findings. **9 of the 19 findings from my previous reviews are fully resolved:** | # | Finding | Status | |---|---------|--------| | ~~1~~ | ~~Security fix revert~~ (corrected to P3: branch needs rebase) | ✅ **FIXED** — Rebased onto `origin/master` | | ~~3~~ | `plan list` tautological assertion (`expected_rc=None`) | ✅ **FIXED** — Now `expected_rc=0` (line 117) | | ~~4~~ | `diagnostics` uses `expected_rc=None` unnecessarily | ✅ **FIXED** — Now `expected_rc=0` (line 59) | | ~~7~~ | `supervised` profile verification shallow | ✅ **FIXED** — Now checks `auto_strategize`, `auto_execute`, `auto_apply` (lines 127–129) | | ~~12~~ | `list_actions()` no DB fallback (elevated to P1) | ✅ **FIXED** — Comprehensive DB query with fallback (lines 636–658). New `ActionRepository.list_all()` added. Well-engineered with proper error handling. | | — | CHANGELOG inaccuracy (H-3 from self-review) | ✅ **FIXED** — Accurate, no longer claims LLM keys or cross-namespace plan monitoring | | — | Epoch-based uniqueness (M-4 from self-review) | ✅ **FIXED** — Uses `uuid4().hex[:12]` (line 22) | | — | Missing config teardown (M-1 from self-review) | ✅ **FIXED** — Suite teardown resets all 3 config keys (lines 29–31) | | — | Inconsistent tag pattern (L-2 from self-review) | ✅ **FIXED** — `Force Tags E2E` at suite level (line 11) | The `list_actions()` fix in particular is solid — the DB query logic, state filtering, namespace handling, `DatabaseError` fallback, and in-memory cache refresh are all correct. I traced the full code path from `action create` → DB persist → `action list --namespace myteam` → `get_by_namespace("myteam")` and it holds up. --- ### P1:must-fix — Two-Commit Branch **Location:** Branch `test/e2e-wf14-server-mode` — commits `f200d724` and `39c2e43c` The branch contains two commits both referencing issue #760: 1. `f200d724` — `fix(action): query persisted actions from database in list_actions` (`Refs: #760`) 2. `39c2e43c` — `test(e2e): workflow example 14 — ...` (`ISSUES CLOSED: #760`) CONTRIBUTING.md line 890: *"Single Commit — Corresponds to exactly one commit. One issue produces one commit."* Line 950: *"If an issue requires decomposition into multiple commits, it must be promoted to an Epic and broken into separate issues."* The `fix(action)` commit is a legitimate bug fix — it correctly addresses a pre-existing platform issue I flagged in my previous review. However, it either needs: - **(a) Squash** into the test commit (simplest — the fix is small and the test depends on it), OR - **(b) Separate ticket** — Create a `Type/Bug` issue for the `list_actions()` fix, move it to its own branch/PR, and rebase this PR on top of it after merge. Option (a) is recommended since the fix is only 25 lines of production code and the test is its primary validation. If squashing, keep the test commit's first line (matching ticket metadata), and include the fix description in the commit body. --- ### P2:should-fix — Remaining Items (Non-Blocking) These remain from previous reviews. They are **documentation/ticket hygiene** issues, not code bugs. Per the playbook, they can be fixed in a follow-up PR within 3 days. | # | Finding | Notes | |---|---------|-------| | 2 | `plan list --namespace` AC checked in ticket #760 but CLI has no `--namespace` flag | Uncheck the AC, add note it's deferred. Spec gap, not a test bug. | | 5 | Actor uses `local/` namespace (line 99) instead of `myteam/` per spec | Honestly documented (lines 97–98) as ActorService constraint. Consider filing follow-up issue. | | 6 | Ticket AC "real LLM API keys" checked but test makes no LLM calls | Spirit is met (real CLI, no test doubles). Reword the AC to match reality. | | 14 | Diagnostics test validates nothing server-mode-specific | Same assertions would pass without server config. Add `server` to `Should Contain Any`. | | 16 | Suite documentation oversells validation scope | Add scope note: "Coverage: Steps 1–2. Steps 3–4 deferred (require live server)." | --- ### P3:nit — Author Discretion | # | Finding | |---|---------| | 8 | `init --yes` without `--force` (line 18) | | 9 | Missing inter-test dependency guard for Action List test | | 10 | Loose `db` 2-char substring in diagnostics assertion (line 63) | | 15 | `read_only: true` contradicts spec's `Read Only: no` (line 77) | | 17 | Token format fragility — survives redaction by accident | | 18 | `actor add` syntax (positional name + config) differs from spec (config-only) | | — | Cache refresh only adds, never updates stale entries (line 651: `if key not in`) — low practical impact since CLI invocations are isolated | | — | `ActionRepository.list_all()` returns `list[Any]` — matches pre-existing `ActionRepository` convention but is weaker than `ActorRepository.list_all()` which returns `list[Actor]` | --- ### Quality Gates Verified | Gate | Result | |------|--------| | `nox -e lint` | ✅ All checks passed | | `nox -e typecheck` | ✅ 0 errors (1 pre-existing warning in `strategy_registry.py`) | | Branch name matches ticket metadata | ✅ `test/e2e-wf14-server-mode` | | HEAD commit message matches ticket metadata | ✅ Exact match including em-dash | | `ISSUES CLOSED: #760` in footer | ✅ | | Labels — `Type/Testing`, `State/In Review`, milestone v3.6.0 | ✅ | | PR description — summary, closes keyword, manual verification | ✅ | | No secrets/credentials in test code | ✅ (`https://agents.example.com` is RFC 2606 reserved) | --- ### Summary This is a well-structured E2E test that has been substantially improved across review cycles. The `list_actions()` DB fix is correct and well-engineered. The test covers Steps 1–2 of Specification Example 14 thoroughly — config round-trips, namespace-scoped action publishing and listing, actor registration, plan list smoke, and supervised profile threshold verification. **The only blocking item is the two-commit branch.** A simple `git rebase -i` squash resolves it. Once that's done and the P2 ticket-hygiene items are noted (uncheck unfulfilled ACs, add scope documentation), this PR is ready to merge.

CoreRasurae force-pushed test/e2e-wf14-server-mode from 39c2e43c08 to 9fef60264d

2026-03-24 21:37:04 +00:00

Compare

CoreRasurae commented

2026-03-24 22:16:11 +00:00

P1:must-fix — Two-Commit Branch - This is not a problem and will not be fixed. There a special exemption rule that allows the fixes for a given end to end test to be provided as a previous commit to the end to end tests itself and be merged as part of the same commit. All that is requires is that the fix needs to complement the end to end tests by fixing the bug that impeded the end to end test to work.

CoreRasurae commented

2026-03-24 23:34:34 +00:00

Response to Re-Review #3 — @brent.edwards

Author: @CoreRasurae
Reviewed against: Re-Review #3 (review #2758, commit 39c2e43c)
Method: Each P2 finding verified against docs/specification.md (Example 14, Steps 1–4, lines 41546–41907) and the CLI synopsis (lines 322–343) before accepting or rejecting.

Findings Addressed

P2 #16 — Suite documentation oversells validation scope ✅

Accepted. The reviewer is correct that the suite documentation claimed to validate "a distributed team scenario" without scoping what is and is not covered.

Fix: Added explicit scope documentation to the *** Settings *** block:

Coverage: Steps 1–2 (server config, resource publishing).
Steps 3–4 (multi-machine usage, team monitoring) require a
live server and are deferred.

P2 #14 — Diagnostics test validates nothing server-mode-specific ✅

Accepted (documentation fix, not assertion fix). The reviewer is correct that the diagnostics test assertions (config, database, disk) would pass identically with or without server configuration — the test does not validate any server-mode-specific diagnostic output.

However, the reviewer's recommended fix — adding Should Contain Any ${combined} server Server server.url — cannot be applied because the diagnostics command does not output any server-related information. I traced the full code path through build_diagnostics_data() (src/cleveragents/cli/commands/system.py:447–489): the 10 diagnostic checks are Config file, Data directory, Database, provider API keys, Disk space, File permissions, Git, Stale locks, Async workers, and Error Pattern DB. There is no _check_server() function and no server connectivity check. The spec Example 14 shows Server: OK (connected to agents.example.com) in the diagnostics output (line 41625), but this is aspirational — the feature is not yet implemented.

Adding the assertion would cause a guaranteed test failure, which is worse than the current state.

Fix: Updated the diagnostics test documentation to explicitly note the server-mode diagnostics spec gap:

Server-mode diagnostics (Server connectivity, Namespace membership)
shown in the spec are not yet implemented in the diagnostics command
(spec gap).

Findings Not Addressed (With Justification)

P1 — Two-Commit Branch ❌ (Exempt)

As stated in my previous comment, there is a standing exemption for E2E test PRs: when a platform bug is discovered during E2E test development that prevents the test from functioning, the bug fix may be included as a separate commit in the same PR, provided the fix is a prerequisite for the E2E test. The fix(action) commit (f200d724) fixes list_actions() to query the database — without this fix, the WF14 E2E Action List By Namespace test cannot pass. The fix is small (25 lines of production code), directly enables the test, and is referenced with Refs: #760.

P2 #2 — `plan list --namespace` AC checked but CLI has no `--namespace` flag ❌ (Ticket hygiene)

The reviewer is correct. I verified:

Spec Example 14 Step 4 (line 41849) shows agents plan list --namespace myteam.
The CLI synopsis in the spec (lines 322–323) defines plan list with --phase, --state, --project, --action, and [REGEX] — no --namespace flag.
The test already documents this at lines 121–122: "The --namespace flag is not yet implemented for plan list (spec gap)."

This is a spec contradiction (the example uses a flag the synopsis does not define) and the AC in issue #760 should be unchecked with a note. This is a ticket-level change on issue #760, not a code change. Will update the ticket separately.

P2 #5 — Actor uses `local/` namespace instead of `myteam/` per spec ❌ (Platform constraint, already documented)

The reviewer acknowledges this is already documented in the test (lines 103–104): "Custom actors must use the local/ namespace prefix per the current ActorService validation." The spec shows myteam/review-pipeline, but ActorService rejects non-local/ namespaces for custom actors. The recommendation is to file a follow-up issue tracking this spec gap. Will file separately.

P2 #6 — Ticket AC "real LLM API keys" checked but test makes no LLM calls ❌ (Ticket hygiene)

The reviewer is correct that no command in this test invokes an LLM. Steps 1–2 of spec Example 14 involve config set/get, diagnostics, action create, action list, actor add, plan list, and automation-profile show — none of which require LLM calls. The AC text ("All invocations use real LLM API keys") reflects the general E2E testing philosophy but is inaccurate for this specific test's scope. The spirit is met: real CLI, zero mocking, real subprocess execution. Will reword the AC on issue #760.

P3/nit items (#8, #9, #10, #15, #17, #18, cache refresh, `list_all()` return type) ❌ (Author discretion)

Per the review playbook, P3 items are at author discretion. These are acknowledged and may be addressed in follow-up work where appropriate.

Verification

E2E tests: All 7 WF14 tests pass via nox -s e2e_tests -- --suite "Wf14 Server Mode".
Coverage: 98% overall (threshold: 97%). plan_lifecycle_service.py at 96% — uncovered lines 639–658 are the list_actions() DB fallback code exercised by the E2E tests (not by unit tests).
No regressions: M1, M2, M5, and smoke E2E suites continue to pass.

# Response to Re-Review #3 — @brent.edwards **Author:** @CoreRasurae **Reviewed against:** Re-Review #3 (review #2758, commit `39c2e43c`) **Method:** Each P2 finding verified against `docs/specification.md` (Example 14, Steps 1–4, lines 41546–41907) and the CLI synopsis (lines 322–343) before accepting or rejecting. --- ## Findings Addressed ### P2 #16 — Suite documentation oversells validation scope ✅ **Accepted.** The reviewer is correct that the suite documentation claimed to validate "a distributed team scenario" without scoping what is and is not covered. **Fix:** Added explicit scope documentation to the `*** Settings ***` block: ``` Coverage: Steps 1–2 (server config, resource publishing). Steps 3–4 (multi-machine usage, team monitoring) require a live server and are deferred. ``` ### P2 #14 — Diagnostics test validates nothing server-mode-specific ✅ **Accepted (documentation fix, not assertion fix).** The reviewer is correct that the diagnostics test assertions (`config`, `database`, `disk`) would pass identically with or without server configuration — the test does not validate any server-mode-specific diagnostic output. However, the reviewer's recommended fix — adding `Should Contain Any ${combined} server Server server.url` — **cannot be applied** because the diagnostics command does not output any server-related information. I traced the full code path through `build_diagnostics_data()` (`src/cleveragents/cli/commands/system.py:447–489`): the 10 diagnostic checks are Config file, Data directory, Database, provider API keys, Disk space, File permissions, Git, Stale locks, Async workers, and Error Pattern DB. There is no `_check_server()` function and no server connectivity check. The spec Example 14 shows `Server: OK (connected to agents.example.com)` in the diagnostics output (line 41625), but this is aspirational — the feature is not yet implemented. Adding the assertion would cause a guaranteed test failure, which is worse than the current state. **Fix:** Updated the diagnostics test documentation to explicitly note the server-mode diagnostics spec gap: ``` Server-mode diagnostics (Server connectivity, Namespace membership) shown in the spec are not yet implemented in the diagnostics command (spec gap). ``` --- ## Findings Not Addressed (With Justification) ### P1 — Two-Commit Branch ❌ (Exempt) As stated in my previous comment, there is a standing exemption for E2E test PRs: when a platform bug is discovered during E2E test development that prevents the test from functioning, the bug fix may be included as a separate commit in the same PR, provided the fix is a prerequisite for the E2E test. The `fix(action)` commit (`f200d724`) fixes `list_actions()` to query the database — without this fix, the `WF14 E2E Action List By Namespace` test cannot pass. The fix is small (25 lines of production code), directly enables the test, and is referenced with `Refs: #760`. ### P2 #2 — `plan list --namespace` AC checked but CLI has no `--namespace` flag ❌ (Ticket hygiene) The reviewer is correct. I verified: - Spec Example 14 Step 4 (line 41849) shows `agents plan list --namespace myteam`. - The CLI synopsis in the spec (lines 322–323) defines `plan list` with `--phase`, `--state`, `--project`, `--action`, and `[REGEX]` — **no `--namespace` flag**. - The test already documents this at lines 121–122: *"The `--namespace` flag is not yet implemented for plan list (spec gap)."* This is a **spec contradiction** (the example uses a flag the synopsis does not define) and the AC in issue #760 should be unchecked with a note. This is a ticket-level change on issue #760, not a code change. Will update the ticket separately. ### P2 #5 — Actor uses `local/` namespace instead of `myteam/` per spec ❌ (Platform constraint, already documented) The reviewer acknowledges this is already documented in the test (lines 103–104): *"Custom actors must use the `local/` namespace prefix per the current ActorService validation."* The spec shows `myteam/review-pipeline`, but ActorService rejects non-`local/` namespaces for custom actors. The recommendation is to file a follow-up issue tracking this spec gap. Will file separately. ### P2 #6 — Ticket AC "real LLM API keys" checked but test makes no LLM calls ❌ (Ticket hygiene) The reviewer is correct that no command in this test invokes an LLM. Steps 1–2 of spec Example 14 involve `config set/get`, `diagnostics`, `action create`, `action list`, `actor add`, `plan list`, and `automation-profile show` — none of which require LLM calls. The AC text ("All invocations use real LLM API keys") reflects the general E2E testing philosophy but is inaccurate for this specific test's scope. The spirit is met: real CLI, zero mocking, real subprocess execution. Will reword the AC on issue #760. ### P3/nit items (#8, #9, #10, #15, #17, #18, cache refresh, `list_all()` return type) ❌ (Author discretion) Per the review playbook, P3 items are at author discretion. These are acknowledged and may be addressed in follow-up work where appropriate. --- ## Verification - **E2E tests:** All 7 WF14 tests pass via `nox -s e2e_tests -- --suite "Wf14 Server Mode"`. - **Coverage:** 98% overall (threshold: 97%). `plan_lifecycle_service.py` at 96% — uncovered lines 639–658 are the `list_actions()` DB fallback code exercised by the E2E tests (not by unit tests). - **No regressions:** M1, M2, M5, and smoke E2E suites continue to pass.

CoreRasurae force-pushed test/e2e-wf14-server-mode from 9fef60264d to 53b0197f0a

2026-03-24 23:35:26 +00:00

Compare

brent.edwards approved these changes 2026-03-24 23:49:49 +00:00

Dismissed

CoreRasurae force-pushed test/e2e-wf14-server-mode from 53b0197f0a to 4f2cdc2155

2026-03-25 17:07:01 +00:00

Compare

CoreRasurae dismissed brent.edwards's review 2026-03-25 17:07:01 +00:00

Reason:

New commits pushed, approval review dismissed automatically according to repository settings

CoreRasurae force-pushed test/e2e-wf14-server-mode from 4f2cdc2155 to 237c76a76f

2026-03-25 21:33:19 +00:00

Compare

CoreRasurae force-pushed test/e2e-wf14-server-mode from 237c76a76f to 4f2cdc2155

2026-03-25 21:34:38 +00:00

Compare

freemo removed a dependency 2026-03-26 15:14:39 +00:00

#965 refactor(testing): rename tdd_bug/tdd_bug_N tags to tdd_issue/tdd_issue_N across Behave and Robot Framework

freemo added a new dependency 2026-03-26 15:14:42 +00:00

#965 refactor(testing): rename tdd_bug/tdd_bug_N tags to tdd_issue/tdd_issue_N across Behave and Robot Framework

freemo removed a dependency 2026-03-26 18:27:58 +00:00

#965 refactor(testing): rename tdd_bug/tdd_bug_N tags to tdd_issue/tdd_issue_N across Behave and Robot Framework

CoreRasurae force-pushed test/e2e-wf14-server-mode from 4f2cdc2155 to d2c70bd489

2026-03-26 18:56:25 +00:00

Compare

CoreRasurae scheduled this pull request to auto merge when all checks succeed 2026-03-26 18:56:54 +00:00

CoreRasurae merged commit d2c70bd489 into master

2026-03-26 19:26:39 +00:00

CoreRasurae deleted branch test/e2e-wf14-server-mode

2026-03-26 19:26:40 +00:00

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Blocks

#760 test(e2e): workflow example 14 — server mode team collaboration (supervised profile)

cleveragents/cleveragents-core

Depends on

#627 Implement @tdd_expected_fail tag handling in Behave environment

cleveragents/cleveragents-core

#628 Implement @tdd_expected_fail tag handling in Robot Framework

cleveragents/cleveragents-core

Reference: cleveragents/cleveragents-core#805

test(e2e): workflow example 14 — server mode team collaboration (supervised profile) #805

Summary

Manual Verification

Prerequisites

Commands

What to Look For

PM Review — Day 34

Action Items

PM Status — Day 36 (2026-03-16)

PM Status — Day 37 — Rebase Required

Code Review Report — PR #805 (test/e2e-wf14-server-mode)

Summary

HIGH Severity

H-1. [Test Coverage] Issue AC not fulfilled — missing action list --namespace and plan list --namespace tests

H-2. [Test Coverage] Actions and actors created in local/ namespace instead of team namespace

H-3. [Test Coverage] CHANGELOG claims do not match actual test coverage

MEDIUM Severity

M-1. [Test Flaw] Missing config cleanup/teardown for persistent state

M-2. [Test Flaw] Missing server.token configuration per specification

M-3. [Test Flaw] plan list test verifies non-crash only, not meaningful behavior

M-4. [Test Flaw] Epoch-based name uniqueness may fail in parallel CI

M-5. [Test Flaw] Misleading test documentation in diagnostics test

LOW Severity

L-1. [Code Quality] Unrelated cls to klass rename bundled in E2E test commit

L-2. [Code Quality] Inconsistent tag pattern across E2E suites

L-3. [Code Quality] Minimal actor YAML may not exercise full registration flow

L-4. [Security] No issues found — good practices observed

Information / Positive Notes

Verdict

Code Review — PR #805 (Ticket #760)

Verdict: Request Changes

P0:blocker — Must Fix Immediately

1. Security fix revert: shell=True re-introduced in cli_coverage_steps.py

P1:must-fix — Must Fix Before Merge

2. Ticket AC "plan list --namespace" marked complete but unfulfilled

P2:should-fix — Fix in Follow-Up PR (Within 3 Days)

3. plan list test assertion is tautological

4. diagnostics test uses expected_rc=None unnecessarily

5. Actor uses local/ namespace, contradicting "team collaboration" scenario

6. Ticket AC "real LLM API keys" unmet

7. supervised profile verification is very shallow

P3:nit — Author Discretion

8. init --yes without --force deviates from established pattern

9. Missing inter-test dependency guard on Action List test

10. Loose db substring in diagnostics assertion

11. CHANGELOG wording "plan list graceful handling" oversells the test

Items Verified as Correct ✅

Stale Self-Review Resolution Summary

Summary

Supplementary Review — PR #805 (Second Pass + P0 Correction)

⚠️ CORRECTION: P0 Finding #1 Downgraded to P3 (Branch Hygiene)

What happened

The reality

Corrected severity

Updated verdict

New Findings From Second Pass

P2 — list_actions() docstring claims DB merge but code only reads in-memory

P3 — Config commands write to ~/.cleveragents/config.toml, not CLEVERAGENTS_HOME

Revised Finding Summary

Third-Pass Review — PR #805

Upgrade: Finding #12 Elevated from P2 to P1

action list --namespace myteam will likely fail at runtime

New Findings (Not in Previous Reviews)

14. Diagnostics test validates nothing server-mode-specific

15. read_only: true in action YAML contradicts spec's Read Only: no

16. Suite documentation is misleading about validation scope

17. Token format fragility — survives redaction only by accident

18. actor add invocation syntax differs from spec

Verified Correct — No Issues

Consolidated Finding Table (All Three Reviews)

Summary

Re-Review #3 — PR #805 (Ticket #760)

Verdict: Request Changes

Progress Since Last Review

P1:must-fix — Two-Commit Branch

P2:should-fix — Remaining Items (Non-Blocking)

P3:nit — Author Discretion

Quality Gates Verified

Summary

Response to Re-Review #3 — @brent.edwards

Code Review Report — PR #805 (`test/e2e-wf14-server-mode`)

H-1. [Test Coverage] Issue AC not fulfilled — missing `action list --namespace` and `plan list --namespace` tests

H-2. [Test Coverage] Actions and actors created in `local/` namespace instead of team namespace

M-2. [Test Flaw] Missing `server.token` configuration per specification

M-3. [Test Flaw] `plan list` test verifies non-crash only, not meaningful behavior

L-1. [Code Quality] Unrelated `cls` to `klass` rename bundled in E2E test commit

1. Security fix revert: `shell=True` re-introduced in `cli_coverage_steps.py`

2. Ticket AC "`plan list --namespace`" marked complete but unfulfilled

3. `plan list` test assertion is tautological

4. `diagnostics` test uses `expected_rc=None` unnecessarily

5. Actor uses `local/` namespace, contradicting "team collaboration" scenario

7. `supervised` profile verification is very shallow

8. `init --yes` without `--force` deviates from established pattern

10. Loose `db` substring in diagnostics assertion

P2 — `list_actions()` docstring claims DB merge but code only reads in-memory

P3 — Config commands write to `~/.cleveragents/config.toml`, not `CLEVERAGENTS_HOME`

`action list --namespace myteam` will likely fail at runtime

15. `read_only: true` in action YAML contradicts spec's `Read Only: no`

18. `actor add` invocation syntax differs from spec

P2 #2 — `plan list --namespace` AC checked but CLI has no `--namespace` flag ❌ (Ticket hygiene)

P2 #5 — Actor uses `local/` namespace instead of `myteam/` per spec ❌ (Platform constraint, already documented)

P3/nit items (#8, #9, #10, #15, #17, #18, cache refresh, `list_all()` return type) ❌ (Author discretion)