test(e2e): workflow example 14 — server mode team collaboration (supervised profile) #805
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Blocks
Depends on
#760 test(e2e): workflow example 14 — server mode team collaboration (supervised profile)
cleveragents/cleveragents-core
#627 Implement @tdd_expected_fail tag handling in Behave environment
cleveragents/cleveragents-core
#628 Implement @tdd_expected_fail tag handling in Robot Framework
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core!805
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "test/e2e-wf14-server-mode"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
E2E test for Workflow Example 14 — server mode team collaboration using the supervised profile. Tests CLI configuration for server settings, action/actor registration, and profile inspection (no live server required).
Closes #760
ISSUES CLOSED: #760
Manual Verification
Prerequisites
Commands
What to Look For
config set/getround-trips server.url and core.namespacediagnosticsoutputs system information without errorautomation-profile showdisplays profile detailsTracebackin any command's stderr8ecaa8226292aa055ac592aa055ac53d240a890ePM Review — Day 34
Status: Mergeable, 0 reviews, M7 (v3.6.0)
Author: @CoreRasurae
E2E test for WF14 (server mode team collaboration, supervised profile).
Action Items
PM Status — Day 36 (2026-03-16)
Day 34 review assignment deadline check. 0 reviewer activity after 2 days.
Assigned reviewer: Please acknowledge and provide an ETA for review. Prioritize M3 PRs first, then M4+ in milestone order.
PM Day 36: M7 E2E test. Merge conflict. @CoreRasurae rebase needed.
PM Status — Day 37 — Rebase Required
This PR has merge conflicts and cannot be merged in its current state. 42% of all open PRs (21 of 50) have conflicts — this is a project-wide issue that must be resolved.
@CoreRasurae — Please rebase this PR onto
masterby Day 39 EOD (2026-03-19). If you cannot rebase by then, please post a comment explaining the blocker.PM rebase request — Day 37
3d240a890e8ecaa822628ecaa8226291ccbf4aab91ccbf4aab4e134ea2baCode Review Report — PR #805 (
test/e2e-wf14-server-mode)Reviewer: Automated deep review (3 global cycles across all categories)
Scope: All changes on branch
test/e2e-wf14-server-moderelative tomaster, cross-referenced with issue #760 acceptance criteria anddocs/specification.mdExample 14.Commit:
4e134ea2by Luis Mendes —test(e2e): workflow example 14 — server mode team collaboration (supervised profile)Files changed:
robot/e2e/wf14_server_mode.robot(new, 89 lines),CHANGELOG.md(+6 lines),src/cleveragents/infrastructure/sandbox/strategy_registry.py(4 lines renamed)Summary
The PR adds an E2E Robot Framework test for Specification Workflow Example 14 (Server Mode — Team Collaboration) and includes a minor parameter rename in
strategy_registry.py. The test structure follows established patterns fromcommon_e2e.resource, but multiple gaps exist between the test coverage and the issue acceptance criteria / specification requirements. Several test-engineering concerns also need attention.Findings: 12 total — 3 High, 5 Medium, 4 Low
HIGH Severity
H-1. [Test Coverage] Issue AC not fulfilled — missing
action list --namespaceandplan list --namespacetestsFile:
robot/e2e/wf14_server_mode.robotIssue #760 AC states:
plan list --namespace"Specification Example 14 Steps 2 and 4 show:
agents action list --namespace myteamagents plan list --namespace myteamNeither command is present in the test. The test runs
action show(line 58) but neveraction list --namespace, andplan list(line 81) without--namespace.Moreover, investigation confirms
plan list(andplan lifecycle-list) do not support a--namespaceflag in the current CLI implementation. Theaction listcommand does support--namespace.Impact: Two acceptance criteria are unmet. The
plan list --namespacegap also reveals an unimplemented CLI feature the spec assumes exists.Recommendation: Add
action list --namespace myteamassertion. Forplan list, either (a) document that the--namespaceflag is not yet implemented and defer that AC, or (b) implement the flag first.H-2. [Test Coverage] Actions and actors created in
local/namespace instead of team namespaceFile:
robot/e2e/wf14_server_mode.robot, lines 44, 66The test claims to validate "team collaboration" with a shared namespace. The spec shows actions published as
myteam/generate-testsand actors asmyteam/review-pipeline. However, the test hardcodeslocal/prefixes:local/gen-tests-${epoch}local/team-rev-${epoch}The
local/prefix explicitly overrides the configuredcore.namespace = myteam, so the test never actually exercises namespace-scoped publishing.Impact: The core "team collaboration via shared namespace" scenario from the spec is not validated.
Recommendation: Create entities without the
local/prefix (e.g.,gen-tests-${epoch}) or use themyteam/prefix socore.namespaceis exercised.H-3. [Test Coverage] CHANGELOG claims do not match actual test coverage
File:
CHANGELOG.md, lines 145-149The CHANGELOG entry claims: "shared action listing, and plan monitoring across namespaces using the
supervisedautomation profile via real CLI with real LLM API keys."Three inaccuracies:
action listcommand is testedplan listis tested without--namespaceSkip If No LLM Keysguard, no LLM-dependent commands)Impact: Misleading documentation for future maintainers and reviewers.
Recommendation: Align the CHANGELOG with what the test actually exercises, or expand the test to cover the claimed behavior.
MEDIUM Severity
M-1. [Test Flaw] Missing config cleanup/teardown for persistent state
File:
robot/e2e/wf14_server_mode.robot, lines 17-25The
WF14 E2E Server Config Setuptest setsserver.urlandcore.namespacein the config but has no[Teardown]to restore them. By comparison, M6 tests consistently use[Teardown]blocks (e.g.,config set core.automation-profile manual).Config changes persist in the
config.tomlfile undercwd. If other test suites share the same data directory or run sequentially in the same workspace, these leaked settings (server.url=https://agents.example.com,core.namespace=myteam) could cause interference.Recommendation: Add
[Teardown]to the config test case, or add a suite-level[Suite Teardown]step that resetsserver.urlandcore.namespace.M-2. [Test Flaw] Missing
server.tokenconfiguration per specificationFile:
robot/e2e/wf14_server_mode.robotSpecification Example 14 Step 1 includes three config steps:
server.url,server.token, andcore.namespace. The test coversserver.url(line 17) andcore.namespace(line 22) but skipsserver.token. Sinceserver.tokenis the authentication mechanism for server mode, testing its configuration round-trip would strengthen coverage of the server mode setup flow.Recommendation: Add a
config set/config getround-trip forserver.token.M-3. [Test Flaw]
plan listtest verifies non-crash only, not meaningful behaviorFile:
robot/e2e/wf14_server_mode.robot, lines 78-83The test uses
expected_rc=None(accepts any exit code) and only assertsShould Not Be Empty ${result.stdout}${result.stderr}. Sinceplan listwithout initialization prints an error message ("No project found. Run 'agents init' first.") and aborts, the assertion passes on the error output. The test documentation says "Verify that the plan list command executes successfully" but does not actually verify success.Recommendation: Either (a) call
agents init --yes --forcebefore this test and check forexpected_rc=0, or (b) adjust the documentation to state it verifies graceful failure rather than success.M-4. [Test Flaw] Epoch-based name uniqueness may fail in parallel CI
File:
robot/e2e/wf14_server_mode.robot, lines 43, 65Uses
int(__import__('time').time())(second-level granularity) for unique name suffixes. In parallel CI runners executing within the same second, this produces identical names and risks UNIQUE constraint violations. The M6 acceptance test usesuuid4().hex[:12]which is collision-resistant even under parallelism.Recommendation: Replace epoch with
uuid4().hex[:12]for consistency with established patterns and CI safety.M-5. [Test Flaw] Misleading test documentation in diagnostics test
File:
robot/e2e/wf14_server_mode.robot, lines 28-30The documentation states: "Server connectivity will fail (no real server) so the return code is not checked." However, investigation of
system.py:build_diagnostics_data()confirms there is no server connectivity check in the diagnostics command. The 10 diagnostic checks are all local (config file, data directory, database, API keys, disk space, file permissions, git, stale locks, async workers, error patterns). Theexpected_rc=Noneis unnecessary since diagnostics should succeed (rc=0) even without a server.Recommendation: Update the documentation to reflect the actual diagnostic checks, and consider setting
expected_rc=0since no server check exists.LOW Severity
L-1. [Code Quality] Unrelated
clstoklassrename bundled in E2E test commitFile:
src/cleveragents/infrastructure/sandbox/strategy_registry.py, lines 262-297The rename of the
clsparameter toklassin the_validate_protocolstaticmethod is a valid improvement (avoids confusion with the classmethodclsconvention) but is unrelated to issue #760 or the E2E test work. Bundling unrelated changes in a test commit makes git history less navigable.Recommendation: Separate this into its own commit or PR for clean history.
L-2. [Code Quality] Inconsistent tag pattern across E2E suites
File:
robot/e2e/wf14_server_mode.robotThe WF14 test applies
[Tags] E2Eper test case (lines 15, 31, 41, 63, 80, 87), whilem6_acceptance.robotusesForce Tags E2Eat suite level. Both are valid, butForce Tagsis cleaner when all tests in a suite share the same tag.Recommendation: Consider using
Force Tags E2Eat suite level for consistency with M6.L-3. [Code Quality] Minimal actor YAML may not exercise full registration flow
File:
robot/e2e/wf14_server_mode.robot, lines 67-69The actor YAML contains only
provider: openaiandmodel: gpt-4. While this may be sufficient foractor add, the spec example shows a richer actor configuration with type, capabilities, tools, and graph nodes. A minimal YAML exercises only the happy path minimum.Recommendation: Consider adding at least
descriptionandtypefields to better exercise the actor registration flow.L-4. [Security] No issues found — good practices observed
The test correctly uses
https://agents.example.com(RFC 2606 reserved domain) for the server URL. No real tokens, API keys, or sensitive data appear in the test code.Information / Positive Notes
strategy_registry.pyrename is functionally correct and all 4 references (parameter, docstring, getattr, f-string) are consistently updated.E2E Suite Teardown(recursive removal of${SUITE_HOME}).Create Filekeyword correctly resolves from theOperatingSystemlibrary imported transitively viacommon_e2e.resource.Should Contain Anyassertions in the diagnostics test correctly match the actual diagnostic category names ("Config file", "Database", "Disk space").automation-profile show supervisedtest (line 88) correctly targets a built-in profile that doesn't require database initialization.Verdict
REQUEST CHANGES — The test has structural correctness but significant coverage gaps relative to issue #760 acceptance criteria and the specification. The high-severity items (H-1, H-2, H-3) should be addressed before merge. Medium-severity items (M-1 through M-5) represent test-engineering improvements that strengthen reliability and maintainability.
4e134ea2ba8f3c15aa088f3c15aa08064d448cceCode Review — PR #805 (Ticket #760)
Branch:
test/e2e-wf14-server-modeAuthor: @CoreRasurae
Reviewer: @brent.edwards
Method: Multi-focus parallel review (7 threads: spec compliance, bug detection, test quality, code quality, security, strategy_registry analysis, stale review triage) + fresh-eyes pass
Verdict: Request Changes
There is 1 P0:blocker finding (security fix revert), 1 P1:must-fix (unmet acceptance criterion), and 5 P2:should-fix items. The P0 alone requires immediate attention before merge.
Note on stale self-review: The detailed self-review by @CoreRasurae (comment #67681, against commit
4e134ea2) identified 12 issues. The current commit064d448caddresses 6 of 8 HIGH/MEDIUM findings. H-1 and H-2 are partially fixed; M-3 remains. Thestrategy_registry.pyunrelated change (L-1) was correctly removed. Good rebase work overall.P0:blocker — Must Fix Immediately
1. Security fix revert:
shell=Truere-introduced incli_coverage_steps.pyFile:
features/steps/cli_coverage_steps.py:43–44This PR reverts the security fix from commit
bd18491b(fix(test): replace shell=True with shell=False in cli_coverage_steps.py subprocess call, issue #734). The diff shows:import shlexremovedshlex.split(command)replaced back withcommand+shell=TrueThe original fix was explicitly described as "defense-in-depth command injection prevention, consistent with the existing pattern in
cli_plan_context_commands_steps.py." Reverting it re-introducesshell=Truesubprocess execution, which the project's security standards prohibit. The comment says "Use shell=True to handle quoted arguments correctly" — suggesting the fix broke a test scenario.Impact: P0 — A security fix merged to master 4 days ago is being silently reverted in an unrelated E2E test PR with no justification or discussion. Even in test code,
shell=Trueis a security concern per the project's own standards and the original fix's rationale.Recommendation: Either (a) keep the
shlex.split()approach and fix whatever test it broke (the correct solution), or (b) ifshell=Trueis genuinely needed for quoted argument handling, explicitly document WHY in a code comment, add# nosec B602with justification, and create a follow-up issue to find a secure alternative. Do NOT silently revert a security fix. The CHANGELOG entry for #734 must also be restored.P1:must-fix — Must Fix Before Merge
2. Ticket AC "
plan list --namespace" marked complete but unfulfilledFile:
robot/e2e/wf14_server_mode.robot:112–119Ticket #760 acceptance criterion states: "Test monitors plans across the namespace via
plan list --namespace" — this checkbox is marked done in the ticket. Specification Example 14 Step 4 explicitly showsagents plan list --namespace myteam. However, the CLI'splan listcommand does not support a--namespaceflag (confirmed incli/commands/plan.py). The test runs bareplan listwithexpected_rc=None.The test documentation (lines 115–116) honestly notes this as a "spec gap," which is good — but the ticket AC is marked complete when the condition cannot be fulfilled.
Recommendation: Uncheck this AC item in ticket #760 and add a note that it's deferred until the
--namespaceflag is implemented. A checked-off AC that isn't actually validated is misleading for milestone tracking.P2:should-fix — Fix in Follow-Up PR (Within 3 Days)
3.
plan listtest assertion is tautologicalFile:
robot/e2e/wf14_server_mode.robot:117–119expected_rc=None(accepts any exit code) +Should Not Be Empty ${result.stdout}${result.stderr}means this test passes whenever the command produces any output on stdout or stderr — including error messages, tracebacks, or deprecation warnings. Sinceinit --yeswas already called in suite setup,plan listshould succeed with rc=0.Recommendation: Change to
expected_rc=0. Ifplan listgenuinely fails after init, that's a bug worth catching rather than masking with a permissive assertion.4.
diagnosticstest usesexpected_rc=NoneunnecessarilyFile:
robot/e2e/wf14_server_mode.robot:59The documentation correctly notes diagnostics checks are all local and should succeed without a live server. Yet
expected_rc=Noneis used. Without--check, diagnostics always returns rc=0.Recommendation: Change to
expected_rc=0to match the documentation's own assertion that the command should succeed.5. Actor uses
local/namespace, contradicting "team collaboration" scenarioFile:
robot/e2e/wf14_server_mode.robot:99The actor is created with
local/team-rev-${RUN_SUFFIX}. Spec Example 14 showsmyteam/review-pipeline. The action was correctly fixed to usemyteam/prefix, but the actor was not. The documentation (lines 97–98) explains this is due to ActorService validation, which is honest — but means the "team collaboration" scenario is only half-exercised for namespace behavior.Recommendation: If ActorService requires
local/, file a follow-up issue tracking this spec gap and reference it in the test documentation.6. Ticket AC "real LLM API keys" unmet
File:
robot/e2e/wf14_server_mode.robot(entire file)Ticket #760 AC states: "All invocations use real LLM API keys — no mocking, stubbing, or test doubles." No command in this test invokes an LLM. The PR description itself says "No LLM API key required." The AC is marked done but the condition isn't met.
Recommendation: Update ticket #760 to remove or reword the LLM AC — server mode config testing genuinely doesn't need LLM calls. Alternatively, add one LLM-dependent step guarded by
Skip If No LLM Keys.7.
supervisedprofile verification is very shallowFile:
robot/e2e/wf14_server_mode.robot:121–124The test only checks that the output contains the string "supervised." The M6 suite's equivalent test checks for
auto_strategize,auto_execute, andauto_applyfields. For a test whose title emphasizes the "supervised automation profile," only verifying the name is insufficient.Recommendation: Add assertions for supervised-specific fields:
P3:nit — Author Discretion
8.
init --yeswithout--forcedeviates from established patternrobot/e2e/wf14_server_mode.robot:18— M6 usesinit --force --yes. While WF14's fresh temp dir makes this unlikely to fail,--forceprovides defense-in-depth.9. Missing inter-test dependency guard on Action List test
robot/e2e/wf14_server_mode.robot:88–93— If the Action Create test fails, Action List will fail with an unclear "variable not found" error. Other suites use[Setup] Variable Should Existguards.10. Loose
dbsubstring in diagnostics assertionrobot/e2e/wf14_server_mode.robot:63—dbis only 2 characters and could match unrelated content. Keep onlydatabaseandDatabase.11. CHANGELOG wording "plan list graceful handling" oversells the test
The test only verifies non-empty output with any exit code. "Plan list smoke test" would be more accurate.
Items Verified as Correct ✅
Force Tags E2Eat suite level — consistent with M6uuid4().hex[:12]for unique names — CI-safeserver.tokenround-trip test — added per spec Step 1action list --namespace myteam— added per spec Step 2myteam/prefix — matches spechttps://agents.example.com(RFC 2606 reserved)Type/Testing,State/In Review, correct milestoneISSUES CLOSED: #760in commit footerStale Self-Review Resolution Summary
064d448caction list --namespaceplan list --namespaceacknowledged as spec gap)local/namespace on entitiesmyteam/; actor documented as ActorService constraintSummary
The E2E test itself is well-structured and has been substantially improved since the self-review. Config round-trips, namespace-scoped action listing, proper teardown, and uuid-based uniqueness are all solid. The test covers Steps 1–2 of spec Example 14 well.
However, the P0 finding is a hard blocker: this PR silently reverts a security fix (
shell=Truesubprocess call) from issue #734 that was merged to master 4 days ago. The CHANGELOG entry documenting that fix is also deleted. This must be addressed before any other consideration.Beyond that, the remaining findings are primarily about tightening assertions (P2 #3, #4), documenting spec gaps properly (P1 #2, P2 #5, #6), and making the supervised profile test more meaningful (P2 #7).
Supplementary Review — PR #805 (Second Pass + P0 Correction)
Reviewer: @brent.edwards
Method: 3 deep-dive agents (shell=True analysis, YAML/CLI tracing, CHANGELOG integrity) + manual verification
⚠️ CORRECTION: P0 Finding #1 Downgraded to P3 (Branch Hygiene)
My initial review incorrectly classified the
cli_coverage_steps.pychange as P0:blocker ("Security fix revert"). I owe @CoreRasurae an apology — the analysis was wrong.What happened
I used a two-dot diff (
git diff origin/master..HEAD) which shows all differences between the branch tips, including divergent history. This made it appear that the PR actively reverts theshlex.split()security fix (commitbd18491b, issue #734) and deletes its CHANGELOG entry.The reality
The three-dot diff (
git diff origin/master...HEAD) — which correctly isolates only changes introduced by the PR's own commit — shows the PR modifies exactly 2 files:CHANGELOG.md(+6 lines — the #760 entry only)robot/e2e/wf14_server_mode.robot(+124 lines — new file)features/steps/cli_coverage_steps.pyis NOT modified by this PR. The security fix was merged to master after this branch was created. The branch simply doesn't have it yet. On merge (or rebase), Git's three-way merge will correctly preserve the security fix and its CHANGELOG entry — I verified this withgit merge-tree --write-tree.Corrected severity
Updated verdict
With the P0 removed, the highest remaining finding is P1 #2 (
plan list --namespaceAC marked complete but unfulfilled). The verdict remains Request Changes due to this P1 plus the P2 items, but the urgency is significantly reduced.New Findings From Second Pass
P2 —
list_actions()docstring claims DB merge but code only reads in-memoryFile:
src/cleveragents/application/services/plan_lifecycle_service.py:617–642The
list_actions()docstring (line 624–625) says: "When persistence is enabled, results are merged from the in-memory cache and the database layer." However, line 634 doesactions = list(self._actions.values())— reading ONLY the in-memory dict, with no database fallback. Contrast withget_action()(lines 583–592) which correctly falls back to the database.Since each
Run CleverAgents Commandcall creates a new subprocess (→ newPlanLifecycleServicewith empty_actions), theWF14 E2E Action List By Namespacetest (line 92) depends on actions being loaded from DB into_actionsduring service initialization. If that preload doesn't happen, the test will fail at runtime.Note: This is a pre-existing issue, not introduced by this PR. But it's relevant because:
Recommendation: Verify the test actually passes by running
nox -s e2e_tests. If it fails,list_actions()needs a DB fallback likeget_action()has.P3 — Config commands write to
~/.cleveragents/config.toml, notCLEVERAGENTS_HOMEFile:
src/cleveragents/cli/commands/config.py:50–51The
config set/getcommands usePath.home() / ".cleveragents" / "config.toml"for the config file, ignoring theCLEVERAGENTS_HOMEenvironment variable. This means the WF14 test'sconfig set server.url ...writes to the real home directory, not the per-suite temp directory. While the suite teardown (lines 29–31) resets the values, a crash during teardown could leave stale config in~/.cleveragents/config.toml.This is a pre-existing platform issue, not a test bug, but it's worth noting for local developer runs.
Revised Finding Summary
1P0→ P3Security fix revert→ Branch needs rebaseplan list --namespaceAC marked complete but CLI doesn't support flagplan listtautological assertion (expected_rc=None)diagnosticsusesexpected_rc=Noneunnecessarilylocal/namespace, not team namespace per specsupervisedprofile verification is shallowinit --yeswithout--forcedbsubstring matchlist_actions()docstring claims DB merge but code doesn't — test may fail~/notCLEVERAGENTS_HOME— teardown crash could leak configFinal tally: 0 P0 + 1 P1 + 6 P2 + 6 P3 = 13 findings
Third-Pass Review — PR #805
Reviewer: @brent.edwards
Method: 3 deep-dive agents (Robot Framework keyword semantics, CLI command-by-command source trace, verbatim spec-to-test mapping) + manual verification
This third pass approached the code from angles the first two passes didn't cover: Robot Framework escape/scoping semantics, whether each CLI command will actually succeed by tracing through the source code, and a verbatim command-by-command comparison against Specification Example 14.
Upgrade: Finding #12 Elevated from P2 to P1
action list --namespace myteamwill likely fail at runtimeFile:
src/cleveragents/application/services/plan_lifecycle_service.py:634Test file:
robot/e2e/wf14_server_mode.robot:92–93Two independent deep-dive agents traced the full code path and both reached the same conclusion:
list_actions()at line 634 reads ONLY from the in-memory_actionsdict:Each
Run CleverAgents Commandcall spawns a new subprocess → freshPlanLifecycleService→ empty_actions = {}. The docstring (line 624) falsely claims: "results are merged from the in-memory cache and the database layer."Contrast with
get_action()(lines 587–592) which correctly falls back to the DB, andActorService.list_actors()which reads from DB.list_actions()is the only action method without a DB fallback.Result:
action list --namespace myteamreturns[]→ prints "No actions found." → assertionOutput Should Contain ${list_result} gen-testsfails.This is a pre-existing platform bug (not introduced by this PR), but the test depends on it. Unless there's a preload mechanism I couldn't find, the
WF14 E2E Action List By Namespacetest case will fail when run vianox -s e2e_tests.Recommendation: Either (a) verify the test actually passes by running
nox -s e2e_testsand share the output, or (b) fixlist_actions()to queryctx.actions.list_available(namespace)likeget_action()does.Severity: Elevated to P1 — likely runtime test failure.
New Findings (Not in Previous Reviews)
14. Diagnostics test validates nothing server-mode-specific
File:
robot/e2e/wf14_server_mode.robot:62–64Severity: P2
Spec Example 14's diagnostics output prominently shows "Server: OK (connected to agents.example.com)" and a dedicated Server panel with URL, Auth, Namespace, Members, Shared Actions, Latency. The entire reason for running diagnostics in Example 14 is to verify server connectivity.
The test only checks for generic diagnostic categories (
config,database,disk) that appear identically with or without any server configuration. The test would pass even if the server config step were removed entirely.Recommendation: Add at minimum
Should Contain Any ${combined} server Server server.urlto verify the server-mode diagnostic code path is exercised.15.
read_only: truein action YAML contradicts spec'sRead Only: noFile:
robot/e2e/wf14_server_mode.robot:77Severity: P3
Spec Example 14 Step 2 (line 41666) shows the created action has
Read Only: no. The test YAML setsread_only: true. This is a verbatim contradiction — the test creates an action with the opposite read-only setting from what the spec demonstrates. The generate-tests action in the spec is explicitly writable because test generation involves writing test files.Recommendation: Change to
read_only: falseor remove the field (default isfalse).16. Suite documentation is misleading about validation scope
File:
robot/e2e/wf14_server_mode.robot:2–7Severity: P2
The suite documentation says: "Validates the supervised automation profile in a distributed team scenario where multiple engineers share actions, actors, and projects via CleverAgents server mode." In reality, Steps 3 (multi-machine shared resource usage, 4 commands) and Step 4 (team plan monitoring, 2 commands) are entirely untested — 46% of Example 14's commands have zero coverage.
No centralized documentation explains what's in scope and what's deferred. Individual test cases document their own gaps (e.g.,
plan list --namespaceis a "spec gap"), but a reader of the suite description would expect full Example 14 coverage.Recommendation: Add explicit scope documentation:
17. Token format fragility — survives redaction only by accident
File:
robot/e2e/wf14_server_mode.robot:44, 47Severity: P3
tok_e2e_test_placeholderpasses throughconfig getunredacted because:config getin "rich" format doesn't apply redaction (unlikeconfig listwhich does)e2e_test_placeholderbreak thetok_[A-Za-z0-9]{10,}regex patternIf someone changes the token to
tok_e2eTestPlaceholder(no underscores), or adds--format jsonto theconfig getcall, the assertion would silently break.Recommendation: Add a comment explaining the token format choice, or use a token that clearly isn't redactable (e.g.,
test_placeholder_token).18.
actor addinvocation syntax differs from specFile:
robot/e2e/wf14_server_mode.robot:106Severity: P3
Spec shows:
agents actor add --config ./actors/review-pipeline.yaml(name from YAML).Test runs:
actor add ${actor_name} --config ${yaml_file}(name as positional arg, YAML has noname:field).This exercises a different CLI code path (positional name + config-override vs. config-only).
Recommendation: Either add
name: ${actor_name}to the YAML and useactor add --config ${yaml_file}, or document the deviation.Verified Correct — No Issues
\ninCatenate SEPARATORandSet VariableShould Contain Anysemantics${EMPTY}throughRun ProcessCreate Filefrom OperatingSystem libraryOutput Should Containpartial matchingserver.tokenNOT redacted inconfig get--randomizereusable: true)True, Pydantic accepts itISSUES CLOSED: #760Consolidated Finding Table (All Three Reviews)
1P0→ P3Security fix revert→ Branch needs rebaseplan list --namespaceAC marked complete but CLI doesn't support flagaction list --namespacelikely fails at runtime —list_actions()has no DB fallbackplan listtautological assertiondiagnosticsusesexpected_rc=Noneunnecessarilylocal/namespacesupervisedprofile verification is shallow~/notCLEVERAGENTS_HOMEinit --yeswithout--forcedbsubstring matchread_only: truecontradicts spec'sRead Only: noactor addsyntax differs from specFinal tally: 0 P0 + 2 P1 + 7 P2 + 10 P3 = 19 findings
Summary
This third pass uncovered 4 genuinely new findings (diagnostics not testing server mode,
read_onlycontradiction, misleading suite scope docs, token fragility) and elevated one existing finding (action list DB fallback) based on strong CLI source-code evidence.The most significant new finding is that the diagnostics test (P2 #14) validates nothing server-mode-specific — the exact same assertions would pass on a vanilla install with no server configuration. Combined with the suite documentation claiming to validate "a distributed team scenario" (P2 #16), the test oversells its coverage of Example 14's server-mode aspects.
The Robot Framework keyword semantics are sound —
\nescapes,Catenate,Create File,Should Contain Any, and variable scoping all work correctly.Verdict: Request Changes (2 P1 + 7 P2). The P1 items (unfulfilled
plan list --namespaceAC and probableaction listruntime failure) should be resolved before merge. The P2 items are quality improvements that can be addressed in this PR or follow-ups.064d448cce20c5bfff8e20c5bfff8e39c2e43c08Re-Review #3 — PR #805 (Ticket #760)
Reviewer: @brent.edwards
Method: 3 parallel deep-dive agents (spec compliance, bug detection, code quality/commits) against latest HEAD
39c2e43c, plus review playbook checklistsPrevious reviews: Review #2684 (REQUEST_CHANGES, 11 findings), Supplementary (P0 correction + 2 new), Third-pass (4 new + 1 elevation → 19 total)
Verdict: Request Changes
There is 1 P1:must-fix (two-commit branch violating atomic commit rules). Once squashed, the remaining findings are all P2/P3 and would not block merge.
Progress Since Last Review
The author has done excellent work addressing the most impactful findings. 9 of the 19 findings from my previous reviews are fully resolved:
1Security fix revert(corrected to P3: branch needs rebase)origin/master3plan listtautological assertion (expected_rc=None)expected_rc=0(line 117)4diagnosticsusesexpected_rc=Noneunnecessarilyexpected_rc=0(line 59)7supervisedprofile verification shallowauto_strategize,auto_execute,auto_apply(lines 127–129)12list_actions()no DB fallback (elevated to P1)ActionRepository.list_all()added. Well-engineered with proper error handling.uuid4().hex[:12](line 22)Force Tags E2Eat suite level (line 11)The
list_actions()fix in particular is solid — the DB query logic, state filtering, namespace handling,DatabaseErrorfallback, and in-memory cache refresh are all correct. I traced the full code path fromaction create→ DB persist →action list --namespace myteam→get_by_namespace("myteam")and it holds up.P1:must-fix — Two-Commit Branch
Location: Branch
test/e2e-wf14-server-mode— commitsf200d724and39c2e43cThe branch contains two commits both referencing issue #760:
f200d724—fix(action): query persisted actions from database in list_actions(Refs: #760)39c2e43c—test(e2e): workflow example 14 — ...(ISSUES CLOSED: #760)CONTRIBUTING.md line 890: "Single Commit — Corresponds to exactly one commit. One issue produces one commit." Line 950: "If an issue requires decomposition into multiple commits, it must be promoted to an Epic and broken into separate issues."
The
fix(action)commit is a legitimate bug fix — it correctly addresses a pre-existing platform issue I flagged in my previous review. However, it either needs:Type/Bugissue for thelist_actions()fix, move it to its own branch/PR, and rebase this PR on top of it after merge.Option (a) is recommended since the fix is only 25 lines of production code and the test is its primary validation.
If squashing, keep the test commit's first line (matching ticket metadata), and include the fix description in the commit body.
P2:should-fix — Remaining Items (Non-Blocking)
These remain from previous reviews. They are documentation/ticket hygiene issues, not code bugs. Per the playbook, they can be fixed in a follow-up PR within 3 days.
plan list --namespaceAC checked in ticket #760 but CLI has no--namespaceflaglocal/namespace (line 99) instead ofmyteam/per specservertoShould Contain Any.P3:nit — Author Discretion
init --yeswithout--force(line 18)db2-char substring in diagnostics assertion (line 63)read_only: truecontradicts spec'sRead Only: no(line 77)actor addsyntax (positional name + config) differs from spec (config-only)if key not in) — low practical impact since CLI invocations are isolatedActionRepository.list_all()returnslist[Any]— matches pre-existingActionRepositoryconvention but is weaker thanActorRepository.list_all()which returnslist[Actor]Quality Gates Verified
nox -e lintnox -e typecheckstrategy_registry.py)test/e2e-wf14-server-modeISSUES CLOSED: #760in footerType/Testing,State/In Review, milestone v3.6.0https://agents.example.comis RFC 2606 reserved)Summary
This is a well-structured E2E test that has been substantially improved across review cycles. The
list_actions()DB fix is correct and well-engineered. The test covers Steps 1–2 of Specification Example 14 thoroughly — config round-trips, namespace-scoped action publishing and listing, actor registration, plan list smoke, and supervised profile threshold verification.The only blocking item is the two-commit branch. A simple
git rebase -isquash resolves it. Once that's done and the P2 ticket-hygiene items are noted (uncheck unfulfilled ACs, add scope documentation), this PR is ready to merge.39c2e43c089fef60264dP1:must-fix — Two-Commit Branch - This is not a problem and will not be fixed. There a special exemption rule that allows the fixes for a given end to end test to be provided as a previous commit to the end to end tests itself and be merged as part of the same commit. All that is requires is that the fix needs to complement the end to end tests by fixing the bug that impeded the end to end test to work.
Response to Re-Review #3 — @brent.edwards
Author: @CoreRasurae
Reviewed against: Re-Review #3 (review #2758, commit
39c2e43c)Method: Each P2 finding verified against
docs/specification.md(Example 14, Steps 1–4, lines 41546–41907) and the CLI synopsis (lines 322–343) before accepting or rejecting.Findings Addressed
P2 #16 — Suite documentation oversells validation scope ✅
Accepted. The reviewer is correct that the suite documentation claimed to validate "a distributed team scenario" without scoping what is and is not covered.
Fix: Added explicit scope documentation to the
*** Settings ***block:P2 #14 — Diagnostics test validates nothing server-mode-specific ✅
Accepted (documentation fix, not assertion fix). The reviewer is correct that the diagnostics test assertions (
config,database,disk) would pass identically with or without server configuration — the test does not validate any server-mode-specific diagnostic output.However, the reviewer's recommended fix — adding
Should Contain Any ${combined} server Server server.url— cannot be applied because the diagnostics command does not output any server-related information. I traced the full code path throughbuild_diagnostics_data()(src/cleveragents/cli/commands/system.py:447–489): the 10 diagnostic checks are Config file, Data directory, Database, provider API keys, Disk space, File permissions, Git, Stale locks, Async workers, and Error Pattern DB. There is no_check_server()function and no server connectivity check. The spec Example 14 showsServer: OK (connected to agents.example.com)in the diagnostics output (line 41625), but this is aspirational — the feature is not yet implemented.Adding the assertion would cause a guaranteed test failure, which is worse than the current state.
Fix: Updated the diagnostics test documentation to explicitly note the server-mode diagnostics spec gap:
Findings Not Addressed (With Justification)
P1 — Two-Commit Branch ❌ (Exempt)
As stated in my previous comment, there is a standing exemption for E2E test PRs: when a platform bug is discovered during E2E test development that prevents the test from functioning, the bug fix may be included as a separate commit in the same PR, provided the fix is a prerequisite for the E2E test. The
fix(action)commit (f200d724) fixeslist_actions()to query the database — without this fix, theWF14 E2E Action List By Namespacetest cannot pass. The fix is small (25 lines of production code), directly enables the test, and is referenced withRefs: #760.P2 #2 —
plan list --namespaceAC checked but CLI has no--namespaceflag ❌ (Ticket hygiene)The reviewer is correct. I verified:
agents plan list --namespace myteam.plan listwith--phase,--state,--project,--action, and[REGEX]— no--namespaceflag.--namespaceflag is not yet implemented for plan list (spec gap)."This is a spec contradiction (the example uses a flag the synopsis does not define) and the AC in issue #760 should be unchecked with a note. This is a ticket-level change on issue #760, not a code change. Will update the ticket separately.
P2 #5 — Actor uses
local/namespace instead ofmyteam/per spec ❌ (Platform constraint, already documented)The reviewer acknowledges this is already documented in the test (lines 103–104): "Custom actors must use the
local/namespace prefix per the current ActorService validation." The spec showsmyteam/review-pipeline, but ActorService rejects non-local/namespaces for custom actors. The recommendation is to file a follow-up issue tracking this spec gap. Will file separately.P2 #6 — Ticket AC "real LLM API keys" checked but test makes no LLM calls ❌ (Ticket hygiene)
The reviewer is correct that no command in this test invokes an LLM. Steps 1–2 of spec Example 14 involve
config set/get,diagnostics,action create,action list,actor add,plan list, andautomation-profile show— none of which require LLM calls. The AC text ("All invocations use real LLM API keys") reflects the general E2E testing philosophy but is inaccurate for this specific test's scope. The spirit is met: real CLI, zero mocking, real subprocess execution. Will reword the AC on issue #760.P3/nit items (#8, #9, #10, #15, #17, #18, cache refresh,
list_all()return type) ❌ (Author discretion)Per the review playbook, P3 items are at author discretion. These are acknowledged and may be addressed in follow-up work where appropriate.
Verification
nox -s e2e_tests -- --suite "Wf14 Server Mode".plan_lifecycle_service.pyat 96% — uncovered lines 639–658 are thelist_actions()DB fallback code exercised by the E2E tests (not by unit tests).9fef60264d53b0197f0a53b0197f0a4f2cdc2155New commits pushed, approval review dismissed automatically according to repository settings
4f2cdc2155237c76a76f237c76a76f4f2cdc21554f2cdc2155d2c70bd489