test(e2e): E2E acceptance criteria for M6 (v3.5.0) — autonomy hardening #803
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Blocks
#746 test(e2e): E2E acceptance criteria for M6 (v3.5.0) — autonomy hardening
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core!803
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "test/e2e-m6-acceptance"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
E2E acceptance test for M6 (v3.5.0) — autonomy hardening. Comprehensive test covering session management, automation profile resolution, guard enforcement, custom profile creation, and full plan lifecycle across multiple automation profiles (ci, manual, trusted, full-auto).
Closes #746
ISSUES CLOSED: #746
Manual Verification
Prerequisites
OPENAI_API_KEYorGEMINI_API_KEYenvironment variable setCommands
What to Look For
automation-profile listshows all built-in profilesconfig set/getround-trips the automation profile settingciprofileTracebackin any command's stderrCode Review Report — PR #803 (
test/e2e-m6-acceptance)E2E acceptance criteria for M6 (v3.5.0) — autonomy hardening
Reviewed commit
275d5ac7by Luis Mendes against issue #746 acceptance criteria, milestone v3.5.0 spec (docs/specification.md), and CHANGELOG entry. Three full review cycles performed across all categories until no new issues were found.Verdict: REQUEST CHANGES — 5 critical/high issues must be addressed before merge.
Summary
CRITICAL (2)
C1.
str.index('{')crashes withValueErrorif stdout has no JSONFiles:
m6_acceptance.robot:31, 66, 192str.index()raisesValueErrorif stdout contains no{. A zero return code does not guarantee JSON on stdout — the CLI could emit plain text. The test crashes with an opaque Python traceback instead of a meaningful assertion failure.Fix: Guard with
str.find():C2. CHANGELOG misrepresents test scope
File:
CHANGELOG.md(commit13bbea90)The entry claims the test "Exercises A2A facade session/plan lifecycle, event queue pub/sub, guard enforcement (denylist, budget caps, tool call limits), automation profile resolution precedence, and full autonomy acceptance flow." None of the bolded features are actually tested. This misrepresents coverage to reviewers and stakeholders.
Fix: Rewrite to accurately describe what is tested (session CRUD, profile list/show/config, resource+project setup, plan lifecycle skeleton, and basic guard output check).
HIGH (5)
H1. Tests silently pass when features are broken
Files:
m6_acceptance.robot:153, 173, 190, 197, 201, 204Multiple tests use
Run Keyword If ${rc} == 0 <assertion>with noELSEbranch. If the CLI command fails (rc != 0), zero assertions run and the test passes silently. The three most critical LLM-dependent tests (Plan Lifecycle,Guard Enforcement,Full Autonomy) can never produce a FAIL result — they either pass or silently skip.Fix: Add
ELSEbranches thatFailorSkipwith a diagnostic message.H2. Python code injection via triple-quoted string interpolation
File:
m6_acceptance.robot:47${combined}is CLI output interpolated into a Python expression inside triple-quoted strings. If output contains""", backslashes, or Python-significant characters, this producesSyntaxErroror evaluates to wrong result.Fix: Use
$combinedobject-reference syntax (no braces, no interpolation).H3. Misleading test name: "Guard Enforcement Via Profile"
File:
m6_acceptance.robot:155-173, keyword at 41-48The test claims to verify guard enforcement but only checks if the words "manual" or "automation" appear anywhere in stdout/stderr. This proves nothing about guard enforcement — any help text or error message containing those common words satisfies it.
H4. Dead code:
Full Flow Apply Stepkeyword has zero assertionsFile:
m6_acceptance.robot:50-55This keyword runs
plan lifecycle-apply, logs output, and returns. It contains no assertions and cannot cause a test failure, giving false impression that apply is validated.H5. Missing acceptance criteria coverage (5 of 10 criteria unmet)
Per issue #746 and milestone spec, the following are not covered at all:
code-reviewaction)MEDIUM (11)
M1. JSON parsing vulnerable to trailing non-JSON output
Files:
m6_acceptance.robot:32, 67, 193json_stris sliced from first{to end of stdout. Trailing text after JSON causesJSONDecodeError. Line 67 uses unsafe['session_id']subscript (raisesKeyError) while lines 32/193 use safe.get().M2. Config test has no
[Teardown]— leaves state dirty on failureFile:
m6_acceptance.robot:107-116If the test fails between setting profile to
ci(line 107) and resetting tomanual(line 115), the profile remainscifor subsequent tests.M3. Case-insensitive substring matching too weak for JSON field validation
Files:
m6_acceptance.robot:29, 44, 64, etc.Output Should Containchecks if text likeplan_idappears anywhere in stdout+stderr (case-insensitive). An error message containing "plan_id" would satisfy the assertion.M4. Deprecated
Run Keyword Ifused throughout (RF 7.x)Files:
m6_acceptance.robot:33, 39, 153, 173, 190, 197, 201, 204; common_e2e.resource:69, 80Run Keyword Ifhas been deprecated since RF 5.0 in favor ofIF/ELSE IF/ELSE. RF 7.x emits deprecation warnings. All 10 call-sites should be migrated.M5. Hardcoded
local/code-reviewaction with no precondition checkFile:
m6_acceptance.robot:151, 171, 189All three LLM-dependent tests hardcode
local/code-review. No setup verifies this action exists. If missing, tests silently skip via conditional gates — producing vacuous results.M6.
Run CleverAgents Commandomitscwd— working directory non-deterministicFile:
common_e2e.resource:62Run Processdoes not setcwd. Commands that resolve relative paths behave differently depending on where Robot was launched.M7. Second-precision suffix causes collision risk in parallel runs
File:
m6_acceptance.robot:23strftime('%Y%m%d%H%M%S')has only second granularity. Parallel CI jobs within the same second get identical suffixes.M8. No cleanup of database entities on mid-test failure
File:
m6_acceptance.robot:58-78Session created at line 62, deleted at line 76. If any assertion fails between, the session is orphaned.
M9. No
[Teardown]on any test case — test order dependencyFile:
m6_acceptance.robot(all test cases)No test uses
[Teardown]. Tests mutate shared state (database, config, git repos) and rely on execution order. Running with--randomize testswould fail.M10. Redundant
config setin Guard test (line 168) is dead codeFile:
m6_acceptance.robot:168Sets
core.automation-profile=manualglobally, then line 171 passes--automation-profile manualvia CLI flag (which takes precedence). The config set is misleading dead code.M11.
Skip If No LLM Keysvulnerable to special characters in API key valuesFile:
common_e2e.resource:53API key values are interpolated into single-quoted Python expression. A key containing a single quote causes
SyntaxError.LOW (8)
L1.
Create Temp Git Repoignores git command return codesFile:
common_e2e.resource:92-98Five
Run Processcalls for git operations never checkrc. A missinggitbinary causes confusing downstream failures.L2.
Collectionslibrary imported but never usedFile:
common_e2e.resource:10Dead import.
L3.
[Tags] E2Emanually repeated instead ofForce TagsFile:
m6_acceptance.robot— all 8 test casesEvery test manually specifies
[Tags] E2E. Should useForce Tags E2Ein Settings. A newly added test omitting the tag would silently drop out of CI tag-filtered runs.L4.
expected_rc=Nonestring comparison fragilityFiles:
common_e2e.resource:69; m6_acceptance.robot:38, 53, etc.'${expected_rc}' != 'None'compares against the string literal"None". Any refactor to use RF-native${NONE}silently breaks the comparison.L5. Environment variables not saved/restored
File:
common_e2e.resource:30-34, 43-44Setup clobbers env vars without saving originals. Teardown removes them but never restores previous values.
L6.
Run Keyword And Ignore Errorsilently swallows directory removal failuresFile:
common_e2e.resource:28If removing a previous run's directory fails for a real reason (locked file, permissions), the error is swallowed and the suite proceeds with a partially-cleaned directory.
L7. No version assertion for v3.5.0
File:
m6_acceptance.robot:2Documentation claims "M6 (v3.5.0)" but no test asserts the actual CLI version.
L8. Worst-case single test duration is 13 minutes
File:
m6_acceptance.robot:175-204Full Autonomy Acceptance Flowchains timeouts: 180s + 120s + 180s + 180s + 120s = 780s worst case.Recommendations
Must fix before merge (C1, C2, H1-H5): The critical and high issues undermine the test's value as an acceptance gate. In particular, H1 means the LLM-dependent tests can never report a failure, and C2 misrepresents coverage to stakeholders.
Should fix (M1-M11): These affect test reliability and maintainability but won't cause incorrect pass/fail by themselves.
Nice to have (L1-L8): Code quality improvements for long-term maintenance.
Consider adding test cases for: Event queue pub/sub, profile resolution precedence, denylist enforcement, budget cap enforcement, tool call limit enforcement, and hierarchical decomposition validation — all of which are listed as acceptance criteria in #746 but not exercised.
13bbea90858bf1812dd98bf1812dd9b79113228eSelf-Review Report — PR #803 / Issue #746
Commit:
8bf1812d—test(e2e): E2E acceptance criteria for M6 (v3.5.0) — autonomy hardeningReviewed against: Issue #746 acceptance criteria, M6 (v3.5.0) milestone criteria,
docs/specification.mdReview method: Multi-cycle global review across all categories (bugs, test flaws, test coverage, performance, security)
Summary
The commit introduces a well-structured Robot Framework E2E test suite for M6, with solid improvements to
common_e2e.resource(safe JSON parsing, git return-code checks,IF/ELSEmigration,Force Tags). The timeout hardening inm3_e2e_verification.robotandm4_e2e_verification.robotis also a welcome change.However, the review identifies 4 high-severity acceptance-criteria coverage gaps, 2 medium-severity bugs in the JSON parser, and several test-flaw and security concerns across the suite. The high-severity items relate directly to missing issue #746 acceptance criteria and M6 milestone requirements that are not exercised by the E2E suite.
Findings by Severity
HIGH — Acceptance Criteria Coverage Gaps
m6_acceptance.robot. Existing coverage inm6_autonomy_acceptance.robotanda2a_facade.robotuses Python helpers, not real CLI invocations — so it does not satisfy the zero-mocking E2E requirement.M6 E2E Guard Enforcement Via Profiletest (line 178) only checks if the words'manual'or'automation'appear in the output. It does not exercise or assert denylist blocking, budget cap enforcement, or tool-call-limit triggering via the CLI.m6_acceptance.robot:178-197M6 E2E Full Autonomy Acceptance Flowtest exercises a flat plan lifecycle (use -> execute -> apply) but does not verify subplan spawning, hierarchical decomposition depth, or parallel execution.MEDIUM — Bugs
Safe Parse Json Fieldcrashes on JSON with trailing non-whitespace content. The keyword takes$stdout[pos:]from the first{to end-of-string and passes it tojson.loads(). Python'sjson.loads()raisesJSONDecodeErroron trailing non-whitespace content (e.g.,'{"plan_id":"abc"}extra text'). The keyword's docstring promises safe parsing but does not guard against this.m6_acceptance.robot:39-40Safe Parse Json Fieldhas notry/exceptforjson.JSONDecodeError. If the substring from the first{onward is malformed JSON,json.loads()will throw an unhandled exception, crashing the keyword rather than returning${EMPTY}as documented.m6_acceptance.robot:40MEDIUM — Test Flaws
Plan Lifecycle Assertionssilently passes whenplan_idis empty. IfSafe Parse Json Fieldreturns empty string, theIFblock (line 49) simply does not executeVerify Plan In List, producing no assertion failure. MissingELSEbranch to warn or fail.m6_acceptance.robot:49-51Verify Plan In Listsilently passes whenlifecycle-listcommand fails. When${list_result.rc} != 0, no warning or diagnostic is logged — the keyword exits silently, hiding infrastructure failures.m6_acceptance.robot:57-59Guard Enforcement Assertionsuses overly broad string matching. Checking for'manual'or'automation'as substrings (line 66) is prone to false positives. Any output containing these common words would pass. This does not verify guard behavior (denylist blocking, budget enforcement, etc.).m6_acceptance.robot:66Full Flow Apply Stephas noELSEbranch for failure. Whenapply.rc != 0, no WARN is logged, inconsistent with the error-handling pattern used at lines 225, 233, 240.m6_acceptance.robot:75-77Skip. Tests at lines 172-176, 193-197, and 214-216 skip whenplan usefails. Iflocal/code-reviewaction is missing or broken, all three critical tests produce silent skips instead of failures. No mechanism ensures these actually execute in CI.m6_acceptance.robot:172-176Skip(lines 175, 196, 215), some useLog ... WARN(lines 225, 233, 240), and some have no failure handling at all (lines 57-59, 75-77). This inconsistency makes test intent unclear and complicates failure diagnosis.m6_acceptance.robot(suite-wide)Output Should Containassertion forciis fragile. The 2-character string"ci"used at lines 109, 129, 133 is checked via case-insensitive substring match. While current CLI output may not contain false-positive matches, this is inherently brittle — any future output change adding a word containing "ci" would pass erroneously. Consider asserting the full profile name or using JSON field extraction.m6_acceptance.robot:109,129,133local/code-reviewaction exists without validation. Three tests depend on this action (lines 170, 191, 213) but none verify its existence beforehand. A missing action causes all LLM tests to skip silently.m6_acceptance.robot:170,191,213M6 E2E Full Autonomy Acceptance Flowusesfull-autoprofile without verifying full-auto-specific behavior. The test passes--automation-profile full-auto(line 213) but never asserts any full-auto characteristic (e.g., auto_strategize=0.0, operations proceeding without human approval). It only exercises the same generic lifecycle as other tests.m6_acceptance.robot:213MEDIUM — Security
Get Environment Variableat lines 53-54 ofcommon_e2e.resourcestores raw API keys in${anthropic}and${openai}variables. Robot Framework may log variable assignments at DEBUG level. If CI captures RF debug logs, API keys would be exposed in plain text. Consider performing the check entirely withinEvaluateto avoid storing keys in named variables:${has_keys}= Evaluate bool(__import__('os').environ.get('ANTHROPIC_API_KEY', '')) or bool(__import__('os').environ.get('OPENAI_API_KEY', ''))common_e2e.resource:53-55LOW — Code Quality / Minor
Safe Parse Json Fieldis local tom6_acceptance.robotinstead of sharedcommon_e2e.resource. This keyword is generic and useful for other E2E tests. Moving it to the shared resource file would improve reuse and maintainability.m6_acceptance.robot:28-41${session_id}. If the test fails beforeSet Test Variableat line 89,${session_id}is undefined.Run Keyword And Ignore Errorhandles this, but it produces confusing error messages in logs. Consider addingSet Test Variable ${session_id} ${EMPTY}as the first line of the test body.m6_acceptance.robot:82m6_acceptance.robot:80-100m6_acceptance.robot:102-136M6 E2E Init And Project Setuphas no per-test[Teardown]. Unlike LLM-dependent tests which have teardowns, this test (line 138) creates resources and projects without cleanup. While the suite teardown removes the directory, the cleanup pattern is inconsistent.m6_acceptance.robot:138session listuse the same 120s default as complexplan execute. Consider shorter timeouts for fast commands to fail faster on hangs.m6_acceptance.robot(suite-wide)randint(1000, 9999)provides only 9,000 values. Under high-parallelism CI (birthday paradox), collision probability is non-trivial. Consideruuid.uuid4().hex[:12]for vastly more possibilities.m6_acceptance.robot:25__import__()calls inEvaluateexpressions. Lines 25, 39-40, and 66 use__import__('json'),__import__('time'),__import__('random')in Evaluate calls. While functional, these bypass normal Robot Framework library import mechanisms. Consider usingEvaluatewith proper module imports or dedicated Library imports.m6_acceptance.robot:25,40,66Positive Aspects
common_e2e.resourcehardening: Git return-code checks, directory removal warnings,IF/ELSEmigration from deprecatedRun Keyword If,cwdparameter addition, andbool()API-key detection are all meaningful quality improvements.Force Tags E2E: Cleaner tagging at suite level vs per-test[Tags].Collectionslibrary removal: Verified unused — clean dead-import removal.Recommendations
Safe Parse Json Fieldkeyword to handle trailing content and malformed JSON (M1, M2). Consider wrapping the JSON extraction in atry/exceptviaEvaluate, or using a helper that finds the matching}.ELSEbranches to conditional assertions that currently pass silently (M3, M4, M6).local/code-reviewaction existence before LLM-dependent tests, converting the skip into an explicit prerequisite check (M10).Safe Parse Json Fieldtocommon_e2e.resourcefor reuse by future E2E suites (L1).Self-review performed by automated analysis covering: bug detection, test flaws, test coverage gaps against issue #746 AC and M6 milestone criteria, performance, and security. Cross-referenced with
docs/specification.mdCLI command reference and source implementation insrc/cleveragents/cli/.Code Review Report -- PR #803 / Issue #746
Commit:
c1260bd2--test(e2e): E2E acceptance criteria for M6 (v3.5.0) -- autonomy hardeningBranch:
test/e2e-m6-acceptanceReviewed against: Issue #746 acceptance criteria, M6 milestone criteria,
docs/specification.mdReview methodology: 3 full review cycles across all categories (test coverage, test flaws, bugs, performance, security)
Executive Summary
The commit delivers solid E2E test infrastructure improvements (IF/ELSE migration, safe JSON parsing, API key protection, git return-code assertions, per-test teardowns) and a well-structured M6 acceptance test suite. The non-LLM tests (session CRUD, profile listing/showing, config set/get, init + project setup) are well-written and provide genuine E2E coverage.
However, several issue acceptance criteria marked as done
[x]are not actually implemented in the E2E test file, and the LLM-dependent tests have structural issues that may cause them to pass vacuously. The integration tests inrobot/m6_autonomy_acceptance.robotandrobot/m6_e2e_verification.robotcover the missing areas with mocking, but the issue explicitly requires zero-mocking E2E coverage for these criteria.Findings: 18 total -- 3 Critical, 5 High, 6 Medium, 4 Low
CRITICAL -- Acceptance Criteria Gaps
C1. Missing event queue pub/sub E2E test
Category: Test Coverage | File:
robot/e2e/m6_acceptance.robotIssue #746 acceptance criterion "Test exercises event queue publish/subscribe via real CLI" is marked
[x]but no such test exists in the E2E suite. The M6 milestone also lists "Event queue publish/subscribe operational" as an acceptance criterion.Mitigation: Covered by integration test
robot/m6_autonomy_acceptance.robot("M6 A2A Event Queue Publish Subscribe") with mocking, but the issue explicitly requires E2E coverage.C2. Missing automation profile precedence resolution E2E test
Category: Test Coverage | File:
robot/e2e/m6_acceptance.robotIssue #746 acceptance criterion "Test verifies automation profile resolution precedence (plan > action > global)" is marked
[x]but no such test exists. The E2E tests only verify listing, showing, and setting individual profiles -- not that plan-level overrides action-level overrides global.Mitigation: Covered by integration test
robot/m6_autonomy_acceptance.robot("M6 Profile Resolution Precedence").C3. No hierarchical decomposition validation in Full Autonomy test
Category: Test Coverage | File:
robot/e2e/m6_acceptance.robot, lines 204-246Issue #746 acceptance criterion "Test exercises a full autonomy acceptance flow with hierarchical decomposition" is marked
[x]but the Full Autonomy test creates only a single plan and attempts execution. There is no verification of hierarchical subplan creation or 4+ levels of decomposition as required by the M6 milestone.Mitigation: Covered by integration test
robot/m6_e2e_verification.robot("Hierarchical Decomposition Creates Four Plus Levels").HIGH -- Significant Quality Issues
H1. Guard enforcement test only checks keyword presence, not actual enforcement
Category: Test Flaw | File:
robot/e2e/m6_acceptance.robot, lines 50-63The "Guard Enforcement Via Profile" test (line 183) only verifies that output contains automation-profile-related keywords (
automation_profile,require_sandbox,auto_strategize, etc.). It does NOT verify that guards actually constrain behavior. The issue AC says "Test verifies guard enforcement (denylist, budget caps, tool call limits)" but none of these are actually tested:Mitigation: Covered by
robot/m6_autonomy_acceptance.robot("M6 Guard Denylist Enforcement", "M6 Guard Budget Enforcement").H2. LLM-dependent tests pass vacuously when
plan usefailsCategory: Test Flaw | File:
robot/e2e/m6_acceptance.robot, lines 176-181, 198-202, 219-221Tests "Plan Lifecycle Via CLI", "Guard Enforcement Via Profile", and "Full Autonomy Acceptance Flow" all
Skipwhenplan usereturns non-zero (e.g., if thelocal/code-reviewaction isn't registered). In CI environments without this action, all 3 most critical M6 E2E tests provide zero coverage while the suite reports "passed (with skips)."Suggestion: Consider (a) registering/creating the
local/code-reviewaction in suite setup, or (b) using a guaranteed-to-exist built-in action, or (c) at minimum adding aWARNlog or dedicated test that explicitly verifies the action exists before the dependent tests run.H3. No parallel execution scaling E2E test
Category: Test Coverage | File:
robot/e2e/m6_acceptance.robotM6 milestone criterion "Parallel execution scales to 10+ concurrent subplans" has no E2E coverage.
H4. No decision correction E2E test
Category: Test Coverage | File:
robot/e2e/m6_acceptance.robotM6 milestone criterion "Decision correction with selective subtree recomputation" has no E2E coverage.
H5. No validation-gated apply E2E test
Category: Test Coverage | File:
robot/e2e/m6_acceptance.robotThe Full Flow test attempts
lifecycle-applybut does not verify that validations were checked before apply proceeded. The M6 milestone lists validation-gated apply as a criterion.MEDIUM -- Quality Improvements
M1. Automation Profile List verifies only 4 of 8 built-in profiles
Category: Test Flaw | File:
robot/e2e/m6_acceptance.robot, lines 102-112The test checks for
manual,supervised,ci, andfull-autobut missesreview,cautious,trusted, andauto. Per the specification, all 8 profiles are built-in.M2. Session delete verification doesn't confirm actual deletion
Category: Test Flaw | File:
robot/e2e/m6_acceptance.robot, lines 98-100After deleting a session, the test checks for "deleted" in output but doesn't re-list sessions to confirm the session actually disappeared. Adding a
session listcall after delete and asserting thesession_idis not present would strengthen the lifecycle verification.M3. Full Flow Apply Step doesn't verify plan state transition
Category: Test Flaw | File:
robot/e2e/m6_acceptance.robot, lines 65-75The
Full Flow Apply Stepkeyword only checks thatplan_idappears in the apply output. It should verify the plan actually transitioned toappliedstate (e.g., parse JSON and check thestateorphasefield).M4. No assertions on plan execute output content
Category: Test Flaw | File:
robot/e2e/m6_acceptance.robot, line 233In the Full Autonomy Flow test, the execute result is only checked for
rc==0. The output content is never verified (e.g., checking for phase transition indicators or expected decision types).M5. Safe Parse Json Field fragile with multi-JSON stdout
Category: Test Flaw | File:
robot/e2e/common_e2e.resource, lines 92-115The
find('{')+rfind('}')approach assumes a single JSON object in stdout. If stdout contains multiple JSON objects on separate lines (e.g., progress messages + final result), the parser captures text spanning both objects, producing invalid JSON. The keyword gracefully returns empty string + WARN on parse failure, but this could cause hard-to-diagnose downstream test failures. Consider parsing from the last complete{...}block, or using a regex like\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}to find the outermost JSON object.M6. Specification.md CLI commands section is stale
Category: Documentation | File:
docs/specification.md, lines 322-346The specification documents
plan listandplan applyas the CLI commands, but the actual implementation has deprecated these in favor oflifecycle-listandlifecycle-apply(the v3 replacements). The tests correctly use the new commands; the specification should be updated.LOW -- Minor / Defensive Improvements
L1. Redundant automation profile reset in Config test
Category: Performance | File:
robot/e2e/m6_acceptance.robot, lines 127, 138-140The
[Teardown](line 127) and the end of the test body (lines 138-140) both reset the profile tomanual. The teardown alone is sufficient; the body reset is redundant (though harmless).L2. CLI stdout/stderr logging may expose secrets in debug logs
Category: Security | File:
robot/e2e/common_e2e.resource, lines 72-73Full STDOUT/STDERR are logged via
Log STDOUT: ${result.stdout}andLog STDERR: ${result.stderr}. If CLI output ever includes API keys, tokens, or sensitive data (e.g., in error messages), they would appear in test logs. Consider masking or truncating output for sensitive commands.L3. Case-sensitive "ci" assertion is a weak substring match
Category: Test Flaw | File:
robot/e2e/m6_acceptance.robot, line 111Case-sensitive match for
ciavoids matching "speCIfication" but could still match substrings like "deficit" or "recipient" in JSON output. Consider matching"ci"(with JSON quotes) or"name": "ci"for a stronger assertion.L4. No
safe.directorygit config in Create Temp Git RepoCategory: Security / Portability | File:
robot/e2e/common_e2e.resource, lines 117-136In containerized CI environments where git runs as a different user than the directory owner, git may reject operations with "dubious ownership" errors. Consider adding
git config --global --add safe.directory ${repo_dir}as a defensive measure.Positive Observations
The following improvements in this commit are well-executed:
os.environ.get()instead of RF variables avoids accidental secret logging at DEBUG level.Run Keyword Ifis thorough and consistent.Create Temp Git Repocatch silent failures early.randint.[Tags].Review performed by automated code review agent. 3 global review cycles completed across all categories: test coverage, test flaws, bugs, performance, and security.
b79113228e09f26c68bb09f26c68bbd3dd9281a6d3dd9281a61d110c55221d110c5522c814f4e61ec814f4e61e038b818fd2038b818fd2465667cac7PM Status Update — Day 34
Luis identified a significant gap: only 5 of 10 M6 acceptance criteria are covered by this E2E suite. The missing criteria are:
Decision: These are not deferrable — they are the headline M6 acceptance criteria. Without them, the E2E suite cannot serve as an M6 acceptance gate. However, several of these features may not be fully implemented yet (e.g., parallel execution, hierarchical decomposition).
Action items:
@tdd_expected_failif needed.Priority: Medium (M6 acceptance gate)
PM Status — Day 34
@CoreRasurae — M6 E2E acceptance criteria (#497). Mergeable with 4 comments.
Status: In Review. Missing labels: needs MoSCoW and Points per CONTRIBUTING.md.
Priority: This is the M6 acceptance gate — High priority. M6 closure depends on this passing. Please ensure all review findings are addressed.
PM status — Day 34
PM Status — Day 36 (2026-03-16)
Coverage gap: Luis identified that only 5 of 10 M6 acceptance criteria are covered. Missing:
@CoreRasurae — These 5 missing criteria need to be added before this PR can be considered complete. Please update the test suite to cover all 10 criteria, or create follow-up issues for the missing ones with clear justification for deferral.
Missing labels: Needs MoSCoW and Points per CONTRIBUTING.md.
PM Day 36 Triage: MERGE CONFLICT. @CoreRasurae rebase onto master needed before this can proceed. This is the M6 E2E acceptance gate targeting v3.5.0 — important for milestone sign-off but blocked until conflict is resolved.
465667cac7a1582c05b6Code Review — PR #803
test(e2e): E2E acceptance criteria for M6 (v3.5.0) — autonomy hardeningReviewer: @brent.edwards | Size: XL (4,447 lines, 83 files) | Focus: Full review — scope, production code, E2E quality, security
P0:blocker (5)
1. PR bundles 3 unrelated issues into a single "test" PR — must be split
The 8 commits address issues #746 (E2E tests), #581 (AuditEventSubscriber — a new production feature), and #658 (subprocess CLI conversion). 23 production files and +793 production lines are hidden under a
test(e2e)title with aType/Testinglabel. CONTRIBUTING.md requires one issue scope per PR. A reviewer scanning the title would have no indication this introduces a new production audit pipeline, a database migration, or modifies core service wiring. This must be split into at minimum 3 PRs: audit pipeline, production bug fixes, and E2E tests.2. Six LLM-dependent tests can never produce a FAIL — only SKIP or WARN
All six critical M6 acceptance tests (
Plan Lifecycle,Guard Enforcement,Profile Precedence,Event Queue,Hierarchical Decomposition,Full Autonomy) follow the same pattern: ifplan usereturns rc≠0, the testSkips. If downstream commands fail, they logWARNinstead of failing. In CI without thelocal/code-reviewaction, the suite reports 4 PASS / 6 SKIP / 0 FAIL — looking green while validating nothing. This defeats the purpose of an acceptance gate.Fix: When API keys are present but
plan usefails, that's a test failure, not a skip. Only skip when API keys are absent (legitimate environmental constraint).3. Event Queue test (AC-3) has zero hard assertions
Both the happy path and error path in
M6 E2E Event Queue Via Plan Lifecycle TransitionsareLogorWARNstatements — neverShould Be TrueorFail. The test can pass with zero verified state transitions.4. Hierarchical Decomposition test (AC-6) never asserts decomposition occurred
Zero decision nodes →
WARN(not failure). One decision node → pass. The milestone requires "4+ levels of subplans" but the test doesn't assert minimum depth or nesting.5.
CONFIG_CHANGEDevent persists non-OpenAI API keys in plaintext to audit DBconfig_service.set_value()emitsold_valueandnew_valueas raw strings under generic key names. The audit subscriber'sredact_dictonly checks if the dict key contains "api_key"/"password"/"secret" or if the value matchessk-*/sk-ant-*patterns. Config keys likeprovider.google.api-keyhave values (AIzaSy...) that match neither pattern. Google, Azure, HuggingFace, OpenRouter, and Neo4j credentials are persisted in plaintext. The BDD test masks this because it uses"api_key"as a dict key (triggers key-name redaction) — unlike the production payload.Fix: Check
is_sensitive_key(key)on the config key name at the emit site, before the event reaches the subscriber.P1:must-fix (10)
6.
CorrectionServicemissingcheckpoint_servicewiring in containercontainer.py:594:CorrectionServiceis registered withoutcheckpoint_service, despite the container having one. Revert-mode corrections silently skip checkpoint rollback. The CLI also creates an ad-hocCorrectionServicebypassing the container entirely.7.
AuditService._ensure_session()TOCTOU race on Singletonaudit_service.py:118-137: Lazy session creation with no lock. Two threads hitting_ensure_sessionsimultaneously create two engines — one leaks.8.
get_container()swallows audit subscriber failure with no recoverycontainer.py:637-646: If initialization fails, the Singleton caches the failure. No retry. Security events silently go unlogged with only a debug-level warning.9.
AutomationProfileRepositorysession leak — noclose()in auto_commitupsert()anddelete()never callsession.close()unlikeSessionRepositorywhich usesfinally: if self._auto_commit: db_session.close().10.
ReactiveEventBus.emit()swallows exception messagereactive.py:79-87: The exception handler logs onlytype(exc).__name__— nostr(exc), noexc_info=True. Production debugging is impossible.11.
_to_domain/_from_domaincrash on corrupt JSONrepositories.py:~4413: Barejson.loads()+SafetyProfile(**dict)with no try/except. Corrupt JSON in DB produces an unhandledJSONDecodeErrorthat propagates as an opaque 500.12.
server_connect: three non-atomicset_value()calls without error handlingserver.py:125-127: Replaced a singlewrite_config()with three separate writes. If the second fails, config is left partially updated. No try/except.13.
automation_profile._get_service()bypasses DI containerCreates
create_engine/sessionmaker/AutomationProfileRepositorydirectly instead of resolving through the container. This is the only CLI command that manually constructs infrastructure — every other uses the container.14.
@tdd_expected_failremoved from Bug #658 feature — verify test passesIf the underlying bug is not actually fixed in this PR's changes, removing the tag will cause CI failures.
15.
format_outputwrites tosys.stdoutdirectly, then callers print""formatting.py:197-211:format_output()now writes to stdout and returns"". But 26 call sites still doconsole.print(format_output(...))ortyper.echo(format_output(...)), producing an extra blank line on every machine-readable command. While JSON parsers tolerate it, this is a behavioral regression in every--format jsoninvocation.P2:should-fix (9)
16.
LifecyclePlanRepository.list_all()— unbounded query, no pagination. With hierarchical decomposition this table grows quickly.17. Event emitted after delete, outside transaction scope — audit gap if app crashes between commit and emit.
18.
list_plans()falls back to stale in-memory cache on DB error atdebuglevel — callers don't know result is incomplete.19. 4 of 6 event-emit exception handlers log no error details (no
exc_info, noerror_type).20.
_create_in_memory_profile_service()copy-pasted across 3 step files.21.
_capture_format_output()duplicated ~7 times across step files.22.
m6_acceptance.robotis 508 lines (limit: 500). Extract repeated setup boilerplate.23. E2E AC-5 only tests plan>global precedence, not action>global.
24. Guard Enforcement assertion checks keyword presence, not actual guard behavior.
P3:nit (5)
25.
hasattr(row, "safety_json")on ORM model — always True; useis not None.26. Dead
if profile.safetyguard —safetyis never None per type contract.27.
import json as _jsonduplicated in 3 method bodies.28.
SECURITY_EVENT_MAPexported as public API — implementation detail.29. Audit subscriber eagerly imports
AuditServiceinstead of usingTYPE_CHECKING.Positive Observations
common_e2e.resourcehardening (safe JSON parsing, git RC checks, IF/ELSE migration) is solidfeatures/mocks/Summary
Verdict: REQUEST_CHANGES — The P0 scope/misrepresentation issues (P0-1) and the vacuous acceptance tests (P0-2/3/4) must be resolved. The API key leak to audit DB (P0-5) is a security vulnerability. Splitting this into 3 focused PRs would make each individually reviewable and unblock the E2E test portion sooner.
my PR makes fixes to the production code and also provides a tests in separate commits, but they are all related and do not need to be submitted as independent PRs, since the fixes made to production code are validate by this same tests. Additioanlly the review points many other issues in the production code, that i am not going to fix, because this is an end to end integration test and i am only fixing the problems related to test failures.
format_ouptut() writes directly to stdout in some situations because typer Pretty printer in output_formatting would split text inserting line breaks and thus making certain CleverAgent keyword detections fail, which in particular would cause the present End to End (E2E) test fail and yes, the tests were updated to deal with with the stdout writes.
a1582c05b6bc178c5362Review Reclassification — PR #803
@CoreRasurae — After re-reading the review playbook and your comments, I'm reclassifying several findings. The playbook defines P0 as "Security vulnerability, data loss risk, broken migration, credential leak" and my original review applied P0 too broadly. I also flagged pre-existing production bugs as blockers on your PR, which isn't fair — those should be tracked separately.
Severity reclassifications
Findings moved to separate issues (no longer blocking this PR)
These are pre-existing production bugs that this PR did not introduce. They should be tracked independently:
CorrectionServicemissingcheckpoint_servicewiringAutomationProfileRepositorysession leakReactiveEventBus.emit()swallows exception message_to_domain/_from_domaincrash on corrupt JSONautomation_profile._get_service()bypasses DII will file issues for these so they aren't lost.
Findings that remain blocking (in-scope for this PR)
P0 (1):
CONFIG_CHANGEDevent persists non-OpenAI API keys in plaintext to audit DB. The audit subscriber is new production code in this PR — the redaction logic must cover Google/Azure/HuggingFace key patterns before merge.P1 (7):
plan usefails, that should be a test failure, not a skip.AuditService._ensure_session()TOCTOU race — new code in this PR.get_container()caches audit subscriber failure with no retry — new code in this PR.server_connectthree non-atomicset_value()calls — changed by this PR.format_outputbehavioral regression (extra blank line on every--format jsoninvocation). I understand the technical reason (Typer Pretty printer splitting text), but 26 call sites now print an empty string after stdout is already written. This is a user-visible regression in every--format jsoncommand.P1-14 (
@tdd_expected_failremoval): Keeping as P1 — please confirm the underlying bug is actually fixed by this PR's changes so CI doesn't break.Regarding
format_output(P1-15)I read your explanation — the Typer Pretty printer was splitting text and breaking keyword detection in the E2E tests. That's a legitimate technical constraint. The concern is that
format_output()now writes to stdout and returns"", but 26 callers still doconsole.print(format_output(...))ortyper.echo(format_output(...)), which emits an extra blank line. If those call sites were updated to not print the return value, I'd drop this to P3.Updated summary
The blocking list is now 8 findings (down from 15), all directly related to code this PR introduces or modifies. The pre-existing production bugs will be tracked as separate issues.
I apologize for the over-classification in the original review. The findings were real but several severity levels didn't match the playbook definitions, and flagging pre-existing bugs as blockers on your PR wasn't appropriate.
1c7076b7b6af6e709d1dRe-review — PR #803 (commit
af6e709d)Reviewer: @brent.edwards | Review against: reclassification comment #65685 (8 blocking findings) + new commits
Part 1 — Status of 8 previously blocking findings
Should Be Equal As Integersonplan_use.rcafterSkip If No LLM Keys. ELSE/Fail branches present throughout.Should Be True,Should Not Be Empty,Output Should Containthroughout. State transition assertion on line 366-367 proves event bus delivery.decision_count >= 1viaShould Be True. For an LLM-dependent E2E test, asserting exact tree depth is inherently flaky; ≥1 decision node is a reasonable minimum._ensure_session(). Low practical risk in CLI (single-process), but the code exists in a reusable service.set_value()calls, no try/except or rollback.tdd_expected_failtags remain inrobot/e2e/or bug #658 files.""should be cleaned up, but it doesn't block merge.Score: 4 resolved, 1 reclassified to P2, 3 not fixed, P0-5 still open.
Part 2 — P0-5 deep-dive: API key leak is still present
The
redact_dict/is_sensitive_key/redact_valueinfrastructure inshared/redaction.pyhas been improved (more_SENSITIVE_SUBSTRINGS, thread-safe_patterns_lock), but the structural problem inCONFIG_CHANGEDevents remains.When
config_service.set_value("provider.google.api-key", "AIzaSy...")is called, the event details dict is:is_sensitive_key("old_value")→ False (the dict key is"old_value", not a sensitive name)redact_value("AIzaSyOld...")→ No match (no_SECRET_PATTERNSregex covers GoogleAIzaSy...format)Same for Azure, HuggingFace (
hf_...), OpenRouter, and Gemini credentials. Only OpenAI (sk-) and Anthropic (sk-ant-) are caught by pattern matching.Fix (any one of):
set_value(): checkis_sensitive_key(key)on the config key name and redactold_value/new_valuebefore emission_SECRET_PATTERNSfor Google (AIzaSy[0-9A-Za-z_-]{30,}), HuggingFace (hf_[A-Za-z0-9]{10,}), etc.is_sensitive_keyto match it)Option 1 is the simplest and most robust — one
if is_sensitive_key(key):guard at the emit site.Part 3 — New P1 issues in new commits
The two new production commits (
0dca2c1bpersist plan overrides,9cee406blist_plans DB fallback) introduce issues not covered by the previous review.NEW-1 (P1):
run_strategizeexceptions not caught inexecute_planCLI — raw traceback to userplan.py:1617-1623: Whenexecutor.run_strategize(plan_id)fails (e.g.,ConnectionError,RuntimeErrorfrom actor), the exception propagates unhandled past theexceptblocks which only catchInvalidPhaseTransitionError,PlanNotReadyError, andCleverAgentsError. The user sees a raw Python traceback instead of a friendly error message. The plan state is correctly set to ERRORED (by PlanExecutor), but the CLI UX is broken.Fix: Add
except Exception as e:fallback that prints a user-friendly error.NEW-2 (P1):
execution_environmentoverride inexecute_plannever persisted to DBplan.py:1591-1593: The override is set on the in-memory Plan object but_commit_planis never called — unlikeuse_action(line 1490) which explicitly persists overrides. Ifget_planre-fetches from DB later in the same function (line 1611), the override may be lost.Fix: Call
service._commit_plan(pre)after setting the override, matching theuse_actionpattern.NEW-3 (P1): Bare
except Exceptioninlist_planssilently swallows all DB errors at debug levelplan_lifecycle_service.py:812-814: A schema migration failure, corrupt row, or serialization bug is caught byexcept Exception, logged atdebuglevel (invisible in production), and the user silently gets an empty plan list. The repository already wraps DB errors intoDatabaseError— catch that specifically, or at minimum log atWARNINGwithexc_info=True.Part 4 — New P2 issues (non-blocking)
PlanExecutorinexecute_planis constructed with onlylifecycle_service— no guardrails, no metrics, uses stub actor. Acceptable for MVP but should be documented.list_all()has no LIMIT/pagination — unbounded memory on large datasets. The repo already has a paginatedlist_plans(limit=100)method.to_domain()inlist_all()— other models in the same file uselazy="joined"to prevent this.service._commit_plan()— encapsulation violation; should have a publicpersist_overrides()method.Updated summary
Verdict: REQUEST_CHANGES — P0-5 (credential leak) remains the primary blocker. The 3 new P1s from the latest commits should also be addressed. The 3 original unfixed P1s (TOCTOU, container cache, server_connect) are lower practical risk in CLI context — I'd accept them being filed as follow-up issues if the P0 and new P1s are resolved.
Positive observations
Significant progress since last review:
common_e2e.resourcehardening is solid (safe JSON parsing, cwd parameter, API key protection)format_outputchange has a legitimate technical justificationplan executeunblocks a real workflow gaplist_plansDB fallback fixes a genuine per-process isolation bugReview Response — @brent.edwards review + reclassification
Addressing the findings from the REQUEST_CHANGES review and the subsequent reclassification comment.
All changes applied in two commits on
test/e2e-m6-acceptance:af6e709d—test(e2e): E2E acceptance criteria for M6 (v3.5.0) — autonomy hardening(E2E tests, already contained prior-round fixes)d76e249b—fix(cli): load persisted actions in start_strategize and run execute phase inline(production fix + BDD test updates)Applied (E2E test fixes — already in
af6e709dfrom prior rounds)Plan Lifecycle,Guard Enforcement Via Profile,Profile Precedence,Event Queue,Hierarchical Decomposition,Full Autonomy) now useShould Be Equal As Integers ${plan_use.rc} 0with descriptive failure messages when API keys are present butplan usefails. Downstream execute/apply failures also useFailinstead ofLog WARN. OnlySkip If No LLM Keysremains as a skip (legitimate — no keys available).M6 E2E Event Queue Via Plan Lifecycle Transitionsnow has 7+ hard assertions:Should Be Equal As Integerson plan_use rc,Should Not Be Emptyon plan_id,Verify Plan In List,Should Be Equal As Integerson both status checks,Should Be True ${state_populated}verifying post-execute state differs from initial, andFailon execute error path.M6 E2E Hierarchical Decomposition Via Plan Treenow has hard assertions:Should Be Equal As Integers ${tree.rc} 0,Should Not Be Empty ${tree.stdout}, andShould Be True ${decision_count} >= 1(was WARN, now hard fail). The test also checks for"children"key as a decomposition infrastructure indicator. A hard assertion on 4+ nesting levels was not added because LLM output is non-deterministic in E2E — the issue AC (#746) says "exercises … hierarchical decomposition", not "asserts 4+ levels". The milestone's 4+ level criterion describes system capability, not a per-test-run invariant.Setup Plan Test Resourceskeyword to eliminate repeated boilerplate. File is now 446 lines.Guard Enforcement Assertionskeyword now parses theautomation_profileJSON field and assertsShould Be Equal As Strings ${resolved_profile} ${expected_profile}. A dedicatedM6 E2E Guard Enforcement Denylist Budget Limitstest creates a custom profile with explicit denylist, budget cap (max_total_cost), and tool-call limits (max_tool_calls_per_step), then verifies all guard fields viaautomation-profile show. Actual runtime enforcement (e.g. triggering a denied tool during LLM execution) is not testable in E2E without controlling LLM output.Applied (production fix — commit
d76e249b)During E2E test execution, 3 LLM-dependent tests failed with
Error [500] INTERNAL: An unexpected error occurredat theplan executestep. Root cause analysis:start_strategize()action registry empty in fresh CLI processesstart_strategize()builtaction_registryfrom the in-memory_actionsdict only. In a separateplan executeCLI invocation (different process fromplan use),_actionswas empty. Added aget_action()call (with DB fallback) before building the preflight registry. (plan_lifecycle_service.py:870-877)PreflightRejectionnot caught by CLI error handlerPreflightRejectionextends bareException, notCleverAgentsError, so it escaped the CLI'sexcept CleverAgentsErrorblock, producing an opaque 500 error. Added explicitexcept PreflightRejectioninexecute_planCLI. (plan.py:1646-1648)plan executeleft plan inexecute/queuedwithout running the execute phaseexecute/queued.lifecycle-applyrequiresexecute/complete. Added inline execute viaPlanExecutor.run_execute()after phase transition, mirroring the existing inline strategize pattern. (plan.py:1640-1649)lifecycle-applydidn't handle auto-progressed planscomplete_execute()callsauto_progress()which can advance the plan directly toapply/queued.lifecycle-applyonly looked forexecute/completeplans, missing auto-progressed ones. Updated auto-selection to also considerapply/queuedplans, and skip theapply_plan()call when the plan is already in Apply phase. (plan.py:1710-1747)BDD test updates in
features/steps/plan_lifecycle_commands_coverage_steps.py:get_plan.side_effectextended from 3 to 5 values for strategize-queued and auto-progress scenarios (accounting for new inline execute state checks)list_plansmocks for lifecycle-apply scenarios changed fromreturn_valuetoside_effectto return separate results for the twolist_planscalls (EXECUTE phase + APPLY phase)Skipped — with justification
fix(cli)commits fix bugs discovered during E2E test development. Splitting would create circular dependencies between PRs (E2E tests depend on the fixes; fixes are validated by the E2E tests). The reviewer acknowledged this reclassification to P2 in the follow-up comment.CONFIG_CHANGEDevent persists non-OpenAI API keys in plaintext to audit DBAuditService._ensure_session()TOCTOU raceget_container()caches audit subscriber failure with no retryserver_connectthree non-atomicset_value()calls@tdd_expected_failremoval — verify test passesformat_outputbehavioral regression (extra blank line)format_output()writes directly to stdout because Typer's Pretty printer inserted line breaks that broke keyword detection in E2E tests. The return value is""by design.LifecyclePlanRepository.list_all()unbounded querylist_plans()falls back to stale in-memory cachedebug-level logging as a resilience measure._create_in_memory_profile_service()copy-pasted across 3 step files_capture_format_output()duplicated ~7 times across step filesaction>globalprecedence path requiresPlanLifecycleService.use_actionto propagate the action'sautomation_profileto the Plan during creation, which is not yet wired. The test documents this limitation explicitly: "Note: action > global precedence requires the action's automation_profile to propagate to the Plan during plan-use, which is not yet wired in PlanLifecycleService.use_action." The plan>global path IS tested and verified.hasattr(row, "safety_json")on ORM model — always Trueif profile.safetyguardimport json as _jsonduplicated in 3 method bodiesSECURITY_EVENT_MAPexported as public APIAuditServiceTest results after all changes
nox -s unit_testsnox -s integration_testsnox -s e2e_testsaf6e709d1d1df7c628751df7c62875a831803589I cannot at the moment access Claude Opus 4.6. But all of the requested changes have been made, so I approve.
Re-review — PR #803 (commit
a8318035)Reviewer: @brent.edwards | Review against: review #2317 (4 remaining blockers)
Blocking findings — status
set_value()now callsis_sensitive_key(key)and replacesold_value/new_valuewithREDACTEDbefore emitting."api-key"(hyphenated) added to_SENSITIVE_SUBSTRINGS. The CONFIG_CHANGED event vector is closed.run_strategizeexceptions not caughtPreflightRejectionexplicitly caught +except Exceptioncatch-all fallback added. No more raw tracebacks.execution_environmentoverride not persistedservice._commit_plan(pre)now called after setting the override, matching theuse_actionpattern.except Exceptioninlist_plansexcept DatabaseError. Programming errors (TypeError,AttributeError) now propagate correctly instead of being swallowed.Score: 4/4 resolved. All P0 and P1 blockers are closed.
Remaining P2/P3 notes (non-blocking)
_SECRET_PATTERNSstill lacks regexes for Google (AIzaSy...), HuggingFace (hf_...), Azure formats. Not needed for CONFIG_CHANGED (protected at the emit site), but those credential values would pass throughredact_value()unredacted in other contexts (e.g., log messages). P2.logger.debugwas not raised tologger.warning. DB fallback events are still invisible at production log levels. P3.list_all(), N+1 lazy loading, private_commit_planusage): Unchanged, tracked for follow-up.Verdict: APPROVED
All P0 and P1 findings are resolved. The E2E test suite is substantially robust, the production code fixes are sound, and the remaining items are P2/P3 tracked for follow-up. Good work addressing the feedback across multiple review rounds, @CoreRasurae.
a831803589f0bdc3c651New commits pushed, approval review dismissed automatically according to repository settings