feat(service): add decision recording and snapshot store #433

2026-02-25T13:12:31Z

hamza.khyari commented

2026-02-25 13:12:31 +00:00

Summary

Implements DecisionService with record/list/tree helpers and SnapshotStore for Forgejo issue #172.

What

DecisionService — dual-mode (in-memory / DB-backed) service with:
- record_decision() with auto-sequencing, auto-snapshot capture, string-to-enum coercion
- get_decision(), list_decisions(), list_by_type() retrieval
- get_tree() (BFS), get_path_to_root() tree navigation
- mark_superseded(), get_superseded() supersession support
- delete_decision() with snapshot cleanup
- count_decisions(), get_next_sequence() statistics
SnapshotStore — hash-based deduplication index for ContextSnapshot objects
Custom exceptions — DuplicateDecisionError, DecisionNotFoundError, SequenceConflictError

Tests

Behave: 37 scenarios / 168 steps in features/decision_recording.feature (all passing)
Robot: 7 smoke tests in robot/decision_recording.robot with helper script
ASV: 6 benchmarks for record throughput, list, tree, and snapshot operations

Docs

docs/reference/decision_service.md covering all public API, dual-mode persistence, and exceptions

Note on cherry-picked commits

This branch includes 2 cherry-picked commits from PR #131 (decision persistence) to provide DecisionRepository for the UoW code paths. When #131 merges to master, a rebase will make these cherry-picks disappear cleanly.

Closes #172

Dependencies

Blocks issue #172 (this PR must be merged before the issue can be closed)

## Summary Implements `DecisionService` with record/list/tree helpers and `SnapshotStore` for Forgejo issue #172. ### What - **`DecisionService`** — dual-mode (in-memory / DB-backed) service with: - `record_decision()` with auto-sequencing, auto-snapshot capture, string-to-enum coercion - `get_decision()`, `list_decisions()`, `list_by_type()` retrieval - `get_tree()` (BFS), `get_path_to_root()` tree navigation - `mark_superseded()`, `get_superseded()` supersession support - `delete_decision()` with snapshot cleanup - `count_decisions()`, `get_next_sequence()` statistics - **`SnapshotStore`** — hash-based deduplication index for `ContextSnapshot` objects - **Custom exceptions** — `DuplicateDecisionError`, `DecisionNotFoundError`, `SequenceConflictError` ### Tests - **Behave**: 37 scenarios / 168 steps in `features/decision_recording.feature` (all passing) - **Robot**: 7 smoke tests in `robot/decision_recording.robot` with helper script - **ASV**: 6 benchmarks for record throughput, list, tree, and snapshot operations ### Docs - `docs/reference/decision_service.md` covering all public API, dual-mode persistence, and exceptions ### Note on cherry-picked commits This branch includes 2 cherry-picked commits from PR #131 (decision persistence) to provide `DecisionRepository` for the UoW code paths. When #131 merges to master, a rebase will make these cherry-picks disappear cleanly. Closes #172 ## Dependencies - Blocks issue #172 (this PR must be merged before the issue can be closed)

hamza.khyari self-assigned this 2026-02-25 14:56:36 +00:00

freemo added a new dependency 2026-02-25 18:13:15 +00:00

#172 feat(service): add decision recording and snapshot store

freemo added the

Type

Feature

label 2026-02-25 18:13:16 +00:00

freemo added this to the v3.2.0 milestone 2026-02-25 18:13:28 +00:00

hamza.khyari force-pushed feature/m4-decision-service from cd5ad39642 to 0f4b8a967f

2026-02-27 12:35:41 +00:00

Compare

CoreRasurae commented

2026-02-27 17:36:23 +00:00

Code Review — PR #433 / Issue #172: `feat(service): add decision recording and snapshot store`

Branch: feature/m4-decision-service
Reviewed commit: cd641a8 (last by Hamza, 7-commit chain from 5539fd0..cd641a8)
Scope: 15 files changed, ~2600 insertions, ~160 deletions
Review focus: bugs, security, performance, test coverage/quality, spec conformance

1. Issue #172 Acceptance Criteria Compliance

Subtask	Status	Notes
Implement `DecisionService` with record/list/tree helpers + snapshot storage	Done	Full API in `decision_service.py`
Add `docs/reference/decision_service.md`	Done	148-line doc file
Behave: `features/decision_recording.feature`	Done	37 scenarios
Robot: `robot/decision_recording.robot`	Done	7 test cases
ASV: `benchmarks/decision_recording_bench.py`	Done	5 benchmark classes
Coverage >= 97% via `nox -s coverage_report`	Not verified	No evidence the coverage gate was run/checked in the commit chain
Run `nox` (all sessions), fix any errors	Not verified	No evidence a full nox pass was performed in the commit chain

2. Bugs

BUG-1 [HIGH] — Sequence numbers not rehydrated from DB on restart

File: decision_service.py:203, 622-626

_plan_sequence is an in-memory dict initialized to {}. In persisted mode, if the service restarts against an existing database with decisions already recorded, _next_sequence() starts from 0 again, causing sequence_number collisions. The spec (line 18398, 18470) requires sequence_number to be monotonically increasing per plan.

BUG-2 [HIGH] — `SequenceConflictError` defined but never raised

File: decision_service.py:74-82

This exception is defined, exported in __all__, and whitelisted in vulture — but never actually raised anywhere. If BUG-1 causes a collision, there is no guard to catch it. The error exists as dead code suggesting uniqueness enforcement was intended but never implemented.

BUG-3 [MEDIUM] — `delete_decision` inconsistent between modes

File: decision_service.py:537-560

In persisted mode, if the underlying repository's delete() silently succeeds for a non-existent ID (does not raise), the method returns True without error. In in-memory mode, it correctly raises DecisionNotFoundError. The behavior diverges depending on the backend.

BUG-4 [MEDIUM] — `mark_superseded` does not validate `new_decision_id` exists

File: decision_service.py:483-523

The spec (line 18196) states "the original and all its descendants are marked superseded_by = <new_decision_id>." The service only marks the single targeted decision but never verifies the replacement decision actually exists — this can create dangling references.

BUG-5 [MEDIUM] — `get_tree` silently drops orphaned subtrees

File: decision_service.py:383-425

If a decision's parent_decision_id references a deleted parent, BFS from roots never reaches it. The if not roots fallback only triggers when there are zero roots. Orphaned subtrees are silently omitted from the returned tree, violating the spec's "Reachability" invariant (spec line 18203).

BUG-6 [LOW] — `SnapshotStore` is purely in-memory, lost on restart

File: decision_service.py:90-164

Snapshots stored in SnapshotStore (the hash-indexed secondary store) are never persisted to the database. After a service restart, get_snapshot() and get_snapshots_for_plan() return None/empty even though the decisions (with embedded snapshots) still exist in the DB. The spec (line 18421-18425) describes the context snapshot as a replay-critical artifact.

BUG-7 [LOW] — Dead code in `PlanLifecycleService`

File: plan_lifecycle_service.py:177-188

_decision_seq dict and _next_seq() method are never called anywhere in the service. _try_record_decision delegates sequence numbering to DecisionService.record_decision(). These are dead code, masked by vulture whitelist entries at vulture_whitelist.py:339-340.

3. Security Issues

SEC-1 [MEDIUM] — No size limits on `actor_reasoning` field

File: decision_service.py:249

Raw LLM reasoning traces can be arbitrarily large. No bounds are enforced before storage or logging. This creates risk for log injection and storage exhaustion attacks.

SEC-2 [LOW] — Truncated SHA-256 hash with misleading prefix

File: decision_service.py:672

_auto_capture_snapshot truncates the SHA-256 to 16 hex chars (64 bits) but prefixes it with sha256:, implying a full cryptographic hash. If any downstream code relies on this for integrity verification, the collision resistance is far lower than expected.

SEC-3 [LOW] — `SnapshotStore._hash_index` grows without bound

File: decision_service.py:99

No eviction policy. A caller generating many unique snapshots causes unbounded memory growth.

SEC-4 [LOW] — `tempfile.mktemp` usage in test step file

File: features/steps/decision_recording_steps.py:966

tempfile.mktemp is deprecated due to a TOCTOU race condition. This is test-only but is a pattern repeated across 17 places in the test suite.

4. Performance Issues

PERF-1 [MEDIUM] — `count_decisions` loads all objects just to count them

File: decision_service.py:593-602

count_decisions(plan_id) calls list_decisions(plan_id) which deserializes every decision from the DB, builds a Python list, then takes len(). Should delegate to COUNT(*) in persisted mode.

PERF-2 [MEDIUM] — No pagination on any list method

Files: decision_service.py — list_decisions, list_by_type, get_tree, get_superseded, get_snapshots_for_plan

None of these support limit/offset. For plans with hundreds or thousands of decisions, this returns unbounded result sets.

PERF-3 [LOW] — `get_tree` and `get_superseded` load full decision set

File: decision_service.py:395, 476

Both call list_decisions() to load the entire plan's decisions into memory, then perform in-memory filtering/traversal.

PERF-4 [LOW] — `SnapshotStore._hash_index` uses `list` instead of `set`

File: decision_service.py:99, 145

list.remove(decision_id) is O(n) for each removal. Using a set would be O(1).

5. Test Coverage and Quality Issues

TEST-1 [HIGH] — BFS order not actually verified in tree tests

Files: features/decision_recording.feature:103, robot/helper_decision_recording.py:99

The "Get tree returns BFS order from root" scenario asserts count and root-first, but never verifies actual BFS ordering. The level-order property (all depth-1 children before depth-2 grandchildren) is not checked. A broken BFS that returns DFS order would pass.

TEST-2 [HIGH] — No test for sequence rehydration after restart

Files: All test files

No test creates decisions, destroys the service, recreates it against the same DB, and verifies that sequence numbers continue from where they left off. This is exactly the scenario where BUG-1 manifests.

TEST-3 [HIGH] — `MagicMock()` without `spec=` hides bugs

Files: decision_di_wiring_steps.py:111, helper_decision_di.py:87, decision_di_bench.py:153

All use MagicMock() for settings without spec=Settings. This silently creates auto-attributes on access, meaning code that reads settings.nonexistent_property gets a silent mock object instead of failing. Should use create_autospec(Settings).

TEST-4 [MEDIUM] — `UnitOfWork.new()` anti-pattern in 4 places

Files: decision_di_wiring_steps.py:82, helper_decision_di.py:37, decision_di_bench.py:67, decision_recording_bench.py

All bypass UnitOfWork.__init__() to directly set private attributes. If __init__ changes, all 4 break silently. Should be a shared fixture factory.

TEST-5 [MEDIUM] — DI wiring assertions are too weak (`>= 1`)

Files: decision_di_wiring_steps.py:145, 194

len(strategy_decisions) >= 1 passes even if 100 spurious decisions are recorded. Should assert the exact expected count.

TEST-6 [MEDIUM] — No test for invalid `decision_type` string

No scenario passes an invalid string like "garbage" to record_decision(decision_type=...) and asserts a ValueError is raised.

TEST-7 [MEDIUM] — `confidence_score` boundary values never tested

Tests only use 0.85. Never exercises 0.0, 1.0, negative, or >1.0. The domain model has ge=0.0, le=1.0 validators — untested at boundaries.

TEST-8 [LOW] — `get_decision` return not validated beyond `decision_id`

File: decision_recording_steps.py:652

step_match_recorded only compares decision_id, not question, chosen_option, decision_type, or any other field. A buggy get_decision returning a corrupted object would pass.

TEST-9 [LOW] — Robot tests are strict subset of Behave tests

File: robot/decision_recording.robot

7 Robot test cases duplicate existing Behave scenarios with no unique coverage. No error-path, persisted-mode, or negative tests in Robot.

TEST-10 [LOW] — No benchmarks for persisted (DB-backed) mode

Files: benchmarks/decision_recording_bench.py, benchmarks/decision_di_bench.py

All benchmarks use in-memory DecisionService(). The DB-backed path (with SQLAlchemy overhead, transactions, and serialization) has fundamentally different performance characteristics and is not benchmarked.

6. Specification Conformance Issues

SPEC-1 [MEDIUM] — Missing uniqueness enforcement per spec schema

The spec (line 18470) defines sequence_number INTEGER NOT NULL in the decisions table. The existence of SequenceConflictError in the codebase implies uniqueness enforcement was planned. It is not implemented, and BUG-1 makes collisions possible.

SPEC-2 [LOW] — `init.py` missing `SequenceConflictError` export

File: src/cleveragents/application/services/__init__.py

SequenceConflictError is in decision_service.__all__ but not re-exported from the services package __init__.py. Also, PlanLifecycleService and its exceptions are absent from __init__.py despite being a peer service.

SPEC-3 [LOW] — Doc claim that retrieval methods raise `DecisionNotFoundError`

File: docs/reference/decision_service.md:75

States "All raise DecisionNotFoundError when the target decision does not exist (where applicable)." However, list_decisions and list_by_type return empty lists, not exceptions.

SPEC-4 [LOW] — `artifacts_produced` typed as `list[Any]` vs spec's `list[ArtifactRef]`

File: decision_service.py:230

The record_decision method accepts list[Any] for artifacts_produced, but the spec (line 18442) and domain model both define it as list[ArtifactRef]. The loose typing defeats compile-time checks.

7. Last Commit Review (`cd641a8`)

The final commit ("fix(service): align test and doc refs with DecisionService API") is a clean-up commit that fixes stale references to old method names (get_decisions_for_plan -> list_decisions, get_decision_tree -> get_tree) across 4 files. The diff is correct and no stale references remain.

One note: the commit message contains ISSUES CLOSED: #172, which will auto-close the issue on merge, but the issue's acceptance criteria (coverage >= 97%, full nox pass) are not confirmed in the commit chain.

Summary — Priority Ranking

#	Severity	Category	Finding
BUG-1	HIGH	Bug	`_plan_sequence` not rehydrated from DB — sequence collisions after restart
BUG-2	HIGH	Bug	`SequenceConflictError` dead code — no uniqueness guard exists
TEST-1	HIGH	Test Flaw	BFS order not actually verified in tree tests
TEST-2	HIGH	Test Gap	No test for sequence rehydration after service restart
TEST-3	HIGH	Test Flaw	`MagicMock()` without `spec=` silently hides bugs in 3 files
BUG-3	MEDIUM	Bug	`delete_decision` behavior diverges between persisted/in-memory modes
BUG-4	MEDIUM	Bug	`mark_superseded` doesn't validate replacement decision exists
BUG-5	MEDIUM	Bug	`get_tree` silently drops orphaned decisions
SEC-1	MEDIUM	Security	No size limit on `actor_reasoning` (log injection / storage exhaustion)
PERF-1	MEDIUM	Performance	`count_decisions` loads all objects just to count
PERF-2	MEDIUM	Performance	No pagination on any list method
TEST-4	MEDIUM	Test Flaw	`UnitOfWork.__new__()` anti-pattern in 4 places
TEST-5	MEDIUM	Test Flaw	DI assertions use `>= 1` instead of exact counts
TEST-6	MEDIUM	Test Gap	Invalid `decision_type` string never tested
TEST-7	MEDIUM	Test Gap	`confidence_score` boundaries never tested
SPEC-1	MEDIUM	Spec Gap	Sequence uniqueness enforcement missing per spec schema
BUG-6	LOW	Bug	`SnapshotStore` lost on restart (not persisted)
BUG-7	LOW	Bug	Dead code `_decision_seq` / `_next_seq` in `PlanLifecycleService`
SEC-2-4	LOW	Security	Truncated hash, unbounded hash index, `tempfile.mktemp`
SPEC-2-4	LOW	Spec	Missing exports, doc inaccuracies, loose typing
TEST-8-10	LOW	Test	Weak assertions, Robot subset, no DB-mode benchmarks

Recommendation: The HIGH-severity items (BUG-1, BUG-2, TEST-1, TEST-2, TEST-3) should be addressed before merge. BUG-1 in particular is a data integrity risk in any deployment that restarts the service, and the lack of a test for it (TEST-2) means it would go undetected until production.

# Code Review — PR #433 / Issue #172: `feat(service): add decision recording and snapshot store` **Branch:** `feature/m4-decision-service` **Reviewed commit:** `cd641a8` (last by Hamza, 7-commit chain from `5539fd0..cd641a8`) **Scope:** 15 files changed, ~2600 insertions, ~160 deletions **Review focus:** bugs, security, performance, test coverage/quality, spec conformance --- ## 1. Issue #172 Acceptance Criteria Compliance | Subtask | Status | Notes | |---|---|---| | Implement `DecisionService` with record/list/tree helpers + snapshot storage | Done | Full API in `decision_service.py` | | Add `docs/reference/decision_service.md` | Done | 148-line doc file | | Behave: `features/decision_recording.feature` | Done | 37 scenarios | | Robot: `robot/decision_recording.robot` | Done | 7 test cases | | ASV: `benchmarks/decision_recording_bench.py` | Done | 5 benchmark classes | | Coverage >= 97% via `nox -s coverage_report` | **Not verified** | No evidence the coverage gate was run/checked in the commit chain | | Run `nox` (all sessions), fix any errors | **Not verified** | No evidence a full nox pass was performed in the commit chain | --- ## 2. Bugs ### BUG-1 [HIGH] — Sequence numbers not rehydrated from DB on restart **File:** `decision_service.py:203, 622-626` `_plan_sequence` is an in-memory `dict` initialized to `{}`. In persisted mode, if the service restarts against an existing database with decisions already recorded, `_next_sequence()` starts from 0 again, causing `sequence_number` collisions. The spec (line 18398, 18470) requires `sequence_number` to be monotonically increasing per plan. ### BUG-2 [HIGH] — `SequenceConflictError` defined but never raised **File:** `decision_service.py:74-82` This exception is defined, exported in `__all__`, and whitelisted in vulture — but never actually raised anywhere. If BUG-1 causes a collision, there is no guard to catch it. The error exists as dead code suggesting uniqueness enforcement was intended but never implemented. ### BUG-3 [MEDIUM] — `delete_decision` inconsistent between modes **File:** `decision_service.py:537-560` In persisted mode, if the underlying repository's `delete()` silently succeeds for a non-existent ID (does not raise), the method returns `True` without error. In in-memory mode, it correctly raises `DecisionNotFoundError`. The behavior diverges depending on the backend. ### BUG-4 [MEDIUM] — `mark_superseded` does not validate `new_decision_id` exists **File:** `decision_service.py:483-523` The spec (line 18196) states "the original and all its descendants are marked `superseded_by = <new_decision_id>`." The service only marks the single targeted decision but never verifies the replacement decision actually exists — this can create dangling references. ### BUG-5 [MEDIUM] — `get_tree` silently drops orphaned subtrees **File:** `decision_service.py:383-425` If a decision's `parent_decision_id` references a deleted parent, BFS from roots never reaches it. The `if not roots` fallback only triggers when there are zero roots. Orphaned subtrees are silently omitted from the returned tree, violating the spec's "Reachability" invariant (spec line 18203). ### BUG-6 [LOW] — `SnapshotStore` is purely in-memory, lost on restart **File:** `decision_service.py:90-164` Snapshots stored in `SnapshotStore` (the hash-indexed secondary store) are never persisted to the database. After a service restart, `get_snapshot()` and `get_snapshots_for_plan()` return `None`/empty even though the decisions (with embedded snapshots) still exist in the DB. The spec (line 18421-18425) describes the context snapshot as a replay-critical artifact. ### BUG-7 [LOW] — Dead code in `PlanLifecycleService` **File:** `plan_lifecycle_service.py:177-188` `_decision_seq` dict and `_next_seq()` method are never called anywhere in the service. `_try_record_decision` delegates sequence numbering to `DecisionService.record_decision()`. These are dead code, masked by vulture whitelist entries at `vulture_whitelist.py:339-340`. --- ## 3. Security Issues ### SEC-1 [MEDIUM] — No size limits on `actor_reasoning` field **File:** `decision_service.py:249` Raw LLM reasoning traces can be arbitrarily large. No bounds are enforced before storage or logging. This creates risk for log injection and storage exhaustion attacks. ### SEC-2 [LOW] — Truncated SHA-256 hash with misleading prefix **File:** `decision_service.py:672` `_auto_capture_snapshot` truncates the SHA-256 to 16 hex chars (64 bits) but prefixes it with `sha256:`, implying a full cryptographic hash. If any downstream code relies on this for integrity verification, the collision resistance is far lower than expected. ### SEC-3 [LOW] — `SnapshotStore._hash_index` grows without bound **File:** `decision_service.py:99` No eviction policy. A caller generating many unique snapshots causes unbounded memory growth. ### SEC-4 [LOW] — `tempfile.mktemp` usage in test step file **File:** `features/steps/decision_recording_steps.py:966` `tempfile.mktemp` is deprecated due to a TOCTOU race condition. This is test-only but is a pattern repeated across 17 places in the test suite. --- ## 4. Performance Issues ### PERF-1 [MEDIUM] — `count_decisions` loads all objects just to count them **File:** `decision_service.py:593-602` `count_decisions(plan_id)` calls `list_decisions(plan_id)` which deserializes every decision from the DB, builds a Python list, then takes `len()`. Should delegate to `COUNT(*)` in persisted mode. ### PERF-2 [MEDIUM] — No pagination on any list method **Files:** `decision_service.py` — `list_decisions`, `list_by_type`, `get_tree`, `get_superseded`, `get_snapshots_for_plan` None of these support `limit`/`offset`. For plans with hundreds or thousands of decisions, this returns unbounded result sets. ### PERF-3 [LOW] — `get_tree` and `get_superseded` load full decision set **File:** `decision_service.py:395, 476` Both call `list_decisions()` to load the entire plan's decisions into memory, then perform in-memory filtering/traversal. ### PERF-4 [LOW] — `SnapshotStore._hash_index` uses `list` instead of `set` **File:** `decision_service.py:99, 145` `list.remove(decision_id)` is O(n) for each removal. Using a `set` would be O(1). --- ## 5. Test Coverage and Quality Issues ### TEST-1 [HIGH] — BFS order not actually verified in tree tests **Files:** `features/decision_recording.feature:103`, `robot/helper_decision_recording.py:99` The "Get tree returns BFS order from root" scenario asserts count and root-first, but never verifies actual BFS ordering. The level-order property (all depth-1 children before depth-2 grandchildren) is not checked. A broken BFS that returns DFS order would pass. ### TEST-2 [HIGH] — No test for sequence rehydration after restart **Files:** All test files No test creates decisions, destroys the service, recreates it against the same DB, and verifies that sequence numbers continue from where they left off. This is exactly the scenario where BUG-1 manifests. ### TEST-3 [HIGH] — `MagicMock()` without `spec=` hides bugs **Files:** `decision_di_wiring_steps.py:111`, `helper_decision_di.py:87`, `decision_di_bench.py:153` All use `MagicMock()` for `settings` without `spec=Settings`. This silently creates auto-attributes on access, meaning code that reads `settings.nonexistent_property` gets a silent mock object instead of failing. Should use `create_autospec(Settings)`. ### TEST-4 [MEDIUM] — `UnitOfWork.__new__()` anti-pattern in 4 places **Files:** `decision_di_wiring_steps.py:82`, `helper_decision_di.py:37`, `decision_di_bench.py:67`, `decision_recording_bench.py` All bypass `UnitOfWork.__init__()` to directly set private attributes. If `__init__` changes, all 4 break silently. Should be a shared fixture factory. ### TEST-5 [MEDIUM] — DI wiring assertions are too weak (`>= 1`) **Files:** `decision_di_wiring_steps.py:145, 194` `len(strategy_decisions) >= 1` passes even if 100 spurious decisions are recorded. Should assert the exact expected count. ### TEST-6 [MEDIUM] — No test for invalid `decision_type` string No scenario passes an invalid string like `"garbage"` to `record_decision(decision_type=...)` and asserts a `ValueError` is raised. ### TEST-7 [MEDIUM] — `confidence_score` boundary values never tested Tests only use `0.85`. Never exercises `0.0`, `1.0`, negative, or `>1.0`. The domain model has `ge=0.0, le=1.0` validators — untested at boundaries. ### TEST-8 [LOW] — `get_decision` return not validated beyond `decision_id` **File:** `decision_recording_steps.py:652` `step_match_recorded` only compares `decision_id`, not `question`, `chosen_option`, `decision_type`, or any other field. A buggy `get_decision` returning a corrupted object would pass. ### TEST-9 [LOW] — Robot tests are strict subset of Behave tests **File:** `robot/decision_recording.robot` 7 Robot test cases duplicate existing Behave scenarios with no unique coverage. No error-path, persisted-mode, or negative tests in Robot. ### TEST-10 [LOW] — No benchmarks for persisted (DB-backed) mode **Files:** `benchmarks/decision_recording_bench.py`, `benchmarks/decision_di_bench.py` All benchmarks use in-memory `DecisionService()`. The DB-backed path (with SQLAlchemy overhead, transactions, and serialization) has fundamentally different performance characteristics and is not benchmarked. --- ## 6. Specification Conformance Issues ### SPEC-1 [MEDIUM] — Missing uniqueness enforcement per spec schema The spec (line 18470) defines `sequence_number INTEGER NOT NULL` in the decisions table. The existence of `SequenceConflictError` in the codebase implies uniqueness enforcement was planned. It is not implemented, and BUG-1 makes collisions possible. ### SPEC-2 [LOW] — `__init__.py` missing `SequenceConflictError` export **File:** `src/cleveragents/application/services/__init__.py` `SequenceConflictError` is in `decision_service.__all__` but not re-exported from the services package `__init__.py`. Also, `PlanLifecycleService` and its exceptions are absent from `__init__.py` despite being a peer service. ### SPEC-3 [LOW] — Doc claim that retrieval methods raise `DecisionNotFoundError` **File:** `docs/reference/decision_service.md:75` States "All raise `DecisionNotFoundError` when the target decision does not exist (where applicable)." However, `list_decisions` and `list_by_type` return empty lists, not exceptions. ### SPEC-4 [LOW] — `artifacts_produced` typed as `list[Any]` vs spec's `list[ArtifactRef]` **File:** `decision_service.py:230` The `record_decision` method accepts `list[Any]` for `artifacts_produced`, but the spec (line 18442) and domain model both define it as `list[ArtifactRef]`. The loose typing defeats compile-time checks. --- ## 7. Last Commit Review (`cd641a8`) The final commit ("fix(service): align test and doc refs with DecisionService API") is a clean-up commit that fixes stale references to old method names (`get_decisions_for_plan` -> `list_decisions`, `get_decision_tree` -> `get_tree`) across 4 files. The diff is correct and no stale references remain. One note: the commit message contains `ISSUES CLOSED: #172`, which will auto-close the issue on merge, but the issue's acceptance criteria (coverage >= 97%, full `nox` pass) are not confirmed in the commit chain. --- ## Summary — Priority Ranking | # | Severity | Category | Finding | |---|---|---|---| | BUG-1 | **HIGH** | Bug | `_plan_sequence` not rehydrated from DB — sequence collisions after restart | | BUG-2 | **HIGH** | Bug | `SequenceConflictError` dead code — no uniqueness guard exists | | TEST-1 | **HIGH** | Test Flaw | BFS order not actually verified in tree tests | | TEST-2 | **HIGH** | Test Gap | No test for sequence rehydration after service restart | | TEST-3 | **HIGH** | Test Flaw | `MagicMock()` without `spec=` silently hides bugs in 3 files | | BUG-3 | **MEDIUM** | Bug | `delete_decision` behavior diverges between persisted/in-memory modes | | BUG-4 | **MEDIUM** | Bug | `mark_superseded` doesn't validate replacement decision exists | | BUG-5 | **MEDIUM** | Bug | `get_tree` silently drops orphaned decisions | | SEC-1 | **MEDIUM** | Security | No size limit on `actor_reasoning` (log injection / storage exhaustion) | | PERF-1 | **MEDIUM** | Performance | `count_decisions` loads all objects just to count | | PERF-2 | **MEDIUM** | Performance | No pagination on any list method | | TEST-4 | **MEDIUM** | Test Flaw | `UnitOfWork.__new__()` anti-pattern in 4 places | | TEST-5 | **MEDIUM** | Test Flaw | DI assertions use `>= 1` instead of exact counts | | TEST-6 | **MEDIUM** | Test Gap | Invalid `decision_type` string never tested | | TEST-7 | **MEDIUM** | Test Gap | `confidence_score` boundaries never tested | | SPEC-1 | **MEDIUM** | Spec Gap | Sequence uniqueness enforcement missing per spec schema | | BUG-6 | LOW | Bug | `SnapshotStore` lost on restart (not persisted) | | BUG-7 | LOW | Bug | Dead code `_decision_seq` / `_next_seq` in `PlanLifecycleService` | | SEC-2-4 | LOW | Security | Truncated hash, unbounded hash index, `tempfile.mktemp` | | SPEC-2-4 | LOW | Spec | Missing exports, doc inaccuracies, loose typing | | TEST-8-10 | LOW | Test | Weak assertions, Robot subset, no DB-mode benchmarks | --- **Recommendation:** The HIGH-severity items (BUG-1, BUG-2, TEST-1, TEST-2, TEST-3) should be addressed before merge. BUG-1 in particular is a data integrity risk in any deployment that restarts the service, and the lack of a test for it (TEST-2) means it would go undetected until production.

freemo added the

labels 2026-03-01 04:36:13 +00:00

hamza.khyari requested review from aditya 2026-03-03 03:14:42 +00:00

hamza.khyari requested review from CoreRasurae 2026-03-03 03:14:43 +00:00

hamza.khyari force-pushed feature/m4-decision-service from cd641a8931 to 2a9606cb72

2026-03-03 03:17:39 +00:00

Compare

aditya approved these changes 2026-03-03 08:41:32 +00:00

Dismissed

hurui200320 referenced this pull request

2026-03-03 08:43:52 +00:00

test(e2e): validate M3 acceptance criteria for v3.2.0 milestone closure #494

hamza.khyari force-pushed feature/m4-decision-service from 2a9606cb72 to c2f8bc0eaf

2026-03-03 12:20:26 +00:00

Compare

hamza.khyari dismissed aditya's review 2026-03-03 12:20:26 +00:00

Reason:

New commits pushed, approval review dismissed automatically according to repository settings

hamza.khyari scheduled this pull request to auto merge when all checks succeed 2026-03-03 12:21:15 +00:00

hamza.khyari force-pushed feature/m4-decision-service from c2f8bc0eaf to 33a9d1e7d8

2026-03-03 12:25:07 +00:00

Compare

hamza.khyari force-pushed feature/m4-decision-service from 33a9d1e7d8 to 0e36755db9

2026-03-03 12:53:31 +00:00

Compare

hamza.khyari merged commit 8249c73aed into master

2026-03-03 12:58:17 +00:00

hamza.khyari referenced this issue from a commit

2026-03-03 12:58:19 +00:00

Merge pull request 'feat(service): add decision recording and snapshot store' (#433) from feature/m4-decision-service into master

Sign in to join this conversation.

4 Participants

Notifications

Due Date

No due date set.

Blocks

#172 feat(service): add decision recording and snapshot store

cleveragents/cleveragents-core

Reference: cleveragents/cleveragents-core#433

feat(service): add decision recording and snapshot store #433

Summary

What

Tests

Docs

Note on cherry-picked commits

Dependencies

Code Review — PR #433 / Issue #172: feat(service): add decision recording and snapshot store

1. Issue #172 Acceptance Criteria Compliance

2. Bugs

BUG-1 [HIGH] — Sequence numbers not rehydrated from DB on restart

BUG-2 [HIGH] — SequenceConflictError defined but never raised

BUG-3 [MEDIUM] — delete_decision inconsistent between modes

BUG-4 [MEDIUM] — mark_superseded does not validate new_decision_id exists

BUG-5 [MEDIUM] — get_tree silently drops orphaned subtrees

BUG-6 [LOW] — SnapshotStore is purely in-memory, lost on restart

BUG-7 [LOW] — Dead code in PlanLifecycleService

3. Security Issues

SEC-1 [MEDIUM] — No size limits on actor_reasoning field

SEC-2 [LOW] — Truncated SHA-256 hash with misleading prefix

SEC-3 [LOW] — SnapshotStore._hash_index grows without bound

SEC-4 [LOW] — tempfile.mktemp usage in test step file

4. Performance Issues

PERF-1 [MEDIUM] — count_decisions loads all objects just to count them

PERF-2 [MEDIUM] — No pagination on any list method

PERF-3 [LOW] — get_tree and get_superseded load full decision set

PERF-4 [LOW] — SnapshotStore._hash_index uses list instead of set

5. Test Coverage and Quality Issues

TEST-1 [HIGH] — BFS order not actually verified in tree tests

TEST-2 [HIGH] — No test for sequence rehydration after restart

TEST-3 [HIGH] — MagicMock() without spec= hides bugs

TEST-4 [MEDIUM] — UnitOfWork.__new__() anti-pattern in 4 places

TEST-5 [MEDIUM] — DI wiring assertions are too weak (>= 1)

TEST-6 [MEDIUM] — No test for invalid decision_type string

TEST-7 [MEDIUM] — confidence_score boundary values never tested

TEST-8 [LOW] — get_decision return not validated beyond decision_id

TEST-9 [LOW] — Robot tests are strict subset of Behave tests

TEST-10 [LOW] — No benchmarks for persisted (DB-backed) mode

6. Specification Conformance Issues

SPEC-1 [MEDIUM] — Missing uniqueness enforcement per spec schema

SPEC-2 [LOW] — __init__.py missing SequenceConflictError export

SPEC-3 [LOW] — Doc claim that retrieval methods raise DecisionNotFoundError

SPEC-4 [LOW] — artifacts_produced typed as list[Any] vs spec's list[ArtifactRef]

7. Last Commit Review (cd641a8)

Summary — Priority Ranking

Code Review — PR #433 / Issue #172: `feat(service): add decision recording and snapshot store`

BUG-2 [HIGH] — `SequenceConflictError` defined but never raised

BUG-3 [MEDIUM] — `delete_decision` inconsistent between modes

BUG-4 [MEDIUM] — `mark_superseded` does not validate `new_decision_id` exists

BUG-5 [MEDIUM] — `get_tree` silently drops orphaned subtrees

BUG-6 [LOW] — `SnapshotStore` is purely in-memory, lost on restart

BUG-7 [LOW] — Dead code in `PlanLifecycleService`

SEC-1 [MEDIUM] — No size limits on `actor_reasoning` field

SEC-3 [LOW] — `SnapshotStore._hash_index` grows without bound

SEC-4 [LOW] — `tempfile.mktemp` usage in test step file

PERF-1 [MEDIUM] — `count_decisions` loads all objects just to count them

PERF-3 [LOW] — `get_tree` and `get_superseded` load full decision set

PERF-4 [LOW] — `SnapshotStore._hash_index` uses `list` instead of `set`

TEST-3 [HIGH] — `MagicMock()` without `spec=` hides bugs

TEST-4 [MEDIUM] — `UnitOfWork.new()` anti-pattern in 4 places

TEST-5 [MEDIUM] — DI wiring assertions are too weak (`>= 1`)

TEST-6 [MEDIUM] — No test for invalid `decision_type` string

TEST-7 [MEDIUM] — `confidence_score` boundary values never tested

TEST-8 [LOW] — `get_decision` return not validated beyond `decision_id`

SPEC-2 [LOW] — `init.py` missing `SequenceConflictError` export

SPEC-3 [LOW] — Doc claim that retrieval methods raise `DecisionNotFoundError`

SPEC-4 [LOW] — `artifacts_produced` typed as `list[Any]` vs spec's `list[ArtifactRef]`

7. Last Commit Review (`cd641a8`)