feat(service): add decision recording and snapshot store #433

Merged
hamza.khyari merged 9 commits from feature/m4-decision-service into master 2026-03-03 12:58:17 +00:00
Member

Summary

Implements DecisionService with record/list/tree helpers and SnapshotStore for Forgejo issue #172.

What

  • DecisionService — dual-mode (in-memory / DB-backed) service with:
    • record_decision() with auto-sequencing, auto-snapshot capture, string-to-enum coercion
    • get_decision(), list_decisions(), list_by_type() retrieval
    • get_tree() (BFS), get_path_to_root() tree navigation
    • mark_superseded(), get_superseded() supersession support
    • delete_decision() with snapshot cleanup
    • count_decisions(), get_next_sequence() statistics
  • SnapshotStore — hash-based deduplication index for ContextSnapshot objects
  • Custom exceptionsDuplicateDecisionError, DecisionNotFoundError, SequenceConflictError

Tests

  • Behave: 37 scenarios / 168 steps in features/decision_recording.feature (all passing)
  • Robot: 7 smoke tests in robot/decision_recording.robot with helper script
  • ASV: 6 benchmarks for record throughput, list, tree, and snapshot operations

Docs

  • docs/reference/decision_service.md covering all public API, dual-mode persistence, and exceptions

Note on cherry-picked commits

This branch includes 2 cherry-picked commits from PR #131 (decision persistence) to provide DecisionRepository for the UoW code paths. When #131 merges to master, a rebase will make these cherry-picks disappear cleanly.

Closes #172

Dependencies

  • Blocks issue #172 (this PR must be merged before the issue can be closed)
## Summary Implements `DecisionService` with record/list/tree helpers and `SnapshotStore` for Forgejo issue #172. ### What - **`DecisionService`** — dual-mode (in-memory / DB-backed) service with: - `record_decision()` with auto-sequencing, auto-snapshot capture, string-to-enum coercion - `get_decision()`, `list_decisions()`, `list_by_type()` retrieval - `get_tree()` (BFS), `get_path_to_root()` tree navigation - `mark_superseded()`, `get_superseded()` supersession support - `delete_decision()` with snapshot cleanup - `count_decisions()`, `get_next_sequence()` statistics - **`SnapshotStore`** — hash-based deduplication index for `ContextSnapshot` objects - **Custom exceptions** — `DuplicateDecisionError`, `DecisionNotFoundError`, `SequenceConflictError` ### Tests - **Behave**: 37 scenarios / 168 steps in `features/decision_recording.feature` (all passing) - **Robot**: 7 smoke tests in `robot/decision_recording.robot` with helper script - **ASV**: 6 benchmarks for record throughput, list, tree, and snapshot operations ### Docs - `docs/reference/decision_service.md` covering all public API, dual-mode persistence, and exceptions ### Note on cherry-picked commits This branch includes 2 cherry-picked commits from PR #131 (decision persistence) to provide `DecisionRepository` for the UoW code paths. When #131 merges to master, a rebase will make these cherry-picks disappear cleanly. Closes #172 ## Dependencies - Blocks issue #172 (this PR must be merged before the issue can be closed)
freemo added this to the v3.2.0 milestone 2026-02-25 18:13:28 +00:00
hamza.khyari force-pushed feature/m4-decision-service from cd5ad39642
All checks were successful
CI / lint (pull_request) Successful in 23s
CI / typecheck (pull_request) Successful in 52s
CI / security (pull_request) Successful in 50s
CI / quality (pull_request) Successful in 27s
CI / integration_tests (pull_request) Successful in 5m9s
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 23s
CI / unit_tests (pull_request) Successful in 24m9s
CI / docker (pull_request) Successful in 1m8s
CI / benchmark-regression (pull_request) Successful in 22m20s
CI / coverage (pull_request) Successful in 1h30m28s
to 0f4b8a967f
Some checks failed
CI / lint (pull_request) Successful in 20s
CI / typecheck (pull_request) Failing after 34s
CI / coverage (pull_request) Has been skipped
CI / security (pull_request) Successful in 29s
CI / quality (pull_request) Successful in 29s
CI / benchmark-publish (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has been skipped
CI / build (pull_request) Successful in 15s
CI / integration_tests (pull_request) Failing after 2m44s
CI / unit_tests (pull_request) Has been cancelled
CI / docker (pull_request) Has been cancelled
2026-02-27 12:35:41 +00:00
Compare
Member

Code Review — PR #433 / Issue #172: feat(service): add decision recording and snapshot store

Branch: feature/m4-decision-service
Reviewed commit: cd641a8 (last by Hamza, 7-commit chain from 5539fd0..cd641a8)
Scope: 15 files changed, ~2600 insertions, ~160 deletions
Review focus: bugs, security, performance, test coverage/quality, spec conformance


1. Issue #172 Acceptance Criteria Compliance

Subtask Status Notes
Implement DecisionService with record/list/tree helpers + snapshot storage Done Full API in decision_service.py
Add docs/reference/decision_service.md Done 148-line doc file
Behave: features/decision_recording.feature Done 37 scenarios
Robot: robot/decision_recording.robot Done 7 test cases
ASV: benchmarks/decision_recording_bench.py Done 5 benchmark classes
Coverage >= 97% via nox -s coverage_report Not verified No evidence the coverage gate was run/checked in the commit chain
Run nox (all sessions), fix any errors Not verified No evidence a full nox pass was performed in the commit chain

2. Bugs

BUG-1 [HIGH] — Sequence numbers not rehydrated from DB on restart

File: decision_service.py:203, 622-626

_plan_sequence is an in-memory dict initialized to {}. In persisted mode, if the service restarts against an existing database with decisions already recorded, _next_sequence() starts from 0 again, causing sequence_number collisions. The spec (line 18398, 18470) requires sequence_number to be monotonically increasing per plan.

BUG-2 [HIGH] — SequenceConflictError defined but never raised

File: decision_service.py:74-82

This exception is defined, exported in __all__, and whitelisted in vulture — but never actually raised anywhere. If BUG-1 causes a collision, there is no guard to catch it. The error exists as dead code suggesting uniqueness enforcement was intended but never implemented.

BUG-3 [MEDIUM] — delete_decision inconsistent between modes

File: decision_service.py:537-560

In persisted mode, if the underlying repository's delete() silently succeeds for a non-existent ID (does not raise), the method returns True without error. In in-memory mode, it correctly raises DecisionNotFoundError. The behavior diverges depending on the backend.

BUG-4 [MEDIUM] — mark_superseded does not validate new_decision_id exists

File: decision_service.py:483-523

The spec (line 18196) states "the original and all its descendants are marked superseded_by = <new_decision_id>." The service only marks the single targeted decision but never verifies the replacement decision actually exists — this can create dangling references.

BUG-5 [MEDIUM] — get_tree silently drops orphaned subtrees

File: decision_service.py:383-425

If a decision's parent_decision_id references a deleted parent, BFS from roots never reaches it. The if not roots fallback only triggers when there are zero roots. Orphaned subtrees are silently omitted from the returned tree, violating the spec's "Reachability" invariant (spec line 18203).

BUG-6 [LOW] — SnapshotStore is purely in-memory, lost on restart

File: decision_service.py:90-164

Snapshots stored in SnapshotStore (the hash-indexed secondary store) are never persisted to the database. After a service restart, get_snapshot() and get_snapshots_for_plan() return None/empty even though the decisions (with embedded snapshots) still exist in the DB. The spec (line 18421-18425) describes the context snapshot as a replay-critical artifact.

BUG-7 [LOW] — Dead code in PlanLifecycleService

File: plan_lifecycle_service.py:177-188

_decision_seq dict and _next_seq() method are never called anywhere in the service. _try_record_decision delegates sequence numbering to DecisionService.record_decision(). These are dead code, masked by vulture whitelist entries at vulture_whitelist.py:339-340.


3. Security Issues

SEC-1 [MEDIUM] — No size limits on actor_reasoning field

File: decision_service.py:249

Raw LLM reasoning traces can be arbitrarily large. No bounds are enforced before storage or logging. This creates risk for log injection and storage exhaustion attacks.

SEC-2 [LOW] — Truncated SHA-256 hash with misleading prefix

File: decision_service.py:672

_auto_capture_snapshot truncates the SHA-256 to 16 hex chars (64 bits) but prefixes it with sha256:, implying a full cryptographic hash. If any downstream code relies on this for integrity verification, the collision resistance is far lower than expected.

SEC-3 [LOW] — SnapshotStore._hash_index grows without bound

File: decision_service.py:99

No eviction policy. A caller generating many unique snapshots causes unbounded memory growth.

SEC-4 [LOW] — tempfile.mktemp usage in test step file

File: features/steps/decision_recording_steps.py:966

tempfile.mktemp is deprecated due to a TOCTOU race condition. This is test-only but is a pattern repeated across 17 places in the test suite.


4. Performance Issues

PERF-1 [MEDIUM] — count_decisions loads all objects just to count them

File: decision_service.py:593-602

count_decisions(plan_id) calls list_decisions(plan_id) which deserializes every decision from the DB, builds a Python list, then takes len(). Should delegate to COUNT(*) in persisted mode.

PERF-2 [MEDIUM] — No pagination on any list method

Files: decision_service.pylist_decisions, list_by_type, get_tree, get_superseded, get_snapshots_for_plan

None of these support limit/offset. For plans with hundreds or thousands of decisions, this returns unbounded result sets.

PERF-3 [LOW] — get_tree and get_superseded load full decision set

File: decision_service.py:395, 476

Both call list_decisions() to load the entire plan's decisions into memory, then perform in-memory filtering/traversal.

PERF-4 [LOW] — SnapshotStore._hash_index uses list instead of set

File: decision_service.py:99, 145

list.remove(decision_id) is O(n) for each removal. Using a set would be O(1).


5. Test Coverage and Quality Issues

TEST-1 [HIGH] — BFS order not actually verified in tree tests

Files: features/decision_recording.feature:103, robot/helper_decision_recording.py:99

The "Get tree returns BFS order from root" scenario asserts count and root-first, but never verifies actual BFS ordering. The level-order property (all depth-1 children before depth-2 grandchildren) is not checked. A broken BFS that returns DFS order would pass.

TEST-2 [HIGH] — No test for sequence rehydration after restart

Files: All test files

No test creates decisions, destroys the service, recreates it against the same DB, and verifies that sequence numbers continue from where they left off. This is exactly the scenario where BUG-1 manifests.

TEST-3 [HIGH] — MagicMock() without spec= hides bugs

Files: decision_di_wiring_steps.py:111, helper_decision_di.py:87, decision_di_bench.py:153

All use MagicMock() for settings without spec=Settings. This silently creates auto-attributes on access, meaning code that reads settings.nonexistent_property gets a silent mock object instead of failing. Should use create_autospec(Settings).

TEST-4 [MEDIUM] — UnitOfWork.__new__() anti-pattern in 4 places

Files: decision_di_wiring_steps.py:82, helper_decision_di.py:37, decision_di_bench.py:67, decision_recording_bench.py

All bypass UnitOfWork.__init__() to directly set private attributes. If __init__ changes, all 4 break silently. Should be a shared fixture factory.

TEST-5 [MEDIUM] — DI wiring assertions are too weak (>= 1)

Files: decision_di_wiring_steps.py:145, 194

len(strategy_decisions) >= 1 passes even if 100 spurious decisions are recorded. Should assert the exact expected count.

TEST-6 [MEDIUM] — No test for invalid decision_type string

No scenario passes an invalid string like "garbage" to record_decision(decision_type=...) and asserts a ValueError is raised.

TEST-7 [MEDIUM] — confidence_score boundary values never tested

Tests only use 0.85. Never exercises 0.0, 1.0, negative, or >1.0. The domain model has ge=0.0, le=1.0 validators — untested at boundaries.

TEST-8 [LOW] — get_decision return not validated beyond decision_id

File: decision_recording_steps.py:652

step_match_recorded only compares decision_id, not question, chosen_option, decision_type, or any other field. A buggy get_decision returning a corrupted object would pass.

TEST-9 [LOW] — Robot tests are strict subset of Behave tests

File: robot/decision_recording.robot

7 Robot test cases duplicate existing Behave scenarios with no unique coverage. No error-path, persisted-mode, or negative tests in Robot.

TEST-10 [LOW] — No benchmarks for persisted (DB-backed) mode

Files: benchmarks/decision_recording_bench.py, benchmarks/decision_di_bench.py

All benchmarks use in-memory DecisionService(). The DB-backed path (with SQLAlchemy overhead, transactions, and serialization) has fundamentally different performance characteristics and is not benchmarked.


6. Specification Conformance Issues

SPEC-1 [MEDIUM] — Missing uniqueness enforcement per spec schema

The spec (line 18470) defines sequence_number INTEGER NOT NULL in the decisions table. The existence of SequenceConflictError in the codebase implies uniqueness enforcement was planned. It is not implemented, and BUG-1 makes collisions possible.

SPEC-2 [LOW] — __init__.py missing SequenceConflictError export

File: src/cleveragents/application/services/__init__.py

SequenceConflictError is in decision_service.__all__ but not re-exported from the services package __init__.py. Also, PlanLifecycleService and its exceptions are absent from __init__.py despite being a peer service.

SPEC-3 [LOW] — Doc claim that retrieval methods raise DecisionNotFoundError

File: docs/reference/decision_service.md:75

States "All raise DecisionNotFoundError when the target decision does not exist (where applicable)." However, list_decisions and list_by_type return empty lists, not exceptions.

SPEC-4 [LOW] — artifacts_produced typed as list[Any] vs spec's list[ArtifactRef]

File: decision_service.py:230

The record_decision method accepts list[Any] for artifacts_produced, but the spec (line 18442) and domain model both define it as list[ArtifactRef]. The loose typing defeats compile-time checks.


7. Last Commit Review (cd641a8)

The final commit ("fix(service): align test and doc refs with DecisionService API") is a clean-up commit that fixes stale references to old method names (get_decisions_for_plan -> list_decisions, get_decision_tree -> get_tree) across 4 files. The diff is correct and no stale references remain.

One note: the commit message contains ISSUES CLOSED: #172, which will auto-close the issue on merge, but the issue's acceptance criteria (coverage >= 97%, full nox pass) are not confirmed in the commit chain.


Summary — Priority Ranking

# Severity Category Finding
BUG-1 HIGH Bug _plan_sequence not rehydrated from DB — sequence collisions after restart
BUG-2 HIGH Bug SequenceConflictError dead code — no uniqueness guard exists
TEST-1 HIGH Test Flaw BFS order not actually verified in tree tests
TEST-2 HIGH Test Gap No test for sequence rehydration after service restart
TEST-3 HIGH Test Flaw MagicMock() without spec= silently hides bugs in 3 files
BUG-3 MEDIUM Bug delete_decision behavior diverges between persisted/in-memory modes
BUG-4 MEDIUM Bug mark_superseded doesn't validate replacement decision exists
BUG-5 MEDIUM Bug get_tree silently drops orphaned decisions
SEC-1 MEDIUM Security No size limit on actor_reasoning (log injection / storage exhaustion)
PERF-1 MEDIUM Performance count_decisions loads all objects just to count
PERF-2 MEDIUM Performance No pagination on any list method
TEST-4 MEDIUM Test Flaw UnitOfWork.__new__() anti-pattern in 4 places
TEST-5 MEDIUM Test Flaw DI assertions use >= 1 instead of exact counts
TEST-6 MEDIUM Test Gap Invalid decision_type string never tested
TEST-7 MEDIUM Test Gap confidence_score boundaries never tested
SPEC-1 MEDIUM Spec Gap Sequence uniqueness enforcement missing per spec schema
BUG-6 LOW Bug SnapshotStore lost on restart (not persisted)
BUG-7 LOW Bug Dead code _decision_seq / _next_seq in PlanLifecycleService
SEC-2-4 LOW Security Truncated hash, unbounded hash index, tempfile.mktemp
SPEC-2-4 LOW Spec Missing exports, doc inaccuracies, loose typing
TEST-8-10 LOW Test Weak assertions, Robot subset, no DB-mode benchmarks

Recommendation: The HIGH-severity items (BUG-1, BUG-2, TEST-1, TEST-2, TEST-3) should be addressed before merge. BUG-1 in particular is a data integrity risk in any deployment that restarts the service, and the lack of a test for it (TEST-2) means it would go undetected until production.

# Code Review — PR #433 / Issue #172: `feat(service): add decision recording and snapshot store` **Branch:** `feature/m4-decision-service` **Reviewed commit:** `cd641a8` (last by Hamza, 7-commit chain from `5539fd0..cd641a8`) **Scope:** 15 files changed, ~2600 insertions, ~160 deletions **Review focus:** bugs, security, performance, test coverage/quality, spec conformance --- ## 1. Issue #172 Acceptance Criteria Compliance | Subtask | Status | Notes | |---|---|---| | Implement `DecisionService` with record/list/tree helpers + snapshot storage | Done | Full API in `decision_service.py` | | Add `docs/reference/decision_service.md` | Done | 148-line doc file | | Behave: `features/decision_recording.feature` | Done | 37 scenarios | | Robot: `robot/decision_recording.robot` | Done | 7 test cases | | ASV: `benchmarks/decision_recording_bench.py` | Done | 5 benchmark classes | | Coverage >= 97% via `nox -s coverage_report` | **Not verified** | No evidence the coverage gate was run/checked in the commit chain | | Run `nox` (all sessions), fix any errors | **Not verified** | No evidence a full nox pass was performed in the commit chain | --- ## 2. Bugs ### BUG-1 [HIGH] — Sequence numbers not rehydrated from DB on restart **File:** `decision_service.py:203, 622-626` `_plan_sequence` is an in-memory `dict` initialized to `{}`. In persisted mode, if the service restarts against an existing database with decisions already recorded, `_next_sequence()` starts from 0 again, causing `sequence_number` collisions. The spec (line 18398, 18470) requires `sequence_number` to be monotonically increasing per plan. ### BUG-2 [HIGH] — `SequenceConflictError` defined but never raised **File:** `decision_service.py:74-82` This exception is defined, exported in `__all__`, and whitelisted in vulture — but never actually raised anywhere. If BUG-1 causes a collision, there is no guard to catch it. The error exists as dead code suggesting uniqueness enforcement was intended but never implemented. ### BUG-3 [MEDIUM] — `delete_decision` inconsistent between modes **File:** `decision_service.py:537-560` In persisted mode, if the underlying repository's `delete()` silently succeeds for a non-existent ID (does not raise), the method returns `True` without error. In in-memory mode, it correctly raises `DecisionNotFoundError`. The behavior diverges depending on the backend. ### BUG-4 [MEDIUM] — `mark_superseded` does not validate `new_decision_id` exists **File:** `decision_service.py:483-523` The spec (line 18196) states "the original and all its descendants are marked `superseded_by = <new_decision_id>`." The service only marks the single targeted decision but never verifies the replacement decision actually exists — this can create dangling references. ### BUG-5 [MEDIUM] — `get_tree` silently drops orphaned subtrees **File:** `decision_service.py:383-425` If a decision's `parent_decision_id` references a deleted parent, BFS from roots never reaches it. The `if not roots` fallback only triggers when there are zero roots. Orphaned subtrees are silently omitted from the returned tree, violating the spec's "Reachability" invariant (spec line 18203). ### BUG-6 [LOW] — `SnapshotStore` is purely in-memory, lost on restart **File:** `decision_service.py:90-164` Snapshots stored in `SnapshotStore` (the hash-indexed secondary store) are never persisted to the database. After a service restart, `get_snapshot()` and `get_snapshots_for_plan()` return `None`/empty even though the decisions (with embedded snapshots) still exist in the DB. The spec (line 18421-18425) describes the context snapshot as a replay-critical artifact. ### BUG-7 [LOW] — Dead code in `PlanLifecycleService` **File:** `plan_lifecycle_service.py:177-188` `_decision_seq` dict and `_next_seq()` method are never called anywhere in the service. `_try_record_decision` delegates sequence numbering to `DecisionService.record_decision()`. These are dead code, masked by vulture whitelist entries at `vulture_whitelist.py:339-340`. --- ## 3. Security Issues ### SEC-1 [MEDIUM] — No size limits on `actor_reasoning` field **File:** `decision_service.py:249` Raw LLM reasoning traces can be arbitrarily large. No bounds are enforced before storage or logging. This creates risk for log injection and storage exhaustion attacks. ### SEC-2 [LOW] — Truncated SHA-256 hash with misleading prefix **File:** `decision_service.py:672` `_auto_capture_snapshot` truncates the SHA-256 to 16 hex chars (64 bits) but prefixes it with `sha256:`, implying a full cryptographic hash. If any downstream code relies on this for integrity verification, the collision resistance is far lower than expected. ### SEC-3 [LOW] — `SnapshotStore._hash_index` grows without bound **File:** `decision_service.py:99` No eviction policy. A caller generating many unique snapshots causes unbounded memory growth. ### SEC-4 [LOW] — `tempfile.mktemp` usage in test step file **File:** `features/steps/decision_recording_steps.py:966` `tempfile.mktemp` is deprecated due to a TOCTOU race condition. This is test-only but is a pattern repeated across 17 places in the test suite. --- ## 4. Performance Issues ### PERF-1 [MEDIUM] — `count_decisions` loads all objects just to count them **File:** `decision_service.py:593-602` `count_decisions(plan_id)` calls `list_decisions(plan_id)` which deserializes every decision from the DB, builds a Python list, then takes `len()`. Should delegate to `COUNT(*)` in persisted mode. ### PERF-2 [MEDIUM] — No pagination on any list method **Files:** `decision_service.py` — `list_decisions`, `list_by_type`, `get_tree`, `get_superseded`, `get_snapshots_for_plan` None of these support `limit`/`offset`. For plans with hundreds or thousands of decisions, this returns unbounded result sets. ### PERF-3 [LOW] — `get_tree` and `get_superseded` load full decision set **File:** `decision_service.py:395, 476` Both call `list_decisions()` to load the entire plan's decisions into memory, then perform in-memory filtering/traversal. ### PERF-4 [LOW] — `SnapshotStore._hash_index` uses `list` instead of `set` **File:** `decision_service.py:99, 145` `list.remove(decision_id)` is O(n) for each removal. Using a `set` would be O(1). --- ## 5. Test Coverage and Quality Issues ### TEST-1 [HIGH] — BFS order not actually verified in tree tests **Files:** `features/decision_recording.feature:103`, `robot/helper_decision_recording.py:99` The "Get tree returns BFS order from root" scenario asserts count and root-first, but never verifies actual BFS ordering. The level-order property (all depth-1 children before depth-2 grandchildren) is not checked. A broken BFS that returns DFS order would pass. ### TEST-2 [HIGH] — No test for sequence rehydration after restart **Files:** All test files No test creates decisions, destroys the service, recreates it against the same DB, and verifies that sequence numbers continue from where they left off. This is exactly the scenario where BUG-1 manifests. ### TEST-3 [HIGH] — `MagicMock()` without `spec=` hides bugs **Files:** `decision_di_wiring_steps.py:111`, `helper_decision_di.py:87`, `decision_di_bench.py:153` All use `MagicMock()` for `settings` without `spec=Settings`. This silently creates auto-attributes on access, meaning code that reads `settings.nonexistent_property` gets a silent mock object instead of failing. Should use `create_autospec(Settings)`. ### TEST-4 [MEDIUM] — `UnitOfWork.__new__()` anti-pattern in 4 places **Files:** `decision_di_wiring_steps.py:82`, `helper_decision_di.py:37`, `decision_di_bench.py:67`, `decision_recording_bench.py` All bypass `UnitOfWork.__init__()` to directly set private attributes. If `__init__` changes, all 4 break silently. Should be a shared fixture factory. ### TEST-5 [MEDIUM] — DI wiring assertions are too weak (`>= 1`) **Files:** `decision_di_wiring_steps.py:145, 194` `len(strategy_decisions) >= 1` passes even if 100 spurious decisions are recorded. Should assert the exact expected count. ### TEST-6 [MEDIUM] — No test for invalid `decision_type` string No scenario passes an invalid string like `"garbage"` to `record_decision(decision_type=...)` and asserts a `ValueError` is raised. ### TEST-7 [MEDIUM] — `confidence_score` boundary values never tested Tests only use `0.85`. Never exercises `0.0`, `1.0`, negative, or `>1.0`. The domain model has `ge=0.0, le=1.0` validators — untested at boundaries. ### TEST-8 [LOW] — `get_decision` return not validated beyond `decision_id` **File:** `decision_recording_steps.py:652` `step_match_recorded` only compares `decision_id`, not `question`, `chosen_option`, `decision_type`, or any other field. A buggy `get_decision` returning a corrupted object would pass. ### TEST-9 [LOW] — Robot tests are strict subset of Behave tests **File:** `robot/decision_recording.robot` 7 Robot test cases duplicate existing Behave scenarios with no unique coverage. No error-path, persisted-mode, or negative tests in Robot. ### TEST-10 [LOW] — No benchmarks for persisted (DB-backed) mode **Files:** `benchmarks/decision_recording_bench.py`, `benchmarks/decision_di_bench.py` All benchmarks use in-memory `DecisionService()`. The DB-backed path (with SQLAlchemy overhead, transactions, and serialization) has fundamentally different performance characteristics and is not benchmarked. --- ## 6. Specification Conformance Issues ### SPEC-1 [MEDIUM] — Missing uniqueness enforcement per spec schema The spec (line 18470) defines `sequence_number INTEGER NOT NULL` in the decisions table. The existence of `SequenceConflictError` in the codebase implies uniqueness enforcement was planned. It is not implemented, and BUG-1 makes collisions possible. ### SPEC-2 [LOW] — `__init__.py` missing `SequenceConflictError` export **File:** `src/cleveragents/application/services/__init__.py` `SequenceConflictError` is in `decision_service.__all__` but not re-exported from the services package `__init__.py`. Also, `PlanLifecycleService` and its exceptions are absent from `__init__.py` despite being a peer service. ### SPEC-3 [LOW] — Doc claim that retrieval methods raise `DecisionNotFoundError` **File:** `docs/reference/decision_service.md:75` States "All raise `DecisionNotFoundError` when the target decision does not exist (where applicable)." However, `list_decisions` and `list_by_type` return empty lists, not exceptions. ### SPEC-4 [LOW] — `artifacts_produced` typed as `list[Any]` vs spec's `list[ArtifactRef]` **File:** `decision_service.py:230` The `record_decision` method accepts `list[Any]` for `artifacts_produced`, but the spec (line 18442) and domain model both define it as `list[ArtifactRef]`. The loose typing defeats compile-time checks. --- ## 7. Last Commit Review (`cd641a8`) The final commit ("fix(service): align test and doc refs with DecisionService API") is a clean-up commit that fixes stale references to old method names (`get_decisions_for_plan` -> `list_decisions`, `get_decision_tree` -> `get_tree`) across 4 files. The diff is correct and no stale references remain. One note: the commit message contains `ISSUES CLOSED: #172`, which will auto-close the issue on merge, but the issue's acceptance criteria (coverage >= 97%, full `nox` pass) are not confirmed in the commit chain. --- ## Summary — Priority Ranking | # | Severity | Category | Finding | |---|---|---|---| | BUG-1 | **HIGH** | Bug | `_plan_sequence` not rehydrated from DB — sequence collisions after restart | | BUG-2 | **HIGH** | Bug | `SequenceConflictError` dead code — no uniqueness guard exists | | TEST-1 | **HIGH** | Test Flaw | BFS order not actually verified in tree tests | | TEST-2 | **HIGH** | Test Gap | No test for sequence rehydration after service restart | | TEST-3 | **HIGH** | Test Flaw | `MagicMock()` without `spec=` silently hides bugs in 3 files | | BUG-3 | **MEDIUM** | Bug | `delete_decision` behavior diverges between persisted/in-memory modes | | BUG-4 | **MEDIUM** | Bug | `mark_superseded` doesn't validate replacement decision exists | | BUG-5 | **MEDIUM** | Bug | `get_tree` silently drops orphaned decisions | | SEC-1 | **MEDIUM** | Security | No size limit on `actor_reasoning` (log injection / storage exhaustion) | | PERF-1 | **MEDIUM** | Performance | `count_decisions` loads all objects just to count | | PERF-2 | **MEDIUM** | Performance | No pagination on any list method | | TEST-4 | **MEDIUM** | Test Flaw | `UnitOfWork.__new__()` anti-pattern in 4 places | | TEST-5 | **MEDIUM** | Test Flaw | DI assertions use `>= 1` instead of exact counts | | TEST-6 | **MEDIUM** | Test Gap | Invalid `decision_type` string never tested | | TEST-7 | **MEDIUM** | Test Gap | `confidence_score` boundaries never tested | | SPEC-1 | **MEDIUM** | Spec Gap | Sequence uniqueness enforcement missing per spec schema | | BUG-6 | LOW | Bug | `SnapshotStore` lost on restart (not persisted) | | BUG-7 | LOW | Bug | Dead code `_decision_seq` / `_next_seq` in `PlanLifecycleService` | | SEC-2-4 | LOW | Security | Truncated hash, unbounded hash index, `tempfile.mktemp` | | SPEC-2-4 | LOW | Spec | Missing exports, doc inaccuracies, loose typing | | TEST-8-10 | LOW | Test | Weak assertions, Robot subset, no DB-mode benchmarks | --- **Recommendation:** The HIGH-severity items (BUG-1, BUG-2, TEST-1, TEST-2, TEST-3) should be addressed before merge. BUG-1 in particular is a data integrity risk in any deployment that restarts the service, and the lack of a test for it (TEST-2) means it would go undetected until production.
hamza.khyari force-pushed feature/m4-decision-service from cd641a8931
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 17s
CI / quality (pull_request) Successful in 18s
CI / build (pull_request) Successful in 28s
CI / security (pull_request) Successful in 37s
CI / typecheck (pull_request) Successful in 59s
CI / integration_tests (pull_request) Successful in 3m43s
CI / unit_tests (pull_request) Successful in 10m18s
CI / docker (pull_request) Successful in 1m0s
CI / benchmark-regression (pull_request) Successful in 21m22s
CI / coverage (pull_request) Successful in 1h22m59s
to 2a9606cb72
All checks were successful
CI / security (pull_request) Successful in 36s
CI / typecheck (pull_request) Successful in 37s
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 20s
CI / lint (pull_request) Successful in 1m20s
CI / quality (pull_request) Successful in 1m22s
CI / unit_tests (pull_request) Successful in 2m37s
CI / integration_tests (pull_request) Successful in 2m58s
CI / docker (pull_request) Successful in 40s
CI / coverage (pull_request) Successful in 4m19s
CI / benchmark-regression (pull_request) Successful in 24m9s
2026-03-03 03:17:39 +00:00
Compare
aditya approved these changes 2026-03-03 08:41:32 +00:00
Dismissed
hamza.khyari force-pushed feature/m4-decision-service from 2a9606cb72
All checks were successful
CI / security (pull_request) Successful in 36s
CI / typecheck (pull_request) Successful in 37s
CI / benchmark-publish (pull_request) Has been skipped
CI / build (pull_request) Successful in 20s
CI / lint (pull_request) Successful in 1m20s
CI / quality (pull_request) Successful in 1m22s
CI / unit_tests (pull_request) Successful in 2m37s
CI / integration_tests (pull_request) Successful in 2m58s
CI / docker (pull_request) Successful in 40s
CI / coverage (pull_request) Successful in 4m19s
CI / benchmark-regression (pull_request) Successful in 24m9s
to c2f8bc0eaf
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 18s
CI / security (pull_request) Successful in 32s
CI / typecheck (pull_request) Failing after 34s
CI / coverage (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has been skipped
CI / unit_tests (pull_request) Successful in 2m16s
CI / docker (pull_request) Has been skipped
CI / integration_tests (pull_request) Successful in 2m51s
2026-03-03 12:20:26 +00:00
Compare
hamza.khyari dismissed aditya's review 2026-03-03 12:20:26 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

hamza.khyari scheduled this pull request to auto merge when all checks succeed 2026-03-03 12:21:15 +00:00
hamza.khyari force-pushed feature/m4-decision-service from c2f8bc0eaf
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 15s
CI / build (pull_request) Successful in 16s
CI / quality (pull_request) Successful in 18s
CI / security (pull_request) Successful in 32s
CI / typecheck (pull_request) Failing after 34s
CI / coverage (pull_request) Has been skipped
CI / benchmark-regression (pull_request) Has been skipped
CI / unit_tests (pull_request) Successful in 2m16s
CI / docker (pull_request) Has been skipped
CI / integration_tests (pull_request) Successful in 2m51s
to 33a9d1e7d8
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / build (pull_request) Successful in 14s
CI / quality (pull_request) Successful in 27s
CI / security (pull_request) Successful in 29s
CI / typecheck (pull_request) Successful in 32s
CI / docker (pull_request) Has been skipped
CI / integration_tests (pull_request) Successful in 4m0s
CI / coverage (pull_request) Successful in 4m6s
CI / unit_tests (pull_request) Failing after 2m8s
CI / benchmark-regression (pull_request) Successful in 24m16s
2026-03-03 12:25:07 +00:00
Compare
hamza.khyari force-pushed feature/m4-decision-service from 33a9d1e7d8
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / build (pull_request) Successful in 14s
CI / quality (pull_request) Successful in 27s
CI / security (pull_request) Successful in 29s
CI / typecheck (pull_request) Successful in 32s
CI / docker (pull_request) Has been skipped
CI / integration_tests (pull_request) Successful in 4m0s
CI / coverage (pull_request) Successful in 4m6s
CI / unit_tests (pull_request) Failing after 2m8s
CI / benchmark-regression (pull_request) Successful in 24m16s
to 0e36755db9
All checks were successful
CI / benchmark-publish (pull_request) Has been skipped
CI / lint (pull_request) Successful in 14s
CI / build (pull_request) Successful in 14s
CI / quality (pull_request) Successful in 16s
CI / security (pull_request) Successful in 31s
CI / typecheck (pull_request) Successful in 43s
CI / unit_tests (pull_request) Successful in 2m18s
CI / integration_tests (pull_request) Successful in 2m53s
CI / docker (pull_request) Successful in 39s
CI / coverage (pull_request) Successful in 3m58s
CI / benchmark-regression (pull_request) Successful in 24m44s
2026-03-03 12:53:31 +00:00
Compare
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core!433
No description provided.