UAT: resolve_plan_phase silently swallows database errors — phase-gating can be bypassed when DB is unavailable during decision recording #3186

Open
opened 2026-04-05 07:33:01 +00:00 by freemo · 2 comments
Owner

Metadata

  • Branch: fix/resolve-plan-phase-silent-db-error
  • Commit Message: fix(phase-gating): raise PhaseResolutionError instead of silently swallowing DB errors in resolve_plan_phase
  • Milestone: (none — backlog)
  • Parent Epic: #357

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Bug Report

Feature Area: Decision Recording During Strategize Phase (v3.2.0)

What was tested

Code analysis of src/cleveragents/application/services/phase_gating.py, specifically the resolve_plan_phase function.

Expected behavior (from spec)

Per CONTRIBUTING.md error handling requirements:

"Errors must not be suppressed. Exceptions should propagate to the top-level execution handler. Exceptions should only be caught when they can be meaningfully handled (e.g., for retry logic or resource cleanup), not just to be logged and re-raised."

Per docs/specification.md v3.2.0 acceptance criteria:

"Decision recording does not fail silently on errors"

When the database is unavailable during decision recording, the system should either:

  1. Propagate the database error to the caller (fail-fast), OR
  2. Use a cached/known phase value to enforce phase-gating

Actual behavior (from code analysis)

In src/cleveragents/application/services/phase_gating.py, resolve_plan_phase catches database errors and silently returns None, causing phase-gating to be skipped entirely:

# Lines ~65-80 in phase_gating.py
try:
    with unit_of_work.transaction() as ctx:
        plan = ctx.lifecycle_plans.get(plan_id)
        if plan is not None:
            return PlanPhase(plan.phase) if isinstance(plan.phase, str) else plan.phase
except (DatabaseError, OperationalError, OSError):
    # Infrastructure errors must not break the opt-in phase-gating contract.
    # If the DB is temporarily unavailable we fall through and skip gating
    # rather than propagating an infrastructure error to the caller.
    logger.warning("decision.phase_resolve_db_error", plan_id=plan_id, exc_info=True)

return None  # Phase-gating is skipped

When None is returned, validate_phase_gating is never called, and the decision is recorded without phase validation. This means:

  • During a DB outage, an implementation_choice decision (Execute-only type) could be recorded during the Strategize phase without raising DecisionPhaseViolationError
  • The spec's requirement that "Decision recording does not fail silently on errors" is violated — the DB error IS silently swallowed

Code location

  • src/cleveragents/application/services/phase_gating.pyresolve_plan_phase function, exception handler block

Notes

The code comment explains the design intent: "Infrastructure errors must not break the opt-in phase-gating contract." This is a deliberate trade-off between availability and correctness. However, it contradicts the spec's error handling requirements.

A possible fix: raise a specific PhaseResolutionError (a subclass of DatabaseError) that callers can catch if they want to handle DB unavailability gracefully, rather than silently skipping phase-gating.


Subtasks

  • Decide whether to propagate the DB error or introduce a PhaseResolutionError exception type
  • Update resolve_plan_phase to not silently swallow DB errors
  • Update record_decision callers to handle PhaseResolutionError if needed
  • Add BDD scenario: recording a decision when DB is unavailable raises an appropriate error (not silently succeeds)
  • Run nox (all default sessions), fix any errors
  • Verify coverage >= 97% via nox -s coverage_report

Definition of Done

  • DB errors during phase resolution are not silently swallowed
  • Either the error propagates to the caller OR a specific exception type is raised that callers can handle
  • BDD scenario covering DB-unavailable phase resolution behavior passes
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/resolve-plan-phase-silent-db-error` - **Commit Message**: `fix(phase-gating): raise PhaseResolutionError instead of silently swallowing DB errors in resolve_plan_phase` - **Milestone**: *(none — backlog)* - **Parent Epic**: #357 > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- ## Bug Report **Feature Area**: Decision Recording During Strategize Phase (v3.2.0) ### What was tested Code analysis of `src/cleveragents/application/services/phase_gating.py`, specifically the `resolve_plan_phase` function. ### Expected behavior (from spec) Per `CONTRIBUTING.md` error handling requirements: > "Errors must not be suppressed. Exceptions should propagate to the top-level execution handler. Exceptions should only be caught when they can be meaningfully handled (e.g., for retry logic or resource cleanup), not just to be logged and re-raised." Per `docs/specification.md` v3.2.0 acceptance criteria: > "Decision recording does not fail silently on errors" When the database is unavailable during decision recording, the system should either: 1. Propagate the database error to the caller (fail-fast), OR 2. Use a cached/known phase value to enforce phase-gating ### Actual behavior (from code analysis) In `src/cleveragents/application/services/phase_gating.py`, `resolve_plan_phase` catches database errors and silently returns `None`, causing phase-gating to be skipped entirely: ```python # Lines ~65-80 in phase_gating.py try: with unit_of_work.transaction() as ctx: plan = ctx.lifecycle_plans.get(plan_id) if plan is not None: return PlanPhase(plan.phase) if isinstance(plan.phase, str) else plan.phase except (DatabaseError, OperationalError, OSError): # Infrastructure errors must not break the opt-in phase-gating contract. # If the DB is temporarily unavailable we fall through and skip gating # rather than propagating an infrastructure error to the caller. logger.warning("decision.phase_resolve_db_error", plan_id=plan_id, exc_info=True) return None # Phase-gating is skipped ``` When `None` is returned, `validate_phase_gating` is never called, and the decision is recorded without phase validation. This means: - During a DB outage, an `implementation_choice` decision (Execute-only type) could be recorded during the Strategize phase without raising `DecisionPhaseViolationError` - The spec's requirement that "Decision recording does not fail silently on errors" is violated — the DB error IS silently swallowed ### Code location - `src/cleveragents/application/services/phase_gating.py` — `resolve_plan_phase` function, exception handler block ### Notes The code comment explains the design intent: "Infrastructure errors must not break the opt-in phase-gating contract." This is a deliberate trade-off between availability and correctness. However, it contradicts the spec's error handling requirements. A possible fix: raise a specific `PhaseResolutionError` (a subclass of `DatabaseError`) that callers can catch if they want to handle DB unavailability gracefully, rather than silently skipping phase-gating. --- ## Subtasks - [ ] Decide whether to propagate the DB error or introduce a `PhaseResolutionError` exception type - [ ] Update `resolve_plan_phase` to not silently swallow DB errors - [ ] Update `record_decision` callers to handle `PhaseResolutionError` if needed - [ ] Add BDD scenario: recording a decision when DB is unavailable raises an appropriate error (not silently succeeds) - [ ] Run `nox` (all default sessions), fix any errors - [ ] Verify coverage >= 97% via `nox -s coverage_report` ## Definition of Done - [ ] DB errors during phase resolution are not silently swallowed - [ ] Either the error propagates to the caller OR a specific exception type is raised that callers can handle - [ ] BDD scenario covering DB-unavailable phase resolution behavior passes - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
freemo added this to the v3.2.0 milestone 2026-04-05 07:49:43 +00:00
Author
Owner

Transition request: State/Unverified -> State/Verified

Current labels: Priority/Backlog, Type/Bug, State/Verified
Target state: State/Verified

Status: Issue already in the target state. No changes applied. No state labels removed or added, and no precondition checks needed.

If you want me to enforce a different transition (e.g., moving from Unverified to Verified by force, or to any other state), please specify and I will apply the necessary label updates and state transitions.


Automated by CleverAgents Bot
Supervisor: Implementation | Agent: ca-issue-state-updater

Transition request: State/Unverified -> State/Verified Current labels: Priority/Backlog, Type/Bug, State/Verified Target state: State/Verified Status: Issue already in the target state. No changes applied. No state labels removed or added, and no precondition checks needed. If you want me to enforce a different transition (e.g., moving from Unverified to Verified by force, or to any other state), please specify and I will apply the necessary label updates and state transitions. --- **Automated by CleverAgents Bot** Supervisor: Implementation | Agent: ca-issue-state-updater
Author
Owner

Issue triaged by project owner:

  • State: Verified — well-documented bug with clear code location, spec reference, and reproduction steps
  • Priority: High — silent error suppression in phase-gating violates both the spec's error handling requirements and the CONTRIBUTING.md fail-fast principles. This can lead to incorrect decision recording during DB outages.
  • Milestone: v3.2.0 — this is a decision recording bug in the Strategize phase, which is core v3.2.0 scope (Decisions + Validations + Invariants)
  • MoSCoW: Should Have — the spec says "Decision recording does not fail silently on errors" (v3.2.0 acceptance criteria), making this a spec compliance issue. Not blocking release but important for correctness.
  • Parent Epic: #357 (referenced in issue body)

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified — well-documented bug with clear code location, spec reference, and reproduction steps - **Priority**: High — silent error suppression in phase-gating violates both the spec's error handling requirements and the CONTRIBUTING.md fail-fast principles. This can lead to incorrect decision recording during DB outages. - **Milestone**: v3.2.0 — this is a decision recording bug in the Strategize phase, which is core v3.2.0 scope (Decisions + Validations + Invariants) - **MoSCoW**: Should Have — the spec says "Decision recording does not fail silently on errors" (v3.2.0 acceptance criteria), making this a spec compliance issue. Not blocking release but important for correctness. - **Parent Epic**: #357 (referenced in issue body) --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
freemo removed this from the v3.2.0 milestone 2026-04-06 20:51:14 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
Reference
cleveragents/cleveragents-core#3186
No description provided.