UAT: CheckpointService.selective_rollback always re-raises original exception — should raise BusinessRuleViolation when recovery also fails #2466

Open
opened 2026-04-03 18:32:43 +00:00 by freemo · 1 comment
Owner

Metadata

  • Branch: fix/checkpoint-selective-rollback-business-rule-violation
  • Commit Message: fix(checkpoint): raise BusinessRuleViolation when selective_rollback recovery also fails
  • Milestone: v3.3.0
  • Parent Epic: #362

Background and Context

The spec (docs/specification.md line 19340) and the method's own docstring both state that CheckpointService.selective_rollback "Raises BusinessRuleViolation if recovery also fails." The intent is to give callers a reliable signal distinguishing two failure modes:

  1. Rollback failed, but sandbox is intact — recovery succeeded, so the sandbox is back at its pre-rollback HEAD. The original rollback exception is re-raised.
  2. Rollback failed AND sandbox is in an unknown state — recovery also failed, so the sandbox may be partially modified. BusinessRuleViolation is raised to signal this more severe condition.

The current implementation collapses both cases into a single bare raise, making it impossible for callers to distinguish them.

Current Behavior

In src/cleveragents/application/services/checkpoint_service.py lines 516–539, the selective_rollback method catches the rollback exception, attempts a best-effort recovery (git reset --hard + git clean), but then unconditionally re-raises the original exception regardless of whether recovery succeeded or failed:

try:
    result = self.rollback_to_checkpoint(plan_id, checkpoint_id)
    return result
except Exception:
    # Atomic rollback: attempt recovery to original HEAD
    if current_head is not None:
        try:
            self._git_reset_hard(sandbox_path, current_head)
            self._git_clean(sandbox_path)
            # Recovery succeeded — but we still fall through to `raise`
        except Exception:
            logger.error("checkpoint.selective_rollback_recovery_failed", ...)
            # Recovery failed — but we still do `raise` (original exception)
    raise  # Always re-raises original exception, regardless of recovery outcome

When recovery fails, the code should raise BusinessRuleViolation (as documented), but instead re-raises the original exception (which could be any exception type). Callers cannot distinguish "rollback failed but sandbox is intact" from "rollback failed AND sandbox is in an unknown state."

Expected Behavior

Per docs/specification.md line 19340 and the method's own docstring:

  • If rollback fails AND recovery also fails → raise BusinessRuleViolation
  • If rollback fails AND recovery succeeds → re-raise the original exception (rollback still failed, but sandbox is intact)

The fix should be:

except Exception as original_exc:
    recovery_failed = False
    if current_head is not None:
        try:
            self._git_reset_hard(sandbox_path, current_head)
            self._git_clean(sandbox_path)
        except Exception:
            recovery_failed = True
            logger.error("checkpoint.selective_rollback_recovery_failed", ...)

    if recovery_failed:
        raise BusinessRuleViolation(
            f"Rollback failed and recovery also failed for plan {plan_id}: {original_exc}"
        ) from original_exc
    raise  # Recovery succeeded; re-raise original rollback failure

Acceptance Criteria

  • When rollback_to_checkpoint raises and the subsequent _git_reset_hard / _git_clean recovery also raises, selective_rollback raises BusinessRuleViolation (not the original exception).
  • When rollback_to_checkpoint raises but recovery succeeds, selective_rollback re-raises the original exception (existing behaviour preserved).
  • The BusinessRuleViolation message includes the plan ID and the original exception for traceability.
  • The BusinessRuleViolation is chained from the original exception (raise ... from original_exc).
  • Unit tests cover both failure branches (recovery succeeds vs. recovery fails).

Steps to Reproduce

  1. Set up a scenario where rollback_to_checkpoint fails (e.g., pass an invalid git ref or a non-existent checkpoint ID that bypasses ResourceNotFoundError).
  2. Also make the recovery _git_reset_hard fail (e.g., mock it to raise).
  3. Call selective_rollback(plan_id, checkpoint_id).
  4. Observe that the raised exception is the original exception rather than BusinessRuleViolation.

Supporting Information

  • Code location: src/cleveragents/application/services/checkpoint_service.py lines 516–539 (selective_rollback)
  • Spec reference: docs/specification.md line 19340
  • Related bugs: #2461 (checkpoint metadata not stored), #2455 (rollback CLI output missing fields)

Subtasks

  • Locate the selective_rollback exception handler in checkpoint_service.py (lines 516–539)
  • Refactor the except clause to capture original_exc and track recovery_failed
  • Raise BusinessRuleViolation (chained from original_exc) when recovery_failed is True
  • Re-raise original exception when recovery succeeds (existing behaviour)
  • Tests (pytest/Behave): Add unit test — recovery fails → BusinessRuleViolation raised
  • Tests (pytest/Behave): Add unit test — recovery succeeds → original exception re-raised
  • Verify coverage >= 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-uat-tester

## Metadata - **Branch**: `fix/checkpoint-selective-rollback-business-rule-violation` - **Commit Message**: `fix(checkpoint): raise BusinessRuleViolation when selective_rollback recovery also fails` - **Milestone**: v3.3.0 - **Parent Epic**: #362 ## Background and Context The spec (`docs/specification.md` line 19340) and the method's own docstring both state that `CheckpointService.selective_rollback` "Raises `BusinessRuleViolation` if recovery also fails." The intent is to give callers a reliable signal distinguishing two failure modes: 1. **Rollback failed, but sandbox is intact** — recovery succeeded, so the sandbox is back at its pre-rollback HEAD. The original rollback exception is re-raised. 2. **Rollback failed AND sandbox is in an unknown state** — recovery also failed, so the sandbox may be partially modified. `BusinessRuleViolation` is raised to signal this more severe condition. The current implementation collapses both cases into a single bare `raise`, making it impossible for callers to distinguish them. ## Current Behavior In `src/cleveragents/application/services/checkpoint_service.py` lines 516–539, the `selective_rollback` method catches the rollback exception, attempts a best-effort recovery (`git reset --hard` + `git clean`), but then unconditionally re-raises the original exception regardless of whether recovery succeeded or failed: ```python try: result = self.rollback_to_checkpoint(plan_id, checkpoint_id) return result except Exception: # Atomic rollback: attempt recovery to original HEAD if current_head is not None: try: self._git_reset_hard(sandbox_path, current_head) self._git_clean(sandbox_path) # Recovery succeeded — but we still fall through to `raise` except Exception: logger.error("checkpoint.selective_rollback_recovery_failed", ...) # Recovery failed — but we still do `raise` (original exception) raise # Always re-raises original exception, regardless of recovery outcome ``` When recovery fails, the code should raise `BusinessRuleViolation` (as documented), but instead re-raises the original exception (which could be any exception type). Callers cannot distinguish "rollback failed but sandbox is intact" from "rollback failed AND sandbox is in an unknown state." ## Expected Behavior Per `docs/specification.md` line 19340 and the method's own docstring: - If rollback fails **AND** recovery also fails → raise `BusinessRuleViolation` - If rollback fails **AND** recovery succeeds → re-raise the original exception (rollback still failed, but sandbox is intact) The fix should be: ```python except Exception as original_exc: recovery_failed = False if current_head is not None: try: self._git_reset_hard(sandbox_path, current_head) self._git_clean(sandbox_path) except Exception: recovery_failed = True logger.error("checkpoint.selective_rollback_recovery_failed", ...) if recovery_failed: raise BusinessRuleViolation( f"Rollback failed and recovery also failed for plan {plan_id}: {original_exc}" ) from original_exc raise # Recovery succeeded; re-raise original rollback failure ``` ## Acceptance Criteria - [ ] When `rollback_to_checkpoint` raises and the subsequent `_git_reset_hard` / `_git_clean` recovery also raises, `selective_rollback` raises `BusinessRuleViolation` (not the original exception). - [ ] When `rollback_to_checkpoint` raises but recovery succeeds, `selective_rollback` re-raises the original exception (existing behaviour preserved). - [ ] The `BusinessRuleViolation` message includes the plan ID and the original exception for traceability. - [ ] The `BusinessRuleViolation` is chained from the original exception (`raise ... from original_exc`). - [ ] Unit tests cover both failure branches (recovery succeeds vs. recovery fails). ## Steps to Reproduce 1. Set up a scenario where `rollback_to_checkpoint` fails (e.g., pass an invalid git ref or a non-existent checkpoint ID that bypasses `ResourceNotFoundError`). 2. Also make the recovery `_git_reset_hard` fail (e.g., mock it to raise). 3. Call `selective_rollback(plan_id, checkpoint_id)`. 4. Observe that the raised exception is the original exception rather than `BusinessRuleViolation`. ## Supporting Information - **Code location**: `src/cleveragents/application/services/checkpoint_service.py` lines 516–539 (`selective_rollback`) - **Spec reference**: `docs/specification.md` line 19340 - **Related bugs**: #2461 (checkpoint metadata not stored), #2455 (rollback CLI output missing fields) ## Subtasks - [ ] Locate the `selective_rollback` exception handler in `checkpoint_service.py` (lines 516–539) - [ ] Refactor the `except` clause to capture `original_exc` and track `recovery_failed` - [ ] Raise `BusinessRuleViolation` (chained from `original_exc`) when `recovery_failed` is `True` - [ ] Re-raise original exception when recovery succeeds (existing behaviour) - [ ] Tests (pytest/Behave): Add unit test — recovery fails → `BusinessRuleViolation` raised - [ ] Tests (pytest/Behave): Add unit test — recovery succeeds → original exception re-raised - [ ] Verify coverage >= 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - [ ] All subtasks above are completed and checked off. - [ ] A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - [ ] The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - [ ] The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-uat-tester
freemo added this to the v3.3.0 milestone 2026-04-03 18:32:49 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • MoSCoW: Should Have — Important spec requirement or quality improvement. Should be included in the milestone if possible.

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **MoSCoW**: Should Have — Important spec requirement or quality improvement. Should be included in the milestone if possible. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#362 Epic: Security & Safety Hardening
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#2466
No description provided.