BUG-HUNT: [error-handling] Silent failures in PlanExecutor checkpoint operations hide critical errors #7169

Open
opened 2026-04-10 08:27:06 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Branch: fix/plan-executor-checkpoint-silent-failures
  • Commit Message: fix(error-handling): surface critical checkpoint failures in PlanExecutor instead of silently swallowing them
  • Milestone: (none — backlog, see note below)
  • Parent Epic: #7158

Background and Context

PlanExecutor in src/cleveragents/application/services/plan_executor.py contains two
guardrail-adjacent helper methods — _try_create_checkpoint() and
_try_rollback_to_last_checkpoint() — that are called at every major phase boundary during
plan execution (pre-execute, post-execute, error recovery). Both methods use a broad
except Exception: clause and log failures exclusively at DEBUG level, meaning that
any exception — including PermissionError, OSError, filesystem corruption, or
unexpected RuntimeError — is silently swallowed and never surfaced to operators or
administrators.

This violates the project's Fail-Fast / No Silent Failures principle (CONTRIBUTING.md §
Error and Exception Handling): "Avoid returning null or default values when an error
condition exists — raise exceptions or return explicit error types"
and "Never catch
exceptions just to log them — let them bubble up for centralized handling."

Current Behavior

Both methods catch all exceptions and log only at DEBUG:

# _try_create_checkpoint (~line 615)
except Exception:
    self._logger.debug(
        "Checkpoint creation failed (non-fatal)",
        plan_id=plan_id,
        phase=phase,
        exc_info=True,
    )
    return None

# _try_rollback_to_last_checkpoint (~line 650)
except Exception:
    self._logger.debug(
        "Checkpoint rollback failed (non-fatal)",
        plan_id=plan_id,
        checkpoint_id=last_cp.checkpoint_id,
        exc_info=True,
    )
    return False

A PermissionError writing a checkpoint to a read-only filesystem, or an OSError during
rollback due to disk exhaustion, is indistinguishable from a normal "no checkpoint manager
configured" path. Administrators monitoring at WARNING or above will never see these
failures.

Expected Behavior

Critical, non-recoverable errors (e.g. PermissionError, OSError, filesystem corruption)
should be logged at WARNING or ERROR level so they are visible in production log
aggregation. Recoverable / expected failures (e.g. FileNotFoundError when no prior
checkpoint exists) may remain at DEBUG. The distinction should be explicit and documented.

Acceptance Criteria

  • _try_create_checkpoint() distinguishes between expected/recoverable exceptions and
    critical infrastructure failures; critical failures are logged at WARNING or ERROR.
  • _try_rollback_to_last_checkpoint() applies the same distinction.
  • No critical filesystem or permission errors are silently swallowed at DEBUG only.
  • Existing behaviour for "no checkpoint manager configured" (returns None/False
    silently) is preserved — that path is intentional and non-erroneous.
  • BDD scenarios added covering: (a) recoverable failure stays debug, (b) critical failure
    is surfaced at warning/error level.
  • All nox sessions pass; coverage ≥ 97%.

Supporting Information

  • File: src/cleveragents/application/services/plan_executor.py
  • Methods: PlanExecutor._try_create_checkpoint(), PlanExecutor._try_rollback_to_last_checkpoint()
  • Approximate lines: 579–657
  • Relevant CONTRIBUTING.md sections: § Error and Exception Handling → Exception Propagation, Fail-Fast Principles
  • Related issues: #7082 (SQLite Connection Leak in Checkpoint Management — related checkpoint subsystem)

Subtasks

  • Audit _try_create_checkpoint() and _try_rollback_to_last_checkpoint() for all
    exception types that can be raised by CheckpointManager
  • Classify exceptions into recoverable (expected, DEBUG) vs. critical (unexpected,
    WARNING/ERROR)
  • Refactor both methods to log critical exceptions at the appropriate level
  • Tests (Behave): Add BDD scenarios for recoverable vs. critical checkpoint failure paths
  • Tests (Robot): Add integration test verifying log level emitted on permission/OS errors
  • Verify coverage ≥ 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit
    Message in Metadata exactly, followed by a blank line, then additional lines providing
    relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata
    exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before
    this issue is marked done.
  • All nox stages pass.
  • Coverage ≥ 97%.

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Acting on behalf of: Bug Hunt | Agent: new-issue-creator

## Metadata - **Branch**: `fix/plan-executor-checkpoint-silent-failures` - **Commit Message**: `fix(error-handling): surface critical checkpoint failures in PlanExecutor instead of silently swallowing them` - **Milestone**: *(none — backlog, see note below)* - **Parent Epic**: #7158 ## Background and Context `PlanExecutor` in `src/cleveragents/application/services/plan_executor.py` contains two guardrail-adjacent helper methods — `_try_create_checkpoint()` and `_try_rollback_to_last_checkpoint()` — that are called at every major phase boundary during plan execution (pre-execute, post-execute, error recovery). Both methods use a broad `except Exception:` clause and log failures exclusively at `DEBUG` level, meaning that **any** exception — including `PermissionError`, `OSError`, filesystem corruption, or unexpected `RuntimeError` — is silently swallowed and never surfaced to operators or administrators. This violates the project's **Fail-Fast / No Silent Failures** principle (CONTRIBUTING.md § Error and Exception Handling): *"Avoid returning null or default values when an error condition exists — raise exceptions or return explicit error types"* and *"Never catch exceptions just to log them — let them bubble up for centralized handling."* ## Current Behavior Both methods catch all exceptions and log only at `DEBUG`: ```python # _try_create_checkpoint (~line 615) except Exception: self._logger.debug( "Checkpoint creation failed (non-fatal)", plan_id=plan_id, phase=phase, exc_info=True, ) return None # _try_rollback_to_last_checkpoint (~line 650) except Exception: self._logger.debug( "Checkpoint rollback failed (non-fatal)", plan_id=plan_id, checkpoint_id=last_cp.checkpoint_id, exc_info=True, ) return False ``` A `PermissionError` writing a checkpoint to a read-only filesystem, or an `OSError` during rollback due to disk exhaustion, is indistinguishable from a normal "no checkpoint manager configured" path. Administrators monitoring at `WARNING` or above will never see these failures. ## Expected Behavior Critical, non-recoverable errors (e.g. `PermissionError`, `OSError`, filesystem corruption) should be logged at `WARNING` or `ERROR` level so they are visible in production log aggregation. Recoverable / expected failures (e.g. `FileNotFoundError` when no prior checkpoint exists) may remain at `DEBUG`. The distinction should be explicit and documented. ## Acceptance Criteria - [ ] `_try_create_checkpoint()` distinguishes between expected/recoverable exceptions and critical infrastructure failures; critical failures are logged at `WARNING` or `ERROR`. - [ ] `_try_rollback_to_last_checkpoint()` applies the same distinction. - [ ] No critical filesystem or permission errors are silently swallowed at `DEBUG` only. - [ ] Existing behaviour for "no checkpoint manager configured" (returns `None`/`False` silently) is preserved — that path is intentional and non-erroneous. - [ ] BDD scenarios added covering: (a) recoverable failure stays debug, (b) critical failure is surfaced at warning/error level. - [ ] All nox sessions pass; coverage ≥ 97%. ## Supporting Information - **File**: `src/cleveragents/application/services/plan_executor.py` - **Methods**: `PlanExecutor._try_create_checkpoint()`, `PlanExecutor._try_rollback_to_last_checkpoint()` - **Approximate lines**: 579–657 - **Relevant CONTRIBUTING.md sections**: § Error and Exception Handling → Exception Propagation, Fail-Fast Principles - **Related issues**: #7082 (SQLite Connection Leak in Checkpoint Management — related checkpoint subsystem) ## Subtasks - [ ] Audit `_try_create_checkpoint()` and `_try_rollback_to_last_checkpoint()` for all exception types that can be raised by `CheckpointManager` - [ ] Classify exceptions into recoverable (expected, `DEBUG`) vs. critical (unexpected, `WARNING`/`ERROR`) - [ ] Refactor both methods to log critical exceptions at the appropriate level - [ ] Tests (Behave): Add BDD scenarios for recoverable vs. critical checkpoint failure paths - [ ] Tests (Robot): Add integration test verifying log level emitted on permission/OS errors - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage ≥ 97%. > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: Bug Hunt | Agent: new-issue-creator
Author
Owner

Verified — Bug: silent failures in PlanExecutor checkpoint operations. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Bug: silent failures in PlanExecutor checkpoint operations. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7169
No description provided.