[AUTO-SPEC-1] Spec Proposal: Document fail_fast CANCELLED behavior in parallel subplan execution #8947

Open
opened 2026-04-14 04:05:44 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Commit: 3cfa248fix(concurrency): fix SubplanExecutionService._execute_parallel() #7582
  • Branch: docs/spec-update-auto-spec-1-fail-fast-cancelled-behavior

Background and Context

Commit 3cfa248 (fix #7582) improved SubplanExecutionService._execute_parallel() to correctly cancel in-flight futures when fail_fast=true triggers, and to report those cancelled futures with CANCELLED status rather than ERRORED. This is a meaningful behavioral distinction that users need to understand — but the current specification does not document it.

The spec (docs/specification.md) mentions fail_fast as a config option but is silent on:

  1. What happens when fail_fast=true triggers during parallel subplan execution
  2. That in-flight futures are cancelled (not just stopped or left to finish)
  3. That cancelled futures are reported with CANCELLED status (not ERRORED)

This is a spec gap: the implementation is better-specified than the spec. The fix introduced a docstring explicitly stating the CANCELLED vs ERRORED distinction, but this has not been reflected in the user-facing specification.

Expected Behavior

The specification accurately documents fail_fast behavior in parallel subplan execution, including:

  • That fail_fast=true triggers immediate cancellation of all remaining in-flight parallel subplans when one fails
  • That cancelled subplans are reported with CANCELLED processing state (not ERRORED)
  • The semantic distinction between ERRORED (execution failure) and CANCELLED (cancelled due to fail_fast cascade)
  • An update to the failure semantics table noting the fail_fast exception to "others continue"

Acceptance Criteria

  • docs/specification.md failure semantics table (around line 18522–18526) is updated to note that when fail_fast=true, other parallel subplans are cancelled rather than continuing
  • A new paragraph or note is added after the failure semantics table documenting fail_fast behavior: cancellation mechanism, CANCELLED vs ERRORED status distinction, and user-facing implications
  • The fail_fast entry in the SubplanConfig options list (around line 18509) is updated to reference the new fail_fast behavior section
  • The spec change is accurate and consistent with the implementation in src/cleveragents/application/services/subplan_execution_service.py post-fix #7582
  • A PR is created targeting main with the spec-only change
  • The PR is reviewed and approved by a maintainer

Subtasks

  • Locate the exact lines in docs/specification.md for the parallel failure semantics table and SubplanConfig options list
  • Draft the fail_fast behavior paragraph (see proposed text in Discrepancy section below)
  • Update the failure semantics table row for parallel execution to note the fail_fast exception
  • Update the fail_fast option description to cross-reference the new behavior note
  • Open a PR with the spec changes for review
  • Address any review feedback
  • Merge once approved

Definition of Done

This issue should be closed when:

  • A PR updating docs/specification.md with accurate fail_fast / CANCELLED behavior documentation has been merged to main
  • The spec accurately reflects the implementation behavior introduced in fix #7582

Spec Update Proposal — [AUTO-SPEC-1]

Supervisor: Spec Update Pool | Agent: spec-update-pool-supervisor
Discrepancy Type: Spec gap — implementation is better-specified than the spec
Related Commit: 3cfa248 — fix(concurrency): fix SubplanExecutionService._execute_parallel() #7582

Discrepancy

What the Spec Currently Says

In docs/specification.md at line 18509:

Parallel execution is bounded by SubplanConfig.max_parallel (default: 5, range: 1–50). This cap prevents runaway resource consumption when a large number of child plans are spawned simultaneously. The runtime uses a ThreadPoolExecutor with min(max_parallel, len(subplans)) workers. The SubplanConfig model also controls merge_strategy (default: git_three_way), fail_fast (default: false), timeout_per_subplan_seconds (default: null), retry_failed (default: true), and max_retries (default: 2).

And at lines 18522-18526 (failure semantics table):

| Parallel | One child fails | Other child plans ==continue== |

The spec mentions fail_fast exists as a config option but does not document:

  1. What happens when fail_fast=true triggers
  2. That in-flight futures are cancelled (not just stopped)
  3. That cancelled futures are reported with CANCELLED status (not ERRORED)

What the Implementation Does

In src/cleveragents/application/services/subplan_execution_service.py (post-fix #7582):

# When fail_fast triggers:
# - All pending futures are cancelled via future.cancel()
# - pool.shutdown(wait=False, cancel_futures=True) is called
# - Futures cancelled via fail-fast are reported with CANCELLED status, not ERRORED

The docstring explicitly states:

"Futures cancelled via fail-fast are reported with CANCELLED status, not ERRORED."

Classification

Implementation found a better approach → Update the spec to match.

The implementation correctly distinguishes between:

  • ERRORED: A subplan that actually failed during execution
  • CANCELLED: A subplan that was cancelled because another subplan failed with fail_fast=true

This distinction is important for users who need to understand why a subplan didn't complete.

Proposed Spec Change

Update docs/specification.md in the parallel execution section (around line 18507-18530) to add documentation of fail_fast behavior:

  1. Add a note to the failure semantics table explaining that when fail_fast=true, in-flight parallel subplans are cancelled and reported as CANCELLED (not ERRORED)
  2. Clarify the distinction between ERRORED (execution failure) and CANCELLED (cancelled due to fail_fast)
  3. Update the parallel failure row to note the fail_fast exception to "others continue"

Suggested addition after the failure semantics table:

fail_fast behavior: When SubplanConfig.fail_fast=true (default: false), the first subplan failure triggers immediate cancellation of all remaining in-flight parallel subplans. Cancelled subplans are reported with CANCELLED processing state — not ERRORED — to distinguish them from subplans that actually failed. This distinction allows users to identify which subplan caused the cascade versus which were collateral cancellations.

Approval Request

Please review and approve this proposal so a spec update PR can be created.

To approve: Add a 👍 reaction or comment "approved"
To reject: Close this issue or comment "rejected"


Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Commit**: `3cfa248` — `fix(concurrency): fix SubplanExecutionService._execute_parallel() #7582` - **Branch**: `docs/spec-update-auto-spec-1-fail-fast-cancelled-behavior` ## Background and Context Commit `3cfa248` (fix #7582) improved `SubplanExecutionService._execute_parallel()` to correctly cancel in-flight futures when `fail_fast=true` triggers, and to report those cancelled futures with `CANCELLED` status rather than `ERRORED`. This is a meaningful behavioral distinction that users need to understand — but the current specification does not document it. The spec (`docs/specification.md`) mentions `fail_fast` as a config option but is silent on: 1. What happens when `fail_fast=true` triggers during parallel subplan execution 2. That in-flight futures are cancelled (not just stopped or left to finish) 3. That cancelled futures are reported with `CANCELLED` status (not `ERRORED`) This is a **spec gap**: the implementation is better-specified than the spec. The fix introduced a docstring explicitly stating the `CANCELLED` vs `ERRORED` distinction, but this has not been reflected in the user-facing specification. ## Expected Behavior The specification accurately documents `fail_fast` behavior in parallel subplan execution, including: - That `fail_fast=true` triggers immediate cancellation of all remaining in-flight parallel subplans when one fails - That cancelled subplans are reported with `CANCELLED` processing state (not `ERRORED`) - The semantic distinction between `ERRORED` (execution failure) and `CANCELLED` (cancelled due to fail_fast cascade) - An update to the failure semantics table noting the `fail_fast` exception to "others continue" ## Acceptance Criteria - [ ] `docs/specification.md` failure semantics table (around line 18522–18526) is updated to note that when `fail_fast=true`, other parallel subplans are cancelled rather than continuing - [ ] A new paragraph or note is added after the failure semantics table documenting `fail_fast` behavior: cancellation mechanism, `CANCELLED` vs `ERRORED` status distinction, and user-facing implications - [ ] The `fail_fast` entry in the `SubplanConfig` options list (around line 18509) is updated to reference the new `fail_fast` behavior section - [ ] The spec change is accurate and consistent with the implementation in `src/cleveragents/application/services/subplan_execution_service.py` post-fix #7582 - [ ] A PR is created targeting `main` with the spec-only change - [ ] The PR is reviewed and approved by a maintainer ## Subtasks - [ ] Locate the exact lines in `docs/specification.md` for the parallel failure semantics table and `SubplanConfig` options list - [ ] Draft the `fail_fast` behavior paragraph (see proposed text in Discrepancy section below) - [ ] Update the failure semantics table row for parallel execution to note the `fail_fast` exception - [ ] Update the `fail_fast` option description to cross-reference the new behavior note - [ ] Open a PR with the spec changes for review - [ ] Address any review feedback - [ ] Merge once approved ## Definition of Done This issue should be closed when: - A PR updating `docs/specification.md` with accurate `fail_fast` / `CANCELLED` behavior documentation has been merged to `main` - The spec accurately reflects the implementation behavior introduced in fix #7582 --- ## Spec Update Proposal — [AUTO-SPEC-1] **Supervisor**: Spec Update Pool | Agent: spec-update-pool-supervisor **Discrepancy Type**: Spec gap — implementation is better-specified than the spec **Related Commit**: `3cfa248` — fix(concurrency): fix SubplanExecutionService._execute_parallel() #7582 ## Discrepancy ### What the Spec Currently Says In `docs/specification.md` at line 18509: > Parallel execution is bounded by `SubplanConfig.max_parallel` (default: `5`, range: 1–50). This cap prevents runaway resource consumption when a large number of child plans are spawned simultaneously. The runtime uses a `ThreadPoolExecutor` with `min(max_parallel, len(subplans))` workers. The `SubplanConfig` model also controls `merge_strategy` (default: `git_three_way`), **`fail_fast` (default: `false`)**, `timeout_per_subplan_seconds` (default: `null`), `retry_failed` (default: `true`), and `max_retries` (default: `2`). And at lines 18522-18526 (failure semantics table): > | **Parallel** | One child fails | Other child plans ==continue== | The spec mentions `fail_fast` exists as a config option but does **not** document: 1. What happens when `fail_fast=true` triggers 2. That in-flight futures are cancelled (not just stopped) 3. That cancelled futures are reported with `CANCELLED` status (not `ERRORED`) ### What the Implementation Does In `src/cleveragents/application/services/subplan_execution_service.py` (post-fix #7582): ```python # When fail_fast triggers: # - All pending futures are cancelled via future.cancel() # - pool.shutdown(wait=False, cancel_futures=True) is called # - Futures cancelled via fail-fast are reported with CANCELLED status, not ERRORED ``` The docstring explicitly states: > "Futures cancelled via fail-fast are reported with `CANCELLED` status, not `ERRORED`." ### Classification **Implementation found a better approach** → Update the spec to match. The implementation correctly distinguishes between: - `ERRORED`: A subplan that actually failed during execution - `CANCELLED`: A subplan that was cancelled because another subplan failed with `fail_fast=true` This distinction is important for users who need to understand why a subplan didn't complete. ## Proposed Spec Change Update `docs/specification.md` in the parallel execution section (around line 18507-18530) to add documentation of `fail_fast` behavior: 1. Add a note to the failure semantics table explaining that when `fail_fast=true`, in-flight parallel subplans are cancelled and reported as `CANCELLED` (not `ERRORED`) 2. Clarify the distinction between `ERRORED` (execution failure) and `CANCELLED` (cancelled due to fail_fast) 3. Update the parallel failure row to note the `fail_fast` exception to "others continue" **Suggested addition** after the failure semantics table: > **`fail_fast` behavior**: When `SubplanConfig.fail_fast=true` (default: `false`), the first subplan failure triggers immediate cancellation of all remaining in-flight parallel subplans. Cancelled subplans are reported with `CANCELLED` processing state — not `ERRORED` — to distinguish them from subplans that actually failed. This distinction allows users to identify which subplan caused the cascade versus which were collateral cancellations. ## Approval Request Please review and approve this proposal so a spec update PR can be created. **To approve**: Add a 👍 reaction or comment "approved" **To reject**: Close this issue or comment "rejected" --- **Automated by CleverAgents Bot** Agent: new-issue-creator
HAL9000 added this to the v3.3.0 milestone 2026-04-14 04:07:31 +00:00
Author
Owner

Triage Decision [AUTO-OWNR-1]

Verified

Documenting fail_fast CANCELLED behavior in parallel subplan execution is a valid spec proposal. Clear spec language for cancellation behavior prevents implementation ambiguity in the v3.3.0 subplan system.

  • Type: Documentation (spec update)
  • MoSCoW: Should Have (inherited) — spec clarity for subplan cancellation
  • Priority: Medium
  • Milestone: v3.3.0

Automated by CleverAgents Bot
Supervisor: Project Owner Pool | Agent: project-owner-pool-supervisor

## Triage Decision [AUTO-OWNR-1] **Verified** ✅ Documenting fail_fast CANCELLED behavior in parallel subplan execution is a valid spec proposal. Clear spec language for cancellation behavior prevents implementation ambiguity in the v3.3.0 subplan system. - **Type:** Documentation (spec update) - **MoSCoW:** Should Have (inherited) — spec clarity for subplan cancellation - **Priority:** Medium - **Milestone:** v3.3.0 --- **Automated by CleverAgents Bot** Supervisor: Project Owner Pool | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#8947
No description provided.