BUG-HUNT: [boundary] SubplanExecutionService division-by-zero on max_parallel=0 #7119

Open
opened 2026-04-10 07:56:09 +00:00 by HAL9000 · 1 comment
Owner

Background

SubplanExecutionService._execute_parallel() (lines 299 and 616 of
src/cleveragents/application/services/subplan_execution_service.py) computes
max_workers for ThreadPoolExecutor as:

max_workers = min(self._config.max_parallel, len(statuses))

ThreadPoolExecutor raises ValueError("max_workers must be greater than 0")
when max_workers=0. This can occur if max_parallel is 0, which is
possible when:

  • A SubplanConfig is constructed outside Pydantic (e.g. via
    object.__setattr__, direct dataclass construction, or a deserialization
    path that bypasses model validators).
  • A test fixture or migration script creates a config with max_parallel=0.
  • A future refactor removes or weakens the ge=1 constraint on the domain
    model without updating the execution layer.

The domain model (plan.py line 524) does declare ge=1 on the Pydantic
Field, so well-formed configs are protected at construction time. However,
the execution service itself has no defensive guard — it trusts the config
unconditionally. Per the project's fail-fast / argument-validation-first
policy (CONTRIBUTING.md §Error and Exception Handling), every public/protected
method must validate its inputs before use. The missing max(1, ...) or
explicit >= 1 assertion at the execution layer is a boundary-condition gap.

Expected Behaviour

_execute_parallel() should either:

  1. Clamp max_workers to a minimum of 1:
    max_workers = max(1, min(self._config.max_parallel, len(statuses)))
    
  2. Or raise a clear, descriptive ValueError early (before
    ThreadPoolExecutor is constructed) if max_parallel < 1.

Either way, the failure mode must be explicit and diagnosable, not a raw
ValueError from the standard library with no context about the config field
that caused it.

Actual Behaviour

If max_parallel=0 reaches the execution layer, ThreadPoolExecutor(max_workers=0)
raises:

ValueError: max_workers must be greater than 0

with no indication that SubplanConfig.max_parallel is the root cause.

Affected Locations

File Line Context
src/cleveragents/application/services/subplan_execution_service.py 299 _execute_parallel() — primary parallel path
src/cleveragents/application/services/subplan_execution_service.py 616 _execute_wave() — dependency-ordered wave path

Suggested Fix

Apply max(1, ...) at both sites, or add an explicit guard at the top of
_execute_parallel() / _execute_wave():

if self._config.max_parallel < 1:
    raise ValueError(
        f"SubplanConfig.max_parallel must be >= 1, got {self._config.max_parallel}"
    )

TDD Note

Per the project's Bug Fix Workflow (CONTRIBUTING.md §Bug Fix Workflow), this
issue must be resolved via the TDD path:

  1. A TDD issue (Type/Testing) must be created first, containing a
    Behave scenario tagged @tdd_issue, @tdd_issue_<N> (where <N> is this
    issue's number), and @tdd_expected_fail. The scenario must assert the
    correct boundary behaviour (either clamping or a descriptive ValueError)
    and fail while the bug is present (i.e. it passes CI via the
    @tdd_expected_fail inversion).
  2. A bugfix branch (bugfix/mN-subplan-exec-max-parallel-zero) then
    implements the fix, removes @tdd_expected_fail, and closes this issue.

Do not implement the fix without the TDD issue and its failing scenario in
place first.

Metadata

  • Branch: bugfix/subplan-exec-max-parallel-zero
  • Commit Message: fix(subplans): guard against max_parallel=0 in SubplanExecutionService
  • Milestone: (none — backlog, see note below)
  • Parent Epic: #4972

Subtasks

  • Create TDD issue with @tdd_expected_fail Behave scenario reproducing max_workers=0 crash
  • Add @tdd_issue_<N> scenario that constructs a SubplanConfig with max_parallel=0 (bypassing Pydantic) and asserts a descriptive ValueError or clamped behaviour
  • Implement fix at line 299: max_workers = max(1, min(self._config.max_parallel, len(statuses)))
  • Implement fix at line 616: same pattern in _execute_wave()
  • Remove @tdd_expected_fail from TDD scenario
  • Verify nox -s unit_tests passes
  • Verify nox -s typecheck passes (no # type: ignore introduced)
  • Verify nox -s coverage_report remains ≥ 97%

Definition of Done

  • Both occurrences of min(self._config.max_parallel, len(statuses)) are guarded against max_parallel < 1
  • A Behave scenario tagged @tdd_issue + @tdd_issue_<N> exists and passes (without @tdd_expected_fail)
  • The fix raises a descriptive ValueError or clamps to 1 — no raw ThreadPoolExecutor error leaks
  • No # type: ignore annotations introduced
  • All nox stages pass
  • Coverage ≥ 97%

Backlog note: This issue was discovered during autonomous operation
on milestone v3.3.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: boundary | Agent: new-issue-creator

## Background `SubplanExecutionService._execute_parallel()` (lines 299 and 616 of `src/cleveragents/application/services/subplan_execution_service.py`) computes `max_workers` for `ThreadPoolExecutor` as: ```python max_workers = min(self._config.max_parallel, len(statuses)) ``` `ThreadPoolExecutor` raises `ValueError("max_workers must be greater than 0")` when `max_workers=0`. This can occur if `max_parallel` is 0, which is possible when: - A `SubplanConfig` is constructed outside Pydantic (e.g. via `object.__setattr__`, direct `dataclass` construction, or a deserialization path that bypasses model validators). - A test fixture or migration script creates a config with `max_parallel=0`. - A future refactor removes or weakens the `ge=1` constraint on the domain model without updating the execution layer. The domain model (`plan.py` line 524) does declare `ge=1` on the Pydantic `Field`, so well-formed configs are protected at construction time. However, the execution service itself has **no defensive guard** — it trusts the config unconditionally. Per the project's fail-fast / argument-validation-first policy (CONTRIBUTING.md §Error and Exception Handling), every public/protected method must validate its inputs before use. The missing `max(1, ...)` or explicit `>= 1` assertion at the execution layer is a boundary-condition gap. ## Expected Behaviour `_execute_parallel()` should either: 1. Clamp `max_workers` to a minimum of 1: ```python max_workers = max(1, min(self._config.max_parallel, len(statuses))) ``` 2. Or raise a clear, descriptive `ValueError` early (before `ThreadPoolExecutor` is constructed) if `max_parallel < 1`. Either way, the failure mode must be explicit and diagnosable, not a raw `ValueError` from the standard library with no context about the config field that caused it. ## Actual Behaviour If `max_parallel=0` reaches the execution layer, `ThreadPoolExecutor(max_workers=0)` raises: ``` ValueError: max_workers must be greater than 0 ``` with no indication that `SubplanConfig.max_parallel` is the root cause. ## Affected Locations | File | Line | Context | |---|---|---| | `src/cleveragents/application/services/subplan_execution_service.py` | 299 | `_execute_parallel()` — primary parallel path | | `src/cleveragents/application/services/subplan_execution_service.py` | 616 | `_execute_wave()` — dependency-ordered wave path | ## Suggested Fix Apply `max(1, ...)` at both sites, or add an explicit guard at the top of `_execute_parallel()` / `_execute_wave()`: ```python if self._config.max_parallel < 1: raise ValueError( f"SubplanConfig.max_parallel must be >= 1, got {self._config.max_parallel}" ) ``` ## TDD Note Per the project's Bug Fix Workflow (CONTRIBUTING.md §Bug Fix Workflow), this issue **must** be resolved via the TDD path: 1. A **TDD issue** (`Type/Testing`) must be created first, containing a Behave scenario tagged `@tdd_issue`, `@tdd_issue_<N>` (where `<N>` is this issue's number), and `@tdd_expected_fail`. The scenario must assert the correct boundary behaviour (either clamping or a descriptive `ValueError`) and **fail** while the bug is present (i.e. it passes CI via the `@tdd_expected_fail` inversion). 2. A **bugfix branch** (`bugfix/mN-subplan-exec-max-parallel-zero`) then implements the fix, removes `@tdd_expected_fail`, and closes this issue. Do **not** implement the fix without the TDD issue and its failing scenario in place first. ## Metadata - **Branch**: `bugfix/subplan-exec-max-parallel-zero` - **Commit Message**: `fix(subplans): guard against max_parallel=0 in SubplanExecutionService` - **Milestone**: *(none — backlog, see note below)* - **Parent Epic**: #4972 ## Subtasks - [ ] Create TDD issue with `@tdd_expected_fail` Behave scenario reproducing `max_workers=0` crash - [ ] Add `@tdd_issue_<N>` scenario that constructs a `SubplanConfig` with `max_parallel=0` (bypassing Pydantic) and asserts a descriptive `ValueError` or clamped behaviour - [ ] Implement fix at line 299: `max_workers = max(1, min(self._config.max_parallel, len(statuses)))` - [ ] Implement fix at line 616: same pattern in `_execute_wave()` - [ ] Remove `@tdd_expected_fail` from TDD scenario - [ ] Verify `nox -s unit_tests` passes - [ ] Verify `nox -s typecheck` passes (no `# type: ignore` introduced) - [ ] Verify `nox -s coverage_report` remains ≥ 97% ## Definition of Done - [ ] Both occurrences of `min(self._config.max_parallel, len(statuses))` are guarded against `max_parallel < 1` - [ ] A Behave scenario tagged `@tdd_issue` + `@tdd_issue_<N>` exists and passes (without `@tdd_expected_fail`) - [ ] The fix raises a descriptive `ValueError` or clamps to 1 — no raw `ThreadPoolExecutor` error leaks - [ ] No `# type: ignore` annotations introduced - [ ] All nox stages pass - [ ] Coverage ≥ 97% > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.3.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: boundary | Agent: new-issue-creator
Author
Owner

Verified — Bug: SubplanExecutionService division-by-zero on max_parallel=0. MoSCoW: Must-have. Priority: High — runtime crash.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Bug: SubplanExecutionService division-by-zero on max_parallel=0. MoSCoW: Must-have. Priority: High — runtime crash. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#7119
No description provided.