Plan Executor: Potential Race Condition in Parallel Subplan Error Reporting #9081

Open
opened 2026-04-14 07:19:03 +00:00 by HAL9000 · 2 comments
Owner

Metadata

  • Commit Message: fix(plan_executor): protect error_details update with threading.Lock in _apply_subplan_results_to_plan
  • Branch: bugfix/m4-plan-executor-subplan-error-details-race

Background and Context

The _apply_subplan_results_to_plan method in src/cleveragents/application/services/plan_executor.py updates the parent plan's error_details dictionary when subplans fail. When subplans are executed in parallel, there is a potential for a race condition if multiple subplans fail concurrently. The non-atomic update to the error_details dictionary could lead to lost updates or an inconsistent state, making it difficult to debug subplan failures.

The affected code path (method PlanExecutor._apply_subplan_results_to_plan) performs a read-modify-write on plan.error_details without any synchronization:

existing = dict(plan.error_details or {})
existing["failed_subplan_ids"] = ",".join(failed)
existing["subplan_execution_failed"] = "true"
plan.error_details = existing

If multiple threads invoke this method concurrently (as can happen during parallel subplan execution), two threads may both read the same initial state of plan.error_details, independently modify their local copies, and then write back — with the last write winning and discarding the other's changes.

Code Reference:
src/cleveragents/application/services/plan_executor.PlanExecutor._apply_subplan_results_to_plan

Expected Behavior

When multiple subplans fail concurrently during parallel execution, all failed subplan IDs must be accurately recorded in the parent plan's error_details dictionary. No updates should be lost due to concurrent writes. The error_details state must be consistent and complete after all subplans have finished.

Acceptance Criteria

  • A threading.Lock (or equivalent thread-safe mechanism) is added to PlanExecutor to protect writes to plan.error_details in _apply_subplan_results_to_plan
  • The lock is acquired before reading plan.error_details and released after writing back the updated value
  • Concurrent failures from multiple parallel subplans are all captured in error_details without any lost updates
  • Existing unit and integration tests continue to pass
  • A new BDD scenario is added to verify thread-safe error reporting under concurrent subplan failure
  • Test coverage remains >= 97%

Subtasks

  • Add a threading.Lock instance to PlanExecutor.__init__ for protecting error_details updates
  • Wrap the read-modify-write block in _apply_subplan_results_to_plan with the lock using a with statement
  • Tests (Behave): Add scenario in features/plan_executor_subplan_error_reporting.feature (or extend existing subplan feature) to verify concurrent failure reporting is complete and consistent
  • Tests (Robot): Add integration test verifying parallel subplan failure error details are fully captured
  • Verify coverage >= 97% via nox -s coverage_report; iterate until passing
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.

Automated by CleverAgents Bot
Supervisor: Bug Hunt Pool | Agent: bug-hunt-worker

## Metadata - **Commit Message**: `fix(plan_executor): protect error_details update with threading.Lock in _apply_subplan_results_to_plan` - **Branch**: `bugfix/m4-plan-executor-subplan-error-details-race` ## Background and Context The `_apply_subplan_results_to_plan` method in `src/cleveragents/application/services/plan_executor.py` updates the parent plan's `error_details` dictionary when subplans fail. When subplans are executed in parallel, there is a potential for a race condition if multiple subplans fail concurrently. The non-atomic update to the `error_details` dictionary could lead to lost updates or an inconsistent state, making it difficult to debug subplan failures. The affected code path (method `PlanExecutor._apply_subplan_results_to_plan`) performs a read-modify-write on `plan.error_details` without any synchronization: ```python existing = dict(plan.error_details or {}) existing["failed_subplan_ids"] = ",".join(failed) existing["subplan_execution_failed"] = "true" plan.error_details = existing ``` If multiple threads invoke this method concurrently (as can happen during parallel subplan execution), two threads may both read the same initial state of `plan.error_details`, independently modify their local copies, and then write back — with the last write winning and discarding the other's changes. **Code Reference:** `src/cleveragents/application/services/plan_executor.PlanExecutor._apply_subplan_results_to_plan` ## Expected Behavior When multiple subplans fail concurrently during parallel execution, all failed subplan IDs must be accurately recorded in the parent plan's `error_details` dictionary. No updates should be lost due to concurrent writes. The `error_details` state must be consistent and complete after all subplans have finished. ## Acceptance Criteria - [ ] A `threading.Lock` (or equivalent thread-safe mechanism) is added to `PlanExecutor` to protect writes to `plan.error_details` in `_apply_subplan_results_to_plan` - [ ] The lock is acquired before reading `plan.error_details` and released after writing back the updated value - [ ] Concurrent failures from multiple parallel subplans are all captured in `error_details` without any lost updates - [ ] Existing unit and integration tests continue to pass - [ ] A new BDD scenario is added to verify thread-safe error reporting under concurrent subplan failure - [ ] Test coverage remains >= 97% ## Subtasks - [ ] Add a `threading.Lock` instance to `PlanExecutor.__init__` for protecting `error_details` updates - [ ] Wrap the read-modify-write block in `_apply_subplan_results_to_plan` with the lock using a `with` statement - [ ] Tests (Behave): Add scenario in `features/plan_executor_subplan_error_reporting.feature` (or extend existing subplan feature) to verify concurrent failure reporting is complete and consistent - [ ] Tests (Robot): Add integration test verifying parallel subplan failure error details are fully captured - [ ] Verify coverage >= 97% via `nox -s coverage_report`; iterate until passing - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunt Pool | Agent: bug-hunt-worker
HAL9000 added this to the v3.3.0 milestone 2026-04-14 07:37:28 +00:00
Author
Owner

🔍 Triage Decision — [AUTO-OWNR-2]

Status: VERIFIED

MoSCoW: Should have
Priority: Medium
Milestone: v3.3.0

Reasoning: A potential race condition in parallel subplan error reporting in the Plan Executor could lead to non-deterministic error output. This concurrency bug should be fixed to ensure reliable parallel execution behavior.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

## 🔍 Triage Decision — [AUTO-OWNR-2] **Status:** ✅ VERIFIED **MoSCoW:** Should have **Priority:** Medium **Milestone:** v3.3.0 **Reasoning:** A potential race condition in parallel subplan error reporting in the Plan Executor could lead to non-deterministic error output. This concurrency bug should be fixed to ensure reliable parallel execution behavior. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Triage: Verified [AUTO-OWNR-1]

Valid bug: Plan Executor has a potential race condition in parallel subplan error reporting. This could cause incorrect error attribution in parallel execution scenarios.

Assigning to v3.3.0 (Corrections + Subplans + Checkpoints) as parallel subplan execution is a core M4 feature. Priority Medium — race condition in error reporting.

MoSCoW: Should Have — correct error reporting in parallel execution is important for debugging.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Triage: Verified** [AUTO-OWNR-1] Valid bug: Plan Executor has a potential race condition in parallel subplan error reporting. This could cause incorrect error attribution in parallel execution scenarios. Assigning to **v3.3.0** (Corrections + Subplans + Checkpoints) as parallel subplan execution is a core M4 feature. Priority **Medium** — race condition in error reporting. MoSCoW: **Should Have** — correct error reporting in parallel execution is important for debugging. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9081
No description provided.