[Bug Hunt][Cycle 2][Resource] Race Condition in AsyncWorker Future Capacity Calculation #7097

Open
opened 2026-04-10 07:41:00 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Branch: bugfix/async-worker-future-capacity-race-condition
  • Commit Message: fix(application): resolve race condition in AsyncWorker future capacity calculation
  • Milestone: backlog
  • Parent Epic: #7023

Bug Report: Resource Management — Race Condition in AsyncWorker Future Capacity Calculation

Severity Assessment

  • Impact: Incorrect worker capacity calculations leading to job queue starvation or oversubscription
  • Likelihood: Medium in high-throughput async job scenarios
  • Priority: High

Location

  • File: src/cleveragents/application/services/async_worker.py
  • Class: AsyncWorker
  • Lines: 680-690 (_poll_loop), 715-725 (_dispatch_job), 425-450 (pickup_and_execute cleanup)

Description

The AsyncWorker has a race condition between capacity calculation in _poll_loop() and future cleanup in pickup_and_execute(). The capacity calculation counts futures that may have completed but not yet been cleaned up from the _futures dictionary, leading to incorrect available capacity.

Evidence

In _poll_loop():

def _poll_loop(self) -> None:
    # ...
    with self._futures_lock:
        active = sum(1 for f in self._futures.values() if not f.done())  # Race here
    capacity = self._config.max_workers - active

In pickup_and_execute() cleanup:

finally:
    with self._tokens_lock:
        self._cancellation_tokens.pop(job.job_id, None)
    with self._futures_lock:
        self._futures.pop(job.job_id, None)  # Cleanup happens after done() becomes True

Race Window: Between when a future completes (f.done() returns True) and when it's removed from _futures, it's counted as inactive in capacity calculation but still occupies a slot in the dictionary.

Expected Behavior

Worker capacity should accurately reflect the number of available worker thread slots:

  • Completed futures should not count toward active capacity
  • Capacity calculation should be atomic with respect to future lifecycle
  • Job dispatch should respect true thread pool availability

Actual Behavior

  • Capacity miscalculation can lead to under-utilization (thinking pool is full when slots are available)
  • Potential queue starvation where jobs wait unnecessarily
  • Inconsistent behavior under high concurrency

Suggested Fix

  1. Clean up completed futures before capacity calculation:
with self._futures_lock:
    # Remove completed futures first
    completed = [job_id for job_id, f in self._futures.items() if f.done()]
    for job_id in completed:
        self._futures.pop(job_id, None)
    # Then calculate true active count
    active = len(self._futures)
  1. Or use ThreadPoolExecutor's internal queue size tracking instead of manual future tracking

  2. Add defensive capacity bounds checking

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.

Subtasks

  • Reproduce the race condition with a targeted concurrency test scenario
  • Audit _poll_loop() capacity calculation logic for atomicity guarantees
  • Implement eager cleanup of completed futures before capacity calculation within _futures_lock
  • Alternatively, evaluate using ThreadPoolExecutor internal tracking instead of manual future dict
  • Add defensive capacity bounds checking (clamp to [0, max_workers])
  • Tests (Behave): Add scenarios for AsyncWorker capacity calculation under concurrent job completion
  • Tests (Robot): Add integration test for high-throughput async job queue behaviour
  • Verify coverage >=97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

  • All subtasks above are completed and checked off
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly (fix(application): resolve race condition in AsyncWorker future capacity calculation), followed by a blank line, then additional lines providing relevant details about the implementation
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly (bugfix/async-worker-future-capacity-race-condition)
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done
  • All nox stages pass
  • Coverage >= 97%

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Bug Hunt Automation | Agent: new-issue-creator

## Metadata - **Branch**: `bugfix/async-worker-future-capacity-race-condition` - **Commit Message**: `fix(application): resolve race condition in AsyncWorker future capacity calculation` - **Milestone**: backlog - **Parent Epic**: #7023 ## Bug Report: Resource Management — Race Condition in AsyncWorker Future Capacity Calculation ### Severity Assessment - **Impact**: Incorrect worker capacity calculations leading to job queue starvation or oversubscription - **Likelihood**: Medium in high-throughput async job scenarios - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/async_worker.py` - **Class**: `AsyncWorker` - **Lines**: 680-690 (_poll_loop), 715-725 (_dispatch_job), 425-450 (pickup_and_execute cleanup) ### Description The AsyncWorker has a race condition between capacity calculation in `_poll_loop()` and future cleanup in `pickup_and_execute()`. The capacity calculation counts futures that may have completed but not yet been cleaned up from the `_futures` dictionary, leading to incorrect available capacity. ### Evidence In `_poll_loop()`: ```python def _poll_loop(self) -> None: # ... with self._futures_lock: active = sum(1 for f in self._futures.values() if not f.done()) # Race here capacity = self._config.max_workers - active ``` In `pickup_and_execute()` cleanup: ```python finally: with self._tokens_lock: self._cancellation_tokens.pop(job.job_id, None) with self._futures_lock: self._futures.pop(job.job_id, None) # Cleanup happens after done() becomes True ``` **Race Window**: Between when a future completes (`f.done()` returns True) and when it's removed from `_futures`, it's counted as inactive in capacity calculation but still occupies a slot in the dictionary. ### Expected Behavior Worker capacity should accurately reflect the number of available worker thread slots: - Completed futures should not count toward active capacity - Capacity calculation should be atomic with respect to future lifecycle - Job dispatch should respect true thread pool availability ### Actual Behavior - Capacity miscalculation can lead to under-utilization (thinking pool is full when slots are available) - Potential queue starvation where jobs wait unnecessarily - Inconsistent behavior under high concurrency ### Suggested Fix 1. Clean up completed futures before capacity calculation: ```python with self._futures_lock: # Remove completed futures first completed = [job_id for job_id, f in self._futures.items() if f.done()] for job_id in completed: self._futures.pop(job_id, None) # Then calculate true active count active = len(self._futures) ``` 2. Or use ThreadPoolExecutor's internal queue size tracking instead of manual future tracking 3. Add defensive capacity bounds checking ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. ## Subtasks - [ ] Reproduce the race condition with a targeted concurrency test scenario - [ ] Audit `_poll_loop()` capacity calculation logic for atomicity guarantees - [ ] Implement eager cleanup of completed futures before capacity calculation within `_futures_lock` - [ ] Alternatively, evaluate using `ThreadPoolExecutor` internal tracking instead of manual future dict - [ ] Add defensive capacity bounds checking (clamp to `[0, max_workers]`) - [ ] Tests (Behave): Add scenarios for AsyncWorker capacity calculation under concurrent job completion - [ ] Tests (Robot): Add integration test for high-throughput async job queue behaviour - [ ] Verify coverage >=97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done - [ ] All subtasks above are completed and checked off - [ ] A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly (`fix(application): resolve race condition in AsyncWorker future capacity calculation`), followed by a blank line, then additional lines providing relevant details about the implementation - [ ] The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly (`bugfix/async-worker-future-capacity-race-condition`) - [ ] The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done - [ ] All nox stages pass - [ ] Coverage >= 97% > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunt Automation | Agent: new-issue-creator
Author
Owner

Verified — Concurrency bug: race condition in AsyncWorker future capacity calculation. MoSCoW: Should-have. Priority: High.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: race condition in AsyncWorker future capacity calculation. MoSCoW: Should-have. Priority: High. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7097
No description provided.