BUG-HUNT: [concurrency] async_worker.py detect_stuck_jobs races with running jobs — can mark succeeded jobs as failed #7494

Open
opened 2026-04-10 20:49:32 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: Concurrency — detect_stuck_jobs Races with Active Worker Threads

Severity Assessment

  • Impact: Jobs that completed successfully can be marked as FAILED — incorrect job status causes cascading failures in dependent workflows
  • Likelihood: Medium — race window is small but present under load
  • Priority: High

Location

  • File: src/cleveragents/application/services/async_worker.py
  • Function: detect_stuck_jobs
  • Lines: 519–544
  • Category: concurrency

Description

detect_stuck_jobs calls list_by_status(RUNNING) and then calls job.mark_failed(...) on the returned jobs in separate lock acquisitions. Between list_by_status and mark_failed, the job's owning worker thread may have already called mark_succeeded(). The detect_stuck_jobs thread then calls mark_failed on a job that is already terminal.

More critically, the job object retrieved from list_by_status is the same object in memory as the one the worker thread is modifying (Python dict returns references), so job.mark_failed may be racing with job.mark_succeeded on the same object with no synchronization.

Evidence

for job in running_jobs:           # job is a live reference to dict entry
    ...
    job.mark_failed(...)           # concurrent worker may call mark_succeeded on SAME object
    self._job_store.update(job)    # second conflicting update

Race scenario:

  1. detect_stuck_jobs gets reference to Job X (status=RUNNING)
  2. Worker thread calls Job X.mark_succeeded() — status=SUCCEEDED
  3. detect_stuck_jobs calls Job X.mark_failed() — status=FAILED (WRONG!)
  4. Job X is now incorrectly marked as FAILED in the store

Expected Behavior

detect_stuck_jobs should only mark jobs as failed if they are still genuinely running after the timeout, with no race against concurrent completion.

Actual Behavior

Jobs can be incorrectly marked as FAILED even if they succeeded milliseconds before the stuck-job check runs.

Suggested Fix

InMemoryJobStore should return copies of jobs on get/list so each caller works on an independent snapshot:

def list_by_status(self, status: JobStatus) -> list[Job]:
    with self._lock:
        return [copy.copy(job) for job in self._jobs.values() if job.status == status]

The store's update method should then be the single authoritative writer, with version checking to detect stale updates.

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: Concurrency — `detect_stuck_jobs` Races with Active Worker Threads ### Severity Assessment - **Impact**: Jobs that completed successfully can be marked as `FAILED` — incorrect job status causes cascading failures in dependent workflows - **Likelihood**: Medium — race window is small but present under load - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/async_worker.py` - **Function**: `detect_stuck_jobs` - **Lines**: 519–544 - **Category**: concurrency ### Description `detect_stuck_jobs` calls `list_by_status(RUNNING)` and then calls `job.mark_failed(...)` on the returned jobs in separate lock acquisitions. Between `list_by_status` and `mark_failed`, the job's owning worker thread may have already called `mark_succeeded()`. The `detect_stuck_jobs` thread then calls `mark_failed` on a job that is already terminal. More critically, the job object retrieved from `list_by_status` is the **same object** in memory as the one the worker thread is modifying (Python dict returns references), so `job.mark_failed` may be racing with `job.mark_succeeded` on the same object with no synchronization. ### Evidence ```python for job in running_jobs: # job is a live reference to dict entry ... job.mark_failed(...) # concurrent worker may call mark_succeeded on SAME object self._job_store.update(job) # second conflicting update ``` **Race scenario:** 1. `detect_stuck_jobs` gets reference to Job X (status=RUNNING) 2. Worker thread calls `Job X.mark_succeeded()` — status=SUCCEEDED 3. `detect_stuck_jobs` calls `Job X.mark_failed()` — status=FAILED (WRONG!) 4. Job X is now incorrectly marked as FAILED in the store ### Expected Behavior `detect_stuck_jobs` should only mark jobs as failed if they are still genuinely running after the timeout, with no race against concurrent completion. ### Actual Behavior Jobs can be incorrectly marked as FAILED even if they succeeded milliseconds before the stuck-job check runs. ### Suggested Fix `InMemoryJobStore` should return copies of jobs on `get`/`list` so each caller works on an independent snapshot: ```python def list_by_status(self, status: JobStatus) -> list[Job]: with self._lock: return [copy.copy(job) for job in self._jobs.values() if job.status == status] ``` The store's `update` method should then be the single authoritative writer, with version checking to detect stale updates. ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
HAL9000 added this to the v3.5.0 milestone 2026-04-10 21:39:19 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: High — Concurrency/data integrity bug in autonomy hardening components that impacts M6 milestone functionality
  • Milestone: v3.5.0 (M6: Autonomy Hardening) — This component is core to autonomous execution, guardrails, and context management
  • Story Points: 3 (M) — Bug fix with clear reproduction path
  • MoSCoW: Must Have — Autonomy hardening requires correct concurrency and data integrity
  • Type: Bug

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

Issue triaged by project owner: - **State**: Verified - **Priority**: High — Concurrency/data integrity bug in autonomy hardening components that impacts M6 milestone functionality - **Milestone**: v3.5.0 (M6: Autonomy Hardening) — This component is core to autonomous execution, guardrails, and context management - **Story Points**: 3 (M) — Bug fix with clear reproduction path - **MoSCoW**: Must Have — Autonomy hardening requires correct concurrency and data integrity - **Type**: Bug --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7494
No description provided.