cleveragents/cleveragents-core

Fork 3

BUG-HUNT: [concurrency] async_worker.py detect_stuck_jobs races with running jobs — can mark succeeded jobs as failed #7494

New issue

Open

opened 2026-04-10 20:49:32 +00:00 by HAL9000 · 1 comment

HAL9000 commented

2026-04-10 20:49:32 +00:00

Owner

Bug Report: Concurrency — `detect_stuck_jobs` Races with Active Worker Threads

Severity Assessment

Impact: Jobs that completed successfully can be marked as FAILED — incorrect job status causes cascading failures in dependent workflows
Likelihood: Medium — race window is small but present under load
Priority: High

Location

File: src/cleveragents/application/services/async_worker.py
Function: detect_stuck_jobs
Lines: 519–544
Category: concurrency

Description

detect_stuck_jobs calls list_by_status(RUNNING) and then calls job.mark_failed(...) on the returned jobs in separate lock acquisitions. Between list_by_status and mark_failed, the job's owning worker thread may have already called mark_succeeded(). The detect_stuck_jobs thread then calls mark_failed on a job that is already terminal.

More critically, the job object retrieved from list_by_status is the same object in memory as the one the worker thread is modifying (Python dict returns references), so job.mark_failed may be racing with job.mark_succeeded on the same object with no synchronization.

Evidence

for job in running_jobs:           # job is a live reference to dict entry
    ...
    job.mark_failed(...)           # concurrent worker may call mark_succeeded on SAME object
    self._job_store.update(job)    # second conflicting update

Race scenario:

detect_stuck_jobs gets reference to Job X (status=RUNNING)
Worker thread calls Job X.mark_succeeded() — status=SUCCEEDED
detect_stuck_jobs calls Job X.mark_failed() — status=FAILED (WRONG!)
Job X is now incorrectly marked as FAILED in the store

Expected Behavior

detect_stuck_jobs should only mark jobs as failed if they are still genuinely running after the timeout, with no race against concurrent completion.

Actual Behavior

Jobs can be incorrectly marked as FAILED even if they succeeded milliseconds before the stuck-job check runs.

Suggested Fix

InMemoryJobStore should return copies of jobs on get/list so each caller works on an independent snapshot:

def list_by_status(self, status: JobStatus) -> list[Job]:
    with self._lock:
        return [copy.copy(job) for job in self._jobs.values() if job.status == status]

The store's update method should then be the single authoritative writer, with version checking to detect stale updates.

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.

Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: Concurrency — `detect_stuck_jobs` Races with Active Worker Threads ### Severity Assessment - **Impact**: Jobs that completed successfully can be marked as `FAILED` — incorrect job status causes cascading failures in dependent workflows - **Likelihood**: Medium — race window is small but present under load - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/async_worker.py` - **Function**: `detect_stuck_jobs` - **Lines**: 519–544 - **Category**: concurrency ### Description `detect_stuck_jobs` calls `list_by_status(RUNNING)` and then calls `job.mark_failed(...)` on the returned jobs in separate lock acquisitions. Between `list_by_status` and `mark_failed`, the job's owning worker thread may have already called `mark_succeeded()`. The `detect_stuck_jobs` thread then calls `mark_failed` on a job that is already terminal. More critically, the job object retrieved from `list_by_status` is the **same object** in memory as the one the worker thread is modifying (Python dict returns references), so `job.mark_failed` may be racing with `job.mark_succeeded` on the same object with no synchronization. ### Evidence ```python for job in running_jobs: # job is a live reference to dict entry ... job.mark_failed(...) # concurrent worker may call mark_succeeded on SAME object self._job_store.update(job) # second conflicting update ``` **Race scenario:** 1. `detect_stuck_jobs` gets reference to Job X (status=RUNNING) 2. Worker thread calls `Job X.mark_succeeded()` — status=SUCCEEDED 3. `detect_stuck_jobs` calls `Job X.mark_failed()` — status=FAILED (WRONG!) 4. Job X is now incorrectly marked as FAILED in the store ### Expected Behavior `detect_stuck_jobs` should only mark jobs as failed if they are still genuinely running after the timeout, with no race against concurrent completion. ### Actual Behavior Jobs can be incorrectly marked as FAILED even if they succeeded milliseconds before the stuck-job check runs. ### Suggested Fix `InMemoryJobStore` should return copies of jobs on `get`/`list` so each caller works on an independent snapshot: ```python def list_by_status(self, status: JobStatus) -> list[Job]: with self._lock: return [copy.copy(job) for job in self._jobs.values() if job.status == status] ``` The store's `update` method should then be the single authoritative writer, with version checking to detect stale updates. ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

HAL9000 referenced this issue

2026-04-10 20:56:23 +00:00

[AUTO-BUG-SUP] Bug Hunt Status (Cycle 4) #7470

HAL9000 added the

labels

2026-04-10 21:37:54 +00:00

HAL9000 added this to the v3.5.0 milestone

2026-04-10 21:39:19 +00:00

HAL9000 commented

2026-04-10 21:40:15 +00:00

Author

Owner

Issue triaged by project owner:

State: Verified
Priority: High — Concurrency/data integrity bug in autonomy hardening components that impacts M6 milestone functionality
Milestone: v3.5.0 (M6: Autonomy Hardening) — This component is core to autonomous execution, guardrails, and context management
Story Points: 3 (M) — Bug fix with clear reproduction path
MoSCoW: Must Have — Autonomy hardening requires correct concurrency and data integrity
Type: Bug

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

Issue triaged by project owner: - **State**: Verified - **Priority**: High — Concurrency/data integrity bug in autonomy hardening components that impacts M6 milestone functionality - **Milestone**: v3.5.0 (M6: Autonomy Hardening) — This component is core to autonomous execution, guardrails, and context management - **Story Points**: 3 (M) — Bug fix with clear reproduction path - **MoSCoW**: Must Have — Autonomy hardening requires correct concurrency and data integrity - **Type**: Bug --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor

HAL9000 added the

MoSCoW

Must have

Points

labels

2026-04-10 22:27:58 +00:00