BUG-HUNT: [concurrency] AsyncJob heartbeat recording has race condition vulnerability #7036

Open
opened 2026-04-10 07:21:17 +00:00 by HAL9000 · 2 comments
Owner

Bug Report: [Concurrency] — AsyncJob heartbeat race condition

Severity Assessment

  • Impact: Worker heartbeats could be recorded after job transitions to terminal state, causing data inconsistency
  • Likelihood: High in production with concurrent workers and job lifecycle management
  • Priority: High

Location

  • File: src/cleveragents/domain/models/core/async_job.py
  • Function/Class: AsyncJob.record_heartbeat()
  • Lines: ~250-260

Description

The record_heartbeat() method only checks if the job status is RUNNING but doesn't prevent race conditions where a job could transition to a terminal state (SUCCEEDED, FAILED, CANCELLED) between the status check and the heartbeat timestamp update.

Evidence

def record_heartbeat(self) -> None:
    """Record a heartbeat from the worker."""
    if self.status != AsyncJobStatus.RUNNING:
        raise ValueError(
            f"Cannot record heartbeat for job in {self.status.value} state"
        )
    self.last_heartbeat = datetime.now(UTC)  # Race condition here

Expected Behavior

Heartbeat recording should be atomic with respect to status transitions, or should gracefully handle the case where a job transitions to terminal state during heartbeat recording.

Actual Behavior

A worker can record a heartbeat after another thread/process has marked the job as completed, leading to inconsistent state where a terminal job has a recent heartbeat timestamp.

Suggested Fix

Add proper synchronization or implement optimistic locking to ensure heartbeat updates are only applied to jobs that are still in RUNNING state at the time of the update.

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Metadata

  • Branch: bugfix/concurrency-async-job-heartbeat-race-condition
  • Commit Message: fix(async_job): prevent race condition in AsyncJob.record_heartbeat() status check
  • Milestone: (none — backlog)
  • Parent Epic: (orphan — see comment below)

Subtasks

  • Reproduce the race condition with a concurrent test scenario
  • Implement atomic status check + heartbeat update (e.g., optimistic locking or synchronisation primitive)
  • Tests (Behave): Add BDD scenario for concurrent heartbeat + terminal-state transition
  • Tests (Robot): Add integration test for concurrent worker heartbeat handling
  • Verify coverage ≥ 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged.
  • All nox stages pass.
  • Coverage >= 97%.

Backlog note: This issue was discovered during autonomous operation
on milestone v3.8.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

## Bug Report: [Concurrency] — AsyncJob heartbeat race condition ### Severity Assessment - **Impact**: Worker heartbeats could be recorded after job transitions to terminal state, causing data inconsistency - **Likelihood**: High in production with concurrent workers and job lifecycle management - **Priority**: High ### Location - **File**: `src/cleveragents/domain/models/core/async_job.py` - **Function/Class**: `AsyncJob.record_heartbeat()` - **Lines**: ~250-260 ### Description The `record_heartbeat()` method only checks if the job status is RUNNING but doesn't prevent race conditions where a job could transition to a terminal state (SUCCEEDED, FAILED, CANCELLED) between the status check and the heartbeat timestamp update. ### Evidence ```python def record_heartbeat(self) -> None: """Record a heartbeat from the worker.""" if self.status != AsyncJobStatus.RUNNING: raise ValueError( f"Cannot record heartbeat for job in {self.status.value} state" ) self.last_heartbeat = datetime.now(UTC) # Race condition here ``` ### Expected Behavior Heartbeat recording should be atomic with respect to status transitions, or should gracefully handle the case where a job transitions to terminal state during heartbeat recording. ### Actual Behavior A worker can record a heartbeat after another thread/process has marked the job as completed, leading to inconsistent state where a terminal job has a recent heartbeat timestamp. ### Suggested Fix Add proper synchronization or implement optimistic locking to ensure heartbeat updates are only applied to jobs that are still in RUNNING state at the time of the update. ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- ## Metadata - **Branch**: `bugfix/concurrency-async-job-heartbeat-race-condition` - **Commit Message**: `fix(async_job): prevent race condition in AsyncJob.record_heartbeat() status check` - **Milestone**: *(none — backlog)* - **Parent Epic**: *(orphan — see comment below)* ## Subtasks - [ ] Reproduce the race condition with a concurrent test scenario - [ ] Implement atomic status check + heartbeat update (e.g., optimistic locking or synchronisation primitive) - [ ] Tests (Behave): Add BDD scenario for concurrent heartbeat + terminal-state transition - [ ] Tests (Robot): Add integration test for concurrent worker heartbeat handling - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged**. - All nox stages pass. - Coverage >= 97%. --- > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.8.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

⚠️ Orphan Issue — Needs Parent Epic Linking

This issue was created during autonomous bug hunting and no parent Epic was found for the AsyncJob / concurrency domain.

A human reviewer should:

  1. Identify or create an appropriate parent Epic for concurrency/async-job bugs
  2. Link this issue as a child that blocks the parent Epic using Forgejo's dependency system:
    # This issue (#7036) BLOCKS the parent Epic
    POST /api/v1/repos/cleveragents/cleveragents-core/issues/7036/blocks
    body: {"owner": "cleveragents", "repo": "cleveragents-core", "index": <PARENT_EPIC_NUMBER>}
    

Candidate parent Epics to consider:

  • Any open Epic covering concurrency bug fixes in the domain model layer
  • Issue #3901 (Refactor: Decompose async_job.py) may be related but is a refactor, not an Epic

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

⚠️ **Orphan Issue — Needs Parent Epic Linking** This issue was created during autonomous bug hunting and no parent Epic was found for the `AsyncJob` / concurrency domain. A human reviewer should: 1. Identify or create an appropriate parent Epic for concurrency/async-job bugs 2. Link this issue as a child that **blocks** the parent Epic using Forgejo's dependency system: ``` # This issue (#7036) BLOCKS the parent Epic POST /api/v1/repos/cleveragents/cleveragents-core/issues/7036/blocks body: {"owner": "cleveragents", "repo": "cleveragents-core", "index": <PARENT_EPIC_NUMBER>} ``` Candidate parent Epics to consider: - Any open Epic covering concurrency bug fixes in the domain model layer - Issue #3901 (Refactor: Decompose async_job.py) may be related but is a refactor, not an Epic --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

Verified — Concurrency bug: race condition in AsyncJob heartbeat recording. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: race condition in AsyncJob heartbeat recording. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7036
No description provided.