[Bug Hunt][Cycle 2][Reactive] Race Condition in CancellationToken Reset #7166

Open
opened 2026-04-10 08:22:49 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Branch: bugfix/reactive-cancellation-token-reset-race-condition
  • Commit Message: fix(reactive): add lock synchronization to CancellationToken.reset() to prevent race condition
  • Milestone: (none — see backlog note below)
  • Parent Epic: #7052

Background and Context

CancellationToken.reset() in src/cleveragents/application/services/async_worker.py (lines 72–74) calls threading.Event.clear() without any synchronization guard. When another thread is concurrently processing a cancellation signal (e.g., checking is_cancelled, acting on wait(), or calling cancel()), the unsynchronized clear() can interleave with those operations and leave the token in an inconsistent state.

This is a classic TOCTOU / check-then-act race: a consumer thread may read is_cancelled == True, begin its cancellation teardown, and then have the state silently reset beneath it by a concurrent reset() call — causing the teardown to proceed against a token that now reports False.

Evidence

# src/cleveragents/application/services/async_worker.py, lines 72-74
# Reset without synchronization
def reset(self) -> None:
    """Reset the cancellation token."""
    self._cancelled.clear()

Issues identified:

  • threading.Event.clear() is not atomic with respect to compound read-modify-check sequences performed by callers.
  • No lock or memory barrier guards the transition from "cancelled" back to "not cancelled".
  • A thread blocked in wait() may be woken by set(), then observe clear() before it can act, effectively losing the cancellation signal.
  • Jobs that rely on is_cancelled checks in tight loops may skip cancellation entirely if reset() races with the check.

Impact

  • Inconsistent cancellation state: jobs may not be properly cancelled when reset() races with active cancellation processing.
  • Reliability degradation: async job cancellation becomes non-deterministic under concurrent load.
  • Silent failures: no exception is raised; the race manifests as incorrect runtime behaviour that is difficult to reproduce and diagnose.

Expected Behaviour

reset() must be safe to call from any thread at any time. The transition from cancelled → not-cancelled must be atomic with respect to all other CancellationToken operations. A standard approach is to protect the clear() call (and optionally all mutating operations) with a threading.Lock, or to redesign the token as an immutable value object that is replaced rather than mutated.

Acceptance Criteria

  • CancellationToken.reset() is protected by a lock (or equivalent synchronization primitive) so that it cannot interleave with concurrent cancel(), is_cancelled, or wait() calls.
  • All CancellationToken mutating operations (cancel(), reset()) acquire the same lock before modifying _cancelled.
  • Existing unit and integration tests continue to pass.
  • A new BDD scenario covers the concurrent reset/cancel race (using threading.Barrier or similar to force the interleaving).

Subtasks

  • Add a threading.Lock (or threading.RLock) to CancellationToken.__init__
  • Wrap reset() body in the lock's context manager
  • Wrap cancel() body in the lock's context manager for consistency
  • Review is_cancelled and wait() — determine if they also need lock protection or if threading.Event atomicity is sufficient for reads
  • Write a BDD Gherkin scenario in features/ that exercises the concurrent reset/cancel race condition
  • Implement the step definitions for the new scenario
  • Run nox full suite and confirm all stages pass
  • Confirm coverage remains ≥ 97%

Definition of Done

  • CancellationToken.reset() acquires a lock before calling _cancelled.clear()
  • CancellationToken.cancel() acquires the same lock before calling _cancelled.set()
  • New BDD scenario exists and passes, demonstrating thread-safe reset behaviour
  • No # type: ignore or suppression annotations introduced
  • All nox stages pass (nox with no arguments)
  • Coverage ≥ 97%
  • Documentation updated if public API contract changes

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Acting on behalf of: Bug Hunt | Agent: new-issue-creator

## Metadata - **Branch**: `bugfix/reactive-cancellation-token-reset-race-condition` - **Commit Message**: `fix(reactive): add lock synchronization to CancellationToken.reset() to prevent race condition` - **Milestone**: *(none — see backlog note below)* - **Parent Epic**: #7052 ## Background and Context `CancellationToken.reset()` in `src/cleveragents/application/services/async_worker.py` (lines 72–74) calls `threading.Event.clear()` without any synchronization guard. When another thread is concurrently processing a cancellation signal (e.g., checking `is_cancelled`, acting on `wait()`, or calling `cancel()`), the unsynchronized `clear()` can interleave with those operations and leave the token in an inconsistent state. This is a classic TOCTOU / check-then-act race: a consumer thread may read `is_cancelled == True`, begin its cancellation teardown, and then have the state silently reset beneath it by a concurrent `reset()` call — causing the teardown to proceed against a token that now reports `False`. ## Evidence ```python # src/cleveragents/application/services/async_worker.py, lines 72-74 # Reset without synchronization def reset(self) -> None: """Reset the cancellation token.""" self._cancelled.clear() ``` **Issues identified:** - `threading.Event.clear()` is not atomic with respect to compound read-modify-check sequences performed by callers. - No lock or memory barrier guards the transition from "cancelled" back to "not cancelled". - A thread blocked in `wait()` may be woken by `set()`, then observe `clear()` before it can act, effectively losing the cancellation signal. - Jobs that rely on `is_cancelled` checks in tight loops may skip cancellation entirely if `reset()` races with the check. ## Impact - **Inconsistent cancellation state**: jobs may not be properly cancelled when `reset()` races with active cancellation processing. - **Reliability degradation**: async job cancellation becomes non-deterministic under concurrent load. - **Silent failures**: no exception is raised; the race manifests as incorrect runtime behaviour that is difficult to reproduce and diagnose. ## Expected Behaviour `reset()` must be safe to call from any thread at any time. The transition from cancelled → not-cancelled must be atomic with respect to all other `CancellationToken` operations. A standard approach is to protect the `clear()` call (and optionally all mutating operations) with a `threading.Lock`, or to redesign the token as an immutable value object that is replaced rather than mutated. ## Acceptance Criteria - `CancellationToken.reset()` is protected by a lock (or equivalent synchronization primitive) so that it cannot interleave with concurrent `cancel()`, `is_cancelled`, or `wait()` calls. - All `CancellationToken` mutating operations (`cancel()`, `reset()`) acquire the same lock before modifying `_cancelled`. - Existing unit and integration tests continue to pass. - A new BDD scenario covers the concurrent reset/cancel race (using `threading.Barrier` or similar to force the interleaving). ## Subtasks - [ ] Add a `threading.Lock` (or `threading.RLock`) to `CancellationToken.__init__` - [ ] Wrap `reset()` body in the lock's context manager - [ ] Wrap `cancel()` body in the lock's context manager for consistency - [ ] Review `is_cancelled` and `wait()` — determine if they also need lock protection or if `threading.Event` atomicity is sufficient for reads - [ ] Write a BDD Gherkin scenario in `features/` that exercises the concurrent reset/cancel race condition - [ ] Implement the step definitions for the new scenario - [ ] Run `nox` full suite and confirm all stages pass - [ ] Confirm coverage remains ≥ 97% ## Definition of Done - [ ] `CancellationToken.reset()` acquires a lock before calling `_cancelled.clear()` - [ ] `CancellationToken.cancel()` acquires the same lock before calling `_cancelled.set()` - [ ] New BDD scenario exists and passes, demonstrating thread-safe reset behaviour - [ ] No `# type: ignore` or suppression annotations introduced - [ ] All nox stages pass (`nox` with no arguments) - [ ] Coverage ≥ 97% - [ ] Documentation updated if public API contract changes > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: Bug Hunt | Agent: new-issue-creator
Author
Owner

Verified — Concurrency bug: race condition in CancellationToken Reset. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: race condition in CancellationToken Reset. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#7166
No description provided.