BUG-HUNT: [concurrency] LockService.acquire() has TOCTOU race condition — expired lock replacement not atomic #7406

Open
opened 2026-04-10 19:04:33 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: Concurrency — TOCTOU Race in LockService.acquire()

Severity Assessment

  • Impact: Two concurrent callers can both believe they hold the same lock, leading to simultaneous execution of plan phases on the same resource — data corruption risk
  • Likelihood: Medium — requires two threads to call acquire() at exactly the same time on a resource whose lock has just expired
  • Priority: High

Location

  • File: src/cleveragents/application/services/lock_service.py
  • Function: LockService.acquire()
  • Lines: 126–184

Description

The acquire() method performs a check-then-act sequence without any database-level locking. The flow is:

  1. SELECT existing lock row
  2. Check if it's expired
  3. UPDATE/INSERT new lock row

Between steps 2 and 3, another concurrent caller can execute the same SELECT, also see the expired lock, and also proceed to UPDATE it. Both callers receive True (lock acquired), but only one physically wrote last — both callers think they hold the lock exclusively.

The root cause is string-based ISO timestamp comparison (existing_expires >= now_iso) combined with no SELECT FOR UPDATE or INSERT ON CONFLICT atomic upsert.

Evidence

# src/cleveragents/application/services/lock_service.py, lines 126–184
session = self._session_factory()
try:
    stmt = select(LockModel).where(
        LockModel.resource_type == resource_type,
        LockModel.resource_id == resource_id,
    )
    existing: LockModel | None = session.execute(stmt).scalar_one_or_none()

    if existing is not None:
        # ... check expiry
        if existing_expires >= now_iso:           # <-- TOCTOU: two threads both see expired
            raise LockConflictError(...)
        # Expired: replace (both threads reach here simultaneously)
        existing.owner_id = owner_id              # <-- both overwrite, both return True
        session.commit()
        return True

    # No existing lock — create new
    lock = LockModel(...)
    session.add(lock)
    session.commit()                              # <-- both threads reach here, both succeed
    return True

Expected Behavior

Exactly one of two concurrent callers racing on an expired lock should succeed; the other should either retry or receive LockConflictError.

Actual Behavior

Both callers can receive True simultaneously, believing they both hold the exclusive lock.

Suggested Fix

Use a database-level atomic upsert:

  • SQLite: INSERT OR REPLACE INTO locks ... WHERE expires_at < :now with row-level checking
  • Or use SELECT FOR UPDATE (PostgreSQL) / SQLAlchemy with_for_update() to serialize the read-modify-write
  • Alternatively wrap in a unique constraint on (resource_type, resource_id) and catch IntegrityError to detect conflicts

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: Concurrency — TOCTOU Race in LockService.acquire() ### Severity Assessment - **Impact**: Two concurrent callers can both believe they hold the same lock, leading to simultaneous execution of plan phases on the same resource — data corruption risk - **Likelihood**: Medium — requires two threads to call `acquire()` at exactly the same time on a resource whose lock has just expired - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/lock_service.py` - **Function**: `LockService.acquire()` - **Lines**: 126–184 ### Description The `acquire()` method performs a **check-then-act** sequence without any database-level locking. The flow is: 1. SELECT existing lock row 2. Check if it's expired 3. UPDATE/INSERT new lock row Between steps 2 and 3, another concurrent caller can execute the same SELECT, also see the expired lock, and also proceed to UPDATE it. Both callers receive `True` (lock acquired), but only one physically wrote last — both callers think they hold the lock exclusively. The root cause is string-based ISO timestamp comparison (`existing_expires >= now_iso`) combined with no SELECT FOR UPDATE or INSERT ON CONFLICT atomic upsert. ### Evidence ```python # src/cleveragents/application/services/lock_service.py, lines 126–184 session = self._session_factory() try: stmt = select(LockModel).where( LockModel.resource_type == resource_type, LockModel.resource_id == resource_id, ) existing: LockModel | None = session.execute(stmt).scalar_one_or_none() if existing is not None: # ... check expiry if existing_expires >= now_iso: # <-- TOCTOU: two threads both see expired raise LockConflictError(...) # Expired: replace (both threads reach here simultaneously) existing.owner_id = owner_id # <-- both overwrite, both return True session.commit() return True # No existing lock — create new lock = LockModel(...) session.add(lock) session.commit() # <-- both threads reach here, both succeed return True ``` ### Expected Behavior Exactly one of two concurrent callers racing on an expired lock should succeed; the other should either retry or receive `LockConflictError`. ### Actual Behavior Both callers can receive `True` simultaneously, believing they both hold the exclusive lock. ### Suggested Fix Use a database-level atomic upsert: - SQLite: `INSERT OR REPLACE INTO locks ... WHERE expires_at < :now` with row-level checking - Or use `SELECT FOR UPDATE` (PostgreSQL) / SQLAlchemy `with_for_update()` to serialize the read-modify-write - Alternatively wrap in a unique constraint on `(resource_type, resource_id)` and catch `IntegrityError` to detect conflicts ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
HAL9000 added this to the v3.2.0 milestone 2026-04-10 19:36:15 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7406
No description provided.