BUG-HUNT: [concurrency] resolve_handler() double-checked locking races on first call — stateful FsDirectoryHandler instances can be duplicated, losing checkpoint data #6533

Open
opened 2026-04-09 21:15:41 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [concurrency] — Double-Checked Locking Race in resolve_handler()

Severity Assessment

  • Impact: Two concurrent callers can each instantiate a separate FsDirectoryHandler (or other stateful handler). Checkpoints created via the first instance are invisible to the second instance, which becomes the canonical cached one. rollback_to() calls on the cached instance will silently report "checkpoint not found" for checkpoints that were actually created.
  • Likelihood: Low — requires concurrent first-time resolution of the same handler reference, but possible in multi-threaded plan execution
  • Priority: Medium

Location

  • File: src/cleveragents/resource/handlers/resolver.py
  • Function: resolve_handler
  • Lines: 47–121

Description

resolve_handler() uses a pattern that is not double-checked locking — it acquires the lock to read the cache, releases it, then does the potentially expensive work (import + instantiate), and re-acquires the lock to write:

# Read phase (lock held):
with _cache_lock:
    if handler_ref in _handler_cache:
        return _handler_cache[handler_ref]

# ← lock released here; another thread can also pass the cache miss check

# Expensive work (no lock):
module = importlib.import_module(module_path)
handler_cls = getattr(module, class_name, None)
instance = handler_cls()  # FsDirectoryHandler.__init__ runs here

# Write phase (lock held):
with _cache_lock:
    _handler_cache[handler_ref] = instance

If two threads both pass the cache-miss check concurrently, they both instantiate handler_cls(). The second instance overwrites the first in the cache. The first instance is returned to thread A, but thread B's write goes to the cache, so any subsequent call returns B's instance.

For stateless handlers (e.g., GitCheckoutHandler) this is harmless — both instances behave identically.

For FsDirectoryHandler, which maintains self._checkpoints: dict[str, str] as instance-level mutable state:

  • Thread A creates a checkpoint via handler_A.create_checkpoint(...), stored in handler_A._checkpoints.
  • Thread B's instance (handler_B) is now the cached instance.
  • A subsequent call to resolve_handler(...) returns handler_B.
  • handler_B.rollback_to(..., checkpoint_id=...) finds self._checkpoints empty and returns RollbackResult(success=False, message="Checkpoint … not found or expired").

Evidence

# resolver.py lines 65–68 — cache READ (lock held)
with _cache_lock:
    if handler_ref in _handler_cache:
        return _handler_cache[handler_ref]
# ← lock released; TOCTOU window opens here

# resolver.py lines 86–108 — expensive work (NO lock)
module = importlib.import_module(module_path)
handler_cls = getattr(module, class_name, None)
instance = handler_cls()   # ← FsDirectoryHandler.__init__ creates new empty _checkpoints

# resolver.py lines 117–120 — cache WRITE (lock held)
with _cache_lock:
    _handler_cache[handler_ref] = instance  # ← second writer wins, orphaning first instance
# fs_directory.py line 272 — FsDirectoryHandler has instance-level state
def __init__(self) -> None:
    self._checkpoints: dict[str, str] = {}  # ← per-instance; not shared between duplicates

Expected Behavior

The handler cache must be populated exactly once per handler_ref. All callers must receive the same instance for a given reference string.

Actual Behavior

Under concurrent first-call conditions, multiple instances can be created; only the last one is cached, orphaning earlier instances that may hold checkpoint state.

Suggested Fix

Use a true double-checked locking pattern: hold the lock for the entire check-and-insert operation, or use a per-key creation lock to serialise creation while allowing concurrent reads for different keys:

def resolve_handler(handler_ref: str) -> ResourceHandler:
    ...
    with _cache_lock:
        if handler_ref in _handler_cache:
            return _handler_cache[handler_ref]
        # Still holding the lock — perform import and instantiation here
        # to prevent concurrent creation of the same handler.
        module = importlib.import_module(module_path)
        handler_cls = getattr(module, class_name, None)
        ...
        instance = handler_cls()
        ...
        _handler_cache[handler_ref] = instance
        return instance

Note: holding the lock during importlib.import_module is acceptable since module import is itself idempotent and Python's import system uses its own lock (importlib._bootstrap._ModuleLock). The combined lock duration is short (module is typically already in sys.modules on the second call).

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — Double-Checked Locking Race in `resolve_handler()` ### Severity Assessment - **Impact**: Two concurrent callers can each instantiate a separate `FsDirectoryHandler` (or other stateful handler). Checkpoints created via the first instance are invisible to the second instance, which becomes the canonical cached one. `rollback_to()` calls on the cached instance will silently report "checkpoint not found" for checkpoints that were actually created. - **Likelihood**: Low — requires concurrent first-time resolution of the same handler reference, but possible in multi-threaded plan execution - **Priority**: Medium ### Location - **File**: `src/cleveragents/resource/handlers/resolver.py` - **Function**: `resolve_handler` - **Lines**: 47–121 ### Description `resolve_handler()` uses a pattern that is **not** double-checked locking — it acquires the lock to read the cache, releases it, then does the potentially expensive work (import + instantiate), and re-acquires the lock to write: ```python # Read phase (lock held): with _cache_lock: if handler_ref in _handler_cache: return _handler_cache[handler_ref] # ← lock released here; another thread can also pass the cache miss check # Expensive work (no lock): module = importlib.import_module(module_path) handler_cls = getattr(module, class_name, None) instance = handler_cls() # FsDirectoryHandler.__init__ runs here # Write phase (lock held): with _cache_lock: _handler_cache[handler_ref] = instance ``` If two threads both pass the cache-miss check concurrently, they both instantiate `handler_cls()`. The second `instance` overwrites the first in the cache. The first instance is returned to thread A, but thread B's write goes to the cache, so any subsequent call returns B's instance. For **stateless handlers** (e.g., `GitCheckoutHandler`) this is harmless — both instances behave identically. For **`FsDirectoryHandler`**, which maintains `self._checkpoints: dict[str, str]` as instance-level mutable state: - Thread A creates a checkpoint via `handler_A.create_checkpoint(...)`, stored in `handler_A._checkpoints`. - Thread B's instance (`handler_B`) is now the cached instance. - A subsequent call to `resolve_handler(...)` returns `handler_B`. - `handler_B.rollback_to(..., checkpoint_id=...)` finds `self._checkpoints` empty and returns `RollbackResult(success=False, message="Checkpoint … not found or expired")`. ### Evidence ```python # resolver.py lines 65–68 — cache READ (lock held) with _cache_lock: if handler_ref in _handler_cache: return _handler_cache[handler_ref] # ← lock released; TOCTOU window opens here # resolver.py lines 86–108 — expensive work (NO lock) module = importlib.import_module(module_path) handler_cls = getattr(module, class_name, None) instance = handler_cls() # ← FsDirectoryHandler.__init__ creates new empty _checkpoints # resolver.py lines 117–120 — cache WRITE (lock held) with _cache_lock: _handler_cache[handler_ref] = instance # ← second writer wins, orphaning first instance ``` ```python # fs_directory.py line 272 — FsDirectoryHandler has instance-level state def __init__(self) -> None: self._checkpoints: dict[str, str] = {} # ← per-instance; not shared between duplicates ``` ### Expected Behavior The handler cache must be populated exactly once per `handler_ref`. All callers must receive the same instance for a given reference string. ### Actual Behavior Under concurrent first-call conditions, multiple instances can be created; only the last one is cached, orphaning earlier instances that may hold checkpoint state. ### Suggested Fix Use a true double-checked locking pattern: hold the lock for the entire check-and-insert operation, or use a per-key creation lock to serialise creation while allowing concurrent reads for different keys: ```python def resolve_handler(handler_ref: str) -> ResourceHandler: ... with _cache_lock: if handler_ref in _handler_cache: return _handler_cache[handler_ref] # Still holding the lock — perform import and instantiation here # to prevent concurrent creation of the same handler. module = importlib.import_module(module_path) handler_cls = getattr(module, class_name, None) ... instance = handler_cls() ... _handler_cache[handler_ref] = instance return instance ``` Note: holding the lock during `importlib.import_module` is acceptable since module import is itself idempotent and Python's import system uses its own lock (`importlib._bootstrap._ModuleLock`). The combined lock duration is short (module is typically already in `sys.modules` on the second call). ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:27:52 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6533
No description provided.