BUG-HUNT: [concurrency] ErrorRecoveryService._get_or_create_history() is not thread-safe causing error history loss #7333

Open
opened 2026-04-10 17:28:03 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: Concurrency — Unsynchronized Error History Creation

Severity Assessment

  • Impact: Two concurrent record_error() calls for the same plan_id can both see the history as absent, both create a new ErrorHistory, and one's error record will be stored in a history that gets immediately overwritten — that error is permanently lost from the in-memory history
  • Likelihood: Low-Medium — requires concurrent error recording for the same plan, possible with async workers
  • Priority: Medium

Location

  • File: src/cleveragents/application/services/error_recovery_service.py
  • Function/Class: ErrorRecoveryService._get_or_create_history()
  • Lines: 291–295

Description

The _get_or_create_history() method is a classic TOCTOU pattern without locking:

def _get_or_create_history(self, plan_id: str) -> ErrorHistory:
    """Get or create an error history for a plan."""
    if plan_id not in self._histories:     # ← Thread A and B both check this
        self._histories[plan_id] = ErrorHistory(plan_id=plan_id)  # ← both create
    return self._histories[plan_id]

The _histories dict is not protected by any lock. Two concurrent calls with the same plan_id:

  1. Thread A: plan_id not in self._histories → True, creates history_A = ErrorHistory(plan_id=plan_id)
  2. Thread B: plan_id not in self._histories → True (A hasn't assigned yet), creates history_B = ErrorHistory(plan_id=plan_id)
  3. Thread A: self._histories[plan_id] = history_A — error added to history_A at line 154
  4. Thread B: self._histories[plan_id] = history_B — overwrites! Error A is now lost!

Evidence

# error_recovery_service.py lines 291-295
def _get_or_create_history(self, plan_id: str) -> ErrorHistory:
    """Get or create an error history for a plan."""
    if plan_id not in self._histories:           # ← no lock!
        self._histories[plan_id] = ErrorHistory(plan_id=plan_id)
    return self._histories[plan_id]

Called from record_error() at line 130, where errors are recorded and added to the history:

history = self._get_or_create_history(plan_id)
# ... then ...
history.add(record)   # ← if history was overwritten, this record is lost!

The service is registered as a DI singleton and shared across multiple plan execution contexts.

Expected Behavior

History creation and error recording should be atomic with respect to concurrent access.

Actual Behavior

Concurrent history creation for the same plan can cause one thread's error records to be silently discarded.

Suggested Fix

Use dict.setdefault() which is atomic in CPython:

def _get_or_create_history(self, plan_id: str) -> ErrorHistory:
    """Get or create an error history for a plan (thread-safe via setdefault)."""
    return self._histories.setdefault(plan_id, ErrorHistory(plan_id=plan_id))

Note: This creates an ErrorHistory object unconditionally, then discards it if one already existed. A more efficient approach uses a threading lock. For CPython, dict.setdefault() is sufficient due to the GIL.

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: Concurrency — Unsynchronized Error History Creation ### Severity Assessment - **Impact**: Two concurrent `record_error()` calls for the same `plan_id` can both see the history as absent, both create a new `ErrorHistory`, and one's error record will be stored in a history that gets immediately overwritten — that error is permanently lost from the in-memory history - **Likelihood**: Low-Medium — requires concurrent error recording for the same plan, possible with async workers - **Priority**: Medium ### Location - **File**: `src/cleveragents/application/services/error_recovery_service.py` - **Function/Class**: `ErrorRecoveryService._get_or_create_history()` - **Lines**: 291–295 ### Description The `_get_or_create_history()` method is a classic TOCTOU pattern without locking: ```python def _get_or_create_history(self, plan_id: str) -> ErrorHistory: """Get or create an error history for a plan.""" if plan_id not in self._histories: # ← Thread A and B both check this self._histories[plan_id] = ErrorHistory(plan_id=plan_id) # ← both create return self._histories[plan_id] ``` The `_histories` dict is not protected by any lock. Two concurrent calls with the same `plan_id`: 1. Thread A: `plan_id not in self._histories` → True, creates `history_A = ErrorHistory(plan_id=plan_id)` 2. Thread B: `plan_id not in self._histories` → True (A hasn't assigned yet), creates `history_B = ErrorHistory(plan_id=plan_id)` 3. Thread A: `self._histories[plan_id] = history_A` — error added to `history_A` at line 154 4. Thread B: `self._histories[plan_id] = history_B` — overwrites! Error A is now lost! ### Evidence ```python # error_recovery_service.py lines 291-295 def _get_or_create_history(self, plan_id: str) -> ErrorHistory: """Get or create an error history for a plan.""" if plan_id not in self._histories: # ← no lock! self._histories[plan_id] = ErrorHistory(plan_id=plan_id) return self._histories[plan_id] ``` Called from `record_error()` at line 130, where errors are recorded and added to the history: ```python history = self._get_or_create_history(plan_id) # ... then ... history.add(record) # ← if history was overwritten, this record is lost! ``` The service is registered as a DI singleton and shared across multiple plan execution contexts. ### Expected Behavior History creation and error recording should be atomic with respect to concurrent access. ### Actual Behavior Concurrent history creation for the same plan can cause one thread's error records to be silently discarded. ### Suggested Fix Use `dict.setdefault()` which is atomic in CPython: ```python def _get_or_create_history(self, plan_id: str) -> ErrorHistory: """Get or create an error history for a plan (thread-safe via setdefault).""" return self._histories.setdefault(plan_id, ErrorHistory(plan_id=plan_id)) ``` Note: This creates an `ErrorHistory` object unconditionally, then discards it if one already existed. A more efficient approach uses a threading lock. For CPython, `dict.setdefault()` is sufficient due to the GIL. ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
Author
Owner

Label Compliance Fix Needed

This issue is missing a State/ label*. Per CONTRIBUTING.md, every issue must have exactly one State/* label.

Current labels: Priority/Backlog, Type/Bug — missing State/*

Recommended fix: Add State/Unverified (id:846) as the default state.


Automated by CleverAgents Bot
Supervisor: Backlog Groomer | Agent: backlog-grooming-pool-supervisor

## Label Compliance Fix Needed This issue is missing a **State/* label**. Per CONTRIBUTING.md, every issue must have exactly one State/* label. Current labels: `Priority/Backlog`, `Type/Bug` — missing `State/*` **Recommended fix**: Add `State/Unverified` (id:846) as the default state. --- **Automated by CleverAgents Bot** Supervisor: Backlog Groomer | Agent: backlog-grooming-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7333
No description provided.