BUG-HUNT: [concurrency] Race condition in ActorRegistry.ensure_built_in_actors() #7095

Open
opened 2026-04-10 07:40:52 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Branch: bugfix/actor-registry-ensure-built-in-actors-race-condition
  • Commit Message: fix(actor): add synchronization to ActorRegistry.ensure_built_in_actors() to prevent concurrent duplicate upserts
  • Milestone: (none — backlog, see note below)
  • Parent Epic: #396

Backlog note: This issue was discovered during autonomous operation
on milestone v3.4.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.

Background and Context

The ActorRegistry.ensure_built_in_actors() method in src/cleveragents/actor/registry.py (lines ~70–110) is called from multiple public methods — get(), get_actor(), remove(), remove_actor(), set_default_actor(), and get_default_actor() — but lacks explicit synchronization. This creates a race condition where multiple threads can simultaneously attempt to create and upsert the same built-in actors.

In CleverAgents server mode, concurrent sessions and plan executions routinely call these public registry methods from different threads, making this race condition reproducible under normal production load.

Current Behavior

Multiple threads can enter ensure_built_in_actors() simultaneously. Each thread independently iterates through all configured providers and calls self._actor_service.upsert_actor(...) for each one. This results in:

  • Duplicate database operations: the same built-in actor is upserted multiple times concurrently
  • Potential constraint violations: depending on the upsert implementation, concurrent writes to the same actor record can cause integrity errors
  • Inconsistent state during actor creation: a thread reading the actor list mid-creation may observe a partially-populated registry
  • Wasted CPU cycles: redundant work performed by every concurrent caller
def ensure_built_in_actors(self) -> list[Actor]:
    """Generate built-in actors from configured providers if missing."""

    configured: list[ProviderInfo] = (
        self._provider_registry.get_configured_providers()
    )
    if not configured:
        return []

    actors: list[Actor] = []
    # NO EXPLICIT LOCKING HERE — multiple threads reach this simultaneously
    for info in configured:
        actor = self._actor_service.upsert_actor(...)  # concurrent duplicate upserts
        actors.append(actor)

Race Condition Scenario:

  1. Thread A calls registry.get("some_actor") → calls ensure_built_in_actors()
  2. Thread B calls registry.remove("other_actor") → calls ensure_built_in_actors() concurrently
  3. Both threads iterate through configured providers simultaneously
  4. Both attempt to upsert the same built-in actors, causing duplicate operations and potential constraint violations

Current Callers (all lacking coordination):

  • get()ensure_built_in_actors()
  • get_actor()ensure_built_in_actors()
  • remove()ensure_built_in_actors()
  • remove_actor()ensure_built_in_actors()
  • set_default_actor()ensure_built_in_actors()
  • get_default_actor()ensure_built_in_actors()

Expected Behavior

ensure_built_in_actors() should be safe to call concurrently from multiple threads. At most one thread should perform the upsert operations at a time, and subsequent callers should either wait for the first to complete or short-circuit if initialization has already occurred.

Acceptance Criteria

  • ensure_built_in_actors() is protected by a threading.Lock (or equivalent) so that concurrent callers do not perform duplicate upserts
  • OR the method is made idempotent with proper duplicate-handling that is safe under concurrent access
  • OR built-in actor initialization is moved to registry construction time (called once, not on-demand)
  • No duplicate database operations occur under concurrent load
  • No constraint violations are raised from concurrent calls to ensure_built_in_actors()
  • All existing tests continue to pass

Subtasks

  • Reproduce the race condition with a concurrent stress test (BDD scenario with @tdd_issue tags per bug fix workflow)
  • Choose and implement the appropriate synchronization strategy (method-level lock, idempotent upsert, or eager initialization)
  • Add threading.Lock to ActorRegistry if method-level locking is chosen
  • Verify no deadlock is introduced (callers of ensure_built_in_actors() must not hold the same lock)
  • Update unit tests (Behave BDD) to cover concurrent access scenarios
  • Update integration tests (Robot Framework) if applicable
  • Run nox -s coverage_report and confirm coverage ≥ 97%
  • Run full nox suite and confirm all stages pass

Definition of Done

  • All subtasks above are complete
  • Commit message first line matches exactly: fix(actor): add synchronization to ActorRegistry.ensure_built_in_actors() to prevent concurrent duplicate upserts
  • Changes pushed to branch bugfix/actor-registry-ensure-built-in-actors-race-condition
  • PR submitted, reviewed, and merged to master
  • @tdd_expected_fail tag removed from TDD test after fix is verified
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

## Metadata - **Branch**: `bugfix/actor-registry-ensure-built-in-actors-race-condition` - **Commit Message**: `fix(actor): add synchronization to ActorRegistry.ensure_built_in_actors() to prevent concurrent duplicate upserts` - **Milestone**: *(none — backlog, see note below)* - **Parent Epic**: #396 > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.4.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. ## Background and Context The `ActorRegistry.ensure_built_in_actors()` method in `src/cleveragents/actor/registry.py` (lines ~70–110) is called from multiple public methods — `get()`, `get_actor()`, `remove()`, `remove_actor()`, `set_default_actor()`, and `get_default_actor()` — but lacks explicit synchronization. This creates a race condition where multiple threads can simultaneously attempt to create and upsert the same built-in actors. In CleverAgents server mode, concurrent sessions and plan executions routinely call these public registry methods from different threads, making this race condition reproducible under normal production load. ## Current Behavior Multiple threads can enter `ensure_built_in_actors()` simultaneously. Each thread independently iterates through all configured providers and calls `self._actor_service.upsert_actor(...)` for each one. This results in: - **Duplicate database operations**: the same built-in actor is upserted multiple times concurrently - **Potential constraint violations**: depending on the upsert implementation, concurrent writes to the same actor record can cause integrity errors - **Inconsistent state during actor creation**: a thread reading the actor list mid-creation may observe a partially-populated registry - **Wasted CPU cycles**: redundant work performed by every concurrent caller ```python def ensure_built_in_actors(self) -> list[Actor]: """Generate built-in actors from configured providers if missing.""" configured: list[ProviderInfo] = ( self._provider_registry.get_configured_providers() ) if not configured: return [] actors: list[Actor] = [] # NO EXPLICIT LOCKING HERE — multiple threads reach this simultaneously for info in configured: actor = self._actor_service.upsert_actor(...) # concurrent duplicate upserts actors.append(actor) ``` **Race Condition Scenario:** 1. Thread A calls `registry.get("some_actor")` → calls `ensure_built_in_actors()` 2. Thread B calls `registry.remove("other_actor")` → calls `ensure_built_in_actors()` concurrently 3. Both threads iterate through configured providers simultaneously 4. Both attempt to upsert the same built-in actors, causing duplicate operations and potential constraint violations **Current Callers (all lacking coordination):** - `get()` → `ensure_built_in_actors()` - `get_actor()` → `ensure_built_in_actors()` - `remove()` → `ensure_built_in_actors()` - `remove_actor()` → `ensure_built_in_actors()` - `set_default_actor()` → `ensure_built_in_actors()` - `get_default_actor()` → `ensure_built_in_actors()` ## Expected Behavior `ensure_built_in_actors()` should be safe to call concurrently from multiple threads. At most one thread should perform the upsert operations at a time, and subsequent callers should either wait for the first to complete or short-circuit if initialization has already occurred. ## Acceptance Criteria - `ensure_built_in_actors()` is protected by a `threading.Lock` (or equivalent) so that concurrent callers do not perform duplicate upserts - OR the method is made idempotent with proper duplicate-handling that is safe under concurrent access - OR built-in actor initialization is moved to registry construction time (called once, not on-demand) - No duplicate database operations occur under concurrent load - No constraint violations are raised from concurrent calls to `ensure_built_in_actors()` - All existing tests continue to pass ## Subtasks - [ ] Reproduce the race condition with a concurrent stress test (BDD scenario with `@tdd_issue` tags per bug fix workflow) - [ ] Choose and implement the appropriate synchronization strategy (method-level lock, idempotent upsert, or eager initialization) - [ ] Add `threading.Lock` to `ActorRegistry` if method-level locking is chosen - [ ] Verify no deadlock is introduced (callers of `ensure_built_in_actors()` must not hold the same lock) - [ ] Update unit tests (Behave BDD) to cover concurrent access scenarios - [ ] Update integration tests (Robot Framework) if applicable - [ ] Run `nox -s coverage_report` and confirm coverage ≥ 97% - [ ] Run full `nox` suite and confirm all stages pass ## Definition of Done - [ ] All subtasks above are complete - [ ] Commit message first line matches exactly: `fix(actor): add synchronization to ActorRegistry.ensure_built_in_actors() to prevent concurrent duplicate upserts` - [ ] Changes pushed to branch `bugfix/actor-registry-ensure-built-in-actors-race-condition` - [ ] PR submitted, reviewed, and merged to `master` - [ ] `@tdd_expected_fail` tag removed from TDD test after fix is verified - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

Verified — Concurrency bug: race condition in ActorRegistry.ensure_built_in_actors(). MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Concurrency bug: race condition in ActorRegistry.ensure_built_in_actors(). MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#396 Epic: ACMS Context Pipeline
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#7095
No description provided.