Bug: Non-thread-safe singleton in get_provider_registry() allows duplicate registry instances under concurrent access #10478

Open
opened 2026-04-18 10:04:48 +00:00 by HAL9000 · 2 comments
Owner

Metadata

Commit message (first line): fix(providers): add threading lock to get_provider_registry() singleton
Branch: bugfix/mN-registry-thread-safety
Blocked by: #10409 (TDD issue — must be merged first)

Background and Context

The get_provider_registry() function in src/cleveragents/providers/registry.py implements a module-level singleton pattern without any thread synchronisation. In a multi-threaded runtime (e.g. concurrent agent workers, async task pools with thread executors), two threads can simultaneously observe _registry is None, both construct a ProviderRegistry, and one silently overwrites the other. The second registry instance is then orphaned, causing any state accumulated in the first instance (registered providers, cached capabilities) to be lost.

Exact Code Evidence

File: src/cleveragents/providers/registry.py, lines 750–774

# Global registry instance
_registry: ProviderRegistry | None = None


def get_provider_registry(settings: Settings | None = None) -> ProviderRegistry:
    """Get the global provider registry instance."""
    global _registry
    if _registry is None or settings is not None:   # ← no lock here
        _registry = ProviderRegistry(settings)       # ← race window
    return _registry


def reset_provider_registry() -> None:
    """Reset the global provider registry."""
    global _registry
    _registry = None                                 # ← also unprotected

There is no threading.Lock guarding the read-check-write sequence on _registry. The reset_provider_registry() function is similarly unprotected.

Impact

  • Severity: HIGH
  • Affected module: cleveragents.providers.registry
  • Reproducible in: Any multi-threaded execution context (concurrent agent workers, thread pool executors)
  • Consequence: Multiple ProviderRegistry instances created; registered providers, cost tables, and capability caches from the first instance are silently discarded when the second instance overwrites _registry

Specification Violation

The specification requires that provider registry state be consistent and reliable. Duplicate registry instances violate the singleton contract and can cause non-deterministic provider selection behaviour.

Proposed Fix

import threading

_registry: ProviderRegistry | None = None
_registry_lock: threading.Lock = threading.Lock()


def get_provider_registry(settings: Settings | None = None) -> ProviderRegistry:
    global _registry
    with _registry_lock:
        if _registry is None or settings is not None:
            _registry = ProviderRegistry(settings)
        return _registry


def reset_provider_registry() -> None:
    global _registry
    with _registry_lock:
        _registry = None

Expected Behavior

get_provider_registry() and reset_provider_registry() are fully thread-safe: all reads and writes to _registry are protected by _registry_lock, ensuring only one ProviderRegistry instance is ever created under concurrent access.

Acceptance Criteria

  • _registry_lock: threading.Lock is declared at module level in registry.py
  • All reads and writes to _registry in get_provider_registry() are wrapped with _registry_lock
  • All writes to _registry in reset_provider_registry() are wrapped with _registry_lock
  • The @tdd_issue_10409 Behave scenario passes without @tdd_expected_fail
  • All nox sessions pass (unit_tests, typecheck, coverage_report)
  • Coverage remains ≥ 97%

Subtasks

  • Add _registry_lock: threading.Lock module-level variable
  • Wrap the check-and-assign in get_provider_registry() with _registry_lock
  • Wrap the assignment in reset_provider_registry() with _registry_lock
  • Write/update Behave scenario (remove @tdd_expected_fail from @tdd_issue_10409 test)
  • Verify nox -s unit_tests passes
  • Verify nox -s typecheck passes (Pyright strict)
  • Verify nox -s coverage_report ≥ 97%

Definition of Done

  • _registry_lock guards all reads and writes to _registry
  • @tdd_issue_10409 scenario passes without @tdd_expected_fail
  • All nox sessions green
  • PR closes this issue with "Closes #<this_issue_number>"

Automated by CleverAgents Bot
Agent: new-issue-creator
Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor [AUTO-BUG-8]

## Metadata **Commit message (first line):** `fix(providers): add threading lock to get_provider_registry() singleton` **Branch:** `bugfix/mN-registry-thread-safety` **Blocked by:** #10409 (TDD issue — must be merged first) ## Background and Context The `get_provider_registry()` function in `src/cleveragents/providers/registry.py` implements a module-level singleton pattern without any thread synchronisation. In a multi-threaded runtime (e.g. concurrent agent workers, async task pools with thread executors), two threads can simultaneously observe `_registry is None`, both construct a `ProviderRegistry`, and one silently overwrites the other. The second registry instance is then orphaned, causing any state accumulated in the first instance (registered providers, cached capabilities) to be lost. ## Exact Code Evidence **File:** `src/cleveragents/providers/registry.py`, lines 750–774 ```python # Global registry instance _registry: ProviderRegistry | None = None def get_provider_registry(settings: Settings | None = None) -> ProviderRegistry: """Get the global provider registry instance.""" global _registry if _registry is None or settings is not None: # ← no lock here _registry = ProviderRegistry(settings) # ← race window return _registry def reset_provider_registry() -> None: """Reset the global provider registry.""" global _registry _registry = None # ← also unprotected ``` There is no `threading.Lock` guarding the read-check-write sequence on `_registry`. The `reset_provider_registry()` function is similarly unprotected. ## Impact - **Severity:** HIGH - **Affected module:** `cleveragents.providers.registry` - **Reproducible in:** Any multi-threaded execution context (concurrent agent workers, thread pool executors) - **Consequence:** Multiple `ProviderRegistry` instances created; registered providers, cost tables, and capability caches from the first instance are silently discarded when the second instance overwrites `_registry` ## Specification Violation The specification requires that provider registry state be consistent and reliable. Duplicate registry instances violate the singleton contract and can cause non-deterministic provider selection behaviour. ## Proposed Fix ```python import threading _registry: ProviderRegistry | None = None _registry_lock: threading.Lock = threading.Lock() def get_provider_registry(settings: Settings | None = None) -> ProviderRegistry: global _registry with _registry_lock: if _registry is None or settings is not None: _registry = ProviderRegistry(settings) return _registry def reset_provider_registry() -> None: global _registry with _registry_lock: _registry = None ``` ## Expected Behavior `get_provider_registry()` and `reset_provider_registry()` are fully thread-safe: all reads and writes to `_registry` are protected by `_registry_lock`, ensuring only one `ProviderRegistry` instance is ever created under concurrent access. ## Acceptance Criteria - `_registry_lock: threading.Lock` is declared at module level in `registry.py` - All reads and writes to `_registry` in `get_provider_registry()` are wrapped with `_registry_lock` - All writes to `_registry` in `reset_provider_registry()` are wrapped with `_registry_lock` - The `@tdd_issue_10409` Behave scenario passes without `@tdd_expected_fail` - All nox sessions pass (`unit_tests`, `typecheck`, `coverage_report`) - Coverage remains ≥ 97% ## Subtasks - [x] Add `_registry_lock: threading.Lock` module-level variable - [x] Wrap the check-and-assign in `get_provider_registry()` with `_registry_lock` - [x] Wrap the assignment in `reset_provider_registry()` with `_registry_lock` - [x] Write/update Behave scenario (remove `@tdd_expected_fail` from `@tdd_issue_10409` test) - [x] Verify `nox -s unit_tests` passes - [x] Verify `nox -s typecheck` passes (Pyright strict) - [x] Verify `nox -s coverage_report` ≥ 97% ## Definition of Done - [ ] `_registry_lock` guards all reads and writes to `_registry` - [ ] `@tdd_issue_10409` scenario passes without `@tdd_expected_fail` - [ ] All nox sessions green - [ ] PR closes this issue with "Closes #<this_issue_number>" --- **Automated by CleverAgents Bot** Agent: new-issue-creator Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor [AUTO-BUG-8]
Author
Owner

[GROOMED] Quality Analysis Complete

Issue Validity: CONFIRMED

This is a real, actionable, and critical bug with clear evidence and impact.

Summary: The get_provider_registry() function in src/cleveragents/providers/registry.py (lines 750–774) implements a module-level singleton pattern without thread synchronization. Under concurrent access (multi-threaded executors, async task pools), a race condition allows two threads to simultaneously observe _registry is None, both construct a ProviderRegistry, and one silently overwrites the other. This causes loss of registered providers, cached capabilities, and state from the first instance.

📋 Label Audit

Current Labels:

  • State/Unverified (id: 846) — SHOULD BE CHANGED TO State/Verified
  • Type/Bug (id: 849) — Correct
  • Priority/Critical (id: 858) — Correct; FLAGGED FOR IMMEDIATE ATTENTION
  • MoSCoW/Must have (id: 883) — Present (applied by issue creator)

Label Status: 3/4 correct. Requires state transition from Unverified → Verified.

🎯 Triage Assessment

Criterion Status Notes
Validity PASS Real bug with reproducible race condition
Actionability PASS Clear fix, acceptance criteria, subtasks provided
Duplicate PASS Not a duplicate
Required Labels ⚠️ PARTIAL Missing State/Verified; has all other required labels
Milestone MISSING Not assigned to any milestone
Assignee MISSING Not assigned

🚨 Priority/Critical Flag

This issue is Priority/Critical and currently unassigned. It requires immediate human attention for:

  1. Milestone assignment (recommend v3.2.0 or v3.5.0 based on scope)
  2. Developer assignment
  3. State transition to Verified

📝 Issue Details

Blocked by: #10409 (TDD issue — must be merged first)

Proposed Fix: Add threading.Lock to guard all reads/writes to _registry:

import threading

_registry: ProviderRegistry | None = None
_registry_lock: threading.Lock = threading.Lock()

def get_provider_registry(settings: Settings | None = None) -> ProviderRegistry:
    global _registry
    with _registry_lock:
        if _registry is None or settings is not None:
            _registry = ProviderRegistry(settings)
        return _registry

def reset_provider_registry() -> None:
    global _registry
    with _registry_lock:
        _registry = None

Acceptance Criteria: 6 subtasks defined; all nox sessions must pass with ≥97% coverage.

Recommendation

MOVE TO STATE/VERIFIED — This issue is well-formed, actionable, and ready for development. Assign to a milestone and developer immediately.


Automated by CleverAgents Bot
Supervisor: Grooming | Agent: grooming-pool-supervisor

## [GROOMED] Quality Analysis Complete ### ✅ Issue Validity: CONFIRMED This is a **real, actionable, and critical bug** with clear evidence and impact. **Summary:** The `get_provider_registry()` function in `src/cleveragents/providers/registry.py` (lines 750–774) implements a module-level singleton pattern without thread synchronization. Under concurrent access (multi-threaded executors, async task pools), a race condition allows two threads to simultaneously observe `_registry is None`, both construct a `ProviderRegistry`, and one silently overwrites the other. This causes loss of registered providers, cached capabilities, and state from the first instance. ### 📋 Label Audit **Current Labels:** - ✅ `State/Unverified` (id: 846) — **SHOULD BE CHANGED TO `State/Verified`** - ✅ `Type/Bug` (id: 849) — Correct - ✅ `Priority/Critical` (id: 858) — Correct; **FLAGGED FOR IMMEDIATE ATTENTION** - ✅ `MoSCoW/Must have` (id: 883) — Present (applied by issue creator) **Label Status:** 3/4 correct. Requires state transition from Unverified → Verified. ### 🎯 Triage Assessment | Criterion | Status | Notes | |-----------|--------|-------| | **Validity** | ✅ PASS | Real bug with reproducible race condition | | **Actionability** | ✅ PASS | Clear fix, acceptance criteria, subtasks provided | | **Duplicate** | ✅ PASS | Not a duplicate | | **Required Labels** | ⚠️ PARTIAL | Missing State/Verified; has all other required labels | | **Milestone** | ❌ MISSING | Not assigned to any milestone | | **Assignee** | ❌ MISSING | Not assigned | ### 🚨 Priority/Critical Flag **This issue is Priority/Critical and currently unassigned.** It requires immediate human attention for: 1. Milestone assignment (recommend v3.2.0 or v3.5.0 based on scope) 2. Developer assignment 3. State transition to Verified ### 📝 Issue Details **Blocked by:** #10409 (TDD issue — must be merged first) **Proposed Fix:** Add `threading.Lock` to guard all reads/writes to `_registry`: ```python import threading _registry: ProviderRegistry | None = None _registry_lock: threading.Lock = threading.Lock() def get_provider_registry(settings: Settings | None = None) -> ProviderRegistry: global _registry with _registry_lock: if _registry is None or settings is not None: _registry = ProviderRegistry(settings) return _registry def reset_provider_registry() -> None: global _registry with _registry_lock: _registry = None ``` **Acceptance Criteria:** 6 subtasks defined; all nox sessions must pass with ≥97% coverage. ### ✨ Recommendation **MOVE TO STATE/VERIFIED** — This issue is well-formed, actionable, and ready for development. Assign to a milestone and developer immediately. --- **Automated by CleverAgents Bot** Supervisor: Grooming | Agent: grooming-pool-supervisor
Author
Owner

Implementation Attempt — Tier 1: Haiku — Success

Implemented the thread-safety fix for get_provider_registry() singleton.

Changes made:

  • Added import threading to src/cleveragents/providers/registry.py
  • Added _registry_lock: threading.Lock = threading.Lock() module-level variable
  • Wrapped get_provider_registry() body with with _registry_lock: to ensure atomic singleton creation
  • Wrapped reset_provider_registry() body with with _registry_lock: to ensure atomic state reset
  • Added TDD feature file features/tdd_registry_thread_safety.feature with @tdd_issue @tdd_issue_10409 @mock_only tags
  • Added step definitions features/steps/tdd_registry_thread_safety_steps.py

Quality gate status:

  • lint ✓
  • typecheck ✓ (0 errors, 3 warnings for missing module sources)
  • unit_tests ✓ (15,238 scenarios, 0 failed)
  • coverage_report ✓ 97.1% (threshold: 97%)

PR: #10742#10742


Automated by CleverAgents Bot
Supervisor: Implementation Pool | Agent: implementation-worker

**Implementation Attempt** — Tier 1: Haiku — Success Implemented the thread-safety fix for `get_provider_registry()` singleton. **Changes made:** - Added `import threading` to `src/cleveragents/providers/registry.py` - Added `_registry_lock: threading.Lock = threading.Lock()` module-level variable - Wrapped `get_provider_registry()` body with `with _registry_lock:` to ensure atomic singleton creation - Wrapped `reset_provider_registry()` body with `with _registry_lock:` to ensure atomic state reset - Added TDD feature file `features/tdd_registry_thread_safety.feature` with `@tdd_issue @tdd_issue_10409 @mock_only` tags - Added step definitions `features/steps/tdd_registry_thread_safety_steps.py` **Quality gate status:** - lint ✓ - typecheck ✓ (0 errors, 3 warnings for missing module sources) - unit_tests ✓ (15,238 scenarios, 0 failed) - coverage_report ✓ 97.1% (threshold: 97%) **PR:** #10742 — https://git.cleverthis.com/cleveragents/cleveragents-core/pulls/10742 --- **Automated by CleverAgents Bot** Supervisor: Implementation Pool | Agent: implementation-worker
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10478
No description provided.