BUG-HUNT: [concurrency] Settings.get_settings() has TOCTOU race — concurrent callers can construct two Settings instances, leaving some threads with stale/inconsistent config #6566

Open
opened 2026-04-09 21:27:37 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [concurrency] — Settings.get_settings() TOCTOU Race Condition

Severity Assessment

  • Impact: Two threads concurrently calling get_settings() for the first time can each construct a separate Settings instance. The last to finish overwrites _instance, but any thread holding a reference to the first instance will use a different config object than the rest of the application — potentially with different provider API keys, log levels, or feature flags.
  • Likelihood: Low in normal single-process startup, but real in async/threaded server contexts (e.g., FastAPI startup with multiple workers calling get_container() which calls get_settings() indirectly) or unit tests that call Settings.reset() followed by concurrent re-initialization.
  • Priority: Medium

Location

  • File: src/cleveragents/config/settings.py
  • Function/Class: Settings.get_settings()
  • Lines: 716–720

Description

The singleton accessor uses an unprotected double-check-then-set pattern without any lock:

@classmethod
def get_settings(cls: type[Settings]) -> Settings:
    """Return the cached singleton instance (creating it if needed)."""
    if cls._instance is None:        # ← Thread A and Thread B can both pass here
        cls._instance = cls()        # ← Both construct, last one wins
    return cls._instance             # ← Thread A may now return stale first instance

Because cls() (Pydantic BaseSettings construction + model_post_init) is not atomic, two threads can both see _instance is None and both enter the constructor. The second thread to finish assignment to cls._instance wins. The first thread has already captured a reference to its instance and returns it — while all subsequent calls return the second instance. Both live instances apply _apply_external_env_overrides() and _synchronize_langsmith_environment() independently, potentially in different environment states if a .env file is loaded concurrently.

A similar bug in the provider registry was already filed (#[concurrency] get_provider_registry() global singleton has TOCTOU race condition) — this is the same pattern in Settings.

Evidence

# src/cleveragents/config/settings.py lines 716-720
@classmethod
def get_settings(cls: type[Settings]) -> Settings:
    """Return the cached singleton instance (creating it if needed)."""
    if cls._instance is None:
        cls._instance = cls()
    return cls._instance

There is no threading.Lock guarding the check-and-set. Settings.reset() (line 733) also sets cls._instance = None without a lock, meaning a concurrent get_settings() call that just captured _instance may use a now-invalidated instance.

Expected Behavior

get_settings() must be idempotent and safe to call from concurrent threads. Only one Settings instance should ever be created per process lifecycle, regardless of concurrency. All callers must receive the same instance.

Actual Behavior

Under concurrent first-call conditions, two Settings instances can be constructed. Whichever thread finishes last sets _instance, while any thread finishing first returns a different (now unreferenced from the class) instance. This instance diverges on any mutable state set during model_post_init (e.g., _langsmith_validation_errors).

Suggested Fix

Protect get_settings() and reset() with a class-level threading.Lock:

_instance_lock: ClassVar[threading.Lock] = threading.Lock()

@classmethod
def get_settings(cls) -> Settings:
    if cls._instance is None:
        with cls._instance_lock:
            if cls._instance is None:     # double-checked locking
                cls._instance = cls()
    return cls._instance

@classmethod
def reset(cls) -> None:
    with cls._instance_lock:
        cls._instance = None

Category

concurrency

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [concurrency] — `Settings.get_settings()` TOCTOU Race Condition ### Severity Assessment - **Impact**: Two threads concurrently calling `get_settings()` for the first time can each construct a separate `Settings` instance. The last to finish overwrites `_instance`, but any thread holding a reference to the *first* instance will use a different config object than the rest of the application — potentially with different provider API keys, log levels, or feature flags. - **Likelihood**: Low in normal single-process startup, but real in async/threaded server contexts (e.g., FastAPI startup with multiple workers calling `get_container()` which calls `get_settings()` indirectly) or unit tests that call `Settings.reset()` followed by concurrent re-initialization. - **Priority**: Medium ### Location - **File**: `src/cleveragents/config/settings.py` - **Function/Class**: `Settings.get_settings()` - **Lines**: 716–720 ### Description The singleton accessor uses an unprotected double-check-then-set pattern without any lock: ```python @classmethod def get_settings(cls: type[Settings]) -> Settings: """Return the cached singleton instance (creating it if needed).""" if cls._instance is None: # ← Thread A and Thread B can both pass here cls._instance = cls() # ← Both construct, last one wins return cls._instance # ← Thread A may now return stale first instance ``` Because `cls()` (Pydantic `BaseSettings` construction + `model_post_init`) is **not** atomic, two threads can both see `_instance is None` and both enter the constructor. The second thread to finish assignment to `cls._instance` wins. The first thread has already captured a reference to *its* instance and returns it — while all subsequent calls return the second instance. Both live instances apply `_apply_external_env_overrides()` and `_synchronize_langsmith_environment()` independently, potentially in different environment states if a `.env` file is loaded concurrently. A similar bug in the provider registry was already filed (#[concurrency] `get_provider_registry()` global singleton has TOCTOU race condition) — this is the same pattern in `Settings`. ### Evidence ```python # src/cleveragents/config/settings.py lines 716-720 @classmethod def get_settings(cls: type[Settings]) -> Settings: """Return the cached singleton instance (creating it if needed).""" if cls._instance is None: cls._instance = cls() return cls._instance ``` There is no `threading.Lock` guarding the check-and-set. `Settings.reset()` (line 733) also sets `cls._instance = None` without a lock, meaning a concurrent `get_settings()` call that just captured `_instance` may use a now-invalidated instance. ### Expected Behavior `get_settings()` must be idempotent and safe to call from concurrent threads. Only one `Settings` instance should ever be created per process lifecycle, regardless of concurrency. All callers must receive the same instance. ### Actual Behavior Under concurrent first-call conditions, two `Settings` instances can be constructed. Whichever thread finishes last sets `_instance`, while any thread finishing first returns a *different* (now unreferenced from the class) instance. This instance diverges on any mutable state set during `model_post_init` (e.g., `_langsmith_validation_errors`). ### Suggested Fix Protect `get_settings()` and `reset()` with a class-level `threading.Lock`: ```python _instance_lock: ClassVar[threading.Lock] = threading.Lock() @classmethod def get_settings(cls) -> Settings: if cls._instance is None: with cls._instance_lock: if cls._instance is None: # double-checked locking cls._instance = cls() return cls._instance @classmethod def reset(cls) -> None: with cls._instance_lock: cls._instance = None ``` ### Category concurrency ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:31:43 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6566
No description provided.