fix(di): get_container() permanently caches failed audit subscriber in Singleton provider #1223

2026-03-31T08:37:25Z

brent.edwards commented

2026-03-31 08:37:25 +00:00

Summary

Fixes bug #992: get_container() permanently caches a failed audit subscriber initialization, preventing retry in long-lived server processes.

Introduced a module-level _audit_subscriber_initialized flag that tracks whether audit_event_subscriber() has succeeded
get_container() now retries audit subscriber initialization on every call until it succeeds, instead of permanently caching the failure
reset_container() resets the flag to ensure clean state for tests
Updated the log message from "will be registered on first AuditEventSubscriber access" to "will be retried on next get_container() call" to accurately reflect the new retry behavior

Motivation

In the CLI, each invocation creates a new process so the cached failure is discarded naturally. In a long-lived server process, however, the audit subscriber would remain uninitialized for the process lifetime because _container is None is False after the first call, skipping all retry logic.

Approach

The simplest and most backward-compatible approach: separate container creation (still happens only once) from audit subscriber initialization (retried until successful). A boolean flag _audit_subscriber_initialized guards the retry — when False, each get_container() call reattempts the initialization.

Alternatives considered:

Lazy callback when DB becomes available — overly complex for this fix
Explicit reinitialize_audit() method — shifts responsibility to callers

TDD Test

Includes the TDD regression test from issue #1096 (branch tdd/m6-di-audit-cache-failure) with the @tdd_expected_fail tag removed per the Bug Fix Workflow in CONTRIBUTING.md. The test verifies that get_container() retries audit subscriber initialization after an initial failure.

Test Results

All 11 nox sessions pass:

lint: ✅
format: ✅
typecheck: ✅
security_scan: ✅
dead_code: ✅
unit_tests: ✅
integration_tests: ✅
docs: ✅
build: ✅
benchmark: ✅
coverage_report: ✅ (97.0% — meets ≥97% threshold)

Files Changed

src/cleveragents/application/container.py — Bug fix: retry audit subscriber init
features/tdd_di_audit_cache_failure.feature — TDD regression test (from #1096, @tdd_expected_fail removed)
features/steps/tdd_di_audit_cache_failure_steps.py — Step definitions for the regression test
benchmarks/bench_audit_subscriber_retry.py — ASV benchmark for retry-path performance

Closes #992

## Summary Fixes bug #992: `get_container()` permanently caches a failed audit subscriber initialization, preventing retry in long-lived server processes. - Introduced a module-level `_audit_subscriber_initialized` flag that tracks whether `audit_event_subscriber()` has succeeded - `get_container()` now retries audit subscriber initialization on every call until it succeeds, instead of permanently caching the failure - `reset_container()` resets the flag to ensure clean state for tests - Updated the log message from "will be registered on first AuditEventSubscriber access" to "will be retried on next get_container() call" to accurately reflect the new retry behavior ## Motivation In the CLI, each invocation creates a new process so the cached failure is discarded naturally. In a long-lived server process, however, the audit subscriber would remain uninitialized for the process lifetime because `_container is None` is `False` after the first call, skipping all retry logic. ## Approach The simplest and most backward-compatible approach: separate container creation (still happens only once) from audit subscriber initialization (retried until successful). A boolean flag `_audit_subscriber_initialized` guards the retry — when `False`, each `get_container()` call reattempts the initialization. Alternatives considered: - Lazy callback when DB becomes available — overly complex for this fix - Explicit `reinitialize_audit()` method — shifts responsibility to callers ## TDD Test Includes the TDD regression test from issue #1096 (branch `tdd/m6-di-audit-cache-failure`) with the `@tdd_expected_fail` tag removed per the Bug Fix Workflow in `CONTRIBUTING.md`. The test verifies that `get_container()` retries audit subscriber initialization after an initial failure. ## Test Results All 11 nox sessions pass: - `lint`: ✅ - `format`: ✅ - `typecheck`: ✅ - `security_scan`: ✅ - `dead_code`: ✅ - `unit_tests`: ✅ - `integration_tests`: ✅ - `docs`: ✅ - `build`: ✅ - `benchmark`: ✅ - `coverage_report`: ✅ (97.0% — meets ≥97% threshold) ## Files Changed - `src/cleveragents/application/container.py` — Bug fix: retry audit subscriber init - `features/tdd_di_audit_cache_failure.feature` — TDD regression test (from #1096, `@tdd_expected_fail` removed) - `features/steps/tdd_di_audit_cache_failure_steps.py` — Step definitions for the regression test - `benchmarks/bench_audit_subscriber_retry.py` — ASV benchmark for retry-path performance Closes #992

brent.edwards added this to the v3.6.0 milestone 2026-03-31 08:37:31 +00:00

brent.edwards added the

Type

Bug

label 2026-03-31 08:37:32 +00:00

brent.edwards referenced this pull request

2026-03-31 08:38:00 +00:00

bug(di): get_container() permanently caches failed audit subscriber initialization #992

freemo approved these changes 2026-04-02 04:21:08 +00:00

Dismissed

freemo left a comment

Review: APPROVED ✅

PR #1223 — fix(di): get_container() permanently caches failed audit subscriber in Singleton provider

What was reviewed

src/cleveragents/application/container.py — Added _audit_subscriber_initialized flag with retry logic
features/tdd_di_audit_cache_failure.feature — TDD regression test (from #1096)
features/steps/tdd_di_audit_cache_failure_steps.py — Step definitions with FakeContainer mock
benchmarks/bench_audit_subscriber_retry.py — ASV benchmark for retry-path performance

Assessment

Spec alignment: ✅ Fixes long-lived server process behavior per the DI container design
Code quality: ✅
- Simple boolean flag approach — minimal and clean
- global _container, _audit_subscriber_initialized — correct usage
- Retry only when not _audit_subscriber_initialized — avoids unnecessary work on hot path
- reset_container() properly resets both globals
- Log message updated to reflect retry behavior ("will be retried on next get_container() call")
- No # type: ignore suppressions
Test coverage: ✅
- FakeContainer with audit_init_attempts counter — good mock design
- Verifies singleton behavior AND retry count
- ASV benchmark included for performance regression tracking
Design note: The _audit_subscriber_initialized flag is not thread-safe, but this is acceptable because get_container() is called during startup (single-threaded), and PR #1224 addresses thread safety for the audit service itself.
Quality gates: All 11 nox sessions pass (97% coverage)

Clean, well-scoped fix with proper separation of concerns from the thread-safety fix in PR #1224.

## Review: APPROVED ✅ **PR #1223 — fix(di): get_container() permanently caches failed audit subscriber in Singleton provider** ### What was reviewed - `src/cleveragents/application/container.py` — Added `_audit_subscriber_initialized` flag with retry logic - `features/tdd_di_audit_cache_failure.feature` — TDD regression test (from #1096) - `features/steps/tdd_di_audit_cache_failure_steps.py` — Step definitions with FakeContainer mock - `benchmarks/bench_audit_subscriber_retry.py` — ASV benchmark for retry-path performance ### Assessment - **Spec alignment**: ✅ Fixes long-lived server process behavior per the DI container design - **Code quality**: ✅ - Simple boolean flag approach — minimal and clean - `global _container, _audit_subscriber_initialized` — correct usage - Retry only when `not _audit_subscriber_initialized` — avoids unnecessary work on hot path - `reset_container()` properly resets both globals - Log message updated to reflect retry behavior ("will be retried on next get_container() call") - No `# type: ignore` suppressions - **Test coverage**: ✅ - FakeContainer with `audit_init_attempts` counter — good mock design - Verifies singleton behavior AND retry count - ASV benchmark included for performance regression tracking - **Design note**: The `_audit_subscriber_initialized` flag is not thread-safe, but this is acceptable because `get_container()` is called during startup (single-threaded), and PR #1224 addresses thread safety for the audit service itself. - **Quality gates**: All 11 nox sessions pass (97% coverage) Clean, well-scoped fix with proper separation of concerns from the thread-safety fix in PR #1224.

freemo referenced this pull request

2026-04-02 04:22:05 +00:00

bug(di): get_container() permanently caches failed audit subscriber initialization #992

freemo referenced this pull request

2026-04-02 04:25:14 +00:00

[Automated] Product Build Session State #1259

freemo self-assigned this 2026-04-02 06:15:13 +00:00

freemo force-pushed bugfix/m7-di-audit-cache-failure from 14e882e855 to 734ad2a162

2026-04-02 07:09:20 +00:00

Compare

freemo commented

2026-04-02 08:03:13 +00:00

🔒 Claimed by pr-reviewer-3. Starting independent code review.

freemo approved these changes 2026-04-02 08:06:06 +00:00

Dismissed

freemo left a comment

Independent Code Review: APPROVED ✅

Reviewer: pr-reviewer-3 (independent perspective)

Changes Reviewed

src/cleveragents/application/container.py — Core bug fix: retry audit subscriber initialization
features/tdd_di_audit_cache_failure.feature — BDD regression test (from TDD issue #1096)
features/steps/tdd_di_audit_cache_failure_steps.py — Step definitions with FakeContainer mock
benchmarks/bench_audit_subscriber_retry.py — ASV benchmark for retry-path performance

Specification Alignment ✅

The fix correctly addresses bug #992 as described in the issue. The approach — separating container creation (happens once) from audit subscriber initialization (retried until successful) — is the simplest and most backward-compatible of the three options listed in the issue's Expected Behavior section. The DI container's singleton semantics are preserved: _container is still created exactly once, while _audit_subscriber_initialized guards the retry loop independently.

Correctness Analysis ✅

Before: audit_event_subscriber() was called inside if _container is None:, so a failure was permanently cached — subsequent calls returned the container without retrying
After: The retry logic is outside the container-creation guard, controlled by _audit_subscriber_initialized. On success, the flag is set True (no further attempts). On failure, it remains False (next call retries)
reset_container() properly resets both _container and _audit_subscriber_initialized
The global statement correctly declares both variables
Log message updated from misleading "will be registered on first AuditEventSubscriber access" to accurate "will be retried on next get_container() call"
No thread-safety concern: get_container() is called during startup (single-threaded), and PR #1224 addresses thread safety separately

Test Quality ✅

BDD scenario directly tests the bug's behavior: first call fails, second call retries and succeeds
FakeContainer with audit_init_attempts counter is a clean, minimal mock
Assertions verify both singleton identity (first is second) AND retry count (attempts == 2)
Proper cleanup via context.add_cleanup() restores original Container class
ASV benchmark provides performance regression tracking for the retry path

Code Quality ✅

Minimal change footprint — only the necessary lines modified
Clean boolean flag approach — no over-engineering
Docstring updated to document retry behavior and reference bug #992
Comments explain the "why" clearly

Commit Standards ✅

Single atomic commit with Conventional Changelog format
ISSUES CLOSED: #992 footer present
Clean commit message explaining both the fix and the TDD test inclusion

PR Metadata ✅

Type/Bug label ✓
v3.6.0 milestone ✓
Closes #992 in body ✓
No needs feedback label ✓

Minor Note (non-blocking)

The step definitions and benchmark use # type: ignore[assignment] for monkey-patching the Container class. This is a pragmatic necessity for this type of testing (assigning a FakeContainer to a typed module attribute) and is consistent with existing patterns in the codebase (e.g., get_ai_provider() in the same file already uses # type: ignore for mock loading).

Verdict: Clean, well-scoped bug fix with proper TDD regression test. Approved for merge.

## Independent Code Review: APPROVED ✅ **Reviewer**: pr-reviewer-3 (independent perspective) ### Changes Reviewed 1. **`src/cleveragents/application/container.py`** — Core bug fix: retry audit subscriber initialization 2. **`features/tdd_di_audit_cache_failure.feature`** — BDD regression test (from TDD issue #1096) 3. **`features/steps/tdd_di_audit_cache_failure_steps.py`** — Step definitions with FakeContainer mock 4. **`benchmarks/bench_audit_subscriber_retry.py`** — ASV benchmark for retry-path performance ### Specification Alignment ✅ The fix correctly addresses bug #992 as described in the issue. The approach — separating container creation (happens once) from audit subscriber initialization (retried until successful) — is the simplest and most backward-compatible of the three options listed in the issue's Expected Behavior section. The DI container's singleton semantics are preserved: `_container` is still created exactly once, while `_audit_subscriber_initialized` guards the retry loop independently. ### Correctness Analysis ✅ - **Before**: `audit_event_subscriber()` was called inside `if _container is None:`, so a failure was permanently cached — subsequent calls returned the container without retrying - **After**: The retry logic is outside the container-creation guard, controlled by `_audit_subscriber_initialized`. On success, the flag is set `True` (no further attempts). On failure, it remains `False` (next call retries) - `reset_container()` properly resets both `_container` and `_audit_subscriber_initialized` - The `global` statement correctly declares both variables - Log message updated from misleading "will be registered on first AuditEventSubscriber access" to accurate "will be retried on next get_container() call" - No thread-safety concern: `get_container()` is called during startup (single-threaded), and PR #1224 addresses thread safety separately ### Test Quality ✅ - BDD scenario directly tests the bug's behavior: first call fails, second call retries and succeeds - `FakeContainer` with `audit_init_attempts` counter is a clean, minimal mock - Assertions verify both singleton identity (`first is second`) AND retry count (`attempts == 2`) - Proper cleanup via `context.add_cleanup()` restores original Container class - ASV benchmark provides performance regression tracking for the retry path ### Code Quality ✅ - Minimal change footprint — only the necessary lines modified - Clean boolean flag approach — no over-engineering - Docstring updated to document retry behavior and reference bug #992 - Comments explain the "why" clearly ### Commit Standards ✅ - Single atomic commit with Conventional Changelog format - `ISSUES CLOSED: #992` footer present - Clean commit message explaining both the fix and the TDD test inclusion ### PR Metadata ✅ - `Type/Bug` label ✓ - v3.6.0 milestone ✓ - `Closes #992` in body ✓ - No `needs feedback` label ✓ ### Minor Note (non-blocking) The step definitions and benchmark use `# type: ignore[assignment]` for monkey-patching the Container class. This is a pragmatic necessity for this type of testing (assigning a FakeContainer to a typed module attribute) and is consistent with existing patterns in the codebase (e.g., `get_ai_provider()` in the same file already uses `# type: ignore` for mock loading). **Verdict**: Clean, well-scoped bug fix with proper TDD regression test. Approved for merge.

freemo referenced this pull request

2026-04-02 08:06:21 +00:00

bug(di): get_container() permanently caches failed audit subscriber initialization #992

freemo commented

2026-04-02 16:47:58 +00:00

Review claimed by reviewer pool instance reviewer-pool-1. Dispatching independent code review.

freemo approved these changes 2026-04-02 16:51:11 +00:00

freemo left a comment

Independent Code Review: APPROVED ✅

Reviewer: reviewer-pool-1 (independent perspective)

Files Reviewed

src/cleveragents/application/container.py — Core bug fix: retry audit subscriber initialization via _audit_subscriber_initialized flag
features/tdd_di_audit_cache_failure.feature — BDD regression test (from TDD issue #1096, @tdd_expected_fail removed)
features/steps/tdd_di_audit_cache_failure_steps.py — Step definitions with FakeContainer mock
benchmarks/bench_audit_subscriber_retry.py — ASV benchmark for retry-path performance

Specification Alignment ✅

The fix correctly addresses bug #992. The approach — separating container creation (happens once via if _container is None:) from audit subscriber initialization (retried until successful via if not _audit_subscriber_initialized:) — is the simplest and most backward-compatible option from the issue's Expected Behavior. The DI container's singleton semantics are fully preserved.

Correctness Analysis ✅

Before (master): audit_event_subscriber() was called inside if _container is None:, so a failure was permanently cached — subsequent calls returned the container without retrying.
After (this PR): The retry logic is outside the container-creation guard, controlled by _audit_subscriber_initialized. On success, the flag is set True (no further attempts). On failure, it remains False (next call retries).
reset_container() properly resets both _container and _audit_subscriber_initialized — correct for test isolation.
The global statement correctly declares both variables.
Log message updated from misleading "will be registered on first AuditEventSubscriber access" to accurate "will be retried on next get_container() call".
Docstring updated to document retry behavior and reference bug #992.

Test Quality ✅

BDD scenario directly tests the bug's core behavior: first call fails, second call retries and succeeds.
FakeContainer with audit_init_attempts counter is a clean, minimal mock.
Assertions verify both singleton identity (first is second) AND retry count (attempts == 2).
Proper cleanup via context.add_cleanup() restores original Container class.
ASV benchmark provides performance regression tracking for the retry path.

Code Quality ✅

Minimal change footprint — only the necessary lines modified in production code.
Clean boolean flag approach — no over-engineering.
Comments explain the "why" clearly.

Commit Standards ✅

Single atomic commit with Conventional Changelog format: fix(di): ...
ISSUES CLOSED: #992 footer present.
Clean commit message explaining both the fix and the TDD test inclusion.

PR Metadata ✅

Type/Bug label ✓
v3.6.0 milestone ✓
Closes #992 in body ✓
No needs feedback label ✓

Minor Note (non-blocking)

The step definitions and benchmark use # type: ignore[assignment] for monkey-patching the Container class. This is a pragmatic necessity for test doubles (assigning a FakeContainer to a typed module attribute) and is consistent with existing patterns in the same file (get_ai_provider() on master already uses # type: ignore for mock loading). Not blocking since this is test/benchmark code only.

Thread Safety Note (non-blocking)

The _audit_subscriber_initialized flag is not thread-safe, but this is acceptable because get_container() is called during startup (single-threaded), and PR #1224 addresses thread safety for the audit service separately.

Verdict: Clean, well-scoped bug fix with proper TDD regression test. Approved for merge.

## Independent Code Review: APPROVED ✅ **Reviewer**: reviewer-pool-1 (independent perspective) ### Files Reviewed 1. **`src/cleveragents/application/container.py`** — Core bug fix: retry audit subscriber initialization via `_audit_subscriber_initialized` flag 2. **`features/tdd_di_audit_cache_failure.feature`** — BDD regression test (from TDD issue #1096, `@tdd_expected_fail` removed) 3. **`features/steps/tdd_di_audit_cache_failure_steps.py`** — Step definitions with FakeContainer mock 4. **`benchmarks/bench_audit_subscriber_retry.py`** — ASV benchmark for retry-path performance ### Specification Alignment ✅ The fix correctly addresses bug #992. The approach — separating container creation (happens once via `if _container is None:`) from audit subscriber initialization (retried until successful via `if not _audit_subscriber_initialized:`) — is the simplest and most backward-compatible option from the issue's Expected Behavior. The DI container's singleton semantics are fully preserved. ### Correctness Analysis ✅ - **Before (master)**: `audit_event_subscriber()` was called inside `if _container is None:`, so a failure was permanently cached — subsequent calls returned the container without retrying. - **After (this PR)**: The retry logic is outside the container-creation guard, controlled by `_audit_subscriber_initialized`. On success, the flag is set `True` (no further attempts). On failure, it remains `False` (next call retries). - `reset_container()` properly resets both `_container` and `_audit_subscriber_initialized` — correct for test isolation. - The `global` statement correctly declares both variables. - Log message updated from misleading "will be registered on first AuditEventSubscriber access" to accurate "will be retried on next get_container() call". - Docstring updated to document retry behavior and reference bug #992. ### Test Quality ✅ - BDD scenario directly tests the bug's core behavior: first call fails, second call retries and succeeds. - `FakeContainer` with `audit_init_attempts` counter is a clean, minimal mock. - Assertions verify both singleton identity (`first is second`) AND retry count (`attempts == 2`). - Proper cleanup via `context.add_cleanup()` restores original Container class. - ASV benchmark provides performance regression tracking for the retry path. ### Code Quality ✅ - Minimal change footprint — only the necessary lines modified in production code. - Clean boolean flag approach — no over-engineering. - Comments explain the "why" clearly. ### Commit Standards ✅ - Single atomic commit with Conventional Changelog format: `fix(di): ...` - `ISSUES CLOSED: #992` footer present. - Clean commit message explaining both the fix and the TDD test inclusion. ### PR Metadata ✅ - `Type/Bug` label ✓ - v3.6.0 milestone ✓ - `Closes #992` in body ✓ - No `needs feedback` label ✓ ### Minor Note (non-blocking) The step definitions and benchmark use `# type: ignore[assignment]` for monkey-patching the Container class. This is a pragmatic necessity for test doubles (assigning a FakeContainer to a typed module attribute) and is consistent with existing patterns in the same file (`get_ai_provider()` on master already uses `# type: ignore` for mock loading). Not blocking since this is test/benchmark code only. ### Thread Safety Note (non-blocking) The `_audit_subscriber_initialized` flag is not thread-safe, but this is acceptable because `get_container()` is called during startup (single-threaded), and PR #1224 addresses thread safety for the audit service separately. **Verdict**: Clean, well-scoped bug fix with proper TDD regression test. Approved for merge.

freemo merged commit a0c6ecd3ad into master

2026-04-02 16:51:29 +00:00

freemo deleted branch bugfix/m7-di-audit-cache-failure

2026-04-02 16:51:30 +00:00

freemo referenced this pull request

2026-04-02 16:51:38 +00:00

bug(di): get_container() permanently caches failed audit subscriber initialization #992

freemo referenced this pull request

2026-04-02 16:52:37 +00:00

bug(di): get_container() permanently caches failed audit subscriber initialization #992

freemo referenced this pull request

2026-04-02 17:11:30 +00:00