bug(di): get_container() permanently caches failed audit subscriber initialization #992

Open
opened 2026-03-17 00:43:11 +00:00 by brent.edwards · 10 comments
Member

Background

Found during review of PR #803 (finding P1-8). This is in code introduced by that PR (commit d05935db, AuditEventSubscriber wiring) but accepted as a follow-up since the per-process CLI lifecycle mitigates it in practice.

Description

get_container() in container.py (~line 621-647) attempts _container.audit_event_subscriber() during initialization. If it fails (e.g., database not yet initialized), the exception is caught and logged, but _container is permanently assigned. On every subsequent call, _container is None is False, so the container is returned without ever re-attempting audit subscriber initialization.

The log message says "will be registered on first AuditEventSubscriber access" but there is no code that automatically retries the initialization later. In the CLI, each invocation creates a new process (and therefore a new container), so this is mitigated. In a long-lived server process, the audit subscriber would remain uninitialized for the lifetime of the process.

Expected Behavior

Either:

  • Retry audit_event_subscriber() on subsequent get_container() calls (e.g., via a flag), or
  • Register a lazy initialization callback that runs when the database becomes available, or
  • Document that this is CLI-only and add an explicit reinitialize_audit() method for server use

Acceptance Criteria

  • Failed audit subscriber initialization is retried on subsequent access
  • Unit test verifies retry behavior after initial failure
  • nox passes with coverage >= 97%

References

  • Discovered in: PR #803 review #2317 (finding P1-8)
  • File: src/cleveragents/application/container.py (~line 621-647)
## Background Found during review of PR #803 (finding P1-8). This is in code introduced by that PR (commit `d05935db`, AuditEventSubscriber wiring) but accepted as a follow-up since the per-process CLI lifecycle mitigates it in practice. ## Description `get_container()` in `container.py` (~line 621-647) attempts `_container.audit_event_subscriber()` during initialization. If it fails (e.g., database not yet initialized), the exception is caught and logged, but `_container` is permanently assigned. On every subsequent call, `_container is None` is False, so the container is returned without ever re-attempting audit subscriber initialization. The log message says "will be registered on first AuditEventSubscriber access" but there is no code that automatically retries the initialization later. In the CLI, each invocation creates a new process (and therefore a new container), so this is mitigated. In a long-lived server process, the audit subscriber would remain uninitialized for the lifetime of the process. ## Expected Behavior Either: - Retry `audit_event_subscriber()` on subsequent `get_container()` calls (e.g., via a flag), or - Register a lazy initialization callback that runs when the database becomes available, or - Document that this is CLI-only and add an explicit `reinitialize_audit()` method for server use ## Acceptance Criteria - [x] Failed audit subscriber initialization is retried on subsequent access - [x] Unit test verifies retry behavior after initial failure - [x] `nox` passes with coverage >= 97% ## References - Discovered in: PR #803 review #2317 (finding P1-8) - File: `src/cleveragents/application/container.py` (~line 621-647)
freemo added this to the v3.6.0 milestone 2026-03-17 18:17:37 +00:00
Owner

PM Triage — Day 37

Triaged and classified:

  • Milestone: v3.6.0 (M7 — DI container robustness)
  • Priority: Medium (mitigated by per-process CLI lifecycle)
  • Assignee: @freemo (DI container author/architect)
  • Points: 3 (need retry mechanism or lazy initialization callback)

Note: Current CLI lifecycle (new process per invocation) mitigates this bug. Relevant for server mode where the container is long-lived.


PM triage — Day 37

**PM Triage — Day 37** Triaged and classified: - **Milestone**: v3.6.0 (M7 — DI container robustness) - **Priority**: Medium (mitigated by per-process CLI lifecycle) - **Assignee**: @freemo (DI container author/architect) - **Points**: 3 (need retry mechanism or lazy initialization callback) Note: Current CLI lifecycle (new process per invocation) mitigates this bug. Relevant for server mode where the container is long-lived. --- *PM triage — Day 37*
Owner

Assigned to @CoreRasurae for bug fix based on developer expertise (DI container — Luis is the DI expert). This bug is top priority per project policy — bugs always take precedence over feature work.

Assigned to @CoreRasurae for bug fix based on developer expertise (DI container — Luis is the DI expert). This bug is top priority per project policy — bugs always take precedence over feature work.
Owner

TDD workflow initiated for this bug:

  • Created TDD issue #1096 to write a tagged test that captures this bug.
  • Dependency link added: this issue is blocked by #1096.
  • Once #1096 is merged to master, the bug fix assignee should create the bugfix branch and implement the fix.
  • The fix must remove the @tdd_expected_fail tag from the test so it runs normally.
TDD workflow initiated for this bug: - Created TDD issue #1096 to write a tagged test that captures this bug. - Dependency link added: this issue is blocked by #1096. - Once #1096 is merged to `master`, the bug fix assignee should create the bugfix branch and implement the fix. - The fix must remove the `@tdd_expected_fail` tag from the test so it runs normally.
Owner

Planning Agent — Discussion Review

TDD workflow confirmed with #1096. Assignment to @CoreRasurae is correct — Luis is the DI container expert.

Technical note: The Day 37 triage correctly identifies the mitigation: current CLI lifecycle (new process per invocation) means the cached failure is discarded on next run. This is only a problem in server mode where the container is long-lived. The v3.6.0 milestone placement for server preparation is appropriate.

Assignment change noted: Originally assigned to @freemo (DI container author/architect), reassigned to @CoreRasurae. Both have relevant expertise. No objection to the reassignment.

The 3-point estimate for implementing a retry mechanism or lazy initialization callback is appropriate. The fix requires careful consideration of thread safety in the retry/reinit path.

No disputes.

## Planning Agent — Discussion Review TDD workflow confirmed with #1096. Assignment to @CoreRasurae is correct — Luis is the DI container expert. **Technical note:** The Day 37 triage correctly identifies the mitigation: current CLI lifecycle (new process per invocation) means the cached failure is discarded on next run. This is only a problem in server mode where the container is long-lived. The v3.6.0 milestone placement for server preparation is appropriate. **Assignment change noted:** Originally assigned to @freemo (DI container author/architect), reassigned to @CoreRasurae. Both have relevant expertise. No objection to the reassignment. The 3-point estimate for implementing a retry mechanism or lazy initialization callback is appropriate. The fix requires careful consideration of thread safety in the retry/reinit path. No disputes.
Author
Member

Implementation Notes — Bug Fix for #992

Branch

bugfix/m7-di-audit-cache-failure

Design Decision: Retry via Module-Level Flag

The fix introduces a module-level boolean flag _audit_subscriber_initialized alongside the existing _container global in container.py. The approach:

  1. get_container() now separates container creation from audit subscriber initialization:

    • Container creation still happens only once (if _container is None)
    • Audit subscriber initialization is attempted on every get_container() call until it succeeds (guarded by _audit_subscriber_initialized)
    • On success, the flag is set to True and no further attempts are made
    • On failure, the warning is logged (with updated message: "will be retried on next get_container() call") and the flag remains False
  2. reset_container() now also resets _audit_subscriber_initialized = False so tests get clean state.

Why This Approach

Considered alternatives from the issue's Expected Behavior section:

  • Lazy callback when DB becomes available: Would require an event/notification system for DB readiness — overly complex for this fix.
  • Explicit reinitialize_audit() for server use: Would shift responsibility to callers and not fix the root cause.
  • Retry on subsequent get_container() calls via a flag (chosen): Simplest, most robust, and backward-compatible. The retry is transparent to callers. In CLI mode, behavior is unchanged (one process = one call). In server mode, eventual consistency is achieved as the audit subscriber will be registered once the DB is ready.

TDD Test Integration

The TDD test from issue #1096 (branch tdd/m6-di-audit-cache-failure) has not been merged to master yet. I've brought the test into this bugfix branch and removed the @tdd_expected_fail tag per the Bug Fix Workflow in CONTRIBUTING.md. The test now runs as a normal regression guard:

  • Feature: features/tdd_di_audit_cache_failure.feature
  • Steps: features/steps/tdd_di_audit_cache_failure_steps.py
  • Tags: @tdd_bug @tdd_bug_992 (permanent regression markers)

Benchmark

Added benchmarks/bench_audit_subscriber_retry.py measuring the performance of get_container() in both the success and retry-after-failure paths.

## Implementation Notes — Bug Fix for #992 ### Branch `bugfix/m7-di-audit-cache-failure` ### Design Decision: Retry via Module-Level Flag The fix introduces a module-level boolean flag `_audit_subscriber_initialized` alongside the existing `_container` global in `container.py`. The approach: 1. **`get_container()`** now separates container creation from audit subscriber initialization: - Container creation still happens only once (`if _container is None`) - Audit subscriber initialization is attempted on **every** `get_container()` call until it succeeds (guarded by `_audit_subscriber_initialized`) - On success, the flag is set to `True` and no further attempts are made - On failure, the warning is logged (with updated message: "will be retried on next get_container() call") and the flag remains `False` 2. **`reset_container()`** now also resets `_audit_subscriber_initialized = False` so tests get clean state. ### Why This Approach Considered alternatives from the issue's Expected Behavior section: - **Lazy callback when DB becomes available**: Would require an event/notification system for DB readiness — overly complex for this fix. - **Explicit `reinitialize_audit()` for server use**: Would shift responsibility to callers and not fix the root cause. - **Retry on subsequent `get_container()` calls via a flag** (chosen): Simplest, most robust, and backward-compatible. The retry is transparent to callers. In CLI mode, behavior is unchanged (one process = one call). In server mode, eventual consistency is achieved as the audit subscriber will be registered once the DB is ready. ### TDD Test Integration The TDD test from issue #1096 (branch `tdd/m6-di-audit-cache-failure`) has not been merged to master yet. I've brought the test into this bugfix branch and **removed the `@tdd_expected_fail` tag** per the Bug Fix Workflow in `CONTRIBUTING.md`. The test now runs as a normal regression guard: - Feature: `features/tdd_di_audit_cache_failure.feature` - Steps: `features/steps/tdd_di_audit_cache_failure_steps.py` - Tags: `@tdd_bug @tdd_bug_992` (permanent regression markers) ### Benchmark Added `benchmarks/bench_audit_subscriber_retry.py` measuring the performance of `get_container()` in both the success and retry-after-failure paths.
Author
Member

Quality Gate Results

All nox sessions passed:

Session Result Duration
lint PASS ~5s
format PASS ~5s
typecheck PASS 17s
security_scan PASS 15s
dead_code PASS 2s
unit_tests PASS 4m
integration_tests PASS 7m
docs PASS 2m
build PASS 2s
benchmark PASS 24m
coverage_report PASS 7m

Coverage: 97.0% (threshold: 97%)

PR Created

PR #1223: fix(di): get_container() permanently caches failed audit subscriber in Singleton provider

  • Branch: bugfix/m7-di-audit-cache-failure
  • Milestone: v3.6.0
  • Label: Type/Bug
  • Closes: #992

Note on TDD Issue #1096

The TDD test from issue #1096 (branch tdd/m6-di-audit-cache-failure) was not yet merged to master. I brought it into the bugfix branch and removed @tdd_expected_fail per the Bug Fix Workflow. When this PR is merged, the TDD test will exist on master as a permanent regression guard with @tdd_issue @tdd_issue_992 tags.

## Quality Gate Results All nox sessions passed: | Session | Result | Duration | |---------|--------|----------| | `lint` | PASS | ~5s | | `format` | PASS | ~5s | | `typecheck` | PASS | 17s | | `security_scan` | PASS | 15s | | `dead_code` | PASS | 2s | | `unit_tests` | PASS | 4m | | `integration_tests` | PASS | 7m | | `docs` | PASS | 2m | | `build` | PASS | 2s | | `benchmark` | PASS | 24m | | `coverage_report` | PASS | 7m | **Coverage: 97.0%** (threshold: 97%) ### PR Created PR #1223: `fix(di): get_container() permanently caches failed audit subscriber in Singleton provider` - Branch: `bugfix/m7-di-audit-cache-failure` - Milestone: v3.6.0 - Label: `Type/Bug` - Closes: #992 ### Note on TDD Issue #1096 The TDD test from issue #1096 (branch `tdd/m6-di-audit-cache-failure`) was not yet merged to master. I brought it into the bugfix branch and removed `@tdd_expected_fail` per the Bug Fix Workflow. When this PR is merged, the TDD test will exist on master as a permanent regression guard with `@tdd_issue @tdd_issue_992` tags.
Owner

PR #1223 reviewed, approved, and merged.

PR #1223 reviewed, approved, and merged.
freemo self-assigned this 2026-04-02 06:13:59 +00:00
Owner

PR #1223 reviewed, approved, and merged.

Independent code review summary:

  • Fix introduces _audit_subscriber_initialized flag to separate container creation (once) from audit subscriber initialization (retried until successful)
  • BDD regression test verifies retry behavior after initial failure
  • ASV benchmark tracks retry-path performance
  • All acceptance criteria met: retry on failure , unit test , nox passes with ≥97% coverage

The fix ensures long-lived server processes will eventually register audit subscriptions once prerequisites (e.g., database) become available, rather than permanently caching the failure from the first get_container() call.

PR #1223 reviewed, approved, and merged. **Independent code review summary:** - Fix introduces `_audit_subscriber_initialized` flag to separate container creation (once) from audit subscriber initialization (retried until successful) - BDD regression test verifies retry behavior after initial failure - ASV benchmark tracks retry-path performance - All acceptance criteria met: retry on failure ✅, unit test ✅, nox passes with ≥97% coverage ✅ The fix ensures long-lived server processes will eventually register audit subscriptions once prerequisites (e.g., database) become available, rather than permanently caching the failure from the first `get_container()` call.
Owner

PR #1223 reviewed, approved, and merged.

The fix introduces a module-level _audit_subscriber_initialized flag that separates container creation (happens once) from audit subscriber initialization (retried until successful). This ensures long-lived server processes eventually register audit subscriptions once prerequisites become available.

All acceptance criteria met:

  • Failed audit subscriber initialization is retried on subsequent access
  • Unit test verifies retry behavior after initial failure
  • nox passes with coverage ≥ 97%
PR #1223 reviewed, approved, and merged. The fix introduces a module-level `_audit_subscriber_initialized` flag that separates container creation (happens once) from audit subscriber initialization (retried until successful). This ensures long-lived server processes eventually register audit subscriptions once prerequisites become available. All acceptance criteria met: - ✅ Failed audit subscriber initialization is retried on subsequent access - ✅ Unit test verifies retry behavior after initial failure - ✅ `nox` passes with coverage ≥ 97%
Owner

Label transition note: PR #1223 has been merged. This issue should be transitioned from State/In Review to State/Completed. The issue could not be auto-closed due to open dependencies (likely the TDD issue #1096 dependency link). A human should update the label and close the issue once dependencies are resolved.

**Label transition note**: PR #1223 has been merged. This issue should be transitioned from `State/In Review` to `State/Completed`. The issue could not be auto-closed due to open dependencies (likely the TDD issue #1096 dependency link). A human should update the label and close the issue once dependencies are resolved.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#992
No description provided.