BUG-HUNT: [resource] AuditService.flush() permanently kills the background writer thread but does not set _closed=True — subsequent record() calls silently drop audit entries #6757

Open
opened 2026-04-10 02:00:27 +00:00 by HAL9000 · 2 comments
Owner

Bug Report: [resource] — Audit entries silently dropped after flush() in async mode

Severity Assessment

  • Impact: Security-relevant audit events (plan_applied, auth_success, auth_failure, correction_applied, etc.) silently dropped into the void when record() is called after flush(), creating a gap in the audit trail that may not be detectable
  • Likelihood: Medium — triggered any time a long-running server process calls flush() for synchronisation (e.g., in tests, during signal handling, or if close() is called mid-session then the service is reused)
  • Priority: High

Location

  • File: src/cleveragents/application/services/audit_service.py
  • Function/Class: AuditService.flush(), AuditService.record()
  • Lines: ~291–310 (flush), ~331–390 (record)

Description

AuditService.flush() sends _STOP_SENTINEL to the background writer thread queue and then joins the thread — permanently killing the writer thread. However, flush() does NOT set self._closed = True.

After flush() returns:

  1. self._async_mode is still True
  2. self._queue is still a non-None queue.Queue
  3. self._writer_thread is now a dead thread (joined/stopped)
  4. self._closed is still False

Any subsequent record() call enters the async path (if self._async_mode and self._queue is not None:) and enqueues the payload. But the background writer thread is dead and will never drain the queue. The audit entries are permanently lost with no error, warning, or exception.

The docstring on flush() says "After flush() the service is closed — no further record() calls should be made." But this is merely a documentation warning, not enforced in code. The AuditService is a Singleton in the DI container — it lives for the process lifetime and is shared across all callers. The AuditEventSubscriber can call audit_service.record() at any time through the event bus, including after a test calls flush().

Evidence

# src/cleveragents/application/services/audit_service.py

def flush(self) -> None:
    """..."""
    if (
        self._async_mode
        and self._queue is not None
        and self._writer_thread is not None
        and self._writer_thread.is_alive()
    ):
        self._queue.put(_STOP_SENTINEL)  # kills the writer thread
        self._writer_thread.join()
        # BUG: self._closed is NOT set to True here
        # BUG: self._writer_thread is not set to None here

def record(self, *, event_type: str, ...) -> AuditLogEntry:
    """..."""
    # ...
    if self._async_mode and self._queue is not None:
        payload = _AuditPayload(...)
        self._queue.put(payload)   # enqueues to dead queue — never drained!
        return AuditLogEntry(id=-1, ...)  # returns placeholder, no error

close() does set self._closed = True and then calls flush(), but:

  • flush() can be called independently without going through close()
  • record() does not check self._closed before enqueuing

Expected Behavior

Either:

  1. flush() should set self._closed = True (and optionally self._writer_thread = None) to make subsequent record() calls fail fast with a clear error, OR
  2. record() should check self._closed before proceeding: if self._closed: raise RuntimeError("AuditService is closed")

Actual Behavior

After flush(), record() silently enqueues audit events to a queue that will never be drained. Events are lost without any indication of failure. The audit trail has gaps that may violate security compliance requirements.

Suggested Fix

Option A — make flush() mark the service as closed:

def flush(self) -> None:
    if (
        self._async_mode
        and self._queue is not None
        and self._writer_thread is not None
        and self._writer_thread.is_alive()
    ):
        self._queue.put(_STOP_SENTINEL)
        self._writer_thread.join()
        self._writer_thread = None  # prevent is_alive() confusion
        self._closed = True  # enforce no further records

Option B — guard record() against post-close calls:

def record(self, *, event_type: str, ...) -> AuditLogEntry:
    if self._closed:
        raise RuntimeError("Cannot record audit entry: AuditService is closed after flush()/close()")
    ...

Category

resource

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [resource] — Audit entries silently dropped after `flush()` in async mode ### Severity Assessment - **Impact**: Security-relevant audit events (`plan_applied`, `auth_success`, `auth_failure`, `correction_applied`, etc.) silently dropped into the void when `record()` is called after `flush()`, creating a gap in the audit trail that may not be detectable - **Likelihood**: Medium — triggered any time a long-running server process calls `flush()` for synchronisation (e.g., in tests, during signal handling, or if `close()` is called mid-session then the service is reused) - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/audit_service.py` - **Function/Class**: `AuditService.flush()`, `AuditService.record()` - **Lines**: ~291–310 (flush), ~331–390 (record) ### Description `AuditService.flush()` sends `_STOP_SENTINEL` to the background writer thread queue and then joins the thread — permanently killing the writer thread. However, `flush()` does NOT set `self._closed = True`. After `flush()` returns: 1. `self._async_mode` is still `True` 2. `self._queue` is still a non-None `queue.Queue` 3. `self._writer_thread` is now a dead thread (joined/stopped) 4. `self._closed` is still `False` Any subsequent `record()` call enters the async path (`if self._async_mode and self._queue is not None:`) and enqueues the payload. But the background writer thread is dead and will never drain the queue. **The audit entries are permanently lost with no error, warning, or exception.** The docstring on `flush()` says "After `flush()` the service is **closed** — no further `record()` calls should be made." But this is merely a documentation warning, not enforced in code. The `AuditService` is a `Singleton` in the DI container — it lives for the process lifetime and is shared across all callers. The `AuditEventSubscriber` can call `audit_service.record()` at any time through the event bus, including after a test calls `flush()`. ### Evidence ```python # src/cleveragents/application/services/audit_service.py def flush(self) -> None: """...""" if ( self._async_mode and self._queue is not None and self._writer_thread is not None and self._writer_thread.is_alive() ): self._queue.put(_STOP_SENTINEL) # kills the writer thread self._writer_thread.join() # BUG: self._closed is NOT set to True here # BUG: self._writer_thread is not set to None here def record(self, *, event_type: str, ...) -> AuditLogEntry: """...""" # ... if self._async_mode and self._queue is not None: payload = _AuditPayload(...) self._queue.put(payload) # enqueues to dead queue — never drained! return AuditLogEntry(id=-1, ...) # returns placeholder, no error ``` `close()` does set `self._closed = True` and then calls `flush()`, but: - `flush()` can be called independently without going through `close()` - `record()` does not check `self._closed` before enqueuing ### Expected Behavior Either: 1. `flush()` should set `self._closed = True` (and optionally `self._writer_thread = None`) to make subsequent `record()` calls fail fast with a clear error, OR 2. `record()` should check `self._closed` before proceeding: `if self._closed: raise RuntimeError("AuditService is closed")` ### Actual Behavior After `flush()`, `record()` silently enqueues audit events to a queue that will never be drained. Events are lost without any indication of failure. The audit trail has gaps that may violate security compliance requirements. ### Suggested Fix Option A — make `flush()` mark the service as closed: ```python def flush(self) -> None: if ( self._async_mode and self._queue is not None and self._writer_thread is not None and self._writer_thread.is_alive() ): self._queue.put(_STOP_SENTINEL) self._writer_thread.join() self._writer_thread = None # prevent is_alive() confusion self._closed = True # enforce no further records ``` Option B — guard `record()` against post-close calls: ```python def record(self, *, event_type: str, ...) -> AuditLogEntry: if self._closed: raise RuntimeError("Cannot record audit entry: AuditService is closed after flush()/close()") ... ``` ### Category resource ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
Author
Owner

Label compliance fix applied: Added missing State/Unverified label per CONTRIBUTING.md requirements.


Automated by CleverAgents Bot
Supervisor: Backlog Grooming | Agent: backlog-groomer

Label compliance fix applied: Added missing `State/Unverified` label per CONTRIBUTING.md requirements. --- **Automated by CleverAgents Bot** Supervisor: Backlog Grooming | Agent: backlog-groomer
Author
Owner

Verified — Resource bug: AuditService.flush() permanently kills background writer thread — subsequent audit entries silently dropped. MoSCoW: Must-have. Priority: High.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Resource bug: AuditService.flush() permanently kills background writer thread — subsequent audit entries silently dropped. MoSCoW: Must-have. Priority: High. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6757
No description provided.