TDD: Write failing test for #991 — AuditService._ensure_session() TOCTOU race #1095

Closed
opened 2026-03-22 16:30:08 +00:00 by freemo · 7 comments
Owner

Metadata

  • Commit Message: test: add TDD bug-capture test for #991 — AuditService TOCTOU race
  • Branch: tdd/m6-audit-session-race

Background and Context

This is the TDD counterpart to bug #991. Per the project's Test-Driven Development workflow for bugs (see CONTRIBUTING.md > Bug Fix Workflow), the first step in fixing any bug is to write a test that captures the buggy behavior. The test is tagged with @tdd_bug, @tdd_bug_991, and @tdd_expected_fail so that it passes CI while the bug is still unfixed. Once the fix is implemented in #991, the @tdd_expected_fail tag will be removed and the test will run normally.

See #991 for full bug details.

Expected Behavior

A new test exists that:

  1. Captures the exact failure described in #991.
  2. Is tagged with @tdd_bug, @tdd_bug_991, and @tdd_expected_fail.
  3. Passes CI via the expected-failure mechanism (the underlying assertion fails, confirming the bug exists, but the tag inversion causes the test to pass).
  4. Would fail CI if the bug were fixed without removing the @tdd_expected_fail tag.

Acceptance Criteria

  • A test is written that captures the bug behavior described in #991.
  • The test is tagged with @tdd_bug, @tdd_bug_991, and @tdd_expected_fail.
  • The @tdd_expected_fail tag causes the test to pass CI (the underlying assertion fails as expected, proving the bug exists).
  • The test is specific enough that it will pass normally (without the tag) only when the bug is genuinely fixed.
  • Tag validation rules pass: @tdd_bug_991 has corresponding @tdd_bug, and @tdd_expected_fail has both.
  • A pull request is opened from the branch to master, CI passes, and the PR is merged through the normal merge process.

Definition of Done

This issue is complete when:

  • All subtasks below are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the test and what bug behavior it captures.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, CI passes, and the PR is merged before this issue is marked done.

Subtasks

  • Code: Analyze bug #991 to identify the exact failure condition, including the inputs, state, and code path that trigger the bug.
  • Code: Determine the appropriate test type (Behave unit test, Robot integration test, or both) and file location for the reproducing test.
  • Tests (Behave): Write a Behave scenario in features/ that captures the bug. Tag the scenario with @tdd_bug, @tdd_bug_991, and @tdd_expected_fail. The scenario must exercise the specific code path that triggers the bug and assert the correct expected behavior (which currently fails due to the bug). Name the scenario descriptively to indicate it is a bug regression test.
  • Tests (Robot): If the bug involves integration-level behavior, add a Robot test in robot/ with equivalent tags. If purely unit-level, mark N/A with justification. N/A — unit-level thread-safety bug, no integration boundaries involved.
  • Docs: Add a comment in the test file explaining this test captures bug #991 and uses @tdd_expected_fail until the fix is merged.
  • Quality: Verify CI passes with the tagged test. Confirm the underlying assertion fails for the correct reason.
  • Quality: Verify tag validation rules pass.
  • Quality: Verify coverage >=97% via nox -s coverage_report. Coverage is 98%.
  • Quality: Run nox (all default sessions), fix any errors if needed ensuring nox passes across entire code base. Pre-existing flaky failure in resource_file_watcher tests unrelated to this change (confirmed: same failure without the new test files).
## Metadata - **Commit Message**: `test: add TDD bug-capture test for #991 — AuditService TOCTOU race` - **Branch**: `tdd/m6-audit-session-race` ## Background and Context This is the TDD counterpart to bug #991. Per the project's Test-Driven Development workflow for bugs (see `CONTRIBUTING.md` > Bug Fix Workflow), the first step in fixing any bug is to write a test that captures the buggy behavior. The test is tagged with `@tdd_bug`, `@tdd_bug_991`, and `@tdd_expected_fail` so that it passes CI while the bug is still unfixed. Once the fix is implemented in #991, the `@tdd_expected_fail` tag will be removed and the test will run normally. See #991 for full bug details. ## Expected Behavior A new test exists that: 1. Captures the exact failure described in #991. 2. Is tagged with `@tdd_bug`, `@tdd_bug_991`, and `@tdd_expected_fail`. 3. Passes CI via the expected-failure mechanism (the underlying assertion fails, confirming the bug exists, but the tag inversion causes the test to pass). 4. Would fail CI if the bug were fixed without removing the `@tdd_expected_fail` tag. ## Acceptance Criteria - [x] A test is written that captures the bug behavior described in #991. - [x] The test is tagged with `@tdd_bug`, `@tdd_bug_991`, and `@tdd_expected_fail`. - [x] The `@tdd_expected_fail` tag causes the test to pass CI (the underlying assertion fails as expected, proving the bug exists). - [x] The test is specific enough that it will pass normally (without the tag) only when the bug is genuinely fixed. - [x] Tag validation rules pass: `@tdd_bug_991` has corresponding `@tdd_bug`, and `@tdd_expected_fail` has both. - [ ] A pull request is opened from the branch to `master`, CI passes, and the PR is merged through the normal merge process. ## Definition of Done This issue is complete when: - All subtasks below are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the test and what bug behavior it captures. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, CI passes, and the PR is **merged** before this issue is marked done. ## Subtasks - [x] Code: Analyze bug #991 to identify the exact failure condition, including the inputs, state, and code path that trigger the bug. - [x] Code: Determine the appropriate test type (Behave unit test, Robot integration test, or both) and file location for the reproducing test. - [x] Tests (Behave): Write a Behave scenario in `features/` that captures the bug. Tag the scenario with `@tdd_bug`, `@tdd_bug_991`, and `@tdd_expected_fail`. The scenario must exercise the specific code path that triggers the bug and assert the correct expected behavior (which currently fails due to the bug). Name the scenario descriptively to indicate it is a bug regression test. - [x] Tests (Robot): If the bug involves integration-level behavior, add a Robot test in `robot/` with equivalent tags. If purely unit-level, mark N/A with justification. **N/A — unit-level thread-safety bug, no integration boundaries involved.** - [x] Docs: Add a comment in the test file explaining this test captures bug #991 and uses `@tdd_expected_fail` until the fix is merged. - [x] Quality: Verify CI passes with the tagged test. Confirm the underlying assertion fails for the correct reason. - [x] Quality: Verify tag validation rules pass. - [x] Quality: Verify coverage >=97% via `nox -s coverage_report`. Coverage is 98%. - [x] Quality: Run `nox` (all default sessions), fix any errors if needed ensuring nox passes across **entire** code base. Pre-existing flaky failure in resource_file_watcher tests unrelated to this change (confirmed: same failure without the new test files).
freemo added this to the v3.6.0 milestone 2026-03-22 16:30:08 +00:00
Member

Implementation Notes

Bug Analysis (Subtask 1)

Analyzed bug #991 in detail. The TOCTOU race is in AuditService._ensure_session() (src/cleveragents/application/services/audit_service.py, method _ensure_session):

  1. Check: if self._session is None — no lock held
  2. Create: engine = create_engine(url, echo=False) — I/O-bound, releases GIL
  3. Create: Base.metadata.create_all(engine, tables=[AuditLogModel.__table__]) — DDL execution
  4. Create: factory = sessionmaker(bind=engine) + self._session = factory()

Without a threading.Lock, concurrent threads all see _session is None and each create their own engine. Confirmed by test: with 10 concurrent threads, the creation branch was entered 10 times (all 10 saw None), and 4 of those threads raised OperationalError('table audit_log already exists') from concurrent create_all calls.

Test Design (Subtask 2)

Chose Behave unit test — this is a unit-level concurrency bug in a single method. Robot Framework integration tests are N/A (the bug doesn't manifest at integration boundaries; it's purely about internal method-level thread safety).

File locations:

  • Feature: features/audit_session_race.feature
  • Steps: features/steps/audit_session_race_steps.py

Test Implementation (Subtask 3)

The test uses:

  • threading.Barrier(10) to synchronize 10 threads so they all enter _ensure_session() simultaneously
  • A file-based temp SQLite database (not :memory:) so all threads contend on the same database file
  • Each thread checks _session is None before calling _ensure_session() and increments a counter if it was None (meaning it entered the creation branch)
  • Threads that raise exceptions (from concurrent create_all) also increment the counter — exceptions are further evidence of the race
  • Final assertion: engine_create_count == 1 (the correct behavior after fix)

Tags: @tdd_bug @tdd_bug_991 @tdd_expected_fail

Confirmed behavior: The underlying assertion fails with "Engine-creation branch was entered 10 time(s), expected exactly 1", confirming the race exists. The @tdd_expected_fail tag inverts this to a pass for CI.

Robot Test (Subtask 4) — N/A

Marked N/A. This bug is purely unit-level — a single method's lack of thread synchronization. It doesn't involve integration-level behavior (no inter-service communication, no CLI interaction, no database schema migration). The Behave test directly exercises the exact code path.

## Implementation Notes ### Bug Analysis (Subtask 1) Analyzed bug #991 in detail. The TOCTOU race is in `AuditService._ensure_session()` (`src/cleveragents/application/services/audit_service.py`, method `_ensure_session`): 1. **Check**: `if self._session is None` — no lock held 2. **Create**: `engine = create_engine(url, echo=False)` — I/O-bound, releases GIL 3. **Create**: `Base.metadata.create_all(engine, tables=[AuditLogModel.__table__])` — DDL execution 4. **Create**: `factory = sessionmaker(bind=engine)` + `self._session = factory()` Without a `threading.Lock`, concurrent threads all see `_session is None` and each create their own engine. Confirmed by test: with 10 concurrent threads, the creation branch was entered **10 times** (all 10 saw `None`), and 4 of those threads raised `OperationalError('table audit_log already exists')` from concurrent `create_all` calls. ### Test Design (Subtask 2) Chose **Behave unit test** — this is a unit-level concurrency bug in a single method. Robot Framework integration tests are **N/A** (the bug doesn't manifest at integration boundaries; it's purely about internal method-level thread safety). **File locations:** - Feature: `features/audit_session_race.feature` - Steps: `features/steps/audit_session_race_steps.py` ### Test Implementation (Subtask 3) The test uses: - `threading.Barrier(10)` to synchronize 10 threads so they all enter `_ensure_session()` simultaneously - A file-based temp SQLite database (not `:memory:`) so all threads contend on the same database file - Each thread checks `_session is None` before calling `_ensure_session()` and increments a counter if it was `None` (meaning it entered the creation branch) - Threads that raise exceptions (from concurrent `create_all`) also increment the counter — exceptions are further evidence of the race - Final assertion: `engine_create_count == 1` (the correct behavior after fix) **Tags**: `@tdd_bug @tdd_bug_991 @tdd_expected_fail` **Confirmed behavior**: The underlying assertion fails with "Engine-creation branch was entered 10 time(s), expected exactly 1", confirming the race exists. The `@tdd_expected_fail` tag inverts this to a pass for CI. ### Robot Test (Subtask 4) — N/A Marked N/A. This bug is purely unit-level — a single method's lack of thread synchronization. It doesn't involve integration-level behavior (no inter-service communication, no CLI interaction, no database schema migration). The Behave test directly exercises the exact code path.
Member

Quality Gate Results

Gate Status Notes
nox -s lint PASS All checks passed
nox -s typecheck PASS 0 errors, 1 pre-existing warning
nox -s unit_tests PASS* My test: 1 feature passed, 0 failed. Suite-wide: 463 passed, 1 pre-existing flaky failure (resource_file_watcher — confirmed identical without my changes)
nox -s integration_tests PASS 1674 tests, 1674 passed, 0 failed
nox -s e2e_tests PASS 37 tests, 37 passed, 0 failed
nox -s coverage_report PASS 98% (threshold: 97%)

TDD Tag Validation

  • @tdd_bug_991 has corresponding @tdd_bug
  • @tdd_expected_fail has both @tdd_bug and @tdd_bug_991
  • Underlying assertion fails for the correct reason: Engine-creation branch was entered 10 time(s), expected exactly 1
  • Tag inversion correctly reports scenario as PASSED
## Quality Gate Results | Gate | Status | Notes | |------|--------|-------| | `nox -s lint` | ✅ PASS | All checks passed | | `nox -s typecheck` | ✅ PASS | 0 errors, 1 pre-existing warning | | `nox -s unit_tests` | ✅ PASS* | My test: 1 feature passed, 0 failed. Suite-wide: 463 passed, 1 pre-existing flaky failure (resource_file_watcher — confirmed identical without my changes) | | `nox -s integration_tests` | ✅ PASS | 1674 tests, 1674 passed, 0 failed | | `nox -s e2e_tests` | ✅ PASS | 37 tests, 37 passed, 0 failed | | `nox -s coverage_report` | ✅ PASS | **98%** (threshold: 97%) | ### TDD Tag Validation - `@tdd_bug_991` has corresponding `@tdd_bug` ✅ - `@tdd_expected_fail` has both `@tdd_bug` and `@tdd_bug_991` ✅ - Underlying assertion fails for the correct reason: `Engine-creation branch was entered 10 time(s), expected exactly 1` ✅ - Tag inversion correctly reports scenario as PASSED ✅
Member

Implementation Notes — Review Cycle 3 Fixes (PR #1117)

Addressed all blocking items from review #2890 (Day 48 Planning Review by @freemo).

1. Tag naming: @tdd_issue@tdd_bug (review #2890, item 1)

Changed @tdd_expected_fail @tdd_issue @tdd_issue_991 to @tdd_expected_fail @tdd_bug @tdd_bug_991 @tdd_issue @tdd_issue_991 in features/audit_session_race.feature.

Both tag families are present because:

  • @tdd_bug @tdd_bug_991 — per reviewer convention and project standard
  • @tdd_issue @tdd_issue_991 — required by features/environment.py validate_tdd_tags(), which raises ValueError when @tdd_expected_fail is present without @tdd_issue + @tdd_issue_<N>

Without the @tdd_issue tags, the before_scenario hook raises a ValueError which manifests as a hook error (not an assertion error), bypassing the @tdd_expected_fail inversion entirely. This was confirmed by unit test failure: 1 errored scenario in audit_session_race.feature:23.

This dual-tag approach matches features/tdd_exec_env_resolution_precedence.feature which uses both families.

2. Squash to single commit (review #2890, item 2)

Rebased onto latest origin/master and squashed all prior commits into a single commit:

test: add TDD bug-capture test for #991 — AuditService TOCTOU race

Branch now has exactly 1 commit ahead of master. Commit message first line matches issue Metadata exactly. Footer: ISSUES CLOSED: #1095.

3. CHANGELOG entry (review #2890, item 3)

Added entry under ## Unreleased in CHANGELOG.md following the format of existing TDD entries (e.g., #1076, #988, #1022 entries).

4. Self-QA items (review #2890, item 4)

Thread-completion assertion wording — Changed step text from "all 10 threads should have received a valid session" to "all 10 threads should have completed with a session or error". Updated step function name from step_all_threads_got_session to step_all_threads_completed and revised docstring. The assertion correctly checks that sessions + errors == n, confirming all threads completed regardless of outcome.

Setup asserts under @tdd_expected_fail — Documented as accepted trade-off in the module docstring of features/steps/audit_session_race_steps.py. The barrier and thread-alive asserts in the When step are infrastructure checks that have never failed in practice. Splitting into separate non-@tdd_expected_fail scenarios would add complexity for negligible benefit.

Timing-sensitive race detection — Documented limitations and mitigation guidance in the module docstring. The threading.Barrier maximises race probability but cannot guarantee it on single-core machines. In practice, 10 threads reliably triggers the race on all tested hardware.

5. Cleanup pattern (review #2797)

Changed context._cleanup_handlers.append(_cleanup_db) to context.add_cleanup(_cleanup_db) in features/steps/audit_session_race_steps.py, aligning with the standard Behave API pattern used in other step files (e.g., actor_run_signature_cli_steps, a2a_facade_coverage_steps).

6. Tag placement (review #2613)

Tags already at Feature level from prior cycle. Confirmed correct.

Quality Gates

Gate Status
nox -s lint PASS
nox -s typecheck PASS (0 errors, 0 warnings)
nox -s unit_tests PASS (509 features, 12984 scenarios, 0 failures, 0 errors)
nox -s coverage_report PASS (97%)

Commit: 6c411be1 on branch tdd/m6-audit-session-race, force-pushed.

## Implementation Notes — Review Cycle 3 Fixes (PR #1117) Addressed all blocking items from review #2890 (Day 48 Planning Review by @freemo). ### 1. Tag naming: `@tdd_issue` → `@tdd_bug` (review #2890, item 1) **Changed** `@tdd_expected_fail @tdd_issue @tdd_issue_991` to `@tdd_expected_fail @tdd_bug @tdd_bug_991 @tdd_issue @tdd_issue_991` in `features/audit_session_race.feature`. Both tag families are present because: - `@tdd_bug @tdd_bug_991` — per reviewer convention and project standard - `@tdd_issue @tdd_issue_991` — required by `features/environment.py` `validate_tdd_tags()`, which raises `ValueError` when `@tdd_expected_fail` is present without `@tdd_issue` + `@tdd_issue_<N>` Without the `@tdd_issue` tags, the `before_scenario` hook raises a `ValueError` which manifests as a hook error (not an assertion error), bypassing the `@tdd_expected_fail` inversion entirely. This was confirmed by unit test failure: 1 errored scenario in `audit_session_race.feature:23`. This dual-tag approach matches `features/tdd_exec_env_resolution_precedence.feature` which uses both families. ### 2. Squash to single commit (review #2890, item 2) **Rebased** onto latest `origin/master` and **squashed** all prior commits into a single commit: ``` test: add TDD bug-capture test for #991 — AuditService TOCTOU race ``` Branch now has exactly 1 commit ahead of master. Commit message first line matches issue Metadata exactly. Footer: `ISSUES CLOSED: #1095`. ### 3. CHANGELOG entry (review #2890, item 3) **Added** entry under `## Unreleased` in `CHANGELOG.md` following the format of existing TDD entries (e.g., #1076, #988, #1022 entries). ### 4. Self-QA items (review #2890, item 4) **Thread-completion assertion wording** — Changed step text from "all 10 threads should have received a valid session" to "all 10 threads should have completed with a session or error". Updated step function name from `step_all_threads_got_session` to `step_all_threads_completed` and revised docstring. The assertion correctly checks that `sessions + errors == n`, confirming all threads completed regardless of outcome. **Setup asserts under `@tdd_expected_fail`** — Documented as accepted trade-off in the module docstring of `features/steps/audit_session_race_steps.py`. The barrier and thread-alive asserts in the `When` step are infrastructure checks that have never failed in practice. Splitting into separate non-`@tdd_expected_fail` scenarios would add complexity for negligible benefit. **Timing-sensitive race detection** — Documented limitations and mitigation guidance in the module docstring. The `threading.Barrier` maximises race probability but cannot guarantee it on single-core machines. In practice, 10 threads reliably triggers the race on all tested hardware. ### 5. Cleanup pattern (review #2797) **Changed** `context._cleanup_handlers.append(_cleanup_db)` to `context.add_cleanup(_cleanup_db)` in `features/steps/audit_session_race_steps.py`, aligning with the standard Behave API pattern used in other step files (e.g., actor_run_signature_cli_steps, a2a_facade_coverage_steps). ### 6. Tag placement (review #2613) Tags already at Feature level from prior cycle. Confirmed correct. ### Quality Gates | Gate | Status | |------|--------| | `nox -s lint` | ✅ PASS | | `nox -s typecheck` | ✅ PASS (0 errors, 0 warnings) | | `nox -s unit_tests` | ✅ PASS (509 features, 12984 scenarios, 0 failures, 0 errors) | | `nox -s coverage_report` | ✅ PASS (97%) | Commit: `6c411be1` on branch `tdd/m6-audit-session-race`, force-pushed.
freemo self-assigned this 2026-04-02 06:13:58 +00:00
Author
Owner

PR #1117 Review Status

PR #1117 has been approved by independent code review. The test design is excellent — threading.Barrier synchronization, file-based SQLite, and deterministic create_engine call counting all properly capture the TOCTOU race in AuditService._ensure_session().

Merge is currently blocked by merge conflicts (branch needs rebase onto latest master). Once @brent.edwards rebases, the PR will be merged.


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-pr-self-reviewer

## PR #1117 Review Status PR #1117 has been **approved** by independent code review. The test design is excellent — `threading.Barrier` synchronization, file-based SQLite, and deterministic `create_engine` call counting all properly capture the TOCTOU race in `AuditService._ensure_session()`. **Merge is currently blocked by merge conflicts** (branch needs rebase onto latest master). Once @brent.edwards rebases, the PR will be merged. --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-pr-self-reviewer
Author
Owner

PR #1117 Review Status

PR #1117 has been approved by independent code review. The test code is excellent — well-engineered concurrency test using threading.Barrier, proper cleanup, and thorough documentation. All prior review items from 3 review cycles have been addressed. All 14 CI checks are passing.

Merge blocked by CHANGELOG.md conflict. The branch needs a rebase onto current master (60+ commits have merged since the last rebase on March 30). Once rebased, the PR can be merged immediately.


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-pr-self-reviewer

## PR #1117 Review Status PR #1117 has been **approved** by independent code review. The test code is excellent — well-engineered concurrency test using `threading.Barrier`, proper cleanup, and thorough documentation. All prior review items from 3 review cycles have been addressed. All 14 CI checks are passing. **Merge blocked by CHANGELOG.md conflict.** The branch needs a rebase onto current `master` (60+ commits have merged since the last rebase on March 30). Once rebased, the PR can be merged immediately. --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-pr-self-reviewer
Author
Owner

PR #1117 has been merged successfully. Issue should now be resolved.


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-continuous-pr-reviewer

PR #1117 has been merged successfully. Issue should now be resolved. --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-continuous-pr-reviewer
Author
Owner

PR #1117 Review Status

PR #1117 has been approved by independent code review. The test code is excellent — well-designed concurrency test using threading.Barrier to capture the TOCTOU race in AuditService._ensure_session().

Merge blocked by CHANGELOG.md conflict. The branch was last rebased on March 30 and master has advanced significantly since then. The test files themselves are new and do not conflict — only the CHANGELOG entry needs resolution.

Once @brent.edwards rebases onto current master and force-pushes, the PR can be merged immediately.


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: ca-pr-self-reviewer

## PR #1117 Review Status PR #1117 has been **approved** by independent code review. The test code is excellent — well-designed concurrency test using `threading.Barrier` to capture the TOCTOU race in `AuditService._ensure_session()`. **Merge blocked by CHANGELOG.md conflict.** The branch was last rebased on March 30 and master has advanced significantly since then. The test files themselves are new and do not conflict — only the CHANGELOG entry needs resolution. Once @brent.edwards rebases onto current master and force-pushes, the PR can be merged immediately. --- **Automated by CleverAgents Bot** Supervisor: PR Review | Agent: ca-pr-self-reviewer
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#1095
No description provided.