fix(test): resolve performance regression in BDD test database initialization #726

Closed
opened 2026-03-12 12:24:07 +00:00 by CoreRasurae · 0 comments
Member

Summary

Fix a performance regression in the BDD test infrastructure that causes redundant Alembic migration checks on every CLI command invocation within a scenario, leading to intermittent hangs, SQLite lock contention, and CI failures in parallel test runs.

Background and Context

The BDD test suite uses behave-parallel with multiprocessing.Pool (fork start method) to split ~280 feature files across 16 worker processes. A pre-migrated SQLite template database (build/.template-migrated.db) is created by scripts/create_template_db.py with all tables and an Alembic HEAD stamp. The _install_template_db_patch() function in features/environment.py monkey-patches MigrationRunner.init_or_upgrade to copy this template instead of running full Alembic migrations for each scenario, drastically reducing per-scenario setup time.

Current Behavior (Bug)

The _fast_init_or_upgrade function in features/environment.py has a critical flaw at line 216-217. When a test database file already exists and is non-empty (created by a prior step such as I have initialized a project), the patch falls through to the original _original_init_or_upgrade():

if db_path.exists() and db_path.stat().st_size > 0:
    return _original_init_or_upgrade(self, **kwargs)  # BUG: redundant migration check

This causes:

  1. Redundant Alembic migration checks — Every subsequent CLI command in the same scenario (e.g., cleveragents project status after cleveragents init) triggers a full Alembic migration version check, even though the database is already fully migrated. Confirmed by 7+ pairs of INFO [alembic.runtime.migration] log lines per scenario.
  2. SQLite file contention — Multiple SQLAlchemy engines open connections to the same SQLite file: one from the Given step's init_command and another from the CLI's _run_cleveragents_inprocess. Under parallel execution, this creates lock contention.
  3. Cumulative overhead — Across 280+ feature files with thousands of scenarios, the redundant engine creation and Alembic checks add up significantly.
  4. Intermittent CI failures — SQLite lock contention under parallel execution causes DatabaseError, triggering @database_retry decorators (with wait_fixed(0.5) × 3 attempts on ~100 repository methods), leading to hangs and timeouts.

Affected Scenarios

All 11 failing scenarios in features/core_cli_commands.feature are ones that invoke a CLI command after a prior step has already created and populated a database:

  • Line 40: Force reinitialize a project
  • Line 47: Project status command works
  • Line 62: Context add command works with files
  • Line 72: Context list shows added files
  • Line 83: Context clear removes all files
  • Line 93: Plan tell creates a new plan
  • Line 101: Plan build generates changes
  • Line 111: Plan apply writes changes
  • Line 120: Plan list shows all plans
  • Line 128: Plan current shows active plan
  • Line 137: Shortcuts work for common commands

The first 3 scenarios in the feature pass because they either don't require a pre-existing project or expect failure (no redundant migration checks).

Expected Behavior

When a test database already exists and is non-empty, _fast_init_or_upgrade should skip the Alembic migration check entirely, since the database was either template-copied (already at HEAD) or created by Base.metadata.create_all() in a prior step.

Fix Applied

In features/environment.py, the existing-non-empty-database branch now returns immediately instead of delegating to the original migration runner:

# BEFORE (causes redundant Alembic checks):
if db_path.exists() and db_path.stat().st_size > 0:
    return _original_init_or_upgrade(self, **kwargs)

# AFTER (skip check for existing test databases):
if db_path.exists() and db_path.stat().st_size > 0:
    return  # Already migrated — skip redundant Alembic check

This is safe because:

  • Template-copied databases are already at Alembic HEAD
  • Databases created by init_command in a prior step already ran the full migration
  • The guard _SCENARIO_DB_PREFIXES check above ensures this only applies to scenario test databases, not migration-runner unit tests

Verification

  • 15/15 scenarios pass with --processes 1 (sequential)
  • 15/15 scenarios pass with --processes 16 (parallel, default)
  • Execution time: ~1.7s for all scenarios (previously subject to hangs/timeouts)

Acceptance Criteria

  • All 15 scenarios in features/core_cli_commands.feature pass consistently
  • Tests pass under both sequential and parallel execution
  • No redundant Alembic migration checks for existing test databases
  • Non-scenario databases (migration-runner unit tests) are unaffected

Supporting Information

Key files involved

  • features/environment.py — contains the fix in _fast_init_or_upgrade()
  • features/core_cli_commands.feature — the affected feature file
  • features/steps/cli_plan_context_commands_steps.py — contains _run_cleveragents_inprocess()
  • src/cleveragents/infrastructure/database/migration_runner.py — the original MigrationRunner.init_or_upgrade
  • src/cleveragents/core/retry_patterns.py@database_retry with wait_fixed(0.5), 3 attempts

Metadata

  • Commit Message: fix(test): skip redundant Alembic migration check for existing test databases
  • Branch: fix/test-db-performance-regression

Subtasks

  • Identify root cause of intermittent BDD test failures
  • Implement fix in features/environment.py
  • Verify fix with sequential execution (--processes 1)
  • Verify fix with parallel execution (--processes 16)

Definition of Done

  • All scenarios in features/core_cli_commands.feature pass consistently
  • Fix does not affect non-test database initialization paths
  • Change is a single, minimal edit to features/environment.py
## Summary Fix a performance regression in the BDD test infrastructure that causes redundant Alembic migration checks on every CLI command invocation within a scenario, leading to intermittent hangs, SQLite lock contention, and CI failures in parallel test runs. ## Background and Context The BDD test suite uses `behave-parallel` with `multiprocessing.Pool` (fork start method) to split ~280 feature files across 16 worker processes. A pre-migrated SQLite template database (`build/.template-migrated.db`) is created by `scripts/create_template_db.py` with all tables and an Alembic HEAD stamp. The `_install_template_db_patch()` function in `features/environment.py` monkey-patches `MigrationRunner.init_or_upgrade` to copy this template instead of running full Alembic migrations for each scenario, drastically reducing per-scenario setup time. ## Current Behavior (Bug) The `_fast_init_or_upgrade` function in `features/environment.py` has a critical flaw at line 216-217. When a test database file **already exists and is non-empty** (created by a prior step such as `I have initialized a project`), the patch falls through to the original `_original_init_or_upgrade()`: ```python if db_path.exists() and db_path.stat().st_size > 0: return _original_init_or_upgrade(self, **kwargs) # BUG: redundant migration check ``` This causes: 1. **Redundant Alembic migration checks** — Every subsequent CLI command in the same scenario (e.g., `cleveragents project status` after `cleveragents init`) triggers a full Alembic migration version check, even though the database is already fully migrated. Confirmed by 7+ pairs of `INFO [alembic.runtime.migration]` log lines per scenario. 2. **SQLite file contention** — Multiple SQLAlchemy engines open connections to the same SQLite file: one from the Given step's `init_command` and another from the CLI's `_run_cleveragents_inprocess`. Under parallel execution, this creates lock contention. 3. **Cumulative overhead** — Across 280+ feature files with thousands of scenarios, the redundant engine creation and Alembic checks add up significantly. 4. **Intermittent CI failures** — SQLite lock contention under parallel execution causes `DatabaseError`, triggering `@database_retry` decorators (with `wait_fixed(0.5)` × 3 attempts on ~100 repository methods), leading to hangs and timeouts. ### Affected Scenarios All 11 failing scenarios in `features/core_cli_commands.feature` are ones that invoke a CLI command **after** a prior step has already created and populated a database: - Line 40: Force reinitialize a project - Line 47: Project status command works - Line 62: Context add command works with files - Line 72: Context list shows added files - Line 83: Context clear removes all files - Line 93: Plan tell creates a new plan - Line 101: Plan build generates changes - Line 111: Plan apply writes changes - Line 120: Plan list shows all plans - Line 128: Plan current shows active plan - Line 137: Shortcuts work for common commands The first 3 scenarios in the feature pass because they either don't require a pre-existing project or expect failure (no redundant migration checks). ## Expected Behavior When a test database already exists and is non-empty, `_fast_init_or_upgrade` should skip the Alembic migration check entirely, since the database was either template-copied (already at HEAD) or created by `Base.metadata.create_all()` in a prior step. ## Fix Applied In `features/environment.py`, the existing-non-empty-database branch now returns immediately instead of delegating to the original migration runner: ```python # BEFORE (causes redundant Alembic checks): if db_path.exists() and db_path.stat().st_size > 0: return _original_init_or_upgrade(self, **kwargs) # AFTER (skip check for existing test databases): if db_path.exists() and db_path.stat().st_size > 0: return # Already migrated — skip redundant Alembic check ``` This is safe because: - Template-copied databases are already at Alembic HEAD - Databases created by `init_command` in a prior step already ran the full migration - The guard `_SCENARIO_DB_PREFIXES` check above ensures this only applies to scenario test databases, not migration-runner unit tests ## Verification - **15/15 scenarios pass** with `--processes 1` (sequential) - **15/15 scenarios pass** with `--processes 16` (parallel, default) - Execution time: ~1.7s for all scenarios (previously subject to hangs/timeouts) ## Acceptance Criteria - [x] All 15 scenarios in `features/core_cli_commands.feature` pass consistently - [x] Tests pass under both sequential and parallel execution - [x] No redundant Alembic migration checks for existing test databases - [x] Non-scenario databases (migration-runner unit tests) are unaffected ## Supporting Information ### Key files involved - `features/environment.py` — contains the fix in `_fast_init_or_upgrade()` - `features/core_cli_commands.feature` — the affected feature file - `features/steps/cli_plan_context_commands_steps.py` — contains `_run_cleveragents_inprocess()` - `src/cleveragents/infrastructure/database/migration_runner.py` — the original `MigrationRunner.init_or_upgrade` - `src/cleveragents/core/retry_patterns.py` — `@database_retry` with `wait_fixed(0.5)`, 3 attempts ## Metadata - **Commit Message**: `fix(test): skip redundant Alembic migration check for existing test databases` - **Branch**: `fix/test-db-performance-regression` ## Subtasks - [x] Identify root cause of intermittent BDD test failures - [x] Implement fix in `features/environment.py` - [x] Verify fix with sequential execution (`--processes 1`) - [x] Verify fix with parallel execution (`--processes 16`) ## Definition of Done - [x] All scenarios in `features/core_cli_commands.feature` pass consistently - [x] Fix does not affect non-test database initialization paths - [x] Change is a single, minimal edit to `features/environment.py`
CoreRasurae added this to the v3.2.0 milestone 2026-03-12 12:36:09 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#726
No description provided.