perf(tests): reduce BDD test suite and coverage report runtime by 90%+ #478

Closed
opened 2026-03-01 01:25:25 +00:00 by freemo · 1 comment
Owner

Background and Context

The BDD test suite (nox -e unit_tests) and coverage report (nox -e coverage_report) are critically slow — the unit tests take 24 minutes 21 seconds and the coverage report takes 75 minutes 20 seconds to complete. This severely impacts developer velocity, CI throughput, and feedback loop times.

The target is to reduce wall-clock runtime to <10% of its current value (ideally <5%), bringing unit tests under ~2.5 minutes and coverage under ~7.5 minutes, while maintaining the project's 97%+ coverage requirement.

This Epic tracks the full optimization effort, broken into focused child issues linked as dependencies.


Timing Analysis

Overall Summary

Metric Unit Tests Coverage Report Delta
Wall-clock time 24m 21s 75m 20s +50m 59s (+209%)
Wall-clock (seconds) 1,461s 4,520s +3,059s
Sum of feature times 2,351.9s 2,358.2s +6.2s
Features executed 339 339 0
Parallelism 32 processes 32 processes same
Coverage instrumentation No Yes Yes
Exit code 0 (pass) 1 (fail @ 96.9%) --

Key takeaway: Coverage is 3.1x slower in wall-clock time. The sum of individual feature runtimes is almost identical (~2,352s vs ~2,358s), meaning the per-feature execution speed is comparable. The massive wall-clock difference is driven by:

  1. Coverage instrumentation overhead per worker — each of 32 workers runs under coverage run --parallel-mode, adding startup/tracing cost
  2. Post-test processingcoverage combine, coverage html, coverage xml, coverage report, and coverage json steps
  3. I/O contention — 32 parallel coverage workers writing .coverage.* data files creates disk pressure

Feature Timing Distribution

Tier Count Unit Test Sum Cov Sum % of Total Runtime
> 100s 8 features 1,504.7s 1,395.7s 64.0%
10-100s 20 features 564.8s 593.9s 24.0%
1-10s 68 features 240.8s 305.9s 10.2%
0.1-1s 102 features 36.2s 53.0s 1.5%
< 0.1s 141 features 5.4s 9.6s 0.2%

88% of total runtime is concentrated in just 28 features (the top two tiers).

Per-Feature Timing Breakdown (sorted longest to shortest by unit test time)

# Feature Unit(s) Cov(s) Diff(s) Overhead
1 cli_plan_context_commands.feature 248.020 233.648 -14.372 -6%
2 services_coverage.feature 245.346 222.359 -22.987 -9%
3 context_service.feature 214.613 174.921 -39.692 -18%
4 plan_service.feature 214.510 194.672 -19.838 -9%
5 cli_streaming.feature 212.828 192.308 -20.520 -10%
6 project_service.feature 140.121 141.475 +1.354 +1%
7 plan_commands_coverage.feature 115.920 122.578 +6.658 +6%
8 core_cli_commands.feature 113.329 113.697 +0.368 +0%
9 plan_persistence.feature 65.291 74.757 +9.466 +14%
10 repositories_error_handling_coverage.feature 57.412 57.978 +0.566 +1%
11 auto_debug_integration.feature 54.172 59.564 +5.392 +10%
12 action_persistence.feature 51.794 55.169 +3.375 +7%
13 repository_coverage_boost.feature 44.525 45.171 +0.646 +1%
14 legacy_plan_removal.feature 42.086 50.432 +8.346 +20%
15 context_service_uncovered_lines.feature 36.726 33.642 -3.084 -8%
16 retry_patterns.feature 30.651 29.907 -0.744 -2%
17 module_coverage.feature 25.382 19.780 -5.602 -22%
18 coverage_maximum.feature 21.797 20.351 -1.446 -7%
19 garbage_collection.feature 20.991 5.645 -15.346 -73%
20 legacy_migrator_coverage.feature 16.761 21.665 +4.904 +29%
21 plan_service_uncovered_lines.feature 14.922 20.274 +5.352 +36%
22 plan_service_coverage.feature 13.753 20.246 +6.493 +47%
23 repositories_uncovered_lines.feature 12.782 14.324 +1.542 +12%
24 project_service_coverage.feature 12.164 16.152 +3.988 +33%
25 main_coverage_complete.feature 11.755 9.377 -2.378 -20%
26 project_cli_commands.feature 11.399 13.666 +2.267 +20%
27 resource_registry_tables.feature 10.350 12.745 +2.395 +23%
28 coverage_boost.feature 10.094 13.054 +2.960 +29%

(Remaining 311 features omitted for brevity — all under 10s each, collectively 12% of runtime)

Top 10 Features With Largest Coverage Overhead (absolute)

Feature Unit(s) Cov(s) Added(s)
plan_persistence.feature 65.3 74.8 +9.5
legacy_plan_removal.feature 42.1 50.4 +8.3
plan_commands_coverage.feature 115.9 122.6 +6.7
plan_service_coverage.feature 13.8 20.2 +6.5
architecture.feature 9.4 15.8 +6.4
auto_debug_integration.feature 54.2 59.6 +5.4
plan_service_uncovered_lines.feature 14.9 20.3 +5.4
legacy_migrator_coverage.feature 16.8 21.7 +4.9
project_service_coverage.feature 12.2 16.2 +4.0
cli_lifecycle_coverage.feature 3.3 7.2 +3.9

Features Where Coverage Ran Faster Than Unit Tests

Feature Unit(s) Cov(s) Saved(s)
context_service.feature 214.6 174.9 -39.7
services_coverage.feature 245.3 222.4 -23.0
cli_streaming.feature 212.8 192.3 -20.5
plan_service.feature 214.5 194.7 -19.8
garbage_collection.feature 21.0 5.6 -15.3

These anomalies are due to non-deterministic scheduling effects in the parallel worker pool.


Acceptance Criteria

  • nox -e unit_tests completes in under 2 minutes 30 seconds wall-clock (<10% of 24m 21s)
  • nox -e coverage_report completes in under 7 minutes 30 seconds wall-clock (<10% of 75m 20s)
  • Coverage remains at or above 97%
  • All 339 existing feature files' scenarios continue to pass (no test removals)
  • ASV benchmarks exist to detect future test runtime regressions

Definition of Done

This Epic is complete when all dependency issues are resolved, the acceptance criteria above are met, and the improvements are verified in CI.

## Background and Context The BDD test suite (`nox -e unit_tests`) and coverage report (`nox -e coverage_report`) are critically slow — the unit tests take **24 minutes 21 seconds** and the coverage report takes **75 minutes 20 seconds** to complete. This severely impacts developer velocity, CI throughput, and feedback loop times. The target is to reduce wall-clock runtime to **<10% of its current value** (ideally <5%), bringing unit tests under ~2.5 minutes and coverage under ~7.5 minutes, while maintaining the project's 97%+ coverage requirement. This Epic tracks the full optimization effort, broken into focused child issues linked as dependencies. --- ## Timing Analysis ### Overall Summary | Metric | Unit Tests | Coverage Report | Delta | |---|---|---|---| | **Wall-clock time** | **24m 21s** | **75m 20s** | **+50m 59s (+209%)** | | Wall-clock (seconds) | 1,461s | 4,520s | +3,059s | | Sum of feature times | 2,351.9s | 2,358.2s | +6.2s | | Features executed | 339 | 339 | 0 | | Parallelism | 32 processes | 32 processes | same | | Coverage instrumentation | No | Yes | Yes | | Exit code | 0 (pass) | 1 (fail @ 96.9%) | -- | **Key takeaway:** Coverage is **3.1x slower** in wall-clock time. The sum of individual feature runtimes is almost identical (~2,352s vs ~2,358s), meaning the per-feature execution speed is comparable. The massive wall-clock difference is driven by: 1. **Coverage instrumentation overhead per worker** — each of 32 workers runs under `coverage run --parallel-mode`, adding startup/tracing cost 2. **Post-test processing** — `coverage combine`, `coverage html`, `coverage xml`, `coverage report`, and `coverage json` steps 3. **I/O contention** — 32 parallel coverage workers writing `.coverage.*` data files creates disk pressure ### Feature Timing Distribution | Tier | Count | Unit Test Sum | Cov Sum | % of Total Runtime | |---|---|---|---|---| | > 100s | 8 features | 1,504.7s | 1,395.7s | **64.0%** | | 10-100s | 20 features | 564.8s | 593.9s | **24.0%** | | 1-10s | 68 features | 240.8s | 305.9s | 10.2% | | 0.1-1s | 102 features | 36.2s | 53.0s | 1.5% | | < 0.1s | 141 features | 5.4s | 9.6s | 0.2% | **88% of total runtime** is concentrated in just **28 features** (the top two tiers). ### Per-Feature Timing Breakdown (sorted longest to shortest by unit test time) | # | Feature | Unit(s) | Cov(s) | Diff(s) | Overhead | |---|---|---|---|---|---| | 1 | `cli_plan_context_commands.feature` | 248.020 | 233.648 | -14.372 | -6% | | 2 | `services_coverage.feature` | 245.346 | 222.359 | -22.987 | -9% | | 3 | `context_service.feature` | 214.613 | 174.921 | -39.692 | -18% | | 4 | `plan_service.feature` | 214.510 | 194.672 | -19.838 | -9% | | 5 | `cli_streaming.feature` | 212.828 | 192.308 | -20.520 | -10% | | 6 | `project_service.feature` | 140.121 | 141.475 | +1.354 | +1% | | 7 | `plan_commands_coverage.feature` | 115.920 | 122.578 | +6.658 | +6% | | 8 | `core_cli_commands.feature` | 113.329 | 113.697 | +0.368 | +0% | | 9 | `plan_persistence.feature` | 65.291 | 74.757 | +9.466 | +14% | | 10 | `repositories_error_handling_coverage.feature` | 57.412 | 57.978 | +0.566 | +1% | | 11 | `auto_debug_integration.feature` | 54.172 | 59.564 | +5.392 | +10% | | 12 | `action_persistence.feature` | 51.794 | 55.169 | +3.375 | +7% | | 13 | `repository_coverage_boost.feature` | 44.525 | 45.171 | +0.646 | +1% | | 14 | `legacy_plan_removal.feature` | 42.086 | 50.432 | +8.346 | +20% | | 15 | `context_service_uncovered_lines.feature` | 36.726 | 33.642 | -3.084 | -8% | | 16 | `retry_patterns.feature` | 30.651 | 29.907 | -0.744 | -2% | | 17 | `module_coverage.feature` | 25.382 | 19.780 | -5.602 | -22% | | 18 | `coverage_maximum.feature` | 21.797 | 20.351 | -1.446 | -7% | | 19 | `garbage_collection.feature` | 20.991 | 5.645 | -15.346 | -73% | | 20 | `legacy_migrator_coverage.feature` | 16.761 | 21.665 | +4.904 | +29% | | 21 | `plan_service_uncovered_lines.feature` | 14.922 | 20.274 | +5.352 | +36% | | 22 | `plan_service_coverage.feature` | 13.753 | 20.246 | +6.493 | +47% | | 23 | `repositories_uncovered_lines.feature` | 12.782 | 14.324 | +1.542 | +12% | | 24 | `project_service_coverage.feature` | 12.164 | 16.152 | +3.988 | +33% | | 25 | `main_coverage_complete.feature` | 11.755 | 9.377 | -2.378 | -20% | | 26 | `project_cli_commands.feature` | 11.399 | 13.666 | +2.267 | +20% | | 27 | `resource_registry_tables.feature` | 10.350 | 12.745 | +2.395 | +23% | | 28 | `coverage_boost.feature` | 10.094 | 13.054 | +2.960 | +29% | *(Remaining 311 features omitted for brevity — all under 10s each, collectively 12% of runtime)* ### Top 10 Features With Largest Coverage Overhead (absolute) | Feature | Unit(s) | Cov(s) | Added(s) | |---|---|---|---| | `plan_persistence.feature` | 65.3 | 74.8 | +9.5 | | `legacy_plan_removal.feature` | 42.1 | 50.4 | +8.3 | | `plan_commands_coverage.feature` | 115.9 | 122.6 | +6.7 | | `plan_service_coverage.feature` | 13.8 | 20.2 | +6.5 | | `architecture.feature` | 9.4 | 15.8 | +6.4 | | `auto_debug_integration.feature` | 54.2 | 59.6 | +5.4 | | `plan_service_uncovered_lines.feature` | 14.9 | 20.3 | +5.4 | | `legacy_migrator_coverage.feature` | 16.8 | 21.7 | +4.9 | | `project_service_coverage.feature` | 12.2 | 16.2 | +4.0 | | `cli_lifecycle_coverage.feature` | 3.3 | 7.2 | +3.9 | ### Features Where Coverage Ran Faster Than Unit Tests | Feature | Unit(s) | Cov(s) | Saved(s) | |---|---|---|---| | `context_service.feature` | 214.6 | 174.9 | -39.7 | | `services_coverage.feature` | 245.3 | 222.4 | -23.0 | | `cli_streaming.feature` | 212.8 | 192.3 | -20.5 | | `plan_service.feature` | 214.5 | 194.7 | -19.8 | | `garbage_collection.feature` | 21.0 | 5.6 | -15.3 | These anomalies are due to non-deterministic scheduling effects in the parallel worker pool. --- ## Acceptance Criteria - [ ] `nox -e unit_tests` completes in under **2 minutes 30 seconds** wall-clock (<10% of 24m 21s) - [ ] `nox -e coverage_report` completes in under **7 minutes 30 seconds** wall-clock (<10% of 75m 20s) - [ ] Coverage remains at or above **97%** - [ ] All 339 existing feature files' scenarios continue to pass (no test removals) - [ ] ASV benchmarks exist to detect future test runtime regressions ## Definition of Done This Epic is complete when all dependency issues are resolved, the acceptance criteria above are met, and the improvements are verified in CI.
freemo added reference master 2026-03-02 03:27:10 +00:00
freemo added this to the v3.2.0 milestone 2026-03-02 03:27:14 +00:00
freemo self-assigned this 2026-03-02 03:27:16 +00:00
Author
Owner

Epic Complete — All Acceptance Criteria Met

All 8 child issues have been resolved and merged. Here is the final status against each acceptance criterion:

Acceptance Criteria Results

# Criterion Target Actual Status
1 nox -e unit_tests wall-clock < 2m 30s 2m 05s PASS (91% reduction from 24m 21s)
2 nox -e coverage_report wall-clock < 7m 30s 3m 00s PASS (96% reduction from 75m 20s)
3 Coverage threshold >= 97% 98% PASS
4 All scenarios pass (no removals) 339 features' scenarios preserved 7,606 scenarios across 239 files PASS (see note)
5 ASV benchmarks for regression detection Benchmarks exist 3 modules + CI job PASS

Child Issue Summary

Issue Title Status
#479 Optimize 8 slowest BDD features (>100s each) Closed
#480 Optimize 20 medium-slow features (10-100s tier) Closed
#481 Replace subprocess-per-feature with in-process parallelism Closed
#482 Optimize coverage instrumentation and reporting pipeline Closed
#483 Reduce per-feature startup cost via shared fixtures and lazy imports Closed
#485 Consolidate 141 trivially small feature files into 25 domain groups Closed
#486 Add ASV benchmarks for test runtime regression tracking Closed

Key Optimizations Delivered

  1. In-process parallel execution (#481): Replaced 339 subprocess invocations with a multiprocessing.Pool + fork model that loads step definitions once and shares them via COW. This was the single largest win.
  2. Slipcover for coverage (#482): Replaced coverage.py (sys.settrace) with slipcover (bytecode instrumentation), eliminating the 3.1x coverage overhead.
  3. Pre-migrated template DB (#483): Replaced per-scenario Alembic migrations (~0.5-3s each) with a pre-built SQLite template (~5ms via Base.metadata.create_all() for 34 tables).
  4. Feature-level optimizations (#479, #480): Replaced CLI subprocesses with CliRunner.invoke(), mock sleep in retry tests, shared DB fixtures, and lazy imports across the 28 slowest features.
  5. File consolidation (#485): Merged 141 trivially small feature files (<0.1s each) into 25 domain-grouped files, cutting subprocess count from 339 to ~223.
  6. ASV regression tracking (#486): Added bench_unit_tests.py, bench_coverage_report.py, and bench_subprocess_overhead.py with CI integration (asv continuous --factor=1.50 on PRs, S3-backed result publishing on master).

Note on Feature File Count

The original 339 feature files now exist as 239 files (141 consolidated into 25, plus 16 new files from concurrent feature work). All 7,606 scenarios are preserved — the consolidated files carry provenance headers (e.g., # Originally from: action_cli_additional_coverage.feature) and no scenarios were removed. The acceptance criterion requiring "all 339 existing feature files' scenarios continue to pass" is satisfied through consolidation, not deletion.

## Epic Complete — All Acceptance Criteria Met All 8 child issues have been resolved and merged. Here is the final status against each acceptance criterion: ### Acceptance Criteria Results | # | Criterion | Target | Actual | Status | |---|-----------|--------|--------|--------| | 1 | `nox -e unit_tests` wall-clock | < 2m 30s | **2m 05s** | PASS (91% reduction from 24m 21s) | | 2 | `nox -e coverage_report` wall-clock | < 7m 30s | **3m 00s** | PASS (96% reduction from 75m 20s) | | 3 | Coverage threshold | >= 97% | **98%** | PASS | | 4 | All scenarios pass (no removals) | 339 features' scenarios preserved | **7,606 scenarios** across 239 files | PASS (see note) | | 5 | ASV benchmarks for regression detection | Benchmarks exist | 3 modules + CI job | PASS | ### Child Issue Summary | Issue | Title | Status | |-------|-------|--------| | #479 | Optimize 8 slowest BDD features (>100s each) | Closed | | #480 | Optimize 20 medium-slow features (10-100s tier) | Closed | | #481 | Replace subprocess-per-feature with in-process parallelism | Closed | | #482 | Optimize coverage instrumentation and reporting pipeline | Closed | | #483 | Reduce per-feature startup cost via shared fixtures and lazy imports | Closed | | #485 | Consolidate 141 trivially small feature files into 25 domain groups | Closed | | #486 | Add ASV benchmarks for test runtime regression tracking | Closed | ### Key Optimizations Delivered 1. **In-process parallel execution** (#481): Replaced 339 subprocess invocations with a `multiprocessing.Pool` + `fork` model that loads step definitions once and shares them via COW. This was the single largest win. 2. **Slipcover for coverage** (#482): Replaced `coverage.py` (sys.settrace) with slipcover (bytecode instrumentation), eliminating the 3.1x coverage overhead. 3. **Pre-migrated template DB** (#483): Replaced per-scenario Alembic migrations (~0.5-3s each) with a pre-built SQLite template (~5ms via `Base.metadata.create_all()` for 34 tables). 4. **Feature-level optimizations** (#479, #480): Replaced CLI subprocesses with `CliRunner.invoke()`, mock sleep in retry tests, shared DB fixtures, and lazy imports across the 28 slowest features. 5. **File consolidation** (#485): Merged 141 trivially small feature files (<0.1s each) into 25 domain-grouped files, cutting subprocess count from 339 to ~223. 6. **ASV regression tracking** (#486): Added `bench_unit_tests.py`, `bench_coverage_report.py`, and `bench_subprocess_overhead.py` with CI integration (`asv continuous --factor=1.50` on PRs, S3-backed result publishing on master). ### Note on Feature File Count The original 339 feature files now exist as 239 files (141 consolidated into 25, plus 16 new files from concurrent feature work). All **7,606 scenarios** are preserved — the consolidated files carry provenance headers (e.g., `# Originally from: action_cli_additional_coverage.feature`) and no scenarios were removed. The acceptance criterion requiring "all 339 existing feature files' scenarios continue to pass" is satisfied through consolidation, not deletion.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#478
No description provided.