Integrate Benchmark Regression Testing into CI Pipeline #1991

Open
opened 2026-04-03 00:31:57 +00:00 by freemo · 1 comment
Owner

Metadata

  • Branch: task/ci-benchmark-regression
  • Commit Message: feat(ci): integrate benchmark regression testing into CI pipeline
  • Milestone: v3.8.0
  • Parent Epic: #1678

Background and Context

The project mandates multi-level testing at every coding task, explicitly including performance benchmarks as a non-optional part of the definition of done (see CONTRIBUTING.md: "Every coding task must include or update tests at multiple levels: unit tests, integration tests, and performance benchmarks"). Currently, while unit and integration tests are enforced in CI, there is no automated mechanism to detect performance regressions introduced by new commits or pull requests.

Without benchmark regression gating in CI, performance degradations can silently accumulate across the plan lifecycle, actor runtime, tool execution, and resource registry subsystems — all of which are on the critical path for CleverAgents' autonomous execution model. This is especially important as the project scales toward the ACMS, decision tree, and multi-agent workstreams.

Expected Behavior

The CI pipeline should automatically run the project's performance benchmark suite on every pull request and compare results against a stored baseline. If any benchmark regresses beyond a configurable threshold, the CI check should fail and block the PR from merging, surfacing the regression to the author and reviewers.

Acceptance Criteria

  • A nox session (e.g., nox -s benchmarks) exists that runs the full benchmark suite via the project's designated task runner.
  • CI workflow runs the benchmark session on every pull request targeting master.
  • Benchmark results are stored as CI artifacts and compared against a persisted baseline.
  • A configurable regression threshold (e.g., ±10% wall-clock time) triggers a CI failure when exceeded.
  • Baseline results are updated automatically when a PR is merged to master (or via a manual trigger).
  • Benchmark results are reported in a human-readable summary posted to the PR (e.g., as a CI job summary or comment).
  • All existing nox sessions continue to pass; no regressions introduced.
  • Coverage remains ≥ 97%.

Supporting Information

  • Related epic: #1678 (CI Execution Time Optimization — Timeouts, Concurrency, and Coverage Artifact Sharing)
  • CONTRIBUTING.md mandates performance benchmarks as part of every task's definition of done.
  • The project uses nox as its task runner; all test and quality sessions must be invoked through it.
  • Benchmark scenarios should cover the core subsystems most sensitive to performance: plan lifecycle state machine, actor tool-calling runtime, resource registry DAG queries, and decision recording.

Subtasks

  • Audit existing benchmark code/fixtures and identify gaps in coverage for key subsystems (plan lifecycle, actor runtime, resource registry, decision recording)
  • Implement or extend BDD-style benchmark scenarios (Gherkin feature files + step definitions) for each identified subsystem
  • Add nox -s benchmarks session to noxfile.py that executes the benchmark suite and outputs structured results (e.g., JSON)
  • Add CI workflow step (.forgejo/workflows/ or equivalent) to run nox -s benchmarks on every PR
  • Implement baseline storage mechanism (e.g., artifact upload/download, committed JSON baseline file, or CI cache)
  • Implement regression comparison logic with configurable threshold (default ±10%)
  • Configure CI to fail the benchmark check when regression threshold is exceeded
  • Add automatic baseline update step triggered on merge to master
  • Add PR summary/comment reporting benchmark delta results
  • Tests (Behave): Add BDD scenarios for the benchmark runner and regression detection logic itself
  • Tests (Robot): Add integration test verifying the nox benchmark session executes end-to-end
  • Update documentation (README, CONTRIBUTING, or relevant docs) to describe the benchmark CI workflow
  • Verify coverage ≥ 97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly (feat(ci): integrate benchmark regression testing into CI pipeline), followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly (task/ci-benchmark-regression).
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
  • All nox stages pass.
  • Coverage ≥ 97%.

Automated by CleverAgents Bot
Supervisor: Unknown | Agent: ca-new-issue-creator

## Metadata - **Branch**: `task/ci-benchmark-regression` - **Commit Message**: `feat(ci): integrate benchmark regression testing into CI pipeline` - **Milestone**: v3.8.0 - **Parent Epic**: #1678 ## Background and Context The project mandates multi-level testing at every coding task, explicitly including performance benchmarks as a non-optional part of the definition of done (see CONTRIBUTING.md: "Every coding task must include or update tests at multiple levels: unit tests, integration tests, and performance benchmarks"). Currently, while unit and integration tests are enforced in CI, there is no automated mechanism to detect performance regressions introduced by new commits or pull requests. Without benchmark regression gating in CI, performance degradations can silently accumulate across the plan lifecycle, actor runtime, tool execution, and resource registry subsystems — all of which are on the critical path for CleverAgents' autonomous execution model. This is especially important as the project scales toward the ACMS, decision tree, and multi-agent workstreams. ## Expected Behavior The CI pipeline should automatically run the project's performance benchmark suite on every pull request and compare results against a stored baseline. If any benchmark regresses beyond a configurable threshold, the CI check should fail and block the PR from merging, surfacing the regression to the author and reviewers. ## Acceptance Criteria - [ ] A nox session (e.g., `nox -s benchmarks`) exists that runs the full benchmark suite via the project's designated task runner. - [ ] CI workflow runs the benchmark session on every pull request targeting `master`. - [ ] Benchmark results are stored as CI artifacts and compared against a persisted baseline. - [ ] A configurable regression threshold (e.g., ±10% wall-clock time) triggers a CI failure when exceeded. - [ ] Baseline results are updated automatically when a PR is merged to `master` (or via a manual trigger). - [ ] Benchmark results are reported in a human-readable summary posted to the PR (e.g., as a CI job summary or comment). - [ ] All existing nox sessions continue to pass; no regressions introduced. - [ ] Coverage remains ≥ 97%. ## Supporting Information - Related epic: #1678 (CI Execution Time Optimization — Timeouts, Concurrency, and Coverage Artifact Sharing) - CONTRIBUTING.md mandates performance benchmarks as part of every task's definition of done. - The project uses `nox` as its task runner; all test and quality sessions must be invoked through it. - Benchmark scenarios should cover the core subsystems most sensitive to performance: plan lifecycle state machine, actor tool-calling runtime, resource registry DAG queries, and decision recording. ## Subtasks - [ ] Audit existing benchmark code/fixtures and identify gaps in coverage for key subsystems (plan lifecycle, actor runtime, resource registry, decision recording) - [ ] Implement or extend BDD-style benchmark scenarios (Gherkin feature files + step definitions) for each identified subsystem - [ ] Add `nox -s benchmarks` session to `noxfile.py` that executes the benchmark suite and outputs structured results (e.g., JSON) - [ ] Add CI workflow step (`.forgejo/workflows/` or equivalent) to run `nox -s benchmarks` on every PR - [ ] Implement baseline storage mechanism (e.g., artifact upload/download, committed JSON baseline file, or CI cache) - [ ] Implement regression comparison logic with configurable threshold (default ±10%) - [ ] Configure CI to fail the benchmark check when regression threshold is exceeded - [ ] Add automatic baseline update step triggered on merge to `master` - [ ] Add PR summary/comment reporting benchmark delta results - [ ] Tests (Behave): Add BDD scenarios for the benchmark runner and regression detection logic itself - [ ] Tests (Robot): Add integration test verifying the nox benchmark session executes end-to-end - [ ] Update documentation (README, CONTRIBUTING, or relevant docs) to describe the benchmark CI workflow - [ ] Verify coverage ≥ 97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly (`feat(ci): integrate benchmark regression testing into CI pipeline`), followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly (`task/ci-benchmark-regression`). - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. - All nox stages pass. - Coverage ≥ 97%. --- **Automated by CleverAgents Bot** Supervisor: Unknown | Agent: ca-new-issue-creator
freemo added this to the v3.8.0 milestone 2026-04-03 00:32:38 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: Medium — Benchmark regression testing is mandated by CONTRIBUTING.md ("Every coding task must include or update tests at multiple levels: unit tests, integration tests, and performance benchmarks"), but the CI enforcement mechanism is currently missing. This is infrastructure to enforce an existing policy.
  • Milestone: v3.8.0 (Server Implementation — already assigned. CI infrastructure improvements are correctly scoped here.)
  • MoSCoW: Could Have — While CONTRIBUTING.md mandates benchmarks, the CI enforcement mechanism is an infrastructure improvement, not a feature deliverable. The benchmarks themselves exist; this adds automated regression detection. Important but not blocking milestone completion.
  • Parent Epic: #1678 (CI Execution Time Optimization — already linked, confirmed correct)

Comprehensive issue with well-defined subtasks and acceptance criteria.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: ca-project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: Medium — Benchmark regression testing is mandated by CONTRIBUTING.md ("Every coding task must include or update tests at multiple levels: unit tests, integration tests, and performance benchmarks"), but the CI enforcement mechanism is currently missing. This is infrastructure to enforce an existing policy. - **Milestone**: v3.8.0 (Server Implementation — already assigned. CI infrastructure improvements are correctly scoped here.) - **MoSCoW**: Could Have — While CONTRIBUTING.md mandates benchmarks, the CI enforcement mechanism is an infrastructure improvement, not a feature deliverable. The benchmarks themselves exist; this adds automated regression detection. Important but not blocking milestone completion. - **Parent Epic**: #1678 (CI Execution Time Optimization — already linked, confirmed correct) Comprehensive issue with well-defined subtasks and acceptance criteria. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: ca-project-owner
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#1991
No description provided.