feat(estimation): add cost and risk estimation actor #209

Open
opened 2026-02-22 23:40:07 +00:00 by freemo · 5 comments
Owner

Metadata

  • Commit Message: feat(estimation): add cost and risk estimation actor
  • Branch: feature/m6-estimation

Background

An optional estimation_actor role produces cost, risk, and duration estimates during plan setup. Estimates are persisted to plan metadata and surfaced in plan status output. The estimation actor is invoked during plan use when configured.

Acceptance Criteria

  • Add optional estimation_actor role and cost/risk estimation outputs.
  • Persist estimation output to plan metadata (cost_estimate, risk_score, duration_estimate).
  • Invoke estimation during plan use and surface estimates in plan status output.
  • Add estimation output schema with currency, token estimates, and confidence ranges.
  • Add opt-out flag --no-estimate for plan use and persist estimate_skipped reason.

Definition of Done

This issue is complete when:

  • All subtasks below are completed and checked off.
  • A Git commit is created where the first line of the commit message matches
    the Commit Message in Metadata exactly, followed by a blank line, then
    additional lines providing relevant details about the implementation. The
    commit body should be appropriate in size for a commit message and relatively
    complete in describing what was done.
  • The commit is pushed to the remote on the branch matching the Branch in
    Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and
    merged before this issue is marked done.

Subtasks

  • Add optional estimation_actor role and cost/risk estimation outputs.
  • Persist estimation output to plan metadata (cost_estimate, risk_score, duration_estimate).
  • Invoke estimation during plan use and surface estimates in plan status output.
  • Add estimation output schema with currency, token estimates, and confidence ranges.
  • Add opt-out flag --no-estimate for plan use and persist estimate_skipped reason.
  • Add error handling when estimation actor fails (fallback to informational warning).
  • Add docs/reference/estimation.md with output format.
  • Add CLI examples showing estimate fields in plan status output.
  • Tests (Behave): Add estimation scenarios.
  • Tests (Robot): Add estimation integration smoke tests.
  • Tests (ASV): Add benchmarks/estimation_actor_bench.py for estimation runtime.
  • Verify coverage >=97% via nox -s coverage_report. If coverage is <97% then review the current unit test coverage report at build/coverage.xml and use it to write new Behave based unit tests to improve code coverage. Specifically, write Behave style unit tests that are descriptively named and specifically improves coverage on whichever file has the most uncovered lines by writing tests that will target the uncovered lines in the report. Once that is done rerun nox -s coverage_report to verify all tests pass and coverage is above >=97%. Only mark this as complete once coverage is >=97%, if not repeat this task as many times as is needed until coverage reaches >=97%.
  • Run nox (all default sessions, including benchmark), fix any errors if needed ensuring nox passes across entire code base, do not ignore any failure even if it seems unrelated to this commit, fix it.

Section: #### M6: Autonomy Hardening + Server Stubs (Day 30)
Status: Open

## Metadata - **Commit Message**: `feat(estimation): add cost and risk estimation actor` - **Branch**: `feature/m6-estimation` ## Background An optional `estimation_actor` role produces cost, risk, and duration estimates during plan setup. Estimates are persisted to plan metadata and surfaced in `plan status` output. The estimation actor is invoked during `plan use` when configured. ## Acceptance Criteria - [x] Add optional `estimation_actor` role and cost/risk estimation outputs. - [x] Persist estimation output to plan metadata (cost_estimate, risk_score, duration_estimate). - [x] Invoke estimation during `plan use` and surface estimates in `plan status` output. - [x] Add estimation output schema with currency, token estimates, and confidence ranges. - [x] Add opt-out flag `--no-estimate` for `plan use` and persist `estimate_skipped` reason. ## Definition of Done This issue is complete when: - All subtasks below are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. The commit body should be appropriate in size for a commit message and relatively complete in describing what was done. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done. ## Subtasks - [x] Add optional `estimation_actor` role and cost/risk estimation outputs. - [x] Persist estimation output to plan metadata (cost_estimate, risk_score, duration_estimate). - [x] Invoke estimation during `plan use` and surface estimates in `plan status` output. - [x] Add estimation output schema with currency, token estimates, and confidence ranges. - [x] Add opt-out flag `--no-estimate` for `plan use` and persist `estimate_skipped` reason. - [x] Add error handling when estimation actor fails (fallback to informational warning). - [x] Add `docs/reference/estimation.md` with output format. - [x] Add CLI examples showing estimate fields in `plan status` output. - [x] Tests (Behave): Add estimation scenarios. - [x] Tests (Robot): Add estimation integration smoke tests. - [x] Tests (ASV): Add `benchmarks/estimation_actor_bench.py` for estimation runtime. - [x] Verify coverage >=97% via `nox -s coverage_report`. If coverage is <97% then review the current unit test coverage report at `build/coverage.xml` and use it to write new Behave based unit tests to improve code coverage. Specifically, write Behave style unit tests that are descriptively named and specifically improves coverage on whichever file has the most uncovered lines by writing tests that will target the uncovered lines in the report. Once that is done rerun `nox -s coverage_report` to verify all tests pass and coverage is above >=97%. Only mark this as complete once coverage is >=97%, if not repeat this task as many times as is needed until coverage reaches >=97%. - [x] Run `nox` (all default sessions, including benchmark), fix any errors if needed ensuring nox passes across **entire** code base, do not ignore any failure even if it seems unrelated to this commit, fix it. **Section**: #### M6: Autonomy Hardening + Server Stubs (Day 30) **Status**: Open
freemo added this to the v3.5.0 milestone 2026-02-22 23:40:07 +00:00
Author
Owner

Expected completion updated (Day 15 rebaseline): Day 35 / 2026-03-15 (previously Day 31 / 2026-03-11)

**Expected completion updated (Day 15 rebaseline):** Day 35 / 2026-03-15 (previously Day 31 / 2026-03-11)
freemo added the due date 2026-03-08 2026-02-23 18:41:41 +00:00
freemo self-assigned this 2026-02-24 21:53:38 +00:00
freemo removed their assignment 2026-03-02 16:26:06 +00:00
Member

CI Fix After Rebase on Master

After rebasing feature/m6-estimation on the latest master, three conflicts needed resolution and two CI failures needed fixing:

Conflicts Resolved

  1. CHANGELOG.md — Kept both entries (master's TDD regression tests for bug #647 and the estimation feature entry).
  2. plan_lifecycle_service.py — Master added a comment about setting processing_state before phase in execute_plan(). The PR's inline estimation code in execute_plan() was duplicating what _run_estimation() already does. Resolved by keeping master's comment and avoiding duplicate code.
  3. settings.py — Master renamed reset_instance() to reset(). Updated all 48 call sites across features/ and robot/ to use the new name.

Unit Test Fix (10 scenarios in estimation.feature + estimation_coverage.feature)

The _run_estimation() method (from master) used EstimationStubActor and wrote to plan.estimation_result. The PR's tests expected EstimationService and estimation_report/estimation_skipped fields. Updated _run_estimation() in PlanLifecycleService to:

  • Use the injected EstimationService when available (new PR path) to populate estimation_report/estimation_skipped
  • Record estimation_produced decisions via _try_record_decision()
  • Fall back to legacy EstimationStubActor when no EstimationService is injected

Integration Test Fix (3 tests in container_resolve_crash.robot)

The container_resolve_crash.robot tests used a 30s subprocess timeout. When running 9 parallel processes, the tests took ~27s in isolation, exceeding the timeout under load. Increased timeout from 30s to 60s.

Quality Gates (all passing)

  • lint: PASSED
  • typecheck: PASSED (0 errors)
  • unit_tests: PASSED (466 features, 12341 scenarios, 0 failures)
  • integration_tests: PASSED (1711 tests, 0 failures)
  • coverage_report: PASSED (98% coverage)
## CI Fix After Rebase on Master After rebasing feature/m6-estimation on the latest master, three conflicts needed resolution and two CI failures needed fixing: ### Conflicts Resolved 1. **CHANGELOG.md** — Kept both entries (master's TDD regression tests for bug #647 and the estimation feature entry). 2. **plan_lifecycle_service.py** — Master added a comment about setting processing_state before phase in execute_plan(). The PR's inline estimation code in execute_plan() was duplicating what _run_estimation() already does. Resolved by keeping master's comment and avoiding duplicate code. 3. **settings.py** — Master renamed reset_instance() to reset(). Updated all 48 call sites across features/ and robot/ to use the new name. ### Unit Test Fix (10 scenarios in estimation.feature + estimation_coverage.feature) The _run_estimation() method (from master) used EstimationStubActor and wrote to plan.estimation_result. The PR's tests expected EstimationService and estimation_report/estimation_skipped fields. Updated _run_estimation() in PlanLifecycleService to: - Use the injected EstimationService when available (new PR path) to populate estimation_report/estimation_skipped - Record estimation_produced decisions via _try_record_decision() - Fall back to legacy EstimationStubActor when no EstimationService is injected ### Integration Test Fix (3 tests in container_resolve_crash.robot) The container_resolve_crash.robot tests used a 30s subprocess timeout. When running 9 parallel processes, the tests took ~27s in isolation, exceeding the timeout under load. Increased timeout from 30s to 60s. ### Quality Gates (all passing) - lint: PASSED - typecheck: PASSED (0 errors) - unit_tests: PASSED (466 features, 12341 scenarios, 0 failures) - integration_tests: PASSED (1711 tests, 0 failures) - coverage_report: PASSED (98% coverage)
Member

CI Fix #2: Parallel Process Throttling

Root Cause Analysis

Investigated the CI failures after the initial push. Key findings:

  1. Tests pass locally and in Docker — Ran full test suite in a python:3.13-slim Docker container (same as CI) with fresh nox environments. All 466 features / 12341 scenarios pass with 0 failures.

  2. Original CI (pre-rebase) had same failures — The original commit cbe504bb also had unit_tests and integration_tests failing, but coverage_report (which runs the SAME unit tests sequentially) PASSED. This proves the test code is correct.

  3. Root cause: parallel test execution overloading CI runner — The CI runner uses Forgejo Actions Docker containers. The _default_processes() function uses os.sched_getaffinity(0) to determine parallelism, which reports the container's visible CPU count (often 8). However, CI runners are shared/oversubscribed, so spawning 8 parallel behave/pabot workers causes resource exhaustion (OOM or SIGTERM kills).

  4. Evidence: coverage_report (sequential, BEHAVE_PARALLEL_COVERAGE=1) passes in CI. unit_tests (parallel, fork-based multiprocessing) fails. Same tests, different parallelism.

Fix Applied

Added TEST_PROCESSES: "4" to the CI workflow for both unit_tests and integration_tests jobs in .forgejo/workflows/ci.yml. This caps the parallel worker count to 4, preventing resource exhaustion on the CI runner while still benefiting from parallelism.

## CI Fix #2: Parallel Process Throttling ### Root Cause Analysis Investigated the CI failures after the initial push. Key findings: 1. **Tests pass locally and in Docker** — Ran full test suite in a `python:3.13-slim` Docker container (same as CI) with fresh nox environments. All 466 features / 12341 scenarios pass with 0 failures. 2. **Original CI (pre-rebase) had same failures** — The original commit `cbe504bb` also had unit_tests and integration_tests failing, but coverage_report (which runs the SAME unit tests sequentially) PASSED. This proves the test code is correct. 3. **Root cause: parallel test execution overloading CI runner** — The CI runner uses Forgejo Actions Docker containers. The `_default_processes()` function uses `os.sched_getaffinity(0)` to determine parallelism, which reports the container's visible CPU count (often 8). However, CI runners are shared/oversubscribed, so spawning 8 parallel behave/pabot workers causes resource exhaustion (OOM or SIGTERM kills). 4. **Evidence**: coverage_report (sequential, `BEHAVE_PARALLEL_COVERAGE=1`) passes in CI. unit_tests (parallel, fork-based multiprocessing) fails. Same tests, different parallelism. ### Fix Applied Added `TEST_PROCESSES: "4"` to the CI workflow for both unit_tests and integration_tests jobs in `.forgejo/workflows/ci.yml`. This caps the parallel worker count to 4, preventing resource exhaustion on the CI runner while still benefiting from parallelism.
freemo self-assigned this 2026-04-02 06:13:58 +00:00
Author
Owner

PR #1210 Review Outcome: Changes Requested

Independent code review of PR #1210 found the implementation to be well-structured in isolation (good domain models, comprehensive tests, proper documentation), but identified blocking issues that prevent merge:

  1. Merge conflicts with master across 6 files — master has received significant overlapping changes since this branch was created
  2. Type conflict on estimation_report field — master defines it as dict[str, object] | None while this PR defines it as EstimationReport | None (typed Pydantic model). The PR's typed approach is architecturally superior and should win during rebase.
  3. datetime.now() without timezoneEstimationSkipped.timestamp should use datetime.now(tz=timezone.utc) per the spec's UTC requirement
  4. Duplicate Alembic migrationestimation_report_json column already exists on master

The PR needs a rebase onto current master with conflict resolution before it can be re-reviewed and merged. Full review details are on PR #1210.

**PR #1210 Review Outcome: Changes Requested** Independent code review of PR #1210 found the implementation to be well-structured in isolation (good domain models, comprehensive tests, proper documentation), but identified **blocking issues** that prevent merge: 1. **Merge conflicts** with master across 6 files — master has received significant overlapping changes since this branch was created 2. **Type conflict** on `estimation_report` field — master defines it as `dict[str, object] | None` while this PR defines it as `EstimationReport | None` (typed Pydantic model). The PR's typed approach is architecturally superior and should win during rebase. 3. **`datetime.now()` without timezone** — `EstimationSkipped.timestamp` should use `datetime.now(tz=timezone.utc)` per the spec's UTC requirement 4. **Duplicate Alembic migration** — `estimation_report_json` column already exists on master The PR needs a rebase onto current master with conflict resolution before it can be re-reviewed and merged. Full review details are on PR #1210.
Author
Owner

⚠️ Backlog Grooming Notice — Stale Issue

This issue has State/In Review with all subtasks checked off, but remains open. It was due on 2026-03-08 (over 3 weeks ago).

All acceptance criteria appear to be met. This issue may be blocked by open dependencies preventing closure. Please review:

  1. Check if the PR for branch feature/m6-estimation was merged
  2. If merged, close this issue
  3. If blocked by dependencies, identify which ones and update the issue

Automated by CleverAgents Bot
Supervisor: Backlog Grooming | Agent: ca-backlog-groomer

⚠️ **Backlog Grooming Notice — Stale Issue** This issue has `State/In Review` with all subtasks checked off, but remains open. It was due on 2026-03-08 (over 3 weeks ago). All acceptance criteria appear to be met. This issue may be blocked by open dependencies preventing closure. Please review: 1. Check if the PR for branch `feature/m6-estimation` was merged 2. If merged, close this issue 3. If blocked by dependencies, identify which ones and update the issue --- **Automated by CleverAgents Bot** Supervisor: Backlog Grooming | Agent: ca-backlog-groomer
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

2026-03-08

Reference
cleveragents/cleveragents-core#209
No description provided.