UAT: TraceService missing on_plan_end hook — PLAN_DURATION_MS and PLAN_TOTAL_COST_USD metrics never emitted #3766

Open
opened 2026-04-05 22:34:09 +00:00 by freemo · 0 comments
Owner

Bug Report

What Was Tested

Code-level analysis of src/cleveragents/application/services/trace_service.py — specifically the plan lifecycle hooks and their coverage of the 14 operational metric keys defined in docs/specification.md §Observability → Metrics Collection.

Expected Behavior (from spec)

The spec (line ~46107) defines these plan-level metrics:

Metric Type Description
plan.duration_seconds Histogram Total wall-clock time per plan, labeled by phase
plan.cost_usd Counter Cumulative API cost per plan
plan.decisions_count Counter Number of decisions per plan
plan.child_plans_count Counter Number of child plans spawned

The OperationalMetricKey enum in metrics.py also defines PLAN_DURATION_MS and PLAN_TOTAL_COST_USD.

TraceService has an on_plan_start hook. The spec implies a corresponding on_plan_end hook that records plan duration and total cost when a plan completes.

Actual Behavior

TraceService only has on_plan_start — there is no on_plan_end hook:

# trace_service.py — only on_plan_start exists
def on_plan_start(self, plan_id: str) -> MetricEntry | None:
    """Hook: called when a plan begins execution."""
    return MetricCollector.record(
        OperationalMetricKey.PLAN_DECISION_COUNT,
        0.0,
        plan_id,
    )

The on_plan_start hook only initializes PLAN_DECISION_COUNT to 0. There is no hook to:

  1. Record PLAN_DURATION_MS when a plan completes (requires start time from on_plan_start)
  2. Record PLAN_TOTAL_COST_USD when a plan completes
  3. Record SUBPLAN_COUNT when child plans are spawned

Additionally, on_plan_start does not record a start timestamp, so even if on_plan_end were added, there would be no start time to compute duration from.

Code Location

  • File: src/cleveragents/application/services/trace_service.py
  • Lines 171–189 (the on_plan_start method and surrounding lifecycle hooks section)

Impact

  • PLAN_DURATION_MS metric is never emitted — plan execution time is invisible
  • PLAN_TOTAL_COST_USD metric is never emitted from the lifecycle hook (only from compute_metrics() which must be called explicitly)
  • The diagnostic dashboard's "Performance summary" (spec line ~46132) cannot show "Average plan duration" without this metric

Fix Required

Add an on_plan_end hook that accepts the plan's start time and total cost:

def on_plan_end(
    self,
    plan_id: str,
    *,
    duration_ms: float,
    total_cost_usd: float,
    subplan_count: int = 0,
) -> list[MetricEntry]:
    """Hook: called when a plan completes execution.

    Args:
        plan_id: ULID of the plan.
        duration_ms: Total wall-clock duration in milliseconds.
        total_cost_usd: Total USD cost for the plan.
        subplan_count: Number of child plans spawned.

    Returns:
        List of MetricEntry objects for plan completion metrics.
    """
    entries = [
        MetricCollector.record(
            OperationalMetricKey.PLAN_DURATION_MS,
            duration_ms,
            plan_id,
        ),
        MetricCollector.record(
            OperationalMetricKey.PLAN_TOTAL_COST_USD,
            total_cost_usd,
            plan_id,
        ),
    ]
    if subplan_count > 0:
        entries.append(
            MetricCollector.record(
                OperationalMetricKey.SUBPLAN_COUNT,
                float(subplan_count),
                plan_id,
            )
        )
    return entries

Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-uat-tester

## Bug Report ### What Was Tested Code-level analysis of `src/cleveragents/application/services/trace_service.py` — specifically the plan lifecycle hooks and their coverage of the 14 operational metric keys defined in `docs/specification.md` §Observability → Metrics Collection. ### Expected Behavior (from spec) The spec (line ~46107) defines these plan-level metrics: | Metric | Type | Description | |---|---|---| | `plan.duration_seconds` | Histogram | Total wall-clock time per plan, labeled by phase | | `plan.cost_usd` | Counter | Cumulative API cost per plan | | `plan.decisions_count` | Counter | Number of decisions per plan | | `plan.child_plans_count` | Counter | Number of child plans spawned | The `OperationalMetricKey` enum in `metrics.py` also defines `PLAN_DURATION_MS` and `PLAN_TOTAL_COST_USD`. `TraceService` has an `on_plan_start` hook. The spec implies a corresponding `on_plan_end` hook that records plan duration and total cost when a plan completes. ### Actual Behavior `TraceService` only has `on_plan_start` — there is **no `on_plan_end` hook**: ```python # trace_service.py — only on_plan_start exists def on_plan_start(self, plan_id: str) -> MetricEntry | None: """Hook: called when a plan begins execution.""" return MetricCollector.record( OperationalMetricKey.PLAN_DECISION_COUNT, 0.0, plan_id, ) ``` The `on_plan_start` hook only initializes `PLAN_DECISION_COUNT` to 0. There is no hook to: 1. Record `PLAN_DURATION_MS` when a plan completes (requires start time from `on_plan_start`) 2. Record `PLAN_TOTAL_COST_USD` when a plan completes 3. Record `SUBPLAN_COUNT` when child plans are spawned Additionally, `on_plan_start` does not record a start timestamp, so even if `on_plan_end` were added, there would be no start time to compute duration from. ### Code Location - **File**: `src/cleveragents/application/services/trace_service.py` - **Lines 171–189** (the `on_plan_start` method and surrounding lifecycle hooks section) ### Impact - `PLAN_DURATION_MS` metric is never emitted — plan execution time is invisible - `PLAN_TOTAL_COST_USD` metric is never emitted from the lifecycle hook (only from `compute_metrics()` which must be called explicitly) - The diagnostic dashboard's "Performance summary" (spec line ~46132) cannot show "Average plan duration" without this metric ### Fix Required Add an `on_plan_end` hook that accepts the plan's start time and total cost: ```python def on_plan_end( self, plan_id: str, *, duration_ms: float, total_cost_usd: float, subplan_count: int = 0, ) -> list[MetricEntry]: """Hook: called when a plan completes execution. Args: plan_id: ULID of the plan. duration_ms: Total wall-clock duration in milliseconds. total_cost_usd: Total USD cost for the plan. subplan_count: Number of child plans spawned. Returns: List of MetricEntry objects for plan completion metrics. """ entries = [ MetricCollector.record( OperationalMetricKey.PLAN_DURATION_MS, duration_ms, plan_id, ), MetricCollector.record( OperationalMetricKey.PLAN_TOTAL_COST_USD, total_cost_usd, plan_id, ), ] if subplan_count > 0: entries.append( MetricCollector.record( OperationalMetricKey.SUBPLAN_COUNT, float(subplan_count), plan_id, ) ) return entries ``` --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-uat-tester
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#3766
No description provided.