feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph #15

New Issue

2026-06-03T06:00:01Z

hurui200320 commented

2026-06-03 06:00:01 +00:00

Background

PureLangGraph has a hardcoded depth heuristic (max(2000, len(self.nodes) * 50)) and no concept of model-call budgets, tool-call budgets, request timeout, or cost limits. ExecutionError is a bare subclass with no structured fields, making it impossible for the router to programmatically determine why an execution failed.

The CleverThis platform needs to enforce per-plan resource quotas and return the correct HTTP status code (429 for budget exhaustion, 500 for platform configuration errors like a missing pricing entry).

Spec references: ADR-2029 (Execution Limits, Error Mapping)

Context (post-bot): create_executor() (Wave 4 / #13) is now implemented. Executor already stores limits and pricing passed by the caller. PureLangGraph.execute() already returns per-node token-usage tuples (Wave 5 / #14 is now implemented). What remains is entirely within this ticket: adding kind/reason to ExecutionError, passing limits/pricing from the dispatch layer into PureLangGraph, and enforcing all five limits there.

What Is Currently Missing

ExecutionError has no kind or reason fields.
PureLangGraph enforces only a hardcoded depth heuristic (max(2000, len(self.nodes) * 50)); no other limits. Critically, a depth breach silently returns the current message rather than raising ExecutionError at all (see _execute_from_node, line ~447).
No model-call counting, tool-call counting, timeout wrapping, or cost accumulation.
Executor already stores limits and pricing, but the dispatch functions in runtime_dispatch._execute_graph() never pass them into PureLangGraph. ExecutionError is also missing from cleveractors.__all__.

Acceptance Criteria

ExecutionError update (cleveractors/core/exceptions.py):

class ExecutionError(CleverAgentsException):
    def __init__(self, message: str, kind: str = "", reason: str = ""):
        super().__init__(message)
        self.kind = kind    # 'depth'|'model_calls'|'tool_calls'|'timeout'|'cost'|''
        self.reason = reason  # 'budget_exhausted'|'missing_pricing_entry'|''

All existing raise ExecutionError(msg) call sites continue to work (both fields default to "").

PureLangGraph limit enforcement (using limits and pricing already stored on Executor):

max_depth: Replace heuristic with limits["max_depth"]. Breach → ExecutionError(..., kind="depth").
max_model_calls: Counter per LLM node invocation. Breach → ExecutionError(..., kind="model_calls").
max_tool_calls: Counter per tool node invocation. Breach → ExecutionError(..., kind="tool_calls").
timeout_ms: asyncio.wait_for(coro, timeout=limits["timeout_ms"]/1000). asyncio.TimeoutError → ExecutionError(..., kind="timeout").
max_cost_usd: After each LLM node, compute cost from pricing[provider][model] × token counts. Cumulative breach → ExecutionError(..., kind="cost", reason="budget_exhausted"). Missing pricing entry → ExecutionError(..., kind="cost", reason="missing_pricing_entry") — never proceed with assumed zero cost.

Subtasks

Add kind and reason fields to ExecutionError with default "" values
Pass limits and pricing from runtime_dispatch._execute_graph() into PureLangGraph (Executor already stores both; the missing wire is from the dispatch call site to PureLangGraph.__init__)
Replace hardcoded depth heuristic with limits["max_depth"]
Add max_model_calls counter and enforcement
Add max_tool_calls counter and enforcement
Wrap execution in asyncio.wait_for for timeout_ms
Add cost accumulation after each LLM node using token data from NodeUsage
Enforce max_cost_usd with missing_pricing_entry detection
Export updated ExecutionError from cleveractors/__init__.py and __all__
Write tests for each of the 5 limit types being exceeded
Write test for missing_pricing_entry scenario (never zero-cost fallback)
Verify all existing tests still pass

Definition of Done

All subtasks checked off.
Each of the 5 limit types raises ExecutionError with the correct kind (and reason where applicable).
from cleveractors import ExecutionError exposes the class with kind and reason attributes.
All tests pass. Coverage at or above project threshold.

## Background `PureLangGraph` has a hardcoded depth heuristic (`max(2000, len(self.nodes) * 50)`) and no concept of model-call budgets, tool-call budgets, request timeout, or cost limits. `ExecutionError` is a bare subclass with no structured fields, making it impossible for the router to programmatically determine why an execution failed. The CleverThis platform needs to enforce per-plan resource quotas and return the correct HTTP status code (429 for budget exhaustion, 500 for platform configuration errors like a missing pricing entry). **Spec references:** ADR-2029 (Execution Limits, Error Mapping) **Context (post-bot):** `create_executor()` (Wave 4 / #13) is now implemented. `Executor` already stores `limits` and `pricing` passed by the caller. `PureLangGraph.execute()` already returns per-node token-usage tuples (Wave 5 / #14 is now implemented). What remains is entirely within this ticket: adding `kind`/`reason` to `ExecutionError`, passing `limits`/`pricing` from the dispatch layer into `PureLangGraph`, and enforcing all five limits there. ## What Is Currently Missing - `ExecutionError` has no `kind` or `reason` fields. - `PureLangGraph` enforces only a hardcoded depth heuristic (`max(2000, len(self.nodes) * 50)`); no other limits. Critically, a depth breach **silently returns the current message** rather than raising `ExecutionError` at all (see `_execute_from_node`, line ~447). - No model-call counting, tool-call counting, timeout wrapping, or cost accumulation. - `Executor` already stores `limits` and `pricing`, but the dispatch functions in `runtime_dispatch._execute_graph()` never pass them into `PureLangGraph`. `ExecutionError` is also missing from `cleveractors.__all__`. ## Acceptance Criteria **`ExecutionError` update (`cleveractors/core/exceptions.py`):** ```python class ExecutionError(CleverAgentsException): def __init__(self, message: str, kind: str = "", reason: str = ""): super().__init__(message) self.kind = kind # 'depth'|'model_calls'|'tool_calls'|'timeout'|'cost'|'' self.reason = reason # 'budget_exhausted'|'missing_pricing_entry'|'' ``` All existing `raise ExecutionError(msg)` call sites continue to work (both fields default to `""`). **`PureLangGraph` limit enforcement (using `limits` and `pricing` already stored on `Executor`):** 1. **`max_depth`**: Replace heuristic with `limits["max_depth"]`. Breach → `ExecutionError(..., kind="depth")`. 2. **`max_model_calls`**: Counter per LLM node invocation. Breach → `ExecutionError(..., kind="model_calls")`. 3. **`max_tool_calls`**: Counter per tool node invocation. Breach → `ExecutionError(..., kind="tool_calls")`. 4. **`timeout_ms`**: `asyncio.wait_for(coro, timeout=limits["timeout_ms"]/1000)`. `asyncio.TimeoutError` → `ExecutionError(..., kind="timeout")`. 5. **`max_cost_usd`**: After each LLM node, compute cost from `pricing[provider][model]` × token counts. Cumulative breach → `ExecutionError(..., kind="cost", reason="budget_exhausted")`. Missing pricing entry → `ExecutionError(..., kind="cost", reason="missing_pricing_entry")` — **never proceed with assumed zero cost**. ## Subtasks - [x] Add `kind` and `reason` fields to `ExecutionError` with default `""` values - [x] Pass `limits` and `pricing` from `runtime_dispatch._execute_graph()` into `PureLangGraph` (`Executor` already stores both; the missing wire is from the dispatch call site to `PureLangGraph.__init__`) - [x] Replace hardcoded depth heuristic with `limits["max_depth"]` - [x] Add `max_model_calls` counter and enforcement - [x] Add `max_tool_calls` counter and enforcement - [x] Wrap execution in `asyncio.wait_for` for `timeout_ms` - [x] Add cost accumulation after each LLM node using token data from `NodeUsage` - [x] Enforce `max_cost_usd` with `missing_pricing_entry` detection - [x] Export updated `ExecutionError` from `cleveractors/__init__.py` and `__all__` - [x] Write tests for each of the 5 limit types being exceeded - [x] Write test for `missing_pricing_entry` scenario (never zero-cost fallback) - [x] Verify all existing tests still pass ## Definition of Done - All subtasks checked off. - Each of the 5 limit types raises `ExecutionError` with the correct `kind` (and `reason` where applicable). - `from cleveractors import ExecutionError` exposes the class with `kind` and `reason` attributes. - All tests pass. Coverage at or above project threshold.

hurui200320 added the

labels 2026-06-03 06:00:57 +00:00

hurui200320 added a new dependency 2026-06-03 06:09:36 +00:00

cleveragents/cleveragents-webapp#273 - feat(actor-limits): map ExecutionError kind/reason to structured HTTP error responses

hurui200320 added a new dependency 2026-06-03 06:41:20 +00:00

#14 feat(ActorResult): implement ActorResult and NodeUsage types; capture per-node token counts from LangChain responses

hurui200320 referenced this issue

2026-06-03 06:43:14 +00:00

feat(public-api): expose all router-facing APIs at cleveractors package level; update README #17

hurui200320 added a new dependency 2026-06-03 06:44:01 +00:00

#17 feat(public-api): expose all router-facing APIs at cleveractors package level; update README

CoreRasurae referenced this issue

2026-06-08 23:13:51 +00:00

feat(registry): extend TemplateType and integrate PackageReference into template system #35

CoreRasurae referenced this issue

2026-06-09 20:19:55 +00:00

feat(registry): extend TemplateType and integrate PackageReference into template system #35

hurui200320 added

and removed

labels 2026-06-11 03:34:35 +00:00

hurui200320 referenced this issue

2026-06-11 03:35:50 +00:00

feat(streaming): add Executor.execute_stream() returning AsyncIterator[str] for token-by-token delivery #16

hurui200320 self-assigned this 2026-06-11 03:39:19 +00:00

hurui200320 added

and removed

labels 2026-06-11 03:39:43 +00:00

hurui200320 commented

2026-06-11 03:42:16 +00:00

Implementation Plan — feat/execution-limits

Background from ADR-2029

ADR-2029 ("Actor Execution Limits and Budget Enforcement via cleveractors-core") mandates that:

The router passes all five limit keys (max_depth, max_model_calls, max_tool_calls, timeout_ms, max_cost_usd) and a pricing table to create_executor() at request time.
PureLangGraph enforces these limits internally during graph traversal.
Pricing rates are denominated per million tokens (e.g., {"openai": {"gpt-4.1-mini": {"prompt": 0.15, "completion": 0.60}}}).
Missing pricing entries must raise ExecutionError(kind='cost', reason='missing_pricing_entry') — never proceed with assumed zero cost.
The library's internal ceiling when no limit key is supplied is 2**31 - 1 (very large; the router's values are the binding constraint per ADR-2029).

Branch

feat/execution-limits (from master at commit 2664ebf)

Commit Message

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

Files Changed

cleveractors/core/exceptions.py — Add kind: str = "" and reason: str = "" fields to ExecutionError.__init__. All existing raise ExecutionError(msg) call sites are backward-compatible (both fields default to "").
cleveractors/langgraph/pure_graph.py — PureLangGraph changes:
- __init__ gains limits: dict[str, Any] and pricing: dict[str, Any] parameters (both default to {}).
- execute() wraps _execute_from_node() in asyncio.wait_for() when timeout_ms is in limits.
- execute() resets _model_call_count, _tool_call_count, _accumulated_cost on each invocation.
- _execute_from_node(): depth check now uses limits.get("max_depth", 2**31 - 1) and raises ExecutionError(kind="depth") instead of silently returning.
- _execute_from_node(): AGENT nodes check max_model_calls before execution.
- _execute_from_node(): TOOL nodes check max_tool_calls before execution.
- _execute_from_node(): After each LLM node result is collected, compute cost (rates per million tokens), check max_cost_usd, and enforce missing pricing entry guard.
cleveractors/runtime_dispatch.py — _execute_graph() passes limits=executor.limits, pricing=executor.pricing to PureLangGraph().
cleveractors/__init__.py — Add ExecutionError to imports and __all__.
features/execution_limits.feature + features/steps/execution_limits_steps.py — BDD scenarios covering all 5 limit types + missing_pricing_entry.
CHANGELOG.md — Document changes under [Unreleased].

Design Decisions

Pricing unit: Per million tokens (matching ADR-2029 example values: $0.15/1M for gpt-4.1-mini prompt). Cost formula: cost = (prompt_tokens / 1_000_000 * prompt_rate) + (completion_tokens / 1_000_000 * completion_rate).
Empty pricing dict: If pricing={} (router did not supply a pricing table), skip all cost calculations. If pricing is non-empty but a provider/model entry is missing, raise missing_pricing_entry.
Empty limits: Use limits.get(key, fallback) throughout — no limit is enforced when the key is absent from the dict. Default fallbacks: max_depth=2**31-1, all counters/cost unchecked.
Counter placement: Check then increment BEFORE node execution for max_model_calls (AGENT nodes) and max_tool_calls (TOOL nodes), so the limit is enforced before the N+1 invocation.
Timeout: Wraps only the _execute_from_node() coroutine in asyncio.wait_for(). The setup/teardown code in execute() is outside the timeout scope (as intended by ADR-2029).

## Implementation Plan — feat/execution-limits ### Background from ADR-2029 ADR-2029 ("Actor Execution Limits and Budget Enforcement via cleveractors-core") mandates that: - The router passes all five limit keys (`max_depth`, `max_model_calls`, `max_tool_calls`, `timeout_ms`, `max_cost_usd`) and a `pricing` table to `create_executor()` at request time. - `PureLangGraph` enforces these limits internally during graph traversal. - Pricing rates are denominated **per million tokens** (e.g., `{"openai": {"gpt-4.1-mini": {"prompt": 0.15, "completion": 0.60}}}`). - Missing pricing entries must raise `ExecutionError(kind='cost', reason='missing_pricing_entry')` — never proceed with assumed zero cost. - The library's internal ceiling when no limit key is supplied is `2**31 - 1` (very large; the router's values are the binding constraint per ADR-2029). ### Branch `feat/execution-limits` (from `master` at commit `2664ebf`) ### Commit Message `feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph` ### Files Changed 1. **`cleveractors/core/exceptions.py`** — Add `kind: str = ""` and `reason: str = ""` fields to `ExecutionError.__init__`. All existing `raise ExecutionError(msg)` call sites are backward-compatible (both fields default to `""`). 2. **`cleveractors/langgraph/pure_graph.py`** — `PureLangGraph` changes: - `__init__` gains `limits: dict[str, Any]` and `pricing: dict[str, Any]` parameters (both default to `{}`). - `execute()` wraps `_execute_from_node()` in `asyncio.wait_for()` when `timeout_ms` is in limits. - `execute()` resets `_model_call_count`, `_tool_call_count`, `_accumulated_cost` on each invocation. - `_execute_from_node()`: depth check now uses `limits.get("max_depth", 2**31 - 1)` and raises `ExecutionError(kind="depth")` instead of silently returning. - `_execute_from_node()`: AGENT nodes check `max_model_calls` before execution. - `_execute_from_node()`: TOOL nodes check `max_tool_calls` before execution. - `_execute_from_node()`: After each LLM node result is collected, compute cost (rates per million tokens), check `max_cost_usd`, and enforce missing pricing entry guard. 3. **`cleveractors/runtime_dispatch.py`** — `_execute_graph()` passes `limits=executor.limits, pricing=executor.pricing` to `PureLangGraph()`. 4. **`cleveractors/__init__.py`** — Add `ExecutionError` to imports and `__all__`. 5. **`features/execution_limits.feature`** + **`features/steps/execution_limits_steps.py`** — BDD scenarios covering all 5 limit types + missing_pricing_entry. 6. **`CHANGELOG.md`** — Document changes under `[Unreleased]`. ### Design Decisions - **Pricing unit**: Per million tokens (matching ADR-2029 example values: $0.15/1M for gpt-4.1-mini prompt). Cost formula: `cost = (prompt_tokens / 1_000_000 * prompt_rate) + (completion_tokens / 1_000_000 * completion_rate)`. - **Empty pricing dict**: If `pricing={}` (router did not supply a pricing table), skip all cost calculations. If `pricing` is non-empty but a provider/model entry is missing, raise `missing_pricing_entry`. - **Empty limits**: Use `limits.get(key, fallback)` throughout — no limit is enforced when the key is absent from the dict. Default fallbacks: `max_depth=2**31-1`, all counters/cost unchecked. - **Counter placement**: Check then increment BEFORE node execution for `max_model_calls` (AGENT nodes) and `max_tool_calls` (TOOL nodes), so the limit is enforced before the N+1 invocation. - **Timeout**: Wraps only the `_execute_from_node()` coroutine in `asyncio.wait_for()`. The setup/teardown code in `execute()` is outside the timeout scope (as intended by ADR-2029).

hurui200320 referenced this issue from a commit

2026-06-11 04:55:14 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

hurui200320 referenced a pull request that will close this issue

2026-06-11 04:55:33 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph #44

hurui200320 added

and removed

labels 2026-06-11 04:55:43 +00:00

hurui200320 commented

2026-06-11 04:56:03 +00:00

Implementation Notes — Commit `55c82ba`

Branch

feat/execution-limits → PR #44

Key Design Decisions

1. Depth limit backward compatibility

When limits["max_depth"] IS provided: strictly enforce with ExecutionError(kind="depth") (new behavior per ADR-2029).

When limits["max_depth"] is NOT provided (e.g., limits={}): use the legacy heuristic max(2000, len(nodes)*50) with a silent cap (warning log only). This preserves backward compatibility for existing callers of PureLangGraph that don't pass limits — including several unit tests that create PureLangGraph(config) directly and call _execute_from_node with high depth values.

Motivation: Changing the default to 2**31-1 (the ADR's "very large internal ceiling") caused RecursionError in existing cycle-detection tests that use auto_finish_active=True to bypass loop detection — those tests rely on the old heuristic as a safety net. The two-path design keeps the tests green without weakening the new enforcement for actual router callers.

2. Cost calculation unit

Rates are USD per million tokens, matching ADR-2029 examples (gpt-4.1-mini: prompt=$0.15/1M, completion=$0.60/1M). Formula: cost = (prompt_tokens / 1_000_000 * prompt_rate) + (completion_tokens / 1_000_000 * completion_rate).

3. Empty pricing dict

If pricing={} is passed (or create_executor() is called without pricing), all cost calculations are skipped entirely. This handles legacy callers that provide max_cost_usd in limits but no pricing table. The missing_pricing_entry error only fires when pricing is non-empty and a provider/model key is missing.

4. ExecutionError re-raise

Added except ExecutionError: raise BEFORE the broad except Exception as e in _execute_from_node(). Without this, cost enforcement errors raised inside the post-execution dict-processing block would be silently swallowed and the node would "succeed" with an empty output.

File Locations

cleveractors.core.exceptions.ExecutionError — kind/reason fields
cleveractors.langgraph.pure_graph.PureLangGraph.__init__ — new limits/pricing params, counter init
cleveractors.langgraph.pure_graph.PureLangGraph.execute — counter reset, timeout wrapping
cleveractors.langgraph.pure_graph.PureLangGraph._execute_from_node — depth, model_calls, tool_calls, cost enforcement
cleveractors.runtime_dispatch._execute_graph — limits/pricing wiring to PureLangGraph
cleveractors.__init__ — ExecutionError export

Quality Gates

lint: ✅ pass
typecheck (Pyright): ✅ 0 errors, 18 pre-existing import warnings (rx, langchain — not our code)
unit_tests: ✅ 2317/2317 pass
integration_tests: ✅ pass
coverage: ✅ 97.17% (threshold 96.5%)

## Implementation Notes — Commit `55c82ba` ### Branch `feat/execution-limits` → PR #44 ### Key Design Decisions **1. Depth limit backward compatibility** When `limits["max_depth"]` IS provided: strictly enforce with `ExecutionError(kind="depth")` (new behavior per ADR-2029). When `limits["max_depth"]` is NOT provided (e.g., `limits={}`): use the legacy heuristic `max(2000, len(nodes)*50)` with a silent cap (warning log only). This preserves backward compatibility for existing callers of `PureLangGraph` that don't pass limits — including several unit tests that create `PureLangGraph(config)` directly and call `_execute_from_node` with high depth values. Motivation: Changing the default to `2**31-1` (the ADR's "very large internal ceiling") caused `RecursionError` in existing cycle-detection tests that use `auto_finish_active=True` to bypass loop detection — those tests rely on the old heuristic as a safety net. The two-path design keeps the tests green without weakening the new enforcement for actual router callers. **2. Cost calculation unit** Rates are USD per million tokens, matching ADR-2029 examples (`gpt-4.1-mini: prompt=$0.15/1M, completion=$0.60/1M`). Formula: `cost = (prompt_tokens / 1_000_000 * prompt_rate) + (completion_tokens / 1_000_000 * completion_rate)`. **3. Empty pricing dict** If `pricing={}` is passed (or `create_executor()` is called without pricing), all cost calculations are skipped entirely. This handles legacy callers that provide `max_cost_usd` in limits but no pricing table. The `missing_pricing_entry` error only fires when `pricing` is non-empty and a provider/model key is missing. **4. ExecutionError re-raise** Added `except ExecutionError: raise` BEFORE the broad `except Exception as e` in `_execute_from_node()`. Without this, cost enforcement errors raised inside the post-execution dict-processing block would be silently swallowed and the node would "succeed" with an empty output. ### File Locations - `cleveractors.core.exceptions.ExecutionError` — `kind`/`reason` fields - `cleveractors.langgraph.pure_graph.PureLangGraph.__init__` — new `limits`/`pricing` params, counter init - `cleveractors.langgraph.pure_graph.PureLangGraph.execute` — counter reset, timeout wrapping - `cleveractors.langgraph.pure_graph.PureLangGraph._execute_from_node` — depth, model_calls, tool_calls, cost enforcement - `cleveractors.runtime_dispatch._execute_graph` — `limits`/`pricing` wiring to `PureLangGraph` - `cleveractors.__init__` — `ExecutionError` export ### Quality Gates - lint: ✅ pass - typecheck (Pyright): ✅ 0 errors, 18 pre-existing import warnings (rx, langchain — not our code) - unit_tests: ✅ 2317/2317 pass - integration_tests: ✅ pass - coverage: ✅ 97.17% (threshold 96.5%)

hurui200320 referenced this issue from a commit

2026-06-11 04:58:17 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

hurui200320 added a new dependency 2026-06-11 05:04:09 +00:00

#44 feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

hurui200320 referenced this issue from a commit

2026-06-11 05:30:49 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

hurui200320 referenced this issue from a commit

2026-06-11 08:01:38 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

hurui200320 referenced this issue from a commit

2026-06-11 08:15:21 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

hurui200320 referenced this issue from a commit

2026-06-11 08:43:49 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

hurui200320 referenced this issue from a commit

2026-06-11 09:37:43 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph

hurui200320 commented

2026-06-11 10:20:09 +00:00

Self-QA Implementation Notes (Cycles 1–3)

Self-QA loop completed in 3 cycles with final verdict: Approved.

Cycle 1

Review findings (2C / 4M / 7m / 6n):

C-1 (Critical): float(_max_cost) inside the broad except Exception block silently swallowed malformed max_cost_usd values, disabling cost enforcement with no error raised. Empirically confirmed: with limits={"max_cost_usd": "not-a-number"} and 2M prompt tokens, the graph returned the input string silently.
C-2 (Critical): The end terminal node was subject to the depth check (depth check ran before the end short-circuit). Any graph with max_depth ≤ N (number of real nodes) could never complete. The ADR-2029 MVP value of MAX_DEPTH=5 could not complete a 5-node graph. BDD tests masked this by using max_depth=10 for the success scenario.
M-1 (Major): asyncio.gather() did not cancel sibling branches when one raised ExecutionError — parallel branches continued spending budget after a limit breach.
M-2 (Major): Default depth ceiling used legacy heuristic max(2000, len(self.nodes) * 50) instead of 2**31-1 per ADR-2029 spec. Docstring claimed 2**31-1 but code used the heuristic.
M-3 (Major): _limits and _pricing stored by reference — caller mutation could silently change enforcement behavior mid-flight.
M-4 (Major): Malformed max_model_calls, max_tool_calls, max_depth produced unstructured ValueError that the router could not map to a specific limit kind.
Minor/nit issues: non-dict token usage silently skipped, double _safe_token_int calls, tautological propagation test, missing multi-node cost accumulation test, missing completion tokens test, happy-path tests only asserting absence of error, timeout_ms=0 causing immediate timeout with no validation.

Fixes applied:

C-1: Wrapped float(max_cost_usd) in a narrow try/except (TypeError, ValueError) that raises ExecutionError(kind="cost") immediately, outside the broad handler's reach.
C-2: Moved the if node_name in ("end", "END"): return message short-circuit to the very top of _execute_from_node(), before the depth check.
M-1: Replaced asyncio.gather(*tasks) with a try/except that cancels all non-done tasks when one raises, then awaits them with return_exceptions=True. Also replaced asyncio.ensure_future with asyncio.create_task.
M-2: Replaced max(2000, len(self.nodes) * 50) with 2**31 - 1 as the default. Updated cycle-detection tests to pass explicit max_depth=50 and handle ExecutionError(kind="depth") as expected termination.
M-3: Applied defensive copies: self._limits = dict(limits) and self._pricing = copy.deepcopy(pricing).
M-4: Added narrow try/except (TypeError, ValueError) with bool guards for max_depth, max_model_calls, and max_tool_calls, each raising structured ExecutionError with the correct kind.
All minor/nit issues addressed: warning logging for non-dict token usage, deduplicated _safe_token_int calls, strengthened propagation test, added multi-node cost accumulation scenario, added completion tokens scenario, strengthened happy-path assertions, added timeout_ms > 0 validation.

Quality gates after Cycle 1: ✅ lint, typecheck, 2331/2331 scenarios, 97.1% coverage.

Cycle 2

Review findings (0C / 1M / 8m / 5n):

M-1 (Major): Wrong reason field for malformed max_cost_usd — code raised ExecutionError(kind="cost", reason="missing_pricing_entry") but per ADR-2029, missing_pricing_entry means the pricing table lacks an entry for the provider/model, not that the limit value itself is malformed. This would produce misleading router logs and diagnostics.
m-2 (Minor): Parallel cancellation test did not verify cancellation actually occurred — only checked that ExecutionError was raised, not that sibling tasks were cancelled.
m-3 (Minor): Non-dict _node_token_usage test did not verify the warning was logged — only checked no error was raised.
m-4/5 (Minor): timeout_ms and max_cost_usd missing bool guards (inconsistency with other limits).
m-6 (Minor): Updated cycle-detection tests had weaker assertions — pure_graph_coverage_steps.py lacked visit_count[0] >= 2 assertion.
m-7 (Minor): Missing BDD scenarios for max_depth input validation (non-numeric and bool).
m-8/9 (Minor): Misleading scenario text in feature file.

Fixes applied:

M-1: Changed reason="missing_pricing_entry" to reason="" for malformed max_cost_usd. Added comment explaining the semantic distinction per ADR-2029.
m-2: Added branch_b_cancelled = [False] flag set inside except asyncio.CancelledError in _slow_execute. Fixed graph topology — original test used start → branch_a/branch_b but start node only routes to next_nodes[0], so parallel execution was never triggered. Fixed by adding a splitter node that fans out to both branches. Added Then step asserting branch_b_cancelled[0] is True.
m-3: Added Then step that patches graph.logger and re-runs the graph to assert logger.warning was called with a message about non-dict token usage.
m-4/5: Added bool guards for timeout_ms and max_cost_usd before float() conversion. Added BDD scenarios for both.
m-6: Added visit_count[0] >= 2 assertions to pure_graph_coverage_steps.py auto-finish and ping-pong bypass tests.
m-7: Added two new scenarios: "Non-numeric max_depth raises ExecutionError with kind 'depth'" and "Bool max_depth raises ExecutionError with kind 'depth'".
m-8/9: Fixed misleading scenario text in feature file.
Nits: Added except BaseException comment, removed in-function import sys, converted print(..., file=sys.stderr) to self.logger.warning(...), moved in-function imports to top-level in step file.

Quality gates after Cycle 2: ✅ lint, typecheck, 2335/2335 scenarios, 97.2% coverage.

Cycle 3

Review findings: No critical or major issues. Minor code quality suggestions (DRY violation in limit validation pattern, magic numbers without named constants, create_pure_langgraph() factory silently omitting limits/pricing, weak warning assertion, dead mock_logger variable, mixed Optional[X] vs X | None style). All nits.

Verdict: Approve — The implementation is functionally correct and complete. All 5 execution limits are properly enforced per ADR-2029, the ExecutionError structured fields are correctly implemented, the wiring from _execute_graph() to PureLangGraph is correct, and the 38 BDD scenarios provide solid coverage.

Remaining Issues

The Cycle 3 minor/nit findings (DRY violation in limit validation, magic number constants, create_pure_langgraph() factory documentation, warning assertion tightening) are code quality improvements with no correctness impact. They can be addressed in a follow-up PR or deferred to a future refactor cycle.

Final quality gate results:

Gate	Result
`nox -e lint`	✅ pass
`nox -e typecheck`	✅ 0 errors
`nox -e unit_tests`	✅ 2335/2335 scenarios
`nox -e integration_tests`	✅ 156/156 tests
`nox -e coverage_report`	✅ 97.2% (threshold 96.5%)

## Self-QA Implementation Notes (Cycles 1–3) Self-QA loop completed in **3 cycles** with final verdict: **Approved**. --- ### Cycle 1 **Review findings (2C / 4M / 7m / 6n):** - **C-1 (Critical):** `float(_max_cost)` inside the broad `except Exception` block silently swallowed malformed `max_cost_usd` values, disabling cost enforcement with no error raised. Empirically confirmed: with `limits={"max_cost_usd": "not-a-number"}` and 2M prompt tokens, the graph returned the input string silently. - **C-2 (Critical):** The `end` terminal node was subject to the depth check (depth check ran before the `end` short-circuit). Any graph with `max_depth ≤ N` (number of real nodes) could never complete. The ADR-2029 MVP value of `MAX_DEPTH=5` could not complete a 5-node graph. BDD tests masked this by using `max_depth=10` for the success scenario. - **M-1 (Major):** `asyncio.gather()` did not cancel sibling branches when one raised `ExecutionError` — parallel branches continued spending budget after a limit breach. - **M-2 (Major):** Default depth ceiling used legacy heuristic `max(2000, len(self.nodes) * 50)` instead of `2**31-1` per ADR-2029 spec. Docstring claimed `2**31-1` but code used the heuristic. - **M-3 (Major):** `_limits` and `_pricing` stored by reference — caller mutation could silently change enforcement behavior mid-flight. - **M-4 (Major):** Malformed `max_model_calls`, `max_tool_calls`, `max_depth` produced unstructured `ValueError` that the router could not map to a specific limit kind. - Minor/nit issues: non-dict token usage silently skipped, double `_safe_token_int` calls, tautological propagation test, missing multi-node cost accumulation test, missing completion tokens test, happy-path tests only asserting absence of error, `timeout_ms=0` causing immediate timeout with no validation. **Fixes applied:** - **C-1:** Wrapped `float(max_cost_usd)` in a narrow `try/except (TypeError, ValueError)` that raises `ExecutionError(kind="cost")` immediately, outside the broad handler's reach. - **C-2:** Moved the `if node_name in ("end", "END"): return message` short-circuit to the very top of `_execute_from_node()`, before the depth check. - **M-1:** Replaced `asyncio.gather(*tasks)` with a try/except that cancels all non-done tasks when one raises, then awaits them with `return_exceptions=True`. Also replaced `asyncio.ensure_future` with `asyncio.create_task`. - **M-2:** Replaced `max(2000, len(self.nodes) * 50)` with `2**31 - 1` as the default. Updated cycle-detection tests to pass explicit `max_depth=50` and handle `ExecutionError(kind="depth")` as expected termination. - **M-3:** Applied defensive copies: `self._limits = dict(limits)` and `self._pricing = copy.deepcopy(pricing)`. - **M-4:** Added narrow `try/except (TypeError, ValueError)` with bool guards for `max_depth`, `max_model_calls`, and `max_tool_calls`, each raising structured `ExecutionError` with the correct `kind`. - All minor/nit issues addressed: warning logging for non-dict token usage, deduplicated `_safe_token_int` calls, strengthened propagation test, added multi-node cost accumulation scenario, added completion tokens scenario, strengthened happy-path assertions, added `timeout_ms > 0` validation. **Quality gates after Cycle 1:** ✅ lint, typecheck, 2331/2331 scenarios, 97.1% coverage. --- ### Cycle 2 **Review findings (0C / 1M / 8m / 5n):** - **M-1 (Major):** Wrong `reason` field for malformed `max_cost_usd` — code raised `ExecutionError(kind="cost", reason="missing_pricing_entry")` but per ADR-2029, `missing_pricing_entry` means the pricing table lacks an entry for the provider/model, not that the limit value itself is malformed. This would produce misleading router logs and diagnostics. - **m-2 (Minor):** Parallel cancellation test did not verify cancellation actually occurred — only checked that `ExecutionError` was raised, not that sibling tasks were cancelled. - **m-3 (Minor):** Non-dict `_node_token_usage` test did not verify the warning was logged — only checked no error was raised. - **m-4/5 (Minor):** `timeout_ms` and `max_cost_usd` missing bool guards (inconsistency with other limits). - **m-6 (Minor):** Updated cycle-detection tests had weaker assertions — `pure_graph_coverage_steps.py` lacked `visit_count[0] >= 2` assertion. - **m-7 (Minor):** Missing BDD scenarios for `max_depth` input validation (non-numeric and bool). - **m-8/9 (Minor):** Misleading scenario text in feature file. **Fixes applied:** - **M-1:** Changed `reason="missing_pricing_entry"` to `reason=""` for malformed `max_cost_usd`. Added comment explaining the semantic distinction per ADR-2029. - **m-2:** Added `branch_b_cancelled = [False]` flag set inside `except asyncio.CancelledError` in `_slow_execute`. Fixed graph topology — original test used `start → branch_a/branch_b` but `start` node only routes to `next_nodes[0]`, so parallel execution was never triggered. Fixed by adding a `splitter` node that fans out to both branches. Added `Then` step asserting `branch_b_cancelled[0] is True`. - **m-3:** Added `Then` step that patches `graph.logger` and re-runs the graph to assert `logger.warning` was called with a message about non-dict token usage. - **m-4/5:** Added bool guards for `timeout_ms` and `max_cost_usd` before `float()` conversion. Added BDD scenarios for both. - **m-6:** Added `visit_count[0] >= 2` assertions to `pure_graph_coverage_steps.py` auto-finish and ping-pong bypass tests. - **m-7:** Added two new scenarios: "Non-numeric max_depth raises ExecutionError with kind 'depth'" and "Bool max_depth raises ExecutionError with kind 'depth'". - **m-8/9:** Fixed misleading scenario text in feature file. - **Nits:** Added `except BaseException` comment, removed in-function `import sys`, converted `print(..., file=sys.stderr)` to `self.logger.warning(...)`, moved in-function imports to top-level in step file. **Quality gates after Cycle 2:** ✅ lint, typecheck, 2335/2335 scenarios, 97.2% coverage. --- ### Cycle 3 **Review findings:** No critical or major issues. Minor code quality suggestions (DRY violation in limit validation pattern, magic numbers without named constants, `create_pure_langgraph()` factory silently omitting limits/pricing, weak warning assertion, dead `mock_logger` variable, mixed `Optional[X]` vs `X | None` style). All nits. **Verdict: Approve** — The implementation is functionally correct and complete. All 5 execution limits are properly enforced per ADR-2029, the `ExecutionError` structured fields are correctly implemented, the wiring from `_execute_graph()` to `PureLangGraph` is correct, and the 38 BDD scenarios provide solid coverage. --- ### Remaining Issues The Cycle 3 minor/nit findings (DRY violation in limit validation, magic number constants, `create_pure_langgraph()` factory documentation, warning assertion tightening) are code quality improvements with no correctness impact. They can be addressed in a follow-up PR or deferred to a future refactor cycle. --- **Final quality gate results:** | Gate | Result | |------|--------| | `nox -e lint` | ✅ pass | | `nox -e typecheck` | ✅ 0 errors | | `nox -e unit_tests` | ✅ 2335/2335 scenarios | | `nox -e integration_tests` | ✅ 156/156 tests | | `nox -e coverage_report` | ✅ 97.2% (threshold 96.5%) |

hurui200320 referenced this issue

2026-06-11 10:33:18 +00:00

feat(execution-limits): add structured ExecutionError kind/reason fields; enforce all 5 execution limits in PureLangGraph #44

hurui200320 closed this issue

2026-06-11 11:05:30 +00:00

hurui200320 added

and removed

labels 2026-06-11 11:05:36 +00:00