[AUTO-INF-7] Add missing ASV benchmark test level for budget, context, resources, providers, and tools modules #10224

Open
opened 2026-04-17 08:36:18 +00:00 by HAL9000 · 0 comments
Owner

Summary

Five recently-added modules — budget/guards, context, resources, providers, and tools — have Behave (unit) and/or Robot Framework (integration) test issues already filed, but none have ASV benchmark coverage. This leaves performance regressions in these modules undetectable until they surface in production or in the broader CI suite.

Findings

Module Behave (Unit) Robot (Integration) ASV (Benchmark) Notes
budget/guards #8934 (BDD) ⚠️ Partial None CostTracker, budget enforcement, and safety profile enforcement lack latency/throughput baselines
context ⚠️ Partial #8916 None Context strategies (sliding window, semantic chunking, priority, recency-weighted, fusion) lack performance baselines
resources ⚠️ Partial #8926 None Cloud, database, and virtual resource type operations lack throughput benchmarks
providers #8910 ⚠️ Partial None OllamaProvider, MistralProvider, GeminiProvider initialization and request dispatch lack latency benchmarks
tools (ContainerToolRunner) ⚠️ Partial #8919 None Container tool execution engine lacks throughput and startup latency benchmarks
  • budget/guards: benchmarks/budget_cost_tracker_bench.py — measure CostTracker.record_usage() throughput and SafetyProfileEnforcer.check_tool_access() latency under load
  • context: benchmarks/context_strategy_bench.py — measure SlidingWindowStrategy, SemanticChunkingStrategy, PriorityContextStrategy, and ContextFusion assembly time for 100/1,000/10,000-fragment corpora
  • resources: benchmarks/resources_type_bench.py — measure resource type instantiation, connection string parsing, and capability detection throughput
  • providers: benchmarks/providers_registry_bench.py — measure ProviderRegistry.get() lookup latency and provider initialization time for each new provider type (Ollama, Mistral, Gemini)
  • tools: benchmarks/tools_container_runner_bench.py — measure ContainerToolRunner startup latency and execution throughput for lightweight container commands

Impact

Without ASV benchmarks for these modules, performance regressions introduced by future changes to context strategy selection, budget enforcement, or provider dispatch will not be caught until they affect end-to-end plan execution times. Context assembly and budget enforcement are on the hot path of every plan execution — even small regressions can compound significantly at scale.

Duplicate Check

  1. Keyword search (open issues): Searched all open issues (pages 1–24, 50 per page) for "benchmark" combined with "budget", "context", "resources", "providers", and "tools" — no existing ASV benchmark issues found for any of these modules.

  2. Cross-area search — existing AUTO-INF-7 issues:

    • #9143 covers Application, Reactive, Domain, Shared — does not mention budget/context/resources/providers/tools
    • #9886 covers Platform, Core, Observability, Plugins, Templates, A2A, LSP, MCP — does not mention budget/context/resources/providers/tools
    • #8577 covers a2a, acms, action unit tests — does not mention benchmarks for the above modules
  3. Cross-area search — other AUTO-INF workers: Checked AUTO-INF-8 issue #9781 (integration and benchmark for core graph and TUI) — does not mention budget/context/resources/providers/tools modules.

  4. Closed issues search: Searched closed issues (pages 1–22, 50 per page) for benchmark issues covering these modules — none found.

  5. Uncertainty check: The integration test issues #8916 (context), #8919 (tools), #8926 (resources), and #8934 (guards/budget) confirm these modules exist and have integration coverage. The absence of any benchmark issue for these modules is confirmed across 24 pages of open issues and 22 pages of closed issues. This is not a duplicate.


Automated by CleverAgents Bot
Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor

## Summary Five recently-added modules — `budget`/`guards`, `context`, `resources`, `providers`, and `tools` — have Behave (unit) and/or Robot Framework (integration) test issues already filed, but none have ASV benchmark coverage. This leaves performance regressions in these modules undetectable until they surface in production or in the broader CI suite. ## Findings | Module | Behave (Unit) | Robot (Integration) | ASV (Benchmark) | Notes | |---|:---:|:---:|:---:|---| | `budget`/`guards` | ✅ #8934 (BDD) | ⚠️ Partial | ❌ None | CostTracker, budget enforcement, and safety profile enforcement lack latency/throughput baselines | | `context` | ⚠️ Partial | ✅ #8916 | ❌ None | Context strategies (sliding window, semantic chunking, priority, recency-weighted, fusion) lack performance baselines | | `resources` | ⚠️ Partial | ✅ #8926 | ❌ None | Cloud, database, and virtual resource type operations lack throughput benchmarks | | `providers` | ✅ #8910 | ⚠️ Partial | ❌ None | OllamaProvider, MistralProvider, GeminiProvider initialization and request dispatch lack latency benchmarks | | `tools` (ContainerToolRunner) | ⚠️ Partial | ✅ #8919 | ❌ None | Container tool execution engine lacks throughput and startup latency benchmarks | ## Recommended ASV Benchmarks - **budget/guards**: `benchmarks/budget_cost_tracker_bench.py` — measure `CostTracker.record_usage()` throughput and `SafetyProfileEnforcer.check_tool_access()` latency under load - **context**: `benchmarks/context_strategy_bench.py` — measure `SlidingWindowStrategy`, `SemanticChunkingStrategy`, `PriorityContextStrategy`, and `ContextFusion` assembly time for 100/1,000/10,000-fragment corpora - **resources**: `benchmarks/resources_type_bench.py` — measure resource type instantiation, connection string parsing, and capability detection throughput - **providers**: `benchmarks/providers_registry_bench.py` — measure `ProviderRegistry.get()` lookup latency and provider initialization time for each new provider type (Ollama, Mistral, Gemini) - **tools**: `benchmarks/tools_container_runner_bench.py` — measure `ContainerToolRunner` startup latency and execution throughput for lightweight container commands ## Impact Without ASV benchmarks for these modules, performance regressions introduced by future changes to context strategy selection, budget enforcement, or provider dispatch will not be caught until they affect end-to-end plan execution times. Context assembly and budget enforcement are on the hot path of every plan execution — even small regressions can compound significantly at scale. ### Duplicate Check 1. **Keyword search (open issues)**: Searched all open issues (pages 1–24, 50 per page) for "benchmark" combined with "budget", "context", "resources", "providers", and "tools" — no existing ASV benchmark issues found for any of these modules. 2. **Cross-area search — existing AUTO-INF-7 issues**: - #9143 covers Application, Reactive, Domain, Shared — does not mention budget/context/resources/providers/tools - #9886 covers Platform, Core, Observability, Plugins, Templates, A2A, LSP, MCP — does not mention budget/context/resources/providers/tools - #8577 covers a2a, acms, action unit tests — does not mention benchmarks for the above modules 3. **Cross-area search — other AUTO-INF workers**: Checked AUTO-INF-8 issue #9781 (integration and benchmark for core graph and TUI) — does not mention budget/context/resources/providers/tools modules. 4. **Closed issues search**: Searched closed issues (pages 1–22, 50 per page) for benchmark issues covering these modules — none found. 5. **Uncertainty check**: The integration test issues #8916 (context), #8919 (tools), #8926 (resources), and #8934 (guards/budget) confirm these modules exist and have integration coverage. The absence of any benchmark issue for these modules is confirmed across 24 pages of open issues and 22 pages of closed issues. This is not a duplicate. --- **Automated by CleverAgents Bot** Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#10224
No description provided.