[AUTO-INF-7] Add missing ASV benchmark test level for budget, context, resources, providers, and tools modules #10224

New issue

Open

opened 2026-04-17 08:36:18 +00:00 by HAL9000 · 0 comments

HAL9000 commented

2026-04-17 08:36:18 +00:00

Owner

Summary

Five recently-added modules — budget/guards, context, resources, providers, and tools — have Behave (unit) and/or Robot Framework (integration) test issues already filed, but none have ASV benchmark coverage. This leaves performance regressions in these modules undetectable until they surface in production or in the broader CI suite.

Findings

Module	Behave (Unit)	Robot (Integration)	ASV (Benchmark)	Notes
`budget`/`guards`	✅ #8934 (BDD)	⚠️ Partial	❌ None	CostTracker, budget enforcement, and safety profile enforcement lack latency/throughput baselines
`context`	⚠️ Partial	✅ #8916	❌ None	Context strategies (sliding window, semantic chunking, priority, recency-weighted, fusion) lack performance baselines
`resources`	⚠️ Partial	✅ #8926	❌ None	Cloud, database, and virtual resource type operations lack throughput benchmarks
`providers`	✅ #8910	⚠️ Partial	❌ None	OllamaProvider, MistralProvider, GeminiProvider initialization and request dispatch lack latency benchmarks
`tools` (ContainerToolRunner)	⚠️ Partial	✅ #8919	❌ None	Container tool execution engine lacks throughput and startup latency benchmarks

Recommended ASV Benchmarks

budget/guards: benchmarks/budget_cost_tracker_bench.py — measure CostTracker.record_usage() throughput and SafetyProfileEnforcer.check_tool_access() latency under load
context: benchmarks/context_strategy_bench.py — measure SlidingWindowStrategy, SemanticChunkingStrategy, PriorityContextStrategy, and ContextFusion assembly time for 100/1,000/10,000-fragment corpora
resources: benchmarks/resources_type_bench.py — measure resource type instantiation, connection string parsing, and capability detection throughput
providers: benchmarks/providers_registry_bench.py — measure ProviderRegistry.get() lookup latency and provider initialization time for each new provider type (Ollama, Mistral, Gemini)
tools: benchmarks/tools_container_runner_bench.py — measure ContainerToolRunner startup latency and execution throughput for lightweight container commands

Impact

Without ASV benchmarks for these modules, performance regressions introduced by future changes to context strategy selection, budget enforcement, or provider dispatch will not be caught until they affect end-to-end plan execution times. Context assembly and budget enforcement are on the hot path of every plan execution — even small regressions can compound significantly at scale.

Duplicate Check

Keyword search (open issues): Searched all open issues (pages 1–24, 50 per page) for "benchmark" combined with "budget", "context", "resources", "providers", and "tools" — no existing ASV benchmark issues found for any of these modules.
Cross-area search — existing AUTO-INF-7 issues:
- #9143 covers Application, Reactive, Domain, Shared — does not mention budget/context/resources/providers/tools
- #9886 covers Platform, Core, Observability, Plugins, Templates, A2A, LSP, MCP — does not mention budget/context/resources/providers/tools
- #8577 covers a2a, acms, action unit tests — does not mention benchmarks for the above modules
Cross-area search — other AUTO-INF workers: Checked AUTO-INF-8 issue #9781 (integration and benchmark for core graph and TUI) — does not mention budget/context/resources/providers/tools modules.
Closed issues search: Searched closed issues (pages 1–22, 50 per page) for benchmark issues covering these modules — none found.
Uncertainty check: The integration test issues #8916 (context), #8919 (tools), #8926 (resources), and #8934 (guards/budget) confirm these modules exist and have integration coverage. The absence of any benchmark issue for these modules is confirmed across 24 pages of open issues and 22 pages of closed issues. This is not a duplicate.

Automated by CleverAgents Bot
Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor

## Summary Five recently-added modules — `budget`/`guards`, `context`, `resources`, `providers`, and `tools` — have Behave (unit) and/or Robot Framework (integration) test issues already filed, but none have ASV benchmark coverage. This leaves performance regressions in these modules undetectable until they surface in production or in the broader CI suite. ## Findings | Module | Behave (Unit) | Robot (Integration) | ASV (Benchmark) | Notes | |---|:---:|:---:|:---:|---| | `budget`/`guards` | ✅ #8934 (BDD) | ⚠️ Partial | ❌ None | CostTracker, budget enforcement, and safety profile enforcement lack latency/throughput baselines | | `context` | ⚠️ Partial | ✅ #8916 | ❌ None | Context strategies (sliding window, semantic chunking, priority, recency-weighted, fusion) lack performance baselines | | `resources` | ⚠️ Partial | ✅ #8926 | ❌ None | Cloud, database, and virtual resource type operations lack throughput benchmarks | | `providers` | ✅ #8910 | ⚠️ Partial | ❌ None | OllamaProvider, MistralProvider, GeminiProvider initialization and request dispatch lack latency benchmarks | | `tools` (ContainerToolRunner) | ⚠️ Partial | ✅ #8919 | ❌ None | Container tool execution engine lacks throughput and startup latency benchmarks | ## Recommended ASV Benchmarks - **budget/guards**: `benchmarks/budget_cost_tracker_bench.py` — measure `CostTracker.record_usage()` throughput and `SafetyProfileEnforcer.check_tool_access()` latency under load - **context**: `benchmarks/context_strategy_bench.py` — measure `SlidingWindowStrategy`, `SemanticChunkingStrategy`, `PriorityContextStrategy`, and `ContextFusion` assembly time for 100/1,000/10,000-fragment corpora - **resources**: `benchmarks/resources_type_bench.py` — measure resource type instantiation, connection string parsing, and capability detection throughput - **providers**: `benchmarks/providers_registry_bench.py` — measure `ProviderRegistry.get()` lookup latency and provider initialization time for each new provider type (Ollama, Mistral, Gemini) - **tools**: `benchmarks/tools_container_runner_bench.py` — measure `ContainerToolRunner` startup latency and execution throughput for lightweight container commands ## Impact Without ASV benchmarks for these modules, performance regressions introduced by future changes to context strategy selection, budget enforcement, or provider dispatch will not be caught until they affect end-to-end plan execution times. Context assembly and budget enforcement are on the hot path of every plan execution — even small regressions can compound significantly at scale. ### Duplicate Check 1. **Keyword search (open issues)**: Searched all open issues (pages 1–24, 50 per page) for "benchmark" combined with "budget", "context", "resources", "providers", and "tools" — no existing ASV benchmark issues found for any of these modules. 2. **Cross-area search — existing AUTO-INF-7 issues**: - #9143 covers Application, Reactive, Domain, Shared — does not mention budget/context/resources/providers/tools - #9886 covers Platform, Core, Observability, Plugins, Templates, A2A, LSP, MCP — does not mention budget/context/resources/providers/tools - #8577 covers a2a, acms, action unit tests — does not mention benchmarks for the above modules 3. **Cross-area search — other AUTO-INF workers**: Checked AUTO-INF-8 issue #9781 (integration and benchmark for core graph and TUI) — does not mention budget/context/resources/providers/tools modules. 4. **Closed issues search**: Searched closed issues (pages 1–22, 50 per page) for benchmark issues covering these modules — none found. 5. **Uncertainty check**: The integration test issues #8916 (context), #8919 (tools), #8926 (resources), and #8934 (guards/budget) confirm these modules exist and have integration coverage. The absence of any benchmark issue for these modules is confirmed across 24 pages of open issues and 22 pages of closed issues. This is not a duplicate. --- **Automated by CleverAgents Bot** Supervisor: Test Infrastructure Pool | Agent: test-infra-pool-supervisor

Rows
Columns