[AUTO-SPEC] Proposal: Document LLM provider fallback behavior in StrategyActor specification #10213

New issue

Open

opened 2026-04-17 06:10:18 +00:00 by HAL9000 · 0 comments

HAL9000 commented

2026-04-17 06:10:18 +00:00

Owner

Metadata

Commit message: docs(spec): document LLM provider fallback behavior and quota recovery in StrategyActor
Branch name: docs/spec-llm-provider-fallback-behavior

Background and Context

Two recent commits implement LLM provider fallback behavior in StrategyActor:

f5712787 (feat: add fallback to Anthropic Sonnet when OpenAI quota is exhausted) — Implements graceful degradation when OpenAI quota is exhausted, falling back to Anthropic Claude Sonnet. Includes quota error detection, fallback LLM caching, 5-minute recovery interval, and comprehensive logging.
51472c0b (debug: upgrade logging levels for fallback diagnostics) — Upgrades fallback-related log messages from DEBUG to WARNING level for better CI/CD observability.

The current specification does not document this fallback behavior, creating a gap between what the system actually does and what the spec describes.

Discrepancy Report

Type: Implementation found a better approach → spec update needed

Area: Actor Runtime — StrategyActor LLM Provider Configuration

Current Spec State

The specification documents StrategyActor as using a configured LLM provider for strategy generation, but does not document:

Quota error detection and classification
Fallback provider selection when primary provider quota is exhausted
Fallback LLM instance caching behavior
Quota recovery interval (5 minutes)
Logging behavior for fallback transitions

Implementation State

StrategyActor._execute_with_llm() now implements:

_is_quota_error() — Detects quota-specific API errors (429, insufficient_quota, rate_limit)
Fallback provider — anthropic/claude-sonnet-4-20250514 used when primary provider quota is exhausted
Fallback LLM caching — self._fallback_llm cached to avoid per-call recreation overhead
Quota recovery — _QUOTA_RECOVERY_INTERVAL = 5 minutes before attempting to recover primary provider
Fallback mode state — self._using_fallback and self._last_quota_error_time track fallback state
Warning-level logging — Fallback transitions logged at WARNING level for CI/CD observability

Proposed Spec Change

Add a subsection to the StrategyActor documentation covering:

#### LLM Provider Fallback

StrategyActor implements graceful degradation when the primary LLM provider encounters quota exhaustion:

- **Quota detection**: HTTP 429 errors and `insufficient_quota`/`rate_limit` error codes trigger fallback
- **Fallback provider**: Configurable secondary provider (default: `anthropic/claude-sonnet-4-20250514`)
- **Fallback caching**: The fallback LLM instance is cached to avoid per-call recreation overhead
- **Recovery interval**: After entering fallback mode, the primary provider is retried every 5 minutes
- **Observability**: Fallback transitions are logged at WARNING level with provider name and error details
- **Both providers exhausted**: If both primary and fallback providers fail, the operation fails with a clear error message indicating that no LLM provider is available

Classification

Implementation found a better approach — The fallback behavior improves CI/CD reliability by gracefully handling quota exhaustion. This is an intentional feature that should be documented in the spec.

Expected Behavior

After this issue is resolved, docs/specification.md accurately documents the LLM provider fallback behavior in StrategyActor, including quota detection, fallback provider selection, caching, recovery interval, and observability.

Acceptance Criteria

docs/specification.md documents StrategyActor LLM provider fallback behavior
Quota error detection criteria are documented (HTTP 429, insufficient_quota, rate_limit)
Fallback provider configuration is documented
Recovery interval (5 minutes) is documented
Observability (WARNING-level logging) is documented
Both-providers-exhausted failure behavior is documented

Subtasks

Review src/cleveragents/actors/strategy_actor.py to confirm current fallback implementation details
Add LLM Provider Fallback subsection to StrategyActor documentation in docs/specification.md
Open a PR targeting master with the spec changes
Apply needs feedback label to the PR

Definition of Done

This issue should be closed when:

docs/specification.md has been updated to document LLM provider fallback behavior.
The PR has been reviewed and merged.
The spec accurately reflects the implementation with no remaining gaps for this feature.

Automated by CleverAgents Bot
Supervisor: Spec Update | Agent: spec-update-pool-supervisor

## Metadata - **Commit message**: `docs(spec): document LLM provider fallback behavior and quota recovery in StrategyActor` - **Branch name**: `docs/spec-llm-provider-fallback-behavior` --- ## Background and Context Two recent commits implement LLM provider fallback behavior in `StrategyActor`: 1. **`f5712787`** (feat: add fallback to Anthropic Sonnet when OpenAI quota is exhausted) — Implements graceful degradation when OpenAI quota is exhausted, falling back to Anthropic Claude Sonnet. Includes quota error detection, fallback LLM caching, 5-minute recovery interval, and comprehensive logging. 2. **`51472c0b`** (debug: upgrade logging levels for fallback diagnostics) — Upgrades fallback-related log messages from DEBUG to WARNING level for better CI/CD observability. The current specification does not document this fallback behavior, creating a gap between what the system actually does and what the spec describes. --- ## Discrepancy Report **Type**: Implementation found a better approach → spec update needed **Area**: Actor Runtime — StrategyActor LLM Provider Configuration ### Current Spec State The specification documents `StrategyActor` as using a configured LLM provider for strategy generation, but does not document: - Quota error detection and classification - Fallback provider selection when primary provider quota is exhausted - Fallback LLM instance caching behavior - Quota recovery interval (5 minutes) - Logging behavior for fallback transitions ### Implementation State `StrategyActor._execute_with_llm()` now implements: 1. **`_is_quota_error()`** — Detects quota-specific API errors (429, `insufficient_quota`, `rate_limit`) 2. **Fallback provider** — `anthropic/claude-sonnet-4-20250514` used when primary provider quota is exhausted 3. **Fallback LLM caching** — `self._fallback_llm` cached to avoid per-call recreation overhead 4. **Quota recovery** — `_QUOTA_RECOVERY_INTERVAL = 5 minutes` before attempting to recover primary provider 5. **Fallback mode state** — `self._using_fallback` and `self._last_quota_error_time` track fallback state 6. **Warning-level logging** — Fallback transitions logged at WARNING level for CI/CD observability ### Proposed Spec Change Add a subsection to the StrategyActor documentation covering: ``` #### LLM Provider Fallback StrategyActor implements graceful degradation when the primary LLM provider encounters quota exhaustion: - **Quota detection**: HTTP 429 errors and `insufficient_quota`/`rate_limit` error codes trigger fallback - **Fallback provider**: Configurable secondary provider (default: `anthropic/claude-sonnet-4-20250514`) - **Fallback caching**: The fallback LLM instance is cached to avoid per-call recreation overhead - **Recovery interval**: After entering fallback mode, the primary provider is retried every 5 minutes - **Observability**: Fallback transitions are logged at WARNING level with provider name and error details - **Both providers exhausted**: If both primary and fallback providers fail, the operation fails with a clear error message indicating that no LLM provider is available ``` ### Classification **Implementation found a better approach** — The fallback behavior improves CI/CD reliability by gracefully handling quota exhaustion. This is an intentional feature that should be documented in the spec. --- ## Expected Behavior After this issue is resolved, `docs/specification.md` accurately documents the LLM provider fallback behavior in `StrategyActor`, including quota detection, fallback provider selection, caching, recovery interval, and observability. --- ## Acceptance Criteria - [ ] `docs/specification.md` documents `StrategyActor` LLM provider fallback behavior - [ ] Quota error detection criteria are documented (HTTP 429, `insufficient_quota`, `rate_limit`) - [ ] Fallback provider configuration is documented - [ ] Recovery interval (5 minutes) is documented - [ ] Observability (WARNING-level logging) is documented - [ ] Both-providers-exhausted failure behavior is documented --- ## Subtasks - [ ] Review `src/cleveragents/actors/strategy_actor.py` to confirm current fallback implementation details - [ ] Add LLM Provider Fallback subsection to StrategyActor documentation in `docs/specification.md` - [ ] Open a PR targeting master with the spec changes - [ ] Apply `needs feedback` label to the PR --- ## Definition of Done This issue should be closed when: 1. `docs/specification.md` has been updated to document LLM provider fallback behavior. 2. The PR has been reviewed and merged. 3. The spec accurately reflects the implementation with no remaining gaps for this feature. --- **Automated by CleverAgents Bot** Supervisor: Spec Update | Agent: spec-update-pool-supervisor

HAL9000 added the

labels

2026-04-17 06:18:31 +00:00

HAL9000 referenced this issue

2026-04-17 06:53:17 +00:00

[AUTO-EPIC] Status: Epic Planning Pool Supervisor — Cycle 6 #10217

HAL9000 referenced this issue