BUG-HUNT: [resource] No HTTP/socket timeout configured on any LLM provider — network stalls block indefinitely with no recovery path #6694

Open
opened 2026-04-09 23:39:31 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: [resource] — No HTTP/socket timeout on any LLM provider

Severity Assessment

  • Impact: Any network stall, slow response, or provider-side hang causes the entire generate_changes / stream_changes call to block indefinitely. No timeout means no recovery, no fallback, and no progress callback advancement. Long-running processes can be permanently stuck awaiting a response that never arrives.
  • Likelihood: Medium — transient network issues, provider slow-downs during high load, and half-open TCP connections all occur in production environments.
  • Priority: High

Location

  • File: src/cleveragents/providers/llm/openai_provider.py, anthropic_provider.py, google_provider.py, openrouter_provider.py
  • Function/Class: factory() closures inside each provider's __init__
  • Lines: ~20–35 in each provider file

Description

None of the four concrete LLM provider adapters pass a timeout (or request_timeout) parameter to the underlying LangChain LLM constructor. LangChain's default timeout for ChatOpenAI, ChatAnthropic, and ChatGoogleGenerativeAI is None — which means no timeout is applied at the HTTP layer.

Without a timeout:

  • A slow or unresponsive provider API causes the Python thread to block indefinitely in socket I/O.
  • The progress_callback is never called beyond the initial 5% until the response returns (or never).
  • No circuit-breaker or fallback is triggered.
  • In the streaming path, the for event in graph.stream(...) loop hangs at the first stuck event.

Evidence

openai_provider.py factory closure:

def factory(resolved_model: str) -> ChatOpenAI:
    kwargs: dict[str, Any] = {"api_key": api_key, "model": resolved_model}
    if organization:
        kwargs["organization"] = organization
    if llm_kwargs:
        kwargs.update(llm_kwargs)
    return ChatOpenAI(**kwargs)  # <-- no timeout parameter

anthropic_provider.py factory closure:

def factory(resolved_model: str) -> ChatAnthropic:
    kwargs: dict[str, Any] = {"model": resolved_model, "api_key": api_key}
    if llm_kwargs:
        kwargs.update(llm_kwargs)
    return ChatAnthropic(**kwargs)  # <-- no timeout parameter

google_provider.py factory closure:

def factory(resolved_model: str) -> ChatGoogleGenerativeAI:
    kwargs: dict[str, Any] = {"api_key": api_key, "model": resolved_model}
    if llm_kwargs:
        kwargs.update(llm_kwargs)
    return ChatGoogleGenerativeAI(**kwargs)  # <-- no timeout parameter

openrouter_provider.py factory closure:

kwargs: dict[str, Any] = {
    "model": resolved_model,
    "openai_api_base": self._BASE_URL,
    "openai_api_key": api_key,
}
# ...
return ChatOpenAI(**kwargs)  # <-- no timeout parameter

LangChain supports timeout (in seconds) or request_timeout for all of these classes. The parameter is currently never passed.

Expected Behavior

Provider calls should have a configurable timeout with a sensible default (e.g., 120 seconds). When the timeout expires, a TimeoutError or provider-specific exception should propagate so callers can handle it (retry, fallback to another provider, report an error to the user).

Actual Behavior

All four providers issue HTTP requests with no timeout. A hung connection blocks indefinitely. The only recovery path is an out-of-band process kill.

Suggested Fix

  1. Add a timeout: float | None = 120.0 parameter to each provider __init__.
  2. Include it in the LLM factory kwargs:
    kwargs["timeout"] = timeout  # ChatOpenAI, ChatAnthropic support this
    
  3. Consider exposing timeout on ProviderRegistry.create_ai_provider() so callers can set it based on SLA requirements.

Category

resource / timeout

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [resource] — No HTTP/socket timeout on any LLM provider ### Severity Assessment - **Impact**: Any network stall, slow response, or provider-side hang causes the entire `generate_changes` / `stream_changes` call to block indefinitely. No timeout means no recovery, no fallback, and no progress callback advancement. Long-running processes can be permanently stuck awaiting a response that never arrives. - **Likelihood**: Medium — transient network issues, provider slow-downs during high load, and half-open TCP connections all occur in production environments. - **Priority**: High ### Location - **File**: `src/cleveragents/providers/llm/openai_provider.py`, `anthropic_provider.py`, `google_provider.py`, `openrouter_provider.py` - **Function/Class**: `factory()` closures inside each provider's `__init__` - **Lines**: ~20–35 in each provider file ### Description None of the four concrete LLM provider adapters pass a `timeout` (or `request_timeout`) parameter to the underlying LangChain LLM constructor. LangChain's default timeout for `ChatOpenAI`, `ChatAnthropic`, and `ChatGoogleGenerativeAI` is `None` — which means **no timeout is applied at the HTTP layer**. Without a timeout: - A slow or unresponsive provider API causes the Python thread to block indefinitely in socket I/O. - The `progress_callback` is never called beyond the initial `5%` until the response returns (or never). - No circuit-breaker or fallback is triggered. - In the streaming path, the `for event in graph.stream(...)` loop hangs at the first stuck event. ### Evidence **`openai_provider.py` factory closure:** ```python def factory(resolved_model: str) -> ChatOpenAI: kwargs: dict[str, Any] = {"api_key": api_key, "model": resolved_model} if organization: kwargs["organization"] = organization if llm_kwargs: kwargs.update(llm_kwargs) return ChatOpenAI(**kwargs) # <-- no timeout parameter ``` **`anthropic_provider.py` factory closure:** ```python def factory(resolved_model: str) -> ChatAnthropic: kwargs: dict[str, Any] = {"model": resolved_model, "api_key": api_key} if llm_kwargs: kwargs.update(llm_kwargs) return ChatAnthropic(**kwargs) # <-- no timeout parameter ``` **`google_provider.py` factory closure:** ```python def factory(resolved_model: str) -> ChatGoogleGenerativeAI: kwargs: dict[str, Any] = {"api_key": api_key, "model": resolved_model} if llm_kwargs: kwargs.update(llm_kwargs) return ChatGoogleGenerativeAI(**kwargs) # <-- no timeout parameter ``` **`openrouter_provider.py` factory closure:** ```python kwargs: dict[str, Any] = { "model": resolved_model, "openai_api_base": self._BASE_URL, "openai_api_key": api_key, } # ... return ChatOpenAI(**kwargs) # <-- no timeout parameter ``` LangChain supports `timeout` (in seconds) or `request_timeout` for all of these classes. The parameter is currently never passed. ### Expected Behavior Provider calls should have a configurable timeout with a sensible default (e.g., 120 seconds). When the timeout expires, a `TimeoutError` or provider-specific exception should propagate so callers can handle it (retry, fallback to another provider, report an error to the user). ### Actual Behavior All four providers issue HTTP requests with no timeout. A hung connection blocks indefinitely. The only recovery path is an out-of-band process kill. ### Suggested Fix 1. Add a `timeout: float | None = 120.0` parameter to each provider `__init__`. 2. Include it in the LLM factory kwargs: ```python kwargs["timeout"] = timeout # ChatOpenAI, ChatAnthropic support this ``` 3. Consider exposing `timeout` on `ProviderRegistry.create_ai_provider()` so callers can set it based on SLA requirements. ### Category resource / timeout ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
Author
Owner

Verified — Resource bug: no HTTP/socket timeout on LLM providers — network stalls block indefinitely. MoSCoW: Must-have. Priority: High.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Resource bug: no HTTP/socket timeout on LLM providers — network stalls block indefinitely. MoSCoW: Must-have. Priority: High. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6694
No description provided.