cleveragents/cleveragents-core

Fork 3

BUG-HUNT: [boundary] `_estimate_token_usage` truncates context to 2000 chars per entry but full context is sent to the API — token estimate is systematically understated, causing silent budget overruns #6700

New issue

Open

opened 2026-04-09 23:40:51 +00:00 by HAL9000 · 1 comment

HAL9000 commented

2026-04-09 23:40:51 +00:00

Owner

Bug Report: [boundary] — Token estimation truncation causes silent budget overruns

Severity Assessment

Impact: Token estimates used for budget enforcement are based on 2000-character context snippets, while the actual API call sends the full (potentially hundreds-of-KB) context. The budget check can report UNDER_BUDGET when the real cost is multiples higher. Budget limits are silently exceeded.
Likelihood: High — any plan that includes file content or large contexts in contexts triggers this (the common case).
Priority: High

Location

File: src/cleveragents/providers/llm/langchain_chat_provider.py
Function/Class: LangChainChatProvider._estimate_token_usage()
Lines: ~295–325

Description

_estimate_token_usage is the fallback token counter used when:

The provider is not an OpenAI LLM (so no LangChain callback tracker), OR
The callback tracker returns no total_tokens

The method builds a prompt_text for estimation by truncating each context entry:

for ctx in contexts:
    content = getattr(ctx, "content", None) or ""
    if content:
        prompt_text += f"\n{content[:2000]}"   # <-- truncated to 2000 chars

However, the actual call to graph.invoke() or graph.stream() passes the full, untruncated contexts list:

state = graph.invoke(
    project,
    plan,
    contexts,            # <-- full content
    thread_id=thread_id,
    ...
)

For a context with 50,000 characters of content, the estimate counts tokens for 2,000 characters (~500 tokens) while the actual API call bills for 50,000 characters (~12,500 tokens). The error factor can be 25× or more.

This path is hit for ALL non-OpenAI providers (Anthropic, Google, OpenRouter, Groq, etc.) and for OpenAI when the LangChain callback is not available.

Evidence

langchain_chat_provider.py — _estimate_token_usage() (~lines 295–320):

def _estimate_token_usage(
    self,
    llm: BaseLanguageModel,
    plan: Plan,
    contexts: list[Context],
) -> int:
    get_tokens = getattr(llm, "get_num_tokens", None)
    if not callable(get_tokens):
        return 0

    estimator = cast(TokenEstimator, get_tokens)

    prompt_text = plan.prompt or ""
    for ctx in contexts:
        content = getattr(ctx, "content", None) or ""
        if content:
            prompt_text += f"\n{content[:2000]}"  # <-- truncates here

    try:
        tokens = estimator(prompt_text)
    except Exception:
        return 0
    ...

generate_changes() (~lines 97–105):

state = graph.invoke(
    project,
    plan,
    contexts,   # <-- full untruncated contexts
    thread_id=thread_id,
    actor_context=actor_context,
)

record_usage() in cost_tracker.py:

cost = self.estimate_cost(provider, model, input_tokens, output_tokens)

The input_tokens fed to record_usage comes from _resolve_token_count which calls _estimate_token_usage — so the undercount propagates directly into budget enforcement.

Expected Behavior

The token estimate used for budget enforcement should count the actual tokens that will be (or were) sent to the API. If the full context is too large to estimate cheaply, the estimator should either:

Use the full content for estimation, OR
Return a conservatively high estimate (e.g., len(content) // 4 per context), OR
Clearly document that estimation is approximate and the truncation limit should match real usage patterns.

Actual Behavior

Token estimates are based on at most 2,000 characters per context entry. With large contexts (code files, documents), actual API token costs can be 10–50× the estimate, causing budget limits to be silently exceeded without triggering BudgetStatus.EXCEEDED.

Suggested Fix

Remove the truncation limit from _estimate_token_usage, or use character-based division as a conservative fast-path:

for ctx in contexts:
    content = getattr(ctx, "content", None) or ""
    if content:
        prompt_text += f"\n{content}"  # Use full content for accurate estimation

If full estimation is too slow, document the limitation and add a prominent warning log when the truncated estimate deviates significantly from a character-count heuristic.

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [boundary] — Token estimation truncation causes silent budget overruns ### Severity Assessment - **Impact**: Token estimates used for budget enforcement are based on 2000-character context snippets, while the actual API call sends the full (potentially hundreds-of-KB) context. The budget check can report `UNDER_BUDGET` when the real cost is multiples higher. Budget limits are silently exceeded. - **Likelihood**: High — any plan that includes file content or large contexts in `contexts` triggers this (the common case). - **Priority**: High ### Location - **File**: `src/cleveragents/providers/llm/langchain_chat_provider.py` - **Function/Class**: `LangChainChatProvider._estimate_token_usage()` - **Lines**: ~295–325 ### Description `_estimate_token_usage` is the fallback token counter used when: - The provider is not an OpenAI LLM (so no LangChain callback tracker), OR - The callback tracker returns no `total_tokens` The method builds a `prompt_text` for estimation by truncating each context entry: ```python for ctx in contexts: content = getattr(ctx, "content", None) or "" if content: prompt_text += f"\n{content[:2000]}" # <-- truncated to 2000 chars ``` However, the actual call to `graph.invoke()` or `graph.stream()` passes the **full, untruncated `contexts`** list: ```python state = graph.invoke( project, plan, contexts, # <-- full content thread_id=thread_id, ... ) ``` For a context with 50,000 characters of content, the estimate counts tokens for 2,000 characters (~500 tokens) while the actual API call bills for 50,000 characters (~12,500 tokens). The error factor can be 25× or more. This path is hit for ALL non-OpenAI providers (Anthropic, Google, OpenRouter, Groq, etc.) and for OpenAI when the LangChain callback is not available. ### Evidence **`langchain_chat_provider.py` — `_estimate_token_usage()` (~lines 295–320):** ```python def _estimate_token_usage( self, llm: BaseLanguageModel, plan: Plan, contexts: list[Context], ) -> int: get_tokens = getattr(llm, "get_num_tokens", None) if not callable(get_tokens): return 0 estimator = cast(TokenEstimator, get_tokens) prompt_text = plan.prompt or "" for ctx in contexts: content = getattr(ctx, "content", None) or "" if content: prompt_text += f"\n{content[:2000]}" # <-- truncates here try: tokens = estimator(prompt_text) except Exception: return 0 ... ``` **`generate_changes()` (~lines 97–105):** ```python state = graph.invoke( project, plan, contexts, # <-- full untruncated contexts thread_id=thread_id, actor_context=actor_context, ) ``` **`record_usage()` in `cost_tracker.py`:** ```python cost = self.estimate_cost(provider, model, input_tokens, output_tokens) ``` The `input_tokens` fed to `record_usage` comes from `_resolve_token_count` which calls `_estimate_token_usage` — so the undercount propagates directly into budget enforcement. ### Expected Behavior The token estimate used for budget enforcement should count the actual tokens that will be (or were) sent to the API. If the full context is too large to estimate cheaply, the estimator should either: 1. Use the full content for estimation, OR 2. Return a conservatively high estimate (e.g., `len(content) // 4` per context), OR 3. Clearly document that estimation is approximate and the truncation limit should match real usage patterns. ### Actual Behavior Token estimates are based on at most 2,000 characters per context entry. With large contexts (code files, documents), actual API token costs can be 10–50× the estimate, causing budget limits to be silently exceeded without triggering `BudgetStatus.EXCEEDED`. ### Suggested Fix Remove the truncation limit from `_estimate_token_usage`, or use character-based division as a conservative fast-path: ```python for ctx in contexts: content = getattr(ctx, "content", None) or "" if content: prompt_text += f"\n{content}" # Use full content for accurate estimation ``` If full estimation is too slow, document the limitation and add a prominent warning log when the truncated estimate deviates significantly from a character-count heuristic. ### Category boundary / cost-tracking ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter

HAL9000 added the

labels

2026-04-09 23:45:56 +00:00

HAL9000 referenced this issue

2026-04-10 02:11:41 +00:00

[AUTO-BUG-POOL] Bug Detection Report (Cycle 2) #6753

HAL9000 added

and removed

labels

2026-04-14 07:20:06 +00:00

HAL9000 commented

2026-04-14 07:20:06 +00:00

Author

Owner

✅ Verified — Bug: token usage estimate systematically understated — silent budget overruns. MoSCoW: Should-have. Priority: Medium.

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Bug: token usage estimate systematically understated — silent budget overruns. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor