BUG-HUNT: [boundary] ContextAnalysisAgent._chunk_documents produces O(N) chunks with step=1 when chunk_overlap >= chunk_size — memory exhaustion for large files #6552

Open
opened 2026-04-09 21:18:05 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: [boundary] — _chunk_documents degenerates into character-by-character chunking when chunk_overlap >= chunk_size

Severity Assessment

  • Impact: If chunk_overlap >= chunk_size, the step used in the sliding window becomes 0 or negative, and is clamped to max(1, ...) = 1. A file of N characters then produces approximately N chunks (each of chunk_size characters). For a 500 KB file with chunk_size=2000, this generates ~500,000 chunks, consuming hundreds of MB of memory and causing extreme slowness or OOM. There is no validation to prevent this misconfiguration.
  • Likelihood: Medium — can be triggered by accidental misconfiguration (e.g., chunk_size=500, chunk_overlap=500 from a config file)
  • Priority: High

Location

  • File: src/cleveragents/agents/graphs/context_analysis.py
  • Class: ContextAnalysisAgent
  • Method: _chunk_documents, __init__
  • Lines: 97 (__init__), 237–248 (_chunk_documents)

Description

In _chunk_documents:

# context_analysis.py lines 237-248
step = max(1, self.chunk_size - self.chunk_overlap)
for index, start in enumerate(range(0, len(content), step)):
    chunk_content = content[start : start + self.chunk_size]
    chunk_metadata: dict[str, Any] = {**metadata, "chunk_index": index}
    chunks.append(Document(page_content=chunk_content, metadata=chunk_metadata))

When chunk_overlap >= chunk_size:

  • chunk_size - chunk_overlap<= 0
  • max(1, ...)1
  • range(0, len(content), 1) → N iterations for a file of N characters

A 100 KB file (len = 102,400) with chunk_size=1000 produces 102,400 Document objects, each 1000 chars long. Multiplied by Document overhead, this easily exhausts memory.

The __init__ constructor has no validation:

# context_analysis.py line 97
def __init__(self, ..., chunk_size: int = 2000, chunk_overlap: int = 200, ...):
    self.chunk_size = chunk_size
    self.chunk_overlap = chunk_overlap
    # No validation: chunk_overlap < chunk_size is not enforced!

Expected Behavior

__init__ should raise ValueError if chunk_overlap >= chunk_size:

if chunk_overlap >= chunk_size:
    raise ValueError(
        f"chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size})"
    )

Actual Behavior

chunk_overlap >= chunk_size silently degenerates to step=1, creating millions of chunks and causing memory exhaustion.

Category

boundary / resource

TDD Note

After this bug is verified, a Type/Testing issue will be created with a TDD test tagged @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [boundary] — `_chunk_documents` degenerates into character-by-character chunking when `chunk_overlap >= chunk_size` ### Severity Assessment - **Impact**: If `chunk_overlap >= chunk_size`, the step used in the sliding window becomes 0 or negative, and is clamped to `max(1, ...)` = 1. A file of N characters then produces approximately N chunks (each of `chunk_size` characters). For a 500 KB file with `chunk_size=2000`, this generates ~500,000 chunks, consuming hundreds of MB of memory and causing extreme slowness or OOM. There is no validation to prevent this misconfiguration. - **Likelihood**: Medium — can be triggered by accidental misconfiguration (e.g., `chunk_size=500, chunk_overlap=500` from a config file) - **Priority**: High ### Location - **File**: `src/cleveragents/agents/graphs/context_analysis.py` - **Class**: `ContextAnalysisAgent` - **Method**: `_chunk_documents`, `__init__` - **Lines**: 97 (`__init__`), 237–248 (`_chunk_documents`) ### Description In `_chunk_documents`: ```python # context_analysis.py lines 237-248 step = max(1, self.chunk_size - self.chunk_overlap) for index, start in enumerate(range(0, len(content), step)): chunk_content = content[start : start + self.chunk_size] chunk_metadata: dict[str, Any] = {**metadata, "chunk_index": index} chunks.append(Document(page_content=chunk_content, metadata=chunk_metadata)) ``` When `chunk_overlap >= chunk_size`: - `chunk_size - chunk_overlap` → `<= 0` - `max(1, ...)` → `1` - `range(0, len(content), 1)` → N iterations for a file of N characters A 100 KB file (`len = 102,400`) with `chunk_size=1000` produces **102,400 Document objects**, each 1000 chars long. Multiplied by Document overhead, this easily exhausts memory. The `__init__` constructor has no validation: ```python # context_analysis.py line 97 def __init__(self, ..., chunk_size: int = 2000, chunk_overlap: int = 200, ...): self.chunk_size = chunk_size self.chunk_overlap = chunk_overlap # No validation: chunk_overlap < chunk_size is not enforced! ``` ### Expected Behavior `__init__` should raise `ValueError` if `chunk_overlap >= chunk_size`: ```python if chunk_overlap >= chunk_size: raise ValueError( f"chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size})" ) ``` ### Actual Behavior `chunk_overlap >= chunk_size` silently degenerates to step=1, creating millions of chunks and causing memory exhaustion. ### Category boundary / resource ### TDD Note After this bug is verified, a Type/Testing issue will be created with a TDD test tagged `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:28:12 +00:00
Author
Owner

Verified — Valid boundary bug. When chunk_overlap >= chunk_size, step=1 produces O(N) chunks causing memory exhaustion for large files. MoSCoW: Should Have — can cause OOM in production with large codebases.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Valid boundary bug. When chunk_overlap >= chunk_size, step=1 produces O(N) chunks causing memory exhaustion for large files. **MoSCoW: Should Have** — can cause OOM in production with large codebases. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6552
No description provided.