BUG-HUNT: [resource] Memory inefficient file operations in ADR inventory scanning lack proper resource management #7305

Open
opened 2026-04-10 15:47:09 +00:00 by HAL9000 · 4 comments
Owner

Metadata

  • Branch: bugfix/resource-adr-inventory-memory-inefficient-file-ops
  • Commit Message: fix(hooks): read only YAML front-matter in _collect_adr_inventory to prevent memory exhaustion
  • Milestone: (none — backlog routing; see note below)
  • Parent Epic: (orphan — see note below)

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.

Related issue: #7282 covers a closely related error-suppression issue in the same function (_collect_adr_inventory). Both issues affect the same code site and may be addressed together in a single fix.


Background and Context

The _collect_adr_inventory function in hooks/adr_hooks.py reads entire ADR file contents into memory using Path.read_text() without any resource management or size constraints. The function only needs the YAML front-matter section (the ----delimited block at the top of each file), but unconditionally loads the full file content for every ADR in the collection.

For large ADR collections or files with large embedded content (diagrams, code blocks, long rationale sections), this pattern can cause memory exhaustion during documentation builds. It also scales poorly as the documentation corpus grows.

This violates the project specification's resource management principles: "proper cleanup of temporary objects and efficient operations for large inventories."

Current Behavior

File: hooks/adr_hooks.py
Function: _collect_adr_inventory
Lines: 381–382

# Line 394 — reads entire file content into memory, no size limit
try:
    content = Path(f.abs_src_path).read_text(encoding="utf-8")  # No size limits
except OSError:
    continue

# Only needs YAML front-matter but reads entire file
fm_match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)  # Only uses front-matter

The function:

  • Loads the full content of every ADR file into memory unconditionally
  • Only uses the YAML front-matter block (------) from the loaded content
  • Applies no size constraints or streaming reads
  • Provides no resource cleanup or memory management
  • Scales poorly as the ADR collection grows

Expected Behavior

Per the project specification for resource management:

  • File reads should be scoped to the minimum data required (front-matter only)
  • Operations over large collections should be memory-efficient
  • Temporary objects should be properly cleaned up
  • Streaming or partial reads should be used where full content is not needed

The function should read only the YAML front-matter section rather than loading entire file contents, using either:

  1. A streaming/line-by-line read that stops after the closing --- delimiter, or
  2. A bounded read with a reasonable size cap for the front-matter region

Impact

  • Memory exhaustion with large ADR files or collections
  • Poor performance on documentation builds as the ADR corpus grows
  • Inefficient resource usage — entire file content allocated for minimal data extraction
  • Scalability issues as documentation grows over time

Acceptance Criteria

  • _collect_adr_inventory no longer reads entire file contents into memory
  • Only the YAML front-matter section is read from each ADR file
  • File reading uses a streaming or bounded approach with appropriate size limits
  • Memory usage scales with front-matter size, not total file size
  • Context managers are used for all file operations
  • All nox stages pass
  • Coverage >= 97%

Subtasks

  • Implement streaming front-matter reader that stops at closing --- delimiter
  • Add a reasonable size cap for front-matter reads (e.g., 64 KB) to guard against malformed files
  • Replace Path.read_text() call with the new bounded reader
  • Use context managers (with statement) for all file I/O in the function
  • Add or update BDD scenarios covering memory-efficient front-matter extraction
  • Verify no regression in ADR inventory collection behaviour
  • Update docstring to document the bounded-read contract

Definition of Done

  • _collect_adr_inventory reads only YAML front-matter, not full file content
  • Streaming or bounded read implemented with documented size limit
  • Context managers used for all file operations in the function
  • BDD unit scenarios added for the new front-matter reader
  • Integration test confirms correct metadata extraction on representative ADR files
  • No memory regression introduced (ASV benchmark or manual verification)
  • All nox stages pass
  • Coverage >= 97%

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

## Metadata - **Branch**: `bugfix/resource-adr-inventory-memory-inefficient-file-ops` - **Commit Message**: `fix(hooks): read only YAML front-matter in _collect_adr_inventory to prevent memory exhaustion` - **Milestone**: *(none — backlog routing; see note below)* - **Parent Epic**: *(orphan — see note below)* > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. > **Related issue:** #7282 covers a closely related error-suppression issue in the same function (`_collect_adr_inventory`). Both issues affect the same code site and may be addressed together in a single fix. --- ## Background and Context The `_collect_adr_inventory` function in `hooks/adr_hooks.py` reads entire ADR file contents into memory using `Path.read_text()` without any resource management or size constraints. The function only needs the YAML front-matter section (the `---`-delimited block at the top of each file), but unconditionally loads the full file content for every ADR in the collection. For large ADR collections or files with large embedded content (diagrams, code blocks, long rationale sections), this pattern can cause memory exhaustion during documentation builds. It also scales poorly as the documentation corpus grows. This violates the project specification's resource management principles: *"proper cleanup of temporary objects and efficient operations for large inventories."* ## Current Behavior **File**: `hooks/adr_hooks.py` **Function**: `_collect_adr_inventory` **Lines**: 381–382 ```python # Line 394 — reads entire file content into memory, no size limit try: content = Path(f.abs_src_path).read_text(encoding="utf-8") # No size limits except OSError: continue # Only needs YAML front-matter but reads entire file fm_match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL) # Only uses front-matter ``` The function: - Loads the full content of every ADR file into memory unconditionally - Only uses the YAML front-matter block (`---` … `---`) from the loaded content - Applies no size constraints or streaming reads - Provides no resource cleanup or memory management - Scales poorly as the ADR collection grows ## Expected Behavior Per the project specification for resource management: - File reads should be scoped to the minimum data required (front-matter only) - Operations over large collections should be memory-efficient - Temporary objects should be properly cleaned up - Streaming or partial reads should be used where full content is not needed The function should read only the YAML front-matter section rather than loading entire file contents, using either: 1. A streaming/line-by-line read that stops after the closing `---` delimiter, or 2. A bounded read with a reasonable size cap for the front-matter region ## Impact - Memory exhaustion with large ADR files or collections - Poor performance on documentation builds as the ADR corpus grows - Inefficient resource usage — entire file content allocated for minimal data extraction - Scalability issues as documentation grows over time ## Acceptance Criteria - `_collect_adr_inventory` no longer reads entire file contents into memory - Only the YAML front-matter section is read from each ADR file - File reading uses a streaming or bounded approach with appropriate size limits - Memory usage scales with front-matter size, not total file size - Context managers are used for all file operations - All nox stages pass - Coverage >= 97% ## Subtasks - [ ] Implement streaming front-matter reader that stops at closing `---` delimiter - [ ] Add a reasonable size cap for front-matter reads (e.g., 64 KB) to guard against malformed files - [ ] Replace `Path.read_text()` call with the new bounded reader - [ ] Use context managers (`with` statement) for all file I/O in the function - [ ] Add or update BDD scenarios covering memory-efficient front-matter extraction - [ ] Verify no regression in ADR inventory collection behaviour - [ ] Update docstring to document the bounded-read contract ## Definition of Done - [ ] `_collect_adr_inventory` reads only YAML front-matter, not full file content - [ ] Streaming or bounded read implemented with documented size limit - [ ] Context managers used for all file operations in the function - [ ] BDD unit scenarios added for the new front-matter reader - [ ] Integration test confirms correct metadata extraction on representative ADR files - [ ] No memory regression introduced (ASV benchmark or manual verification) - [ ] All nox stages pass - [ ] Coverage >= 97% --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

⚠️ Orphan Issue — Needs Manual Parent Epic Linking

This issue was created by an automated bug-hunting agent and has no parent Epic linked. Per CONTRIBUTING.md, orphan issues are not permitted — every issue must be linked to a parent Epic via Forgejo's dependency system (child blocks parent).

Action required (project owner):

  1. Identify the appropriate parent Epic for ADR hook resource management work
  2. Create the dependency: this issue (#7305) blocks the parent Epic
  3. Remove this comment once the link is established

Suggested parent candidates:

  • Any open Epic covering hooks/adr_hooks.py improvements or ADR documentation pipeline work
  • A general "resource management / memory efficiency" Epic if one exists
  • Related issue #7282 (error suppression in same function) may share a parent Epic

Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

⚠️ **Orphan Issue — Needs Manual Parent Epic Linking** This issue was created by an automated bug-hunting agent and has no parent Epic linked. Per `CONTRIBUTING.md`, orphan issues are not permitted — every issue must be linked to a parent Epic via Forgejo's dependency system (child **blocks** parent). **Action required (project owner):** 1. Identify the appropriate parent Epic for ADR hook resource management work 2. Create the dependency: this issue (#7305) **blocks** the parent Epic 3. Remove this comment once the link is established **Suggested parent candidates:** - Any open Epic covering `hooks/adr_hooks.py` improvements or ADR documentation pipeline work - A general "resource management / memory efficiency" Epic if one exists - Related issue #7282 (error suppression in same function) may share a parent Epic --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

Verified — Performance bug: memory inefficient file operations in ADR scanning. MoSCoW: Could-have. Priority: Low.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Performance bug: memory inefficient file operations in ADR scanning. MoSCoW: Could-have. Priority: Low. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Performance bug: memory inefficient file operations in ADR scanning. MoSCoW: Could-have. Priority: Low.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Performance bug: memory inefficient file operations in ADR scanning. MoSCoW: Could-have. Priority: Low. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Performance bug: memory inefficient file operations in ADR scanning. MoSCoW: Could-have. Priority: Low.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Performance bug: memory inefficient file operations in ADR scanning. MoSCoW: Could-have. Priority: Low. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7305
No description provided.