BUG-HUNT: [resource] ActorLoader.discover() reads file content twice — once for hash check and once for duplicate detection, wasting I/O #7381

Open
opened 2026-04-10 18:37:44 +00:00 by HAL9000 · 3 comments
Owner

Bug Report: [resource] ActorLoader.discover() reads file content twice per file, doubling disk I/O

Severity Assessment

  • Impact: Each discovered YAML file is read from disk twice per discover() call — once during the hash comparison pass, and once again during the content hash computation for the new cache entry. In environments with many actor files (hundreds), this doubles the disk I/O for actor discovery.
  • Likelihood: High — affects every discover() call on any deployment with actor files
  • Priority: Medium (performance, not correctness)

Location

  • File: src/cleveragents/actor/loader.py
  • Function/Class: ActorLoader.discover()
  • Lines: ~90-155

Description

In ActorLoader.discover(), there's a double-read pattern:

First read: During the initial pass, when a file is not cached:

content = resolved.read_bytes()  # First read
content_hash = _compute_hash(content)
# ... then processes the file

Second read: When committing the new cache entry:

for name, entries in pending.items():
    resolved_path, config = entries[0]
    content_hash = _compute_hash(resolved_path.read_bytes())  # Second read!
    # Creates cache entry with this hash

The content variable from the first read is not passed into the pending dict, so the second loop has to re-read it from disk to compute the hash.

Evidence

# First loop: reads the file for parsing
for path in found_files:
    resolved = path.resolve()
    content = resolved.read_bytes()  # READ #1
    content_hash = _compute_hash(content)
    
    # ... checks existing cache, parses YAML, etc.
    pending.setdefault(name, []).append((resolved, config))
    # ^ Note: 'content' and 'content_hash' are NOT stored in pending!

# Second loop: re-reads the file for the cache entry
new_actors: dict[str, _CacheEntry] = {}
for name, entries in pending.items():
    resolved_path, config = entries[0]
    content_hash = _compute_hash(resolved_path.read_bytes())  # READ #2 - wasted!

Expected Behavior

The content and content_hash from the first read should be stored in the pending dict and reused in the second loop:

# Store content_hash alongside config in pending:
pending.setdefault(name, []).append((resolved, config, content_hash))  # Store hash!

# In second loop:
for name, entries in pending.items():
    resolved_path, config, content_hash = entries[0]  # Use stored hash!
    # No re-read needed!

Actual Behavior

Every file that needs to be loaded (or updated) is read from disk twice: once during the discovery pass (to parse YAML), and once during the cache-building pass (to compute the final content hash).

Category

resource

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: [resource] ActorLoader.discover() reads file content twice per file, doubling disk I/O ### Severity Assessment - **Impact**: Each discovered YAML file is read from disk **twice** per `discover()` call — once during the hash comparison pass, and once again during the content hash computation for the new cache entry. In environments with many actor files (hundreds), this doubles the disk I/O for actor discovery. - **Likelihood**: High — affects every `discover()` call on any deployment with actor files - **Priority**: Medium (performance, not correctness) ### Location - **File**: `src/cleveragents/actor/loader.py` - **Function/Class**: `ActorLoader.discover()` - **Lines**: ~90-155 ### Description In `ActorLoader.discover()`, there's a double-read pattern: **First read**: During the initial pass, when a file is not cached: ```python content = resolved.read_bytes() # First read content_hash = _compute_hash(content) # ... then processes the file ``` **Second read**: When committing the new cache entry: ```python for name, entries in pending.items(): resolved_path, config = entries[0] content_hash = _compute_hash(resolved_path.read_bytes()) # Second read! # Creates cache entry with this hash ``` The `content` variable from the first read is not passed into the `pending` dict, so the second loop has to re-read it from disk to compute the hash. ### Evidence ```python # First loop: reads the file for parsing for path in found_files: resolved = path.resolve() content = resolved.read_bytes() # READ #1 content_hash = _compute_hash(content) # ... checks existing cache, parses YAML, etc. pending.setdefault(name, []).append((resolved, config)) # ^ Note: 'content' and 'content_hash' are NOT stored in pending! # Second loop: re-reads the file for the cache entry new_actors: dict[str, _CacheEntry] = {} for name, entries in pending.items(): resolved_path, config = entries[0] content_hash = _compute_hash(resolved_path.read_bytes()) # READ #2 - wasted! ``` ### Expected Behavior The `content` and `content_hash` from the first read should be stored in the `pending` dict and reused in the second loop: ```python # Store content_hash alongside config in pending: pending.setdefault(name, []).append((resolved, config, content_hash)) # Store hash! # In second loop: for name, entries in pending.items(): resolved_path, config, content_hash = entries[0] # Use stored hash! # No re-read needed! ``` ### Actual Behavior Every file that needs to be loaded (or updated) is read from disk twice: once during the discovery pass (to parse YAML), and once during the cache-building pass (to compute the final content hash). ### Category resource ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
Author
Owner

Verified — Performance bug: ActorLoader.discover() reads file content twice. MoSCoW: Could-have. Priority: Low — minor optimization.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Performance bug: ActorLoader.discover() reads file content twice. MoSCoW: Could-have. Priority: Low — minor optimization. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Performance bug: ActorLoader.discover() reads file content twice. MoSCoW: Could-have. Priority: Low — minor optimization.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Performance bug: ActorLoader.discover() reads file content twice. MoSCoW: Could-have. Priority: Low — minor optimization. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Author
Owner

Verified — Performance bug: ActorLoader.discover() reads file content twice. MoSCoW: Could-have. Priority: Low — minor optimization.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Performance bug: ActorLoader.discover() reads file content twice. MoSCoW: Could-have. Priority: Low — minor optimization. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7381
No description provided.