Data Integrity: Silent data corruption when reading files with invalid UTF-8 sequences #8125

Open
opened 2026-04-13 03:37:54 +00:00 by HAL9000 · 1 comment
Owner

Metadata

  • Module: src/cleveragents/application/services/context_service.py
  • Line: 233
  • Analysis Pass: Type Safety / Data Flow
  • Commit Message: fix: handle UnicodeDecodeError in ContextService file reading to prevent silent data corruption
  • Branch Name: fix/context-service-utf8-silent-corruption

Background and Context

The ContextService reads files from the user's workspace to provide context to the AI. The integrity of this file content is paramount, as the AI's output is directly influenced by it. Corrupted or incomplete context can lead to flawed code generation.

When adding a file to the context, the _add_file_to_context method reads its content using file_path.read_text(encoding="utf-8", errors="ignore"). The errors="ignore" argument causes Python to silently discard any bytes that are not valid UTF-8 characters.

If a file contains even a single invalid byte sequence (due to being a different encoding, or file corruption), it will be added to the context with incomplete content. There is no error, warning, or notification that this has happened. This leads to silent data corruption, which can be very difficult to debug.

Expected Behavior

The system should not silently accept corrupted file data. When a file with an invalid encoding is encountered, the system should either:

  1. Raise an exception (UnicodeDecodeError) by using the default errors="strict".
  2. Catch this specific exception, log a clear warning that the file could not be read due to an encoding issue, and skip the file.

This ensures that failures are observable and that the context is never silently populated with corrupted data.

Acceptance Criteria

  • The errors="ignore" argument is removed from the read_text call on line 233.
  • A try...except UnicodeDecodeError block is added to handle file encoding errors gracefully.
  • A warning is logged when a file is skipped due to an encoding error.
  • The context is not populated with partially-read, corrupted data.

Subtasks

  • Modify _add_file_to_context to use errors="strict" (or remove the errors argument entirely).
  • Implement a try...except UnicodeDecodeError block to catch and handle the error.
  • Add logging to report the file path and error when a file is skipped due to an encoding issue.
  • Create a test case with a file containing invalid UTF-8 bytes to verify the new behavior.

Definition of Done

  • All subtasks are complete.
  • Code is reviewed and merged.
  • Unit tests pass.

Automated by CleverAgents Bot
Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor


Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Module:** `src/cleveragents/application/services/context_service.py` - **Line:** 233 - **Analysis Pass:** Type Safety / Data Flow - **Commit Message:** `fix: handle UnicodeDecodeError in ContextService file reading to prevent silent data corruption` - **Branch Name:** `fix/context-service-utf8-silent-corruption` ## Background and Context The `ContextService` reads files from the user's workspace to provide context to the AI. The integrity of this file content is paramount, as the AI's output is directly influenced by it. Corrupted or incomplete context can lead to flawed code generation. When adding a file to the context, the `_add_file_to_context` method reads its content using `file_path.read_text(encoding="utf-8", errors="ignore")`. The `errors="ignore"` argument causes Python to silently discard any bytes that are not valid UTF-8 characters. If a file contains even a single invalid byte sequence (due to being a different encoding, or file corruption), it will be added to the context with incomplete content. There is no error, warning, or notification that this has happened. This leads to silent data corruption, which can be very difficult to debug. ## Expected Behavior The system should not silently accept corrupted file data. When a file with an invalid encoding is encountered, the system should either: 1. Raise an exception (`UnicodeDecodeError`) by using the default `errors="strict"`. 2. Catch this specific exception, log a clear warning that the file could not be read due to an encoding issue, and skip the file. This ensures that failures are observable and that the context is never silently populated with corrupted data. ## Acceptance Criteria - The `errors="ignore"` argument is removed from the `read_text` call on line 233. - A `try...except UnicodeDecodeError` block is added to handle file encoding errors gracefully. - A warning is logged when a file is skipped due to an encoding error. - The context is not populated with partially-read, corrupted data. ## Subtasks - [ ] Modify `_add_file_to_context` to use `errors="strict"` (or remove the `errors` argument entirely). - [ ] Implement a `try...except UnicodeDecodeError` block to catch and handle the error. - [ ] Add logging to report the file path and error when a file is skipped due to an encoding issue. - [ ] Create a test case with a file containing invalid UTF-8 bytes to verify the new behavior. ## Definition of Done - All subtasks are complete. - Code is reviewed and merged. - Unit tests pass. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunt Pool | Agent: bug-hunt-pool-supervisor --- **Automated by CleverAgents Bot** Agent: new-issue-creator
HAL9000 added this to the v3.2.0 milestone 2026-04-13 03:37:59 +00:00
Author
Owner

Verified — Silent data corruption is a serious issue but the Backlog priority suggests it's an edge case (invalid UTF-8 files). Upgrading to Should Have — this should be fixed before production but is not blocking core functionality. Verified.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Silent data corruption is a serious issue but the Backlog priority suggests it's an edge case (invalid UTF-8 files). Upgrading to **Should Have** — this should be fixed before production but is not blocking core functionality. Verified. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#8125
No description provided.