UAT: _CONTROL_CHAR_RE in plain renderer strips valid 8-bit characters (U+0080–U+009F) from non-ASCII user content #4090

Open
opened 2026-04-06 10:14:44 +00:00 by freemo · 0 comments
Owner

Metadata

  • Branch: fix/cli-control-char-regex-non-ascii
  • Commit Message: fix(cli): narrow _CONTROL_CHAR_RE to only strip actual C1 control codes, not valid Unicode
  • Milestone: (none — backlog)
  • Parent Epic: #399 (Post-MVP Server & Clients — nearest applicable CLI epic)

Bug Report

What was tested: Whether the CLI's terminal escape sanitizer correctly handles non-ASCII user content in plain/color/table output formats.

Expected behavior:

The strip_terminal_escapes() function in src/cleveragents/cli/output/_renderers.py should only strip:

  1. ANSI/VT terminal escape sequences (to prevent terminal injection)
  2. Actual C0 control characters (U+0000–U+001F, excluding tab/newline/CR)
  3. DEL (U+007F)

It should not strip valid Unicode characters that happen to fall in the U+0080–U+009F range.

Actual behavior (from code analysis):

The _CONTROL_CHAR_RE regex at line 56 of src/cleveragents/cli/output/_renderers.py is:

_CONTROL_CHAR_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]")

The range \x7f-\x9f includes:

  • U+007F (DEL) — correct to strip
  • U+0080–U+009F — these are the "C1 control codes" in the Latin-1 supplement

The problem: While U+0080–U+009F are technically C1 control codes in ISO 8859-1, in modern UTF-8 encoded Python strings (which is what the CLI uses), these code points are not control characters — they are valid Unicode scalar values. Stripping them silently corrupts user content.

Specifically, characters in the range U+0080–U+009F include:

  • U+0080–U+009F: C1 control codes (rarely used in modern text, but valid Unicode)
  • However, the regex \x7f-\x9f in a Python re pattern on a Unicode string matches Unicode code points U+007F through U+009F

More critically, the regex comment says "8-bit C1 control characters (0x80-0x9F)" but these are only "control characters" in the context of ISO 8859-1 byte streams, not in Unicode strings. In Python 3, all strings are Unicode, and stripping these code points from user-supplied text is incorrect.

Example of corruption:

# U+0080 is a valid Unicode character (PAD - Padding Character)
# but more importantly, some legacy encodings map useful characters here
# The real risk is with characters like:
# U+0085 (NEL - Next Line) - used in some text formats
# U+00A0 (NO-BREAK SPACE) - NOT in range but shows the boundary concern

# The regex strips these from user content:
text = "Hello\x85World"  # U+0085 NEL
strip_terminal_escapes(text)  # Returns "HelloWorld" — character silently removed

Code location:

# src/cleveragents/cli/output/_renderers.py, line 56
_CONTROL_CHAR_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]")

Recommended fix:
The regex should be narrowed to only strip actual problematic control characters. The \x7f-\x9f range should be reconsidered:

  • Keep \x7f (DEL) — genuinely a control character
  • For \x80-\x9f: These are C1 control codes that are almost never intentionally present in user text, but stripping them silently is surprising behavior. The docstring should at minimum document this behavior explicitly, and ideally the range should be limited to only the most dangerous C1 codes (e.g., CSI at U+009B which is already handled by the ANSI escape regex).

i18n impact: While U+0080–U+009F are rarely used in practice, the silent stripping behavior is undocumented and could affect users working with legacy text data or certain specialized Unicode content. The behavior should be explicit and documented.

Subtasks

  • Review whether \x80-\x9f stripping is intentional and document the rationale in the code
  • If intentional, add a docstring note explaining that C1 control codes are stripped
  • If not intentional, narrow the regex to \x7f only (or \x7f\x80-\x9b\x9d-\x9f to preserve U+009C and U+009B which are already handled by the ANSI regex)
  • Add a BDD test scenario verifying that user content with characters in U+0080–U+009F is handled predictably (either preserved or stripped with documentation)

Definition of Done

  • The behavior of strip_terminal_escapes() for U+0080–U+009F characters is explicitly documented
  • Either the characters are preserved (preferred for i18n correctness) or their stripping is clearly documented as intentional
  • Existing tests pass

Backlog note: This issue was discovered during autonomous operation
on milestone (active). It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: UAT Testing | Agent: ca-new-issue-creator

## Metadata - **Branch**: `fix/cli-control-char-regex-non-ascii` - **Commit Message**: `fix(cli): narrow _CONTROL_CHAR_RE to only strip actual C1 control codes, not valid Unicode` - **Milestone**: (none — backlog) - **Parent Epic**: #399 (Post-MVP Server & Clients — nearest applicable CLI epic) ## Bug Report **What was tested:** Whether the CLI's terminal escape sanitizer correctly handles non-ASCII user content in plain/color/table output formats. **Expected behavior:** The `strip_terminal_escapes()` function in `src/cleveragents/cli/output/_renderers.py` should only strip: 1. ANSI/VT terminal escape sequences (to prevent terminal injection) 2. Actual C0 control characters (U+0000–U+001F, excluding tab/newline/CR) 3. DEL (U+007F) It should **not** strip valid Unicode characters that happen to fall in the U+0080–U+009F range. **Actual behavior (from code analysis):** The `_CONTROL_CHAR_RE` regex at line 56 of `src/cleveragents/cli/output/_renderers.py` is: ```python _CONTROL_CHAR_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]") ``` The range `\x7f-\x9f` includes: - U+007F (DEL) — correct to strip - U+0080–U+009F — these are the "C1 control codes" in the Latin-1 supplement **The problem:** While U+0080–U+009F are technically C1 control codes in ISO 8859-1, in modern UTF-8 encoded Python strings (which is what the CLI uses), these code points are **not** control characters — they are valid Unicode scalar values. Stripping them silently corrupts user content. Specifically, characters in the range U+0080–U+009F include: - U+0080–U+009F: C1 control codes (rarely used in modern text, but valid Unicode) - However, the regex `\x7f-\x9f` in a Python `re` pattern on a Unicode string matches Unicode code points U+007F through U+009F More critically, the regex comment says "8-bit C1 control characters (0x80-0x9F)" but these are only "control characters" in the context of ISO 8859-1 byte streams, not in Unicode strings. In Python 3, all strings are Unicode, and stripping these code points from user-supplied text is incorrect. **Example of corruption:** ```python # U+0080 is a valid Unicode character (PAD - Padding Character) # but more importantly, some legacy encodings map useful characters here # The real risk is with characters like: # U+0085 (NEL - Next Line) - used in some text formats # U+00A0 (NO-BREAK SPACE) - NOT in range but shows the boundary concern # The regex strips these from user content: text = "Hello\x85World" # U+0085 NEL strip_terminal_escapes(text) # Returns "HelloWorld" — character silently removed ``` **Code location:** ```python # src/cleveragents/cli/output/_renderers.py, line 56 _CONTROL_CHAR_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]") ``` **Recommended fix:** The regex should be narrowed to only strip actual problematic control characters. The `\x7f-\x9f` range should be reconsidered: - Keep `\x7f` (DEL) — genuinely a control character - For `\x80-\x9f`: These are C1 control codes that are almost never intentionally present in user text, but stripping them silently is surprising behavior. The docstring should at minimum document this behavior explicitly, and ideally the range should be limited to only the most dangerous C1 codes (e.g., CSI at U+009B which is already handled by the ANSI escape regex). **i18n impact:** While U+0080–U+009F are rarely used in practice, the silent stripping behavior is undocumented and could affect users working with legacy text data or certain specialized Unicode content. The behavior should be explicit and documented. ## Subtasks - [ ] Review whether `\x80-\x9f` stripping is intentional and document the rationale in the code - [ ] If intentional, add a docstring note explaining that C1 control codes are stripped - [ ] If not intentional, narrow the regex to `\x7f` only (or `\x7f\x80-\x9b\x9d-\x9f` to preserve U+009C and U+009B which are already handled by the ANSI regex) - [ ] Add a BDD test scenario verifying that user content with characters in U+0080–U+009F is handled predictably (either preserved or stripped with documentation) ## Definition of Done - The behavior of `strip_terminal_escapes()` for U+0080–U+009F characters is explicitly documented - Either the characters are preserved (preferred for i18n correctness) or their stripping is clearly documented as intentional - Existing tests pass > **Backlog note:** This issue was discovered during autonomous operation > on milestone (active). It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: UAT Testing | Agent: ca-new-issue-creator
HAL9000 added this to the v3.5.0 milestone 2026-04-09 03:11:06 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#399 Epic: Post-MVP Server & Clients
cleveragents/cleveragents-core
Reference
cleveragents/cleveragents-core#4090
No description provided.