Fix flaky database_integration robot suite on CI #387

Closed
opened 2026-02-23 16:58:12 +00:00 by freemo · 1 comment
Owner

Parent

Epic: #362 (Security & Safety Hardening)

Problem

The Database Integration robot suite intermittently fails on the CI server with an empty error message:

Database Can Be Initialized :: Verify database initialization works   | FAIL |
Python script execution failed:

The root cause is that under heavy parallel CI load (N pabot workers each spawning Python subprocesses via Run Python Script), a subprocess can be killed by OOM or CPU starvation before its output buffer is flushed. This results in a non-zero exit code with empty stdout — making the failure impossible to diagnose.

The test passes on re-run without code changes, confirming the non-deterministic nature.

Root Cause

The Run Python Script keyword in robot/database_integration.robot had several issues:

  1. NamedTemporaryFile fd leak — file object not explicitly closed before Create File writes to it
  2. 30s timeout too tight — insufficient headroom under parallel CI resource contention
  3. No retry mechanism — transient signal-killed processes cause immediate test failure
  4. Poor error reporting — return code not included in failure message

Fix

Applied in PR #386:

  • Replace NamedTemporaryFile with mkstemp + explicit os.close(fd)
  • Increase subprocess timeout from 30s to 60s
  • Retry once when rc != 0 and stdout is empty (characteristic of OOM/SIGKILL)
  • Include return code in failure message for future diagnosis

Verification

  • nox -e lint — PASS
  • nox -e typecheck — PASS
  • nox -e integration_tests — Database Integration suite passes (both instances, 218-223s)
## Parent Epic: #362 (Security & Safety Hardening) ## Problem The `Database Integration` robot suite intermittently fails on the CI server with an empty error message: ``` Database Can Be Initialized :: Verify database initialization works | FAIL | Python script execution failed: ``` The root cause is that under heavy parallel CI load (N pabot workers each spawning Python subprocesses via `Run Python Script`), a subprocess can be killed by OOM or CPU starvation before its output buffer is flushed. This results in a non-zero exit code with empty stdout — making the failure impossible to diagnose. The test passes on re-run without code changes, confirming the non-deterministic nature. ## Root Cause The `Run Python Script` keyword in `robot/database_integration.robot` had several issues: 1. **`NamedTemporaryFile` fd leak** — file object not explicitly closed before `Create File` writes to it 2. **30s timeout too tight** — insufficient headroom under parallel CI resource contention 3. **No retry mechanism** — transient signal-killed processes cause immediate test failure 4. **Poor error reporting** — return code not included in failure message ## Fix Applied in PR #386: - Replace `NamedTemporaryFile` with `mkstemp` + explicit `os.close(fd)` - Increase subprocess timeout from 30s to 60s - Retry once when `rc != 0` and stdout is empty (characteristic of OOM/SIGKILL) - Include return code in failure message for future diagnosis ## Verification - `nox -e lint` — PASS - `nox -e typecheck` — PASS - `nox -e integration_tests` — Database Integration suite passes (both instances, 218-223s)
freemo added this to the v3.3.0 milestone 2026-02-23 16:59:35 +00:00
Author
Owner

Fixed by PR #386 (ci-int-fix branch).

Changes in robot/database_integration.robot:

  • mkstemp instead of NamedTemporaryFile (fd leak fix)
  • Timeout 30s → 60s
  • Retry once on empty-output failure (signal-killed process)
  • Include rc in error messages
Fixed by PR #386 (`ci-int-fix` branch). Changes in `robot/database_integration.robot`: - `mkstemp` instead of `NamedTemporaryFile` (fd leak fix) - Timeout 30s → 60s - Retry once on empty-output failure (signal-killed process) - Include rc in error messages
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#362 Epic: Security & Safety Hardening
cleveragents/cleveragents-core
Depends on
Reference
cleveragents/cleveragents-core#387
No description provided.