Proposal: fix bug-hunter pool supervisor — add explicit null-session completion detection to break zombie monitoring loop #5636

Open
opened 2026-04-09 08:00:39 +00:00 by HAL9000 · 0 comments
Owner

Agent Improvement Proposal

Pattern Detected

Type: workflow_fix
Affected Agent: bug-hunter
Evidence: Issue #5602 (system-watchdog alert, cycle 189) and issue #5576/#5623 (bug detection pool status cycles 180 and 210) confirm the pool supervisor has been stuck in a zombie monitoring loop for 35+ minutes with 0 modules scanned and 0 findings filed.

Detailed Evidence

From issue #5602 (watchdog alert):

"The bug-hunter pool supervisor is stuck in a zombie monitoring loop:

  • Current cycle: 189 (running for ~30+ minutes)
  • Pool reports: 8 workers 'In Progress' scanning modules
  • Actual worker status: ALL workers have null status (completed/errored)
  • Pool behavior: Checking worker sessions every 10 seconds, always seeing 'busy'"

From issue #5623 (cycle 210 status):

All 8 workers still show "In Progress" with 35min duration, 0 findings, 0 modules completed.

The root cause is in the pool supervision loop (lines 283-306 of bug-hunter.md):

# ── Step 4: Monitor workers, collect results, refill slots ───
while active:
    bash("sleep 10", timeout=30000)
    STATUS = bash("curl -s ${SERVER}/session/status", timeout=30000)

    for module, session_id in list(active.items()):
        if session is completed or errored:   # ← AMBIGUOUS PSEUDOCODE

The pseudocode if session is completed or errored: does not explicitly tell the agent what to do when a session returns null from the status API. When a session completes and is deleted, the OpenCode Server API returns null for that session ID. The agent interprets null as "session not found = still busy" instead of "session not found = completed/deleted."

The watchdog confirms: worker sessions ses_28f31e250ffe..., ses_28f31848cffe..., etc. all return null but the pool keeps treating them as "In Progress."

Proposed Change

Update the pool supervision loop in bug-hunter.md to explicitly handle the null session status as a completion signal. The pseudocode should be clarified to:

# ── Step 4: Monitor workers, collect results, refill slots ───
remaining_unscanned = unscanned[N:]
while active:
    bash("sleep 10", timeout=30000)
    
    for module, session_id in list(active.items()):
        # Query individual session status
        session_data = bash("curl -s ${SERVER}/session/${session_id}", timeout=30000)
        
        # CRITICAL: null response means session completed/deleted — treat as DONE
        # A session that no longer exists has finished its work.
        if session_data is null or session_data == "null" or session_data == "":
            # Session completed (deleted after finishing)
            scanned_modules.add(module)
            del active[module]
            # Refill slot
            if remaining_unscanned:
                next_module = remaining_unscanned.pop(0)
                NEW_SID = create session + prompt_async for next_module
                active[next_module] = NEW_SID
            continue
        
        # Check explicit status field
        status = parse status from session_data
        if status in ("completed", "error", "cancelled"):
            # Session finished with explicit status
            scanned_modules.add(module)
            del active[module]
            if remaining_unscanned:
                next_module = remaining_unscanned.pop(0)
                NEW_SID = create session + prompt_async for next_module
                active[next_module] = NEW_SID

Additionally, add a maximum worker timeout (e.g., 45 minutes) after which a worker is considered stuck and forcibly removed from the active set, preventing infinite zombie loops even if the null-detection logic fails.

Expected Impact

  • Breaks the zombie loop: Pool supervisor will correctly detect completed sessions and move on to scan new modules
  • Restores bug detection: Workers will be dispatched to new modules after previous ones complete
  • Prevents recurrence: Explicit null-check + timeout guard prevents future zombie loops
  • Unblocks findings: Bug hunter will resume filing issues for discovered bugs

Risk Assessment

  • Low risk: This is a pure logic fix — adding explicit null-check and timeout. No behavioral change when sessions are genuinely running.
  • Potential concern: If a worker session is legitimately slow (>45 min), it may be prematurely marked as done. Mitigation: 45 minutes is generous for a single-module scan; the worker will still complete its work and file issues even if the supervisor moves on.
  • Potential concern: The supervisor may dispatch duplicate workers to the same module if a slow worker is timed out. Mitigation: The scanned_modules set prevents re-scanning, and duplicate issue detection in the worker prevents duplicate bug reports.

This is a proposal from the agent evolver. A human must approve this issue before the change will be implemented. To approve: remove the needs feedback label, add State/Verified, or comment with approval.


Automated by CleverAgents Bot
Supervisor: Agent Evolver | Agent: agent-evolver

## Agent Improvement Proposal ### Pattern Detected **Type**: workflow_fix **Affected Agent**: `bug-hunter` **Evidence**: Issue #5602 (system-watchdog alert, cycle 189) and issue #5576/#5623 (bug detection pool status cycles 180 and 210) confirm the pool supervisor has been stuck in a zombie monitoring loop for 35+ minutes with 0 modules scanned and 0 findings filed. ### Detailed Evidence From issue #5602 (watchdog alert): > "The bug-hunter pool supervisor is stuck in a zombie monitoring loop: > - Current cycle: 189 (running for ~30+ minutes) > - Pool reports: 8 workers 'In Progress' scanning modules > - Actual worker status: ALL workers have null status (completed/errored) > - Pool behavior: Checking worker sessions every 10 seconds, always seeing 'busy'" From issue #5623 (cycle 210 status): > All 8 workers still show "In Progress" with 35min duration, 0 findings, 0 modules completed. The root cause is in the pool supervision loop (lines 283-306 of bug-hunter.md): ```python # ── Step 4: Monitor workers, collect results, refill slots ─── while active: bash("sleep 10", timeout=30000) STATUS = bash("curl -s ${SERVER}/session/status", timeout=30000) for module, session_id in list(active.items()): if session is completed or errored: # ← AMBIGUOUS PSEUDOCODE ``` The pseudocode `if session is completed or errored:` does not explicitly tell the agent what to do when a session returns `null` from the status API. When a session completes and is deleted, the OpenCode Server API returns `null` for that session ID. The agent interprets `null` as "session not found = still busy" instead of "session not found = completed/deleted." The watchdog confirms: worker sessions `ses_28f31e250ffe...`, `ses_28f31848cffe...`, etc. all return `null` but the pool keeps treating them as "In Progress." ### Proposed Change Update the pool supervision loop in `bug-hunter.md` to explicitly handle the `null` session status as a completion signal. The pseudocode should be clarified to: ```python # ── Step 4: Monitor workers, collect results, refill slots ─── remaining_unscanned = unscanned[N:] while active: bash("sleep 10", timeout=30000) for module, session_id in list(active.items()): # Query individual session status session_data = bash("curl -s ${SERVER}/session/${session_id}", timeout=30000) # CRITICAL: null response means session completed/deleted — treat as DONE # A session that no longer exists has finished its work. if session_data is null or session_data == "null" or session_data == "": # Session completed (deleted after finishing) scanned_modules.add(module) del active[module] # Refill slot if remaining_unscanned: next_module = remaining_unscanned.pop(0) NEW_SID = create session + prompt_async for next_module active[next_module] = NEW_SID continue # Check explicit status field status = parse status from session_data if status in ("completed", "error", "cancelled"): # Session finished with explicit status scanned_modules.add(module) del active[module] if remaining_unscanned: next_module = remaining_unscanned.pop(0) NEW_SID = create session + prompt_async for next_module active[next_module] = NEW_SID ``` Additionally, add a **maximum worker timeout** (e.g., 45 minutes) after which a worker is considered stuck and forcibly removed from the active set, preventing infinite zombie loops even if the null-detection logic fails. ### Expected Impact - **Breaks the zombie loop**: Pool supervisor will correctly detect completed sessions and move on to scan new modules - **Restores bug detection**: Workers will be dispatched to new modules after previous ones complete - **Prevents recurrence**: Explicit null-check + timeout guard prevents future zombie loops - **Unblocks findings**: Bug hunter will resume filing issues for discovered bugs ### Risk Assessment - **Low risk**: This is a pure logic fix — adding explicit null-check and timeout. No behavioral change when sessions are genuinely running. - **Potential concern**: If a worker session is legitimately slow (>45 min), it may be prematurely marked as done. Mitigation: 45 minutes is generous for a single-module scan; the worker will still complete its work and file issues even if the supervisor moves on. - **Potential concern**: The supervisor may dispatch duplicate workers to the same module if a slow worker is timed out. Mitigation: The `scanned_modules` set prevents re-scanning, and duplicate issue detection in the worker prevents duplicate bug reports. --- *This is a proposal from the agent evolver. A human must approve this issue before the change will be implemented. To approve: remove the `needs feedback` label, add `State/Verified`, or comment with approval.* --- **Automated by CleverAgents Bot** Supervisor: Agent Evolver | Agent: agent-evolver
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#5636
No description provided.