BUG-HUNT: [resource] AsyncWorker._install_signal_handlers() clobbers previous handlers when multiple workers start — on stop, a stale signal handler from another worker instance is permanently restored #6832

Open
opened 2026-04-10 02:27:46 +00:00 by HAL9000 · 2 comments
Owner

Bug Report: [resource] — Multiple AsyncWorker instances corrupt signal handler chain

Severity Assessment

  • Impact: When multiple AsyncWorker instances are started, stopping one leaves a stale signal handler from another worker permanently active; SIGTERM/SIGINT may not trigger graceful shutdown, or may trigger shutdown of the wrong worker, leading to orphaned background threads and resource leaks
  • Likelihood: Medium — Multi-worker deployments (e.g., A2A server mode with async.max_workers > 1 and separate worker pools) can create multiple AsyncWorker instances; also occurs during testing when workers are constructed/destroyed in rapid succession
  • Priority: High

Location

  • File: src/cleveragents/application/services/async_worker.py
  • Function/Class: AsyncWorker._install_signal_handlers(), AsyncWorker._restore_signal_handlers()
  • Lines: ~635–655

Description

AsyncWorker._install_signal_handlers() captures the current SIGTERM/SIGINT handlers and installs its own handlers. _restore_signal_handlers() reinstates the captured originals. However, if multiple AsyncWorker instances are started sequentially from the main thread, each one's "captured original" is actually the PREVIOUS worker's installed handler, not the process's original handler.

Scenario with two workers:

1. Original handler: process-level SIGTERM = default_handler
2. Worker A starts: captures original_sigterm = default_handler
   Installs: SIGTERM = worker_A_handler
3. Worker B starts: captures original_sigterm = worker_A_handler  ← NOT the real original!
   Installs: SIGTERM = worker_B_handler
4. Worker A stops: restores SIGTERM = default_handler  ← correct
5. Worker B stops: restores SIGTERM = worker_A_handler  ← stale handler from dead Worker A!

After step 5, SIGTERM is handled by worker_A_handler which references the now-stopped Worker A's state. Sending SIGTERM to the process:

  • Tries to call _shutdown_event.set() on Worker A's stopped event
  • Never shuts down Worker B (which is still running)
  • Never triggers the actual system default SIGTERM behavior

This leaves background threads from Worker B running indefinitely after the process receives SIGTERM.

Evidence

# src/cleveragents/application/services/async_worker.py, lines 635-641
def _install_signal_handlers(self) -> None:
    """Install signal handlers for graceful shutdown."""
    try:
        self._original_sigint = signal.getsignal(signal.SIGINT)   # captures CURRENT handler
        self._original_sigterm = signal.getsignal(signal.SIGTERM)  # which may be another worker's!
        signal.signal(signal.SIGINT, self._signal_handler)
        signal.signal(signal.SIGTERM, self._signal_handler)
    except (OSError, ValueError):
        pass

def _restore_signal_handlers(self) -> None:
    """Restore original signal handlers."""
    try:
        if self._original_sigint is not None:
            signal.signal(signal.SIGINT, self._original_sigint)    # restores stale handler!
        if self._original_sigterm is not None:
            signal.signal(signal.SIGTERM, self._original_sigterm)  # restores stale handler!
    except (OSError, ValueError):
        pass

Expected Behavior

Signal handlers should be managed globally at the process level, not per-worker instance. When the last worker stops, the process's original signal handler should be restored. Intermediate worker stop/start cycles should not pollute the signal handler chain.

Actual Behavior

Each worker captures whatever the current signal handler is at the time of start() — which may be another worker's handler. When a worker stops, it "restores" a stale handler. After all workers stop, the signal handler is the last-started worker's captured handler (which is actually the second-to-last worker's handler, etc.) — in the worst case, a dead worker's handler.

Suggested Fix

Use a module-level reference counter and original handler cache:

_SIGTERM_ORIGINAL: signal.Handlers | None = None
_SIGINT_ORIGINAL: signal.Handlers | None = None
_SIGNAL_REFCOUNT: int = 0
_SIGNAL_LOCK = threading.Lock()

def _install_signal_handlers(self) -> None:
    global _SIGTERM_ORIGINAL, _SIGINT_ORIGINAL, _SIGNAL_REFCOUNT
    with _SIGNAL_LOCK:
        if _SIGNAL_REFCOUNT == 0:
            # First worker — capture the real original handlers
            _SIGTERM_ORIGINAL = signal.getsignal(signal.SIGTERM)
            _SIGINT_ORIGINAL = signal.getsignal(signal.SIGINT)
        _SIGNAL_REFCOUNT += 1
    # Always install our handler
    try:
        signal.signal(signal.SIGTERM, self._signal_handler)
        signal.signal(signal.SIGINT, self._signal_handler)
    except (OSError, ValueError):
        pass

def _restore_signal_handlers(self) -> None:
    global _SIGTERM_ORIGINAL, _SIGINT_ORIGINAL, _SIGNAL_REFCOUNT
    with _SIGNAL_LOCK:
        _SIGNAL_REFCOUNT -= 1
        if _SIGNAL_REFCOUNT == 0:
            try:
                signal.signal(signal.SIGTERM, _SIGTERM_ORIGINAL or signal.SIG_DFL)
                signal.signal(signal.SIGINT, _SIGINT_ORIGINAL or signal.SIG_DFL)
            except (OSError, ValueError):
                pass

Category

resource

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [resource] — Multiple `AsyncWorker` instances corrupt signal handler chain ### Severity Assessment - **Impact**: When multiple `AsyncWorker` instances are started, stopping one leaves a stale signal handler from another worker permanently active; SIGTERM/SIGINT may not trigger graceful shutdown, or may trigger shutdown of the wrong worker, leading to orphaned background threads and resource leaks - **Likelihood**: Medium — Multi-worker deployments (e.g., A2A server mode with `async.max_workers > 1` and separate worker pools) can create multiple `AsyncWorker` instances; also occurs during testing when workers are constructed/destroyed in rapid succession - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/async_worker.py` - **Function/Class**: `AsyncWorker._install_signal_handlers()`, `AsyncWorker._restore_signal_handlers()` - **Lines**: ~635–655 ### Description `AsyncWorker._install_signal_handlers()` captures the current SIGTERM/SIGINT handlers and installs its own handlers. `_restore_signal_handlers()` reinstates the captured originals. However, if multiple `AsyncWorker` instances are started sequentially from the main thread, each one's "captured original" is actually the PREVIOUS worker's installed handler, not the process's original handler. **Scenario with two workers:** ``` 1. Original handler: process-level SIGTERM = default_handler 2. Worker A starts: captures original_sigterm = default_handler Installs: SIGTERM = worker_A_handler 3. Worker B starts: captures original_sigterm = worker_A_handler ← NOT the real original! Installs: SIGTERM = worker_B_handler 4. Worker A stops: restores SIGTERM = default_handler ← correct 5. Worker B stops: restores SIGTERM = worker_A_handler ← stale handler from dead Worker A! ``` After step 5, SIGTERM is handled by `worker_A_handler` which references the now-stopped Worker A's state. Sending SIGTERM to the process: - Tries to call `_shutdown_event.set()` on Worker A's stopped event - Never shuts down Worker B (which is still running) - Never triggers the actual system default SIGTERM behavior This leaves background threads from Worker B running indefinitely after the process receives SIGTERM. ### Evidence ```python # src/cleveragents/application/services/async_worker.py, lines 635-641 def _install_signal_handlers(self) -> None: """Install signal handlers for graceful shutdown.""" try: self._original_sigint = signal.getsignal(signal.SIGINT) # captures CURRENT handler self._original_sigterm = signal.getsignal(signal.SIGTERM) # which may be another worker's! signal.signal(signal.SIGINT, self._signal_handler) signal.signal(signal.SIGTERM, self._signal_handler) except (OSError, ValueError): pass def _restore_signal_handlers(self) -> None: """Restore original signal handlers.""" try: if self._original_sigint is not None: signal.signal(signal.SIGINT, self._original_sigint) # restores stale handler! if self._original_sigterm is not None: signal.signal(signal.SIGTERM, self._original_sigterm) # restores stale handler! except (OSError, ValueError): pass ``` ### Expected Behavior Signal handlers should be managed globally at the process level, not per-worker instance. When the last worker stops, the process's original signal handler should be restored. Intermediate worker stop/start cycles should not pollute the signal handler chain. ### Actual Behavior Each worker captures whatever the current signal handler is at the time of `start()` — which may be another worker's handler. When a worker stops, it "restores" a stale handler. After all workers stop, the signal handler is the last-started worker's captured handler (which is actually the second-to-last worker's handler, etc.) — in the worst case, a dead worker's handler. ### Suggested Fix Use a module-level reference counter and original handler cache: ```python _SIGTERM_ORIGINAL: signal.Handlers | None = None _SIGINT_ORIGINAL: signal.Handlers | None = None _SIGNAL_REFCOUNT: int = 0 _SIGNAL_LOCK = threading.Lock() def _install_signal_handlers(self) -> None: global _SIGTERM_ORIGINAL, _SIGINT_ORIGINAL, _SIGNAL_REFCOUNT with _SIGNAL_LOCK: if _SIGNAL_REFCOUNT == 0: # First worker — capture the real original handlers _SIGTERM_ORIGINAL = signal.getsignal(signal.SIGTERM) _SIGINT_ORIGINAL = signal.getsignal(signal.SIGINT) _SIGNAL_REFCOUNT += 1 # Always install our handler try: signal.signal(signal.SIGTERM, self._signal_handler) signal.signal(signal.SIGINT, self._signal_handler) except (OSError, ValueError): pass def _restore_signal_handlers(self) -> None: global _SIGTERM_ORIGINAL, _SIGINT_ORIGINAL, _SIGNAL_REFCOUNT with _SIGNAL_LOCK: _SIGNAL_REFCOUNT -= 1 if _SIGNAL_REFCOUNT == 0: try: signal.signal(signal.SIGTERM, _SIGTERM_ORIGINAL or signal.SIG_DFL) signal.signal(signal.SIGINT, _SIGINT_ORIGINAL or signal.SIG_DFL) except (OSError, ValueError): pass ``` ### Category resource ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
Author
Owner

Label compliance fix applied:

  • Added missing label(s): State/Unverified
  • Reason: Issue was missing required State/* and/or Type/* labels per CONTRIBUTING.md

Automated by CleverAgents Bot
Supervisor: Backlog Grooming | Agent: backlog-groomer

Label compliance fix applied: - Added missing label(s): State/Unverified - Reason: Issue was missing required State/* and/or Type/* labels per CONTRIBUTING.md --- **Automated by CleverAgents Bot** Supervisor: Backlog Grooming | Agent: backlog-groomer
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: High — Signal handler corruption in AsyncWorker can cause orphaned threads and resource leaks in multi-worker deployments. This is a real concurrency bug with production impact.
  • Milestone: v3.2.0 — Core infrastructure reliability must be fixed in the earliest milestone
  • Story Points: 3 — M — Requires careful signal handler chain management fix
  • MoSCoW: MoSCoW/Must have — Signal handling correctness is required for reliable operation
  • Assignee: HAL9000

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner

Issue triaged by project owner: - **State**: Verified - **Priority**: High — Signal handler corruption in AsyncWorker can cause orphaned threads and resource leaks in multi-worker deployments. This is a real concurrency bug with production impact. - **Milestone**: v3.2.0 — Core infrastructure reliability must be fixed in the earliest milestone - **Story Points**: 3 — M — Requires careful signal handler chain management fix - **MoSCoW**: MoSCoW/Must have — Signal handling correctness is required for reliable operation - **Assignee**: HAL9000 --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner
HAL9000 self-assigned this 2026-04-10 06:11:49 +00:00
HAL9000 added this to the v3.2.0 milestone 2026-04-10 06:11:49 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6832
No description provided.