BUG-HUNT: [data-integrity] skill_registry_service.py refresh_agent_skills removes tools before re-scan — registry empty on failure #7487

Open
opened 2026-04-10 20:47:32 +00:00 by HAL9000 · 1 comment
Owner

Bug Report: Data Integrity — refresh_agent_skills Destroys Tool Registry on Re-scan Failure

Severity Assessment

  • Impact: All agent skill tools permanently removed from the registry if re-discovery raises any exception — complete tool outage with no recovery without restart
  • Likelihood: Medium — any filesystem error, malformed skill definition, or transient failure during discovery
  • Priority: High

Location

  • File: src/cleveragents/application/services/skill_registry_service.py
  • Function: SkillRegistryService.refresh_agent_skills
  • Lines: 135–150
  • Category: data-integrity

Description

refresh_agent_skills eagerly removes all previously registered agent skill tools from the ToolRegistry before calling discover_and_register. If discover_and_register raises any exception, the previously working tools have already been unregistered and self._registered_agent_tools has been cleared — leaving the registry permanently empty with no way to recover the old state without restarting the service.

Evidence

def refresh_agent_skills(self, raw_paths, *, on_conflict="skip"):
    if self._tool_registry is not None:
        for spec in self._registered_agent_tools:
            self._tool_registry.remove(spec.name)   # ← tools removed HERE
    self._registered_agent_tools = []               # ← state cleared HERE
    self._discovery_conflicts = []

    return self.discover_and_register(raw_paths, on_conflict=on_conflict)
    # ↑ if this raises, registry is already empty — no recovery

Any error in discover_and_register (malformed YAML, permission error, missing file) leaves the system with zero registered agent tools.

Expected Behavior

If re-scan fails, the previously registered tools should remain available. "Refresh" should be atomic: swap in new tools only after successful discovery.

Actual Behavior

Tools are removed before the new scan succeeds. Any failure leaves the registry empty.

Suggested Fix

Capture old tools, attempt discovery first, then remove old tools only after success:

def refresh_agent_skills(self, raw_paths, *, on_conflict="skip"):
    old_tools = list(self._registered_agent_tools)
    old_conflicts = list(self._discovery_conflicts)
    try:
        result = self.discover_and_register(raw_paths, on_conflict=on_conflict)
    except Exception:
        # Discovery failed — keep old tools, restore state
        self._registered_agent_tools = old_tools
        self._discovery_conflicts = old_conflicts
        raise
    # Success — now remove old tools
    if self._tool_registry is not None:
        for spec in old_tools:
            try:
                self._tool_registry.remove(spec.name)
            except Exception:
                pass
    return result

Category

data-integrity

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor

## Bug Report: Data Integrity — `refresh_agent_skills` Destroys Tool Registry on Re-scan Failure ### Severity Assessment - **Impact**: All agent skill tools permanently removed from the registry if re-discovery raises any exception — complete tool outage with no recovery without restart - **Likelihood**: Medium — any filesystem error, malformed skill definition, or transient failure during discovery - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/skill_registry_service.py` - **Function**: `SkillRegistryService.refresh_agent_skills` - **Lines**: 135–150 - **Category**: data-integrity ### Description `refresh_agent_skills` eagerly removes all previously registered agent skill tools from the `ToolRegistry` before calling `discover_and_register`. If `discover_and_register` raises any exception, the previously working tools have already been unregistered and `self._registered_agent_tools` has been cleared — leaving the registry permanently empty with no way to recover the old state without restarting the service. ### Evidence ```python def refresh_agent_skills(self, raw_paths, *, on_conflict="skip"): if self._tool_registry is not None: for spec in self._registered_agent_tools: self._tool_registry.remove(spec.name) # ← tools removed HERE self._registered_agent_tools = [] # ← state cleared HERE self._discovery_conflicts = [] return self.discover_and_register(raw_paths, on_conflict=on_conflict) # ↑ if this raises, registry is already empty — no recovery ``` Any error in `discover_and_register` (malformed YAML, permission error, missing file) leaves the system with zero registered agent tools. ### Expected Behavior If re-scan fails, the previously registered tools should remain available. "Refresh" should be atomic: swap in new tools only after successful discovery. ### Actual Behavior Tools are removed before the new scan succeeds. Any failure leaves the registry empty. ### Suggested Fix Capture old tools, attempt discovery first, then remove old tools only after success: ```python def refresh_agent_skills(self, raw_paths, *, on_conflict="skip"): old_tools = list(self._registered_agent_tools) old_conflicts = list(self._discovery_conflicts) try: result = self.discover_and_register(raw_paths, on_conflict=on_conflict) except Exception: # Discovery failed — keep old tools, restore state self._registered_agent_tools = old_tools self._discovery_conflicts = old_conflicts raise # Success — now remove old tools if self._tool_registry is not None: for spec in old_tools: try: self._tool_registry.remove(spec.name) except Exception: pass return result ``` ### Category data-integrity ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Detection Pool | Agent: bug-hunt-pool-supervisor
HAL9000 added this to the v3.5.0 milestone 2026-04-10 21:39:20 +00:00
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: High — Concurrency/data integrity bug in autonomy hardening components that impacts M6 milestone functionality
  • Milestone: v3.5.0 (M6: Autonomy Hardening) — This component is core to autonomous execution, guardrails, and context management
  • Story Points: 3 (M) — Bug fix with clear reproduction path
  • MoSCoW: Must Have — Autonomy hardening requires correct concurrency and data integrity
  • Type: Bug

Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

Issue triaged by project owner: - **State**: Verified - **Priority**: High — Concurrency/data integrity bug in autonomy hardening components that impacts M6 milestone functionality - **Milestone**: v3.5.0 (M6: Autonomy Hardening) — This component is core to autonomous execution, guardrails, and context management - **Story Points**: 3 (M) — Bug fix with clear reproduction path - **MoSCoW**: Must Have — Autonomy hardening requires correct concurrency and data integrity - **Type**: Bug --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7487
No description provided.