BUG-HUNT: [RESOURCE] Memory leak in PlanService memory services cache #7100

Open
opened 2026-04-10 07:41:52 +00:00 by HAL9000 · 1 comment
Owner

Background / Context

PlanService maintains an internal _memory_services cache (a plain dict[str, MemoryService]) that maps session IDs to their corresponding MemoryService instances. The cache is populated lazily by get_memory_service() every time a new session ID is encountered, but there is no automatic eviction mechanism for sessions that are abandoned or closed without an explicit call to clear_memory().

In long-running PlanService instances (e.g., a server process handling many short-lived sessions), the cache will grow monotonically, retaining MemoryService objects — and any resources they hold — for the entire lifetime of the service.

  • File: src/cleveragents/application/services/plan_service.py
  • Class: PlanService
  • Category: resource-management
  • Severity: Medium (gradual resource leak over time; not immediately fatal)

Current Behaviour

_memory_services is declared at construction time as an empty dict (line 99):

self._memory_services: dict[str, MemoryService] = {}

get_memory_service() (lines 447–471) adds a new entry for every previously-unseen session_id and also replaces existing entries when the persistence mode changes, but never removes stale entries:

# Three code paths all write to the cache but none remove old entries
self._memory_services[session_id] = service   # line 455
self._memory_services[session_id] = service   # line 465
self._memory_services[session_id] = service   # line 471

clear_memory() (line 504) does correctly use .pop() to remove the entry, but it is an opt-in, caller-driven operation. Any session that ends without the caller invoking clear_memory() — due to an exception, timeout, or simply not being cleaned up — leaves a dangling entry in the cache indefinitely.

Expected Behaviour

The _memory_services cache should not grow without bound. One or more of the following mitigations should be applied:

  1. Weak-reference cache (weakref.WeakValueDictionary) — entries are automatically removed when the MemoryService object is no longer referenced elsewhere.
  2. LRU eviction (functools.lru_cache or a bounded OrderedDict) — cap the number of cached entries and evict the least-recently-used when the cap is reached.
  3. TTL-based eviction — track last-access time per entry and purge stale entries on a background schedule or lazily on the next access.
  4. Session lifecycle hook — ensure clear_memory() is called as part of the session teardown path so no session can be abandoned without cleanup.

Acceptance Criteria

  • _memory_services does not grow unboundedly in a long-running PlanService instance.
  • Abandoned sessions (those that end without an explicit clear_memory() call) do not permanently retain their MemoryService in the cache.
  • The chosen eviction strategy is covered by BDD unit tests (Behave) and integration tests (Robot Framework).
  • No regression in existing memory-service behaviour.
  • All nox stages pass and coverage remains ≥ 97 %.

Supporting Information

  • Related issue: #7082 (SQLite Connection Leak in Checkpoint Management) — similar resource-lifecycle pattern in the same application layer.
  • Related issue: #7080 (Memory growth in reference parser cache) — same class of unbounded-cache bug.
  • Related issue: #7067 (Memory leak from unbounded task accumulation in Agent base class) — same class of unbounded-growth bug.

Metadata

  • Branch: fix/plan-service-memory-services-cache-eviction
  • Commit Message: fix(plan): add cache eviction to PlanService._memory_services to prevent memory leak
  • Milestone: N/A (backlog — see note below)
  • Parent Epic: #7023

Subtasks

  • Reproduce the leak with a focused Behave scenario (tag @tdd_issue + @tdd_expected_fail)
  • Choose and implement an appropriate eviction strategy for _memory_services
  • Ensure clear_memory() remains the canonical explicit-cleanup path
  • Remove @tdd_expected_fail once the fix is in place and the test passes normally
  • Add Robot Framework integration test covering session lifecycle and cache size
  • Update docstrings for get_memory_service() and clear_memory() to document eviction behaviour
  • Verify no performance regression via ASV benchmark

Definition of Done

  • _memory_services cache is bounded or self-evicting
  • Behave regression test (@tdd_issue) passes without @tdd_expected_fail
  • Robot Framework integration test passes
  • Docstrings updated
  • All nox stages pass
  • Coverage >= 97%

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Acting on behalf of: Bug Hunter | Agent: new-issue-creator

## Background / Context `PlanService` maintains an internal `_memory_services` cache (a plain `dict[str, MemoryService]`) that maps session IDs to their corresponding `MemoryService` instances. The cache is populated lazily by `get_memory_service()` every time a new session ID is encountered, but there is **no automatic eviction mechanism** for sessions that are abandoned or closed without an explicit call to `clear_memory()`. In long-running `PlanService` instances (e.g., a server process handling many short-lived sessions), the cache will grow monotonically, retaining `MemoryService` objects — and any resources they hold — for the entire lifetime of the service. - **File**: `src/cleveragents/application/services/plan_service.py` - **Class**: `PlanService` - **Category**: resource-management - **Severity**: Medium (gradual resource leak over time; not immediately fatal) ## Current Behaviour `_memory_services` is declared at construction time as an empty dict (line 99): ```python self._memory_services: dict[str, MemoryService] = {} ``` `get_memory_service()` (lines 447–471) adds a new entry for every previously-unseen `session_id` and also replaces existing entries when the persistence mode changes, but **never removes stale entries**: ```python # Three code paths all write to the cache but none remove old entries self._memory_services[session_id] = service # line 455 self._memory_services[session_id] = service # line 465 self._memory_services[session_id] = service # line 471 ``` `clear_memory()` (line 504) does correctly use `.pop()` to remove the entry, but it is an **opt-in, caller-driven operation**. Any session that ends without the caller invoking `clear_memory()` — due to an exception, timeout, or simply not being cleaned up — leaves a dangling entry in the cache indefinitely. ## Expected Behaviour The `_memory_services` cache should not grow without bound. One or more of the following mitigations should be applied: 1. **Weak-reference cache** (`weakref.WeakValueDictionary`) — entries are automatically removed when the `MemoryService` object is no longer referenced elsewhere. 2. **LRU eviction** (`functools.lru_cache` or a bounded `OrderedDict`) — cap the number of cached entries and evict the least-recently-used when the cap is reached. 3. **TTL-based eviction** — track last-access time per entry and purge stale entries on a background schedule or lazily on the next access. 4. **Session lifecycle hook** — ensure `clear_memory()` is called as part of the session teardown path so no session can be abandoned without cleanup. ## Acceptance Criteria - `_memory_services` does not grow unboundedly in a long-running `PlanService` instance. - Abandoned sessions (those that end without an explicit `clear_memory()` call) do not permanently retain their `MemoryService` in the cache. - The chosen eviction strategy is covered by BDD unit tests (Behave) and integration tests (Robot Framework). - No regression in existing memory-service behaviour. - All nox stages pass and coverage remains ≥ 97 %. ## Supporting Information - Related issue: #7082 (SQLite Connection Leak in Checkpoint Management) — similar resource-lifecycle pattern in the same application layer. - Related issue: #7080 (Memory growth in reference parser cache) — same class of unbounded-cache bug. - Related issue: #7067 (Memory leak from unbounded task accumulation in Agent base class) — same class of unbounded-growth bug. ## Metadata - **Branch**: `fix/plan-service-memory-services-cache-eviction` - **Commit Message**: `fix(plan): add cache eviction to PlanService._memory_services to prevent memory leak` - **Milestone**: N/A (backlog — see note below) - **Parent Epic**: #7023 ## Subtasks - [ ] Reproduce the leak with a focused Behave scenario (tag `@tdd_issue` + `@tdd_expected_fail`) - [ ] Choose and implement an appropriate eviction strategy for `_memory_services` - [ ] Ensure `clear_memory()` remains the canonical explicit-cleanup path - [ ] Remove `@tdd_expected_fail` once the fix is in place and the test passes normally - [ ] Add Robot Framework integration test covering session lifecycle and cache size - [ ] Update docstrings for `get_memory_service()` and `clear_memory()` to document eviction behaviour - [ ] Verify no performance regression via ASV benchmark ## Definition of Done - [ ] `_memory_services` cache is bounded or self-evicting - [ ] Behave regression test (`@tdd_issue`) passes without `@tdd_expected_fail` - [ ] Robot Framework integration test passes - [ ] Docstrings updated - [ ] All nox stages pass - [ ] Coverage >= 97% > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Acting on behalf of: Bug Hunter | Agent: new-issue-creator
Author
Owner

Verified — Resource leak: memory leak in PlanService memory services cache. MoSCoW: Should-have. Priority: Medium.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

✅ **Verified** — Resource leak: memory leak in PlanService memory services cache. MoSCoW: Should-have. Priority: Medium. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7100
No description provided.