BUG-HUNT: [resource] LockService creates SQLAlchemy sessions without closing them — connection pool leak across all 7 methods #6555

Open
opened 2026-04-09 21:18:54 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [resource] — LockService Creates Sessions Without session.close()

Severity Assessment

  • Impact: Database connection pool exhaustion under sustained load; stale connections with SQLite/PostgreSQL backends; eventual OperationalError: too many connections in production
  • Likelihood: High — every acquire, release, renew, release_all_for_owner, cleanup_expired call leaks a connection. count_stale_locks and is_locked are already tracked in UAT #3997 but the root cause (no session.close()) affects all methods.
  • Priority: High

Location

  • File: src/cleveragents/application/services/lock_service.py
  • Functions: acquire (~line 163), release (~line 258), renew (~line 313), release_all_for_owner (~line 385), cleanup_expired (~line 408)
  • Lines: 163–237, 258–310, 313–381, 385–407, 408–430

Description

All seven methods in LockService call self._session_factory() to obtain a SQLAlchemy session but never call session.close(), even in the finally: block (which does not exist). The methods that do have try/except blocks call session.rollback() on error, but do NOT close the session afterwards, meaning the underlying database connection is never returned to the connection pool.

# Pattern repeated in ALL 7 methods:
session = self._session_factory()
try:
    # ... database work ...
    session.commit()
    return True
except LockConflictError:
    session.rollback()
    raise
except Exception:
    session.rollback()
    raise
# ← NO finally: session.close() !

SQLAlchemy sessions obtained from a sessionmaker bind the underlying DBAPI connection for their lifetime. Without session.close(), the connection is only returned to the pool when the session object is garbage-collected — which is non-deterministic in CPython and never happens in PyPy or when reference cycles exist.

Under load (e.g. the AsyncWorker poll loop calling is_locked() every second), the connection pool quickly saturates, causing OperationalError: SQLITE_BUSY or TimeoutError depending on the backend.

Evidence

# lock_service.py ~line 163 (acquire method)
session = self._session_factory()
try:
    stmt = select(LockModel).where(...)
    existing = session.execute(stmt).scalar_one_or_none()
    ...
    session.commit()
    return True
except LockConflictError:
    session.rollback()
    raise
except Exception:
    session.rollback()
    raise
# No finally block. session.close() is never called.
# lock_service.py ~line 408 (count_stale_locks)
session = self._session_factory()
stmt = (
    select(func.count())
    .select_from(LockModel)
    .where(LockModel.expires_at < now_iso)
)
result: int = session.execute(stmt).scalar_one()
return result
# No try/except, no session.close().

Expected Behavior

Every session obtained from self._session_factory() must be explicitly closed in a finally: block so the underlying connection is returned to the pool:

session = self._session_factory()
try:
    # ... work ...
    session.commit()
    return result
except Exception:
    session.rollback()
    raise
finally:
    session.close()   # ← returns connection to pool

Or equivalently, use the session as a context manager.

Actual Behavior

Sessions are leaked. In SQLite (WAL mode), leaked reader sessions hold shared locks that prevent checkpoint operations. In production Postgres, leaked sessions exhaust the pool.

Suggested Fix

Add finally: session.close() to all 7 methods in LockService. Compare with ResourceRegistryService or CheckpointRepository in the same codebase which correctly close sessions. Note that UAT #3997 addresses missing try/except in two methods but does not address the missing session.close() in any method.

Category

resource

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [resource] — LockService Creates Sessions Without `session.close()` ### Severity Assessment - **Impact**: Database connection pool exhaustion under sustained load; stale connections with SQLite/PostgreSQL backends; eventual `OperationalError: too many connections` in production - **Likelihood**: High — every `acquire`, `release`, `renew`, `release_all_for_owner`, `cleanup_expired` call leaks a connection. `count_stale_locks` and `is_locked` are already tracked in UAT #3997 but the root cause (no `session.close()`) affects all methods. - **Priority**: High ### Location - **File**: `src/cleveragents/application/services/lock_service.py` - **Functions**: `acquire` (~line 163), `release` (~line 258), `renew` (~line 313), `release_all_for_owner` (~line 385), `cleanup_expired` (~line 408) - **Lines**: 163–237, 258–310, 313–381, 385–407, 408–430 ### Description All seven methods in `LockService` call `self._session_factory()` to obtain a SQLAlchemy session but **never call `session.close()`**, even in the `finally:` block (which does not exist). The methods that do have `try/except` blocks call `session.rollback()` on error, but do NOT close the session afterwards, meaning the underlying database connection is never returned to the connection pool. ```python # Pattern repeated in ALL 7 methods: session = self._session_factory() try: # ... database work ... session.commit() return True except LockConflictError: session.rollback() raise except Exception: session.rollback() raise # ← NO finally: session.close() ! ``` SQLAlchemy sessions obtained from a `sessionmaker` bind the underlying DBAPI connection for their lifetime. Without `session.close()`, the connection is only returned to the pool when the session object is garbage-collected — which is non-deterministic in CPython and never happens in PyPy or when reference cycles exist. Under load (e.g. the `AsyncWorker` poll loop calling `is_locked()` every second), the connection pool quickly saturates, causing `OperationalError: SQLITE_BUSY` or `TimeoutError` depending on the backend. ### Evidence ```python # lock_service.py ~line 163 (acquire method) session = self._session_factory() try: stmt = select(LockModel).where(...) existing = session.execute(stmt).scalar_one_or_none() ... session.commit() return True except LockConflictError: session.rollback() raise except Exception: session.rollback() raise # No finally block. session.close() is never called. ``` ```python # lock_service.py ~line 408 (count_stale_locks) session = self._session_factory() stmt = ( select(func.count()) .select_from(LockModel) .where(LockModel.expires_at < now_iso) ) result: int = session.execute(stmt).scalar_one() return result # No try/except, no session.close(). ``` ### Expected Behavior Every session obtained from `self._session_factory()` must be explicitly closed in a `finally:` block so the underlying connection is returned to the pool: ```python session = self._session_factory() try: # ... work ... session.commit() return result except Exception: session.rollback() raise finally: session.close() # ← returns connection to pool ``` Or equivalently, use the session as a context manager. ### Actual Behavior Sessions are leaked. In SQLite (WAL mode), leaked reader sessions hold shared locks that prevent checkpoint operations. In production Postgres, leaked sessions exhaust the pool. ### Suggested Fix Add `finally: session.close()` to all 7 methods in `LockService`. Compare with `ResourceRegistryService` or `CheckpointRepository` in the same codebase which correctly close sessions. Note that UAT #3997 addresses missing `try/except` in two methods but does **not** address the missing `session.close()` in any method. ### Category resource ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:27:52 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6555
No description provided.