LockService.acquire raises IntegrityError under contention instead of LockConflictError #9638

Open
opened 2026-04-15 00:58:50 +00:00 by HAL9000 · 3 comments
Owner

Metadata

  • Commit message: fix(lock-service): trap IntegrityError in acquire and raise LockConflictError
  • Branch name: fix/lock-service-integrity-error-on-contention

Background and Context

LockService.acquire inserts a new row into the locks table without guarding against concurrent inserts. The LockModel table has a UNIQUE constraint on (resource_type, resource_id). Under contention — two workers or threads racing to acquire the same lock — the second session.commit() fires the UNIQUE constraint and SQLAlchemy raises sqlalchemy.exc.IntegrityError.

The current implementation only catches LockConflictError and re-raises IntegrityError unhandled, so callers receive a raw database exception instead of the documented LockConflictError. This breaks mutual exclusion because consumers treat IntegrityError as an infrastructure failure rather than a handled lock conflict.

Affected files:

  • src/cleveragents/application/services/lock_service.pyacquire method
  • src/cleveragents/infrastructure/database/models.pyLockModel unique constraint

Expected Behavior

When two callers race to acquire the same lock, exactly one succeeds. The losing caller receives LockConflictError (populated with the existing owner) so contention can be handled gracefully by the application layer.


Acceptance Criteria

  • LockService.acquire wraps session.commit() in a try/except that catches sqlalchemy.exc.IntegrityError.
  • On IntegrityError, the session is rolled back, the existing lock row is reloaded, and LockConflictError is raised with the existing owner's details.
  • No IntegrityError escapes LockService.acquire when the failure is caused by a UNIQUE constraint violation on (resource_type, resource_id).
  • Existing LockConflictError handling paths are unaffected.
  • Unit tests cover the concurrent-insert scenario (mock session.commit() to raise IntegrityError on the second call, assert LockConflictError is raised).
  • Integration/concurrency test with two threads sharing the same session factory confirms one acquires and one raises LockConflictError.
  • All existing lock-service tests continue to pass.

Subtasks

  • Reproduce the bug with a failing test (mock or integration).
  • Wrap session.commit() in lock_service.py::acquire with except IntegrityError.
  • On catch: rollback session, query for the existing LockModel row, raise LockConflictError(existing_owner=...).
  • Verify the fix handles the edge case where the row is gone by the time we reload (retry or surface a generic error).
  • Add/update unit tests in the lock-service test suite.
  • Add a concurrency integration test (two threads, shared session factory).
  • Update docstring on acquire to document the IntegrityErrorLockConflictError translation.

Definition of Done

This issue is closed when:

  1. LockService.acquire never surfaces IntegrityError to callers for UNIQUE constraint violations.
  2. All acceptance criteria above are met and verified by CI.
  3. The fix is merged to the main branch with passing nox / test suite (coverage ≥ 97%).

Steps to Reproduce

  1. Start two worker processes (or threads with independent SQLAlchemy sessions) that share the same database and instantiate LockService with the same session_factory.
  2. Ensure the locks table has no row for resource_type="plan" / resource_id="plan-123".
  3. In parallel, call:
    • LockService.acquire(owner_id="owner-a", resource_type="plan", resource_id="plan-123")
    • LockService.acquire(owner_id="owner-b", resource_type="plan", resource_id="plan-123")
  4. One call succeeds; the other hits the UNIQUE constraint on LockModel and session.commit() raises sqlalchemy.exc.IntegrityError.

Expected: Second caller receives LockConflictError.
Actual: IntegrityError bubbles out of LockService.acquire, causing the caller to fail with a raw database exception.


Automated by CleverAgents Bot
Agent: new-issue-creator

## Metadata - **Commit message:** `fix(lock-service): trap IntegrityError in acquire and raise LockConflictError` - **Branch name:** `fix/lock-service-integrity-error-on-contention` --- ## Background and Context `LockService.acquire` inserts a new row into the locks table without guarding against concurrent inserts. The `LockModel` table has a UNIQUE constraint on `(resource_type, resource_id)`. Under contention — two workers or threads racing to acquire the same lock — the second `session.commit()` fires the UNIQUE constraint and SQLAlchemy raises `sqlalchemy.exc.IntegrityError`. The current implementation only catches `LockConflictError` and re-raises `IntegrityError` unhandled, so callers receive a raw database exception instead of the documented `LockConflictError`. This breaks mutual exclusion because consumers treat `IntegrityError` as an infrastructure failure rather than a handled lock conflict. Affected files: - `src/cleveragents/application/services/lock_service.py` — `acquire` method - `src/cleveragents/infrastructure/database/models.py` — `LockModel` unique constraint --- ## Expected Behavior When two callers race to acquire the same lock, exactly one succeeds. The losing caller receives `LockConflictError` (populated with the existing owner) so contention can be handled gracefully by the application layer. --- ## Acceptance Criteria - [ ] `LockService.acquire` wraps `session.commit()` in a try/except that catches `sqlalchemy.exc.IntegrityError`. - [ ] On `IntegrityError`, the session is rolled back, the existing lock row is reloaded, and `LockConflictError` is raised with the existing owner's details. - [ ] No `IntegrityError` escapes `LockService.acquire` when the failure is caused by a UNIQUE constraint violation on `(resource_type, resource_id)`. - [ ] Existing `LockConflictError` handling paths are unaffected. - [ ] Unit tests cover the concurrent-insert scenario (mock `session.commit()` to raise `IntegrityError` on the second call, assert `LockConflictError` is raised). - [ ] Integration/concurrency test with two threads sharing the same session factory confirms one acquires and one raises `LockConflictError`. - [ ] All existing lock-service tests continue to pass. --- ## Subtasks - [ ] Reproduce the bug with a failing test (mock or integration). - [ ] Wrap `session.commit()` in `lock_service.py::acquire` with `except IntegrityError`. - [ ] On catch: rollback session, query for the existing `LockModel` row, raise `LockConflictError(existing_owner=...)`. - [ ] Verify the fix handles the edge case where the row is gone by the time we reload (retry or surface a generic error). - [ ] Add/update unit tests in the lock-service test suite. - [ ] Add a concurrency integration test (two threads, shared session factory). - [ ] Update docstring on `acquire` to document the `IntegrityError` → `LockConflictError` translation. --- ## Definition of Done This issue is closed when: 1. `LockService.acquire` never surfaces `IntegrityError` to callers for UNIQUE constraint violations. 2. All acceptance criteria above are met and verified by CI. 3. The fix is merged to the main branch with passing `nox` / test suite (coverage ≥ 97%). --- ## Steps to Reproduce 1. Start two worker processes (or threads with independent SQLAlchemy sessions) that share the same database and instantiate `LockService` with the same `session_factory`. 2. Ensure the locks table has no row for `resource_type="plan"` / `resource_id="plan-123"`. 3. In parallel, call: - `LockService.acquire(owner_id="owner-a", resource_type="plan", resource_id="plan-123")` - `LockService.acquire(owner_id="owner-b", resource_type="plan", resource_id="plan-123")` 4. One call succeeds; the other hits the UNIQUE constraint on `LockModel` and `session.commit()` raises `sqlalchemy.exc.IntegrityError`. **Expected:** Second caller receives `LockConflictError`. **Actual:** `IntegrityError` bubbles out of `LockService.acquire`, causing the caller to fail with a raw database exception. --- **Automated by CleverAgents Bot** Agent: new-issue-creator
Author
Owner

🏷️ Triage Decision — [AUTO-OWNR-1]\n\nStatus: Verified\n\nIssue Type: Bug \nMoSCoW: Must Have — Database integrity errors are blocking \nPriority: Critical\n\nRationale: LockService raising IntegrityError under contention instead of LockConflictError is a critical bug that can cause unexpected crashes in concurrent scenarios. This is a Must Have fix as it affects system stability during parallel subplan execution (v3.3.0 core feature).\n\nLabels to apply: State/Verified, MoSCoW/Must have, Priority/Critical, Type/Bug\n\n---\nAutomated by CleverAgents Bot\nSupervisor: Project Owner | Agent: project-owner-pool-supervisor\n\n---\nAutomated by CleverAgents Bot\nAgent: automation-tracking-manager

## 🏷️ Triage Decision — [AUTO-OWNR-1]\n\n**Status:** ✅ Verified\n\n**Issue Type:** Bug \n**MoSCoW:** Must Have — Database integrity errors are blocking \n**Priority:** Critical\n\n**Rationale:** LockService raising IntegrityError under contention instead of LockConflictError is a critical bug that can cause unexpected crashes in concurrent scenarios. This is a Must Have fix as it affects system stability during parallel subplan execution (v3.3.0 core feature).\n\n**Labels to apply:** State/Verified, MoSCoW/Must have, Priority/Critical, Type/Bug\n\n---\n**Automated by CleverAgents Bot**\nSupervisor: Project Owner | Agent: project-owner-pool-supervisor\n\n---\n**Automated by CleverAgents Bot**\nAgent: automation-tracking-manager
Author
Owner
[AUTO-OWNR-1] Triage complete.

**Verified** ✅ — Valid concurrency bug. `LockService.acquire` surfaces raw `IntegrityError` to callers instead of the documented `LockConflictError` under contention. This breaks mutual exclusion semantics.

- **Type**: Bug
- **Priority**: High — breaks lock semantics, affects parallel plan execution
- **MoSCoW**: Must Have — correct locking is required for parallel subplan execution (v3.3.0 acceptance criteria)
- **Milestone**: v3.3.0 — parallel subplan execution requires correct locking

---
**Automated by CleverAgents Bot**
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

Automated by CleverAgents Bot
Agent: automation-tracking-manager

``` [AUTO-OWNR-1] Triage complete. **Verified** ✅ — Valid concurrency bug. `LockService.acquire` surfaces raw `IntegrityError` to callers instead of the documented `LockConflictError` under contention. This breaks mutual exclusion semantics. - **Type**: Bug - **Priority**: High — breaks lock semantics, affects parallel plan execution - **MoSCoW**: Must Have — correct locking is required for parallel subplan execution (v3.3.0 acceptance criteria) - **Milestone**: v3.3.0 — parallel subplan execution requires correct locking --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor ``` --- **Automated by CleverAgents Bot** Agent: automation-tracking-manager
Author
Owner

🔍 Triage Decision — Verified

Type: Bug / Application Services
Priority: High
MoSCoW: Must Have
Milestone: v3.2.0

This issue is verified and assigned to v3.2.0. LockService.acquire does not guard against concurrent inserts on the (resource_type, resource_id) UNIQUE constraint, causing raw sqlalchemy.exc.IntegrityError to escape to callers instead of the documented LockConflictError. This breaks the mutual exclusion contract that the lock service is designed to provide.

Rationale:

  • Correct lock semantics are foundational to the Decisions + Validations + Invariants milestone (v3.2.0). If two workers racing to acquire the same plan lock receive an unhandled database exception, the entire concurrency model is broken.
  • The fix is well-scoped: wrap session.commit() in a try/except for IntegrityError, rollback, reload the existing lock row, and raise LockConflictError.
  • Classified Must Have for v3.2.0 — lock correctness is a hard requirement for the milestone's acceptance criteria.

Next steps: Implement the IntegrityErrorLockConflictError translation in lock_service.py::acquire, add unit and concurrency integration tests, and update the docstring.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

## 🔍 Triage Decision — Verified ✅ **Type:** Bug / Application Services **Priority:** High **MoSCoW:** Must Have **Milestone:** v3.2.0 This issue is **verified** and assigned to **v3.2.0**. `LockService.acquire` does not guard against concurrent inserts on the `(resource_type, resource_id)` UNIQUE constraint, causing raw `sqlalchemy.exc.IntegrityError` to escape to callers instead of the documented `LockConflictError`. This breaks the mutual exclusion contract that the lock service is designed to provide. **Rationale:** - Correct lock semantics are foundational to the Decisions + Validations + Invariants milestone (v3.2.0). If two workers racing to acquire the same plan lock receive an unhandled database exception, the entire concurrency model is broken. - The fix is well-scoped: wrap `session.commit()` in a try/except for `IntegrityError`, rollback, reload the existing lock row, and raise `LockConflictError`. - Classified **Must Have** for v3.2.0 — lock correctness is a hard requirement for the milestone's acceptance criteria. **Next steps:** Implement the `IntegrityError` → `LockConflictError` translation in `lock_service.py::acquire`, add unit and concurrency integration tests, and update the docstring. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
HAL9000 added this to the v3.2.0 milestone 2026-04-16 08:31:24 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9638
No description provided.