BUG-HUNT: [alembic] Destructive data deduplication in migration #7937

Open
opened 2026-04-12 08:06:48 +00:00 by HAL9000 · 2 comments
Owner

Background

The migration alembic/versions/a5_006_action_invariants_unique_constraint.py performs destructive DELETE operations on the action_invariants and action_arguments tables to enforce new unique constraints. The deduplication logic unconditionally deletes all but the MIN(id) row in each duplicate group. If any of the deleted rows contain important or distinct information in non-key columns, that data is permanently lost with no backup or recovery path.

This is distinct from the atomicity concern tracked in #7295 (which addresses transaction isolation and rollback on failure). This issue concerns the correctness of the deduplication strategy itself: the migration silently discards data that may be meaningful, with no pre-migration audit, no backup, and no user warning.

Current Behavior

The _deduplicate_action_invariants() function deletes all duplicate (action_name, position) rows, keeping only the one with the lowest id:

def _deduplicate_action_invariants() -> None:
    """Remove duplicate (action_name, position) rows before adding the constraint."""

    op.execute(
        sa.text(
            """
            DELETE FROM action_invariants
            WHERE id NOT IN (
                SELECT MIN(id)
                FROM action_invariants
                GROUP BY action_name, position
            )
            """
        )
    )

Similarly, _deduplicate_action_arguments() deletes duplicate rows from action_arguments. In both cases:

  • No pre-migration audit is performed to determine whether the deleted rows differ in non-key columns
  • No backup of the deleted data is created before deletion
  • No warning is surfaced to the operator when duplicates are found
  • The MIN(id) selection strategy is arbitrary — it does not guarantee the retained row is the most complete or most recent

Expected Behavior

Per the specification's data integrity and fail-fast principles, migrations that perform destructive data operations must:

  • Audit duplicate rows before deletion and warn (or abort) if non-key columns differ between duplicates
  • Create a backup of rows to be deleted (e.g., into a _deleted_duplicates_<timestamp> table or a logged audit record) before executing the DELETE
  • Surface a clear operator warning when duplicates are detected, describing how many rows will be deleted and from which tables
  • Provide a documented recovery path if the deletion was incorrect

Acceptance Criteria

  • Pre-migration audit detects whether duplicate rows differ in non-key columns; if they do, the migration warns the operator or aborts with a descriptive error
  • Deleted rows are backed up (e.g., to a temporary audit table or logged) before the DELETE executes
  • The deduplication strategy is documented with a rationale for the MIN(id) selection
  • If no meaningful data is lost (all non-key columns are identical across duplicates), the migration proceeds normally
  • All nox stages pass with coverage ≥ 97%

Supporting Information

  • File: alembic/versions/a5_006_action_invariants_unique_constraint.py
  • Functions: _deduplicate_action_invariants(), _deduplicate_action_arguments()
  • Related issue: #7295 (non-atomic transaction safety in the same migration — distinct concern)
  • Related issue: #6647 (duplicate constraint creation in same migration file)
  • Category: data-flow

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.

Metadata

  • Branch: bugfix/alembic-destructive-deduplication-data-loss
  • Commit Message: fix(alembic): add pre-migration audit and backup before destructive deduplication in a5_006
  • Milestone: (none — backlog per Milestone Scope Guard)
  • Parent Epic: (none identified — flagged as orphan below)

Subtasks

  • Audit _deduplicate_action_invariants() to check whether duplicate rows differ in non-key columns
  • Audit _deduplicate_action_arguments() to check whether duplicate rows differ in non-key columns
  • Implement pre-migration audit that warns or aborts if non-key column data would be lost
  • Implement backup mechanism (audit table or log) for rows to be deleted before DELETE executes
  • Document the MIN(id) selection rationale in migration docstring
  • Write BDD scenario(s) demonstrating the data loss risk (tagged @tdd_issue, @tdd_expected_fail)
  • Fix the implementation so all BDD scenarios pass
  • Confirm all nox stages pass with coverage ≥ 97%

Definition of Done

  • Pre-migration audit is present and tested for both action_invariants and action_arguments
  • Deleted rows are backed up or logged before deletion executes
  • Migration aborts or warns with a descriptive message when meaningful data would be lost
  • BDD tests tagged @tdd_issue prove the bug and then pass after the fix
  • All nox stages pass
  • Coverage ≥ 97%

Backlog note: This issue was discovered during autonomous operation
on milestone v3.2.0. It does not block milestone completion and has been
placed in the backlog for human review and future milestone assignment.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

## Background The migration `alembic/versions/a5_006_action_invariants_unique_constraint.py` performs destructive `DELETE` operations on the `action_invariants` and `action_arguments` tables to enforce new unique constraints. The deduplication logic unconditionally deletes all but the `MIN(id)` row in each duplicate group. If any of the deleted rows contain important or distinct information in non-key columns, that data is permanently lost with no backup or recovery path. This is distinct from the atomicity concern tracked in #7295 (which addresses transaction isolation and rollback on failure). This issue concerns the **correctness of the deduplication strategy itself**: the migration silently discards data that may be meaningful, with no pre-migration audit, no backup, and no user warning. ## Current Behavior The `_deduplicate_action_invariants()` function deletes all duplicate `(action_name, position)` rows, keeping only the one with the lowest `id`: ```python def _deduplicate_action_invariants() -> None: """Remove duplicate (action_name, position) rows before adding the constraint.""" op.execute( sa.text( """ DELETE FROM action_invariants WHERE id NOT IN ( SELECT MIN(id) FROM action_invariants GROUP BY action_name, position ) """ ) ) ``` Similarly, `_deduplicate_action_arguments()` deletes duplicate rows from `action_arguments`. In both cases: - No pre-migration audit is performed to determine whether the deleted rows differ in non-key columns - No backup of the deleted data is created before deletion - No warning is surfaced to the operator when duplicates are found - The `MIN(id)` selection strategy is arbitrary — it does not guarantee the retained row is the most complete or most recent ## Expected Behavior Per the specification's data integrity and fail-fast principles, migrations that perform destructive data operations must: - Audit duplicate rows before deletion and warn (or abort) if non-key columns differ between duplicates - Create a backup of rows to be deleted (e.g., into a `_deleted_duplicates_<timestamp>` table or a logged audit record) before executing the `DELETE` - Surface a clear operator warning when duplicates are detected, describing how many rows will be deleted and from which tables - Provide a documented recovery path if the deletion was incorrect ## Acceptance Criteria - Pre-migration audit detects whether duplicate rows differ in non-key columns; if they do, the migration warns the operator or aborts with a descriptive error - Deleted rows are backed up (e.g., to a temporary audit table or logged) before the `DELETE` executes - The deduplication strategy is documented with a rationale for the `MIN(id)` selection - If no meaningful data is lost (all non-key columns are identical across duplicates), the migration proceeds normally - All nox stages pass with coverage ≥ 97% ## Supporting Information - **File**: `alembic/versions/a5_006_action_invariants_unique_constraint.py` - **Functions**: `_deduplicate_action_invariants()`, `_deduplicate_action_arguments()` - **Related issue**: #7295 (non-atomic transaction safety in the same migration — distinct concern) - **Related issue**: #6647 (duplicate constraint creation in same migration file) - **Category**: data-flow ### TDD Note After this bug issue is verified, a corresponding `Type/Testing` issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. ## Metadata - **Branch**: `bugfix/alembic-destructive-deduplication-data-loss` - **Commit Message**: `fix(alembic): add pre-migration audit and backup before destructive deduplication in a5_006` - **Milestone**: *(none — backlog per Milestone Scope Guard)* - **Parent Epic**: *(none identified — flagged as orphan below)* ## Subtasks - [ ] Audit `_deduplicate_action_invariants()` to check whether duplicate rows differ in non-key columns - [ ] Audit `_deduplicate_action_arguments()` to check whether duplicate rows differ in non-key columns - [ ] Implement pre-migration audit that warns or aborts if non-key column data would be lost - [ ] Implement backup mechanism (audit table or log) for rows to be deleted before `DELETE` executes - [ ] Document the `MIN(id)` selection rationale in migration docstring - [ ] Write BDD scenario(s) demonstrating the data loss risk (tagged `@tdd_issue`, `@tdd_expected_fail`) - [ ] Fix the implementation so all BDD scenarios pass - [ ] Confirm all nox stages pass with coverage ≥ 97% ## Definition of Done - [ ] Pre-migration audit is present and tested for both `action_invariants` and `action_arguments` - [ ] Deleted rows are backed up or logged before deletion executes - [ ] Migration aborts or warns with a descriptive message when meaningful data would be lost - [ ] BDD tests tagged `@tdd_issue` prove the bug and then pass after the fix - [ ] All nox stages pass - [ ] Coverage ≥ 97% > **Backlog note:** This issue was discovered during autonomous operation > on milestone v3.2.0. It does not block milestone completion and has been > placed in the backlog for human review and future milestone assignment. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

⚠️ Orphan Issue — Needs Manual Parent Epic Linking

This issue was created by the automated new-issue-creator agent. No parent Epic covering Alembic migration data safety or destructive deduplication was found in the open issues at creation time.

This issue must be manually linked to an appropriate parent Epic using Forgejo's dependency system (this issue blocks the parent Epic). Suggested candidates to review:

  • Any Epic covering Alembic migration correctness or data integrity
  • Related issues: #7295 (non-atomic migration operations), #6647 (duplicate constraint creation in a5_006)

Until a parent Epic is identified and linked, this issue remains an orphan in violation of CONTRIBUTING.md's "no orphan issues" rule.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: new-issue-creator

⚠️ **Orphan Issue — Needs Manual Parent Epic Linking** This issue was created by the automated new-issue-creator agent. No parent Epic covering Alembic migration data safety or destructive deduplication was found in the open issues at creation time. This issue must be manually linked to an appropriate parent Epic using Forgejo's dependency system (this issue **blocks** the parent Epic). Suggested candidates to review: - Any Epic covering Alembic migration correctness or data integrity - Related issues: #7295 (non-atomic migration operations), #6647 (duplicate constraint creation in a5_006) Until a parent Epic is identified and linked, this issue remains an orphan in violation of CONTRIBUTING.md's "no orphan issues" rule. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: new-issue-creator
Author
Owner

Issue triaged by project owner:

  • State: Verified
  • Priority: Low — Destructive data deduplication in Alembic migration; the migration silently discards potentially meaningful data with no backup or audit trail
  • Milestone: None (Backlog) — as noted in the issue itself, this is a backlog item
  • Story Points: 5 — L — Requires implementing pre-migration audit, backup mechanism, and documentation
  • MoSCoW: Could Have — important data integrity improvement but the migration has already run in production; this is about improving future migrations
  • Parent Epic: None identified (orphan as noted in the issue)

Rationale: This is a valid data integrity concern. The _deduplicate_action_invariants() and _deduplicate_action_arguments() functions delete rows without auditing whether non-key columns differ. However, since this migration has already run, the immediate risk is past. The fix is about improving the pattern for future migrations. It's a "Could Have" for the backlog.


Automated by CleverAgents Bot
Supervisor: Project Owner | Agent: project-owner-pool-supervisor

Issue triaged by project owner: - **State**: Verified - **Priority**: Low — Destructive data deduplication in Alembic migration; the migration silently discards potentially meaningful data with no backup or audit trail - **Milestone**: None (Backlog) — as noted in the issue itself, this is a backlog item - **Story Points**: 5 — L — Requires implementing pre-migration audit, backup mechanism, and documentation - **MoSCoW**: Could Have — important data integrity improvement but the migration has already run in production; this is about improving future migrations - **Parent Epic**: None identified (orphan as noted in the issue) **Rationale**: This is a valid data integrity concern. The `_deduplicate_action_invariants()` and `_deduplicate_action_arguments()` functions delete rows without auditing whether non-key columns differ. However, since this migration has already run, the immediate risk is past. The fix is about improving the pattern for future migrations. It's a "Could Have" for the backlog. --- **Automated by CleverAgents Bot** Supervisor: Project Owner | Agent: project-owner-pool-supervisor
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#7937
No description provided.