feat(acms): add PostgreSQL and Docker Compose domain analyzers #611
No reviewers
Labels
No labels
auto/needs-reevaluation
controller-managed
auto/blocked-by-deps
auto/ci-timeout
auto/claimed-implementer
auto/claimed-merge
auto/claimed-reviewer
auto/driver-down
auto/invariant-violation
auto/last-attempt-tier-0
auto/last-attempt-tier-1
auto/last-attempt-tier-2
auto/last-attempt-tier-min
Automation Tracking
auto/needs-conflict-resolution
auto/needs-implementer
auto/postmortem
auto/ready-to-merge
auto/restart-throttled
auto/revert
auto/sentinel
auto/stale-inactivity
auto/unstable
Blocked
Bounty
$100
Bounty
$1000
Bounty
$10000
Bounty
$20
Bounty
$2000
Bounty
$250
Bounty
$50
Bounty
$500
Bounty
$5000
Bounty
$750
MoSCoW
Could have
MoSCoW
Must have
MoSCoW
Should have
Needs Feedback
Points
1
Points
13
Points
2
Points
21
Points
3
Points
34
Points
5
Points
55
Points
8
Points
88
Priority
Backlog
Priority
CI Blocker
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Signed-off: Owner
Signed-off: Scrum Master
Signed-off: Tech Lead
Spike
State
Completed
State
Duplicate
State
In Progress
State
In Review
State
Paused
State
Unverified
State
Verified
State
Wont Do
Type
Automation
Type
Bug
Type
Discussion
Type
Documentation
Type
Epic
Type
Feature
Type
Legendary
Type
Refactor
Type
Support
Type
Task
Type
Testing
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
cleveragents/cleveragents-core!611
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feature/m6-acms-domain-specific-analyzers"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Implements Phase 2 of issue #588:
PostgreSQLAnalyzerandDockerComposeAnalyzerdomain-specific analyzers for the ACMS UKO indexing pipeline.Closes #588 (Phase 2 scope — PostgreSQL + Docker Compose).
Changes
New Source Files
postgresql_analyzer.py(~594 lines) — Regex-based DDL parser extractinguko-data:Table,uko-data:Column,uko-data:ForeignKey,uko-data:View, anduko-data:Schematriples. Handles column metadata (data type, nullability, primary key),CREATE TABLE/VIEWwith optional schema qualification, and inline/out-of-lineFOREIGN KEYconstraints. RaisesValueErroron empty content; returns empty list on unparsable DDL.docker_compose_analyzer.py(~393 lines) — YAML-based parser usingyaml.safe_loadextractinguko-infra:DeploymentUnit,uko-infra:Service,uko-infra:Port,uko-infra:EnvironmentVariable, anduko-infra:connectsTotriples. Validates Docker Compose format viaservices/versionkey detection. Returns empty list on empty, unparsable, or non-Compose YAML.Test Files
features/domain_analyzers.feature— 34 BDD scenarios across 7 groups: protocol conformance (4), registry operations (6), PythonAnalyzer (5), MarkdownAnalyzer (5), PostgreSQLAnalyzer (6), DockerComposeAnalyzer (6), cross-analyzer (2).features/steps/domain_analyzer_steps.py— Step definitions usinguse_step_matcher("re")for triple-assertion steps to avoidparse-format ambiguity with existinguko_analyzers_steps.py.robot/domain_analyzers.robot+robot/helper_domain_analyzers.py— 6 Robot Framework integration smoke tests with realistic multi-object sample content.Modified Files
acms/__init__.py— AddedDockerComposeAnalyzerandPostgreSQLAnalyzerimports and__all__entries.vulture_whitelist.py— Added 2 entries for new analyzer classes.CHANGELOG.md— Added unreleased entry.Spec Compliance
Per
docs/specification.md§44210-44261 (Domain-Specific Analyzers), §42151-42272 (uko-datavocabulary), §42285-42331 (uko-infravocabulary):AnalyzerProtocol(runtime-checkable)AnalyzerRegistryby file extensionUKOTripleinstances withuko://URI scheme and confidence=1.0UKOIndexer._add_provenance())Verification
c0f22397b64d2aad1b254d2aad1b25d3fef49050Review — PR #611
feature/m6-acms-domain-specific-analyzersReviewer: brent.edwards | Round: 1 | Commit:
4d2aad1bQuality Gates
nox -s lintnox -s typecheckFindings Summary
postgresql_analyzer.pydomain_analyzers.featurepostgresql_analyzer.py:335-347, 422-443_extract_bodyand_split_entriescount parentheses without respecting SQL string literals — silent data truncationdomain_analyzer_steps.py:124# type: ignore[arg-type]suppression — CONTRIBUTING prohibits inline suppressionsdocker_compose_analyzer.py:138yaml.safe_loadis safe against arbitrary code exec but still vulnerable to billion-laughs alias expansion (quadratic memory). Consider rejecting alias nodes or adding a size guard.postgresql_analyzer.py:388-393"primary" INTEGERis stripped of quotes then matched against_NON_COLUMN_KEYWORDS, so a column literally namedprimarywould be skippedrepositories_uncovered_branches_steps.py(scoped_session removal),helper_m4_e2e_verification.pyandm4_e2e_verification.robot(3 M4 CLI E2E tests removed). These changes also appear in PR #610 and should be in their own commit or excluded from this PR.Verdict
REQUEST_CHANGES — 3 P1 findings require resolution. The two 500+ line files (F1, F2) need splitting or restructuring. The parenthesis-in-string-literal bug (F3) silently produces wrong results for real-world DDL containing
DEFAULT,CHECK, or string constants with parentheses. The out-of-scope changes (F7) should be removed from this PR.What Went Well
@ -0,0 +520,4 @@@docker_compose_analyzerScenario: Analyze dict-form volume mappingGiven a DockerComposeAnalyzer analyzerWhen I analyze the Docker Compose content:F2 [P1 · Policy] — This feature file is 523 lines, exceeding the 500-line limit. Consider splitting into separate feature files per analyzer domain (e.g.,
postgresql_analyzer.feature,docker_compose_analyzer.feature).@ -0,0 +121,4 @@def analyze(self, content: str, resource_uri: str) -> list[UKOTriple]:return [] # pragma: no covercontext.registry.register(_DuplicatePyAnalyzer()) # type: ignore[arg-type]F4 [P2 · Policy] — CONTRIBUTING.md prohibits
# type: ignoreinline suppressions. Please fix the type mismatch at the call site instead of suppressing it. If theregister()method accepts aProtocoltype, ensure_DuplicatePyAnalyzerfully satisfies it.F7 [P2 · Process] — Removing 3 M4 CLI E2E tests (
cli-plan-use,cli-plan-execute,cli-plan-tree) is out of scope for issue #588 (domain-specific analyzers). These same changes also appear in PR #610. Please remove this file from the PR or move these changes to a separate commit/PR that references the appropriate issue.@ -0,0 +135,4 @@"""if not content or not content.strip():raise ValueError("content must not be empty.")if not resource_uri or not resource_uri.strip():F5 [P2 · Security] — While
yaml.safe_loadblocks arbitrary code execution, it is still vulnerable to billion-laughs-style alias expansion that can cause quadratic memory usage with crafted YAML input (e.g., nested anchors/aliases). Since this processes user-provided Docker Compose files, consider either:if len(content) > MAX_COMPOSE_SIZE), oryaml.safe_loadthrough a wrapper that rejects alias nodes@ -0,0 +337,4 @@subject_uri=t_uri,predicate="rdf:type",object_uri="uko-data:Table",)F3 [P1 · Bug] —
_extract_bodycounts raw(and)characters without respecting SQL string literals. ADEFAULTvalue like'func(x)'or aCHECKconstraint containing'a)'will cause the depth counter to hit 0 prematurely, silently truncating the table body. The same bug affects_split_entries(line 422-443).Example that breaks:
The
'hello(world)'contains an unmatched(inside a string literal, which increments depth. The)afterworlddecrements it. Then the actual closing)on the last line sets depth to 0 too early, producing a truncated body that missescol2.Fix: track whether you're inside a single-quoted string (
'...') and skip paren counting while inside it. PostgreSQL uses''for escaping quotes inside strings.@ -0,0 +387,4 @@return content[paren_start + 1 : i]return ""def _parse_table_body(F6 [P2 · Bug] — The
first_wordextraction strips double quotes then checks against_NON_COLUMN_KEYWORDS. This means a column legitimately named"primary"or"check"(which PostgreSQL allows via quoting) will be silently skipped.Fix: only check unquoted first words against
_NON_COLUMN_KEYWORDS. If the original first token starts with", it's a quoted identifier and should never be treated as a keyword.@ -0,0 +615,4 @@subject_uri=source_col_uri,predicate="uko-data:foreignKeyTo",object_uri=target_col_uri,)F1 [P1 · Policy] — This file is 618 lines, exceeding the 500-line limit per CONTRIBUTING.md. Consider extracting helper methods (e.g.,
_extract_body,_split_entries,_parse_table_body, foreign-key extraction) into a separate_postgresql_helpers.pymodule, or splitting table-level vs. function/view parsing into separate classes.d3fef4905013618fb8d5Supplementary Findings — PR #611 (Second Pass)
Following review #2018, a thorough second pass uncovered one additional P1 finding:
F8 [P1 · Process] — CHANGELOG deletes pre-existing #495 entry
The CHANGELOG diff removes the
(#495)entry (M4 acceptance criteria validation) from the## Unreleasedsection and replaces it with the new(#588)entry. The #588 entry correctly prepends to the top of the section, but the deletion of #495 is wrong — that entry belongs to a different merged PR and must be preserved.This is the same pattern found in PR #610 (F5). The correct behavior is to prepend the new entry above the existing ones, not replace the first entry.
File:
CHANGELOG.md(diff lines 5-11 show the deletion)Updated totals: 3 P1 + 5 P2 = 8 findings (F1-F8).
F8 addressed in commit
fb96d791.The
(#495)entry was never actually deleted — it was present in the branch at line 30, identical to master. Thegit diff origin/master..HEAD -- CHANGELOG.mdshowed a pure 12-line addition with zero deletions of existing entries.However, the entry ordering was wrong:
(#588)was appended after(#203)instead of prepended to the top of## Unreleased. Fixed by moving the(#588)entry to the first position under## Unreleased, per convention that newest entries go first.All pre-existing entries preserved and unchanged:
(#473),(#495),(#203),(#494).Quality gates: lint ✓, typecheck ✓, security_scan ✓, behave 57/57 scenarios ✓.
Verification Review — Round 2 (Post-Fix)
Reviewed at HEAD:
fb96d791(3 commits:77db78d7feat,13618fb8fix F1-F7,fb96d791fix F8)All 8 findings from Review #2018 and Comment #55331 are verified fixed. No new bugs found. Quality gates pass clean.
Finding Disposition
postgresql_analyzer.pywas 618 lines (500 limit)_postgresql_helpers.py(440 lines). Both under limit.domain_analyzers.featurewas 523 lines (500 limit)domain_analyzers.feature(217),postgresql_analyzer.feature(201),docker_compose_analyzer.feature(199). All under limit.extract_body()extract_body()andsplit_entries()in_postgresql_helpers.pynow trackin_stringstate, handle''escape sequences. Two regression BDD scenarios added: "DEFAULT with parentheses in string literal" and "CHECK constraint with string containing parentheses". Both pass.# type: ignore[arg-type]indomain_analyzer_steps.py:124_dup: type[AnalyzerProtocol] = _DuplicatePyAnalyzer; del _dup. Notype: ignoreanywhere in the file.yaml.safe_load_MAX_COMPOSE_BYTES = 1_048_576(1 MiB) guard added indocker_compose_analyzer.pybeforeyaml.safe_load()call. Returns[]with warning log if content exceeds limit."primary") incorrectly filteredparse_table_body()now checksnot raw_first.startswith('"')before keyword matching, so quoted identifiers bypass the filter. Regression BDD scenario "Quoted keyword column names are not silently dropped" tests"primary","check", andnormal_col. Passes.23803f14). PR now contains exactly 12 files, all in-scope: 5 new source/helper files, 3 feature files, 1 step file, 2 robot files, and expected modifications to__init__.py,vulture_whitelist.py,CHANGELOG.md.#588entry placed at top of Unreleased section.#495entry and all other pre-existing entries intact.Quality Gates
nox -s lintnox -s typecheckNew Bug Scan
Reviewed all code changes introduced by fix commits
13618fb8andfb96d791:extract_body()/split_entries()string tracking: Forward-progress guaranteed byi += 1at loop end;''escape handled withi += 2+continue; unterminated strings gracefully return empty. Clean.parse_table_body()keyword filter:startswith('"')check correctly short-circuits for quoted identifiers before keyword comparison. No false positives or negatives in edge cases.len(content)measures characters (not bytes), but the naming_MAX_COMPOSE_BYTESis slightly imprecise. Functionally correct since byte count ≥ char count for UTF-8, making the guard conservative. Non-blocking observation._dup: type[AnalyzerProtocol] = _DuplicatePyAnalyzer; del _dupis idiomatic and has zero runtime cost. Clean.No new bugs found.
Verdict
All P1 and P2 findings from Round 1 are resolved. Code is clean, well-structured, and fully tested. Approved for merge.
Approved for merge!