[AUTO-INF-7] Improve test data realism and fixture validation #9789

Open
opened 2026-04-15 15:48:11 +00:00 by HAL9000 · 0 comments
Owner

Summary

  • Correction fixtures under features/fixtures/m4/ ship 27-character plan/decision IDs, which are invalid ULIDs and let regressions slip through validation.
  • Stage context fixtures for the paper-writer workflow rely on duplicated, mostly-null snapshots, so behaviour in intermediate stages is untested and stage-order drift would go unnoticed.
  • Several large-suite tests (Robot + CLI) still depend on the live OpenAI stack, and higher-level policies reference directories that no longer exist, which explains frequent CI flakes tied to external services and stale paths.

Findings

1. Correction flows use invalid ULID samples

  • features/fixtures/m4/correction_flows.json (lines ~5-74) hard-codes plan_id and target_decision_id strings such as 01M4CORR0000000000000000001.
  • Decoding the fixture with jq length checks yields 27-character identifiers; the production validator (PlanIdentity -> python-ulid) rejects anything not exactly 26 characters.
  • Tests built on these fixtures never fail when validation hardens, so CI reports green even though real requests now raise.

2. Paper-writer contexts do not reflect real state transitions

  • features/fixtures/v2/contexts/discovery_stage_context.json and features/fixtures/v2/paper_contexts/03_brainstorming.json duplicate the same stage_order array and leave most keys null or empty lists.
  • Robot suites (robot/scientific_paper_e2e_test.robot) depend on these snapshots; because they do not capture partially-complete stages (for example vetted sources or section indices), defects in transition code only appear during manual smoke tests.
  • The duplication also means a stage addition or removal requires editing multiple JSON files by hand, increasing drift risk.

3. ACMS policy fixture references dead paths

  • features/fixtures/m5/acms_context_policy.json still whitelists tests/**/*.py, but the repo reorganised into features/ and robot/ (no tests/ tree in the manifest).
  • As a result, ACMS behaviour around real-world asset discovery is never exercised in CI; the policy pretends a non-existent directory exists, so we skip exercising richly structured doc and robot fixtures.

4. Robot e2e suites invoke live LLM providers

  • robot/scientific_paper_e2e_test.robot shells out to python -m cleveragents with --unsafe, and the referenced config examples/scientific_paper_writer.yaml still specifies provider: openai with gpt-3.5-turbo and gpt-4-turbo actors.
  • In CI this produces sporadic throttling failures and non-deterministic outputs; locally, engineers toggle environment variables to point at mocks, but the suite ships without a dedicated stub config.
  • The fixtures under features/mocks/mock_ai_provider.py are unused by Robot suites, so we get neither determinism nor coverage of the mock provider code.

5. No property-based or factory-driven data generation

  • pyproject.toml lists Behave, Pytest, and RobotFramework but no faker, factory-boy, or hypothesis-style libraries.
  • Consequently, every validation edge case is baked into static JSON or YAML. The ULID fixtures above show how easy it is for stale constants to bypass future schema changes, and we never randomise locale, encoding, or timezone inputs to flush out latent bugs.

Recommendations

Quick wins (<= 2 sprints)

  • Add a lightweight validation script (Pytest or Behave hook) that walks features/fixtures/**/*.json and .yaml, verifying ULID length, existing paths, and non-empty required keys; wire it into CI so invalid fixtures fail fast.
  • Replace the paper-writer discovery/brainstorming fixtures with snapshots exported from an actual run, or generate them via a shared helper so stage_order lives in a single module.
  • Update acms_context_policy.json to mirror the current repo layout (for example switch to features//*.feature, robot//*.robot) and exercise the policy against real directories in CI.

Mid-term initiatives (2-4 sprints)

  • Introduce a test-specific config (for example examples/scientific_paper_writer_test.yaml) that wires actors to MockAIProvider; switch Robot suites to that config and gate the live-provider variant behind an opt-in stress run.
  • Stand up a fixtures/factories.py module using Faker (seeded via environment variable) to emit plan/decision identifiers, timestamps, and resource paths so Behave and Robot suites can request realistic but deterministic test subjects.
  • Add hypothesis-based property tests around ULID / namespaced-name validators and schema coercion in PlanService, so future fixture tweaks do not regress coverage.

Data generation strategy

  • Adopt a seeded Faker profile for IDs, file paths, and human-readable metadata shared across Behave and Robot helpers (for example extend features/mocks/test_uow_factory.build_test_uow).
  • Provide helper APIs that emit whole fixture payloads (plan correction flows, automation profiles), returning both the structured object and JSON or YAML serialisation so specs stay DRY.

Maintenance considerations

  • Document fixture generation and add a pre-commit hook or CI step that re-generates stage context snapshots from canonical Python builders to prevent manual drift.
  • Add a check that compares include_paths / exclude_paths entries against the repo tree, failing when directories disappear.
  • Track ownership by introducing CODEOWNERS entries for features/fixtures/** so schema changes trigger review from the test-infra team.

Duplicate Check

  • GET /api/v1/repos/cleveragents/cleveragents-core/issues?q=AUTO-INF-7&state=open&limit=50&page=1 returns #9143 and #8577, which focus on missing test levels (application/unit coverage) rather than fixture realism; no open issue targets test data quality.
  • No closed issues contain the tag together with the keywords fixture or test data.
## Summary - Correction fixtures under features/fixtures/m4/ ship 27-character plan/decision IDs, which are invalid ULIDs and let regressions slip through validation. - Stage context fixtures for the paper-writer workflow rely on duplicated, mostly-null snapshots, so behaviour in intermediate stages is untested and stage-order drift would go unnoticed. - Several large-suite tests (Robot + CLI) still depend on the live OpenAI stack, and higher-level policies reference directories that no longer exist, which explains frequent CI flakes tied to external services and stale paths. ## Findings ### 1. Correction flows use invalid ULID samples - features/fixtures/m4/correction_flows.json (lines ~5-74) hard-codes plan_id and target_decision_id strings such as 01M4CORR0000000000000000001. - Decoding the fixture with jq length checks yields 27-character identifiers; the production validator (PlanIdentity -> python-ulid) rejects anything not exactly 26 characters. - Tests built on these fixtures never fail when validation hardens, so CI reports green even though real requests now raise. ### 2. Paper-writer contexts do not reflect real state transitions - features/fixtures/v2/contexts/discovery_stage_context.json and features/fixtures/v2/paper_contexts/03_brainstorming.json duplicate the same stage_order array and leave most keys null or empty lists. - Robot suites (robot/scientific_paper_e2e_test.robot) depend on these snapshots; because they do not capture partially-complete stages (for example vetted sources or section indices), defects in transition code only appear during manual smoke tests. - The duplication also means a stage addition or removal requires editing multiple JSON files by hand, increasing drift risk. ### 3. ACMS policy fixture references dead paths - features/fixtures/m5/acms_context_policy.json still whitelists tests/**/*.py, but the repo reorganised into features/ and robot/ (no tests/ tree in the manifest). - As a result, ACMS behaviour around real-world asset discovery is never exercised in CI; the policy pretends a non-existent directory exists, so we skip exercising richly structured doc and robot fixtures. ### 4. Robot e2e suites invoke live LLM providers - robot/scientific_paper_e2e_test.robot shells out to python -m cleveragents with --unsafe, and the referenced config examples/scientific_paper_writer.yaml still specifies provider: openai with gpt-3.5-turbo and gpt-4-turbo actors. - In CI this produces sporadic throttling failures and non-deterministic outputs; locally, engineers toggle environment variables to point at mocks, but the suite ships without a dedicated stub config. - The fixtures under features/mocks/mock_ai_provider.py are unused by Robot suites, so we get neither determinism nor coverage of the mock provider code. ### 5. No property-based or factory-driven data generation - pyproject.toml lists Behave, Pytest, and RobotFramework but no faker, factory-boy, or hypothesis-style libraries. - Consequently, every validation edge case is baked into static JSON or YAML. The ULID fixtures above show how easy it is for stale constants to bypass future schema changes, and we never randomise locale, encoding, or timezone inputs to flush out latent bugs. ## Recommendations ### Quick wins (<= 2 sprints) - Add a lightweight validation script (Pytest or Behave hook) that walks features/fixtures/**/*.json and .yaml, verifying ULID length, existing paths, and non-empty required keys; wire it into CI so invalid fixtures fail fast. - Replace the paper-writer discovery/brainstorming fixtures with snapshots exported from an actual run, or generate them via a shared helper so stage_order lives in a single module. - Update acms_context_policy.json to mirror the current repo layout (for example switch to features/**/*.feature, robot/**/*.robot) and exercise the policy against real directories in CI. ### Mid-term initiatives (2-4 sprints) - Introduce a test-specific config (for example examples/scientific_paper_writer_test.yaml) that wires actors to MockAIProvider; switch Robot suites to that config and gate the live-provider variant behind an opt-in stress run. - Stand up a fixtures/factories.py module using Faker (seeded via environment variable) to emit plan/decision identifiers, timestamps, and resource paths so Behave and Robot suites can request realistic but deterministic test subjects. - Add hypothesis-based property tests around ULID / namespaced-name validators and schema coercion in PlanService, so future fixture tweaks do not regress coverage. ## Data generation strategy - Adopt a seeded Faker profile for IDs, file paths, and human-readable metadata shared across Behave and Robot helpers (for example extend features/mocks/test_uow_factory.build_test_uow). - Provide helper APIs that emit whole fixture payloads (plan correction flows, automation profiles), returning both the structured object and JSON or YAML serialisation so specs stay DRY. ## Maintenance considerations - Document fixture generation and add a pre-commit hook or CI step that re-generates stage context snapshots from canonical Python builders to prevent manual drift. - Add a check that compares include_paths / exclude_paths entries against the repo tree, failing when directories disappear. - Track ownership by introducing CODEOWNERS entries for features/fixtures/** so schema changes trigger review from the test-infra team. ## Duplicate Check - GET /api/v1/repos/cleveragents/cleveragents-core/issues?q=AUTO-INF-7&state=open&limit=50&page=1 returns #9143 and #8577, which focus on missing test levels (application/unit coverage) rather than fixture realism; no open issue targets test data quality. - No closed issues contain the tag together with the keywords fixture or test data.
HAL9000 changed title from [AUTO-INF-7] temp to [AUTO-INF-7] Improve test data realism and fixture validation 2026-04-15 15:48:52 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#9789
No description provided.