feat(testing): Add @mocked/@llm integration test tagging system #684

Closed
opened 2026-03-11 18:10:37 +00:00 by freemo · 1 comment
Owner

Background and Context

Currently, all Behave BDD tests and most Robot Framework tests unconditionally use MockAIProvider (injected globally in features/environment.py before_all() and re-applied in every before_scenario()). There is no mechanism to selectively run tests against real LLM endpoints.

This creates a critical gap: integration tests that should exercise real LLM behavior are indistinguishable from unit-style tests that legitimately use mocks. We cannot verify that our LLM integration actually works end-to-end.

39% of step definition files (215/549) and 20% of robot helper files (37/182) use unittest.mock. The MockAIProvider is applied to every test unconditionally via DI override.

Expected Behavior

A tag-based system should allow tests to declare their mock/LLM requirements:

Tag Meaning Behavior
@mocked Test uses any mock (not just LLM) Informational — no runtime effect
@mocked_llm Test specifically mocks the LLM provider MockAIProvider injected (current default behavior)
@llm Test requires a real LLM endpoint MockAIProvider is NOT injected; real provider used
@llm_anthropic Requires ANTHROPIC_API_KEY in environment Test skipped if key not present
@llm_openai Requires OPENAI_API_KEY in environment Test skipped if key not present

Runtime behavior changes needed in features/environment.py:

  1. before_all() should not unconditionally install MockAIProvider
  2. before_scenario() should check scenario tags:
    • If @llm tag present: skip mock injection, verify required API keys
    • If @mocked_llm tag present (or no @llm tag): inject MockAIProvider as today
    • If @llm_anthropic / @llm_openai: skip test if corresponding env var is missing
  3. Similar changes needed in Robot Framework via a new listener or library

Nox session changes needed in noxfile.py:

  • unit_tests session should exclude @llm tests by default
  • New llm_integration_tests session that includes only @llm tests
  • Integration tests (integration_tests) should support --include llm flag

Acceptance Criteria

  • features/environment.py conditionally injects MockAIProvider based on scenario tags
  • @llm-tagged tests use real LLM providers when API keys are available
  • @llm-tagged tests are skipped gracefully when API keys are absent
  • Robot Framework equivalent listener/library handles the same tag logic
  • noxfile.py has a new llm_integration_tests session
  • At least 1 sample @llm test exists to prove the system works
  • At least 1 sample @mocked_llm test exists (most existing tests qualify)
  • All existing tests continue to pass (backward compatible — defaulting to mock)
  • ruff check and ruff format pass
  • CONTRIBUTING.md updated with tag documentation (separate issue)

Metadata

  • Commit message: feat(testing): add @mocked/@llm integration test tagging system
  • Branch name: feature/m3-test-llm-tagging

Subtasks

  • Modify features/environment.py before_scenario() to check tags before mock injection
  • Add tag-based API key checking and graceful skip logic
  • Create Robot Framework listener or library for equivalent tag handling
  • Add llm_integration_tests session to noxfile.py
  • Create sample @llm test to validate the system
  • Ensure backward compatibility (untagged tests default to mock)

Definition of Done

  • Tag system is implemented and functional in both Behave and Robot Framework
  • Existing test suite passes with no changes (backward compatible)
  • Sample real-LLM test demonstrates the tag system works
  • CI pipeline updated to support @llm test sessions
## Background and Context Currently, **all** Behave BDD tests and most Robot Framework tests unconditionally use `MockAIProvider` (injected globally in `features/environment.py` `before_all()` and re-applied in every `before_scenario()`). There is no mechanism to selectively run tests against real LLM endpoints. This creates a critical gap: integration tests that should exercise real LLM behavior are indistinguishable from unit-style tests that legitimately use mocks. We cannot verify that our LLM integration actually works end-to-end. **39% of step definition files** (215/549) and **20% of robot helper files** (37/182) use `unittest.mock`. The `MockAIProvider` is applied to **every** test unconditionally via DI override. ## Expected Behavior A tag-based system should allow tests to declare their mock/LLM requirements: | Tag | Meaning | Behavior | |---|---|---| | `@mocked` | Test uses any mock (not just LLM) | Informational — no runtime effect | | `@mocked_llm` | Test specifically mocks the LLM provider | `MockAIProvider` injected (current default behavior) | | `@llm` | Test requires a real LLM endpoint | `MockAIProvider` is **NOT** injected; real provider used | | `@llm_anthropic` | Requires `ANTHROPIC_API_KEY` in environment | Test skipped if key not present | | `@llm_openai` | Requires `OPENAI_API_KEY` in environment | Test skipped if key not present | ### Runtime behavior changes needed in `features/environment.py`: 1. `before_all()` should **not** unconditionally install `MockAIProvider` 2. `before_scenario()` should check scenario tags: - If `@llm` tag present: skip mock injection, verify required API keys - If `@mocked_llm` tag present (or no `@llm` tag): inject `MockAIProvider` as today - If `@llm_anthropic` / `@llm_openai`: skip test if corresponding env var is missing 3. Similar changes needed in Robot Framework via a new listener or library ### Nox session changes needed in `noxfile.py`: - `unit_tests` session should exclude `@llm` tests by default - New `llm_integration_tests` session that includes only `@llm` tests - Integration tests (`integration_tests`) should support `--include llm` flag ## Acceptance Criteria - [ ] `features/environment.py` conditionally injects `MockAIProvider` based on scenario tags - [ ] `@llm`-tagged tests use real LLM providers when API keys are available - [ ] `@llm`-tagged tests are skipped gracefully when API keys are absent - [ ] Robot Framework equivalent listener/library handles the same tag logic - [ ] `noxfile.py` has a new `llm_integration_tests` session - [ ] At least 1 sample `@llm` test exists to prove the system works - [ ] At least 1 sample `@mocked_llm` test exists (most existing tests qualify) - [ ] All existing tests continue to pass (backward compatible — defaulting to mock) - [ ] `ruff check` and `ruff format` pass - [ ] CONTRIBUTING.md updated with tag documentation (separate issue) ## Metadata - **Commit message**: `feat(testing): add @mocked/@llm integration test tagging system` - **Branch name**: `feature/m3-test-llm-tagging` ## Subtasks - [ ] Modify `features/environment.py` `before_scenario()` to check tags before mock injection - [ ] Add tag-based API key checking and graceful skip logic - [ ] Create Robot Framework listener or library for equivalent tag handling - [ ] Add `llm_integration_tests` session to `noxfile.py` - [ ] Create sample `@llm` test to validate the system - [ ] Ensure backward compatibility (untagged tests default to mock) ## Definition of Done - Tag system is implemented and functional in both Behave and Robot Framework - Existing test suite passes with no changes (backward compatible) - Sample real-LLM test demonstrates the tag system works - CI pipeline updated to support `@llm` test sessions
freemo added this to the v3.2.0 milestone 2026-03-11 18:13:19 +00:00
Author
Owner

Closing as superseded. The approach has changed from tagging mocked vs real tests to prohibiting all mocking in integration tests outright. Integration tests must exercise real services, real endpoints, and real dependencies — mocking of any kind is strictly prohibited. Replaced by new issues covering: (1) removal of mock LLM providers from integration tests, (2) removal of all unittest.mock from integration tests, (3) removal of integration test imports from the unit test mock library, and (4) CI configuration for real LLM API keys.

**Closing as superseded.** The approach has changed from tagging mocked vs real tests to prohibiting all mocking in integration tests outright. Integration tests must exercise real services, real endpoints, and real dependencies — mocking of any kind is strictly prohibited. Replaced by new issues covering: (1) removal of mock LLM providers from integration tests, (2) removal of all unittest.mock from integration tests, (3) removal of integration test imports from the unit test mock library, and (4) CI configuration for real LLM API keys.
freemo 2026-03-11 19:00:25 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
cleveragents/cleveragents-core#684
No description provided.