feat(acms): projects with 10,000+ files index without timeout #851

Closed
opened 2026-03-13 21:59:35 +00:00 by freemo · 2 comments
Owner

Metadata

  • Commit Message: feat(acms): projects with 10,000+ files index without timeout
  • Branch: feature/m5-large-project-indexing

Background

M5 (v3.4.0) acceptance criterion: projects with 10,000+ files must be indexable without timeout. The ACMS indexing pipeline must handle large-scale projects efficiently, completing resource registration, linking, and indexing within reasonable time bounds.

Per the E2E suite (m5_e2e_verification.robot lines 16, 26, 36), the tests verify creating a project with 10,000+ files, registering and linking resources, and confirming indexing completes with resource details surviving re-fetch.

Expected Behavior

  1. A project with 10,000+ files can be created and indexed
  2. Resource registration and linking scales to large file counts
  3. Indexing completes without timeout (reasonable time bounds)
  4. Resource details survive database round-trip after indexing
  5. Location and type information preserved for all indexed resources

Acceptance Criteria

  • Project with 10,000+ files indexes successfully
  • Resource registration and linking works at scale
  • No timeout during indexing (configurable bounds)
  • Resource details survive re-fetch after indexing
  • Robot E2E tests Large Project Creation With Ten Thousand Files, Resource Registration And Linking To Project, and Indexing Completes Without Timeout pass
  • Performance test verifies indexing completes within acceptable time

Supporting Information

  • E2E tests: robot/m5_e2e_verification.robot lines 16, 26, 36
  • Related: ACMS tier management (hot/warm/cold storage)

Subtasks

  • Implement scalable indexing pipeline for large projects
  • Optimize resource registration for 10K+ files
  • Add configurable timeout bounds for indexing
  • Ensure database persistence survives round-trip at scale
  • Tests (Behave): Add scenarios for large project edge cases
  • Tests (Robot): Verify E2E acceptance tests pass
  • Performance benchmark: measure indexing time for 10K files
  • Verify coverage >=97% via nox -s coverage_report
  • Run nox (all default sessions), fix any errors

Definition of Done

This issue is complete when:

  • All subtasks above are completed and checked off.
  • A Git commit is created where the first line of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation.
  • The commit is pushed to the remote on the branch matching the Branch in Metadata exactly.
  • The commit is submitted as a pull request to master, reviewed, and merged before this issue is marked done.
## Metadata - **Commit Message**: `feat(acms): projects with 10,000+ files index without timeout` - **Branch**: `feature/m5-large-project-indexing` ## Background M5 (v3.4.0) acceptance criterion: projects with 10,000+ files must be indexable without timeout. The ACMS indexing pipeline must handle large-scale projects efficiently, completing resource registration, linking, and indexing within reasonable time bounds. Per the E2E suite (`m5_e2e_verification.robot` lines 16, 26, 36), the tests verify creating a project with 10,000+ files, registering and linking resources, and confirming indexing completes with resource details surviving re-fetch. ## Expected Behavior 1. A project with 10,000+ files can be created and indexed 2. Resource registration and linking scales to large file counts 3. Indexing completes without timeout (reasonable time bounds) 4. Resource details survive database round-trip after indexing 5. Location and type information preserved for all indexed resources ## Acceptance Criteria - [ ] Project with 10,000+ files indexes successfully - [ ] Resource registration and linking works at scale - [ ] No timeout during indexing (configurable bounds) - [ ] Resource details survive re-fetch after indexing - [ ] Robot E2E tests `Large Project Creation With Ten Thousand Files`, `Resource Registration And Linking To Project`, and `Indexing Completes Without Timeout` pass - [ ] Performance test verifies indexing completes within acceptable time ## Supporting Information - E2E tests: `robot/m5_e2e_verification.robot` lines 16, 26, 36 - Related: ACMS tier management (hot/warm/cold storage) ## Subtasks - [ ] Implement scalable indexing pipeline for large projects - [ ] Optimize resource registration for 10K+ files - [ ] Add configurable timeout bounds for indexing - [ ] Ensure database persistence survives round-trip at scale - [ ] Tests (Behave): Add scenarios for large project edge cases - [ ] Tests (Robot): Verify E2E acceptance tests pass - [ ] Performance benchmark: measure indexing time for 10K files - [ ] Verify coverage >=97% via `nox -s coverage_report` - [ ] Run `nox` (all default sessions), fix any errors ## Definition of Done This issue is complete when: - All subtasks above are completed and checked off. - A Git commit is created where the **first line** of the commit message matches the Commit Message in Metadata exactly, followed by a blank line, then additional lines providing relevant details about the implementation. - The commit is pushed to the remote on the branch matching the **Branch** in Metadata exactly. - The commit is submitted as a **pull request** to `master`, reviewed, and **merged** before this issue is marked done.
freemo added this to the v3.4.0 milestone 2026-03-13 21:59:50 +00:00
Member

Started implementation on feature/m5-large-project-indexing.

Implemented changes so far:

  1. Configurable indexing timeout bounds (service + CLI)

    • Added timeout_seconds support to RepoIndexingService.index_resource() and RepoIndexingService.refresh_index().
    • Added timeout enforcement in walk_and_index() with TimeoutError when elapsed runtime exceeds configured bound.
    • Exposed timeout from CLI via agents repo index --timeout-seconds.
  2. 10K-file Behave coverage

    • Added new scenario to features/repo_indexing.feature:
      • Indexing 10,000 files completes within timeout bound
    • Added step implementations in features/steps/repo_indexing_steps.py to:
      • generate 10,000 files,
      • run indexing with timeout,
      • assert status/file count/runtime bound.
  3. Robot M5 acceptance helper now does real indexing for the indexing criterion

    • Updated robot/helper_m5_e2e_verification.py indexing_complete() to:
      • generate a 10,000-file repository,
      • register/link resource,
      • run real RepoIndexingService.index_resource(..., timeout_seconds=120.0),
      • verify indexed file count, status, elapsed runtime bound, and DB status round-trip.
  4. Performance benchmark coverage for 10K indexing

    • Added TimeRepoIndexingLarge benchmark class in benchmarks/context_indexing_bench.py with time_full_index_10000_files().

Targeted rerun logs (saved under build/test-logs/):

  • issue-851-behave-targeted-rerun.log
  • issue-851-robot-targeted-rerun.log

Earlier full-session gate logs were also generated in build/test-logs/ for lint/typecheck/unit/integration/coverage during this issue work.

Started implementation on `feature/m5-large-project-indexing`. Implemented changes so far: 1. **Configurable indexing timeout bounds (service + CLI)** - Added `timeout_seconds` support to `RepoIndexingService.index_resource()` and `RepoIndexingService.refresh_index()`. - Added timeout enforcement in `walk_and_index()` with `TimeoutError` when elapsed runtime exceeds configured bound. - Exposed timeout from CLI via `agents repo index --timeout-seconds`. 2. **10K-file Behave coverage** - Added new scenario to `features/repo_indexing.feature`: - `Indexing 10,000 files completes within timeout bound` - Added step implementations in `features/steps/repo_indexing_steps.py` to: - generate 10,000 files, - run indexing with timeout, - assert status/file count/runtime bound. 3. **Robot M5 acceptance helper now does real indexing for the indexing criterion** - Updated `robot/helper_m5_e2e_verification.py` `indexing_complete()` to: - generate a 10,000-file repository, - register/link resource, - run real `RepoIndexingService.index_resource(..., timeout_seconds=120.0)`, - verify indexed file count, status, elapsed runtime bound, and DB status round-trip. 4. **Performance benchmark coverage for 10K indexing** - Added `TimeRepoIndexingLarge` benchmark class in `benchmarks/context_indexing_bench.py` with `time_full_index_10000_files()`. Targeted rerun logs (saved under `build/test-logs/`): - `issue-851-behave-targeted-rerun.log` ✅ - `issue-851-robot-targeted-rerun.log` ✅ Earlier full-session gate logs were also generated in `build/test-logs/` for lint/typecheck/unit/integration/coverage during this issue work.
Member

Review follow-up for PR #1158 is complete.

  • Rebasing: rebased feature/m5-large-project-indexing onto current master and resolved the CHANGELOG.md conflict.
  • File-size fixes: split the Behave repo-index steps into focused setup/action and assertion/cleanup modules, and split the Robot M5 helper into focused support and context modules so the touched files comply with the 500-line rule.
  • Service cleanup: reduced RepoIndexingService back under the file-size limit without changing the timeout propagation behavior.
  • Scope correction: src/cleveragents/tool/wrapping.py is restored to match master exactly and is no longer modified by this PR.

The branch was amended and force-pushed for re-review.

Review follow-up for PR #1158 is complete. - Rebasing: rebased `feature/m5-large-project-indexing` onto current `master` and resolved the `CHANGELOG.md` conflict. - File-size fixes: split the Behave repo-index steps into focused setup/action and assertion/cleanup modules, and split the Robot M5 helper into focused support and context modules so the touched files comply with the 500-line rule. - Service cleanup: reduced `RepoIndexingService` back under the file-size limit without changing the timeout propagation behavior. - Scope correction: `src/cleveragents/tool/wrapping.py` is restored to match `master` exactly and is no longer modified by this PR. The branch was amended and force-pushed for re-review.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#851
No description provided.