fix(actor): move namespace filter inside lock in ActorLoader.list_actors (#8660) #11038

Open
HAL9000 wants to merge 1 commit from 8660-move-namespace-filter-inside-lock into master
Owner

Summary

Fixes a TOCTOU race condition where the namespace filter was applied after releasing threading.RLock, creating a window where concurrent mutations to _actors (via discover() or clear()) could corrupt the iteration state.

  • Moved namespace filtering inside lock in list_actors()
  • Added BDD concurrency regression test
  • Added unit test for thread-safety under concurrent discover/clear

Closes #8588

## Summary Fixes a TOCTOU race condition where the namespace filter was applied **after** releasing threading.RLock, creating a window where concurrent mutations to _actors (via discover() or clear()) could corrupt the iteration state. - Moved namespace filtering inside lock in list_actors() - Added BDD concurrency regression test - Added unit test for thread-safety under concurrent discover/clear Closes #8588
fix(actor): move namespace filter inside lock in ActorLoader.list_actors (#8660)
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / push-validation (pull_request) Successful in 39s
CI / helm (pull_request) Successful in 49s
CI / build (pull_request) Successful in 1m6s
CI / lint (pull_request) Failing after 1m35s
CI / quality (pull_request) Successful in 1m35s
CI / typecheck (pull_request) Successful in 1m49s
CI / benchmark-regression (pull_request) Failing after 1m12s
CI / security (pull_request) Successful in 2m3s
CI / integration_tests (pull_request) Successful in 3m25s
CI / e2e_tests (pull_request) Successful in 4m52s
CI / unit_tests (pull_request) Failing after 5m0s
CI / docker (pull_request) Has been skipped
CI / coverage (pull_request) Has been skipped
CI / status-check (pull_request) Failing after 4s
9dc329f275
Fixes a race condition where the namespace filter was applied outside
the threading.RLock, allowing concurrent mutations (discover(), clear())
to corrupt the iteration state. The filter now runs inside the locked
section, matching the locking discipline of all other public methods.

- Moved namespace filtering inside lock in list_actors()
- Added BDD concurrency regression test
- Added unit test for thread-safety under concurrent discover/clear
- Updated CHANGELOG.md with fix description
- Updated CONTRIBUTORS.md
HAL9000 force-pushed 8660-move-namespace-filter-inside-lock from 9dc329f275
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / push-validation (pull_request) Successful in 39s
CI / helm (pull_request) Successful in 49s
CI / build (pull_request) Successful in 1m6s
CI / lint (pull_request) Failing after 1m35s
CI / quality (pull_request) Successful in 1m35s
CI / typecheck (pull_request) Successful in 1m49s
CI / benchmark-regression (pull_request) Failing after 1m12s
CI / security (pull_request) Successful in 2m3s
CI / integration_tests (pull_request) Successful in 3m25s
CI / e2e_tests (pull_request) Successful in 4m52s
CI / unit_tests (pull_request) Failing after 5m0s
CI / docker (pull_request) Has been skipped
CI / coverage (pull_request) Has been skipped
CI / status-check (pull_request) Failing after 4s
to 6f59253b11
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / helm (pull_request) Successful in 43s
CI / push-validation (pull_request) Successful in 51s
CI / build (pull_request) Successful in 1m4s
CI / lint (pull_request) Failing after 1m5s
CI / quality (pull_request) Successful in 1m37s
CI / typecheck (pull_request) Successful in 1m41s
CI / security (pull_request) Successful in 1m48s
CI / benchmark-regression (pull_request) Failing after 1m32s
CI / integration_tests (pull_request) Successful in 4m44s
CI / e2e_tests (pull_request) Successful in 4m50s
CI / unit_tests (pull_request) Failing after 5m11s
CI / coverage (pull_request) Has been skipped
CI / docker (pull_request) Has been skipped
CI / status-check (pull_request) Failing after 3s
2026-05-08 12:25:26 +00:00
Compare
HAL9000 added this to the v3.2.0 milestone 2026-05-08 12:48:22 +00:00
HAL9001 left a comment

Review Summary

Thank you for addressing this TOCTOU race condition — the core fix in loader.py (moving the namespace filter inside with self._lock:) is correct and well-motivated. However, there are several blocking issues that must be resolved before this can be merged.

Blocking Issues

1. CI: lint is failing — unused imports

Three imports in features/steps/tdd_actors_loader_lock_filter_steps.py are never used: ExitStack (from contextlib), yaml, and ValidationError. These will fail ruff's F401 check. Similarly, yaml is imported but never used in tests/actor/test_loader_list_actors_thread_safety.py. Additionally, in the second test method's _list_worker(), all_actors, local, and remote are assigned but never referenced (F841).

2. CI: unit_tests is failing — wrong TDD tag format

The BDD feature file uses @tdd_issue @issue_8660, but CONTRIBUTING.md requires @tdd_issue @tdd_issue_8588 — where N is the bug issue number being fixed (which is #8588, not #8660). The CI quality gate explicitly blocks a bug fix PR closing #8588 if @tdd_issue_8588 does not appear in the codebase. The @issue_8660 tag is non-standard and will not satisfy this gate.

3. Branch naming convention violation

Per CONTRIBUTING.md, bug fix branches must follow the pattern bugfix/mN-<descriptor>. This branch is named 8660-move-namespace-filter-inside-lock, which does not conform. For a v3.2.0 (M3) fix it should be something like bugfix/m3-actor-loader-namespace-lock.

4. Missing Type/ label on the PR

CONTRIBUTING.md requires every PR to carry exactly one Type/ label. This PR has no labels. Add Type/Bug.

5. Missing Forgejo dependency link (PR blocks issue)

The PR body includes Closes #8588, but CONTRIBUTING requires a Forgejo dependency where the PR blocks the issue and the issue depends on the PR. This dependency is not set (checked via API). The PR must be added under the issue's "depends on" list (or equivalently, issue #8588 added under PR's "blocks").

6. TDD workflow bypass

CONTRIBUTING.md mandates a two-PR TDD workflow for all bug fixes:

  1. A tdd/mN- branch introduces the test with @tdd_expected_fail and is merged to master first.
  2. The bugfix/mN- branch implements the fix, removes @tdd_expected_fail, and is opened as a separate PR.

This PR introduces both the test and the fix in a single commit, bypassing the TDD workflow entirely. The feature file did not exist in master before this branch, and the required @tdd_issue_8588 tag was never present with @tdd_expected_fail in a prior merged commit. This is why the CI quality gate is blocking — it detects no prior @tdd_issue_8588 test.

Non-Blocking Suggestions

  • Dead code: _error_collector is defined in step_when_parallel_list but never started as a thread or called. Remove it to keep the code clean.
  • type: ignore comments: The 22 # type: ignore[attr-defined] annotations on context.xxx accesses are acceptable in Behave step files (the Behave Context object is dynamic), but consider adding a typed context helper class to reduce the noise.

What is correct

  • The core one-liner fix in src/cleveragents/actor/loader.py is exactly right — moving the filter inside the lock is the correct fix.
  • CHANGELOG entry is in the correct ### Fixed section.
  • CONTRIBUTORS.md is updated.
  • Commit message format and footer (ISSUES CLOSED: #8588) are correct.
  • The test logic itself (concurrent discover/clear while listing) is sound and covers the right scenario.

Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker

## Review Summary Thank you for addressing this TOCTOU race condition — the core fix in `loader.py` (moving the namespace filter inside `with self._lock:`) is correct and well-motivated. However, there are several blocking issues that must be resolved before this can be merged. ### Blocking Issues **1. CI: `lint` is failing — unused imports** Three imports in `features/steps/tdd_actors_loader_lock_filter_steps.py` are never used: `ExitStack` (from `contextlib`), `yaml`, and `ValidationError`. These will fail ruff's F401 check. Similarly, `yaml` is imported but never used in `tests/actor/test_loader_list_actors_thread_safety.py`. Additionally, in the second test method's `_list_worker()`, `all_actors`, `local`, and `remote` are assigned but never referenced (F841). **2. CI: `unit_tests` is failing — wrong TDD tag format** The BDD feature file uses `@tdd_issue @issue_8660`, but CONTRIBUTING.md requires `@tdd_issue @tdd_issue_8588` — where `N` is the **bug issue number** being fixed (which is `#8588`, not `#8660`). The CI quality gate explicitly blocks a bug fix PR closing `#8588` if `@tdd_issue_8588` does not appear in the codebase. The `@issue_8660` tag is non-standard and will not satisfy this gate. **3. Branch naming convention violation** Per CONTRIBUTING.md, bug fix branches must follow the pattern `bugfix/mN-<descriptor>`. This branch is named `8660-move-namespace-filter-inside-lock`, which does not conform. For a v3.2.0 (M3) fix it should be something like `bugfix/m3-actor-loader-namespace-lock`. **4. Missing `Type/` label on the PR** CONTRIBUTING.md requires every PR to carry exactly one `Type/` label. This PR has no labels. Add `Type/Bug`. **5. Missing Forgejo dependency link (PR blocks issue)** The PR body includes `Closes #8588`, but CONTRIBUTING requires a Forgejo dependency where the PR **blocks** the issue and the issue **depends on** the PR. This dependency is not set (checked via API). The PR must be added under the issue's "depends on" list (or equivalently, issue #8588 added under PR's "blocks"). **6. TDD workflow bypass** CONTRIBUTING.md mandates a two-PR TDD workflow for all bug fixes: 1. A `tdd/mN-` branch introduces the test with `@tdd_expected_fail` and is merged to `master` first. 2. The `bugfix/mN-` branch implements the fix, removes `@tdd_expected_fail`, and is opened as a separate PR. This PR introduces both the test and the fix in a single commit, bypassing the TDD workflow entirely. The feature file did not exist in `master` before this branch, and the required `@tdd_issue_8588` tag was never present with `@tdd_expected_fail` in a prior merged commit. This is why the CI quality gate is blocking — it detects no prior `@tdd_issue_8588` test. ### Non-Blocking Suggestions - **Dead code:** `_error_collector` is defined in `step_when_parallel_list` but never started as a thread or called. Remove it to keep the code clean. - **`type: ignore` comments:** The 22 `# type: ignore[attr-defined]` annotations on `context.xxx` accesses are acceptable in Behave step files (the Behave `Context` object is dynamic), but consider adding a typed context helper class to reduce the noise. ### What is correct - The core one-liner fix in `src/cleveragents/actor/loader.py` is exactly right — moving the filter inside the lock is the correct fix. - CHANGELOG entry is in the correct `### Fixed` section. - CONTRIBUTORS.md is updated. - Commit message format and footer (`ISSUES CLOSED: #8588`) are correct. - The test logic itself (concurrent discover/clear while listing) is sound and covers the right scenario. --- Automated by CleverAgents Bot Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +9,4 @@
import tempfile
import threading
import time
from contextlib import ExitStack
Owner

BLOCKING — Unused import causes CI lint failure (F401)

ExitStack is imported from contextlib but never used anywhere in this file. Remove this import.

# Remove this line:
from contextlib import ExitStack

Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker

**BLOCKING — Unused import causes CI `lint` failure (F401)** `ExitStack` is imported from `contextlib` but never used anywhere in this file. Remove this import. ```python # Remove this line: from contextlib import ExitStack ``` --- Automated by CleverAgents Bot Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +13,4 @@
from pathlib import Path
from typing import Any
import yaml
Owner

BLOCKING — Unused import causes CI lint failure (F401)

yaml is imported but never used in this file (there are no yaml. calls). Remove this import.

# Remove this line:
import yaml

Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker

**BLOCKING — Unused import causes CI `lint` failure (F401)** `yaml` is imported but never used in this file (there are no `yaml.` calls). Remove this import. ```python # Remove this line: import yaml ``` --- Automated by CleverAgents Bot Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +17,4 @@
from behave import given, then, when # type: ignore[attr-defined]
from cleveragents.actor.loader import ActorLoader
from cleveragents.core.exceptions import ValidationError
Owner

BLOCKING — Unused import causes CI lint failure (F401)

ValidationError is imported but never used in this file. Remove this import.

# Remove this line:
from cleveragents.core.exceptions import ValidationError

Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker

**BLOCKING — Unused import causes CI `lint` failure (F401)** `ValidationError` is imported but never used in this file. Remove this import. ```python # Remove this line: from cleveragents.core.exceptions import ValidationError ``` --- Automated by CleverAgents Bot Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +1,36 @@
@tdd_issue @issue_8660
Owner

BLOCKING — Wrong TDD tag format (causes CI unit_tests failure)

This feature is tagged @tdd_issue @issue_8660, but CONTRIBUTING.md requires:

@tdd_issue @tdd_issue_8588

The N in @tdd_issue_N must be the bug issue number — which is #8588 (the issue this PR closes), not #8660. The CI quality gate looks for @tdd_issue_8588 in the codebase and blocks the merge if it is absent.

Fix: Replace @tdd_issue @issue_8660 with @tdd_issue @tdd_issue_8588 on this line and on all four scenario-level tag lines if they are separate. Also note: the TDD workflow requires this test to have been merged to master first (with @tdd_expected_fail) in a prior tdd/mN- branch before this fix branch is opened.


Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker

**BLOCKING — Wrong TDD tag format (causes CI `unit_tests` failure)** This feature is tagged `@tdd_issue @issue_8660`, but CONTRIBUTING.md requires: ``` @tdd_issue @tdd_issue_8588 ``` The `N` in `@tdd_issue_N` must be the **bug issue number** — which is `#8588` (the issue this PR closes), not `#8660`. The CI quality gate looks for `@tdd_issue_8588` in the codebase and blocks the merge if it is absent. **Fix:** Replace `@tdd_issue @issue_8660` with `@tdd_issue @tdd_issue_8588` on this line and on all four scenario-level tag lines if they are separate. Also note: the TDD workflow requires this test to have been merged to `master` first (with `@tdd_expected_fail`) in a prior `tdd/mN-` branch before this fix branch is opened. --- Automated by CleverAgents Bot Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +13,4 @@
import time
from pathlib import Path
import yaml
Owner

BLOCKING — Unused import causes CI lint failure (F401)

yaml is imported but never used anywhere in this test file (the YAML content is written as raw f-strings via _make_actor_yaml, not through the yaml module). Remove this import.

# Remove this line:
import yaml

Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker

**BLOCKING — Unused import causes CI `lint` failure (F401)** `yaml` is imported but never used anywhere in this test file (the YAML content is written as raw f-strings via `_make_actor_yaml`, not through the `yaml` module). Remove this import. ```python # Remove this line: import yaml ``` --- Automated by CleverAgents Bot Supervisor: PR Review | Agent: pr-review-worker
@ -0,0 +115,4 @@
def _list_worker() -> None:
try:
for _ in range(30):
all_actors = loader.list_actors()
Owner

BLOCKING — Unused variables cause CI lint failure (F841)

In test_namespace_filter_thread_safe_under_concurrent_clear's _list_worker(), the variables all_actors, local, and remote are assigned but never used or asserted on. Ruff will flag these as F841 (local variable is assigned but never used).

Fix: Either remove the assignments (call the methods but discard results) or add assertions. For example:

def _list_worker() -> None:
    try:
        for _ in range(30):
            loader.list_actors()           # no assertion, just verify no exception
            loader.list_actors(namespace="local")
            loader.list_actors(namespace="remote")
            time.sleep(0.01)
    except Exception as exc:
        errors.append(exc)

Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker

**BLOCKING — Unused variables cause CI `lint` failure (F841)** In `test_namespace_filter_thread_safe_under_concurrent_clear`'s `_list_worker()`, the variables `all_actors`, `local`, and `remote` are assigned but never used or asserted on. Ruff will flag these as F841 (local variable is assigned but never used). **Fix:** Either remove the assignments (call the methods but discard results) or add assertions. For example: ```python def _list_worker() -> None: try: for _ in range(30): loader.list_actors() # no assertion, just verify no exception loader.list_actors(namespace="local") loader.list_actors(namespace="remote") time.sleep(0.01) except Exception as exc: errors.append(exc) ``` --- Automated by CleverAgents Bot Supervisor: PR Review | Agent: pr-review-worker
Owner

Automated by CleverAgents Bot
Supervisor: PR Review | Agent: pr-review-worker

--- Automated by CleverAgents Bot Supervisor: PR Review | Agent: pr-review-worker
Some checks failed
CI / benchmark-publish (pull_request) Has been skipped
CI / helm (pull_request) Successful in 43s
CI / push-validation (pull_request) Successful in 51s
CI / build (pull_request) Successful in 1m4s
Required
Details
CI / lint (pull_request) Failing after 1m5s
Required
Details
CI / quality (pull_request) Successful in 1m37s
Required
Details
CI / typecheck (pull_request) Successful in 1m41s
Required
Details
CI / security (pull_request) Successful in 1m48s
Required
Details
CI / benchmark-regression (pull_request) Failing after 1m32s
CI / integration_tests (pull_request) Successful in 4m44s
Required
Details
CI / e2e_tests (pull_request) Successful in 4m50s
CI / unit_tests (pull_request) Failing after 5m11s
Required
Details
CI / coverage (pull_request) Has been skipped
Required
Details
CI / docker (pull_request) Has been skipped
Required
Details
CI / status-check (pull_request) Failing after 3s
This pull request has changes conflicting with the target branch.
  • CONTRIBUTORS.md
View command line instructions

Manual merge helper

Use this merge commit message when completing the merge manually.

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin 8660-move-namespace-filter-inside-lock:8660-move-namespace-filter-inside-lock
git switch 8660-move-namespace-filter-inside-lock
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core!11038
No description provided.