BUG-HUNT: [resource] StateManager._save_checkpoint non-atomic write — checkpoint file is corrupt if process is killed mid-write; same-second saves silently overwrite each other #6531

Open
opened 2026-04-09 21:15:29 +00:00 by HAL9000 · 0 comments
Owner

Bug Report: [resource] — Non-Atomic Checkpoint Write in StateManager._save_checkpoint

Severity Assessment

  • Impact: (1) If the process is killed or crashes while write_text() is executing, the checkpoint file on disk is partially written and undetectably corrupt. On next startup, load_checkpoint() will raise a json.JSONDecodeError (or load truncated/garbage state) with no graceful recovery path. (2) If two checkpoint saves land within the same wall-clock second, the later write silently overwrites the earlier one, losing a checkpoint without any warning.
  • Likelihood: Medium — file corruption can happen in any process crash; timestamp collision occurs under high update rates (checkpointing every 10 updates with fast nodes).
  • Priority: Medium

Location

  • File: src/cleveragents/langgraph/state.py
  • Class: StateManager
  • Method: _save_checkpoint
  • Lines: ~117–127

Description

_save_checkpoint() writes JSON directly to the target checkpoint filename in a single write_text() call. This is not atomic: if the process is interrupted mid-write (crash, SIGKILL, OOM), the resulting file is partially written. The load_checkpoint() method has no validation or error handling for this case — it calls json.loads() directly on the file content, which will raise json.JSONDecodeError on a truncated file.

Additionally, the timestamp is formatted with only second precision (%Y%m%d_%H%M%S). If update_count % checkpoint_interval == 0 is triggered twice within the same second (feasible when nodes execute rapidly), the second write uses the same filename and silently overwrites the first checkpoint.

Evidence

def _save_checkpoint(self) -> None:
    if not self.checkpoint_dir:
        return
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")   # second precision only!
    checkpoint_file = self.checkpoint_dir / f"checkpoint_{timestamp}.json"
    checkpoint_data = {
        "state": self.state.to_dict(),
        "timestamp": timestamp,
        "update_count": self.update_count,
    }
    checkpoint_file.write_text(                              # <-- not atomic!
        json.dumps(checkpoint_data, indent=2), encoding="utf-8"
    )

And load_checkpoint() has no corruption handling:

def load_checkpoint(self, checkpoint_file: Path) -> None:
    checkpoint_data = json.loads(checkpoint_file.read_text(encoding="utf-8"))  # crashes on corrupt file
    self.state = GraphState.from_dict(checkpoint_data["state"])

Expected Behavior

  1. Checkpoint writes should be atomic: write to a temp file, then rename to the target path (rename is atomic on POSIX filesystems).
  2. load_checkpoint() should catch json.JSONDecodeError and either raise a descriptive error or skip the corrupt file and try the next-latest checkpoint.
  3. Timestamp should include microseconds or a counter suffix to prevent same-second collisions.

Actual Behavior

  1. Process crash during write_text() leaves a partial (corrupt) checkpoint file.
  2. load_checkpoint() of a corrupt file raises an unhandled json.JSONDecodeError, crashing the caller.
  3. Two checkpoints within the same second silently merge into one file, losing state history.

Suggested Fix

def _save_checkpoint(self) -> None:
    if not self.checkpoint_dir:
        return
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")  # add microseconds
    checkpoint_file = self.checkpoint_dir / f"checkpoint_{timestamp}.json"
    tmp_file = checkpoint_file.with_suffix(".tmp")
    checkpoint_data = {
        "state": self.state.to_dict(),
        "timestamp": timestamp,
        "update_count": self.update_count,
    }
    tmp_file.write_text(json.dumps(checkpoint_data, indent=2), encoding="utf-8")
    tmp_file.rename(checkpoint_file)  # atomic on POSIX
    self.logger.debug("Saved checkpoint: %s", checkpoint_file)

And add error handling to load_checkpoint():

def load_checkpoint(self, checkpoint_file: Path) -> None:
    try:
        checkpoint_data = json.loads(checkpoint_file.read_text(encoding="utf-8"))
    except (json.JSONDecodeError, KeyError) as exc:
        raise ValueError(f"Corrupt checkpoint file {checkpoint_file}: {exc}") from exc
    ...

Category

resource

TDD Note

After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: @tdd_issue, @tdd_issue_<this-issue-number>, and @tdd_expected_fail to prove the bug exists before fixing it.


Automated by CleverAgents Bot
Supervisor: Bug Hunting | Agent: bug-hunter

## Bug Report: [resource] — Non-Atomic Checkpoint Write in `StateManager._save_checkpoint` ### Severity Assessment - **Impact**: (1) If the process is killed or crashes while `write_text()` is executing, the checkpoint file on disk is partially written and undetectably corrupt. On next startup, `load_checkpoint()` will raise a `json.JSONDecodeError` (or load truncated/garbage state) with no graceful recovery path. (2) If two checkpoint saves land within the same wall-clock second, the later write silently overwrites the earlier one, losing a checkpoint without any warning. - **Likelihood**: Medium — file corruption can happen in any process crash; timestamp collision occurs under high update rates (checkpointing every 10 updates with fast nodes). - **Priority**: Medium ### Location - **File**: `src/cleveragents/langgraph/state.py` - **Class**: `StateManager` - **Method**: `_save_checkpoint` - **Lines**: ~117–127 ### Description `_save_checkpoint()` writes JSON directly to the target checkpoint filename in a single `write_text()` call. This is not atomic: if the process is interrupted mid-write (crash, SIGKILL, OOM), the resulting file is partially written. The `load_checkpoint()` method has no validation or error handling for this case — it calls `json.loads()` directly on the file content, which will raise `json.JSONDecodeError` on a truncated file. Additionally, the timestamp is formatted with only **second** precision (`%Y%m%d_%H%M%S`). If `update_count % checkpoint_interval == 0` is triggered twice within the same second (feasible when nodes execute rapidly), the second write uses the same filename and silently overwrites the first checkpoint. ### Evidence ```python def _save_checkpoint(self) -> None: if not self.checkpoint_dir: return timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") # second precision only! checkpoint_file = self.checkpoint_dir / f"checkpoint_{timestamp}.json" checkpoint_data = { "state": self.state.to_dict(), "timestamp": timestamp, "update_count": self.update_count, } checkpoint_file.write_text( # <-- not atomic! json.dumps(checkpoint_data, indent=2), encoding="utf-8" ) ``` And `load_checkpoint()` has no corruption handling: ```python def load_checkpoint(self, checkpoint_file: Path) -> None: checkpoint_data = json.loads(checkpoint_file.read_text(encoding="utf-8")) # crashes on corrupt file self.state = GraphState.from_dict(checkpoint_data["state"]) ``` ### Expected Behavior 1. Checkpoint writes should be atomic: write to a temp file, then rename to the target path (rename is atomic on POSIX filesystems). 2. `load_checkpoint()` should catch `json.JSONDecodeError` and either raise a descriptive error or skip the corrupt file and try the next-latest checkpoint. 3. Timestamp should include microseconds or a counter suffix to prevent same-second collisions. ### Actual Behavior 1. Process crash during `write_text()` leaves a partial (corrupt) checkpoint file. 2. `load_checkpoint()` of a corrupt file raises an unhandled `json.JSONDecodeError`, crashing the caller. 3. Two checkpoints within the same second silently merge into one file, losing state history. ### Suggested Fix ```python def _save_checkpoint(self) -> None: if not self.checkpoint_dir: return timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f") # add microseconds checkpoint_file = self.checkpoint_dir / f"checkpoint_{timestamp}.json" tmp_file = checkpoint_file.with_suffix(".tmp") checkpoint_data = { "state": self.state.to_dict(), "timestamp": timestamp, "update_count": self.update_count, } tmp_file.write_text(json.dumps(checkpoint_data, indent=2), encoding="utf-8") tmp_file.rename(checkpoint_file) # atomic on POSIX self.logger.debug("Saved checkpoint: %s", checkpoint_file) ``` And add error handling to `load_checkpoint()`: ```python def load_checkpoint(self, checkpoint_file: Path) -> None: try: checkpoint_data = json.loads(checkpoint_file.read_text(encoding="utf-8")) except (json.JSONDecodeError, KeyError) as exc: raise ValueError(f"Corrupt checkpoint file {checkpoint_file}: {exc}") from exc ... ``` ### Category `resource` ### TDD Note After this bug issue is verified, a corresponding Type/Testing issue will be created for TDD. The test will use tags: `@tdd_issue`, `@tdd_issue_<this-issue-number>`, and `@tdd_expected_fail` to prove the bug exists before fixing it. --- **Automated by CleverAgents Bot** Supervisor: Bug Hunting | Agent: bug-hunter
HAL9000 added this to the v3.2.0 milestone 2026-04-09 21:27:53 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
cleveragents/cleveragents-core#6531
No description provided.