scanstate 0.2.0

Generic scan checkpoint, journal, and progress primitives for pause/resume workflows
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
# TOKIO-LEVEL Deep Audit: scanstate v0.1.1

**Audit Date:** 2026-03-26  
**Scope:** Single-crate analysis of checkpoint persistence, journal recovery, TOML serialization, thread safety, and corruption handling.  
**Auditor:** Automated analysis + manual code review

---

## Executive Summary

| Component | Status | Risk Level |
|-----------|--------|------------|
| Checkpoint Atomicity | ✅ SOUND | Low |
| Journal Recovery | ⚠️ PARTIAL | Medium |
| TOML Serialization | ⚠️ LIMITED | Low |
| Thread Safety | ❌ UNSOUND | **High** |
| Corruption Handling | ⚠️ PARTIAL | Medium |

**Critical Finding:** The crate is **NOT Tokio-safe** for concurrent checkpoint writes. Multiple async tasks writing checkpoints will cause data races and potential corruption.

---

## 1. Checkpoint Write Atomicity

### Question: Is checkpoint write atomic? What happens on crash mid-save?

### Analysis

The `ScanCheckpoint::save()` implementation (lines 139-174 in `src/checkpoint.rs`) uses a **write-to-temp + atomic rename** pattern:

```rust
pub fn save(&self, path: impl AsRef<Path>) -> Result<(), ScanStateError> {
    // 1. Create parent directories
    // 2. Serialize to JSON
    // 3. Generate unique temp path with atomic counter + PID
    let tmp_path = tmp_path_for(path);
    
    // 4. Write to temp file with cleanup guard
    let _guard = TmpGuard(&tmp_path);
    let mut file = fs::File::create(&tmp_path)?;
    file.write_all(&json)?;
    file.sync_data()?;  // <-- fsync before rename
    std::mem::forget(_guard);  // Prevent cleanup on success
    
    // 5. Atomic rename
    fs::rename(&tmp_path, path)?;
    Ok(())
}
```

### Crash Scenarios

| Crash Timing | Outcome | Data Integrity |
|--------------|---------|----------------|
| Before `write_all()` | Temp file incomplete or absent | Original checkpoint intact |
| During `write_all()` | Temp file partially written | Original checkpoint intact |
| After `write_all()`, before `sync_data()` | Temp file complete but unflushed | Original checkpoint intact (journal may recover) |
| After `sync_data()`, before `rename()` | Temp file complete and flushed | Original checkpoint intact |
| During `rename()` | **Depends on filesystem** | Atomic rename guarantees file consistency |
| After `rename()` | Checkpoint fully saved | New state persisted |

### Verdict: ✅ **SOUND**

The implementation correctly implements atomic writes:
- **TmpGuard** ensures cleanup on panic/unwind (RAII pattern)
- **`sync_data()`** ensures data reaches stable storage before rename
- **`fs::rename()`** is atomic on POSIX (same filesystem) and Windows
- **Unique temp paths** prevent collisions between processes/threads

### Limitations

```rust
// The temp counter uses Relaxed ordering - sufficient for uniqueness
// but doesn't provide happens-before relationships
static CHECKPOINT_TMP_COUNTER: AtomicU64 = AtomicU64::new(0);
```

`Ordering::Relaxed` is acceptable here since uniqueness only requires monotonic increments within a process, not cross-thread synchronization.

---

## 2. Journal Recovery After Interrupted Scans

### Question: Does journal recovery work after interrupted scans?

### Analysis

The `WriteAheadJournal` (lines 23-148 in `src/journal.rs`) implements newline-delimited JSON with two replay modes:

### Strict Mode (`replay()`)
```rust
pub fn replay(&self) -> Result<Vec<Entry>, ScanStateError> {
    // Fails fast on first corrupt entry
    entries.push(serde_json::from_slice(&buf)?);
}
```

### Lenient Mode (`replay_lenient()`)
```rust
pub fn replay_lenient(&self) -> Result<(Vec<Entry>, usize), ScanStateError> {
    // Skips corrupt entries, counts them
    match serde_json::from_slice::<Entry>(&buf) {
        Ok(entry) => entries.push(entry),
        Err(_) => corrupt_count += 1,
    }
}
```

### Recovery Scenarios

| Scenario | `replay()` | `replay_lenient()` |
|----------|------------|---------------------|
| Clean shutdown | ✅ All entries | ✅ All entries |
| Crash mid-append (incomplete line) | ❌ Fails | ✅ Skips partial line |
| Corrupt entry in middle | ❌ Fails | ✅ Recovers before/after |
| Truncated file (mid-line) | ❌ Fails | ⚠️ Counts as corrupt |
| Empty file | ✅ Empty vec | ✅ Empty vec |
| Missing file | ✅ Empty vec | ✅ Empty vec |
| Whitespace lines | ✅ Skipped | ✅ Skipped |

### Critical Gaps

```rust
// Line 52-55: Each append fsyncs - GOOD for durability
pub fn append(&self, entry: &Entry) -> Result<(), ScanStateError> {
    file.write_all(&buf)?;
    file.sync_all()?;  // Full fsync every entry
    Ok(())
}
```

**Performance Issue:** `sync_all()` on every append creates significant I/O overhead. For high-throughput scanners, this is a bottleneck.

**No Checkpoint Integration:** The journal and checkpoint are separate systems. There's no built-in coordination for:
- Checkpoint rotation after journal truncation
- Determining which journal entries are already reflected in checkpoint
- Replay-and-checkpoint workflow

### Verdict: ⚠️ **PARTIAL**

Recovery works for individual crash scenarios, but:
1. No built-in integration between checkpoint and journal
2. `replay()` fails fast without recovery options
3. Users must manually coordinate truncation with checkpointing

### Recommended Pattern

```rust
// NOT PROVIDED BY CRATE - users must implement:
fn recover_from_crash(journal_path: &Path, checkpoint_path: &Path) -> Result<ScanCheckpoint> {
    let journal = WriteAheadJournal::new(journal_path);
    let (entries, corrupt) = journal.replay_lenient()?;
    
    let mut checkpoint = ScanCheckpoint::load(checkpoint_path)?;
    for entry in entries {
        if entry.status == "completed" {
            checkpoint.mark_complete(&entry.target_id);
        }
    }
    checkpoint.save(checkpoint_path)?;
    journal.truncate()?;
    Ok(checkpoint)
}
```

---

## 3. TOML Serialization Correctness

### Question: Is the TOML serialization correct for all types?

### Analysis

TOML support is **limited to `CheckpointSettings`** only:

```rust
// src/checkpoint.rs lines 40-85
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct CheckpointSettings {
    pub scan_id: String,
    pub checkpoint_path: String,
    pub journal_path: Option<String>,
    pub total_targets: usize,
    pub sync_checkpoint: bool,
    pub flush_interval_secs: u64,
}
```

**Key Finding:** `ScanCheckpoint` (the main state) does **NOT** support TOML - only JSON.

### Type Compatibility Matrix

| Type | TOML Serialize | TOML Deserialize | Notes |
|------|---------------|------------------|-------|
| `CheckpointSettings` ||| All standard types |
| `ScanCheckpoint` | ❌ N/A | ❌ N/A | JSON only |
| `Entry` | ❌ N/A | ❌ N/A | JSON only |
| `ScanProgress` | ❌ N/A | ❌ N/A | JSON only |

### TOML Test Coverage

```rust
// Lines 388-425: Settings round-trip tests
#[test]
fn settings_round_trip() { /* ... */ }

#[test]
fn settings_from_toml_partial() { /* ... */ }
```

### Verdict: ⚠️ **LIMITED**

TOML serialization is **correct but narrow**:
- Only `CheckpointSettings` supports TOML
- No datetime types (common TOML pitfall)
- No heterogeneous arrays
- All types used are TOML-compatible (strings, numbers, bools)

### Gap: SystemTime Serialization

```rust
// src/progress.rs line 18
pub struct ScanProgress {
    pub start_time: SystemTime,  // Uses serde's SystemTime serialization
}
```

`SystemTime` serializes as a `(secs, nanos)` tuple in JSON. If TOML support were added for `ScanProgress`, this would need special handling since TOML datetime format differs.

---

## 4. Thread Safety Under Concurrent Checkpoint Writes

### Question: Is the crate thread-safe for concurrent checkpoint writes?

### Analysis

### Type Thread-Safety

| Type | `Send` | `Sync` | Internal Sync |
|------|--------|--------|---------------|
| `ScanCheckpoint` |||**None** |
| `WriteAheadJournal` |||**None** |
| `ScanProgress` |||**None** |
| `CheckpointSettings` ||| N/A (immutable) |

### Critical Finding: NO INTERNAL SYNCHRONIZATION

```rust
// src/checkpoint.rs lines 100-105
pub struct ScanCheckpoint {
    pub scan_id: String,
    completed_targets: HashSet<String>,  // Standard HashSet - NOT thread-safe
}

// mark_complete takes &mut self - but no locking
pub fn mark_complete(&mut self, target_id: impl Into<String>) {
    self.completed_targets.insert(target_id.into());  // Race condition!
}
```

### Concurrent Write Scenarios

```rust
// DANGER: This pattern will CORRUPT data
let checkpoint = Arc::new(ScanCheckpoint::new("scan-1"));

// Spawn 10 tokio tasks, all writing
for i in 0..10 {
    let cp = checkpoint.clone();
    tokio::spawn(async move {
        // RACE CONDITION: Multiple concurrent writes
        cp.save(format!("checkpoint-{}.json", i)).unwrap();
    });
}
```

**Problems:**
1. `ScanCheckpoint` has no interior mutability - requires `&mut self` for updates
2. `save()` reads from `&self` but filesystem operations race
3. Temp file counter could collide under extreme contention
4. `WriteAheadJournal::append()` opens/closes file on each call - no file locking

### What The Adversarial Tests Show

```rust
// src/adversarial_tests.rs lines 32-58
fn adversarial_concurrent_mark_complete() {
    let checkpoint = Arc::new(std::sync::Mutex::new(ScanCheckpoint::new(...)));
    // ...
}
```

The adversarial tests **externally synchronize** with `std::sync::Mutex`. This proves the authors knew the types aren't thread-safe.

### Journal Concurrent Append

```rust
// src/adversarial_tests.rs lines 193-236
fn adversarial_concurrent_journal_append() {
    // Creates separate journal handles per thread
    let journal = WriteAheadJournal::new(path);
    // All threads append simultaneously - relies on OS file locking
}
```

**Result:** Test passes on Linux (OS-level append atomicity), but this is **NOT** guaranteed by the crate and may fail on other platforms or network filesystems.

### Verdict: ❌ **UNSOUND**

The crate is **NOT safe for concurrent use** without external synchronization:

| Scenario | Safe? | Required Mitigation |
|----------|-------|---------------------|
| Multiple threads, one checkpoint | ❌ No | External `Mutex<ScanCheckpoint>` |
| Multiple async tasks, one journal | ⚠️ Risky | External `Mutex<WriteAheadJournal>` |
| One writer, multiple readers | ✅ Yes | Atomic rename provides visibility |
| Concurrent checkpoint + journal | ⚠️ Risky | Coordinated locking required |

### Tokio-Specific Concerns

```rust
// THIS IS UNSAFE - DO NOT USE
async fn update_checkpoint(cp: &ScanCheckpoint) {
    // Tokio may move between threads between await points
    cp.save("checkpoint.json").await;  // If this were async...
}
```

Currently all I/O is **blocking** (`std::fs`), so in Tokio you must use `spawn_blocking`:

```rust
// Correct Tokio usage
let cp = Arc::new(Mutex::new(checkpoint));
tokio::task::spawn_blocking(move || {
    cp.lock().unwrap().save("checkpoint.json")
}).await??;
```

### Recommendations

1. **Document** that types are `Send` but not internally synchronized
2. **Provide** `tokio::sync::RwLock` wrapper example in docs
3. **Consider** adding `SyncCheckpoint` wrapper type with internal locking

---

## 5. Corrupt/Truncated Checkpoint Files

### Question: What happens with corrupt/truncated checkpoint files?

### Analysis

### `ScanCheckpoint::load()` Behavior

```rust
// src/checkpoint.rs lines 181-189
pub fn load(path: impl AsRef<Path>) -> Result<Self, ScanStateError> {
    let bytes = fs::read(p)?;
    let wire: CheckpointWire = serde_json::from_slice(&bytes)?;  // Fails on corrupt
    Ok(Self { /* ... */ })
}
```

### Corruption Scenarios

| Scenario | `load()` Result | Recoverable? |
|----------|-----------------|--------------|
| Valid JSON | ✅ Ok(checkpoint) | N/A |
| Invalid JSON syntax |`ScanStateError::Serde` | No built-in recovery |
| Truncated JSON |`ScanStateError::Serde` | No built-in recovery |
| Wrong JSON structure |`ScanStateError::Serde` | No built-in recovery |
| Empty file |`ScanStateError::Serde` | No |
| File permissions |`ScanStateError::Io` | Fix permissions |
| Missing file |`ScanStateError::Io` | Use `load_or_new()` |

### Comparison: Checkpoint vs Journal

| Feature | Checkpoint | Journal |
|---------|-----------|---------|
| Strict load |`load()` |`replay()` |
| Lenient load |**Not implemented** |`replay_lenient()` |
| Partial recovery | ❌ None | ✅ Skips corrupt lines |
| Corruption detection | ❌ JSON parse fails | ✅ Counts corrupt entries |

### Gap: No Lenient Checkpoint Loading

Unlike the journal, checkpoint has **no recovery mode**:

```rust
// Journal has this - Checkpoint does NOT
pub fn replay_lenient(&self) -> Result<(Vec<Entry>, usize), ScanStateError>;

// Would be useful for checkpoints:
pub fn load_lenient(&self) -> Result<(ScanCheckpoint, Vec<CorruptTarget>), ScanStateError>;
```

### Adversarial Test Coverage

```rust
// src/adversarial_tests.rs lines 339-353
fn adversarial_corrupted_checkpoint_recovery() {
    fs::write(&path, r#"{"scan_id": "test", "completed_targets": ["a", "b", "#);
    let result = ScanCheckpoint::load(&path);
    assert!(result.is_err());  // Just confirms it fails - no recovery
}
```

### Verdict: ⚠️ **PARTIAL**

- **Journal:** Robust corruption handling with lenient mode
- **Checkpoint:** All-or-nothing - single corrupt byte loses entire checkpoint

### Workaround for Users

```rust
// Users must implement checkpoint backup/rotation
fn load_checkpoint_with_backup(path: &Path) -> Result<ScanCheckpoint> {
    match ScanCheckpoint::load(path) {
        Ok(cp) => Ok(cp),
        Err(_) => {
            // Try backup
            ScanCheckpoint::load(path.with_extension("json.bak"))
        }
    }
}
```

---

## Summary Table

| Concern | Implementation | Risk |
|---------|---------------|------|
| Atomic write | Write-temp + fsync + rename | ✅ Low |
| Crash recovery | Temp guard + atomic rename | ✅ Low |
| Journal durability | Per-entry fsync | ✅ Correct but slow |
| Journal recovery | Strict + lenient modes | ⚠️ Partial |
| TOML support | Settings only | ⚠️ Limited |
| Thread safety | None (external lock required) |**High** |
| Corrupt checkpoint | Total failure | ⚠️ Medium |
| Corrupt journal | Lenient recovery | ✅ Good |

---

## Recommendations

### Immediate (Before Production Use)

1. **Wrap all checkpoint operations in `tokio::sync::RwLock` or `std::sync::Mutex`**
2. **Keep checkpoint backups** - no built-in recovery for corrupt checkpoints
3. **Use `replay_lenient()`** for journal recovery, not strict `replay()`

### Crate Improvements

1. Add `tokio` feature with async I/O and internal synchronization
2. Implement `load_lenient()` for checkpoint partial recovery
3. Add checkpoint/journal integration helpers
4. Document thread-safety requirements clearly
5. Consider batch journal writes with periodic fsync for performance

---

## Appendix: Code Paths

| Function | File | Lines | Key Behavior |
|----------|------|-------|--------------|
| `ScanCheckpoint::save()` | `checkpoint.rs` | 139-174 | Atomic write via temp file |
| `ScanCheckpoint::load()` | `checkpoint.rs` | 181-189 | Strict JSON parse |
| `WriteAheadJournal::append()` | `journal.rs` | 40-57 | Per-entry fsync |
| `WriteAheadJournal::replay()` | `journal.rs` | 67-87 | Strict parse |
| `WriteAheadJournal::replay_lenient()` | `journal.rs` | 97-121 | Skip corrupt entries |
| `tmp_path_for()` | `checkpoint.rs` | 204-209 | Atomic counter + PID |

---

*End of Audit Report*