1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
//! Per-index content-hash cache for incremental reindex skip.
//!
//! Why: re-embedding an unchanged file is expensive (ONNX model inference).
//! Caching content hashes in memory (and mirroring them to redb via
//! `hash_cache::persist_batch`) lets subsequent reindexes skip every file
//! whose content did not change since the last run.
//!
//! What: `file_hashes()` is the process-global DashMap of per-index hash caches.
//! `hash_content` produces a stable SHA-256 fingerprint. `shrink_hashes_if_needed`
//! keeps the cache bounded.
//!
//! Test: covered indirectly by `reindex_walks_directory_and_emits_events` (the
//! second reindex run on an unchanged workspace must skip all files).
use crateIndexId;
use DashMap;
use ;
use PathBuf;
use ;
/// Per-index ceiling on the content-hash cache (issue #75). Each entry holds
/// a `PathBuf` + 64-char hex SHA-256 string, so 200k entries ≈ ~30–60 MB.
/// When exceeded we drain ~10% of the entries (DashMap has no ordering, so
/// the eviction set is arbitrary — those files are simply re-hashed on the
/// next reindex, which is the safe, correct fallback).
pub const MAX_FILE_HASHES_PER_INDEX: usize = 200_000;
/// Per-index, per-process content-hash cache. Used to skip reindexing files
/// whose content hasn't changed since the last reindex in this daemon's
/// lifetime.
///
/// Why: survives across `POST /indexes/:id/reindex` calls but not daemon
/// restarts (acceptable: cold start re-embeds everything anyway, and on warm
/// daemons the user expects "skip unchanged" behaviour).
/// What: `DashMap<IndexId, Arc<DashMap<PathBuf, String>>>` — one inner map per
/// index, accessed by `hashes_for`.
/// Test: the second pass of `reindex_walks_directory_and_emits_events` must
/// emit only `skip` events (all files hash-matched).
pub
/// Return the per-index hash cache, creating an empty one on first access.
///
/// Why: each index independently tracks its own file hashes so a forced
/// reindex of index A does not clear the cache for index B.
/// What: `entry(id).or_insert_with(...)` on the global `DashMap`.
/// Test: covered indirectly by all reindex tests.
pub
/// Drop ~10% of entries from `map` when above `MAX_FILE_HASHES_PER_INDEX`.
///
/// Why: prevents an unbounded growth in the per-daemon content-hash cache
/// when a project gets ever-larger or files are renamed many times. The
/// hash cache is a pure speed optimisation (skip re-embed for unchanged
/// files), so evicting entries is always safe — affected files just get
/// re-hashed and re-embedded on the next reindex.
/// What: collects an arbitrary subset of keys and removes them. DashMap has
/// no insertion-order metadata so we can't do "true" LRU; arbitrary eviction
/// is acceptable for a cache whose miss penalty is just extra work.
/// Test: covered indirectly by the reindex test (oversizing not exercised).
pub
/// Stable content fingerprint for the "skip unchanged file" optimization.
///
/// Why: SHA-256 is collision-resistant and stable across processes, builds,
/// and Rust versions. `DefaultHasher` (SipHash) is randomized per build and
/// has weaker collision properties — fine for `HashMap` keys but unsafe for
/// content fingerprinting where a false negative silently skips a real edit.
/// What: SHA-256 of the file's UTF-8 bytes, hex-encoded.
/// Test: see `reindex_walks_directory_and_emits_events` — a re-run of the
/// reindex with unchanged files must mark them as skipped (proves the hash
/// is stable across two invocations within the same process).
pub