1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
//! # scankit — walk + watch + filter directory trees.
//!
//! `scankit` is the shared scanner that Tauri / Iced / native
//! desktop apps reach for when they need to enumerate user files.
//! Its job is small but easy to get wrong:
//!
//! 1. Walk a directory tree (`walkdir` under the hood).
//! 2. Skip what the user said to skip — `.DS_Store`, `node_modules`,
//! `.git`, `*.log`, anything matching the configured glob set.
//! 3. Drop oversized files before you ever read them — a rogue
//! 50 GB sqlite database shouldn't take your indexer offline.
//! 4. (Future, behind `watch` feature) keep watching the tree and
//! emit change events as files are added / modified / removed.
//!
//! What `scankit` deliberately does NOT do:
//!
//! - Parse files. Use [`mdkit`](https://crates.io/crates/mdkit) or
//! bring your own. `scankit` hands you `ScanEntry`s and gets out
//! of the way.
//! - Schema extraction, search indexing, embedding generation.
//! Those are the layers that consume `scankit`'s output.
//! - PII redaction, secrets scanning. Privacy policy is the
//! embedding application's concern.
//!
//! ## Quick start
//!
//! ```no_run
//! use scankit::{Scanner, ScanConfig};
//! use std::path::Path;
//!
//! let scanner = Scanner::new(
//! ScanConfig::default()
//! .max_file_size_bytes(50 * 1024 * 1024) // 50 MB cap
//! .add_exclude("**/.git/**")?
//! .add_exclude("**/node_modules/**")?
//! .add_exclude("**/.DS_Store")?,
//! )?;
//!
//! for result in scanner.walk(Path::new("/Users/me/Documents")) {
//! match result {
//! Ok(entry) => println!("{}: {} bytes", entry.path.display(), entry.size_bytes),
//! Err(e) => eprintln!("scan error: {e}"),
//! }
//! }
//! # Ok::<(), scankit::Error>(())
//! ```
//!
//! ## Why a separate crate
//!
//! Every "index files on the user's machine" project rebuilds the
//! same five hundred lines of walkdir-with-excludes-and-size-cap
//! glue, and every project gets it slightly wrong. `scankit` ships
//! it once, with the edge cases (symlink loops, permission denials,
//! mid-walk concurrent deletes) handled in one place.
//!
//! ## Stability commitment (v0.3+)
//!
//! v0.3 marks the **API stability candidate** for 1.0. The
//! following surface is committed to and will only change with a
//! major version bump:
//!
//! - [`Scanner`] construction + dispatch — `new`, `walk`, `scan`
//! (under the `watch` feature), `config`. Future trait methods
//! land with default impls so existing callers don't break.
//! - [`ScanConfig`] field set + the builder methods
//! (`max_file_size_bytes`, `follow_symlinks`, `add_exclude`).
//! Marked `#[non_exhaustive]` so we can add fields without
//! major bumps.
//! - [`ScanEntry`], [`ScanEvent`], [`Error`] structs + enums.
//! All `#[non_exhaustive]` for forward-compat — pattern-matchers
//! must include a wildcard arm.
//! - The lazy `Iterator<Item = Result<ScanEntry>>` shape returned
//! by [`Scanner::walk`].
//! - The `Iterator<Item = ScanEvent>` shape returned by
//! [`Scanner::scan`] under the `watch` feature, including the
//! `Initial` → `InitialComplete` → live-events lifecycle.
//! - Feature flag names: `walk`, `watch`.
//!
//! The following are **implementation details** and may change in
//! minor versions:
//!
//! - The internal layout of [`Scanner`] / [`ScanWalkIter`] /
//! [`ScanStream`] (private fields, helper methods).
//! - The exact threading model of [`Scanner::scan`] (currently one
//! short-lived initial-walk thread + the `notify` watcher's own
//! threads; could change as `notify` evolves).
//! - The exact set of filesystem-event types translated to
//! [`ScanEvent`] variants (notify itself is platform-specific
//! and we follow upstream).
//!
//! 1.0 will be cut once the API is exercised by at least one
//! downstream production user. [Sery Link](https://sery.ai) is
//! the canonical integration target.
use PathBuf;
use SystemTime;
pub use ;
pub use ;
pub use ;
// ---------------------------------------------------------------------------
// ScanEntry — the unit of output
// ---------------------------------------------------------------------------
/// One file produced by a successful walk. Directories are not
/// surfaced — `Scanner` recurses into them silently. Symlinks are
/// dereferenced when [`ScanConfig::follow_symlinks`] is true and
/// emitted as the target file; otherwise they're skipped.
///
/// `#[non_exhaustive]` so we can grow the struct (e.g. add inode /
/// content hash) in minor versions without breaking external
/// struct-literal construction.
// ---------------------------------------------------------------------------
// ScanConfig — the policy
// ---------------------------------------------------------------------------
/// Configuration for a [`Scanner`]. Construct via [`ScanConfig::default`]
/// then layer on options with the `with_*` / `add_*` builder methods,
/// or build from struct literal during the same crate.
///
/// `#[non_exhaustive]` — same forward-compat reasoning as
/// [`ScanEntry`].