Expand description
Text canonicalization for consistent embedding input.
Delegates to frankensearch::DefaultCanonicalizer for the full preprocessing
pipeline (NFC normalization, markdown stripping, code block collapsing,
whitespace normalization, low-signal filtering, and truncation).
This module adds content hashing on top of the shared canonicalization logic.
§Example
ⓘ
use crate::search::canonicalize::{canonicalize_for_embedding, content_hash};
let raw = "**Hello** world!\n\n```rust\nfn main() {}\n```";
let canonical = canonicalize_for_embedding(raw);
let hash = content_hash(&canonical);Constants§
- CODE_
HEAD_ LINES - Maximum lines to keep from the beginning of a code block.
- CODE_
TAIL_ LINES - Maximum lines to keep from the end of a code block.
- MAX_
EMBED_ CHARS - Maximum characters to keep after canonicalization.
Functions§
- canonicalize_
for_ embedding - Canonicalize text for embedding.
- content_
hash - Compute SHA256 content hash of text.
- content_
hash_ hex - Compute SHA256 content hash as hex string.
- is_
hard_ message_ noise - Noise we can safely skip during indexing.
- is_
search_ noise_ text - Noise we should suppress from search results.
- is_
system_ prompt_ text - Return true when content looks like an injected prompt/instructions block.
- is_
tool_ acknowledgement - Return true when text is a low-value acknowledgement/tool confirmation.
- query_
requests_ system_ prompt - Return true when a query explicitly asks for prompt/instructions content.