Skip to main content

Module canonicalize

Module canonicalize 

Source
Expand description

Text canonicalization for consistent embedding input.

Delegates to frankensearch::DefaultCanonicalizer for the full preprocessing pipeline (NFC normalization, markdown stripping, code block collapsing, whitespace normalization, low-signal filtering, and truncation).

This module adds content hashing on top of the shared canonicalization logic.

§Example

use crate::search::canonicalize::{canonicalize_for_embedding, content_hash};

let raw = "**Hello** world!\n\n```rust\nfn main() {}\n```";
let canonical = canonicalize_for_embedding(raw);
let hash = content_hash(&canonical);

Constants§

CODE_HEAD_LINES
Maximum lines to keep from the beginning of a code block.
CODE_TAIL_LINES
Maximum lines to keep from the end of a code block.
MAX_EMBED_CHARS
Maximum characters to keep after canonicalization.

Functions§

canonicalize_for_embedding
Canonicalize text for embedding.
content_hash
Compute SHA256 content hash of text.
content_hash_hex
Compute SHA256 content hash as hex string.
is_hard_message_noise
Noise we can safely skip during indexing.
is_search_noise_text
Noise we should suppress from search results.
is_system_prompt_text
Return true when content looks like an injected prompt/instructions block.
is_tool_acknowledgement
Return true when text is a low-value acknowledgement/tool confirmation.
query_requests_system_prompt
Return true when a query explicitly asks for prompt/instructions content.