Module canonicalize

Expand description

Text canonicalization for consistent embedding input.

Delegates to frankensearch::DefaultCanonicalizer for the full preprocessing pipeline (NFC normalization, markdown stripping, code block collapsing, whitespace normalization, low-signal filtering, and truncation).

This module adds content hashing on top of the shared canonicalization logic.

§Example

use crate::search::canonicalize::{canonicalize_for_embedding, content_hash};

let raw = "**Hello** world!\n\n```rust\nfn main() {}\n```";
let canonical = canonicalize_for_embedding(raw);
let hash = content_hash(&canonical);

Constants§

CODE_HEAD_LINES: Maximum lines to keep from the beginning of a code block.
CODE_TAIL_LINES: Maximum lines to keep from the end of a code block.
MAX_EMBED_CHARS: Maximum characters to keep after canonicalization.

Functions§

canonicalize_for_embedding: Canonicalize text for embedding.
content_hash: Compute SHA256 content hash of text.
content_hash_hex: Compute SHA256 content hash as hex string.
is_hard_message_noise: Noise we can safely skip during indexing.
is_search_noise_text: Noise we should suppress from search results.
is_system_prompt_text: Return true when content looks like an injected prompt/instructions block.
is_tool_acknowledgement: Return true when text is a low-value acknowledgement/tool confirmation.
query_requests_system_prompt: Return true when a query explicitly asks for prompt/instructions content.

Module canonicalize

Module canonicalize Copy item path

§Example

Constants§

Functions§

Module canonicalize