Crate text_core

Expand description

§text-core

Shared text documents, tokenization, spans, and statistics for moritzbrantner-video-analysis.

Default builds are deterministic, local-first, and do not download models or invoke native inference/runtime tools.

§Feature flags

No optional feature flags today.

§Stable contract

TextDocument, OwnedTextDocument, TextSpan, token/sentence/paragraph records, text processing options, and the portable document/segment contracts are the stable data boundary for other crates. The portable contracts also carry optional rich metadata: source references, media timing, provenance, and annotation spans.

§Quality and limits

Segmentation is deterministic and Unicode-aware, but it is not a statistical NLP tokenizer. Higher-level linguistic quality belongs in text-linguistics.

§Example

use text_core::{build_annotation_graph, TextDocument, TextProcessingOptions};

let document = TextDocument::new("doc-1", "Rust-first multimodal analysis.");
let graph = build_annotation_graph(document.text, &TextProcessingOptions::default());

assert!(!graph.tokens.is_empty());
assert_eq!(graph.sentences.len(), 1);

§Package surface

Primary workflow: text.tokenize returns tokens, spans, script profile, and text statistics.
Workflow operations: text.statistics, text.normalize, text.tokenize, and text.boundaries.
Debug operations: describe inspects package metadata and operation support.
Runtime support: pure Rust, available through library, CLI, server, and WASM wrappers.
Sample output includes title, message, summary, result, and the operation-specific fields such as tokens, stats, text, or words.
This crate does not download models, run native inference, or scan files.

text-lexical
text-linguistics
video-analysis-core

Re-exports§

pub use contracts::AsTextSegmentContract;
pub use contracts::IntoTextDocumentContract;
pub use contracts::TextAnnotationSpan;
pub use contracts::TextDocumentContract;
pub use contracts::TextProvenance;
pub use contracts::TextSegmentContract;
pub use contracts::TextSourceRef;
pub use contracts::TimebaseContract;
pub use contracts::TimestampContract;

Modules§

contracts
operations
surface: Library-owned runtime surface for text-core.

Structs§

AnnotatedParagraph: Data type for annotated paragraph.
AnnotatedSentence: Data type for annotated sentence.
AnnotationConfidence: Confidence score normalized into the inclusive range 0.0..=1.0.
AnnotationId: Data type for annotation identifier.
CanonicalToken: Data type for canonical token.
DetailedTextStats: Data type for detailed text stats.
GraphemeSpan: Data type for grapheme span.
OwnedTextDocument: Owned text document for storage, serialization, and cross-thread workflows.
Paragraph: Data type for paragraph.
ScriptProfile: Data type for script profile.
Sentence: Data type for sentence.
TextAnnotationGraph: Data type for text annotation graph.
TextBoundaryOptions: Data type for text boundary options.
TextDocument: Borrowed text document used as the lightweight boundary between text crates.
TextProcessingOptions: Data type for text processing options.
TextSpan: Half-open byte and character span into a UTF-8 text buffer.
TextSpanRef: Data type for text span ref.
TextStats: Data type for text stats.
Token: Data type for token.
WordSegment: Data type for word segment.

Enums§

AnnotationProvenance: Variants describing annotation provenance.
TokenKind: Variants describing token kind.

Functions§

build_annotation_graph: Builds annotation graph.
build_annotation_graph_from_parts: Builds annotation graph from parts.
detailed_text_stats: Returns detailed text stats.
detect_script_profile: Returns detect script profile.
normalize_text: Returns normalize text.
normalize_whitespace: Returns normalize whitespace.
segment_document_id: Returns segment document identifier.
segment_graphemes: Returns segment graphemes.
segment_words: Returns segment words.
split_paragraphs: Returns split paragraphs.
split_sentence_spans: Returns split sentence spans.
split_sentences: Returns split sentences.
text_stats: Returns text stats.
tokenize: Returns tokenize.
tokenize_words: Returns tokenize words.
word_counts: Returns word counts.

Crate text_core

Crate text_core Copy item path

§text-core

§Feature flags

§Stable contract

§Quality and limits

§Example

§Package surface

§Related crates

Re-exports§

Modules§

Structs§

Enums§

Functions§

Crate text_core