Skip to main content

Crate text_core

Crate text_core 

Source
Expand description

§text-core

Shared text documents, tokenization, spans, and statistics for moritzbrantner-video-analysis.

Default builds are deterministic, local-first, and do not download models or invoke native inference/runtime tools.

§Feature flags

  • No optional feature flags today.

§Stable contract

TextDocument, OwnedTextDocument, TextSpan, token/sentence/paragraph records, text processing options, and the portable document/segment contracts are the stable data boundary for other crates. The portable contracts also carry optional rich metadata: source references, media timing, provenance, and annotation spans.

§Quality and limits

Segmentation is deterministic and Unicode-aware, but it is not a statistical NLP tokenizer. Higher-level linguistic quality belongs in text-linguistics.

§Example

use text_core::{build_annotation_graph, TextDocument, TextProcessingOptions};

let document = TextDocument::new("doc-1", "Rust-first multimodal analysis.");
let graph = build_annotation_graph(document.text, &TextProcessingOptions::default());

assert!(!graph.tokens.is_empty());
assert_eq!(graph.sentences.len(), 1);

§Package surface

  • Primary workflow: text.tokenize returns tokens, spans, script profile, and text statistics.
  • Workflow operations: text.statistics, text.normalize, text.tokenize, and text.boundaries.
  • Debug operations: describe inspects package metadata and operation support.
  • Runtime support: pure Rust, available through library, CLI, server, and WASM wrappers.
  • Sample output includes title, message, summary, result, and the operation-specific fields such as tokens, stats, text, or words.
  • This crate does not download models, run native inference, or scan files.
  • text-lexical
  • text-linguistics
  • video-analysis-core

Re-exports§

pub use contracts::AsTextSegmentContract;
pub use contracts::IntoTextDocumentContract;
pub use contracts::TextAnnotationSpan;
pub use contracts::TextDocumentContract;
pub use contracts::TextProvenance;
pub use contracts::TextSegmentContract;
pub use contracts::TextSourceRef;
pub use contracts::TimebaseContract;
pub use contracts::TimestampContract;

Modules§

contracts
operations
surface
Library-owned runtime surface for text-core.

Structs§

AnnotatedParagraph
Data type for annotated paragraph.
AnnotatedSentence
Data type for annotated sentence.
AnnotationConfidence
Confidence score normalized into the inclusive range 0.0..=1.0.
AnnotationId
Data type for annotation identifier.
CanonicalToken
Data type for canonical token.
DetailedTextStats
Data type for detailed text stats.
GraphemeSpan
Data type for grapheme span.
OwnedTextDocument
Owned text document for storage, serialization, and cross-thread workflows.
Paragraph
Data type for paragraph.
ScriptProfile
Data type for script profile.
Sentence
Data type for sentence.
TextAnnotationGraph
Data type for text annotation graph.
TextBoundaryOptions
Data type for text boundary options.
TextDocument
Borrowed text document used as the lightweight boundary between text crates.
TextProcessingOptions
Data type for text processing options.
TextSpan
Half-open byte and character span into a UTF-8 text buffer.
TextSpanRef
Data type for text span ref.
TextStats
Data type for text stats.
Token
Data type for token.
WordSegment
Data type for word segment.

Enums§

AnnotationProvenance
Variants describing annotation provenance.
TokenKind
Variants describing token kind.

Functions§

build_annotation_graph
Builds annotation graph.
build_annotation_graph_from_parts
Builds annotation graph from parts.
detailed_text_stats
Returns detailed text stats.
detect_script_profile
Returns detect script profile.
normalize_text
Returns normalize text.
normalize_whitespace
Returns normalize whitespace.
segment_document_id
Returns segment document identifier.
segment_graphemes
Returns segment graphemes.
segment_words
Returns segment words.
split_paragraphs
Returns split paragraphs.
split_sentence_spans
Returns split sentence spans.
split_sentences
Returns split sentences.
text_stats
Returns text stats.
tokenize
Returns tokenize.
tokenize_words
Returns tokenize words.
word_counts
Returns word counts.