Expand description
§text-core
Shared text documents, tokenization, spans, and statistics for moritzbrantner-video-analysis.
Default builds are deterministic, local-first, and do not download models or invoke native inference/runtime tools.
§Feature flags
- No optional feature flags today.
§Stable contract
TextDocument, OwnedTextDocument, TextSpan, token/sentence/paragraph
records, text processing options, and the portable document/segment contracts
are the stable data boundary for other crates. The portable contracts also
carry optional rich metadata: source references, media timing, provenance, and
annotation spans.
§Quality and limits
Segmentation is deterministic and Unicode-aware, but it is not a statistical NLP
tokenizer. Higher-level linguistic quality belongs in text-linguistics.
§Example
use text_core::{build_annotation_graph, TextDocument, TextProcessingOptions};
let document = TextDocument::new("doc-1", "Rust-first multimodal analysis.");
let graph = build_annotation_graph(document.text, &TextProcessingOptions::default());
assert!(!graph.tokens.is_empty());
assert_eq!(graph.sentences.len(), 1);§Package surface
- Primary workflow:
text.tokenizereturns tokens, spans, script profile, and text statistics. - Workflow operations:
text.statistics,text.normalize,text.tokenize, andtext.boundaries. - Debug operations:
describeinspects package metadata and operation support. - Runtime support: pure Rust, available through library, CLI, server, and WASM wrappers.
- Sample output includes
title,message,summary,result, and the operation-specific fields such astokens,stats,text, orwords. - This crate does not download models, run native inference, or scan files.
§Related crates
text-lexicaltext-linguisticsvideo-analysis-core
Re-exports§
pub use contracts::AsTextSegmentContract;pub use contracts::IntoTextDocumentContract;pub use contracts::TextAnnotationSpan;pub use contracts::TextDocumentContract;pub use contracts::TextProvenance;pub use contracts::TextSegmentContract;pub use contracts::TextSourceRef;pub use contracts::TimebaseContract;pub use contracts::TimestampContract;
Modules§
- contracts
- operations
- surface
- Library-owned runtime surface for
text-core.
Structs§
- Annotated
Paragraph - Data type for annotated paragraph.
- Annotated
Sentence - Data type for annotated sentence.
- Annotation
Confidence - Confidence score normalized into the inclusive range
0.0..=1.0. - Annotation
Id - Data type for annotation identifier.
- Canonical
Token - Data type for canonical token.
- Detailed
Text Stats - Data type for detailed text stats.
- Grapheme
Span - Data type for grapheme span.
- Owned
Text Document - Owned text document for storage, serialization, and cross-thread workflows.
- Paragraph
- Data type for paragraph.
- Script
Profile - Data type for script profile.
- Sentence
- Data type for sentence.
- Text
Annotation Graph - Data type for text annotation graph.
- Text
Boundary Options - Data type for text boundary options.
- Text
Document - Borrowed text document used as the lightweight boundary between text crates.
- Text
Processing Options - Data type for text processing options.
- Text
Span - Half-open byte and character span into a UTF-8 text buffer.
- Text
Span Ref - Data type for text span ref.
- Text
Stats - Data type for text stats.
- Token
- Data type for token.
- Word
Segment - Data type for word segment.
Enums§
- Annotation
Provenance - Variants describing annotation provenance.
- Token
Kind - Variants describing token kind.
Functions§
- build_
annotation_ graph - Builds annotation graph.
- build_
annotation_ graph_ from_ parts - Builds annotation graph from parts.
- detailed_
text_ stats - Returns detailed text stats.
- detect_
script_ profile - Returns detect script profile.
- normalize_
text - Returns normalize text.
- normalize_
whitespace - Returns normalize whitespace.
- segment_
document_ id - Returns segment document identifier.
- segment_
graphemes - Returns segment graphemes.
- segment_
words - Returns segment words.
- split_
paragraphs - Returns split paragraphs.
- split_
sentence_ spans - Returns split sentence spans.
- split_
sentences - Returns split sentences.
- text_
stats - Returns text stats.
- tokenize
- Returns tokenize.
- tokenize_
words - Returns tokenize words.
- word_
counts - Returns word counts.