1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
//! Text processing and NLP utilities.
//!
//! This module provides text preprocessing tools for Natural Language Processing:
//! - Tokenization (word-level, character-level)
//! - BPE tokenization (Byte Pair Encoding for LLMs/speech models)
//! - Stop words filtering
//! - Stemming (Porter stemmer)
//! - Vectorization (Bag of Words, TF-IDF)
//! - Sentiment analysis (lexicon-based)
//! - Topic modeling (LDA)
//! - Document similarity (cosine, Jaccard, edit distance)
//! - Entity extraction (emails, URLs, mentions, hashtags)
//! - Text summarization (`TextRank`, TF-IDF extractive)
//!
//! # Design Principles
//!
//! Following the Toyota Way and aprender's quality standards:
//! - Zero `unwrap()` calls (Cloudflare-class safety)
//! - Result-based error handling with `AprenderError`
//! - Comprehensive test coverage (≥95%)
//! - Property-based testing with proptest
//! - Pure Rust implementation (no external NLP dependencies)
//!
//! # Quick Start
//!
//! ```
//! use aprender::text::tokenize::WhitespaceTokenizer;
//! use aprender::text::Tokenizer;
//!
//! let tokenizer = WhitespaceTokenizer::new();
//! let tokens = tokenizer.tokenize("Hello, world! This is aprender.").expect("tokenize should succeed");
//! assert_eq!(tokens, vec!["Hello,", "world!", "This", "is", "aprender."]);
//! ```
//!
//! # References
//!
//! Based on the comprehensive NLP specification:
//! `docs/specifications/nlp-models-techniques-spec.md`
// Re-export key chat_template types for convenience
pub use ;
use crateAprenderError;
/// Trait for text tokenization.
///
/// Tokenizers split text into smaller units (tokens) such as words or characters.
/// All tokenizers must handle edge cases gracefully and return Result for error handling.
///
/// # Examples
///
/// ```
/// use aprender::text::{Tokenizer, tokenize::WhitespaceTokenizer};
///
/// let tokenizer = WhitespaceTokenizer::new();
/// let tokens = tokenizer.tokenize("Hello world").expect("tokenize should succeed");
/// assert_eq!(tokens, vec!["Hello", "world"]);
/// ```
// ============================================================================
// trueno-rag integration (GH-125)
// ============================================================================
/// Re-export trueno-rag types when the `rag` feature is enabled.
///
/// Provides document chunking, retrieval, and RAG pipeline capabilities
/// for document-based ML workflows.
///
/// # Example
///
/// ```ignore
/// use aprender::text::rag::{Chunker, ChunkingStrategy};
///
/// let chunker = Chunker::new(ChunkingStrategy::Recursive {
/// chunk_size: 512,
/// overlap: 64,
/// });
/// let chunks = chunker.chunk(&document)?;
/// ```
// Text preprocessing contract falsification (FALSIFY-PP-001..006)
// Refs: NLP spec §2.1.1, PMAT-346