1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
//! Text processing and NLP utilities.
//!
//! This module provides text preprocessing tools for Natural Language Processing:
//! - Tokenization (word-level, character-level)
//! - BPE tokenization (Byte Pair Encoding for LLMs/speech models)
//! - Stop words filtering
//! - Stemming (Porter stemmer)
//! - Vectorization (Bag of Words, TF-IDF)
//! - Sentiment analysis (lexicon-based)
//! - Topic modeling (LDA)
//! - Document similarity (cosine, Jaccard, edit distance)
//! - Entity extraction (emails, URLs, mentions, hashtags)
//! - Text summarization (`TextRank`, TF-IDF extractive)
//!
//! # Design Principles
//!
//! Following the Toyota Way and aprender's quality standards:
//! - Zero `unwrap()` calls (Cloudflare-class safety)
//! - Result-based error handling with `AprenderError`
//! - Comprehensive test coverage (≥95%)
//! - Property-based testing with proptest
//! - Pure Rust implementation (no external NLP dependencies)
//!
//! # Quick Start
//!
//! ```
//! use aprender::text::tokenize::WhitespaceTokenizer;
//! use aprender::text::Tokenizer;
//!
//! let tokenizer = WhitespaceTokenizer::new();
//! let tokens = tokenizer.tokenize("Hello, world! This is aprender.").expect("tokenize should succeed");
//! assert_eq!(tokens, vec!["Hello,", "world!", "This", "is", "aprender."]);
//! ```
//!
//! # References
//!
//! Based on the comprehensive NLP specification:
//! `docs/specifications/nlp-models-techniques-spec.md`
// Re-export key chat_template types for convenience
pub use ;
use crateAprenderError;
/// Trait for text tokenization.
///
/// Tokenizers split text into smaller units (tokens) such as words or characters.
/// All tokenizers must handle edge cases gracefully and return Result for error handling.
///
/// # Examples
///
/// ```
/// use aprender::text::{Tokenizer, tokenize::WhitespaceTokenizer};
///
/// let tokenizer = WhitespaceTokenizer::new();
/// let tokens = tokenizer.tokenize("Hello world").expect("tokenize should succeed");
/// assert_eq!(tokens, vec!["Hello", "world"]);
/// ```
// ============================================================================
// trueno-rag integration (GH-125) — REMOVED for APR-MONO self-containment.
// ============================================================================
//
// `aprender::text::rag` was a `pub use trueno_rag::*` re-export. Since aprender-rag
// (=trueno-rag) depends on aprender-core (and aprender-serve), this re-export closed a
// core→rag→serve→core cycle — a layer inversion (the RAG crate builds ON core). Consume
// the RAG pipeline directly from the `aprender-rag` crate (`use aprender_rag::...`), which
// already depends on core. No in-tree code used `aprender::text::rag`.
// Text preprocessing contract falsification (FALSIFY-PP-001..006)
// Refs: NLP spec §2.1.1, PMAT-346