1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
use HashMap;
/// Counts words in a string using rules that closely match
///
/// Word does *not* simply split on spaces. Instead, it uses
/// human‑friendly heuristics:
///
/// ### Rules implemented:
/// - Tokens are separated by **whitespace**.
/// - Leading/trailing punctuation is ignored (e.g., `"hello,"` → `hello`).
/// - Hyphenated words count as **one** (e.g., `"state-of-the-art"` → 1).
/// - Contractions count as **one** (e.g., `"don't"` → 1).
/// - URLs count as **one** word.
/// - Emojis count as **one** word.
/// - CJK (Chinese/Japanese/Korean) characters count as **individual words**,
/// matching Word’s behavior (e.g., `"你好世界"` → 3).
/// - Multiple spaces, tabs, and newlines are ignored.
///
/// ### What this function does *not* handle:
/// - Word document structural features (fields, comments, footnotes, etc.).
/// Those do not apply to plain text.
///
/// ### Examples
/// ```
/// use bt_string_utils::analyzer::word_count;
/// assert_eq!(word_count("Hello, world!"), 2);
/// assert_eq!(word_count("state-of-the-art"), 1);
/// assert_eq!(word_count("I'm here"), 2);
/// ```
///
/// # Arguments
/// * `text` – The input string to analyze.
///
/// # Returns
/// The number of words.
/// Returns `true` if the character belongs to a CJK (Chinese/Japanese/Korean)
/// Unicode block.
///
///
/// ### Examples
/// ```
/// use bt_string_utils::analyzer::is_cjk;
/// assert!(is_cjk('你'));
/// assert!(!is_cjk('a'));
/// assert!(is_cjk('你'));
/// assert!(is_cjk('界'));
/// assert!(!is_cjk('a'));
/// assert!(!is_cjk('🙂'));
///
/// ```
///Find different words in two similar string vectors and find the difference
/// in the number of words.
/// It is useful when there are two almost identical documents and minimal changes need to be verified
/// Counts paragraphs in a string using rules that match
///
/// In plain text, this corresponds to:
/// - `\r\n` (Windows newline)
/// - `\n` (Unix newline)
/// - `\r` (old Mac newline)
///
/// ### Rules implemented:
/// - Each newline sequence ends a paragraph.
/// - Consecutive newlines create **empty paragraphs**, and Word counts them.
/// - A document with no newline at all counts as **one paragraph**.
/// - An empty document counts as **zero paragraphs**.
///
/// ### Examples
/// ```
/// use bt_string_utils::analyzer::count_paragraphs;
/// assert_eq!(count_paragraphs("Hello"), 1);
/// assert_eq!(count_paragraphs("Hello\nWorld"), 2);
/// assert_eq!(count_paragraphs("Line1\n\nLine3"), 3); // empty paragraph in the middle
/// assert_eq!(count_paragraphs(""), 0);
/// ```