1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
/// Splits the given string at the first occurrence of the specified separator.
///
/// # Arguments
///
/// * `s` - A string slice to be split.
/// * `separator` - The substring used as a separator.
///
/// # Returns
///
/// A tuple containing two strings:
/// - The first part of the string before the separator.
/// - The second part of the string after the separator.
///
/// If the separator is not found, returns the original string and an empty string.
///
/// # Examples
///
/// ```
/// use bt_string_utils::splitter::get_first_of_split;
/// let (part1, part2) = get_first_of_split("hello=world", "=");
/// assert_eq!(part1, "hello");
/// assert_eq!(part2, "world");
///
/// let (part1, part2) = get_first_of_split("key:value", ":");
/// assert_eq!(part1, "key");
/// assert_eq!(part2, "value");
///
/// let (part1, part2) = get_first_of_split("no=separator", " ");
/// assert_eq!(part1, "no=separator");
/// assert_eq!(part2, "");
/// ```
/// Splits a string into at most `n` substrings, grouped by whole words.
///
/// This function performs **word‑based splitting**, never character‑based.
/// It guarantees:
///
/// - Words are never broken apart.
/// - The number of returned substrings is **min(n, word_count)**.
/// - Unicode and emoji are handled safely (because splitting happens on
/// whitespace boundaries, which are always valid UTF‑8 boundaries).
/// - The original string is never copied; all substrings are `&str` slices.
///
///
/// # Arguments
/// * `s` — The input string to split.
/// * `n` — The desired number of substrings. The function will never return more substrings than the number of words in `s`.
///
/// # Returns
/// A `Vec<&str>` containing up to `n` substrings, each containing one or more
/// whole words from the original string.
///
/// # Examples
/// Splitting into fewer groups than words:
/// ```
/// use bt_string_utils::splitter::split_upto_n_by_word;
/// let s = "Hello 🙂 World from Rust";
/// let parts = split_upto_n_by_word(s, 3);
/// assert_eq!(parts, vec!["Hello", " 🙂 World", " from Rust"]);
/// ```
///
/// Requesting more groups than words:
///
/// ```
/// use bt_string_utils::splitter::split_upto_n_by_word;
/// let s = "Hello 🙂 World";
/// let parts = split_upto_n_by_word(s, 10);
/// assert_eq!(parts, vec!["Hello", " 🙂", " World"]);
/// ```
///
/// Single group:
///
/// ```
/// use bt_string_utils::splitter::split_upto_n_by_word;
/// let s = "Hello world";
/// let parts = split_upto_n_by_word(s, 1);
/// assert_eq!(parts, vec!["Hello world"]);
/// ```
///
/// Empty input:
///
/// ```
/// use bt_string_utils::splitter::split_upto_n_by_word;
/// let parts = split_upto_n_by_word("", 5);
/// assert!(parts.is_empty());
/// ```
/// Splits a given string into multiple chunks of safe size while ensuring that UTF-8 multi-byte characters are not split.
///
/// This function takes a string and divides it into smaller chunks of `chunk_size_bytes` bytes or less, ensuring that each chunk ends
/// at a valid UTF-8 character boundary. This helps avoid issues with splitting multi-byte characters (such as emojis or non-Latin
/// characters), which can lead to invalid UTF-8 sequences. The chunks are returned as a `Vec<String>`, which contains the substrings
/// of the original content.
///
/// # Parameters
///
/// - `content`: A reference to a `str` containing the document or text data to be split into chunks. The string must be a valid UTF-8 string.
/// - `chunk_size_bytes: usize`: Size of a chunk in bytes
///
/// # Returns
///
/// - `Vec<String>`: A vector of `String` instances, each containing one chunk of the original `content`. and the function ensures that no chunk is split in the middle of a multi-byte UTF-8 character.
///
/// # Behavior
///
/// The function processes the input string byte-by-byte and ensures that each chunk is of safe size and that multi-byte characters
/// are respected. The chunks are added to the result vector in order, with each chunk being a valid UTF-8 sequence.
///
/// # Example
///
/// ```rust
/// use bt_string_utils::splitter::split_into_chunks;
/// let document: &str = "Your 70k+ character document..."; // some long document content
/// let chunks = split_into_chunks(document,5);
/// for chunk in chunks {
/// println!("{}", chunk);
/// }
/// ```
///
/// # Limitations
///
/// - The function will step backwards within the byte array if necessary to ensure that chunks don't break in the middle of a multi-byte character.
/// - It is optimized to handle **UTF-8** encoded data correctly.
/// - If the input string is extremely short, only a single chunk will be returned.