string_manipulation_utf8 0.3.0

String manipulation functions using character indexing (UTF-8) instead of bytes.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
# rust-string-manipulation-utf8


**A Rust library with string manipulation functions using character indexing (UTF-8)**

Library name: [string_manipulation_utf8](https://crates.io/crates/string_manipulation_utf8)

An implementation of string manipulation functions using character indexing instead of bytes. It uses UTF-8 encoded strings as implemented in Rust.

This library also has common string functions like indexof, substr and substring that exist in other programming languages.

It can be used as functions or methods for 'str' type (string slice) and 'String' type.

Library functions:

- indexof : get the position from one string into another
- substr : get a substring of a string using start index and length (signed values)
- substru : get a substring of a string using start index and length (unsigned values)
- substr_end : get a substring from start index till the end of the string
- substring : get a substring of a string using start and end index (not included)
- str_remove : Remove a substring from a string
- str_concat! : macro to concatenate multiple strings (all strings are borrowed)

Standard Rust functions:

Functions independent of character and byte indexing in Rust.

- replace : replaces all matches of a pattern with another string
- replacen : replaces first N matches of a pattern with another string
- strip_prefix : returns a string slice with the prefix removed

- contains : check if a string contains another string
- starts_with : check if a string starts with another string
- ends_with : check if a string ends with another string
- is_empty : check if a String has a length of zero

> The Rust standard library doesn't support Unicode grapheme clusters (with combining diacritical marks) where multiple code points are required to form one character.  
> Example:  
> e + combining acute = e + ´ = \u{0065}\u{0301} = é (two code points with 3 bytes, hex. 65 CC 81)  
> Versus the character é = \u{00E9} with one code point for 2 bytes, hex. C3 A9  
> This library uses the Rust standard library and hence will count such combined characters as multiple characters.

See section 'Using byte positioning' for examples with native byte indexing.

> Simple benchmarking code was used to find the faster algorithms. [GitHub rust-string-manip-benchmark]https://github.com/guntherwillems/rust-string-manip-benchmark

To compile and run the example code in examples/main.rs:  
`cargo run --example main`

To compile and run the tests in tests/tests.rs:  
`carto test`

Install:  
Run the following Cargo command in your project directory:  
`cargo add string_manipulation_utf8 `  
Or add the following line to your Cargo.toml:  
`string_manipulation_utf8 = "0.3.0"`


## Using character positioning



### indexof


Get the character position from one string into another. Start searching from character 'start_index'. Returns None if not found. Index of the first character is 0.


Syntax:

- `str.indexof(searchstring: &str, start_index: usize) -> Option<usize>`
- `string.indexof(searchstring: &str, start_index: usize) -> Option<usize>`
- `indexof(s: &str, searchstring: &str, start_index: usize) -> Option<usize>`

Example:

Return the character index of "test" in the given string. Start searching at the beginning of the string. Result position is 0 because "test" starts at the beginning of the string.

~~~rust
use string_manipulation_utf8::CharString; // String and str methods
use string_manipulation_utf8::indexof; // str function

fn main() {
    let s1: &str = "test éèçà 123 test";
    let s2: String = s1.to_owned();

    match s1.indexof("test", 0) { // Result: Some(0)
        Some(pos) => println!("Found at position: {}", pos),
        None => println!("Not found"),
    }

    match s2.indexof("test", 0) { // Result: Some(0)
        Some(pos) => println!("Found at position: {}", pos),
        None => println!("Not found"),
    }

    match indexof(s1, "test", 0) { // Result: Some(0)
        Some(pos) => println!("Found at position: {}", pos),
        None => println!("Not found"),
    }

    match indexof(&s2, "test", 0) { // Result: Some(0)
        Some(pos) => println!("Found at position: {}", pos),
        None => println!("Not found"),
    }
}
~~~


Return the character index of "test" in the given string. Start searching from character index 6. The result is position 14.

~~~rust
use string_manipulation_utf8::indexof;
use string_manipulation_utf8::CharString; // String and str methods.

fn main() {
    let s1: &str = "test éèçà 123 test";
    let s2: String = s1.to_owned();

    match s1.indexof("test", 6) { // Result: Some(14)
        Some(pos) => println!("Found at position: {}", pos),
        None => println!("Not found"),
    }

    match s2.indexof("test", 6) { // Result: Some(14)
        Some(pos) => println!("Found at position: {}", pos),
        None => println!("Not found"),
    }

    match indexof(s1, "test", 6) { // Result: Some(14)
        Some(pos) => println!("Found at position: {}", pos),
        None => println!("Not found"),
    }

    match indexof(&s2, "test", 6) { // Result: Some(14)
        Some(pos) => println!("Found at position: {}", pos),
        None => println!("Not found"),
    }
}
~~~


### substr


Get a substring of a string, beginning at character index 'start_index' and take 'length' characters.  
Negative numbers count backwards:  
  'start_index' from the end of the string.  
  'length' from 'start_index'.  
If start_index exceeds the string boundary limits, return an empty string. (Similar to C++ std::substr() and c# String.Substring.)  
'length' can be isize::MAX or isize::MIN to get the substring until the positive or negative string boundary without the need to calculate the length. (Alternatively, see substr_end in this library.)  
Index of the first character is 0.

If 'start_index' and 'length' are positive, substru is a little faster like string.chars().skip(start_index).take(length).collect() that it interpolates. See substru and section 'Standard Rust methods' for examples.

Syntax:

- `str.substr(start_index: isize, length: isize) -> String`
- `string.substr(start_index: isize, length: isize) -> String`
- `substr(s: &str, start_index: isize, length: isize) -> String`

Example:

~~~rust
use string_manipulation_utf8::CharString; // String and str methods

fn main() {
    assert_eq!("0123456789".substr(2, 3), "234");
    assert_eq!("0123456789".substr(-5, 3), "567");
    assert_eq!("0123456789".substr(-5, -3), "345"); // Negative length counts backwards
    assert_eq!("0123456789".substr(5, -3), "345"); // Negative length counts backwards
    assert_eq!("0123456789".substr(2, 0), ""); // Take nothing
    assert_eq!("0123456789".substr(0, 0), ""); // Take nothing
    assert_eq!("0123456789".substr(-4, 0), ""); // Take nothing

    assert_eq!("0123456789".substr(5, isize::MAX), "56789");
    assert_eq!("0123456789".substr(isize::MAX, 1), ""); // Out of bounds
    assert_eq!("0123456789".substr(isize::MAX, isize::MIN), ""); // Out of bounds
    assert_eq!("0123456789".substr(isize::MIN, isize::MAX), ""); // Out of bounds
}
~~~

Example:

~~~rust
use string_manipulation_utf8::substr;
use string_manipulation_utf8::CharString; // String and str methods

fn main() {
    let s1: &str = "test éèçà 123 test";
    let s2: String = s1.to_owned();

    println!("substr str: {}", s1.substr(10, 3)); // Result: "123"
    println!("substr String: {}", s2.substr(10, 3)); // Result: "123"
    println!("substr function: {}", substr(s1, 10, 3)); // Result: "123"
}
~~~

Remark:

> To get a substring from 'start_index' until the end of the string:  
> substr(string, start_index, isize::MAX)  
> substr_end(string, start_index)  
> substr(string, start_index, string.chars().count() is isize - start_index)


## substru


Same as substr, but only accepts unsiged values for 'start_index' and 'length'.  
For positive numbers this is faster than using substr.  
It interpolates the code: s.chars().skip(start_index).take(length).collect::<String>()

Syntax:

- `str.substru(start_index: usize, length: usize) -> String`
- `string.substru(start_index: usize, length: usize) -> String`
- `substru(s: &str, start_index: usize, length: usize) -> String`


## substring


Get a substring of a string beginning at character index 'start_index' up to and *excluding* the character index 'end_index'.

Equivalent of JavaScript substring with 2 parameters.  
If 'start_index' is equal to 'end_index', substring() returns an empty string.  
If 'start_index' is greater than 'end_index', swap 'start_index' and 'end_index'.  
Any argument value that is less than 0 is treated as if it were 0.  
Any argument value that is greater than string length is treated as if it were string length.  
Index of the first character is 0.

Syntax:

- `str.substring(start_index: isize, end_index: isize) -> String`
- `string.substring(start_index: isize, end_index: isize) -> String`
- `substring(s: &str, start_index: isize, end_index: isize) -> String`

Example:

~~~rust
use string_manipulation_utf8::CharString; // String and str methods
use string_manipulation_utf8::substring; // str function

fn main() {
    println!("{}", substring("0123456789", 2, 3)); // Result: 2
    println!("{}", substring("0123456789", 2, 9)); // Result: 2345678
    println!("{}", substring("0123456789", 2, 10)); // Result: 23456789
    println!("{}", substring("0123456789", 2, 11)); // Result: 23456789
    println!("{}", substring("0123456789", -2, 3)); // Result: 012
    println!("{}", substring("0123456789", -2, 50)); // Result: 0123456789
    println!("{}", substring("0123456789", 9, 2)); // Result: 2345678
    
    let str: &str = "test éèçà 123 test";
    let string: String = str.to_owned();

    println!("{}", str.substring(10, 14)); // 123
    println!("{}", string.substring(10, 14)); // 123
    println!("{}", substring(str, 10, 14)); // 123
}
~~~


## substr_end


Get a substring from character index 'start_index' till end of the string.  
'start_index' can be negative to count backwards from the end of the string.  
If start_index exceeds the string boundary limits, return an empty string.  
(Similar to C++ std::substr() and c# String.Substring.)  
Index of the first character is 0.

> Because Rust doesn't have a practical default value for function parameters, substr_end()  
> replaces substr(string, start_index), string.substr(start_index).  
> Same result with: substr(string, start_index, isize::MAX)

Syntax:

- `substr_end(s: &str, start_index: isize) -> String`
- `string.substr_end(start_index: isize) -> String`
- `str.substr_end(start_index: isize) -> String`

~~~rust
assert_eq!("0123456789".substr_end(2), "23456789");
assert_eq!("0123456789".substr_end(0), "0123456789");
assert_eq!("0123456789".substr_end(9), "9");
assert_eq!("0123456789".substr_end(10), "");
assert_eq!("0123456789".substr_end(-3), "789");
~~~

Example:

~~~rust
use string_manipulation_utf8::substr_end;
use string_manipulation_utf8::CharString; // String and str methods // str function

fn main() {
    let s1: &str = "test éèçà 123 test";
    let s2: String = s1.to_owned();

    println!("substr_end str: {}", s1.substr_end(10)); // Result: "123 test"
    println!("substr_end String: {}", s2.substr_end(10)); // Result: "123 test"
    println!("substr_end function: {}", substr_end(s1, 10)); // Result: "123 test"
    println!("substr_end function: {}", substr_end(&s2, 10)); // Result: "123 test"
}
~~~


### str_remove


Remove a substring from a string. Beginning at character index 'start_index' and take 'length' characters.  
Index of the first character is 0.

Syntax:

- `str.str_remove(start_index: usize, length: usize) -> String`
- `string.str_remove(start_index: usize, length: usize) -> String`
- `str_remove(s: &str, start_index: usize, length: usize) -> String`


Examples:

~~~rust
use string_manipulation_utf8::str_remove;
use string_manipulation_utf8::CharString; // String and str methods // str function

fn main() {
    let s1: &str = "test éèçà 123 test";
    let s2: String = s1.to_owned();

    println!("str_remove str: {}", s1.str_remove(10, 4)); // Result: "test éèçà test"
    println!("str_remove String: {}", s2.str_remove(10, 4)); // Result: "test éèçà test"
    println!("str_remove function: {}", str_remove(s1, 10, 4)); // Result: "test éèçà test"
}
~~~


### str_concat


Macro to concatenate multiple strings.  
All strings are borrowed.  
First allocates the needed capacity, then adds the stings.

Syntax:

`str_concat!(&str1, &str2, ...)`

Examples:

~~~rust
use string_manipulation_utf8::str_concat;

fn main() {
    println!(
        "{}",
        str_concat!("test", " ", "123 ", "éèçà ", "123 ", "test home")
    ); // Result: "test 123 éèçà 123 test home"

    let s1: String = "string1".to_owned();
    let s2: String = "string2".to_owned();
    let s3: String = "string3".to_owned();
    let result: String = str_concat!(&s1, &s2, &s3);
    println!("{result}"); // Result: "string1string2string3"

    let s2: &str = "string2"; // Adding a string slice
    let result: String = str_concat!(&s1, s2, &s3);
    println!("{result}"); // Result: "string1string2string3"
}
~~~

Alternatives with Rust statements.

The Rust 'std::concat!' macro only works with literals. Ex. `concat!("test", 10, 'b', true)`

Using the std::format macro.  
`format!("{}{}{}", s1, s2, s3)`

When adding strings with the + operator, the first string is moved (move of ownership), from the second string it's borrowed.  
`s1.clone() + &s2 + &s3`  
`s1.to_owned() + &s2 + &s3`


### Standard Rust methods


Standard Rust methods independent of character or byte indexing.

- replace : Replaces all matches of a pattern with another string.
- replacen : Replaces first N matches of a pattern with another string.
- strip_prefix : Returns a string slice with the prefix removed if the search string is found at the beginning of the string.
- strip_suffix : Return a string slice with suffix removed if the search string is found at the end of the string.

- contains : Check if the given pattern matches a sub-slice of this string slice.
- starts_with : Check if the given pattern matches a prefix of this string slice.
- ends_with : Check if the given pattern matches a suffix of this string slice.
- is_empty : Check if this String has a length of zero.
- chars() : Getting a substring with the chars iterator.

Examples:

~~~rust
fn main() {
    let s1: &str = "test éèçà 123 test";
    let s2: String = s1.to_owned();

    println!("{}", s1.replace("test", "new")); // Result: "new éèçà 123 new"
    println!("{}", s2.replacen("test", "new", 1)); // Result: "new éèçà 123 test"

    match s1.strip_prefix("test ") {
        // Result: Some("éèçà 123 test")
        Some(s) => println!("Found: {}", s),
        None => println!("Not found"),
    };

    let result = match s2.strip_prefix("test ") {
        // Result: Some("éèçà 123 test")
        Some(s) => s,
        None => &s2,
    };
    println!("{result}");

    assert_eq!(s1.contains("123"), true);
    assert_eq!(s2.contains("123"), true);
    assert_eq!(s1.starts_with("test"), true);
    assert_eq!(s2.ends_with("test"), true);
    assert_eq!(s1.is_empty(), false);
    assert_eq!(s2.is_empty(), false);
}
~~~

Getting a substring with the Rust chars() module that returns an iterator over the string characters. Skip(), take() and count() consume the chars iterator.

~~~rust
fn main() {
    let str: &str = "test éèçà 123 test";
    let string: String = str.to_owned();

    let start_index: usize = 5;
    let length: usize = 4;

    // All 4 results return éèçà

    // With type annotation
    let _s1: String = string.chars().skip(start_index).take(length).collect();
    let _s2: String = str.chars().skip(start_index).take(length).collect();

    // Without type annotation
    let _s3 = string.chars().skip(start_index).take(length).collect::<String>();
    let _s4 = str.chars().skip(start_index).take(length).collect::<String>();
    
    let _total1 = str.chars().count(); // Length in characters (=18). Consumes the chars iterator.
    let _total2 = string.chars().count(); // Length in characters (=18). Consumes the chars iterator.
}
~~~


## Using byte positioning


Get a substring using byte positions with standard Rust methods.

Using a string slice:

~~~rust
use string_manipulation_utf8::str_concat;

fn main() {
    let s1: &str = "test éèçà 123 test";
    let s2: &str = "éèçà ";

    let s2_pos_o: Option<usize> = s1.find(s2);
    if s2_pos_o.is_some() {
        let s2_len: usize = s2.len(); // Length in bytes
        let s2_pos: usize = s2_pos_o.unwrap(); // Position of s2 in s1 in bytes
        let s2_pos_end: usize = s2_pos + s2_len; // Position of the last character of s2 in s1 in bytes

        // Remove s2 from s1. Result: test 123 test
        println!("{}", s1[..s2_pos].to_owned() + &s1[s2_pos_end..]);

        // Same using the macro str_concat! from this library. Result: test 123 test
        println!("{}", str_concat!(&s1[..s2_pos], &s1[s2_pos_end..]));

        // Get characters from s1 after s2
        println!("{}", &s1[s2_pos_end..]); // Result: 123 test
    }
}
~~~

Using a string:

~~~rust
use string_manipulation_utf8::str_concat;

fn main() {
    let s1: String = "test éèçà 123 test".to_owned();
    let s2: &str = "éèçà ";

    let s2_pos_o: Option<usize> = s1.find(s2);
    if s2_pos_o.is_some() {
        let s2_len: usize = s2.len(); // Length in bytes
        let s2_pos: usize = s2_pos_o.unwrap(); // Position in bytes
        let s2_pos_end: usize = s2_pos + s2_len;

        // Remove s2 from s1. Result: test 123 test
        println!("{}", s1[..s2_pos].to_owned() + &s1[s2_pos_end..]);

        // Same using the macro str_concat! from this library. Result: test 123 test
        println!("{}", str_concat!(&s1[..s2_pos], &s1[s2_pos_end..]));

        // Get characters from s1 after s2 inside s1
        println!("{}", &s1[s2_pos_end..]); // Result: 123 test
    }
}
~~~

Shorter version:

~~~rust
use string_manipulation_utf8::str_concat;

fn main() {
    // let s1: String = "test éèçà 123 test".to_owned(); // Also works with a string
    let s1: &str = "test éèçà 123 test";
    let s2: &str = "éèçà ";

    let s2_pos_o: Option<usize> = s1.find(s2);
    if s2_pos_o.is_some() {
        println!("{}", str_concat!(&s1[..s2_pos_o.unwrap()], &s1[s2_pos_o.unwrap() + s2.len()..]));
        // Result: test 123 test
    }
}
~~~