unicode_reverse/lib.rs
1#![no_std]
2
3//! The [`reverse_grapheme_clusters_in_place`][0] function reverses a string slice in-place without
4//! allocating any memory on the heap. It correctly handles multi-byte UTF-8 sequences and
5//! grapheme clusters, including combining marks and astral characters such as Emoji.
6//!
7//! ## Example
8//!
9//! ```rust
10//! use unicode_reverse::reverse_grapheme_clusters_in_place;
11//!
12//! let mut x = "man\u{0303}ana".to_string();
13//! println!("{}", x); // prints "mañana"
14//!
15//! reverse_grapheme_clusters_in_place(&mut x);
16//! println!("{}", x); // prints "anañam"
17//! ```
18//!
19//! ## Background
20//!
21//! As described in [this article by Mathias Bynens][1], naively reversing a Unicode string can go
22//! wrong in several ways. For example, merely reversing the `chars` (Unicode Scalar Values) in a
23//! string can cause combining marks to become attached to the wrong characters:
24//!
25//! ```rust
26//! let x = "man\u{0303}ana";
27//! println!("{}", x); // prints "mañana"
28//!
29//! let y: String = x.chars().rev().collect();
30//! println!("{}", y); // prints "anãnam": Oops! The '~' is now applied to the 'a'.
31//! ```
32//!
33//! Reversing the [grapheme clusters][2] of the string fixes this problem:
34//!
35//! ```rust
36//! extern crate unicode_segmentation;
37//! use unicode_segmentation::UnicodeSegmentation;
38//!
39//! # fn main() {
40//! let x = "man\u{0303}ana";
41//! let y: String = x.graphemes(true).rev().collect();
42//! println!("{}", y); // prints "anañam"
43//! # }
44//! ```
45//!
46//! The `reverse_grapheme_clusters_in_place` function from this crate performs this same operation,
47//! but performs the reversal in-place rather than allocating a new string.
48//!
49//! Note: Even grapheme-level reversal may produce unexpected output if the input string contains
50//! certain non-printable control codes, such as directional formatting characters. Handling such
51//! characters is outside the scope of this crate.
52//!
53//! ## Algorithm
54//!
55//! The implementation is very simple. It makes two passes over the string's contents:
56//!
57//! 1. For each grapheme cluster, reverse the bytes within the grapheme cluster in-place.
58//! 2. Reverse the bytes of the entire string in-place.
59//!
60//! After the second pass, each grapheme cluster has been reversed twice, so its bytes are now back
61//! in their original order, but the clusters are now in the opposite order within the string.
62//!
63//! ## no_std
64//!
65//! This crate does not depend on libstd, so it can be used in [`no_std` projects][3].
66//!
67//! [0]: fn.reverse_grapheme_clusters_in_place.html
68//! [1]: https://mathiasbynens.be/notes/javascript-unicode
69//! [2]: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
70//! [3]: https://doc.rust-lang.org/book/no-stdlib.html
71
72#[cfg(test)]
73mod tests;
74
75use core::str;
76use unicode_segmentation::UnicodeSegmentation;
77
78/// Reverse a Unicode string in-place without allocating.
79///
80/// This function reverses a string slice in-place without allocating any memory on the heap. It
81/// correctly handles multi-byte UTF-8 sequences and grapheme clusters, including combining marks
82/// and astral characters such as Emoji.
83///
84/// See the [crate-level documentation](index.html) for more details.
85///
86/// ## Example
87///
88/// ```rust
89/// extern crate unicode_reverse;
90/// use unicode_reverse::reverse_grapheme_clusters_in_place;
91///
92/// fn main() {
93/// let mut x = "man\u{0303}ana".to_string();
94/// println!("{}", x); // prints "mañana"
95///
96/// reverse_grapheme_clusters_in_place(&mut x);
97/// println!("{}", x); // prints "anañam"
98/// }
99/// ```
100pub fn reverse_grapheme_clusters_in_place(s: &mut str) {
101 unsafe {
102 let v = s.as_bytes_mut();
103
104 // Part 1: Reverse the bytes within each grapheme cluster.
105 // This does not preserve UTF-8 validity.
106 {
107 // Invariant: `tail` points to data we have not modified yet, so it is always valid UTF-8.
108 let mut tail = &mut v[..];
109 while let Some(len) = str::from_utf8_unchecked(tail)
110 .graphemes(true)
111 .next()
112 .map(str::len)
113 {
114 let (grapheme, new_tail) = tail.split_at_mut(len);
115 grapheme.reverse();
116 tail = new_tail;
117 }
118 }
119
120 // Part 2: Reverse all bytes. This restores multi-byte sequences to their original order.
121 v.reverse();
122
123 // The string is now valid UTF-8 again.
124 debug_assert!(str::from_utf8(v).is_ok());
125 }
126}