unicode_reverse/
lib.rs

1#![no_std]
2
3//! The [`reverse_grapheme_clusters_in_place`][0] function reverses a string slice in-place without
4//! allocating any memory on the heap.  It correctly handles multi-byte UTF-8 sequences and
5//! grapheme clusters, including combining marks and astral characters such as Emoji.
6//!
7//! ## Example
8//!
9//! ```rust
10//! use unicode_reverse::reverse_grapheme_clusters_in_place;
11//!
12//! let mut x = "man\u{0303}ana".to_string();
13//! println!("{}", x); // prints "mañana"
14//!
15//! reverse_grapheme_clusters_in_place(&mut x);
16//! println!("{}", x); // prints "anañam"
17//! ```
18//!
19//! ## Background
20//!
21//! As described in [this article by Mathias Bynens][1], naively reversing a Unicode string can go
22//! wrong in several ways. For example, merely reversing the `chars` (Unicode Scalar Values) in a
23//! string can cause combining marks to become attached to the wrong characters:
24//!
25//! ```rust
26//! let x = "man\u{0303}ana";
27//! println!("{}", x); // prints "mañana"
28//!
29//! let y: String = x.chars().rev().collect();
30//! println!("{}", y); // prints "anãnam": Oops! The '~' is now applied to the 'a'.
31//! ```
32//!
33//! Reversing the [grapheme clusters][2] of the string fixes this problem:
34//!
35//! ```rust
36//! extern crate unicode_segmentation;
37//! use unicode_segmentation::UnicodeSegmentation;
38//!
39//! # fn main() {
40//! let x = "man\u{0303}ana";
41//! let y: String = x.graphemes(true).rev().collect();
42//! println!("{}", y); // prints "anañam"
43//! # }
44//! ```
45//!
46//! The `reverse_grapheme_clusters_in_place` function from this crate performs this same operation,
47//! but performs the reversal in-place rather than allocating a new string.
48//!
49//! Note: Even grapheme-level reversal may produce unexpected output if the input string contains
50//! certain non-printable control codes, such as directional formatting characters. Handling such
51//! characters is outside the scope of this crate.
52//!
53//! ## Algorithm
54//!
55//! The implementation is very simple. It makes two passes over the string's contents:
56//!
57//! 1. For each grapheme cluster, reverse the bytes within the grapheme cluster in-place.
58//! 2. Reverse the bytes of the entire string in-place.
59//!
60//! After the second pass, each grapheme cluster has been reversed twice, so its bytes are now back
61//! in their original order, but the clusters are now in the opposite order within the string.
62//!
63//! ## no_std
64//!
65//! This crate does not depend on libstd, so it can be used in [`no_std` projects][3].
66//!
67//! [0]: fn.reverse_grapheme_clusters_in_place.html
68//! [1]: https://mathiasbynens.be/notes/javascript-unicode
69//! [2]: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
70//! [3]: https://doc.rust-lang.org/book/no-stdlib.html
71
72#[cfg(test)]
73mod tests;
74
75use core::str;
76use unicode_segmentation::UnicodeSegmentation;
77
78/// Reverse a Unicode string in-place without allocating.
79///
80/// This function reverses a string slice in-place without allocating any memory on the heap.  It
81/// correctly handles multi-byte UTF-8 sequences and grapheme clusters, including combining marks
82/// and astral characters such as Emoji.
83///
84/// See the [crate-level documentation](index.html) for more details.
85///
86/// ## Example
87///
88/// ```rust
89/// extern crate unicode_reverse;
90/// use unicode_reverse::reverse_grapheme_clusters_in_place;
91///
92/// fn main() {
93///     let mut x = "man\u{0303}ana".to_string();
94///     println!("{}", x); // prints "mañana"
95///
96///     reverse_grapheme_clusters_in_place(&mut x);
97///     println!("{}", x); // prints "anañam"
98/// }
99/// ```
100pub fn reverse_grapheme_clusters_in_place(s: &mut str) {
101    unsafe {
102        let v = s.as_bytes_mut();
103
104        // Part 1: Reverse the bytes within each grapheme cluster.
105        // This does not preserve UTF-8 validity.
106        {
107            // Invariant: `tail` points to data we have not modified yet, so it is always valid UTF-8.
108            let mut tail = &mut v[..];
109            while let Some(len) = str::from_utf8_unchecked(tail)
110                .graphemes(true)
111                .next()
112                .map(str::len)
113            {
114                let (grapheme, new_tail) = tail.split_at_mut(len);
115                grapheme.reverse();
116                tail = new_tail;
117            }
118        }
119
120        // Part 2: Reverse all bytes. This restores multi-byte sequences to their original order.
121        v.reverse();
122
123        // The string is now valid UTF-8 again.
124        debug_assert!(str::from_utf8(v).is_ok());
125    }
126}