pub struct SortedString<'a> { /* private fields */ }Expand description
Main type to perform binary search through.
This type upholds the important invariants of SortedString::binary_search():
- a sorted string is required,
- as well as some
AsciiCharseparator to split by.
Access to binary search is gated behind this type. For this to not be too painful, the type is designed to be cheap, as it doesnβt own the potentially large haystack to perform binary search on.
The terms haystack and needle are used here as they are in
std::str::pattern.
Β§Instance Creation
There are two avenues:
SortedString::new_checked(): recommended, ensuring soundnessSortedString::new_unchecked(): unsafe, faster
Β§Note on ASCII requirement
On instance creation, the separator has to be within the ASCII
range. In other
words, multi-byte (in the sense of UTF-8) separators are not supported. This
leaves common separators like \n, \t, -, , and many more
available, while arbitrary chars are not allowed:
let sep = 'π¦';
let haystack = "a,b,c";
// Not allowed:
let _ = b4s::SortedString::new_checked(haystack, sep);UTF-8 being a variable-length encoding, allowing any separator char would
require an initial linear scan of the input
string:
- either at instance creation time, then saving the results,
- or before each search.
The former takes up a lot of space (usize pointers, which on 64-bit are much
larger than the separator itself), the latter time. Both are undesirable. Even more
undesirable would be handling UTF-8 parsing and validation manually. The ASCII
requirement keeps things simple and
is unlikely to be an impediment in real-world use.
As UTF-8 is fully ASCII-compatible, allowing only that means the string can be efficiently scanned as bytes. The needle and haystack can still be any UTF-8 string. No multi-byte code point (with its continuation bytes) contains bytes in the ASCII range. Scanning for such an ASCII separator is therefore sound:
- it wonβt be found at non-char boundaries,
- and it itself will be a boundary.
These properties simplify scanning and allow it to be fast at both instance creation and search time. The downside is hopefully negligible and in the common case unnoticeable. Please contact the author if your use case requires multi-byte support.
The ASCII requirement is manifested in the type system, enforcing correct usage at compile time.
ImplementationsΒ§
SourceΒ§impl<'a> SortedString<'a>
impl<'a> SortedString<'a>
Sourcepub fn new_checked(
haystack: &'a str,
sep: AsciiChar,
) -> Result<Self, SortedStringCreationError>
pub fn new_checked( haystack: &'a str, sep: AsciiChar, ) -> Result<Self, SortedStringCreationError>
Creates a new instance of SortedString, performing sanity checks.
See SortedString::new_unchecked() for a version without checks.
The performed checks and failure modes are outlined in the error examples below.
Β§Example
let sep = b4s::AsciiChar::Comma;
let haystack = "a,b,c";
let sorted_string = b4s::SortedString::new_checked(haystack, sep);
assert!(sorted_string.is_ok());Β§Errors
This method returns a SortedStringCreationError in the cases outlined below.
Β§Example: Creation Fails (Haystack Not Sorted)
Binary search requires prior sorting, so creation has to fail.
let sep = b4s::AsciiChar::Comma;
let unsorted_haystack = "a,c,b";
let sorted_string = b4s::SortedString::new_checked(unsorted_haystack, sep);
assert_eq!(sorted_string, Err(b4s::SortedStringCreationError::NotSorted));This error might also creep in when passing a newline-separated file containing a trailing newline. The empty string will be in last position, contradicting sorting.
let sep = b4s::AsciiChar::LineFeed;
let haystack = "Aachen\nAmpel\nAngel\nApfel\n";
let ss = b4s::SortedString::new_checked(haystack, sep);
assert_eq!(ss, Err(b4s::SortedStringCreationError::NotSorted));Β§Example: Creation Fails (Haystack Empty)
Supplying an empty haystack str is likely an error and therefore fails.
let sep = b4s::AsciiChar::Comma;
let haystack = "";
let sorted_string = b4s::SortedString::new_checked(haystack, sep);
assert_eq!(sorted_string, Err(b4s::SortedStringCreationError::EmptyHaystack));Sourcepub fn binary_search<U>(&self, needle: U) -> SearchResult
pub fn binary_search<U>(&self, needle: U) -> SearchResult
Searches for a needle inside this SortedString.
The return type aims to imitate binary_search of the standard
library,
returning a Result. The success case is a std::ops::Range, not a single
usize, as the haystack is variable-length. For the error case, see below.
Β§Examples
For common examples, see the crate-level documentation. The following are more specific examples.
Β§Multi-byte needle and haystack
The needle and haystack can be any UTF-8 string. However, note that the
returned Range is returned in raw byte positions, similar to
std::str::len:
This length is in bytes, not chars or graphemes. In other words, it might not be what a human considers the length of the string.
let sep = b4s::AsciiChar::VerticalBar;
let haystack = "Apfel|BΓ€ume|ZΓ€une|zΓ€unen|Γpfel|Γfen|Ζoo|Ω
Ψ±ΨΨ¨ΩΨ§|δ½ ε₯½|π";
let ss = b4s::SortedString::new_checked(haystack, sep)?;
let needle = "BΓ€ume";
let result = ss.binary_search(needle);
assert_eq!(result, Ok(std::ops::Range { start: 6, end: 12 }));Β§Needle contains separator
This situation not handled specially. Such needles will be impossible to find:
use std::{error::Error, ops::Range};
fn main() -> Result<(), Box<dyn Error>> {
let sep = b4s::AsciiChar::A;
let haystack = "Aachen\nAmpel\nAngel\nApfel\n";
let ss = b4s::SortedString::new_checked(haystack, sep)?;
let needle = "Angel";
let result = ss.binary_search(needle);
assert_eq!(result, Err(b4s::SearchError(Range { start: 0, end: 0 })));
let needle = "ngel\n";
let range = ss.binary_search(needle)?; // Works (note ? operator also works)
assert_eq!(range, Range { start: 14, end: 19 });
Ok(())
}Β§Errors
Refer to SearchError for more info.
Β§Panics
This method panics when slicing into the haystack fails. Occurrences of such
panics are logic
bugs
and should never occur (see the below assumptions). Please report them if they
do.
The panic allows the API to be simple. No panic implies converting the failure
to some error value, like an enum. That enum would then have a variant such
as EncodingError. However, the user would have no sensible way of handling
that, either, and might as well panic anyway. Keeping the panic inside the
function allows the return type to be very simple (no enum required).
See also this thread, where people who know what theyβre talking about in terms of Rust (matklad, BurntSushi) argue in a similar fashion. The author took their advice to gauge idiomaticity.
Β§Background
Fundamental assumptions for panic-freedom are:
-
The haystack is
&str, aka valid UTF-8. Users cannot pass malformed UTF-8. -
The separator is ASCII, as ensured by signatures working off
AsciiChar, providing type-level guarantees. Users can only pass single-byte UTF-8 values. -
The separator being ASCII (highest bit 0), it cannot be part of any code point encoded as multiple bytes using UTF-8 (highest bit 1), as validated in this sanity check:
for code_point in 0..=0x10FFFF { if let Some(c) = std::char::from_u32(code_point) { if c.is_ascii() { continue; } for byte in c.to_string().bytes() { assert!(!byte.is_ascii()); } } } -
Single-byte values with highest bit 1 are outside the ASCII range and invalid UTF-8:
for byte in 0x80u8..=0xFFu8 { assert!(!byte.is_ascii()); assert!(std::str::from_utf8(&[byte]).is_err()); }
Β§Fuzz Testing
To further strengthen confidence in panic-freedom, fuzz testing was conducted
using afl.
If you have access to the crateβs source code repository, run fuzz testing yourself from the root using
cargo install just && just fuzzThe author let a fuzz test run for over 5 billion iterations, finding no panics:
american fuzzy lop ++4.06c {default} (target/debug/afl-target) [fast]
ββ process timing βββββββββββββββββββββββββββββββββββββ¬β overall results βββββ
β run time : 0 days, 19 hrs, 7 min, 6 sec β cycles done : 192k β
β last new find : 0 days, 18 hrs, 56 min, 54 sec β corpus count : 39 β
βlast saved crash : none seen yet βsaved crashes : 0 β
β last saved hang : none seen yet β saved hangs : 0 β
ββ cycle progress ββββββββββββββββββββββ¬β map coverageβ΄βββββββββββββββββββββββ€
β now processing : 31.444679 (79.5%) β map density : 3.31% / 6.22% β
β runs timed out : 0 (0.00%) β count coverage : 2.77 bits/tuple β
ββ stage progress ββββββββββββββββββββββΌβ findings in depth ββββββββββββββββββ€
β now trying : splice 8 β favored items : 12 (30.77%) β
β stage execs : 27/28 (96.43%) β new edges on : 14 (35.90%) β
β total execs : 5.15G β total crashes : 0 (0 saved) β
β exec speed : 73.5k/sec β total tmouts : 2 (0 saved) β
ββ fuzzing strategy yields βββββββββββββ΄ββββββββββββββ¬β item geometry ββββββββ€
β bit flips : disabled (default, enable with -D) β levels : 6 β
β byte flips : disabled (default, enable with -D) β pending : 0 β
β arithmetics : disabled (default, enable with -D) β pend fav : 0 β
β known ints : disabled (default, enable with -D) β own finds : 33 β
β dictionary : n/a β imported : 0 β
βhavoc/splice : 31/1.98G, 2/3.17G β stability : 100.00% β
βpy/custom/rq : unused, unused, 0/1126, 0/19.4k βββββββββββββββββββββββββ
β trim/eff : 1.58%/6355, disabled β [cpu003: 12%]
ββββββββββββββββββββββββββββββββββββββββββββββββββββββNote: at the time of writing,
cargo-fuzz and
cargo-afl were available. The former
currently only wraps libFuzzer, the
latter works with afl. Both were already
deprecated at the time of writing, but continue to work well. afl was chosen
as it didnβt require a nightly toolchain.
Sourcepub const fn new_unchecked(string: &'a str, sep: AsciiChar) -> Self
pub const fn new_unchecked(string: &'a str, sep: AsciiChar) -> Self
Creates an instance of SortedString without performing sanity
checks.
This is essentially what conventionally would be a simple
new(),
but specifically named to alert users to the dangers.
Β§Example: Simple Use
let sep = b4s::AsciiChar::Comma;
let haystack = "a,b,c";
let sorted_string = b4s::SortedString::new_unchecked(haystack, sep);Β§Example: Incorrect Use
use std::ops::Range;
let sep = b4s::AsciiChar::Comma;
let unsorted_haystack = "a,c,b";
let sorted_string = b4s::SortedString::new_unchecked(unsorted_haystack, sep);
// Unable to find element in unsorted haystack
assert_eq!(
sorted_string.binary_search("b"),
Err(b4s::SearchError(Range { start: 0, end: 1 }))
);Sourcepub fn sort(string: &str, sep: AsciiChar) -> String
pub fn sort(string: &str, sep: AsciiChar) -> String
Convenience method to sort a str by a given separator, returning an owned
version.
As SortedString is designed to be thin and doesnβt own its data (expect for
the sep), this convenience method helps creating a sorted String in the
required format.
Β§Example
let sep = b4s::AsciiChar::LineFeed;
// Perhaps this was read directly from a file, where sorting is unreliable/unknown.
let unsorted_haystack = "c\nb\na";
let sorted_haystack = b4s::SortedString::sort(unsorted_haystack, sep);
// Passes fine
let sorted_string = b4s::SortedString::new_checked(&sorted_haystack, sep)?;
assert_eq!(
sorted_string.binary_search("c"),
Ok(std::ops::Range { start: 4, end: 5 })
);Trait ImplementationsΒ§
SourceΒ§impl<'a> Clone for SortedString<'a>
impl<'a> Clone for SortedString<'a>
SourceΒ§fn clone(&self) -> SortedString<'a>
fn clone(&self) -> SortedString<'a>
1.0.0 Β· SourceΒ§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSourceΒ§impl<'a> Debug for SortedString<'a>
impl<'a> Debug for SortedString<'a>
SourceΒ§impl Display for SortedString<'_>
impl Display for SortedString<'_>
SourceΒ§impl<'a> Hash for SortedString<'a>
impl<'a> Hash for SortedString<'a>
SourceΒ§impl<'a> Ord for SortedString<'a>
impl<'a> Ord for SortedString<'a>
SourceΒ§fn cmp(&self, other: &SortedString<'a>) -> Ordering
fn cmp(&self, other: &SortedString<'a>) -> Ordering
1.21.0 Β· SourceΒ§fn max(self, other: Self) -> Selfwhere
Self: Sized,
fn max(self, other: Self) -> Selfwhere
Self: Sized,
SourceΒ§impl<'a> PartialEq for SortedString<'a>
impl<'a> PartialEq for SortedString<'a>
SourceΒ§impl<'a> PartialOrd for SortedString<'a>
impl<'a> PartialOrd for SortedString<'a>
impl<'a> Eq for SortedString<'a>
impl<'a> StructuralPartialEq for SortedString<'a>
Auto Trait ImplementationsΒ§
impl<'a> Freeze for SortedString<'a>
impl<'a> RefUnwindSafe for SortedString<'a>
impl<'a> Send for SortedString<'a>
impl<'a> Sync for SortedString<'a>
impl<'a> Unpin for SortedString<'a>
impl<'a> UnwindSafe for SortedString<'a>
Blanket ImplementationsΒ§
SourceΒ§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
SourceΒ§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
SourceΒ§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
SourceΒ§impl<T> IntoEither for T
impl<T> IntoEither for T
SourceΒ§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSourceΒ§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more