Struct b4s::SortedString
source · pub struct SortedString<'a> { /* private fields */ }Expand description
Main type to perform binary search through.
This type upholds the important invariants of SortedString::binary_search():
- a sorted string is required,
- as well as some
AsciiCharseparator to split by.
Access to binary search is gated behind this type. For this to not be too painful, the type is designed to be cheap, as it doesn’t own the potentially large haystack to perform binary search on.
The terms haystack and needle are used here as they are in
std::str::pattern.
Instance Creation
There are two avenues:
SortedString::new_checked(): recommended, ensuring soundnessSortedString::new_unchecked(): unsafe, faster
Note on ASCII requirement
On instance creation, the separator has to be within the ASCII
range. In other
words, multi-byte (in the sense of UTF-8) separators are not supported. This
leaves common separators like \n, \t, -, , and many more
available, while arbitrary chars are not allowed:
let sep = '🦀';
let haystack = "a,b,c";
// Not allowed:
let _ = b4s::SortedString::new_checked(haystack, sep);UTF-8 being a variable-length encoding, allowing any separator char would
require an initial linear scan of the input
string:
- either at instance creation time, then saving the results,
- or before each search.
The former takes up a lot of space (usize pointers, which on 64-bit are much
larger than the separator itself), the latter time. Both are undesirable.
As UTF-8 is fully ASCII-compatible, allowing only that means the string can be efficiently scanned as bytes. The needle and haystack can still be any UTF-8 string. No multi-byte code point (with its continuation bytes) contains bytes in the ASCII range. Scanning for such an ASCII separator is therefore sound:
- it won’t be found at non-char boundaries,
- and it itself will be a boundary.
These properties simplify scanning and allow it to be fast at both instance creation and search time. The downside is hopefully negligible and in the common case unnoticeable. Please contact the author if your use case requires multi-byte support.
The ASCII requirement is manifested in the type system, enforcing correct usage at compile time.
Implementations§
source§impl<'a> SortedString<'a>
impl<'a> SortedString<'a>
sourcepub fn new_checked(
haystack: &'a str,
sep: AsciiChar
) -> Result<Self, SortedStringCreationError>
pub fn new_checked( haystack: &'a str, sep: AsciiChar ) -> Result<Self, SortedStringCreationError>
Creates a new instance of SortedString, performing sanity checks.
See SortedString::new_unchecked() for a version without checks.
The performed checks and failure modes are outlined in the error examples below.
Example
let sep = b4s::AsciiChar::Comma;
let haystack = "a,b,c";
let sorted_string = b4s::SortedString::new_checked(haystack, sep);
assert!(sorted_string.is_ok());Errors
This method returns a SortedStringCreationError in the cases outlined below.
Example: Creation Fails (Haystack Not Sorted)
Binary search requires prior sorting, so creation has to fail.
let sep = b4s::AsciiChar::Comma;
let unsorted_haystack = "a,c,b";
let sorted_string = b4s::SortedString::new_checked(unsorted_haystack, sep);
assert_eq!(sorted_string, Err(b4s::SortedStringCreationError::NotSorted));This error might also creep in when passing a newline-separated file containing a trailing newline. The empty string will be in last position, contradicting sorting.
let sep = b4s::AsciiChar::LineFeed;
let haystack = "Aachen\nAmpel\nAngel\nApfel\n";
let ss = b4s::SortedString::new_checked(haystack, sep);
assert_eq!(ss, Err(b4s::SortedStringCreationError::NotSorted));Example: Creation Fails (Haystack Empty)
Supplying an empty haystack str is likely an error and therefore fails.
let sep = b4s::AsciiChar::Comma;
let haystack = "";
let sorted_string = b4s::SortedString::new_checked(haystack, sep);
assert_eq!(sorted_string, Err(b4s::SortedStringCreationError::EmptyHaystack));sourcepub fn binary_search<U>(&self, needle: U) -> SearchResultwhere
U: AsRef<str>,
pub fn binary_search<U>(&self, needle: U) -> SearchResultwhere U: AsRef<str>,
Searches for a needle inside this SortedString.
The return type aims to imitate binary_search of the standard
library,
returning a Result. The success case is a std::ops::Range, not a single
usize, as the haystack is variable-length. For the error case, see below.
Examples
For common examples, see the crate-level documentation. The following are more specific examples.
Multi-byte needle and haystack
The needle and haystack can be any UTF-8 string. However, note that the
returned Range is returned in raw byte positions, similar to
std::str::len:
This length is in bytes, not chars or graphemes. In other words, it might not be what a human considers the length of the string.
let sep = b4s::AsciiChar::VerticalBar;
let haystack = "Apfel|Bäume|Zäune|zäunen|Äpfel|Öfen|ƒoo|مرحبًا|你好|😂";
let ss = b4s::SortedString::new_checked(haystack, sep).unwrap();
let needle = "Bäume";
let result = ss.binary_search(needle);
assert_eq!(result, Ok(std::ops::Range { start: 6, end: 12 }));Needle contains separator
This situation not handled specially. Such needles will be impossible to find:
use std::ops::Range;
let sep = b4s::AsciiChar::A;
let haystack = "Aachen\nAmpel\nAngel\nApfel\n";
let ss = b4s::SortedString::new_checked(haystack, sep).unwrap();
let needle = "Angel";
let result = ss.binary_search(needle);
assert_eq!(result, Err(b4s::SearchError(Range { start: 0, end: 0 })));
let needle = "ngel\n";
let result = ss.binary_search(needle);
assert_eq!(result, Ok(Range { start: 14, end: 19 }));Errors
Refer to SearchError for more info.
Panics
This method panics when slicing into the haystack fails. Occurrences of such
panics are logic
bugs
and should never occur (see the below assumptions). Please report them if they
do.
The panic allows the API to be simple. No panic implies converting the failure
to some error value, like an enum. That enum would then have a variant such
as EncodingError. However, the user would have no sensible way of handling
that, either, and might as well panic anyway. Keeping the panic inside the
function allows the return type to be very simple (no enum required).
See also this thread, where people who know what they’re talking about in terms of Rust (matklad, BurntSushi) argue in a similar fashion. The author took their advice to gauge idiomaticity.
Background
Fundamental assumptions for panic-freedom are:
-
The haystack is
&str, aka valid UTF-8. Users cannot pass malformed UTF-8. -
The separator is ASCII, as ensured by signatures working off
AsciiChar, providing type-level guarantees. Users can only pass single-byte UTF-8 values. -
The separator being ASCII (highest bit 0), it cannot be part of any code point encoded as multiple bytes using UTF-8 (highest bit 1), as validated in this sanity check:
for code_point in 0..=0x10FFFF { if let Some(c) = std::char::from_u32(code_point) { if c.is_ascii() { continue; } for byte in c.to_string().bytes() { assert!(!byte.is_ascii()); } } } -
Single-byte values with highest bit 1 are outside the ASCII range and invalid UTF-8:
for byte in 0x80u8..=0xFFu8 { assert!(!byte.is_ascii()); assert!(std::str::from_utf8(&[byte]).is_err()); }
Fuzz Testing
To further strengthen confidence in panic-freedom, fuzz testing was conducted
using afl.
If you have access to the crate’s source code repository, run fuzz testing yourself from the root using
cargo install just && just fuzz
The author let a fuzz test run for over 5 billion iterations, finding no panics:
american fuzzy lop ++4.06c {default} (target/debug/afl-target) [fast]
┌─ process timing ────────────────────────────────────┬─ overall results ────┐
│ run time : 0 days, 19 hrs, 7 min, 6 sec │ cycles done : 192k │
│ last new find : 0 days, 18 hrs, 56 min, 54 sec │ corpus count : 39 │
│last saved crash : none seen yet │saved crashes : 0 │
│ last saved hang : none seen yet │ saved hangs : 0 │
├─ cycle progress ─────────────────────┬─ map coverage┴──────────────────────┤
│ now processing : 31.444679 (79.5%) │ map density : 3.31% / 6.22% │
│ runs timed out : 0 (0.00%) │ count coverage : 2.77 bits/tuple │
├─ stage progress ─────────────────────┼─ findings in depth ─────────────────┤
│ now trying : splice 8 │ favored items : 12 (30.77%) │
│ stage execs : 27/28 (96.43%) │ new edges on : 14 (35.90%) │
│ total execs : 5.15G │ total crashes : 0 (0 saved) │
│ exec speed : 73.5k/sec │ total tmouts : 2 (0 saved) │
├─ fuzzing strategy yields ────────────┴─────────────┬─ item geometry ───────┤
│ bit flips : disabled (default, enable with -D) │ levels : 6 │
│ byte flips : disabled (default, enable with -D) │ pending : 0 │
│ arithmetics : disabled (default, enable with -D) │ pend fav : 0 │
│ known ints : disabled (default, enable with -D) │ own finds : 33 │
│ dictionary : n/a │ imported : 0 │
│havoc/splice : 31/1.98G, 2/3.17G │ stability : 100.00% │
│py/custom/rq : unused, unused, 0/1126, 0/19.4k ├───────────────────────┘
│ trim/eff : 1.58%/6355, disabled │ [cpu003: 12%]
└────────────────────────────────────────────────────┘
Note: at the time of writing,
cargo-fuzz and
cargo-afl were available. The former
currently only wraps libFuzzer, the
latter works with afl. Both were already
deprecated at the time of writing, but continue to work well. afl was chosen
as it didn’t require a nightly toolchain.
sourcepub const fn new_unchecked(string: &'a str, sep: AsciiChar) -> Self
pub const fn new_unchecked(string: &'a str, sep: AsciiChar) -> Self
Creates an instance of SortedString without performing sanity checks.
This is essentially what conventionally would be a simple
new(),
but specifically named to alert users to the dangers.
Example: Simple Use
let sep = b4s::AsciiChar::Comma;
let haystack = "a,b,c";
let sorted_string = b4s::SortedString::new_unchecked(haystack, sep);Example: Incorrect Use
use std::ops::Range;
let sep = b4s::AsciiChar::Comma;
let unsorted_haystack = "a,c,b";
let sorted_string = b4s::SortedString::new_unchecked(unsorted_haystack, sep);
// Unable to find element in unsorted haystack
assert_eq!(
sorted_string.binary_search("b"),
Err(b4s::SearchError(Range { start: 0, end: 1 }))
);sourcepub fn sort(string: &str, sep: AsciiChar) -> String
pub fn sort(string: &str, sep: AsciiChar) -> String
Convenience method to sort a str by a given separator, returning an owned
version.
As SortedString is designed to be thin and doesn’t own its data (expect for
the sep), this convenience method helps creating a sorted String in the
required format.
Example
let sep = b4s::AsciiChar::LineFeed;
// Perhaps this was read directly from a file, where sorting is unreliable/unknown.
let unsorted_haystack = "c\nb\na";
let sorted_haystack = b4s::SortedString::sort(unsorted_haystack, sep);
// Passes fine
let sorted_string = b4s::SortedString::new_checked(&sorted_haystack, sep).unwrap();
assert_eq!(
sorted_string.binary_search("c"),
Ok(std::ops::Range { start: 4, end: 5 })
);Trait Implementations§
source§impl<'a> Clone for SortedString<'a>
impl<'a> Clone for SortedString<'a>
source§fn clone(&self) -> SortedString<'a>
fn clone(&self) -> SortedString<'a>
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moresource§impl<'a> Debug for SortedString<'a>
impl<'a> Debug for SortedString<'a>
source§impl Display for SortedString<'_>
impl Display for SortedString<'_>
source§impl<'a> Hash for SortedString<'a>
impl<'a> Hash for SortedString<'a>
source§impl<'a> Ord for SortedString<'a>
impl<'a> Ord for SortedString<'a>
source§fn cmp(&self, other: &SortedString<'a>) -> Ordering
fn cmp(&self, other: &SortedString<'a>) -> Ordering
1.21.0 · source§fn max(self, other: Self) -> Selfwhere
Self: Sized,
fn max(self, other: Self) -> Selfwhere Self: Sized,
source§impl<'a> PartialEq<SortedString<'a>> for SortedString<'a>
impl<'a> PartialEq<SortedString<'a>> for SortedString<'a>
source§fn eq(&self, other: &SortedString<'a>) -> bool
fn eq(&self, other: &SortedString<'a>) -> bool
self and other values to be equal, and is used
by ==.source§impl<'a> PartialOrd<SortedString<'a>> for SortedString<'a>
impl<'a> PartialOrd<SortedString<'a>> for SortedString<'a>
source§fn partial_cmp(&self, other: &SortedString<'a>) -> Option<Ordering>
fn partial_cmp(&self, other: &SortedString<'a>) -> Option<Ordering>
1.0.0 · source§fn le(&self, other: &Rhs) -> bool
fn le(&self, other: &Rhs) -> bool
self and other) and is used by the <=
operator. Read more