SortedString

Struct SortedString 

Source
pub struct SortedString<'a> { /* private fields */ }
Expand description

Main type to perform binary search through.

This type upholds the important invariants of SortedString::binary_search():

  • a sorted string is required,
  • as well as some AsciiChar separator to split by.

Access to binary search is gated behind this type. For this to not be too painful, the type is designed to be cheap, as it doesn’t own the potentially large haystack to perform binary search on.

The terms haystack and needle are used here as they are in std::str::pattern.

Β§Instance Creation

There are two avenues:

  1. SortedString::new_checked(): recommended, ensuring soundness
  2. SortedString::new_unchecked(): unsafe, faster

Β§Note on ASCII requirement

On instance creation, the separator has to be within the ASCII range. In other words, multi-byte (in the sense of UTF-8) separators are not supported. This leaves common separators like \n, \t, -, , and many more available, while arbitrary chars are not allowed:

β“˜
let sep = 'πŸ¦€';
let haystack = "a,b,c";
// Not allowed:
let _ = b4s::SortedString::new_checked(haystack, sep);

UTF-8 being a variable-length encoding, allowing any separator char would require an initial linear scan of the input string:

  • either at instance creation time, then saving the results,
  • or before each search.

The former takes up a lot of space (usize pointers, which on 64-bit are much larger than the separator itself), the latter time. Both are undesirable. Even more undesirable would be handling UTF-8 parsing and validation manually. The ASCII requirement keeps things simple and is unlikely to be an impediment in real-world use.

As UTF-8 is fully ASCII-compatible, allowing only that means the string can be efficiently scanned as bytes. The needle and haystack can still be any UTF-8 string. No multi-byte code point (with its continuation bytes) contains bytes in the ASCII range. Scanning for such an ASCII separator is therefore sound:

These properties simplify scanning and allow it to be fast at both instance creation and search time. The downside is hopefully negligible and in the common case unnoticeable. Please contact the author if your use case requires multi-byte support.

The ASCII requirement is manifested in the type system, enforcing correct usage at compile time.

ImplementationsΒ§

SourceΒ§

impl<'a> SortedString<'a>

Source

pub fn new_checked( haystack: &'a str, sep: AsciiChar, ) -> Result<Self, SortedStringCreationError>

Creates a new instance of SortedString, performing sanity checks.

See SortedString::new_unchecked() for a version without checks.

The performed checks and failure modes are outlined in the error examples below.

Β§Example
let sep = b4s::AsciiChar::Comma;
let haystack = "a,b,c";
let sorted_string = b4s::SortedString::new_checked(haystack, sep);

assert!(sorted_string.is_ok());
Β§Errors

This method returns a SortedStringCreationError in the cases outlined below.

Β§Example: Creation Fails (Haystack Not Sorted)

Binary search requires prior sorting, so creation has to fail.

let sep = b4s::AsciiChar::Comma;
let unsorted_haystack = "a,c,b";
let sorted_string = b4s::SortedString::new_checked(unsorted_haystack, sep);

assert_eq!(sorted_string, Err(b4s::SortedStringCreationError::NotSorted));

This error might also creep in when passing a newline-separated file containing a trailing newline. The empty string will be in last position, contradicting sorting.

let sep = b4s::AsciiChar::LineFeed;
let haystack = "Aachen\nAmpel\nAngel\nApfel\n";
let ss = b4s::SortedString::new_checked(haystack, sep);

assert_eq!(ss, Err(b4s::SortedStringCreationError::NotSorted));
Β§Example: Creation Fails (Haystack Empty)

Supplying an empty haystack str is likely an error and therefore fails.

let sep = b4s::AsciiChar::Comma;
let haystack = "";
let sorted_string = b4s::SortedString::new_checked(haystack, sep);

assert_eq!(sorted_string, Err(b4s::SortedStringCreationError::EmptyHaystack));

Searches for a needle inside this SortedString.

The return type aims to imitate binary_search of the standard library, returning a Result. The success case is a std::ops::Range, not a single usize, as the haystack is variable-length. For the error case, see below.

Β§Examples

For common examples, see the crate-level documentation. The following are more specific examples.

Β§Multi-byte needle and haystack

The needle and haystack can be any UTF-8 string. However, note that the returned Range is returned in raw byte positions, similar to std::str::len:

This length is in bytes, not chars or graphemes. In other words, it might not be what a human considers the length of the string.

let sep = b4s::AsciiChar::VerticalBar;
let haystack = "Apfel|BΓ€ume|ZΓ€une|zΓ€unen|Γ„pfel|Γ–fen|Ζ’oo|Ω…Ψ±Ψ­Ψ¨Ω‹Ψ§|δ½ ε₯½|πŸ˜‚";
let ss = b4s::SortedString::new_checked(haystack, sep)?;

let needle = "BΓ€ume";
let result = ss.binary_search(needle);

assert_eq!(result, Ok(std::ops::Range { start: 6, end: 12 }));
Β§Needle contains separator

This situation not handled specially. Such needles will be impossible to find:

use std::{error::Error, ops::Range};

fn main() -> Result<(), Box<dyn Error>> {
    let sep = b4s::AsciiChar::A;
    let haystack = "Aachen\nAmpel\nAngel\nApfel\n";
    let ss = b4s::SortedString::new_checked(haystack, sep)?;

    let needle = "Angel";
    let result = ss.binary_search(needle);
    assert_eq!(result, Err(b4s::SearchError(Range { start: 0, end: 0 })));

    let needle = "ngel\n";
    let range = ss.binary_search(needle)?; // Works (note ? operator also works)
    assert_eq!(range, Range { start: 14, end: 19 });

    Ok(())
}
Β§Errors

Refer to SearchError for more info.

Β§Panics

This method panics when slicing into the haystack fails. Occurrences of such panics are logic bugs and should never occur (see the below assumptions). Please report them if they do.

The panic allows the API to be simple. No panic implies converting the failure to some error value, like an enum. That enum would then have a variant such as EncodingError. However, the user would have no sensible way of handling that, either, and might as well panic anyway. Keeping the panic inside the function allows the return type to be very simple (no enum required).

See also this thread, where people who know what they’re talking about in terms of Rust (matklad, BurntSushi) argue in a similar fashion. The author took their advice to gauge idiomaticity.

Β§Background

Fundamental assumptions for panic-freedom are:

  • The haystack is &str, aka valid UTF-8. Users cannot pass malformed UTF-8.

  • The separator is ASCII, as ensured by signatures working off AsciiChar, providing type-level guarantees. Users can only pass single-byte UTF-8 values.

  • The separator being ASCII (highest bit 0), it cannot be part of any code point encoded as multiple bytes using UTF-8 (highest bit 1), as validated in this sanity check:

    for code_point in 0..=0x10FFFF {
        if let Some(c) = std::char::from_u32(code_point) {
            if c.is_ascii() {
                continue;
            }
            for byte in c.to_string().bytes() {
                assert!(!byte.is_ascii());
            }
        }
    }
  • Single-byte values with highest bit 1 are outside the ASCII range and invalid UTF-8:

    for byte in 0x80u8..=0xFFu8 {
        assert!(!byte.is_ascii());
        assert!(std::str::from_utf8(&[byte]).is_err());
    }
Β§Fuzz Testing

To further strengthen confidence in panic-freedom, fuzz testing was conducted using afl.

If you have access to the crate’s source code repository, run fuzz testing yourself from the root using

cargo install just && just fuzz

The author let a fuzz test run for over 5 billion iterations, finding no panics:

     american fuzzy lop ++4.06c {default} (target/debug/afl-target) [fast]
β”Œβ”€ process timing ────────────────────────────────────┬─ overall results ────┐
β”‚        run time : 0 days, 19 hrs, 7 min, 6 sec      β”‚  cycles done : 192k  β”‚
β”‚   last new find : 0 days, 18 hrs, 56 min, 54 sec    β”‚ corpus count : 39    β”‚
β”‚last saved crash : none seen yet                     β”‚saved crashes : 0     β”‚
β”‚ last saved hang : none seen yet                     β”‚  saved hangs : 0     β”‚
β”œβ”€ cycle progress ─────────────────────┬─ map coverage┴───────────────────────
β”‚  now processing : 31.444679 (79.5%)  β”‚    map density : 3.31% / 6.22%      β”‚
β”‚  runs timed out : 0 (0.00%)          β”‚ count coverage : 2.77 bits/tuple    β”‚
β”œβ”€ stage progress ─────────────────────┼─ findings in depth ──────────────────
β”‚  now trying : splice 8               β”‚ favored items : 12 (30.77%)         β”‚
β”‚ stage execs : 27/28 (96.43%)         β”‚  new edges on : 14 (35.90%)         β”‚
β”‚ total execs : 5.15G                  β”‚ total crashes : 0 (0 saved)         β”‚
β”‚  exec speed : 73.5k/sec              β”‚  total tmouts : 2 (0 saved)         β”‚
β”œβ”€ fuzzing strategy yields ────────────┴─────────────┬─ item geometry ────────
β”‚   bit flips : disabled (default, enable with -D)   β”‚    levels : 6         β”‚
β”‚  byte flips : disabled (default, enable with -D)   β”‚   pending : 0         β”‚
β”‚ arithmetics : disabled (default, enable with -D)   β”‚  pend fav : 0         β”‚
β”‚  known ints : disabled (default, enable with -D)   β”‚ own finds : 33        β”‚
β”‚  dictionary : n/a                                  β”‚  imported : 0         β”‚
β”‚havoc/splice : 31/1.98G, 2/3.17G                    β”‚ stability : 100.00%   β”‚
β”‚py/custom/rq : unused, unused, 0/1126, 0/19.4k      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚    trim/eff : 1.58%/6355, disabled                 β”‚          [cpu003: 12%]
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Note: at the time of writing, cargo-fuzz and cargo-afl were available. The former currently only wraps libFuzzer, the latter works with afl. Both were already deprecated at the time of writing, but continue to work well. afl was chosen as it didn’t require a nightly toolchain.

Source

pub const fn new_unchecked(string: &'a str, sep: AsciiChar) -> Self

Creates an instance of SortedString without performing sanity checks.

This is essentially what conventionally would be a simple new(), but specifically named to alert users to the dangers.

Β§Example: Simple Use
let sep = b4s::AsciiChar::Comma;
let haystack = "a,b,c";
let sorted_string = b4s::SortedString::new_unchecked(haystack, sep);
Β§Example: Incorrect Use
use std::ops::Range;

let sep = b4s::AsciiChar::Comma;
let unsorted_haystack = "a,c,b";
let sorted_string = b4s::SortedString::new_unchecked(unsorted_haystack, sep);

// Unable to find element in unsorted haystack
assert_eq!(
    sorted_string.binary_search("b"),
    Err(b4s::SearchError(Range { start: 0, end: 1 }))
);
Source

pub fn sort(string: &str, sep: AsciiChar) -> String

Convenience method to sort a str by a given separator, returning an owned version.

As SortedString is designed to be thin and doesn’t own its data (expect for the sep), this convenience method helps creating a sorted String in the required format.

Β§Example
let sep = b4s::AsciiChar::LineFeed;
// Perhaps this was read directly from a file, where sorting is unreliable/unknown.
let unsorted_haystack = "c\nb\na";
let sorted_haystack = b4s::SortedString::sort(unsorted_haystack, sep);

// Passes fine
let sorted_string = b4s::SortedString::new_checked(&sorted_haystack, sep)?;

assert_eq!(
    sorted_string.binary_search("c"),
    Ok(std::ops::Range { start: 4, end: 5 })
);

Trait ImplementationsΒ§

SourceΒ§

impl<'a> Clone for SortedString<'a>

SourceΒ§

fn clone(&self) -> SortedString<'a>

Returns a duplicate of the value. Read more
1.0.0 Β· SourceΒ§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
SourceΒ§

impl<'a> Debug for SortedString<'a>

SourceΒ§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
SourceΒ§

impl Display for SortedString<'_>

SourceΒ§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
SourceΒ§

impl<'a> Hash for SortedString<'a>

SourceΒ§

fn hash<__H: Hasher>(&self, state: &mut __H)

Feeds this value into the given Hasher. Read more
1.3.0 Β· SourceΒ§

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

Feeds a slice of this type into the given Hasher. Read more
SourceΒ§

impl<'a> Ord for SortedString<'a>

SourceΒ§

fn cmp(&self, other: &SortedString<'a>) -> Ordering

This method returns an Ordering between self and other. Read more
1.21.0 Β· SourceΒ§

fn max(self, other: Self) -> Self
where Self: Sized,

Compares and returns the maximum of two values. Read more
1.21.0 Β· SourceΒ§

fn min(self, other: Self) -> Self
where Self: Sized,

Compares and returns the minimum of two values. Read more
1.50.0 Β· SourceΒ§

fn clamp(self, min: Self, max: Self) -> Self
where Self: Sized,

Restrict a value to a certain interval. Read more
SourceΒ§

impl<'a> PartialEq for SortedString<'a>

SourceΒ§

fn eq(&self, other: &SortedString<'a>) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 Β· SourceΒ§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
SourceΒ§

impl<'a> PartialOrd for SortedString<'a>

SourceΒ§

fn partial_cmp(&self, other: &SortedString<'a>) -> Option<Ordering>

This method returns an ordering between self and other values if one exists. Read more
1.0.0 Β· SourceΒ§

fn lt(&self, other: &Rhs) -> bool

Tests less than (for self and other) and is used by the < operator. Read more
1.0.0 Β· SourceΒ§

fn le(&self, other: &Rhs) -> bool

Tests less than or equal to (for self and other) and is used by the <= operator. Read more
1.0.0 Β· SourceΒ§

fn gt(&self, other: &Rhs) -> bool

Tests greater than (for self and other) and is used by the > operator. Read more
1.0.0 Β· SourceΒ§

fn ge(&self, other: &Rhs) -> bool

Tests greater than or equal to (for self and other) and is used by the >= operator. Read more
SourceΒ§

impl<'a> Eq for SortedString<'a>

SourceΒ§

impl<'a> StructuralPartialEq for SortedString<'a>

Auto Trait ImplementationsΒ§

Β§

impl<'a> Freeze for SortedString<'a>

Β§

impl<'a> RefUnwindSafe for SortedString<'a>

Β§

impl<'a> Send for SortedString<'a>

Β§

impl<'a> Sync for SortedString<'a>

Β§

impl<'a> Unpin for SortedString<'a>

Β§

impl<'a> UnwindSafe for SortedString<'a>

Blanket ImplementationsΒ§

SourceΒ§

impl<T> Any for T
where T: 'static + ?Sized,

SourceΒ§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
SourceΒ§

impl<T> Borrow<T> for T
where T: ?Sized,

SourceΒ§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
SourceΒ§

impl<T> BorrowMut<T> for T
where T: ?Sized,

SourceΒ§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
SourceΒ§

impl<T> CloneToUninit for T
where T: Clone,

SourceΒ§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

πŸ”¬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
SourceΒ§

impl<T> From<T> for T

SourceΒ§

fn from(t: T) -> T

Returns the argument unchanged.

SourceΒ§

impl<T, U> Into<U> for T
where U: From<T>,

SourceΒ§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

SourceΒ§

impl<T> IntoEither for T

SourceΒ§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
SourceΒ§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
SourceΒ§

impl<T> ToOwned for T
where T: Clone,

SourceΒ§

type Owned = T

The resulting type after obtaining ownership.
SourceΒ§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
SourceΒ§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
SourceΒ§

impl<T> ToString for T
where T: Display + ?Sized,

SourceΒ§

fn to_string(&self) -> String

Converts the given value to a String. Read more
SourceΒ§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

SourceΒ§

type Error = Infallible

The type returned in the event of a conversion error.
SourceΒ§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
SourceΒ§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

SourceΒ§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
SourceΒ§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.