Struct b4s::SortedString

source ·
pub struct SortedString<'a> { /* private fields */ }
Expand description

Main type to perform binary search through.

This type upholds the important invariants of SortedString::binary_search():

  • a sorted string is required,
  • as well as some AsciiChar separator to split by.

Access to binary search is gated behind this type. For this to not be too painful, the type is designed to be cheap, as it doesn’t own the potentially large haystack to perform binary search on.

The terms haystack and needle are used here as they are in std::str::pattern.

Instance Creation

There are two avenues:

  1. SortedString::new_checked(): recommended, ensuring soundness
  2. SortedString::new_unchecked(): unsafe, faster

Note on ASCII requirement

On instance creation, the separator has to be within the ASCII range. In other words, multi-byte (in the sense of UTF-8) separators are not supported. This leaves common separators like \n, \t, -, , and many more available, while arbitrary chars are not allowed:

let sep = '🦀';
let haystack = "a,b,c";
// Not allowed:
let _ = b4s::SortedString::new_checked(haystack, sep);

UTF-8 being a variable-length encoding, allowing any separator char would require an initial linear scan of the input string:

  • either at instance creation time, then saving the results,
  • or before each search.

The former takes up a lot of space (usize pointers, which on 64-bit are much larger than the separator itself), the latter time. Both are undesirable.

As UTF-8 is fully ASCII-compatible, allowing only that means the string can be efficiently scanned as bytes. The needle and haystack can still be any UTF-8 string. No multi-byte code point (with its continuation bytes) contains bytes in the ASCII range. Scanning for such an ASCII separator is therefore sound:

These properties simplify scanning and allow it to be fast at both instance creation and search time. The downside is hopefully negligible and in the common case unnoticeable. Please contact the author if your use case requires multi-byte support.

The ASCII requirement is manifested in the type system, enforcing correct usage at compile time.

Implementations§

source§

impl<'a> SortedString<'a>

source

pub fn new_checked( haystack: &'a str, sep: AsciiChar ) -> Result<Self, SortedStringCreationError>

Creates a new instance of SortedString, performing sanity checks.

See SortedString::new_unchecked() for a version without checks.

The performed checks and failure modes are outlined in the error examples below.

Example
let sep = b4s::AsciiChar::Comma;
let haystack = "a,b,c";
let sorted_string = b4s::SortedString::new_checked(haystack, sep);

assert!(sorted_string.is_ok());
Errors

This method returns a SortedStringCreationError in the cases outlined below.

Example: Creation Fails (Haystack Not Sorted)

Binary search requires prior sorting, so creation has to fail.

let sep = b4s::AsciiChar::Comma;
let unsorted_haystack = "a,c,b";
let sorted_string = b4s::SortedString::new_checked(unsorted_haystack, sep);

assert_eq!(sorted_string, Err(b4s::SortedStringCreationError::NotSorted));

This error might also creep in when passing a newline-separated file containing a trailing newline. The empty string will be in last position, contradicting sorting.

let sep = b4s::AsciiChar::LineFeed;
let haystack = "Aachen\nAmpel\nAngel\nApfel\n";
let ss = b4s::SortedString::new_checked(haystack, sep);

assert_eq!(ss, Err(b4s::SortedStringCreationError::NotSorted));
Example: Creation Fails (Haystack Empty)

Supplying an empty haystack str is likely an error and therefore fails.

let sep = b4s::AsciiChar::Comma;
let haystack = "";
let sorted_string = b4s::SortedString::new_checked(haystack, sep);

assert_eq!(sorted_string, Err(b4s::SortedStringCreationError::EmptyHaystack));

Searches for a needle inside this SortedString.

The return type aims to imitate binary_search of the standard library, returning a Result. The success case is a std::ops::Range, not a single usize, as the haystack is variable-length. For the error case, see below.

Examples

For common examples, see the crate-level documentation. The following are more specific examples.

Multi-byte needle and haystack

The needle and haystack can be any UTF-8 string. However, note that the returned Range is returned in raw byte positions, similar to std::str::len:

This length is in bytes, not chars or graphemes. In other words, it might not be what a human considers the length of the string.

let sep = b4s::AsciiChar::VerticalBar;
let haystack = "Apfel|Bäume|Zäune|zäunen|Äpfel|Öfen|ƒoo|مرحبًا|你好|😂";
let ss = b4s::SortedString::new_checked(haystack, sep).unwrap();

let needle = "Bäume";
let result = ss.binary_search(needle);

assert_eq!(result, Ok(std::ops::Range { start: 6, end: 12 }));
Needle contains separator

This situation not handled specially. Such needles will be impossible to find:

use std::ops::Range;

let sep = b4s::AsciiChar::A;
let haystack = "Aachen\nAmpel\nAngel\nApfel\n";
let ss = b4s::SortedString::new_checked(haystack, sep).unwrap();

let needle = "Angel";
let result = ss.binary_search(needle);
assert_eq!(result, Err(b4s::SearchError(Range { start: 0, end: 0 })));

let needle = "ngel\n";
let result = ss.binary_search(needle);
assert_eq!(result, Ok(Range { start: 14, end: 19 }));
Errors

Refer to SearchError for more info.

Panics

This method panics when slicing into the haystack fails. Occurrences of such panics are logic bugs and should never occur (see the below assumptions). Please report them if they do.

The panic allows the API to be simple. No panic implies converting the failure to some error value, like an enum. That enum would then have a variant such as EncodingError. However, the user would have no sensible way of handling that, either, and might as well panic anyway. Keeping the panic inside the function allows the return type to be very simple (no enum required).

See also this thread, where people who know what they’re talking about in terms of Rust (matklad, BurntSushi) argue in a similar fashion. The author took their advice to gauge idiomaticity.

Background

Fundamental assumptions for panic-freedom are:

  • The haystack is &str, aka valid UTF-8. Users cannot pass malformed UTF-8.

  • The separator is ASCII, as ensured by signatures working off AsciiChar, providing type-level guarantees. Users can only pass single-byte UTF-8 values.

  • The separator being ASCII (highest bit 0), it cannot be part of any code point encoded as multiple bytes using UTF-8 (highest bit 1), as validated in this sanity check:

    for code_point in 0..=0x10FFFF {
        if let Some(c) = std::char::from_u32(code_point) {
            if c.is_ascii() {
                continue;
            }
            for byte in c.to_string().bytes() {
                assert!(!byte.is_ascii());
            }
        }
    }
  • Single-byte values with highest bit 1 are outside the ASCII range and invalid UTF-8:

    for byte in 0x80u8..=0xFFu8 {
        assert!(!byte.is_ascii());
        assert!(std::str::from_utf8(&[byte]).is_err());
    }
Fuzz Testing

To further strengthen confidence in panic-freedom, fuzz testing was conducted using afl.

If you have access to the crate’s source code repository, run fuzz testing yourself from the root using

cargo install just && just fuzz

The author let a fuzz test run for over 5 billion iterations, finding no panics:

     american fuzzy lop ++4.06c {default} (target/debug/afl-target) [fast]
┌─ process timing ────────────────────────────────────┬─ overall results ────┐
│        run time : 0 days, 19 hrs, 7 min, 6 sec      │  cycles done : 192k  │
│   last new find : 0 days, 18 hrs, 56 min, 54 sec    │ corpus count : 39    │
│last saved crash : none seen yet                     │saved crashes : 0     │
│ last saved hang : none seen yet                     │  saved hangs : 0     │
├─ cycle progress ─────────────────────┬─ map coverage┴──────────────────────┤
│  now processing : 31.444679 (79.5%)  │    map density : 3.31% / 6.22%      │
│  runs timed out : 0 (0.00%)          │ count coverage : 2.77 bits/tuple    │
├─ stage progress ─────────────────────┼─ findings in depth ─────────────────┤
│  now trying : splice 8               │ favored items : 12 (30.77%)         │
│ stage execs : 27/28 (96.43%)         │  new edges on : 14 (35.90%)         │
│ total execs : 5.15G                  │ total crashes : 0 (0 saved)         │
│  exec speed : 73.5k/sec              │  total tmouts : 2 (0 saved)         │
├─ fuzzing strategy yields ────────────┴─────────────┬─ item geometry ───────┤
│   bit flips : disabled (default, enable with -D)   │    levels : 6         │
│  byte flips : disabled (default, enable with -D)   │   pending : 0         │
│ arithmetics : disabled (default, enable with -D)   │  pend fav : 0         │
│  known ints : disabled (default, enable with -D)   │ own finds : 33        │
│  dictionary : n/a                                  │  imported : 0         │
│havoc/splice : 31/1.98G, 2/3.17G                    │ stability : 100.00%   │
│py/custom/rq : unused, unused, 0/1126, 0/19.4k      ├───────────────────────┘
│    trim/eff : 1.58%/6355, disabled                 │          [cpu003: 12%]
└────────────────────────────────────────────────────┘

Note: at the time of writing, cargo-fuzz and cargo-afl were available. The former currently only wraps libFuzzer, the latter works with afl. Both were already deprecated at the time of writing, but continue to work well. afl was chosen as it didn’t require a nightly toolchain.

source

pub const fn new_unchecked(string: &'a str, sep: AsciiChar) -> Self

Creates an instance of SortedString without performing sanity checks.

This is essentially what conventionally would be a simple new(), but specifically named to alert users to the dangers.

Example: Simple Use
let sep = b4s::AsciiChar::Comma;
let haystack = "a,b,c";
let sorted_string = b4s::SortedString::new_unchecked(haystack, sep);
Example: Incorrect Use
use std::ops::Range;

let sep = b4s::AsciiChar::Comma;
let unsorted_haystack = "a,c,b";
let sorted_string = b4s::SortedString::new_unchecked(unsorted_haystack, sep);

// Unable to find element in unsorted haystack
assert_eq!(
    sorted_string.binary_search("b"),
    Err(b4s::SearchError(Range { start: 0, end: 1 }))
);
source

pub fn sort(string: &str, sep: AsciiChar) -> String

Convenience method to sort a str by a given separator, returning an owned version.

As SortedString is designed to be thin and doesn’t own its data (expect for the sep), this convenience method helps creating a sorted String in the required format.

Example
let sep = b4s::AsciiChar::LineFeed;
// Perhaps this was read directly from a file, where sorting is unreliable/unknown.
let unsorted_haystack = "c\nb\na";
let sorted_haystack = b4s::SortedString::sort(unsorted_haystack, sep);

// Passes fine
let sorted_string = b4s::SortedString::new_checked(&sorted_haystack, sep).unwrap();

assert_eq!(
    sorted_string.binary_search("c"),
    Ok(std::ops::Range { start: 4, end: 5 })
);

Trait Implementations§

source§

impl<'a> Clone for SortedString<'a>

source§

fn clone(&self) -> SortedString<'a>

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl<'a> Debug for SortedString<'a>

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Display for SortedString<'_>

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl<'a> Hash for SortedString<'a>

source§

fn hash<__H: Hasher>(&self, state: &mut __H)

Feeds this value into the given Hasher. Read more
1.3.0 · source§

fn hash_slice<H>(data: &[Self], state: &mut H)where H: Hasher, Self: Sized,

Feeds a slice of this type into the given Hasher. Read more
source§

impl<'a> Ord for SortedString<'a>

source§

fn cmp(&self, other: &SortedString<'a>) -> Ordering

This method returns an Ordering between self and other. Read more
1.21.0 · source§

fn max(self, other: Self) -> Selfwhere Self: Sized,

Compares and returns the maximum of two values. Read more
1.21.0 · source§

fn min(self, other: Self) -> Selfwhere Self: Sized,

Compares and returns the minimum of two values. Read more
1.50.0 · source§

fn clamp(self, min: Self, max: Self) -> Selfwhere Self: Sized + PartialOrd<Self>,

Restrict a value to a certain interval. Read more
source§

impl<'a> PartialEq<SortedString<'a>> for SortedString<'a>

source§

fn eq(&self, other: &SortedString<'a>) -> bool

This method tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

This method tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl<'a> PartialOrd<SortedString<'a>> for SortedString<'a>

source§

fn partial_cmp(&self, other: &SortedString<'a>) -> Option<Ordering>

This method returns an ordering between self and other values if one exists. Read more
1.0.0 · source§

fn lt(&self, other: &Rhs) -> bool

This method tests less than (for self and other) and is used by the < operator. Read more
1.0.0 · source§

fn le(&self, other: &Rhs) -> bool

This method tests less than or equal to (for self and other) and is used by the <= operator. Read more
1.0.0 · source§

fn gt(&self, other: &Rhs) -> bool

This method tests greater than (for self and other) and is used by the > operator. Read more
1.0.0 · source§

fn ge(&self, other: &Rhs) -> bool

This method tests greater than or equal to (for self and other) and is used by the >= operator. Read more
source§

impl<'a> Eq for SortedString<'a>

source§

impl<'a> StructuralEq for SortedString<'a>

source§

impl<'a> StructuralPartialEq for SortedString<'a>

Auto Trait Implementations§

§

impl<'a> RefUnwindSafe for SortedString<'a>

§

impl<'a> Send for SortedString<'a>

§

impl<'a> Sync for SortedString<'a>

§

impl<'a> Unpin for SortedString<'a>

§

impl<'a> UnwindSafe for SortedString<'a>

Blanket Implementations§

source§

impl<T> Any for Twhere T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for Twhere T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for Twhere T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for Twhere U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for Twhere T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T> ToString for Twhere T: Display + ?Sized,

source§

default fn to_string(&self) -> String

Converts the given value to a String. Read more
source§

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.