b4s 0.3.0

Binary Search Single Sorted String: Perform binary search on a single, delimited string slice of sorted but unevenly sized substrings.
Documentation

b4s

Binary Search Single Sorted String: Perform binary search on a single, delimited string slice of sorted but unevenly sized substrings.

View the docs for more information.

Usage

There are generally two ways to setup this crate: at compile-time, or at runtime. The main (only...) method of interest is [SortedString::binary_search()]. View its documentation for detailed context.

Runtime

use b4s::{AsciiChar, SortedString};

fn main() {
    match SortedString::new_checked("abc,def,ghi,jkl,mno,pqr,stu,vwx,yz", AsciiChar::Comma) {
        Ok(ss) => {
            match ss.binary_search("ghi") {
                Ok(r) => println!("Found at range: {:?}", r),
                Err(r) => println!("Not found, last looked at range: {:?}", r),
            }
        }
        Err(e) => println!("Error: {:?}", e),
    }
}

Compile-time

For convenience, there's also a const fn, usable statically. As a tradeoff, it's potentially unsound.

use b4s::{AsciiChar, SortedString};

static SS: SortedString =
    SortedString::new_unchecked("abc,def,ghi,jkl,mno,pqr,stu,vwx,yz", AsciiChar::Comma);

fn main() {
    match SS.binary_search("ghi") {
        Ok(r) => println!("Found at range: {:?}", r),
        Err(r) => println!("Not found, last looked at range: {:?}", r),
    }
}

The source for the input string can be anything, for example a file prepared at compile time:

static SS: SortedString =
    SortedString::new_unchecked(include_str!("path/to/file"), AsciiChar::LineFeed);

Motivation

The itch to be scratched is the following:

  • there's an array of strings to do lookup in, for example a word list
  • the lookup is a simple containment check, with no modification
  • the word list is available and prepared (sorted) at compile-time (e.g. in build.rs)
  • the word list is large (potentially much larger than the code itself)
  • the list is to be distributed as part of the binary

A couple possible approaches come to mind. The summary table, where n is the number of words, is:

Approach Pre-compile preprocessing Compile time Runtime lookup Binary size
b4s Sort, O(n log n) Single ref: O(1) Bin. search: O(log n) O(n)
array Sort, O(n log n) Many refs: O(n) Bin. search: O(log n) ~ O(3n)
phf None Many refs: O(n) Hash: O(1) ~ O(3n)
padded &str Sort + Pad, ~ O(n log n) Single ref: O(1) Bin. search: O(log n) ~ O(n)

For more context, see the individual sections below.

Note that the pre-compile preprocessing is ordinarily performed only a single time, unless the word list itself changes. This column might therefore be moot, and considered essentially zero-cost.

Therefore, this crate is an attempt to provide a solution with:

  • good, not perfect runtime performance.

  • very little, compile-time preprocessing needed (just sorting).

  • essentially no additional startup cost (unlike, say, constructing a HashSet at runtime).

    The program this crate was initially designed for is sensitive to startup-time, as the program's main processing is rapid. Even just 50ms of startup time would be very noticeable, slowing down a program run by a factor of about 10.

  • binary sizes as small as possible.

  • compile times as fast as possible.

It was found that the approaches using arrays and hash sets (via phf) absolutely tanked developer experience, with compile times north of 20 minutes (!) for 30 MB word lists, large binaries, and clippy imploding, taking the IDE with it. This crate was born as a solution. Its main downside is suboptimal runtime performance. If that is your primary goal, opt for phf or similar crates.

Alternative approaches

Array

A simple array is an obvious choice, and can be generated in a build script.

static WORDS: &[&str] = &["abc", "def", "ghi", "jkl"];

fn main() {
    match WORDS.binary_search(&"ghi") {
        Ok(i) => println!("Found at index: {:?}", i),
        Err(i) => println!("Not found, could be inserted at: {:?}", i),
    }
}

There are two large pains in this approach:

  1. compile times become very slow (in the rough ballpark of 1 minute per 100.000 words, YMMV considerably)

  2. binary size becomes large.

    The words are much shorter than the &str they are contained in. On 64-bit hardware, a &str is 16 bytes, with a usize address pointer and a usize length. For large word lists, this leads to incredible bloat for the resulting binary.

Hash Set

Regular HashSets are not available at compile time. Crates like phf change that:

use phf::{phf_set, Set};

static WORDS: Set<&'static str> = phf_set! {
    "abc",
    "def",
    "ghi",
    "jkl"
};

fn main() {
    if WORDS.contains(&"ghi") {
        println!("Found!");
    } else {
        println!("Not found!");
    }
}

Similar downsides as for the array case apply: very long compile times, and considerable binary bloat from smart pointers. A hash set ultimately is an array with computed indices, so this is expected.

Single, sorted and padded string

Another approach could be to use a single string (saving pointer bloat), but pad all words to the longest occurring length, facilitating easy binary search (and increasing bloat to some extent):

static WORDS: &str = "abc␣␣def␣␣ghi␣␣jklmn";

// Perform binary search...

The binary search implementation is then straightforward, as the elements are of known, fixed lengths (in this case, 5). This approach was found to not perform well.