sbitmap 0.1.0

Fast and scalable bitmap implementation based on Linux kernel's sbitmap
Documentation
  • Coverage
  • 88.89%
    8 out of 9 items documented0 out of 8 items with examples
  • Size
  • Source code size: 74.28 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 1.58 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 14s Average build duration of successful builds.
  • all releases: 13s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • ublk-org/sbitmap
    2 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • ming1
sbitmap-0.1.0 has been yanked.

sbitmap - Scalable Bitmap Allocator

A fast and scalable lock-free bitmap implementation based on the Linux kernel's sbitmap.

Overview

sbitmap provides a high-performance, cache-line optimized bitmap for concurrent bit allocation across multiple threads. It's designed for scenarios where many threads need to allocate and free bits from a shared pool efficiently.

Features

  • Lock-free: All operations use atomic instructions without locks
  • Cache-line aligned: Each bitmap word is on its own cache line to prevent false sharing
  • Lightweight hints: Callers pass allocation hints by reference - no thread-local overhead
  • Scalable: Tested with high concurrency workloads
  • Memory efficient: Bit-level granularity with minimal overhead

Design

This implementation is based on the Linux kernel's sbitmap (from lib/sbitmap.c), specifically designed for:

  • High-concurrency scenarios (multiple queues, multiple threads)
  • Efficient resource allocation (journal entries, tags, etc.)
  • Low-latency allocation and deallocation

Key Optimizations

  1. Cache-line separation: Each SbitmapWord is aligned to 64 bytes
  2. Per-task allocation hints: Caller-provided hints reduce contention without thread-local overhead
  3. Atomic operations: Acquire/Release semantics for correctness
  4. No deferred clearing: Direct atomic bit clearing for simplicity

Usage

Add to your Cargo.toml:

[dependencies]
sbitmap = "0.1"

Basic Example

use sbitmap::Sbitmap;

// Create a bitmap with 1024 bits (non-round-robin mode)
let sb = Sbitmap::new(1024, None, false);

// Each caller maintains its own allocation hint
let mut hint = 0;

// Allocate a bit
if let Some(bit) = sb.get(&mut hint) {
    // Use the allocated bit
    println!("Allocated bit: {}", bit);

    // Free it when done
    sb.put(bit, &mut hint);
}

Concurrent Usage

use sbitmap::Sbitmap;
use std::sync::Arc;
use std::thread;

let sb = Arc::new(Sbitmap::new(1024, None, false));
let mut handles = vec![];

for _ in 0..8 {
    let sb = Arc::clone(&sb);
    handles.push(thread::spawn(move || {
        // Each thread maintains its own hint in local context
        let mut hint = 0;

        // Each thread can safely allocate/free bits
        if let Some(bit) = sb.get(&mut hint) {
            // Do work...
            sb.put(bit, &mut hint);
        }
    }));
}

for h in handles {
    h.join().unwrap();
}

API

Sbitmap::new(depth: usize, shift: Option<u32>, round_robin: bool) -> Self

Create a new sbitmap with depth bits. The shift parameter controls how many bits per word (2^shift bits per word) and is critical for performance - it determines how bits are spread across multiple cache-line aligned words. When None, the shift is auto-calculated for optimal cache usage. The round_robin parameter enables strict round-robin allocation order (usually false for better performance).

Understanding the shift parameter:

  • The shift value spreads bits among multiple words, which is key to sbitmap performance
  • Each word is on a separate cache line (64 bytes), reducing contention between CPUs
  • Smaller shift = more words = better spreading = less contention (but more memory overhead)
  • Larger shift = fewer words = more contention (but better memory efficiency)

get(&self, hint: &mut usize) -> Option<usize>

Allocate a free bit. The hint parameter is a mutable reference to the caller's allocation hint, which helps reduce contention by spreading allocations across different parts of the bitmap. Returns Some(bit_number) on success or None if no free bits are available.

put(&self, bitnr: usize, hint: &mut usize)

Free a previously allocated bit. The hint parameter is updated to improve cache locality for subsequent allocations.

test_bit(&self, bitnr: usize) -> bool

Check if a bit is currently allocated.

weight(&self) -> usize

Count the number of currently allocated bits.

depth(&self) -> usize

Get the total number of bits in the bitmap.

Use Cases

  • Tag allocation: I/O tag allocation for block devices
  • Resource pools: Any scenario requiring efficient concurrent resource allocation
  • Lock-free data structures: Building block for concurrent algorithms
  • NUMA machine: improvement on NUMA machines is obvious

Performance Characteristics

  • Allocation: O(n) worst case, O(1) average with hints
  • Deallocation: O(1)
  • Memory overhead: ~56 bytes per word (64 bits) due to cache-line alignment
  • Thread safety: Lock-free with atomic operations
  • Scalability: Linear scaling with number of CPUs up to bitmap depth

Performance Tuning

The shift parameter is crucial for tuning sbitmap performance based on your workload:

When to use a smaller shift:

  • High contention: When many threads are competing heavily for bit allocation and release, use a smaller shift to spread bits across more words and reduce contention on individual cache lines
  • NUMA systems: Machines with multiple NUMA nodes benefit significantly from smaller shift values, as this distributes memory accesses across more cache lines and reduces cross-node traffic
  • Many concurrent allocators: Systems with a high CPU count see better scalability with smaller shift values

Examples:

// High contention scenario (32-core NUMA system)
let sb = Sbitmap::new(1024, Some(4), false);  // 2^4 = 16 bits per word, 64 words

// Low contention scenario (4-core system)
let sb = Sbitmap::new(1024, Some(6), false);  // 2^6 = 64 bits per word, 16 words

// Let sbitmap decide (recommended starting point)
let sb = Sbitmap::new(1024, None, false);     // Auto-calculated based on depth

Trade-offs:

  • Smaller shift improves performance under contention but uses more memory (each word needs 64 bytes for cache-line alignment)
  • Larger shift reduces memory overhead but increases contention when many threads compete
  • The auto-calculated shift (when None) provides a balanced default suitable for most workloads

Memory Ordering

  • get(): Acquire semantics - ensures allocated bit is visible before use
  • put(): Release semantics - ensures all writes complete before bit is freed

Comparison with Alternatives

Feature sbitmap Mutex + BitVec AtomicBitSet
Lock-free
Cache-optimized
Per-thread hints
Kernel-proven design

Benchmarks

To compare sbitmap performance against a simple lockless bitmap:

# Run with defaults (32 bits, auto shift, 10 seconds, N-1 tasks)
cargo run --bin bench_compare --release

# Specify bitmap depth and duration
cargo run --bin bench_compare --release -- --depth 1024 --time 5

# Specify bitmap depth, shift, and duration
cargo run --bin bench_compare --release -- --depth 512 --shift 5 --time 10

# Show help
cargo run --bin bench_compare --release -- --help

This benchmark:

  • Auto-detects available CPUs and spawns N-1 concurrent tasks
  • Measures operations per second (get + put pairs)
  • Compares sbitmap vs a baseline lockless implementation
  • Defaults: 32 bits, auto-calculated shift, 10 seconds, N-1 tasks (where N is total CPU count)

Options:

  • --depth DEPTH - Bitmap depth in bits (default: 32)
  • --shift SHIFT - log2(bits per word), auto-calculated if not specified
  • --time TIME - Benchmark duration in seconds (default: 10)
  • --tasks TASKS - Number of concurrent tasks (default: NUM_CPUS - 1)
  • --round-robin - Enable round-robin allocation mode (default: disabled)

See benches/README.md for more details.

Example output on a 32-CPU system:

System: 32 CPUs detected, 2 NUMA nodes, using 31 tasks for benchmark
Bitmap depth: 32 bits
Shift: auto-calculated (bits per word: 8)
Duration: 10 seconds


=== Sbitmap (Optimized) Benchmark ===
Configuration:
  - Duration: 10s
  - Tasks: 31
  - Bitmap depth: 32 bits

Results:
  Task 0: 3101117 ops, 310111 ops/sec (0.3101 Mops/sec)
  ...
  Task 30: 3169582 ops, 316958 ops/sec (0.3170 Mops/sec)
  Total: 93604448 ops, 9360444 ops/sec (9.3604 Mops/sec)

=== SimpleBitmap (Baseline) Benchmark ===
Configuration:
  - Duration: 10s
  - Tasks: 31
  - Bitmap depth: 32 bits

Results:
  Task 0: 1998241 ops, 199824 ops/sec (0.1998 Mops/sec)
  ...
  Task 30: 1835360 ops, 183536 ops/sec (0.1835 Mops/sec)
  Total: 62530560 ops, 6253056 ops/sec (6.2531 Mops/sec)

License

Licensed under either of:

at your option.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Credits

Based on the Linux kernel's sbitmap implementation by Jens Axboe and Facebook contributors.