sbitmap 0.1.1

Fast and scalable bitmap implementation based on Linux kernel's sbitmap
Documentation
# sbitmap - Scalable Bitmap Allocator

A fast and scalable lock-free bitmap implementation based on the Linux kernel's sbitmap.

## Overview

`sbitmap` provides a high-performance, cache-line optimized bitmap for concurrent bit allocation across multiple threads. It's designed for scenarios where many threads need to allocate and free bits from a shared pool efficiently.

## Features

- **Lock-free**: All operations use atomic instructions without locks
- **Cache-line aligned**: Each bitmap word is on its own cache line to prevent false sharing
- **Lightweight hints**: Callers pass allocation hints by reference - no thread-local overhead
- **Scalable**: Tested with high concurrency workloads
- **Memory efficient**: Bit-level granularity with minimal overhead

## Design

This implementation is based on the Linux kernel's sbitmap (from `lib/sbitmap.c`), specifically designed for:

- High-concurrency scenarios (multiple queues, multiple threads)
- Efficient resource allocation (journal entries, tags, etc.)
- Low-latency allocation and deallocation

### Key Optimizations

1. **Cache-line separation**: Each `SbitmapWord` is aligned to 64 bytes
2. **Per-task allocation hints**: Caller-provided hints reduce contention without thread-local overhead
3. **Atomic operations**: Acquire/Release semantics for correctness
4. **No deferred clearing**: Direct atomic bit clearing for simplicity

## Usage

Add to your `Cargo.toml`:

```toml
[dependencies]
sbitmap = "0.1"
```

### Basic Example

```rust
use sbitmap::Sbitmap;

// Create a bitmap with 1024 bits (non-round-robin mode)
let sb = Sbitmap::new(1024, None, false);

// Each caller maintains its own allocation hint
let mut hint = 0;

// Allocate a bit
if let Some(bit) = sb.get(&mut hint) {
    // Use the allocated bit
    println!("Allocated bit: {}", bit);

    // Free it when done
    sb.put(bit, &mut hint);
}
```

### Concurrent Usage

```rust
use sbitmap::Sbitmap;
use std::sync::Arc;
use std::thread;

let sb = Arc::new(Sbitmap::new(1024, None, false));
let mut handles = vec![];

for _ in 0..8 {
    let sb = Arc::clone(&sb);
    handles.push(thread::spawn(move || {
        // Each thread maintains its own hint in local context
        let mut hint = 0;

        // Each thread can safely allocate/free bits
        if let Some(bit) = sb.get(&mut hint) {
            // Do work...
            sb.put(bit, &mut hint);
        }
    }));
}

for h in handles {
    h.join().unwrap();
}
```

### Batch Allocation

```rust
use sbitmap::Sbitmap;

// Create a bitmap with 1024 bits
let sb = Sbitmap::new(1024, None, false);
let mut hint = 0;

// Allocate 4 consecutive bits atomically
if let Some(start_bit) = sb.get_batch(4, &mut hint) {
    // Use bits: start_bit, start_bit+1, start_bit+2, start_bit+3
    println!("Allocated bits {}-{}", start_bit, start_bit + 3);

    // Process consecutive resources...
    for i in 0..4 {
        println!("Using bit {}", start_bit + i);
    }

    // Free all 4 bits atomically when done
    sb.put_batch(start_bit, 4, &mut hint);
}
```

**Note:** Batch operations require `nr_bits <= bits_per_word()`. All consecutive bits are guaranteed to be within the same word (no spanning across word boundaries).

## API

### `Sbitmap::new(depth: usize, shift: Option<u32>, round_robin: bool) -> Self`

Create a new sbitmap with `depth` bits. The `shift` parameter controls how many bits per word (2^shift bits per word) and is critical for performance - it determines how bits are spread across multiple cache-line aligned words. When `None`, the shift is auto-calculated for optimal cache usage. The `round_robin` parameter enables strict round-robin allocation order (usually `false` for better performance).

**Understanding the shift parameter:**
- The shift value spreads bits among multiple words, which is key to sbitmap performance
- Each word is on a separate cache line (64 bytes), reducing contention between CPUs
- Smaller shift = more words = better spreading = less contention (but more memory overhead)
- Larger shift = fewer words = more contention (but better memory efficiency)

### `get(&self, hint: &mut usize) -> Option<usize>`

Allocate a free bit. The `hint` parameter is a mutable reference to the caller's allocation hint, which helps reduce contention by spreading allocations across different parts of the bitmap. Returns `Some(bit_number)` on success or `None` if no free bits are available.

### `put(&self, bitnr: usize, hint: &mut usize)`

Free a previously allocated bit. The `hint` parameter is updated to improve cache locality for subsequent allocations.

### `get_batch(&self, nr_bits: usize, hint: &mut usize) -> Option<usize>`

Allocate `nr_bits` consecutive free bits from the bitmap atomically. This operation provides acquire barrier semantics on success. Only supports `nr_bits <= bits_per_word()` to ensure all bits are within the same word (no spanning across word boundaries).

Returns `Some(start_bit)` where `start_bit` is the first bit of the allocated consecutive range, or `None` if no consecutive `nr_bits` are available or `nr_bits > bits_per_word()`.

**Use cases:**
- Allocating contiguous resource ranges (e.g., multiple consecutive I/O tags)
- Batch resource allocation for improved efficiency
- DMA buffer allocation requiring consecutive indices

### `put_batch(&self, bitnr: usize, nr_bits: usize, hint: &mut usize)`

Free `nr_bits` consecutive previously allocated bits starting from `bitnr`. This operation provides release barrier semantics, ensuring that all writes to data associated with these bits are visible before the bits are freed. Only supports `nr_bits <= bits_per_word()` to ensure all bits are within the same word.

The `hint` parameter is updated for better cache locality in subsequent allocations.

### `test_bit(&self, bitnr: usize) -> bool`

Check if a bit is currently allocated.

### `weight(&self) -> usize`

Count the number of currently allocated bits.

### `depth(&self) -> usize`

Get the total number of bits in the bitmap.

## Use Cases

- **Tag allocation**: I/O tag allocation for block devices
- **Resource pools**: Any scenario requiring efficient concurrent resource allocation
- **Lock-free data structures**: Building block for concurrent algorithms
- **Batch resource allocation**: Allocating multiple consecutive I/O tags, DMA buffers, or contiguous resource ranges
- **NUMA machine**: improvement on NUMA machines is obvious

## Performance Characteristics

- **Allocation**: O(n) worst case, O(1) average with hints
- **Deallocation**: O(1)
- **Batch allocation**: O(n * nr_bits) worst case, finds consecutive bits within single word
- **Batch deallocation**: O(1), atomic clear of consecutive bits
- **Memory overhead**: ~56 bytes per word (64 bits) due to cache-line alignment
- **Thread safety**: Lock-free with atomic operations
- **Scalability**: Linear scaling with number of CPUs up to bitmap depth

## Performance Tuning

The `shift` parameter is crucial for tuning sbitmap performance based on your workload:

**When to use a smaller shift:**
- **High contention**: When many threads are competing heavily for bit allocation and release, use a smaller shift to spread bits across more words and reduce contention on individual cache lines
- **NUMA systems**: Machines with multiple NUMA nodes benefit significantly from smaller shift values, as this distributes memory accesses across more cache lines and reduces cross-node traffic
- **Many concurrent allocators**: Systems with a high CPU count see better scalability with smaller shift values

**Examples:**
```rust
// High contention scenario (32-core NUMA system)
let sb = Sbitmap::new(1024, Some(4), false);  // 2^4 = 16 bits per word, 64 words

// Low contention scenario (4-core system)
let sb = Sbitmap::new(1024, Some(6), false);  // 2^6 = 64 bits per word, 16 words

// Let sbitmap decide (recommended starting point)
let sb = Sbitmap::new(1024, None, false);     // Auto-calculated based on depth
```

**Trade-offs:**
- Smaller shift improves performance under contention but uses more memory (each word needs 64 bytes for cache-line alignment)
- Larger shift reduces memory overhead but increases contention when many threads compete
- The auto-calculated shift (when `None`) provides a balanced default suitable for most workloads

## Memory Ordering

- `get()`: Acquire semantics - ensures allocated bit is visible before use
- `put()`: Release semantics - ensures all writes complete before bit is freed
- `get_batch()`: Acquire semantics - ensures all allocated bits are visible before use
- `put_batch()`: Release semantics - ensures all writes complete before bits are freed

## Comparison with Alternatives

| Feature | sbitmap | Mutex + BitVec | AtomicBitSet |
|---------|---------|----------------|--------------|
| Lock-free ||||
| Cache-optimized ||||
| Per-thread hints ||||
| Kernel-proven design ||||

## Benchmarks

To compare sbitmap performance against a simple lockless bitmap:

```bash
# Run with defaults (32 bits, auto shift, 10 seconds, N-1 tasks)
cargo run --bin bench_compare --release

# Specify bitmap depth and duration
cargo run --bin bench_compare --release -- --depth 1024 --time 5

# Specify bitmap depth, shift, and duration
cargo run --bin bench_compare --release -- --depth 512 --shift 5 --time 10

# Benchmark batch operations (allocating 4 consecutive bits)
cargo run --bin bench_compare --release -- --depth 128 --batch 4 --time 5

# Show help
cargo run --bin bench_compare --release -- --help
```

This benchmark:
- Auto-detects available CPUs and spawns N-1 concurrent tasks
- Measures operations per second (get + put pairs for single-bit mode, get_batch + put_batch pairs for batch mode)
- Compares sbitmap vs a baseline lockless implementation (single-bit mode only)
- Defaults: 32 bits, auto-calculated shift, 10 seconds, N-1 tasks (where N is total CPU count)

Options:
- `--depth DEPTH` - Bitmap depth in bits (default: 32)
- `--shift SHIFT` - log2(bits per word), auto-calculated if not specified
- `--time TIME` - Benchmark duration in seconds (default: 10)
- `--tasks TASKS` - Number of concurrent tasks (default: NUM_CPUS - 1)
- `--batch NR_BITS` - Use get_batch/put_batch with NR_BITS (default: 1, single bit mode)
- `--round-robin` - Enable round-robin allocation mode (default: disabled)

See [benches/README.md](benches/README.md) for more details.

Example output on a 32-CPU system:

```
System: 32 CPUs detected, 2 NUMA nodes, using 31 tasks for benchmark
Bitmap depth: 32 bits
Shift: auto-calculated (bits per word: 8)
Duration: 10 seconds


=== Sbitmap (Optimized) Benchmark ===
Configuration:
  - Duration: 10s
  - Tasks: 31
  - Bitmap depth: 32 bits

Results:
  Task 0: 3101117 ops, 310111 ops/sec (0.3101 Mops/sec)
  ...
  Task 30: 3169582 ops, 316958 ops/sec (0.3170 Mops/sec)
  Total: 93604448 ops, 9360444 ops/sec (9.3604 Mops/sec)

=== SimpleBitmap (Baseline) Benchmark ===
Configuration:
  - Duration: 10s
  - Tasks: 31
  - Bitmap depth: 32 bits

Results:
  Task 0: 1998241 ops, 199824 ops/sec (0.1998 Mops/sec)
  ...
  Task 30: 1835360 ops, 183536 ops/sec (0.1835 Mops/sec)
  Total: 62530560 ops, 6253056 ops/sec (6.2531 Mops/sec)
```

## License

Licensed under either of:

- MIT license ([LICENSE-MIT]LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 ([LICENSE-APACHE]LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)

at your option.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Credits

Based on the Linux kernel's sbitmap implementation by Jens Axboe and other contributors.