packsimd

Note: This release includes the Scalar and SSE4.1 backends only. AVX2 and AVX-512 implementations are planned for a future release. On x86_64 CPUs with SSE4.1, the SSE4.1 backend is used automatically.

High-performance BP128 compression for u32 integer arrays with SIMD acceleration, zero-allocation APIs, and deterministic encoding.

Overview

packsimd compresses integer arrays by packing each block of 128 values using the minimum bit width required. It automatically detects and uses the best available SIMD backend at runtime.

Use Case	Example	Typical Ratio
Database Indexing	Posting lists, doc IDs	20-40%
Search Systems	Inverted indices	20-40%
Time Series	Timestamp deltas	30-50%
Network Protocols	Integer data transfer	Varies
Columnar Storage	Integer columns	20-40%

Features

BP128 Algorithm — Variable bit-width packing, 128 values per block
SIMD Acceleration — SSE4.1 on x86_64 with automatic runtime detection
Scalar Fallback — Reference implementation for non-SIMD targets
Zero-Allocation API — compress_into / decompress_into with pre-allocated buffers
Fast Header Inspection — decompressed_len reads size without decompressing
Deterministic Output — Same input always produces identical compressed bytes
No Dependencies — Zero runtime dependencies
No Panics — All error conditions return Result
Extensively Tested — Property-based testing (proptest), fuzz targets, 128+ tests

Installation

[dependencies]
packsimd = "0.1"

Quick Start

use packsimd::{compress, decompress};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data: Vec<u32> = (0..256).map(|i| i % 1000).collect();
    let compressed = compress(&data)?;
    let decompressed = decompress(&compressed)?;
    assert_eq!(data, decompressed);
    Ok(())
}

Zero-Allocation Path

use packsimd::{compress_into, decompress_into, max_compressed_size, decompressed_len};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data: Vec<u32> = (0..256).map(|i| i % 1000).collect();

    // Compress
    let mut cbuf = vec![0u8; max_compressed_size(data.len())];
    let cbytes = compress_into(&data, &mut cbuf)?;

    // Decompress
    let dlen = decompressed_len(&cbuf[..cbytes])?;
    let mutdbuf = vec![0u32; dlen];
    decompress_into(&cbuf[..cbytes], &mutdbuf)?;

    Ok(())
}

Documentation

For complete API reference and usage examples, see USAGE.md.

Architecture

┌─────────────────────────────────────────────────────┐
│                    Public API                        │
│  compress / compress_into / max_compressed_size     │
│  decompress / decompress_into / decompressed_len    │
└──────────────────────┬──────────────────────────────┘
                       │
          ┌────────────┴────────────┐
          │       Dispatch          │
          │  Runtime CPU detection  │
          │  OnceLock caching       │
          └────────────┬────────────┘
                       │
       ┌───────────────┼───────────────┐
       │               │               │
  ┌────┴────┐    ┌─────┴─────┐   ┌────┴────┐
  │ Scalar  │    │   SSE4.1  │   │  AVX2   │
  │Backend  │    │  Backend  │   │(planned)│
  │         │    │           │   │         │
  │Reference│    │  128-bit  │   │ 256-bit │
  │  impl   │    │   SIMD    │   │  SIMD   │
  └─────────┘    └───────────┘   └─────────┘

Component	Responsibility
compress	Bit width calculation, header writing, block packing
decompress	Header parsing, validation, block unpacking
bitwidth	`required_bit_width`, block size calculations
dispatch	Runtime SIMD backend selection and caching
simd/scalar	Reference scalar implementation (all bit widths)
simd/sse	SSE4.1-accelerated kernels (x86_64 only)

Performance

Benchmarked on x86_64 (SSE4.1) with LTO and opt-level=3.

Compression Ratios

Data Pattern	Ratio	Compress	Decompress
Sequential (0-999)	23.65%	5.4 GiB/s	1.4 GiB/s
Constant (all same)	18.97%	2.8 GiB/s	649 MiB/s
Random (full entropy)	100.22%	23.5 GiB/s	24.0 GiB/s

Throughput at Scale (1M values)

Bit Width	Compress	Decompress
1-bit	13.7 GiB/s	11.1 GiB/s
8-bit	8.9 GiB/s	16.0 GiB/s
16-bit	7.1–7.5 GiB/s	13.8–14.2 GiB/s
32-bit	7.9–8.3 GiB/s	9.3–9.8 GiB/s

SSE4.1 provides 1.7×–12.6× faster unpack across all bit widths. Scalar pack is competitive for most widths.

Run benchmarks:

cargo bench

Security

No Panics — All error conditions return Result
Input Validation — Header, bit widths, and buffer sizes verified before use
OOM Protection — Maximum 1 billion decompressed values
No Undefined Behavior — Unsafe blocks documented with invariants, covered by fuzz testing

Examples

See the examples/ directory:

cargo run --package packsimd-examples

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Run tests: cargo test --all-targets
Run clippy: cargo clippy --all-targets -- -D warnings
Run benchmarks: cargo bench
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

git clone https://github.com/themankindproject/packsimd
cd packsimd

# Run tests
cargo test --all-targets

# Run doc tests
cargo test --doc

# Generate documentation
cargo doc --no-deps --open

Roadmap

Feature	Status
Scalar implementation	Done
SSE4.1 backend	Done
AVX2 backend	Planned
AVX-512 backend	Planned

License

MIT License - See LICENSE file for details.

packsimd 0.1.1