oxiblas-core 0.1.1

Core traits and SIMD abstractions for OxiBLAS
Documentation

oxiblas-core

Core traits, SIMD abstractions, and scalar types for the OxiBLAS library

Crates.io Documentation

Overview

oxiblas-core is the foundational crate for OxiBLAS, providing the core abstractions and building blocks used throughout the library. It is designed to be platform-agnostic with architecture-specific optimizations for x86_64 (AVX2/AVX-512) and AArch64 (NEON).

Features

Core Traits

  • Scalar - Fundamental trait for numeric types supported by BLAS/LAPACK
    • Implemented for: f32, f64, Complex<f32>, Complex<f64>
    • Optional support for f16 (half precision) and f128 (quad precision)
    • Provides type-safe operations and conversions

SIMD Abstractions

Architecture-specific vectorization with automatic fallback:

x86_64:

  • AVX-512 (512-bit): 8×f64 or 16×f32 per instruction
  • AVX2/FMA (256-bit): 4×f64 or 8×f32 per instruction
  • SSE4.1/SSE4.2 (128-bit): 2×f64 or 4×f32 per instruction

AArch64:

  • NEON (128-bit): 2×f64 or 4×f32 per instruction
  • Advanced 4×6 micro-kernels optimized for Apple Silicon

Fallback:

  • Scalar operations for platforms without SIMD support

Extended Precision Types

  • f16 (half precision) - 16-bit floating point (with f16 feature)

    • Useful for memory-constrained applications
    • Hardware acceleration on ARM and modern x86_64
  • f128 (quad precision) - ~31 decimal digits precision (with f128 feature)

    • Based on double-double arithmetic
    • Essential for high-accuracy numerical computations
    • Kahan and pairwise summation algorithms

Memory Management

  • Cache-aware allocation - Platform-specific cache line alignment
  • Memory alignment - SIMD-friendly memory layout (16/32/64-byte alignment)
  • Workspace management - Efficient temporary buffer reuse for LAPACK algorithms

Blocking & Tuning

  • Automatic blocking parameters - Cache-aware tile sizes for GEMM and other operations
  • Platform detection - Runtime detection of cache sizes (L1/L2/L3)
    • Linux: sysfs (/sys/devices/system/cpu/)
    • macOS: sysctl
    • x86_64: CPUID instruction
  • Optimized for:
    • Intel Xeon (256KB-512KB L2): KC=192, MC=128
    • Apple Silicon (16MB L2): KC=448, MC=256
    • AMD Zen (512KB L2): KC=192, MC=Variable

Parallel Operations

  • Rayon integration (with parallel feature)
  • Multi-threaded BLAS Level 3 - Automatic parallelization for large matrices
  • Load balancing - Efficient work distribution across cores
  • Cache-aware parallel blocking - Minimizes false sharing

Installation

Add this to your Cargo.toml:

[dependencies]
oxiblas-core = "0.1"

# With extended precision
oxiblas-core = { version = "0.1", features = ["f16", "f128"] }

# With parallelization
oxiblas-core = { version = "0.1", features = ["parallel"] }

# All features
oxiblas-core = { version = "0.1", features = ["f16", "f128", "parallel"] }

Usage

Basic Scalar Operations

use oxiblas_core::scalar::Scalar;

fn dot_product<T: Scalar>(x: &[T], y: &[T]) -> T {
    x.iter()
        .zip(y.iter())
        .map(|(a, b)| *a * *b)
        .fold(T::zero(), |acc, v| acc + v)
}

// Works with f32, f64, Complex<f32>, Complex<f64>
let x = vec![1.0f64, 2.0, 3.0];
let y = vec![4.0f64, 5.0, 6.0];
let result = dot_product(&x, &y); // 32.0

SIMD Operations

use oxiblas_core::simd::{SimdType, SimdOps};

// Automatic SIMD selection based on platform
let x: Vec<f64> = vec![1.0, 2.0, 3.0, 4.0];
let y: Vec<f64> = vec![5.0, 6.0, 7.0, 8.0];
let mut result = vec![0.0; 4];

// Uses AVX2/NEON automatically if available
unsafe {
    let simd = <f64 as SimdType>::simd();
    simd.fma(&x, &y, &mut result);
    // result = x * y + result
}

Extended Precision

use oxiblas_core::scalar::QuadFloat;

#[cfg(feature = "f128")]
{
    // Quad precision (f128) - ~31 decimal digits
    let x = QuadFloat::from(2.0);
    let sqrt_x = x.sqrt();
    println!("√2 = {}", sqrt_x); // Very high precision
}

Kahan Summation

use oxiblas_core::scalar::kahan_sum;

let values: Vec<f64> = vec![1.0, 1e-16, -1.0]; // Difficult for naive sum
let result = kahan_sum(&values); // Accurate result using compensated summation

Cache Detection

use oxiblas_core::tuning::detect_cache_sizes;

let cache = detect_cache_sizes();
println!("L1D: {} KB", cache.l1d / 1024);
println!("L2:  {} KB", cache.l2 / 1024);
println!("L3:  {} KB", cache.l3 / 1024);

Blocking Parameters

use oxiblas_core::blocking::BlockParams;

// Get optimal blocking parameters for GEMM
let params = BlockParams::for_gemm::<f64>();
println!("MC={}, KC={}, NC={}", params.mc, params.kc, params.nc);
// Automatically tuned for your system's cache hierarchy

Feature Flags

Feature Description Default
default Core functionality (f32, f64, complex)
parallel Rayon-based parallelization
f16 Half-precision (16-bit) floating point
f128 Quad-precision (~31 digits) via double-double
nightly Nightly-only optimizations
force-scalar Disable SIMD, use scalar only (debug)
max-simd-128 Limit to 128-bit SIMD (SSE/NEON)
max-simd-256 Limit to 256-bit SIMD (AVX2)

SIMD Support Matrix

Platform 128-bit 256-bit 512-bit
x86_64 (SSE4.1)
x86_64 (AVX2)
x86_64 (AVX-512)
AArch64 (NEON)
AArch64 (SVE) Planned
Fallback (scalar)

Performance

SIMD Performance (Apple M3, NEON)

Operation Size Scalar NEON (128-bit) Speedup
f64 Add 4,096 15.2 µs 7.98 µs 1.9×
f64 FMA 4,096 22.1 µs 11.29 µs 2.0×
f32 Add 4,096 8.1 µs 3.2 µs 2.5×
f32 FMA 4,096 11.5 µs 4.8 µs 2.4×

SIMD Performance (Linux x86_64, AVX2)

Operation Size Scalar AVX2 (256-bit) Speedup
f64 Add 4,096 18.4 µs 7.98 µs 2.3×
f64 FMA 4,096 26.7 µs 11.29 µs 2.4×
f32 Add 4,096 9.8 µs 2.1 µs 4.7×
f32 FMA 4,096 14.2 µs 3.2 µs 4.4×

Architecture

oxiblas-core/
├── scalar.rs          # Scalar trait, f16, f128, extended precision
├── simd.rs            # SIMD abstraction layer
├── simd/
│   ├── avx2.rs        # AVX2/FMA kernels (x86_64)
│   ├── avx512.rs      # AVX-512 kernels (x86_64)
│   ├── neon.rs        # NEON kernels (AArch64)
│   └── scalar.rs      # Fallback scalar implementation
├── memory/
│   ├── align.rs       # Aligned allocation
│   ├── workspace.rs   # Temporary buffer management
│   └── cache.rs       # Cache-aware utilities
├── blocking.rs        # Blocking parameter calculation
├── tuning.rs          # Platform detection and auto-tuning
└── parallel.rs        # Parallel operations with Rayon

Supported Platforms

Tier 1 (Fully Tested)

  • x86_64: Linux, macOS, Windows
  • AArch64: macOS (Apple Silicon), Linux

Tier 2 (Best Effort)

  • x86: Linux, Windows
  • AArch64: Android, iOS
  • RISC-V: Linux (scalar only)

Requirements

  • Rust: 1.85+ (Edition 2024)
  • No external C dependencies
  • Optional: OpenMP or Rayon for parallelization

Examples

See the examples directory in the main repository:

  • basic_simd.rs - SIMD operations
  • extended_precision.rs - f16 and f128 usage
  • cache_tuning.rs - Platform-specific optimization

Benchmarks

Run benchmarks:

# SIMD benchmarks
cargo bench --package oxiblas-core --bench simd

# Blocking parameter benchmarks
cargo bench --package oxiblas-core --bench blocking

Safety

  • All SIMD operations are properly marked unsafe where required
  • Memory alignment is enforced at compile-time where possible
  • Extensive testing across platforms ensures correctness
  • No undefined behavior in safe APIs

Contributing

Contributions are welcome! Areas of interest:

  1. ARM SVE support - Scalable Vector Extension for future ARM
  2. RISC-V vector - Vector extension support
  3. Additional extended precision - Alternative quad-float implementations
  4. Auto-tuning improvements - Better platform detection

Related Crates

License

Licensed under either of:

at your option.

References