oxiblas-core

Core traits, SIMD abstractions, and scalar types for the OxiBLAS library

Overview

oxiblas-core is the foundational crate for OxiBLAS, providing the core abstractions and building blocks used throughout the library. It is designed to be platform-agnostic with architecture-specific optimizations for x86_64 (AVX2/AVX-512) and AArch64 (NEON).

Features

Core Traits

Scalar - Fundamental trait for numeric types supported by BLAS/LAPACK
- Implemented for: f32, f64, Complex<f32>, Complex<f64>
- Optional support for f16 (half precision) and f128 (quad precision)
- Provides type-safe operations and conversions

SIMD Abstractions

Architecture-specific vectorization with automatic fallback:

x86_64:

AVX-512 (512-bit): 8×f64 or 16×f32 per instruction
AVX2/FMA (256-bit): 4×f64 or 8×f32 per instruction
SSE4.1/SSE4.2 (128-bit): 2×f64 or 4×f32 per instruction

AArch64:

NEON (128-bit): 2×f64 or 4×f32 per instruction
Advanced 4×6 micro-kernels optimized for Apple Silicon

Fallback:

Scalar operations for platforms without SIMD support

Extended Precision Types

f16 (half precision) - 16-bit floating point (with f16 feature)
- Useful for memory-constrained applications
- Hardware acceleration on ARM and modern x86_64
f128 (quad precision) - ~31 decimal digits precision (with f128 feature)
- Based on double-double arithmetic
- Essential for high-accuracy numerical computations
- Kahan and pairwise summation algorithms

Memory Management

Cache-aware allocation - Platform-specific cache line alignment
Memory alignment - SIMD-friendly memory layout (16/32/64-byte alignment)
Workspace management - Efficient temporary buffer reuse for LAPACK algorithms

Blocking & Tuning

Automatic blocking parameters - Cache-aware tile sizes for GEMM and other operations
Platform detection - Runtime detection of cache sizes (L1/L2/L3)
- Linux: sysfs (/sys/devices/system/cpu/)
- macOS: sysctl
- x86_64: CPUID instruction
Optimized for:
- Intel Xeon (256KB-512KB L2): KC=192, MC=128
- Apple Silicon (16MB L2): KC=448, MC=256
- AMD Zen (512KB L2): KC=192, MC=Variable

Parallel Operations

Rayon integration (with parallel feature)
Multi-threaded BLAS Level 3 - Automatic parallelization for large matrices
Load balancing - Efficient work distribution across cores
Cache-aware parallel blocking - Minimizes false sharing

Installation

Add this to your Cargo.toml:

[dependencies]
oxiblas-core = "0.1"

# With extended precision
oxiblas-core = { version = "0.1", features = ["f16", "f128"] }

# With parallelization
oxiblas-core = { version = "0.1", features = ["parallel"] }

# All features
oxiblas-core = { version = "0.1", features = ["f16", "f128", "parallel"] }

Usage

Basic Scalar Operations

use oxiblas_core::scalar::Scalar;

fn dot_product<T: Scalar>(x: &[T], y: &[T]) -> T {
    x.iter()
        .zip(y.iter())
        .map(|(a, b)| *a * *b)
        .fold(T::zero(), |acc, v| acc + v)
}

// Works with f32, f64, Complex<f32>, Complex<f64>
let x = vec![1.0f64, 2.0, 3.0];
let y = vec![4.0f64, 5.0, 6.0];
let result = dot_product(&x, &y); // 32.0

SIMD Operations

use oxiblas_core::simd::{SimdType, SimdOps};

// Automatic SIMD selection based on platform
let x: Vec<f64> = vec![1.0, 2.0, 3.0, 4.0];
let y: Vec<f64> = vec![5.0, 6.0, 7.0, 8.0];
let mut result = vec![0.0; 4];

// Uses AVX2/NEON automatically if available
unsafe {
    let simd = <f64 as SimdType>::simd();
    simd.fma(&x, &y, &mut result);
    // result = x * y + result
}

Extended Precision

use oxiblas_core::scalar::QuadFloat;

#[cfg(feature = "f128")]
{
    // Quad precision (f128) - ~31 decimal digits
    let x = QuadFloat::from(2.0);
    let sqrt_x = x.sqrt();
    println!("√2 = {}", sqrt_x); // Very high precision
}

Kahan Summation

use oxiblas_core::scalar::kahan_sum;

let values: Vec<f64> = vec![1.0, 1e-16, -1.0]; // Difficult for naive sum
let result = kahan_sum(&values); // Accurate result using compensated summation

Cache Detection

use oxiblas_core::tuning::detect_cache_sizes;

let cache = detect_cache_sizes();
println!("L1D: {} KB", cache.l1d / 1024);
println!("L2:  {} KB", cache.l2 / 1024);
println!("L3:  {} KB", cache.l3 / 1024);

Blocking Parameters

use oxiblas_core::blocking::BlockParams;

// Get optimal blocking parameters for GEMM
let params = BlockParams::for_gemm::<f64>();
println!("MC={}, KC={}, NC={}", params.mc, params.kc, params.nc);
// Automatically tuned for your system's cache hierarchy

Feature Flags

Feature	Description	Default
`default`	Core functionality (f32, f64, complex)	✓
`parallel`	Rayon-based parallelization
`f16`	Half-precision (16-bit) floating point
`f128`	Quad-precision (~31 digits) via double-double
`nightly`	Nightly-only optimizations
`force-scalar`	Disable SIMD, use scalar only (debug)
`max-simd-128`	Limit to 128-bit SIMD (SSE/NEON)
`max-simd-256`	Limit to 256-bit SIMD (AVX2)

SIMD Support Matrix

Platform	128-bit	256-bit	512-bit
x86_64 (SSE4.1)	✓
x86_64 (AVX2)	✓	✓
x86_64 (AVX-512)	✓	✓	✓
AArch64 (NEON)	✓
AArch64 (SVE)	✓	Planned
Fallback (scalar)	✓

Performance

SIMD Performance (Apple M3, NEON)

Operation	Size	Scalar	NEON (128-bit)	Speedup
f64 Add	4,096	15.2 µs	7.98 µs	1.9×
f64 FMA	4,096	22.1 µs	11.29 µs	2.0×
f32 Add	4,096	8.1 µs	3.2 µs	2.5×
f32 FMA	4,096	11.5 µs	4.8 µs	2.4×

SIMD Performance (Linux x86_64, AVX2)

Operation	Size	Scalar	AVX2 (256-bit)	Speedup
f64 Add	4,096	18.4 µs	7.98 µs	2.3×
f64 FMA	4,096	26.7 µs	11.29 µs	2.4×
f32 Add	4,096	9.8 µs	2.1 µs	4.7×
f32 FMA	4,096	14.2 µs	3.2 µs	4.4×

Architecture

oxiblas-core/
├── scalar.rs          # Scalar trait, f16, f128, extended precision
├── simd.rs            # SIMD abstraction layer
├── simd/
│   ├── avx2.rs        # AVX2/FMA kernels (x86_64)
│   ├── avx512.rs      # AVX-512 kernels (x86_64)
│   ├── neon.rs        # NEON kernels (AArch64)
│   └── scalar.rs      # Fallback scalar implementation
├── memory/
│   ├── align.rs       # Aligned allocation
│   ├── workspace.rs   # Temporary buffer management
│   └── cache.rs       # Cache-aware utilities
├── blocking.rs        # Blocking parameter calculation
├── tuning.rs          # Platform detection and auto-tuning
└── parallel.rs        # Parallel operations with Rayon

Supported Platforms

Tier 1 (Fully Tested)

x86_64: Linux, macOS, Windows
AArch64: macOS (Apple Silicon), Linux

Tier 2 (Best Effort)

x86: Linux, Windows
AArch64: Android, iOS
RISC-V: Linux (scalar only)

Requirements

Rust: 1.85+ (Edition 2024)
No external C dependencies
Optional: OpenMP or Rayon for parallelization

Examples

See the examples directory in the main repository:

basic_simd.rs - SIMD operations
extended_precision.rs - f16 and f128 usage
cache_tuning.rs - Platform-specific optimization

Benchmarks

Run benchmarks:

# SIMD benchmarks
cargo bench --package oxiblas-core --bench simd

# Blocking parameter benchmarks
cargo bench --package oxiblas-core --bench blocking

Safety

All SIMD operations are properly marked unsafe where required
Memory alignment is enforced at compile-time where possible
Extensive testing across platforms ensures correctness
No undefined behavior in safe APIs

Contributing

Contributions are welcome! Areas of interest:

ARM SVE support - Scalable Vector Extension for future ARM
RISC-V vector - Vector extension support
Additional extended precision - Alternative quad-float implementations
Auto-tuning improvements - Better platform detection

Related Crates

oxiblas-matrix - Matrix types built on oxiblas-core
oxiblas-blas - BLAS operations using oxiblas-core
oxiblas-lapack - LAPACK decompositions
oxiblas - Meta-crate with unified API

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT license (LICENSE-MIT)

at your option.

References

BLIS Design - Blocking and micro-kernel design inspiration
Intel Intrinsics Guide - x86_64 SIMD reference
ARM NEON Intrinsics - AArch64 SIMD reference
Kahan Summation - Compensated summation algorithm

oxiblas-core 0.1.1