oxiblas-core
Core traits, SIMD abstractions, and scalar types for the OxiBLAS library
Overview
oxiblas-core is the foundational crate for OxiBLAS, providing the core abstractions and building blocks used throughout the library. It is designed to be platform-agnostic with architecture-specific optimizations for x86_64 (AVX2/AVX-512) and AArch64 (NEON).
Features
Core Traits
Scalar- Fundamental trait for numeric types supported by BLAS/LAPACK- Implemented for:
f32,f64,Complex<f32>,Complex<f64> - Optional support for
f16(half precision) andf128(quad precision) - Provides type-safe operations and conversions
- Implemented for:
SIMD Abstractions
Architecture-specific vectorization with automatic fallback:
x86_64:
- AVX-512 (512-bit): 8×f64 or 16×f32 per instruction
- AVX2/FMA (256-bit): 4×f64 or 8×f32 per instruction
- SSE4.1/SSE4.2 (128-bit): 2×f64 or 4×f32 per instruction
AArch64:
- NEON (128-bit): 2×f64 or 4×f32 per instruction
- Advanced 4×6 micro-kernels optimized for Apple Silicon
Fallback:
- Scalar operations for platforms without SIMD support
Extended Precision Types
-
f16(half precision) - 16-bit floating point (withf16feature)- Useful for memory-constrained applications
- Hardware acceleration on ARM and modern x86_64
-
f128(quad precision) - ~31 decimal digits precision (withf128feature)- Based on double-double arithmetic
- Essential for high-accuracy numerical computations
- Kahan and pairwise summation algorithms
Memory Management
- Cache-aware allocation - Platform-specific cache line alignment
- Memory alignment - SIMD-friendly memory layout (16/32/64-byte alignment)
- Workspace management - Efficient temporary buffer reuse for LAPACK algorithms
Blocking & Tuning
- Automatic blocking parameters - Cache-aware tile sizes for GEMM and other operations
- Platform detection - Runtime detection of cache sizes (L1/L2/L3)
- Linux: sysfs (
/sys/devices/system/cpu/) - macOS: sysctl
- x86_64: CPUID instruction
- Linux: sysfs (
- Optimized for:
- Intel Xeon (256KB-512KB L2): KC=192, MC=128
- Apple Silicon (16MB L2): KC=448, MC=256
- AMD Zen (512KB L2): KC=192, MC=Variable
Parallel Operations
- Rayon integration (with
parallelfeature) - Multi-threaded BLAS Level 3 - Automatic parallelization for large matrices
- Load balancing - Efficient work distribution across cores
- Cache-aware parallel blocking - Minimizes false sharing
Installation
Add this to your Cargo.toml:
[]
= "0.1"
# With extended precision
= { = "0.1", = ["f16", "f128"] }
# With parallelization
= { = "0.1", = ["parallel"] }
# All features
= { = "0.1", = ["f16", "f128", "parallel"] }
Usage
Basic Scalar Operations
use Scalar;
// Works with f32, f64, Complex<f32>, Complex<f64>
let x = vec!;
let y = vec!;
let result = dot_product; // 32.0
SIMD Operations
use ;
// Automatic SIMD selection based on platform
let x: = vec!;
let y: = vec!;
let mut result = vec!;
// Uses AVX2/NEON automatically if available
unsafe
Extended Precision
use QuadFloat;
Kahan Summation
use kahan_sum;
let values: = vec!; // Difficult for naive sum
let result = kahan_sum; // Accurate result using compensated summation
Cache Detection
use detect_cache_sizes;
let cache = detect_cache_sizes;
println!;
println!;
println!;
Blocking Parameters
use BlockParams;
// Get optimal blocking parameters for GEMM
let params = ;
println!;
// Automatically tuned for your system's cache hierarchy
Feature Flags
| Feature | Description | Default |
|---|---|---|
default |
Core functionality (f32, f64, complex) | ✓ |
parallel |
Rayon-based parallelization | |
f16 |
Half-precision (16-bit) floating point | |
f128 |
Quad-precision (~31 digits) via double-double | |
nightly |
Nightly-only optimizations | |
force-scalar |
Disable SIMD, use scalar only (debug) | |
max-simd-128 |
Limit to 128-bit SIMD (SSE/NEON) | |
max-simd-256 |
Limit to 256-bit SIMD (AVX2) |
SIMD Support Matrix
| Platform | 128-bit | 256-bit | 512-bit |
|---|---|---|---|
| x86_64 (SSE4.1) | ✓ | ||
| x86_64 (AVX2) | ✓ | ✓ | |
| x86_64 (AVX-512) | ✓ | ✓ | ✓ |
| AArch64 (NEON) | ✓ | ||
| AArch64 (SVE) | ✓ | Planned | |
| Fallback (scalar) | ✓ |
Performance
SIMD Performance (Apple M3, NEON)
| Operation | Size | Scalar | NEON (128-bit) | Speedup |
|---|---|---|---|---|
| f64 Add | 4,096 | 15.2 µs | 7.98 µs | 1.9× |
| f64 FMA | 4,096 | 22.1 µs | 11.29 µs | 2.0× |
| f32 Add | 4,096 | 8.1 µs | 3.2 µs | 2.5× |
| f32 FMA | 4,096 | 11.5 µs | 4.8 µs | 2.4× |
SIMD Performance (Linux x86_64, AVX2)
| Operation | Size | Scalar | AVX2 (256-bit) | Speedup |
|---|---|---|---|---|
| f64 Add | 4,096 | 18.4 µs | 7.98 µs | 2.3× |
| f64 FMA | 4,096 | 26.7 µs | 11.29 µs | 2.4× |
| f32 Add | 4,096 | 9.8 µs | 2.1 µs | 4.7× |
| f32 FMA | 4,096 | 14.2 µs | 3.2 µs | 4.4× |
Architecture
oxiblas-core/
├── scalar.rs # Scalar trait, f16, f128, extended precision
├── simd.rs # SIMD abstraction layer
├── simd/
│ ├── avx2.rs # AVX2/FMA kernels (x86_64)
│ ├── avx512.rs # AVX-512 kernels (x86_64)
│ ├── neon.rs # NEON kernels (AArch64)
│ └── scalar.rs # Fallback scalar implementation
├── memory/
│ ├── align.rs # Aligned allocation
│ ├── workspace.rs # Temporary buffer management
│ └── cache.rs # Cache-aware utilities
├── blocking.rs # Blocking parameter calculation
├── tuning.rs # Platform detection and auto-tuning
└── parallel.rs # Parallel operations with Rayon
Supported Platforms
Tier 1 (Fully Tested)
- x86_64: Linux, macOS, Windows
- AArch64: macOS (Apple Silicon), Linux
Tier 2 (Best Effort)
- x86: Linux, Windows
- AArch64: Android, iOS
- RISC-V: Linux (scalar only)
Requirements
- Rust: 1.85+ (Edition 2024)
- No external C dependencies
- Optional: OpenMP or Rayon for parallelization
Examples
See the examples directory in the main repository:
basic_simd.rs- SIMD operationsextended_precision.rs- f16 and f128 usagecache_tuning.rs- Platform-specific optimization
Benchmarks
Run benchmarks:
# SIMD benchmarks
# Blocking parameter benchmarks
Safety
- All SIMD operations are properly marked
unsafewhere required - Memory alignment is enforced at compile-time where possible
- Extensive testing across platforms ensures correctness
- No undefined behavior in safe APIs
Contributing
Contributions are welcome! Areas of interest:
- ARM SVE support - Scalable Vector Extension for future ARM
- RISC-V vector - Vector extension support
- Additional extended precision - Alternative quad-float implementations
- Auto-tuning improvements - Better platform detection
Related Crates
oxiblas-matrix- Matrix types built on oxiblas-coreoxiblas-blas- BLAS operations using oxiblas-coreoxiblas-lapack- LAPACK decompositionsoxiblas- Meta-crate with unified API
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.
References
- BLIS Design - Blocking and micro-kernel design inspiration
- Intel Intrinsics Guide - x86_64 SIMD reference
- ARM NEON Intrinsics - AArch64 SIMD reference
- Kahan Summation - Compensated summation algorithm