Caravela provides a comprehensive suite of linear algebra operations from basic vector dot products to complex matrix multiplications, all with best-in-class performance through runtime CPU feature detection and highly optimized SIMD kernels.
Key Features
- Zero Dependencies: Pure Rust implementation
- Complete BLAS Coverage: Level 1 (vector-vector), Level 2 (matrix-vector), and Level 3 (matrix-matrix) operations
- Two-Level API:
- Simple high-level functions for casual users
- Full BLAS-style interface for advanced control
- SIMD Acceleration:
- x86_64: AVX2 + FMA with optimized microkernel implementations
- AArch64: NEON optimized for ARM processors (Apple Silicon, AWS Graviton, etc.)
- Cache-Optimized GEMM: State-of-the-art BLIS algorithm implementation with multi-level cache blocking
- Runtime Feature Detection: Automatically selects the best implementation for your CPU
- Generic Design: Seamless operation with both
f32andf64types
Installation
[]
= "0.1.0"
Usage
Caravela provides two API levels to suit different needs:
High-Level API
Simple, easy-to-use functions for everyday linear algebra operations.
use ;
// Vector dot product
let a = vec!;
let b = vec!;
let result = dot; // 32.0
// Squared Euclidean distance (more efficient for comparisons)
let dist_sq = l2sq; // 27.0 (use sqrt if you need actual distance)
// Normalize a vector in-place (returns original norm)
let mut v = vec!;
let norm = normalize; // norm = 5.0, v = [0.6, 0.8]
// Scale a vector in-place
let mut v = vec!;
scale; // v = [2.0, 4.0, 6.0]
// Matrix-vector multiplication: y = Ax
let matrix = vec!;
let vector = vec!;
let result = matvec; // [6.0, 15.0]
// Matrix-matrix multiplication: C = AB
let a = vec!;
let b = vec!;
let c = matmul; // [19.0, 22.0, 43.0, 50.0]
Low-Level API
BLAS-style interface providing full control over all parameters and operations.
use ;
// Vector operations (same as high-level, included for completeness)
let dot_product = dot;
let dist_sq = l2sq;
// General matrix-vector multiply: y = α·A·x + β·y
let matrix = vec!;
let x = vec!;
let mut y = vec!;
gemv;
// y = 2.0 * [6.0, 15.0] + 0.5 * [10.0, 20.0] = [17.0, 40.0]
// Transposed matrix-vector multiply: y = α·A^T·x + β·y
let x2 = vec!; // Note: x has m elements for A^T
let mut y2 = vec!; // y has n elements
gemv_t;
// Computes: y = A^T * x + y
// A^T * [1,2] = [9, 12, 15], so y = [14, 17, 20]
// General matrix-matrix multiply: C = α·A·B + β·C
let a = vec!;
let b = vec!;
let mut c = vec!; // 2x2 matrix
gemm;
// c = 2.0 * A * B + 3.0 * C
// c = 2.0 * [19,22,43,50] + 3.0 * [1,1,1,1] = [41,47,89,103]
// Transposed A: C = α·A^T·B + β·C
let a_t = vec!;
gemm_tn;
// Transposed B: C = α·A·B^T + β·C
let b_t = vec!;
gemm_nt;
// Both transposed: C = α·A^T·B^T + β·C
gemm_tt;
Performance
Caravela implements state-of-the-art algorithms for maximum performance:
GEMM (Matrix Multiplication)
- BLIS Algorithm: 5-level nested loops with cache blocking
- Optimized Microkernels: Hand-tuned SIMD kernels for AVX2 and NEON
- Cache-Aware Design: Multi-level blocking (L1/L2/L3) for optimal data reuse
GEMV (Matrix-Vector)
- Blocked Algorithm: Cache-friendly tiling for both standard and transposed operations
- SIMD Acceleration: Vectorized dot products for each row/column
Vector Operations
- Unrolled Loops: 8-way unrolling with multiple accumulators
- SIMD Utilization: Full width vectors (256-bit AVX2, 128-bit NEON)
- Performance: Near memory bandwidth limits for large vectors
Architecture Support
- x86_64: Requires AVX2 + FMA (Intel Haswell/AMD Excavator or newer)
- AArch64: Requires NEON (all ARMv8+ processors)
- Fallback: Optimized scalar implementation for other architectures
The library automatically detects and uses the best available instruction set at runtime.
Future Directions
Caravela is a project that was born from my needs. I was growing tired of crappy dynamic links to BLAS libraries. It is in constant development as I learn more about low level programming. Future areas of development:
- GPU acceleration backends