/// Scalar implementation of the add operation.
///
/// WHY: This serves as the ultimate fallback for any CPU that does not
/// support SIMD instructions or for which an optimized path hasn't been written.
/// It uses a simple loop which modern compilers can often auto-vectorize.