Skip to main content

InnerProduct

Struct InnerProduct 

Source
pub struct InnerProduct;
Expand description

Compute the inner-product between vector-like types.

Trait Implementations§

Source§

impl Clone for InnerProduct

Source§

fn clone(&self) -> InnerProduct

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for InnerProduct

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl<A, B, To> DistanceFunction<A, B, To> for InnerProduct

Source§

fn evaluate_similarity(&self, a: A, b: B) -> To

Perform a distance computation between the left-hand and right-hand arguments.
Source§

impl PureDistanceFunction<&[f32], BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<f32>, UnequalLengths>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

Source§

impl PureDistanceFunction<&[f32], BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<f32>, UnequalLengths>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

Source§

impl PureDistanceFunction<&[f32], BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<f32>, UnequalLengths>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

Source§

impl PureDistanceFunction<&[f32], BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<f32>, UnequalLengths>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

Source§

impl PureDistanceFunction<&[f32], BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<f32>, UnequalLengths>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

Source§

impl PureDistanceFunction<&[f32], BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<f32>, UnequalLengths>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

Source§

impl PureDistanceFunction<&[f32], BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<f32>, UnequalLengths>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

Source§

impl PureDistanceFunction<&[f32], BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<f32>, UnequalLengths>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

Source§

impl PureDistanceFunction<BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<4, Unsigned, SlicePtr<'_, u8>, BitTranspose>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl PureDistanceFunction<BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, Result<MathematicalValue<u32>, UnequalLengths>> for InnerProduct

Source§

impl<A> Target2<A, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>> for InnerProduct
where A: Architecture,

Compute the inner product between bitvectors x and y.

Returns an error if the arguments have different lengths.

Source§

fn run( self, _: A, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl<A> Target2<A, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>, BitTranspose>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>> for InnerProduct
where A: Architecture,

The strategy is to compute the inner product <x, y> by decomposing the problem into groups of 64-dimensions.

For each group, we load the 64-bits of y into a word bits. And the four 64-bit words of the group in x in b0, b1, b2, and b3`.

Note that bit i in b0 is bit-0 of the i-th value in ths group. Likewise, bit i in b1 is bit-1 of the same word.

This means that we can compute the partial inner product for this group as

(bits & b0).count_ones()                // Contribution of bit 0
    + 2 * (bits & b1).count_ones()      // Contribution of bit 1
    + 4 * (bits & b2).count_ones()      // Contribution of bit 2
    + 8 * (bits & b3).count_ones()      // Contribution of bit 3

We process as many full groups as we can.

To handle the remainder, we need to be careful about acessing y because BitSlice only guarantees the validity of reads at the byte level. That is - we cannot assume that a full 64-bit read is valid.

The bit-tranposed x, on the other hand, guarantees allocations in blocks of 4 * 64-bits, so it can be treated as normal.

Source§

fn run( self, _: A, x: BitSlice<'_, N, Unsigned, BitTranspose>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl<A> Target2<A, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>> for InnerProduct
where A: Architecture, InnerProduct: for<'a> Target2<A, MathematicalValue<f32>, &'a [u8], &'a [u8]>,

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Implementation Notes

This can directly invoke the methods implemented in vector because BitSlice<'_, 8, Unsigned> is isomorphic to &[u8].

Source§

fn run( self, arch: A, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl<const N: usize> Target2<Scalar, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<N, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, _: Scalar, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

A fallback implementation that uses scaler indexing to retrieve values from the corresponding BitSlice.

Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<Scalar, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Performance

This function uses a generic implementation and therefore is not very fast.

Source§

fn run( self, _: Scalar, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.

The main trick here is avoiding explicit conversion from 1 bit integers to 32-bit floating-point numbers by using _mm256_permutevar_ps, which performs a shuffle on two independent 128-bit lanes of f32 values in a register A using the lower 2-bits of each 32-bit integer in a register B.

Importantly, this instruction only takes a single cycle and we can avoid any kind of masking. Going the route of conversion would require and AND operation to isolate bottom bits and a somewhat lengthy 32-bit integer to f32 conversion instruction.

The overall strategy broadcasts a 32-bit integer (consisting of 32, 1-bit values) across 8 lanes into a register A.

Each lane is then shifted by a different amount so:

  • Lane 0 has value 0 as its least significant bit (LSB)
  • Lane 1 has value 1 as its LSB.
  • Lane 2 has value 2 as its LSB.
  • etc.

These LSB’s are used to power the shuffle function to convert to f32 values (either 0.0 or 1.0) and we can FMA as needed.

To process the next group of 8 values, we shift all lanes in A by 8-bits so lane 0 has value 8 as its LSB, lane 1 has value 9 etc.

A total of three shifts are applied to extract all 32 1-bit value as f32 in order.

Source§

fn run( self, arch: V3, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.

The strategy used here is almost identical to that used for 1-bit distances. The main difference is that now we use the full 2-bit shuffle capabilities of _mm256_permutevar_ps and ths relatives sizes of the shifts are slightly different.

Source§

fn run( self, arch: V3, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.

The strategy here is similar to the 1 and 2-bit strategies. However, instead of using _mm256_permutevar_ps, we now go directly for 32-bit integer to 32-bit floating point.

This is because the shuffle intrinsic only supports 2-bit shuffles.

Source§

fn run( self, arch: V3, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Implementation Notes

This implementation is optimized around x86 with the AVX2 vector extension. Specifically, we try to hit Wide::<i32, 8> as SIMDDotProduct<Wide<i16, 8>> so we can hit the _mm256_madd_epi16 intrinsic.

Also note that AVX2 does not have 16-bit integer bit-shift instructions. Instead, we have to use 32-bit integer shifts and then bit-cast to 16-bit intrinsics. This works because we need to apply the same shift to all lanes.

Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Implementation Notes

This implementation is optimized around x86 with the AVX2 vector extension. Specifically, we try to hit Wide::<i32, 8> as SIMDDotProduct<Wide<i16, 8>> so we can hit the _mm256_madd_epi16 intrinsic.

Also note that AVX2 does not have 16-bit integer bit-shift instructions. Instead, we have to use 32-bit integer shifts and then bit-cast to 16-bit intrinsics. This works because we need to apply the same shift to all lanes.

Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.
Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Computes the inner product of 8-bit unsigned × 1-bit unsigned vectors using V3 intrinsics.

For each 32-element block we load 32 bytes from x and 4 bytes (32 bits) from y. ANDing the data with the mask created from 4 bytes from y zeroes unselected lanes. Finally, _mm256_sad_epu8 horizontally sums the masked bytes in groups of 8.

The main loop is 4× unrolled, processing 128 elements per iteration.

§Overflow

Each sad output lane holds at most 8 × 255 = 2_040. Accumulated across d/32 blocks, the per-lane max is (d/32) × 2_040. At dim = 3072: 96 × 2_040 = 195_840, well within i32 range.

Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.
Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Computes the inner product of 8-bit unsigned × 2-bit unsigned vectors using AVX2.

§Strategy

Unpack each 16-byte chunk of y into 64 crumb values via a two-level cascade: first [unpack_half_bytes] splits bytes into nibbles, then a second pass splits nibbles into crumbs (masked with 0x03). Each unpacked half is paired with 32 bytes of x and multiplied via _mm256_maddubs_epi16.

The main loop is 4× unrolled: eight i16 products (4 blocks × 2 halves) are summed in i16 before a single _mm256_madd_epi16(…, 1) widens to i32. This is safe because 8 × (255 × 3 × 2) = 12_240 < i16::MAX.

Source§

impl Target2<V3, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.
Source§

fn run( self, arch: V3, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Computes the inner product of 8-bit unsigned × 4-bit unsigned vectors using V3 intrinsics.

§Strategy

Unpack each 16-byte chunk of y into 32 nibble values via [unpack_half_bytes], then multiply with the corresponding 32 bytes of x using _mm256_maddubs_epi16 (u8 × u8 → i16, pairwise horizontal add).

The main loop is 4× unrolled: four i16 products are summed in i16 before a single _mm256_madd_epi16(…, 1) widens to i32. This is safe because 4 × (255 × 15 × 2) = 30_600 < i16::MAX.

Source§

impl Target2<V4, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<f32>, UnequalLengths>, &[f32], BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: &[f32], y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<f32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Available on x86-64 only.

Compute the inner product between x and y.

Returns an error if the arguments have different lengths.

§Implementation Notes

This is optimized around the __mm512_dpbusd_epi32 VNNI instruction, which computes the pairwise dot product between vectors of 8-bit integers and accumulates groups of 4 with an i32 accumulation vector.

One quirk of this instruction is that one argument must be unsigned and the other must be signed. Since thie kernsl works on 2-bit integers, this is not a limitation. Just something to be aware of.

Since AVX512 does not have an 8-bit shift instruction, we generally load data as u32x16 (which has a native shift) and bit-cast it to u8x64 as needed.

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<3, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<5, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<6, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<7, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<1, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<2, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Target2<V4, Result<MathematicalValue<u32>, UnequalLengths>, BitSliceBase<8, Unsigned, SlicePtr<'_, u8>>, BitSliceBase<4, Unsigned, SlicePtr<'_, u8>>> for InnerProduct

Source§

fn run( self, arch: V4, x: BitSlice<'_, N, Unsigned, Dense>, y: BitSlice<'_, N, Unsigned, Dense>, ) -> MathematicalResult<u32>

Run the operation with the provided Architecture.
Source§

impl Copy for InnerProduct

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> ByRef<T> for T

Source§

fn by_ref(&self) -> &T

Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Generator<T> for T
where T: Clone,

Source§

fn generate(&mut self) -> T

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> AsyncFriendly for T
where T: Send + Sync + 'static,