| version with prefetching
|
| TODO(msmelyan)
|
| Crux of the computation is computing
| a / (sqrt(b) + epsilon), where a and b are
| vectors and epsilon is very small (eg., 10^-5) and
| does not change.
|
| Today it’s computed using two vector sqrt and
| vector divide simd instructions.
|
| It is slow. We can take advantage of existing fast
| vector VRSQRTPS instruction that computes
| approximate reciprocals of square roots of the
| vector.
|
| It is 6x faster than vsrt and vdiv combinations.
|
| Since the addition of epsilon is just done to
| avoid division by zero, we approximate
| a / (sqrt(b) + epsilon) by a / (sqrt(b
| + sqrt(epsilon))
|
| If we do that, we can use VRSQRTPS instead now.
|
| VRSQRTPS is not very accurate. Specifically, for
| the test on random numbers between 0.1 and 1 the
| absolute error was about 10^-3 compared to using
| slower but more accurate combination of vsqrt and
| vdiv.
|
| Extend Marat’s function with more NR iterations to
| get more accuracy for training
|
| TODO(msmelyan)
|
| explore streaming stores, but need to have
| inuque indices (deduplication)
|
| const float* w_n, // prefetch ptr
| const float* h_n, // prefetch ptr
| float* nw_n, // prefetch ptr
| float* nh_n, // prefetch ptr
| version with prefetching
|
| TODO(msmelyan)
|
| Crux of the computation is computing
| a / (sqrt(b) + epsilon), where a and b are
| vectors and epsilon is very small (eg., 10^-5) and
| does not change. Today it’s computed using two
| vector sqrt and vector divide simd
| instructions. It is slow. We can take advantage of
| existing fast vector VRSQRTPS instruction that
| computes approximate reciprocals of square roots
| of the vector.
|
| It is 6x faster than vsrt and vdiv combinations.
|
| Since the addition of epsilon is just done to
| avoid division by zero, we approximate
| a / (sqrt(b) + epsilon) by a / (sqrt(b
| + sqrt(epsilon))
|
| If we do that, we can use VRSQRTPS instead now.
|
| VRSQRTPS is not very accurate. Specifically, for
| the test on random numbers between 0.1 and 1 the
| absolute error was about 10^-3 compared to using
| slower but more accurate combination of vsqrt and
| vdiv.
|
| Extend Marat’s function with more NR iterations to
| get more accuracy for training
|
| TODO(msmelyan)
|
| explore streaming stores, but need to have
| unique indices (deduplication)
|
| const float* w_n, // prefetch ptr
| const float* h_n, // prefetch ptr
| float* nw_n, // prefetch ptr
| float* nh_n, // prefetch ptr
Embedding lookup with reduction.input of size data_size * block_sizeindices of size index_sizelengths of size output_sizeweights nullptr or array of size index_sizeout of size output_size * block_sizesum(lengths[i]) == index_sizeBehavior is roughly equivalent to pseudocode:pos = 0for (i = 0..output_size-1)for (k = 0..block_size-1)
| Base implementation does runtime dispatch
| for each segment of reduction
|
| ———–
| @return
|
| false if there is an out-of-bound error
|
| bool IS_WEIGHT_POSITIONAL = false>
|
| const float* weights, // optional,
| can be null for sum reducer
|
| const float* scale_bias, // optional
| scale & bias params for uint8 input
|
| Base implementation does runtime dispatch for
| each segment of reduction
|
| @return false if there is an out-of-bound error
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for sum reducer
| const float* scale_bias, // optional scale & bias params for uint8 input
| Embedding lookup with reduction.
|
| input of size data_size * (block_size + 8B)
| indices of size index_size
| lengths of size output_size
| weights nullptr or array of size index_size
| out of size output_size * block_size
| sum(lengths[i]) == index_size
|
| Note that block_size should be the number of
| quantized values per row in the data,
| i.e. excluding the scale and bias. The total
| (fused) block size is assumed to be this
| block_size, plus 4 bytes for scale and 4 bytes
| for bias.
|
| Behavior is roughly equivalent to pseudocode:
|
| pos = 0
| fused_block_size = block_size + 8B // quantized values and scale and bias
| for (i = 0..output_size-1)
| for (k = 0..block_size-1)
| out[iblock_size + k] = 0
| for (j = 0..lengths[i]-1)
| for (k = 0..block_size-1)
| out[iblock_size + k] += input[indices[pos](fused_block_size) + k] *
| (weights ? weights[IS_WEIGHT_POSITIONAL ? j : pos] : 1.0)
| pos += 1
| if (normalize_weights && lengths[i] > 0)
| for (k = 0..block_size-1)
| out[iblock_size + k] /= lengths[i]
|
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for non-weighted sum
| Base implementation does runtime dispatch for
| each segment of reduction
|
| @return false if there is an out-of-bound error
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for sum reducer
| Base implementation does runtime dispatch for
| each segment of reduction
|
| @return false if there is an out-of-bound error
|
| const float* weights, // optional, can be null for sum reducer
| Embedding lookup with reduction.
|
| input of size data_size * (block_size + 8B)
| indices of size index_size
| offsets of size output_size
| weights nullptr or array of size index_size
| out of size output_size * block_size
|
| Note that block_size should be the number of
| quantized values per row in the data,
| i.e. excluding the scale and bias. The total
| (fused) block size is assumed to be this
| block_size, plus 4 bytes for scale and 4 bytes
| for bias.
|
| Behavior is roughly equivalent to pseudocode:
|
| pos = 0
| fused_block_size = block_size + 8B // quantized values and scale and bias
| for (i = 0..output_size-1)
| for (k = 0..block_size-1)
| out[iblock_size + k] = 0
| start_offset = offsets[i]
| end_offset = i == output_size-1 ? index_size : offsets[i+1] - 1
| length = end_offset - start_offset
| for (j = start_offset..end_offset)
| for (k = 0..block_size-1)
| out[iblock_size + k] += input[indices[pos](fused_block_size) + k] *
| (weights ? weights[IS_WEIGHT_POSITIONAL ? j : pos] : 1.0)
| pos += 1
| if (normalize_weights && length > 0)
| for (k = 0..block_size-1)
| out[iblock_size + k] /= length
|
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for non-weighted sum
Similar to Axpy that calculate y = a * x + y, but allowing x and y to be
of different data types.
It also provides a performance optimization hint (use_a) to see if a is going
to be 1 or not.