| Version with prefetching for embeddings and
| momentum using fp16
|
| const Half* w_n, // prefetch ptr
| const Half* h_n, // prefetch ptr
| Half* nw_n, // prefetch ptr
| Half* nh_n, // prefetch ptr
| Compute adagrad sparse, assumes embedding and
| momentum are at::Half
|
| {w_n, h_n, nw_n, nh_n} prefetch ptr
| const Half* /* w_n /, // prefetch ptr
| const Half /* h_n /, // prefetch ptr
| Half /* nw_n /, // prefetch ptr
| Half /* nh_n */, // prefetch ptr
| version without prefetching
|
version without prefetching
| The following functions inside internal
| namespace are inlined because they
| are performance critical.
|
| version without prefetching
|
| version with prefetching
|
| TODO(msmelyan)
|
| Crux of the computation is computing
| a / (sqrt(b) + epsilon), where a and b are
| vectors and epsilon is very small (eg., 10^-5) and
| does not change.
|
| Today it’s computed using two vector sqrt and
| vector divide simd instructions.
|
| It is slow. We can take advantage of existing fast
| vector VRSQRTPS instruction that computes
| approximate reciprocals of square roots of the
| vector.
|
| It is 6x faster than vsrt and vdiv combinations.
|
| Since the addition of epsilon is just done to
| avoid division by zero, we approximate
| a / (sqrt(b) + epsilon) by a / (sqrt(b
| + sqrt(epsilon))
|
| If we do that, we can use VRSQRTPS instead now.
|
| VRSQRTPS is not very accurate. Specifically, for
| the test on random numbers between 0.1 and 1 the
| absolute error was about 10^-3 compared to using
| slower but more accurate combination of vsqrt and
| vdiv.
|
| Extend Marat’s function with more NR iterations to
| get more accuracy for training
|
| TODO(msmelyan)
|
| explore streaming stores, but need to have
| inuque indices (deduplication)
|
| const float* w_n, // prefetch ptr
| const float* h_n, // prefetch ptr
| float* nw_n, // prefetch ptr
| float* nh_n, // prefetch ptr
| const float* /* w_n /, // prefetch ptr
| const float /* h_n /, // prefetch ptr
| float /* nw_n /, // prefetch ptr
| float /* nh_n */, // prefetch ptr
| version with prefetching
|
| TODO(msmelyan)
|
| Crux of the computation is computing
| a / (sqrt(b) + epsilon), where a and b are
| vectors and epsilon is very small (eg., 10^-5) and
| does not change. Today it’s computed using two
| vector sqrt and vector divide simd
| instructions. It is slow. We can take advantage of
| existing fast vector VRSQRTPS instruction that
| computes approximate reciprocals of square roots
| of the vector.
|
| It is 6x faster than vsrt and vdiv combinations.
|
| Since the addition of epsilon is just done to
| avoid division by zero, we approximate
| a / (sqrt(b) + epsilon) by a / (sqrt(b
| + sqrt(epsilon))
|
| If we do that, we can use VRSQRTPS instead now.
|
| VRSQRTPS is not very accurate. Specifically, for
| the test on random numbers between 0.1 and 1 the
| absolute error was about 10^-3 compared to using
| slower but more accurate combination of vsqrt and
| vdiv.
|
| Extend Marat’s function with more NR iterations to
| get more accuracy for training
|
| TODO(msmelyan)
|
| explore streaming stores, but need to have
| unique indices (deduplication)
|
| const float* w_n, // prefetch ptr
| const float* h_n, // prefetch ptr
| float* nw_n, // prefetch ptr
| float* nh_n, // prefetch ptr
| It seems that microsoft msvc does not
| have a _cvtsh_ss implementation so
| we will add a dummy version to it.
|
Embedding lookup with reduction.input
of size data_size * block_sizeindices
of size index_sizelengths
of size output_sizeweights
nullptr or array of size index_sizeout
of size output_size * block_sizesum(lengths[i]) == index_sizeBehavior is roughly equivalent to pseudocode:pos = 0for (i = 0..output_size-1)for (k = 0..block_size-1)
| Base implementation does runtime dispatch
| for each segment of reduction
|
| ———–
| @return
|
| false if there is an out-of-bound error
|
| bool IS_WEIGHT_POSITIONAL = false>
|
| const float* weights, // optional,
| can be null for sum reducer
|
| const float* scale_bias, // optional
| scale & bias params for uint8 input
|
| Base implementation does runtime dispatch for
| each segment of reduction
|
| @return false if there is an out-of-bound error
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for sum reducer
| const float* scale_bias, // optional scale & bias params for uint8 input
| Embedding lookup with reduction.
|
| input
of size data_size * block_size
| indices
of size index_size
| offsets
of size output_size
| weights
nullptr or array of size index_size
| out
of size output_size * block_size
|
| Behavior is roughly equivalent to pseudocode:
|
| pos = 0
| for (i = 0..output_size-1)
| for (k = 0..block_size-1)
| out[iblock_size + k] = 0
| start_offset = offsets[i]
| end_offset = offsets[i+1]
| length = end_offset - start_offset
| for (j = start_offset..end_offset-1)
| for (k = 0..block_size-1)
| out[iblock_size + k] += input[indices[pos]block_size + k] *
| (weights ? weights[IS_WEIGHT_POSITIONAL ? j - start_offset : pos] : 1.0)
| pos += 1
| if (normalize_weights && length > 0)
| for (k = 0..block_size-1)
| out[iblock_size + k] /= length
|
| TODO: make this API also take “offsets” rather
| than “lengths” to match the API for PyTorch’s
| EmbeddingBag
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for non-weighted sum
| const float* scale_bias, // optional scale & bias params for uint8 input
| Row-wise quantization with fp16 scale
| and bias
|
| ———–
| @param bit_rate
|
| can be 2, 4, or 8
|
| Embedding lookup with reduction.
|
| input
of size data_size * (block_size + 8B)
| indices
of size index_size
| lengths
of size output_size
| weights
nullptr or array of size index_size
| out
of size output_size * block_size
| sum(lengths[i]) == index_size
|
| Note that block_size should be the number of
| quantized values per row in the data,
| i.e. excluding the scale and bias. The total
| (fused) block size is assumed to be this
| block_size, plus 4 bytes for scale and 4 bytes
| for bias.
|
| Behavior is roughly equivalent to pseudocode:
|
| pos = 0
| fused_block_size = block_size + 8B // quantized values and scale and bias
| for (i = 0..output_size-1)
| for (k = 0..block_size-1)
| out[iblock_size + k] = 0
| for (j = 0..lengths[i]-1)
| for (k = 0..block_size-1)
| out[iblock_size + k] += input[indices[pos](fused_block_size) + k] *
| (weights ? weights[IS_WEIGHT_POSITIONAL ? j : pos] : 1.0)
| pos += 1
| if (normalize_weights && lengths[i] > 0)
| for (k = 0..block_size-1)
| out[iblock_size + k] /= lengths[i]
|
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for non-weighted sum
| Base implementation does runtime dispatch for
| each segment of reduction
|
| @return false if there is an out-of-bound error
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for sum reducer
| Base implementation does runtime dispatch for
| each segment of reduction
|
| @return false if there is an out-of-bound error
|
| const float* weights, // optional, can be null for sum reducer
| Embedding lookup with reduction.
|
| input
of size data_size * (block_size + 8B)
| indices
of size index_size
| offsets
of size output_size
| weights
nullptr or array of size index_size
| out
of size output_size * block_size
|
| Note that block_size should be the number of
| quantized values per row in the data,
| i.e. excluding the scale and bias. The total
| (fused) block size is assumed to be this
| block_size, plus 4 bytes for scale and 4 bytes
| for bias.
|
| Behavior is roughly equivalent to pseudocode:
|
| pos = 0
| fused_block_size = block_size + 8B // quantized values and scale and bias
| for (i = 0..output_size-1)
| for (k = 0..block_size-1)
| out[iblock_size + k] = 0
| start_offset = offsets[i]
| end_offset = i == output_size-1 ? index_size : offsets[i+1] - 1
| length = end_offset - start_offset
| for (j = start_offset..end_offset)
| for (k = 0..block_size-1)
| out[iblock_size + k] += input[indices[pos](fused_block_size) + k] *
| (weights ? weights[IS_WEIGHT_POSITIONAL ? j : pos] : 1.0)
| pos += 1
| if (normalize_weights && length > 0)
| for (k = 0..block_size-1)
| out[iblock_size + k] /= length
|
|
| bool IS_WEIGHT_POSITIONAL = false>
| const float* weights, // optional, can be null for non-weighted sum
Explicit initialize for float
| Explicit initialize for float and AVX2
| vectorization
|
Similar to Axpy that calculate y = a * x + y, but allowing x and y to be
of different data types.
It also provides a performance optimization hint (use_a) to see if a is going
to be 1 or not.