Crate caffe2_perfkernels

Source

Macros§

avx2_do
avx2_fma_do
avx512_do
avx_do
avx_f16c_do
base_do
| Common utilities for writing performance | kernels and easy dispatching of different | backends. |
embedding_idx_specialization
Proxy back to generic implementation
embedding_specialization
Proxy back to generic implementation
fused_8bit_rowwise_embedding_idx_specialization
Proxy back to generic implementation

Constants§

QEPSILON

Functions§

adagrad_fp16_update_prefetch
| Version with prefetching for embeddings and | momentum using fp16 | | const Half* w_n, // prefetch ptr | const Half* h_n, // prefetch ptr | Half* nw_n, // prefetch ptr | Half* nh_n, // prefetch ptr
adagrad_fp16_update_prefetch__avx2_fma
| Compute adagrad sparse, assumes embedding and | momentum are at::Half | | {w_n, h_n, nw_n, nh_n} prefetch ptr
adagrad_fp16_update_prefetch_base
| const Half* /* w_n /, // prefetch ptr | const Half /* h_n /, // prefetch ptr | Half /* nw_n /, // prefetch ptr | Half /* nh_n */, // prefetch ptr
adagrad_update
| version without prefetching |
adagrad_update__avx2_fma
version without prefetching
adagrad_update_base
adagrad_update_base_inlined
| The following functions inside internal | namespace are inlined because they | are performance critical. |
adagrad_update_no_prefetch
| version without prefetching |
adagrad_update_prefetch
| version with prefetching | | TODO(msmelyan) | | Crux of the computation is computing | a / (sqrt(b) + epsilon), where a and b are | vectors and epsilon is very small (eg., 10^-5) and | does not change. | | Today it’s computed using two vector sqrt and | vector divide simd instructions. | | It is slow. We can take advantage of existing fast | vector VRSQRTPS instruction that computes | approximate reciprocals of square roots of the | vector. | | It is 6x faster than vsrt and vdiv combinations. | | Since the addition of epsilon is just done to | avoid division by zero, we approximate | a / (sqrt(b) + epsilon) by a / (sqrt(b | + sqrt(epsilon)) | | If we do that, we can use VRSQRTPS instead now. | | VRSQRTPS is not very accurate. Specifically, for | the test on random numbers between 0.1 and 1 the | absolute error was about 10^-3 compared to using | slower but more accurate combination of vsqrt and | vdiv. | | Extend Marat’s function with more NR iterations to | get more accuracy for training | | TODO(msmelyan) | | explore streaming stores, but need to have | inuque indices (deduplication) | | const float* w_n, // prefetch ptr | const float* h_n, // prefetch ptr | float* nw_n, // prefetch ptr | float* nh_n, // prefetch ptr
adagrad_update_prefetch__avx2_fma
adagrad_update_prefetch_base
| const float* /* w_n /, // prefetch ptr | const float /* h_n /, // prefetch ptr | float /* nw_n /, // prefetch ptr | float /* nh_n */, // prefetch ptr
adagrad_update_prefetch_inlined
| version with prefetching | | TODO(msmelyan) | | Crux of the computation is computing | a / (sqrt(b) + epsilon), where a and b are | vectors and epsilon is very small (eg., 10^-5) and | does not change. Today it’s computed using two | vector sqrt and vector divide simd | instructions. It is slow. We can take advantage of | existing fast vector VRSQRTPS instruction that | computes approximate reciprocals of square roots | of the vector. | | It is 6x faster than vsrt and vdiv combinations. | | Since the addition of epsilon is just done to | avoid division by zero, we approximate | a / (sqrt(b) + epsilon) by a / (sqrt(b | + sqrt(epsilon)) | | If we do that, we can use VRSQRTPS instead now. | | VRSQRTPS is not very accurate. Specifically, for | the test on random numbers between 0.1 and 1 the | absolute error was about 10^-3 compared to using | slower but more accurate combination of vsqrt and | vdiv. | | Extend Marat’s function with more NR iterations to | get more accuracy for training | | TODO(msmelyan) | | explore streaming stores, but need to have | unique indices (deduplication) | | const float* w_n, // prefetch ptr | const float* h_n, // prefetch ptr | float* nw_n, // prefetch ptr | float* nh_n, // prefetch ptr
cvtsh_ss
| It seems that microsoft msvc does not | have a _cvtsh_ss implementation so | we will add a dummy version to it. |
cvtss_sh
decompress_and_dequantize
decompress_and_dequantize__avx2
decompress_and_dequantize__base
embedding_lookup
Embedding lookup with reduction.input of size data_size * block_sizeindices of size index_sizelengths of size output_sizeweights nullptr or array of size index_sizeout of size output_size * block_sizesum(lengths[i]) == index_sizeBehavior is roughly equivalent to pseudocode:pos = 0for (i = 0..output_size-1)for (k = 0..block_size-1)
embedding_lookup_generic_slow
| Base implementation does runtime dispatch | for each segment of reduction | | ———– | @return | | false if there is an out-of-bound error | | bool IS_WEIGHT_POSITIONAL = false> | | const float* weights, // optional, | can be null for sum reducer | | const float* scale_bias, // optional | scale & bias params for uint8 input |
embedding_lookup_generic_slow_idx
| Base implementation does runtime dispatch for | each segment of reduction | | @return false if there is an out-of-bound error | | bool IS_WEIGHT_POSITIONAL = false> | const float* weights, // optional, can be null for sum reducer | const float* scale_bias, // optional scale & bias params for uint8 input
embedding_lookup_idx
| Embedding lookup with reduction. | | input of size data_size * block_size | indices of size index_size | offsets of size output_size | weights nullptr or array of size index_size | out of size output_size * block_size | | Behavior is roughly equivalent to pseudocode: | | pos = 0 | for (i = 0..output_size-1) | for (k = 0..block_size-1) | out[iblock_size + k] = 0 | start_offset = offsets[i] | end_offset = offsets[i+1] | length = end_offset - start_offset | for (j = start_offset..end_offset-1) | for (k = 0..block_size-1) | out[iblock_size + k] += input[indices[pos]block_size + k] * | (weights ? weights[IS_WEIGHT_POSITIONAL ? j - start_offset : pos] : 1.0) | pos += 1 | if (normalize_weights && length > 0) | for (k = 0..block_size-1) | out[iblock_size + k] /= length | | TODO: make this API also take “offsets” rather | than “lengths” to match the API for PyTorch’s | EmbeddingBag | | bool IS_WEIGHT_POSITIONAL = false> | const float* weights, // optional, can be null for non-weighted sum | const float* scale_bias, // optional scale & bias params for uint8 input
embedding_lookup_idx_int32_t_float_float_avx2_fma
embedding_lookup_idx_int32_t_float_float_false_avx2_fma
embedding_lookup_idx_int32_t_float_float_true_avx2_fma
embedding_lookup_idx_int32_t_half_float_avx2_fma
embedding_lookup_idx_int32_t_half_float_false_avx2_fma
embedding_lookup_idx_int32_t_half_float_true_avx2_fma
embedding_lookup_idx_int32_t_uint8_t_float_avx2_fma
embedding_lookup_idx_int32_t_uint8_t_float_false_avx2_fma
embedding_lookup_idx_int32_t_uint8_t_float_true_avx2_fma
embedding_lookup_idx_int64_t_float_float_avx2_fma
embedding_lookup_idx_int64_t_float_float_false_avx2_fma
embedding_lookup_idx_int64_t_float_float_true_avx2_fma
embedding_lookup_idx_int64_t_half_float_avx2_fma
embedding_lookup_idx_int64_t_half_float_false_avx2_fma
embedding_lookup_idx_int64_t_half_float_true_avx2_fma
embedding_lookup_idx_int64_t_uint8_t_float_avx2_fma
embedding_lookup_idx_int64_t_uint8_t_float_false_avx2_fma
embedding_lookup_idx_int64_t_uint8_t_float_true_avx2_fma
embedding_lookup_int32_t_float_float_avx2_fma
embedding_lookup_int32_t_float_float_false_avx2_fma
embedding_lookup_int32_t_float_float_true_avx2_fma
embedding_lookup_int32_t_half_float_avx2_fma
embedding_lookup_int32_t_half_float_false_avx2_fma
embedding_lookup_int32_t_half_float_true_avx2_fma
embedding_lookup_int32_t_uint8_t_float_avx2_fma
embedding_lookup_int32_t_uint8_t_float_false_avx2_fma
embedding_lookup_int32_t_uint8_t_float_true_avx2_fma
embedding_lookup_int64_t_float_float_avx2_fma
embedding_lookup_int64_t_float_float_false_avx2_fma
embedding_lookup_int64_t_float_float_true_avx2_fma
embedding_lookup_int64_t_half_float_avx2_fma
embedding_lookup_int64_t_half_float_false_avx2_fma
embedding_lookup_int64_t_half_float_true_avx2_fma
embedding_lookup_int64_t_uint8_t_float_avx2_fma
embedding_lookup_int64_t_uint8_t_float_false_avx2_fma
embedding_lookup_int64_t_uint8_t_float_true_avx2_fma
float_to_fused_8bit_rowwise_quantized
float_to_fused_8bit_rowwise_quantized_base
float_to_fusedn_bit_rowwise_quantized_sb_half
| Row-wise quantization with fp16 scale | and bias | | ———– | @param bit_rate | | can be 2, 4, or 8 |
float_to_fusedn_bit_rowwise_quantized_sb_half_base
fused_8bit_rowwise_embedding_lookup
| Embedding lookup with reduction. | | input of size data_size * (block_size + 8B) | indices of size index_size | lengths of size output_size | weights nullptr or array of size index_size | out of size output_size * block_size | sum(lengths[i]) == index_size | | Note that block_size should be the number of | quantized values per row in the data, | i.e. excluding the scale and bias. The total | (fused) block size is assumed to be this | block_size, plus 4 bytes for scale and 4 bytes | for bias. | | Behavior is roughly equivalent to pseudocode: | | pos = 0 | fused_block_size = block_size + 8B // quantized values and scale and bias | for (i = 0..output_size-1) | for (k = 0..block_size-1) | out[iblock_size + k] = 0 | for (j = 0..lengths[i]-1) | for (k = 0..block_size-1) | out[iblock_size + k] += input[indices[pos](fused_block_size) + k] * | (weights ? weights[IS_WEIGHT_POSITIONAL ? j : pos] : 1.0) | pos += 1 | if (normalize_weights && lengths[i] > 0) | for (k = 0..block_size-1) | out[iblock_size + k] /= lengths[i] | | | bool IS_WEIGHT_POSITIONAL = false> | const float* weights, // optional, can be null for non-weighted sum
fused_8bit_rowwise_embedding_lookup_generic_slow
| Base implementation does runtime dispatch for | each segment of reduction | | @return false if there is an out-of-bound error | | bool IS_WEIGHT_POSITIONAL = false> | const float* weights, // optional, can be null for sum reducer
fused_8bit_rowwise_embedding_lookup_generic_slow_idx
| Base implementation does runtime dispatch for | each segment of reduction | | @return false if there is an out-of-bound error | | const float* weights, // optional, can be null for sum reducer
fused_8bit_rowwise_embedding_lookup_idx
| Embedding lookup with reduction. | | input of size data_size * (block_size + 8B) | indices of size index_size | offsets of size output_size | weights nullptr or array of size index_size | out of size output_size * block_size | | Note that block_size should be the number of | quantized values per row in the data, | i.e. excluding the scale and bias. The total | (fused) block size is assumed to be this | block_size, plus 4 bytes for scale and 4 bytes | for bias. | | Behavior is roughly equivalent to pseudocode: | | pos = 0 | fused_block_size = block_size + 8B // quantized values and scale and bias | for (i = 0..output_size-1) | for (k = 0..block_size-1) | out[iblock_size + k] = 0 | start_offset = offsets[i] | end_offset = i == output_size-1 ? index_size : offsets[i+1] - 1 | length = end_offset - start_offset | for (j = start_offset..end_offset) | for (k = 0..block_size-1) | out[iblock_size + k] += input[indices[pos](fused_block_size) + k] * | (weights ? weights[IS_WEIGHT_POSITIONAL ? j : pos] : 1.0) | pos += 1 | if (normalize_weights && length > 0) | for (k = 0..block_size-1) | out[iblock_size + k] /= length | | | bool IS_WEIGHT_POSITIONAL = false> | const float* weights, // optional, can be null for non-weighted sum
fused_8bit_rowwise_embedding_lookup_idx_int32_t_float_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int32_t_float_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int32_t_float_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int32_t_half_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int32_t_half_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int32_t_half_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int32_t_uint8_t_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int32_t_uint8_t_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int32_t_uint8_t_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_float_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_float_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_float_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_half_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_half_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_half_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_uint8_t_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_uint8_t_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_idx_int64_t_uint8_t_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_float_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_float_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_float_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_half_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_half_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_half_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_uint8_t_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_uint8_t_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_int32_t_uint8_t_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_float_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_float_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_float_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_half_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_half_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_half_float_true_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_uint8_t_float_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_uint8_t_float_false_avx2_fma
fused_8bit_rowwise_embedding_lookup_int64_t_uint8_t_float_true_avx2_fma
fused_8bit_rowwise_quantized_to_float
fused_8bit_rowwise_quantized_to_float_base
fusedn_bit_rowwise_quantized_sb_half_to_float
fusedn_bit_rowwise_quantized_sb_half_to_float_base
lstm_unit_cpu
| Define templated implementation fo | LSTM kernels on CPU |
lstm_unit_cpuf32
Explicit initialize for float
lstm_unit_gradient_cpu
lstm_unit_gradient_cpuf32
lstm_unit_gradient_impl_avx2_fma
lstm_unit_gradient_impl_avx2_fmaf32
lstm_unit_gradient_implf32
lstm_unit_impl_avx2_fma
| Define templated implementation fo | LSTM kernels on CPU supporting AVX2 |
lstm_unit_impl_avx2_fmaf32
Explicit initialize for float
lstm_unit_implf32
| Explicit initialize for float and AVX2 | vectorization |
quantize_and_compress
quantize_and_compress__avx2
quantize_and_compress__base
typed_axpy
Similar to Axpy that calculate y = a * x + y, but allowing x and y to be of different data types. It also provides a performance optimization hint (use_a) to see if a is going to be 1 or not.
typed_axpy__avx2_fma
typed_axpy__avx_f_16c
typed_axpy__base
typed_axpy_halffloat__avx2_fma
typed_axpy_halffloat__avx_f_16c
typed_axpy_halffloat__base
typed_axpy_uint8_float__avx2_fma
typed_axpy_uint8_float__base
typed_axpyf16f32
typed_axpyf32f32
typed_axpyu8f32