Structs

  • | Applies 8-bit row-wise quantization | by determining the range (maximum - | minimum) and offset (minimum value) | of each row in the input matrix, and then | scaling each element to an 8-bit number | between 0 and 255. | | To later de-quantize values, the scale | (range / 255) and offset (bias) are stored | alongside the data. | | More precisely, each row contains int8 | elements for each quantized element, | and the last 8 bytes of each row in the | output matrix are a float storing the | scale followed by another float containing | the scale. | | For N-dimensional input tensor, the | first N-1 dimensions are interpreted | as rows and the last dimension is interpreted | as a column. | | For example, an input tensor with dimension | 5x2x4 is interpreted as 10 rows and 4 | columns. |
  • | Fake 2/4 bit quantization | | Creates a 2/4bit rowwise quantized | blob with scales and biases in fp16 | | The storage format is 8 bit rowwise with | scales and biases in fp32 |
  • | Applies row-wise stochastic/random | quantization by determining the range | of each row in the input matrix, and then | quantize each element to one of two closest | discrete levels by randomly drawing | | Bernoulli distribution. | | The method is extended from TernGrad | [1], which randomly quantizes gradients | to three levels to reduce communication | in distributed training. | | The format of each row (x) in the output | matrix is [bitwidth][tail][min][max][data]: | | bitwidth[1 Byte]: bitwidth per data | [1, 2, 4 or 8]; | | tail[1 Byte]: the number of unused buckets | [1-8] (One byte is split to 8/bitwidth | buckets and each bucket stores one low-precision | data in bitwidth bits); | | min[4 Bytes]: the minimum floating | value min(x); | | max[4 Bytes]: the maximum floating | value max(x); | | data: quantized data. | | The quantization is uniform with levels | q = min + (max-min)/(2^bitwidth - 1)*[0:1:2^bitwidth]. | | During stochastic/random quantization | x’=Quantize(x), for q_j < x_i <= q_{j+1}, | we draw quantization x’i from Bernoulli | distributions with | | P(x’i = q{j+1}) = (x_i - q_j)/(q{j+1} | - q_j), and | | P(x’i = q_j) = (q{j+1} - x_i)/(q_{j+1} | - q_j) where x’_i is the quantized value | of x_i. | | [1] proved E{x’_i}=x_i, which is an | unbiased approximation. | | More details are in the paper. | | For example, suppose targeted bitwidth | = 2 and x = [0.3, -1.4, -0.6, 0.9, 1.0], | then tail = 3, min = -1.4, max = 1.0 and | q = [-1.4, -0.6, 0.2, 1.0]. x_1 = 0.3 will | be quantized to x’_1 = 0.2 with probability | 7/8 and to x’_1 = 1.0 with probability | 1/8. | | The storage format of quantized data | is: [x’_1|x’_3|x’_5|xxx]-[x’_2|x’_4|xxx|xxx]. | | In general, a input row is split to multiple | segments. | | One segment is a continuous subarray | of the row, and its length is the number | of bytes storing quantized data in the | output matrix. | | The b-th bucket of the i-th byte stores | the i-th data of the b-th segment of input | row. | | [1] Wen, Wei, Cong Xu, Feng Yan, Chunpeng | Wu, | | Yandan Wang, Yiran Chen, and Hai Li. | “Terngrad: | | Ternary gradients to reduce communication | in distributed deep learning.” In Advances | in Neural | | Information Processing Systems, pp. | 1508-1518. 2017. |
  • | De-quantizes the result of the | FloatToFusedRandRowwiseQuantized | operator. | | Refer FloatToFusedRandRowwiseQuantized | operator for details. |

Functions

Type Definitions