Crate caffe2_sgd

source ·

Structs

  • | Computes the AdaDelta update | (https://arxiv.org/abs/1212.5701) for an input | gradient and accumulated history of squared | gradients. Concretely, given inputs (param, | moment, moment_delta, grad, learning_rate), | computes: | | new_moment = moment * decay + square(grad) * (1 - decay) | new_grad = sqrt(moment_delta + epsilon) / sqrt(new_moment + epsilon) * grad | new_param = param + learning_rate * new_grad | new_moment_delta = moment_delta * decay + square(new_grad) * (1 - decay) | | and returns (new_param, new_moment, | new_moment_delta).
  • | Computes the AdaGrad update for an input gradient | and accumulated history. Concretely, given inputs | (param, grad, moment, learning_rate), computes | | new_moment = moment + square(grad) | effective_lr = learning_rate / (sqrt(new_moment) + epsilon) | update = learning_rate * grad / (sqrt(new_moment) + epsilon) | new_param = param + update | and returns (new_param, new_moment). | | Optionally returns effective_lr and update as | well.
  • | Computes the Adam update | (https://arxiv.org/abs/1412.6980) for an input | gradient and momentum parameters. Concretely, | given inputs (param, m1, m2, grad, lr, iters), | | t = iters + 1 | correction_multiplier = sqrt(1 - power(beta2, t)) / | (1 - power(beta1, t)) | m1_o = (beta1 * m1) + (1 - beta1) * grad | m2_o = (beta2 * m2) + (1 - beta2) * np.square(grad) | grad_o = correction_multiplier * m1_o /
    | (sqrt(m2_o) + epsilon) | param_o = param + lr * grad_o | | and returns (param_o, m1_o, m2_o, grad_o), in | which grad_o is an optional output
  • | Alter: alternatate learning rate with | active_period and inactive_period. | update for for a duration of active_period | and then stop for a duration of inactive_period | if active_first, and vice versa |
  • | Similar to Iter, but takes a mutex as | the first input to make sure that updates | are carried out atomically. This can | be used in e.g. Hogwild sgd algorithms. |
  • Clips the input tensor by scaling based on the input value and the threshold. The value is usually the (pre-computed) norm of the tensor. If the value is larger than the threshold, scaling would be performed in this way:


  • | composite: the learning policy changes | according to current iteration # |
  • | constantThenLinearWarmup: first | use a constant multiplier and then ramp | up to the global lr |
  • | ConstantWarmup: return scale when | iter < num_iter, and 1 otherwise |
  • | Cosine: return a learning rate with | a cosine schedule lower bound min_lr, | upper bound max_lr. | | See https://arxiv.org/pdf/1608.03983.pdf |
  • | Cyclical: return a learning rate with | period 2 * stepsize and lower bound base_lr, | upper bound max_lr. | | See https://arxiv.org/pdf/1506.01186.pdf |
  • | Exp: return gamma ^ iter |
  • | Fixed: not changing the learning rate | at all. |


  • | Gate: return multiplier_1 if before | num_iter, else multiplier_2 |
  • | hill: the learning rate changes according | to following 3 stages | | 1) linear warmup (increasing) at first | num_iter steps from start_multiplier | | 2) inverse shrink (decreasing) afterwards | (gamma, power) | | 3) lower bounded by end_multiplier |
  • | Inv: return (1 + gamma * iter) ^ (-power) |
  • | Stores a singe integer, that gets incremented | on each call to Run(). | | Useful for tracking the iteration count | during SGD, for example. | | IterOp runs an iteration counter. I | cannot think of a case where we would | need to access the iter variable on device, | so this will always produce a tensor | on the CPU side. If the blob already exists | and is a tensor<int64_t> object, we | will simply increment it (this emulates | the case when we want to resume training). | Otherwise we will have the iter starting | with 0. |
  • | Implement Layer-wise Adaptive Rate | Scaling (LARS) with clipping. Before | adding weight decay, given a parameter | tensor X and its gradient dX, the local | learning rate for X will be | | local_lr = trust * norm(X) / ( norm(dX) | + wd * norm(X) + offset * norm(X) ) | | = trust / ( norm(dX) / norm(X) + wd + offset), | | where offset is a preset hyper-parameter | to avoid numerical issue and trust indicates | how much we trust the layer to change | its parameters during one update. | | In this implementation, we uses l2 norm | and the computed local learning rate | is clipped based on the upper bound lr_max | and the lower bound lr_min: | | local_lr = min(local_lr, lr_max) and | local_lr = max(local_lr, lr_min) |
  • | Learning Rate Adaption is an operation | that perform one iteration of gradient | descent based on learning rate: | | lr(k) = lr(k-1) - lr_alpha * df(k-1)/dlr, | | where df(k-1)/dlr is the gradient of | objective function f on lr, and lr_alpha | is a learning rate hyperparameter. | | It can be prove that df(k-1)/dlr equals | | INNERPRODUCT(grad(k-1), -grad(k-2)), | where grad(k-1) is the grad of f(k-1) | on parameters. | | When the argument “normalized_lr_adaption” | is false, we simply perform the following | update: | | lr(k) = lr(k-1) - lr_alpha | | INNERPRODUCT(grad(k-1), grad(k-2)). | | If we set “normalized_lr_adaption” | to be true, we do not directly apply INNERPRODUCT(grad(k-1), | | -grad(k-2)) as the grad. | | Instead, we perform the following update: | | lr(k) = lr(k-1) + lr_alpha cosineSimilarity(grad(k-1), | grad(k-2)). |
  • | Learning rate is a decreasing function of | time. With low learning rates the improvements | will be linear. With high learning rates they will | start to look more exponential. Learning rate is | controlled by the following arguments: | | | Required: | iterations | base_lr: base learning rate | policy: this controls how the learning rate is applied, options are: | fixed | step: uses stepsize, gamma | exp: uses gamma | gate: uses ‘multiplier_1’, ‘multiplier_2’, num_iter`` | inv: uses gamma, power| linearWarmup: uses start_multiplier, num_iter| constantWarmup: uses multiplier, num_iter| alter: uses active_first, active_period, inactive_period| hill: uses those in both linearWarmupandinv, plus end_multiplier| composite: uses sub_policy_num_itersand additional args with format | cyclic: uses max_lr, stepsize| cosine: uses min_lr, max_lr, period, t_mult, lr_shrink| constantThenLinearWarmup: uses start_warmup_multiplier, constant_warmup_num_iter, linear_warmup_num_iter| compositeCyclical: uses start_warmup_multiplier, constant_warmup_num_iter, linear_warmup_num_iter, cyclical_max_lr, cyclical_step_size, cyclical_decay| compositeCosine: uses start_warmup_multiplier, constant_warmup_num_iter, linear_warmup_num_iter, cosine_max_lr, cosine_period, cosine_t_mult, cosine_lr_shrink| sub_policy_{sub_policy_index}_{sub_policy_arg}, for example: | sub_policy_0_policy: "exp", sub_policy_0_gamma: 0.99, | sub_policy_0_lr_scale: 1.2 | sub_policy_0_policy: "fixed", sub_policy_0_lr_scale: 1.0 | sub_policy_num_iters: [1000, 1000] | | Optional: | stepsize: defaults to 0 | max_lr: defaults to 0.005 | gamma: defaults to 0 | power: defaults to 0 | num_iter: defaults to 0 | start_multiplier: defaults to 0 | multiplier: defaults to 0.5 | multiplier_1: defaults to 1 | multiplier_2: defaults to 1 | m1: defaults to 0.5, the first piece lr of piece warmup | n1: defaults to 0, iter threshold of the first piece lr | m2: defaults to 0.5, the second piece lr of piece warmup | n2: defaults to 0, iter threshold of the second piece lr | m3: defaults to 0.5, the third piece lr of piece warmup | start_warmup_multiplier: defaults to 0.1, part of constantThenLinearWarmup | constant_warmup_num_iter: defaults to 10000000, part of constantThenLinearWarmup and constantThenLinearWarmup | linear_warmup_num_iter: defaults to 10000000, part of constantThenLinearWarmup, CompositeCyclicalLRPolicy, CompositeCosineLRPolicy | cyclical_max_lr: defaults to 0.05, part of CompositeCyclicalLRPolicy | cyclical_step_size: defaults to 1000000, part of CompositeCyclicalLRPolicy | cyclical_decay: defaults to 1.0, part of CompositeCyclicalLRPolicy | cosine_min_lr:defaults to 0.01, part of CompositeCosineLRPolicy | cosine_max_lr:defaults to 0.05, part of CompositeCosineLRPolicy | cosine_period:defaults to 50, part of CompositeCosineLRPolicy | cosine_t_mult:defaults to 1.0, part of CompositeCosineLRPolicy | cosine_lr_shrink`:defaults to 0.99, part of CompositeCosineLRPolicy | | Usage: | train_net.LearningRate(iterations, “label”, base_lr=float, | policy=“policy_name”, stepsize=int, gamma=float) | | | Example usage: | train_net.LearningRate(200, “LR”, base_lr=-0.1, | policy=“step”, stepsize=20, gamma=0.9) |
  • | LinearWarmup: return max(iter/num_iter, | 1) |
  • | Computes a momentum SGD update for an input | gradient and momentum parameters. Concretely, | given inputs (grad, m, lr) and parameters | (momentum, nesterov), computes: | | if not nesterov: | adjusted_gradient = lr * grad + momentum * m | return (adjusted_gradient, adjusted_gradient) | else: | m_new = momentum * m + lr * grad | return ((1 + momentum) * m_new - momentum * m, m_new) | | Output is (grad, momentum) | | Note the difference to MomemtumSGDUpdate, which | actually performs the parameter update (and is | thus faster).
  • | Performs a momentum SGD update for an input | gradient and momentum parameters. Concretely, | given inputs (grad, m, lr, param) and arguments | (momentum, nesterov), computes: | | if not nesterov: | adjusted_gradient = lr * grad + momentum * m | param = param - adjusted_gradient | return (adjusted_gradient, adjusted_gradient, param) | else: | m_new = momentum * m + lr * grad | param = param - ((1 + momentum) * m_new - momentum * m), | return ((1 + momentum) * m_new - momentum * m, m_new, param) | | Output is (grad, momentum, parameter). | | Note the difference to MomentumSGD, which returns | a new gradient but does not perform the parameter | update.


  • | ConstantWarmup: return scale when | iter < num_iter, and 1 otherwise |
  • | Poly: return (1 - iter/max_iter) ^ (power) |
  • | Computes the RMSProp update | | (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). | | Concretely, given inputs (grad, mean_squares, mom, | lr), computes: | | mean_squares_o = mean_squares + (1 - decay) * (square(grad) - mean_squares) | mom_o = momentum * mom + lr * grad / sqrt(epsilon + mean_squares_o) | grad_o = mom_o | | Returns (grad_o, mean_squares_o, mom_o).
  • | Count the number recent update on rows. | | Exponential decay is applied on the | counter with decay rate r, such that | r^{counter_halflife} = 0.5; | | If counter_halflife is nonpositive, | this operator is turned off. |
  • | Fused operator of | | SparseLengthsIndicesInGradientSumGradient | (gradient of SparseLengthsSum) + RowWiseSparseAdagrad. | | BW saving analysis for numSegments | B, L_avg = avg(lengths), block_size | D, assuming T = float and SIndex = int64_t: | | Before fusion, | | SparseLengthsIndicesInGradientSumGradient | reads | | BD4 and writes | | BL_avgD4. RowWiseSparseAdagrad | reads | | B2L_avgD4 and writes BL_avgD4. | So, the total memory traffic is B*(1+4L_avg)D4. | | After fusion, we read B(1+L_avg)D4 | and write | | BL_avgD4 with total memory traffic | | B(1+2*L_avg)D4. | | Assuming L_avg >> 1, the memory BW is | saving is about 2x . | | See https://fb.quip.com/ldG7A55Ur5wM | for more details on BW saving analysis | and evaluation results. | | typename Tdata, // embedding types | typename T, // everything else | | Fused operator of | | SparseLengthsIndicesInGradientSumGradient | (gradient of SparseLengthsSum) + RowWiseSparseAdagrad. | | Given inputs (param, moment, indices, | grad, lr), runs the row-wise sparse | AdaGrad update on (param, grad, moment[indices], | lr), and returns (new_param, new_moment) | as in the dense case. Additional input | (lengths) is for fused | | SparseLengthsSumGradient operator. |
  • | typename Tdata, // embedding types | typename T, // everything else |
  • | typename Tdata, // embedding types | typename T, // everything else |
  • | Given inputs (param, moment, indices, | grad, lr), runs a modified sparse Adagrad | update on (param, grad, moment[indices], | lr), and returns (new_param, new_momwnr), | where moment is a 1D tensor with length | equal to the number of rows in param: | shape(moment) == shape(param)[0]. | | Each element of moment is applied to | an entire row of param, and the new moment | is calculated by adding the average | squared sum of gradients across each | row. | | ———– | @note | | indices must also be a 1D tensor indexing | into the rows of param. |
  • | Computes a modified Adam Update for | the sparse case. | | Given inputs (param, moment1, moment2, | indices, grad, lr, iter), runs the Adam | update on (param, moment1[indices], | moment2[indices], lr, iter) and returns | (new_param, new_moment1, new_moment2), | where moment2 is a 1D tensor with length | equal to the number of rows in param: | | shape(moment2) == shape(param)[0]. | Each element of moment2 is applied to | an entire row of param, and the new moment2 | values are calculated by averaging | across the row. |
  • | Given inputs (param, moment, moment_delta, | indices, grad, lr), runs the dense AdaDelta | update on (param, grad, moment[indices], | moment_delta[indices], lr), and returns | (new_param, new_moment, new_moment_delta) | as in the dense case. |
  • | Fused operator of | | SparseLengthsIndicesInGradientSumGradient | (gradient of SparseLengthsSum) + SparseAdagrad. | | Given inputs (param, moment, indices, | grad, lr), runs the sparse AdaGrad update | on (param, grad, moment[indices], | lr), and returns (new_param, new_moment) | as in the dense case. Additional input | (lengths) is for fused | | SparseLengthsIndicesInGradientSumGradient | operator. | | typename Tdata, // embedding and momentum | types typename T, // everything else | bool is_mean = false> |


  • | Given inputs (param, moment, indices, | grad, lr), runs the dense AdaGrad update | on (param, grad, moment[indices], | lr), and returns (new_param, new_moment) | as in the dense case. |
  • | Computes the Adam Update for the sparse | case. | | Given inputs (param, moment1, moment2, | indices, grad, lr, iter), runs the dense | Adam on (param, moment1[indices], | momemnt2[indices], lr, iter) and returns | (new_param, new_moment1, new_moment2) | as in dense case. | | Adam can be customized as Rectified | Adam (RAdam) by setting enableRAdam | = true. |

  • | Performs a momentum SGD update analogous | to MomentumSGDUpdate, but using a GradientSlice | and indices into the full param and momentum | tables. Both param and momentum should | be in-place (corresponding inputs | and outputs should be the same blobs). |
  • | This operator implement the STORM (https://arxiv.org/abs/1905.10018) | optimization algorithm. | | Given inputs (param, moment, grad_sq_sum, | grad, indices, lr), computes the dense | STORM update on (param, moment[indices], | grad_sq_sum, grad, lr), and returns | (new_param, new_moment, new_grad_sq_sum) | as in the dense case. |
  • | This operator implement the optimization | algorithm in https://arxiv.org/abs/1803.02865 | by Wu, Ward and Bottou. | | Given inputs (param, seq_b, indices, | grad, lr), runs the dense WnGrad update | on (param, grad, seq_b, lr), and returns | (new_param, new_seq_b) as in the dense | case. |
  • | Step: return gamma ^ (floor(iter / step)) |
  • | Computes the STORM | (https://arxiv.org/abs/1905.10018) update for an | input gradient and accumulated history of | gradients. Concretely, given inputs (param, | moment, grad_sq_sum, grad, lr), computes: | | new_grad_sq_sum = grad_sq_sum + norm(grad)^2 | effective_lr = lr / (beta + new_grad_sq_sum)^1/3 | alpha = momentum * square(effective_lr) | new_moment = grad + (1 - alpha) * (moment - grad) | new_param = param + effective_lr * new_moment | | and returns (new_param, new_moment, | new_grad_sq_sum). | | Note that due to caffe2 limitation, it is | difficult to re-calculate gradient in the previous | iteration using the current example. We simplied | calculation for new_moment by using the gradient | from the current iteration.
  • Every stepsize iterations, multiply the weights by a constant scale: nw = w * scale
  • | Computes the WnGrad update for an input gradient | and accumulated history. This operator implement | the optimization algorithm in | https://arxiv.org/abs/1803.02865 by Wu, Ward and | Bottou. | | Concretely, given inputs (param, grad, seq_b, | learning_rate), computes | | new_seq_b = seq_b + 1 / seq_b * norm(grad)^2 | effective_lr = learning_rate / (new_seq_b + epsilon) | update = learning_rate * grad / (new_seq_b + epsilon) | new_param = param + update | and returns (new_param, new_seq_b). | | Optionally returns effective_lr and update as well.
  • | YellowFin: An automatic tuner for momentum SGD | (https://arxiv.org/abs/1706.03471) | | The YellowFinOp tunes learning rate and | momentum and performs momentum SGD steps. | | The learning rate and momentum are separate | for any matrix of parameters.

Constants

Traits

  • | LearningRateFunctor is a functor that | when fed with an iter number, produces | the learning rate for the corresponding | iteration. |

Functions

Type Definitions