# Crate caffe2_sgd

## Structs

• | Computes the AdaDelta update | (https://arxiv.org/abs/1212.5701) for an input | gradient and accumulated history of squared | gradients. Concretely, given inputs (param, | moment, moment_delta, grad, learning_rate), | computes: | | new_moment = moment * decay + square(grad) * (1 - decay) | new_grad = sqrt(moment_delta + epsilon) / sqrt(new_moment + epsilon) * grad | new_param = param + learning_rate * new_grad | new_moment_delta = moment_delta * decay + square(new_grad) * (1 - decay) | | and returns (new_param, new_moment, | new_moment_delta).
• | Computes the AdaGrad update for an input gradient | and accumulated history. Concretely, given inputs | (param, grad, moment, learning_rate), computes | | new_moment = moment + square(grad) | effective_lr = learning_rate / (sqrt(new_moment) + epsilon) | update = learning_rate * grad / (sqrt(new_moment) + epsilon) | new_param = param + update | and returns (new_param, new_moment). | | Optionally returns effective_lr and update as | well.
• | Computes the Adam update | (https://arxiv.org/abs/1412.6980) for an input | gradient and momentum parameters. Concretely, | given inputs (param, m1, m2, grad, lr, iters), | | t = iters + 1 | correction_multiplier = sqrt(1 - power(beta2, t)) / | (1 - power(beta1, t)) | m1_o = (beta1 * m1) + (1 - beta1) * grad | m2_o = (beta2 * m2) + (1 - beta2) * np.square(grad) | grad_o = correction_multiplier * m1_o /
| (sqrt(m2_o) + epsilon) | param_o = param + lr * grad_o | | and returns (param_o, m1_o, m2_o, grad_o), in | which grad_o is an optional output
• | Alter: alternatate learning rate with | active_period and inactive_period. | update for for a duration of active_period | and then stop for a duration of inactive_period | if active_first, and vice versa |
• | Similar to Iter, but takes a mutex as | the first input to make sure that updates | are carried out atomically. This can | be used in e.g. Hogwild sgd algorithms. |
• Clips the input tensor by scaling based on the input value and the threshold. The value is usually the (pre-computed) norm of the tensor. If the value is larger than the threshold, scaling would be performed in this way:

• | composite: the learning policy changes | according to current iteration # |
• | constantThenLinearWarmup: first | use a constant multiplier and then ramp | up to the global lr |
• | ConstantWarmup: return scale when | iter < num_iter, and 1 otherwise |
• | Cosine: return a learning rate with | a cosine schedule lower bound min_lr, | upper bound max_lr. | | See https://arxiv.org/pdf/1608.03983.pdf |
• | Cyclical: return a learning rate with | period 2 * stepsize and lower bound base_lr, | upper bound max_lr. | | See https://arxiv.org/pdf/1506.01186.pdf |
• | Exp: return gamma ^ iter |
• | Fixed: not changing the learning rate | at all. |

• | Gate: return multiplier_1 if before | num_iter, else multiplier_2 |
• | hill: the learning rate changes according | to following 3 stages | | 1) linear warmup (increasing) at first | num_iter steps from start_multiplier | | 2) inverse shrink (decreasing) afterwards | (gamma, power) | | 3) lower bounded by end_multiplier |
• | Inv: return (1 + gamma * iter) ^ (-power) |
• | Stores a singe integer, that gets incremented | on each call to Run(). | | Useful for tracking the iteration count | during SGD, for example. | | IterOp runs an iteration counter. I | cannot think of a case where we would | need to access the iter variable on device, | so this will always produce a tensor | on the CPU side. If the blob already exists | and is a tensor<int64_t> object, we | will simply increment it (this emulates | the case when we want to resume training). | Otherwise we will have the iter starting | with 0. |
• | Implement Layer-wise Adaptive Rate | Scaling (LARS) with clipping. Before | adding weight decay, given a parameter | tensor X and its gradient dX, the local | learning rate for X will be | | local_lr = trust * norm(X) / ( norm(dX) | + wd * norm(X) + offset * norm(X) ) | | = trust / ( norm(dX) / norm(X) + wd + offset), | | where offset is a preset hyper-parameter | to avoid numerical issue and trust indicates | how much we trust the layer to change | its parameters during one update. | | In this implementation, we uses l2 norm | and the computed local learning rate | is clipped based on the upper bound lr_max | and the lower bound lr_min: | | local_lr = min(local_lr, lr_max) and | local_lr = max(local_lr, lr_min) |
• | Learning rate is a decreasing function of | time. With low learning rates the improvements | will be linear. With high learning rates they will | start to look more exponential. Learning rate is | controlled by the following arguments: | | | Required: | `iterations` | `base_lr`: base learning rate | `policy`: this controls how the learning rate is applied, options are: | `fixed` | `step`: uses `stepsize`, `gamma` | `exp`: uses `gamma` | `gate`: uses ‘multiplier_1’, ‘multiplier_2’, `num_iter`` | `inv`: uses `gamma`, `power`| `linearWarmup`: uses `start_multiplier`, `num_iter`| `constantWarmup`: uses `multiplier`, `num_iter`| `alter`: uses `active_first`, `active_period`, `inactive_period`| `hill`: uses those in both `linearWarmup`and`inv`, plus `end_multiplier`| `composite`: uses `sub_policy_num_iters`and additional args with format | `cyclic`: uses `max_lr`, `stepsize`| `cosine`: uses `min_lr`, `max_lr`, `period`, `t_mult`, `lr_shrink`| `constantThenLinearWarmup`: uses `start_warmup_multiplier`, `constant_warmup_num_iter`, `linear_warmup_num_iter`| `compositeCyclical`: uses `start_warmup_multiplier`, `constant_warmup_num_iter`, `linear_warmup_num_iter`, `cyclical_max_lr`, `cyclical_step_size`, `cyclical_decay`| `compositeCosine`: uses `start_warmup_multiplier`, `constant_warmup_num_iter`, `linear_warmup_num_iter`, `cosine_max_lr`, `cosine_period`, `cosine_t_mult`, `cosine_lr_shrink`| sub_policy_{sub_policy_index}_{sub_policy_arg}, for example: | sub_policy_0_policy: "exp", sub_policy_0_gamma: 0.99, | sub_policy_0_lr_scale: 1.2 | sub_policy_0_policy: "fixed", sub_policy_0_lr_scale: 1.0 | sub_policy_num_iters: [1000, 1000] | | Optional: | `stepsize`: defaults to 0 | `max_lr`: defaults to 0.005 | `gamma`: defaults to 0 | `power`: defaults to 0 | `num_iter`: defaults to 0 | `start_multiplier`: defaults to 0 | `multiplier`: defaults to 0.5 | `multiplier_1`: defaults to 1 | `multiplier_2`: defaults to 1 | `m1`: defaults to 0.5, the first piece lr of piece warmup | `n1`: defaults to 0, iter threshold of the first piece lr | `m2`: defaults to 0.5, the second piece lr of piece warmup | `n2`: defaults to 0, iter threshold of the second piece lr | `m3`: defaults to 0.5, the third piece lr of piece warmup | `start_warmup_multiplier`: defaults to 0.1, part of constantThenLinearWarmup | `constant_warmup_num_iter`: defaults to 10000000, part of constantThenLinearWarmup and constantThenLinearWarmup | `linear_warmup_num_iter`: defaults to 10000000, part of constantThenLinearWarmup, CompositeCyclicalLRPolicy, CompositeCosineLRPolicy | `cyclical_max_lr`: defaults to 0.05, part of CompositeCyclicalLRPolicy | `cyclical_step_size`: defaults to 1000000, part of CompositeCyclicalLRPolicy | `cyclical_decay`: defaults to 1.0, part of CompositeCyclicalLRPolicy | `cosine_min_lr`:defaults to 0.01, part of CompositeCosineLRPolicy | `cosine_max_lr`:defaults to 0.05, part of CompositeCosineLRPolicy | `cosine_period`:defaults to 50, part of CompositeCosineLRPolicy | `cosine_t_mult`:defaults to 1.0, part of CompositeCosineLRPolicy | `cosine_lr_shrink`:defaults to 0.99, part of CompositeCosineLRPolicy | | Usage: | train_net.LearningRate(iterations, “label”, base_lr=float, | policy=“policy_name”, stepsize=int, gamma=float) | | | Example usage: | train_net.LearningRate(200, “LR”, base_lr=-0.1, | policy=“step”, stepsize=20, gamma=0.9) |
• | LinearWarmup: return max(iter/num_iter, | 1) |

• | ConstantWarmup: return scale when | iter < num_iter, and 1 otherwise |
• | Poly: return (1 - iter/max_iter) ^ (power) |
• | Computes the RMSProp update | | (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). | | Concretely, given inputs (grad, mean_squares, mom, | lr), computes: | | mean_squares_o = mean_squares + (1 - decay) * (square(grad) - mean_squares) | mom_o = momentum * mom + lr * grad / sqrt(epsilon + mean_squares_o) | grad_o = mom_o | | Returns (grad_o, mean_squares_o, mom_o).
• | Count the number recent update on rows. | | Exponential decay is applied on the | counter with decay rate r, such that | r^{counter_halflife} = 0.5; | | If counter_halflife is nonpositive, | this operator is turned off. |
• | typename Tdata, // embedding types | typename T, // everything else |
• | typename Tdata, // embedding types | typename T, // everything else |
• | Given inputs (param, moment, indices, | grad, lr), runs a modified sparse Adagrad | update on (param, grad, moment[indices], | lr), and returns (new_param, new_momwnr), | where moment is a 1D tensor with length | equal to the number of rows in param: | shape(moment) == shape(param)[0]. | | Each element of moment is applied to | an entire row of param, and the new moment | is calculated by adding the average | squared sum of gradients across each | row. | | ———– | @note | | indices must also be a 1D tensor indexing | into the rows of param. |
• | Computes a modified Adam Update for | the sparse case. | | Given inputs (param, moment1, moment2, | indices, grad, lr, iter), runs the Adam | update on (param, moment1[indices], | moment2[indices], lr, iter) and returns | (new_param, new_moment1, new_moment2), | where moment2 is a 1D tensor with length | equal to the number of rows in param: | | shape(moment2) == shape(param)[0]. | Each element of moment2 is applied to | an entire row of param, and the new moment2 | values are calculated by averaging | across the row. |
• | Given inputs (param, moment, moment_delta, | indices, grad, lr), runs the dense AdaDelta | update on (param, grad, moment[indices], | moment_delta[indices], lr), and returns | (new_param, new_moment, new_moment_delta) | as in the dense case. |

• | Given inputs (param, moment, indices, | grad, lr), runs the dense AdaGrad update | on (param, grad, moment[indices], | lr), and returns (new_param, new_moment) | as in the dense case. |
• | Computes the Adam Update for the sparse | case. | | Given inputs (param, moment1, moment2, | indices, grad, lr, iter), runs the dense | Adam on (param, moment1[indices], | momemnt2[indices], lr, iter) and returns | (new_param, new_moment1, new_moment2) | as in dense case. | | Adam can be customized as Rectified | Adam (RAdam) by setting enableRAdam | = true. |

• | Performs a momentum SGD update analogous | to MomentumSGDUpdate, but using a GradientSlice | and indices into the full param and momentum | tables. Both param and momentum should | be in-place (corresponding inputs | and outputs should be the same blobs). |
• | This operator implement the STORM (https://arxiv.org/abs/1905.10018) | optimization algorithm. | | Given inputs (param, moment, grad_sq_sum, | grad, indices, lr), computes the dense | STORM update on (param, moment[indices], | grad_sq_sum, grad, lr), and returns | (new_param, new_moment, new_grad_sq_sum) | as in the dense case. |
• | This operator implement the optimization | algorithm in https://arxiv.org/abs/1803.02865 | by Wu, Ward and Bottou. | | Given inputs (param, seq_b, indices, | grad, lr), runs the dense WnGrad update | on (param, grad, seq_b, lr), and returns | (new_param, new_seq_b) as in the dense | case. |
• | Step: return gamma ^ (floor(iter / step)) |
• Every `stepsize` iterations, multiply the weights by a constant `scale`: nw = w * scale