| Computes the AdaDelta update
| (https://arxiv.org/abs/1212.5701) for an input
| gradient and accumulated history of squared
| gradients. Concretely, given inputs (param,
| moment, moment_delta, grad, learning_rate),
| computes:
|
| new_moment = moment * decay + square(grad) * (1 - decay)
| new_grad = sqrt(moment_delta + epsilon) / sqrt(new_moment + epsilon) * grad
| new_param = param + learning_rate * new_grad
| new_moment_delta = moment_delta * decay + square(new_grad) * (1 - decay)
|
| and returns (new_param, new_moment,
| new_moment_delta).
| Computes the AdaGrad update for an input gradient
| and accumulated history. Concretely, given inputs
| (param, grad, moment, learning_rate), computes
|
| new_moment = moment + square(grad)
| effective_lr = learning_rate / (sqrt(new_moment) + epsilon)
| update = learning_rate * grad / (sqrt(new_moment) + epsilon)
| new_param = param + update
| and returns (new_param, new_moment).
|
| Optionally returns effective_lr and update as
| well.
| Computes the Adam update
| (https://arxiv.org/abs/1412.6980) for an input
| gradient and momentum parameters. Concretely,
| given inputs (param, m1, m2, grad, lr, iters),
|
| t = iters + 1
| correction_multiplier = sqrt(1 - power(beta2, t)) /
| (1 - power(beta1, t))
| m1_o = (beta1 * m1) + (1 - beta1) * grad
| m2_o = (beta2 * m2) + (1 - beta2) * np.square(grad)
| grad_o = correction_multiplier * m1_o /
| (sqrt(m2_o) + epsilon)
| param_o = param + lr * grad_o
|
| and returns (param_o, m1_o, m2_o, grad_o), in
| which grad_o is an optional output
| Alter: alternatate learning rate with
| active_period and inactive_period.
| update for for a duration of active_period
| and then stop for a duration of inactive_period
| if active_first, and vice versa
|
| Similar to Iter, but takes a mutex as
| the first input to make sure that updates
| are carried out atomically. This can
| be used in e.g. Hogwild sgd algorithms.
|
Clips the input tensor by scaling based on the input value and the threshold.
The value is usually the (pre-computed) norm of the tensor. If the value is
larger than the threshold, scaling would be performed in this way:
| composite: the learning policy changes
| according to current iteration #
|
| constantThenLinearWarmup: first
| use a constant multiplier and then ramp
| up to the global lr
|
| ConstantWarmup: return scale when
| iter < num_iter, and 1 otherwise
|
| Cosine: return a learning rate with
| a cosine schedule lower bound min_lr,
| upper bound max_lr.
|
| See https://arxiv.org/pdf/1608.03983.pdf
|
| Cyclical: return a learning rate with
| period 2 * stepsize and lower bound base_lr,
| upper bound max_lr.
|
| See https://arxiv.org/pdf/1506.01186.pdf
|
| Exp: return gamma ^ iter
|
| Fixed: not changing the learning rate
| at all.
|
| Gate: return multiplier_1 if before
| num_iter, else multiplier_2
|
| hill: the learning rate changes according
| to following 3 stages
|
| 1) linear warmup (increasing) at first
| num_iter steps from start_multiplier
|
| 2) inverse shrink (decreasing) afterwards
| (gamma, power)
|
| 3) lower bounded by end_multiplier
|
| Inv: return (1 + gamma * iter) ^ (-power)
|
| Stores a singe integer, that gets incremented
| on each call to Run().
|
| Useful for tracking the iteration count
| during SGD, for example.
|
| IterOp runs an iteration counter. I
| cannot think of a case where we would
| need to access the iter variable on device,
| so this will always produce a tensor
| on the CPU side. If the blob already exists
| and is a tensor<int64_t> object, we
| will simply increment it (this emulates
| the case when we want to resume training).
| Otherwise we will have the iter starting
| with 0.
|
| Implement Layer-wise Adaptive Rate
| Scaling (LARS) with clipping. Before
| adding weight decay, given a parameter
| tensor X and its gradient dX, the local
| learning rate for X will be
|
| local_lr = trust * norm(X) / ( norm(dX)
| + wd * norm(X) + offset * norm(X) )
|
| = trust / ( norm(dX) / norm(X) + wd + offset),
|
| where offset is a preset hyper-parameter
| to avoid numerical issue and trust indicates
| how much we trust the layer to change
| its parameters during one update.
|
| In this implementation, we uses l2 norm
| and the computed local learning rate
| is clipped based on the upper bound lr_max
| and the lower bound lr_min:
|
| local_lr = min(local_lr, lr_max) and
| local_lr = max(local_lr, lr_min)
|
| Learning Rate Adaption is an operation
| that perform one iteration of gradient
| descent based on learning rate:
|
| lr(k) = lr(k-1) - lr_alpha * df(k-1)/dlr,
|
| where df(k-1)/dlr is the gradient of
| objective function f on lr, and lr_alpha
| is a learning rate hyperparameter.
|
| It can be prove that df(k-1)/dlr equals
|
| INNERPRODUCT(grad(k-1), -grad(k-2)),
| where grad(k-1) is the grad of f(k-1)
| on parameters.
|
| When the argument “normalized_lr_adaption”
| is false, we simply perform the following
| update:
|
| lr(k) = lr(k-1) - lr_alpha
|
| INNERPRODUCT(grad(k-1), grad(k-2)).
|
| If we set “normalized_lr_adaption”
| to be true, we do not directly apply INNERPRODUCT(grad(k-1),
|
| -grad(k-2)) as the grad.
|
| Instead, we perform the following update:
|
| lr(k) = lr(k-1) + lr_alpha cosineSimilarity(grad(k-1),
| grad(k-2)).
|
| Learning rate is a decreasing function of
| time. With low learning rates the improvements
| will be linear. With high learning rates they will
| start to look more exponential. Learning rate is
| controlled by the following arguments:
|
|
| Required:
| iterations
| base_lr
: base learning rate
| policy
: this controls how the learning rate is applied, options are:
| fixed
| step
: uses stepsize
, gamma
| exp
: uses gamma
| gate
: uses ‘multiplier_1’, ‘multiplier_2’, num_iter`` |
inv: uses
gamma,
power|
linearWarmup: uses
start_multiplier,
num_iter|
constantWarmup: uses
multiplier,
num_iter|
alter: uses
active_first,
active_period,
inactive_period|
hill: uses those in both
linearWarmupand
inv, plus
end_multiplier|
composite: uses
sub_policy_num_itersand additional args with format |
cyclic: uses
max_lr,
stepsize|
cosine: uses
min_lr,
max_lr,
period,
t_mult,
lr_shrink|
constantThenLinearWarmup: uses
start_warmup_multiplier,
constant_warmup_num_iter,
linear_warmup_num_iter|
compositeCyclical: uses
start_warmup_multiplier,
constant_warmup_num_iter,
linear_warmup_num_iter,
cyclical_max_lr,
cyclical_step_size,
cyclical_decay|
compositeCosine: uses
start_warmup_multiplier,
constant_warmup_num_iter,
linear_warmup_num_iter,
cosine_max_lr,
cosine_period,
cosine_t_mult,
cosine_lr_shrink| sub_policy_{sub_policy_index}_{sub_policy_arg}, for example: | sub_policy_0_policy: "exp", sub_policy_0_gamma: 0.99, | sub_policy_0_lr_scale: 1.2 | sub_policy_0_policy: "fixed", sub_policy_0_lr_scale: 1.0 | sub_policy_num_iters: [1000, 1000] | | Optional: |
stepsize: defaults to 0 |
max_lr: defaults to 0.005 |
gamma: defaults to 0 |
power: defaults to 0 |
num_iter: defaults to 0 |
start_multiplier: defaults to 0 |
multiplier: defaults to 0.5 |
multiplier_1: defaults to 1 |
multiplier_2: defaults to 1 |
m1: defaults to 0.5, the first piece lr of piece warmup |
n1: defaults to 0, iter threshold of the first piece lr |
m2: defaults to 0.5, the second piece lr of piece warmup |
n2: defaults to 0, iter threshold of the second piece lr |
m3: defaults to 0.5, the third piece lr of piece warmup |
start_warmup_multiplier: defaults to 0.1, part of constantThenLinearWarmup |
constant_warmup_num_iter: defaults to 10000000, part of constantThenLinearWarmup and constantThenLinearWarmup |
linear_warmup_num_iter: defaults to 10000000, part of constantThenLinearWarmup, CompositeCyclicalLRPolicy, CompositeCosineLRPolicy |
cyclical_max_lr: defaults to 0.05, part of CompositeCyclicalLRPolicy |
cyclical_step_size: defaults to 1000000, part of CompositeCyclicalLRPolicy |
cyclical_decay: defaults to 1.0, part of CompositeCyclicalLRPolicy |
cosine_min_lr:defaults to 0.01, part of CompositeCosineLRPolicy |
cosine_max_lr:defaults to 0.05, part of CompositeCosineLRPolicy |
cosine_period:defaults to 50, part of CompositeCosineLRPolicy |
cosine_t_mult:defaults to 1.0, part of CompositeCosineLRPolicy |
cosine_lr_shrink`:defaults to 0.99, part of CompositeCosineLRPolicy
|
| Usage:
| train_net.LearningRate(iterations, “label”, base_lr=float,
| policy=“policy_name”, stepsize=int, gamma=float)
|
|
| Example usage:
| train_net.LearningRate(200, “LR”, base_lr=-0.1,
| policy=“step”, stepsize=20, gamma=0.9)
|
| LinearWarmup: return max(iter/num_iter,
| 1)
|
| Computes a momentum SGD update for an input
| gradient and momentum parameters. Concretely,
| given inputs (grad, m, lr) and parameters
| (momentum, nesterov), computes:
|
| if not nesterov:
| adjusted_gradient = lr * grad + momentum * m
| return (adjusted_gradient, adjusted_gradient)
| else:
| m_new = momentum * m + lr * grad
| return ((1 + momentum) * m_new - momentum * m, m_new)
|
| Output is (grad, momentum)
|
| Note the difference to MomemtumSGDUpdate, which
| actually performs the parameter update (and is
| thus faster).
| Performs a momentum SGD update for an input
| gradient and momentum parameters. Concretely,
| given inputs (grad, m, lr, param) and arguments
| (momentum, nesterov), computes:
|
| if not nesterov:
| adjusted_gradient = lr * grad + momentum * m
| param = param - adjusted_gradient
| return (adjusted_gradient, adjusted_gradient, param)
| else:
| m_new = momentum * m + lr * grad
| param = param - ((1 + momentum) * m_new - momentum * m),
| return ((1 + momentum) * m_new - momentum * m, m_new, param)
|
| Output is (grad, momentum, parameter).
|
| Note the difference to MomentumSGD, which returns
| a new gradient but does not perform the parameter
| update.
| ConstantWarmup: return scale when
| iter < num_iter, and 1 otherwise
|
| Poly: return (1 - iter/max_iter) ^ (power)
|
| Computes the RMSProp update
|
| (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf).
|
| Concretely, given inputs (grad, mean_squares, mom,
| lr), computes:
|
| mean_squares_o = mean_squares + (1 - decay) * (square(grad) - mean_squares)
| mom_o = momentum * mom + lr * grad / sqrt(epsilon + mean_squares_o)
| grad_o = mom_o
|
| Returns (grad_o, mean_squares_o, mom_o).
| Count the number recent update on rows.
|
| Exponential decay is applied on the
| counter with decay rate r, such that
| r^{counter_halflife} = 0.5;
|
| If counter_halflife is nonpositive,
| this operator is turned off.
|
| Fused operator of
|
| SparseLengthsIndicesInGradientSumGradient
| (gradient of SparseLengthsSum) + RowWiseSparseAdagrad.
|
| BW saving analysis for numSegments
| B, L_avg = avg(lengths), block_size
| D, assuming T = float and SIndex = int64_t:
|
| Before fusion,
|
| SparseLengthsIndicesInGradientSumGradient
| reads
|
| BD4 and writes
|
| BL_avgD4. RowWiseSparseAdagrad
| reads
|
| B2L_avgD4 and writes BL_avgD4.
| So, the total memory traffic is B*(1+4L_avg)D4.
|
| After fusion, we read B(1+L_avg)D4
| and write
|
| BL_avgD4 with total memory traffic
|
| B(1+2*L_avg)D4.
|
| Assuming L_avg >> 1, the memory BW is
| saving is about 2x .
|
| See https://fb.quip.com/ldG7A55Ur5wM
| for more details on BW saving analysis
| and evaluation results.
|
| typename Tdata, // embedding types
| typename T, // everything else
|
| Fused operator of
|
| SparseLengthsIndicesInGradientSumGradient
| (gradient of SparseLengthsSum) + RowWiseSparseAdagrad.
|
| Given inputs (param, moment, indices,
| grad, lr), runs the row-wise sparse
| AdaGrad update on (param, grad, moment[indices],
| lr), and returns (new_param, new_moment)
| as in the dense case. Additional input
| (lengths) is for fused
|
| SparseLengthsSumGradient operator.
|
| typename Tdata, // embedding types
| typename T, // everything else
|
| typename Tdata, // embedding types
| typename T, // everything else
|
| Given inputs (param, moment, indices,
| grad, lr), runs a modified sparse Adagrad
| update on (param, grad, moment[indices],
| lr), and returns (new_param, new_momwnr),
| where moment is a 1D tensor with length
| equal to the number of rows in param:
| shape(moment) == shape(param)[0].
|
| Each element of moment is applied to
| an entire row of param, and the new moment
| is calculated by adding the average
| squared sum of gradients across each
| row.
|
| ———–
| @note
|
| indices must also be a 1D tensor indexing
| into the rows of param.
|
| Computes a modified Adam Update for
| the sparse case.
|
| Given inputs (param, moment1, moment2,
| indices, grad, lr, iter), runs the Adam
| update on (param, moment1[indices],
| moment2[indices], lr, iter) and returns
| (new_param, new_moment1, new_moment2),
| where moment2 is a 1D tensor with length
| equal to the number of rows in param:
|
| shape(moment2) == shape(param)[0].
| Each element of moment2 is applied to
| an entire row of param, and the new moment2
| values are calculated by averaging
| across the row.
|
| Given inputs (param, moment, moment_delta,
| indices, grad, lr), runs the dense AdaDelta
| update on (param, grad, moment[indices],
| moment_delta[indices], lr), and returns
| (new_param, new_moment, new_moment_delta)
| as in the dense case.
|
| Fused operator of
|
| SparseLengthsIndicesInGradientSumGradient
| (gradient of SparseLengthsSum) + SparseAdagrad.
|
| Given inputs (param, moment, indices,
| grad, lr), runs the sparse AdaGrad update
| on (param, grad, moment[indices],
| lr), and returns (new_param, new_moment)
| as in the dense case. Additional input
| (lengths) is for fused
|
| SparseLengthsIndicesInGradientSumGradient
| operator.
|
| typename Tdata, // embedding and momentum
| types typename T, // everything else
| bool is_mean = false>
|
| Given inputs (param, moment, indices,
| grad, lr), runs the dense AdaGrad update
| on (param, grad, moment[indices],
| lr), and returns (new_param, new_moment)
| as in the dense case.
|
| Computes the Adam Update for the sparse
| case.
|
| Given inputs (param, moment1, moment2,
| indices, grad, lr, iter), runs the dense
| Adam on (param, moment1[indices],
| momemnt2[indices], lr, iter) and returns
| (new_param, new_moment1, new_moment2)
| as in dense case.
|
| Adam can be customized as Rectified
| Adam (RAdam) by setting enableRAdam
| = true.
|
| Performs a momentum SGD update analogous
| to MomentumSGDUpdate, but using a GradientSlice
| and indices into the full param and momentum
| tables. Both param and momentum should
| be in-place (corresponding inputs
| and outputs should be the same blobs).
|
| This operator implement the STORM (https://arxiv.org/abs/1905.10018)
| optimization algorithm.
|
| Given inputs (param, moment, grad_sq_sum,
| grad, indices, lr), computes the dense
| STORM update on (param, moment[indices],
| grad_sq_sum, grad, lr), and returns
| (new_param, new_moment, new_grad_sq_sum)
| as in the dense case.
|
| This operator implement the optimization
| algorithm in https://arxiv.org/abs/1803.02865
| by Wu, Ward and Bottou.
|
| Given inputs (param, seq_b, indices,
| grad, lr), runs the dense WnGrad update
| on (param, grad, seq_b, lr), and returns
| (new_param, new_seq_b) as in the dense
| case.
|
| Step: return gamma ^ (floor(iter / step))
|
| Computes the STORM
| (https://arxiv.org/abs/1905.10018) update for an
| input gradient and accumulated history of
| gradients. Concretely, given inputs (param,
| moment, grad_sq_sum, grad, lr), computes:
|
| new_grad_sq_sum = grad_sq_sum + norm(grad)^2
| effective_lr = lr / (beta + new_grad_sq_sum)^1/3
| alpha = momentum * square(effective_lr)
| new_moment = grad + (1 - alpha) * (moment - grad)
| new_param = param + effective_lr * new_moment
|
| and returns (new_param, new_moment,
| new_grad_sq_sum).
|
| Note that due to caffe2 limitation, it is
| difficult to re-calculate gradient in the previous
| iteration using the current example. We simplied
| calculation for new_moment by using the gradient
| from the current iteration.
Every stepsize
iterations, multiply the weights by a constant scale
:
nw = w * scale
| Computes the WnGrad update for an input gradient
| and accumulated history. This operator implement
| the optimization algorithm in
| https://arxiv.org/abs/1803.02865 by Wu, Ward and
| Bottou.
|
| Concretely, given inputs (param, grad, seq_b,
| learning_rate), computes
|
| new_seq_b = seq_b + 1 / seq_b * norm(grad)^2
| effective_lr = learning_rate / (new_seq_b + epsilon)
| update = learning_rate * grad / (new_seq_b + epsilon)
| new_param = param + update
| and returns (new_param, new_seq_b).
|
| Optionally returns effective_lr and update as well.
| YellowFin: An automatic tuner for momentum SGD
| (https://arxiv.org/abs/1706.03471)
|
| The YellowFinOp tunes learning rate and
| momentum and performs momentum SGD steps.
|
| The learning rate and momentum are separate
| for any matrix of parameters.