Expand description

(Automatic) Differentiation helpers

Automatic Differentiation

This module provides structs for performing Forward and Reverse Automatic Differentiation

Automatic Differentiation is not Numerical Differentiation

You were probably introduced to differentiation as numeric differentiation, ie if you have a function 3x2 then you can estimate its gradient at some value x by computing 3x2 and 3(x+h)2 where h is a very small value. The tangent line these two points create gives you an approximation of the gradient when you calculate (f(x+h) - f(x)) / h. Unfortunately floating point numbers in computers have limited precision, so this method is only approximate and can result in floating point errors. 1 + 1 might equal 2 but as you go smaller 10-i + 10-i starts to loook rather like 10-i as i goes into double digits.

Automatic Differentiation is not Symbolic Differentiation

If you were taught calculus you have probably done plenty of symbolic differentiation by hand. A function 3x2 can be symbolically differentiated into 6x by applying simple rules to manipulate the algebra. Unfortunately the rules aren’t so simple for more complex expressions such as exponents, logs or trigonometry. Symbolic differentiation can give you expressions which are just as or more complicated than the original, and doing it by hand can be error prone. Symbolic Differentiation is also tricky to relate to algorithmic computations that use control structures.

Automatic Differentiation

Automatic Differentiation computes the derivative of a function without rewriting the function as symbolic differentiation does and without the precision issues of numerical differentiation by splitting the derivative into lots of tiny steps of basic operations like addition and multiplication. These are combined using the chain rule. The downside is more memory is used than in symbolic or numerical differentiation, as derivatives have to be tracked through the computational graph.

Forward Differentiation

Forward Differentiation computes all the gradients in a computational graph with respect to an input. For example, if you have a function f(x, y) = 5x3 - 4x2 + 10x - y, then for some actual value of x and y you can compute f(x,y) and δf(x,y)/δx together in one forward pass using forward differentiation. You can also make another pass and compute f(x,y) and δf(x,y)/δy for some actual value of x and y. It is possible to avoid redundantly calculating f(x,y) multiple times, but I am waiting on const generics to implement this. Regardless, forward differentiation requires making at least N+1 passes of the function to compute the derivatives of the output with respect to N inputs - and the current implementation will make 2N. However, you do get the gradients for every output in a single pass. This is poorly suited to neural nets as they often have a single output(loss) to differentiate many many inputs with respect to.

Reverse Mode Differentiation

Reverse Mode Differentiation computes all the gradients in a computational graph for the same output. For example, if you have a function f(x, y) = 5x3 - 4x2 + 10x - y, then for some actual value of x and y you can compute f(x,y) and store all the intermediate results. You can then run a backward pass on the output of f(x, y) and obtain δf(x,y)/δx and δf(x,y)/δy for the actual values of x and y in a single pass. The catch is that reverse mode must store as many intermediate values as there are steps in the function which can use much more memory than forward mode. Reverse mode also requires making N backward passes to get the gradients for N different outputs. This is well suited to neural nets because we often have a single output (loss) to differentiate many inputs with respect to. However, reverse mode will be slower than forward mode if the number of inputs is small or there are many outputs.

Usage

Both Trace and Record for forward and reverse automatic differentiation respectively implement Numeric and can generally be treated as normal numbers just like f32 and f64.

Trace is literally implemented as a dual number, and is more or less a one to one substitution. Record requires dynamically building a computational graph of the values and dependencies of each operation performed on them. This means performing operations on records have side effects, they add entries onto a WengertList. However, when using Record the side effects are abstracted away, just create a WengertList before you start creating Records.

Given some function from N inputs to M outputs you can pass it Traces or Records and retrieve the first derivative from the outputs for all combinations of N and M. If N >> M then you should use Record as reverse mode automatic differentiation is much cheaper. If N << M then you should use Trace as it will be much cheaper. If you have large N and M, or small N and M, you might have to benchmark to find which method works best. However, most problems are N > M.

For this example we use a function which takes two inputs, r and a, and returns two outputs, x and y.

Using Trace

use easy_ml::differentiation::Trace;
use crate::easy_ml::numeric::extra::Cos;
use crate::easy_ml::numeric::extra::Sin;
fn cartesian(r: Trace<f32>, angle: Trace<f32>) -> (Trace<f32>, Trace<f32>) {
    let x = r * angle.cos();
    let y = r * angle.sin();
    (x, y)
}
// first find dx/dr and dy/dr
let (x, y) = cartesian(Trace::variable(1.0), Trace::constant(2.0));
let dx_dr = x.derivative;
let dy_dr = y.derivative;
// now find dx/da and dy/da
let (x, y) = cartesian(Trace::constant(1.0), Trace::variable(2.0));
let dx_da = x.derivative;
let dy_da = y.derivative;

Using Record

use easy_ml::differentiation::{Record, WengertList};
use crate::easy_ml::numeric::extra::{Cos, Sin};
// the lifetimes tell the rust compiler that our inputs and outputs
// can all live as long as the WengertList
fn cartesian<'a>(r: Record<'a, f32>, angle: Record<'a, f32>)
-> (Record<'a, f32>, Record<'a, f32>) {
    let x = r * angle.cos();
    let y = r * angle.sin();
    (x, y)
}
// first we must construct a WengertList to create records from
let list = WengertList::new();
let r = Record::variable(1.0, &list);
let a = Record::variable(2.0, &list);
let (x, y) = cartesian(r, a);
// first find dx/dr and dx/da
let x_derivatives = x.derivatives();
let dx_dr = x_derivatives[&r];
let dx_da = x_derivatives[&a];
// now find dy/dr and dy/da
let y_derivatives = y.derivatives();
let dy_dr = y_derivatives[&r];
let dy_da = y_derivatives[&a];

Differences

Notice how in the above examples all the same 4 derivatives are found, but in forward mode we rerun the function with a different input as the sole variable, the rest as constants, whereas in reverse mode we rerun the derivatives() function on a different output variable. With Reverse mode we would only pass constants into the cartesian function if we didn’t want to get their derivatives (and avoid wasting memory on something we didn’t need).

Substitution

There is no need to rewrite the input functions, as you can use the Numeric and Real traits to write a function that will take floating point numbers, Traces and Records.

use easy_ml::differentiation::{Trace, Record, WengertList};
use crate::easy_ml::numeric::Numeric;
use crate::easy_ml::numeric::extra::{Real};
fn cartesian<T: Numeric + Real + Copy>(r: T, angle: T) -> (T, T) {
    let x = r * angle.cos();
    let y = r * angle.sin();
    (x, y)
}
let list = WengertList::new();
let r_record = Record::variable(1.0, &list);
let a_record = Record::variable(2.0, &list);
let (x_record, y_record) = cartesian(r_record, a_record);
// find dx/dr using reverse mode automatic differentiation
let x_derivatives = x_record.derivatives();
let dx_dr_reverse = x_derivatives[&r_record];
let (x_trace, y_trace) = cartesian(Trace::variable(1.0), Trace::constant(2.0));
// now find dx/dr with forward automatic differentiation
let dx_dr_forward = x_trace.derivative;
assert_eq!(dx_dr_reverse, dx_dr_forward);
let (x, y) = cartesian(1.0, 2.0);
assert_eq!(x, x_record.number); assert_eq!(x, x_trace.number);
assert_eq!(y, y_record.number); assert_eq!(y, y_trace.number);

Equivalance

Although in this example the derivatives found are identical, in practise, because forward and reverse mode compute things differently and floating point numbers have limited precision, you should not expect the derivatives to be exactly equal.

Further information

Modules

Other operator related implementations for the differentiation module

Operator implementations for Records.

Operator implementations for Traces

Structs

Computed derivatives of a computational graph for some output Record variable.

A wrapper around a real number which records it going through the computational graph. This is used to perform Reverse Mode Automatic Differentiation.

A dual number which traces a real number and keeps track of its derivative. This is used to perform Forward Automatic Differentiation

A list of operations performed in a forward pass of a dynamic computational graph, used for Reverse Mode Automatic Differentiation.

Traits

A trait with no methods which is implemented for all primitive types.