l2 1.0.2

L2 is a Pytorch-style Tensor+Autograd library written in Rust
Documentation
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->

-   [Notes]#notes
    -   [Tensor]#tensor
        -   [Todo]#todo
        -   [Done]#done
        -   [wont do]#wont-do
        -   [Notes]#notes-1
    -   [Resources]#resources
        -   [backprop]#backprop
        -   [Autodiff]#autodiff
        -   [Rust]#rust
        -   [SIMD]#simd

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

# Notes

-   no ints or primitive types in generic parameters
    -   there is an rfc, but it was opened in 2017
-   https://www.lpalmieri.com/

## Tensor

### Todo

### Done

-   use vectors to store data
-   separate the view and data parts of tensor to not have to copy everything?
    -   use pointers?
    -   cant use (mut) reference because you need to initialize data somehow, it must belong to the tensor that created it
    -   tensor must be able to mutate data to do in-place operations
    -   option to do in-place or copy
        -   autograd needs copy
            -   https://discuss.pytorch.org/t/how-in-place-operation-affect-autograd/31173
        -   but some ops like `x[:10] = 0` should be able to be done inplace
    -   so don't need to separate data and tensor, but keep in mind for future if you want to do more (or multiple) inplace ops
    -   make two, `slice` and `slice_mut`
    -   just make copies immutable to make it simple
-   right now, `calc_shape_from_slice` will slice a tensor of shape `[2, 2, 2]` to `[1, 1, 1]` if slicing it on `[[0, 1], [0, 1], [0, 1]]`
    -   this means that strides for 3+ d tensors can be wrong
-   vector's overhead shouldn't be much
-   maybe set vec to not allocate extra memory
-   vec is contigous in memory
-   naive slicing is too slow
-   numpy sometimes makes a copy when slicing to keep stuff contigous
-   naive approach slows down with size
-   it looks like numpy will make a copy when slicing
-   dont' use `[start:stop]` syntax for rust, use `.slice(!vec[start:stop])`
-   you should be able to do: **will make copies to make it simple**
    -   `x[1:2, 3:4] = y`
        -   this has to be inplace
            -   return a tensor containing a vector of mut references to values in original tensor?
            -   return vec of mut references to original data
        -   bad idea, instead:
            -   use strides and slice start and stop points to allow access to the right parts of the data vec
            -   return a mut reference to the tensor
    -   `y = x[1:2, 3:4]`
        -   this can be a copy or a view
        -   should be able to do either easily
    -   how ndarray does it https://stackoverflow.com/questions/50400966/is-there-a-rust-ndarray-equivalent-for-numpy-arithmetic-on-a-slice
-   numpy notes
    -   http://scipy-lectures.org/advanced/advanced_numpy/#indexing-scheme-strides
    -   https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/
-   ndarray heap (https://users.rust-lang.org/t/ndarray-stack-and-heap-memory-and-overhead/25254)
-   pre-allocate size of new vectors
-   add checks for slicing and tensor creation
-   validation should happen at initialization
-   fix error handling and structure
-   panic!("Invalid slice")
-   should panicing be done in main?
-   use smarter indexing
    -   -1 for all values
    -   not needing to say either first or last vals
    -   slicing with an empty slice will return a copy of the orignal tensor
-   reduce number of &[..]
-   reshape
-   benchmarks for previous stuff
-   broadcasting
-   ops
    -   should be like `let c = &a + &b;` and `let c = l2::add(a, b);`
    -   elementwise ops are comparable to numpy until ~ 4096x4096
    -   other should be reference
    -   ops won't return Result<Tensor, TensorError>
        -   that would mean that you would need to run `let x = (a + b).unwrap();`
        -   will panic instead
    -   all ops are tensor-tensor ops
    -   element-wise ops
-   self-ops
    -   pow
    -   sqrt
    -   exp
    -   log
        -   e
        -   10
    -   abs
    -   sin
    -   cos
    -   tan
-   other ops
    -   _dim ops are slower than numpy (30us vs 300us), numpy seems to cache stuff_
    -   _argmax and argmin return f32 tensors_
    -   over dim or all dims
    -   sum
    -   mean
    -   max
    -   min
    -   argmax
    -   argmin
-   use enum for ops?
-   matmul
    -   _about 100x slower than numpy_
    -   _wont implement broadcasting on matmul_
    -   batch matmul
-   check errors
-   concat
-   transpose
-   clone
-   normal
-   uniform
-   autodiff
    -   accumulate gradients
    -   hold reference to parents
        -   lhs and rhs
    -   hold gradient
    -   know grad functions for each op
    -   account for creators
    -   do tensor-tensor op without grad tracking for backend
    -   derivative is `RefCell<Option<Tensor>>`
    -   `borrow_mut()` on derivative and assign to it
    -   use `Rc::new(Tensor::new(Rc::clone(&t))),` so references can be to more than one tensor?
    -   dont need to use option?
        -   no
    -   use a wrapper for grad tracking tensors?
        -   save memory on normal tensors
    -   mark nodes as evaluated?
        -   prevent having to recurse through shared graph multiple times
        -   topological sort
            -   done
        -   in backward mutate lhs and rhs parent's grad not own
            -   works

```
   a
  / \
 *   *
/     \
b      c
\     /
 +   +
  \ /
   d

da =  dd/db + dd/dc
but this will recompute the backwards pass for all the graph above a
topological sorting accumulates gradients for a before going further up the computation graph
```

-   printing of tensors and graph
-   combine 1,2,3,4d ops into one function
-   figure out macros and crates
-   ops
    -   div
    -   pow
    -   sqrt
    -   exp
    -   log10
    -   log
    -   abs
    -   sin
    -   cos
    -   tan
    -   slice
        -   allocates a small (1 element) tensor to satisfy match arms
    -   transpose
        -   allocates a small (1 element) tensor to satisfy match arms
    -   view
    -   concat
    -   sum
    -   mean
-   figure out if you want to use tensor ops in backward or vec ops
    -   decide what to do with new_with_parents
    -   used tensor ops
-   clear gradient
-   blas
-   autograd
    -   benchmarks for each operator
        -   use correct shapes
        -   compare to pytorch and ndarray
    -   only 100s of ns used for topo sort, rest is calling backwards
-   cargo.rs

### wont do

-   change col-major to row-major
-   impl iterator
-   replace indices slice with enum
-   tensor::new shouldn't return result
-   use enum to store diff between two types of indices
    -   [start, end)
    -   -1
-   prevent having to reallocate memory on each backwards pass
    -   clear unneeded memory as soon as you can?
-   impl == on tensors
-   derivative vec for broadcasted mul to a one-element tensor is 6 elements long
    -   problem with derivatives when broadcasting
-   ops
    -   matmul
        -   3 and 4d backwards don't work yet because transpose should only work on two dims
    -   max
    -   min
    -   argmax
    -   argmin
-   fix transpose
-   redo error handling
-   const generics for compile time errors
-   fix blas ci

### Notes

-   L2 is competitive with numpy for small tensors
-   L2 copies slices since thats whats needed for autograd
    -   by default, numpy returns a view
-   takes about 6s to slice all elements from a 64x64x64x64 tensor
-   speed of slicing/allocating cannot be optimized. Numpy takes about 2x the time that l2 does because l2 will always copy a slice. Numpy's native slices are views, but copies are needed for autograd.

## Resources

-   https://dev.to/erikinapeartree/rust-for-machine-learning-simd-blas-and-lapack-4dcg
-   https://docs.rs/rayon/1.3.0/rayon/
-   https://www.google.com/search?q=rust+ndarray+simd&oq=rust+ndarray+simd&aqs=chrome..69i57.3773j0j7&sourceid=chrome&ie=UTF-8
-   https://stackoverflow.com/questions/39477684/should-i-avoid-unwrap-in-production-application/39478185#39478185
-   https://medium.com/@GolDDranks/things-rust-doesnt-let-you-do-draft-f596a3c740a5

### backprop

-   https://datascience.stackexchange.com/questions/20139/gradients-for-bias-terms-in-backpropagation
-   https://cs231n.github.io/optimization-2/
-   https://cs231n.github.io/neural-networks-case-study/#grad
-   https://stackoverflow.com/questions/38082835/backpropagation-in-gradient-descent-for-neural-networks-vs-linear-regression
-   https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
-   https://stackoverflow.com/questions/38082835/backpropagation-in-gradient-descent-for-neural-networks-vs-linear-regression
-   https://github.com/bkkaggle/L2/tree/master#acknowledgements

### Autodiff

-   https://github.com/karpathy/micrograd
-   https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation
    -   https://github.com/ibab/rust-ad
    -   https://github.com/Rufflewind/revad/blob/eb3978b3ccdfa8189f3ff59d1ecee71f51c33fd7/revad.py
    -   https://github.com/srirambandi/ai
-   https://discuss.pytorch.org/t/is-pytorch-autograd-tape-based/13992/3
-   https://www.reddit.com/r/MachineLearning/comments/8ep130/d_how_does_autograd_work/
-   https://github.com/mattjj/autodidact
-   https://github.com/karpathy/recurrentjs
-   https://github.com/karpathy/randomfun
-   https://medium.com/@ralphmao95/simple-autograd-implementation-understand-automatic-differentiation-hand-by-hand-9e86f6d703ab
-   https://evcu.github.io/ml/autograd/
-   https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/
-   https://github.com/maciejkula/wyrm
-   https://medium.com/@maciejkula/building-an-autodifferentiation-library-9ccf32c7a658
-   https://github.com/evcu/numpy_autograd/blob/master/my_autograd.py#L147
-   https://github.com/evcu/numpy_autograd/blob/master/Autograd.ipynb
-   https://cs231n.github.io/optimization-2/

### Rust

-   https://nora.codes/post/what-is-rusts-unsafe/

### SIMD

-   https://opensourceweekly.org/issues/7/