A data embedding tool and related data analysis or clustering

The crate will provide:

Some variations on data embedding tools from t-Sne (2008) to Umap(2018). Our implementation is a mix of the various embedding algorithms recently published and mentioned in References.
- The graph is initialized by the Hnsw nearest neighbour algorithm.
  This provides for free, sub-sampling in the data to embed by considering only less densely occupied layers (the upper layers). This corresponds generally to a subsampling of 2%-4%, but can give a guarantee as the distance beetween points leaved out the sampling and its nearest sampled neighbour are known. The hnsw structure thus enables also an iterative/hierarchical initialization of the embedding by taking into account an increasing number of layers.
- The preliminary graph built for the embedding uses an exponential function of distances to neighbour nodes (as in Umap). It is possible to modulate the initial edge weight by :
  - Considering a power of the distance function to neighbours (See documentation in module EmbedderParams).
  - Increase or decrease the impact of the local density of points around each node. There is no symetrization of the graph. (except when initializing the embedding with diffusion maps in this case it is done as in t-sne or LargeVis). We use the diffusion maps algorithm (Lafon-Keller-Coifman).
- We also use a cross entropy optimization of this initial layout but take into account the initial local density estimate of each point when computing the cauchy weight of an embedded edge.
An implementation of the Mapper algorithm using the C++ Ripser module from U. Bauer
Some by-products :
- an implementation of range approximation and approximated SVD for dense and/or row compressed matrices as described in explicited in the svdapprox module and the paper of Halko-Tropp (Cf. Tsvd).
- a Diffusion Maps implementation.
- A single-linkage hierarchical clustering function

The crate is in a preliminary state

Currently only the approximated SVD and a first version of the embedding (with possible hierarchical inittialization) are implemented. But the mnist examples shows how to run the embedding, even in this (preliminary) state.

Building

The crate provides 2 features to choose between openblas-static and intel-mkl-static.
So --features "openblas-static" or --features "openblas-static" or must be passed to cargo to compile. Alternatively define the default in Cargo.toml.

Results

These are preliminary results. Timings are given for a 8-core i7 @2.3 Ghz laptop.

Embedder examples

MNIST digits database Cf mnist-digits

It consists in 70000 images of handwritten digits of 784 pixels

initialized by an approximated svd. It tooks 20s to run, of which 9s were spent in the ann construction.

mnist

hierarchical initialization

mnist

It took 22s of which 9s were spent in the ann construction.

MNIST fashion database Cfmnist-fashion

It conssits in 70000 images of clothes.

initialized by an approximated svd.

time : 29s
hierarchical initialization

mnist

time : 34s

Randomized SVD

The randomized SVD is based on the paper of Halko-Tropp

Mapper

Installation

compile with :

cargo build --release --features "openblas-static" to link with openblas
cargo build --release --features "intel-mkl-static" to link with mkl intel's library (intel mkl will be automatically dowloaded, see README.md of crate ndarray-linalg)

References

Visualizing data using t_sne. Van der Maaten and Hinton 2008.
Visualizing Large Scale High Dimensional Data Tang Liu WWW2016 2016 LargeVis
Phate Visualizing Structure and Transitions for Biological Data Exploration K.R Moon 2017.
Umap: Uniform Manifold Approximation and Projection for Dimension Reduction. L.MacInnes, J.Healy and J.Melville 2018

License

Licensed under either of

Apache License, Version 2.0, LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0
MIT license LICENSE-MIT or http://opensource.org/licenses/MIT

at your option.

This software was written on my own while working at CEA

annembed 0.0.3

A data embedding tool and related data analysis or clustering

The crate is in a preliminary state

Building

Results

Embedder examples

Randomized SVD

Mapper

Installation

References

License