This is an lock-free efficient implementation of a covertree for data science. The traditional application of this is for KNN, but it can be applied and used in lots of other applications.
The structure is controlled by 3 parameters, the most important of which is the scale_base. This should be between 1.2 and 2ish. A higher value will create more outliers. Outliers are not loaded into ram at startup, but a high value slows down creation of a tree significantly. Theoretically, this value doesn’t matter to the big O time, but I wouldn’t go above 2.
The cutoff value controls how many points a leaf is allowed to cover. A smaller value gives faster bottom level queries, but at the cost of higher memory useage. Do not expect a value of 100 will give 1/100 the memory useage of a value of 1. It’d be closer to 1/10 or 1/5th. This is because the distribution of the number of children of a node. A high cutoff will increase the compute of the true-knn by a little bit.
The resolution is the minimum scale index, this again reduces memory footprint and increases the query time for true KNN. Once a node’s resolution dips below this value we stop and covert the remaining coverage into a leaf. This is mainly to stop before floating point errors become an issue. Try to choose it to result in a cutoff of about 2^-9.
See the git readme for a description of the algo.
The errors that can occor when a cover tree is loading, working or saving.
Most errors are floated up from
Interfacees that simplify bulk queries
Tools and data structures for assisting cover tree queries.
Utility functions for i/o
A construction object for a covertree. See
Container for the parameters governing the construction of the covertree
Cover Tree Reader Head
When 2 spheres overlap under a node, and there is a point in the overlap we have to decide to which sphere it belongs. As we create the nodes in a particular sequence, we can assign them to the first to be created or we can assign it to the nearest.
Like with a node address, the clusters are segmented by layer so we also reference by layer. The ClusterID is not meaningful, it’s just a uint.
Helper struct for iterating thru the reader’s of the the layers.
The data structure explicitly seperates the covertree by layer, and the addressing schema for nodes is a pair for the layer index and the center point index of that node.