[][src]Crate rust_lapper

(Docs from nim-lapper, by Brent Pendersen) This module provides a simple data-structure for fast interval searches.

      0  1  2  3  4  5  6  7  8  9  10 11
(0,10]X  X  X  X  X  X  X  X  X  X
(2,5]       X  X  X
(3,8]          X  X  X  X  X
(3,8]          X  X  X  X  X
(3,8]          X  X  X  X  X
(3,8]          X  X  X  X  X
(5,9]                X  X  X  X
(8,11]                        X  X  X

Query: (8, 11]
Answer: ((0,10], (5,9], (8,11])

The main methods are find and seek where the latter uses a cursor and is very fast for cases when the queries are sorted. This is another innovation in this library that allows an additional ~50% speed improvement when consecutive queries are known to be in sort order.

The overlap function for this assumes a zero based genomic coordinate system. So [start, stop) is not inclusive of the stop position for neither the queries, nor the Intervals.

Lapper does not use an interval tree, instead, it operates on the assumtion that most intervals are of similar length; or, more exactly, that the longest interval in the set is not long compred to the average distance between intervals.

For cases where this holds true (as it often does with genomic data), we can sort by start and use binary search on the starts, accounting for the length of the longest interval. The advantage of this approach is simplicity of implementation and speed. In realistic tests queries returning the overlapping intervals are 1000 times faster than brute force and queries that merely check for the overlaps are > 5000 times faster.

On any dataset where that is not the case, It is suggested that you set the {find,seek}_skip methods. This allows the lapper to redo its binary search if no intervals have been found in the last N intervals seen, as would happen in a scenario with very long intervals. An example of this can be seen in the diagram above. Without setting max_misses, every interval between the first and the last one would need to be tested. By seeting max_misses to 2, after (2, 5] and the first (3, 8] miss, the sublist that hasn't been tested yet would be binar searched to find a better offset, and skip the other 3 (3, 8], and head strait to (5, 9]. You should benchmark this for your data. It can cause a best case dataset to be slower, as just iterating over misses is still often faster than redoing a binary search.

A better alternative, if it's possible in your scenario, is to use merge_overlaps first, and then use the normal find and seek methods.

Examples

   use rust_lapper::{Interval, Lapper};
   use std::cmp;
   type Iv = Interval<u32>;

   // create some fake data
   let data: Vec<Iv> = (0..20).step_by(5).map(|x| Iv{start: x, stop: x + 2, val: 0}).collect();
   println!("{:#?}", data);

   // make lapper structure
   let laps = Lapper::new(data);

   assert_eq!(laps.find(6, 11).next(), Some(&Iv{start: 5, stop: 7, val: 0}));
    
   // Demonstration of seek function. By passing in the &mut cursor, seek can have thread local
   // cursors going
   let mut sim: usize = 0;
   let mut cursor = 0;
   // Calculate the overlap between the query and the found intervals, sum total overlap
   for i in (0..10).step_by(3) {
       sim += laps
           .seek(i, i + 2, &mut cursor)
           .map(|iv| cmp::min(i + 2, iv.stop) - cmp::max(i, iv.start))
           .sum::<usize>();
   }
   assert_eq!(sim, 4);

Structs

Interval

Represent a range from [start, stop) Inclusive start, exclusive of stop

IterFind

Find Iterator

IterFindSkip
IterLapper

Lapper Iterator

Lapper

Primary object of the library. The public intervals holds all the intervals and can be used for iterating / pulling values out of the tree.