sview-fmindex
Data is in single blob, and fm-index is slice view.
sview-fmindex is a Rust library for building FM-indexes into pre-allocated blobs and using them with minimal copying through slice views.
-fmindex: FM-index is a compressed text index that provides two main operations:- Count the number of occurrences in text
- Locate the positions in text
sview-: In this library, data is stored in one contiguous blob, and the FM-index structure is created as a slice view into this blob.
Architecture
builder
┌────────┐
│ header │
└────────┘
│
|-(build with text)
blob ▼
┌────────┬──────────────────────┐
│ header │ body (LARGE) │
└────────┴──────────────────────┘
│
│-(load index)
fm-index ▼
┌────────┬────────┐
│ header │ view │
└────────┴────────┘
Usage
Basic Example
use ;
use Block2; // Block2 can index 3 types of symbols
// (1) Prepare Data
let text = b"CTCCGTACACCTGTTTCGTATCGGAXXYYZZ".to_vec;
let symbols: & = &;
// (2) Build index
let builder = init.unwrap;
// You have to prepare a blob to build the index.
let blob_size = builder.blob_size;
let mut blob = vec!;
// Build the fm-index to the blob.
builder.build.unwrap;
// Load the fm-index from the blob.
let fm_index = load.unwrap;
// (3) Match with pattern
let pattern = b"TA";
// - count
let count = fm_index.count_pattern;
assert_eq!;
// - locate
let mut locations = fm_index.locate_pattern;
locations.sort; // The locations may not be in order.
assert_eq!;
// All unindexed characters are treated as the same character.
// In the text, X, Y, and Z can match any other unindexed character
let mut locations = fm_index.locate_pattern;
locations.sort;
// Using the b"XXXXX", b"YYYYY", or b"!@#$%" gives the same result.
assert_eq!;
Benchmarks
1 Gbp nucleotide text · 20 bp pattern
| Load strategy | Avg RSS | Peak RSS | Blob load (ms) | Locate per pattern (ns) |
|---|---|---|---|---|
| Full in‑memory | 2.82 GiB | 4.50 GiB | 2,388 | 1,369 ns |
mmap (no Advice) |
0.48 GiB | 0.52 GiB | 0.04 | 1,365 ns |
- Data
- Text: 1 Gbp random nucleotide
- Patterns: 1 000 000 × 20 bp
- Index: Position:
u32, Block:Block2<u64>, Uncompressed
- Hardware
- CPU Intel Xeon E5‑2680 v4 @ 2.40 GHz
- Memory 256 GiB
- OS (Kernel) Ubuntu 20.04.2 LTS (5.4.0‑171‑generic)
- Page size 4 KiB