gtars-tokenizers 0.5.1

Genomic region tokenizers for machine learning in Rust
Documentation

Genomic data tokenizers and pre-processors to prepare interval data for machine learning pipelines.

The tokenizers module is the most comprehensive module in gtars. It houses all tokenizers that implement tokenization of genomic data into a known vocabulary. This is especially useful for genomic data machine learning models that are based on NLP-models like tranformers.

Example

use std::path::Path;

use gtars_tokenizers::Tokenizer;
use gtars_core::models::Region;

let tokenizer = Tokenizer::from_bed(Path::new("../tests/data/tokenizers/peaks.bed")).unwrap();

let regions = vec![Region {
    chr: "chr1".to_string(),
    start: 100,  
    end: 200,
    rest: None,
}];
let tokens = tokenizer.tokenize(&regions);