gtars_tokenizers/lib.rs
1//! # gtars-tokenizers
2//!
3//! Wrapper around gtars-overlaprs for producing tokens for machine learning models.
4//!
5//! ## Purpose
6//!
7//! This module wraps the core overlap infrastructure from gtars-overlaprs to convert
8//! genomic regions into vocabulary tokens for machine learning pipelines. It is
9//! specifically designed for ML applications that need to represent genomic intervals
10//! as discrete tokens.
11//!
12//! ## Design Philosophy
13//!
14//! All overlap computation is delegated to gtars-overlaprs. This module focuses on:
15//! - Token vocabulary management
16//! - Encoding/decoding strategies
17//! - Integration with ML frameworks (HuggingFace, etc.)
18//!
19//! ## Use Cases
20//!
21//! - **Transformer Models**: Convert genomic regions to token sequences
22//! - **Feature Extraction**: Represent intervals as discrete features for ML
23//! - **Language Model Input**: Prepare genomic data for NLP-based models
24//!
25//! ## Main Components
26//!
27//! - **`Tokenizer`**: Maps regions to vocabulary tokens using overlap detection
28//! - **`Universe`**: Vocabulary of genomic regions (peaks/intervals)
29//!
30//! ## Example
31//!
32//! ```rust
33//! use std::path::Path;
34//! use gtars_tokenizers::Tokenizer;
35//! use gtars_core::models::Region;
36//!
37//! let tokenizer = Tokenizer::from_bed(Path::new("../tests/data/tokenizers/peaks.bed")).unwrap();
38//!
39//! let regions = vec![Region {
40//! chr: "chr1".to_string(),
41//! start: 100,
42//! end: 200,
43//! rest: None,
44//! }];
45//! let tokens = tokenizer.tokenize(®ions);
46//! ```
47//!
48pub mod config;
49pub mod encoding;
50pub mod error;
51pub mod tokenizer;
52pub mod universe;
53pub mod utils;
54
55// re-export things
56pub use encoding::*;
57pub use error::*;
58pub use tokenizer::*;
59pub use universe::*;
60pub use utils::*;
61
62// contants
63pub mod consts {
64 pub const TOKENIZERS_CMD: &str = "tokenize";
65}