seq_geom_parser/lib.rs
1//! # seq_geom_parser
2//!
3//! A parser and extractor for sequencing read geometry descriptions.
4//!
5//! Geometry strings describe the layout of technical and biological sequences
6//! within sequencing reads. For example, `1{b[16]u[12]x:}2{r:}` means:
7//! - Read 1: 16bp cell barcode, 12bp UMI, discard rest
8//! - Read 2: biological read (full length)
9//!
10//! ## Supported tags
11//!
12//! | Tag | Meaning | Example |
13//! |-----|---------|---------|
14//! | `b[N]` | Cell barcode | `b[16]` |
15//! | `bN[L]` | Numbered barcode at level N | `b0[8]` |
16//! | `s[N]` | Sample barcode (sugar for b0) | `s[8]` |
17//! | `u[N]` | UMI | `u[12]` |
18//! | `r[N]` / `r:` | Biological read (fixed/unbounded) | `r[50]`, `r:` |
19//! | `f[SEQ]` | Fixed anchor sequence | `f[TTGCTAGGACCG]` |
20//! | `x[N]` / `x:` | Discard (fixed/unbounded) | `x[18]`, `x:` |
21//! | `x[N-M]` | Variable-length discard | `x[0-3]` |
22//!
23//! ## Distance functions
24//!
25//! Fixed anchors can be wrapped in distance functions for approximate matching:
26//! - `hamming(f[SEQ], N)` — match within Hamming distance N
27//!
28//! ## Variable-length normalization
29//!
30//! Variable-length tags are supported when a downstream fixed anchor makes
31//! their boundaries inferable, such as `b[9-10]u[12]f[SEQ]` or
32//! `x[0-3]f[SEQ]s[10]`. Normalization helpers are exposed in [`normalize`] so
33//! callers can pad extracted variable-length barcode/UMI sequences to their
34//! declared maximum width when needed.
35//!
36//! ## Complexity Tiers
37//!
38//! The public API distinguishes three extraction tiers:
39//! - [`GeometryComplexity::FixedOffsets`]: every extracted field has a static
40//! offset. Example: `1{b[16]u[12]x:}2{r:}`.
41//! - [`GeometryComplexity::InferableVariable`]: one variable-width region per
42//! read, inferred from a fixed right boundary. Example:
43//! `1{b[9-10]f[ACGT]u[12]}2{r:}`.
44//! - [`GeometryComplexity::BoundaryResolved`]: the read must first be split by
45//! resolved boundaries such as anchors and read ends. Example:
46//! `1{r:f[ACAGT]b[9-11]}2{u[12]x:}`.
47//!
48//! ## Boundary Resolution
49//!
50//! For [`GeometryComplexity::BoundaryResolved`] geometries, extraction proceeds
51//! in two phases:
52//! 1. Resolve anchor positions in read order.
53//! 2. Assign the spans between those resolved boundaries to fields.
54//!
55//! If multiple anchor placements satisfy the geometry, the solver chooses the
56//! monotone placement chain with the minimum total distance score. Ties are
57//! broken by choosing the lexicographically leftmost anchor positions.
58//!
59//! ## Public Model vs Compiled Executor
60//!
61//! The boundary-oriented types exposed from [`types`] describe the public
62//! semantic model of a geometry: boundaries, anchors, and segments between
63//! resolved boundaries.
64//!
65//! They are not the same as the extractor's internal compiled representation.
66//! [`CompiledGeom`] compiles parsed geometries into private extraction plans in
67//! [`extract`] that are optimized for the hot path. This split is intentional:
68//! the public types document the model and complexity hierarchy, while the
69//! executor keeps a separate IR that can evolve for performance without
70//! changing the public API.
71//!
72//! ## Examples By Tier
73//!
74//! ```rust
75//! use seq_geom_parser::{geometry_complexity, parse_geometry, GeometryComplexity};
76//!
77//! let simple = parse_geometry("1{b[16]u[12]x:}2{r:}").unwrap();
78//! assert_eq!(geometry_complexity(&simple), GeometryComplexity::FixedOffsets);
79//!
80//! let inferable = parse_geometry("1{b[9-10]f[ACGT]u[12]}2{r:}").unwrap();
81//! assert_eq!(
82//! geometry_complexity(&inferable),
83//! GeometryComplexity::InferableVariable
84//! );
85//!
86//! let boundary = parse_geometry("1{r:f[ACAGT]b[9-11]}2{u[12]x:}").unwrap();
87//! assert_eq!(
88//! geometry_complexity(&boundary),
89//! GeometryComplexity::BoundaryResolved
90//! );
91//! ```
92
93pub mod extract;
94pub mod normalize;
95pub mod parse;
96pub mod types;
97
98// Re-export key types at crate root
99pub use extract::{
100 BoundaryResolvedExtractor, CompiledGeom, ExtractedSeqs, GeomMeta, InferableExtractor,
101 NormalizationMeta, SimpleExtractor,
102};
103pub use parse::{format_errors, geometry_complexity, parse_geometry, validate_geometry};
104pub use types::*;