Skip to main content

Crate seq_geom_parser

Crate seq_geom_parser 

Source
Expand description

§seq_geom_parser

A parser and extractor for sequencing read geometry descriptions.

Geometry strings describe the layout of technical and biological sequences within sequencing reads. For example, 1{b[16]u[12]x:}2{r:} means:

  • Read 1: 16bp cell barcode, 12bp UMI, discard rest
  • Read 2: biological read (full length)

§Supported tags

TagMeaningExample
b[N]Cell barcodeb[16]
bN[L]Numbered barcode at level Nb0[8]
s[N]Sample barcode (sugar for b0)s[8]
u[N]UMIu[12]
r[N] / r:Biological read (fixed/unbounded)r[50], r:
f[SEQ]Fixed anchor sequencef[TTGCTAGGACCG]
x[N] / x:Discard (fixed/unbounded)x[18], x:
x[N-M]Variable-length discardx[0-3]

§Distance functions

Fixed anchors can be wrapped in distance functions for approximate matching:

  • hamming(f[SEQ], N) — match within Hamming distance N

§Variable-length normalization

Variable-length tags are supported when a downstream fixed anchor makes their boundaries inferable, such as b[9-10]u[12]f[SEQ] or x[0-3]f[SEQ]s[10]. Normalization helpers are exposed in normalize so callers can pad extracted variable-length barcode/UMI sequences to their declared maximum width when needed.

§Complexity Tiers

The public API distinguishes three extraction tiers:

§Boundary Resolution

For GeometryComplexity::BoundaryResolved geometries, extraction proceeds in two phases:

  1. Resolve anchor positions in read order.
  2. Assign the spans between those resolved boundaries to fields.

If multiple anchor placements satisfy the geometry, the solver chooses the monotone placement chain with the minimum total distance score. Ties are broken by choosing the lexicographically leftmost anchor positions.

§Public Model vs Compiled Executor

The boundary-oriented types exposed from types describe the public semantic model of a geometry: boundaries, anchors, and segments between resolved boundaries.

They are not the same as the extractor’s internal compiled representation. CompiledGeom compiles parsed geometries into private extraction plans in extract that are optimized for the hot path. This split is intentional: the public types document the model and complexity hierarchy, while the executor keeps a separate IR that can evolve for performance without changing the public API.

§Examples By Tier

use seq_geom_parser::{geometry_complexity, parse_geometry, GeometryComplexity};

let simple = parse_geometry("1{b[16]u[12]x:}2{r:}").unwrap();
assert_eq!(geometry_complexity(&simple), GeometryComplexity::FixedOffsets);

let inferable = parse_geometry("1{b[9-10]f[ACGT]u[12]}2{r:}").unwrap();
assert_eq!(
    geometry_complexity(&inferable),
    GeometryComplexity::InferableVariable
);

let boundary = parse_geometry("1{r:f[ACAGT]b[9-11]}2{u[12]x:}").unwrap();
assert_eq!(
    geometry_complexity(&boundary),
    GeometryComplexity::BoundaryResolved
);

Re-exports§

pub use extract::BoundaryResolvedExtractor;
pub use extract::CompiledGeom;
pub use extract::ExtractedSeqs;
pub use extract::GeomMeta;
pub use extract::InferableExtractor;
pub use extract::NormalizationMeta;
pub use extract::SimpleExtractor;
pub use parse::format_errors;
pub use parse::geometry_complexity;
pub use parse::parse_geometry;
pub use parse::validate_geometry;
pub use types::*;

Modules§

extract
Sequence extraction from reads using a compiled geometry.
normalize
Variable-length barcode/UMI normalization via collision-free padding.
parse
Chumsky-based parser for sequencing read geometry descriptions.
types
Types for describing sequencing read geometry.