scdata
A high-performance Rust library for constructing sparse single-cell UMI count data, with deterministic export to Matrix Market (10x-style) format.
scdata is built around a strict separation of concerns:
- data accumulation (UMI-aware)
- export (deterministic, index-driven)
It intentionally does not implement matrix operations.
Overview
scdata is designed for building production-grade single-cell pipelines in Rust.
It provides:
- Incremental UMI-aware insertion
- Duplicate detection
- Efficient merging
- Deterministic export using a canonical feature index
- Matrix Market compatibility
Core Concepts
Scdata
Sparse container for single-cell data.
- Stores cells in 256 buckets (hash-partitioned)
- Tracks unique
(feature_id, UMI)pairs - Aggregates counts per feature
CellData
Per-cell storage:
seen: unique(feature_id, UMI)pairstotal_reads: aggregated counts per feature
Cell Identifiers
scdata stores cell identifiers internally as u64 values.
If you want exported barcodes.tsv.gz entries to appear as DNA barcode strings, these u64 values must encode the barcode sequence in the 2-bit format used by int_to_str.
During accumulation, scdata treats cell IDs simply as numeric identifiers. During export, those numeric values are rendered back to DNA strings using int_to_str.
Encoding DNA barcodes for use with scdata
[]
= "0.1"
use IntToStr;
let barcode = "ATGACTCTCAGCATGG";
let cell_id: u64 = new.into_u64;
You can then use cell_id as the cell identifier in Scdata:
scdata.try_insert;
Important note
scdata does not validate whether a given u64 really represents a valid DNA barcode.
If you pass arbitrary numeric IDs, they will still be stored correctly, but barcode export will interpret them as 2-bit encoded DNA and render the corresponding sequence.
FeatureIndex (critical)
Defines the canonical feature space and export order.
👉 This ensures stable row order across exports.
MatrixValueType
Controls Matrix Market output:
- Integer (recommended)
- Real (limited support)
Workflow
1. Create dataset
let mut data = new;
2. Insert data
data.try_insert;
3. Finalize for export
data.finalize_for_export;
4. Export
data.write_sparse?;
Matrix Market Output
Produces standard 10x-style files:
- matrix.mtx.gz
- features.tsv.gz
- barcodes.tsv.gz
Feature order is derived from FeatureIndex, not data.
Design Philosophy
- UMI-centric (not raw count matrix first)
- Deterministic export
- Explicit feature indexing
- Separation of:
- data ingestion
- finalization
- export
- No hidden magic
When to use scdata
✔ Rust-native single-cell pipelines
✔ High-performance UMI counting
✔ Deterministic reproducible outputs
✔ Avoid Python/R dependencies
Limitations
- Real-valued matrices are not a primary target
- Requires external feature index for export
- MatrixMarket import reconstructs integer counts only
License
MIT or Apache-2.0
Author
Stefan Lang