scdata 0.1.0

A Rust library for efficient capture of single-cell sequencing data, providing sparse matrix handling, gene/cell indexing. Data access is not priorized here!
Documentation

scdata

Rust

A lightweight Rust library for handling sparse single-cell expression data, with support for incremental insertion, merging, and Matrix Market import/export.

This crate is designed for high-performance pipelines where you want: - Fine-grained control over sparse data - Thread-aware data insertion - Explicit gene indexing - Matrix Market compatibility - Clean Rust-native APIs


Overview

scdata provides the following main components:

Scdata

Core sparse container for single-cell count data.

IndexedGenes

Gene name ↔ index mapping.

CellData

Internal per-cell sparse representation.

AmbientRnaDetect

Utilities for detecting ambient RNA contamination (experimental).

MatrixValueType

Specifies the Matrix Market value type: - Integer - Real - Complex - Pattern - Unknown(String)


Design Philosophy

  • Sparse-first architecture
  • Incremental insertion (try_insert)
  • Explicit gene indexing
  • Controlled merging
  • Minimal dependencies
  • Matrix Market interoperability

Installation

Add to your Cargo.toml:

scdata = { git = "https://github.com/stela2502/scdata" }

Basic Usage

1️⃣ Create Gene Index

use scdata::IndexedGenes;

let genes = IndexedGenes::from_names(vec![
    "GeneA",
    "GeneB",
    "GeneC"
]);

This creates a stable mapping between gene names and internal indices.


2️⃣ Create a Sparse Dataset

use scdata::{Scdata, MatrixValueType};

let mut data = Scdata::new(
    4,                         // number of threads
    MatrixValueType::Integer   // matrix type
);

3️⃣ Insert Counts

use scdata::Scdata;
use mapping_info::MappingInfo; // example external report struct

let mut report = MappingInfo::default();

let cell_id: u64 = 123456;
let gene_hash: GeneUmiHash = ...; // your gene/UMI hash structure

data.try_insert(
    &cell_id,
    gene_hash,
    1.0,
    &mut report
);

try_insert: - Inserts a value into the sparse structure - Returns true if insertion succeeded - Updates external MappingInfo


4️⃣ Merge Datasets

data.merge(&other_dataset);

This merges sparse cell structures efficiently.


5️⃣ Export Data

Write filtered matrix

use std::path::PathBuf;

let output = PathBuf::from("matrix.mtx");

data.write(
    &output,
    &genes,
    10    // min_count threshold
).unwrap();

Write sparse format explicitly

data.write_sparse(
    &output,
    &genes,
    10
).unwrap();

6️⃣ Read Matrix Market

let (data, genes) =
    Scdata::read_matrix_market("matrix.mtx").unwrap();

This loads both: - The sparse matrix - The gene index


Internal Structure

Scdata
 ├── data: [BTreeMap<u64, CellData>; 255]
 ├── genes_with_data: HashSet<usize>
 ├── num_threads
 └── value_type

Key design features:

  • Per-thread storage buckets
  • BTreeMap for deterministic ordering
  • Gene-level tracking via genes_with_data
  • Lazy validation (checked flag)

Matrix Market Compatibility

Supports standard .mtx sparse format.

Typical 10x-style layout:

matrix.mtx
genes.tsv
barcodes.tsv

You can integrate this library into pipelines that:

  • Process CellRanger outputs
  • Produce Scanpy-compatible matrices
  • Export to downstream R/Python workflows

Threading Model

  • num_threads controls internal parallelization
  • Designed for deterministic merge behavior
  • No implicit global locks

When To Use scdata

✔ Building custom RNA-seq pipelines in Rust
✔ Integrating single-cell logic into Rust tools
✔ Avoiding Python/R for production pipelines
✔ High-performance sparse counting
✔ Custom filtering logic before export


Roadmap Ideas

  • Whitelist-based sample detection
  • Improved ambient RNA modeling
  • HDF5 export
  • 10x-compatible directory export
  • Compression support

License

MIT or Apache-2.0 (depending on repository settings)


Author

Stefan Lang
Bioinformatics & Single-Cell Systems