seqx 0.1.2

A command-line tool for processing and analyzing biological sequences.
Documentation
# seqx Development Guide

This guide is for contributors who want to extend `seqx` or understand the current implementation.

## Architecture Overview

```text
seqx/
├── .github/
│   └── workflows/
│       └── release.yml
├── scripts/
│   ├── bench_packed_io.sh
│   └── gen_random_fasta.py
├── src/
│   ├── main.rs
│   ├── lib.rs
│   ├── cmd/
│   │   ├── mod.rs
│   │   ├── compress.rs
│   │   ├── convert.rs
│   │   ├── dedup.rs
│   │   ├── guide.rs
│   │   ├── extract.rs
│   │   ├── filter.rs
│   │   ├── merge.rs
│   │   ├── modify.rs
│   │   ├── sample.rs
│   │   ├── search.rs
│   │   ├── sort.rs
│   │   ├── split.rs
│   │   └── stats.rs
│   └── common/
│       ├── mod.rs
│       ├── parser.rs
│       ├── packed_seq_io.rs
│       ├── record.rs
│       ├── writer.rs
│       └── README.md
├── Cargo.toml
├── Cargo.lock
├── README.md
├── QUICKREF.md
├── DEVELOPMENT.md
├── SKILL.md
├── rustfmt.toml
└── target/                # build artifacts (generated)
```

## Build and Test

```bash
# Debug build
cargo build

# Release build
cargo build --release

# Run tests
cargo test

# Run formatter
cargo fmt
```

## Command Registration Flow

Every command follows the same integration pattern.

1. Implement a command module in `src/cmd/<name>.rs`.
2. Export and register it in `src/cmd/mod.rs` (`pub mod ...` and `Commands` enum).
3. Add the dispatch arm in `src/main.rs`.

### Minimal Command Template

```rust
use clap::Parser;
use crate::common::{detect_format, RecordReader, SeqWriter};

#[derive(Parser)]
#[command(about = "Brief description")]
pub struct Args {
    #[arg(short, long)]
    pub input: Option<String>,

    #[arg(short, long)]
    pub output: Option<String>,
}

pub fn run(args: Args) -> anyhow::Result<()> {
    let format = detect_format(args.input.as_deref(), None);
    let mut reader = RecordReader::new(args.input.as_deref(), format)?;
    let mut writer = SeqWriter::from_path(args.output.as_deref(), format, 80)?;

    while let Some(record) = reader.next_record()? {
        writer.write_record(&record)?;
    }

    Ok(())
}
```

## I/O and Parsing Model

`seqx` uses streaming readers and writers by default.

- `RecordReader` yields one `Record` at a time.
- `SeqWriter` writes output records immediately.
- Gzip input is handled transparently by `open_file`.
- Format detection uses extension-based heuristics (`.fa/.fasta/.fq/.fastq`, optional `.gz`).

### Format Support in Current Parser

`Format` currently supports:

- `fasta` / `fa`
- `fastq` / `fq`
- `auto`

`RecordReader` is backed by `noodles`:

- FASTA: `noodles::fasta::io::Reader`
- FASTQ: `noodles::fastq::io::Reader`

## Common Module Responsibilities

- `record.rs`: normalized sequence record model and sequence operations.
- `parser.rs`: format detection, gzip opening, range parsing, streaming record iteration.
- `writer.rs`: output abstraction for FASTA/FASTQ writing.
- `packed_seq_io.rs`: binary temp-record encoding used by large-file sort/dedup pipelines.

## Large-File Strategies

The current codebase avoids full in-memory loading for heavy operations:

- `sort`: external chunk sort + mmap-based multiway merge.
- `dedup`: disk bucket partitioning + per-bucket dedup + stable global merge by input order.
- `split --parts`: two-pass flow (materializes stdin to temp file if needed).
- `extract`: single-pass streaming extraction for ID/range/BED paths.
- `search`: producer/worker pipeline with ordered output merge.

## Nucleotide-Only Guards

Some features are intentionally restricted to nucleotide sequences:

- `filter --gc-min/--gc-max`
- `modify --reverse-complement`
- reverse-complement branch in `search`

When non-nucleotide records are encountered in these paths, the command returns a clear error (or skips reverse-complement search when not applicable).

## CLI Consistency Notes

- Most commands use `-i/--input`, `-o/--output`, `-f/--format`.
- `convert` uses `-F/--from` and `-T/--to`.
- `merge` takes positional input files: `seqx merge <INPUTS>...`.
- Thread count is available in `search`, `sort`, `dedup`, and `compress` via `-t/--threads`.

## Suggested Contribution Checklist

- Keep changes command-scoped and minimal.
- Prefer shared helpers in `common` over duplicated logic.
- Keep command behavior stream-friendly when possible.
- Add/adjust tests near the changed module.
- Verify `cargo fmt`, `cargo build`, and `cargo test` before submission.