seqx 0.1.2

A command-line tool for processing and analyzing biological sequences.
Documentation
# seqx Quick Reference

## Project Layout

```text
seqx/
├── .github/
│   └── workflows/
│       └── release.yml
├── scripts/
│   ├── bench_packed_io.sh
│   └── gen_random_fasta.py
├── src/
│   ├── main.rs
│   ├── lib.rs
│   ├── cmd/
│   │   ├── mod.rs
│   │   ├── compress.rs
│   │   ├── convert.rs
│   │   ├── guide.rs
│   │   ├── dedup.rs
│   │   ├── extract.rs
│   │   ├── filter.rs
│   │   ├── merge.rs
│   │   ├── modify.rs
│   │   ├── sample.rs
│   │   ├── search.rs
│   │   ├── sort.rs
│   │   ├── split.rs
│   │   └── stats.rs
│   └── common/
│       ├── mod.rs
│       ├── parser.rs
│       ├── packed_seq_io.rs
│       ├── record.rs
│       ├── writer.rs
│       └── README.md
├── Cargo.toml
├── Cargo.lock
├── README.md
├── QUICKREF.md
├── DEVELOPMENT.md
├── SKILL.md
├── rustfmt.toml
└── target/              # build artifacts (generated)
```

## Add a New Command (3 Steps)

```rust
// 1) Create src/cmd/newcmd.rs
use crate::common::{detect_format, RecordReader, SeqWriter};
use clap::Parser;

#[derive(Parser)]
#[command(about = "Brief description")]
pub struct Args {
    #[arg(short, long)]
    pub input: Option<String>,

    #[arg(short, long)]
    pub output: Option<String>,
}

pub fn run(args: Args) -> anyhow::Result<()> {
    let format = detect_format(args.input.as_deref(), None);
    let mut reader = RecordReader::new(args.input.as_deref(), format)?;
    let mut writer = SeqWriter::from_path(args.output.as_deref(), format, 80)?;

    while let Some(record) = reader.next_record()? {
        writer.write_record(&record)?;
    }

    Ok(())
}

// 2) Register it in src/cmd/mod.rs
pub mod newcmd;
// ...
Newcmd(newcmd::Args),

// 3) Dispatch it in src/main.rs
Commands::Newcmd(args) => seqx::cmd::newcmd::run(args),
```

## Core Shared API

### `Record`

```rust
record.len()
record.is_nucleotide_sequence()
Record::is_nucleotide_text(text)
record.gc_content()
record.avg_quality()
record.reverse_complement()
record.to_upper()
record.to_lower()
record.remove_gaps()
record.slice(start, end)
record.write_fasta(writer, width)
record.write_fastq(writer)
record.dedup_key(by_id, prefix, ignore_case)
```

### `parser`

```rust
detect_format(path, format_arg) -> Format
RecordReader::new(path, format)?
reader.next_record()? -> Option<Record>
parse_records(path, format)? -> Vec<Record>
parse_range("1:100")? -> (0, 100) // 0-based, half-open
```

### `packed_seq_io`

```rust
write_record_binary(writer, &record)?
read_record_binary(reader)? -> Option<Record>
```

Note: A/C/G/T data is stored in packed 2-bit form when possible; other alphabets (for example protein sequences) fall back to plain text encoding.

### `writer`

```rust
SeqWriter::from_path(path, format, width)?
writer.write_record(&record)?
create_writer(path)?
```

## Real CLI Option Patterns

- Input/output pattern (most commands): `-i/--input`, `-o/--output`
- Common format option: `-f/--format` (except `convert`, which uses `-F/--from` and `-T/--to`)
- FASTA line width: `-w/--line-width`
- Thread options where implemented: `-t/--threads` (`search`, `sort`, `dedup`, `compress`)

## Streaming Skeleton

```rust
let format = detect_format(args.input.as_deref(), args.format.as_deref());
let mut reader = RecordReader::new(args.input.as_deref(), format)?;
let mut writer = SeqWriter::from_path(args.output.as_deref(), format, args.line_width)?;

while let Some(record) = reader.next_record()? {
    writer.write_record(&record)?;
}
```

## Smoke Test Commands

```bash
cat > test.fa << 'EOF'
>seq1
ATGCGATCGATCG
>seq2
CGTACGTACGTACGTACGTA
EOF

cargo run -- stats -i test.fa --gc
cargo run -- convert -i test.fa -T fastq
cargo run -- filter -i test.fa --min-len 15
```

## Developer Checklist

- [ ] `cargo fmt`
- [ ] `cargo build`
- [ ] New command registered in `src/cmd/mod.rs` and `src/main.rs`
- [ ] Reused `common` helpers instead of duplicating parser/writer logic
- [ ] Logs and progress messages go to `stderr` (`eprintln!`)
- [ ] Nucleotide-only operations are guarded for non-nucleotide records

## Current Implementation Summary

- Parsing: streaming `RecordReader` backed by `noodles` for FASTA/FASTQ.
- Large-file strategy: external sort, bucketed dedup, streaming split/extract/search.
- Temp storage: binary packed format in `packed_seq_io` for lower I/O overhead.
- Parallel paths: `search`, `sort`, `dedup`, and `compress` expose `--threads` (default `1` for search/sort/dedup, auto-detect for compress).
- Compression: `compress` uses `pigz` if available, otherwise `gzp` for parallel gzip.