bqtools 0.5.3

A command-line tool for interacting with BINSEQ file formats.
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

bqtools is a Rust CLI for working with BINSEQ files — a binary format family for high-performance DNA sequence processing. It encodes, decodes, greps, concatenates, samples, and pipes BINSEQ files (`.bq`, `.vbq`, `.cbq`). CBQ is the recommended format for most applications.

## Build & Test Commands

```bash
cargo build                    # Debug build
cargo build --release          # Optimized build (uses LTO, slow)
cargo install --path .         # Install binary locally

cargo test --verbose           # Run all tests
cargo test --verbose -F fuzzy  # Run tests including fuzzy feature
cargo test <test_name>         # Run a single test by name

cargo fmt --check              # Check formatting
cargo clippy --verbose         # Lint (pedantic clippy enabled)
```

Logging is controlled via `BQTOOLS_LOG` environment variable (uses `env_logger`).

## Feature Flags

- `htslib` (default): SAM/BAM/CRAM support via rust-htslib
- `gcs` (default): Google Cloud Storage file reading
- `fuzzy` (optional): Fuzzy matching via `sassy` — requires `RUSTFLAGS="-C target-cpu=native"`

Build without defaults: `cargo build --no-default-features -F fuzzy,gcs`

## Architecture

### Module Layout

- **`src/cli/`** — Clap derive-based argument definitions. `cli.rs` has the top-level `Commands` enum. `input.rs` and `output.rs` handle complex input/output argument parsing (file formats, compression, paired-end, spans).
- **`src/commands/`** — Command implementations, each in its own subdirectory. `utils.rs` has shared compression helpers.
- **`src/types.rs`** — Type aliases (`BoxedReader`, `BoxedWriter`).
- **`src/main.rs`** — CLI dispatch and SIGPIPE handling.

### Key Patterns

**Parallel processing**: Commands use the `paraseq` crate's `ParallelProcessor` trait for embarrassingly parallel batch processing. Each command has a `processor.rs` implementing this trait with thread-local buffers and `Arc<Mutex<T>>` for shared global state.

**Grep backends**: The grep command uses a `PatternMatcher` enum dispatching to three backends — `regex`, `aho-corasick` (fixed-string, multi-pattern), and `sassy` (fuzzy, feature-gated). The same pattern applies to `PatternCounter` for the `-P` pattern-count mode.

**Encode modes**: Encoding dispatches across atomic (single/paired files), recursive (directory walk via `walkdir`), manifest (file list), and batch (multi-file thread distribution) modes.

**Writer abstraction**: `SplitWriter` supports interleaved (single file) and split (separate R1/R2) output modes with polymorphic writers (file, stdout, compressed, chunked).

### Core Dependencies

| Crate     | Role                             |
| --------- | -------------------------------- |
| `binseq`  | BINSEQ format read/write         |
| `bitnuc`  | 2-bit/4-bit nucleotide encoding  |
| `paraseq` | Parallel FASTX/BINSEQ processing |
| `clap`    | CLI argument parsing (derive)    |
| `anyhow`  | Error handling throughout        |

### Testing

Integration tests live in `tests/`. `tests/common.rs` provides a builder (`write_fastx()`) for generating random FASTQ/FASTA test data with configurable compression (none, gzip, zstd). Tests use cartesian products over format/compression/mode combinations. Dev dependencies: `bon` (builder macro), `nucgen` (random sequences), `tempfile`, `itertools`.

### Generating Test Data

Random FASTQ/FASTA test data can be created on the CLI with `nucgen` (`cargo install nucgen` if not already installed).

```bash
# generates 10,000 reads of length 150
nucgen -n 10000 -l 150 some.fq
# generates 30,000 paired-reads of length 50 and 200
nucgen -n 30000 -l 50 -L 200 some_R1.fq some_R2.fq
```

These can then be ingested with `bqtools encode`:

```bash
bqtools encode some.fq -o some.cbq
bqtools encode some_R1.fq some_R2.fq -o some.cbq
```

### Benchmarking Changes

Make use of `hyperfine` (`cargo install hyperfine` if not already installed) to measure performance of binaries after changes.

```bash
# Measures decoding performance
hyperfine --warmup 3 --runs 10 "bqtools decode some.cbq > /dev/null"
```

## Contribution Guide

When making changes, keep the following documentation in sync:

1. **CLAUDE.md** — Update this file when adding new commands, changing architecture, or modifying build/test workflows.
2. **README.md** — Update usage examples and feature descriptions when adding or changing user-facing functionality (new commands, flags, behavior changes).
3. **Clap doc comments** — All CLI arguments, flags, and subcommands use clap derive macros with `/// doc comments` and `#[clap(long_about)]` attributes. When adding or modifying flags, write clear help text directly on the struct fields in `src/cli/`. These doc comments are the `--help` output users see.
4. **New feature flags** — If adding a Cargo feature flag, document it in both `CLAUDE.md` (Feature Flags section) and `README.md` (Feature Flags / Installation section).