nanalogue 0.1.11

BAM/Mod BAM parsing and analysis tool with a single-molecule focus
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# `nanalogue`

Nanalogue = *N*ucleic Acid *Analogue*

Nanalogue is a tool to parse or analyse BAM/Mod BAM files with a single-molecule focus.

[![Cargo Build & Test](https://github.com/DNAReplicationLab/nanalogue/actions/workflows/ci.yml/badge.svg)](https://github.com/DNAReplicationLab/nanalogue/actions/workflows/ci.yml)
[![Code test coverage > 92\%](https://github.com/DNAReplicationLab/nanalogue/actions/workflows/cargo-llvm-cov.yml/badge.svg)](https://github.com/DNAReplicationLab/nanalogue/actions/workflows/cargo-llvm-cov.yml)
[![crates.io](https://img.shields.io/crates/v/nanalogue.svg)](https://crates.io/crates/nanalogue)
[![Documentation](https://docs.rs/nanalogue/badge.svg)](https://docs.rs/nanalogue)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A common pain point in genomics analyses is that BAM files are information-dense
which makes it difficult to gain insight from them. Nanalogue hopes to make it easy
to extract and process this information, with a particular focus on single-molecule
aspects and DNA/RNA modifications. Despite this focus, some of nanalogue's commands are
quite general and can be applied to almost any BAM file.

We can process many types of DNA/RNA modifications occurring in any pattern (single/multiple mods,
spatially-isolated/non-isolated etc.). All we require is that the data is stored
in a BAM file in the mod BAM format (i.e. using MM/ML tags as laid down in the
[specifications](https://samtools.github.io/hts-specs/SAMtags.pdf)). We currently support standard `MM/ML` and fallback `Mm/Ml`
tag variants; other mixed-case or lowercase variants are not recognized.

## Table of Contents

- [Usage and documentation]#usage-and-documentation
  - [Simulate BAM files]#simulate-bam-files
- [Installation and Updates]#installation-and-updates
  - [Pre-built Binaries]#pre-built-binaries
    - [Quick Install Script]#quick-install-script
    - [GitHub Releases]#github-releases
    - [GitHub Actions Artifacts]#github-actions-artifacts
  - [Using Cargo]#using-cargo
    - [Cargo locked]#using-cargo-locked
  - [Using Docker]#using-docker
- [Commands]#commands
  - [`nanalogue read-info`]#nanalogue-read-info
  - [`nanalogue read-table-hide-mods`]#nanalogue-read-table-hide-mods
  - [`nanalogue read-table-show-mods`]#nanalogue-read-table-show-mods
  - [`nanalogue read-stats`]#nanalogue-read-stats
  - [`nanalogue find-modified-reads`]#nanalogue-find-modified-reads
  - [`nanalogue window-dens`]#nanalogue-window-dens
  - [`nanalogue window-grad`]#nanalogue-window-grad
  - [`nanalogue peek`]#nanalogue-peek
- [Contributing]#contributing
- [Security]#security
- [Third-Party Notices]#third-party-notices
- [Changelog]#changelog
- [Acknowledgments]#acknowledgments

# Usage and documentation

There are several ways to use nanalogue with a BAM/CRAM resource:
- as a tool on the command line
- as a Rust [library]https://crates.io/crates/nanalogue with documentation [here]https://docs.rs/nanalogue
- using the Python wrapper: [pynanalogue]https://github.com/DNAReplicationLab/pynanalogue
- using the TypeScript binding:
  [nanalogue-node]https://github.com/sathish-t/nanalogue-node
- as a GUI with AI chat features:
  [nanalogue-gui]https://github.com/sathish-t/nanalogue-gui

The Rust library and command-line tools are the most mature among these.

In addition to these resources, we are developing a
companion cookbook [here](https://www.nanalogue.com).

## Simulate BAM files

For developers: if you are looking to make a custom BAM file containing synthetic, simulated
DNA/RNA modification data to develop/test your tool, you may be interested in `nanalogue_sim_bam`.
This is an executable that ships with nanalogue that can create a BAM file according to your
specifications. Please run `nanalogue_sim_bam --help`. If you are a rust developer looking
to use this functionality in your library, please look at the documentation of the module
`nanalogue_core::simulate_mod_bam` in the docs.rs link [above](#usage-and-documentation).

# Installation and Updates

## Pre-built Binaries

Pre-built binaries for macOS and Linux are available:

### Quick Install Script

The easiest way to install or update pre-built binaries is using the install script.
It installs both `nanalogue` and `nanalogue_sim_bam` when available.
For the available binary artifacts, see [GitHub Actions Artifacts](#github-actions-artifacts):

```bash
curl -fsSL https://raw.githubusercontent.com/DNAReplicationLab/nanalogue/main/install.sh | sh
```

The script will prompt you for an install directory (default: `/usr/local/bin`).
To skip the prompt and use the default directory:

```bash
curl -fsSL https://raw.githubusercontent.com/DNAReplicationLab/nanalogue/main/install.sh | sh -s -- -y
```

To install to a custom directory, set the `NANALOGUE_INSTALL_DIR` environment variable.
Note that you may need to add the custom directory to your `PATH` for global access.
In the example below, nanalogue is installed to `$HOME/.local/bin`:

```bash
export NANALOGUE_INSTALL_DIR=$HOME/.local/bin && curl -fsSL https://raw.githubusercontent.com/DNAReplicationLab/nanalogue/main/install.sh | sh
```
**Dependencies:** The install script requires:
- `curl` or `wget`
- `unzip`
- `jq`
- `sha256sum` or `shasum`

On Debian/Ubuntu, install dependencies with:
```bash
sudo apt-get install curl unzip jq coreutils
```

On macOS, these are typically pre-installed or available via Homebrew:
```bash
brew install jq
```

#### If you are updating

If you already installed nanalogue using the curl command above and want to update it,
set `NANALOGUE_INSTALL_DIR` to the existing install directory before rerunning the script,
or choose the same directory where you did your previous install at the prompt.

### GitHub Releases

Official release binaries can be downloaded from the [Releases page](https://github.com/DNAReplicationLab/nanalogue/releases)
on the Github repository. Each release includes binaries for multiple platforms.

### GitHub Actions Artifacts

Binaries built from the latest code are available as artifacts from the 
[Build Release Binaries workflow](https://github.com/DNAReplicationLab/nanalogue/actions/workflows/build-binaries.yml)
from the github repository. To download:

1. Navigate to the workflow runs
2. Click on a successful workflow run
3. Scroll to the "Artifacts" section at the bottom
4. Download the binary artifact for your platform. Linux artifact names include architecture suffixes such as `x86_64`, `aarch64`, `arm`, `riscv64gc`, and `powerpc64le`:
   - `binaries-macos-latest` - macOS Apple Silicon binaries
   - `binaries-macos-15-intel` - macOS Intel binaries
   - `binaries-musllinux_1_2_x86_64` - x86_64 Alpine/musl (static binaries)
   - `binaries-musllinux_1_2_aarch64` - aarch64 Alpine/musl (static binaries)
   - `binaries-musllinux_1_2_arm` - 32-bit ARM / some Raspberry Pi models (static binaries)
   - `binaries-musllinux_1_2_powerpc64le` - PowerPC64LE Alpine/musl (static binaries)
   - `binaries-manylinux_2_34_<arch>` - Newer Linux distributions (glibc 2.34+)
   - `binaries-manylinux_2_29_<arch>` - RISC-V / PowerPC64LE Linux (glibc 2.29+)
   - `binaries-manylinux_2_28_<arch>` - Modern Linux distributions (glibc 2.28+)
   - `binaries-manylinux_2_17_arm` - Other Linux distributions (glibc 2.17+, maximum compatibility)

## Using Cargo

Run the following command to install or update `nanalogue` for usage on the command line.

```bash
cargo install nanalogue
```

`cargo` is the rust package manager. If you do not have `cargo`,
then follow these [instructions](https://doc.rust-lang.org/cargo/getting-started/installation.html)
to get it. On Linux and macOS systems, the install command is as simple as
`curl https://sh.rustup.rs -sSf | sh`

### Using cargo locked

If the `cargo install` command fails, please also try

```bash
cargo install nanalogue --locked
```

This uses the exact versions of dependencies specified in the package's `Cargo.lock` file,
and fixes install problems due to newer packages.

If you are building from a Git checkout instead of an installed crate, make sure
Git submodules are initialized so the vendored HTSlib source is available:

```bash
git submodule update --init --recursive
```

## Using Docker

You can also use `nanalogue` via Docker. On some systems you may need to use `sudo docker`.
Run the following command to pull the latest image, or run it again to update to a newer version:

```bash
docker pull dockerofsat/nanalogue:latest
```

The following command runs `read-stats` on `some_file.bam` in the current working directory:

```bash
docker run --rm -v $(pwd):$(pwd) -w $(pwd) dockerofsat/nanalogue:latest nanalogue read-stats some_file.bam
```

You can mount other directories using the `-v` option as needed.

# Commands

All the commands below have options you can specify on the command line.
Please run `--help` with a command to learn what these are. Among other operations,
the options allow you to subsample the BAM file (`-s`),
restrict read and/or modification data to a specific genomic region (`--region` or `--mod-region`),
restrict by one or several read ids (`--read-id` or `--read-id-list`),
a specific mapping type (`--read-filter`), filter modification data suitably
(`--mod-prob-filter`) etc.

## `nanalogue read-info`
Prints information about reads in JSON, including BAM mapping quality (`mapq`). A sample output snippet follows.

```json
[
{
        "read_id": "a4f36092-b4d5-47a9-813e-c22c3b477a0c",
        "sequence_length": 48,
        "mapq": 255,
        "contig": "dummyIII",
        "reference_start": 23,
        "reference_end": 71,
        "alignment_length": 48,
        "alignment_type": "primary_forward",
        "mod_count": "T+T:3;(probabilities >= 0.5020, PHRED base qual >= 0)"
}
]
```
Please note that a `mapq` of 255 means that the mapping quality is unavailable,
and a `mod_count` of "NA" means modifications are not present.

With options like `--detailed` and `--detailed-pretty`, the modification information in the BAM file
is converted to a more-usable JSON format; this detailed output also includes `mapq`. A sample output snippet follows.

```json
[
{
  "alignment_type": "primary_forward",
  "alignment": {
    "start": 23,
    "end": 71,
    "contig": "dummyIII",
    "contig_id": 2
  },
  "mod_table": [
    {
      "base": "T",
      "is_strand_plus": true,
      "mod_code": "T",
      "data": [
        [
          3,
          26,
          221
        ],
        [
          8,
          31,
          242
        ],
        [
          27,
          50,
          3
        ],
        [
          39,
          62,
          47
        ],
        [
          47,
          70,
          239
        ]
      ]
    }
  ],
  "read_id": "a4f36092-b4d5-47a9-813e-c22c3b477a0c",
  "mapq": 255,
  "seq_len": 48
}
]
```

## `nanalogue read-table-hide-mods`

Prints basecalled length, alignment length, read id,
and optionally other information such as sequence per molecule.
This command does not expect modification data at all.
A sample output snippet follows.

```text
read_id align_length    sequence_length_template        alignment_type
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48, 0   48      primary_forward, unmapped
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    8       8       primary_forward
fffffff1-10d2-49cb-8ca3-e8d48979001b    33      33      primary_reverse
```

## `nanalogue read-table-show-mods`

Prints basecalled length, alignment length, read id, modification counts,
and optionally other information such as sequence per molecule.
If modification data is not available, any modification-related columns
have outputs like "NA". A sample output snippet follows.

```text
# mod-unmod threshold is 0.5
read_id align_length    sequence_length_template        alignment_type  mod_count
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48, 0   48      primary_forward, unmapped       T:3, T:3;7200:0
fffffff1-10d2-49cb-8ca3-e8d48979001b    33      33      primary_reverse T:1
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    8       8       primary_forward T:0
```

## `nanalogue read-stats`
Calculates various summary statistics on all reads. A sample output follows.

```text
key     value
n_primary_alignments    3
n_secondary_alignments  0
n_supplementary_alignments      0
n_unmapped_reads        1
n_reversed_reads        1
align_len_mean  29
align_len_max   48
align_len_min   8
align_len_median        8
align_len_n50   48
seq_len_mean    34
seq_len_max     48
seq_len_min     8
seq_len_median  33
seq_len_n50     48
```

## `nanalogue find-modified-reads`
Find names of modified reads through criteria specified by sub commands
e.g.  at least one window with a modification density above
some value (`any-dens-above`). Please run
`nanalogue find-modified-reads --help` to learn more.
Output is a list of read ids that satisfy the specified criterion e.g.

```text
a4f36092-b4d5-47a9-813e-c22c3b477a0f
5d10eb9a-aae1-4db8-8ec6-7ebb34d32576
fffffff1-10d2-49cb-8ca3-e8d48979001a
```

## `nanalogue window-dens`
Output windowed densities of reads. Sample output follows.

```text
#contig ref_win_start   ref_win_end     read_id win_val strand  base    mod_strand      mod_type        win_start       win_end basecall_qual
dummyI  9       13      read1   0       +       T       +       T       0       3       2
dummyI  12      14      read1   0       +       T       +       T       2       4       3
dummyI  13      17      read1   0       +       T       +       T       3       7       32
```

## `nanalogue window-grad`
Output windowed gradients of reads.
Sample output is similar to the one above but with gradients reported instead of
mean modification densities in the `win_val` column.

## `nanalogue peek`
Read the BAM file header and the first 100 records, and output contig information
and modification tag information. Sample output follows.

```text
contigs_and_lengths:
dummyI  22
dummyII 48
dummyIII        76

modifications:
G-7200
T+T
```

If no modifications are found in the first 100 records, you will see

```text
contigs_and_lengths:
dummyI  22
dummyII 48
dummyIII        76

modifications:
None
```

# Contributing

Contributions are welcome! Please see [CONTRIBUTIONS.md](CONTRIBUTIONS.md)
for guidelines on how to contribute to this project.

# Security

For security concerns and vulnerability reporting, please see [SECURITY.md](SECURITY.md).

# Third-Party Notices

This repository vendors a small number of third-party crates to keep the build
reproducible and to work around toolchain-specific issues in the HTSlib Rust
bindings. See [THIRD_PARTY_NOTICES.md](THIRD_PARTY_NOTICES.md) for the full
list of vendored crates, their local paths, their license files, and the patch
files that show the exact changes from upstream.

# Changelog

Changelog of the project is at [CHANGELOG.md](CHANGELOG.md).

# Acknowledgments

This software was developed at the Earlham Institute in the UK.
This work was supported by the Biotechnology and Biological Sciences
Research Council (BBSRC), part of UK Research and Innovation,
through the Core Capability Grant BB/CCG2220/1 at the Earlham Institute
and the Earlham Institute Strategic Programme Grant Cellular Genomics
BBX011070/1 and its constituent work packages BBS/E/ER/230001B
(CellGen WP2 Consequences of somatic genome variation on traits).
The work was also supported by the following response-mode project grants:
BB/W006014/1 (Single molecule detection of DNA replication errors) and
BB/Y00549X/1 (Single molecule analysis of Human DNA replication).
This research was supported in part by NBI Research Computing
through use of the High-Performance Computing system and Isilon storage.