seqtui 0.1.1

Fast TUI toolkit for viewing, translating, and manipulating biological sequences.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
================================================================================
SeqTUI - Developer Reference Document
================================================================================

Last updated: December 2025

This document provides context for continuing development on SeqTUI, a terminal-
based sequence viewer and toolkit (FASTA, PHYLIP, NEXUS) written in Rust. It 
captures key design decisions, architecture choices, and lessons learned.

================================================================================
PROJECT GOAL
================================================================================

SeqTUI aims to be a fast, memory-efficient terminal viewer AND command-line 
toolkit for sequences (aligned or not). Key goals:

1. Handle very large files (500MB+, millions of nucleotides per sequence)
2. Vim-style navigation for bioinformaticians comfortable with CLI
3. Support NT→AA translation with all 33 NCBI genetic codes
4. Color-coded display matching Seaview conventions
5. Minimal dependencies, easy deployment on HPC clusters
6. CLI mode for batch processing (convert, translate, concatenate)
7. Supermatrix building for phylogenetics (multi-gene concatenation)
8. VCF export for isolated biallelic SNPs with flanking distance filter
9. File browser for interactive file selection (:e command, or launch without args)

================================================================================
ARCHITECTURE
================================================================================

The codebase follows an event-driven MVC pattern:

src/
├── main.rs         - Entry point, CLI args, jemalloc allocator setup
├── lib.rs          - Module exports
├── model.rs        - Core data structures and application state
├── fasta.rs        - FASTA parsing with memory optimization
├── formats/        - Multi-format support module
│   ├── mod.rs      - Format detection (extension + content) and unified API
│   ├── nexus.rs    - NEXUS parser (token-based per spec)
│   └── phylip.rs   - PHYLIP parser (sequential + interleaved)
├── event.rs        - Keyboard input handling (Action enum, apply_action)
├── ui.rs           - TUI rendering with ratatui
├── controller.rs   - Main loop, background loading, channel-based messaging
└── genetic_code.rs - 33 NCBI genetic codes and translation logic

Key design pattern: Events -> Actions -> State mutations -> Render
Async pattern: Background thread -> Channel -> Main loop polls -> State update

================================================================================
KEY DATA STRUCTURES (model.rs)
================================================================================

Sequence {
    name: String,           // Sequence identifier
    data: Vec<u8>,          // Raw sequence bytes (NOT String - memory optimization)
}

Alignment {
    sequences: Vec<Sequence>,
    warning: Option<String>,  // Unequal length warning
}

AppState {
    alignment: Alignment,
    translated_alignment: Option<Alignment>,  // Cached AA translation
    cached_translation_code_id: Option<u8>,   // Code used for cached translation
    cached_translation_frame: Option<usize>,  // Frame used for cached translation
    view_mode: ViewMode,      // Nucleotide or AminoAcid
    loading_state: LoadingState,  // Ready, LoadingFile, Translating
    spinner_frame: usize,     // Animation frame (0-3)
    viewport: Viewport,       // What's visible on screen
    cursor: Cursor,           // Current position
    mode: AppMode,            // Normal, Command, Search, TranslationSettings
    help_tab: HelpTab,        // Current help tab (5 tabs)
    pending_g: bool,          // For g-prefix commands
    pending_z: bool,          // For z-prefix commands (zH, zL)
    ...
}

LoadingState {
    Ready,                    // No loading in progress
    LoadingFile { path, message, sequences_loaded },
    Translating { message, sequences_done, total },
}

================================================================================
PERFORMANCE OPTIMIZATIONS
================================================================================

These were critical for handling 500MB+ files (47 sequences × 11M nucleotides):

1. SEQUENCE STORAGE: Vec<u8> instead of String
   - Eliminates UTF-8 validation overhead
   - Direct byte access without bounds checking per character
   - ~30% memory reduction

2. FASTA PARSING: Bulk read for large files
   - Files >1MB: read entire file to Vec<u8>, then parse
   - Avoids per-line allocation overhead
   - Pre-allocated capacity based on file size

3. TRANSLATION: Array-based codon lookup
   - [u8; 64] array instead of HashMap for codon table
   - Inline function base_to_index() for 2-bit encoding
   - No string allocation per codon (works directly in bytes)
   - translate_sequence(&[u8]) -> Vec<u8> (no String intermediates)

4. MEMORY ALLOCATOR: jemalloc (tikv-jemallocator)
   - In theory, helps return freed memory to OS
   - In practice, the difference is minimal on modern systems
   - Kept in case it helps on some HPC/cluster environments

5. REMOVED: Parallel translation with rayon
   - Initially added for speed, but caused 15 threads overhead
   - Single-threaded is fast enough for interactive use
   - Simpler code, lower memory footprint

================================================================================
ASYNC LOADING ARCHITECTURE
================================================================================

The TUI opens IMMEDIATELY when the user runs seqtui. File parsing happens in
a background thread while a loading spinner is displayed.

Components:

1. LoadingState enum (model.rs)
   - Ready: No loading, alignment is displayed
   - LoadingFile: Parsing in progress, shows spinner
   - Translating: Translation in progress (future use)

2. LoadMessage enum (controller.rs)
   - Complete(Alignment): Parsing succeeded
   - Error(String): Parsing failed
   - Progress { sequences_loaded }: Streaming updates (future use)

3. Background loading flow:
   a. main.rs calls run_app_with_loading(path, format)
   b. Controller creates AppState in LoadingFile state
   c. Controller spawns std::thread for parsing
   d. TUI renders immediately with spinner overlay
   e. Main loop polls channel (non-blocking try_recv)
   f. On LoadMessage::Complete, state.set_alignment() is called
   g. Spinner disappears, alignment is shown

4. Spinner animation:
   - Braille characters: ⠋ ⠙ ⠹ ⠸ (4 frames)
   - tick_spinner() advances frame each render loop (~50ms)
   - spinner_char() returns current frame character

5. Error handling:
   - If parsing fails, LoadMessage::Error is sent
   - state.set_loading_error() displays error in status bar
   - User can quit with :q

Future: Could add streaming parser that sends Progress messages as sequences
are parsed, updating the count in the loading overlay.

================================================================================
VIM NAVIGATION DESIGN
================================================================================

Navigation is designed for Vim users but also works with arrows:

ARROW-CENTRIC:
  ←↑↓→           Move one position
  Shift+←→       Half page left/right  
  Shift+↑↓       Page up/down

VIM-CENTRIC:
  h/j/k/l        Move (left/down/up/right)
  Ctrl+U/D       Half page up/down
  zH/zL          Half page left/right (Vim's horizontal scroll)
  0/$            First/last column
  g0/gm/g$       First/middle/last VISIBLE column
  <num>|         Go to column N

NOTE: Ctrl+Arrow doesn't work on macOS (captured by system for Spaces/Mission
Control), so Shift+Arrow is used instead. Both Ctrl and Shift are supported
in code for cross-platform compatibility.

================================================================================
PENDING STATE PATTERN
================================================================================

For multi-key commands (g0, gm, g$, zH, zL), we use a "pending" state:

1. User presses 'g' → set pending_g = true
2. Next key (0, m, or $) triggers the actual command
3. Command calls clear_pending() to reset

IMPORTANT: Any action triggered by a pending state must call clear_pending()
or the app becomes unresponsive (all subsequent keys go to the pending handler
which returns Action::None for unknown keys).

See: set_pending_g(), set_pending_z(), clear_pending() in model.rs

================================================================================
HELP SYSTEM
================================================================================

Tabbed help overlay with 5 sections:
- Basics: Getting started, :q, :h, :<number>
- Arrow Nav: Arrow key navigation
- Vim Nav: Vim-style navigation  
- Search: /, ?, n, N
- Translation: :asAA, :asNT, :setcode

Navigate tabs with ←/→, h/l, or Tab. Any other key closes help.

State: help_tab: HelpTab in AppState
Actions: HelpNextTab, HelpPrevTab, DismissHelp

================================================================================
TRANSLATION SYSTEM
================================================================================

Nucleotide to amino acid translation:
- 33 NCBI genetic codes supported (Standard, Vertebrate Mito, etc.)
- 3 reading frames (+1, +2, +3)
- Translation settings UI with j/k for code, h/l for frame

TRANSLATION CACHING:
The translated alignment is cached in `translated_alignment` along with 
metadata tracking which settings were used:
- cached_translation_code_id: Option<u8> - Genetic code ID used
- cached_translation_frame: Option<usize> - Frame used (0, 1, or 2)

When user types :asNT, we switch view_mode back to Nucleotide but KEEP the
cached translation. When typing :asAA again:
1. has_valid_cached_translation() checks if cached settings match current
2. If match: switch_to_cached_aa_view() - instant, no recomputation
3. If no match: start_background_translation() - recompute in background

This enables rapid NT↔AA toggling without recomputation, which is important
for large alignments where translation can take several seconds.

CACHE INVALIDATION:
Cache is invalidated when genetic_code_id or frame changes:
- User opens :setcode dialog and changes settings
- Cache metadata won't match, triggering recomputation

Memory note: Dropping translated_alignment with jemalloc properly returns
memory to OS. Without jemalloc, memory stays allocated.

================================================================================
COLOR SCHEME
================================================================================

Nucleotides (DNA/RNA):
  A: Red background
  C: Green background
  G: Yellow background
  T/U: Blue background

Amino acids (Seaview-style, grouped by chemical property):
  Hydrophobic (AFILMVW): Yellow
  Polar (NQST): Green
  Charged+ (KRH): Magenta/Red
  Charged- (DE): Blue
  Special (CGP): Orange/Cyan/Gray
  Stop (*): Red on white

================================================================================
FILE HANDLING
================================================================================

Multi-format support with auto-detection:

FORMAT DETECTION STRATEGY (in parse_file_with_options):
-------------------------------------------------------

The detection follows a cascading fallback strategy:

1. EXPLICIT FORMAT (-f/--format option)
   If user specifies -f nexus, we use NEXUS parser directly.
   If parsing fails, we return the error (no fallback).

2. FILE EXTENSION (if no -f option)
   We try the parser matching the extension (.fasta → FASTA parser).
   IMPORTANT: If extension-based parsing FAILS, we SILENTLY fall through
   to content detection. No warning is displayed.
   
   Example: seq.fasta containing NEXUS data
   - Try FASTA parser (fails because #NEXUS is not a valid FASTA header)
   - Fall through to step 3

3. CONTENT-BASED DETECTION
   Examine first non-empty line:
   - Starts with "#NEXUS" (case-insensitive) → NEXUS
   - Starts with ">" → FASTA  
   - Two integers (ntax nchar) → PHYLIP
   
   If detected, parse with that format.
   If parsing fails, return the error (no further fallback).

4. TRY ALL PARSERS (last resort)
   If content detection returns None, try each parser in order:
   FASTA → NEXUS → PHYLIP
   Return first success, or UnknownFormat error if all fail.

CURRENT BEHAVIOR SUMMARY:
- Extension is a HINT, not authoritative
- Extension mismatch produces NO warning (silent fallback)
- Content signature is trusted when found
- User can always override with -f/--format

POTENTIAL IMPROVEMENT:
Could add warning when extension doesn't match detected/successful format:
  "Warning: file.fasta was parsed as NEXUS (extension suggests FASTA)"
Currently NOT implemented - parsing succeeds silently.

Format detection priority:
1. Explicit -f/--format CLI option (fasta, phylip, nexus, auto)
2. File extension (.fasta, .phy, .nex, etc.) - with silent fallback on failure
3. Content detection (looks for format signatures)
4. Try all parsers as fallback

FASTA parsing handles:
- Standard multi-line FASTA
- Lines starting with > are headers
- Sequences can span multiple lines
- Automatic uppercase conversion
- Warning if sequences have different lengths (invalid alignment)

PHYLIP parsing handles:
- Sequential format (all of sequence on consecutive lines)
- Interleaved format (detected by line count vs NCHAR)
- Relaxed names (any length, not just 10 chars)
- Strict 10-char names for legacy files

NEXUS parsing handles:
- Token-based parsing per NEXUS specification
- DATA and CHARACTERS blocks
- Sequential and INTERLEAVE formats
- MATCHCHAR substitution (e.g., '.' = same as reference sequence)
- Multi-line FORMAT commands
- Inline comments like [1], [annotation]
- Case-insensitive commands
- Quoted sequence names

Edge cases:
- Empty files: Error
- Files without headers: Creates "Unknown" sequence (FASTA)
- Very long lines: Handled (bulk read approach)
- Unknown format: Helpful error with format hints

================================================================================
TESTING
================================================================================

82 unit tests covering:
- Event handling and key mappings
- FASTA parsing edge cases
- PHYLIP parsing (sequential, interleaved, relaxed names)
- NEXUS parsing (simple, interleaved, quoted names, MATCHCHAR)
- Format detection (extension, content, fallback)
- Real-world file tests (LOC_01790.nex with 27 sequences)
- Genetic code translation
- Model state transitions
- Color assignments
- VCF export (biallelic SNPs, flanking distance, missing genotypes)

Run: cargo test

================================================================================
DEPENDENCIES
================================================================================

Cargo.toml key dependencies:
- ratatui: TUI framework
- crossterm: Terminal backend (cross-platform)
- anyhow/thiserror: Error handling
- clap: CLI argument parsing
- tikv-jemallocator: Alternative memory allocator (kept for potential benefits)

================================================================================
KNOWN ISSUES / FUTURE IMPROVEMENTS
================================================================================

Potential enhancements:
1. Selection/copy to clipboard
2. Sequence statistics (GC content, length)
3. Consensus sequence display
4. Export selected region
5. Mouse support for clicking
6. Reverse complement view (frames 4-6 for translation)
7. Multiple file comparison
8. Memory-efficient supermatrix (streaming write instead of full in-memory)
9. -f option for delimiter field selection (-f1,2 like Unix cut)

Performance notes:
- Initial load of 500MB file: ~2-3 seconds
- Translation of 500MB: ~1 second
- Memory usage: stable during interactive use

================================================================================
CLI MODE & CONCATENATION (main.rs)
================================================================================

SeqTUI has two modes:

1. TUI MODE (default)
   - Interactive viewer with Vim-style navigation
   - Triggered when no -o/--output is specified
   - Can launch with or without file arguments
   - Without args: opens file browser to select files
   - With args: opens specified files directly
   - :e command opens file browser from within viewer

2. CLI MODE (with -o)
   - Batch processing: convert, translate, concatenate
   - Single-line FASTA output (pipe-friendly)
   - Triggered by -o/--output

CLI OPTIONS:
  -o, --output        Output file (or "-" for stdout)
  -t, --translate     Translate NT to AA
  -g, --genetic-code  Genetic code (1-33, default: 1)
  -r, --reading-frame Reading frame (1-3, default: 1)
  -d, --delimiter     ID matching delimiter (uses first field)
  -s, --supermatrix   Fill missing sequences (default '-', or custom char)
  -p, --partitions    Write partition file
  -v, --vcf           Extract biallelic SNPs to VCF (value = min flanking dist)
  --force             Bypass safety checks (orphan IDs, non-NT files)

SINGLE FILE CLI:
  run_cli_mode() - parse, optionally translate, write FASTA

MULTI-FILE CONCATENATION:
  run_concatenation_mode() - merge sequences by ID matching

CONCATENATION ALGORITHM:
  Pass 1: Collect all unique sequence IDs across files
          Track how many files each ID appears in (for orphan detection)
          Validate alignments if -s (supermatrix mode)
          Record alignment length per file (for gap filling)
          Check orphan ratio: if >30% IDs appear in only 1 file, abort
            - Writes <output>_<random>.log with all IDs (orphans marked with *)
            - Suggests -d delimiter or --force to proceed
  
  Pass 2: For each file:
          - Parse and optionally translate
          - For each known ID: append sequence or gaps (if -s and missing)
          - Track partition boundaries
  
  Output: Write concatenated sequences + optional partition file
          Always writes log file with per-file stats and warnings

LOG FILE NAMING CONVENTION:
  All log files use the pattern: <prefix>_<6_random_chars>.log
  - If output file specified: prefix = output file stem
    Example: -o results/supermatrix.fasta → results/supermatrix_a7f3k2.log
  - If output is stdout (-o -) or none: prefix = "seqtui"
    Example: -o - → seqtui_b2x9m1.log (in current directory)
  - Random suffix prevents overwrites in HPC parallel jobs
  - Files grouped with output (ls supermatrix* shows all related files)

ID MATCHING:
  - Default: full sequence ID
  - With -d "_": extract first field before delimiter
  - Example: "Human_ENS001" with -d "_" matches "Human_LOC789" on "Human"
  - extract_key(id, delimiter) function handles this

ORPHAN ID DETECTION:
  - Orphan = ID that appears in only one input file
  - If orphan_count / total_output_ids > 0.30, likely delimiter problem
  - Error message suggests -d and writes IDs to <output>_ids_<random>.log
  - --force bypasses this check

NUCLEOTIDE VALIDATION:
  - Translation and VCF modes require nucleotide sequences
  - Files with <50% ACGT characters (excluding gaps/N/?) are flagged
  - Error suggests the file may be amino acids
  - Details written to <output>_nt_check_<random>.log
  - --force bypasses this check

================================================================================
VCF MODE (main.rs)
================================================================================

Extract biallelic SNPs from alignments with flanking distance filter:
  seqtui alignment.fasta -v 300 -o snps.vcf

VCF MODE ALGORITHM:
  Pass 1: Collect all sequence IDs across files
          - Reference = first sequence of first file
          - Samples sorted alphabetically (reference first)
          - Validate alignment and nucleotide content
  
  Pass 2: For each file:
          - Single pass through sequences with bit flags:
            real_nt_only[pos]: true if site has only ACGT/N/?
            seen_nt[pos]: bit flags (A=1, C=2, G=4, T=8)
          - Derive polymorphic sites: !real_nt_only || popcount(seen_nt) > 1
          - Compute distLeft[i], distRight[i] using reset vector
          - Select biallelic sites: real_nt_only && popcount==2 && dist>=min
  
  Output: VCF with DL/DR in INFO field for filtering

VCF OUTPUT FORMAT:
  - Reference sequence from first file (sample column included)
  - Haploid genotypes: 0 (ref), 1 (alt), . (missing)
  - Missing genotype: sequence absent from file OR has N/? at position
  - Site excluded: any present sequence has gap (-) at that position
  - INFO: DL=distance_left;DR=distance_right
  - Each input file becomes a separate CHROM (basename without extension)

VCF LOG FILES:
  - Per-file SNP counts written to <output>_vcf_<random>.log (>100 files)
  - NT validation errors written to <output>_nt_check_<random>.log
  - Log files use same naming convention as concatenation mode

BIT FLAG OPTIMIZATION:
  - Alleles tracked with bit flags: A=1, C=2, G=4, T=8
  - Biallelic check: popcount(seen_nt[pos]) == 2
  - Single pass per file (no HashSet allocations)
  - O(n) distance computation using reset vector

VALIDATION:
  - -v requires -o (output file)
  - -v incompatible with -t, -s, -p
  - Files must be valid alignments (same length)
  - Files must be nucleotide (checked via NT validation)

VALIDATION:
  - -s and -p require multiple input files
  - -s requires aligned sequences (same length within each file)
  - -s accepts optional fill character (default '-', or '?', '.', etc.)
  - Clear error messages for invalid combinations

================================================================================
TRANSLATION IMPROVEMENTS
================================================================================

AMBIGUITY CODE HANDLING (genetic_code.rs):
Translation now handles common nucleotide ambiguity codes:
  - R = A or G (purine)
  - Y = T or C (pyrimidine)  
  - N or ? = any base

Rules:
  - Only 1 ambiguous position per codon (practical case)
  - All possible translations must yield the same AA
  - Otherwise returns 'X'

Examples:
  CTR → L (CTA=Leu, CTG=Leu, both Leu)
  GGN → G (all 4 codons = Gly)
  ATN → X (ATT/ATC/ATA=Ile, ATG=Met, mixed result)

Implementation:
  ambiguity_expansions(b) returns possible base indices
  translate_codon() expands and checks all combinations

================================================================================
DEVELOPMENT WORKFLOW
================================================================================

# Build and run
cargo run -- test_data/alignment.fasta

# Run tests
cargo test

# Build release (for HPC)
cargo build --release

# Check for issues
cargo clippy

TEST STRUCTURE:
  src/lib.rs tests (74 tests):
    - formats/fasta.rs: FASTA parsing edge cases
    - formats/nexus.rs: NEXUS parsing (interleaved, matchchar, etc.)
    - formats/phylip.rs: PHYLIP parsing (sequential, interleaved)
    - formats/mod.rs: Format detection tests
    - genetic_code.rs: Translation and ambiguity codes
    - model.rs: State management, search, cursor movement
    - event.rs: Keyboard input and action handling
    - ui.rs: Color rendering
    - controller.rs: App state creation
  
  src/main.rs tests (12 tests):
    - VCF mode: SNP detection, flanking distances, exclusion rules
    - Log file generation: path patterns, uniqueness, directory handling
    - Concatenation: log file creation and content
  
  Test data:
    - test_data/alignment.fasta: 5 sequences, 200 sites
    - test_data/unaligned.fasta: Unaligned sequences (for error testing)
    - test_data/LOC_01790.nex: NEXUS file with 27 sequences
    - test_data/vcf_tests/*.fa: VCF mode test cases

================================================================================
CONTACT / CONTEXT
================================================================================

This project was developed with AI assistance (Claude/Copilot). When resuming
development, provide this file as context to quickly get back up to speed on
architecture decisions and design patterns used.

Key files to review when resuming:
1. This file (readme_dev.txt)
2. src/model.rs - Core state and data structures (incl. LoadingState)
3. src/controller.rs - Main loop, background loading, LoadMessage channel
4. src/event.rs - Action definitions and key handling
5. src/genetic_code.rs - Translation logic
6. src/formats/mod.rs - Format detection and unified parsing API
7. src/formats/nexus.rs - Token-based NEXUS parser

================================================================================