================================================================================
SeqTUI - Developer Reference Document
================================================================================
Last updated: December 2025
This document provides context for continuing development on SeqTUI, a terminal-
based sequence viewer and toolkit (FASTA, PHYLIP, NEXUS) written in Rust. It
captures key design decisions, architecture choices, and lessons learned.
================================================================================
PROJECT GOAL
================================================================================
SeqTUI aims to be a fast, memory-efficient terminal viewer AND command-line
toolkit for sequences (aligned or not). Key goals:
1. Handle very large files (500MB+, millions of nucleotides per sequence)
2. Vim-style navigation for bioinformaticians comfortable with CLI
3. Support NT→AA translation with all 33 NCBI genetic codes
4. Color-coded display matching Seaview conventions
5. Minimal dependencies, easy deployment on HPC clusters
6. CLI mode for batch processing (convert, translate, concatenate)
7. Supermatrix building for phylogenetics (multi-gene concatenation)
8. VCF export for isolated biallelic SNPs with flanking distance filter
9. File browser for interactive file selection (:e command, or launch without args)
================================================================================
ARCHITECTURE
================================================================================
The codebase follows an event-driven MVC pattern:
src/
├── main.rs - Entry point, CLI args, jemalloc allocator setup
├── lib.rs - Module exports
├── model.rs - Core data structures and application state
├── fasta.rs - FASTA parsing with memory optimization
├── formats/ - Multi-format support module
│ ├── mod.rs - Format detection (extension + content) and unified API
│ ├── nexus.rs - NEXUS parser (token-based per spec)
│ └── phylip.rs - PHYLIP parser (sequential + interleaved)
├── event.rs - Keyboard input handling (Action enum, apply_action)
├── ui.rs - TUI rendering with ratatui
├── controller.rs - Main loop, background loading, channel-based messaging
└── genetic_code.rs - 33 NCBI genetic codes and translation logic
Key design pattern: Events -> Actions -> State mutations -> Render
Async pattern: Background thread -> Channel -> Main loop polls -> State update
================================================================================
KEY DATA STRUCTURES (model.rs)
================================================================================
Sequence {
name: String, // Sequence identifier
data: Vec<u8>, // Raw sequence bytes (NOT String - memory optimization)
}
Alignment {
sequences: Vec<Sequence>,
warning: Option<String>, // Unequal length warning
}
AppState {
alignment: Alignment,
translated_alignment: Option<Alignment>, // Cached AA translation
cached_translation_code_id: Option<u8>, // Code used for cached translation
cached_translation_frame: Option<usize>, // Frame used for cached translation
view_mode: ViewMode, // Nucleotide or AminoAcid
loading_state: LoadingState, // Ready, LoadingFile, Translating
spinner_frame: usize, // Animation frame (0-3)
viewport: Viewport, // What's visible on screen
cursor: Cursor, // Current position
mode: AppMode, // Normal, Command, Search, TranslationSettings
help_tab: HelpTab, // Current help tab (5 tabs)
pending_g: bool, // For g-prefix commands
pending_z: bool, // For z-prefix commands (zH, zL)
...
}
LoadingState {
Ready, // No loading in progress
LoadingFile { path, message, sequences_loaded },
Translating { message, sequences_done, total },
}
================================================================================
PERFORMANCE OPTIMIZATIONS
================================================================================
These were critical for handling 500MB+ files (47 sequences × 11M nucleotides):
1. SEQUENCE STORAGE: Vec<u8> instead of String
- Eliminates UTF-8 validation overhead
- Direct byte access without bounds checking per character
- ~30% memory reduction
2. FASTA PARSING: Bulk read for large files
- Files >1MB: read entire file to Vec<u8>, then parse
- Avoids per-line allocation overhead
- Pre-allocated capacity based on file size
3. TRANSLATION: Array-based codon lookup
- [u8; 64] array instead of HashMap for codon table
- Inline function base_to_index() for 2-bit encoding
- No string allocation per codon (works directly in bytes)
- translate_sequence(&[u8]) -> Vec<u8> (no String intermediates)
4. MEMORY ALLOCATOR: jemalloc (tikv-jemallocator)
- In theory, helps return freed memory to OS
- In practice, the difference is minimal on modern systems
- Kept in case it helps on some HPC/cluster environments
5. REMOVED: Parallel translation with rayon
- Initially added for speed, but caused 15 threads overhead
- Single-threaded is fast enough for interactive use
- Simpler code, lower memory footprint
================================================================================
ASYNC LOADING ARCHITECTURE
================================================================================
The TUI opens IMMEDIATELY when the user runs seqtui. File parsing happens in
a background thread while a loading spinner is displayed.
Components:
1. LoadingState enum (model.rs)
- Ready: No loading, alignment is displayed
- LoadingFile: Parsing in progress, shows spinner
- Translating: Translation in progress (future use)
2. LoadMessage enum (controller.rs)
- Complete(Alignment): Parsing succeeded
- Error(String): Parsing failed
- Progress { sequences_loaded }: Streaming updates (future use)
3. Background loading flow:
a. main.rs calls run_app_with_loading(path, format)
b. Controller creates AppState in LoadingFile state
c. Controller spawns std::thread for parsing
d. TUI renders immediately with spinner overlay
e. Main loop polls channel (non-blocking try_recv)
f. On LoadMessage::Complete, state.set_alignment() is called
g. Spinner disappears, alignment is shown
4. Spinner animation:
- Braille characters: ⠋ ⠙ ⠹ ⠸ (4 frames)
- tick_spinner() advances frame each render loop (~50ms)
- spinner_char() returns current frame character
5. Error handling:
- If parsing fails, LoadMessage::Error is sent
- state.set_loading_error() displays error in status bar
- User can quit with :q
Future: Could add streaming parser that sends Progress messages as sequences
are parsed, updating the count in the loading overlay.
================================================================================
VIM NAVIGATION DESIGN
================================================================================
Navigation is designed for Vim users but also works with arrows:
ARROW-CENTRIC:
←↑↓→ Move one position
Shift+←→ Half page left/right
Shift+↑↓ Page up/down
VIM-CENTRIC:
h/j/k/l Move (left/down/up/right)
Ctrl+U/D Half page up/down
zH/zL Half page left/right (Vim's horizontal scroll)
0/$ First/last column
g0/gm/g$ First/middle/last VISIBLE column
<num>| Go to column N
NOTE: Ctrl+Arrow doesn't work on macOS (captured by system for Spaces/Mission
Control), so Shift+Arrow is used instead. Both Ctrl and Shift are supported
in code for cross-platform compatibility.
================================================================================
PENDING STATE PATTERN
================================================================================
For multi-key commands (g0, gm, g$, zH, zL), we use a "pending" state:
1. User presses 'g' → set pending_g = true
2. Next key (0, m, or $) triggers the actual command
3. Command calls clear_pending() to reset
IMPORTANT: Any action triggered by a pending state must call clear_pending()
or the app becomes unresponsive (all subsequent keys go to the pending handler
which returns Action::None for unknown keys).
See: set_pending_g(), set_pending_z(), clear_pending() in model.rs
================================================================================
HELP SYSTEM
================================================================================
Tabbed help overlay with 5 sections:
- Basics: Getting started, :q, :h, :<number>
- Arrow Nav: Arrow key navigation
- Vim Nav: Vim-style navigation
- Search: /, ?, n, N
- Translation: :asAA, :asNT, :setcode
Navigate tabs with ←/→, h/l, or Tab. Any other key closes help.
State: help_tab: HelpTab in AppState
Actions: HelpNextTab, HelpPrevTab, DismissHelp
================================================================================
TRANSLATION SYSTEM
================================================================================
Nucleotide to amino acid translation:
- 33 NCBI genetic codes supported (Standard, Vertebrate Mito, etc.)
- 3 reading frames (+1, +2, +3)
- Translation settings UI with j/k for code, h/l for frame
TRANSLATION CACHING:
The translated alignment is cached in `translated_alignment` along with
metadata tracking which settings were used:
- cached_translation_code_id: Option<u8> - Genetic code ID used
- cached_translation_frame: Option<usize> - Frame used (0, 1, or 2)
When user types :asNT, we switch view_mode back to Nucleotide but KEEP the
cached translation. When typing :asAA again:
1. has_valid_cached_translation() checks if cached settings match current
2. If match: switch_to_cached_aa_view() - instant, no recomputation
3. If no match: start_background_translation() - recompute in background
This enables rapid NT↔AA toggling without recomputation, which is important
for large alignments where translation can take several seconds.
CACHE INVALIDATION:
Cache is invalidated when genetic_code_id or frame changes:
- User opens :setcode dialog and changes settings
- Cache metadata won't match, triggering recomputation
Memory note: Dropping translated_alignment with jemalloc properly returns
memory to OS. Without jemalloc, memory stays allocated.
================================================================================
COLOR SCHEME
================================================================================
Nucleotides (DNA/RNA):
A: Red background
C: Green background
G: Yellow background
T/U: Blue background
Amino acids (Seaview-style, grouped by chemical property):
Hydrophobic (AFILMVW): Yellow
Polar (NQST): Green
Charged+ (KRH): Magenta/Red
Charged- (DE): Blue
Special (CGP): Orange/Cyan/Gray
Stop (*): Red on white
================================================================================
FILE HANDLING
================================================================================
Multi-format support with auto-detection:
FORMAT DETECTION STRATEGY (in parse_file_with_options):
-------------------------------------------------------
The detection follows a cascading fallback strategy:
1. EXPLICIT FORMAT (-f/--format option)
If user specifies -f nexus, we use NEXUS parser directly.
If parsing fails, we return the error (no fallback).
2. FILE EXTENSION (if no -f option)
We try the parser matching the extension (.fasta → FASTA parser).
IMPORTANT: If extension-based parsing FAILS, we SILENTLY fall through
to content detection. No warning is displayed.
Example: seq.fasta containing NEXUS data
- Try FASTA parser (fails because #NEXUS is not a valid FASTA header)
- Fall through to step 3
3. CONTENT-BASED DETECTION
Examine first non-empty line:
- Starts with "#NEXUS" (case-insensitive) → NEXUS
- Starts with ">" → FASTA
- Two integers (ntax nchar) → PHYLIP
If detected, parse with that format.
If parsing fails, return the error (no further fallback).
4. TRY ALL PARSERS (last resort)
If content detection returns None, try each parser in order:
FASTA → NEXUS → PHYLIP
Return first success, or UnknownFormat error if all fail.
CURRENT BEHAVIOR SUMMARY:
- Extension is a HINT, not authoritative
- Extension mismatch produces NO warning (silent fallback)
- Content signature is trusted when found
- User can always override with -f/--format
POTENTIAL IMPROVEMENT:
Could add warning when extension doesn't match detected/successful format:
"Warning: file.fasta was parsed as NEXUS (extension suggests FASTA)"
Currently NOT implemented - parsing succeeds silently.
Format detection priority:
1. Explicit -f/--format CLI option (fasta, phylip, nexus, auto)
2. File extension (.fasta, .phy, .nex, etc.) - with silent fallback on failure
3. Content detection (looks for format signatures)
4. Try all parsers as fallback
FASTA parsing handles:
- Standard multi-line FASTA
- Lines starting with > are headers
- Sequences can span multiple lines
- Automatic uppercase conversion
- Warning if sequences have different lengths (invalid alignment)
PHYLIP parsing handles:
- Sequential format (all of sequence on consecutive lines)
- Interleaved format (detected by line count vs NCHAR)
- Relaxed names (any length, not just 10 chars)
- Strict 10-char names for legacy files
NEXUS parsing handles:
- Token-based parsing per NEXUS specification
- DATA and CHARACTERS blocks
- Sequential and INTERLEAVE formats
- MATCHCHAR substitution (e.g., '.' = same as reference sequence)
- Multi-line FORMAT commands
- Inline comments like [1], [annotation]
- Case-insensitive commands
- Quoted sequence names
Edge cases:
- Empty files: Error
- Files without headers: Creates "Unknown" sequence (FASTA)
- Very long lines: Handled (bulk read approach)
- Unknown format: Helpful error with format hints
================================================================================
TESTING
================================================================================
82 unit tests covering:
- Event handling and key mappings
- FASTA parsing edge cases
- PHYLIP parsing (sequential, interleaved, relaxed names)
- NEXUS parsing (simple, interleaved, quoted names, MATCHCHAR)
- Format detection (extension, content, fallback)
- Real-world file tests (LOC_01790.nex with 27 sequences)
- Genetic code translation
- Model state transitions
- Color assignments
- VCF export (biallelic SNPs, flanking distance, missing genotypes)
Run: cargo test
================================================================================
DEPENDENCIES
================================================================================
Cargo.toml key dependencies:
- ratatui: TUI framework
- crossterm: Terminal backend (cross-platform)
- anyhow/thiserror: Error handling
- clap: CLI argument parsing
- tikv-jemallocator: Alternative memory allocator (kept for potential benefits)
================================================================================
KNOWN ISSUES / FUTURE IMPROVEMENTS
================================================================================
Potential enhancements:
1. Selection/copy to clipboard
2. Sequence statistics (GC content, length)
3. Consensus sequence display
4. Export selected region
5. Mouse support for clicking
6. Reverse complement view (frames 4-6 for translation)
7. Multiple file comparison
8. Memory-efficient supermatrix (streaming write instead of full in-memory)
9. -f option for delimiter field selection (-f1,2 like Unix cut)
Performance notes:
- Initial load of 500MB file: ~2-3 seconds
- Translation of 500MB: ~1 second
- Memory usage: stable during interactive use
================================================================================
CLI MODE & CONCATENATION (main.rs)
================================================================================
SeqTUI has two modes:
1. TUI MODE (default)
- Interactive viewer with Vim-style navigation
- Triggered when no -o/--output is specified
- Can launch with or without file arguments
- Without args: opens file browser to select files
- With args: opens specified files directly
- :e command opens file browser from within viewer
2. CLI MODE (with -o)
- Batch processing: convert, translate, concatenate
- Single-line FASTA output (pipe-friendly)
- Triggered by -o/--output
CLI OPTIONS:
-o, --output Output file (or "-" for stdout)
-t, --translate Translate NT to AA
-g, --genetic-code Genetic code (1-33, default: 1)
-r, --reading-frame Reading frame (1-3, default: 1)
-d, --delimiter ID matching delimiter (uses first field)
-s, --supermatrix Fill missing sequences (default '-', or custom char)
-p, --partitions Write partition file
-v, --vcf Extract biallelic SNPs to VCF (value = min flanking dist)
--force Bypass safety checks (orphan IDs, non-NT files)
SINGLE FILE CLI:
run_cli_mode() - parse, optionally translate, write FASTA
MULTI-FILE CONCATENATION:
run_concatenation_mode() - merge sequences by ID matching
CONCATENATION ALGORITHM:
Pass 1: Collect all unique sequence IDs across files
Track how many files each ID appears in (for orphan detection)
Validate alignments if -s (supermatrix mode)
Record alignment length per file (for gap filling)
Check orphan ratio: if >30% IDs appear in only 1 file, abort
- Writes <output>_<random>.log with all IDs (orphans marked with *)
- Suggests -d delimiter or --force to proceed
Pass 2: For each file:
- Parse and optionally translate
- For each known ID: append sequence or gaps (if -s and missing)
- Track partition boundaries
Output: Write concatenated sequences + optional partition file
Always writes log file with per-file stats and warnings
LOG FILE NAMING CONVENTION:
All log files use the pattern: <prefix>_<6_random_chars>.log
- If output file specified: prefix = output file stem
Example: -o results/supermatrix.fasta → results/supermatrix_a7f3k2.log
- If output is stdout (-o -) or none: prefix = "seqtui"
Example: -o - → seqtui_b2x9m1.log (in current directory)
- Random suffix prevents overwrites in HPC parallel jobs
- Files grouped with output (ls supermatrix* shows all related files)
ID MATCHING:
- Default: full sequence ID
- With -d "_": extract first field before delimiter
- Example: "Human_ENS001" with -d "_" matches "Human_LOC789" on "Human"
- extract_key(id, delimiter) function handles this
ORPHAN ID DETECTION:
- Orphan = ID that appears in only one input file
- If orphan_count / total_output_ids > 0.30, likely delimiter problem
- Error message suggests -d and writes IDs to <output>_ids_<random>.log
- --force bypasses this check
NUCLEOTIDE VALIDATION:
- Translation and VCF modes require nucleotide sequences
- Files with <50% ACGT characters (excluding gaps/N/?) are flagged
- Error suggests the file may be amino acids
- Details written to <output>_nt_check_<random>.log
- --force bypasses this check
================================================================================
VCF MODE (main.rs)
================================================================================
Extract biallelic SNPs from alignments with flanking distance filter:
seqtui alignment.fasta -v 300 -o snps.vcf
VCF MODE ALGORITHM:
Pass 1: Collect all sequence IDs across files
- Reference = first sequence of first file
- Samples sorted alphabetically (reference first)
- Validate alignment and nucleotide content
Pass 2: For each file:
- Single pass through sequences with bit flags:
real_nt_only[pos]: true if site has only ACGT/N/?
seen_nt[pos]: bit flags (A=1, C=2, G=4, T=8)
- Derive polymorphic sites: !real_nt_only || popcount(seen_nt) > 1
- Compute distLeft[i], distRight[i] using reset vector
- Select biallelic sites: real_nt_only && popcount==2 && dist>=min
Output: VCF with DL/DR in INFO field for filtering
VCF OUTPUT FORMAT:
- Reference sequence from first file (sample column included)
- Haploid genotypes: 0 (ref), 1 (alt), . (missing)
- Missing genotype: sequence absent from file OR has N/? at position
- Site excluded: any present sequence has gap (-) at that position
- INFO: DL=distance_left;DR=distance_right
- Each input file becomes a separate CHROM (basename without extension)
VCF LOG FILES:
- Per-file SNP counts written to <output>_vcf_<random>.log (>100 files)
- NT validation errors written to <output>_nt_check_<random>.log
- Log files use same naming convention as concatenation mode
BIT FLAG OPTIMIZATION:
- Alleles tracked with bit flags: A=1, C=2, G=4, T=8
- Biallelic check: popcount(seen_nt[pos]) == 2
- Single pass per file (no HashSet allocations)
- O(n) distance computation using reset vector
VALIDATION:
- -v requires -o (output file)
- -v incompatible with -t, -s, -p
- Files must be valid alignments (same length)
- Files must be nucleotide (checked via NT validation)
VALIDATION:
- -s and -p require multiple input files
- -s requires aligned sequences (same length within each file)
- -s accepts optional fill character (default '-', or '?', '.', etc.)
- Clear error messages for invalid combinations
================================================================================
TRANSLATION IMPROVEMENTS
================================================================================
AMBIGUITY CODE HANDLING (genetic_code.rs):
Translation now handles common nucleotide ambiguity codes:
- R = A or G (purine)
- Y = T or C (pyrimidine)
- N or ? = any base
Rules:
- Only 1 ambiguous position per codon (practical case)
- All possible translations must yield the same AA
- Otherwise returns 'X'
Examples:
CTR → L (CTA=Leu, CTG=Leu, both Leu)
GGN → G (all 4 codons = Gly)
ATN → X (ATT/ATC/ATA=Ile, ATG=Met, mixed result)
Implementation:
ambiguity_expansions(b) returns possible base indices
translate_codon() expands and checks all combinations
================================================================================
DEVELOPMENT WORKFLOW
================================================================================
# Build and run
cargo run -- test_data/alignment.fasta
# Run tests
cargo test
# Build release (for HPC)
cargo build --release
# Check for issues
cargo clippy
TEST STRUCTURE:
src/lib.rs tests (74 tests):
- formats/fasta.rs: FASTA parsing edge cases
- formats/nexus.rs: NEXUS parsing (interleaved, matchchar, etc.)
- formats/phylip.rs: PHYLIP parsing (sequential, interleaved)
- formats/mod.rs: Format detection tests
- genetic_code.rs: Translation and ambiguity codes
- model.rs: State management, search, cursor movement
- event.rs: Keyboard input and action handling
- ui.rs: Color rendering
- controller.rs: App state creation
src/main.rs tests (12 tests):
- VCF mode: SNP detection, flanking distances, exclusion rules
- Log file generation: path patterns, uniqueness, directory handling
- Concatenation: log file creation and content
Test data:
- test_data/alignment.fasta: 5 sequences, 200 sites
- test_data/unaligned.fasta: Unaligned sequences (for error testing)
- test_data/LOC_01790.nex: NEXUS file with 27 sequences
- test_data/vcf_tests/*.fa: VCF mode test cases
================================================================================
CONTACT / CONTEXT
================================================================================
This project was developed with AI assistance (Claude/Copilot). When resuming
development, provide this file as context to quickly get back up to speed on
architecture decisions and design patterns used.
Key files to review when resuming:
1. This file (readme_dev.txt)
2. src/model.rs - Core state and data structures (incl. LoadingState)
3. src/controller.rs - Main loop, background loading, LoadMessage channel
4. src/event.rs - Action definitions and key handling
5. src/genetic_code.rs - Translation logic
6. src/formats/mod.rs - Format detection and unified parsing API
7. src/formats/nexus.rs - Token-based NEXUS parser
================================================================================