ustar-parser 0.1.3

STAR format parser for CIF, mmCIF, NMR-STAR, cif dictionaries, NEF files and other scientific data formats
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

USTAR is a general STAR (Self-defining Text Archive and Retrieval) format parser written in Rust. STAR is a data format commonly used in scientific computing, particularly for crystallographic data (CIF files), NMR-STAR files (BMRB), and mmCIF (Protein Data Bank) files.

## Build and Development Commands

### Build
```bash
cargo build                    # Debug build
cargo build --release          # Release build
cargo build --all-targets      # Build all targets including tests and benchmarks
```

### Testing
```bash
cargo test                     # Run all tests
cargo test parser_tests        # Run specific test module
cargo test --test parse_bmrb_stars    # Run integration test for BMRB files
cargo test --test parse_cod_cifs      # Run integration test for COD CIF files
cargo test --test parse_pdb_mmcifs    # Run integration test for PDB mmCIF files
```

### Binaries
The project includes several command-line tools:
```bash
cargo run --bin ustar-dumper           # Parse and dump STAR files with visualization
cargo run --bin ustar-benchmark        # Performance benchmarking
cargo run --bin ustar-parse-debugger   # Debug parser behavior
```

## Architecture and Key Components

### Multi-Encoding Parser System
The parser supports three encoding modes through dynamically generated grammars:
- **ASCII**: Standard ASCII character set
- **ExtendedAscii**: Extended ASCII including characters up to 0xFF
- **Unicode**: Full Unicode support with comprehensive whitespace handling

Grammar files are generated at build time by `build.rs` from a template (`src/star.pest_template`) using placeholder substitution.

### Core Components

**Parser Module (`src/parsers.rs`)**
- Three separate parser modules (ascii, extended, unicode) to avoid Rule enum conflicts
- Each uses Pest grammar files generated at build time
- All parsers share the same Rule enum structure

**Configuration System (`src/config.rs`)**
- `ParserConfig` type for runtime configuration
- Supports encoding mode selection, string decomposition options, and BOM detection
- Default configurations available via `default_config()`

**Mutable Parse Tree (`src/mutable_pair.rs`)**
- `MutablePair` provides a mutable alternative to Pest's immutable `Pair` type
- Enables post-parsing transformations like string decomposition
- Converts from Pest pairs via `MutablePair::from_pest_pair()`

**String Processing**
- `src/string_decomposer.rs`: Transforms string tokens into delimiter + content + delimiter
- Optional feature controlled by `DecomposedStrings` configuration

**Buffered Processing (`src/sas_buffered.rs`, `src/sas_buffered_walker.rs`)**
- Handler traits for for output to SAS [SAX like API]
- Walker pattern for traversing parse trees efficiently

### Test Data and Integration Tests
Extensive test suite includes:
- Unit tests in `tests/parser_tests.rs` and `tests/encoding_tests.rs`
- Integration tests with real-world data:
  - BMRB NMR-STAR files (`tests/parse_bmrb_stars.rs`)
  - Crystallography Open Database CIF files (`tests/parse_cod_cifs.rs`)  
  - Protein Data Bank mmCIF files (`tests/parse_pdb_mmcifs.rs`)
- Test data stored in `tests/test_data/` with samples from real databases

### Grammar Template System
The `build.rs` script generates three grammar variants from `src/star.pest_template`:
- Placeholder system allows encoding-specific character class definitions
- Unicode whitespace handling includes comprehensive character ranges
- Generated files: `star_ascii.pest`, `star_extended.pest`, `star_unicode.pest`

## Development Notes

- The parser handles STAR format variants including CIF, NMR-STAR, mmCIF, and NEF
- BOM detection is automatic when enabled in configuration
- String decomposition is optional and controlled via configuration
- All parsers share identical rule structures but differ in character class definitions
- The project includes extensive real-world test data for validation