onecode 0.1.0

Rust bindings for ONEcode - a data representation format for genomic data
Documentation
# onecode-rs

Rust bindings for [ONEcode](https://github.com/thegenemyers/ONEcode), a simple and efficient data representation format for genomic data.

## Overview

ONEcode is a data representation framework designed primarily for genomic data, providing both human-readable ASCII and compressed binary file versions with strongly typed data.

This library provides safe, idiomatic Rust bindings to the ONEcode C library.

## Features

- ✅ Read and write ONE files in both ASCII and binary formats
- ✅ Schema validation and creation
- ✅ Provenance and reference tracking
- ✅ Type-safe access to fields (integers, reals, characters, strings, lists)
- ✅ File navigation and statistics
- ✅ Sequence name extraction from embedded GDB in alignment files
- ✅ RAII-based resource management
-**Fully thread-safe** - concurrent operations supported

## Requirements

### System Dependencies

This library uses `bindgen` to generate Rust bindings from C headers, which requires clang/libclang:

**Ubuntu/Debian:**
```bash
sudo apt-get install llvm-dev libclang-dev clang
```

**Fedora/RHEL:**
```bash
sudo dnf install clang-devel llvm-devel
```

**macOS:**
```bash
xcode-select --install  # Usually already installed
```

**Arch Linux:**
```bash
sudo pacman -S clang
```

For more details, see the [bindgen requirements documentation](https://rust-lang.github.io/rust-bindgen/requirements.html).

## Installation

Add this to your `Cargo.toml`:

```toml
[dependencies]
onecode = { git = "https://github.com/pangenome/onecode-rs" }
```

## Usage

### Reading a ONE file

```rust
use onecode::OneFile;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut file = OneFile::open_read("data.1seq", None, None, 1)?;

    // Read through the file
    loop {
        let line_type = file.read_line();
        if line_type == '\0' {
            break; // End of file
        }

        match line_type {
            'S' => {
                // Access DNA sequence data
                println!("Sequence line");
            },
            'I' => {
                // Access identifier string
                println!("ID: {}", file.int(0));
            },
            _ => {}
        }
    }

    Ok(())
}
```

### Writing a ONE file

```rust
use onecode::{OneFile, OneSchema};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a schema
    let schema_text = "P 3 tst\nO T 1 3 INT\n";
    let schema = OneSchema::from_text(schema_text)?;

    // Open file for writing
    let mut writer = OneFile::open_write_new(
        "output.1tst",
        &schema,
        "tst",
        false,  // ASCII format
        1       // single-threaded
    )?;

    // Add provenance
    writer.add_provenance("myprogram", "1.0", "example command")?;

    // Write data
    writer.set_int(0, 42);
    writer.write_line('T', 0, None);

    // File is automatically closed on drop
    Ok(())
}
```

### Creating schemas from text

```rust
use onecode::OneSchema;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Define schema inline
    let schema_text = r#"
P 3 seq
O S 1 3 DNA
D I 1 3 INT
    "#;

    let schema = OneSchema::from_text(schema_text)?;
    // Use schema for file operations
    Ok(())
}
```

### Getting file statistics

```rust
use onecode::OneFile;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = OneFile::open_read("data.1seq", None, None, 1)?;

    // Get statistics for a line type
    let (count, max_length, total_length) = file.stats('S')?;
    println!("Sequences: {}, Max length: {}, Total: {}",
             count, max_length, total_length);

    Ok(())
}
```

### Working with alignment files (.1aln) and sequence names

Alignment files can contain embedded genome database (GDB) information, mapping sequence IDs to names:

```rust
use onecode::OneFile;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut file = OneFile::open_read("alignments.1aln", None, None, 1)?;

    // Get all sequence names (efficient for multiple lookups)
    let seq_names = file.get_all_sequence_names();
    println!("Found {} sequences", seq_names.len());

    // Read alignments and resolve sequence names
    loop {
        let line_type = file.read_line();
        if line_type == '\0' { break; }

        if line_type == 'A' {
            let query_id = file.int(0);
            let target_id = file.int(3);

            if let (Some(query_name), Some(target_name)) =
                (seq_names.get(&query_id), seq_names.get(&target_id)) {
                println!("Alignment: {} vs {}", query_name, target_name);
            }
        }
    }

    Ok(())
}
```

Or look up individual names on-demand:

```rust
let mut file = OneFile::open_read("alignments.1aln", None, None, 1)?;

// Get a specific sequence name by ID
if let Some(name) = file.get_sequence_name(5) {
    println!("Sequence 5: {}", name);
}
```

## API Documentation

Full API documentation is available via cargo doc:

```bash
cargo doc --open
```

Key types:
- `OneFile` - Main file handle for reading/writing ONE files
- `OneSchema` - Schema definition and validation
- `OneError` - Error types
- `OneType` - Field type enumeration

## Building

The library uses `bindgen` to automatically generate bindings from the C headers and `cc` to compile the C library.

```bash
cargo build --release
```

## Testing

All tests pass with full concurrent execution:

```bash
cargo test
```

Test suite includes:
- 9 basic functionality tests
- 3 sequence name extraction tests
- 4 thread-safety stress tests (10-50 concurrent threads)
- 2 doc tests

## Thread Safety

✅ **Fully thread-safe!** The library supports concurrent operations without any restrictions.

The upstream ONEcode C library has been updated with thread-local storage for all global state, making it safe for concurrent use from multiple threads. All operations including schema creation, file reading, and error handling work correctly under concurrent load.

## Architecture

The library is organized into several modules:

- `ffi` - Raw FFI bindings generated by bindgen
- `error` - Rust error types and Result wrapper
- `types` - Rust-friendly type definitions
- `file` - Safe `OneFile` wrapper with RAII resource management
- `schema` - `OneSchema` management and validation

## Integration with ONEcode

The C library is included as a git subtree in the `ONEcode/` directory and compiled automatically during the build process.

To update the ONEcode subtree:

```bash
git subtree pull --prefix ONEcode https://github.com/thegenemyers/ONEcode.git main --squash
```

## Performance

- Zero-copy access to data where possible
- Supports parallel reading/writing with configurable thread count
- Binary format provides efficient compression
- Thread-safe without synchronization overhead

## License

This Rust wrapper is licensed under MIT OR Apache-2.0.

The ONEcode C library has its own license - see `ONEcode/` for details.

## Contributing

Contributions are welcome! Please ensure tests pass before submitting PRs:

```bash
cargo test
cargo clippy
cargo fmt
```

## Acknowledgments

ONEcode was developed by Gene Myers and Richard Durbin. This Rust wrapper builds on their excellent work to provide safe, idiomatic Rust bindings.