emstar
emstar is a high-performance Rust library for reading and writing STAR files.
Emstar provides fast, memory-efficient parsing, reading and writing of STAR file formats commonly used in cryo-EM workflows.
Features
- โก High Performance: Written in Rust with zero-copy parsing where possible
- ๐ฏ Type-Safe: Strongly typed API with comprehensive error handling
- ๐ Flexible Data: Support for both simple blocks (key-value) and loop blocks (tabular data)
- ๐๏ธ CRUD Operations: Create, Read, Update, Delete APIs for easy data manipulation
- ๐ Statistics API: Analyze STAR file contents with detailed statistics
- ๐งช Well-Tested: Comprehensive test suite with integration tests
- ๐ Benchmarks: Performance benchmarks for large-scale data processing
- ๐ง Easy to Use: Simple, intuitive API similar to the Python starfile package
Quick Start
Installation
Add to your Cargo.toml:
[]
= "0.1"
Basic Usage
use ;
use HashMap;
// Read a STAR file
let data_blocks = read?;
// Access a data block
if let Some = data_blocks.get
// Write modified data
write?;
# Ok::
Creating STAR Files
use ;
use HashMap;
// Create a simple block using array initialization
let simple: SimpleBlock = .into;
// Create a loop block using the builder pattern
let particles = builder
.columns
.rows
.build?;
// Combine into data blocks
let mut data = new;
data.insert;
data.insert;
// Write to file
write?;
# Ok::
File Operations
emstar provides simple read/write operations. For file management, use the Rust standard library:
use ;
use Path;
use fs;
// Read a file
let data = read?;
// Write a file (creates new or overwrites existing)
write?;
// Check if file exists
if new.exists
// Delete a file
remove_file?;
// Get file statistics
let file_stats = stats?;
println!;
println!;
Read/Write Options
emstar provides optional configuration for reading and writing:
ReadOptions
Filter and customize reading behavior:
use ;
// Read only specific blocks
let options = ReadOptions ;
let data = read?;
// Skip loop blocks (metadata-only mode)
let options = ReadOptions ;
let metadata = read?;
WriteOptions
Customize output format:
use ;
let options = WriteOptions ;
write?;
Utility Functions
Merge with Existing File
use ;
use HashMap;
let mut new_blocks = new;
let particles = builder
.columns
.rows
.build?;
new_blocks.insert;
merge_with_file?;
Validate STAR File
Quick validation without loading all data:
use validate;
match validate
Merge Multiple Files
use merge;
merge?;
Streaming Statistics
Calculate statistics without loading all data:
use stats_streaming;
let stats = stats_streaming?;
println!;
println!;
SimpleBlock CRUD
use ;
let mut block = new;
// Create
block.set;
block.set;
// Read
if let Some = block.get
// Update
block.set;
// Delete
block.remove;
block.clear; // Remove all
// Utilities
let has_key = block.contains_key;
let count = block.len;
LoopBlock CRUD
use ;
let mut block = new;
// Create - add columns and rows
block.add_column;
block.add_column;
block.add_row?;
// Read
let value = block.get; // row 0, col 0
let value = block.get_by_name;
let column = block.get_column;
// Update
block.set_by_name?;
// Delete
block.remove_row?;
block.remove_column?;
block.clear_rows; // Keep columns, remove all rows
block.clear; // Remove everything
// Utilities
let has_col = block.columns.contains;
let n_rows = block.row_count;
let n_cols = block.column_count;
// Iterate
for row in block.iter_rows
LoopBlock Builder Pattern
For more ergonomic LoopBlock creation, use the builder pattern:
use ;
// Create a LoopBlock using the builder
let particles = builder
.columns
.rows
.build?;
assert_eq!;
assert_eq!;
Builder methods:
.columns(&["col1", "col2"])- Set all column names at once.rows(vec![vec![...], vec![...]])- Add multiple rows at once.build()- Construct the LoopBlock
DataBlock Convenience Methods
Access blocks without verbose pattern matching:
use ;
let data_blocks = read?;
// Using expect_simple/expect_loop (panics with message if wrong type)
if let Some = data_blocks.get
if let Some = data_blocks.get
// Using as_simple/as_loop (returns Option)
if let Some = data_blocks.get.as_simple
// Check block type
if data_blocks.get.is_loop
SimpleBlock Array Initialization
Create a SimpleBlock from an array of key-value pairs:
use ;
let block: SimpleBlock = .into;
assert_eq!;
Statistics API
Analyze STAR file contents:
use ;
use HashMap;
// Get statistics from file (loads entire file into memory)
let file_stats = stats?;
println!;
println!;
println!;
println!;
// Get specific block stats
if let Some = file_stats.get_block_stats
// Get stats from in-memory data
let blocks: = read?;
let mem_stats = from_blocks;
Data Types
emstar provides strongly typed data representations:
DataValue
Represents a single value in a STAR file:
DataValue::String(String)- String valuesDataValue::Integer(i64)- Integer valuesDataValue::Float(f64)- Floating-point valuesDataValue::Null- Null/NA values
DataBlock
Represents a data block in a STAR file:
DataBlock::Simple(SimpleBlock)- Key-value pairsDataBlock::Loop(LoopBlock)- Tabular data with columns and rows
SimpleBlock
Key-value pairs for simple data blocks:
let mut block = new;
block.set;
let value = block.get;
// Statistics
let stats = block.stats;
println!;
LoopBlock
Tabular data for loop blocks:
// Using the builder pattern (recommended)
let block = builder
.columns
.rows
.build?;
// Or using traditional methods
let mut block = new;
block.add_column;
block.add_column;
block.add_row?;
let n_rows = block.row_count;
let n_cols = block.column_count;
let value = block.get; // Get value at row 0, col 0
let value = block.get_by_name; // Get value by column name (row first!)
// Statistics
let stats = block.stats;
println!;
API Reference
I/O Functions
| Function | Description |
|---|---|
read(path) |
Read a STAR file from disk |
write(&data, path) |
Write data to a STAR file (creates or overwrites) |
to_string(&data) |
Convert data to STAR format string |
list_blocks(&blocks) |
List all blocks with their names and types |
merge_with_file(&blocks, path) |
Merge blocks with an existing file |
validate(path) |
Validate a STAR file without loading |
merge(paths, output_path) |
Merge multiple STAR files |
stats_streaming(path) |
Calculate statistics without loading |
For file management (delete, exists), use std::fs and std::path::Path.
Statistics Functions
| Function | Description |
|---|---|
stats(path) |
Calculate statistics for a STAR file |
SimpleBlock Methods
| Method | Description |
|---|---|
new() |
Create a new empty SimpleBlock |
get(key) |
Read value by key |
set(key, value) |
Create/Update value |
remove(key) |
Delete key-value pair |
contains_key(key) |
Check if key exists |
contains_value(value) |
Check if value exists |
keys() |
Get all keys |
values() |
Get iterator over all values |
iter() |
Iterate over key-value pairs |
len() |
Get number of entries |
is_empty() |
Check if block is empty |
clear() |
Remove all entries |
drain() |
Remove and return all entries |
retain(f) |
Retain entries matching predicate |
first_value() |
Get the first value |
get_or_insert(key) |
Get value or insert Null |
stats() |
Get block statistics |
LoopBlock Methods
| Method | Description |
|---|---|
new() |
Create a new empty LoopBlock |
with_columns(names) |
Create with specified columns |
get(row, col) |
Read value at position |
get_by_name(row, col_name) |
Read value by column name |
get_f64(row, col_name) |
Get value as f64 (auto-converts) |
get_i64(row, col_name) |
Get value as i64 (auto-converts) |
get_string(row, col_name) |
Get value as string |
get_column(name) |
Get entire column as Vec |
iter_rows() |
Iterate over all rows |
column_iter_f64(name) |
Iterate column as Option |
column_iter_i64(name) |
Iterate column as Option |
column_iter_str(name) |
Iterate column as Option<&str> |
column_metadata(name) |
Get column metadata (type, len, nulls) |
add_column(name) |
Add a column |
add_row(values) |
Add a row |
set(row, col, value) |
Update cell by indices |
set_by_name(row, col_name, value) |
Update cell by name |
remove_row(idx) |
Delete a row |
remove_column(name) |
Delete a column |
clear_rows() |
Remove all rows |
clear() |
Remove all data |
row_count() |
Get number of rows |
column_count() |
Get number of columns |
stats() |
Get block statistics |
builder() |
Create a LoopBlockBuilder (fluent API) |
as_dataframe() |
Access underlying Polars DataFrame |
from_dataframe(df) |
Create LoopBlock from Polars DataFrame |
LoopBlockBuilder Methods
| Method | Description |
|---|---|
new() |
Create a new builder |
columns(names) |
Set column names |
rows(data) |
Set all rows |
validate() |
Validate builder state |
build() |
Build the LoopBlock |
build_validated() |
Build with validation |
DataBlock Methods
| Method | Description |
|---|---|
block_type() |
Get type name ("SimpleBlock" or "LoopBlock") |
is_simple() |
Check if block is a SimpleBlock |
is_loop() |
Check if block is a LoopBlock |
as_simple() |
Get SimpleBlock reference (returns Option) |
as_loop() |
Get LoopBlock reference (returns Option) |
as_simple_mut() |
Get mutable SimpleBlock reference |
as_loop_mut() |
Get mutable LoopBlock reference |
expect_simple(msg) |
Get SimpleBlock or panic with message |
expect_loop(msg) |
Get LoopBlock or panic with message |
count() |
Get entry count (Simple) or row count (Loop) |
stats() |
Get block statistics |
Statistics Types
| Type | Description |
|---|---|
StarStats |
Comprehensive STAR file statistics |
StarStats::from_blocks(&blocks) |
Create stats from in-memory blocks |
StarStats::get_block_stats(name) |
Get stats for specific block |
StarStats::has_loop_blocks() |
Check if file has any LoopBlocks |
StarStats::has_simple_blocks() |
Check if file has any SimpleBlocks |
DataBlockStats |
Block-level statistics (enum) |
LoopBlockStats |
LoopBlock statistics (rows, cols, cells) |
SimpleBlockStats |
SimpleBlock statistics (entries) |
Configuration Types
| Type | Fields | Description |
|---|---|---|
ReadOptions |
blocks_to_read, skip_loop_blocks, skip_simple_blocks |
Options for reading STAR files |
WriteOptions |
float_precision, header_comment, exclude_blocks |
Options for writing STAR files |
ValidationDetails |
n_blocks, estimated_size_bytes, block_names, is_empty |
File validation results |
ColumnMetadata |
name, dtype, len, null_count |
Column metadata for inspection |
DataValue Methods
| Method | Description |
|---|---|
as_integer() |
Convert to i64 (returns Option) |
as_float() |
Convert to f64 (returns Option) |
as_string() |
Convert to &str (returns Option) |
as_bool() |
Convert to bool (returns Option) |
is_null() |
Check if value is Null |
is_nan() |
Check if Float is NaN |
is_infinite() |
Check if Float is infinite |
type_name() |
Get type name as string |
Examples
See the examples/ directory for comprehensive examples:
basic_usage.rs- Basic read/write operations with Polars integrationcrud_operations.rs- Complete CRUD operations demonstrationstatistics.rs- Statistics API usage
Run examples:
Performance
emstar is designed for high performance:
- Zero-copy parsing where possible using
SmartString - Efficient numeric parsing using
lexical - Optimized memory layout for loop blocks using Polars DataFrames
- Streaming I/O for large files
Performance Recommendations
For best performance with emstar:
Fast Column Access
// Slow - allocates a Vec<DataValue>
let col = block.get_column.unwrap;
// Fast - zero-copy iteration
for val in block.column_iter_f64.unwrap
Batch Updates
// Slow - O(n) per call, each call recreates the column
for in updates
// Fast - rebuild once using the builder pattern
let mut rows = Vecnew;
for val in updates
let new_block = builder
.columns
.rows
.build?;
Row Iteration
// Slow - allocates Vec for each row
for row in block.iter_rows
// Fast - use column iterators when possible
if let =
Advanced: Direct DataFrame Access
For complex operations, access the underlying Polars DataFrame:
use ;
use *;
let data_blocks = read?;
if let Some = data_blocks.get
Key Points:
as_dataframe()returns&DataFramefor read-only access- Use
.clone()if you need ownership for modifications - Convert back with
LoopBlock::from_dataframe(df)
Benchmarks
Run benchmarks:
Benchmark results (on typical hardware):
- Parse 10,000 rows: ~5-10ms
- Write 10,000 rows: ~2-5ms
- Parse 100,000 rows: ~50-100ms
- Write 100,000 rows: ~20-50ms
Testing
Run tests:
Run tests with coverage:
Error Handling
emstar uses thiserror for comprehensive error types:
use Error;
match read
Features
default- Core functionalityserde- Optional serde support for (de)serialization (planned)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.