taco_format 0.1.3

TACO (Trajectory and Compressed Observables) Format for molecular dynamics data
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
# TACO (Trajectory and Compressed Observables) Format

TACO is a high-performance binary format for molecular dynamics (MD) trajectory data, designed for efficient storage and processing of large simulation trajectories.

## Features

- **Delta Encoding**: Stores differences between consecutive frames to leverage temporal coherence
- **Hybrid Compression**: Configurable lossy (half-precision) or lossless compression for positions, velocities, and forces
- **Random Access**: Fast direct access to arbitrary frames without scanning the entire file
- **Metadata Support**: Rich metadata for both simulation parameters and atom properties
- **Efficient Processing**: Optimized algorithms for reading and writing of frames
- **Multi-Language Support**: Native APIs for Python, C, C++, and Fortran
- **Cross-Platform**: Works on Linux, macOS, and Windows

## Performance

TACO provides significant space savings compared to traditional formats:
- **Storage Efficiency**: Typically 3-5x smaller than ASE trajectory files
- **Fast Reading**: Efficient batch loading of frames for analysis tasks
- **Fast Writing**: Streamlined frame processing and compression
- **Memory Efficient**: Processes frames in batches to minimize memory usage

## Python Interface

TACO provides a Python interface for easy integration with analysis tools.

### Installation

```bash
pip install taco-format
```

### Basic Usage

```python
import taco_format
import numpy as np
from ase import Atoms

# Create some example Atoms objects
atoms_list = [Atoms('H2O') for _ in range(100)]

# Writing
taco_format.write('traj.taco', atoms_list)

# Reading
atoms_list = taco_format.read('traj.taco')
```

### Appending to Existing Trajectories

TACO supports efficiently appending frames to existing trajectory files without rewriting the entire file:

```python
import taco_format
from ase import Atoms

# Create initial trajectory
initial_frames = [Atoms('H2O') for _ in range(100)]
taco_format.write('traj.taco', initial_frames)

# Later, append more frames to the same file
additional_frames = [Atoms('H2O') for _ in range(50)]
taco_format.append('traj.taco', additional_frames)

# The file now contains 150 frames total
all_frames = taco_format.read('traj.taco')
print(len(all_frames))  # Output: 150
```

**Benefits of Append:**
- **Efficient**: Only writes new frame data, doesn't rewrite existing frames
- **Maintains Compression**: Delta encoding chain is preserved across appends
- **Preserves Metadata**: All simulation and atom metadata remains intact
- **Multiple Appends**: Can append to the same file multiple times
- **Random Access**: Full random access to all frames after appending

**Rust API:**

```rust
use taco_format::{Writer, Frame, FrameData};
use ndarray::Array2;

// Create initial trajectory
let mut writer = Writer::create(
    "trajectory.taco",
    num_atoms,
    time_step,
    sim_metadata,
    atom_metadata,
    compression_settings,
)?;

// Write initial frames
for i in 0..100 {
    let frame = create_frame(i);  // Your frame creation logic
    writer.write_frame(frame)?;
}
writer.finish()?;

// Later, append more frames
let mut writer = Writer::append("trajectory.taco")?;
for i in 100..150 {
    let frame = create_frame(i);
    writer.write_frame(frame)?;
}
writer.finish()?;
```

### Advanced Usage

```python
# Write with custom settings
taco_format.write('traj.taco', atoms_list,
                  time_step=0.002,             # in picoseconds
                  full_frame_interval=50,      # store full frame every 50 frames
                  compression_level=5,         # zstd compression level (1-22)
                  lossless=False)              # use lossy compression

# Read specific frames
frames = taco_format.read('traj.taco', 
                          frame_indices=[0, 10, 20, 30, 40])

# Read a range of frames
frames = taco_format.read('traj.taco', 
                          start_frame=100,
                          end_frame=200)  # Reads frames 100-199

# Efficient writing for large trajectories
taco_format.write('big_traj.taco', big_atoms_list,
                  compression_level=3)  # Use moderate compression
```

### Tensor Operations

TACO provides built-in tensor operations for common trajectory analyses:

```python
import taco_format
import numpy as np

# Calculate center of mass
positions = np.array([[0.0, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0]], dtype=np.float32)
masses = np.array([[1.0], [12.0], [16.0]], dtype=np.float32)
com = taco_format.center_of_mass(positions, masses)

# Extract subset of atoms
indices = [0, 2, 4]  # Atoms to extract
subset = taco_format.extract_subset(positions, indices)

# Calculate RMSD between two coordinate sets
coords1 = np.array([[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]], dtype=np.float32)
coords2 = np.array([[0.1, 0.1, 0.1], [1.1, 0.1, 0.1]], dtype=np.float32)
rmsd = taco_format.calc_rmsd(coords1, coords2)
```

### Utility Functions

The Python interface also includes utility functions for working with TACO files:

```python
# Check if file is a TACO file
is_taco = taco_format.is_taco_file("trajectory.taco")

# Get file information
info = taco_format.get_file_info("trajectory.taco")
print(info)

# Copy frames from one file to another
taco_format.copy_frames("source.taco", "subset.taco", 
                        start_frame=10, num_frames=50)

# Extract specific atoms
atom_indices = [0, 1, 2, 10, 15, 20]  # Atoms to extract
taco_format.extract_atoms("full.taco", "subset.taco", atom_indices)
```

## C, C++, and Fortran Interfaces

TACO provides native interfaces for C, C++, and Fortran, enabling integration with existing molecular dynamics codes and high-performance computing applications.

All interfaces are located in the `c_api/` directory with a single, canonical implementation.

### C API

The C API provides a low-level interface suitable for integration with C programs and as a foundation for other language bindings.

```c
#include "taco_format_c.h"

// Setup metadata
const char* names[] = {"O", "H", "H"};
const char* types[] = {"O", "H", "H"};
float masses[] = {15.999, 1.008, 1.008};
taco_atom_metadata_t atom_metadata = {
    .masses = masses, .names = names, .types = types, .num_atoms = 3
};

taco_simulation_metadata_t sim_metadata = {
    .name = NULL, .description = NULL, .ensemble = NULL,
    .temperature = 300.0, .pressure = 1.0, .software = NULL, .timestep_fs = 1.0
};

taco_compression_settings_t compression = {.precision = 0, .zstd_level = 3};

// Create writer
CTacoWriter* writer = taco_writer_create("output.taco", 3, 0.001,
                                        &sim_metadata, &atom_metadata, 
                                        compression);

// Write frame
float positions[] = {0.0, 0.0, 0.0, 0.1, 0.08, 0.0, 0.1, -0.08, 0.0};
taco_frame_t frame = {
    .frame_number = 0, .time = 0.0, .positions = positions, 
    .num_atoms = 3, /* other fields... */
};
taco_writer_write_frame(writer, &frame);
taco_writer_finish(writer);
```

### C++ API

The C++ API provides a modern interface with RAII, STL containers, and exception handling.

```cpp
#include "taco_format.hpp"

// Setup metadata
std::vector<float> masses = {15.999f, 1.008f, 1.008f};
std::vector<std::string> names = {"O", "H", "H"};
taco::AtomMetadata atom_metadata(masses, names);

taco::SimulationMetadata sim_metadata;
sim_metadata.ensemble = "NVT";
sim_metadata.temperature = 300.0;

// Create writer (RAII - automatically closes)
taco::Writer writer("output.taco", 3, 0.001, sim_metadata, atom_metadata);

// Create and write frame
taco::Frame frame;
frame.positions = {{0.0f, 0.0f, 0.0f}, {0.1f, 0.08f, 0.0f}, {0.1f, -0.08f, 0.0f}};
writer.write_frame(frame);

// Read all frames
taco::Reader reader("output.taco");
auto all_frames = reader.read_all_frames();
```

### Fortran API

The Fortran API provides a modern Fortran 2008 interface with ISO C binding.

```fortran
program taco_example
    use iso_c_binding
    use iso_fortran_env, only: real32, real64, int64
    use taco_format
    implicit none
    
    type(c_ptr) :: writer, reader
    type(taco_frame_t) :: frame
    type(taco_compression_settings_t) :: compression
    type(taco_atom_metadata_t) :: atom_meta
    type(taco_simulation_metadata_t) :: sim_meta
    
    real(real32), target :: masses(3) = [15.999, 1.008, 1.008]
    real(real32), target :: positions(9) = [0.0, 0.0, 0.0, 0.1, 0.08, 0.0, 0.1, -0.08, 0.0]
    character(len=8), target :: names(3) = ['O ', 'H ', 'H ']
    character(len=8), target :: types(3) = ['O ', 'H ', 'H ']
    type(c_ptr), target :: name_ptrs(3), type_ptrs(3)
    integer :: error_code, i
    
    ! Setup pointers for strings
    do i = 1, 3
        name_ptrs(i) = c_loc(names(i))
        type_ptrs(i) = c_loc(types(i))
    end do
    
    ! Setup metadata
    sim_meta%temperature = 300.0_c_double
    sim_meta%pressure = 1.0_c_double
    sim_meta%timestep_fs = 1.0_c_double
    ! Set other fields to null
    
    atom_meta%masses = c_loc(masses)
    atom_meta%names = c_loc(name_ptrs)
    atom_meta%types = c_loc(type_ptrs)
    atom_meta%num_atoms = 3
    
    compression%precision = 0  ! lossless
    compression%zstd_level = 3
    
    ! Create writer
    writer = taco_writer_create('output.taco', 3, 0.001_real64, &
                               sim_meta, atom_meta, compression)
    
    ! Setup frame
    frame%frame_number = 0
    frame%time = 0.0_c_double
    frame%positions = c_loc(positions)
    frame%num_atoms = 3
    ! Set other fields...
    
    ! Write frame
    error_code = taco_writer_write_frame(writer, frame)
    error_code = taco_writer_finish(writer)
end program
```

### Building with C/C++/Fortran Support

```bash
# Build the C API library
cd c_api
cargo build --release

# Build and test C examples
make test_c_api_static
./test_c_api_static

# Build and run Fortran examples and tests
cd fortran
make all          # Build interface, examples, and tests
make run-examples # Run all examples
make run-tests    # Run all tests
```

The C API is located in `c_api/` with:
- `taco_format_c.h` - C header file
- `src/lib.rs` - Rust implementation with C FFI
- `test_c_api.c` - C example/test

The Fortran API is located in `c_api/fortran/` with:
- `taco_format.f90` - Fortran interface module
- `examples/` - Comprehensive Fortran examples
- `tests/` - Fortran unit and integration tests

See [C/C++/Fortran API Documentation](docs/c_cpp_fortran_api.md) for complete details.

## Usage in Rust

### Writing Trajectories

```rust
use taco_format::{Writer, Frame, FrameData, SimulationMetadata, AtomMetadata, CompressionSettings};
use ndarray::Array2;

// Create metadata
let sim_metadata = SimulationMetadata::default();
let atom_metadata = AtomMetadata::default();

// Create a writer
let mut writer = Writer::create(
    "trajectory.taco", // File path
    1000,              // Number of atoms
    0.001,             // Time step (ps)
    sim_metadata,
    atom_metadata,
    CompressionSettings::default(),
)?;

// Write frames
let positions = Array2::<f32>::zeros((1000, 3));
let frame_data = FrameData::new(positions);
let frame = Frame::new(0, 0.0, frame_data);
writer.write_frame(frame)?;

// Write multiple frames sequentially
let frames = vec![frame1, frame2, frame3, ...];
writer.write_frames(frames)?;

// Finish writing
writer.finish()?;
```

### Reading Trajectories

```rust
use taco_format::Reader;

// Open a reader
let mut reader = Reader::open("trajectory.taco")?;

// Get header information
println!("Num atoms: {}", reader.num_atoms());
println!("Num frames: {}", reader.num_frames());

// Read a specific frame
let frame = reader.read_frame(42)?;
let positions = frame.data.positions.unwrap();

// Read a range of frames
let frames = reader.read_frame_range(100, 200)?; // Frames 100-199

// Read specific frames
let frame_indices = vec![10, 20, 30, 40, 50];
let selected_frames = reader.read_frames(&frame_indices)?;

// Iterate through all frames
for frame_result in reader.iter_frames() {
    let frame = frame_result?;
    // Process frame...
}
```

### Tensor Operations

```rust
use taco_format::tensor;
use ndarray::{Array1, Array2};

// Calculate center of mass
let positions = Array2::<f32>::zeros((100, 3));
let masses = Array1::<f32>::ones(100);
let com = tensor::center_of_mass(&positions, &Some(masses));

// Extract subset of atoms
let atom_indices = vec![0, 1, 5, 10];
let subset = tensor::extract_subset(&positions, &atom_indices);

// Calculate RMSD between two coordinate sets
let coords1 = Array2::<f32>::zeros((100, 3));
let coords2 = Array2::<f32>::zeros((100, 3));
let rmsd = tensor::calc_rmsd(&coords1, &coords2)?;
```

## File Structure

```
[Header]
- Format version
- Simulation parameters (time step, temperature, etc.)
- Atom metadata (masses, names, etc.)
- Compression settings

[Frame Index Table]
- Byte offsets to each frame for random access

[Data Blocks]
- Full and delta frames:
  - Position tensors (Nx3)
  - Velocity tensors (Nx3)
  - Force tensors (Nx3)
  - Box dimensions & energies
```

## Building from Source

```bash
git clone https://github.com/username/taco-format.git
cd taco-format
cargo build --release
```

For Python bindings:

```bash
pip install maturin
maturin develop --release
```

## License

MIT