taco_format 0.1.4

TACO (Trajectory and Compressed Observables) Format for molecular dynamics data
Documentation
---
title: 'TACO: A Trajectory and Compressed Observables Format for Molecular Dynamics Data'
tags:
  - Rust
  - Python
  - molecular dynamics
  - trajectory
  - compression
  - scientific data
authors:
  - name: Nils Holle
    orcid: 0000-0001-9830-6881
    affiliation: 1
  - name: Riccardo Mazzarello
    orcid: 0000-0003-2319-375X
    affiliation: 2
  - name: Martin Salinga
    orcid: 0000-0002-2228-6244
    affiliation: 1
affiliations:
  - name: University of Münster, Institute of Materials Physics, Germany
    index: 1
    ror: 00pd74e08
  - name: Sapienza Università di Roma
    index: 2
    ror: 02be6w209
date: 20 June 2025
bibliography: paper.bib
---

# TACO: Trajectory and Compressed Observables Format for Molecular Dynamics

## Summary

Molecular dynamics (MD) simulations generate large volumes of trajectory data, which pose significant challenges for data storage, transfer, and analysis. The TACO (Trajectory and Compressed Observables) format addresses these challenges by providing a high-performance file format specifically designed for MD trajectories. TACO leverages temporal coherence through delta encoding, employs configurable lossy and lossless compression strategies, and offers efficient random access capabilities. The format supports rich metadata and provides both Rust and Python interfaces, with seamless integration into the Atomic Simulation Environment (ASE) ecosystem. Additionally, TACO includes a comprehensive command-line interface for common operations such as frame extraction, format conversion, and file validation, available both as a standalone binary and through Python installation. Implemented in modern Rust with Python bindings, TACO offers a flexible and extensible solution for efficient trajectory data management, making it a valuable tool for researchers working with large-scale molecular simulations.

## Statement of need

Modern molecular dynamics simulations routinely generate trajectories containing millions of atoms over millions of time steps, resulting in datasets that can reach terabyte scales. These large datasets present multiple challenges for researchers and computing infrastructure. Storage limitations on high-performance computing systems often restrict the amount of trajectory data that can be retained. Data transfer between computing centers and analysis workstations becomes prohibitively slow with conventional formats. Traditional file formats typically lack efficient random access capabilities, forcing sequential reads through potentially enormous files. Additionally, many existing formats provide limited support for rich metadata that can be crucial for analysis and reproducibility.

Existing formats like DCD [xphillips2020scalable], XTC [xabraham2015gromacs], and NetCDF [Amber reference] either do not integrate well with widely used analysis tools like ASE and lack features for handling of full simulation outputs that include positions, velocities, forces, stress tensors, and others.
TACO addresses these issues through several different approaches. Like e.g. the TRR format, it employs delta encoding to store differences between frames, exploiting the temporal coherence inherent in MD trajectories. It implements hybrid compression with configurable precision modes, offering both lossy and lossless compression options that can be tuned based on accuracy requirements. The format provides fast random access through a carefully designed index structure, enabling direct frame lookup without scanning through files. TACO includes comprehensive support for rich metadata storage of both simulation parameters and atom properties. Finally, it offers native Python bindings for seamless integration with common analysis workflows, with full compatibility with the Atomic Simulation Environment (ASE).

## Key Features and Design

### Delta Encoding and Compression

TACO employs delta encoding to store differences between consecutive frames rather than absolute coordinates. This strategy exploits the temporal coherence in MD trajectories, where atomic movements between adjacent time steps are typically small compared to absolute positions.

The compression strategy uses configurable precision modes that can be adapted to different accuracy requirements. By default, half precision (16-bit) encoding is used for all data to maximize compression efficiency, while full precision (32-bit) is automatically applied to key frames (stored at regular intervals) to prevent error accumulation over long trajectories. Users can override these defaults and configure both the precision mode and the key frame interval based on their specific accuracy and storage requirements. Lossless compression using Zstandard is applied to all data with configurable compression levels (1-22), allowing users to balance compression speed versus file size. This flexible approach enables researchers to optimize storage requirements while maintaining the accuracy needed for their analysis tasks.

### File Structure and Random Access

TACO implements an efficient file structure with a header containing format version information, simulation metadata including temperature and ensemble details, atom metadata such as masses and element types, and compression settings. This is followed by a frame index table that stores byte offsets to each frame, and finally the data blocks containing full and delta frames. The frame index table enables $\mathcal{O}(1)$ random access to any frame without scanning through the file, accelerating analysis workflows that require non-sequential frame access.

## Performance and Implementation

TACO is implemented in Rust for performance and memory safety, with Python bindings provided via PyO3 for seamless integration with scientific Python workflows. The implementation requires Python 3.8 or newer and provides full compatibility with the Atomic Simulation Environment (ASE).

## Command-Line Interface

TACO includes a command-line interface providing convenient access to common trajectory operations. The CLI is available both as a Rust binary and as a Python entry point when installed via pip.

The CLI supports five main operations: **info** (display file metadata), **extract** (subset frame ranges), **convert** (change compression settings), **check** (validate integrity), and **stats** (analyze trajectory data). For example:

```bash
taco extract input.taco subset.taco --start 100 --end 200
taco convert subset.taco final.taco --lossless
```

## Conclusion

The TACO format addresses the growing data management challenges in molecular dynamics research by providing a modern, efficient solution for trajectory storage and analysis. By combining domain-specific optimizations such as delta encoding and configurable compression strategies with fast random access capabilities, TACO enables more efficient storage and analysis of trajectory data compared to traditional formats. The implementation in Rust ensures high performance and memory safety, while the Python bindings and ASE compatibility allow researchers to easily incorporate TACO into existing workflows.

The format's flexible design allows users to balance storage efficiency with accuracy requirements through configurable precision modes and compression settings. Future development will focus on optimizing read and write performance further.

## Acknowledgments

...

## References