extxyz-ng 0.0.2

extended XYZ parser (extxyz)
Documentation

extxyz

crates.io version Rust Docs

PyPI - Version Python Docs

Extended XYZ specification and parsers. Implemented in rust, fast, no memory issue, with python bindings.

Performance is a central focus, especially for incremental read/write for large-scale trajectories. Frames are processed in a streaming fashion using buffered I/O, enabling a minimal memory footprint.

Compare to legacy C implementation, this approach achieves up to 4× faster performance while reducing memory usage to half for large files. See benchmark.

Usage

This crate provides a low-level parser for the format, designed to be easily converted into structured data compatible with the ccmat library. For high-level tasks such as analyzing crystal or molecular structures, we recommend using ccmat directly, which offers both Rust and Python APIs.

Rust

To use this in your rust project, run cargo add extxyz-ng or add it to your Cargo.toml:

[dependencies]
extxyz = { package = "extxyz-ng", version = "0" }

To read/write a structure frame from file,

use std::fs;
use extxyz::read_frame;

fn main() {
    let path = "./structure.xyz";
    let file = fs::File::open(path).unwrap();
    let mut rd = std::io::BufReader::new(file);

    let frame = read_frame(&mut rd).unwrap();
    println!(frame.natoms);
    println!(frame.info);
    println!(frame.arrs);

    let mut file = File::create("output.xyz")?;
    let mut w = BufWriter::new(file);
    write_frame(&mut w, &frame).unwrap();
}

Or to read/write frames from for example LAMMPS trajactories outputs,

use std::fs;
use extxyz::read_frames;

fn main() {
    let path = "./trajactories.xyz";
    let file = fs::File::open(path).expect("Failed to read file");
    let mut rd = std::io::BufReader::new(file);

    let frames = read_frames(&mut rd);
    for frame in frames {
        println!(frame.natoms);
        println!(frame.info);
        println!(frame.arrs);
    }

    let mut file = File::create("output_trajs.xyz")?;
    let mut w = BufWriter::new(file);
    write_frames(&mut w, frames).unwrap();
}

Python

Install the package:

pip install extxyz-ng

The library to use is extxyz.

from extxyz import (
    read_frame,
    read_frame_from_file,
    read_frames,
    read_frames_from_file,
    write_frame,
    write_frames,
)

# read a frame from an .xyz file 
p = Path(__file__).parent / "mgb.xyz"
frame = read_frame_from_file(p)

# or from an open file handler 
# (this is advanced only if you need the file handler, `read_frame_from_file` should cover most of the use cases)
with open(p, "rb") as fh:
    frame = read_frame(fh)

# read multiple frames from an .xyz file, it returns an iterator
p = Path(__file__).parent / "mgb_multi_frames.xyz"
frames = read_frames_from_file(p)

# or from an open file handler, it returns an iterator
# NOTE: the frames should be used inside the open context, see the read-write round-trip example below
p = Path(__file__).parent / "mgb_multi_frames.xyz"
with open(p, "rb") as fh:
    frames = read_frames(fh)

# write a frame to a file (note the "wb" not "w", because we handle the bytes directly for performance)
fpath = Path(__file__).parent / "foo.xyz"
with open(fpath, "wb") as fh:
    write_frame(fh, default_frame)

# write multiple frames into a file
with open(fpath, "wb") as fh_write:
    write_frames(fh_write, frames)

The round-trip read and write example of frames. The frames read from file handler must be used inside the open context, because the lifetime of the frames should not longer than the file handler.

inp = Path(__file__).parent / "mgb_multi_frames.xyz"
out = Path(__file__).parent / "foo.xyz"

with open(inp, "rb") as fh_read, open(out, "wb") as fh_write:
    # should not close the file handler for read when streaming.
    frames = read_frames(fh_read)
    write_frames(fh_write, frames)

Why/when you should/shouldnt use old c implementation aka libAtoms/extxyz

You should use libAtoms/extxyz if you want

  • use julia binding (but we can add it to extxyz, no time work on it at the moment).
  • use fortran binding.

You should use extxyz/extxyz if you want

  • robust parsing that won't end up at segmentfault when your input is slightly misalign (e.g. leading spaces)
  • nice error showing you where exactly the input is not able to be parsed.
  • nice formatting write to the output, it takes care of accurate alignment for every format.
  • read from flexible input that compatible for legacy libAtoms/extxyz but no segfalut for what expected to be readable.
  • latest python version support.
  • use it in WebAssembly.
  • use it as a rust dependency.
  • streaming on read/write without blowup you RAM for trajactories of large structure.

Performance benchmark

Compare with the legacy c implementation in libAtoms/extxyz, the rust implementation is nearly 4 times faster. The benchmark is done in parsing a > 20k atoms structure.

Bench v.s. legacy libAtoms

Memory benchmark

No memory leak in rust implementation

The rust implementation is memory safe, no memory leak validated by valgrind.

On the contrary, libAtoms/extxyz has memory leak, manifested by:

valgrind --leak-check=full ./target/release/read_frame_legacy_c

==1485786== 215,052 (24 direct, 215,028 indirect) bytes in 1 blocks are definitely lost in loss record 203 of 203
==1485786==    at 0x48AB7A8: malloc (vg_replace_malloc.c:446)
==1485786==    by 0x401D458: cleri_grammar (in /home/jyu/rust/extxyz/target/release/read_frame_legacy_c)
==1485786==    by 0x4017DDB: read_frame_legacy_c::main (in /home/jyu/rust/extxyz/target/release/read_frame_legacy_c)
==1485786==    by 0x4018F02: std::sys::backtrace::__rust_begin_short_backtrace (in /home/jyu/rust/extxyz/target/release/read_frame_legacy_c)
==1485786==    by 0x4018EF8: std::rt::lang_start::{{closure}} (in /home/jyu/rust/extxyz/target/release/read_frame_legacy_c)
==1485786==    by 0x402BCA5: std::rt::lang_start_internal (in /home/jyu/rust/extxyz/target/release/read_frame_legacy_c)
==1485786==    by 0x4018EE4: main (in /home/jyu/rust/extxyz/target/release/read_frame_legacy_c)

low memory footprint when read frames

Rust implementation use buffer to read frames, the memory usage is not cumulated with the increasing of read frames. An iterator is returned, which has the lifetime as the file handler.

Here is the memory footprint recorded using valgrind --tool=massif:

    KB
10.48^                                                                  :     
     |#: :    : ::::::::: : :  @    :    :: :: ::@@ ::::   ::   :::  :  :@  ::
     |#  :: :::@: :: :: : : :  @:: ::    : ::  ::@  : : :  ::   :::  :: :@  ::
     |# ::::: :@: :: :: :::::::@: ::::::@: ::  ::@ :: : :@@::@  :::: ::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@:::::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
     |# ::::: :@: :: :: :::::::@: :::: :@: :: :::@ :: : :@ ::@: :::::::::@::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   1.344

low memory footprint when read large frame (20,000 lines)

The file itself is ~ 758 kb, use buffer read, the file won't be loaded as a whole. The memory usage is all from the final constructed structure result. The total memory usage (2.53mb) is half of libAtoms's c implementation (parsing same file requires 4.74 mb)

    MB
2.528^                                                                      # 
     |                                                                @@@@:@# 
     |                                                          @@@:@@@@@@:@#:
     |                                                     :::::@@@:@@@@@@:@#:
     |                                              ::::::::::: @@@:@@@@@@:@#:
     |                                       ::::::::::::: :::: @@@:@@@@@@:@#:
     |                                  ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     |                            ::::::::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     |                     @@@@:::: ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     |              ::@@@@@@@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     |        ::::@:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     |   ::@:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     | ::: @:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     | ::: @:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     | ::: @:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     | ::: @:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     | ::: @:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     | ::: @:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     | ::: @:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
     | ::: @:@::: @:::@ @@ @@ @:: : ::: ::@@:::::::::::::: :::: @@@:@@@@@@:@#:
   0 +----------------------------------------------------------------------->Mi
     0                                                                   96.84

Writer formatting

  • keys and values in the info line keeps its original format

Round-trip tests

To fully backward compatible with legacy libAtoms/extxyz which used in the community for long time. I run following round-trip tests to ensure the behavior align with old specification.

  • .xyz --extxyz/read --> inner --extxyz/write--> .xyz-01 --cextxyz/read--> inner --> .xyz-02
  • test xyz-01 exatly the same as xyz-02 in content.

TODO: some ambiguse inputs that need to recheck for legacy and new parser

  • what if Properties has same keys but different shape? will undefiened override happens?
  • arr rows has spaces padding cause segfault

Specification

Types to be parsed in the info line (line 2nd)

  • Float
  • Int
  • Boolean
  • bare string
  • string

Type promotion

In array, Int will promote to Float. No other promote rules. This is different from libAtoms/extxyz because blindly promote will cause ambiguity. For example, if bool can be promote to string, it is a "True" on "T" or "true"? If user put an Int of Float, them meant to say it is a number. If there are string parsed from the same array, it is usually indicate an invalid element in the input file.

"Properties" shape

  • in writing, the shape in the "Properties" field is deduct from raw data after the info line. Internally, the "Properties" is ensure the same with the real shape. It is been verified in the frame creation. The frame is not able to be created by hand (I do not provide any constructors from the struct) from struct but always from raw text.
  • in reading, the "Properties" is read and stored, and the deduct shape is compute and validate against the claimed "Properties" shape. If not conform with each other, parsing should fail.

Output format backward libAtoms/extxyz read compatibility

The output format is constrained with following rule, in order that the output format can be read by the legacy libAtoms/extxyz.

  • No leading spaces for each line.
  • Properties printed as the last key-value pair in the info line.

Extend input format support

  • accept leading spaces for each line.
  • accept the info line is not key-value pair to able to parse unextend xyz format (with default Properties shape setup).

Extend matrix format

  • key with "Lattice" (or "lattice"/"lAttice"/... case-insensitive) is treat differently
  • lattice store column-wise per vector.
  • other matrix are formatted by array of (rows as arrays) as for example "[[e1, e2], [e3, e4], [e5, e6]]"
  • empty array is invalid and cause parsing error e.g. [] is invalid.

dev

Since all functionality is adapt to rust implementation, the legacy c code is only for test and benchmark purpose and will be removed in the end when this rust implementation get widely used.

To test with legacy C implementation binding, clone the libAtoms/extxyz source code (and its submodule libcleri for language parsing) as a submodule.

git clone --recurse-submodules https://github.com/extxyz/extxyz.git
cd extxyz

Make new Release

For pypi release:

  • update version number at python/Cargo.toml. The version don't need to sync with rust crate version.
  • trigger manually at CI workflow pypi-publish

For binary release and for crates.io release, they share same version number.

# commit and push to main (can be done with a PR)
git commit -am "release: version 0.1.0"
git push

# actually push the tag up (this triggers dist's CI)
git tag v0.1.0
git push --tags

CI workflow of crates.io build can be trigger manually at CI workflow crate-publish. But it will not run the final crates.io upload.

Roadmap

  • Julia binding
  • Python binding
  • benchmark on speed when parsing large files.
  • read multiple frames.
  • benchmark the memory usage when parsing
  • Fortran binding (not planned)
  • ccmat integration through features tag
  • on python side, interop with ase atoms data structure.
  • v0 is marking all functionality with python bindings. To get into community to be used so I have more API feedback.
  • v1 is make the stable APIs, then the project should be regard as fully complete.

License

All contributions must retain this attribution.