This is a high-performance Rust library for handling the FoLiA XML format, a rich format for linguistic annotation.
This library is currently in alpha stage, it may already be used to read FoLiA documents and to create documents from scratch. Note that this library does not yet implement validation!. You will have to ensure your FoLiA documents are valid by running another FoLiA validator, as this library does not yet guarantee producing valid FoLiA.
For a comparison of FoLiA libraries and a list of implemented features, see FoLiA Implementations.
Installation
Add folia
to your project's Cargo.toml
.
Usage
Reading from file and querying all words:
extern crate folia;
use folia;
//load document from file
let doc = from_file.expect;
//Build a query, here you can match on any attribute
let query = select.element;
//Turn the query into a specific selector
let selector = from_query;
//Run the selector
for word in doc.select
A common pattern is to query in two stages, methods like get_annotation()
, get_annotations()
provide shortcut
alternatives to select()
. Let's output Part-of-Speech tags:
//Run the selector
for word in doc.select
We can create a document from scratch:
let doc = new.expect;
let root: ElementKey = 0; //root element always has key 0
//add a sentence, returns its key
let sentence = doc.add_element_to.expect;
doc.add_element_to.expect;
doc.add_element_to.expect;
If you have an element's key (a numerical internal identifier), you can easily obtain a FoliaElement
instance:
if let Some = doc.get_element
If you have it's official ID, you can do:
if let Some = doc.get_element_by_id
Benchmarks
As the primary goal of this library is to provide a high-performance library, we ran some limited benchmarks against the other more mature and more feature complete FoLiA libraries: FoliaPy, written in Python, and libfolia, written in C++.
Tested on a Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, Linux 5.3
Note: The folia-rust implementation does only a minimal validation whereas the others do a a complete shallow validation on parsing, including also a text consistency validation.
Benchmarks on a +-100MB FoLiA document
(bosb002gide03_01.nederlab.folia.xml
)
Parse from file into a full memory representation (DOM)
Implementation | CPU | Memory | Peak Memory |
---|---|---|---|
foliapy v2.2.1 | 60.9 s | 2083 MB | - |
libfolia v2.3 | 14.7 s | 2656 MB | 2681 MB |
folia-rust v0.0.1 | 2.6 s | 531 MB | 622 MB |
Selecting and iterating over all words
Implementation | CPU | Memory | Peak Memory |
---|---|---|---|
foliapy v2.2.1 | 1.46 s | - | - |
libfolia v2.3 | 0.84 s | - | - |
folia-rust v0.0.1 | 0.122 s | - | - |
Serialisation (without disk writing)
Implementation | CPU | Memory | Peak Memory |
---|---|---|---|
foliapy v2.2.1 | 77.7 s | - | - |
libfolia v2.3 | 5.06s | - | - |
folia-rust v0.0.1 | 1.14s | - | - |