agldt 0.1.2

Tools for handling data conforming the standards of the Ancient Greek and Latin Dependency Treebank
Documentation
# agldt

**author:** Caio Geraldes <caio.geraldes@usp.br>

Tools for parsing treebanks from AGLDT

## Basic usage

```rust
use serde_xml_rs::from_str;
use std::fs::read_to_string;
use agldt::parser::*;

fn main() {
  let src = read_to_string("/path/to/agldt/tlg0007.tlg004.perseus-grc1.tb.xml").unwrap();
  let doc = from_str::<Treebank>(&preprocess(&src)).unwrap();

  assert_eq!(doc.count_words(), 9451);
  assert_eq!(doc.count_tokens(), 10709);
}
```

## Description of parsing stages

### Preprocessing

Pre-processes the source `.xml` code to allow for serialization of the treebank.

There are some oddities in the scheme used in AGLDT's `xml` header and body,
that otherwise make serializing it to a `struct` quite messy.
This is kind of a bodge, but should do the trick.

#### Oddities

The main oddity on AGLDT use of `xml` occurs inside the tag `<respStmt>`, where the
tag `<persName>` might contain either a single string value or a series of tags:

```xml
<respStmt>
  <persName>Bridget Almas</persName>
  <resp>responsible for the annotation environment and cts:urn technology</resp>
  <address>Tufts University</address>
</respStmt>
<respStmt>
  <persName>
    <short>Vanessa Gorman</short>
    <name>Vanessa Gorman</name>
    <address>vbgorman@gmail.com</address>
    <uri>http://data.perseus.org/sosol/users/Vanessa%20Gorman</uri>
  </persName>
  <resp>annotator of the text</resp>
</respStmt>
```

To solve this oddity, we apply two regex replacements so as to move the
`<name>` and `<address>` tags inside `<persName>`.

A handful of other oddities concern the use of the tags `<primary>`,
`<secondary>` and `<annotator>` inside the tag `<sentence>`.
Those are also removed by the regex in the current version.

Finally, the `head` value is sometimes an empty string, which is still an issue
for me to serialize. As `0` is not used anywhere else, I replace empty strings
for `"0"`.

### Serialization

Uses `serde` for serializing the data. I did my best to keep the metadata
accessible, but there are still some missing fields that will later be included.