agldt
author: Caio Geraldes caio.geraldes@usp.br
Tools for parsing treebanks from AGLDT
Basic usage
use from_str;
use read_to_string;
use *;
Description of parsing stages
Preprocessing
Pre-processes the source .xml code to allow for serialization of the treebank.
There are some oddities in the scheme used in AGLDT's xml header and body,
that otherwise make serializing it to a struct quite messy.
This is kind of a bodge, but should do the trick.
Oddities
The main oddity on AGLDT use of xml occurs inside the tag <respStmt>, where the
tag <persName> might contain either a single string value or a series of tags:
Bridget Almas
responsible for the annotation environment and cts:urn technology
Tufts University
Vanessa Gorman
Vanessa Gorman
vbgorman@gmail.com
http://data.perseus.org/sosol/users/Vanessa%20Gorman
annotator of the text
To solve this oddity, we apply two regex replacements so as to move the
<name> and <address> tags inside <persName>.
A handful of other oddities concern the use of the tags <primary>,
<secondary> and <annotator> inside the tag <sentence>.
Those are also removed by the regex in the current version.
Finally, the head value is sometimes an empty string, which is still an issue
for me to serialize. As 0 is not used anywhere else, I replace empty strings
for "0".
Serialization
Uses serde for serializing the data. I did my best to keep the metadata
accessible, but there are still some missing fields that will later be included.