1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/// Tools for parsing treebanks from AGLDT
///
/// # Stages
///
/// ## Preprocessing
///
/// Preprocesses the source `.xml` code to allow for serialization of the treebank.
///
/// There are some oddities in the scheme used in AGLDT's `xml` header and body, that otherwise make
/// serializing it to a `struct` quite messy.
/// This is kind of a bodge, but should do the trick.
///
/// ### Oddities
///
/// The main oddity on AGLDT use of `xml` occurs inside the tag `<respStmt>`, where the
/// tag `<persName>` might contain either a single string value or a series of tags:
///
/// ```xml
/// <respStmt>
/// <persName>Bridget Almas</persName>
/// <resp>responsible for the annotation environment and cts:urn technology</resp>
/// <address>Tufts University</address>
/// </respStmt>
/// <respStmt>
/// <persName>
/// <short>Vanessa Gorman</short>
/// <name>Vanessa Gorman</name>
/// <address>vbgorman@gmail.com</address>
/// <uri>http://data.perseus.org/sosol/users/Vanessa%20Gorman</uri>
/// </persName>
/// <resp>annotator of the text</resp>
/// </respStmt>
/// ```
/// To solve this oddity, we apply two regex replacements so as to move the
/// `<name>` and `<address>` tags inside `<persName>`.
///
/// A handful of other oddities concearn the use of the tags `<primary>`, `<secondary>` and
/// `<annotator>` inside the tag `<sentence>`. Those are also removed by the regex in the current
/// version.
///
///
/// Finally, the `head` value is sometimes an empty string, which is still an issue for me to
/// serialize. As `0` is not used anywhere else, I replace empty strings for `"0"`.
///
/// ## Serialization
///
/// Uses [`serde`] for serializing the data. I did my best to keep the metadata acessible, but
/// there are still some missing fields that will later be included.
///
/// Definitions for parsing and building AGLDT postag data.