treebender/
lib.rs

1/*!
2A symbolic natural language parsing library for Rust, inspired by
3[HDPSG](https://en.wikipedia.org/wiki/Head-driven_phrase_structure_grammar).
4
5# What is this?
6This is a library for parsing natural or constructed languages into syntax trees
7and feature structures. There's no machine learning or probabilistic models,
8everything is hand-crafted and deterministic.
9
10You can find out more about the motivations of this project in
11[this blog post](https://vgel.me/posts/symbolic-linguistics-part1/).
12
13## But what are you using it for?
14I'm using this to parse a constructed language for my upcoming xenolinguistics
15game, [Themengi](https://vgel.me/themengi/).
16
17# Motivation
18Using a simple 80-line grammar, introduced in the tutorial below, we can parse
19a simple subset of English, checking reflexive pronoun binding, case, and
20number agreement.
21
22```text
23$ cargo run --bin cli examples/reflexives.fgr
24> she likes himself
25Parsed 0 trees
26
27> her likes herself
28Parsed 0 trees
29
30> she like herself
31Parsed 0 trees
32
33> she likes herself
34Parsed 1 tree
35(0..3: S
36  (0..1: N (0..1: she))
37  (1..2: TV (1..2: likes))
38  (2..3: N (2..3: herself)))
39[
40  child-2: [
41    case: acc
42    pron: ref
43    needs_pron: #0 she
44    num: sg
45    child-0: [ word: herself ]
46  ]
47  child-1: [
48    tense: nonpast
49    child-0: [ word: likes ]
50    num: #1 sg
51  ]
52  child-0: [
53    child-0: [ word: she ]
54    case: nom
55    pron: #0
56    num: #1
57  ]
58]
59```
60
61Low resource language? Low problem! No need to train on gigabytes of text, just
62write a grammar using your brain. Let's hypothesize that in
63American Sign Language, topicalized nouns (expressed with raised eyebrows)
64must appear first in the sentence. We can write a small grammar (18 lines),
65and plug in some sentences:
66
67```text
68$ cargo run --bin cli examples/asl-wordorder.fgr -n
69> boy sit
70Parsed 1 tree
71(0..2: S
72  (0..1: NP ((0..1: N (0..1: boy))))
73  (1..2: IV (1..2: sit)))
74
75> boy throw ball
76Parsed 1 tree
77(0..3: S
78  (0..1: NP ((0..1: N (0..1: boy))))
79  (1..2: TV (1..2: throw))
80  (2..3: NP ((2..3: N (2..3: ball)))))
81
82> ball nm-raised-eyebrows boy throw
83Parsed 1 tree
84(0..4: S
85  (0..2: NP
86    (0..1: N (0..1: ball))
87    (1..2: Topic (1..2: nm-raised-eyebrows)))
88  (2..3: NP ((2..3: N (2..3: boy))))
89  (3..4: TV (3..4: throw)))
90
91> boy throw ball nm-raised-eyebrows
92Parsed 0 trees
93```
94
95# Tutorial
96As an example, let's say we want to build a parser for English reflexive
97pronouns (himself, herself, themselves, themself, itself). We'll also support
98number ("He likes X" v.s. "They like X") and simple embedded clauses
99("He said that they like X").
100
101Grammar files are written in a custom language, similar to BNF, called
102Feature GRammar (.fgr). There's a VSCode syntax highlighting extension for these
103files available as [`fgr-syntax`](https://marketplace.visualstudio.com/items?itemName=vgel.fgr-syntax).
104
105We'll start by defining our lexicon. The lexicon is the set of terminal symbols
106(symbols in the actual input) that the grammar will match. Terminal symbols must
107start with a lowercase letter, and non-terminal symbols must start with an
108uppercase letter.
109
110```fgr
111// pronouns
112N -> he
113N -> him
114N -> himself
115N -> she
116N -> her
117N -> herself
118N -> they
119N -> them
120N -> themselves
121N -> themself
122
123// names, lowercase as they are terminals
124N -> mary
125N -> sue
126N -> takeshi
127N -> robert
128
129// complementizer
130Comp -> that
131
132// verbs -- intransitive, transitive, and clausal
133IV -> falls
134IV -> fall
135IV -> fell
136
137TV -> likes
138TV -> like
139TV -> liked
140
141CV -> says
142CV -> say
143CV -> said
144```
145
146Next, we can add our sentence rules (they must be added at the top, as the first
147rule in the file is assumed to be the top-level rule):
148
149```fgr
150// sentence rules
151S -> N IV
152S -> N TV N
153S -> N CV Comp S
154
155// ... previous lexicon ...
156```
157
158Assuming this file is saved as `examples/no-features.fgr` (which it is :wink:),
159we can test this file with the built-in CLI:
160
161```text
162$ cargo run --bin cli examples/no-features.fgr
163> he falls
164Parsed 1 tree
165(0..2: S
166  (0..1: N (0..1: he))
167  (1..2: IV (1..2: falls)))
168[
169  child-1: [ child-0: [ word: falls ] ]
170  child-0: [ child-0: [ word: he ] ]
171]
172
173> he falls her
174Parsed 0 trees
175
176> he likes her
177Parsed 1 tree
178(0..3: S
179  (0..1: N (0..1: he))
180  (1..2: TV (1..2: likes))
181  (2..3: N (2..3: her)))
182[
183  child-2: [ child-0: [ word: her ] ]
184  child-1: [ child-0: [ word: likes ] ]
185  child-0: [ child-0: [ word: he ] ]
186]
187
188> he likes
189Parsed 0 trees
190
191> he said that he likes her
192Parsed 1 tree
193(0..6: S
194  (0..1: N (0..1: he))
195  (1..2: CV (1..2: said))
196  (2..3: Comp (2..3: that))
197  (3..6: S
198    (3..4: N (3..4: he))
199    (4..5: TV (4..5: likes))
200    (5..6: N (5..6: her))))
201[
202  child-0: [ child-0: [ word: he ] ]
203  child-2: [ child-0: [ word: that ] ]
204  child-1: [ child-0: [ word: said ] ]
205  child-3: [
206    child-2: [ child-0: [ word: her ] ]
207    child-1: [ child-0: [ word: likes ] ]
208    child-0: [ child-0: [ word: he ] ]
209  ]
210]
211
212> he said that he
213Parsed 0 trees
214```
215
216This grammar already parses some correct sentences, and blocks some trivially
217incorrect ones. However, it doesn't care about number, case, or reflexives
218right now:
219
220```text
221> she likes himself  // unbound reflexive pronoun
222Parsed 1 tree
223(0..3: S
224  (0..1: N (0..1: she))
225  (1..2: TV (1..2: likes))
226  (2..3: N (2..3: himself)))
227[
228  child-0: [ child-0: [ word: she ] ]
229  child-2: [ child-0: [ word: himself ] ]
230  child-1: [ child-0: [ word: likes ] ]
231]
232
233> him like her  // incorrect case on the subject pronoun, should be nominative
234                // (he) instead of accusative (him)
235Parsed 1 tree
236(0..3: S
237  (0..1: N (0..1: him))
238  (1..2: TV (1..2: like))
239  (2..3: N (2..3: her)))
240[
241  child-0: [ child-0: [ word: him ] ]
242  child-1: [ child-0: [ word: like ] ]
243  child-2: [ child-0: [ word: her ] ]
244]
245
246> he like her  // incorrect verb number agreement
247Parsed 1 tree
248(0..3: S
249  (0..1: N (0..1: he))
250  (1..2: TV (1..2: like))
251  (2..3: N (2..3: her)))
252[
253  child-2: [ child-0: [ word: her ] ]
254  child-1: [ child-0: [ word: like ] ]
255  child-0: [ child-0: [ word: he ] ]
256]
257```
258
259To fix this, we need to add *features* to our lexicon, and restrict the sentence
260rules based on features.
261
262Features are added with square brackets, and are key: value pairs separated by
263commas. `**top**` is a special feature value, which basically means
264"unspecified" -- we'll come back to it later. Features that are unspecified are
265also assumed to have a `**top**` value, but sometimes explicitly stating top is
266more clear.
267
268```fgr
269/// Pronouns
270// The added features are:
271// * num: sg or pl, whether this noun wants a singular verb (likes) or
272//   a plural verb (like). note this is grammatical number, so for example
273//   singular they takes plural agreement ("they like X", not *"they likes X")
274// * case: nom or acc, whether this noun is nominative or accusative case.
275//   nominative case goes in the subject, and accusative in the object.
276//   e.g., "he fell" and "she likes him", not *"him fell" and *"her likes he"
277// * pron: he, she, they, or ref -- what type of pronoun this is
278// * needs_pron: whether this is a reflexive that needs to bind to another
279//   pronoun.
280N[ num: sg, case: nom, pron: he ]                    -> he
281N[ num: sg, case: acc, pron: he ]                    -> him
282N[ num: sg, case: acc, pron: ref, needs_pron: he ]   -> himself
283N[ num: sg, case: nom, pron: she ]                   -> she
284N[ num: sg, case: acc, pron: she ]                   -> her
285N[ num: sg, case: acc, pron: ref, needs_pron: she]   -> herself
286N[ num: pl, case: nom, pron: they ]                  -> they
287N[ num: pl, case: acc, pron: they ]                  -> them
288N[ num: pl, case: acc, pron: ref, needs_pron: they ] -> themselves
289N[ num: sg, case: acc, pron: ref, needs_pron: they ] -> themself
290
291// Names
292// The added features are:
293// * num: sg, as people are singular ("mary likes her" / *"mary like her")
294// * case: **top**, as names can be both subjects and objects
295//   ("mary likes her" / "she likes mary")
296// * pron: whichever pronoun the person uses for reflexive agreement
297//   mary    pron: she  => mary likes herself
298//   sue     pron: they => sue likes themself
299//   takeshi pron: he   => takeshi likes himself
300N[ num: sg, case: **top**, pron: she ]  -> mary
301N[ num: sg, case: **top**, pron: they ] -> sue
302N[ num: sg, case: **top**, pron: he ]   -> takeshi
303N[ num: sg, case: **top**, pron: he ]   -> robert
304
305// Complementizer doesn't need features
306Comp -> that
307
308// Verbs -- intransitive, transitive, and clausal
309// The added features are:
310// * num: sg, pl, or **top** -- to match the noun numbers.
311//   **top** will match either sg or pl, as past-tense verbs in English
312//   don't agree in number: "he fell" and "they fell" are both fine
313// * tense: past or nonpast -- this won't be used for agreement, but will be
314//   copied into the final feature structure, and the client code could do
315//   something with it
316IV[ num:      sg, tense: nonpast ] -> falls
317IV[ num:      pl, tense: nonpast ] -> fall
318IV[ num: **top**, tense: past ]    -> fell
319
320TV[ num:      sg, tense: nonpast ] -> likes
321TV[ num:      pl, tense: nonpast ] -> like
322TV[ num: **top**, tense: past ]    -> liked
323
324CV[ num:      sg, tense: nonpast ] -> says
325CV[ num:      pl, tense: nonpast ] -> say
326CV[ num: **top**, tense: past ]    -> said
327```
328
329Now that our lexicon is updated with features, we can update our sentence rules
330to constrain parsing based on those features. This uses two new features,
331tags and unification. Tags allow features to be associated between nodes in a
332rule, and unification controls how those features are compatible. The rules for
333unification are:
334
3351. A string feature can unify with a string feature with the same value
3362. A **top** feature can unify with anything, and the nodes are merged
3373. A complex feature ([ ... ] structure) is recursively unified with another
338   complex feature.
339
340If unification fails anywhere, the parse is aborted and the tree is discarded.
341This allows the programmer to discard trees if features don't match.
342
343```fgr
344// Sentence rules
345// Intransitive verb:
346// * Subject must be nominative case
347// * Subject and verb must agree in number (copied through #1)
348S -> N[ case: nom, num: #1 ] IV[ num: #1 ]
349// Transitive verb:
350// * Subject must be nominative case
351// * Subject and verb must agree in number (copied through #2)
352// * If there's a reflexive in the object position, make sure its `needs_pron`
353//   feature matches the subject's `pron` feature. If the object isn't a
354//   reflexive, then its `needs_pron` feature will implicitly be `**top**`, so
355//   will unify with anything.
356S -> N[ case: nom, pron: #1, num: #2 ] TV[ num: #2 ] N[ case: acc, needs_pron: #1 ]
357// Clausal verb:
358// * Subject must be nominative case
359// * Subject and verb must agree in number (copied through #1)
360// * Reflexives can't cross clause boundaries (*"He said that she likes himself"),
361//   so we can ignore reflexives and delegate to inner clause rule
362S -> N[ case: nom, num: #1 ] CV[ num: #1 ] Comp S
363```
364
365Now that we have this augmented grammar (available as `examples/reflexives.fgr`),
366we can try it out and see that it rejects illicit sentences that were previously
367accepted, while still accepting valid ones:
368
369```text
370> he fell
371Parsed 1 tree
372(0..2: S
373  (0..1: N (0..1: he))
374  (1..2: IV (1..2: fell)))
375[
376  child-1: [
377    child-0: [ word: fell ]
378    num: #0 sg
379    tense: past
380  ]
381  child-0: [
382    pron: he
383    case: nom
384    num: #0
385    child-0: [ word: he ]
386  ]
387]
388
389> he like him
390Parsed 0 trees
391
392> he likes himself
393Parsed 1 tree
394(0..3: S
395  (0..1: N (0..1: he))
396  (1..2: TV (1..2: likes))
397  (2..3: N (2..3: himself)))
398[
399  child-1: [
400    num: #0 sg
401    child-0: [ word: likes ]
402    tense: nonpast
403  ]
404  child-2: [
405    needs_pron: #1 he
406    num: sg
407    child-0: [ word: himself ]
408    pron: ref
409    case: acc
410  ]
411  child-0: [
412    child-0: [ word: he ]
413    pron: #1
414    num: #0
415    case: nom
416  ]
417]
418
419> he likes herself
420Parsed 0 trees
421
422> mary likes herself
423Parsed 1 tree
424(0..3: S
425  (0..1: N (0..1: mary))
426  (1..2: TV (1..2: likes))
427  (2..3: N (2..3: herself)))
428[
429  child-0: [
430    pron: #0 she
431    num: #1 sg
432    case: nom
433    child-0: [ word: mary ]
434  ]
435  child-1: [
436    tense: nonpast
437    child-0: [ word: likes ]
438    num: #1
439  ]
440  child-2: [
441    child-0: [ word: herself ]
442    num: sg
443    pron: ref
444    case: acc
445    needs_pron: #0
446  ]
447]
448
449> mary likes themself
450Parsed 0 trees
451
452> sue likes themself
453Parsed 1 tree
454(0..3: S
455  (0..1: N (0..1: sue))
456  (1..2: TV (1..2: likes))
457  (2..3: N (2..3: themself)))
458[
459  child-0: [
460    pron: #0 they
461    child-0: [ word: sue ]
462    case: nom
463    num: #1 sg
464  ]
465  child-1: [
466    tense: nonpast
467    num: #1
468    child-0: [ word: likes ]
469  ]
470  child-2: [
471    needs_pron: #0
472    case: acc
473    pron: ref
474    child-0: [ word: themself ]
475    num: sg
476  ]
477]
478
479> sue likes himself
480Parsed 0 trees
481```
482
483If this is interesting to you and you want to learn more, you can check out
484[my blog series](https://vgel.me/posts/symbolic-linguistics-part1/),
485the excellent textbook [Syntactic Theory: A Formal Introduction (2nd ed.)](https://web.stanford.edu/group/cslipublications/cslipublications/site/1575864002.shtml),
486and the [DELPH-IN project](http://www.delph-in.net/wiki/index.php/Home), whose
487work on the LKB inspired this simplified version.
488
489# Using from code
490I need to write this section in more detail, but if you're comfortable with Rust,
491I suggest looking through the codebase. It's not perfect, it started as one of
492my first Rust projects (after migrating through F# -> TypeScript -> C in search
493of the right performance/ergonomics tradeoff), and it could use more tests,
494but overall it's not too bad.
495
496Basically, the processing pipeline is:
497
4981. Make a `Grammar` struct
499  * `Grammar` is defined in `rules.rs`.
500  * The easiest way to make a `Grammar` is `Grammar::parse_from_file`, which is
501    mostly a hand-written recusive descent parser in `parse_grammar.rs`. Yes,
502    I recognize the irony here.
5032. It takes input (in `Grammar::parse`, which does everything for you, or
504   `Grammar::parse_chart`, which just does the chart)
5053. The input is first chart-parsed in `earley.rs`
5064. Then, a forest is built from the chart, in `forest.rs`, using an algorithm
507    I found in a very useful blog series I forget the URL for, because the
508    algorithms in the academic literature for this are... weird.
5095. Finally, the feature unification is used to prune the forest down to only
510   valid trees. It would be more efficient to do this during parsing, but meh.
511
512The most interesting thing you can do via code and not via the CLI is probably
513getting at the raw feature DAG, as that would let you do things like pronoun
514coreference. The DAG code is in `featurestructure.rs`, and should be fairly
515approachable -- there's a lot of Rust ceremony around `Rc<RefCell<...>>`
516because using an arena allocation crate seemed ~~too har~~like overkill, but
517that is somewhat mitigated by the `NodeRef` type alias. Hit me up at
518https://vgel.me/contact if you need help with anything here!
519*/
520
521#[macro_use]
522extern crate lazy_static;
523
524pub mod earley;
525pub mod featurestructure;
526pub mod forest;
527pub mod parse_grammar;
528pub mod rules;
529pub mod syntree;
530pub mod utils;
531
532use std::fs;
533use std::path;
534use std::rc::Rc;
535
536pub use crate::earley::{parse_chart, Chart};
537pub use crate::featurestructure::NodeRef;
538pub use crate::forest::Forest;
539pub use crate::rules::{Grammar, Rule};
540pub use crate::syntree::{Constituent, SynTree};
541pub use crate::utils::Err;
542
543impl Grammar {
544  pub fn parse_chart(&self, input: &[&str]) -> Chart {
545    parse_chart(&self, input)
546  }
547
548  pub fn parse_forest(&self, input: &[&str]) -> Forest {
549    Forest::from(self.parse_chart(input))
550  }
551
552  pub fn unify_tree(
553    tree: SynTree<Rc<Rule>, String>,
554  ) -> Result<(SynTree<String, String>, NodeRef), Err> {
555    match tree {
556      SynTree::Leaf(w) => Ok((SynTree::Leaf(w), NodeRef::new_top())),
557      SynTree::Branch(cons, children) => {
558        let features = cons.value.features.deep_clone();
559
560        let mut bare_children = Vec::with_capacity(children.len());
561        for (idx, child) in children.into_iter().enumerate() {
562          let (child_tree, child_features) = Self::unify_tree(child)?;
563          bare_children.push(child_tree);
564
565          let to_unify = NodeRef::new_with_edges(vec![(format!("child-{}", idx), child_features)])?;
566          NodeRef::unify(features.clone(), to_unify)?;
567        }
568
569        let bare_self = SynTree::Branch(
570          Constituent {
571            span: cons.span,
572            value: cons.value.symbol.clone(),
573          },
574          bare_children,
575        );
576
577        Ok((bare_self, features))
578      }
579    }
580  }
581
582  pub fn parse(&self, input: &[&str]) -> Vec<(SynTree<String, String>, NodeRef)> {
583    let forest = self.parse_forest(input);
584    let trees = forest.trees(&self);
585    trees
586      .into_iter()
587      .filter_map(|t| Self::unify_tree(t).map(Some).unwrap_or(None))
588      .collect::<Vec<_>>()
589  }
590
591  pub fn read_from_file<P: AsRef<path::Path>>(path: P) -> Result<Self, Err> {
592    fs::read_to_string(path)?.parse()
593  }
594}
595
596#[test]
597fn test_unification_blocking() {
598  let g: Grammar = r#"
599    S -> N[ case: nom, pron: #1 ] TV N[ case: acc, needs_pron: #1 ]
600    TV -> likes
601    N[ case: nom, pron: she ] -> she
602    N[ case: nom, pron: he ] -> he
603    N[ case: acc, pron: he ] -> him
604    N[ case: acc, pron: ref, needs_pron: he ] -> himself
605  "#
606  .parse()
607  .unwrap();
608
609  assert_eq!(g.parse(&["he", "likes", "himself"]).len(), 1);
610  assert_eq!(g.parse(&["he", "likes", "him"]).len(), 1);
611  assert_eq!(g.parse(&["she", "likes", "him"]).len(), 1);
612
613  assert_eq!(g.parse(&["himself", "likes", "himself"]).len(), 0);
614  assert_eq!(g.parse(&["she", "likes", "himself"]).len(), 0);
615  assert_eq!(g.parse(&["himself", "likes", "him"]).len(), 0);
616}
treebender/lib.rs

treebender/
lib.rs