treebender/lib.rs
1/*!
2A symbolic natural language parsing library for Rust, inspired by
3[HDPSG](https://en.wikipedia.org/wiki/Head-driven_phrase_structure_grammar).
4
5# What is this?
6This is a library for parsing natural or constructed languages into syntax trees
7and feature structures. There's no machine learning or probabilistic models,
8everything is hand-crafted and deterministic.
9
10You can find out more about the motivations of this project in
11[this blog post](https://vgel.me/posts/symbolic-linguistics-part1/).
12
13## But what are you using it for?
14I'm using this to parse a constructed language for my upcoming xenolinguistics
15game, [Themengi](https://vgel.me/themengi/).
16
17# Motivation
18Using a simple 80-line grammar, introduced in the tutorial below, we can parse
19a simple subset of English, checking reflexive pronoun binding, case, and
20number agreement.
21
22```text
23$ cargo run --bin cli examples/reflexives.fgr
24> she likes himself
25Parsed 0 trees
26
27> her likes herself
28Parsed 0 trees
29
30> she like herself
31Parsed 0 trees
32
33> she likes herself
34Parsed 1 tree
35(0..3: S
36 (0..1: N (0..1: she))
37 (1..2: TV (1..2: likes))
38 (2..3: N (2..3: herself)))
39[
40 child-2: [
41 case: acc
42 pron: ref
43 needs_pron: #0 she
44 num: sg
45 child-0: [ word: herself ]
46 ]
47 child-1: [
48 tense: nonpast
49 child-0: [ word: likes ]
50 num: #1 sg
51 ]
52 child-0: [
53 child-0: [ word: she ]
54 case: nom
55 pron: #0
56 num: #1
57 ]
58]
59```
60
61Low resource language? Low problem! No need to train on gigabytes of text, just
62write a grammar using your brain. Let's hypothesize that in
63American Sign Language, topicalized nouns (expressed with raised eyebrows)
64must appear first in the sentence. We can write a small grammar (18 lines),
65and plug in some sentences:
66
67```text
68$ cargo run --bin cli examples/asl-wordorder.fgr -n
69> boy sit
70Parsed 1 tree
71(0..2: S
72 (0..1: NP ((0..1: N (0..1: boy))))
73 (1..2: IV (1..2: sit)))
74
75> boy throw ball
76Parsed 1 tree
77(0..3: S
78 (0..1: NP ((0..1: N (0..1: boy))))
79 (1..2: TV (1..2: throw))
80 (2..3: NP ((2..3: N (2..3: ball)))))
81
82> ball nm-raised-eyebrows boy throw
83Parsed 1 tree
84(0..4: S
85 (0..2: NP
86 (0..1: N (0..1: ball))
87 (1..2: Topic (1..2: nm-raised-eyebrows)))
88 (2..3: NP ((2..3: N (2..3: boy))))
89 (3..4: TV (3..4: throw)))
90
91> boy throw ball nm-raised-eyebrows
92Parsed 0 trees
93```
94
95# Tutorial
96As an example, let's say we want to build a parser for English reflexive
97pronouns (himself, herself, themselves, themself, itself). We'll also support
98number ("He likes X" v.s. "They like X") and simple embedded clauses
99("He said that they like X").
100
101Grammar files are written in a custom language, similar to BNF, called
102Feature GRammar (.fgr). There's a VSCode syntax highlighting extension for these
103files available as [`fgr-syntax`](https://marketplace.visualstudio.com/items?itemName=vgel.fgr-syntax).
104
105We'll start by defining our lexicon. The lexicon is the set of terminal symbols
106(symbols in the actual input) that the grammar will match. Terminal symbols must
107start with a lowercase letter, and non-terminal symbols must start with an
108uppercase letter.
109
110```fgr
111// pronouns
112N -> he
113N -> him
114N -> himself
115N -> she
116N -> her
117N -> herself
118N -> they
119N -> them
120N -> themselves
121N -> themself
122
123// names, lowercase as they are terminals
124N -> mary
125N -> sue
126N -> takeshi
127N -> robert
128
129// complementizer
130Comp -> that
131
132// verbs -- intransitive, transitive, and clausal
133IV -> falls
134IV -> fall
135IV -> fell
136
137TV -> likes
138TV -> like
139TV -> liked
140
141CV -> says
142CV -> say
143CV -> said
144```
145
146Next, we can add our sentence rules (they must be added at the top, as the first
147rule in the file is assumed to be the top-level rule):
148
149```fgr
150// sentence rules
151S -> N IV
152S -> N TV N
153S -> N CV Comp S
154
155// ... previous lexicon ...
156```
157
158Assuming this file is saved as `examples/no-features.fgr` (which it is :wink:),
159we can test this file with the built-in CLI:
160
161```text
162$ cargo run --bin cli examples/no-features.fgr
163> he falls
164Parsed 1 tree
165(0..2: S
166 (0..1: N (0..1: he))
167 (1..2: IV (1..2: falls)))
168[
169 child-1: [ child-0: [ word: falls ] ]
170 child-0: [ child-0: [ word: he ] ]
171]
172
173> he falls her
174Parsed 0 trees
175
176> he likes her
177Parsed 1 tree
178(0..3: S
179 (0..1: N (0..1: he))
180 (1..2: TV (1..2: likes))
181 (2..3: N (2..3: her)))
182[
183 child-2: [ child-0: [ word: her ] ]
184 child-1: [ child-0: [ word: likes ] ]
185 child-0: [ child-0: [ word: he ] ]
186]
187
188> he likes
189Parsed 0 trees
190
191> he said that he likes her
192Parsed 1 tree
193(0..6: S
194 (0..1: N (0..1: he))
195 (1..2: CV (1..2: said))
196 (2..3: Comp (2..3: that))
197 (3..6: S
198 (3..4: N (3..4: he))
199 (4..5: TV (4..5: likes))
200 (5..6: N (5..6: her))))
201[
202 child-0: [ child-0: [ word: he ] ]
203 child-2: [ child-0: [ word: that ] ]
204 child-1: [ child-0: [ word: said ] ]
205 child-3: [
206 child-2: [ child-0: [ word: her ] ]
207 child-1: [ child-0: [ word: likes ] ]
208 child-0: [ child-0: [ word: he ] ]
209 ]
210]
211
212> he said that he
213Parsed 0 trees
214```
215
216This grammar already parses some correct sentences, and blocks some trivially
217incorrect ones. However, it doesn't care about number, case, or reflexives
218right now:
219
220```text
221> she likes himself // unbound reflexive pronoun
222Parsed 1 tree
223(0..3: S
224 (0..1: N (0..1: she))
225 (1..2: TV (1..2: likes))
226 (2..3: N (2..3: himself)))
227[
228 child-0: [ child-0: [ word: she ] ]
229 child-2: [ child-0: [ word: himself ] ]
230 child-1: [ child-0: [ word: likes ] ]
231]
232
233> him like her // incorrect case on the subject pronoun, should be nominative
234 // (he) instead of accusative (him)
235Parsed 1 tree
236(0..3: S
237 (0..1: N (0..1: him))
238 (1..2: TV (1..2: like))
239 (2..3: N (2..3: her)))
240[
241 child-0: [ child-0: [ word: him ] ]
242 child-1: [ child-0: [ word: like ] ]
243 child-2: [ child-0: [ word: her ] ]
244]
245
246> he like her // incorrect verb number agreement
247Parsed 1 tree
248(0..3: S
249 (0..1: N (0..1: he))
250 (1..2: TV (1..2: like))
251 (2..3: N (2..3: her)))
252[
253 child-2: [ child-0: [ word: her ] ]
254 child-1: [ child-0: [ word: like ] ]
255 child-0: [ child-0: [ word: he ] ]
256]
257```
258
259To fix this, we need to add *features* to our lexicon, and restrict the sentence
260rules based on features.
261
262Features are added with square brackets, and are key: value pairs separated by
263commas. `**top**` is a special feature value, which basically means
264"unspecified" -- we'll come back to it later. Features that are unspecified are
265also assumed to have a `**top**` value, but sometimes explicitly stating top is
266more clear.
267
268```fgr
269/// Pronouns
270// The added features are:
271// * num: sg or pl, whether this noun wants a singular verb (likes) or
272// a plural verb (like). note this is grammatical number, so for example
273// singular they takes plural agreement ("they like X", not *"they likes X")
274// * case: nom or acc, whether this noun is nominative or accusative case.
275// nominative case goes in the subject, and accusative in the object.
276// e.g., "he fell" and "she likes him", not *"him fell" and *"her likes he"
277// * pron: he, she, they, or ref -- what type of pronoun this is
278// * needs_pron: whether this is a reflexive that needs to bind to another
279// pronoun.
280N[ num: sg, case: nom, pron: he ] -> he
281N[ num: sg, case: acc, pron: he ] -> him
282N[ num: sg, case: acc, pron: ref, needs_pron: he ] -> himself
283N[ num: sg, case: nom, pron: she ] -> she
284N[ num: sg, case: acc, pron: she ] -> her
285N[ num: sg, case: acc, pron: ref, needs_pron: she] -> herself
286N[ num: pl, case: nom, pron: they ] -> they
287N[ num: pl, case: acc, pron: they ] -> them
288N[ num: pl, case: acc, pron: ref, needs_pron: they ] -> themselves
289N[ num: sg, case: acc, pron: ref, needs_pron: they ] -> themself
290
291// Names
292// The added features are:
293// * num: sg, as people are singular ("mary likes her" / *"mary like her")
294// * case: **top**, as names can be both subjects and objects
295// ("mary likes her" / "she likes mary")
296// * pron: whichever pronoun the person uses for reflexive agreement
297// mary pron: she => mary likes herself
298// sue pron: they => sue likes themself
299// takeshi pron: he => takeshi likes himself
300N[ num: sg, case: **top**, pron: she ] -> mary
301N[ num: sg, case: **top**, pron: they ] -> sue
302N[ num: sg, case: **top**, pron: he ] -> takeshi
303N[ num: sg, case: **top**, pron: he ] -> robert
304
305// Complementizer doesn't need features
306Comp -> that
307
308// Verbs -- intransitive, transitive, and clausal
309// The added features are:
310// * num: sg, pl, or **top** -- to match the noun numbers.
311// **top** will match either sg or pl, as past-tense verbs in English
312// don't agree in number: "he fell" and "they fell" are both fine
313// * tense: past or nonpast -- this won't be used for agreement, but will be
314// copied into the final feature structure, and the client code could do
315// something with it
316IV[ num: sg, tense: nonpast ] -> falls
317IV[ num: pl, tense: nonpast ] -> fall
318IV[ num: **top**, tense: past ] -> fell
319
320TV[ num: sg, tense: nonpast ] -> likes
321TV[ num: pl, tense: nonpast ] -> like
322TV[ num: **top**, tense: past ] -> liked
323
324CV[ num: sg, tense: nonpast ] -> says
325CV[ num: pl, tense: nonpast ] -> say
326CV[ num: **top**, tense: past ] -> said
327```
328
329Now that our lexicon is updated with features, we can update our sentence rules
330to constrain parsing based on those features. This uses two new features,
331tags and unification. Tags allow features to be associated between nodes in a
332rule, and unification controls how those features are compatible. The rules for
333unification are:
334
3351. A string feature can unify with a string feature with the same value
3362. A **top** feature can unify with anything, and the nodes are merged
3373. A complex feature ([ ... ] structure) is recursively unified with another
338 complex feature.
339
340If unification fails anywhere, the parse is aborted and the tree is discarded.
341This allows the programmer to discard trees if features don't match.
342
343```fgr
344// Sentence rules
345// Intransitive verb:
346// * Subject must be nominative case
347// * Subject and verb must agree in number (copied through #1)
348S -> N[ case: nom, num: #1 ] IV[ num: #1 ]
349// Transitive verb:
350// * Subject must be nominative case
351// * Subject and verb must agree in number (copied through #2)
352// * If there's a reflexive in the object position, make sure its `needs_pron`
353// feature matches the subject's `pron` feature. If the object isn't a
354// reflexive, then its `needs_pron` feature will implicitly be `**top**`, so
355// will unify with anything.
356S -> N[ case: nom, pron: #1, num: #2 ] TV[ num: #2 ] N[ case: acc, needs_pron: #1 ]
357// Clausal verb:
358// * Subject must be nominative case
359// * Subject and verb must agree in number (copied through #1)
360// * Reflexives can't cross clause boundaries (*"He said that she likes himself"),
361// so we can ignore reflexives and delegate to inner clause rule
362S -> N[ case: nom, num: #1 ] CV[ num: #1 ] Comp S
363```
364
365Now that we have this augmented grammar (available as `examples/reflexives.fgr`),
366we can try it out and see that it rejects illicit sentences that were previously
367accepted, while still accepting valid ones:
368
369```text
370> he fell
371Parsed 1 tree
372(0..2: S
373 (0..1: N (0..1: he))
374 (1..2: IV (1..2: fell)))
375[
376 child-1: [
377 child-0: [ word: fell ]
378 num: #0 sg
379 tense: past
380 ]
381 child-0: [
382 pron: he
383 case: nom
384 num: #0
385 child-0: [ word: he ]
386 ]
387]
388
389> he like him
390Parsed 0 trees
391
392> he likes himself
393Parsed 1 tree
394(0..3: S
395 (0..1: N (0..1: he))
396 (1..2: TV (1..2: likes))
397 (2..3: N (2..3: himself)))
398[
399 child-1: [
400 num: #0 sg
401 child-0: [ word: likes ]
402 tense: nonpast
403 ]
404 child-2: [
405 needs_pron: #1 he
406 num: sg
407 child-0: [ word: himself ]
408 pron: ref
409 case: acc
410 ]
411 child-0: [
412 child-0: [ word: he ]
413 pron: #1
414 num: #0
415 case: nom
416 ]
417]
418
419> he likes herself
420Parsed 0 trees
421
422> mary likes herself
423Parsed 1 tree
424(0..3: S
425 (0..1: N (0..1: mary))
426 (1..2: TV (1..2: likes))
427 (2..3: N (2..3: herself)))
428[
429 child-0: [
430 pron: #0 she
431 num: #1 sg
432 case: nom
433 child-0: [ word: mary ]
434 ]
435 child-1: [
436 tense: nonpast
437 child-0: [ word: likes ]
438 num: #1
439 ]
440 child-2: [
441 child-0: [ word: herself ]
442 num: sg
443 pron: ref
444 case: acc
445 needs_pron: #0
446 ]
447]
448
449> mary likes themself
450Parsed 0 trees
451
452> sue likes themself
453Parsed 1 tree
454(0..3: S
455 (0..1: N (0..1: sue))
456 (1..2: TV (1..2: likes))
457 (2..3: N (2..3: themself)))
458[
459 child-0: [
460 pron: #0 they
461 child-0: [ word: sue ]
462 case: nom
463 num: #1 sg
464 ]
465 child-1: [
466 tense: nonpast
467 num: #1
468 child-0: [ word: likes ]
469 ]
470 child-2: [
471 needs_pron: #0
472 case: acc
473 pron: ref
474 child-0: [ word: themself ]
475 num: sg
476 ]
477]
478
479> sue likes himself
480Parsed 0 trees
481```
482
483If this is interesting to you and you want to learn more, you can check out
484[my blog series](https://vgel.me/posts/symbolic-linguistics-part1/),
485the excellent textbook [Syntactic Theory: A Formal Introduction (2nd ed.)](https://web.stanford.edu/group/cslipublications/cslipublications/site/1575864002.shtml),
486and the [DELPH-IN project](http://www.delph-in.net/wiki/index.php/Home), whose
487work on the LKB inspired this simplified version.
488
489# Using from code
490I need to write this section in more detail, but if you're comfortable with Rust,
491I suggest looking through the codebase. It's not perfect, it started as one of
492my first Rust projects (after migrating through F# -> TypeScript -> C in search
493of the right performance/ergonomics tradeoff), and it could use more tests,
494but overall it's not too bad.
495
496Basically, the processing pipeline is:
497
4981. Make a `Grammar` struct
499 * `Grammar` is defined in `rules.rs`.
500 * The easiest way to make a `Grammar` is `Grammar::parse_from_file`, which is
501 mostly a hand-written recusive descent parser in `parse_grammar.rs`. Yes,
502 I recognize the irony here.
5032. It takes input (in `Grammar::parse`, which does everything for you, or
504 `Grammar::parse_chart`, which just does the chart)
5053. The input is first chart-parsed in `earley.rs`
5064. Then, a forest is built from the chart, in `forest.rs`, using an algorithm
507 I found in a very useful blog series I forget the URL for, because the
508 algorithms in the academic literature for this are... weird.
5095. Finally, the feature unification is used to prune the forest down to only
510 valid trees. It would be more efficient to do this during parsing, but meh.
511
512The most interesting thing you can do via code and not via the CLI is probably
513getting at the raw feature DAG, as that would let you do things like pronoun
514coreference. The DAG code is in `featurestructure.rs`, and should be fairly
515approachable -- there's a lot of Rust ceremony around `Rc<RefCell<...>>`
516because using an arena allocation crate seemed ~~too har~~like overkill, but
517that is somewhat mitigated by the `NodeRef` type alias. Hit me up at
518https://vgel.me/contact if you need help with anything here!
519*/
520
521#[macro_use]
522extern crate lazy_static;
523
524pub mod earley;
525pub mod featurestructure;
526pub mod forest;
527pub mod parse_grammar;
528pub mod rules;
529pub mod syntree;
530pub mod utils;
531
532use std::fs;
533use std::path;
534use std::rc::Rc;
535
536pub use crate::earley::{parse_chart, Chart};
537pub use crate::featurestructure::NodeRef;
538pub use crate::forest::Forest;
539pub use crate::rules::{Grammar, Rule};
540pub use crate::syntree::{Constituent, SynTree};
541pub use crate::utils::Err;
542
543impl Grammar {
544 pub fn parse_chart(&self, input: &[&str]) -> Chart {
545 parse_chart(&self, input)
546 }
547
548 pub fn parse_forest(&self, input: &[&str]) -> Forest {
549 Forest::from(self.parse_chart(input))
550 }
551
552 pub fn unify_tree(
553 tree: SynTree<Rc<Rule>, String>,
554 ) -> Result<(SynTree<String, String>, NodeRef), Err> {
555 match tree {
556 SynTree::Leaf(w) => Ok((SynTree::Leaf(w), NodeRef::new_top())),
557 SynTree::Branch(cons, children) => {
558 let features = cons.value.features.deep_clone();
559
560 let mut bare_children = Vec::with_capacity(children.len());
561 for (idx, child) in children.into_iter().enumerate() {
562 let (child_tree, child_features) = Self::unify_tree(child)?;
563 bare_children.push(child_tree);
564
565 let to_unify = NodeRef::new_with_edges(vec![(format!("child-{}", idx), child_features)])?;
566 NodeRef::unify(features.clone(), to_unify)?;
567 }
568
569 let bare_self = SynTree::Branch(
570 Constituent {
571 span: cons.span,
572 value: cons.value.symbol.clone(),
573 },
574 bare_children,
575 );
576
577 Ok((bare_self, features))
578 }
579 }
580 }
581
582 pub fn parse(&self, input: &[&str]) -> Vec<(SynTree<String, String>, NodeRef)> {
583 let forest = self.parse_forest(input);
584 let trees = forest.trees(&self);
585 trees
586 .into_iter()
587 .filter_map(|t| Self::unify_tree(t).map(Some).unwrap_or(None))
588 .collect::<Vec<_>>()
589 }
590
591 pub fn read_from_file<P: AsRef<path::Path>>(path: P) -> Result<Self, Err> {
592 fs::read_to_string(path)?.parse()
593 }
594}
595
596#[test]
597fn test_unification_blocking() {
598 let g: Grammar = r#"
599 S -> N[ case: nom, pron: #1 ] TV N[ case: acc, needs_pron: #1 ]
600 TV -> likes
601 N[ case: nom, pron: she ] -> she
602 N[ case: nom, pron: he ] -> he
603 N[ case: acc, pron: he ] -> him
604 N[ case: acc, pron: ref, needs_pron: he ] -> himself
605 "#
606 .parse()
607 .unwrap();
608
609 assert_eq!(g.parse(&["he", "likes", "himself"]).len(), 1);
610 assert_eq!(g.parse(&["he", "likes", "him"]).len(), 1);
611 assert_eq!(g.parse(&["she", "likes", "him"]).len(), 1);
612
613 assert_eq!(g.parse(&["himself", "likes", "himself"]).len(), 0);
614 assert_eq!(g.parse(&["she", "likes", "himself"]).len(), 0);
615 assert_eq!(g.parse(&["himself", "likes", "him"]).len(), 0);
616}