Opinionated parser for Congress.gov's legislation text search query syntax.
Congress.gov's query parser is fairly permissive and allows some queries
whose semantics are unclear. This library is opinionated in that it rules
out some queries that will both parse and run on Congress.gov. For example,
double negatives, nested MUST and SHOULD queries, and MUST/SHOULD groups
inside proximity queries will "work" on Congress.gov, but it's not clear
what such queries are supposed to mean. Additionally, NOT queries inside
of MUST/SHOULD groups will parse and run, but it appears that the
Congress.gov parser ignores or removes the ! in those cases. So this
library only allows negating terms at the top level.
This library also has some built-in functionality for simplifying queries,
specifically removing redundant terms, extraneous parentheses, and the MUST
operator, the latter because the default connective for Congress.gov search
is AND, so the MUST is always unnecessary. Simplification also includes
grouping consecutive SHOULD terms, like ~a ~b, into ~(a b).
aqp stands for "Advanced Query Parser". Background on that is available
from this Solr Jira ticket.
The 3 in the crate name is because this my third attempt at putting together
this crate.
Below is a grammar for the query syntax as implemented by this package, though the implementation may have drifted from what's described below. The implementation should be considered the normative version of the syntax for the purposes of this crate. Paste the grammar into the Ohm Editor to experiment with it and test example queries.
Query {
Exp = ( ParenExp | Prox | Boolean | term | not )+
ParenExp = "(" Exp ")"
Prox = ( "n" | "N" | "w" | "W" ) "/" digit+ ParenProxArgs
ParenProxArgs = "(" ( ProxArgs | ParenProxArgs ) ")"
ProxArgs = literal+ | ( Prox | Boolean | nonliteral )+
Boolean = ("+" | "~") ( BoolArgs | ParenBoolArgs )
ParenBoolArgs = "(" ( BoolArgs+ | ParenBoolArgs+ ) ")"
BoolArgs = Prox | Boolean | term
// tokens
term = nonliteral | literal
nonliteral = wildcard | phrase | bare
phrase = "\"" ( bare | space )+ "\""
literal = "'" bare "'"
wildcard = bare "*"
not = "!" term
// may need to add more punctuation
bare = ( alnum | space | "," | "." | "%" | "$" )+ ~"/"
}