Module serde_bibtex::syntax
source · Expand description
§Description of the bibliography syntax
The goal of this module is to give an explicit description of the grammar accepted by this
crate. For other grammars, see for example the btparse
documentation.
For an informal description of the .bib
grammar, visit the documentation for the de
module.
Generally speaking, we attempt to align with the grammar accepted by biber, extended to handle ASCII-compatible non-UTF-8 input where sensible. However, biber has certain idiosyncracies that we intentionally do not support. Jump to the comparisons section for an informal discussion of the differences with other bibtex-compatible programs.
§Structure of a bibliography
§Whitespace, comments, and junk characters.
- Whitespace is defined as any ASCII char accepted by the
is_ascii_whitespace
method.ⓘws = _{ (" " | "\t" | "\n" | "\r" | "\x0C" )+ }
- A TeX Comment is started by a
%
symbol and terminated by a newline\n
.ⓘtex_comment = _{ "%" ~ (!"\n" ~ ANY)* ~ "\n" }
- Whitespace and TeX comments can be combined to match ignored characters, which is any
sequence of whitespace and TeX comments.
ⓘ
ign = _{ (tex_comment | ws)* }
- Junk characters are any characters which are either commented or are not
@
.ⓘjunk = _{ (tex_comment | !("@" | "%") ~ ANY)* }
§Identifiers
- An identifier is any UTF-8 character which is not ASCII, or a printable ASCII character
which is not one of the literal characters
{}(),=\#%"
.ⓘidentifier = _{ (!('\x00'..'\x20' | "{" | "}" | "(" | ")" | "," | "=" | "\\" | "#" | "%" | "\"" | "\x7f") ~ ANY)+}
- A variable is an identifier which can be used in macro expansions. The syntax is the same as
an identifier, except additionally it cannot begin with an ASCII digit.
ⓘ
variable = @{ !ASCII_DIGIT ~ identifier }
- An entry type, entry key, and field key are all parsed as identifiers.
ⓘ
entry_type = @{ identifier } entry_key = @{ identifier } field_key = @{ identifier }
§Field tokens and values.
- A numeric token is a sequence of digits
ⓘ
token_number = @{ ASCII_DIGIT+ }
- A balanced token is a sequence of characters such that the brackets
{}
are balanced.ⓘbalanced = _{ "{" ~ balanced* ~ "}" | (!("{" | "}") ~ ANY) } token_curly = @{ balanced* }
- A quoted token is a sequence of characters delimited by
"
such that the brackets{}
are balanced. The closing"
must not be captured within any brackets{}
.ⓘquoted = _{ "{" ~ balanced* ~ "}" | (!("{" | "}" | "\"") ~ ANY) } token_quoted = @{ quoted* }
- A token can be any of the above, or also a variable.
ⓘ
token = _{ token_number | "{" ~ token_curly ~ "}" | "\"" ~ token_quoted ~ "\"" | variable }
- A value is a sequence of tokens delimited by
#
and separated possibly by ignored characters.ⓘvalue = { token ~ (ign ~ "#" ~ ign ~ token)* }
§Comment entry
- A comment entry is essentially parsed as a text token, except in place of quotes we allow
delimitation by round brackets. Similarly to the quoted text token, it is terminated by a closing
)
which is not enclosed by curly brackets. It is identified by a case-insensitivecomment
entry type.ⓘround = _{ "{" ~ balanced* ~ "}" | (!("{" | "}" | ")") ~ ANY) } token_round = @{ round* } comment_entry_type = _{ ^"comment" ~ ign } entry_comment = { comment_entry_type ~ ( "{" ~ token_curly ~ "}" | "(" ~ token_round ~ ")" ) }
§Preamble entry
- A preamble entry contains only a value and is identified by a case-insensitive
preamble
entry type.ⓘpreamble_contents = _{ ign ~ value ~ ign } preamble_entry_type = _{ ^"preamble" ~ ign } entry_preamble = { preamble_entry_type ~ ( "{" ~ preamble_contents ~ "}" | "(" ~ preamble_contents ~ ")" ) }
§Macro entry
- A macro entry consists of a variable and a value, separated by a
=
character. Note that a macro can optionally have empty contents, and if it is not empty, it can optionally have a trailing comma.ⓘmacro_contents = _{ (ign ~ variable ~ ign ~ "=" ~ ign ~ value ~ ign ~ ","?)? ~ ign } macro_entry_type = _{ ^"string" ~ ign } entry_macro = { macro_entry_type ~ ("{" ~ macro_contents ~ "}" | "(" ~ macro_contents ~ ")") }
§Regular entry
- The basic component of a regular entry is the field. A field consists of a field key and a
value, separated by an “=”. Note the similarity to the macro entry: however, the field key
is permitted to start with an ASCII digit.
ⓘ
field = _{ ign ~ "," ~ ign ~ field_key ~ ign ~ "=" ~ ign ~ value }
- The bracketed component of a regular entry consists of an entry key, followed by a list of
fields (possibly none), followed by an optional comma.
ⓘ
regular_entry_contents = _{ ign ~ entry_key ~ field* ~ ign ~ ","? ~ ign }
- A regular entry then consists of the entry type along with the contents of the entry,
delimieted by brackets.
ⓘ
entry_regular = { entry_type ~ ign ~ ("{" ~ regular_entry_contents ~ "}" | "(" ~ regular_entry_contents ~ ")") }
§Bibliography
- An entry is any one of the above cases (comment, preamble, macro, or regular) preceded by an
@
symbol.ⓘentry = { "@" ~ ign ~ (entry_comment | entry_preamble | entry_macro | entry_regular) }
- A bibliography is a possibly empty list of entries, separated by junk characters.
ⓘ
bib = _{ SOI ~ junk ~ (entry ~ junk)* ~ EOI }
§Grammar comparisons
§Differences from biber
- A field key is permitted to start with an ASCII digit.
- We do not skip chars following
\
and'
. When biber encounters one of these characters, it consumes the following character and counts it as whitespace. For instance, biber considers@ '%article
to be equivalent to@article
, since the%
character is ignored since it follows'
and does not begin a comment. - We treat
comment
entries delimited by()
in the same way as quoted text fields. This is more flexible than biber, which considers a closing)
to terminate the comment field, regardless of the current depth of{}
brackets. - A field key allowed to start with digit. The only place we do not permit digits is at the beginning of a variable, so that a variable can be unambiguously distinguished from an unquoted number.
§Differences from bibtex
- Bibtex does not support
%
-style comments. - Bibtex does not capture
@comment
strings: instead, upon reading an@comment
entry, it immediately resets and applies ‘junk’ parsing. For examplewill result in a parse error, since the@comment{@article}
@comment
is discarded, then{
is discarded as a junk character, then@article
is parsed to begin a new entry, and}
then results in an error. - Bibtex does not support unicode.
- The only disallowed printable ASCII character in an entry key is
,
§More flexible syntax?
The syntax could intentionally be made more flexible while still accepting all files satisfying
the current grammar. However, we do not want to promote proliferation of .bib
files that are
incompatible with other more well-established tools.
Structs§
- A simple automatically derived pest parser.