bibtex-parser 0.2.1

BibTeX parser with Rust and Python APIs
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
# bibtex-parser

[![CI](https://github.com/b-vitamins/bibtex-parser/actions/workflows/ci.yml/badge.svg)](https://github.com/b-vitamins/bibtex-parser/actions/workflows/ci.yml)
[![Crates.io](https://img.shields.io/crates/v/bibtex-parser.svg)](https://crates.io/crates/bibtex-parser)
[![docs.rs](https://img.shields.io/docsrs/bibtex-parser.svg)](https://docs.rs/bibtex-parser)
[![PyPI](https://img.shields.io/pypi/v/citerra.svg)](https://pypi.org/project/citerra/)
[![License](https://img.shields.io/crates/l/bibtex-parser.svg)](LICENSE)

BibTeX parsing for Rust and Python applications.

`bibtex-parser` provides a Rust parser and library API, plus a Python package
built with PyO3 and maturin. It supports strict parsing, explicit tolerant
recovery, source locations, raw-text preservation, semantic helpers, editing
primitives, streaming, multi-file corpus parsing, and configurable
serialization.

## Install

Rust:

```toml
[dependencies]
bibtex-parser = "0.2"
```

Python:

```sh
pip install citerra
```

The Python distribution and import name are both `citerra`:

```python
import citerra
```

## Core Concepts

- `Library` is the compact Rust bibliography collection API for application
  code that wants entries, fields, strings, comments, preambles, validation,
  transforms, and writing.
- `ParsedDocument` is the Rust model for tooling that needs diagnostics,
  source-order blocks, raw source text, partial entries, failed blocks, and
  source locations.
- `citerra.Document` is the Python document model. It exposes source-order
  blocks, diagnostics, raw text, editing operations, and writing through Python
  classes and functions.
- `Parser` configures strict versus tolerant parsing, source capture, raw-text
  preservation, expanded values, streaming, and multi-source parsing.

Strict parsing is the default. Tolerant parsing is explicit.

## Rust Quick Start

```rust
use bibtex_parser::{Library, Result};

fn main() -> Result<()> {
    let input = r#"
        @string{venue = "VLDB"}
        @article{paper,
            author = "Jane Doe and John Smith",
            title = "Example Paper",
            journal = venue,
            year = 2026
        }
    "#;

    let library = Library::parse(input)?;
    let entry = library.find_by_key("paper").unwrap();

    assert_eq!(entry.get("journal"), Some("VLDB"));
    assert_eq!(entry.year(), Some("2026".to_string()));
    assert_eq!(entry.authors().len(), 2);
    Ok(())
}
```

## Rust Tolerant Parsing And Diagnostics

Use tolerant mode when a corpus may contain malformed entries but
valid entries before or after those regions should still be returned.

```rust
use bibtex_parser::{Block, Library, Result};

fn main() -> Result<()> {
    let library = Library::parser()
        .tolerant()
        .capture_source()
        .parse(r#"
            @article{ok, title = "Good"}
            @article{bad, title = "Missing close"
            @book{recovered, title = "Recovered"}
        "#)?;

    assert_eq!(library.entries().len(), 2);
    assert_eq!(library.failed_blocks().len(), 1);

    for block in library.blocks() {
        if let Block::Failed(failed) = block {
            eprintln!("bad block at {:?}", failed.source);
        }
    }

    Ok(())
}
```

For diagnostics and source-preserving output, parse into `ParsedDocument`:

```rust
use bibtex_parser::{ParseStatus, Parser, Result};

fn main() -> Result<()> {
    let document = Parser::new()
        .tolerant()
        .capture_source()
        .preserve_raw()
        .expand_values()
        .parse_source("refs/main.bib", r#"
            @article{paper, title = "Example Paper"}
            @article{broken, title = "Missing close"
        "#)?;

    assert_eq!(document.status(), ParseStatus::Partial);
    assert_eq!(document.entries()[0].key(), "paper");

    for diagnostic in document.diagnostics() {
        eprintln!("{}: {}", diagnostic.code, diagnostic.message);
    }

    Ok(())
}
```

## Rust Query, Edit, And Write

```rust
use bibtex_parser::{Library, Result, Writer, WriterConfig};

fn main() -> Result<()> {
    let mut library = Library::parse(r#"
        @article{paper,
            title = "Example Paper",
            doi = "https://doi.org/10.1000/XYZ.",
            keywords = "rust; parsing, bibtex"
        }
    "#)?;

    let entry = &mut library.entries_mut()[0];
    entry.set_literal("note", "accepted");
    entry.rename_field("keywords", "tags");

    library.normalize_doi_fields();

    let config = WriterConfig {
        indent: "  ".to_string(),
        align_values: true,
        sort_fields: true,
        ..Default::default()
    };

    let mut output = Vec::new();
    Writer::with_config(&mut output, config).write_library(&library)?;

    let entry = &library.entries()[0];
    assert_eq!(entry.doi(), Some("10.1000/xyz".to_string()));
    assert_eq!(entry.get("note"), Some("accepted"));
    assert_eq!(entry.get("tags"), Some("rust; parsing, bibtex"));
    Ok(())
}
```

For simple serialization:

```rust
let bibtex = library.to_bibtex()?;
library.write_file("references.bib")?;
```

## Rust Semantic Helpers

```rust
use bibtex_parser::{Library, ResourceKind, Result};

fn main() -> Result<()> {
    let library = Library::parse(r#"
        @article{paper,
            author = "Jane Doe and {Research Group}",
            date = "2026-05-13",
            doi = "https://doi.org/10.1000/XYZ."
        }
    "#)?;

    let entry = &library.entries()[0];
    assert_eq!(entry.authors()[1].literal.as_deref(), Some("Research Group"));
    assert_eq!(entry.date_parts().unwrap().unwrap().month, Some(5));
    assert_eq!(entry.resource_fields()[0].kind, ResourceKind::Doi);
    assert_eq!(entry.doi(), Some("10.1000/xyz".to_string()));
    Ok(())
}
```

Helpers include name parsing, date extraction, DOI normalization, field
normalization, resource classification, duplicate-key detection, and validation.

## Rust Streaming And Multi-File Parsing

Streaming lets callers process source-order events without building a full
intermediate collection:

```rust
use bibtex_parser::{ParseEvent, ParseFlow, Parser, Result};

fn main() -> Result<()> {
    let mut entries = 0;
    let summary = Parser::new().parse_events("@article{paper, title = \"A\"}", |event| {
        if let ParseEvent::Entry(entry) = event {
            assert_eq!(entry.key(), "paper");
            entries += 1;
        }
        Ok(ParseFlow::Continue)
    })?;

    assert_eq!(entries, 1);
    assert_eq!(summary.entries, 1);
    Ok(())
}
```

Multi-source parsing keeps source identity and can detect duplicate keys across
files:

```rust
use bibtex_parser::{CorpusSource, Parser, Result, SourceId};

fn main() -> Result<()> {
    let sources = [
        CorpusSource::new("refs/a.bib", "@article{paper, title = \"A\"}"),
        CorpusSource::new("refs/b.bib", "@article{paper, title = \"B\"}"),
    ];

    let corpus = Parser::new().parse_sources(&sources)?;
    let duplicates = corpus.duplicate_keys();

    assert_eq!(corpus.entries().count(), 2);
    assert_eq!(duplicates[0].key, "paper");
    assert!(duplicates[0].cross_source);
    assert_eq!(duplicates[0].occurrences[1].source, SourceId::new(1));
    Ok(())
}
```

Enable the `parallel` feature for Rayon-backed parsing of multiple files from
disk with `Parser::parse_files`.

## Python Quick Start

```python
import citerra

document = citerra.parse(
    '@article{paper, author = "Jane Doe", title = "Example Paper", year = 2026}',
    expand_values=True,
)

entry = document.entry("paper")
assert entry is not None
assert entry.entry_type == "article"
assert entry.get("title") == "Example Paper"
assert entry.date_parts().year == 2026
```

`load`, `loads`, `dump`, and `dumps` are available for file-like workflows:

```python
from pathlib import Path
import citerra

document = citerra.parse_path("references.bib", tolerant=True)
Path("normalized.bib").write_text(citerra.dumps(document), encoding="utf-8")
```

## Python Diagnostics, Source, And Raw Text

```python
import citerra

document = citerra.parse(
    text,
    tolerant=True,
    capture_source=True,
    preserve_raw=True,
    source="refs/main.bib",
)

if document.status != "ok":
    for diagnostic in document.diagnostics:
        location = diagnostic.source
        if location is not None:
            print(diagnostic.code, location.line, location.column, diagnostic.message)
        else:
            print(diagnostic.code, diagnostic.message)

entry = document.entry("paper")
if entry is not None:
    print(entry.raw)
    print(entry.field("title").raw_value)
```

## Python Mutation And Writing

```python
import citerra

document = citerra.parse_path("references.bib", tolerant=True)

document.rename_key("paper", "paper-v2")
document.set_field("paper-v2", "note", "accepted")
document.remove_export_fields(["abstract", "keywords"])

config = citerra.WriterConfig(
    preserve_raw=True,
    trailing_comma=True,
)

output = document.write(config)
```

Use `preserve_raw=True` for low-churn source-preserving writes. Use
`preserve_raw=False` when normalized formatting is desired.

## Python Plain Records

Some application code wants ordinary dictionaries for filtering, indexing, or
bulk transforms. The Python package provides explicit helpers for that shape
without changing the native document model:

```python
import citerra

document = citerra.parse_path("references.bib")
records = citerra.document_to_dicts(document)

selected = [record for record in records if record.get("year") == "2026"]
text = citerra.write_entries(
    selected,
    field_order=["author", "title", "journal", "year", "doi"],
    sort_by=["ID"],
    trailing_comma=True,
)
```

## Python Helpers

```python
import citerra

assert citerra.normalize_doi("https://doi.org/10.1000/XYZ.") == "10.1000/xyz"
assert citerra.latex_to_unicode("Jos\\'e") == "José"

names = citerra.parse_names("Jane Doe and {Research Group}")
assert names[1].literal == "Research Group"

date = citerra.parse_date("2026-05-13")
assert (date.year, date.month, date.day) == (2026, 5, 13)
```

## Feature Flags

```toml
[dependencies]
bibtex-parser = { version = "0.2", features = ["parallel", "latex_to_unicode"] }
```

- `parallel`: enables Rayon-backed `Parser::parse_files` for multi-file workloads.
- `latex_to_unicode`: enables LaTeX accent-to-Unicode conversion helpers.
- `python-extension`: builds the PyO3 extension module used by the Python package.

## Semantics

- `Library::parse` and `Parser::parse` are strict by default.
- `Parser::tolerant()` recovers valid blocks after malformed input and records
  failures and diagnostics.
- `Library` expands string definitions and month constants for field access.
- `ParsedDocument` can retain raw entry, field, value, comment, preamble, string,
  and failed-block text when `preserve_raw()` is enabled.
- Source columns are 1-based Unicode scalar columns. Byte spans are also exposed
  for exact source slicing.
- Writer defaults preserve source order. Sorting, alignment, trailing commas,
  and normalized output are explicit choices.

## Local Development

This repository uses a Guix manifest for local development:

```sh
guix shell -m manifest.scm --
```

Common gates:

```sh
cargo fmt --all -- --check
guix shell -m manifest.scm -- actionlint
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo bench --all-features --no-run
guix shell -m manifest.scm -- maturin build --release --out target/wheels
```

Python wheel smoke test without installing into the user environment:

```sh
rm -rf target/python-test
python3 - <<'PY'
from pathlib import Path
from zipfile import ZipFile

wheel = sorted(Path("target/wheels").glob("citerra-*.whl"))[-1]
target = Path("target/python-test")
target.mkdir(parents=True, exist_ok=True)
with ZipFile(wheel) as archive:
    archive.extractall(target)
PY
guix shell -m manifest.scm -- env PYTHONPATH=target/python-test python3 -m pytest tests/python
```

## Release Process

The release workflow builds the Rust crate, Python source distribution, and
ABI3 wheels for Linux, macOS, and Windows. Tagging `vX.Y.Z` runs release
validation, creates a GitHub release, publishes to crates.io, and publishes to
PyPI when the required repository environments and secrets are configured.

See [RELEASE_CHECKLIST.md](RELEASE_CHECKLIST.md) for the exact release gates
and setup requirements.

## Benchmarks And Comparisons

The repository includes Criterion benchmarks for parser throughput, tolerant
parsing, source-preserving parsing, streaming, writing, corpus parsing, common
library operations, and memory-oriented workloads. The tables below are one run
on `tests/fixtures/tugboat.bib`:

- 2,701,551 bytes, 73,993 lines, 3,644 entries.
- AMD Ryzen 5 5600G, 6 cores / 12 threads.
- Rust `1.93.0`, Python `3.11.14`.
- Measured on 2026-05-13. Throughput is input-size normalized.

Reproduce the Rust parser table:

```sh
cargo bench --bench performance --all-features -- --noplot throughput
```

| Rust parser / mode | Version | Median time | Throughput | Notes |
| --- | ---: | ---: | ---: | --- |
| `serde_bibtex` ignore | 0.7.1 | 2.411 ms | 1.04 GiB/s | Parses and discards data |
| `serde_bibtex` selected struct | 0.7.1 | 4.143 ms | 621.9 MiB/s | Deserializes selected fields |
| `bibtex-parser` strict `Library` | 0.2.1 | 5.234 ms | 492.3 MiB/s | Entries, fields, strings, comments, preambles |
| `serde_bibtex` borrowed entries | 0.7.1 | 6.872 ms | 374.9 MiB/s | Borrowed parsed entries |
| `bibtex-parser` tolerant `Library` | 0.2.1 | 7.015 ms | 367.3 MiB/s | Recovery and failed-block tracking |
| `biblatex` raw bibliography | 0.11.0 | 10.839 ms | 237.7 MiB/s | Raw BibLaTeX bibliography |
| `serde_bibtex` owned entries | 0.7.1 | 13.469 ms | 191.3 MiB/s | Owned entries with month macros |
| `bibtex-parser` streaming events | 0.2.1 | 21.943 ms | 117.4 MiB/s | Source-order callback events |
| `bibtex-parser` source-preserving document | 0.2.1 | 31.862 ms | 80.9 MiB/s | Raw text, source locations, diagnostics model |
| `nom-bibtex` | 0.6.0 | 34.334 ms | 75.0 MiB/s | Parsed bibliography |

Reproduce the Rust writing table:

```sh
cargo bench --bench performance --all-features -- --noplot writing
```

| Rust writer mode | Version | Median time | Throughput |
| --- | ---: | ---: | ---: |
| Raw-preserving document write | 0.2.1 | 1.816 ms | 1.39 GiB/s |
| Normalized `Library` write | 0.2.1 | 5.325 ms | 483.8 MiB/s |

The Python comparison used the local `citerra` wheel plus `bibtexparser` 1.4.4,
`bibtexparser` 2.0.0b9, and `pybtex` 0.26.1. The comparison script uses
whichever optional packages are installed in the active environment:

```sh
python python/benchmarks/compare_parsers.py tests/fixtures/tugboat.bib
python python/benchmarks/compare_parsers.py tests/fixtures/tugboat.bib --write
```

| Python parser / mode | Version | Median parse time | Throughput | Relative time |
| --- | ---: | ---: | ---: | ---: |
| `citerra` structured parse | 0.2.1 | 0.058 s | 44.3 MiB/s | 1.0x |
| `citerra` source-preserving parse | 0.2.1 | 0.065 s | 39.9 MiB/s | 1.1x |
| `bibtexparser` parse | 2.0.0b9 | 0.372 s | 6.9 MiB/s | 6.4x |
| `pybtex` parse | 0.26.1 | 0.859 s | 3.0 MiB/s | 14.8x |
| `bibtexparser` parse | 1.4.4 | 10.483 s | 0.2 MiB/s | 180.1x |

| Python writer / mode | Version | Median write time | Throughput | Relative time |
| --- | ---: | ---: | ---: | ---: |
| `citerra` raw-preserving write | 0.2.1 | 0.003 s | 953.2 MiB/s | 1.0x |
| `citerra` normalized write | 0.2.1 | 0.014 s | 181.3 MiB/s | 5.3x |
| `bibtexparser` write | 1.4.4 | 0.106 s | 24.3 MiB/s | 39.2x |
| `bibtexparser` write | 2.0.0b9 | 0.493 s | 5.2 MiB/s | 182.2x |
| `pybtex` write | 0.26.1 | 3.942 s | 0.7 MiB/s | 1458.5x |

## License

Licensed under either of Apache-2.0 or MIT, at your option.