biblib

biblib is a Rust library for parsing citation exports into a shared data model and deduplicating the resulting records.

It is built for import pipelines, evidence synthesis tooling, registry ingestion, and any workflow that needs to turn citation files from multiple sources into one normalized Citation shape.

What It Supports

biblib currently ships parsers for:

Source format	Feature	Parser
RIS	`ris`	`RisParser`
PubMed / MEDLINE (`.nbib`)	`pubmed`	`PubMedParser`
EndNote XML	`xml`	`EndNoteXmlParser`
Generic CSV / delimited data	`csv`	`csv::CsvParser`
ICTRP registry CSV exports	`csv`	`IctrpCsvParser`

All parser outputs converge on the same Citation struct, including normalized fields such as title, authors, date, doi, accession_number, pmid, pmc_id, urls, and extra_fields.

Installation

[dependencies]
biblib = "0.5"

For a smaller build:

[dependencies]
biblib = { version = "0.5", default-features = false, features = ["ris"] }

Quick Start

Parse RIS

use biblib::{CitationParser, RisParser};

let input = r#"TY  - JOUR
TI  - Machine Learning in Healthcare
AU  - Smith, John
AU  - Doe, Jane
PY  - 2023
DO  - 10.1000/example
ER  -"#;

let citations = RisParser::new().parse(input).unwrap();

assert_eq!(citations.len(), 1);
assert_eq!(citations[0].title, "Machine Learning in Healthcare");
assert_eq!(citations[0].doi.as_deref(), Some("10.1000/example"));

Parse PubMed / MEDLINE

use biblib::{CitationParser, PubMedParser};

let input = r#"PMID- 12345678
TI  - Immunotherapy in Oncology
FAU - Smith, John
JT  - Journal of Clinical Research
DP  - 2024 Jun 15
AB  - Example abstract."#;

let citations = PubMedParser::new().parse(input).unwrap();

assert_eq!(citations.len(), 1);
assert_eq!(citations[0].pmid.as_deref(), Some("12345678"));
assert_eq!(citations[0].title, "Immunotherapy in Oncology");
assert_eq!(citations[0].journal.as_deref(), Some("Journal of Clinical Research"));

Auto-detect Supported Formats

detect_and_parse() currently auto-detects RIS, PubMed, EndNote XML, and ICTRP CSV. Generic CSV should still be parsed explicitly with CsvParser.

use biblib::detect_and_parse;

let input = "TY  - JOUR\nTI  - Example\nER  -";
let (citations, format) = detect_and_parse(input).unwrap();

assert_eq!(format.as_str(), "RIS");
assert_eq!(citations[0].title, "Example");

Parse ICTRP CSV

use biblib::{CitationParser, IctrpCsvParser};

let input = concat!(
    "TrialID,Public title,Scientific title,Date registration,Date registration3,Study type,Source Register\n",
    "NCT00000001,Public title,Scientific title,01/05/2026,20260501,Interventional,ClinicalTrials.gov\n"
);

let citations = IctrpCsvParser::new().parse(input).unwrap();
let citation = &citations[0];

assert_eq!(citation.accession_number.as_deref(), Some("NCT00000001"));
assert_eq!(citation.title, "Scientific title");
assert_eq!(citation.citation_type, vec!["Clinical Trial", "Interventional"]);

Parse Generic CSV with Custom Headers

use biblib::csv::{CsvConfig, CsvParser};
use biblib::CitationParser;

let mut config = CsvConfig::new();
config
    .set_delimiter(b';')
    .set_header_mapping("title", vec!["Article Name".to_string()])
    .set_header_mapping("authors", vec!["Writers".to_string()])
    .set_header_mapping("year", vec!["Published".to_string()]);

let input = "Article Name;Writers;Published\nExample Paper;Smith, John;2023";
let citations = CsvParser::with_config(config).parse(input).unwrap();

assert_eq!(citations[0].title, "Example Paper");
assert_eq!(citations[0].date.as_ref().unwrap().year, 2023);

Deduplicate Parsed Records

use biblib::dedupe::{Deduplicator, DeduplicatorConfig};
use biblib::{Citation, Date};

let citations = vec![
    Citation {
        title: "Example Title".to_string(),
        doi: Some("10.1000/example".to_string()),
        date: Some(Date { year: 2023, month: None, day: None }),
        journal: Some("Example Journal".to_string()),
        ..Default::default()
    },
    Citation {
        title: "Example Title".to_string(),
        doi: Some("10.1000/example".to_string()),
        date: Some(Date { year: 2023, month: None, day: None }),
        journal: Some("Example Journal".to_string()),
        ..Default::default()
    },
];

let config = DeduplicatorConfig {
    group_by_year: true,
    run_in_parallel: true,
    source_preferences: vec!["PubMed".to_string()],
};

let groups = Deduplicator::new()
    .with_config(config)
    .find_duplicates(&citations)
    .unwrap();

let duplicate_group = groups
    .iter()
    .find(|group| group.unique.doi.as_deref() == Some("10.1000/example"))
    .unwrap();

assert_eq!(duplicate_group.duplicates.len(), 1);

Data Model

The core output type is Citation.

Important fields include:

Field	Type	Purpose
`citation_type`	`Vec<String>`	Source and work-type labels
`title`	`String`	Main normalized title
`authors`	`Vec<Author>`	Parsed people with name parts and affiliations
`journal`	`Option<String>`	Full journal or source title
`journal_abbr`	`Option<String>`	Journal abbreviation
`date`	`Option<Date>`	Year with optional month/day
`volume`	`Option<String>`	Volume string
`issue`	`Option<String>`	Issue or number string
`pages`	`Option<String>`	Normalized page range
`issn`	`Vec<String>`	One or more ISSNs/serial identifiers
`doi`	`Option<String>`	Normalized DOI
`accession_number`	`Option<String>`	Registry or source accession identifier
`pmid`	`Option<String>`	PubMed identifier
`pmc_id`	`Option<String>`	PubMed Central identifier
`abstract_text`	`Option<String>`	Abstract text
`keywords`	`Vec<String>`	Parsed keywords
`urls`	`Vec<String>`	Collected links
`language`	`Option<String>`	Language code or label
`mesh_terms`	`Vec<String>`	PubMed MeSH terms
`publisher`	`Option<String>`	Publisher or sponsor
`extra_fields`	`HashMap<String, Vec<String>>`	Source-specific leftovers preserved raw

This makes it easy to normalize aggressively where the library has clear semantics, while still keeping source-specific information available.

Feature Flags

Feature	Enables
`ris`	RIS parser
`pubmed`	PubMed / MEDLINE parser
`xml`	EndNote XML parser
`csv`	Generic CSV parser and ICTRP CSV parser
`dedupe`	Deduplication engine
`diagnostics`	Pretty parse diagnostics via `ariadne`

Default features: csv, pubmed, xml, ris, dedupe

Since v0.5, biblib no longer uses the regex crate or exposes regex-backend feature flags. It uses regex-lite internally, and regex backend selection is no longer part of the public API surface.

Errors and Diagnostics

All parsers return ParseError on malformed input. Errors carry:

The source format
A 1-based line number when available
A byte span when available
A structured ValueError

Example:

use biblib::{CitationParser, RisParser, ValueError};

let input = "TY  - JOUR\nAU  - Smith, John\nER  -\n";

match RisParser::new().parse(input) {
    Ok(_) => unreachable!("expected a parse error"),
    Err(err) => {
        assert_eq!(err.line, Some(1));
        assert!(matches!(err.error, ValueError::MissingValue { key: "TI", .. }));
    }
}

For human-friendly diagnostics, enable diagnostics:

[dependencies]
biblib = { version = "0.5", features = ["diagnostics"] }

Then use parse_with_diagnostics():

use biblib::{RisParser, parse_with_diagnostics};

let input = "TY  - JOUR\nAU  - Smith, John\nER  -\n";
let rendered = parse_with_diagnostics(&RisParser::new(), input, "refs.ris");

assert!(rendered.is_err());

Guides

PARSING_GUIDE.md - format-specific mapping and normalization rules
DEDUPLICATION_GUIDE.md - duplicate matching behavior and configuration
docs.rs/biblib - API reference

License

Licensed under either MIT or Apache-2.0, at your option.

biblib 0.5.0