biblib 0.4.4

Parse, manage, and deduplicate academic citations
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
# Changelog


All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.4.4] - 2026-06-23


### Fixed


- **RIS repeated abstract tags**: The RIS parser now merges repeated `AB` tags into a single `abstract_text` value instead of keeping only the first one. Repeated `N2` tags are merged the same way when `AB` is absent, and `AB` still takes priority over `N2`.

## [0.4.3] - 2026-06-13


### Fixed


- **RIS multi-line field values**: The RIS parser now correctly handles continuation lines that have no leading indentation. Previously, a line like `At present, there are no...` following an `N2` or `AB` tag was silently dropped because its first two characters (`At`) were mistaken for a tag. Continuation lines are now detected by checking for the absence of a proper RIS separator (` -`, `- `, `-`) after the two-character prefix, and their text is appended to the previous tag's value.

- **RIS `M3` (Type of Work) field now included in `citation_type`**: The `M3` tag was previously unmapped and fell through to `extra_fields`. It is now recognised as `RisTag::WorkType` and its values are appended to `citation_type` alongside the `TY` value (duplicates suppressed).

## [0.4.2] - 2026-02-21


### Added


- **RIS Citation Type Mapping**: The RIS parser now automatically maps abbreviated citation types (like `JOUR`, `BOOK`) directly to their full, human-readable equivalents (like `Journal Article`, `Book`) during parsing, resulting in cleaner and standardized `citation_type` values downstream. Unrecognized abbreviations fall back to preserving their original strings.

## [0.4.1] - 2026-02-20


### Fixed


- **PubMed book title fallback**: When a PubMed record has no `TI` (Title) field, the parser now falls back to the `BTI` (Book Title) field. This fixes a `MissingValue` error for book citations that use `BTI` instead of `TI`.

## [0.4.0] - 2026-02-20


### Added


- **Line numbers in all parser errors**: Every `ParseError` now carries the 1-based line number (`ParseError::line`) of the citation record that triggered the error. Previously, most conversion errors (missing title, bad date, etc.) had `line: None`. Now all parsers — RIS, PubMed, CSV, and EndNote XML — populate this field consistently.

- **`SourceSpan` type**: New public struct `SourceSpan { start: usize, end: usize }` representing an inclusive-start, exclusive-end byte-offset range into the original source text. Available via `use biblib::SourceSpan`.

- **`ParseError::span` field** (`Option<SourceSpan>`): Every `ParseError` optionally carries a byte-offset span covering the full citation record where the error occurred. Populated by the RIS, PubMed, and CSV parsers for all conversion errors.

- **`ParseError::with_span(span)` builder method**: Attaches a `SourceSpan` to an existing `ParseError`, enabling builder-style construction.

- **`diagnostics` feature** (optional dependency on [`ariadne`]https://crates.io/crates/ariadne): Enable with `features = ["diagnostics"]` to unlock:
  - `ParseError::to_diagnostic(filename: &str, source: &str) -> String` — renders a human-readable diagnostic with ANSI colour highlighting, source context lines, and underlined error spans using the ariadne crate.
  - `parse_with_diagnostics(parser, input, filename)` — convenience free function that calls any `CitationParser` and on failure returns `Err(String)` containing the rendered diagnostic.

### Changed


- **`ParseError` struct layout** (BREAKING for struct-literal construction): A new `span: Option<SourceSpan>` field has been inserted between `column` and `format`. Any code constructing `ParseError` with struct literal syntax (i.e. `ParseError { line: …, column: …, format: …, error: … }`) must add `span: None`. Code using the provided constructors (`at_line`, `at_position`, `without_position`, `new`) is unaffected.

### Migration Guide


#### `ParseError` struct literal construction


If you construct `ParseError` with a struct literal, add the new `span` field:

```rust
// Before (0.3.x):
ParseError { line: Some(1), column: None, format: CitationFormat::Ris, error: ValueError::Syntax("…".into()) }

// After (0.4.x):
ParseError { line: Some(1), column: None, span: None, format: CitationFormat::Ris, error: ValueError::Syntax("…".into()) }
```

Using the constructors instead avoids this:

```rust
// No change needed:
ParseError::at_line(1, CitationFormat::Ris, ValueError::Syntax("…".into()))
```

#### Opt-in rich diagnostics


```toml
[dependencies]
biblib = { version = "0.4", features = ["diagnostics"] }
```

```rust
use biblib::{CitationParser, RisParser, parse_with_diagnostics};

let source = std::fs::read_to_string("citations.ris")?;
match parse_with_diagnostics(&RisParser::new(), &source, "citations.ris") {
    Ok(citations) => println!("Parsed {} citations", citations.len()),
    Err(diagnostic) => eprintln!("{}", diagnostic), // pretty, coloured output
}
```

## [0.3.2] - 2025-12-30


### Added


- **PubMed DOI fallback to AID field**: The PubMed parser now checks the `AID` (Article Identifier) field for DOI when not found in the `LID` (Location ID) field. This improves DOI extraction for citations where the DOI is only present in the AID field.

## [0.3.1] - 2025-12-24


### Added


- **Multi-author parsing for RIS**: The RIS parser now handles multiple authors on a single AU line, splitting on semicolons (`;`), ampersands (`&`), and the word `and`
- **PARSING_GUIDE.md**: Comprehensive documentation for all format parsers (RIS, PubMed, EndNote XML, CSV) including tag mappings, date formats, and data transformations
- **DEDUPLICATION_GUIDE.md**: Detailed documentation of the deduplication algorithm, similarity thresholds, normalization rules, and configuration options

### Changed


- **README.md**: Complete rewrite with cleaner structure and accurate API examples

## [0.3.0] - 2025-08-17


### Added


- **Enhanced deduplication API**: New `find_duplicates_with_sources()` method for source-aware deduplication when sources are managed externally
- **Enhanced CSV parser functionality**: Added configuration options for quote character, trimming, flexible parsing, memory optimization, additional header aliases, validation, and automatic format detection

### Changed


- **Error handling system**: Complete restructure of error types for better debugging and programmatic error handling
  - `CitationParser::parse` now returns `Result<Vec<Citation>, ParseError>` instead of `Result<Vec<Citation>, CitationError>`
  - `CitationError` restructured with `UnknownFormat` and `Parse(ParseError)` variants
  - Enhanced error reporting with line/column tracking and semantic error types (`Syntax`, `MissingValue`, `BadValue`, `MultipleValues`)
  - Empty input now returns `Ok(Vec::new())` instead of errors across all parsers
- **`detect_and_parse()` function signature**: Now takes only one parameter (`content`) instead of two (`content`, `source`)
- **Author struct (BREAKING)**: The `Author` schema has changed to support richer name handling and multiple affiliations
  - Before: `Author { family_name: String, given_name: String, affiliation: Option<String> }`
  - After: `Author { name: String, given_name: Option<String>, middle_name: Option<String>, affiliations: Vec<String> }`
  - Name parsing is standardized via a shared utility; mononyms and middle names are handled consistently across all parsers

### Fixed


- **CSV extra fields functionality**: Fixed critical issue where extra fields were completely ignored and not populated in Citation structs

### Removed


- **Citation ID field**: The `id` field has been completely removed from the `Citation` struct as it is not part of the actual bibliographic data parsed from citation formats
- **Source tracking in Citation struct**: The `source` field has been completely removed from the `Citation` struct and all parser implementations
- **Parser `with_source()` methods**: All parsers no longer have the `with_source()` method for setting citation source
- **Citation year field (BREAKING)**: The deprecated `year: Option<i32>` field has been removed from `Citation`; use `date.year` instead

### Migration Guide


#### 1. Citation ID Management


If you were relying on the auto-generated `id` field in citations, you'll need to manage IDs at the application level:

**Before (v0.2.x):**

```rust
let parser = RisParser::new();
let citations = parser.parse(input).unwrap();
let citation_id = citations[0].id.clone(); // Auto-generated nanoid
```

**After (v0.3.x):**

```rust
let parser = RisParser::new();
let citations = parser.parse(input).unwrap();
// Generate IDs in your application if needed:
let citation_id = nanoid::nanoid!(); // or your preferred ID system
```

#### 2. Source Tracking


Source tracking must now be handled at the application level instead of within the Citation struct:

**Before (v0.2.x):**

```rust
let parser = RisParser::new().with_source("PubMed");
let citations = parser.parse(input).unwrap();
let source = citations[0].source.clone(); // "PubMed"
```

**After (v0.3.x):**

```rust
let parser = RisParser::new();
let citations = parser.parse(input).unwrap();
// Handle source tracking in your application:
let source = "PubMed"; // manage this in your app
```

#### 3. Error Handling


The error system has been completely restructured for better debugging and programmatic error handling:

**Before (v0.2.x):**

```rust
use biblib::CitationError;

match parser.parse(input) {
    Ok(citations) => println!("Parsed {} citations", citations.len()),
    Err(CitationError::ParseError(msg)) => eprintln!("Parse error: {}", msg),
    Err(CitationError::IoError(e)) => eprintln!("IO error: {}", e),
}
```

**After (v0.3.x):**

```rust
use biblib::{CitationError, ParseError, ValueError};

match parser.parse(input) {
    Ok(citations) => println!("Parsed {} citations", citations.len()),
    Err(parse_err) => {
        // Much more detailed error information
        eprintln!("Parse error at line {}: {}",
            parse_err.line.unwrap_or(0), parse_err);

        // Handle specific error types
        match &parse_err.error {
            ValueError::Syntax(msg) => eprintln!("Syntax error: {}", msg),
            ValueError::MissingValue { field, key } => {
                eprintln!("Missing required field {}: {}", field, key)
            },
            ValueError::BadValue { field, key, value, reason } => {
                eprintln!("Invalid value in field {} ({}): {} ({})", field, key, value, reason)
            },
            ValueError::MultipleValues { field, key, .. } => {
                eprintln!("Multiple values for field {} ({})", field, key)
            },
        }
    }
}

// For top-level API (detect_and_parse), handle CitationError:
match detect_and_parse(content) {
    Ok((citations, format)) => println!("Detected {} format, parsed {} citations", format, citations.len()),
    Err(CitationError::UnknownFormat) => eprintln!("Could not detect citation format"),
    Err(CitationError::Parse(parse_err)) => eprintln!("Parse error: {}", parse_err),
}
```

Note: Empty input now returns `Ok(Vec::new())` instead of an error, improving API usability.

#### 4. Format Detection API


The `detect_and_parse()` function no longer accepts a source parameter:

**Before (v0.2.x):**

```rust
let (citations, format) = detect_and_parse(content, "PubMed").unwrap();
```

**After (v0.3.x):**

```rust
let (citations, format) = detect_and_parse(content).unwrap();
// Track source separately in your application
```

#### 5. Enhanced CSV Parser API


The CSV parser has been significantly enhanced. Existing code will continue to work, but new features are available:

```rust
use biblib::csv::{CsvParser, CsvConfig};

// Auto-detection (new)
let parser = CsvParser::with_auto_detection();

// Enhanced configuration (new options)
let mut config = CsvConfig::new();
config.set_quote(b'\'').set_flexible(true).add_header_aliases("title", vec!["paper_title".to_string()]);

let parser = CsvParser::with_config(config);
```

#### 6. Deduplication with Sources


Use the new `find_duplicates_with_sources()` method when you need source-aware deduplication:

**Before (v0.2.x):**

```rust
let citations = vec![/* citations with source field */];
let deduplicator = Deduplicator::new().with_config(config);
let groups = deduplicator.find_duplicates(&citations).unwrap();
```

**After (v0.3.x):**

```rust
let citations = vec![/* citations without source field */];
let sources = vec!["PubMed", "CrossRef"]; // source for each citation
let deduplicator = Deduplicator::new().with_config(config);
let groups = deduplicator.find_duplicates_with_sources(&citations, &sources).unwrap();
```

#### 7. Author Struct Changes (BREAKING)


The `Author` struct now supports richer name handling and multiple affiliations. Update your code as follows:

**Before (v0.2.x):**

```rust
let a = biblib::Author {
    family_name: "Smith".into(),
    given_name: "John".into(),
    affiliation: Some("University of Nowhere".into()),
};
```

**After (v0.3.x):**

```rust
let a = biblib::Author {
    name: "Smith".into(),
    given_name: Some("John".into()),
    middle_name: None,
    affiliations: vec!["University of Nowhere".into()],
};
```

Notes:

- `name` contains the full name; `given_name`/`middle_name` are optional.
- Multiple affiliations are stored in the `affiliations` vector.
- Parsers now handle name splitting consistently.

#### 8. Citation Year Field Removed (BREAKING)


The `Citation.year` field has been removed. Use `date.year` instead:

**Before (v0.2.x):**

```rust
let year = citation.year;
```

**After (v0.3.x):**

```rust
let year = citation.date.as_ref().map(|d| d.year);
```

Notes:

- Parsers now populate `citation.date` only.
- CSV mappings for a `year` header still populate `citation.date`.

## [0.2.4] - 2025-06-11


### Fixed


- Fixed the line continuation for `TI` and `AB` tags in PubMed parser to handle cases where these tags are split across multiple lines

## [0.2.3] - 2025-06-09


### Fixed


- Improved modularity of CSV and XML feature. Fixes compilation errors when default features are not in use

### Changed


- License changed to MIT or Apache-2.0

## [0.2.2] - 2025-01-31


### Fixed


- Fixed RIS parser line to handle tags like T1, A2, etc.

## [0.2.1] - 2025-01-28


### Added


- New `detect_and_parse` function for automatic format detection and parsing
- Support for automatic detection of RIS, PubMed, and EndNote XML formats

## [0.2.0] - 2025-01-28


### Added


- New `source` field in `Citation` struct to track citation origin
- `.with_source()` method on all parsers (RIS, PubMed, EndNote XML, CSV) to specify citation source
- `source_preferences` option in `DeduplicatorConfig` for controlling unique citation selection
- Cargo features for optional components:
  - `csv` - Enable CSV format support
  - `pubmed` - Enable PubMed/MEDLINE format support
  - `xml` - Enable EndNote XML support
  - `ris` - Enable RIS format support
  - `dedupe` - Enable citation deduplication
  - All features enabled by default

### Changed


- Enhanced unique citation selection logic in deduplicator:
  1. Prefers citations from sources listed in `source_preferences`
  2. Falls back to citations with abstracts if no source preference matches
  3. Prefers citations with DOIs if abstracts exist in both citations
  4. Uses first citation as fallback if all above criteria are equal

## [0.1.0] - 2025-01-25


### Added


- Initial release with core functionality
- Support for multiple citation formats (RIS, PubMed, EndNote XML, CSV)
- Citation deduplication engine
- Comprehensive metadata handling