1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
//! # RuSt Research Paper Parser (rsrpp)
//!
//! The `rsrpp` library provides a set of tools for parsing research papers.
//!
//! ## Features
//!
//! - Extract structured text from PDF papers (sections, paragraphs)
//! - Robust section detection with fallback for non-standard formats (Nature, etc.)
//! - Detect and separate figure/table captions
//! - Math expression detection and LaTeX-formatted markup (heuristic + LLM with trigram alignment)
//! - **Structured reference extraction** (LLM-based, requires `OPENAI_API_KEY`)
//!
//! ## Quick Start
//!
//! ### Pre-requirements
//! - Poppler: `sudo apt install poppler-utils`
//! - OpenCV: `sudo apt install libopencv-dev clang libclang-dev`
//! - (Optional) `OPENAI_API_KEY` environment variable for LLM features
//!
//! ### Installation
//! To start using the `rsrpp` library, add it to your project's dependencies in the `Cargo.toml` file:
//!
//! ```bash
//! cargo add rsrpp
//! ```
//!
//! Then, import the necessary modules in your code:
//!
//! ```rust
//! extern crate rsrpp;
//! use rsrpp::parser;
//! ```
//!
//! ## Examples
//!
//! ### Basic Usage
//!
//! ```rust
//! # use rsrpp::config::ParserConfig;
//! # use rsrpp::models::Section;
//! # use rsrpp::parser::parse;
//! # async fn try_main() -> Result<(), String> {
//! let mut config = ParserConfig::new();
//! let verbose = true;
//! let url = "https://arxiv.org/pdf/1706.03762";
//! let pages = parse(url, &mut config, verbose).await.unwrap(); // Vec<Page>
//!
//! // Basic conversion (captions separated, no math markup)
//! let sections = Section::from_pages(&pages); // Vec<Section>
//!
//! // With math markup (math expressions wrapped in <math>...</math> tags, LaTeX format)
//! let sections_with_math = Section::from_pages_with_math(&pages, &config.math_texts);
//!
//! let json = serde_json::to_string(§ions_with_math).unwrap(); // String
//! # Ok(())
//! # }
//! # #[tokio::main]
//! # async fn main() {
//! # try_main().await.unwrap();
//! # }
//! ```
//!
//! ### With Reference Extraction (requires OPENAI_API_KEY)
//!
//! ```rust,ignore
//! use rsrpp::config::ParserConfig;
//! use rsrpp::parser::{parse, pages2paper_output};
//!
//! let mut config = ParserConfig::new();
//! config.extract_references = true; // Enable reference extraction
//!
//! let pages = parse("paper.pdf", &mut config, false).await?;
//! let output = pages2paper_output(&pages, &config); // PaperOutput
//!
//! // output.sections - Vec<Section>
//! // output.references - Vec<Reference> with authors, title, year, venue, etc.
//! ```
//!
//! ## Tests
//!
//! The library includes a set of tests to ensure its functionality. To run the tests, use the following command:
//!
//! ```sh
//! cargo test
//! ```