Expand description
§RuSt Research Paper Parser (rsrpp)
The rsrpp library provides a set of tools for parsing research papers.
§Features
- Extract structured text from PDF papers (sections, paragraphs)
- Robust section detection with fallback for non-standard formats (Nature, etc.)
- Detect and separate figure/table captions
- Math expression detection and LaTeX-formatted markup (heuristic + LLM with trigram alignment)
- Structured reference extraction (LLM-based, requires
OPENAI_API_KEY)
§Quick Start
§Pre-requirements
- Poppler:
sudo apt install poppler-utils - OpenCV:
sudo apt install libopencv-dev clang libclang-dev OPENAI_API_KEYenvironment variable for LLM features (enabled by default; auto-disabled if not set)
§Installation
To start using the rsrpp library, add it to your project’s dependencies in the Cargo.toml file:
cargo add rsrppThen, import the necessary modules in your code:
extern crate rsrpp;
use rsrpp::parser;§Examples
§Basic Usage
let mut config = ParserConfig::new(); // LLM enabled by default
let verbose = true;
let url = "https://arxiv.org/pdf/1706.03762";
let pages = parse(url, &mut config, verbose).await.unwrap(); // Vec<Page>
// Basic conversion (captions separated, no math markup)
let sections = Section::from_pages(&pages); // Vec<Section>
// With math markup (math expressions wrapped in <math>...</math> tags, LaTeX format)
let sections_with_math = Section::from_pages_with_math(&pages, &config.math_texts);
let json = serde_json::to_string(§ions_with_math).unwrap(); // String§With Reference Extraction (requires OPENAI_API_KEY)
ⓘ
use rsrpp::config::ParserConfig;
use rsrpp::parser::{parse, pages2paper_output};
let mut config = ParserConfig::new(); // LLM enabled by default
config.extract_references = true; // Enable reference extraction
let pages = parse("paper.pdf", &mut config, false).await?;
let output = pages2paper_output(&pages, &config); // PaperOutput
// output.sections - Vec<Section>
// output.references - Vec<Reference> with authors, title, year, venue, etc.§Tests
The library includes a set of tests to ensure its functionality. To run the tests, use the following command:
cargo test