Skip to main content

Crate rsrpp

Crate rsrpp 

Source
Expand description

§RuSt Research Paper Parser (rsrpp)

The rsrpp library provides a set of tools for parsing research papers.

§Features

  • Extract structured text from PDF papers (sections, paragraphs)
  • Robust section detection with fallback for non-standard formats (Nature, etc.)
  • Detect and separate figure/table captions
  • Math expression detection and LaTeX-formatted markup (heuristic + LLM with trigram alignment)
  • Structured reference extraction (LLM-based, requires OPENAI_API_KEY)

§Quick Start

§Pre-requirements

  • Poppler: sudo apt install poppler-utils
  • OpenCV: sudo apt install libopencv-dev clang libclang-dev
  • OPENAI_API_KEY environment variable for LLM features (enabled by default; auto-disabled if not set)

§Installation

To start using the rsrpp library, add it to your project’s dependencies in the Cargo.toml file:

cargo add rsrpp

Then, import the necessary modules in your code:

extern crate rsrpp;
use rsrpp::parser;

§Examples

§Basic Usage

let mut config = ParserConfig::new(); // LLM enabled by default
let verbose = true;
let url = "https://arxiv.org/pdf/1706.03762";
let pages = parse(url, &mut config, verbose).await.unwrap(); // Vec<Page>

// Basic conversion (captions separated, no math markup)
let sections = Section::from_pages(&pages); // Vec<Section>

// With math markup (math expressions wrapped in <math>...</math> tags, LaTeX format)
let sections_with_math = Section::from_pages_with_math(&pages, &config.math_texts);

let json = serde_json::to_string(&sections_with_math).unwrap(); // String

§With Reference Extraction (requires OPENAI_API_KEY)

use rsrpp::config::ParserConfig;
use rsrpp::parser::{parse, pages2paper_output};

let mut config = ParserConfig::new(); // LLM enabled by default
config.extract_references = true; // Enable reference extraction

let pages = parse("paper.pdf", &mut config, false).await?;
let output = pages2paper_output(&pages, &config); // PaperOutput

// output.sections - Vec<Section>
// output.references - Vec<Reference> with authors, title, year, venue, etc.

§Tests

The library includes a set of tests to ensure its functionality. To run the tests, use the following command:

cargo test

Modules§

cleaner
Text cleaning and block classification module.
config
converter
extracter
llm
models
parser
test_utils