Rust Research Paper Parser (rsrpp)
RuSt Research Paper Parser (rsrpp)
The rsrpp library provides a set of tools for parsing research papers.
Quick Start
Pre-requirements
- Poppler:
sudo apt install poppler-utils - OpenCV:
sudo apt install libopencv-dev clang libclang-dev
Installation
To start using the rsrpp library, add it to your project's dependencies in the Cargo.toml file:
Then, import the necessary modules in your code:
extern crate rsrpp;
use parser;
Examples
Here is a simple example of how to use the parser module:
let mut config = new;
let url = "https://arxiv.org/pdf/1706.03762";
let pages = parse.await.unwrap; // Vec<Page>
let sections = from_pages; // Vec<Section>
let json = to_string.unwrap; // String
Tests
The library includes a set of tests to ensure its functionality. To run the tests, use the following command:
License: MIT
Releases
- LLM-enhanced processing is now enabled by default (
ParserConfig::new()setsuse_llm: true)- If
OPENAI_API_KEYis not set, LLM is automatically disabled at runtime - Use
config.use_llm = falseto explicitly disable
- If
- Fixed LLM section validation discarding sections from pages the LLM hadn't examined
merge_sections()now uses page-range-aware logic
- Fixed body text loss in Nature-format and non-standard papers:
- Added section detection fallback for papers without "Abstract" heading using anchor-word matching
- Added text area degenerate detection to prevent filtering out all blocks
- Capped table detection regions at 50% of page area to reject false positives from chart gridlines
- Exempted known section titles from table-region filtering
- Improved math extraction accuracy:
- Fixed critical bug where LLM-extracted math text was discarded; added trigram-based block alignment
- Reduced false positives: dates, statistics, section/figure references
- Added detection for multi-char math functions, ASCII exponents/subscripts, letter fractions, norm notation
- Unified math output to LaTeX format inside
<math>tags - Added context-based validation for structure-only pattern matches
- Fixed panic-causing unwrap() calls with proper error handling.
- Fixed Poppler 25.12.0 compatibility on macOS.
- Refactored
fix_suffix_hyphensto support 31 compound word suffixes:-based,-driven,-oriented,-aware,-agnostic,-independent,-dependent,-first,-native,-centric,-intensive,-bound,-safe,-free,-proof,-efficient,-optimized,-enabled,-powered,-ready,-capable,-compatible,-compliant,-level,-scale,-wide,-specific,-friendly,-facing,-like,-style
- Added unit tests for suffix hyphenation functionality.
- updated how to extract section titles from PDF.
- restructured
rsrpp.parser. - updated how to extract section titles from PDF.
- updated tests.
- removed
init_loggerformrsrpp.
- fixed typo.
- introdeced
tracinglogger.
- Updated
rsrppversion forrsrpp-cli.
- Updated dependencies.
- removed build.sh because it requires sudo when installing the crate.
- Fixed a bug: remove unused
println!.
- Fixed a bug in xml loop to finish when the file reaches to end.
- Added verbose mode.
- Fixed a bug in the process extracting page number.
- Updated: implemented new errors to handle invalid URLs.
- Updated: The max retry time for saving PDF files has been increased.
- Fix bugs: After converting to PDF, the program now waits until processing is complete.
- Fixed bugs in
get_pdf_info. - Made minor improvements.
- Added cli -> rsrpp-cli.
- Updated the
Sectionmodule.content: Stringwas replaced bycontent: Vec<TextBlock>.