Rust Research Paper Parser (rsrpp)

Crates.io Version

RuSt Research Paper Parser (rsrpp)

The rsrpp library provides a set of tools for parsing research papers.

Quick Start

Pre-requirements

Poppler: sudo apt install poppler-utils
OpenCV: sudo apt install libopencv-dev clang libclang-dev

Installation

To start using the rsrpp library, add it to your project's dependencies in the Cargo.toml file:

cargo add rsrpp

Then, import the necessary modules in your code:

extern crate rsrpp;
use rsrpp::parser;

Examples

Here is a simple example of how to use the parser module:

let mut config = ParserConfig::new();
let url = "https://arxiv.org/pdf/1706.03762";
let pages = parse(url, &mut config).await.unwrap(); // Vec<Page>
let sections = Section::from_pages(&pages); // Vec<Section>
let json = serde_json::to_string(&sections).unwrap(); // String

Tests

The library includes a set of tests to ensure its functionality. To run the tests, use the following command:

cargo test

License: MIT

Releases

Fixed panic-causing unwrap() calls with proper error handling.

Fixed Poppler 25.12.0 compatibility on macOS.

Refactored fix_suffix_hyphens to support 31 compound word suffixes:
- -based, -driven, -oriented, -aware, -agnostic, -independent, -dependent, -first, -native, -centric, -intensive, -bound, -safe, -free, -proof, -efficient, -optimized, -enabled, -powered, -ready, -capable, -compatible, -compliant, -level, -scale, -wide, -specific, -friendly, -facing, -like, -style
Added unit tests for suffix hyphenation functionality.

updated how to extract section titles from PDF.

restructured rsrpp.parser.
updated how to extract section titles from PDF.
updated tests.

removed init_logger form rsrpp.

fixed typo.
introdeced tracing logger.

Updated rsrpp version for rsrpp-cli.

Updated dependencies.
removed build.sh because it requires sudo when installing the crate.

Fixed a bug: remove unused println!.

Fixed a bug in xml loop to finish when the file reaches to end.

Added verbose mode.
Fixed a bug in the process extracting page number.

Updated: implemented new errors to handle invalid URLs.

Updated: The max retry time for saving PDF files has been increased.

Fix bugs: After converting to PDF, the program now waits until processing is complete.

Fixed bugs in get_pdf_info.
Made minor improvements.

Added cli -> rsrpp-cli.

Updated the Section module. content: String was replaced by content: Vec<TextBlock>.

rsrpp 1.0.21