Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
LiteParse
Rust library and CLI for fast, lightweight PDF and document parsing with spatial text extraction. Runs entirely locally with zero cloud dependencies.
LiteParse is also available for Node.js/TypeScript, Python, and the browser (WASM). See the project README for all options.
Installation
Add to your Cargo.toml:
[]
= "2"
Or install the CLI:
Quick Start
use ;
async
Configuration
use ;
let config = LiteParseConfig ;
let parser = new;
Markdown Output
LiteParse can render documents directly to Markdown, including headings, tables, lists,
images, and links reconstructed from the spatial layout. Set
output_format: OutputFormat::Markdown; the rendered Markdown is returned on
result.text. Two related knobs control Markdown rendering:
image_mode(ImageMode::Placeholderdefault |Off|Embed) — how raster images are surfaced in the output.extract_links(defaulttrue) — render hyperlink annotations as[text](url); setfalsefor plain anchor text.
use ;
let config = LiteParseConfig ;
let result = new.parse.await?;
println!; // rendered Markdown
Reconstruction quality varies with document complexity.
Parsing from Bytes
use PdfInput;
let pdf_bytes: = read?;
let result = parser.parse_input.await?;
println!;
Document Complexity
Before committing to a full parse, check whether a document needs OCR or heavier
processing. is_complex is a cheap, text-layer-only pass that returns a
PageComplexityStats per page with a needs_ocr verdict and the signals behind it —
useful for routing documents to different pipelines, rejecting ones you can't handle, or
estimating cost.
use PdfInput;
let parser = new;
let pages = parser.is_complex.await?;
if pages.iter.any
reasons is a Vec<ComplexityReason> (Scanned, NoText, SparseText,
EmbeddedImages, Garbled, VectorText); new variants may be added over time, so match
leniently.
Custom OCR Engine
Implement the OcrEngine trait to plug in your own OCR backend:
use OcrEngine;
use Arc;
let parser = new
.with_ocr_engine;
Features
tesseract(default) — Built-in Tesseract OCR viatesseract-rs. Disable withdefault-features = falseif you don't need OCR or want to use an HTTP OCR server instead.
Supported Formats
- PDF (
.pdf) - Microsoft Office (
.docx,.xlsx,.pptx, etc.) — requires LibreOffice - OpenDocument (
.odt,.ods,.odp) — requires LibreOffice - Images (
.png,.jpg,.tiff, etc.) — requires ImageMagick
CLI
The crate also builds the lit CLI binary:
See lit --help for all options.
License
Apache-2.0