Skip to main content

Crate undoc

Crate undoc 

Source
Expand description

§undoc

High-performance Microsoft Office document extraction to Markdown.

This library provides tools for parsing DOCX, XLSX, and PPTX files and converting them to Markdown, plain text, or structured JSON.

§Quick Start

use undoc::{parse_file, to_markdown};

// Simple text extraction
let text = undoc::extract_text("document.docx")?;
println!("{}", text);

// Convert to Markdown
let markdown = to_markdown("document.docx")?;
std::fs::write("output.md", markdown)?;

// Full parsing with access to structure
let doc = parse_file("document.docx")?;
println!("Sections: {}", doc.sections.len());
println!("Resources: {}", doc.resources.len());

§Format-Specific APIs

use undoc::docx::DocxParser;
use undoc::xlsx::XlsxParser;
use undoc::pptx::PptxParser;

// Word documents
let doc = DocxParser::open("report.docx")?.parse()?;

// Excel spreadsheets
let workbook = XlsxParser::open("data.xlsx")?.parse()?;

// PowerPoint presentations
let presentation = PptxParser::open("slides.pptx")?.parse()?;

§Features

  • docx (default): Word document support
  • xlsx (default): Excel spreadsheet support
  • pptx (default): PowerPoint presentation support
  • async: Async I/O support with Tokio
  • ffi: C-ABI bindings for foreign language integration

Re-exports§

pub use container::OoxmlContainer;
pub use container::Relationship;
pub use container::Relationships;
pub use detect::detect_format_from_bytes;
pub use detect::detect_format_from_path;
pub use detect::FormatType;
pub use error::Error;
pub use error::Result;
pub use model::Block;
pub use model::Cell;
pub use model::CellAlignment;
pub use model::Document;
pub use model::HeadingLevel;
pub use model::ListInfo;
pub use model::ListType;
pub use model::Metadata;
pub use model::Paragraph;
pub use model::Resource;
pub use model::ResourceType;
pub use model::Row;
pub use model::Section;
pub use model::Table;
pub use model::TextAlignment;
pub use model::TextRun;
pub use model::TextStyle;

Modules§

container
ZIP container abstraction for OOXML documents.
detect
Format detection for Office Open XML documents.
docx
DOCX (Word) document parser.
error
Error types for the undoc library.
model
Intermediate document model for Office documents.
pptx
PPTX (PowerPoint) presentation parser.
render
Output rendering for documents.
xlsx
XLSX (Excel) spreadsheet parser.

Functions§

extract_text
Extract plain text from a document.
parse_bytes
Parse a document from bytes.
parse_file
Parse a document file and return a Document model.
to_json
Convert a document to JSON.
to_markdown
Convert a document to Markdown.
to_markdown_with_options
Convert a document to Markdown with options.
to_text
Convert a document to plain text with render options.