Expand description
§undoc
High-performance Microsoft Office document extraction to Markdown.
This library provides tools for parsing DOCX, XLSX, and PPTX files and converting them to Markdown, plain text, or structured JSON.
§Quick Start
use undoc::{parse_file, to_markdown};
// Simple text extraction
let text = undoc::extract_text("document.docx")?;
println!("{}", text);
// Convert to Markdown
let markdown = to_markdown("document.docx")?;
std::fs::write("output.md", markdown)?;
// Full parsing with access to structure
let doc = parse_file("document.docx")?;
println!("Sections: {}", doc.sections.len());
println!("Resources: {}", doc.resources.len());§Format-Specific APIs
use undoc::docx::DocxParser;
use undoc::xlsx::XlsxParser;
use undoc::pptx::PptxParser;
// Word documents
let doc = DocxParser::open("report.docx")?.parse()?;
// Excel spreadsheets
let workbook = XlsxParser::open("data.xlsx")?.parse()?;
// PowerPoint presentations
let presentation = PptxParser::open("slides.pptx")?.parse()?;§Features
docx(default): Word document supportxlsx(default): Excel spreadsheet supportpptx(default): PowerPoint presentation supportasync: Async I/O support with Tokioffi: C-ABI bindings for foreign language integration
Re-exports§
pub use container::OoxmlContainer;pub use container::Relationship;pub use container::Relationships;pub use detect::detect_format_from_bytes;pub use detect::detect_format_from_path;pub use detect::FormatType;pub use error::Error;pub use error::Result;pub use model::Block;pub use model::Cell;pub use model::CellAlignment;pub use model::Document;pub use model::HeadingLevel;pub use model::ListInfo;pub use model::ListType;pub use model::Metadata;pub use model::Paragraph;pub use model::Resource;pub use model::ResourceType;pub use model::Row;pub use model::Section;pub use model::Table;pub use model::TextAlignment;pub use model::TextRun;pub use model::TextStyle;
Modules§
- container
- ZIP container abstraction for OOXML documents.
- detect
- Format detection for Office Open XML documents.
- docx
- DOCX (Word) document parser.
- error
- Error types for the undoc library.
- model
- Intermediate document model for Office documents.
- pptx
- PPTX (PowerPoint) presentation parser.
- render
- Output rendering for documents.
- xlsx
- XLSX (Excel) spreadsheet parser.
Functions§
- extract_
text - Extract plain text from a document.
- parse_
bytes - Parse a document from bytes.
- parse_
file - Parse a document file and return a Document model.
- to_json
- Convert a document to JSON.
- to_
markdown - Convert a document to Markdown.
- to_
markdown_ with_ options - Convert a document to Markdown with options.
- to_text
- Convert a document to plain text with render options.