undoc
A high-performance Rust library for extracting content from Microsoft Office documents (DOCX, XLSX, PPTX) to Markdown, plain text, and JSON.
Features
- Multi-format support: DOCX (Word), XLSX (Excel), PPTX (PowerPoint)
- Multiple output formats: Markdown, Plain Text, JSON (with full metadata)
- Structure preservation: Headings, lists, tables, inline formatting
- Smart heading detection: Style-based heading recognition (English/Korean)
- Table cell alignment: Proper left/center/right alignment in Markdown
- PPTX table extraction: Full table parsing from PowerPoint slides
- CJK text support: Smart spacing for Korean, Chinese, Japanese content
- Asset extraction: Images, charts, and embedded media with resolved paths (XLSX drawings included)
- Rich content: Footnotes/endnotes, headers/footers, text boxes, cell comments, hyperlinks
- Section markers:
<!-- slide N: Name -->/<!-- sheet N: Name -->boundary markers for PPTX/XLSX - Text cleanup: Multiple presets for LLM training data preparation
- Self-update: Built-in update mechanism via GitHub releases
- C-ABI FFI: Native library for C#, Python, and other languages
- Parallel processing: Uses Rayon for multi-section documents
- Streaming pipeline (0.3.0+):
parse_file_streamingyields sections as they parse; peak memory bounded regardless of document size - WebAssembly (0.4.0+):
@iyulab/undocnpm package — parse Office documents in the browser, no server required
Table of Contents
- Installation
- WASM / Browser
- CLI Usage
- Rust Library Usage
- C# / .NET Integration
- Output Formats
- Feature Flags
- License
WASM / Browser
Parse Office documents in the browser — no server, no upload:
import init from '@iyulab/undoc';
await ;
const doc = ;
console.log; // "docx" | "xlsx" | "pptx"
console.log;
Install: npm install @iyulab/undoc
Installation
Pre-built Binaries (Recommended)
Download the latest release from GitHub Releases.
Windows (x64)
# Download and extract (replace VERSION with the actual version, e.g. v0.3.1)
$VERSION = (Invoke-RestMethod "https://api.github.com/repos/iyulab/undoc/releases/latest").tag_name
Invoke-WebRequest -Uri "https://github.com/iyulab/undoc/releases/latest/download/undoc-windows-x86_64-${VERSION}.zip" -OutFile "undoc.zip"
Expand-Archive -Path "undoc.zip" -DestinationPath "."
# Move to a directory in PATH (optional)
Move-Item -Path "undoc.exe" -Destination "$env:LOCALAPPDATA\Microsoft\WindowsApps\"
# Verify installation
undoc version
Linux (x64)
# Download and extract (replace VERSION with the actual version, e.g. v0.3.1)
VERSION=
# Install to /usr/local/bin (requires sudo)
# Or install to user directory
# Verify installation
macOS
# Intel Mac (replace VERSION with the actual version, e.g. v0.3.1)
VERSION=
# Apple Silicon (M1/M2/M3/M4)
# Install
# Verify
Available Binaries
| Platform | Architecture | File |
|---|---|---|
| Windows | x64 | undoc-windows-x86_64-{version}.zip |
| Linux | x64 | undoc-linux-x86_64-{version}.tar.gz |
| macOS | Intel | undoc-macos-x86_64-{version}.tar.gz |
| macOS | Apple Silicon | undoc-macos-aarch64-{version}.tar.gz |
Updating
undoc includes a built-in self-update mechanism:
# Check for updates
# Update to latest version
# Force reinstall (even if on latest)
Install via Cargo
If you have Rust installed:
# Install CLI
# Add library to your project
CLI Usage
Quick Start
# Convert to Markdown + extract media (default)
# Specify output directory
# With text cleanup for LLM training
Output Structure
document_output/
├── extract.md # Markdown output
└── media/ # Extracted images and media
└── image1.jpeg
Use undoc convert <file> --all to produce all three formats at once:
document_output/
├── extract.md # Markdown output
├── extract.txt # Plain text output
├── content.json # Full structured JSON
└── media/
Commands
Convert (multi-format streaming pipeline)
# Markdown only (default)
# All formats + media
# Specific formats
# Skip media extraction
# Lossless mode (page breaks + headers/footers)
# Insert section boundary markers for PPTX/XLSX
Convert Options
| Option | Description | Default |
|---|---|---|
-o, --output |
Output directory | <stem>_output/ |
--formats |
Comma-separated formats: md,txt,json |
md |
--all |
Output all formats (MD + TXT + JSON) | false |
--no-images |
Skip media extraction | false |
--cleanup |
Text cleanup: minimal, standard, aggressive |
none |
--section-markers |
Insert <!-- slide/sheet N: Name --> markers (PPTX/XLSX) |
false |
--emit-page-breaks |
Emit --- for hard page breaks |
false |
--include-headers-footers |
Include section headers/footers as blockquotes | false |
--lossless |
Enable both --emit-page-breaks and --include-headers-footers |
false |
-q, --quiet |
Suppress progress output | false |
Convert to Markdown
# Basic conversion (output to stdout)
# Save to file
# With YAML frontmatter
# With text cleanup for LLM training
# Table rendering options
# Limit heading depth
# Insert section boundary markers for PPTX/XLSX
# Lossless mode (preserve page breaks and headers/footers)
Markdown Options
| Option | Description | Default |
|---|---|---|
-o, --output |
Output file path | stdout |
-f, --frontmatter |
Include YAML frontmatter | false |
--table-mode |
Table rendering: markdown, html, ascii |
markdown |
--cleanup |
Text cleanup: minimal, standard, aggressive |
none |
--max-heading |
Maximum heading level (1-6) | 4 |
--section-markers |
Insert <!-- slide/sheet N: Name --> markers (PPTX/XLSX) |
false |
--emit-page-breaks |
Emit --- for hard page breaks |
false |
--include-headers-footers |
Include section headers/footers as blockquotes | false |
--lossless |
Enable both --emit-page-breaks and --include-headers-footers |
false |
Convert to Plain Text
# Basic extraction
# With cleanup
Convert to JSON
# Pretty-printed JSON
# Compact JSON
Show Document Information
Output:
Document Information
────────────────────────────────────────
File: document.docx
Format: Docx
Sections: 5
Resources: 3
Title: My Document
Author: John Doe
Pages/Slides/Sheets: 10
Created: 2025-01-15T10:30:00Z
Modified: 2025-01-20T14:45:00Z
Content Statistics
────────────────────────────────────────
Words: 2500
Characters: 15000
Extract Resources
# Extract to current directory
# Extract to specific directory
Self-Update
# Check for updates
# Update to latest version
# Force reinstall
Examples
# Convert Word document to Markdown with frontmatter
# Convert Excel to Markdown tables
# Convert PowerPoint to Markdown with slide markers
# Extract all images from a document
# Get document metadata
# Convert with aggressive cleanup for AI training
# Batch conversion (shell)
for; do ; done
# Batch conversion (PowerShell)
|
Rust Library Usage
Quick Start
use ;
Convenience Functions
// One-shot conversions without building a Document first
let text = extract_text?;
let markdown = to_markdown?;
let json = to_json?;
// Parse from bytes
let doc = parse_bytes?;
Render Options
use ;
use SectionMarkerStyle;
let options = new
.with_frontmatter
.with_table_fallback
.with_cleanup_preset
.with_max_heading
.with_section_markers; // <!-- slide N: Name -->
let markdown = to_markdown?;
// Lossless mode: preserve page breaks and headers/footers
let lossless = lossless;
let markdown = to_markdown?;
Working with Document Structure
use parse_file;
let doc = parse_file?;
// Access metadata
println!;
println!;
println!;
// Iterate sections
for section in &doc.sections
// Extract resources
for in &doc.resources
Streaming Pipeline
Supported for PPTX (per slide) and XLSX (per sheet). DOCX is not yet supported.
use ControlFlow;
use ;
parse_file_streaming?;
Format Detection
use ;
// From file path
let format = detect_format_from_path?;
assert_eq!;
// From bytes
let data = read?;
let format = detect_format_from_bytes?;
C# / .NET Integration
undoc provides C-ABI compatible bindings for integration with C# and .NET applications.
Getting the Native Library
Download from GitHub Releases:
| Platform | Library File |
|---|---|
| Windows x64 | undoc.dll |
| Linux x64 | libundoc.so |
| macOS | libundoc.dylib |
Or build from source:
C# Wrapper Usage
using Iyulab.Undoc;
// Parse and convert to Markdown
using var doc = UndocDocument.FromFile("document.docx");
string markdown = doc.ToMarkdown(MarkdownFlags.Frontmatter);
Console.WriteLine(markdown);
// Get document metadata
Console.WriteLine($"Title: {doc.Title}");
Console.WriteLine($"Author: {doc.Author}");
Console.WriteLine($"Sections: {doc.SectionCount}");
Console.WriteLine($"Resources: {doc.ResourceCount}");
// Convert to other formats
string text = doc.ToText();
string json = doc.ToJson(JsonFormat.Pretty);
// Work with resources
string resourceIds = doc.GetResourceIds(); // JSON array: ["rId1", "rId2"]
string info = doc.GetResourceInfo("rId1"); // JSON metadata
byte[] imageData = doc.GetResourceData("rId1"); // Binary data
See bindings/csharp/Undoc.cs for the complete wrapper implementation.
Output Formats
Markdown
Structured Markdown with preserved formatting:
- Headings: Document headings →
#,##,### - Lists: Ordered and unordered with nesting
- Tables: Markdown tables (with HTML/ASCII fallback for complex layouts)
- Inline styles: Bold (
**), italic (*), underline (<u>), strikethrough, superscript/subscript - Hyperlinks: Preserved as Markdown links (DOCX, XLSX, PPTX)
- Images: Linked image references from document drawings
- Footnotes: Markdown reference-style (
[^N]/[^N]: text) - Headers/Footers: Rendered as blockquotes (opt-in via
--include-headers-footers)
Plain Text
Pure text content without formatting markers.
JSON
Complete document structure with metadata:
Supported Formats
| Format | Extension | Status |
|---|---|---|
| Word | .docx | Supported |
| Excel | .xlsx | Supported |
| PowerPoint | .pptx | Supported |
Feature Flags
| Feature | Description | Default |
|---|---|---|
ffi |
C-ABI foreign function interface | No |
# Cargo.toml - enable FFI
[]
= { = "0.3", = ["ffi"] }
Performance
- Parallel section/sheet/slide processing with Rayon
- Efficient XML parsing with quick-xml
- Memory-efficient handling of large documents
- Streaming pipeline for bounded peak memory on large files
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.