undoc

A high-performance Rust library for extracting content from Microsoft Office documents (DOCX, XLSX, PPTX) to Markdown, plain text, and JSON.

Features

Multi-format support: DOCX (Word), XLSX (Excel), PPTX (PowerPoint)
Multiple output formats: Markdown, Plain Text, JSON (with full metadata)
Structure preservation: Headings, lists, tables, inline formatting
Smart heading detection: Style-based heading recognition (English/Korean)
Table cell alignment: Proper left/center/right alignment in Markdown
PPTX table extraction: Full table parsing from PowerPoint slides
CJK text support: Smart spacing for Korean, Chinese, Japanese content
Asset extraction: Images, charts, and embedded media with resolved paths
Text cleanup: Multiple presets for LLM training data preparation
Self-update: Built-in update mechanism via GitHub releases
C-ABI FFI: Native library for C#, Python, and other languages
Parallel processing: Uses Rayon for multi-section documents

Installation
CLI Usage
Rust Library Usage
C# / .NET Integration
Output Formats
Feature Flags
License

Installation

Pre-built Binaries (Recommended)

Download the latest release from GitHub Releases.

Windows (x64)

# Download and extract
Invoke-WebRequest -Uri "https://github.com/iyulab/undoc/releases/latest/download/undoc-cli-x86_64-pc-windows-msvc.zip" -OutFile "undoc.zip"
Expand-Archive -Path "undoc.zip" -DestinationPath "."

# Move to a directory in PATH (optional)
Move-Item -Path "undoc.exe" -Destination "$env:LOCALAPPDATA\Microsoft\WindowsApps\"

# Verify installation
undoc version

Linux (x64)

# Download and extract
curl -LO https://github.com/iyulab/undoc/releases/latest/download/undoc-cli-x86_64-unknown-linux-gnu.tar.gz
tar -xzf undoc-cli-x86_64-unknown-linux-gnu.tar.gz

# Install to /usr/local/bin (requires sudo)
sudo mv undoc /usr/local/bin/

# Or install to user directory
mkdir -p ~/.local/bin
mv undoc ~/.local/bin/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Verify installation
undoc version

macOS

# Intel Mac
curl -LO https://github.com/iyulab/undoc/releases/latest/download/undoc-cli-x86_64-apple-darwin.tar.gz
tar -xzf undoc-cli-x86_64-apple-darwin.tar.gz

# Apple Silicon (M1/M2/M3)
curl -LO https://github.com/iyulab/undoc/releases/latest/download/undoc-cli-aarch64-apple-darwin.tar.gz
tar -xzf undoc-cli-aarch64-apple-darwin.tar.gz

# Install
sudo mv undoc /usr/local/bin/

# Verify
undoc version

Available Binaries

Platform	Architecture	File
Windows	x64	`undoc-cli-x86_64-pc-windows-msvc.zip`
Linux	x64	`undoc-cli-x86_64-unknown-linux-gnu.tar.gz`
macOS	Intel	`undoc-cli-x86_64-apple-darwin.tar.gz`
macOS	Apple Silicon	`undoc-cli-aarch64-apple-darwin.tar.gz`

Updating

undoc includes a built-in self-update mechanism:

# Check for updates
undoc update --check

# Update to latest version
undoc update

# Force reinstall (even if on latest)
undoc update --force

Install via Cargo

If you have Rust installed:

# Install CLI
cargo install undoc-cli

# Add library to your project
cargo add undoc

CLI Usage

Quick Start

# Extract all formats (Markdown, text, JSON) + media to output directory
undoc document.docx

# Specify output directory
undoc document.docx ./output

# With text cleanup for LLM training
undoc document.docx --cleanup aggressive

Output Structure

document_output/
├── extract.md      # Markdown output with frontmatter
├── extract.txt     # Plain text output
├── content.json    # Full structured JSON
└── media/          # Extracted images and media
    └── image1.jpeg

Commands

undoc <file> [output]              # Extract all formats (default)
undoc convert <file> [OPTIONS]     # Same as above, explicit command
undoc markdown <file> [OPTIONS]    # Convert to Markdown only (alias: md)
undoc text <file> [OPTIONS]        # Convert to plain text only
undoc json <file> [OPTIONS]        # Convert to JSON only
undoc info <file>                  # Show document information
undoc extract <file> [OPTIONS]     # Extract resources only
undoc update [OPTIONS]             # Self-update to latest version
undoc version                      # Show version information

Convert to Markdown

# Basic conversion (output to stdout)
undoc markdown document.docx

# Save to file
undoc markdown document.docx -o output.md

# With YAML frontmatter
undoc markdown document.docx --frontmatter -o output.md

# With text cleanup for LLM training
undoc markdown document.docx --cleanup standard -o cleaned.md

# Table rendering options
undoc markdown spreadsheet.xlsx --table-mode html -o output.md

# Limit heading depth
undoc markdown document.docx --max-heading 3 -o output.md

Markdown Options

Option	Description	Default
`-o, --output`	Output file path	stdout
`-f, --frontmatter`	Include YAML frontmatter	false
`--table-mode`	Table rendering: `markdown`, `html`, `ascii`	markdown
`--cleanup`	Text cleanup: `minimal`, `standard`, `aggressive`	none
`--max-heading`	Maximum heading level (1-6)	6

Convert to Plain Text

# Basic extraction
undoc text document.docx

# With cleanup
undoc text document.docx --cleanup standard -o output.txt

Convert to JSON

# Pretty-printed JSON
undoc json document.docx -o output.json

# Compact JSON
undoc json document.docx --compact -o output.json

Show Document Information

undoc info document.docx

Output:

Document Information
────────────────────────────────────────
File: document.docx
Format: Docx
Sections: 5
Resources: 3
Title: My Document
Author: John Doe
Pages/Slides/Sheets: 10
Created: 2025-01-15T10:30:00Z
Modified: 2025-01-20T14:45:00Z

Content Statistics
────────────────────────────────────────
Words: 2500
Characters: 15000

Extract Resources

# Extract to current directory
undoc extract presentation.pptx

# Extract to specific directory
undoc extract presentation.pptx -o ./media

Self-Update

# Check for updates
undoc update --check

# Update to latest version
undoc update

# Force reinstall
undoc update --force

Examples

# Convert Word document to Markdown with frontmatter
undoc md report.docx --frontmatter -o report.md

# Convert Excel to Markdown tables
undoc md data.xlsx -o tables.md

# Convert PowerPoint to Markdown
undoc md presentation.pptx -o slides.md

# Extract all images from a document
undoc extract report.docx -o ./images

# Get document metadata
undoc info document.docx

# Convert with aggressive cleanup for AI training
undoc md document.docx --cleanup aggressive -o cleaned.md

# Batch conversion (shell)
for f in *.docx; do undoc md "$f" -o "${f%.docx}.md"; done

# Batch conversion (PowerShell)
Get-ChildItem *.docx | ForEach-Object { undoc md $_.FullName -o "$($_.BaseName).md" }

Rust Library Usage

Quick Start

use undoc::{parse_file, render};

fn main() -> undoc::Result<()> {
    // Parse document
    let doc = parse_file("document.docx")?;

    // Convert to Markdown
    let options = render::RenderOptions::default();
    let markdown = render::to_markdown(&doc, &options)?;
    println!("{}", markdown);

    // Get plain text
    let text = render::to_text(&doc, &options)?;

    // Get JSON
    let json = render::to_json(&doc, render::JsonFormat::Pretty)?;

    Ok(())
}

Render Options

use undoc::render::{RenderOptions, CleanupPreset, TableFallback};

let options = RenderOptions::new()
    .with_frontmatter(true)
    .with_table_fallback(TableFallback::Html)
    .with_cleanup_preset(CleanupPreset::Aggressive)
    .with_max_heading(3);

let markdown = render::to_markdown(&doc, &options)?;

Working with Document Structure

use undoc::parse_file;

let doc = parse_file("document.docx")?;

// Access metadata
println!("Title: {:?}", doc.metadata.title);
println!("Author: {:?}", doc.metadata.author);
println!("Created: {:?}", doc.metadata.created);

// Iterate sections
for section in &doc.sections {
    println!("Section: {:?}", section.name);
    for element in &section.elements {
        // Process paragraphs, tables, etc.
    }
}

// Extract resources
for (id, resource) in &doc.resources {
    let filename = resource.suggested_filename(id);
    std::fs::write(&filename, &resource.data)?;
}

Format Detection

use undoc::{detect_format_from_path, detect_format_from_bytes, FormatType};

// From file path
let format = detect_format_from_path("document.docx")?;
assert_eq!(format, FormatType::Docx);

// From bytes
let data = std::fs::read("document.docx")?;
let format = detect_format_from_bytes(&data)?;

C# / .NET Integration

undoc provides C-ABI compatible bindings for integration with C# and .NET applications.

Getting the Native Library

Download from GitHub Releases:

Platform	Library File
Windows x64	`undoc.dll`
Linux x64	`libundoc.so`
macOS	`libundoc.dylib`

Or build from source:

cargo build --release --features ffi

C# Wrapper Usage

using Iyulab.Undoc;

// Parse and convert to Markdown
using var doc = UndocDocument.FromFile("document.docx");
string markdown = doc.ToMarkdown(MarkdownFlags.Frontmatter);
Console.WriteLine(markdown);

// Get document metadata
Console.WriteLine($"Title: {doc.Title}");
Console.WriteLine($"Author: {doc.Author}");
Console.WriteLine($"Sections: {doc.SectionCount}");
Console.WriteLine($"Resources: {doc.ResourceCount}");

// Convert to other formats
string text = doc.ToText();
string json = doc.ToJson(JsonFormat.Pretty);

// Work with resources
string resourceIds = doc.GetResourceIds(); // JSON array: ["rId1", "rId2"]
string info = doc.GetResourceInfo("rId1"); // JSON metadata
byte[] imageData = doc.GetResourceData("rId1"); // Binary data

See bindings/csharp/Undoc.cs for the complete wrapper implementation.

Output Formats

Markdown

Structured Markdown with preserved formatting:

Headings: Document headings → #, ##, ###
Lists: Ordered and unordered with nesting
Tables: Markdown tables (with HTML/ASCII fallback for complex layouts)
Inline styles: Bold (**), italic (*), underline, strikethrough
Hyperlinks: Preserved as Markdown links
Images: Reference-style image links

Plain Text

Pure text content without formatting markers.

JSON

Complete document structure with metadata:

{
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "created": "2025-01-15T10:30:00Z",
    "modified": "2025-01-20T14:45:00Z"
  },
  "sections": [...],
  "resources": [...]
}

Supported Formats

Format	Extension	Status
Word	.docx	Supported
Excel	.xlsx	Supported
PowerPoint	.pptx	Supported

Feature Flags

Feature	Description	Default
`ffi`	C-ABI foreign function interface	No

# Cargo.toml - enable FFI
[dependencies]
undoc = { version = "0.1", features = ["ffi"] }

Performance

Parallel section/sheet/slide processing with Rayon
Efficient XML parsing with quick-xml
Memory-efficient handling of large documents

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Related Projects

unhwp - Korean HWP document extraction

undoc 0.1.13

undoc

Features

Table of Contents

Installation

Pre-built Binaries (Recommended)

Windows (x64)

Linux (x64)

macOS

Available Binaries

Updating

Install via Cargo

CLI Usage

Quick Start

Output Structure

Commands

Convert to Markdown

Markdown Options

Convert to Plain Text

Convert to JSON

Show Document Information

Extract Resources

Self-Update

Examples

Rust Library Usage

Quick Start

Render Options

Working with Document Structure

Format Detection

C# / .NET Integration

Getting the Native Library

C# Wrapper Usage

Output Formats

Markdown

Plain Text

JSON

Supported Formats

Feature Flags

Performance

License

Contributing

Related Projects