unhwp
A high-performance Rust library for extracting HWP/HWPX Korean word processor documents into structured Markdown with assets.
Features
- Multi-format support: HWP 5.0 (OLE) and HWPX (XML/ZIP)
- Multiple output formats: Markdown, Plain Text, JSON (with full metadata)
- Structure preservation: Headings, lists, tables, inline formatting
- Asset extraction: Images and binary resources
- Streaming API: Memory-efficient section-by-section processing for large documents
- Section markers: Optional
<!-- section N -->boundary markers in output - Self-update: Built-in update mechanism via GitHub releases
- C-ABI FFI: Native library for C#, Python, and other languages
- Parallel processing: HWPX sections parsed in parallel via Rayon
- Async support: Optional Tokio integration
Table of Contents
- Installation
- CLI Usage
- Rust Library Usage
- C# / .NET Integration
- Output Formats
- Feature Flags
- License
Installation
Pre-built Binaries (Recommended)
Download the latest release from GitHub Releases.
Windows (x64)
# Auto-detect latest version and download
$VERSION = (Invoke-RestMethod "https://api.github.com/repos/iyulab/unhwp/releases/latest").tag_name
Invoke-WebRequest -Uri "https://github.com/iyulab/unhwp/releases/latest/download/unhwp-windows-x86_64-${VERSION}.zip" -OutFile "unhwp.zip"
Expand-Archive -Path "unhwp.zip" -DestinationPath "."
# Move to a directory in PATH (optional)
Move-Item -Path "unhwp.exe" -Destination "$env:LOCALAPPDATA\Microsoft\WindowsApps\"
# Verify installation
unhwp --version
Linux (x64)
# Auto-detect latest version and download
VERSION=
# Install to /usr/local/bin (requires sudo)
# Or install to user directory
# Verify installation
macOS
VERSION=
# Intel Mac
# Apple Silicon (M1/M2/M3/M4)
# Install
# Verify
Available Binaries
| Platform | Architecture | File |
|---|---|---|
| Windows | x64 | unhwp-windows-x86_64-{version}.zip |
| Linux | x64 | unhwp-linux-x86_64-{version}.tar.gz |
| macOS | Intel | unhwp-macos-x86_64-{version}.tar.gz |
| macOS | Apple Silicon | unhwp-macos-aarch64-{version}.tar.gz |
Updating
unhwp includes a built-in self-update mechanism:
# Check for updates
# Update to latest version
# Force reinstall (even if on latest)
The update command automatically:
- Detects your platform
- Downloads the appropriate binary from GitHub Releases
- Replaces the current executable
- Preserves your settings
Install via Cargo
If you have Rust installed:
# Install CLI
# Add library to your project
CLI Usage
Basic Conversion
# Convert HWP/HWPX to Markdown (creates <filename>_output/ directory)
# Specify output directory
# Using subcommand
Output Structure
By default, only Markdown is produced. Use --formats or --all to add more formats:
# Default: Markdown only
# → document_output/extract.md
# → document_output/images/
# All formats
# → document_output/extract.md
# → document_output/extract.txt
# → document_output/content.json
# → document_output/images/
# Specific formats
Cleanup Options (for LLM Training Data)
# Standard cleanup (default when --cleanup is specified without a value)
# Preset selection
Section Markers
Insert <!-- section N --> boundaries to identify document sections in output:
Output:
Lorem ipsum...
Commands
Examples
# Basic conversion
# All formats + cleanup for AI training
# Section-aware Markdown for downstream parsing
# Skip image extraction (faster)
# Quiet batch conversion (shell) — -q is on the `convert` subcommand
for; do ; done
# Quiet batch conversion (PowerShell)
|
Rust Library Usage
Quick Start
use ;
Streaming API
For large documents, process section-by-section without loading the entire document:
use ControlFlow;
use ;
Event order is always: DocumentStart → (SectionParsed | SectionFailed)* → DocumentEnd → ResourceExtracted*
Section Markers
use ;
let options = default
.with_section_markers;
let markdown = to_markdown_with_options?;
// Each section preceded by <!-- section N -->
Output Formats
unhwp provides four complementary output formats:
| Format | Method | Description |
|---|---|---|
| RawContent | doc.raw_content() |
JSON with full metadata, styles, structure |
| RawText | doc.plain_text() |
Pure text without formatting |
| Markdown | to_markdown() |
Structured Markdown |
| Images | doc.resources |
Extracted binary assets |
RawContent (JSON)
Get the complete document structure with all metadata:
let doc = parse_file?;
let json = doc.raw_content;
// JSON includes:
// - metadata: title, author, created, modified
// - sections: paragraphs, tables
// - styles: bold, italic, underline, font, color
// - tables: rows, cells, colspan, rowspan
// - images, equations, links
Builder API
use ;
let markdown = new
.with_images
.with_image_dir
.with_table_fallback
.with_frontmatter
.lenient // Continue past invalid sections
.parse?
.to_markdown?;
RenderOptions
use ;
let options = default
.with_frontmatter
.with_table_fallback
.with_max_heading_level
.with_image_dir
.with_image_prefix
.with_cleanup // Standard cleanup
.with_section_markers; // <!-- section N -->
let markdown = to_markdown_with_options?;
C# / .NET Integration
unhwp provides C-ABI compatible bindings for seamless integration with C# and .NET applications.
Installation via NuGet
Or via Package Manager Console:
Install-Package Unhwp
Getting the Native Library (Manual)
Build from source or download from GitHub Releases:
| Platform | Library File |
|---|---|
| Windows x64 | unhwp.dll |
| Linux x64 | libunhwp.so |
| macOS | libunhwp.dylib |
# Build native library from source (requires ffi feature)
Quick Start
using Unhwp;
// Parse document — always use `using` to release native memory
using var doc = UnhwpDocument.ParseFile("document.hwp");
// Get Markdown
string markdown = doc.ToMarkdown();
File.WriteAllText("output.md", markdown);
// Get plain text
string text = doc.ToText();
// Get full structured JSON
string json = doc.ToJson();
// With frontmatter
var opts = new MarkdownOptions { IncludeFrontmatter = true };
string mdWithFm = doc.ToMarkdown(opts);
// Document statistics
Console.WriteLine($"Sections: {doc.SectionCount}");
Console.WriteLine($"Resources: {doc.ResourceCount}");
Console.WriteLine($"Title: {doc.Title}");
Console.WriteLine($"Author: {doc.Author}");
Parse from Bytes
byte[] data = File.ReadAllBytes("document.hwp");
using var doc = UnhwpDocument.ParseBytes(data);
string markdown = doc.ToMarkdown();
Extract Resources
using var doc = UnhwpDocument.ParseFile("document.hwp");
Directory.CreateDirectory("./images");
foreach (var resourceId in doc.GetResourceIds())
{
var bytes = doc.GetResourceData(resourceId);
if (bytes != null)
File.WriteAllBytes($"./images/{resourceId}", bytes);
}
ASP.NET Core Example
[ApiController]
[Route("api/[controller]")]
public class DocumentController : ControllerBase
{
[HttpPost("convert")]
public async Task<IActionResult> ConvertHwp(IFormFile file)
{
if (file == null) return BadRequest("No file");
using var ms = new MemoryStream();
await file.CopyToAsync(ms);
try
{
using var doc = UnhwpDocument.ParseBytes(ms.ToArray());
var opts = new MarkdownOptions { IncludeFrontmatter = true };
return Ok(new { markdown = doc.ToMarkdown(opts) });
}
catch (UnhwpException ex)
{
return BadRequest(new { error = ex.Message });
}
}
}
Supported Formats
| Format | Container | Status |
|---|---|---|
| HWP 5.0+ | OLE/CFB | ✅ Supported |
| HWPX | ZIP/XML | ✅ Supported |
| HWP 3.x | Binary | ✅ Supported (feature: hwp3) |
Structure Preservation
unhwp maintains document structure during conversion:
- Headings: Outline levels →
#,##,### - Lists: Ordered and unordered with nesting
- Tables: Cell spans, alignment, HTML fallback for complex tables
- Images: Extracted with Markdown references
- Inline styles: Bold (
**), italic (*), underline (<u>), strikethrough (~~) - Equations: LaTeX or script format
Feature Flags
| Feature | Description | Default |
|---|---|---|
hwp5 |
HWP 5.0 binary format support | ✅ |
hwpx |
HWPX XML format support | ✅ |
hwp3 |
Legacy HWP 3.x support (EUC-KR) | ❌ |
ffi |
C-ABI foreign function interface (required for C# / native bindings) | ❌ |
async |
Async I/O with Tokio | ❌ |
Performance
- Parallel section processing with Rayon
- Zero-copy parsing where possible
- Memory-efficient streaming for large documents
Run benchmarks:
License
MIT License - see LICENSE for details.