Office Oxide — The Fastest Native Office Document Library
A fast, memory-safe library for text extraction from Office documents. Rust core with first-class bindings for Python, Go, C#/.NET, Node.js (native and WASM), and a stable C FFI. Handles DOCX, XLSX, PPTX, DOC, XLS, and PPT. Up to 100× faster than python-docx, openpyxl, python-pptx, and xlrd. Beats python-calamine on XLSX. 100% pass rate on valid Office files — zero failures on legitimate Word/Excel/PowerPoint documents. MIT/Apache-2.0 dual-licensed.
Scope of "fastest". Benchmarks compare Office Oxide against other native / embeddable libraries (no JVM runtime required): python-docx, openpyxl, python-pptx, python-calamine, xlrd, markitdown, catdoc, antiword, xls2csv, calamine (Rust), dotext, docx-rs. Apache POI and Apache Tika are out of scope for this comparison because they require a JVM and are targeted at a different deployment shape. POI/Tika numbers may be added in a future release.
Available bindings
| Language | Package | Directory | Docs |
|---|---|---|---|
| Rust | office_oxide on crates.io |
src/ |
lib.rs |
| Python | office-oxide on PyPI |
python/ |
python/ |
| Go | github.com/yfedoseev/office_oxide/go |
go/ |
go/README.md |
| C# / .NET | OfficeOxide on NuGet |
csharp/ |
csharp/OfficeOxide/README.md |
| Node.js (native) | office-oxide on npm |
js/ |
js/README.md |
| Node.js / browser (WASM) | office-oxide-wasm on npm |
wasm-pkg/ |
wasm-pkg/README.md |
| C / other | header-only via FFI | include/office_oxide_c/ |
office_oxide.h |
| CLI | office-oxide binary |
crates/office_oxide_cli/ |
|
| MCP server | office-oxide-mcp binary |
crates/office_oxide_mcp/ |
Ready-to-run demos (extract, replace, read_xlsx) exist for every binding under examples/. Deeper language-specific guides live in docs/: Rust · Python · Go · C# · JavaScript (native) · WASM · C FFI.
Quick Start
Python
# One-liner text extraction
=
=
# Context-managed document; accepts str or pathlib.Path
# "pptx"
Rust
use Document;
let doc = open?;
let text = doc.plain_text;
let markdown = doc.to_markdown;
let ir = doc.to_ir; // Format-agnostic intermediate representation
[]
= "0.1.1"
JavaScript / WASM
Browser + bundlers:
import from "office-oxide-wasm";
const doc = ;
console.log;
console.log;
doc.;
Node.js native (koffi + C FFI, no node-gyp):
import from "office-oxide";
using doc = ;
console.log; // "docx"
console.log;
console.log;
console.log;
Go
import oo "github.com/yfedoseev/office_oxide/go"
doc, _ := oo.Open("report.docx")
defer doc.Close()
text, _ := doc.PlainText()
md, _ := doc.ToMarkdown()
C# / .NET
using OfficeOxide;
using var doc = Document.Open("report.docx");
Console.WriteLine(doc.Format); // "docx"
Console.WriteLine(doc.PlainText());
Console.WriteLine(doc.ToMarkdown());
C (raw FFI)
Include include/office_oxide_c/office_oxide.h and link against liboffice_oxide. See examples/c/extract.c for a working sample.
Why office_oxide?
- Fast — 8-100× faster than python-docx, openpyxl, python-pptx, xlrd; beats calamine on XLSX
- Reliable — 100% pass rate on valid Office files, tested against 6,062 real-world documents. Zero failures on legitimate Word 97+ / Excel 97+ / PowerPoint 97+ files
- Complete — 6 formats: DOCX, XLSX, PPTX + legacy DOC, XLS, PPT
- Multi-platform — Rust, Python, Go, JavaScript/TypeScript, C#/.NET, WASM, CLI, and MCP server — one library, all platforms
- Permissive — MIT / Apache-2.0, no AGPL or GPL restrictions
Performance
Benchmarked on 6,062 files from 11 independent public test suites. Single-thread, release build with LTO, warm disk cache (steady-state), median of three runs on an idle system. Full methodology in BENCHMARKS.md.
DOCX — 2,538 files
| Library | Language | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|---|
| office_oxide | Rust | 0.8ms | 3.9ms | 98.9% | MIT |
| python-docx | Python | 11.8ms | 98ms | 95.1% | MIT |
XLSX — 1,802 files
| Library | Language | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|---|
| office_oxide | Rust | 5.0ms | 40ms | 97.8% | MIT |
| python-calamine | Rust/Python | 13.9ms | 183ms | 96.6% | MIT |
| openpyxl | Python | 94.5ms | 698ms | 96.2% | MIT |
PPTX — 806 files
| Library | Language | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|---|
| office_oxide | Rust | 0.7ms | 3.9ms | 98.4% | MIT |
| python-pptx | Python | 32.5ms | 174ms | 86.7% | MIT |
Legacy Formats — 916 files
| Library | Format | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|---|
| office_oxide | DOC (246) | 0.3ms | 3.4ms | 94.7% | MIT |
| catdoc | DOC | 4.3ms | 41ms | 90.2% | GPL-2.0 |
| antiword | DOC | 4.5ms | 66ms | 76.8% | GPL-2.0 |
| office_oxide | XLS (494) | 2.8ms | 75ms | 99.2% | MIT |
| xls2csv (catdoc) | XLS | 6.9ms | 58ms | 84.0% | GPL-2.0 |
| python-calamine | XLS | 9.0ms | 96ms | 90.7% | MIT |
| xlrd | XLS | 36.6ms | 503ms | 93.1% | BSD-3 |
| office_oxide | PPT (176) | 0.7ms | 6.6ms | 100% | MIT |
| catppt (catdoc) | PPT | 2.8ms | 8ms | 77.8% | GPL-2.0 |
On .xls, xls2csv has a tighter p99 (58ms vs 75ms) because it emits truncated/lossy output on complex sheets. office_oxide is 2.4× faster on the mean and passes 15pp more of the corpus. No other Rust or Python library supports .doc, .xls, and .ppt text extraction without a JVM (Apache Tika) or external binaries.
Corpus
| Source | Files | License |
|---|---|---|
| LibreOffice Core | 2,185 | MPL-2.0 |
| Apache POI | 1,298 | Apache-2.0 |
| Open XML SDK | 707 | MIT |
| ClosedXML | 371 | MIT |
| Pandoc | 224 | GPL-2.0 |
| python-docx + python-pptx | 111 | MIT |
| Apache Tika | 108 | Apache-2.0 |
| calamine | 28 | MIT |
| openpreserve | 20 | CC0 |
| oletools | 17 | BSD-2 |
| LibreOffice (legacy) | 12 | MPL-2.0 |
| Total | 6,062 |
Pass Rate — 100% on valid files
100% pass rate on all valid Office files — the 97 non-passing files in the corpus are all invalid inputs:
| Category | Count | Notes |
|---|---|---|
| Truncated / corrupted ZIP | 43 | Missing EOCD, invalid Central Directory, fuzzer-corrupted archives |
| Encrypted / password-protected | 19 | CFBF-encrypted containers — no password supplied |
| XML bomb / security fixture | 17 | Billion-laughs entities, CVE exploit inputs, fuzzer corpus |
| Wrong format / mislabeled | 16 | WordPerfect, IBM DisplayWrite, pre-OLE2 Excel 3/4, ODT stored as .docx, XLSB stored as .xls |
| Truncated binary | 2 | File ends mid-CFB-sector |
Zero failures on legitimate Word 97+ / Excel 97+ / PowerPoint 97+ files. Zero panics, zero timeouts, zero false negatives on valid documents. Full breakdown in BENCHMARKS.md.
Supported Formats
| Format | Extension | Read | Write | Edit | Convert | Text | Markdown | HTML | IR |
|---|---|---|---|---|---|---|---|---|---|
| Word (OOXML) | .docx | Yes | Yes | Yes | — | Yes | Yes | Yes | Yes |
| Excel (OOXML) | .xlsx | Yes | Yes | Yes | — | Yes | Yes | Yes | Yes |
| PowerPoint (OOXML) | .pptx | Yes | Yes | Yes | — | Yes | Yes | Yes | Yes |
| Word (Legacy) | .doc | Yes | — | — | → .docx | Yes | Yes | Yes | Yes |
| Excel (Legacy) | .xls | Yes | — | — | → .xlsx | Yes | Yes | Yes | Yes |
| PowerPoint (Legacy) | .ppt | Yes | — | — | → .pptx | Yes | Yes | Yes | Yes |
Legacy formats can be converted to modern OOXML with save_as():
=
# Converts DOC → DOCX
Python API
# Quick extraction
=
=
=
# Document object
=
=
=
=
= # Structured JSON intermediate representation
= # "pptx"
# From bytes
=
All 6 formats supported. Works with str, pathlib.Path, or raw bytes.
Rust API
use ;
// Open from path (format auto-detected from extension)
let doc = open?;
// Open from reader with explicit format
let file = open?;
let doc = from_reader?;
// Extract content
let text = doc.plain_text;
let markdown = doc.to_markdown;
let html = doc.to_html;
let ir = doc.to_ir; // Format-agnostic DocumentIR
// Access format-specific types
if let Some = doc.as_docx
// Create documents from IR
use create_from_ir;
create_from_ir?;
Sub-modules
Each format is available as a sub-module for direct access:
use DocxDocument;
use XlsxDocument;
use PptxDocument;
use DocDocument;
use XlsDocument;
use PptDocument;
Installation
Python
Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.
Rust
[]
= "0.1.1"
JavaScript/WASM
Go
See go/README.md for setup details.
C# / .NET
See csharp/OfficeOxide/README.md for setup details.
CLI
MCP Server
CLI
office-oxide provides fast Office document processing from your terminal:
All six formats supported (docx, xlsx, pptx, doc, xls, ppt). Use --help for all options.
MCP Server
office-oxide-mcp lets AI assistants (Claude, Cursor, etc.) read Office documents locally via the Model Context Protocol.
Add to your MCP client configuration:
The server exposes extract and info tools. All processing runs locally — no files leave your machine.
Building from Source
# Python bindings
# Shared library for Go, JS/TS (koffi), and C# bindings
# Output: target/release/liboffice_oxide.{so,dylib} or office_oxide.dll
Documentation
- Documentation Site — Guides and examples
- Getting Started (Rust) — Rust guide
- Getting Started (Python) — Python guide
- Getting Started (Go) — Go guide
- Getting Started (JavaScript / TypeScript) — Node.js native guide
- Getting Started (WASM) — Browser and Node.js WASM guide
- Getting Started (C# / .NET) — .NET guide
- Getting Started (C FFI) — C FFI guide
- API Docs (Rust) — Full Rust API reference
- Architecture — System design and module structure
Use Cases
- RAG / LLM pipelines — Extract clean text or Markdown from Office documents for retrieval-augmented generation
- Document processing at scale — Parse thousands of documents in seconds
- Data extraction — Pull structured data from spreadsheets, tables, and presentations
- Format conversion — Convert between formats via the intermediate representation
- python-docx / openpyxl alternative — Up to 100× faster, supports all 6 formats in one library
Why I built this
I needed a library that could read all six Office formats at once — not six separate packages — and I needed it without pulling in a JVM, a Python runtime, or a GPL-licensed dependency. Nothing existed that combined speed, correctness, and a permissive license across the full DOCX / XLSX / PPTX / DOC / XLS / PPT surface, so I wrote it in Rust and wrapped it for every language I use day-to-day. The same binary powers Python via PyO3, Node.js via koffi, Go via cgo, C# via P/Invoke, and the browser via WASM — one fix lands everywhere.
If it saves you a dependency, a license audit, or a weekend, consider leaving a star. If something's broken or missing, open an issue — I read all of them.
— Yury
License
Dual-licensed under MIT or Apache-2.0 at your option. No AGPL, no GPL, no copyleft restrictions. Use freely in commercial and open-source projects.
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
&& && &&
Citation
Rust + Python + Go + JS/TS + C# + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on valid Office files (6,062-file corpus) | Up to 100× faster than alternatives | 6 formats