docx-review-core 0.1.0

Review-oriented DOCX extraction library for tracked changes, comments, and structural blocks
Documentation

docx-review

docx-review is a review-oriented DOCX extraction toolkit for Rust.

It reads .docx files directly from OOXML and produces structured JSON for:

  • document blocks
  • tracked changes
  • comments and comment threads
  • text anchors
  • headers, footers, footnotes, and endnotes

The project is designed as headless infrastructure for automation, review workflows, AI pipelines, and downstream tools that need more than plain text.

What It Extracts

docx-review currently supports:

  • paragraphs
  • table cells as flat blocks
  • tracked changes:
    • insert
    • delete
    • replacement
    • move
    • format change
  • comments
  • comment anchors and anchored text
  • comment threading and resolved state from commentsExtended.xml
  • footnotes and endnotes
  • headers and footers
  • list item detection by nesting level
  • text spans with tracked-change markers

Workspace

  • crates/core
    • docx-review-core
    • extraction library and normalized data model
  • crates/cli
    • docx-review
    • command-line interface

Installation

Install the CLI from crates.io:

cargo install docx-review-cli

This installs the docx-review command.

CLI

Basic extraction:

docx-review extract document.docx

Pretty JSON:

docx-review extract document.docx --pretty

Tracked changes only:

docx-review extract document.docx --only track-changes --pretty

Comments only:

docx-review extract document.docx --only comments --pretty

Read from stdin:

cat document.docx | docx-review extract -

JSONL output:

docx-review extract document.docx --format jsonl

Notes:

  • --format jsonl with no --only emits one block JSON object per line.
  • --only comments --format jsonl emits one comment per line.
  • --only track-changes --format jsonl emits one tracked change per line.

Track changes modes:

docx-review extract document.docx --mode paired
docx-review extract document.docx --mode raw
docx-review extract document.docx --mode both

Useful flags:

  • --no-comments
  • --no-text-spans
  • --include-raw-ids
  • -v, -vv

Rust API

Add the crate:

[dependencies]
docx-review-core = "0.1"
use docx_review_core::extract_from_path;

fn main() -> Result<(), docx_review_core::Error> {
    let document = extract_from_path("document.docx")?;

    println!("blocks: {}", document.blocks.len());
    println!("tracked changes: {}", document.tracked_changes.len());
    println!("comments: {}", document.comments.len());

    Ok(())
}

With options:

use docx_review_core::{extract_from_path_with_opts, ExtractOptions, TrackChangesMode};

fn main() -> Result<(), docx_review_core::Error> {
    let mut opts = ExtractOptions::default();
    opts.track_changes_mode = TrackChangesMode::Both;
    opts.include_raw_ids = true;

    let document = extract_from_path_with_opts("document.docx", opts)?;
    println!("raw changes: {}", document.raw_changes.len());

    Ok(())
}

Output Model

At a high level, extraction returns:

  • Document.blocks
    • the normalized textual structure of the document
  • Document.tracked_changes
    • review-oriented change records
  • Document.comments
    • comments, anchors, replies, and resolved state
  • Document.raw_changes
    • optional raw tracked changes when TrackChangesMode::Both is used

blocks are the main content surface. Comments and tracked changes are linked back to blocks by id.

Scope

The current implementation is focused on review semantics and structural extraction.

Designed for:

  • review metadata extraction from real Word documents
  • tracked changes and comment workflows
  • structural stories outside document.xml

Not the current focus:

  • editing
  • image extraction
  • full numbering style reconstruction

Development

Run the CLI locally:

cargo run -p docx-review-cli -- extract path/to/document.docx --pretty