undoc 0.1.19

High-performance Microsoft Office document extraction to Markdown
Documentation
# undoc

High-performance Microsoft Office document extraction to Markdown.

## Installation

```bash
pip install undoc
```

## Usage

### Basic Usage

```python
from undoc import parse_file

# Parse a document
doc = parse_file("document.docx")

# Convert to Markdown
markdown = doc.to_markdown()
print(markdown)

# Convert to plain text
text = doc.to_text()

# Convert to JSON
json_data = doc.to_json()
```

### With Context Manager

```python
from undoc import parse_file

with parse_file("document.xlsx") as doc:
    print(doc.to_markdown(frontmatter=True))
    print(f"Sections: {doc.section_count}")
    print(f"Resources: {doc.resource_count}")
```

### Parse from Bytes

```python
from undoc import parse_bytes

with open("document.pptx", "rb") as f:
    data = f.read()

doc = parse_bytes(data)
markdown = doc.to_markdown()
```

### Extract Resources (Images)

```python
from undoc import parse_file

doc = parse_file("document.docx")

# Get all resource IDs
resource_ids = doc.get_resource_ids()

for rid in resource_ids:
    # Get resource metadata
    info = doc.get_resource_info(rid)
    print(f"Resource: {info['filename']} ({info['mime_type']})")

    # Get resource binary data
    data = doc.get_resource_data(rid)

    # Save to file
    with open(info['filename'], 'wb') as f:
        f.write(data)
```

### Document Metadata

```python
from undoc import parse_file

doc = parse_file("document.docx")

print(f"Title: {doc.title}")
print(f"Author: {doc.author}")
print(f"Sections: {doc.section_count}")
print(f"Resources: {doc.resource_count}")
```

## Supported Formats

- **DOCX** - Microsoft Word documents
- **XLSX** - Microsoft Excel spreadsheets
- **PPTX** - Microsoft PowerPoint presentations

## Features

- **RAG-Ready Output**: Structured Markdown optimized for RAG/LLM applications
- **High Performance**: Native Rust implementation via FFI
- **Asset Extraction**: Images and embedded resources
- **Metadata Preservation**: Document properties, styles, formatting
- **Cross-Platform**: Windows, Linux, macOS (Intel & ARM)

## API Reference

### Functions

- `parse_file(path)` - Parse document from file path
- `parse_bytes(data)` - Parse document from bytes
- `version()` - Get library version

### Undoc Class

#### Conversion Methods

- `to_markdown(frontmatter=False, escape_special=False, paragraph_spacing=False)` - Convert to Markdown
- `to_text()` - Convert to plain text
- `to_json(compact=False)` - Convert to JSON
- `plain_text()` - Get plain text (fast extraction)

#### Properties

- `title` - Document title
- `author` - Document author
- `section_count` - Number of sections
- `resource_count` - Number of resources

#### Resource Methods

- `get_resource_ids()` - List of resource IDs
- `get_resource_info(id)` - Resource metadata
- `get_resource_data(id)` - Resource binary data

## License

MIT License - see [LICENSE](../../LICENSE) for details.