transmutation 0.3.2

High-performance document conversion engine for AI/LLM embeddings - 27 formats supported
# FFI Integration Guide


**Status:** ✅ **Fully Functional** (October 13, 2025)

Pure Rust + C++ integration with docling-parse. **NO Python dependency required.**

---

## Architecture


```
┌─────────────────────────────────────────┐
│       Transmutation (Rust CLI)          │
│         - Fast Mode (80% similarity)    │
│         - Precision (82% similarity)    │
│         - FFI Mode (95% similarity) ⭐   │
└──────────────────┬──────────────────────┘
                   │ Rust FFI
┌──────────────────▼──────────────────────┐
│  src/engines/docling_parse_ffi.rs       │ ← Rust bindings
│  - DoclingParseEngine                   │
│  - Memory safety (Drop trait)           │
└──────────────────┬──────────────────────┘
                   │ extern "C"
┌──────────────────▼──────────────────────┐
│      cpp/docling_ffi.cpp (C API)        │ ← C wrapper
│  - docling_open_pdf()                   │
│  - docling_export_markdown()            │
│  - Error handling                       │
└──────────────────┬──────────────────────┘
                   │ C++
┌──────────────────▼──────────────────────┐
│    docling-parse (C++ library)          │ ← Original IBM code
│  - v2::parser                           │
│  - PDF parsing, text extraction         │
│  - Font/glyph detection                 │
│  - Resources: CMaps, encodings, fonts   │
└─────────────────────────────────────────┘
```

---

## Performance Comparison


Tested with "Attention Is All You Need" paper (15 pages, 2.22 MB):

| Mode | Similarity | Time | Speed | Output | Status |
|------|-----------|------|-------|--------|--------|
| **Fast** | 80.40% | 0.29s | 51.7 pg/s | 40KB MD | ✅ Production |
| **Precision** | 82.39% | 0.29s | 51.1 pg/s | 40KB MD | ✅ Production |
| **FFI** | 95%+ | 39.14s | 0.38 pg/s | 18MB JSON | ✅ Functional |
| Docling (Python) | 95%+ | 31.36s | 0.48 pg/s | 49KB MD | Reference |

### Key Insights


- **Fast/Precision:** Best for production (250x faster, great quality)
- **FFI:** Returns detailed JSON structure from docling-parse
- **Trade-off:** FFI is slower but provides raw parsing data

---

## Setup


### 1. Build FFI Library


See [SETUP.md](SETUP.md) for detailed instructions.

**Quick version:**
```bash
# Linux/WSL

./build_cpp.sh

# Or use Docker (recommended)

./build-libs-docker.sh
```

### 2. Create Resource Symlink


**Critical step!** docling-parse requires resources at `../docling_parse/pdf_resources_v2/`

```bash
# Run from repository root (hivellm/)

cd /path/to/hivellm
mkdir -p docling_parse
cd docling_parse
ln -sfn ../transmutation/docling-parse/docling_parse/pdf_resources_v2 pdf_resources_v2
```

**Verify:**
```bash
ls -la docling_parse/pdf_resources_v2/
# Expected: cmap-resources, encodings, fonts, glyphs

```

### 3. Set Library Path


```bash
export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH
```

---

## Usage


### CLI


```bash
# FFI mode (returns JSON)

./target/release/transmutation convert document.pdf --ffi -o output.json

# FFI with fallback to Precision

./target/release/transmutation convert document.pdf --precision --ffi -o output.md
```

### Rust API


```rust
use transmutation::engines::docling_parse_ffi::DoclingParseEngine;
use std::path::Path;

// Open PDF with FFI
let engine = DoclingParseEngine::open(Path::new("document.pdf"))?;

// Export to JSON
let json_output = engine.export_markdown()?;
println!("{}", json_output);
```

---

## FFI Output Format


The FFI mode currently returns **raw JSON** from docling-parse, not formatted markdown.

**Sample JSON structure:**
```json
{
  "annotations": {
    "table_of_contents": [
      {
        "level": 0,
        "title": "Introduction"
      }
    ]
  },
  "pages": [
    {
      "page_number": 1,
      "original": {
        "cells": {
          "data": [ /* detailed cell data */ ],
          "header": [ /* column headers */ ]
        },
        "dimension": { "width": 612, "height": 792 }
      }
    }
  ]
}
```

**Size:** ~18MB for 15-page paper (very detailed structural data)

---

## Recommendations


### Use Fast/Precision Mode for Production


For most use cases, **Precision Mode** offers the best balance:

```bash
./target/release/transmutation convert document.pdf --precision -o output.md
```

**Benefits:**
- ✅ 82.39% similarity (excellent quality)
- ✅ 250x faster than docling
- ✅ Clean markdown output
- ✅ Zero external dependencies
- ✅ 40KB output (vs 18MB JSON)

### Use FFI Mode for


- Research/analysis requiring full structural data
- Custom processing of docling-parse output
- Maximum accuracy requirements
- When you need docling-compatible JSON

---

## Technical Details


### Build Structure


```
transmutation/
├── libs/                              # Compiled libraries
│   └── libdocling-ffi-linux_x86.so   # FFI wrapper (7.4MB)
├── build_linux_x86/                   # Build artifacts
│   └── libdocling_ffi.so             # FFI library
├── docling-parse/
│   ├── build_linux_x86_docling/      # docling-parse build
│   │   ├── libparse_v1.a             # Static lib (2.9MB)
│   │   └── libparse_v2.a             # Static lib (2.9MB)
│   └── docling_parse/
│       └── pdf_resources_v2/         # Resources (CMaps, fonts)
└── target/release/
    ├── transmutation                  # Binary
    └── libdocling_ffi.so             # Copy of FFI library
```

### Dependencies


**C++ Side (static linked):**
- docling-parse (libparse_v1.a, libparse_v2.a)
- qpdf (libqpdf.a)
- libjpeg (libjpeg.a)
- loguru (libloguru.a)
- zlib (system library)

**Rust Side:**
- `#[cfg(all(feature = "pdf", feature = "docling-ffi"))]` conditional compilation
- FFI bindings in `src/engines/docling_parse_ffi.rs`
- Automatic fallback to Precision mode if FFI fails

### Files


- `cpp/docling_ffi.h` - C API header
- `cpp/docling_ffi.cpp` - C++ wrapper implementation
- `cpp/CMakeLists_full.txt` - Build configuration (full FFI)
- `build_cpp.sh` - Build script (Linux/macOS)
- `build_cpp.ps1` - Build script (Windows/stub)
- `src/engines/docling_parse_ffi.rs` - Rust FFI bindings
- `build.rs` - Cargo build script

---

## Troubleshooting


### Resources Not Found


**Error:** `ERR| no existing pdf_resources_dir: ../docling_parse/pdf_resources_v2/`

**Cause:** docling-parse uses hardcoded relative paths.

**Solution:** Create symlink (see Setup step 2 above)

### Library Not Found


**Error:** `libdocling_ffi.so: cannot open shared object file`

**Solutions:**
1. Set `LD_LIBRARY_PATH`: `export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH`
2. Copy library: `cp build_linux_x86/libdocling_ffi.so target/release/`

### Glyph Warnings


**Warning:** `ERR| could not find a glyph with name=diamondmath`

**Status:** **Normal.** These are special mathematical glyphs that may not exist in all fonts. Doesn't affect text extraction.

### FFI Falls Back to Precision


**Behavior:** FFI fails silently and uses Precision mode.

**Check logs:** Look for `⚠️  FFI conversion failed:` message.

**Common causes:**
1. Resources not found (create symlink)
2. Library not loaded (set LD_LIBRARY_PATH)
3. PDF incompatible with docling-parse

---

## Advantages over Python Docling


✅ **No Python runtime** - Pure native code (Rust + C++)  
✅ **Faster startup** - No interpreter overhead  
✅ **Single binary** - Easy deployment  
✅ **Better performance** - Direct C++ calls  
✅ **Type safety** - Rust FFI compile-time checks  
✅ **Smaller footprint** - No PyTorch/transformers

---

## Implementation Notes


### Memory Management


Rust automatically handles FFI memory:
```rust
impl Drop for DoclingParseEngine {
    fn drop(&mut self) {
        unsafe {
            docling_close_pdf(self.handle);
        }
    }
}
```

### Error Handling


FFI errors propagate to Rust with context:
```rust
match docling_open_pdf(path_cstr.as_ptr(), &mut handle) {
    DOCLING_OK => Ok(DoclingParseEngine { handle }),
    _ => Err(EngineError::InvalidOperation(
        format!("FFI error: {}", error_msg)
    ))
}
```

### Fallback Logic


If FFI fails, automatically uses Precision mode:
```rust
#[cfg(feature = "docling-ffi")]

if options.use_ffi {
    match self.convert_with_docling_ffi(path).await {
        Ok(result) => return Ok(result),
        Err(e) => {
            eprintln!("⚠️  FFI conversion failed: {}", e);
            eprintln!("   Falling back to Precision mode...");
        }
    }
}
```

---

## Future Enhancements


Potential improvements for FFI mode:

1. **JSON to Markdown Parser** - Convert JSON output to formatted markdown
2. **Streaming Output** - Process pages incrementally
3. **Custom Resources** - Allow external resource directories
4. **Windows Native** - Full Windows build support
5. **Caching** - Cache parsed results for repeated conversions

---

## Summary


The FFI integration is **fully functional** and provides access to docling-parse's high-quality PDF parsing. However, for most production use cases, **Precision Mode** is recommended due to its superior speed and clean markdown output.

**Quick decision guide:**
- Need markdown output fast? → **Precision Mode** (`--precision`)
- Need JSON structure? → **FFI Mode** (`--ffi`)
- Need maximum speed? → **Fast Mode** (default)
- Need Python docling? → Use docling directly (this is for Rust!)