# AI Development Rules for Transmutation
**Project**: Transmutation - High-performance document conversion engine for AI/LLM embeddings
**Language**: Rust 1.85+ (edition 2024)
**License**: MIT
**Repository**: https://github.com/hivellm/transmutation
---
## Project Overview
Transmutation is a **pure Rust** document conversion engine designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. This is a **high-performance alternative to Docling**, offering superior speed, lower memory usage, and zero runtime dependencies.
**Core Goals**:
- 100% Pure Rust implementation (no Python dependencies)
- Convert documents to LLM-friendly formats (Markdown, Images, JSON)
- Optimize output for embedding generation (text and multimodal)
- Maintain maximum quality with minimum size
- Faster and more efficient than Docling
- Seamless integration with HiveLLM Vectorizer
---
## Code Style
### Formatting
- Follow **Rust 2021/2024 edition** conventions
- Use `cargo fmt` with project-specific `rustfmt.toml`
- Maximum line length: **100 characters**
- Indentation: **4 spaces** (no tabs)
- Use trailing commas in multi-line lists/structs
- Group imports: std → external → crate → module
### Naming Conventions
- **Crates/Modules**: `snake_case` (e.g., `pdf_parser`, `image_ocr`)
- **Files**: `snake_case` (e.g., `pdf.rs`, `file_detect.rs`)
- **Structs/Enums/Traits**: `PascalCase` (e.g., `Converter`, `OutputFormat`, `DocumentConverter`)
- **Functions/Variables**: `snake_case` (e.g., `convert_to_markdown()`, `file_path`)
- **Constants**: `UPPER_SNAKE_CASE` (e.g., `MAX_CHUNK_SIZE`, `DEFAULT_DPI`)
- **Type Parameters**: Single letter or `PascalCase` (e.g., `T`, `Item`, `Error`)
### Code Organization
- One module per file format converter (e.g., `pdf.rs`, `docx.rs`)
- Traits in `converters/traits.rs`
- Shared utilities in `utils/`
- Output format handlers in `output/`
- Error types in `error.rs`
- Public API in `lib.rs`
---
## Documentation
### Doc Comments
- **All public APIs MUST have doc comments** (`///` or `//!`)
- Use doc sections: `# Arguments`, `# Returns`, `# Errors`, `# Examples`, `# Panics`
- Provide runnable examples in doc tests
- Include links to related types/functions with `[Type]`
**Example**:
```rust
/// Converts a document to the specified output format.
///
/// This function handles the complete conversion workflow including
/// file detection, format validation, conversion, and optimization.
///
/// # Arguments
///
/// * `input_path` - Path to the input document
/// * `output_format` - Desired output format (Markdown, JSON, etc.)
/// * `options` - Conversion options for customization
///
/// # Returns
///
/// A `ConversionResult` containing the converted data and metadata
///
/// # Errors
///
/// * `FileNotFound` - If the input file does not exist
/// * `UnsupportedFormat` - If the file format is not supported
/// * `ConversionError` - If conversion fails
///
/// # Examples
///
/// ```rust
/// # use transmutation::{convert_document, OutputFormat, ConversionOptions};
/// # async fn example() -> Result<(), Box<dyn std::error::Error>> {
/// let result = convert_document(
/// "document.pdf",
/// OutputFormat::Markdown,
/// ConversionOptions::default()
/// ).await?;
/// # Ok(())
/// # }
/// ```
pub async fn convert_document(
input_path: &str,
output_format: OutputFormat,
options: ConversionOptions,
) -> Result<ConversionResult> {
// Implementation
}
```
### Module Documentation
- Add module-level docs (`//!`) at the top of each file
- Explain the purpose and main functionality
- Provide usage examples
### Project Documentation
- Update `docs/ROADMAP.md` after completing tasks
- Update `docs/CHANGELOG.md` for all user-facing changes
- Update `README.md` for major features
- Never create unnecessary `.md` files - consolidate in existing docs
---
## Testing Standards
### Test Organization
```
tests/
├── unit/ # Inline unit tests (#[cfg(test)] mod tests)
├── integration/ # Integration tests
└── fixtures/ # Test data (sample PDFs, DOCX, images, etc.)
```
### Coverage Requirements
- **Overall**: > 90%
- **Critical paths** (converters, parsers): 100%
- **Unit tests**: > 95%
- **Integration tests**: > 85%
### Test Naming
```rust
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_pdf_to_markdown_success() {
// Arrange
let input = "tests/fixtures/sample.pdf";
let converter = PdfConverter::new();
// Act
let result = converter.to_markdown(input).await;
// Assert
assert!(result.is_ok());
assert!(!result.unwrap().is_empty());
}
#[test]
fn test_unsupported_file_format() {
let result = detect_file_format("test.unknown");
assert!(matches!(result, Err(Error::UnsupportedFormat(_))));
}
}
```
### Run Tests Before Committing
```bash
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_pdf_conversion
# Check coverage (Linux)
cargo tarpaulin --out Html --output-dir coverage
# Alternative (all platforms)
cargo llvm-cov --html
```
### Integration Tests
- Test with real document samples in `tests/fixtures/`
- Test error handling and edge cases
- Test performance benchmarks in `benches/`
---
## Error Handling
### Use `thiserror` for Library Errors
```rust
use thiserror::Error;
#[derive(Error, Debug)]
pub enum ConversionError {
#[error("File not found: {0}")]
FileNotFound(String),
#[error("Unsupported format: {0}")]
UnsupportedFormat(String),
#[error("Conversion failed: {0}")]
ConversionFailed(String),
#[error("IO error")]
Io(#[from] std::io::Error),
#[error("PDF parsing error")]
PdfError(#[from] lopdf::Error),
}
pub type Result<T> = std::result::Result<T, ConversionError>;
```
### Error Handling Best Practices
- **Never use `unwrap()` or `expect()` in library code** (tests are OK)
- Use `?` operator for error propagation
- Provide context with `anyhow::Context` when appropriate
- Log errors with `tracing::error!`
- Return `Result<T>` for recoverable errors
- Document all possible errors in doc comments
---
## Performance
### Optimization Principles
- **Pure Rust only** - no Python/C++ dependencies for core functionality
- Use `rayon` for CPU-bound parallel processing
- Use `tokio` for I/O-bound async operations
- Minimize allocations - prefer `&str` over `String` for parameters
- Use `&[T]` instead of `&Vec<T>` for function parameters
- Profile with `cargo bench` before optimizing
- Lazy initialization with `once_cell::Lazy` for expensive statics
### Memory Management
- Target: <500MB per conversion
- Streaming processing for large files
- Use `SmallVec` for small collections
- Use `Cow` for copy-on-write optimizations
- Profile memory with `heaptrack` or `valgrind --tool=massif`
### Performance Targets
- PDF → Markdown: 20+ pages/second
- DOCX → Markdown: 25+ pages/second
- Image OCR: 2+ images/second
- Startup time: <100ms
---
## Security
### Input Validation
- **Validate all file inputs** before processing
- Check file sizes and limits
- Sanitize file paths (prevent path traversal)
- Use `validator` crate for struct validation
### SQL Injection Prevention
- Use parameterized queries with `sqlx::query!` macro
- Never use string concatenation for SQL
### Secrets Management
- **Never hardcode secrets or API keys**
- Use environment variables with `dotenvy`
- Document required env vars in `.env.example`
- Add `.env` to `.gitignore`
### Dependencies
- Run `cargo audit` regularly
- Keep dependencies updated
- Review security advisories
---
## Rust Best Practices
### Idioms
1. **Use `Result` instead of panicking** for recoverable errors
2. **Prefer `&str` over `String`** for function parameters
3. **Use `#[derive]` macros** (Debug, Clone, PartialEq, Eq, Serialize, Deserialize)
4. **Implement `Display` and `Error`** for custom errors (use `thiserror`)
5. **Use `Option` and `Result`** - avoid sentinel values
6. **Prefer iterators** over loops
7. **Use `Vec<T>` for owned data**, `&[T]` for borrowed
8. **Use `async/await`** for I/O operations
9. **Use `?` operator** for error propagation
10. **Use `clippy`** and fix all warnings
### Anti-Patterns to Avoid
- ❌ Cloning everything unnecessarily
- ❌ Using `unwrap()` in production code
- ❌ Not using `?` operator
- ❌ String vs &str confusion
- ❌ Not implementing Error trait
- ❌ Not handling all match arms
### Common Patterns
- **Repository pattern** with traits
- **Builder pattern** for complex configurations
- **Newtype pattern** for type safety
- **From/Into traits** for conversions
- **Iterator chains** for data transformation
---
## Git Workflow
### Branch Strategy
```
main
├── develop
├── feature/[feature-name]
├── fix/[issue-number]-[description]
├── docs/[description]
└── perf/[description]
```
### Branch Naming
- `feature/pdf-converter` - New features
- `fix/123-memory-leak` - Bug fixes
- `docs/api-reference` - Documentation
- `perf/optimize-pdf-parsing` - Performance improvements
- `test/integration-tests` - Test additions
### Commit Message Format
Follow [Conventional Commits](https://www.conventionalcommits.org/):
```
[type]([optional scope]): [subject]
[optional body]
[optional footer]
```
**Types**:
- `feat` - New feature (e.g., `feat(pdf): add PDF to markdown converter`)
- `fix` - Bug fix (e.g., `fix(docx): handle corrupt files`)
- `docs` - Documentation (e.g., `docs: update API reference`)
- `style` - Code style (e.g., `style: format with rustfmt`)
- `refactor` - Code refactoring (e.g., `refactor(converters): extract common logic`)
- `perf` - Performance improvement (e.g., `perf(pdf): optimize page parsing`)
- `test` - Testing (e.g., `test(pdf): add integration tests`)
- `chore` - Maintenance (e.g., `chore: update dependencies`)
**Examples**:
```
feat(pdf): implement PDF to Markdown conversion
- Add PDF parser using lopdf
- Extract text and images from pages
- Generate structured Markdown output
- Add unit tests
Closes #12
```
### Pre-Commit Checklist
- [ ] All tests pass (`cargo test`)
- [ ] No clippy warnings (`cargo clippy -- -D warnings`)
- [ ] Code formatted (`cargo fmt`)
- [ ] Documentation updated
- [ ] `docs/CHANGELOG.md` updated (if user-facing)
- [ ] No debug code or `println!` statements
- [ ] No secrets or credentials
- [ ] Coverage > 90%
### Commit Workflow
```bash
# Run tests
cargo test
# Run clippy
cargo clippy -- -D warnings
# Format code
cargo fmt
# Stage changes
git add .
# Commit with message
git commit -m "feat(pdf): implement PDF converter"
# Push to remote
git push origin feature/pdf-converter
```
---
## Task Queue Integration
### Task States
1. `PENDING` - Task created, not started
2. `IN_PROGRESS` - Currently working
3. `REVIEW` - Awaiting peer review
4. `REVISION` - Needs changes
5. `COMPLETED` - Finished and approved
6. `BLOCKED` - Cannot proceed
### Update Protocol
Update Task Queue at these points:
1. Task start: `PENDING` → `IN_PROGRESS`
2. Code complete: `IN_PROGRESS` → `REVIEW`
3. Review feedback: `REVIEW` → `REVISION`
4. Re-submission: `REVISION` → `REVIEW`
5. Approval: `REVIEW` → `COMPLETED`
Include task ID in commit messages: `feat(pdf): implement converter [TASK-123]`
---
## Vectorizer Integration
### Search-First Protocol
Before implementing features:
1. **Search Vectorizer** for existing documentation
2. Query: `vectorizer search --collection transmutation-docs --query "[question]"`
3. Review results and existing implementations
4. Only implement if no solution exists
### Upload Protocol
Upload documentation after:
1. **After Implementation**: Code docs and examples
2. **After Review**: Review reports
3. **After Approval**: User guides
### Collections
- `transmutation-docs` - All project documentation
- `transmutation-code` - Indexed source code
- `chat-history` - Chat history (auto-save at >90% context)
---
## Review Process
### Peer Review Requirements
- **2+ specialist agents** must review each feature
- Focus: code quality, tests, performance, security
- Timeline: 24-48 hours
### Review Checklist
- [ ] Code follows Rust best practices
- [ ] All public APIs have doc comments
- [ ] Tests pass with >90% coverage
- [ ] No clippy warnings
- [ ] Error handling is comprehensive
- [ ] Performance meets targets
- [ ] Security considerations addressed
- [ ] Documentation updated
### Requesting Review
```bash
# Push feature branch
git push origin feature/pdf-converter
# Update Task Queue to REVIEW status
# Notify reviewers with:
# - Link to feature specification (docs/specs/)
# - Link to tests
# - Summary of changes
```
---
## Project-Specific Rules
### Converter Development
1. **Implement trait** in `converters/traits.rs`
2. **Create module** in `converters/[format].rs`
3. **Add tests** with sample files in `tests/fixtures/`
4. **Update registry** in `converters/mod.rs`
5. **Document** in `docs/ROADMAP.md`
### Output Format Handling
1. **Create handler** in `output/[format].rs`
2. **Implement serialization** with `serde`
3. **Optimize output** for LLM processing
4. **Add tests** for format correctness
### Pure Rust Requirement
- **No Python dependencies** in core functionality
- **No C/C++ dependencies** unless optional (features)
- OCR (Tesseract) and FFmpeg are **optional features**
- Core converters must be **100% Rust**
### Performance Testing
```bash
# Run benchmarks
cargo bench
# Profile with flamegraph (Linux)
cargo flamegraph --bin transmutation
# Memory profiling
heaptrack ./target/release/transmutation
```
---
## Development Workflow
### 1. Feature Development Cycle
```bash
# 1. Create branch
git checkout -b feature/xlsx-converter
# 2. Read specification
# docs/specs/xlsx-converter.md
# 3. Update ROADMAP (mark [~])
# docs/ROADMAP.md
# 4. Implement feature
# src/converters/xlsx.rs
# 5. Write tests
# tests/integration/test_xlsx.rs
# 6. Run tests
cargo test
# 7. Run clippy
cargo clippy -- -D warnings
# 8. Format code
cargo fmt
# 9. Update CHANGELOG
# docs/CHANGELOG.md
# 10. Commit
git add .
git commit -m "feat(xlsx): implement XLSX to Markdown converter"
# 11. Push and request review
git push origin feature/xlsx-converter
```
### 2. Bug Fix Cycle
```bash
# 1. Create branch
git checkout -b fix/123-pdf-memory-leak
# 2. Write failing test
# tests/integration/test_pdf.rs
# 3. Fix bug
# src/converters/pdf.rs
# 4. Verify test passes
cargo test
# 5. Update CHANGELOG
# docs/CHANGELOG.md
# 6. Commit and push
git commit -m "fix(pdf): resolve memory leak in page parsing [TASK-123]"
git push origin fix/123-pdf-memory-leak
```
---
## CLI Development
### CLI Tool (optional feature)
- Use `clap` with derive macros
- Provide progress bars with `indicatif`
- Use `colored` for terminal output
- Handle signals gracefully (Ctrl+C)
**Example**:
```rust
use clap::Parser;
#[derive(Parser)]
#[command(name = "transmutation")]
#[command(about = "High-performance document conversion engine")]
struct Cli {
/// Input file path
#[arg(short, long)]
input: String,
/// Output format (markdown, json, images)
#[arg(short, long, default_value = "markdown")]
format: String,
/// Output directory
#[arg(short, long, default_value = "output")]
output: String,
}
```
---
## Logging and Tracing
### Use `tracing` for Structured Logging
```rust
use tracing::{info, warn, error, debug, trace};
// At function entry
#[tracing::instrument(skip(data))]
async fn convert_document(path: &str, data: Vec<u8>) -> Result<String> {
info!(path = %path, size = data.len(), "Starting conversion");
// During processing
debug!("Parsing document structure");
// On errors
if let Err(e) = parse_document(&data) {
error!(error = %e, "Failed to parse document");
return Err(e);
}
// On completion
info!("Conversion completed successfully");
Ok(result)
}
```
### Log Levels
- `trace` - Very detailed debugging
- `debug` - Debugging information
- `info` - General information
- `warn` - Warnings (recoverable issues)
- `error` - Errors (failed operations)
### Configure with Environment
```bash
RUST_LOG=transmutation=debug cargo run
```
---
## Deployment and Release
### Building
```bash
# Debug build
cargo build
# Release build
cargo build --release
# With all features
cargo build --release --all-features
# Cross-compilation
rustup target add x86_64-unknown-linux-musl
cargo build --release --target x86_64-unknown-linux-musl
```
### Publishing to crates.io
```bash
# Login
cargo login [api-token]
# Dry run
cargo publish --dry-run
# Publish
cargo publish
```
### Pre-Release Checklist
- [ ] All tests passing
- [ ] Documentation complete
- [ ] Examples provided
- [ ] `docs/CHANGELOG.md` updated
- [ ] Version bumped in `Cargo.toml`
- [ ] Git tag created
- [ ] No private dependencies
- [ ] Benchmarks run
- [ ] Security audit passed (`cargo audit`)
---
## Continuous Integration
### GitHub Actions
- Run tests on push/PR
- Run clippy with `-D warnings`
- Check formatting with `cargo fmt --check`
- Generate coverage reports
- Run benchmarks on main branch
- Security audit with `cargo audit`
---
## References
### Official Documentation
- [The Rust Programming Language](https://doc.rust-lang.org/book/)
- [Rust API Guidelines](https://rust-lang.github.io/api-guidelines/)
- [Tokio Documentation](https://tokio.rs/)
### Project Documentation
- `docs/ROADMAP.md` - Development roadmap
- `docs/CHANGELOG.md` - Change history
- `gov/manuals/AI_INTEGRATION_MANUAL_TEMPLATE.md` - General AI integration guide
- `gov/manuals/rust/AI_INTEGRATION_MANUAL_RUST.md` - Rust-specific guide
- `gov/manuals/rust/BEST_PRACTICES.md` - Rust best practices
### HiveLLM Ecosystem
- Task Queue: `http://localhost:8080`
- Vectorizer: `http://localhost:15002`
---
## Quick Commands Reference
```bash
# Development
cargo check # Fast compile check
cargo build # Debug build
cargo run # Run binary
cargo test # Run tests
cargo bench # Run benchmarks
# Code Quality
cargo fmt # Format code
cargo clippy -- -D warnings # Lint with strict warnings
cargo audit # Security audit
# Documentation
cargo doc --open # Generate and open docs
cargo doc --no-deps # Docs without dependencies
# Release
cargo build --release # Optimized build
cargo publish --dry-run # Test publish
cargo publish # Publish to crates.io
# Coverage (Linux)
cargo tarpaulin --out Html # Generate coverage report
# Coverage (all platforms)
cargo llvm-cov --html # Alternative coverage tool
```
---
## Context Management
### At >90% Context
1. Save chat history to Vectorizer:
- Collection: `chat-history`
- Include full transcript
2. Create summary in `chat-summary`
3. Continue work in new context
---
## Special Instructions
### When Implementing Features
1. Read specification in `docs/specs/[feature].md`
2. Check existing code in Vectorizer first
3. Implement following Rust best practices
4. Write comprehensive tests (>90% coverage)
5. Document all public APIs
6. Update `docs/ROADMAP.md` status
7. Request peer review (2+ agents)
### When Fixing Bugs
1. Write failing test first
2. Fix the bug
3. Verify test passes
4. Add regression test
5. Update `docs/CHANGELOG.md`
### When Asked Questions
1. Search Vectorizer first
2. Check existing documentation
3. Refer to Rust API Guidelines
4. Provide working code examples
---
**Remember**: This is a **pure Rust** project building a **high-performance alternative to Docling**. Focus on speed, efficiency, and zero runtime dependencies. Every feature must be faster and lighter than Python equivalents.
**NO PYTHON. NO C++. PURE RUST ONLY** (except optional FFI for Tesseract/FFmpeg features).
---
**Version**: 1.0.0
**Last Updated**: 2025-10-12
**Maintained by**: HiveLLM Team