llm-transpile
Token-optimized document transpiler for LLM pipelines
한국어 · 日本語 · 中文 · Español · Français · Deutsch · Português · Русский · العربية · हिन्दी
Raw documents (Markdown, HTML, plain text) → structured bridge format <D>?<H><B> — with adaptive compression that keeps you under token budget.
<H>
t: Software License Agreement
s: Annual license terms between licensor and licensee
k: [license, contract, software]
</H>
<B>
# Contracting Parties
This agreement is made between Licensor and Licensee.
...
</B>
Table of Contents
- Why
- Installation
- CLI Usage
- Library Usage
- Output Format
- Fidelity Levels
- Adaptive Compression
- Input Formats
- Error Handling
- Performance
- Contributing
- License
Why
LLMs perform better when context is clean and dense. This library handles the mechanical work:
- Structural parsing — Markdown/HTML/plain text → typed IR nodes (headings, paragraphs, tables, lists, code blocks)
- Adaptive compression — automatically escalates through 4 stages as token budget fills up
- Symbol substitution — repeated domain terms → Unicode PUA characters, decoded by
<D>dictionary header - Table linearization — Markdown tables → compact
Key:Valsequences (≤5 rows) or pipe-separated rows (h1|h2\nv1|v2) for larger tables - Streaming output — Tokio stream delivers the first chunk immediately, minimizing TTFT
Installation
Library (Rust crate)
[]
= "0.1"
Requires Rust 1.75+.
CLI binary + tool integration
# Homebrew (macOS)
# Pre-built binary (faster, no compile)
# From crates.io
Then configure tool integrations:
transpile install launches an interactive wizard that detects and configures whichever tools are installed:
| Tool | Integration method | What it does |
|---|---|---|
| Claude Code | PostToolUse hook | Auto-compresses .md/.html/.txt files on Read |
| Gemini CLI | SKILL.md |
LLM auto-invokes transpile on document file extensions |
| Codex CLI | SKILL.md |
LLM auto-invokes transpile on document file extensions |
| Cursor | .mdc rule (alwaysApply) |
Triggers transpile before reading document files |
| OpenCode | SKILL.md |
LLM auto-invokes transpile on document file extensions |
All non-Claude tools use a skill file that teaches the LLM to run transpile --input <file> automatically — no size check needed, extension alone triggers it.
Selective install / uninstall
Claude Code plugin
/plugin marketplace add epicsagas/claude-plugins
/plugin install transpile@epicsagas
Or from source:
CLI Usage
transpile [OPTIONS]
Options:
-i, --input <FILE> Input file path (reads from stdin if omitted)
-f, --format <FORMAT> Input format: markdown | html | plaintext [default: markdown]
Auto-detected from file extension when --input is used
-l, --fidelity <LEVEL> Compression level: lossless | semantic | compressed [default: semantic]
-b, --budget <N> Token budget upper limit (unlimited if omitted)
-c, --count Print only the input token count, then exit
-j, --json Output as JSON {input_tok, output_tok, reduction_pct, content}
-q, --quiet Suppress the stats line on stderr
--stats Print stats line to stdout after content (single-stream capture)
-h, --help Print help
-V, --version Print version
Examples
# Convert a Markdown file (format auto-detected from .md extension)
# Read from stdin — clean stdout, stats on stderr
|
# Pipe cleanly — suppress stats entirely
|
# Check token count without converting
# JSON output for scripts and pipelines
|
# Capture content + stats in one stream (stdout)
# Lossless — no compression, full content preserved (legal/audit docs)
# Aggressive compression into a 512-token budget
Stats (
[273 → 150 tok 45.1% reduction]) are written to stderr by default, so stdout stays clean for piping. Use--quietto suppress, or--statsto redirect to stdout.
Library Usage
Synchronous
use ;
let md = r#"
# Software License Agreement
This agreement is made between Licensor and Licensee.
| Item | Cost |
|----------|-------|
| Base fee | $800 |
| Support | $200 |
"#;
let output = transpile?;
println!;
Streaming (Tokio)
use ;
use StreamExt;
let mut stream = transpile_stream.await;
while let Some = stream.next.await
Token count estimate
let n = token_count;
Output Format
<D> ← Symbol dictionary (omitted when no substitutions occur)
{sym}=repeated-term
</D>
<H> ← YAML-like metadata header
t: document title
s: one-line summary
k: [keyword1, keyword2]
</H>
<B> ← Document body (compressed + substituted)
...content...
</B>
The <D> block uses Unicode Private Use Area characters (U+E000–U+F8FF) as compact symbol handles, avoiding collision with visible text patterns. The dictionary supports up to 6,400 unique terms per document.
Fidelity Levels
| Level | Typical use case | Compression applied |
|---|---|---|
Lossless |
Legal / audit documents | None — original content guaranteed |
Semantic |
General RAG pipelines | Stopword removal + low-importance pruning |
Compressed |
Summarization, tight budgets | Maximum compression, first-sentence extraction |
Adaptive Compression
The compressor monitors budget usage in real time and escalates automatically:
| Budget usage | Stage | What happens |
|---|---|---|
| 0–60% | StopwordOnly |
English/Korean stopwords stripped |
| 60–80% | PruneLowImportance |
Bottom 20% of paragraphs by importance score removed |
| 80–95% | DeduplicateAndLinearize |
Duplicate sentences removed; tables linearized |
| 95%+ | MaxCompression |
Each paragraph truncated to first sentence |
Losslessmode bypasses all compression stages unconditionally.
During streaming, when budget usage crosses 80%, remaining nodes are automatically switched to Compressed mode.
Input Formats
InputFormat |
Parser |
|---|---|
Markdown |
pulldown-cmark — CommonMark + GFM tables |
Html |
ammonia sanitization → tag stripping → plain text pipeline |
PlainText |
Blank-line paragraph splitting |
Error Handling
use TranspileError;
match transpile
Performance
Measured on release build (cargo build --release), Apple M-series, 48 documents across Markdown / HTML / PlainText:
| Metric | Measured | Notes |
|---|---|---|
| Throughput | 10,975 tok/ms | ≈75× faster than Python parsing baseline |
| Semantic reduction | 33.9% (Markdown) | 15–30% target met |
| Compressed reduction | 39.7% (Markdown) | Budget-adaptive, guaranteed ≥ PruneLowImportance |
| Lossless word coverage | 98.8% avg | Across all formats and languages |
| HTML reduction | 97.6% | Reflects markup overhead removal (nav/scripts/styles) |
| Multilingual support | 15 languages tested | AR/DE/ES/FR/HI/IT/JA/KO/NL/PL/PT/RU/SV/TR/ZH — 99.4% avg word coverage |
Run the evaluation suite yourself:
Contributing
Contributions are welcome — bug reports, feature requests, and pull requests.
# Clone and build
# Run tests
# Run benchmarks (HTML report → target/criterion/)
# Lint and format
Guidelines
- Keep MSRV at Rust 1.75 — avoid features introduced after that.
- New compression behavior must not affect
Losslessmode. - Each PR should include tests for any new logic in the relevant module (
ir,compressor,symbol,renderer). - Run
cargo clippy -- -D warningsandcargo fmtbefore submitting.
License
Apache-2.0 — see LICENSE.