# llm-utl
[](https://crates.io/crates/llm-utl)
[](https://docs.rs/llm-utl)
[](LICENSE)
Transform code repositories into LLM-friendly prompts with intelligent chunking and filtering. Convert your codebase into optimally-chunked, formatted prompts ready for use with Large Language Models like Claude, GPT-4, or other AI assistants.
## Features
- ๐ **Zero-config** - Works out of the box with sensible defaults
- ๐ฏ **Type-safe API** - Fluent, compile-time checked interface with presets
- ๐ฆ **Smart Chunking** - Automatically splits large codebases into optimal token-sized chunks with overlap
- ๐ง **Presets** - Optimized configurations for common tasks (code review, documentation, security audit)
- ๐งน **Code Filtering** - Removes tests, comments, debug prints, and other noise from code
- ๐จ **Multiple Formats** - Output to Markdown, XML, or JSON
- โก **Fast** - Parallel file scanning with multi-threaded processing (~1000 files/second)
- ๐ **Gitignore Support** - Respects `.gitignore` files automatically
- ๐ **Multi-Language** - Built-in filters for Rust, Python, JavaScript/TypeScript, Go, Java, C/C++
- ๐ก๏ธ **Robust** - Comprehensive error handling with atomic file writes
## Installation
### As a CLI Tool
```bash
cargo install llm-utl
```
### As a Library
Add to your `Cargo.toml`:
```toml
[dependencies]
llm-utl = "0.1.0"
```
## Quick Start
### Command Line Usage
Basic usage:
```bash
# Convert current directory to prompts
llm-utl
# Specify input and output directories
llm-utl --dir ./src --out ./prompts
# Configure token limits and format
llm-utl --max-tokens 50000 --format xml
# Dry run to preview what would be generated
llm-utl --dry-run
```
All options:
```bash
llm-utl [OPTIONS]
Options:
-d, --dir <DIR> Root directory to scan [default: .]
-o, --out <OUT> Output directory [default: out]
--pattern <PATTERN> Output filename pattern [default: prompt_{index:03}.{ext}]
-f, --format <FORMAT> Output format [default: markdown] [possible values: markdown, xml, json]
--max-tokens <TOKENS> Max tokens per chunk [default: 100000]
--overlap <TOKENS> Overlap tokens between chunks [default: 1000]
--tokenizer <TOKENIZER> Tokenizer to use [default: enhanced] [possible values: simple, enhanced]
--dry-run Dry run (don't write files)
-v, --verbose Verbose output (use -vv for trace level)
-h, --help Print help
-V, --version Print version
```
### Library Usage
#### Simple API (Recommended)
The `Scan` API provides a fluent, type-safe interface:
```rust
use llm_utl::Scan;
// Simplest usage - scan current directory
llm_utl::scan()?;
// Scan specific directory
Scan::dir("./src").run()?;
// Use a preset for common tasks
Scan::dir("./src")
.code_review()
.run()?;
// Custom configuration
Scan::dir("./project")
.output("./prompts")
.max_tokens(200_000)
.format(Format::Json)
.keep_tests()
.run()?;
```
#### Using Presets
Presets provide optimized configurations for specific tasks:
```rust
use llm_utl::Scan;
// Code review - removes tests, comments, debug prints
Scan::dir("./src")
.code_review()
.run()?;
// Documentation - keeps all comments and docs
Scan::dir("./project")
.documentation()
.run()?;
// Security audit - includes everything
Scan::dir("./src")
.security_audit()
.run()?;
// Bug analysis - focuses on logic
Scan::dir("./src")
.bug_analysis()
.run()?;
```
#### Advanced API
For complex scenarios, use the full `Pipeline` API:
```rust
use llm_utl::{Config, Pipeline, OutputFormat};
fn main() -> anyhow::Result<()> {
let config = Config::builder()
.root_dir("./src")
.output_dir("./prompts")
.format(OutputFormat::Markdown)
.max_tokens(100_000)
.overlap_tokens(1_000)
.build()?;
let stats = Pipeline::new(config)?.run()?;
println!("Processed {} files into {} chunks",
stats.total_files,
stats.total_chunks
);
Ok(())
}
```
## Advanced Configuration
### Code Filtering
Control what gets removed from your code:
```rust
use llm_utl::{Config, FilterConfig};
let config = Config::builder()
.root_dir(".")
.filter_config(FilterConfig {
remove_tests: true,
remove_doc_comments: false, // Keep documentation
remove_comments: true,
remove_blank_lines: true,
preserve_headers: true,
remove_debug_prints: true, // Remove println!, dbg!, etc.
})
.build()?;
```
Or use presets:
```rust
use llm_utl::FilterConfig;
// Minimal - remove everything except code
let minimal = FilterConfig::minimal();
// Preserve docs - keep documentation comments
let with_docs = FilterConfig::preserve_docs();
// Production - ready for production review
let production = FilterConfig::production();
```
### File Filtering
Include or exclude specific files and directories:
```rust
use llm_utl::{Config, FileFilterConfig};
let config = Config::builder()
.root_dir(".")
.file_filter_config(
FileFilterConfig::default()
.exclude_directories(vec![
"**/target".to_string(),
"**/node_modules".to_string(),
"**/.git".to_string(),
])
.exclude_files(vec!["*.lock".to_string()])
// Or whitelist specific files (use glob patterns with **/):
// .allow_only(vec!["**/*.rs".to_string(), "**/*.toml".to_string()])
)
.build()?;
```
**Important**: When using `.allow_only()`, use glob patterns like `**/*.rs` instead of `*.rs` to match files in all subdirectories. The pattern `*.rs` only matches files in the root directory.
### Custom Tokenizers
Choose between simple and enhanced tokenization:
```rust
use llm_utl::{Config, TokenizerKind};
let config = Config::builder()
.root_dir(".")
.tokenizer(TokenizerKind::Enhanced) // More accurate
// .tokenizer(TokenizerKind::Simple) // Faster, ~4 chars per token
.build()?;
```
## Working with Statistics
The `PipelineStats` struct provides detailed information about the scanning process:
```rust
let stats = Scan::dir("./src").run()?;
// File counts
println!("Total files: {}", stats.total_files);
println!("Text files: {}", stats.text_files);
println!("Binary files: {}", stats.binary_files);
// Chunks
println!("Total chunks: {}", stats.total_chunks);
println!("Avg chunk size: {} tokens", stats.avg_tokens_per_chunk);
println!("Max chunk size: {} tokens", stats.max_chunk_tokens);
// Performance
println!("Duration: {:.2}s", stats.duration.as_secs_f64());
println!("Throughput: {:.0} tokens/sec",
stats.throughput_tokens_per_sec()
);
// Output
println!("Output directory: {}", stats.output_directory);
println!("Files written: {}", stats.files_written);
```
## Design Philosophy
### Progressive Disclosure
Start simple, add complexity only when needed:
1. **Level 1**: `llm_utl::scan()` - Zero config, works immediately
2. **Level 2**: `Scan::dir("path").code_review()` - Use presets for common tasks
3. **Level 3**: `Scan::dir().keep_tests().exclude([...])` - Fine-grained control
4. **Level 4**: Full `Config` API - Maximum flexibility
### Type Safety
All options are compile-time checked:
```rust
// This won't compile - caught at compile time
Scan::dir("./src")
.format("invalid"); // Error: expected Format enum
// Correct usage
Scan::dir("./src")
.format(Format::Json);
```
### Sensible Defaults
Works well without configuration:
- Excludes common directories (`node_modules`, `target`, `.git`, etc.)
- Removes noise (tests, comments, debug prints)
- Uses efficient token limits (100,000 per chunk)
- Provides clear, actionable error messages
### Fluent Interface
Natural, readable API:
```rust
Scan::dir("./src")
.code_review()
.output("./review")
.max_tokens(200_000)
.keep_tests()
.run()?;
```
## Output Formats
### Markdown (Default)
```markdown
# Chunk 1/3 (45,234 tokens)
## File: src/main.rs (1,234 tokens)
```rust
fn main() {
println!("Hello, world!");
}
```
```
### XML
```xml
<?xml version="1.0" encoding="UTF-8"?>
<chunk index="1" total="3">
<file path="src/main.rs" tokens="1234">
<![CDATA[
fn main() {
println!("Hello, world!");
}
]]>
</file>
</chunk>
```
### JSON
```json
{
"chunk_index": 1,
"total_chunks": 3,
"total_tokens": 45234,
"files": [
{
"path": "src/main.rs",
"tokens": 1234,
"content": "fn main() {\n println!(\"Hello, world!\");\n}"
}
]
}
```
## Custom Templates
llm-utl supports custom Tera templates for maximum flexibility in output formatting.
### Using Custom Templates
#### Override Built-in Templates
Replace default templates with your own:
```rust
use llm_utl::api::*;
Scan::dir("./src")
.format(Format::Markdown)
.template("./my-markdown.tera")
.run()?;
```
CLI usage:
```bash
llm-utl --dir ./src --format markdown --template ./my-markdown.tera
```
#### Create Custom Formats
Define completely custom output formats:
```rust
use llm_utl::api::*;
use serde_json::json;
Scan::dir("./src")
.custom_format("my_format", "txt")
.template("./custom.tera")
.template_data("version", json!("1.0.0"))
.template_data("project", json!("My Project"))
.template_data("author", json!("John Doe"))
.run()?;
```
CLI usage:
```bash
llm-utl --dir ./src \
--format custom \
--format-name my_format \
--ext txt \
--template ./custom.tera \
--template-data version=1.0.0 \
--template-data project="My Project" \
--template-data author="John Doe"
```
### Template Variables
Your templates have access to the following context:
```tera
{# Chunk information #}
{{ ctx.chunk_index }} {# Current chunk number (1-based) #}
{{ ctx.total_chunks }} {# Total number of chunks #}
{{ ctx.chunk_files }} {# Files in this chunk #}
{{ ctx.total_tokens }} {# Token count for chunk #}
{# Files array #}
{% for file in ctx.files %}
{{ file.path }} {# Absolute path #}
{{ file.relative_path }} {# Relative path #}
{{ file.content }} {# File contents (None for binary) #}
{{ file.is_binary }} {# Boolean flag #}
{{ file.token_count }} {# Estimated tokens #}
{{ file.lines }} {# Line count (None for binary) #}
{% endfor %}
{# Metadata #}
{{ ctx.metadata.generated_at }} {# Timestamp #}
{{ ctx.metadata.format }} {# Output format #}
{# Custom data (if provided) #}
{{ ctx.custom.version }}
{{ ctx.custom.project }}
{{ ctx.custom.author }}
{# Preset info (if using a preset) #}
{{ ctx.preset.name }}
{{ ctx.preset.description }}
```
### Custom Filters
Built-in Tera filters available in templates:
```tera
{# XML escaping #}
{{ content | xml_escape }}
{# JSON encoding #}
{{ data | json_encode }}
{{ data | json_encode(pretty=true) }}
{# Truncate output #}
{{ content | truncate_lines(max=100) }}
{# Detect language from extension #}
{{ file.path | detect_language }}
```
### Example Custom Template
```tera
# {{ ctx.custom.project }} - Code Review
Version: {{ ctx.custom.version }}
Author: {{ ctx.custom.author }}
## Chunk {{ ctx.chunk_index }} of {{ ctx.total_chunks }}
{% for file in ctx.files %}
### File: {{ file.relative_path }}
Lines: {{ file.lines }}, Tokens: {{ file.token_count }}
```{% set ext = file.relative_path | split(pat=".") | last %}{{ ext }}
{{ file.content }}
```
{% endfor %}
---
Generated at: {{ ctx.metadata.generated_at }}
```
### Template Validation
Templates are validated automatically:
- File existence and readability
- Tera syntax correctness
- Required variables (chunk_index, total_chunks, files)
Invalid templates will produce clear error messages with suggested fixes.
### Advanced API Usage
For programmatic template configuration:
```rust
use llm_utl::{Config, OutputFormat};
use std::collections::HashMap;
use serde_json::Value;
let mut custom_data = HashMap::new();
custom_data.insert("version".to_string(), Value::String("1.0.0".to_string()));
custom_data.insert("project".to_string(), Value::String("My Project".to_string()));
let config = Config::builder()
.root_dir("./src")
.template_path("./my-template.tera")
.format(OutputFormat::Custom)
.custom_format_name("my_format")
.custom_extension("txt")
.custom_data(custom_data)
.build()?;
Pipeline::new(config)?.run()?;
```
## Use Cases
- ๐ **Code Review with AI** - Feed your codebase to Claude or GPT-4 for comprehensive reviews
- ๐ **Learning** - Generate study materials from large codebases
- ๐ **Documentation** - Create AI-friendly documentation sources
- ๐ **Analysis** - Prepare code for AI-powered analysis and insights
- ๐ค **Training Data** - Generate datasets for fine-tuning models
## How It Works
The tool follows a 4-stage pipeline:
1. **Scanner** - Discovers files in parallel, respecting `.gitignore`
2. **Filter** - Removes noise (tests, comments, debug statements) using language-specific filters
3. **Splitter** - Intelligently chunks content based on token limits with overlap for context
4. **Writer** - Renders chunks using Tera templates with atomic file operations
## Performance
- Parallel file scanning using all CPU cores
- Streaming mode for large files (>10MB)
- Zero-copy operations where possible
- Optimized for minimal allocations
Typical performance: **~1000 files/second** on modern hardware.
## Supported Languages
Built-in filtering support for:
- Rust
- Python
- JavaScript/TypeScript (including JSX/TSX)
- Go
- Java/Kotlin
- C/C++
Other languages are processed as plain text.
## Real-World Examples
### Pre-commit Review
```rust
use llm_utl::Scan;
fn pre_commit_hook() -> llm_utl::Result<()> {
println!("๐ Analyzing changes...");
let stats = Scan::dir("./src")
.code_review()
.output("./review")
.run()?;
println!("โ Review ready in {}", stats.output_directory);
Ok(())
}
```
### CI/CD Security Scan
```rust
use llm_utl::Scan;
fn ci_security_check() -> llm_utl::Result<()> {
let stats = Scan::dir("./src")
.security_audit()
.output("./security-reports")
.max_tokens(120_000)
.run()?;
if stats.total_files == 0 {
eprintln!("โ No files to scan");
std::process::exit(1);
}
println!("โ Scanned {} files", stats.total_files);
Ok(())
}
```
### Documentation Generation
```rust
use llm_utl::Scan;
fn generate_docs() -> llm_utl::Result<()> {
Scan::dir(".")
.documentation()
.output("./docs/ai-generated")
.run()?;
Ok(())
}
```
### Batch Processing
```rust
use llm_utl::Scan;
fn process_multiple_projects() -> llm_utl::Result<()> {
for project in ["./frontend", "./backend", "./mobile"] {
println!("Processing {project}...");
match Scan::dir(project).run() {
Ok(stats) => println!(" โ {} files", stats.total_files),
Err(e) => eprintln!(" โ Error: {e}"),
}
}
Ok(())
}
```
## More Examples
See the `https://github.com/maxBogovick/llm-util/tree/master/examples` directory for more usage examples.
## Development
```bash
# Clone the repository
git clone https://github.com/maxBogovick/llm-util.git
cd llm-utl
# Build
cargo build --release
# Run tests
cargo test
# Run with verbose logging
RUST_LOG=llm_utl=debug cargo run -- --dir ./src
# Format code
cargo fmt
# Lint
cargo clippy
```
## Troubleshooting
### "No processable files found" Error
If you see this error:
```
Error: No processable files found in '.'.
```
**Common causes:**
1. **Wrong directory**: The tool is running in an empty directory or a directory without source files.
```bash
# โ Wrong - running in home directory
cd ~
llm-utl
# โ
Correct - specify your project directory
llm-utl --dir ./my-project
```
2. **All files are gitignored**: Your `.gitignore` excludes all files in the directory.
```bash
# Check what files would be scanned
llm-utl --dir ./my-project --dry-run -v
```
3. **No source files**: The directory contains only non-source files (images, binaries, etc.).
```bash
# Make sure directory contains code files
ls ./my-project/*.rs # or *.py, *.js, etc.
```
**Quick fix:**
```bash
# Always specify the directory containing your source code
llm-utl --dir ./path/to/your/project --out ./prompts
```
### Permission Issues
If you encounter permission errors:
```bash
# Ensure you have read access to source directory
# and write access to output directory
chmod +r ./src
chmod +w ./out
```
### Large Files
If processing is slow with very large files:
```bash
# Increase token limit for large codebases
llm-utl --max-tokens 200000
# Or use simple tokenizer for better performance
llm-utl --tokenizer simple
```
## FAQ
### How do I scan only specific file types?
Use the `Scan` API with exclusion patterns or the full `Config` API with custom file filters:
```rust
use llm_utl::{Config, FileFilterConfig};
Config::builder()
.root_dir("./src")
.file_filter_config(
FileFilterConfig::default()
.allow_only(vec!["**/*.rs".to_string(), "**/*.toml".to_string()])
)
.build()?
.run()?;
```
### How do I handle very large codebases?
Increase token limits and adjust overlap:
```rust
Scan::dir("./large-project")
.max_tokens(500_000)
.overlap(5_000)
.run()?;
```
### Can I process multiple directories?
Yes, scan each separately or use a common parent:
```rust
for dir in ["./src", "./lib", "./bin"] {
Scan::dir(dir)
.output(&format!("./out/{}", dir.trim_start_matches("./")))
.run()?;
}
```
### How do I preserve everything for analysis?
Use the security audit preset or configure manually:
```rust
// Using preset
Scan::dir("./src")
.security_audit()
.run()?;
// Manual configuration
Scan::dir("./src")
.keep_tests()
.keep_comments()
.keep_doc_comments()
.keep_debug_prints()
.run()?;
```
### What are the available presets?
The library provides these presets:
- **code_review** - Removes tests, comments, debug prints for clean code review
- **documentation** - Preserves all documentation and comments
- **security_audit** - Includes everything for comprehensive security analysis
- **bug_analysis** - Focuses on logic by removing noise
- **refactoring** - Optimized for refactoring tasks
- **test_generation** - Configured for generating tests
## Platform Support
- โ Linux
- โ macOS
- โ Windows
All major platforms are supported and tested.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
Built with these excellent crates:
- [ignore](https://github.com/BurntSushi/ripgrep/tree/master/crates/ignore) - Fast gitignore-aware file walking
- [tera](https://github.com/Keats/tera) - Powerful template engine
- [clap](https://github.com/clap-rs/clap) - CLI argument parsing
- [tracing](https://github.com/tokio-rs/tracing) - Structured logging
## See Also
- [API Documentation](https://docs.rs/llm-utl)
- [Changelog](CHANGELOG.md)
- [Contributing Guidelines](CONTRIBUTING.md)