llm-utl
Transform code repositories into LLM-friendly prompts with intelligent chunking and filtering. Convert your codebase into optimally-chunked, formatted prompts ready for use with Large Language Models like Claude, GPT-4, or other AI assistants.
Features
- ๐ Zero-config - Works out of the box with sensible defaults
- ๐ฏ Type-safe API - Fluent, compile-time checked interface with presets
- ๐ฆ Smart Chunking - Automatically splits large codebases into optimal token-sized chunks with overlap
- ๐ง Presets - Optimized configurations for common tasks (code review, documentation, security audit)
- ๐งน Code Filtering - Removes tests, comments, debug prints, and other noise from code
- ๐จ Multiple Formats - Output to Markdown, XML, or JSON
- โก Fast - Parallel file scanning with multi-threaded processing (~1000 files/second)
- ๐ Gitignore Support - Respects
.gitignorefiles automatically - ๐ Multi-Language - Built-in filters for Rust, Python, JavaScript/TypeScript, Go, Java, C/C++
- ๐ก๏ธ Robust - Comprehensive error handling with atomic file writes
Installation
As a CLI Tool
As a Library
Add to your Cargo.toml:
[]
= "0.1.0"
Quick Start
Command Line Usage
Basic usage:
# Convert current directory to prompts
# Specify input and output directories
# Configure token limits and format
# Dry run to preview what would be generated
All options:
Library Usage
Simple API (Recommended)
The Scan API provides a fluent, type-safe interface:
use Scan;
// Simplest usage - scan current directory
scan?;
// Scan specific directory
dir.run?;
// Use a preset for common tasks
dir
.code_review
.run?;
// Custom configuration
dir
.output
.max_tokens
.format
.keep_tests
.run?;
Using Presets
Presets provide optimized configurations for specific tasks:
use Scan;
// Code review - removes tests, comments, debug prints
dir
.code_review
.run?;
// Documentation - keeps all comments and docs
dir
.documentation
.run?;
// Security audit - includes everything
dir
.security_audit
.run?;
// Bug analysis - focuses on logic
dir
.bug_analysis
.run?;
Advanced API
For complex scenarios, use the full Pipeline API:
use ;
Advanced Configuration
Code Filtering
Control what gets removed from your code:
use ;
let config = builder
.root_dir
.filter_config
.build?;
Or use presets:
use FilterConfig;
// Minimal - remove everything except code
let minimal = minimal;
// Preserve docs - keep documentation comments
let with_docs = preserve_docs;
// Production - ready for production review
let production = production;
File Filtering
Include or exclude specific files and directories:
use ;
let config = builder
.root_dir
.file_filter_config
.build?;
Important: When using .allow_only(), use glob patterns like **/*.rs instead of *.rs to match files in all subdirectories. The pattern *.rs only matches files in the root directory.
Custom Tokenizers
Choose between simple and enhanced tokenization:
use ;
let config = builder
.root_dir
.tokenizer // More accurate
// .tokenizer(TokenizerKind::Simple) // Faster, ~4 chars per token
.build?;
Working with Statistics
The PipelineStats struct provides detailed information about the scanning process:
let stats = dir.run?;
// File counts
println!;
println!;
println!;
// Chunks
println!;
println!;
println!;
// Performance
println!;
println!;
// Output
println!;
println!;
Design Philosophy
Progressive Disclosure
Start simple, add complexity only when needed:
- Level 1:
llm_utl::scan()- Zero config, works immediately - Level 2:
Scan::dir("path").code_review()- Use presets for common tasks - Level 3:
Scan::dir().keep_tests().exclude([...])- Fine-grained control - Level 4: Full
ConfigAPI - Maximum flexibility
Type Safety
All options are compile-time checked:
// This won't compile - caught at compile time
dir
.format; // Error: expected Format enum
// Correct usage
dir
.format;
Sensible Defaults
Works well without configuration:
- Excludes common directories (
node_modules,target,.git, etc.) - Removes noise (tests, comments, debug prints)
- Uses efficient token limits (100,000 per chunk)
- Provides clear, actionable error messages
Fluent Interface
Natural, readable API:
dir
.code_review
.output
.max_tokens
.keep_tests
.run?;
Output Formats
Markdown (Default)
```rust
fn main() {
}
### XML
```xml
JSON
Custom Templates
llm-utl supports custom Tera templates for maximum flexibility in output formatting.
Using Custom Templates
Override Built-in Templates
Replace default templates with your own:
use *;
dir
.format
.template
.run?;
CLI usage:
Create Custom Formats
Define completely custom output formats:
use *;
use json;
dir
.custom_format
.template
.template_data
.template_data
.template_data
.run?;
CLI usage:
Template Variables
Your templates have access to the following context:
{# Chunk information #}
{{ ctx.chunk_index }} {# Current chunk number (1-based) #}
{{ ctx.total_chunks }} {# Total number of chunks #}
{{ ctx.chunk_files }} {# Files in this chunk #}
{{ ctx.total_tokens }} {# Token count for chunk #}
{# Files array #}
{% for file in ctx.files %}
{{ file.path }} {# Absolute path #}
{{ file.relative_path }} {# Relative path #}
{{ file.content }} {# File contents (None for binary) #}
{{ file.is_binary }} {# Boolean flag #}
{{ file.token_count }} {# Estimated tokens #}
{{ file.lines }} {# Line count (None for binary) #}
{% endfor %}
{# Metadata #}
{{ ctx.metadata.generated_at }} {# Timestamp #}
{{ ctx.metadata.format }} {# Output format #}
{# Custom data (if provided) #}
{{ ctx.custom.version }}
{{ ctx.custom.project }}
{{ ctx.custom.author }}
{# Preset info (if using a preset) #}
{{ ctx.preset.name }}
{{ ctx.preset.description }}
Custom Filters
Built-in Tera filters available in templates:
{# XML escaping #}
{{ content | xml_escape }}
{# JSON encoding #}
{{ data | json_encode }}
{{ data | json_encode(pretty=true) }}
{# Truncate output #}
{{ content | truncate_lines(max=100) }}
{# Detect language from extension #}
{{ file.path | detect_language }}
Example Custom Template
# {{ ctx.custom.project }} - Code Review
Version: {{ ctx.custom.version }}
Author: {{ ctx.custom.author }}
## Chunk {{ ctx.chunk_index }} of {{ ctx.total_chunks }}
{% for file in ctx.files %}
### File: {{ file.relative_path }}
Lines: {{ file.lines }}, Tokens: {{ file.token_count }}
```{% set ext = file.relative_path | split(pat=".") | last %}{{ ext }}
{{ file.content }}
{% endfor %}
Generated at: {{ ctx.metadata.generated_at }}
### Template Validation
Templates are validated automatically:
- File existence and readability
- Tera syntax correctness
- Required variables (chunk_index, total_chunks, files)
Invalid templates will produce clear error messages with suggested fixes.
### Advanced API Usage
For programmatic template configuration:
```rust
use llm_utl::{Config, OutputFormat};
use std::collections::HashMap;
use serde_json::Value;
let mut custom_data = HashMap::new();
custom_data.insert("version".to_string(), Value::String("1.0.0".to_string()));
custom_data.insert("project".to_string(), Value::String("My Project".to_string()));
let config = Config::builder()
.root_dir("./src")
.template_path("./my-template.tera")
.format(OutputFormat::Custom)
.custom_format_name("my_format")
.custom_extension("txt")
.custom_data(custom_data)
.build()?;
Pipeline::new(config)?.run()?;
Use Cases
- ๐ Code Review with AI - Feed your codebase to Claude or GPT-4 for comprehensive reviews
- ๐ Learning - Generate study materials from large codebases
- ๐ Documentation - Create AI-friendly documentation sources
- ๐ Analysis - Prepare code for AI-powered analysis and insights
- ๐ค Training Data - Generate datasets for fine-tuning models
How It Works
The tool follows a 4-stage pipeline:
- Scanner - Discovers files in parallel, respecting
.gitignore - Filter - Removes noise (tests, comments, debug statements) using language-specific filters
- Splitter - Intelligently chunks content based on token limits with overlap for context
- Writer - Renders chunks using Tera templates with atomic file operations
Performance
- Parallel file scanning using all CPU cores
- Streaming mode for large files (>10MB)
- Zero-copy operations where possible
- Optimized for minimal allocations
Typical performance: ~1000 files/second on modern hardware.
Supported Languages
Built-in filtering support for:
- Rust
- Python
- JavaScript/TypeScript (including JSX/TSX)
- Go
- Java/Kotlin
- C/C++
Other languages are processed as plain text.
Real-World Examples
Pre-commit Review
use Scan;
CI/CD Security Scan
use Scan;
Documentation Generation
use Scan;
Batch Processing
use Scan;
More Examples
See the https://github.com/maxBogovick/llm-util/tree/master/examples directory for more usage examples.
Development
# Clone the repository
# Build
# Run tests
# Run with verbose logging
RUST_LOG=llm_utl=debug
# Format code
# Lint
Troubleshooting
"No processable files found" Error
If you see this error:
Error: No processable files found in '.'.
Common causes:
-
Wrong directory: The tool is running in an empty directory or a directory without source files.
# โ Wrong - running in home directory # โ Correct - specify your project directory -
All files are gitignored: Your
.gitignoreexcludes all files in the directory.# Check what files would be scanned -
No source files: The directory contains only non-source files (images, binaries, etc.).
# Make sure directory contains code files
Quick fix:
# Always specify the directory containing your source code
Permission Issues
If you encounter permission errors:
# Ensure you have read access to source directory
# and write access to output directory
Large Files
If processing is slow with very large files:
# Increase token limit for large codebases
# Or use simple tokenizer for better performance
FAQ
How do I scan only specific file types?
Use the Scan API with exclusion patterns or the full Config API with custom file filters:
use ;
builder
.root_dir
.file_filter_config
.build?
.run?;
How do I handle very large codebases?
Increase token limits and adjust overlap:
dir
.max_tokens
.overlap
.run?;
Can I process multiple directories?
Yes, scan each separately or use a common parent:
for dir in
How do I preserve everything for analysis?
Use the security audit preset or configure manually:
// Using preset
dir
.security_audit
.run?;
// Manual configuration
dir
.keep_tests
.keep_comments
.keep_doc_comments
.keep_debug_prints
.run?;
What are the available presets?
The library provides these presets:
- code_review - Removes tests, comments, debug prints for clean code review
- documentation - Preserves all documentation and comments
- security_audit - Includes everything for comprehensive security analysis
- bug_analysis - Focuses on logic by removing noise
- refactoring - Optimized for refactoring tasks
- test_generation - Configured for generating tests
Platform Support
- โ Linux
- โ macOS
- โ Windows
All major platforms are supported and tested.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
Built with these excellent crates:
- ignore - Fast gitignore-aware file walking
- tera - Powerful template engine
- clap - CLI argument parsing
- tracing - Structured logging