llm-utl

Transform code repositories into LLM-friendly prompts with intelligent chunking and filtering. Convert your codebase into optimally-chunked, formatted prompts ready for use with Large Language Models like Claude, GPT-4, or other AI assistants.

Features

🚀 Zero-config - Works out of the box with sensible defaults
🎯 Type-safe API - Fluent, compile-time checked interface with presets
📦 Smart Chunking - Automatically splits large codebases into optimal token-sized chunks with overlap
🔧 Presets - Optimized configurations for common tasks (code review, documentation, security audit)
🧹 Code Filtering - Removes tests, comments, debug prints, and other noise from code
🎨 Multiple Formats - Output to Markdown, XML, or JSON
⚡ Fast - Parallel file scanning with multi-threaded processing (~1000 files/second)
🔍 Gitignore Support - Respects .gitignore files automatically
🌍 Multi-Language - Built-in filters for Rust, Python, JavaScript/TypeScript, Go, Java, C/C++
🛡️ Robust - Comprehensive error handling with atomic file writes

Installation

As a CLI Tool

cargo install llm-utl

As a Library

Add to your Cargo.toml:

[dependencies]
llm-utl = "0.1.0"

Quick Start

Command Line Usage

Basic usage:

# Convert current directory to prompts
llm-utl

# Specify input and output directories
llm-utl --dir ./src --out ./prompts

# Configure token limits and format
llm-utl --max-tokens 50000 --format xml

# Dry run to preview what would be generated
llm-utl --dry-run

All options:

llm-utl [OPTIONS]

Options:
  -d, --dir <DIR>              Root directory to scan [default: .]
  -o, --out <OUT>              Output directory [default: out]
      --pattern <PATTERN>      Output filename pattern [default: prompt_{index:03}.{ext}]
  -f, --format <FORMAT>        Output format [default: markdown] [possible values: markdown, xml, json]
      --max-tokens <TOKENS>    Max tokens per chunk [default: 100000]
      --overlap <TOKENS>       Overlap tokens between chunks [default: 1000]
      --tokenizer <TOKENIZER>  Tokenizer to use [default: enhanced] [possible values: simple, enhanced]
      --dry-run               Dry run (don't write files)
  -v, --verbose               Verbose output (use -vv for trace level)
  -h, --help                  Print help
  -V, --version               Print version

Library Usage

Simple API (Recommended)

The Scan API provides a fluent, type-safe interface:

use llm_utl::Scan;

// Simplest usage - scan current directory
llm_utl::scan()?;

// Scan specific directory
Scan::dir("./src").run()?;

// Use a preset for common tasks
Scan::dir("./src")
    .code_review()
    .run()?;

// Custom configuration
Scan::dir("./project")
    .output("./prompts")
    .max_tokens(200_000)
    .format(Format::Json)
    .keep_tests()
    .run()?;

Using Presets

Presets provide optimized configurations for specific tasks:

use llm_utl::Scan;

// Code review - removes tests, comments, debug prints
Scan::dir("./src")
    .code_review()
    .run()?;

// Documentation - keeps all comments and docs
Scan::dir("./project")
    .documentation()
    .run()?;

// Security audit - includes everything
Scan::dir("./src")
    .security_audit()
    .run()?;

// Bug analysis - focuses on logic
Scan::dir("./src")
    .bug_analysis()
    .run()?;

Advanced API

For complex scenarios, use the full Pipeline API:

use llm_utl::{Config, Pipeline, OutputFormat};

fn main() -> anyhow::Result<()> {
    let config = Config::builder()
        .root_dir("./src")
        .output_dir("./prompts")
        .format(OutputFormat::Markdown)
        .max_tokens(100_000)
        .overlap_tokens(1_000)
        .build()?;

    let stats = Pipeline::new(config)?.run()?;

    println!("Processed {} files into {} chunks",
        stats.total_files,
        stats.total_chunks
    );

    Ok(())
}

Advanced Configuration

Code Filtering

Control what gets removed from your code:

use llm_utl::{Config, FilterConfig};

let config = Config::builder()
    .root_dir(".")
    .filter_config(FilterConfig {
        remove_tests: true,
        remove_doc_comments: false,  // Keep documentation
        remove_comments: true,
        remove_blank_lines: true,
        preserve_headers: true,
        remove_debug_prints: true,   // Remove println!, dbg!, etc.
    })
    .build()?;

Or use presets:

use llm_utl::FilterConfig;

// Minimal - remove everything except code
let minimal = FilterConfig::minimal();

// Preserve docs - keep documentation comments
let with_docs = FilterConfig::preserve_docs();

// Production - ready for production review
let production = FilterConfig::production();

File Filtering

Include or exclude specific files and directories:

use llm_utl::{Config, FileFilterConfig};

let config = Config::builder()
    .root_dir(".")
    .file_filter_config(
        FileFilterConfig::default()
            .exclude_directories(vec![
                "**/target".to_string(),
                "**/node_modules".to_string(),
                "**/.git".to_string(),
            ])
            .exclude_files(vec!["*.lock".to_string()])
            // Or whitelist specific files (use glob patterns with **/):
            // .allow_only(vec!["**/*.rs".to_string(), "**/*.toml".to_string()])
    )
    .build()?;

Important: When using .allow_only(), use glob patterns like **/*.rs instead of *.rs to match files in all subdirectories. The pattern *.rs only matches files in the root directory.

Custom Tokenizers

Choose between simple and enhanced tokenization:

use llm_utl::{Config, TokenizerKind};

let config = Config::builder()
    .root_dir(".")
    .tokenizer(TokenizerKind::Enhanced)  // More accurate
    // .tokenizer(TokenizerKind::Simple) // Faster, ~4 chars per token
    .build()?;

Working with Statistics

The PipelineStats struct provides detailed information about the scanning process:

let stats = Scan::dir("./src").run()?;

// File counts
println!("Total files: {}", stats.total_files);
println!("Text files: {}", stats.text_files);
println!("Binary files: {}", stats.binary_files);

// Chunks
println!("Total chunks: {}", stats.total_chunks);
println!("Avg chunk size: {} tokens", stats.avg_tokens_per_chunk);
println!("Max chunk size: {} tokens", stats.max_chunk_tokens);

// Performance
println!("Duration: {:.2}s", stats.duration.as_secs_f64());
println!("Throughput: {:.0} tokens/sec",
    stats.throughput_tokens_per_sec()
);

// Output
println!("Output directory: {}", stats.output_directory);
println!("Files written: {}", stats.files_written);

Design Philosophy

Progressive Disclosure

Start simple, add complexity only when needed:

Level 1: llm_utl::scan() - Zero config, works immediately
Level 2: Scan::dir("path").code_review() - Use presets for common tasks
Level 3: Scan::dir().keep_tests().exclude([...]) - Fine-grained control
Level 4: Full Config API - Maximum flexibility

Type Safety

All options are compile-time checked:

// This won't compile - caught at compile time
Scan::dir("./src")
    .format("invalid");  // Error: expected Format enum

// Correct usage
Scan::dir("./src")
    .format(Format::Json);

Sensible Defaults

Works well without configuration:

Excludes common directories (node_modules, target, .git, etc.)
Removes noise (tests, comments, debug prints)
Uses efficient token limits (100,000 per chunk)
Provides clear, actionable error messages

Fluent Interface

Natural, readable API:

Scan::dir("./src")
    .code_review()
    .output("./review")
    .max_tokens(200_000)
    .keep_tests()
    .run()?;

Output Formats

Markdown (Default)

# Chunk 1/3 (45,234 tokens)

## File: src/main.rs (1,234 tokens)

```rust
fn main() {
    println!("Hello, world!");
}


### XML

```xml
<?xml version="1.0" encoding="UTF-8"?>
<chunk index="1" total="3">
  <file path="src/main.rs" tokens="1234">
    <![CDATA[
fn main() {
    println!("Hello, world!");
}
    ]]>
  </file>
</chunk>

JSON

{
  "chunk_index": 1,
  "total_chunks": 3,
  "total_tokens": 45234,
  "files": [
    {
      "path": "src/main.rs",
      "tokens": 1234,
      "content": "fn main() {\n    println!(\"Hello, world!\");\n}"
    }
  ]
}

Custom Templates

llm-utl supports custom Tera templates for maximum flexibility in output formatting.

Using Custom Templates

Override Built-in Templates

Replace default templates with your own:

use llm_utl::api::*;

Scan::dir("./src")
    .format(Format::Markdown)
    .template("./my-markdown.tera")
    .run()?;

CLI usage:

llm-utl --dir ./src --format markdown --template ./my-markdown.tera

Create Custom Formats

Define completely custom output formats:

use llm_utl::api::*;
use serde_json::json;

Scan::dir("./src")
    .custom_format("my_format", "txt")
    .template("./custom.tera")
    .template_data("version", json!("1.0.0"))
    .template_data("project", json!("My Project"))
    .template_data("author", json!("John Doe"))
    .run()?;

CLI usage:

llm-utl --dir ./src \
  --format custom \
  --format-name my_format \
  --ext txt \
  --template ./custom.tera \
  --template-data version=1.0.0 \
  --template-data project="My Project" \
  --template-data author="John Doe"

Template Variables

Your templates have access to the following context:

{# Chunk information #}
{{ ctx.chunk_index }}       {# Current chunk number (1-based) #}
{{ ctx.total_chunks }}      {# Total number of chunks #}
{{ ctx.chunk_files }}       {# Files in this chunk #}
{{ ctx.total_tokens }}      {# Token count for chunk #}

{# Files array #}
{% for file in ctx.files %}
  {{ file.path }}           {# Absolute path #}
  {{ file.relative_path }}  {# Relative path #}
  {{ file.content }}        {# File contents (None for binary) #}
  {{ file.is_binary }}      {# Boolean flag #}
  {{ file.token_count }}    {# Estimated tokens #}
  {{ file.lines }}          {# Line count (None for binary) #}
{% endfor %}

{# Metadata #}
{{ ctx.metadata.generated_at }}  {# Timestamp #}
{{ ctx.metadata.format }}        {# Output format #}

{# Custom data (if provided) #}
{{ ctx.custom.version }}
{{ ctx.custom.project }}
{{ ctx.custom.author }}

{# Preset info (if using a preset) #}
{{ ctx.preset.name }}
{{ ctx.preset.description }}

Custom Filters

Built-in Tera filters available in templates:

{# XML escaping #}
{{ content | xml_escape }}

{# JSON encoding #}
{{ data | json_encode }}
{{ data | json_encode(pretty=true) }}

{# Truncate output #}
{{ content | truncate_lines(max=100) }}

{# Detect language from extension #}
{{ file.path | detect_language }}

Example Custom Template

# {{ ctx.custom.project }} - Code Review
Version: {{ ctx.custom.version }}
Author: {{ ctx.custom.author }}

## Chunk {{ ctx.chunk_index }} of {{ ctx.total_chunks }}

{% for file in ctx.files %}
### File: {{ file.relative_path }}
Lines: {{ file.lines }}, Tokens: {{ file.token_count }}

```{% set ext = file.relative_path | split(pat=".") | last %}{{ ext }}
{{ file.content }}

{% endfor %}

Generated at: {{ ctx.metadata.generated_at }}


### Template Validation

Templates are validated automatically:
- File existence and readability
- Tera syntax correctness
- Required variables (chunk_index, total_chunks, files)

Invalid templates will produce clear error messages with suggested fixes.

### Advanced API Usage

For programmatic template configuration:

```rust
use llm_utl::{Config, OutputFormat};
use std::collections::HashMap;
use serde_json::Value;

let mut custom_data = HashMap::new();
custom_data.insert("version".to_string(), Value::String("1.0.0".to_string()));
custom_data.insert("project".to_string(), Value::String("My Project".to_string()));

let config = Config::builder()
    .root_dir("./src")
    .template_path("./my-template.tera")
    .format(OutputFormat::Custom)
    .custom_format_name("my_format")
    .custom_extension("txt")
    .custom_data(custom_data)
    .build()?;

Pipeline::new(config)?.run()?;

Use Cases

📖 Code Review with AI - Feed your codebase to Claude or GPT-4 for comprehensive reviews
🎓 Learning - Generate study materials from large codebases
📚 Documentation - Create AI-friendly documentation sources
🔍 Analysis - Prepare code for AI-powered analysis and insights
🤖 Training Data - Generate datasets for fine-tuning models

How It Works

The tool follows a 4-stage pipeline:

Scanner - Discovers files in parallel, respecting .gitignore
Filter - Removes noise (tests, comments, debug statements) using language-specific filters
Splitter - Intelligently chunks content based on token limits with overlap for context
Writer - Renders chunks using Tera templates with atomic file operations

Performance

Parallel file scanning using all CPU cores
Streaming mode for large files (>10MB)
Zero-copy operations where possible
Optimized for minimal allocations

Typical performance: ~1000 files/second on modern hardware.

Supported Languages

Built-in filtering support for:

Rust
Python
JavaScript/TypeScript (including JSX/TSX)
Go
Java/Kotlin
C/C++

Other languages are processed as plain text.

Real-World Examples

Pre-commit Review

use llm_utl::Scan;

fn pre_commit_hook() -> llm_utl::Result<()> {
    println!("🔍 Analyzing changes...");

    let stats = Scan::dir("./src")
        .code_review()
        .output("./review")
        .run()?;

    println!("✓ Review ready in {}", stats.output_directory);
    Ok(())
}

CI/CD Security Scan

use llm_utl::Scan;

fn ci_security_check() -> llm_utl::Result<()> {
    let stats = Scan::dir("./src")
        .security_audit()
        .output("./security-reports")
        .max_tokens(120_000)
        .run()?;

    if stats.total_files == 0 {
        eprintln!("❌ No files to scan");
        std::process::exit(1);
    }

    println!("✓ Scanned {} files", stats.total_files);
    Ok(())
}

Documentation Generation

use llm_utl::Scan;

fn generate_docs() -> llm_utl::Result<()> {
    Scan::dir(".")
        .documentation()
        .output("./docs/ai-generated")
        .run()?;

    Ok(())
}

Batch Processing

use llm_utl::Scan;

fn process_multiple_projects() -> llm_utl::Result<()> {
    for project in ["./frontend", "./backend", "./mobile"] {
        println!("Processing {project}...");

        match Scan::dir(project).run() {
            Ok(stats) => println!("  ✓ {} files", stats.total_files),
            Err(e) => eprintln!("  ✗ Error: {e}"),
        }
    }
    Ok(())
}

More Examples

See the https://github.com/maxBogovick/llm-util/tree/master/examples directory for more usage examples.

Development

# Clone the repository
git clone https://github.com/maxBogovick/llm-util.git
cd llm-utl

# Build
cargo build --release

# Run tests
cargo test

# Run with verbose logging
RUST_LOG=llm_utl=debug cargo run -- --dir ./src

# Format code
cargo fmt

# Lint
cargo clippy

Troubleshooting

"No processable files found" Error

If you see this error:

Error: No processable files found in '.'.

Common causes:

Wrong directory: The tool is running in an empty directory or a directory without source files.

# ❌ Wrong - running in home directory
cd ~
llm-utl

# ✅ Correct - specify your project directory
llm-utl --dir ./my-project

All files are gitignored: Your .gitignore excludes all files in the directory.

# Check what files would be scanned
llm-utl --dir ./my-project --dry-run -v

No source files: The directory contains only non-source files (images, binaries, etc.).

# Make sure directory contains code files
ls ./my-project/*.rs  # or *.py, *.js, etc.

Quick fix:

# Always specify the directory containing your source code
llm-utl --dir ./path/to/your/project --out ./prompts

Permission Issues

If you encounter permission errors:

# Ensure you have read access to source directory
# and write access to output directory
chmod +r ./src
chmod +w ./out

Large Files

If processing is slow with very large files:

# Increase token limit for large codebases
llm-utl --max-tokens 200000

# Or use simple tokenizer for better performance
llm-utl --tokenizer simple

FAQ

How do I scan only specific file types?

Use the Scan API with exclusion patterns or the full Config API with custom file filters:

use llm_utl::{Config, FileFilterConfig};

Config::builder()
    .root_dir("./src")
    .file_filter_config(
        FileFilterConfig::default()
            .allow_only(vec!["**/*.rs".to_string(), "**/*.toml".to_string()])
    )
    .build()?
    .run()?;

How do I handle very large codebases?

Increase token limits and adjust overlap:

Scan::dir("./large-project")
    .max_tokens(500_000)
    .overlap(5_000)
    .run()?;

Can I process multiple directories?

Yes, scan each separately or use a common parent:

for dir in ["./src", "./lib", "./bin"] {
    Scan::dir(dir)
        .output(&format!("./out/{}", dir.trim_start_matches("./")))
        .run()?;
}

How do I preserve everything for analysis?

Use the security audit preset or configure manually:

// Using preset
Scan::dir("./src")
    .security_audit()
    .run()?;

// Manual configuration
Scan::dir("./src")
    .keep_tests()
    .keep_comments()
    .keep_doc_comments()
    .keep_debug_prints()
    .run()?;

What are the available presets?

The library provides these presets:

code_review - Removes tests, comments, debug prints for clean code review
documentation - Preserves all documentation and comments
security_audit - Includes everything for comprehensive security analysis
bug_analysis - Focuses on logic by removing noise
refactoring - Optimized for refactoring tasks
test_generation - Configured for generating tests

Platform Support

✓ Linux
✓ macOS
✓ Windows

All major platforms are supported and tested.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with these excellent crates:

ignore - Fast gitignore-aware file walking
tera - Powerful template engine
clap - CLI argument parsing
tracing - Structured logging

llm-utl 0.1.5