Dumpfs
TLDR: A tool for dumping codebase information for LLMs efficiently and effectively.
It analyzes a codebase and generates a structured representation that can be fed to large language models (LLMs). It supports local directories, individual files, and remote Git repositories even under specific directory.
Status: It's safe to say it's ready for daily use, as I've been using for a while now.
Installation
Cargo (Rust)
NPM (Node.js/JavaScript)
Usage
CLI Examples
# Basic usage (current directory, output to stdout)
# Scan a specific directory with output to a file
# Scan with specific output format
# Copy to the generate content Clipboard
# Filter files using ignore patterns
# Include only specific files
# Show additional metadata in output
# Skip file contents, show only structure
# Scan a remote Git repository
# Generate completion (supports bash)
Node.js/JavaScript Library Examples
import from '@kkharji/dumpfs';
// Basic usage - scan current directory
const result = await ;
const llmText = await result.;
console.log;
// With options - scan with custom settings
const result = await ;
// Generate different output formats
const markdownOutput = await result.;
// Customize output options
const customOutput = await result.;
Recent Changes & New Features
v0.1.0+ Updates
🚀 Node.js Bindings (NAPI)
- Full JavaScript/TypeScript library with async support
- Cross-platform native modules for optimal performance
- Type definitions included for better development experience
- Available on npm as
@kkharji/dumpfs
🧠Token Counting & LLM Integration
- Built-in token counting for popular LLM models (GPT-4, Claude Sonnet, Llama, Mistral)
- Model-aware content analysis and optimization
- Caching system for efficient repeated tokenization
- Support for content-based token estimation
âš¡ Enhanced CLI
- Output to stdout for better shell integration
- Clipboard support for seamless workflow
- Improved progress reporting and error handling
- Better filtering and ignore patterns
🔧 Performance Improvements
- Optimized parallel processing with configurable thread counts
- Enhanced file type detection and text classification
- Better memory management for large codebases
- Improved handling of symlinks and permissions
Key Features
The architecture supports several important features:
- Parallel Processing: Uses worker threads for efficient filesystem traversal and processing
- Flexible Input: Handles both local and remote code sources uniformly
- Smart Filtering: Provides multiple ways to filter content:
- File size limits
- Modified dates
- Permissions
- Gitignore patterns
- Custom include/exclude patterns
- Token Counting & LLM Integration:
- Built-in tokenization for major LLM models (GPT-4, Claude, Llama, Mistral)
- Implements caching for efficient tokenization
- Model-aware content analysis and optimization
- Performance Optimization:
- Uses efficient buffered I/O
- Provides progress tracking
- Supports cancelation
- Extensibility:
- Modular design for adding new tokenizers
- Support for multiple output formats
- Pluggable formatter system
Data Flow
- User input (path/URL) → Subject
- Subject initializes appropriate source (local/remote)
- Scanner traverses files with parallel workers
- Files are processed according to type and options
- Results are collected into a tree structure
- Formatter converts tree to desired output format
- Results are saved or displayed
Architecture Overview
dumpfs is organized into several key modules that work together to analyze and format codebase content for LLMs:
Core Modules
1. Subject (src/subject.rs)
- Acts as the central coordinator for processing input sources
- Handles both local directories and remote Git repositories
- Provides high-level API for scanning and formatting operations
2. Filesystem Scanner (src/fs/)
- Handles recursive directory traversal and file analysis
- Implements parallel processing via worker threads for performance
- Detects file types and extracts content & metadata
- Manages filtering based on various criteria (size, date, permissions)
3. Git Integration (src/git/)
- Parses and validates remote repository URLs
- Extracts repository metadata (owner, name, branch)
- Manages cloning and updating of remote repositories
- Handles authentication and credentials
- Provides access to repository contents for scanning
4. Token Counter (src/tk/)
- Implements token counting for various LLM models
- Supports multiple providers (OpenAI, Anthropic, HuggingFace)
- Includes caching to avoid redundant tokenization
- Tracks statistics for optimization
5. Formatters (src/fs/fmt/)
- Converts scanned filesystem data into LLM-friendly formats
- Supports multiple output formats (Markdown, XML, JSON)
- Handles metadata inclusion and content organization
Supporting Modules
Error Handling (src/error.rs)
- Provides a centralized error type system
- Implements custom error conversion and propagation
- Ensures consistent error handling across modules
Cache Management (src/cache.rs)
- Manages persistent caching of tokenization results
- Provides cache location and naming utilities
CLI Interface (src/cli/)
- Implements command-line interface using clap
- Processes user options and coordinates operations
- Provides progress feedback and reporting