# Dumpfs
TLDR: A tool for dumping codebase information for LLMs efficiently and effectively.
It analyzes a codebase and generates a structured representation that can be fed to large language models (LLMs). It supports local directories, individual files, and *remote Git repositories* even under specific directory.
Status: It's safe to say it's ready for daily use, as I've been using for a while now.
## Installation
### Cargo (Rust)
```bash
cargo install dumpfs
```
### NPM (Node.js/JavaScript)
```bash
npm install @kkharji/dumpfs
```
## Usage
### CLI Examples
```bash
# Basic usage (current directory, output to stdout)
dumpfs gen
# Scan a specific directory with output to a file
dumpfs gen /path/to/project -o project_dump.md
# Scan with specific output format
dumpfs gen . -o output.xml -f xml
# Copy to the generate content Clipboard
dumpfs gen . --clip
# Filter files using ignore patterns
dumpfs gen . -i "*.log,*.tmp,node_modules/*"
# Include only specific files
dumpfs gen . -I "*.rs,*.toml"
# Show additional metadata in output
dumpfs gen . -smp
# Skip file contents, show only structure
dumpfs gen . --skip-content
# Scan a remote Git repository
dumpfs gen https://github.com/username/repo -o repo_dump.md
# Generate completion (supports bash)
dumpfs completion zsh ~/.config/zsh/completions/_dumpfs
```
### Node.js/JavaScript Library Examples
```javascript
import { scan } from '@kkharji/dumpfs';
// Basic usage - scan current directory
const result = await scan('.');
const llmText = await result.llmText();
console.log(llmText);
// With options - scan with custom settings
const result = await scan('/path/to/project', {
maxDepth: 3,
ignorePatterns: ['node_modules/**', '*.log'],
includePatterns: ['*.js', '*.ts', '*.json'],
skipContent: false,
model: 'gpt4' // Enable token counting
});
// Generate different output formats
const markdownOutput = await result.llmText();
// Customize output options
const customOutput = await result.llmText({
showPermissions: true,
showSize: true,
showModified: true,
includeTreeOutline: true,
omitFileContents: false
});
```
## Recent Changes & New Features
### v0.1.0+ Updates
**🚀 Node.js Bindings (NAPI)**
- Full JavaScript/TypeScript library with async support
- Cross-platform native modules for optimal performance
- Type definitions included for better development experience
- Available on npm as `@kkharji/dumpfs`
**🧠Token Counting & LLM Integration**
- Built-in token counting for popular LLM models (GPT-4, Claude Sonnet, Llama, Mistral)
- Model-aware content analysis and optimization
- Caching system for efficient repeated tokenization
- Support for content-based token estimation
**âš¡ Enhanced CLI**
- Output to stdout for better shell integration
- Clipboard support for seamless workflow
- Improved progress reporting and error handling
- Better filtering and ignore patterns
**🔧 Performance Improvements**
- Optimized parallel processing with configurable thread counts
- Enhanced file type detection and text classification
- Better memory management for large codebases
- Improved handling of symlinks and permissions
## Key Features
The architecture supports several important features:
1. **Parallel Processing**: Uses worker threads for efficient filesystem traversal and processing
2. **Flexible Input**: Handles both local and remote code sources uniformly
3. **Smart Filtering**: Provides multiple ways to filter content:
- File size limits
- Modified dates
- Permissions
- Gitignore patterns
- Custom include/exclude patterns
4. **Token Counting & LLM Integration**:
- Built-in tokenization for major LLM models (GPT-4, Claude, Llama, Mistral)
- Implements caching for efficient tokenization
- Model-aware content analysis and optimization
5. **Performance Optimization**:
- Uses efficient buffered I/O
- Provides progress tracking
- Supports cancelation
6. **Extensibility**:
- Modular design for adding new tokenizers
- Support for multiple output formats
- Pluggable formatter system
## Data Flow
1. User input (path/URL) → Subject
2. Subject initializes appropriate source (local/remote)
3. Scanner traverses files with parallel workers
4. Files are processed according to type and options
5. Results are collected into a tree structure
6. Formatter converts tree to desired output format
7. Results are saved or displayed
## Architecture Overview
dumpfs is organized into several key modules that work together to analyze and format codebase content for LLMs:
### Core Modules
#### 1. Subject (`src/subject.rs`)
- Acts as the central coordinator for processing input sources
- Handles both local directories and remote Git repositories
- Provides high-level API for scanning and formatting operations
#### 2. Filesystem Scanner (`src/fs/`)
- Handles recursive directory traversal and file analysis
- Implements parallel processing via worker threads for performance
- Detects file types and extracts content & metadata
- Manages filtering based on various criteria (size, date, permissions)
#### 3. Git Integration (`src/git/`)
- Parses and validates remote repository URLs
- Extracts repository metadata (owner, name, branch)
- Manages cloning and updating of remote repositories
- Handles authentication and credentials
- Provides access to repository contents for scanning
#### 4. Token Counter (`src/tk/`)
- Implements token counting for various LLM models
- Supports multiple providers (OpenAI, Anthropic, HuggingFace)
- Includes caching to avoid redundant tokenization
- Tracks statistics for optimization
#### 5. Formatters (`src/fs/fmt/`)
- Converts scanned filesystem data into LLM-friendly formats
- Supports multiple output formats (Markdown, XML, JSON)
- Handles metadata inclusion and content organization
## Supporting Modules
#### Error Handling (`src/error.rs`)
- Provides a centralized error type system
- Implements custom error conversion and propagation
- Ensures consistent error handling across modules
#### Cache Management (`src/cache.rs`)
- Manages persistent caching of tokenization results
- Provides cache location and naming utilities
#### CLI Interface (`src/cli/`)
- Implements command-line interface using clap
- Processes user options and coordinates operations
- Provides progress feedback and reporting