Dumpfs

TLDR: A tool for dumping codebase information for LLMs efficiently and effectively.

It analyzes a codebase and generates a structured representation that can be fed to large language models (LLMs). It supports local directories, individual files, and remote Git repositories even under specific directory.

Status: It's safe to say it's ready for daily use, as I've been using for a while now.

Installation

Cargo (Rust)

cargo install dumpfs

NPM (Node.js/JavaScript)

npm install @kkharji/dumpfs

Usage

CLI Examples

# Basic usage (current directory, output to stdout)
dumpfs gen

# Scan a specific directory with output to a file
dumpfs gen /path/to/project -o project_dump.md

# Scan with specific output format
dumpfs gen . -o output.xml -f xml

# Copy to the generate content Clipboard
dumpfs gen . --clip

# Filter files using ignore patterns
dumpfs gen . -i "*.log,*.tmp,node_modules/*"

# Include only specific files
dumpfs gen . -I "*.rs,*.toml"

# Show additional metadata in output
dumpfs gen . -smp

# Skip file contents, show only structure
dumpfs gen . --skip-content

# Scan a remote Git repository
dumpfs gen https://github.com/username/repo -o repo_dump.md

# Generate completion (supports bash)
dumpfs completion zsh ~/.config/zsh/completions/_dumpfs

Node.js/JavaScript Library Examples

import { scan } from '@kkharji/dumpfs';

// Basic usage - scan current directory
const result = await scan('.');
const llmText = await result.llmText();
console.log(llmText);

// With options - scan with custom settings
const result = await scan('/path/to/project', {
  maxDepth: 3,
  ignorePatterns: ['node_modules/**', '*.log'],
  includePatterns: ['*.js', '*.ts', '*.json'],
  skipContent: false,
  model: 'gpt4' // Enable token counting
});

// Generate different output formats
const markdownOutput = await result.llmText();

// Customize output options
const customOutput = await result.llmText({
  showPermissions: true,
  showSize: true,
  showModified: true,
  includeTreeOutline: true,
  omitFileContents: false
});

Recent Changes & New Features

v0.1.0+ Updates

🚀 Node.js Bindings (NAPI)

Full JavaScript/TypeScript library with async support
Cross-platform native modules for optimal performance
Type definitions included for better development experience
Available on npm as @kkharji/dumpfs

🧠 Token Counting & LLM Integration

Built-in token counting for popular LLM models (GPT-4, Claude Sonnet, Llama, Mistral)
Model-aware content analysis and optimization
Caching system for efficient repeated tokenization
Support for content-based token estimation

⚡ Enhanced CLI

Output to stdout for better shell integration
Clipboard support for seamless workflow
Improved progress reporting and error handling
Better filtering and ignore patterns

🔧 Performance Improvements

Optimized parallel processing with configurable thread counts
Enhanced file type detection and text classification
Better memory management for large codebases
Improved handling of symlinks and permissions

Key Features

The architecture supports several important features:

Parallel Processing: Uses worker threads for efficient filesystem traversal and processing
Flexible Input: Handles both local and remote code sources uniformly
Smart Filtering: Provides multiple ways to filter content:
- File size limits
- Modified dates
- Permissions
- Gitignore patterns
- Custom include/exclude patterns
Token Counting & LLM Integration:
- Built-in tokenization for major LLM models (GPT-4, Claude, Llama, Mistral)
- Implements caching for efficient tokenization
- Model-aware content analysis and optimization
Performance Optimization:
- Uses efficient buffered I/O
- Provides progress tracking
- Supports cancelation
Extensibility:
- Modular design for adding new tokenizers
- Support for multiple output formats
- Pluggable formatter system

Data Flow

User input (path/URL) → Subject
Subject initializes appropriate source (local/remote)
Scanner traverses files with parallel workers
Files are processed according to type and options
Results are collected into a tree structure
Formatter converts tree to desired output format
Results are saved or displayed

Architecture Overview

dumpfs is organized into several key modules that work together to analyze and format codebase content for LLMs:

Core Modules

1. Subject (`src/subject.rs`)

Acts as the central coordinator for processing input sources
Handles both local directories and remote Git repositories
Provides high-level API for scanning and formatting operations

2. Filesystem Scanner (`src/fs/`)

Handles recursive directory traversal and file analysis
Implements parallel processing via worker threads for performance
Detects file types and extracts content & metadata
Manages filtering based on various criteria (size, date, permissions)

3. Git Integration (`src/git/`)

Parses and validates remote repository URLs
Extracts repository metadata (owner, name, branch)
Manages cloning and updating of remote repositories
Handles authentication and credentials
Provides access to repository contents for scanning

4. Token Counter (`src/tk/`)

Implements token counting for various LLM models
Supports multiple providers (OpenAI, Anthropic, HuggingFace)
Includes caching to avoid redundant tokenization
Tracks statistics for optimization

5. Formatters (`src/fs/fmt/`)

Converts scanned filesystem data into LLM-friendly formats
Supports multiple output formats (Markdown, XML, JSON)
Handles metadata inclusion and content organization

Supporting Modules

Error Handling (`src/error.rs`)

Provides a centralized error type system
Implements custom error conversion and propagation
Ensures consistent error handling across modules

Cache Management (`src/cache.rs`)

Manages persistent caching of tokenization results
Provides cache location and naming utilities

CLI Interface (`src/cli/`)

Implements command-line interface using clap
Processes user options and coordinates operations
Provides progress feedback and reporting

dumpfs 0.1.0