dumpfs 0.1.0

A tool for dumping codebase information for LLMs efficiently and effectively
Documentation
# Dumpfs

TLDR: A tool for dumping codebase information for LLMs efficiently and effectively.

It analyzes a codebase and generates a structured representation that can be fed to large language models (LLMs). It supports local directories, individual files, and *remote Git repositories* even under specific directory.

Status: It's safe to say it's ready for daily use, as I've been using for a while now.

## Installation

### Cargo (Rust)
```bash
cargo install dumpfs
```

### NPM (Node.js/JavaScript)
```bash
npm install @kkharji/dumpfs
```

## Usage

### CLI Examples

```bash
# Basic usage (current directory, output to stdout)
dumpfs gen

# Scan a specific directory with output to a file
dumpfs gen /path/to/project -o project_dump.md

# Scan with specific output format
dumpfs gen . -o output.xml -f xml

# Copy to the generate content Clipboard
dumpfs gen . --clip

# Filter files using ignore patterns
dumpfs gen . -i "*.log,*.tmp,node_modules/*"

# Include only specific files
dumpfs gen . -I "*.rs,*.toml"

# Show additional metadata in output
dumpfs gen . -smp

# Skip file contents, show only structure
dumpfs gen . --skip-content

# Scan a remote Git repository
dumpfs gen https://github.com/username/repo -o repo_dump.md

# Generate completion (supports bash)
dumpfs completion zsh ~/.config/zsh/completions/_dumpfs
```

### Node.js/JavaScript Library Examples

```javascript
import { scan } from '@kkharji/dumpfs';

// Basic usage - scan current directory
const result = await scan('.');
const llmText = await result.llmText();
console.log(llmText);

// With options - scan with custom settings
const result = await scan('/path/to/project', {
  maxDepth: 3,
  ignorePatterns: ['node_modules/**', '*.log'],
  includePatterns: ['*.js', '*.ts', '*.json'],
  skipContent: false,
  model: 'gpt4' // Enable token counting
});

// Generate different output formats
const markdownOutput = await result.llmText();

// Customize output options
const customOutput = await result.llmText({
  showPermissions: true,
  showSize: true,
  showModified: true,
  includeTreeOutline: true,
  omitFileContents: false
});
```

## Recent Changes & New Features

### v0.1.0+ Updates

**🚀 Node.js Bindings (NAPI)**
- Full JavaScript/TypeScript library with async support
- Cross-platform native modules for optimal performance
- Type definitions included for better development experience
- Available on npm as `@kkharji/dumpfs`

**🧠 Token Counting & LLM Integration**
- Built-in token counting for popular LLM models (GPT-4, Claude Sonnet, Llama, Mistral)
- Model-aware content analysis and optimization
- Caching system for efficient repeated tokenization
- Support for content-based token estimation

**âš¡ Enhanced CLI**
- Output to stdout for better shell integration
- Clipboard support for seamless workflow
- Improved progress reporting and error handling
- Better filtering and ignore patterns

**🔧 Performance Improvements**
- Optimized parallel processing with configurable thread counts
- Enhanced file type detection and text classification
- Better memory management for large codebases
- Improved handling of symlinks and permissions

## Key Features

The architecture supports several important features:

1. **Parallel Processing**: Uses worker threads for efficient filesystem traversal and processing
2. **Flexible Input**: Handles both local and remote code sources uniformly
3. **Smart Filtering**: Provides multiple ways to filter content:
   - File size limits
   - Modified dates
   - Permissions
   - Gitignore patterns
   - Custom include/exclude patterns
4. **Token Counting & LLM Integration**:
   - Built-in tokenization for major LLM models (GPT-4, Claude, Llama, Mistral)
   - Implements caching for efficient tokenization
   - Model-aware content analysis and optimization
5. **Performance Optimization**:
   - Uses efficient buffered I/O
   - Provides progress tracking
   - Supports cancelation
6. **Extensibility**:
   - Modular design for adding new tokenizers
   - Support for multiple output formats
   - Pluggable formatter system

## Data Flow

1. User input (path/URL) → Subject
2. Subject initializes appropriate source (local/remote)
3. Scanner traverses files with parallel workers
4. Files are processed according to type and options
5. Results are collected into a tree structure
6. Formatter converts tree to desired output format
7. Results are saved or displayed


## Architecture Overview

dumpfs is organized into several key modules that work together to analyze and format codebase content for LLMs:

### Core Modules

#### 1. Subject (`src/subject.rs`)
- Acts as the central coordinator for processing input sources
- Handles both local directories and remote Git repositories
- Provides high-level API for scanning and formatting operations

#### 2. Filesystem Scanner (`src/fs/`)
- Handles recursive directory traversal and file analysis
- Implements parallel processing via worker threads for performance
- Detects file types and extracts content & metadata
- Manages filtering based on various criteria (size, date, permissions)

#### 3. Git Integration (`src/git/`)
- Parses and validates remote repository URLs
- Extracts repository metadata (owner, name, branch)
- Manages cloning and updating of remote repositories
- Handles authentication and credentials
- Provides access to repository contents for scanning

#### 4. Token Counter (`src/tk/`)
- Implements token counting for various LLM models
- Supports multiple providers (OpenAI, Anthropic, HuggingFace)
- Includes caching to avoid redundant tokenization
- Tracks statistics for optimization

#### 5. Formatters (`src/fs/fmt/`)
- Converts scanned filesystem data into LLM-friendly formats
- Supports multiple output formats (Markdown, XML, JSON)
- Handles metadata inclusion and content organization

## Supporting Modules

#### Error Handling (`src/error.rs`)
- Provides a centralized error type system
- Implements custom error conversion and propagation
- Ensures consistent error handling across modules

#### Cache Management (`src/cache.rs`)
- Manages persistent caching of tokenization results
- Provides cache location and naming utilities

#### CLI Interface (`src/cli/`)
- Implements command-line interface using clap
- Processes user options and coordinates operations
- Provides progress feedback and reporting