cadi-scraper 2.0.0

CADI Scraper/Chunker utility for converting source code repos and file data into reusable CADI chunks
Documentation
# cadi-scraper

CADI Scraper/Chunker utility for converting source code repos and file data into reusable CADI chunks.

## Overview

`cadi-scraper` automatically analyzes source code projects and converts them into optimized, content-addressed chunks ready for distribution through CADI registries. It handles multiple programming languages, diverse file formats, and provides intelligent semantic chunking.

## Features

- **Multi-language support**: Rust, TypeScript, Python, JavaScript, Go, C/C++
- **Format-agnostic**: Source code, Markdown, JSON, YAML, HTML, CSS
- **5 chunking strategies**: By-file, Semantic, Fixed-size, Hierarchical, By-line-count
- **Automatic metadata extraction**: Titles, descriptions, licenses, frameworks, API surfaces
- **Rate-limited fetching**: HTTP and filesystem access with configurable throttling
- **Semantic analysis**: AST-based code understanding via tree-sitter
- **Framework detection**: Identifies 20+ popular frameworks (React, Django, Spring, etc.)
- **License detection**: Recognizes SPDX licenses automatically
- **Async/await**: High-performance concurrent processing

## Installation

Add this to your `Cargo.toml`:

```toml
[dependencies]
cadi-scraper = "1.0"
```

## Quick Start

### Basic Scraping

```rust
use cadi_scraper::{Scraper, ScraperConfig, ScraperInput, ChunkingStrategy};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = ScraperConfig {
        chunking_strategy: ChunkingStrategy::Semantic,
        max_chunk_size: 50_000,
        ..Default::default()
    };
    
    let scraper = Scraper::new(config);
    
    let input = ScraperInput::LocalPath("./my-project".into());
    let output = scraper.scrape(input).await?;
    
    println!("Created {} chunks", output.chunks.len());
    println!("Total bytes: {}", output.statistics.total_bytes);
    
    Ok(())
}
```

### CLI Usage

```bash
# Install
cargo install cadi

# Scrape a project
cadi scrape ./my-project --strategy semantic --output ./chunks

# Publish to registry
cadi publish --registry https://registry.example.com \
  --auth-token TOKEN \
  --namespace myorg/project

# See all options
cadi scrape --help
```

## Chunking Strategies

### By-File
Creates one chunk per file. Fast, simple, preserves file structure.

```rust
ChunkingStrategy::ByFile
```

### Semantic
Analyzes code structure and chunks at logical boundaries (functions, classes, modules).

```rust
ChunkingStrategy::Semantic
```

### Fixed-Size
Creates fixed-byte chunks, useful for uniform processing.

```rust
ChunkingStrategy::FixedSize
```

### Hierarchical
Creates parent chunks per file with children chunks for functions/classes.

```rust
ChunkingStrategy::Hierarchical
```

### By-Line-Count
Creates chunks every N lines (default 100).

```rust
ChunkingStrategy::ByLineCount
```

## Configuration

### Via Environment Variables

```bash
export CADI_CHUNKING_STRATEGY="semantic"
export CADI_MAX_CHUNK_SIZE="52428800"  # 50MB
export CADI_INCLUDE_OVERLAP="true"
export CADI_EXTRACT_API_SURFACE="true"
export CADI_DETECT_LICENSES="true"
```

### Via Config File

Create `~/.cadi/scraper-config.yaml`:

```yaml
chunking_strategy: semantic
max_chunk_size: 52428800
include_overlap: true
extract_api_surface: true
detect_licenses: true
languages:
  rust:
    enabled: true
    custom_patterns: []
  python:
    enabled: true
    custom_patterns: []
```

### Programmatically

```rust
let config = ScraperConfig {
    chunking_strategy: ChunkingStrategy::Semantic,
    max_chunk_size: 50_000,
    include_overlap: true,
    hierarchy: true,
    extract_api: true,
    detect_licenses: true,
    ..Default::default()
};
```

## Output

Scraping produces `ScraperOutput` with:

```rust
pub struct ScraperOutput {
    pub chunks: Vec<ScrapedChunk>,      // Generated chunks
    pub manifest: Manifest,              // Dependency graph
    pub statistics: ScrapingStatistics,  // Metrics
    pub errors: Vec<String>,             // Non-fatal errors
}
```

## Advanced Usage

### Custom Language Patterns

```rust
let mut config = ScraperConfig::default();
config.languages.insert("rust".to_string(), LanguageConfig {
    enabled: true,
    custom_patterns: vec![
        r"#\[derive\((.*?)\)\]".to_string(),
    ],
});
```

### Publishing Chunks

```rust
use cadi_registry::RegistryClient;

let output = scraper.scrape(input).await?;
let client = RegistryClient::new(registry_url, auth_token);

for chunk in output.chunks {
    client.publish_chunk(&chunk).await?;
}
```

### Batch Processing

```rust
let inputs = vec![
    ScraperInput::LocalPath("./project1".into()),
    ScraperInput::LocalPath("./project2".into()),
    ScraperInput::Url("https://github.com/user/repo".into()),
];

for input in inputs {
    let output = scraper.scrape(input).await?;
    // Process output...
}
```

## Framework Detection

Automatically detects:

- **Frontend**: React, Vue, Angular, Svelte, Next.js
- **Backend**: Express, Fastify, Django, FastAPI, Spring, Rails
- **Async Runtimes**: Tokio, async-std
- **Testing**: Jest, pytest, RSpec
- **Build Tools**: Webpack, Vite, Cargo, Maven

## License Detection

Recognizes SPDX identifiers:

- MIT
- Apache-2.0
- GPL-3.0
- BSD-3-Clause
- ISC
- And many more...

## Performance

Typical performance on modern hardware:

- **By-file chunking**: ~100 MB/sec
- **Semantic chunking**: ~50 MB/sec
- **Metadata extraction**: Included in above
- **Rate limiting**: Configurable (default 10 req/sec)

## Error Handling

```rust
use cadi_scraper::error::Error;

match scraper.scrape(input).await {
    Ok(output) => {
        if !output.errors.is_empty() {
            eprintln!("Warnings: {:?}", output.errors);
        }
    }
    Err(Error::InvalidInput(msg)) => eprintln!("Invalid input: {}", msg),
    Err(Error::Fetch(msg)) => eprintln!("Fetch failed: {}", msg),
    Err(e) => eprintln!("Error: {}", e),
}
```

## Integration

Part of the CADI ecosystem:

- **cadi-core**: Chunk and manifest types
- **cadi-registry**: Publish scraped chunks
- **cadi**: CLI integration
- **cadi-builder**: Transform scraped chunks

## Documentation

- Full API docs: [docs.rs/cadi-scraper]https://docs.rs/cadi-scraper
- User guide: Check repository SCRAPER-GUIDE.md
- Examples: See repository examples/ directory

## License

MIT License