cadi-scraper
CADI Scraper/Chunker utility for converting source code repos and file data into reusable CADI chunks.
Overview
cadi-scraper automatically analyzes source code projects and converts them into optimized, content-addressed chunks ready for distribution through CADI registries. It handles multiple programming languages, diverse file formats, and provides intelligent semantic chunking.
Features
- Multi-language support: Rust, TypeScript, Python, JavaScript, Go, C/C++
- Format-agnostic: Source code, Markdown, JSON, YAML, HTML, CSS
- 5 chunking strategies: By-file, Semantic, Fixed-size, Hierarchical, By-line-count
- Automatic metadata extraction: Titles, descriptions, licenses, frameworks, API surfaces
- Rate-limited fetching: HTTP and filesystem access with configurable throttling
- Semantic analysis: AST-based code understanding via tree-sitter
- Framework detection: Identifies 20+ popular frameworks (React, Django, Spring, etc.)
- License detection: Recognizes SPDX licenses automatically
- Async/await: High-performance concurrent processing
Installation
Add this to your Cargo.toml:
[]
= "1.0"
Quick Start
Basic Scraping
use ;
async
CLI Usage
# Install
# Scrape a project
# Publish to registry
# See all options
Chunking Strategies
By-File
Creates one chunk per file. Fast, simple, preserves file structure.
ByFile
Semantic
Analyzes code structure and chunks at logical boundaries (functions, classes, modules).
Semantic
Fixed-Size
Creates fixed-byte chunks, useful for uniform processing.
FixedSize
Hierarchical
Creates parent chunks per file with children chunks for functions/classes.
Hierarchical
By-Line-Count
Creates chunks every N lines (default 100).
ByLineCount
Configuration
Via Environment Variables
# 50MB
Via Config File
Create ~/.cadi/scraper-config.yaml:
chunking_strategy: semantic
max_chunk_size: 52428800
include_overlap: true
extract_api_surface: true
detect_licenses: true
languages:
rust:
enabled: true
custom_patterns:
python:
enabled: true
custom_patterns:
Programmatically
let config = ScraperConfig ;
Output
Scraping produces ScraperOutput with:
Advanced Usage
Custom Language Patterns
let mut config = default;
config.languages.insert;
Publishing Chunks
use RegistryClient;
let output = scraper.scrape.await?;
let client = new;
for chunk in output.chunks
Batch Processing
let inputs = vec!;
for input in inputs
Framework Detection
Automatically detects:
- Frontend: React, Vue, Angular, Svelte, Next.js
- Backend: Express, Fastify, Django, FastAPI, Spring, Rails
- Async Runtimes: Tokio, async-std
- Testing: Jest, pytest, RSpec
- Build Tools: Webpack, Vite, Cargo, Maven
License Detection
Recognizes SPDX identifiers:
- MIT
- Apache-2.0
- GPL-3.0
- BSD-3-Clause
- ISC
- And many more...
Performance
Typical performance on modern hardware:
- By-file chunking: ~100 MB/sec
- Semantic chunking: ~50 MB/sec
- Metadata extraction: Included in above
- Rate limiting: Configurable (default 10 req/sec)
Error Handling
use Error;
match scraper.scrape.await
Integration
Part of the CADI ecosystem:
- cadi-core: Chunk and manifest types
- cadi-registry: Publish scraped chunks
- cadi: CLI integration
- cadi-builder: Transform scraped chunks
Documentation
- Full API docs: docs.rs/cadi-scraper
- User guide: Check repository SCRAPER-GUIDE.md
- Examples: See repository examples/ directory
License
MIT License