scribe-selection

Intelligent file selection and context extraction for Scribe repository bundles.

Overview

scribe-selection implements sophisticated algorithms for choosing which files to include in a repository bundle. Rather than naively including everything or requiring manual selection, it uses multi-dimensional scoring, graph centrality, and heuristic analysis to automatically identify the most important code for LLM understanding.

Key Features

Multi-Dimensional Scoring

Documentation score: Rewards READMEs, design docs, API documentation
PageRank centrality: Files that are heavily imported get higher scores
Test linkage: Identifies test files and their corresponding source files
Git churn: Integrates change frequency and recency signals
Path depth: Closer-to-root files prioritized (entry points, configs)
Template detection: Demotes boilerplate and auto-generated code
Entrypoint detection: Identifies main.rs, __init__.py, index.js
Example code scoring: Detects and prioritizes example/demo code

Selection Algorithms

Simple Router (Default)

Transparent rule-based decision tree for file selection:

Priority-based routing (mandatory → high priority → budget-based)
Deterministic and explainable decisions
Fast: O(n log n) for sorting, O(n) for selection
Replaced complex bandit algorithm with clearer logic

Covering Set Selection

Surgical selection targeting specific entities:

AST-based search for functions, classes, modules
Transitive dependency closure computation
Configurable depth limits and importance thresholds
Minimal file sets for understanding or impact analysis

Progressive Demotion

When approaching token budgets, intelligently reduce content:

FULL: Complete file content (default)
CHUNK: AST-based semantic sections (important functions/classes preserved)
SIGNATURE: Type signatures and interfaces only (minimal information)

Achieves 3-10x compression while preserving critical context.

Configurable Weights

Two preset configurations:

V1: Balanced across all dimensions
V2: Higher weight on centrality and documentation

Custom weight tuning for specific use cases.

Architecture

FileMetadata + Scores → Selection Algorithm → Budget Enforcement → Demotion Engine → Final Selection
      ↓            ↓               ↓                  ↓                  ↓                ↓
  Heuristics   PageRank      Simple Router      Token Check         AST Chunking    Selected Files
   Scoring     Analysis      Decision Tree      Hard Limits        Signature Extract   with Metadata

Core Components

`FileScorer`

Computes multi-dimensional importance scores:

score = w_doc*doc + w_readme*readme + w_imp*imp_deg + w_path*path_depth^-1 +
        w_test*test_link + w_churn*churn + w_centrality*centrality +
        w_entrypoint*entrypoint + w_examples*examples + priority_boost

Each dimension is normalized to [0, 1] and combined with configurable weights.

`SimpleRouter`

Rule-based selection algorithm:

Mandatory files: README, LICENSE, config files (always included)
High priority: Files with score > threshold
Budget-based: Fill remaining budget with highest-scoring files
Test handling: Optional inclusion based on configuration

`CoveringSetSelector`

Targets specific code entities:

Entity search using tree-sitter AST parsing
Graph traversal for dependencies/dependents
Importance filtering to exclude noise
Reason tracking for transparency

`DemotionEngine`

Progressive content reduction:

Full → Chunk: Extract high-importance functions/classes via AST
Chunk → Signature: Keep only type definitions and interfaces
Quality scoring: Tracks information preservation ratio

Usage

Basic File Selection

use scribe_selection::{Selector, SelectionConfig};

let config = SelectionConfig {
    algorithm: Algorithm::SimpleRouter,
    token_budget: 100_000,
    max_files: Some(200),
    exclude_tests: true,
    ..Default::default()
};

let selector = Selector::new(config);
let result = selector.select(files).await?;

println!("Selected {} files using {} tokens",
    result.selected.len(),
    result.total_tokens
);

Custom Scoring Weights

use scribe_selection::{ScoringWeights, SelectionConfig};

let weights = ScoringWeights {
    documentation: 0.3,
    centrality: 0.4,      // Emphasize graph importance
    test_linkage: 0.1,
    churn: 0.1,
    path_depth: 0.05,
    entrypoint: 0.05,
};

let config = SelectionConfig {
    scoring_weights: weights,
    ..Default::default()
};

let selector = Selector::new(config);

Covering Set for Entity

use scribe_selection::{CoveringSetConfig, EntityType};

let config = CoveringSetConfig {
    entity_name: "authenticate_user".to_string(),
    entity_type: EntityType::Function,
    max_files: 20,
    max_depth: Some(3),
    include_dependents: false,  // For understanding mode
    importance_threshold: 0.01,
};

let result = selector.select_covering_set(files, config).await?;

for (file, reason) in result.selected {
    println!("{}: {:?}", file.path.display(), reason);
}
// Output:
//   src/auth.rs: Target (contains function)
//   src/db.rs: DirectDependency (imported by auth.rs)
//   src/config.rs: TransitiveDependency (imported by db.rs, depth 2)

Progressive Demotion

use scribe_selection::{DemotionLevel, DemotionConfig};

let mut config = SelectionConfig::default();
config.demotion_enabled = true;
config.demotion_threshold = 0.9; // Start demoting at 90% of budget

let selector = Selector::new(config);
let result = selector.select_with_budget(files, 50_000).await?;

// Check demotion results
for file in &result.selected {
    match file.demotion_level {
        DemotionLevel::Full => println!("{}: full content", file.path.display()),
        DemotionLevel::Chunk => println!("{}: chunked to key sections", file.path.display()),
        DemotionLevel::Signature => println!("{}: signatures only", file.path.display()),
    }
}

println!("Compression ratio: {:.2}x", result.compression_ratio);
println!("Quality score: {:.2}%", result.quality_score * 100.0);

Impact Analysis Mode

use scribe_selection::{CoveringSetConfig, EntityType};

let config = CoveringSetConfig {
    entity_name: "User".to_string(),
    entity_type: EntityType::Class,
    include_dependents: true,  // Find what depends on this class
    max_depth: Some(2),
    ..Default::default()
};

let result = selector.select_covering_set(files, config).await?;

println!("Changing User class affects {} files:", result.selected.len());
for (file, reason) in result.selected {
    if matches!(reason, InclusionReason::Dependent(_)) {
        println!("  {} will be impacted", file.path.display());
    }
}

Scoring Dimensions

Documentation Score (0-1)

README files: 1.0 (highest priority)
Design docs (DESIGN.md, ARCHITECTURE.md): 0.9
API documentation: 0.8
Code comments: Proportional to comment density
Docstrings: Bonus for well-documented code

PageRank Centrality (0-1)

Normalized PageRank score from dependency graph
Files imported by many others get high scores
Core utilities, config files typically score high
Isolated files score low

Test Linkage (0-1)

Test files get score based on source file importance
Source files with tests get bonus score
Helps include relevant context for tested code

Git Churn (0-1)

Change frequency: More commits = higher activity
Recency: Recent changes weighted higher
Combined score: churn = frequency * recency

Path Depth (0-1)

Inverse of directory depth: 1 / (depth + 1)
Root-level files score highest
Deeply nested utility files score lower
Encourages including entry points and configs

Template Detection (penalty)

Auto-generated code detection
Boilerplate pattern matching
License headers, copyright notices
Generated API clients, scaffolds
Penalty: Reduces score by 50-80%

Performance

Targets

Selection time: O(n log n) for sorting, O(n) for selection
Small repos (≤1k files): <100ms selection time
Medium repos (1k-10k): <500ms selection time
Large repos (10k-100k): <2s selection time

Optimizations

Lazy scoring: Only compute scores for files that pass initial filters
Parallel scoring: Multi-threaded score computation using Rayon
Incremental demotion: Progressive content reduction as budget fills
Score caching: Cache expensive computations (centrality, churn)

Configuration

`SelectionConfig`

Field	Type	Default	Description
`algorithm`	`Algorithm`	`SimpleRouter`	Selection algorithm to use
`token_budget`	`usize`	`100_000`	Maximum tokens in bundle
`max_files`	`Option<usize>`	`None`	Maximum number of files
`exclude_tests`	`bool`	`false`	Exclude test files
`scoring_weights`	`ScoringWeights`	`V2`	Weight configuration
`demotion_enabled`	`bool`	`true`	Enable progressive demotion
`demotion_threshold`	`f64`	`0.85`	Start demotion at % of budget

`ScoringWeights`

Field	Type	V1	V2	Description
`documentation`	`f64`	0.2	0.25	Documentation scoring weight
`centrality`	`f64`	0.2	0.30	PageRank centrality weight
`test_linkage`	`f64`	0.15	0.10	Test-source relationship weight
`churn`	`f64`	0.15	0.10	Git activity weight
`path_depth`	`f64`	0.15	0.10	Directory depth weight
`entrypoint`	`f64`	0.10	0.10	Entry point detection weight
`examples`	`f64`	0.05	0.05	Example code weight

Integration

scribe-selection is the core decision-making component used by:

scribe-scaling: Applies selection within budget constraints
scribe-webservice: Powers interactive file selection UI
CLI: Implements --algorithm, --token-budget, --max-files flags
scribe-graph: Uses centrality scores from graph analysis

scribe-selection 0.5.1

scribe-selection

Overview

Key Features

Multi-Dimensional Scoring

Selection Algorithms

Simple Router (Default)

Covering Set Selection

Progressive Demotion

Configurable Weights

Architecture

Core Components

`FileScorer`

`SimpleRouter`

`CoveringSetSelector`

`DemotionEngine`

Usage

Basic File Selection

Custom Scoring Weights

Covering Set for Entity

Progressive Demotion

Impact Analysis Mode

Scoring Dimensions

Documentation Score (0-1)

PageRank Centrality (0-1)

Test Linkage (0-1)

Git Churn (0-1)

Path Depth (0-1)

Template Detection (penalty)

Performance

Targets

Optimizations

Configuration

`SelectionConfig`

`ScoringWeights`

Integration

See Also