scribe-selection 0.5.1

Intelligent code selection and context extraction for Scribe
Documentation

scribe-selection

Intelligent file selection and context extraction for Scribe repository bundles.

Overview

scribe-selection implements sophisticated algorithms for choosing which files to include in a repository bundle. Rather than naively including everything or requiring manual selection, it uses multi-dimensional scoring, graph centrality, and heuristic analysis to automatically identify the most important code for LLM understanding.

Key Features

Multi-Dimensional Scoring

  • Documentation score: Rewards READMEs, design docs, API documentation
  • PageRank centrality: Files that are heavily imported get higher scores
  • Test linkage: Identifies test files and their corresponding source files
  • Git churn: Integrates change frequency and recency signals
  • Path depth: Closer-to-root files prioritized (entry points, configs)
  • Template detection: Demotes boilerplate and auto-generated code
  • Entrypoint detection: Identifies main.rs, __init__.py, index.js
  • Example code scoring: Detects and prioritizes example/demo code

Selection Algorithms

Simple Router (Default)

Transparent rule-based decision tree for file selection:

  • Priority-based routing (mandatory → high priority → budget-based)
  • Deterministic and explainable decisions
  • Fast: O(n log n) for sorting, O(n) for selection
  • Replaced complex bandit algorithm with clearer logic

Covering Set Selection

Surgical selection targeting specific entities:

  • AST-based search for functions, classes, modules
  • Transitive dependency closure computation
  • Configurable depth limits and importance thresholds
  • Minimal file sets for understanding or impact analysis

Progressive Demotion

When approaching token budgets, intelligently reduce content:

  • FULL: Complete file content (default)
  • CHUNK: AST-based semantic sections (important functions/classes preserved)
  • SIGNATURE: Type signatures and interfaces only (minimal information)

Achieves 3-10x compression while preserving critical context.

Configurable Weights

Two preset configurations:

  • V1: Balanced across all dimensions
  • V2: Higher weight on centrality and documentation

Custom weight tuning for specific use cases.

Architecture

FileMetadata + Scores → Selection Algorithm → Budget Enforcement → Demotion Engine → Final Selection
      ↓            ↓               ↓                  ↓                  ↓                ↓
  Heuristics   PageRank      Simple Router      Token Check         AST Chunking    Selected Files
   Scoring     Analysis      Decision Tree      Hard Limits        Signature Extract   with Metadata

Core Components

FileScorer

Computes multi-dimensional importance scores:

score = w_doc*doc + w_readme*readme + w_imp*imp_deg + w_path*path_depth^-1 +
        w_test*test_link + w_churn*churn + w_centrality*centrality +
        w_entrypoint*entrypoint + w_examples*examples + priority_boost

Each dimension is normalized to [0, 1] and combined with configurable weights.

SimpleRouter

Rule-based selection algorithm:

  1. Mandatory files: README, LICENSE, config files (always included)
  2. High priority: Files with score > threshold
  3. Budget-based: Fill remaining budget with highest-scoring files
  4. Test handling: Optional inclusion based on configuration

CoveringSetSelector

Targets specific code entities:

  • Entity search using tree-sitter AST parsing
  • Graph traversal for dependencies/dependents
  • Importance filtering to exclude noise
  • Reason tracking for transparency

DemotionEngine

Progressive content reduction:

  • Full → Chunk: Extract high-importance functions/classes via AST
  • Chunk → Signature: Keep only type definitions and interfaces
  • Quality scoring: Tracks information preservation ratio

Usage

Basic File Selection

use scribe_selection::{Selector, SelectionConfig};

let config = SelectionConfig {
    algorithm: Algorithm::SimpleRouter,
    token_budget: 100_000,
    max_files: Some(200),
    exclude_tests: true,
    ..Default::default()
};

let selector = Selector::new(config);
let result = selector.select(files).await?;

println!("Selected {} files using {} tokens",
    result.selected.len(),
    result.total_tokens
);

Custom Scoring Weights

use scribe_selection::{ScoringWeights, SelectionConfig};

let weights = ScoringWeights {
    documentation: 0.3,
    centrality: 0.4,      // Emphasize graph importance
    test_linkage: 0.1,
    churn: 0.1,
    path_depth: 0.05,
    entrypoint: 0.05,
};

let config = SelectionConfig {
    scoring_weights: weights,
    ..Default::default()
};

let selector = Selector::new(config);

Covering Set for Entity

use scribe_selection::{CoveringSetConfig, EntityType};

let config = CoveringSetConfig {
    entity_name: "authenticate_user".to_string(),
    entity_type: EntityType::Function,
    max_files: 20,
    max_depth: Some(3),
    include_dependents: false,  // For understanding mode
    importance_threshold: 0.01,
};

let result = selector.select_covering_set(files, config).await?;

for (file, reason) in result.selected {
    println!("{}: {:?}", file.path.display(), reason);
}
// Output:
//   src/auth.rs: Target (contains function)
//   src/db.rs: DirectDependency (imported by auth.rs)
//   src/config.rs: TransitiveDependency (imported by db.rs, depth 2)

Progressive Demotion

use scribe_selection::{DemotionLevel, DemotionConfig};

let mut config = SelectionConfig::default();
config.demotion_enabled = true;
config.demotion_threshold = 0.9; // Start demoting at 90% of budget

let selector = Selector::new(config);
let result = selector.select_with_budget(files, 50_000).await?;

// Check demotion results
for file in &result.selected {
    match file.demotion_level {
        DemotionLevel::Full => println!("{}: full content", file.path.display()),
        DemotionLevel::Chunk => println!("{}: chunked to key sections", file.path.display()),
        DemotionLevel::Signature => println!("{}: signatures only", file.path.display()),
    }
}

println!("Compression ratio: {:.2}x", result.compression_ratio);
println!("Quality score: {:.2}%", result.quality_score * 100.0);

Impact Analysis Mode

use scribe_selection::{CoveringSetConfig, EntityType};

let config = CoveringSetConfig {
    entity_name: "User".to_string(),
    entity_type: EntityType::Class,
    include_dependents: true,  // Find what depends on this class
    max_depth: Some(2),
    ..Default::default()
};

let result = selector.select_covering_set(files, config).await?;

println!("Changing User class affects {} files:", result.selected.len());
for (file, reason) in result.selected {
    if matches!(reason, InclusionReason::Dependent(_)) {
        println!("  {} will be impacted", file.path.display());
    }
}

Scoring Dimensions

Documentation Score (0-1)

  • README files: 1.0 (highest priority)
  • Design docs (DESIGN.md, ARCHITECTURE.md): 0.9
  • API documentation: 0.8
  • Code comments: Proportional to comment density
  • Docstrings: Bonus for well-documented code

PageRank Centrality (0-1)

  • Normalized PageRank score from dependency graph
  • Files imported by many others get high scores
  • Core utilities, config files typically score high
  • Isolated files score low

Test Linkage (0-1)

  • Test files get score based on source file importance
  • Source files with tests get bonus score
  • Helps include relevant context for tested code

Git Churn (0-1)

  • Change frequency: More commits = higher activity
  • Recency: Recent changes weighted higher
  • Combined score: churn = frequency * recency

Path Depth (0-1)

  • Inverse of directory depth: 1 / (depth + 1)
  • Root-level files score highest
  • Deeply nested utility files score lower
  • Encourages including entry points and configs

Template Detection (penalty)

  • Auto-generated code detection
  • Boilerplate pattern matching
  • License headers, copyright notices
  • Generated API clients, scaffolds
  • Penalty: Reduces score by 50-80%

Performance

Targets

  • Selection time: O(n log n) for sorting, O(n) for selection
  • Small repos (≤1k files): <100ms selection time
  • Medium repos (1k-10k): <500ms selection time
  • Large repos (10k-100k): <2s selection time

Optimizations

  • Lazy scoring: Only compute scores for files that pass initial filters
  • Parallel scoring: Multi-threaded score computation using Rayon
  • Incremental demotion: Progressive content reduction as budget fills
  • Score caching: Cache expensive computations (centrality, churn)

Configuration

SelectionConfig

Field Type Default Description
algorithm Algorithm SimpleRouter Selection algorithm to use
token_budget usize 100_000 Maximum tokens in bundle
max_files Option<usize> None Maximum number of files
exclude_tests bool false Exclude test files
scoring_weights ScoringWeights V2 Weight configuration
demotion_enabled bool true Enable progressive demotion
demotion_threshold f64 0.85 Start demotion at % of budget

ScoringWeights

Field Type V1 V2 Description
documentation f64 0.2 0.25 Documentation scoring weight
centrality f64 0.2 0.30 PageRank centrality weight
test_linkage f64 0.15 0.10 Test-source relationship weight
churn f64 0.15 0.10 Git activity weight
path_depth f64 0.15 0.10 Directory depth weight
entrypoint f64 0.10 0.10 Entry point detection weight
examples f64 0.05 0.05 Example code weight

Integration

scribe-selection is the core decision-making component used by:

  • scribe-scaling: Applies selection within budget constraints
  • scribe-webservice: Powers interactive file selection UI
  • CLI: Implements --algorithm, --token-budget, --max-files flags
  • scribe-graph: Uses centrality scores from graph analysis

See Also

  • scribe-graph: Provides PageRank centrality scores
  • scribe-scaling: Token budgeting and performance optimization
  • scribe-analysis: AST parsing for demotion and chunking
  • ../../WHY_SCRIBE.md: Context on intelligent selection philosophy