verificar
Synthetic Data Factory for Domain-Specific Code Intelligence
Verificar is a unified combinatorial test generation and synthetic data factory for PAIML transpiler projects (depyler, bashrs, ruchy, decy). It generates verified (source, target, correctness) tuples at scale, creating training data for domain-specific code intelligence models.
Features
- Multi-Language Support: Generate test programs in Python, Bash, C, and Ruchy
- Combinatorial Generation: Exhaustive enumeration of valid programs up to configurable depth
- Mutation Testing: AST-level mutation operators (AOR, ROR, LOR, BSR, etc.)
- Verification Oracle: Sandboxed execution with I/O diffing for correctness verification
- ML Pipeline: Bug prediction models and embeddings for code intelligence
- Parquet Output: Efficient columnar storage for large-scale data processing
Architecture
┌─────────────────────────────────────────────────────────────┐
│ VERIFICAR CORE │
├─────────────────────────────────────────────────────────────┤
│ Grammar → Generator → Mutator → Oracle │
│ Definitions Engine Engine Verification │
└─────────────────────────────────────────────────────────────┘
Installation
Add to your Cargo.toml:
[]
= "0.3"
Or with optional features:
[]
= { = "0.3", = ["parquet", "ml"] }
Quick Start
Library Usage
use ;
use Language;
// Create a generator for Python
let generator = new;
// Generate test cases using coverage-guided sampling
let strategy = CoverageGuided ;
let test_cases = generator.generate;
CLI Usage
# Generate Python test programs
# Generate with specific sampling strategy
# Generate depyler-specific patterns
Supported Languages
| Language | Generator | Description |
|---|---|---|
| Python | PythonEnumerator |
Functions, control flow, type hints |
| Bash | BashEnumerator |
1000+ patterns: variables, pipes, conditionals |
| C | CEnumerator |
Functions, pointers, memory operations |
| Ruchy | RuchyEnumerator |
Custom DSL programs |
Sampling Strategies
- Exhaustive: Enumerate all programs up to depth N
- CoverageGuided: Prioritize unexplored AST paths (NAUTILUS-style)
- Swarm: Random feature subsets per batch
- Boundary: Edge values emphasized (0, -1, MAX_INT, empty collections)
Generation Priority
Based on organizational intelligence analysis of 1,296 defect-fix commits:
| Priority | Category | Allocation | Rationale |
|---|---|---|---|
| P0 | ASTTransform | 50% | Universal dominant defect (40-62%) |
| P1 | OwnershipBorrow | 20% | Rust-specific (15-20%) |
| P2 | StdlibMapping | 15% | API translation errors |
| P3 | Language-specific | 15% | bashrs security, decy memory, etc. |
Features
| Feature | Description |
|---|---|
parquet |
Enable Parquet data output |
ml |
Enable ML pipeline (aprender integration) |
tree-sitter |
Use tree-sitter for grammar parsing |
pest |
Use pest for PEG grammars |
full |
Enable all features |
License
MIT License - see LICENSE for details.
Contributing
Contributions welcome! Please read the CLAUDE.md for development guidelines.