# CODEX: Multi-Tech Python-to-Rust Training Data Specification
**Version:** 1.0.0
**Last Updated:** 2025-11-26
**Status:** Draft
**Project:** verificar + aprender integration
---
## Executive Summary
**CODEX** (COde Data EXtraction) is a unified pipeline combining **aprender** (AutoML) with **verificar** (synthetic data factory) to generate high-quality training data for Python-to-Rust transpilation. The pipeline uses ML-driven filtering, adaptive generation, and active learning to maximize training data utility while minimizing verification oracle costs.
---
## Problem Statement
Current transpiler training approaches face key challenges:
1. **Verification Bottleneck**: Oracle execution (sandbox Python + Rust) is expensive (~100ms/sample)
2. **Low Signal-to-Noise**: Most generated programs are trivial or redundant
3. **Distribution Mismatch**: Uniform random sampling doesn't match real-world bug distribution
4. **Sparse Feedback**: Binary correctness labels waste gradient information
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ CODEX PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ verificar │ │ aprender │ │ verificar │ │ aprender │ │
│ │ Generator │───▶│ Filter │───▶│ Oracle │───▶│ Labeler │ │
│ │ │ │ (Quality) │ │ │ │ (Rich) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Candidate │ │ High-Value │ │ (source, │ │ Training │ │
│ │ Programs │ │ Subset │ │ target, │ │ Dataset │ │
│ │ │ │ │ │ verdict) │ │ .parquet │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ FEEDBACK LOOP (Active Learning) │ │
│ │ aprender::GradientBoosting predicts informative samples → Generator │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Component Integration
### 1. Quality-Gated Generation (Pre-Oracle Filter)
**Goal:** Reduce oracle calls by 10x via ML-based quality prediction.
```rust
use aprender::tree::RandomForestClassifier;
use verificar::generator::{Generator, SamplingStrategy};
use verificar::data::CodeFeatures;
/// Quality gate using aprender RandomForest
pub struct QualityGate {
model: RandomForestClassifier,
threshold: f32,
}
impl QualityGate {
/// Train on historical (features, oracle_passed) pairs
pub fn train(examples: &[(CodeFeatures, bool)]) -> Self {
let mut model = RandomForestClassifier::new()
.with_n_estimators(100)
.with_max_depth(Some(10));
let (features, labels) = examples_to_matrix(examples);
model.fit(&features, &labels).unwrap();
Self { model, threshold: 0.7 }
}
/// Predict if sample is worth verifying
pub fn should_verify(&self, features: &CodeFeatures) -> bool {
let x = features_to_row(features);
self.model.predict_proba(&x)[1] > self.threshold
}
}
```
**Features for quality prediction:**
- `ast_depth`: Deeper = more interesting
- `num_operators`: Mathematical complexity
- `num_control_flow`: Branching logic
- `cyclomatic_complexity`: Path diversity
- `uses_edge_values`: Boundary conditions (0, -1, empty)
- `type_coercion_count`: Python→Rust type mapping challenges
### 2. Bug Prediction Model (Defect Likelihood)
**Goal:** Prioritize samples likely to reveal transpiler bugs.
```rust
use aprender::tree::GradientBoostingClassifier;
use verificar::ml::CommitFeatures;
/// Bug predictor trained on historical defect-fix commits
pub struct BugPredictor {
model: GradientBoostingClassifier,
}
impl BugPredictor {
/// Train on defect-fix commit features from 1,296 PAIML commits
pub fn train(commits: &[CommitFeatures], labels: &[bool]) -> Self {
let mut model = GradientBoostingClassifier::new()
.with_n_estimators(200)
.with_learning_rate(0.1)
.with_max_depth(5);
let features = commits_to_matrix(commits);
model.fit(&features, &labels).unwrap();
Self { model }
}
/// Predict probability of triggering a bug
pub fn predict_bug_prob(&self, code: &str, features: &CodeFeatures) -> f32 {
let x = extract_features(code, features);
self.model.predict_proba(&x)[1]
}
}
```
**Defect category allocation (from PAIML intelligence):**
| P0 | ASTTransform | 50% | 2.0x |
| P1 | OwnershipBorrow | 20% | 1.5x |
| P2 | StdlibMapping | 15% | 1.2x |
| P3 | LanguageSpecific | 15% | 1.0x |
### 3. Adaptive Generation (Active Learning)
**Goal:** Dynamically adjust sampling strategy based on oracle feedback.
```rust
use aprender::cluster::KMeans;
use verificar::generator::SamplingStrategy;
/// Active learner that prioritizes unexplored regions
pub struct ActiveSampler {
/// Embedding model for code representation
embedder: CodeEmbedder,
/// K-means clusters of verified samples
clusters: KMeans,
/// Per-cluster success rates
cluster_stats: Vec<ClusterStats>,
}
impl ActiveSampler {
/// Thompson Sampling: explore uncertain clusters
pub fn sample_strategy(&self) -> SamplingStrategy {
let ucb_scores: Vec<f32> = self.cluster_stats
.iter()
.map(|s| s.mean_reward + 2.0 * s.uncertainty())
.collect();
let target_cluster = argmax(&ucb_scores);
let centroid = self.clusters.centroids()[target_cluster].clone();
SamplingStrategy::Targeted {
feature_bias: centroid_to_features(¢roid),
exploration_epsilon: 0.1,
}
}
/// Update with oracle feedback
pub fn update(&mut self, sample: &CodeSample, passed: bool) {
let embedding = self.embedder.embed(&sample.source);
let cluster_id = self.clusters.predict(&embedding);
self.cluster_stats[cluster_id].update(passed);
}
}
```
### 4. Rich Label Generation (Beyond Binary)
**Goal:** Extract maximum signal from each oracle invocation.
```rust
use aprender::linear_model::LinearRegression;
use verificar::oracle::ExecutionResult;
/// Multi-task labeler extracting rich supervision
pub struct RichLabeler;
impl RichLabeler {
/// Generate rich labels from oracle execution
pub fn label(
source: &str,
target: &str,
source_result: &ExecutionResult,
target_result: &ExecutionResult,
) -> RichLabels {
RichLabels {
// Binary correctness
correct: source_result.output == target_result.output,
// Semantic similarity (for soft labels)
output_similarity: jaccard_similarity(
&source_result.output,
&target_result.output,
),
// Performance ratio (for distillation)
runtime_ratio: target_result.duration.as_secs_f32()
/ source_result.duration.as_secs_f32().max(0.001),
// Error category (for multi-class)
error_category: classify_error(source_result, target_result),
// AST diff features (for localization)
ast_diff: compute_ast_diff(source, target),
}
}
}
#[derive(Debug, Clone)]
pub struct RichLabels {
pub correct: bool,
pub output_similarity: f32,
pub runtime_ratio: f32,
pub error_category: ErrorCategory,
pub ast_diff: AstDiff,
}
```
### 5. Data Quality Scoring Pipeline
**Goal:** Rank training examples by informativeness.
```rust
use aprender::decomposition::PCA;
use aprender::metrics::silhouette_score;
/// Score training examples by quality
pub struct DataQualityScorer {
pca: PCA,
reference_embeddings: Matrix,
}
impl DataQualityScorer {
/// Score a training example
pub fn score(&self, example: &TrainingExample) -> QualityScore {
let embedding = self.embed(example);
QualityScore {
// Novelty: distance from existing examples
novelty: self.min_distance_to_reference(&embedding),
// Diversity: contribution to overall variance
diversity: self.variance_contribution(&embedding),
// Difficulty: model uncertainty
difficulty: self.predict_difficulty(example),
// Coverage: AST node types covered
coverage: self.ast_coverage(example),
}
}
/// Filter to top-k most informative examples
pub fn select_top_k(&self, examples: &[TrainingExample], k: usize) -> Vec<TrainingExample> {
let mut scored: Vec<_> = examples
.iter()
.map(|e| (e.clone(), self.score(e).composite()))
.collect();
scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
scored.into_iter().take(k).map(|(e, _)| e).collect()
}
}
```
---
## Pipeline Execution
### Full Pipeline (batch mode)
```rust
use verificar::generator::Generator;
use verificar::oracle::SandboxOracle;
use aprender::preprocessing::StandardScaler;
pub async fn run_codex_pipeline(config: CodexConfig) -> Dataset {
// 1. Initialize components
let generator = Generator::new(Language::Python);
let quality_gate = QualityGate::load(&config.quality_model_path)?;
let bug_predictor = BugPredictor::load(&config.bug_model_path)?;
let oracle = SandboxOracle::new(config.timeout);
let labeler = RichLabeler;
let active_sampler = ActiveSampler::new(config.n_clusters);
let mut dataset = Vec::new();
let mut verified_count = 0;
// 2. Generate with adaptive sampling
for batch in 0..config.n_batches {
let strategy = active_sampler.sample_strategy();
let candidates = generator.generate(strategy, config.batch_size);
// 3. Quality gate (10x speedup)
let high_quality: Vec<_> = candidates
.into_iter()
.filter(|c| quality_gate.should_verify(&c.features))
.collect();
// 4. Bug-priority sorting
let mut prioritized = high_quality;
prioritized.sort_by(|a, b| {
bug_predictor.predict_bug_prob(&b.source, &b.features)
.partial_cmp(&bug_predictor.predict_bug_prob(&a.source, &a.features))
.unwrap()
});
// 5. Oracle verification (expensive)
for candidate in prioritized.iter().take(config.oracle_budget) {
let (source_result, target_result) = oracle.execute_pair(
&candidate.source,
&candidate.target,
).await?;
// 6. Rich labeling
let labels = labeler.label(
&candidate.source,
&candidate.target,
&source_result,
&target_result,
);
// 7. Update active learner
active_sampler.update(&candidate, labels.correct);
dataset.push(TrainingExample {
source: candidate.source.clone(),
target: candidate.target.clone(),
labels,
});
verified_count += 1;
}
log::info!(
"Batch {}: {} verified, {} total examples",
batch, verified_count, dataset.len()
);
}
// 8. Quality-based selection
let scorer = DataQualityScorer::train(&dataset);
let final_dataset = scorer.select_top_k(&dataset, config.final_size);
// 9. Export to Parquet
export_parquet(&final_dataset, &config.output_path)?;
Ok(final_dataset)
}
```
### Configuration
```yaml
# codex-config.yaml
pipeline:
n_batches: 1000
batch_size: 100
oracle_budget: 20 # 20% pass quality gate
final_size: 50000
generator:
language: python
max_depth: 5
strategies:
- exhaustive: 0.2
- coverage_guided: 0.5
- boundary: 0.3
quality_gate:
model_path: models/quality_rf.apr
threshold: 0.7
bug_predictor:
model_path: models/bug_gb.apr
defect_weights:
ast_transform: 2.0
ownership_borrow: 1.5
stdlib_mapping: 1.2
language_specific: 1.0
active_learning:
n_clusters: 50
exploration_epsilon: 0.1
ucb_coefficient: 2.0
labeling:
rich_labels: true
error_categories:
- type_mismatch
- ownership_violation
- lifetime_error
- panic_divergence
- output_mismatch
output:
format: parquet
path: data/codex_python_rust.parquet
columns:
- source
- target
- correct
- output_similarity
- runtime_ratio
- error_category
- features
```
---
## Expected Outcomes
| Oracle calls per 1K examples | 1000 | 100 |
| Bug-revealing rate | 2% | 15% |
| Dataset diversity (silhouette) | 0.3 | 0.6 |
| Training convergence (epochs) | 50 | 20 |
| Final model accuracy | 85% | 92% |
---
## Implementation Roadmap
### Phase 1: Quality Gate (VER-050)
- [ ] Feature extraction pipeline
- [ ] RandomForest training on historical data
- [ ] Integration with verificar generator
### Phase 2: Bug Predictor (VER-051)
- [ ] Commit feature extraction from PAIML repos
- [ ] GradientBoosting model training
- [ ] Defect category weighting
### Phase 3: Active Learning (VER-052)
- [ ] Code embedding via TF-IDF + SVD
- [ ] K-means clustering
- [ ] Thompson Sampling integration
### Phase 4: Rich Labeling (VER-053)
- [ ] Error categorization taxonomy
- [ ] AST diff computation
- [ ] Soft label generation
### Phase 5: Integration (VER-054)
- [ ] End-to-end pipeline
- [ ] Parquet export with schema
- [ ] Benchmarks and validation
---
## Dependencies
```toml
[dependencies]
aprender = "0.9" # ML algorithms
verificar = "0.1" # Synthetic data generation
trueno = "0.7" # SIMD tensor ops
serde = "1"
serde_yaml = "0.9"
parquet = "54"
tokio = { version = "1", features = ["full"] }
```
---
## References
1. **PAIML Vision Sync**: `/docs/specifications/paiml-sai-vision-sync.md`
2. **Aprender Documentation**: `../aprender/README.md`
3. **Verificar Architecture**: `../verificar/CLAUDE.md`
4. **Defect Analysis**: 1,296 defect-fix commits across PAIML transpilers