Skip to main content

Module data

Module data 

Source
Expand description

Data pipeline for storing verified test cases

This module handles the storage and retrieval of verified (source, target, correctness) tuples in Parquet format.

§Features

  • Large-scale parallel generation with progress tracking
  • Automatic Parquet sharding for large datasets
  • Support for all sampling strategies

Re-exports§

pub use corpus::CorpusFormat;
pub use corpus::CorpusManager;
pub use corpus::CorpusMetadata;
pub use corpus::TrainingCorpus;
pub use pipeline::DataPipeline;
pub use pipeline::PipelineConfig;
pub use pipeline::PipelineStats;
pub use pipeline::PipelineStrategy;

Modules§

corpus
Corpus management for verified tuples
pipeline
Large-scale data generation pipeline

Structs§

CodeFeatures
Features extracted from source code for ML
GenerationMetadata
Metadata about how the test case was generated
TestCase
Test case with full metadata
TestCaseBuilder
Builder for test cases with mutations
VerifiedTuple
Verified transpilation tuple for ML training

Enums§

TestResult
Test result enum