Skip to main content

Module train_data

Module train_data 

Source

Modules§

bm25
checkpoint
diff
git
Git operations for training data extraction.
query

Structs§

TrainDataConfig
Configuration for training data generation.
TrainDataStats
Statistics from a training data generation run.
Triplet
A single training triplet: query + positive + hard negatives.

Enums§

TrainDataError

Functions§

generate_training_data
Generate training data JSONL from git history across one or more repos. For each repo: walks HEAD files to build a BM25 corpus, then iterates commits to find changed functions. Each changed function produces one triplet with the normalized commit message as query, the function content as positive, and BM25-selected hard negatives.