Generate training data JSONL from git history across one or more repos.
For each repo: walks HEAD files to build a BM25 corpus, then iterates
commits to find changed functions. Each changed function produces one
triplet with the normalized commit message as query, the function content
as positive, and BM25-selected hard negatives.