Module meta_model

Expand description

Meta-learning across corpora: predict a PipelineConfig for a new corpus by consulting past tuner runs on similar corpora.

This is Level 2 of SphereQL’s self-optimization hierarchy (per the metalearning-direction memory):

L1 (tuner::auto_tune): per-corpus search. Produces a best config.
L2 (this module): cross-corpus generalization. Takes the (corpus features, best config) pairs produced by L1 and learns a function CorpusFeatures → PipelineConfig so new corpora can skip search or warm-start it.
L3: online adaptation from query feedback. Deferred.

Today’s meta-model is a simple z-score-normalized nearest neighbor over CorpusFeatures::to_vec, with two model-space adjustments: scale-type features (item/category/dim counts) are ln(1+x) compressed before normalization so a single 500k corpus can’t dominate the statistics, and training sets mixing multiple metric_names are stratified to the dominant metric at fit time (scores under different objectives are not comparable). It works with any N ≥ 1 training records, is deterministic, and has no free hyperparameters. When you’ve accumulated ≥ 10 diverse corpora you can swap in something fancier (gradient-boosted trees, small MLP) against the same MetaModel trait — the storage format (MetaTrainingRecord) stays stable.

§Storage

Records are serialized as a flat JSON array:

[
  { "corpus_id": "built_in_775", "features": {...}, "best_config": {...}, ... },
  ...
]

MetaTrainingRecord::save_list and MetaTrainingRecord::load_list are convenience wrappers; the format is plain enough to edit by hand or process with jq.

Structs§

DistanceWeightedMetaModel: Picks the training record that maximizes evidence × w(distance), where w(d) = 1 / (d + epsilon) over z-score-normalized Euclidean distance and evidence is MetaTrainingRecord::score_lift when present, falling back to best_score for legacy records.
MetaTrainingRecord: One observation for the meta-learner: “on this corpus profile, this config was found to be best under this metric.”
NearestNeighborMetaModel: The simplest useful meta-model: given a new corpus, return the best_config of the training record whose corpus-feature vector is closest in z-score-normalized Euclidean distance (scale features log-compressed first — see [to_model_space]).

Traits§

MetaModel: Predicts a PipelineConfig from a CorpusFeatures profile.