Expand description
Meta-learning across corpora: predict a PipelineConfig for a new
corpus by consulting past tuner runs on similar corpora.
This is Level 2 of SphereQL’s self-optimization hierarchy (per the metalearning-direction memory):
- L1 (
tuner::auto_tune): per-corpus search. Produces a best config. - L2 (this module): cross-corpus generalization. Takes the (corpus
features, best config) pairs produced by L1 and learns a function
CorpusFeatures → PipelineConfigso new corpora can skip search or warm-start it. - L3: online adaptation from query feedback. Deferred.
Today’s meta-model is a simple z-score-normalized nearest neighbor
over CorpusFeatures::to_vec, with two model-space adjustments:
scale-type features (item/category/dim counts) are ln(1+x)
compressed before normalization so a single 500k corpus can’t
dominate the statistics, and training sets mixing multiple
metric_names are stratified to the dominant metric at fit time
(scores under different objectives are not comparable). It works
with any N ≥ 1 training records, is deterministic, and has no
free hyperparameters. When you’ve accumulated ≥ 10 diverse corpora
you can swap in something fancier (gradient-boosted trees, small
MLP) against the same MetaModel trait — the storage format
(MetaTrainingRecord) stays stable.
§Storage
Records are serialized as a flat JSON array:
[
{ "corpus_id": "built_in_775", "features": {...}, "best_config": {...}, ... },
...
]MetaTrainingRecord::save_list and MetaTrainingRecord::load_list
are convenience wrappers; the format is plain enough to edit by hand
or process with jq.
Structs§
- Distance
Weighted Meta Model - Picks the training record that maximizes
evidence × w(distance), wherew(d) = 1 / (d + epsilon)over z-score-normalized Euclidean distance andevidenceisMetaTrainingRecord::score_liftwhen present, falling back tobest_scorefor legacy records. - Meta
Training Record - One observation for the meta-learner: “on this corpus profile, this config was found to be best under this metric.”
- Nearest
Neighbor Meta Model - The simplest useful meta-model: given a new corpus, return the
best_config of the training record whose corpus-feature vector is
closest in z-score-normalized Euclidean distance (scale features
log-compressed first — see [
to_model_space]).
Traits§
- Meta
Model - Predicts a
PipelineConfigfrom aCorpusFeaturesprofile.