Expand description
Three-level hierarchical caption generation pipeline.
SensorLM’s key insight is that paired (sensor, text) training data can be generated automatically from unlabelled wearable recordings, eliminating the need for human annotation at scale.
§Caption levels
| Level | Module | Description | Token budget |
|---|---|---|---|
| 1 – Statistical | statistical | Mean/max/min/std per channel | 512 |
| 2 – Structural | structural | Trends & anomaly events | 512 |
| 3 – Semantic | semantic | Activities, sleep, mood | 256–1024 |
§Combination keys
The training pipeline selects one of eight caption variants for each batch:
low_level_caption → level 1 only
middle_level_caption → level 2 only
high_level_summary_caption → level 3 only (short)
high_level_all_caption → level 3 (full)
middle_low_level_caption → levels 2 + 1
high_low_level_caption → levels 3 + 1
high_middle_level_caption → levels 3 + 2
high_middle_low_level_caption → levels 3 + 2 + 1Modules§
- semantic
- Level-3 (semantic) caption generation.
- statistical
- Level-1 (statistical) caption generation.
- structural
- Level-2 (structural) caption generation.
- templates
- Text templates for all three captioning levels.
Structs§
- Caption
Context - All contextual information needed to produce a full multi-level caption.
Functions§
- generate_
caption - Generate the caption text for the requested
CaptionKey.