Meaningful Moments: VLM-Identified Temporal Importance Labels for Video Action Recognition
Not every moment in a video is equally useful for recognizing an action. A few decisive moments often carry much of the evidence: the instant an object moves, the peak of a dive, or the placement of one object on another. Yet existing video datasets typically provide clip-level captions, action labels, or task-specific temporal annotations rather than large-scale, action-conditioned labels identifying which short segments carry recognition evidence. This work asks whether vision-language models can identify meaningful moments at scale and whether the resulting labels improve video action recognition.
To investigate this question, I introduce Meaningful Moments (MM), a corpus of about 4.58 million VLM-generated per-segment importance scores around 500,000 class-labeled videos spanning Something-Something v2 (SSv2), Kinetics-400 (K400), and Diving-48 across 622 action classes. Each segment receives both a continuous importance score and a binary pseudo-label.
To evaluate these labels, I freeze the video action classifier and vary only the temporal sampling strategy, using an alpha-parameterized framework that compares importance-led selection against an importance-inverted negative control. This separates gains attributable to the VLM importance signal from gains caused by variable-density sampling alone.
Across all three datasets, VLM-derived importance is informative: at the hard-cut endpoint, keeping VLM-important segments outperforms keeping VLM-designated filler. Its downstream usefulness, however, depends strongly on the dataset's class structure. On Diving-48, where discriminative evidence is concentrated in the dive itself, importance-led fast-forwarding preserves full-video recognition while the inverted control collapses by roughly 30 percentage points. On K400, hard-cut selection beats uniform sampling by 1.45 percentage points. On SSv2, VLM importance labels carry signal but do not recover full-video performance.
Continuous-score thresholds and budgets further outperform the oracle-designated binary kept set, and in this annotation pipeline direct scoring matches greedy removal within its observed noise floor while using roughly 25x fewer oracle calls. A four-VLM cross-oracle study on 369 joint-precheck-pass videos shows moderate segment-level agreement but broadly consistent downstream recognition trends across oracles.
Together, these results show that temporal importance is useful but conditional on dataset, recognizer, and sampling protocol.
Completed as an undergraduate thesis at Dartmouth College.