Watch Before You Answer

Abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed since commonly reported long video understanding benchmarks contain 30-50% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLMs’ video understanding performance.

Guided by this observation, we introduce VidFilter, a simple yet effective solution: filtering the post-training datasets to include only video-dependent questions. When used in tandem with RL-based post-training algorithms, this simple filtering technique improves performance by up to 3% relative to using the full dataset, while using only 69% of the original post-training data.

Moreover, we show that data filtering with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is the major bottleneck for improving video understanding in VLMs.

Introduction

Video understanding is central to real-world applications such as autonomous driving, online tutorial development, assistive robotics, and movie analysis. Despite recent advances in vision-language models (VLMs), performance has lagged behind text-based reasoning, especially for tasks involving long-context video understanding.

Even for leading vision-language models (Qwen-2.5-VL), the majority of performance on video understanding benchmarks comes from language priors (pink) rather than visual comprehension. Overall gains as model size increases are exclusively driven by improvements in text-only reasoning, with visual grounding sometimes worsening in the larger model variant.

In this work, we show that the community’s progress in improving video understanding in VLMs is even worse than initially thought, with a majority of the gains coming from models’ abilities to answer questions without access to the video.

This phenomenon, known as “linguistic short-cutting” has been well established in Vision Question Answering (VQA) as a serious problem. We find that most video gains come from being able to answer a larger portion of the benchmark without access to the video, making these benchmarks problematic for measuring improvements in genuine video understanding.

Key Contributions

Identifying Linguistic Shortcutting: We identify the pervasive presence of linguistic shortcutting in both video understanding benchmarks (30-50% text-only answerable) and post-training datasets
VidFilter: We introduce VidFilter, an exceedingly simple method for improving VLM post-training: simply filtering out text-only answerable questions
Data Efficiency: Although this strategy filters out approximately 30% of the post-training data, it leads to improvements of 2.5-3% in video understanding performance
Outperforming Complex Methods: This simple approach outperforms several more advanced post-training strategies such as stronger fusion modules, debiasing objectives, or RL-based post-training

Method

Guided by our analyses, we introduce VidFilter, a simple technique for improving video understanding in VLMs through post-training. VidFilter combines reinforcement learning techniques for post-training with a simple data filter.

RL for Video Understanding Post-Training

We use reinforcement learning (RL) for post-training based on recent evidence that RL improves underlying visual recognition capabilities while exhibiting less catastrophic forgetting than supervised fine-tuning (SFT).

We adopt Group Relative Policy Optimization (GRPO) augmented with techniques from DAPO and temporal-aware rewards from Video-R1. Specifically, we employ token-level policy gradient loss with asymmetric clipping to make the training more efficient and stable.

Post-Training Data Filter

We construct our post-training data based on Video-R1-260k, which comprises 116,248 Video QA and 146,823 Image QA instances spanning diverse video and image understanding scenarios.

We partition Video-R1-260k into three variants based on text answerability:

Variant	Samples	TA Ratio	Description
Full	263,071	30.9%	Common post-training practice without filtering
TA	81,361	100%	Only text-only answerable questions
NTA (VidFilter)	181,710	0%	Only video-dependent questions

Table 1: Post-training data variants. VidFilter uses the NTA variant containing only questions requiring genuine visual understanding.

Experiments

We evaluate on four video understanding benchmarks:

VideoMME: A comprehensive, general-purpose benchmark spanning perception and reasoning
VideoMMMU: Focused on expert-level, multi-disciplinary video reasoning
MMVU: Emphasizing college-level, knowledge-intensive video comprehension
VideoTT: Assessing understanding of visual narratives

Main Results

Method	VideoMME	VideoMMMU	MMVU	VideoTT	Avg
Qwen2.5-VL-7B	48.5	36.2	47.0	36.4	42.0
Video-R1	48.9	36.4	43.8	38.0	41.8
LongVILA-R1-7B	50.9	30.9	46.9	41.7	42.6
LLaVA-Critic-R1	50.2	37.1	49.9	34.8	43.0
VidFilter (Ours)	51.8	39.5	51.4	37.9	45.2

Table 2: Comparison with state-of-the-art methods at 32 frames on NTA (video-dependent) evaluation splits. VidFilter achieves the best performance across all datasets.

Key Findings

Less is More: VidFilter, post-trained on 182K NTA samples, consistently outperforms models trained on the full 263K dataset, using only 69.1% of the post-training data.

Consistent Frame Scaling: Models trained on NTA data show steady improvement as the number of frames increases, while models trained on full data exhibit inconsistent scaling and minimal gains.

Qualitative Results

Qualitative comparison of reasoning paths. VidFilter demonstrates stronger visual grounding by explicitly referencing frame-level observations, while Video-R1 relies on general knowledge or linguistic patterns.

Conclusion

We identify the pervasive presence of linguistic shortcutting in both video understanding benchmarks and post-training datasets. Some of the most popular video understanding benchmarks are at least 50% composed of questions which can be answered using the question text alone.

We developed VidFilter, an exceedingly simple post-training strategy for improving video understanding: filtering out text-only answerable questions from the post-training dataset. This strategy outperforms seven state-of-the-art approaches while providing notable benefits in training data efficiency.

Our findings highlight the importance of curating post-training data that truly requires visual reasoning, offering a simple yet powerful direction for building more robust and visually grounded VLMs.

Additional Materials

Code and Data

Code and datasets will be released upon publication.