Recurrent Video Masked Autoencoders

1Google DeepMind, 2University of Oxford
Description of Image

RVM leverages recurrent computation and asymmetric masking to yield a highly efficient generalist encoder that achieves competitive performance across semantic and geometric video tasks with linear computational cost.

Abstract

We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the tempo- ral structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neu- ral network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DI- NOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge dis- tillation, exhibiting up to 30× greater parameter efficiency than competing video masked autoencoders. Finally, we demonstrate that RVM’s recurrent nature allows for stable feature propagation over long temporal horizons with lin- ear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based video models. Ablation studies further highlight the factors driving the model’s success, with qualitative results showing that RVM learns rich representations of scene semantics, structure, and motion.

DAVIS Video Segmentation

We present qualitative results on 7 randomly selected videos from the DAVIS-2017 dataset for the video object segmentation task (first-frame ground-truth provided). The task is to propagate the ground-truth object segmentation from the first frame to all subsequent frames.

JHMDB Pose Tracking

We present qualitative results on 5 randomly selected videos from the JHMDB dataset for the human pose tracking task (first-frame ground-truth provided). The task is to propagate the ground-truth human keypoints from the first frame to all subsequent frames.

KMeans Visualization

We present qualitative results on 5 randomly selected videos from the DAVIS-2017 dataset using KMeans clustering to illustrate how each model decomposes visual structure in a video. KMeans is applied directly to the raw feature maps without any additional processing, using K = 5 clusters.

Noise Video Comparison

We present qualitative results on a random noise video using PCA and KMeans clustering to evaluate whether each model’s representations can capture motion independent of semantic content.

PCA Visualization

We present qualitative results on 5 randomly selected videos from the DAVIS-2017 dataset using principal component analysis (PCA) to illustrate what each model primarily captures in a video. We extract the first three principal components and visualize them as RGB images.

VIP Part Propagation

We present qualitative results on 5 randomly selected videos from the VIP dataset for the video part segmentation task (first-frame ground-truth provided). The task is to propagate the ground-truth human part segmentation from the first frame to all subsequent frames.

Libero Benchmark Results (Frozen Backbone)

Libero Benchmark Results

We evaluate frozen visual representations on the Libero robotics benchmark using a Diffusion Policy with a transformer-based predictor and DDIM sampling. Each frozen backbone encodes observations from two camera views, with task descriptions encoded via a frozen SigLIP text encoder. We train for 100K steps across all 10 Libero-10 tasks. Results are averaged over five seeds.

RVM achieves the highest and most stable success rates. 4DS and VideoMAE_v2 perform comparably, while CroCo, DINOv2, V-JEPA, and VideoMAE show lower performance.

Related Links

DINO, DINOv2, DINOv3: Self-supervised vision transformers that learn robust visual features and scale to universal vision models.

VideoMAE, VideoMAEv2: Masked autoencoders for data-efficient video pre-training that scale to billion-parameter models.

V-JEPA, V-JEPA 2: Video Joint Embedding Predictive Architectures for feature prediction and planning without reconstruction.

Scaling 4D Representations: Scaling 4D Representations (4DS) applies masked auto-encoding to learn rich spatio-temporal features.