Overview

TL;DR: 4RC (pronounced "ARC") enables unified and complete 4D reconstruction via conditional querying from monocular videos in a single feed-forward pass.

Demo Video

Abstract: We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing methods that typically decouple motion from geometry or produce limited 4D attributes, such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form, decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.

Framework

Overall architecture of 4RC: Video frames are patchified and augmented with camera and time tokens, then jointly encoded by a single transformer into a compact 4D latent representation \(F\). From this representation, a conditional decoder with disentangled geometry and motion heads enables flexible querying of 3D geometry and motion for arbitrary source views at arbitrary target timestamps.

framework

Comparisons

Qualitative comparison of dynamic tracking.

comparison results

🚧 Under Construction

This project page is still a work in progress. More interactive demos and results are coming soon!

BibTeX

@article{luo20264rc,
  title={4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere},
  author={Yihang Luo and Shangchen Zhou and Yushi Lan and Xingang Pan and Chen Change Loy},
  journal={arXiv preprint arXiv:TODO},
  year={2026}
}