Multi-View 3D Point Tracking

¹ETH Zürich ²Carnegie Mellon University ³Balgrist University Hospital ⁴Microsoft

ICCV 2025 (Oral)

Abstract

We introduce MVTracker, the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike monocular trackers, which struggle with depth ambiguities and occlusion, or previous multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified 3D point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Notably, on DexYCB, our method surpasses the strongest monocular tracker by 63.6% and a triplane-based multi-view baseline by 53.5%. MVTracker also generalizes better to diverse camera setups of 1–8 cameras with varying vantage points and video lengths of 24–150 frames. By releasing our pre-trained tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for a wide range of real-world applications.

Method

Given multi-view RGB videos and camera parameters, our method first extracts per-view feature maps using a CNN encoder. We then construct a fused 3D point cloud from estimated or sensor-provided depth, associating each point with learned features. Directed kNN-based correlation links points across space and time, capturing spatiotemporal relationships across views. A transformer iteratively refines point trajectories using attention over multi-view correlations. The model processes sequences in overlapping sliding windows, producing temporally consistent 3D point trajectories with occlusion-aware visibility predictions.

Qualitative Visualization

Use the interactive viewer below to explore our qualitative results. Navigate with WASD/QE keys, drag to rotate, scroll to zoom, and use the space bar or arrows to control playback. You can toggle visual elements in the left panel. As rendering runs in the browser, performance may vary; we recommend opening only one viewer at a time. To reduce memory usage (~1 GB), we omit RGB/depth previews, feature maps, kNN neighborhoods, limit track number, and we may crop the scene. DUSt3R results show confidence-filtered (>5) point clouds. GT tracks, if available, are shown in white. Occlusion is indicated with gray (GT tracks) or black (predictions). For fullscreen, you can open from the gallery below or use rerun locally (pip install rerun-sdk==0.21.0; rerun).

Click to Load Viewer

4D‑Dress – Lunge

Cameras: 4
Depth: Studio capture (106 cams)
GT available: No

4D‑Dress – Waterbending

Cameras: 4
Depth: Studio capture (106 cams)
GT available: No

DexYCB – Sequence 0 (Kinect)

Cameras: 4
Depth: Kinect cameras
GT available: Yes (from GT meshes)

DexYCB – Sequence 0 (DUSt3R)

Cameras: 4
Depth: DUSt3R estimate
GT available: Yes (from GT meshes)

Panoptic Studio – Basketball

Cameras: 4
Depth: Dynamic 3DGS (27 cams)
GT available: Yes (from TAPVID-3D)

MV-Kubric – Sequence 0

Cameras: 4
Depth: Blender simulation
GT available: Yes (from simulation)

Qualitative Comparison

Explore evaluation results of compared methods in the interactive viewer and gallery below. Compared to monocular 3D point trackers such as SpatialTracker and TAPIP3D, MVTracker leverages the multi-view information and produces view-consistent trajectories with less jitter. Compared to triplane and optimization-based baselines, our learned prior yields more precise tracks and fewer failures. Red lines indicate distance to ground truth; dark-colored points are predicted as not visible in any view. To save memory, we batch tracks into one object, visualize first 50 (of 512) evaluation tracks, and for Panoptic only the last frame.

Ground Truth

MVTracker

SpatialTrackerV1

TAPIP3D

Triplane

Click Here to Load Viewer

DexYCB (seq. 0)

Cameras: 4
Depth: Kinect

DexYCB (seq. 1)

Cameras: 4
Depth: Kinect

DexYCB (seq. 0)

Cameras: 4
Depth: DUSt3R

Panoptic (seq. 0)

Cameras: 4
Depth: Dyn. 3DGS

Panoptic (seq. 4)

Cameras: 4
Depth: Dyn. 3DGS

MV-Kubric (seq. 0)

Cameras: 4
Depth: Blender

BibTeX

@inproceedings{rajic2025mvtracker, title = {Multi-View 3D Point Tracking}, author = {Raji{\v{c}}, Frano and Xu, Haofei and Mihajlovic, Marko and Li, Siyuan and Demir, Irem and G{\"u}ndo{\u{g}}du, Emircan and Ke, Lei and Prokudin, Sergey and Pollefeys, Marc and Tang, Siyu}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025} }

Acknowledgements

We are grateful for the insightful discussions with Yiming Wang, Zador Pataki, Luigi Piccinelli, Paul‑Edouard Sarlin, and Philipp Lindenberger. We thank Ignacio Rocco and Skanda Koppula for their helpful clarifications regarding TAPVid-3D. This work was conducted within the Proficiency flagship project funded by the Swiss Innovation Agency Innosuisse. It was also supported as part of the Swiss AI initiative by a grant from the Swiss National Supercomputing Centre under project ID A03 and A136 on Alps.