We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.
Given a pair of input sequences before and after the change, our pipeline co-registers both sequences into a shared 3D space, selects paired views to compute the region-level change scores, and detect changed objects (Added, Removed, or Moved) in both sequences.
We show a demonstration of a robot cleaning a messy table back to the initial clean state by applying SceneDiff to detect moved and added objects. See input videos here.
Before Change
After Change
Before Change
After Change
Before Change
After Change
Before Change
After Change
We built a SAM2-based annotation tool that allows annotators to provide sparse point prompts and object attributes. The system propagates masks across video pairs offline and provides a review interface, reducing annotation time by a large margin.
This work is supported in part by NSF IIS grant 2312102. S.W. is supported by NSF 2331878 and 2340254, and research grants from Intel, Amazon, and IBM. This research used the Delta advanced computing resource, a joint effort of UIUC and NCSA supported by NSF (award OAC 2005572) and the State of Illinois.
Special thanks to Prachi Garg and Yunze Man for helpful discussion during project development, Bowei Chen, Zhen Zhu, Ansel Blume, Chang Liu, and Hongchi Xia for general advice and feedback on the paper, and Haoqing Wang and Vladimir Yesayan for data collection and annotation.