TY - GEN
T1 - Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos
AU - Lu, Shiyang
AU - Deng, Yunfu
AU - Boularias, Abdeslam
AU - Bekris, Kostas
N1 - Funding Information: 1The authors are affiliated with the Department of Computer Science at Rutgers University, New Brunswick, NJ, 08901, USA. Email: {shiyang.lu, abdeslam.boularias, kostas.bekris}@rutgers.edu. 2This author is affiliated with the Department of Electrical and Computer Engineering at Rutgers University, New Brunswick, NJ, 08901. This work is supported by NSF awards 1734492, 1846043 and 2132972. Publisher Copyright: © 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - This work proposes a self-supervised learning system for segmenting rigid objects in RGB images. The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot. A key feature of the self-supervised training process is a graph-matching algorithm that operates on the over-segmentation output of the point cloud that is reconstructed from each video. The graph matching, along with point cloud registration, is able to find reoccurring object patterns across videos and combine them into 3D object pseudo labels, even under occlusions or different viewing angles. Projected 2D object masks from 3D pseudo labels are used to train a pixel-wise feature extractor through contrastive learning. During online inference, a clustering method uses the learned features to cluster foreground pixels into object segments. Experiments highlight the method's effectiveness on both real and synthetic video datasets, which include cluttered scenes of tabletop objects. The proposed method outperforms existing unsupervised methods for object segmentation by a large margin.
AB - This work proposes a self-supervised learning system for segmenting rigid objects in RGB images. The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot. A key feature of the self-supervised training process is a graph-matching algorithm that operates on the over-segmentation output of the point cloud that is reconstructed from each video. The graph matching, along with point cloud registration, is able to find reoccurring object patterns across videos and combine them into 3D object pseudo labels, even under occlusions or different viewing angles. Projected 2D object masks from 3D pseudo labels are used to train a pixel-wise feature extractor through contrastive learning. During online inference, a clustering method uses the learned features to cluster foreground pixels into object segments. Experiments highlight the method's effectiveness on both real and synthetic video datasets, which include cluttered scenes of tabletop objects. The proposed method outperforms existing unsupervised methods for object segmentation by a large margin.
UR - http://www.scopus.com/inward/record.url?scp=85168696584&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168696584&partnerID=8YFLogxK
U2 - https://doi.org/10.1109/ICRA48891.2023.10160786
DO - https://doi.org/10.1109/ICRA48891.2023.10160786
M3 - Conference contribution
T3 - Proceedings - IEEE International Conference on Robotics and Automation
SP - 7017
EP - 7023
BT - Proceedings - ICRA 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Robotics and Automation, ICRA 2023
Y2 - 29 May 2023 through 2 June 2023
ER -