Skip to content
/ RM-Depth Public

RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes, CVPR 2022

License

Notifications You must be signed in to change notification settings

twhui/RM-Depth

Repository files navigation

PWC

RM-Depth

This repository (https://github.com/twhui/RM-Depth) is the offical project page for my paper RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes published in CVPR 2022. The up-to-date version of the paper is available on arXiv. The supplementary material is available here.

Overview

A new unsupervised CNN is proposed to predict single-image depth map and complete 3D motion (motions of moving objects and camera itself) in dynamic scenes without requiring scene rigidity and semantic labels. Optical flow and moving object segementations are also recovered.

Major contributions: (1) Recurrent modulation units (RMU) are proposed to adaptively and iteratively combine encoder and decoder features. (2) Residual upsampling is proposed for fast and efficient resizing of feature maps while sharp depth can be resulted. (3) A warping-based network is proposed to estimate a motion field of moving objects without using semantic priors. The motion field is further regularized by an outlier-aware training loss.

Despite the depth model just uses a single image in test time and 2.97M parameters, it achieves state-of-the-art results on the KITTI and Cityscapes benchmarks (AbsRel = 0.107 and 0.090, respectively). Besides, It can run at 40FPS (image size: 640 x 192) on a NVIDIA 1080 GPU.

Recurrent Modulation Unit (RMU)

Fusion of feature maps across encoder and decoder often appears in depth estimation. In RM-Depth, the depth decoder consists of RMUs. The fusion is iteratively refined by adaptive modulating the encoder features using the hidden state of RMU. This in turn improves the performance of single-image depth inference.

Residual upsampling

Conventionally, feature maps are upsampled using a single set of filters. In this work, multiple sets of filters are proposed such that each set of them is specifically trained for upsampling some of the spectral components. This effectively improves upsampling along edges.

Motion Network

Besides camera motion, a 3D motion field of moving objects is recovered in a coarse-to-fine framework through a warping approach. The unsupervised learning of motion field is further improved by introducing an outlieraware regularization loss.

Depth Prediction Results

Semantic Prior KITTI Testing Set (Eigen split) Cityscapes Testing Set Model Size (M)
Monodepth2 (ICCV19) 0.115 - 14.84
PackNet (CVPR20) 0.111 - 128.29
Lee et al. (ICCV21) 0.114 0.116 22.77
Lee et al. (AAAI21) 0.112 0.111 14.84
RM-Depth (CVPR22),
updated results
0.107 (trained on K) (predictions),
0.105 (trained on CS+K)
(predictions)
0.090 (predictions) 2.97
RM-Depth (CVPR22),
1024 x 320
0.106 (predictions) 0.088 (predictions) 2.97

Code Package

Please contact Dr. T.-W. Hui (e-mail provided in the first page of the paper) for academic research or commerical collaborations.

License and Citation

This software and associated documentation files (the "Software"), and the research paper (RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes) including but not limited to the figures, and tables (the "Paper") are provided for academic research purposes only and without any warranty. Any commercial use requires my consent. When using any parts of the Software or the Paper in your work, please cite the following paper:

@InProceedings{hui22rmdepth,
 author = {Tak-Wai Hui},
 title = {RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes},
 booktitle = {Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 pages = {1675--1684},
 year = {2022}
}

Releases

No releases published

Packages

No packages published