Example: air traffic controller

Digging Into Self-Supervised Monocular Depth Estimation

Digging Into Self-Supervised Monocular Depth EstimationCl ement Godard1 Oisin Mac Aodha2 Michael Firman3 Gabriel Brostow3, ground-truth Depth data is challenging to ac-quire at scale. To overcome this limitation, self-supervisedlearning has emerged as a promising alternative for train-ing models to perform Monocular Depth Estimation . In thispaper, we propose a set of improvements, which together re-sult in both quantitatively and qualitatively improved depthmaps compared to competing Self-Supervised on Self-Supervised Monocular training usuallyexplores increasingly complex architectures, loss functions,and image formation models, all of which have recentlyhelped to close the gap with fully-supervised methods.

because it could inexpensively complement LIDAR sensors usedinself-drivingcars,andenablenewsingle-photoappli-cations such as image-editing and AR-compositing. Solv-ing for depth is also a powerful way to use large unlabeled image datasets for the pretraining of deep networks for downstream discriminative tasks [23]. However, collecting

Tags:

  Image, Sensor

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Digging Into Self-Supervised Monocular Depth Estimation

1 Digging Into Self-Supervised Monocular Depth EstimationCl ement Godard1 Oisin Mac Aodha2 Michael Firman3 Gabriel Brostow3, ground-truth Depth data is challenging to ac-quire at scale. To overcome this limitation, self-supervisedlearning has emerged as a promising alternative for train-ing models to perform Monocular Depth Estimation . In thispaper, we propose a set of improvements, which together re-sult in both quantitatively and qualitatively improved depthmaps compared to competing Self-Supervised on Self-Supervised Monocular training usuallyexplores increasingly complex architectures, loss functions,and image formation models, all of which have recentlyhelped to close the gap with fully-supervised methods.

2 Weshow that a surprisingly simple model, and associated de-sign choices, lead to superior predictions. In particular, wepropose (i) a minimum reprojection loss, designed to ro-bustly handle occlusions, (ii) a full-resolution multi-scalesampling method that reduces visual artifacts, and (iii) anauto-masking loss to ignore training pixels that violate cam-era motion assumptions. We demonstrate the effectivenessof each component in isolation, and show high quality,state-of-the-art results on the KITTI IntroductionWe seek to automatically infer a dense Depth image froma single color input image .

3 Estimating absolute, or evenrelative Depth , seems ill-posed without a second input imageto enable triangulation. Yet, humans learn from navigatingand interacting in the real-world, enabling us to hypothesizeplausible Depth estimates for novel scenes [18].Generating high quality Depth -from-color is attractivebecause it could inexpensively complement LIDAR sensorsused in self-driving cars, and enable new single-photo appli-cations such as image -editing and AR-compositing. Solv-ing for Depth is also a powerful way to use large unlabeledimage datasets for the pretraining of deep networks fordownstream discriminative tasks [23].

4 However, collectinglarge and varied training datasets with accurateground truthdepth for supervised learning [55, 9] is itself a formidablechallenge. As an alternative, several recent self-supervisedInputMonodepth2 (M)Monodepth2 (S)Monodepth2 (MS)Zhou et al. [76] (M)Monodepth [15] (S)Zhan et al. [73] (MS)DDVO [62] (M)Ranjan et al. [51] (M)EPC++ [38] (MS)Figure from a single Self-Supervised model,Monodepth2, produces sharp, high quality Depth maps, whethertrained with Monocular (M), stereo (S), or joint (MS) have shown that it is instead possible to trainmonocular Depth Estimation models using only synchro-nizedstereo pairs[12, 15] ormonocular video[76].

5 Among the two Self-Supervised approaches, monocularvideo is an attractive alternative to stereo-based supervision,but it introduces its own set of challenges. In addition toestimating Depth , the model also needs to estimate the ego-motion between temporal image pairs during training. Thistypically involves training a pose Estimation network thattakes a finite sequence of frames as input, and outputs thecorresponding camera transformations. Conversely, usingstereo data for training makes the camera-pose Estimation aone-time offline calibration, but can cause issues related toocclusion and texture-copy artifacts [15].

6 We propose three architectural and loss innovations thatcombined, lead to large improvements in Monocular depthestimation when training with Monocular video, stereopairs, or both: (1) A novel appearance matching loss to ad-dress the problem of occluded pixels that occur when us-ing Monocular supervision. (2) A novel and simpleauto-maskingapproach to ignore pixels where no relative camera3828 InputGeonet [71] (M)Ranjan [51] (M)EPC++ [38] (MS)Baseline (M)Monodepth2 (M)Figure methods can fail to predictdepth for objects that were often observed to be in motion dur-ing moving cars including methods which explicitlymodel motion [71, 38, 51].

7 Our method succeeds here where oth-ers, and our baseline with our contributions turned off, is observed in Monocular training. (3) A multi-scaleappearance matching loss that performs all image samplingat the input resolution, leading to a reduction in Depth ar-tifacts. Together, these contributions yield state-of-the-artmonocular and stereo Self-Supervised Depth Estimation re-sults on the KITTI dataset [13], and simplify many compo-nents found in the existing top performing Related WorkWe review models that, at test time, take a single colorimage as input and predict the Depth of each pixel as Supervised Depth EstimationEstimating Depth from a single image is an inherently ill-posed problem as the same input image can project to mul-tiple plausible depths.

8 To address this, learning based meth-ods have shown themselves capable of fitting predictivemodels that exploit the relationship between color imagesand their corresponding Depth . Various approaches, such ascombining local predictions [19, 55], non-parametric scenesampling [24], through to end-to-end supervised learning[9, 31, 10] have been explored. Learning based algorithmsare also among some of the best performing for stereo esti-mation [72, 42, 60, 25] and optical flow [20, 63].Many of the above methods are fully supervised, requir-ing ground truth Depth during training.

9 However, this ischallenging to acquire in varied real-world settings. As aresult, there is a growing body of work that exploits weaklysupervised training data, in the form of known objectsizes [66], sparse ordinal depths [77, 6], supervised appear-ance matching terms [72, 73], or unpaired synthetic depthdata [45, 2, 16, 78], all while still requiring the collectionof additional Depth or other annotations. Synthetic train-ing data is an alternative [41], but it is not trivial to generatelarge amounts of synthetic data containing varied real-worldappearance and motion.

10 Recent work has shown that con-ventional structure-from-motion (SfM) pipelines can gen-erate sparse training signal for both camera pose and Depth [35, 28, 68], where SfM is typically run as a pre-processingstep decoupled from learning. Recently, [65] built upon ourmodel by incorporating noisy Depth hints from traditionalstereo algorithms, improving Depth Self-Supervised Depth EstimationIn the absence of ground truth Depth , one alternative is totrain Depth Estimation models using image reconstruction asthe supervisory signal.


Related search queries