Example: marketing

Depth Map Prediction from a Single ... - New York University

Depth Map Prediction from a Single Imageusing a Multi- scale Deep NetworkDavid of Computer Science, Courant Institute, New York UniversityAbstractPredicting Depth is an essential component in understanding the 3D geometry ofa scene. While for stereo images local correspondence suffices for estimation,finding Depth relations from asingle imageis less straightforward, requiring in-tegration of both global and local information from various cues. Moreover, thetask is inherently ambiguous, with a large source of uncertainty coming from theoverall scale . In this paper, we present a new method that addresses this task byemploying two deep network stacks: one that makes a coarse global predictionbased on the entire image, and another that refines this Prediction locally.

common scale-dependent errors. This focuses attention on the spatial relations within a scene rather than general scale, and is particularly apt for applications such as 3D modeling, where the model is often rescaled during postprocessing. In this paper we present a new approach for estimating depth from a single image. We directly

Tags:

  Scale, Common, Common scale

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Depth Map Prediction from a Single ... - New York University

1 Depth Map Prediction from a Single Imageusing a Multi- scale Deep NetworkDavid of Computer Science, Courant Institute, New York UniversityAbstractPredicting Depth is an essential component in understanding the 3D geometry ofa scene. While for stereo images local correspondence suffices for estimation,finding Depth relations from asingle imageis less straightforward, requiring in-tegration of both global and local information from various cues. Moreover, thetask is inherently ambiguous, with a large source of uncertainty coming from theoverall scale . In this paper, we present a new method that addresses this task byemploying two deep network stacks: one that makes a coarse global predictionbased on the entire image, and another that refines this Prediction locally.

2 We alsoapply a scale -invariant error to help measure Depth relations rather than scale . Byleveraging the raw datasets as large sources of training data, our method achievesstate-of-the-art results on both NYU Depth and KITTI, and matches detailed depthboundaries without the need for IntroductionEstimating Depth is an important component of understanding geometric relations within a scene. Inturn, such relations help provide richer representations of objects and their environment, often lead-ing to improvements in existing recognition tasks [18], as well as enabling many further applicationssuch as 3D modeling [16, 6], physics and support models [18], robotics [4, 14], and potentially rea-soning about there is much prior work on estimating Depth based on stereo images or motion [17], there hasbeen relatively little on estimating Depth from asingleimage.

3 Yet the monocular case often arises inpractice: Potential applications include better understandings of the many images distributed on theweb and social media outlets, real estate listings, and shopping sites. These include many examplesof both indoor and outdoor are likely several reasons why the monocular case has not yet been tackled to the same degreeas the stereo one. Provided accurate image correspondences, Depth can be recovered deterministi-cally in the stereo case [5]. Thus, stereo Depth estimation can be reduced to developing robust imagepoint correspondences which can often be found using local appearance features. By contrast,estimating Depth from a Single image requires the use of monocular Depth cues such as line anglesand perspective, object sizes, image position, and atmospheric effects.

4 Furthermore, a global viewof the scene may be needed to relate these effectively, whereas local disparity is sufficient for , the task is inherently ambiguous, and a technically ill-posed problem: Given an image, aninfinite number of possible world scenes may have produced it. Of course, most of these are physi-cally implausible for real-world spaces, and thus the Depth may still be predicted with considerableaccuracy. At least one major ambiguity remains, though: the global scale . Although extreme cases(such as a normal room versus a dollhouse) do not exist in the data, moderate variations in roomand furniture sizes are present. We address this using ascale-invariant errorin addition to more1common scale -dependent errors.

5 This focuses attention on the spatial relations within a scene ratherthan general scale , and is particularly apt for applications such as 3D modeling, where the model isoften rescaled during this paper we present a new approach for estimating Depth from a Single image. Wedirectlyregress on the depthusing a neural network with two components: one that first estimates the globalstructure of the scene, then a second that refines it using local information. The network is trainedusing a loss that explicitly accounts for Depth relations between pixel locations, in addition to point-wise error. Our system achieves state-of-the art estimation rates on NYU Depth and KITTI, as wellas improved qualitative Related WorkDirectly related to our work are several approaches that estimate Depth from a Single image.

6 Saxenaet al.[15] predict Depth from a set of image features using linear regression and a MRF, and laterextend their work into the Make3D [16] system for 3D model generation. However, the systemrelies on horizontal alignment of images, and suffers in less controlled settings. Hoiemet al.[6] donot predict Depth explicitly, but instead categorize image regions into geometric structures (ground,sky, vertical), which they use to compose a simple 3D model of the recently, Ladickyet al.[10] show how to integrate semantic object labels with monoculardepth features to improve performance; however, they rely on handcrafted features and use super-pixels to segment the image. Karschet al.[7] use a kNN transfer mechanism based on SIFT Flow[12] to estimate depths of static backgrounds from Single images, which they augment with motioninformation to better estimate moving foreground subjects in videos.

7 This can achieve better align-ment, but requires the entire dataset to be available at runtime and performs expensive alignmentprocedures. By contrast, our method learns an easier-to-store set of network parameters, and can beapplied to images in broadly, stereo Depth estimation has been extensively investigated. Scharsteinet al.[17] pro-vide a survey and evaluation of many methods for 2-frame stereo correspondence, organized bymatching, aggregation and optimization techniques. In a creative application of multiview stereo,Snavelyet al.[20] match across views of many uncalibrated consumer photographs of the samescene to create accurate 3D reconstructions of common learning techniques have also been applied in the stereo case, often obtaining better resultswhile relaxing the need for careful camera alignment [8, 13, 21, 19].

8 Most relevant to this work isKondaet al.[8], who train a factored autoencoder on image patches to predict Depth from stereosequences; however, this relies on the local displacements provided by are also several hardware-based solutions for Single -image Depth estimation. Levinet al.[11]perform Depth from defocus using a modified camera aperture, while the Kinect and Kinect v2 useactive stereo and time-of-flight to capture Depth . Our method makes indirect use of such sensorsto provide ground truth Depth targets during training; however, at test time our system is purelysoftware-based, predicting Depth from RGB Model ArchitectureOur network is made of two component stacks, shown in Fig.

9 1. A coarse- scale network first predictsthe Depth of the scene at a global level. This is then refined within local regions by a fine-scalenetwork. Both stacks are applied to the original input, but in addition, the coarse network s outputis passed to the fine network as additional first-layer image features. In this way, the local networkcan edit the global Prediction to incorporate finer- scale Global Coarse- scale NetworkThe task of the coarse- scale network is to predict the overall Depth map structure using a global viewof the scene. The upper layers of this network are fully connected, and thus contain the entire imagein their field of view. Similarly, the lower and middle layers are designed to combine informationfrom different parts of the image through max-pooling operations to a small spatial dimension.

10 Inso doing, the network is able to integrate a global understanding of the full scene to predict thedepth. Such an understanding is needed in the Single -image case to make effective use of cues such29x9 conv 2 stride 2x2 pool 11x11 conv 4 stride 2x2 poolFine 1 Coarse 15x5 conv 2x2 poolCoarse 29664 Coarse 5256256 Coarse 6409663 Concatenate384 Coarse 4 Fine 3 CoarseFine 4 Refined3x3 conv full3x3 conv 3x3 conv5x5 convfull1164 Fine 25x5 convInput384 Coarse 3 Coarse 7054055056057058059060061062063064065066 0670680690700710720730740750760770780790 8008108208308408508608708808909009109209 3094095096097098099100101102103104105106 1079x9 conv 2 stride 2x2 pool 11x11 conv 4 stride 2x2 poolFine 1 Coarse 15x5 conv 2x2 poolCoarse 29664 Coarse 5256256 Coarse 6409663 Concatenate384


Related search queries