Example: tourism industry

4D Spatio-Temporal ConvNets: Minkowski Convolutional ...

4D Spatio-Temporal ConvNets: Minkowski Convolutional neural NetworksChristopher many robotics and VR/AR applications, 3D-videosare readily-available input sources (a sequence of depthimages, or LIDAR scans). However, in many cases, the3D-videos are processed frame-by-frame either through 2 Dconvnets or 3D perception algorithms. In this work, wepropose 4-dimensional Convolutional neural networks forspatio-temporal perception that can directly process such3D-videos using high-dimensional convolutions. For this, weadopt sparse tensors [8,9] and propose generalized sparseconvolutions that encompass all discrete convolutions. Toimplement the generalized sparse convolution, we create anopen-source auto-differentiation library for sparse tensors1that provides extensive functions for high-dimensional con-volutional neural networks. We create 4D spatio-temporalconvolutional neural networks using the library and vali-date them on various 3D semantic segmentation benchmarksand proposed 4D datasets for 3D-video perception.

the 3D convolutional neural network. 1. Introduction In this work, we are interested in 3D-video perception. A 3D-video is a temporal sequence of 3D scans such as a video from a depth camera, a sequence of LIDAR scans, or a multiple MRI scans of the same object or a body part (Fig. 1). As LIDAR scanners and depth cameras become

Tags:

  Network, Neural, Convolutional, Convolutional neural networks

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of 4D Spatio-Temporal ConvNets: Minkowski Convolutional ...

1 4D Spatio-Temporal ConvNets: Minkowski Convolutional neural NetworksChristopher many robotics and VR/AR applications, 3D-videosare readily-available input sources (a sequence of depthimages, or LIDAR scans). However, in many cases, the3D-videos are processed frame-by-frame either through 2 Dconvnets or 3D perception algorithms. In this work, wepropose 4-dimensional Convolutional neural networks forspatio-temporal perception that can directly process such3D-videos using high-dimensional convolutions. For this, weadopt sparse tensors [8,9] and propose generalized sparseconvolutions that encompass all discrete convolutions. Toimplement the generalized sparse convolution, we create anopen-source auto-differentiation library for sparse tensors1that provides extensive functions for high-dimensional con-volutional neural networks. We create 4D spatio-temporalconvolutional neural networks using the library and vali-date them on various 3D semantic segmentation benchmarksand proposed 4D datasets for 3D-video perception.

2 To over-come challenges in 4D space, we propose the hybrid kernel,a special case of the generalized sparse convolution, andtrilateral-stationary conditional random fields that enforcespatio-temporal consistency in the 7D space-time-chromaspace. Experimentally, we show that a Convolutional neuralnetwork with only generalized 3D sparse convolutions canoutperform 2D or 2D-3D hybrid methods by a large , we show that on 3D-videos, 4D Spatio-Temporal convo-lutional neural networks are robust to noise and outperformthe 3D Convolutional neural IntroductionIn this work, we are interested in 3D-video 3D-video is a temporal sequence of 3D scans such as avideo from a depth camera, a sequence of LIDAR scans,or a multiple MRI scans of the same object or a body part( ). As LIDAR scanners and depth cameras becomemore affordable and widely used for robotics applications,3D-vidoes became readily-available sources of input for1 the time of submission, we achieved the best performance on Scan-Net [5] with mIoUFigure 1: An example of 3D video: 3D scenes at differenttime steps.

3 Best viewed on : Line2D: Square3D: Cube4D: TesseractFigure 2: 2D projections of hypercubes in various dimen-sionsrobotics systems or AR/VR , there are many technical challenges in using 3D-videos for high-level perception tasks. First, 3D data requiresa heterogeneous representation and processing that eitheralienates users or makes it difficult to integrate into largersystems. Second, the performance of the 3D convolutionalneural networks is worse or on-par with 2D convolutionalneural networks. Third, the limited number of open-sourcelibraries for fast large-scale 3D data is another resolve most, if not all, of the challenges in high-dimensional perception, we adopt a sparse tensor [8,9] forour problem and propose the generalized sparse convolutionsfor sparse tensors and publish an open-source auto-diff li-brary for sparse tensors with comprehensive standard neuralnetwork adopt the sparse representation for a few reasons.

4 Cur-rently, there are various concurrent works for 3D perception:a dense 3D convolution [5], pointnet-variants [23,24], con-tinuous convolutions [12,16], surface convolutions [21,30],and an octree convolution [25]. Out of these representations,we chose a sparse tensor due to its expressiveness and gen-eralizability for high-dimensional spaces. Also, it allowshomogeneous data representation within traditional neuralnetwork libraries since most of them support sparse , the sparse convolution closely resembles the stan-dard convolution ( ) which is proven to be successful in130752D perception as well as 3D reconstruction [4], feature learn-ing [34], and semantic segmentation [5]. As the generalizedsparse convolution is a direct high-dimensional extensionof the standard 2D convolution, we can re-purpose all ar-chitectural innovations such as residual connections, batchnormalization, and many others with little to no modificationfor high-dimensional , the sparse convolution is efficient and fast.

5 It onlycomputes outputs for predefined coordinates and saves theminto a compact sparse tensor ( ). It saves both memoryand computation especially for 3D scans or high-dimensionaldata where most of the space is the proposed library, we create the first large-scale3D/4D networks3and named them Minkowski networks af-ter the space-time continuum, Minkowski space, in , even with the efficient representation, merelyscaling the 3D convolution to high-dimensional spaces re-sults in significant computational overhead and memory con-sumption due to the curse of dimensionality. A 2D convo-lution with kernel size 5 requires52= 25weights whichincreases exponentially to53= 125in 3D, and 625 in 4D( ). This exponential increase, however, does not nec-essarily translate to better performance and slows down thenetwork significantly. To overcome this challenge, we pro-pose custom kernels with non-(hyper)-cubic , the 4D Spatio-Temporal predictions are not neces-sarily consistent throughout the space and time.

6 To enforceconsistency, we propose the conditional random fields de-fined in a 7D trilateral space (space-time-color) with a sta-tionary consistency function. We use variational inference toconvert the conditional random field to differentiable recur-rent layers which can be implemented in as a 7D Minkowskinetwork and train both the 4D and 7D networks , we use various 3D benchmarks that coverboth indoor [5,2] and outdoor spaces [28,26]. First, weshow that a pure 3D method without a 2D convolutionalneural network can outperform 2D or hybrid deep-learningalgorithms by a large , we create 4D datasetsfrom Synthia [28] and Varcity [26] and report ablation stud-ies of temporal Related WorkThe 4D Spatio-Temporal perception fundamentally re-quires 3D perception as a slice of 4D along the temporaldimension is a 3D scan. However, as there are no previ-ous works on 4D perception using neural networks, we willprimarily cover 3D perception, specifically 3D segmenta-tion using neural networks.

7 We categorized all previous3At the time of submission, our proposed method was the first very deep3D Convolutional neural networks with more than 20 achieved mIoU on the ScanNet benchmark outperformingall algorithms including the best peer-reviewed work [6] by 19% mIoU atthe time of in 3D as either (a) 3D- Convolutional neural networksor (b) neural networks without 3D convolutions. Finally, wecover early 4D perception methods. Although 2D videos arespatio-temporal data, we will not cover them in this paper as3D perception requires radically different data processing,implementation, and neural first branch of3D- Convolutional neural networks uses a rectangular gridand a dense representation [31,5] where the empty space isrepresented either as0or the signed distance function. Thisstraightforward representation is intuitive and is supportedby all major public neural network libraries. However, asthe most space in 3D scans is empty, it suffers from highmemory consumption and slow computation.

8 To resolve this,OctNet [25] proposed to use the Octree structure to represent3D space and convolution on second branch is sparse 3D- Convolutional neural net-works [29,9]. There are two quantization methods used forhigh dimensions: a rectangular grid and a permutohedrallattice [1]. [29] used a permutohedral lattice whereas [9]used a rectangular grid for 3D classification and last branch is 3D pseudo-continuous convolutionalneural networks [12,16]. Unlike the previous works, theydefine convolutions using continuous kernels in a continuousspace. However, finding neighbors in a continuous space isexpensive, as it requires KD-tree search rather than a hashtable, and are susceptible to uneven distribution of networks without 3D ,we saw a tremendous increase in neural networks without3D convolutions for 3D perception. Since 3D scans consistof thin observable surfaces, [21,30] proposed to use 2 Dconvolutions on the surface for semantic direction is PointNet-based methods [23,24].

9 PointNets use a set of input coordinates as features for amulti-layer perceptron. However, this approach processesa limited number of points and thus a sliding window forcropping out a section from an input was used for largespaces making the receptive field size rather limited. [15]tried to resolve such shortcomings with a recurrent networkon top of multiple pointnets, and [16] proposed a variant of3D continuous convolution for lower layers of a PointNetand got a significant performance first 4D perception algorithm [19]proposed a dynamic deformable balloon model for 4D car-diac image analysis. Later, [17] used a 4D Markov RandomFields for cardiac segmentation. Recently, [35] combined a3D-UNet for spatial data with a 1D-AutoEncoder for tempo-ral data and applied the model for auto-encoding brain this paper, we propose the first Convolutional neu-ral networks for high-dimensional spaces including the 4D3076spatio-temporal data, or 3D videos.

10 Compared with other ap-proaches that combine temporal data with a recurrent neuralnetwork or a shallow model, our networks use a homoge-neous representation, convolutions, and other neural networkconsistently throughout the networks. Specifically, convolu-tions are proven to be effective in numerous 2D/3D spatialperception as well as temporal or sequence modeling [3].3. Sparse Tensor and ConvolutionIn traditional speech, text, or image data, features areextracted densely. However, for 3-dimensional scans, suchdense representation is inefficient since most of the spaceis empty. Instead, we can save non-empty space as its co-ordinate and the associated feature. This representation isan N-dimensional extension of a sparse matrix. In partic-ular, we follow the COO format [32] as it is efficient forneighborhood queries ( ). The last axis is reserved forthe batch indices to dissociate points at the same location indifferent batch [9].


Related search queries