Transcription of Point Transformer
1 Point TransformerHengshuang Zhao1,2Li Jiang3 Jiaya Jia3 Philip Torr1 Vladlen Koltun41 University of Oxford2 The University of Hong Kong3 The Chinese University of Hong Kong4 Intel LabsAbstractSelf-attention networks have revolutionized natural lan-guage processing and are making impressive strides in im-age analysis tasks such as image classification and objectdetection. Inspired by this success, we investigate the ap-plication of self-attention networks to 3D Point cloud pro-cessing. We design self-attention layers for Point clouds anduse these to construct self-attention networks for tasks suchas semantic scene segmentation, object part segmentation,and object classification. Our Point Transformer design im-proves upon prior work across domains and tasks.
2 For ex-ample, on the challenging S3 DIS dataset for large-scale se-mantic scene segmentation, the Point Transformer attainsan mIoU of on Area 5, outperforming the strongestprior model by absolute percentage points and crossingthe 70% mIoU threshold for the first Introduction3D data arises in many application areas such as au-tonomous driving, augmented reality, and robotics. Unlikeimages, which are arranged on regular pixel grids, 3D pointclouds are sets embedded in continuous space. This makes3D Point clouds structurally different from images and pre-cludes immediate application of deep network designs thathave become standard in computer vision, such as networksbased on the discrete convolution variety of approaches to deep learning on 3D pointclouds have arisen in response to this challenge.
3 Some vox-elize the 3D space to enable the application of 3D discreteconvolutions [23, 32]. This induces massive computationaland memory costs and underutilizes the sparsity of pointsets in 3D. Sparse convolutional networks relieve these limi-tations by operating only on voxels that are not empty [9, 3].Other designs operate directly on points and propagate in-formation via pooling operators [25, 27] or continuous con-volutions [42, 37]. Another family of approaches connectthe Point set into a graph for message passing [44, 19].In this work, we develop an approach to deep learning onpoint clouds that is inspired by the success of transformerssemantic segmentationpart segmentationclassificationairplanelampbe dPointTransformerFigure 1.
4 The Point Transformer can serve as the backbone for var-ious 3D Point cloud understanding tasks such as object classifica-tion, object part segmentation, and semantic scene natural language processing [39, 45, 5, 4, 51] and imageanalysis [10, 28, 54]. The Transformer family of models isparticularly appropriate for Point cloud processing becausethe self-attention operator, which is at the core of trans-former networks, is in essence a set operator: it is invariantto permutation and cardinality of the input elements. Theapplication of self-attention to 3D Point clouds is thereforequite natural, since Point clouds are essentially sets embed-ded in 3D flesh out this intuition and develop a self-attentionlayer for 3D Point cloud processing.
5 Based on this layer,we construct Point Transformer networks for a variety of3D understanding tasks. We investigate the form of the self-attention operator, the application of self-attention to localneighborhoods around each Point , and the encoding of po-sitional information in the network. The resulting networksare based purely on self-attention and pointwise show that Point Transformers are remarkably effec-tive in 3D deep learning tasks, both at the level of detailedobject analysis and large-scale parsing of massive particular, Point Transformers set the new state of the arton large-scale semantic segmentation on the S3 DIS dataset( mIoU on Area 5), shape classification on Model-Net40 ( overall accuracy), and object part segmenta-16259tion on ShapeNetPart ( instance mIoU).
6 Our full im-plementation and trained models will be released upon ac-ceptance. In summary, our main contributions include thefollowing. We design a highly expressive Point Transformer layerfor Point cloud layer is invariantto permutation and cardinality and is thus inherentlysuited to Point cloud processing. Based on the Point Transformer layer, we constructhigh-performing Point Transformer networks for clas-sification and dense prediction on Point clouds. Thesenetworks can serve as general backbones for 3D sceneunderstanding. We report extensive experiments over multiple do-mains and datasets. We conduct controlled studies toexamine specific choices in the Point Transformer de-sign and set the new state of the art on multiple highlycompetitive benchmarks, outperforming long lines ofprior Related WorkFor 2D image understanding, pixels are placed in regu-lar grids and can be processed with classical contrast, 3D Point clouds are unordered and scatteredin 3D space: they are essentially sets.
7 Learning-based ap-proaches to processing 3D Point clouds can be classifiedinto the following types: projection-based, voxel-based, andpoint-based processing irregular in-puts like Point clouds, an intuitive way is to transform ir-regular representations to regular ones. Considering thesuccess of 2D CNNs, some approaches [34, 18, 2, 14, 16]adopt multi-view projection, where 3D Point clouds are pro-jected into various image planes. Then 2D CNNs are usedto extract feature representations in these image planes, fol-lowed by multi-view feature fusion to form the final outputrepresentations. In a related approach, TangentConv [35]projects local surface geometry onto a tangent plane at ev-ery Point , forming tangent images that can be processed by2D convolution.
8 However, this approach heavily relies ontangent estimation. In projection-based frameworks, the ge-ometric information inside Point clouds is collapsed duringthe projection stage. These approaches may also underuti-lize the sparsity of Point clouds when forming dense pixelgrids on projection planes. The choice of projection planesmay heavily influence recognition performance and occlu-sion in 3D may impede alternative approach to trans-forming irregular Point clouds to regular representations is3D voxelization [23, 32], followed by convolutions in applied naively, this strategy can incur massive com-putation and memory costs due to the cubic growth in thenumber of voxels as a function of resolution. The solutionis to take advantage of sparsity, as most voxels are usuallyunoccupied.
9 For example, OctNet [29] uses unbalancedoctrees with hierarchical partitions. Approaches based onsparse convolutions, where the convolution kernel is onlyevaluated at occupied voxels, can further reduce computa-tion and memory requirements [9, 3]. These methods havedemonstrated good accuracy but may still lose geometricdetail due to quantization onto the voxel than projecting or quantiz-ing irregular Point clouds onto regular grids in 2D or 3D, re-searchers have designed deep network structures that ingestpoint clouds directly, as sets embedded in continuous [25] utilizes permutation-invariant operators suchas pointwise MLPs and pooling layers to aggregate featuresacross a set. pointnet ++ [27] applies these ideas within ahierarchical spatial structure to increase sensitivity to localgeometric layout.
10 Such models can benefit from efficientsampling of the Point set, and a variety of sampling strate-gies have been developed [27, 7, 46, 50, 11].A number of approaches connect the Point set intoa graph and conduct message passing on this [44] performs graph convolutions on kNN [55] densely connects local [31] uses dynamic edge-conditioned filters where con-volution kernels are generated based on edges inside pointclouds. SPG [15] operates on a superpoint graph that rep-resents contextual relationships. KCNet [30] utilizes kernelcorrelation and graph pooling. Wang et al. [40] investigatethe local spectral graph convolution. GACNet [41] employsgraph attention convolution and HPEIN [13] builds a hierar-chical Point -edge interaction architecture.