Example: air traffic controller

PCT: Point Cloud Transformer - arXiv

PCT: Point Cloud TransformerMeng-Hao GuoTsinghua CaiTsinghua LiuTsinghua MuTsinghua R. MartinCardiff HuTsinghua irregular domain and lack of ordering make it chal-lenging to design deep neural networks for Point Cloud pro- cessing . This paper presents a novel framework namedPoint Cloud Transformer (PCT)for Point Cloud is based on Transformer , which achieves huge successin natural language processing and displays great potentialin image processing . It is inherently permutation invariantfor processing a sequence of points, making it well-suitedfor Point Cloud learning. To better capture local contextwithin the Point Cloud , we enhance input embedding withthe support of farthest Point sampling and nearest neighborsearch. Extensive experiments demonstrate that the PCTachieves the state-of-the-art performance on shape classifi-cation, part segmentation, semantic segmentation and nor-mal estimation IntroductionExtracting semantics directly from a Point Cloud is anurgent requirement in some applications such as robotics,autonomous driving, augmented reality, etc.

lenging to design deep neural networks for point cloud pro-cessing. This paper presents a novel framework named Point Cloud Transformer(PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant

Tags:

  Processing, Cessing, Pro cessing

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of PCT: Point Cloud Transformer - arXiv

1 PCT: Point Cloud TransformerMeng-Hao GuoTsinghua CaiTsinghua LiuTsinghua MuTsinghua R. MartinCardiff HuTsinghua irregular domain and lack of ordering make it chal-lenging to design deep neural networks for Point Cloud pro- cessing . This paper presents a novel framework namedPoint Cloud Transformer (PCT)for Point Cloud is based on Transformer , which achieves huge successin natural language processing and displays great potentialin image processing . It is inherently permutation invariantfor processing a sequence of points, making it well-suitedfor Point Cloud learning. To better capture local contextwithin the Point Cloud , we enhance input embedding withthe support of farthest Point sampling and nearest neighborsearch. Extensive experiments demonstrate that the PCTachieves the state-of-the-art performance on shape classifi-cation, part segmentation, semantic segmentation and nor-mal estimation IntroductionExtracting semantics directly from a Point Cloud is anurgent requirement in some applications such as robotics,autonomous driving, augmented reality, etc.

2 Unlike 2D im-ages, Point clouds are disordered and unstructured, makingit challenging to design neural networks to process et al. [21] pioneered PointNet for feature learning onpoint clouds by using multi-layer perceptrons (MLPs), max-pooling and rigid transformations to ensure invariance un-der permutations and rotation. Inspired by strong progressmade by convolutional neural networks (CNNs) in the fieldof image processing , many recent works [24, 17, 1, 31] haveconsidered to define convolution operators that can aggre-gate local features for Point clouds. These methods eitherreorder the input Point sequence or voxelize the Point cloudto obtain a canonical domain for , Transformer [26], the dominant framework innatural language processing , has been applied to image vi-Figure 1. Attention map and part segmentation generated by three columns: Point -wise attention map for different querypoints (indicated byI), yellow to blue indicating increasing atten-tion weight. Last column: part segmentation tasks, giving better performance than popular convo-lutional neural networks [7, 30].

3 Transformer is a decoder-encoder structure that contains three main modules for input(word) embedding, positional (order) encoding, and self-attention module is the core compo-nent, generating refined attention feature for its input fea-ture based on global context. First, self-attention takes thesum of input embedding and positional encoding as input,and computes three vectors for each word:query,keyandvaluethrough trained linear layers. Then, the attentionweight between any two words can be obtained by match-ing (dot-producting) their query and key vectors. Finally,the attention feature is defined as the weighted sum of allvalue vectors with the attention weights. Obviously, theoutput attention feature of each word is related to all in-put features, making it capable of learning the global con-text. All operations of Transformer are parallelizable andorder-independent. In theory, it can replace the convolutionoperation in a convolutional neural network and has betterversatility.

4 For more detailed introduction of self-attention,1 [ ] 7 Jun 2021please refer to Section by the Transformer s success in vision and NLPtasks, we propose a novel framework PCT for Point cloudlearning based on the principles of traditional key idea of PCT is using the inherent order invarianceof Transformer to avoid the need to define the order of pointcloud data and conduct feature learning through the atten-tion mechanism. As shown in Figure 1, the distribution ofattention weights is highly related to part semantics, and itdoes not seriously attenuate with spatial clouds and natural language are rather differentkinds of data, so our PCT framework must make severaladjustments for this. These include: Coordinate-based input embedding , a positional encoding module is appliedto represent the word order in nature language. Thiscan distinguish the same word in different positionsand reflect the positional relationships between , Point clouds do not have a fixed order.

5 In ourPCT framework, we merge the raw positional encod-ing and the input embedding into a coordinate-basedinput embedding module. It can generate distinguish-able features, since each Point hasuniquecoordinateswhich represent its spatial position. Optimized offset-attention offset-attention module approach we proposed is an effectiveupgrade over the original self-attention. It works byreplacing the attention feature with the offset betweenthe input of self-attention module and attention has two advantages. Firstly, the absolute coordi-nates of the same object can be completely differentwith rigid , relative coordi-nates are generally more robust. Secondly, the Lapla-cian matrix (the offset between degree matrix and ad-jacency matrix) has been proven to be very effective ingraph convolution learning [3]. From this perspective,we regard the Point Cloud as a graph with the float adjacency matrix as the attention map. Also, the atten-tion map in our work will be scaled with all the sum ofeach rows to 1.

6 So the degree matrix can be understoodas the identity matrix. Therefore, the offset-attentionoptimization process can be approximately understoodas a Laplace process, which will be discuss detailedin Section In addition, we have done sufficientcomparative experiments, introduced in Section 4, onoffset-attention and self-attention to prove its effective-ness. Neighbor embedding , everyword in a sentence contains basic semantic informa-tion. However, the independent input coordinates ofthe points are only weakly related to the semantic con-tent. Attention mechanism is effective in capturingglobal features, but it may ignore local geometric in-formation which is also essential for Point Cloud learn-ing. To address this problem, we use a neighbor em-bedding strategy to improve upon Point embedding. Italso assists the attention module by considering atten-tion between local groups of points containing seman-tic information instead of individual the above adjustments, the PCT becomes more suit-able for Point Cloud feature learning and achieves the state-of-the-art performance on shape classification, part segmen-tation and normal estimation main contributions of this paper are summarized asfollowing:1.

7 We proposed a novel Transformer based frameworknamed PCT for Point Cloud learning, which is exactlysuitable for unstructured, disordered Point Cloud datawith irregular We proposed offset-attention with implicit Laplace op-erator and normalization refinement which is inher-ently permutation-invariant and more suitable for pointcloud learning compare to the original self-attentionmodule in Extensive experiments demonstrate that the PCT withexplicit local context enhancement achieves state-of-the-art performance on shape classification, part seg-mentation and normal estimation Related Transformer in NLPB ahdanau et al. [2] proposed a neural machine trans-lation method with an attention mechanism, in which at-tention weight is computed through the hidden state ofan RNN. Self-attention was proposed by Lin et al. [18]to visualize and interpret sentence embeddings. Buildingon these, Vaswani et al. [26] proposed Transformer formachine translation; it is based solely on self-attention,without any recurrence or convolution operators.

8 Devlinet al. [6] proposed bidirectional transformers (BERT) ap-proach, which is one of the most powerful models in theNLP field. More lately, language learning networks such asXLNet [36], Transformer -XL [5] and BioBERT [15] havefurther extended the Transformer , in natural language processing , the input is inorder, and word has basic semantic, whereas Point cloudsare unordered, and individual points have no semanticmeaning in Transformer for visionMany frameworks have introduced attention into visiontasks. Wang et al. [27] proposed a residual attention ap-2proach with stacked attention modules for image classifi-cation. Hu et al. [10] presented a novel spatial encodingunit, the SE block, whose idea was derived from the at-tention mechanism. Zhang el al. [38] designed SAGAN,which uses self-attention for image generation. There hasalso been an increasing trend to employ Transformer as amodule to optimize neural networks. Wu et al.

9 [30] pro-posed visual transformers that apply Transformer to token-based images from feature maps for vision tasks. Recently,Dosovitskiy [7], proposed an image recognition network,ViT, based on patch encoding and Transformer , showingthat with sufficient training data, Transformer provides bet-ter performance than a traditional convolutional neural net-work. Carion et al. [4] presented an end-to-end detectiontransformer that takes CNN features as input and generatesbounding boxes with a Transformer by the local patch structures used in ViT andbasic semantic information in language word, we present aneighbor embedding module that aggregates features froma Point s local neighborhood, which can capture the localinformation and obtain semantic Point -based deep learningPointNet [21] pioneered Point Cloud learning. Subse-quently, Qi et al. proposed PointNet++ [22], which usesquery ball grouping and hierarchical PointNet to capture lo-cal structures.

10 Several subsequent works considered how todefine convolution operations on Point clouds. One mainapproach is to convert a Point Cloud into a regular voxelarray to allow convolution operations. Tchapmi et al. [24]proposed SEGC loud for pointwise segmentation. It mapsconvolution features of 3D voxels to Point clouds using tri-linear interpolation and keeps global consistency throughfully connected conditional random fields. Atzmon et al [1]present the PCNN framework with extension and restrictionoperators to map between Point -based representation andvoxel-based representation. Volumetric convolution is per-formed on voxels for Point feature extraction. MCCNN byHermosilla et al. [8] allows non-uniformly sampled pointclouds; convolution is treated as a Monte Carlo integra-tion problem. Similarly, in PointConv proposed by Wu etal. [31], 3D convolution is performed through Monte Carloestimation and importance different approach redefines convolution to operationon irregular Point Cloud data.


Related search queries