Example: air traffic controller

JOURNAL OF LA DS-TransUNet: Dual Swin Transformer U-Net ...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, JUNE 20211DS-TransUNet: Dual Swin Transformer U-Netfor Medical Image SegmentationAiliang Lin, Bingzhi Chen, Jiayu Xu, Zheng Zhang,Senior Member, IEEE,and Guangming Lu,Member, IEEEA bstract Automatic medical image segmentation has madegreat progress benefit from the development of deep , most existing methods are based on convolutionalneural networks (CNNs), which fail to build long-range depen-dencies and global context connections due to the limitation ofreceptive field in convolution operation. Inspired by the successof Transformer whose self-attention mechanism has the powerfulabilities of modeling the long-range contextual information, someresearchers have expended considerable efforts in designing therobust variants of Transformer -based U-Net .

tract multi-scale features for image classification. Multi Vision Transformers (MViT) [22] is present for video and image recognition by connecting multi-scale feature hierarchies with transformer models. Multi-modal Multi-scale TRansformer (M2TR) [23] uses a multi-scale transformer to detect the local inconsistency at different scales.

Tags:

  Multi, Scale

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of JOURNAL OF LA DS-TransUNet: Dual Swin Transformer U-Net ...

1 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, JUNE 20211DS-TransUNet: Dual Swin Transformer U-Netfor Medical Image SegmentationAiliang Lin, Bingzhi Chen, Jiayu Xu, Zheng Zhang,Senior Member, IEEE,and Guangming Lu,Member, IEEEA bstract Automatic medical image segmentation has madegreat progress benefit from the development of deep , most existing methods are based on convolutionalneural networks (CNNs), which fail to build long-range depen-dencies and global context connections due to the limitation ofreceptive field in convolution operation. Inspired by the successof Transformer whose self-attention mechanism has the powerfulabilities of modeling the long-range contextual information, someresearchers have expended considerable efforts in designing therobust variants of Transformer -based U-Net .

2 Moreover, the patchdivision used in vision transformers usually ignores the pixel-level intrinsic structural features inside each patch. To alleviatethese problems, in this paper, we propose a novel deep medicalimage segmentation framework called Dual Swin TransformerU-Net (DS-TransUNet), which might be the first attempt toconcurrently incorporate the advantages of hierarchical SwinTransformer into both encoder and decoder of the standard U-shaped architecture to enhance the semantic segmentation qualityof varying medical images. Unlike many prior Transformer -based solutions, the proposed DS-TransUNet first adopts dual- scale encoder subnetworks based on Swin Transformer to extractthe coarse and fine-grained feature representations of differentsemantic scales.

3 As the core component for our DS-TransUNet,a well-designed Transformer Interactive Fusion (TIF) moduleis proposed to effectively establish global dependencies betweenfeatures of different scales through the self-attention mechanism,in order to make full use of these obtained multi - scale fea-ture representations. Furthermore, we also introduce the SwinTransformer block into decoder to further explore the long-range contextual information during the up-sampling experiments across four typical tasks for medical imagesegmentation demonstrate the effectiveness of DS-TransUNet,and show that our approach significantly outperforms the state-of-the-art Terms Medical image segmentation; Long-range con-textual information; Hierarchical Swin Transformer ; Dual- scale ; Transformer Interactive Fusion moduleI.

4 INTRODUCTIONMEDICAL image segmentation is an important yet chal-lenging research problem involving many commontasks in clinical applications, such as polyp segmentation,lesion segmentation, cell segmentation, etc. Moreover, medicalimage segmentation is a complex and key step in the field ofmedical image processing and analysis, and plays an importantAiliang Lin, Bingzhi Chen, Jiayu Xu and Guangming Lu are with theShenzhen Medical Biometrics Perception and Analysis Engineering Labora-tory, Harbin Institute of Technology, Shenzhen 518055, China. (e-mail: Zhang is with the Bio-Computing Research Center, Harbin Instituteof Technology, Shenzhen 518055, China, and also with Shenzhen KeyLaboratory of Visual Object Detection and Recognition, Shenzhen 518055,China.)

5 (e-mail: in computer-aided clinical diagnosis system. Its purposeis to segment the parts with special significance in medicalimages and extract relevant features through semi-automatic orautomatic process, so as to provide reliable basis for clinicaldiagnosis and pathological research, and assist doctors inmaking more accurate the development of deep learning, convolutional neu-ral networks (CNNs) have become dominant in a seriesof medical image segmentation tasks. Among various CNNvariants, the typical encoder-decoder based network U-Net [1] has demonstrated excellent segmentation potential, whereencoder extracts features through continuous down-sampling,and then decoder progressively leverage features output fromencoder through skip connection for up-sampling, so that thenetwork can obtain features of different granularity for bettersegmentation.)

6 Following the popularity of U-Net , many novelmodels have been proposed such as UNet++ [2], Res-UNet [3],Attention U-Net [4], DenseUNet [5], R2U-Net [6], KiU-Net[7] and UNet 3+ [8], which are specially designed for medicalimage segmentation and achieve expressive performance. Al-though CNNs have made great success in the field of medicalimage, it is difficult for them to make further to the inherent inductive biases, each convolutional kernelcan only focus on a sub-region in the whole image, whichmakes it lose global context and fail to build long-rangedependencies. The stacking of convolution layer and down-sampling helps expand the receptive filed and bring betterlocal interaction, but this is a sub-optimal choice because itmakes the model more complicated and easier to overfit.

7 Thereexists some works trying to model long-range dependenciesfor convolution such as attention mechanism [9] [10] [11].However, since these methods are not aimed at the field ofmedical image segmentation, they still have great limitations inglobal context modeling which means there is great potentialfor , the novel architecture Transformer [12] whichwas originally designed for sequence-to-sequence modeling innatural language processing (NLP) tasks, has sparked tremen-dous discussion in computer vision (CV) community. Trans-former can revolutionize most NLP tasks such as machinetranslation, named-entity recognition and question answering,mainly because multi -head self attention (MSA) mechanismcan effectively build global connection between the tokens ofsequences.

8 The ability of long-range dependencies modelingis also suitable for pixel-based CV tasks. Specially, DEtectionTRansformer (DETR) [13] utilizes a elegant design basedon Transformer to build the first fully end-to-end objectdetection model. Vision Transformer (ViT) [14], the first [ ] 12 Jun 2021 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, JUNE 20212recognition model purely based on Transformer is proposedand achieves comparable performance with other state-of-the-art (SOTA) convolution-based methods. To reduce thecomputational complexity, a hierarchical Swin Transformer [15] is proposed with Window based MSA (W-MSA) andShifted Window based MSA (SW-MSA) as illustrated in (b), and surpasses the previous SOTA methods in image clas-sification, dense prediction tasks such as object detection andsemantic segmentation.

9 SEgmentation Transformer (SETR)[16] shows that Transformer can achieve SOTA performancein segmentation tasks as encoder. However, Transformer -based models have not attracted enough attention in medicalimage segmentation. TransUNet [17] utilizes CNNs to extractfeatures and then feeds them into Transformer for long-rangedependencies modeling. TransFuse [18] based on ViT tries tofuse the features extracted by Transformer and CNNs, whileMedT [19] based on Axial-Attention [20] explores the fea-sibility of applying Transformer without large- scale success of these models shows the great potential ofTransformer in medical image segmentation, but they all onlyapply Transformer in encoder, which means such potentialof Transformer in decoder for segmentation remains to , multi - scale feature representations have beenproved to play an important role in vision transformers.

10 Cross-Attention multi - scale Vision Transformer (CrossViT) [21]proposes a novel dual-branch Transformer architecture to ex-tract multi - scale features for image classification. multi VisionTransformers (MViT) [22] is present for video and imagerecognition by connecting multi - scale feature hierarchies withtransformer models. multi -modal multi - scale Transformer (M2TR) [23] uses a multi - scale Transformer to detect the localinconsistency at different scales. In general, multi - scale featurepresentations can bring more powerful performance to visiontransformers, but they are rarely used in the filed of alleviate the inherent inductive biases of CNNs, thispaper proposes a novel encoder-decoder Transformer basedframework that mainly combines the advantages of SwinTransformer and mul ti- scale vision transformers to effectivelyoptimize the structure of the standard U-shaped architecturefor automatic medical image segmentation.


Related search queries