Transcription of Fast Fourier Convolution - NIPS
1 fast Fourier ConvolutionLu Chi1, Borui Jiang2, Yadong Mu1 1 Wangxuan Institute of Computer Technology,2 Center for Data SciencePeking convolutions in modern deep networks are known to operate locally and atfixed scale ( , the widely-adopted3 3kernels in image-oriented tasks). Thiscauses low efficacy in connecting two distant locations in the network. In this work,we propose a novel convolutional operator dubbed asfast Fourier Convolution (FFC), which has the main hallmarks of non-local receptive fields and cross-scalefusion within the convolutional unit. According to spectral Convolution theorem inFourier theory, point-wise update in the spectral domain globally affects all inputfeatures involved in Fourier transform, which sheds light on neural architecturaldesign with non-local receptive field.
2 Our proposed FFC is inspired to capsulatethree different kinds of computations in a single operation unit: a local branch thatconducts ordinary small-kernel Convolution , a semi-global branch that processesspectrally stacked image patches, and a global branch that manipulates image-levelspectrum. All branches complementarily address different scales. A multi-branchaggregation step is included in FFC for cross-scale fusion. FFC is a genericoperator that can directly replace vanilla convolutions in a large body of existingnetworks, without any adjustments and with comparable complexity metrics ( ,FLOPs). We experimentally evaluate FFC in three major vision benchmarks(ImageNet for image recognition, Kinetics for video action recognition, MSCOCOfor human keypoint detection).
3 It consistently elevates accuracies in all above tasksby significant IntroductionDeep neural networks have been the prominent driving force for recent dramatic progress in severalresearch domains. The goal of this paper is the exposition of a novel convolutional unit codenamedfast Fourier Convolution (FFC). Motivating our design of FFC, we consider two desiderata. First,one of the core concepts in deep convolutional neural networks (CNNs) isreceptive fieldthat isdeeply rooted in the visual cortex architecture. In convolutional networks, receptive field refers tothe image part that is accessible by one filter. A majority of modern networks have adopted thearchitecture of deeply stacking many convolutions with small receptive field (3 3in ResNet [11]for images or3 3 3in C3D [27] for videos).
4 This still ensures that all image parts are visibleto high layers, since stacking convolutional layers can increase the receptive field either linearlyor exponentially ( , using atrous convolutions [2]). However, for context-sensitive tasks such ashuman pose estimation, large receptive field in convolutions is highly desired. Recent endeavor onenlarging receptive field includes deformable Convolution [9] and non-local neural networks [31].Secondly, CNNs typically admit a chain-like topology. Neural layers provide different levels offeature abstraction. The idea of cross-scale fusion has celebrated its success in various example, one can tailor and send high-level semantics to shallower layers for guiding moreaccurate spatial detection, as shown in the seminal work of FPN [18].
5 Recent studies have considered Corresponding Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, reinforce cross-scale fusion in more complex patterns, as exemplified by HRNet [29] and Auto-DeepLab [19]. Our work is also partly inspired by GoogLeNet [26], which is among the earlyexploration of capturing and fusing multi-scale information in an operation unit, rather than amongdistant neural thus seek for a novel Convolution operator that efficiently implements non-local receptive fieldand fuses multi-scale information. The key tool for our development is the spectral transform particular, we choose Fourier transform for incarnation, leaving further exploration of many otherchoices ( , wavelet) as a future work. According to the spectral Convolution theorem [15] in Fouriertheory, updating a single value in the spectral domain globally affects all original data, which shedslight on design efficient neural architectures with non-local receptive field ( , [34,7]).
6 In specific,we design a collection of operations with varying receptive fields, among which non-local onesare accomplished via Fourier transform. These operations are applied to disjoint subsets of featurechannels. Updated feature maps across scales are eventually aggregated as the our best knowledge, FFC is the first work that explores an efficient ensemble of local and non-localreceptive fields in a single unit. It can be used in a plug-and-play fashion for easily replacing vanillaconvolutions in mainstream CNNs without any additional effort. In contrast, existing non-localoperators can only be sparsely inserted into the network pipeline due to their expensive computationalcost. FFC consumes comparable GFLOPs and parameters with respect to vanilla convolutions, yetconveys richer information.
7 In the experiments, we apply FFC for tackling a variety of computervision tasks, including image recognition on ImageNet, video action recognition on Kinetics dataset,and human keypoint detection on Microsoft COCO data. The reported performances consistentlyoutstrip previous models by significant margins. We strongly believe that FFC can make inroadsinto domains of neural network design where uniform, local receptive field had previously Related WorkNon-local neural networks. The theory ofeffective receptive field[21] revealed that convolutionstend to contract to the central regions. This questions the necessity of large convolutional , small-kernel convolutions are also favored in CNNs for mitigating the risk of , researchers gradually realized that linking two arbitrary distant neurons in a layer is crucialfor many context-sensitive tasks, such as classifying the action type in a spatio-temporal video tubeor jointly inferring the precise locations of human keypoints.
8 This is addressed by recent research onnon-local networks. Early methods as in [31] rely on expensive self-convolutions, which incurs aseries of follow-up research that seeks for acceleration ( , [14]). Nonetheless, current paradigm ofusing non-local operators are sparsely inserting them into some network pipelines. The way that theycan be densely knitted remains an unexplored research fusion. In CNNs, it is widely acknowledged that features extracted from differentlocations in a network are highly complementary, providing low-level (edges, blobs etc), mid-level(meaningful shapes) or high-level semantic abstraction. Cross-scale feature fusion has widelycelebrated effectiveness in numerous ways. For example, FCN [20] directly concatenated featuremaps of different scales, generating more accurate image segments.
9 The visual object detectiontask requires both accurate localization and prediction of object categories. To this end, FPN [18]propagated features in a top-down manner, seamlessly bridging the high spatial resolution in lowerlayers and semantic discriminative ability in higher layers. Recently-proposed HRNet [29] conductedcross-scale fusion among multiple network branches that maintain different spatial neural networks. Recent years have witnessed increasing research enthusiasm on spectralneural networks. The spectral domain, previously harnessed only for accelerating convolutions, alsoprovides a powerful building block for constructing deep networks. For example, [23] proposedspectral pooling that performs dimensionality reduction by truncating the representation in thefrequency domain.
10 [34] utilized wavelet based representation for restoring high-resolution images.[7] proposed paired spatial-spectral transforms and devised a number of new layers in the spectraldomain. Our work advances above-mentioned research front via designing an operation unit thatsimultaneously uses spatial and spectral information for achieving mixed receptive TransformerLocal BranchSemi-global / Global Brancheschannel reductionFourier UnitLocal Fourier Unitchannel promotionFigure 1:Left: Architecture design of fast Fourier Convolution (FFC). " denotes element-wise sum. Here in= out= : Design of spectral transformfg. See main text for more fast Fourier Convolution (FFC) Architectural DesignThe architecture of our proposed FFC is shown in Figure 1. Conceptually, FFC is comprised of twointer-connected paths: a spatial (or local) path that conducts ordinary convolutions on a part of inputfeature channels, and a spectral (or global) path that operates in the spectral domain.