Transcription of Generative Adversarial Transformers
1 Generative Adversarial TransformersDrew A. Hudson Department of Computer ScienceStanford Lawrence ZitnickFacebook AI ResearchFacebook, introduce the GANformer, a novel and effi-cient type of transformer, and explore it for thetask of visual Generative modeling. The networkemploys a bipartite structure that enables long-range interactions across the image, while main-taining computation of linear efficiency, that canreadily scale to high-resolution synthesis. It itera-tively propagates information from a set of latentvariables to the evolving visual features and viceversa, to support the refinement of each in light ofthe other, and encourage the emergence of compo-sitional representations for objects and scenes.
2 Incontrast to the classic transformer architecture, itutilizes multiplicative integration that allows flexi-ble region-based modulation, and can thus be seenas a multi-latent generalization of the successfulStyleGAN network. We demonstrate the model sstrength and robustness through a careful eval-uation over a range of datasets, from simulatedmulti-object environments to rich real-world in-door and outdoor scenes, showing it attains state-of-the-art results in terms of image quality anddiversity, while enjoying fast learning and betterdata-efficiency. Further qualitative and quantita-tive experiments offer an insight into the model sinner workings, revealing improved interpretabil-ity and stronger disentanglement, and illustratethe benefits and efficacy of our approach.
3 An im-plementation of the model is available IntroductionThe cognitive science literature speaks of two reciprocalmechanisms that underlie human perception: thebottom-upprocessing, proceeding from the retina up to the cortex, aslocal elements and salient stimuli hierarchically group to-gether to form the whole [27], and thetop-downprocessing,where surrounding global context, selective attention andprior knowledge inform the interpretation of the particular[32]. While their respective roles and dynamics are beingProceedings of the38thInternational Conference on MachineLearning, PMLR 139, 2021. Copyright 2021 by the author(s).Figure images generated by the GANformer, along witha visualization of the model attention studied, researchers agree that it is the interplaybetween these two complementary processes that enablesthe formation of our rich internal representations, allowingus to perceive the world around in its fullest and create vividimageries in our mind s eye [13, 17, 39, 52].
4 Nevertheless, the very mainstay and foundation of computervision over the last decade the Convolutional Neural Net-work, surprisingly, does not reflect this bidirectional naturethat so characterizes the human visual system, and ratherdisplays a one-way feed-forward progression from raw sen-sory signals to higher representations. Unfortunately, thelocal receptive field and rigid computation of CNNs reducetheir ability to model long-range dependencies or developholistic understanding of global shapes and structures thatgoes beyond the brittle reliance on texture [26], and in thegenerative domain especially, they are linked to considerableoptimization and stability issues [70] due to their fundamen-tal difficulty in coordinating between fine details across thegenerated scene.
5 These concerns, along with the inevitablecomparison to cognitive visual processes, beg the questionof whether convolution alone provides a complete solution,or some key ingredients are still missing. I wish to thank Christopher D. Manning for the fruitful dis-cussions and constructive feedback in developing the bipartitetransformer, especially when explored within the language repre-sentation area, as well as for the kind financial support that allowedthis work to [ ] 29 Mar 2022 Generative Adversarial TransformersFigure introduce the GANformer network, that leverages a bipartite structure to support long-range interactionswhile evading the quadratic complexity standard Transformers suffer from.
6 We present two novel attention operations over the bipartitegraph:simplexandduplex, the former permits communication in one direction, in the Generative context from the latents to the imagefeatures, while the latter enables both top-down and bottom-up connections between these two dual , the NLP community has witnessed a major rev-olution with the advent of the Transformer network [65],a highly-adaptive architecture centered around relationalattention and dynamic interaction. In response, several at-tempts have been made to integrate the transformer intocomputer vision models, but so far they have met only lim-ited success due to scalabillity limitations stemming fromits quadratic mode of to address these shortcomings and unlock thefull potential of this promising network for the field ofcomputer vision, we introduce the Generative AdversarialTransformer, or GANformer for short, a simple yet effectivegeneralization of the vanilla transformer, explored here forthe task of visual synthesis.
7 The model utilizes a bipartitestructure for computing soft attention, that iteratively aggre-gates and disseminates information between the generatedimage features and a compact set of latent variables thatfunctions as abottleneck, to enable bidirectional interactionbetween these dual representations. This design achieves afavorable balance, being capable of flexibly modeling globalphenomena and long-range interactions on the one hand,while featuring an efficient setup that still scales linearlywith the input size on the other. As such, the GANformercan sidestep the computational costs and applicability con-straints incurred by prior works, caused by the dense andpotentially excessive pairwise connectivity of the standardtransformer [5,70], and successfully advance the generativemodeling of compositional images and study the model s quantitative and qualitative behaviorthrough a series of experiments, where it achieves state-of-the-art performance for a wide selection of datasets, ofboth simulated as well as real-world kinds, obtaining par-ticularly impressive gains in generating highly-structuredmulti-object scenes.
8 As indicated by our analysis, the GAN-former requires less training steps and fewer samples thancompeting approaches to successfully synthesize images ofhigh quality and diversity. Further evaluation provides ro-bust evidence for the network s enhanced transparency andcompositionality, while ablation studies empirically validatethe value and effectiveness of our approach. We then presentvisualizations of the model s produced attention maps, toshed more light upon its internal representations and synthe-sis process. All in all, as we will see through the rest of thepaper, by bringing the renowned GANs and Transformerarchitectures together under one roof, we can integrate theircomplementary strengths, to create a strong, compositionaland efficient network for visual Generative Related WorkGenerative Adversarial Networks (GANs) [28], originallyintroduced in 2014, have made remarkable progress over thepast years, with significant advances in training stability anddramatic improvements in image quality and diversity.
9 Thatturned them to be nowadays one of the leading paradigmsin visual synthesis [5,44,58]. In turn, GANs have beenwidely adopted for a rich variety of tasks, including image-to-image translation [40,72], super-resolution [47], styletransfer [12], and representation learning [18], to name afew. But while generated images for faces, single objects ornatural scenery have reached astonishing fidelity, becomingnearly indistinguishable from real samples, the uncondi-tional synthesis of more structured or compositional scenesis still lagging behind, suffering from inferior coherence, re-duced geometric consistency and, at times, a lack of globalcoordination [9,43,70].
10 As of now, faithful generation ofstructured scenes is thus yet to be , the last years saw impressive progress inthe field of NLP, driven by the innovative architecturecalled Transformer [65], which has attained substantialgains within the language domain and consequently sparkedconsiderable interest across the deep learning community[16,65]. In response, several attempts have been made toincorporate self-attention constructions into vision models, Generative Adversarial Transformersmost commonly for image recognition, but also in segmenta-tion [25], detection [8], and synthesis [70]. From structuralperspective, these can be roughly divided into two streams:those that apply local attention operations, failing to cap-ture global interactions [14,37,56,57,71], and others thatborrow the original transformer structure as-is and performattention globally across the entire image, resulting in pro-hibitive computation due to the quadratic complexity, whichfundamentally hinders its applicability to low-resolution lay-ers only [3,5,19,24,41,66,70].