Grounded Language-Image Pre-training

Grounded Language-Image Pre-trainingLiunian Harold Li 1 , Pengchuan Zhang 2 , Haotian Zhang 3 , Jianwei Yang2, Chunyuan Li2, Yiwu Zhong4 ,Lijuan Wang5, Lu Yuan5, Lei Zhang6, Jenq-Neng Hwang3, Kai-Wei Chang1, Jianfeng Gao21 UCLA,2 Microsoft Research,3 University of Washington,4 University of Wisconsin-Madison,5 Microsoft Cloud and AI,6 International Digital Economy paper presents a Grounded Language-Image Pre-training (GLIP) model for learningobject-level, language -aware, andsemantic-richvisual representations. GLIP uni-fies object detection and phrase grounding for unification brings two benefits: 1) it allows GLIPto learn from both detection and grounding data to im-prove both tasks and bootstrap a good grounding model;2) GLIP can leverage massive image-text pairs by generat-ing grounding boxes in a self-training fashion, making thelearned representations semantic-rich.

In our experiments,we pre-train GLIP on 27M grounding data, including 3 Mhuman-annotated and 24M web-crawled image-text learned representations demonstrate strong zero-shotand few-shot transferability to various object-level recogni-tion tasks. 1) Whendirectly evaluatedon COCO and LVIS(without seeing any images in COCO during Pre-training ),GLIP achieves AP and AP, respectively, surpass-ing many supervised ) Afterfine-tunedonCOCO, GLIP achieves AP on val and AP ontest-dev, surpassing prior SoTA. 3) Whentransferredto 13downstream object detection tasks, a 1-shot GLIP rivalswith a fully-supervised Dynamic Head.

Code will be re-leased IntroductionVisual recognition models are typically trained to predicta fixed set of pre-determined object categories, which limitstheir usability in real-world applications since additional la-beled data are needed to generalize to new visual conceptsand domains. CLIP [42] shows thatimage-levelvisual rep-resentations can be learned effectively on large amounts of The three authors contributed equally. Corresponding author. Work done when interning at Microsoft baselines on COCO object detection: Faster-RCNN w/ResNet50 ( ) or ResNet101 ( ), and DyHead w/ Swin-Tiny ( ).

Raw image-text pairs. Because the paired texts contain aboarder set of visual concepts than any pre-defined conceptpool, the pre-trained CLIP model is so semantically rich thatit can be easily transferred to downstream image classifi-cation and text-image retrieval tasks in zero-shot , to gain fine-grained understanding of images, asrequired by many tasks, such as object detection [33, 46],segmentation [7, 37], human pose estimation [51, 58], sceneunderstanding [15, 27, 59], action recognition [19], vision- language understanding [8, 29 32, 38, 50, 52, 65, 67],object-levelvisual representations are highly this paper, we show thatphrase grounding, which is atask of identifying the fine-grained correspondence betweenphrases in a sentence and objects (or regions) in an image, isan effective and scalable Pre-training task to learn an object-level, language -aware, and semantic-rich visual representa-tion, and propose Grounded Language-Image Pre-training (GLIP).

Our approach unifies the phrase grounding and ob-ject detection tasks in that object detection can be cast ascontext-free phrase grounding while phrase grounding canbe viewed as a contextualized object detection task. Wehighlight our key contributions as detection and grounding by reformulatingobject detection as phrase reformulationchanges the input of a detection model: it takes as input notonly an image but also a text prompt that describesallthecandidate categories in the detection task2. For example,the text prompt for COCO object detection [34] is a textstring that consists of 80 phrases, , the 80 COCO objectclass names, joined by.

, as shown in Figure 1 (Left).Any object detection model can be converted to a ground-ing model by replacing the object classification logits in itsbox classifier with the word-region alignment scores, ,dot product of the region (or box) visual features and thetoken (or phrase) language features, as shown in Figure 1(Right). The language features are computed using a lan-2 Different from typical phrase grounding tasks, phrases in the textprompt for an object detection task may not be present in the [ ] 7 Dec ModuleFusionBERTL ayerA woman holds a blow dryer,wearing protective # Bicycle.

FusionRegion FeaturesWord-Region Alignment ScoreVisual EncoderTextEncoderDyHead LossWordFeatures "# " "# %.. '# ( "# ()" '# " '# % "# (.. %# " *# " %# (.. *# (.. +%,- ,%+- +%,, ,%+, - 1. A unified framework for detection and grounding. Unlike a classical object detection modelwhich predicts a categorical class for each detected object, we reformulate detection as a grounding task byaligning each region/box to phrases in a text prompt. GLIP jointly trains an image encoder and a languageencoder to predict the correct pairings of regions and words. We further add the cross-modality deepfusion to early fuse information from two modalities and to learn a language -aware visual syringes and a small vial of esmeralda in holguin, cuba.))))

The view from the top of the beach. beautiful caribbean sea turquoiseTwo syringesA small vialvaccinethe viewplaya esmeraldabeautiful caribbean sea turquoiseTwo syringesFigure 2. Grounding pre-dictions from GLIP. GLIPcan locate rare entities,phrases with attributes, andeven abstract model, which gives the new detection (or ground-ing) model a dual-encoder structure. Different from CLIP that fuses vision and language only at the last dot productlayer [42], we show that deep cross-modality fusion appliedby GLIP, as shown in Figure 1 (Middle), is crucial to learnhigh-quality language -aware visual representations and toachieve superior transfer learning performance.

The unifi-cation of detection and grounding also allows us to pre-trainusing both types of data and benefits both tasks. On thedetection side, the pool of visual concepts is significantlyenriched thanks to the grounding data. On the groundingside, detection data introduce more bounding box annota-tions and help train a new SoTA phrase grounding up visual concepts with massive a good grounding model (teacher), we canaugment GLIP Pre-training data by automatically generat-ing grounding boxes for massive image-text-paired data, inwhich noun phrases are detected by an NLP parser [2].

Thus, we can pre-train our (student) GLIP-Large model(GLIP-L) on 27M grounding data, including 3M human -annotated fine-grained data and 24M web-crawled image-text pairs. For the 24M image-text pairs, there are (> ) phrase-box pseudo annotations,with unique noun phrases. We showcase two realexamples of the generated boxes in Figure 2. The teachermodel can accurately localize some arguably hard con-cepts, such assyringes,vaccine,beautiful caribbean seaturquoise, and even abstract words (the view). Trainingon such semantic-rich data delivers a semantic-rich studentmodel.

Grounded Language-Image Pre-training

Tags:

Information

Transcription of Grounded Language-Image Pre-training

Related search queries

Grounded Language-Image Pre-training

Tags:

Information

Documents from same domain

Related documents

Related search queries