Show and Tell: A Neural Image Caption Generator

Show and Tell: A Neural Image Caption GeneratorOriol describing the content of an Image is afundamental problem in artificial intelligence that connectscomputer vision and natural language processing. In thispaper, we present a generative model based on a deep re-current architecture that combines recent advances in com-puter vision and machine translation and that can be usedto generate natural sentences describing an Image . Themodel is trained to maximize the likelihood of the target de-scription sentence given the training Image . Experimentson several datasets show the accuracy of the model and thefluency of the language it learns solely from Image descrip-tions.

Our model is often quite accurate, which we verifyboth qualitatively and quantitatively. For instance, whilethe current state-of-the-art BLEU-1 score (the higher thebetter) on the Pascal dataset is 25, our approach yields 59,to be compared to human performance around 69. We alsoshow BLEU-1 score improvements on Flickr30k, from 56 to66, and on SBU, from 19 to 28. Lastly, on the newly releasedCOCO dataset, we achieve a BLEU-4 of , which is thecurrent IntroductionBeing able to automatically describe the content of animage using properly formed English sentences is a verychallenging task, but it could have great impact, for instanceby helping visually impaired people better understand thecontent of images on the web.

This task is significantlyharder, for example, than the well-studied Image classifi-cation or object recognition tasks, which have been a mainfocus in the computer vision community [27]. Indeed, adescription must capture not only the objects contained inan Image , but it also must express how these objects relateto each other as well as their attributes and the activitiesthey are involved in. Moreover, the above semantic knowl-edge has to be expressed in a natural language like English,which means that a language model is needed in addition tovisual previous attempts have proposed to stitch togetherA group of people shopping at an outdoor market.

!There are many vegetables at the fruit !Deep CNNL anguage !Generating!RNNF igure 1. NIC, our model, is based end-to-end on a Neural net-work consisting of a vision CNN followed by a language gener-ating RNN. It generates complete sentences in natural languagefrom an input Image , as shown on the example solutions of the above sub-problems, in order to gofrom an Image to its description [6, 16]. In contrast, wewould like to present in this work a single joint model thattakes an imageIas input, and is trained to maximize thelikelihoodp(S|I)of producing a target sequence of wordsS={S1,S2,..}where each wordStcomes from a givendictionary, that describes the Image main inspiration of our work comes from recent ad-vances in machine translation, where the task is to transforma sentenceSwritten in a source language , into its transla-tionTin the target language , by maximizingp(T|S).

Formany years, machine translation was also achieved by a se-ries of separate tasks (translating words individually, align-ing words, reordering, etc), but recent work has shown thattranslation can be done in a much simpler way using Re-current Neural Networks (RNNs) [3, 2, 30] and still reachstate-of-the-art performance. An encoder RNNreadsthesource sentence and transforms it into a rich fixed-lengthvector representation, which in turn in used as the initialhidden state of a decoder RNN thatgeneratesthe , we propose to follow this elegant recipe, replac-ing the encoder RNN by a deep convolution Neural network(CNN).

Over the last few years it has been convincinglyshown that CNNs can produce a rich representation of theinput Image by embedding it to a fixed-length vector, suchthat this representation can be used for a variety of vision1tasks [28]. Hence, it is natural to use a CNN as an Image encoder , by first pre-training it for an Image classificationtask and using the last hidden layer as an input to the RNNdecoder that generates sentences (see Fig. 1). We call thismodel the Neural Image Caption , or contributions are as follows. First, we present anend-to-end system for the problem. It is a Neural net whichis fully trainable using stochastic gradient descent.

Second,our model combines state-of-art sub-networks for visionand language models. These can be pre-trained on largercorpora and thus can take advantage of additional data. Fi-nally, it yields significantly better performance comparedto state-of-the-art approaches; for instance, on the Pascaldataset, NIC yielded a BLEU score of 59, to be compared tothe current state-of-the-art of 25, while human performancereaches 69. On Flickr30k, we improve from 56 to 66, andon SBU, from 19 to Related WorkThe problem of generating natural language descriptionsfrom visual data has long been studied in computer vision,but mainly for video [7, 32].

This has led to complex sys-tems composed of visual primitive recognizers combinedwith a structured formal language , And-Or Graphs orlogic systems, which are further converted to natural lan-guage via rule-based systems are heav-ily hand-designed, relatively brittle and have been demon-strated only on limited domains, traffic scenes or problem of still Image description with natural texthas gained interest more recently. Leveraging recent ad-vances in recognition of objects, their attributes and loca-tions, allows us to drive natural language generation sys-tems, though these are limited in their expressivity.

Farhadiet al. [6] use detections to infer a triplet of scene elementswhich is converted to text using templates. Similarly, Liet al. [19] start off with detections and piece together a fi-nal description using phrases containing detected objectsand relationships. A more complex graph of detectionsbeyond triplets is used by Kulkani et al. [16], but withtemplate-based text generation. More powerful languagemodels based on language parsing have been used as well[23, 1, 17, 18, 5]. The above approaches have been able todescribe images in the wild , but they are heavily hand-designed and rigid when it comes to text large body of work has addressed the problem of rank-ing descriptions for a given Image [11, 8, 24].

Such ap-proaches are based on the idea of co-embedding of imagesand text in the same vector space. For an Image query, de-scriptions are retrieved which lie close to the Image in theembedding space. Most closely, Neural networks are used toco-embed images and sentences together [29] or even imagecrops and subsentences [13] but do not attempt to generatenovel descriptions. In general, the above approaches cannotdescribe previously unseen compositions of objects, eventhough the individual objects might have been observed inthe training data. Moreover, they avoid addressing the prob-lem of evaluating how good a generated description this work we combine deep convolutional nets for im-age classification [12] with recurrent networks for sequencemodeling [10], to create a single network that generates de-scriptions of images.

Show and Tell: A Neural Image Caption Generator

Tags:

Information

Transcription of Show and Tell: A Neural Image Caption Generator

Related search queries

Show and Tell: A Neural Image Caption Generator

Tags:

Information

Documents from same domain

Related documents

Related search queries