SPICE: Semantic Propositional Image Caption …

SPICE: Semantic Propositional Image CaptionEvaluationPeter Anderson1, Basura Fernando1, Mark Johnson2, Stephen Gould11 The Australian National University, Canberra, University, Sydney, is considerable interest in the task of automaticallygenerating Image captions. However, evaluation is challenging. Existingautomatic evaluation metrics are primarily sensitive to n-gram overlap,which is neither necessary nor sufficient for the task of simulating hu-man judgment. We hypothesize that Semantic Propositional content is animportant component of human Caption evaluation , and propose a newautomated Caption evaluation metric defined over scene graphs coinedSPICE. Extensive evaluations across a range of models and datasetsindicate that SPICE captures human judgments over model-generatedcaptions better than other automatic metrics ( , system-level corre-lation of with human judgments on the MS COCO dataset, for CIDEr and for METEOR).

Furthermore, SPICE can answerquestions such aswhich Caption -generator best understands colors?andcan Caption -generators count?1 IntroductionRecently there has been considerable interest in joint visual and linguistic prob-lems, such as the task of automatically generating Image captions [1, 2]. Interesthas been driven in part by the development of new and larger benchmark datasetssuch as Flickr 8K [3], Flickr 30K [4] and MS COCO [5]. However, while newdatasets often spur considerable innovation as has been the case with the MSCOCO Captioning Challenge [6] benchmark datasets also require fast, accurateand inexpensive evaluation metrics to encourage rapid progress. Unfortunately,existing metrics have proven to be inadequate substitutes for human judgmentin the task of evaluating Image captions [7, 3, 8].

As such, there is an urgent needto develop new automated evaluation metrics for this task [8, 9]. In this paper,we present a novel automatic Image Caption evaluation metric that measures thequality of generated captions by analyzing their Semantic content. Our methodclosely resembles human judgment while offering the additional advantage thatthe performance of any model can be analyzed in greater detail than with otherautomated of the problems with using metrics such as Bleu [10], ROUGE [11],CIDEr [12] or METEOR [13] to evaluate captions, is that these metrics are pri-2 Peter Anderson, Basura Fernando, Mark Johnson, Stephen GouldFig. our method s main principle which uses Semantic Propositional con-tent to assess the quality of Image captions. Reference and candidate captions aremapped through dependency parse trees (top) to semanticscene graphs(right) encoding the objects (red), attributes (green), and relations (blue) present.

Captionquality is determined using an F-score calculated over tuples in the candidate andreference scene graphsmarily sensitive to n-gram overlap. However,n-gram overlap is neither necessarynor sufficient for two sentences to convey the same meaning[14].To illustrate the limitations of n-gram comparisons, consider the followingtwo captions (a,b) from the MS COCO dataset:(a) A young girlstanding on top of atennis court.(b) A giraffestanding on top of agreen captions describe two very different images. However, comparing these cap-tions using any of the previously mentioned n-gram metrics produces a highsimilarity score due to the presence of the long 5-gram phrase standing on topof a in both captions. Now consider the captions (c,d) obtained from the sameimage:(c) A shiny metal pot filled with some diced veggies.

(d) The pan on the stove has chopped vegetables in captions convey almost the same meaning, but exhibit low n-gram simi-larity as they have no words in overcome the limitations of existing n-gram based automatic evaluationmetrics, in this work we hypothesize thatsemantic Propositional content is anSPICE: Semantic Propositional Image Caption Evaluation3important component of human Caption evaluation . That is, given an Image withthe Caption A young girl standing on top of a tennis court , we expect that ahuman evaluator might consider the truth value of each of the Semantic propo-sitions contained therein such as (1) there is a girl, (2) girl is young, (3) girl isstanding, (4) there is a court, (5) court is tennis, and (6) girl is on top of each of these propositions is clearly and obviously supported by the Image , wewould expect the Caption to be considered acceptable, and scored this main idea as motivation, we estimate Caption quality by trans-forming both candidate and reference captions into a graph-based Semantic rep-resentation called ascene graph.

The scene graph explicitly encodes the objects,attributes and relationships found in Image captions, abstracting away most ofthe lexical and syntactic idiosyncrasies of natural language in the process. Recentwork has demonstrated scene graphs to be a highly effective representation forperforming complex Image retrieval queries [15, 16], and we demonstrate similaradvantages when using this representation for Caption parse an Image Caption into a scene graph, we use a two-stage approachsimilar to previous works [16 18]. In the first stage, syntactic dependencies be-tween words in the Caption are established using a dependency parser [19] pre-trained on a large dataset. An example of the resulting dependency syntax tree,using Universal Dependency relations [20], is shown in Figure 1 top. In the sec-ond stage, we map from dependency trees to scene graphs using a rule-basedsystem [16].

Given candidate and reference scene graphs, our metric computesan F-score defined over the conjunction of logical tuples representing semanticpropositions in the scene graph ( , Figure 1 right). We dub this approachSPICE forSemantic Propositional Image Caption a range of datasets and human evaluations, we show that SPICE out-performs existing n-gram metrics in terms of agreement with human evaluationsof model-generated captions, while offering scope for further improvements tothe extent that Semantic parsing techniques continue to improve. We make codeavailable from the project page1. Our main contributions are:1. We propose SPICE, a principled metric for automatic Image Caption evalu-ation that compares Semantic Propositional content;2. We show that SPICE outperforms metrics Bleu, METEOR, ROUGE-L andCIDEr in terms of agreement with human evaluations; and3.

We demonstrate that SPICE performance can be decomposed to answerquestions such as which Caption -generator best understands colors? and can Caption generators count? 2 Background and Related Caption evaluation MetricsThere is a considerable amount of work dedicated to the development of metricsthat can be used for automatic evaluation of Image captions. Typically, these1 Anderson, Basura Fernando, Mark Johnson, Stephen Gouldmetrics are posed as similarity measures that compare a candidate sentence toa set of reference or ground-truth sentences. Most of the metrics in common usefor Caption evaluation are based on n-gram matching. Bleu [10] is a modified pre-cision metric with a sentence-brevity penalty, calculated as a weighted geometricmean over different length n-grams. METEOR [13] uses exact, stem, synonymand paraphrase matches between n-grams to align sentences, before comput-ing a weighted F-score with an alignment fragmentation penalty.

ROUGE [11]is a package of a measures for automatic evaluation of text summaries usingF-measures. CIDEr [12] applies term frequency-inverse document frequency (tf-idf) weights to n-grams in the candidate and reference sentences, which are thencompared by summing their cosine similarity across n-grams. With the excep-tion of CIDEr, these methods were originally developed for the evaluation oftext summaries or machine translations (MT), and were subsequently adoptedfor Image Caption studies have analyzed the performance of n-gram metrics when usedfor Image Caption evaluation , by measuring correlation with human judgmentsof Caption quality. On the PASCAL 1K dataset, Bleu-1 was found to exhibitweak or no correlation (Pearson srof and ) [7]. Using the Flickr 8K[3] dataset, METEOR exhibited moderate correlation (Spearman s of )outperforming ROUGE SU-4 ( ), Bleu-smoothed ( ) and Bleu-1 ( )[8].

Using the PASCAL-50S and ABSTRACT-50S datasets, CIDEr and ME-TEOR were found to have greater agreement with human consensus than Bleuand ROUGE [12].Within the context of automatic MT evaluation , a number of papers haveproposed the use of shallow- Semantic information such as Semantic role labels(SRLs) [14]. In the MEANT metric [21], SRLs are used to try to capture thebasic event structure of sentences whodidwhattowhom,when,whereandwhy [22]. Using this approach, sentence similarity is calculated by first match-ing Semantic frames across sentences by starting with the verbs at their , this approach does not easily transfer to Image Caption evaluation , asverbs are frequently absent from Image captions or not meaningful a verytall building with a trainsittingnext to it and this can de-rail the matchingprocess.

SPICE: Semantic Propositional Image Caption …

Tags:

Information

Transcription of SPICE: Semantic Propositional Image Caption …

Related search queries

SPICE: Semantic Propositional Image Caption …

Tags:

Information

Documents from same domain

Related documents

Related search queries