Improving Multimodal Named Entity Recognition via Entity ...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3342 3352 July 5 - 10, 2020 Association for Computational Linguistics3342 Improving Multimodal Named Entity Recognition via Entity SpanDetection with Unified Multimodal TransformerJianfei Yu1, Jing Jiang2, Li Yang3, and Rui Xia1, 1 School of Artificial Intelligence, Nanjing University of Science & Technology, China2 School of Information Systems, Singapore Management University, Singapore3 DBS Bank, Singapore{jfyu, this paper, we study Multimodal NamedEntity Recognition (MNER) for social mediaposts. Existing approaches for MNER mainlysuffer from two drawbacks: (1) despite gener-ating word-aware visual representations, theirword representations are insensitive to the vi-sual context; (2) most of them ignore the biasbrought by the visual context.}

To tackle thefirst issue, we propose a Multimodal interac-tion module to obtain both image-aware wordrepresentations and word-aware visual repre-sentations. To alleviate the visual bias, we fur-ther propose to leverage purely text-based en-tity span detection as an auxiliary module, anddesign a Unified Multimodal Transformer toguide the final predictions with the Entity spanpredictions. Experiments show that our uni-fied approach achieves the new state-of-the-artperformance on two benchmark IntroductionRecent years have witnessed the explosive growthof user-generated contents on social media plat-forms such as Twitter. While empowering userswith rich information, the flourish of social mediaalso solicits the emerging need of automatically ex-tracting important information from these massiveunstructured contents.

As a crucial component ofmany information extraction tasks, Named entityrecognition (NER) aims to discover Named enti-ties in free text and classify them into pre-definedtypes, such as person (PER), location (LOC) andorganization (ORG). Given its importance, NERhas attracted much attention in the research com-munity (Yadav and Bethard, 2018).Although many methods coupled with either dis-crete shallow features (Zhou and Su, 2002; Finkelet al., 2005; Torisawa et al., 2007) or continuousdeep features (Lample et al., 2016; Ma and Hovy, Corresponding author.(a). [Kevin Durant PER] enters[Oracle Arena LOC] wearing off White x [Jordan MISC](b). Vote for [King of the JungleMISC] [Kian PER] or [DavidPER] ?Figure 1:Two examples for Multimodal Named EntityRecognition (MNER).)

Named entities and their Entity typesare ) have shown success in identifying entities informal newswire text, most of them perform poorlyon informal social media text ( , tweets) dueto its short length and noisiness. To adapt existingNER models to social media, various methods havebeen proposed to incorporate many tweet-specificfeatures (Ritter et al., 2011; Li et al., 2012, 2014;Limsopatham and Collier, 2016). More recently,as social media posts become increasingly multi-modal, several studies proposed to exploit usefulvisual information to improve the performance ofNER (Moon et al., 2018; Zhang et al., 2018; Luet al., 2018).In this work, following the recent trend, we focuson Multimodal Named Entity Recognition (MNER)for social media posts, where the goal is to detectnamed entities and identify their Entity types givena{sentence, image}pair.

For example, in Fig. ,it is expected to recognize thatKevin Durant,Or-acle Arena, andJordanbelong to the category ofperson names ( ,PER), place names ( ,LOC),and other names ( ,MISC), previous work has shown success of fus-ing visual information into NER (Moon et al., 2018;Zhang et al., 2018; Lu et al., 2018), they still suf-fer from several limitations: (1) The first obstacle3343lies in the non-contextualized word representations,where each word is represented by the same vector,regardless of the context it occurs in. However,the meanings of many polysemous entities in so-cial media posts often rely on its context Fig. as an example, without the contextwordswearing off, it is hard to figure out whetherJordanrefers to a shoe brand or a person.

(2) Al-though most existing methods focus on modelinginter-modal interactions to obtain word-aware vi-sual representations, the word representations intheir final hidden layer are still based on the tex-tual context, which are insensitive to the visualcontext. Intuitively, the associated image often pro-vides more context to resolve polysemous entities,and should contribute to the final word representa-tions ( , in Fig. , the image can supervise thefinal word representations ofKianandDavidto becloser to persons than animals). (3) Most previousapproaches largely ignore the bias of incorporatingvisual information. Actually, in most social mediaposts, the associated image tends to highlight onlyone or two entities in the sentence, without men-tioning the other entities.

In these cases, directlyintegrating visual information will inevitably leadthe model to better recognize entities highlightedby images, but fail to identify the other entities ( ,Oracle ArenaandKing of the Junglein Fig. 1).To address these limitations, we resort to ex-isting pre-trained contextualized word representa-tions, and propose a unified Multimodal architec-ture based on Transformer (Vaswani et al., 2017),which can effectively capture inter-modality inter-actions and alleviate the visual bias. Specifically,we first adopt a recently pre-trained contextualizedrepresentation model (Devlin et al., 2018) as oursentence encoder, whose multi-head self-attentionmechanism can guide each word to capture thesemantic and syntactic dependency upon its con-text.

Second, to better capture the implicit align-ments between words and images, we propose amultimodal interaction (MMI) module, which es-sentially couples the standard Transformer layerwith cross-modal attention mechanism to producean image-aware word representation and a word-aware visual representation for each input word,respectively. Finally, to largely eliminate the biasof the visual context, we propose to leverage text-based Entity span detection as an auxiliary task,and design a unified neural architecture based onTransformer. In particular, a conversion matrix isdesigned to construct the correspondence betweenthe auxiliary and the main tasks, so that the entityspan information can be fully utilized to guide thefinal MNER results show that our Unified Mul-timodal Transformer (UMT) brings consistent per-formance gains over several highly competitive uni-modal and Multimodal methods, and outperformsthe state-of-the-art by a relative improvement and on two benchmarks, main contributions of this paper can be sum-marized as follows: We propose a Multimodal Transformer modelfor the task of MNER, which empowers Trans-former with a Multimodal interaction mod-ule to capture the inter-modality dynamicsbetween words and images.

To the best ofour knowledge, this is the first work to applyTransformer to MNER. Based on the above Multimodal Transformer,we further design a unified architecture to in-corporate a text-based Entity span detectionmodule, aiming to alleviate the bias of thevisual context in MNER with the guidanceof Entity span predictions from this MethodologyIn this section, we first formulate the MNER task,and give an overview of our method. We then delveinto the details of each component in our Formulation:Given a sentenceSand itsassociated imageVas input, the goal of MNER isto extract a set of entities fromS, and classify eachextracted Entity into one of the pre-defined with most existing work on MNER, we for-mulate the task as a sequence labeling (s1,s2.)

,sn)denote a sequence of in-put words, andy=(y1,y2,..,yn)be the corre-sponding label sequence, whereyi YandYis the pre-defined label set with theBIO2taggingschema (Sang and Veenstra, 1999). Overall ArchitectureFig. illustrates the overall architecture of ourUnified Multimodal Transformer, which containsthree main components: (1) representation learningfor unimodal input; (2) a Multimodal Transformerfor MNER; and (3) a unified architecture with aux-iliary Entity span detection (ESD) [CLS] Jordan[SEP] + Layer with Self-AttentionQ K VTransformer Layer with Self-AttentionQ K + Input Textual Input Transformer for MNERText ModalityVisual ModalityConversionMatrixMultimodal Interaction ModuleBERT Encoder++Auxiliary Entity span Detection ModuleB-PERrn+1E3E2E1E0 EnEn+ [SEP]EJordanEenters EDurantEKevinE[CLS]c0c1c2c3cncn+ AttentionQ K VAdd & NormAdd & NormFeed Forward v49v2v1 Visual ModalityCross-Modal AttentionAdd & NormFeed Forward rn+ ModalityQ K VCross-Modal AttentionQ K VAdd & NormAdd & NormFeed Forwardh0h1hn+ an+ bn+ + qn+ & 2:(a).

Improving Multimodal Named Entity Recognition via Entity ...

Tags:

Information

Transcription of Improving Multimodal Named Entity Recognition via Entity ...

Related search queries

Improving Multimodal Named Entity Recognition via Entity ...

Tags:

Information

Documents from same domain

Related documents

Related search queries