A Survey on Recent Approaches for Natural Language ...

Proceedings of the 2021 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, pages 2545 2568 June 6 11, 2021. 2021 Association for Computational Linguistics2545A Survey on Recent Approaches for Natural Language Processing inLow-Resource ScenariosMichael A. Hedderich*1, Lukas Lange*1,2, Heike Adel2,Jannik Str tgen2& Dietrich Klakow11 Saarland University, Saarland Informatics Campus, Germany2 Bosch Center for Artificial Intelligence, neural networks and huge Language mod-els are becoming omnipresent in Natural lan-guage applications. As they are known for re-quiring large amounts of training data, thereis a growing body of work to improve theperformance in low-resource settings. Moti-vated by the Recent fundamental changes to-wards neural models and the popular pre-trainand fine-tune paradigm, we Survey promisingapproaches for low-resource Natural languageprocessing.

After a discussion about the dif-ferent dimensions of data availability, we givea structured overview of methods that enablelearning when training data is sparse. Thisincludes mechanisms to create additional la-beled data like data augmentation and distantsupervision as well as transfer learning set-tings that reduce the need for target supervi-sion. A goal of our Survey is to explain howthese methods differ in their requirements asunderstanding them is essential for choosinga technique suited for a specific low-resourcesetting. Further key aspects of this work are tohighlight open issues and to outline promisingdirections for future IntroductionMost of today s research in Natural Language pro-cessing (NLP) is concerned with the processingof 10 to 20 high-resource languages with a specialfocus on English, and thus, ignores thousands oflanguages with billions of speakers (Bender, 2019).

The rise of data-hungry deep learning systems in-creased the performance of NLP for high resource-languages, but the shortage of large-scale data inless-resourced languages makes their processinga challenging problem. Therefore, Ruder (2019)named NLP for low-resource scenarios one of thefour biggest open problems in NLP umbrella term low-resource covers a spec-trum of scenarios with varying resource conditions.* equal contributionIt includes work on threatened languages, suchas Yongning Na, a Sino-Tibetan Language with40k speakers and only 3k written, unlabeled sen-tences (Adams et al., 2017). Other languages arewidely spoken but seldom addressed by NLP re-search. More than 310 languages exist with atleast one million L1-speakers each (Eberhard et al.,2019). Similarly, Wikipedia exists for 300 technological developmentsfor low-resource languages can help to increase par-ticipation of the speakers communities in a digitalworld.

Note, however, that tackling low-resourcesettings is even crucial when dealing with popu-lar NLP languages as low-resource settings do notonly concern languages but also non-standard do-mains and tasks, for which even in English onlylittle training data is available. Thus, the term lan-guage in this paper also includes importance of low-resource scenarios andthe significant changes in NLP in the last years haveled to active research on resource-lean settings anda wide variety of techniques have been all share the motivation of overcoming thelack of labeled data by leveraging further , these works differ greatly on the sourcesthey rely on, , unlabeled data, manual heuristicsor cross-lingual alignments. Understanding the re-quirements of these methods is essential for choos-ing a technique suited for a specific low-resourcesetting.

Thus, one key goal of this Survey is to high-light the underlying assumptions these techniquestake regarding the low-resource this work, we (1) give a broad and structuredoverview of current efforts on low-resource NLP,(2) analyse the different aspects of low-resourcesettings, (3) highlight the necessary resources anddata assumptions as guidance for practitioners and(4) discuss open issues and promising future direc-1 low-resourcelanguagesdomainsData Augmentation ( )labeled data, heuristics*additional labeled data33 Distant Supervision ( )unlabeled data, heuristics*additional labeled data33 Cross-lingual projections ( )unlabeleddata,high-resourcelabeleddata, cross-lingual alignmentadditional labeled data37 Embeddings & Pre-trained LMs( )unlabeled databetter Language representation33LM domain adaptation ( )existing LM,unlabeled domain datadomain-specific Language rep-resentation73 Multilingual LMs ( )multilingualunlabeleddatamultilingual feature represen-tation37 Adversarial Discriminator ( 6)additional datasetsindependent representations33 Meta-Learning ( 6)multiple auxiliary tasksbetter target task performance33 Table 1: Overview of low-resource methods surveyed in this paper.

* Heuristics are typically gathered Table 1 gives an overview of the surveyedtechniques along with their requirements a practi-tioner needs to take into Related SurveysRecent surveys cover low-resource machine trans-lation (Liu et al., 2019) and unsupervised domainadaptation (Ramponi and Plank, 2020). Thus, wedo not investigate these topics further in this pa-per, but focus instead on general methods for low-resource, supervised Natural Language processingincluding data augmentation, distant supervisionand transfer learning. This is also in contrast to thetask-specific Survey by Magueresse et al. (2020)who review highly influential work for several ex-traction tasks, but only provide little overview of re-cent Approaches . In Table 2 in the appendix, we listpast surveys that discuss a specific method or low-resource Language family for those readers whoseek a more specialized Aspects of Low-Resource To visualize the variety of resource-lean scenarios,Figure 1 shows exemplarily which NLP tasks wereaddressed in six different languages from basic tohigher-level tasks.

While it is possible to buildEnglish NLP systems for many higher-level appli-cations, low-resource languages lack the data foun-dation for this. Additionally, even if it is possibleto create basic systems for tasks, such as tokeniza-tion and named entity recognition, for all testedlow-resource languages, the training data is typicalof lower quality compared to the English datasets,TPTPTPTPTPTPMAMAMAMAMAMASASASAS ASASADSDSDSDSDSDSLSLSLSLSLSLSRSRSRSRSDDD HHHH02468101214161820 English(1000)Yoruba(40)Hausa(60)Quechuan (8)Nahuatl( )Estonian( )Supported TasksLanguage(Speakers in million) above / TasksbelowH: Higher-level NLP applicationsD: DiscourseRS: Relational semanticsLS: Lexical semanticsDS: Distributional semanticsSA: Syntactic analysisMA: Morphological analysisTP: Text processingFigure 1: Supported NLP tasks in different that the figure does not incorporate data qualityor system performance.

More details on the selectionof tasks and languages are given in the appendix Sec-tion very limited in size. It also shows that the fourAmerican and African languages with between 60 million speakers have been addressed lessthan the Estonian Language , with 1 million speak-ers. This indicates the unused potential to reachmillions of speakers who currently have no accessto higher-level NLP applications. Joshi et al. (2020)study further the availability of resources for lan-guages around the Dimensions of Resource AvailabilityMany techniques presented in the literature dependon certain assumptions about the low-resource sce-2547nario. These have to be adequately defined to eval-uate their applicability for a specific setting andto avoid confusion when comparing different ap-proaches. We propose to categorize low-resourcesettings along the following three dimensions:(i) Theavailability of task-specific labelsinthe target Language (or target domain) is the mostprominent dimension in the context of supervisedlearning.

Labels are usually created through man-ual annotation, which can be both time- and cost-intensive. Not having access to adequate expertsto perform the annotation can also be an issue forsome languages and domains.(ii) Theavailability of unlabeled Language - ordomain-specific textis another factor, especiallyas most modern NLP Approaches are based on someform of input embeddings trained on unlabeledtexts.(iii) Most of the ideas surveyed in the next sec-tions assume theavailability of auxiliary datawhich can have many forms. Transfer learningmight leverage task-specific labels in a differentlanguage or domain. Distant supervision utilizesexternal sources of information, such as knowledgebases or gazetteers. Some Approaches require otherNLP tools in the target Language like machine trans-lation to generate training data.

It is essential toconsider this as results from one low-resource sce-nario might not be transferable to another one ifthe assumptions on the auxiliary data are How Low is Low-Resource?On the dimension of task-specific labels, differ-ent thresholds are used to define part-of-speech (POS) tagging, Garrette andBaldridge (2013) limit the time of the annotators to2 hours resulting in up to 1-2k tokens. Kann et al.(2020) study languages that have less than 10k la-beled tokens in the Universal Dependency project(Nivre et al., 2020) and Loubser and Puttkammer(2020) report that most available datasets for SouthAfrican languages have 40-60k labeled threshold is also task-dependent and morecomplex tasks might also increase the resource re-quirements. For text generation, Yang et al. (2019)frame their work as low-resource with 350k la-beled training instances.

A Survey on Recent Approaches for Natural Language ...

Tags:

Information

Transcription of A Survey on Recent Approaches for Natural Language ...

Related search queries

A Survey on Recent Approaches for Natural Language ...

Tags:

Information

Documents from same domain

Related documents

Related search queries