Example: bankruptcy

Text Analysis in R - Ken Benoit's website

Text Analysis in RKasper Welbersa, Wouter Van Atteveldtb, and Kenneth BenoitcaInstitute for Media Studies, University of Leuven, Leuven, Belgium;bDepartment of Communcation Science, VUUniversity Amsterdam, Amsterdam, The Netherlands;cDepartment of Methodology, London School of Economics andPolitical Science, London, UKABSTRACTC omputational text Analysis has become an exciting research field withmany applications in communication research. It can be a difficult methodto apply, however, because it requires knowledge of various techniques,and the software required to perform most of these techniques is notreadily available in common statistical software packages.

Text Analysis in R Kasper Welbersa, Wouter Van Atteveldtb, and Kenneth Benoit c aInstitute for Media Studies, University of Leuven, Leuven, Belgium; bDepartment of Communcation Science, VU

Tags:

  Analysis, Texts, Text analysis in r

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Text Analysis in R - Ken Benoit's website

1 Text Analysis in RKasper Welbersa, Wouter Van Atteveldtb, and Kenneth BenoitcaInstitute for Media Studies, University of Leuven, Leuven, Belgium;bDepartment of Communcation Science, VUUniversity Amsterdam, Amsterdam, The Netherlands;cDepartment of Methodology, London School of Economics andPolitical Science, London, UKABSTRACTC omputational text Analysis has become an exciting research field withmany applications in communication research. It can be a difficult methodto apply, however, because it requires knowledge of various techniques,and the software required to perform most of these techniques is notreadily available in common statistical software packages.

2 In this teacher scorner, we address these barriers by providing an overview of general stepsand operations in a computational text Analysis project, and demonstratehow each step can be performed using the R statistical software. As apopular open-source platform, R has an extensive user community thatdevelops and maintains a wide range of text Analysis packages. We showthat these packages make it easy to perform advanced text the increasing importance of computational text Analysis in communication research(Boumans & Trilling,2016; Grimmer & Stewart,2013), many researchers face the challenge oflearning how to use advanced software that enables this type of Analysis .

3 Currently, one of the mostpopular environments for computational methods and the emerging field of data science 1is the Rstatistical software (R Core Team,2017). However, for researchers that are not well-versed inprogramming, learning how to use R can be a challenge, and performing text Analysis in particularcan seem daunting. In this teacher s corner, we show that performing text Analysis in R is not as hardas some might fear. We provide a step-by-step introduction into the use of common techniques, withthe aim of helping researchers get acquainted with computational text Analysis in general, as well asgetting a start at performing advanced text Analysis studies in is a free, open-source, cross-platform programming environment.

4 In contrast to most program-ming languages, R was specifically designed for statistical Analysis , which makes it highly suitable fordata science applications. Although the learning curve for programming with R can be steep,especially for people without prior programming experience, the tools now available for carryingout text Analysis in R make it easy to perform powerful, cutting-edge text analytics using only a fewsimple commands. One of the keys to R s explosive growth (Fox & Leanage,2016; TIOBE,2017) hasbeen its densely populated collection of extension software libraries, known in R terminology aspackages, supplied and maintained by R s extensive user community.

5 Each package extends thefunctionality of the base R language and core packages, and in addition to functions and data mustinclude documentation and examples, often in the form of vignettes demonstrating the use of thepackage. The best-known package repository, the Comprehensive R Archive Network (CRAN),currently has over 10,000 packages that are published, and which have gone through an extensiveCONTACTK asper for Media Studies, University of Leuven, Sint-Andriesstraat2 box 15530, Antwerp 2000, versions of one or more of the figures in the article can be found online term data science is a popular buzzword related to data-driven research and big data (Provost & Fawcett,2013).

6 2017 Taylor & Francis Group, LLCCOMMUNICATION METHODS AND MEASURES2017, VOL. 11, NO. 4, 245 265 by [ ] at 01:16 05 November 2017 screening for procedural conformity and cross-platform compatibility before being accepted by thus features a wide range of inter-compatible packages, maintained and continuouslyupdated by scholars, practitioners, and projects such as RStudio and rOpenSci. Furthermore, thesepackages may be installed easily and safely from within the R environment using a single thus provides a solid bridge for developers and users of new Analysis tools to meet, making it avery suitable programming environment for scientific Analysis in particular has become well established in R.

7 There is a vast collection of dedicatedtext processing and text Analysis packages, from low-level string operations (Gagolewski,2017)toadvanced text modeling techniques such as fitting Latent Dirichlet Allocation models (Blei, Ng, &Jordan,2003; Roberts et al.,2014) nearly 50 packages in total at our last count. Furthermore, thereis an increasing effort among developers to cooperate and coordinate, such as the rOpenSci specialinterest of the main advantages of performing text Analysis in R is that it is oftenpossible, and relatively easy, to switch between different packages or to combine them.

8 Recent effortsamong the R text Analysis developers community are designed to promote this interoperability tomaximize flexibility and choice among a result, learning the basics for text Analysis in Rprovides access to a wide range of advanced text Analysis of this Teacher s CornerThis teacher s corner covers the most common steps for performing text Analysis in R, from datapreparation to Analysis , and provides easy to replicate example code to perform each step. Theexample code is also digitally available in our online appendix, which is updated over primarily on bag-of-words text Analysis approaches, meaning that only the frequencies ofwords per text are used and word positions are ignored.

9 Although this drastically simplifies textcontent, research and many real-world applications show that word frequencies alone containsufficient information for many types of Analysis (Grimmer & Stewart,2013).Table 1presents an overview of the text Analysis operations that we address, categorized in threesections. In thedata preparationsection we discuss five steps to prepare texts for Analysis . The firststep,importing text, covers the functions for reading texts from various types of file formats ( , txt,csv, pdf) into araw textcorpus in R. The stepsstring operationsandpreprocessingcover techniquesfor manipulating raw texts and processing them intotokens( , units of text, such as words or wordstems).

10 The tokens are then used for creating thedocument-term matrix(DTM), which is a commonformat for representing a bag-of-words type corpus, that is used by many R text Analysis non-bag-of-words formats, such as the tokenlist, are briefly touched upon in theadvancedtopicssection. Finally, it is a common step tofilter and weightthe terms in the DTM. These steps aregenerally performed in the presented sequential order (seeFigure 1for conceptual illustration). Aswe will show, there are R packages that provide convenient functions that manage multiple datapreparation steps in a single line of code. Still, we first discuss and demonstrate each step separatelyto provide a basic understanding of the purpose of each step, the choices that can be made and thepitfalls to watch out discusses four text Analysis methods that have become popular in commu-nication research (Boumans & Trilling,2016) and that can be performed with a DTM as than being competing approaches, these methods have different advantages and disadvan-tages, so choosing the best method for a study depends largely on the research question, and2 Other programming environments have similar archives, such as pip for python.


Related search queries