Text Mining for Sentiment Analysis of Twitter Data

Text Mining for Sentiment Analysis of Twitter data Shruti Wakade, Chandra Shekar, Kathy J. Liszka and Chien-Chung Chan The University of Akron Department of Computer Science Abstract Text messages express the state of minds from a large population on earth. From the perspective of decision makers, this collection of messages provides a precious source of information. In this paper, we present the use of Weka data Mining tools to extract useful information for classifying Sentiment of tweets collected from Twitter . The results of tweet Mining are represented as decision trees that can be used for judging Sentiment of new tweets. We introduce a new method for preprocessing tweets for decision tree learning. We evaluate the impact of tweets containing emoticons to the classifying process. The method is applied to perform Sentiment Analysis from tweets related to iPhone and Microsoft.

Experimental results show that decision tree classifiers out-performed na ve Bayes algorithm. Keywords: geometric tiling, minimal covering sets, wireless sensor networks 1. Introduction Billions of dollars are spent worldwide each year on market Analysis . data -driven decisions are a powerful and necessary method of conducting business. Imagine how useful it would be for a company to know how its products are viewed in the market or how a political candidate could leverage their public image in their campaign, without surveying people directly. One way to accomplish this is by collecting public Sentiment on Internet microblogging sites such as Twitter1, Tumblr2, Plurk3, Pownce4, and Jaiku5. These are the top five social networking forums that provide a quick and easy means for people to express themselves while creating a valuable pool of data for those who are interested in those 1 2 3 4 5 opinions.

Messages that users create are saved in their personal profile and forwarded to others in their circle of friends. The information may be kept private among the list, or made public and unrestricted. Opinion Mining , Sentiment Analysis , and subjectivity Analysis are related fields sharing common goals of developing and applying computational techniques to process collections of opinionated texts or reviews. Other research goals are to generate heuristics or tools that can be used to classify, rank, or summarize sentiments toward certain objects, events, or topics. For example, these tools can be used to determine a thumbs up or thumbs down vote for specific movies from their reviews, or to predict in-favor or in-worse of certain products or events. In this paper, we look specifically at Twitter data , called tweets, to perform clustering and Sentiment Analysis .

Tweets are limited to 140 characters. Figure 1 shows an actual tweet taken from Twitter . This type of cyber-communication is commonly called microblogging. Sentiment Analysis is a field of research that determines if there is a favorable or non-favorable reaction in text. Figure 1. Example tweet. Our approach is to use the Weka1 data Mining software with a positive and negative word set and compare it to a second word set provided by Twitter . We are interested in the impact of emoticons added to both of these sets. In section two, we discuss previous research in the field of Sentiment Analysis on text. Section three presents the problem statement and setup. In section four, we describe the preprocessing steps performed on the data and the feature selection used. Section five presents the experiments. Section six contains discussion of the results and we conclude in section seven.

2. Previous Work There is a small, but growing body of research in specifically opinion Mining from microblogging data . Kim et al. give a compelling case for using Twitter lists for a corpus in Sentiment analysis2. In this context, lists are groups of people who share a common interest such as music. They show that even though tweets are brief, they contain enough information to express identifiable characteristics, interests and sentiments. The seminal work by Pang et al. shows that machine learning is a viable tool for Sentiment Analysis using movie reviews for a corpus3. They apply three standard machine learning algorithms; Na ve Bayes, maximum entropy (MaxEnt), and support vector machines (SVMs). Their positive and negative word lists were relatively small, from five to eleven in different experiments, but nonetheless, the results are good.

More notable, they bring to light the difficulty of the task compared to topic based classification. The work in Go et al. is very similar to Pang in using the same three classifiers, but microblogging data from Twitter is used as opposed to the longer text movie reviews4. The results are remarkably similar, showing promise that applying these tools for Sentiment Analysis cross the boundaries from longer text blocks to the 140 characters restricted tweets. The research in this paper excludes neutral sentiments from the corpora. Only positive and negative tweets are collected, mined through queries in the Twitter search utility using common emoticons. Once collected, the emoticons are removed from the tweets before training with the classifiers. Manually collected test data retains emoticons, if present. Pak and Paroubek5 collect data from Twitter , filter it and then classify as positive or negative by the use of popular emoticons (smiley faces, sad faces, and variations).

Neutral tweets are collected from newspaper accounts to round out the corpora. An Analysis indicates the distribution of word frequencies in the collection is normal. They apply a Na ve Bayes classifier to test the posts. Their best results are those experiments using bigrams. This is contrary to the findings of Pang, but may easily be explained by the very nature of the differing corpora. Movies reviews may contain more words and users may take more time to think about their post where tweeters tend to give lightening quick, brief snapshots of a thought sent from a cell phone or other small device. In fact, one very interesting observation that this paper makes is the amount of slang used and frequent misspellings in tweets. This may have minor effects on any opinion Analysis applied to microblogging data .

Read performs Sentiment Analysis on Usenet group data and movie reviews. He uses the Na ve Bayes and SVM classifiers6. His corpus is created using emoticons to identify positive and negative texts . No neutral or objective texts are included in either the training or testing data sets. Read also looks at topic, domain, and temporal dependency classifications. To summarize, research parameters tend to be grouped as follows: Classifier used Na ve Bayes Maximum Entropy Support Vector Machine Text blocks versus microblogging data Positive/negative word list source and size Use of neutral/objective data In the training data set In the testing data set Use of emoticons In the training data set In the testing data set Use of unigrams, bigrams, or both Use of word presence versus word frequency 3. Problem Formulation Sentiment Analysis can be viewed as an application of text categorization, which dates back to the work on probabilistic text classification by Maron7.

The main task of text classification is how to label texts with a predefined set of categories. Text categorization has been applied in other areas such as document indexing, document filtering, word sense disambiguation, etc. as surveyed in Sebastiani8. One of the central issues in text classification is how to represent the content of a text in order to facilitate an effective classification. From researches in information retrieval systems, one of the most popular and successful method is to represent a text by the collection of terms appear in it. The similarity between documents is defined by using the term frequency inverse document frequency (tfidf) measure9. In this approach, the terms or features used to represent a text is determined by taking the union of all terms that appear in the collection of texts used to derive the classifier.

This usually results in a large number of features. Therefore, dimensionality reduction is a related issue that needs to be addressed. The problem we consider in this paper is as follows. Given a collection of tweets related to a specific subject, how do we come up with a classifier for labeling Sentiment of new tweets as positive, negative, or neutral? We start by collecting related tweets using a query containing words or phrase denoting the subject of interest. Since tweets may belong to multiple subjects, the inclusion of a tweet to a specific subject is not necessarily certain. In this work, we do not consider a fuzzy membership. In order to apply data Mining tools to generate a classifier, we need to determine a list of features to represent tweets and assign a Sentiment label to each tweet. Instead of using all terms that appear in the collected tweets, we have adopted a list of positive and negative words together with one where a positive emoticon is present and one where a negative emoticon is present to form the list of features.

Text Mining for Sentiment Analysis of Twitter Data

Tags:

Information

Transcription of Text Mining for Sentiment Analysis of Twitter Data

Related search queries

Text Mining for Sentiment Analysis of Twitter Data

Tags:

Information

Documents from same domain

Related documents

Related search queries