Example: stock market

Predicting Flu Trends using Twitter Data - …

Predicting Flu Trends using Twitter DataHarshavardhan Achrekar Avinash Gandhe Ross Lazarus Ssu-Hsin Yu Benyuan Liu Department of computer Science Scientific Systems Company Inc Department of Population MedicineUniversity of Massachusetts Lowell500 West Cummings ParkHarvard Medical SchoolLowell, MA 01854 Woburn, MA 01801 Boston, MA 02101 Abstract Reducing the impact of seasonal influenza epidemicsand other pandemics such as the H1N1 is of paramount im-portance for public health authorities. Studies have shownthateffective interventions can be taken to contain the epidemics ifearly detection can be made. Traditional approach employedbythe Centers for Disease Control and Prevention (CDC) includescollecting influenza-like illness (ILI) activity data from sentinel medical practices.

Predicting Flu Trends using Twitter Data Harshavardhan Achrekar ∗ Avinash Gandhe † Ross Lazarus ‡ Ssu-Hsin Yu † Benyuan Liu ∗ ∗ Department of Computer Science † Scientific Systems Company Inc ‡ Department of Population Medicine

Tags:

  Computer, Using, Data, Trends, Predicting, Predicting flu trends using twitter data, Twitter

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Predicting Flu Trends using Twitter Data - …

1 Predicting Flu Trends using Twitter DataHarshavardhan Achrekar Avinash Gandhe Ross Lazarus Ssu-Hsin Yu Benyuan Liu Department of computer Science Scientific Systems Company Inc Department of Population MedicineUniversity of Massachusetts Lowell500 West Cummings ParkHarvard Medical SchoolLowell, MA 01854 Woburn, MA 01801 Boston, MA 02101 Abstract Reducing the impact of seasonal influenza epidemicsand other pandemics such as the H1N1 is of paramount im-portance for public health authorities. Studies have shownthateffective interventions can be taken to contain the epidemics ifearly detection can be made. Traditional approach employedbythe Centers for Disease Control and Prevention (CDC) includescollecting influenza-like illness (ILI) activity data from sentinel medical practices.

2 Typically there is a 1-2 week delay betweenthe time a patient is diagnosed and the moment that datapoint becomes available in aggregate ILI reports. In this paperwe present the Social Network Enabled Flu Trends (SNEFT)framework, which monitors messages posted on Twitter with amention of flu indicators to track and predict the emergence andspread of an influenza epidemic in a population. Based on thedata collected during 2009 and 2010, we find that the volumeof flu related tweets is highly correlated with the number ofILI cases reported by CDC. We further devise auto-regressionmodels to predict the ILI activity level in a population. Themodels predict data collected and published by CDC, as thepercentage of visits to sentinel physicians attributable to ILIin successively weeks.

3 We test models with previous CDC data ,with and without measures of Twitter data , showing that Twitterdata can substantially improve the models prediction , Twitter data provides real-time assessment of INTRODUCTIONS easonal influenza epidemics result in about three to fivemillion cases of severe illness and about 250,000 to 500,000deaths worldwide each year [1]. Reducing the impact of sea-sonal epidemics and pandemics such as the H1N1 influenza isof paramount importance for public health authorities. Studieshave shown that preventive measures can be taken to containepidemics, if an early detection is made during the germinationof an epidemic [2], [3]. Therefore, it is important to track andpredict the emergence and spread of flu in the Center for Disease Control and Prevention (CDC)monitors influenza-like illness (ILI) cases, by collectingdatafrom sentinel medical practices, collating the reports andpublishing them on a weekly basis.

4 As diagnoses are madeand reported by doctors, the system is almost entirely manual,resulting in a 1-2 weeks delay between the time a patient isdiagnosed and the moment that data point becomes availablein aggregate ILI reports. Public health authorities need tobeforewarned at the earliest time to ensure effective preventiveintervention, and this leads to the critical need of more efficientand timely methods of estimating influenza innovative surveillance systems have been proposedto capture the health seeking behavior and transform them intoinfluenza activity indicators. These include monitoring callvolumes to telephone triage advice lines [4], over the counterdrug sales [5], and patients visit logs to Physicians for flushots. Understanding that human interaction on the web is avaluable source of sensing health Trends , Google Flu Trendsutilizes aggregated web search queries pertaining to influenzato build a comprehensive model that can estimate nationwideas well as state-level ILI activity [6].

5 In this paper we investigate the use of a novel data source,namely, messages posted on Twitter , to track and predict thelevel of ILI activity in a population. Twitter has becomepopular platforms for people to share news and events intheir daily lives, including their mood, health status, travel,entertainment, etc. data collected from Twitter represents apreviously untapped data source for detecting the onset of aflu epidemic and Predicting its spread. Our approach assumestwitter users as sensors and the collective message ex-changes with a mention of flu such as I got Flu and downwith swine flu as early indicators and robust predictors ofinfluenza. Although many of these data are noisy individually,in aggregate they reveal the underlying epidemic pattern intime and activity is known to follow a seasonal pattern, andsuccessive weekly counts tend to be highly correlated.

6 Usingboth the information in previous weeks of CDC data andTwitter activity measures, we may be able to take advantage ofthe additional real time information about ILI activity presentin Twitter data to help predict the underlying ILI collected tweets and the location information of Twitterusers who mentioned about flu descriptors in their tweetsstarting from October 18, 2009 until present. Until October23,2010 we have collected million tweets from millionunique users from Twitter . Since CDC does not provide weeklyILI activity data for the period from May 23, 2010 to October9, 2010, we have 31 weeks of CDC data for the Twitter the analysis, retweets of previous posts and tweets fromthe same users within a certain period are removed from thedatasets as these tweets do not present new ILI cases.

7 Wefound the number of flu related tweets in Twitter is highlycorrelated with the CDC data with a Pearson correlationcoefficient of We consider auto-regression modelsthat predict future health system load such as the numberof ILI cases in a population next week. The models predictdata collected and published by CDC, as the percentage ofvisits to sentinel physicians attributable to ILI in subsequentweeks. We test these models with previous CDC data , with andwithout measures of Twitter data , showing that Twitter datacan substantially improve the model fits. Twitter data providesreal-time assessment of flu and can be particular useful whenthe CDC data for true ILI activity is not available due tothe delay in the CDC data collection rest of this paper is organized as follows: Section II de-scribes the related work that harness the collective intelligenceof OSN users, in an effort to explain and in some events predictreal-world outcomes.

8 In Section III, we present our datacollection methodology for extracting relevant informationfrom Twitter in theSNEFT architecture. Detailed data analysisare performed in Section IV to establish correlation with CDCdata. In Section V we present statistical models to predict ILIactivity and evaluate the performance. Finally we concludeinSection VI and provide acknowledgements in Section RELATEDWORKA number of studies have been conducted on different formsof social networks like , Facebook, Flickr, Linkedln,Wikipedia and Youtube etc. Sitaram et al. demonstrated howsocial media content like chatter from Twitter can be used topredict real-world outcomes of forecasting box-office revenuesfor movies [7]. Sakaki et al. used a probabilistic spatio-temporal model to build an autonomous earthquake reportingsystem in Japan using Twitter users as sensors and applyingKalman filtering and particle filtering for location estimation[8].

9 Meme Tracking in news cycles as explained by Leskovecet al. was an attempt to model information diffusion in socialmedia like blogs and tracking handoff from professional newsmedia to social networks [9]. Twitter has been used for real-time notifications such aslarge-scale fire emergencies, downtime on services providedby content providers [10] and live traffic updates. There havebeen efforts in utilizing Twitter data for Predicting nationalmood [11], currency tracing and performing market and riskanalysis. Tweetminster, a media utility tool design to makeUK politics open and social, analyses political tweets, toestablish the correlations between buzz on Twitter and electionresults. Ginsberg et al. in his paper discussing his approach forestimating Flu Trends proposed that the relative frequencyofcertain search terms are good indicators of the percentage ofphysician visits and established a linear correlation to weeklypublished ILI percentages between 2003 and 2007 for all nineregions identified by CDC [6].

10 In June 2010, we introducedSNEFT architecture as a continuous data collection enginewhich combines the detection and prediction capability onsocial networks in discovering real world flu Trends [12].III. DATACOLLECTIONWe describe our data collection methodology by introducingSocial Network Enabled Flu Trends (SNEFT) architecture,providing description of our datasets, exploring strategies fordata cleaning, applying filtering techniques in order to performquantitative spatio-temporal analysis on the collected SNEFT ArchitectureWe propose theSNEFT architecture in Figure 1 along withits crawler, predictor and detector components, as our solutionto track and predict flu activity with certain accuracy. CDCFig. 1. The system architecture reports and other influenza related data are downloadedinto ILI data database from its website.


Related search queries