Example: marketing

Find Me the Right Content! Diversity-Based Sampling of ...

Find Me the Right content ! Diversity-Based Sampling of Social Media Spaces for Topic-Centric Search Munmun De Choudhury Scott Counts Mary Czerwinski .. Arizona State University, Tempe, AZ 85281, USA.. Microsoft Research, Redmond, WA 98052, USA.. {counts, Abstract happenings like the BP oil spill, the elections in Iran, the earthquake in Haiti, or the release of the Windows Phone. Social media and networking websites, such as Twitter and Facebook, generate large quantities of information and have However, with the content from half a billion Facebook become mechanisms for real-time content dissipation to users or with more than 60 million tweets generated every users. An important question that arises is: how do we sample day, the domain of topic-centric search of social media con- such social media information spaces in order to deliver rel- tent faces tremendous challenges. How do we identify the evant content on a topic to end users?}

Find Me the Right Content! Diversity-Based Sampling of Social Media Spaces for Topic-Centric Search Munmun De Choudhuryy Scott Counts zMary Czerwinski yArizona State University, Tempe, AZ 85281, USA zMicrosoft Research, Redmond, WA 98052, USA ymunmun@asu.edu, zfcounts, maryczg@microsoft.com Abstract Social media and networking websites, such as Twitter and

Tags:

  Based, Content, Rights, Sampling, Diversity, Right content, Diversity based sampling of

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Find Me the Right Content! Diversity-Based Sampling of ...

1 Find Me the Right content ! Diversity-Based Sampling of Social Media Spaces for Topic-Centric Search Munmun De Choudhury Scott Counts Mary Czerwinski .. Arizona State University, Tempe, AZ 85281, USA.. Microsoft Research, Redmond, WA 98052, USA.. {counts, Abstract happenings like the BP oil spill, the elections in Iran, the earthquake in Haiti, or the release of the Windows Phone. Social media and networking websites, such as Twitter and Facebook, generate large quantities of information and have However, with the content from half a billion Facebook become mechanisms for real-time content dissipation to users or with more than 60 million tweets generated every users. An important question that arises is: how do we sample day, the domain of topic-centric search of social media con- such social media information spaces in order to deliver rel- tent faces tremendous challenges. How do we identify the evant content on a topic to end users?}

2 Notice that these large- Right content from these spaces, that can best satisfy an end scale information spaces are inherently diverse', featuring a user in the context of real-time search on a topic? wide array of attributes such as location, recency, degree of Our answer to this question lies in devising methods diffusion effects in the network and so on. Naturally, for the geared towards effective Sampling of social media spaces. end user, different levels of diversity in social media content The problem of Sampling information signals has been stud- can significantly impact the information consumption expe- rience: low diversity can provide focused content that may ied extensively in the information theory literature (Cover be simpler to understand, while high diversity can increase and Thomas 1991); a noted method being the celebrated breadth in the exposure to multiple opinions and perspectives.

3 Nyquist-Shannon theorem that provides a technique for Hence to address our research question, we turn to diversity Sampling bandlimited signals. However, this Sampling tech- as a core concept in our proposed Sampling methodology. nique does not apply to social media information spaces, be- Here we are motivated by ideas in the compressive sensing cause (1) they do not have a notion of bandwidth, and (2). literature and utilize the notion of sparsity in social media they inherently feature a wide ensemble of attributes. For information to represent such large spaces via a small num- example, tweets (on Twitter) can be rich' in themes ( , ber of basis components. Thereafter we use a greedy itera- political and economic perspectives on the same topic), can tive clustering technique on this transformed space to con- be posted by individuals in disparate geographic locations, struct samples matching a desired level of diversity .

4 based can be updates from a celebrity, or can be conversational be- on Twitter Firehose data, we demonstrate quantitatively that tween two or more individuals with conflicting opinions. In our method is robust, and performs better than other base- line techniques over a variety of trending topics. In a user essence, social media information spaces are of high dimen- study, we further show that users find samples generated by sionality: a characteristic property we refer to as diversity . our method to be more interesting and subjectively engaging In order to leverage rather than be limited by this diversity compared to techniques inspired by state-of-the-art systems, when Sampling , we first consider the wide variety of ways an with improvements in the range of 15 45%. end user can best use this diversity property, when searching for topic-centric social media content . To take an example, a user searching for content on Twitter after the release of 1 Introduction the Windows Phone in November 2010 might intend to find The advent of the Web technology has given consid- homogenous samples (low diversity ) say, tweets posted by erable leeway to the creation of vast quantities of user- the technical experts.

5 In another situation, if she is interested generated information content online. Such information of- in learning about the oil spill in the Gulf of Mexico back in ten manifests itself in social media spaces, via status up- 2010, an appropriate sample would be heterogenous in terms dates on Facebook, tweets on Twitter, and news items on of the mixing of its attributes (high diversity ). It will there- Digg. In almost all of these websites, while end users can fore span over attributes like author, geography and themes broadcast' information that interests them, they can also like Politics or Finance. listen' to their peers by subscribing to their respective con- Generalizing, we contend that generating samples that tent streams. Consequently, these avenues have emerged as align to a desired diversity level can have practical utility im- means of real-time content dissemination to users for timely plications to the end user in a search context (Brehm 1956.)

6 Part of this work was performed while the author was an intern Ziegler et al. 2005; Radlinski and Dumais 2006). content of at Microsoft Research, Redmond. low diversity , or homogenous samples can cater to scenar- Copyright c 2011, Association for the Advancement of Artificial ios where the user seeks focused information qualifying cer- Intelligence ( ). All rights reserved. tain pre-requisites (knowledge depth). Highly diverse con- tent, being heterogeneous in its representation of various at- sections 7 and 8. Section 9 and 10 present a discussion of tributes, is likely to benefit the user in terms of information our results followed by the conclusions. gain along multiple facets (knowledge breadth). Thus we describe diversity as a core property of social 2 Related Work media content , and we quantify it via its measure of en- Although the burst of informational content on the Web due tropy in a conceptual structure called the diversity spectrum to the emergence of social media sites is relatively new, (discussed in more detail in section 3).

7 Our central goal is there is a rich body of statistical, data mining and social sci- therefore to determine social media information samples on ences literature that investigates efficient methods for sam- a topic that match a desired degree of diversity . pling large data spaces (Kellogg 1967; Frank 1978; Das et Our Contributions. We have developed a weighted dimen- al. 2008). Sociologists have studied the impact of snowball sional representation of the information units ( , tweets) Sampling and random-walk based Sampling of nodes in a characterizing large-scale social media spaces. Next we pro- social network on graph attributes and other network phe- pose a Sampling methodology to reduce such large social nomena (Frank 1978). Recently, Sampling of large online media spaces. The Sampling method borrows ideas from the networks ( , the Internet and social networks) has gained compressive sensing literature that emphasizes the notion of much attention (Rusmevichientong et al.)

8 2001; Achlioptas representing an information stream via a small set of basis et al. 2005; Leskovec and Faloutsos 2006; Stutzbach 2006;. functions, assuming the stream is fairly sparse. We there- De Choudhury et al. 2010; Maiya and Berger-Wolf 2010). after deploy an iterative clustering framework on the reduced in terms of how different techniques impact the recovery space for the purpose of sample generation. The algorithm of overall network metrics, like degree of distribution, path utilizes a greedy approach- based entropy minimization tech- lengths, etc., as well as dynamic phenomena over networks nique to generate samples of a particular Sampling ratio and such as diffusion and community evolution. matching a desired level of diversity . Most of the above mentioned work on social media sam- pling focused on how the Sampling process impacts graph Main Results. We perform quantitative evaluation of the structure and graph dynamics.

9 Thus, the focus of the sam- proposed Sampling method over a Twitter dataset (Firehose pling strategies was to prune the space of nodes or edges. comprising Billion tweets in June 2010). There are sev- However, this does not provide insights into the various eral key insights in our results. (1) We find that the compres- characteristics ( , degree of diffusion, topical content , sive sensing based reduction step can prune the social media level of diversity , etc.) of social media spaces in general. space by as much as 50 60%, and still yield robust samples Moreover, while these works addressed the issue of how that are very close to a given desired information diversity to sample relevant entities in dynamic graphs, no principled level, compared to other baseline techniques. (2) Overall, we way to sample or prune large social media spaces has been observe that information diversity appears to be a useful at- proposed.

10 These spaces are unique, because of the nature tribute to sample social information spaces consistently over of user generated content , including its high dimensional- multiple thematic categories (Politics, Sports etc.). (3) Nev- ity and diversity . To the best of our knowledge, in this pa- ertheless, depending on the thematic category of content , the per, generalized Sampling methods for large social media choice of the dimensional type ( , tweet features like re- spaces are being proposed for the first time. Our focus in- cency; nodal features like the social graph topology of the cludes how the generated samples can improve the informa- tweet creator) can make a notable difference to the samples tion consumption experience of end users. generated. We also address the issue of evaluation of sample results 3 Problem Definition sets in the absence of ground truth data. Ultimately it is the end users who decide the goodness of samples in a We begin by formalizing our problem definition.


Related search queries