Example: dental hygienist

Ryan Tibshirani Data Mining: 36-462/36-662 January 22 2013

PageRankRyan TibshiraniData mining : 36-462/36-662 January 22 2013 Optional reading: ESL retrieval with the webLast time: information retrieval, learned how to compute similarityscores (distances) of documents to a given query stringBut what if documents are webpages, and our collection is thewhole web (or a big chunk of it)? Now, two problems:ITechniques from last lectures (normalization, IDF weighting)are computationally infeasible at this scale. There are about30 billion webpages!ISome webpages should be assigned more priority than others,for being more importantFortunately, there is an underlying structure that we can exploit:links between webpages2 Web search before Google(From Page et al.)

Ryan Tibshirani Data Mining: 36-462/36-662 January 22 2013 Optional reading: ESL 14.10 1. Information retrieval with the web Last time:information retrieval, learned how to compute similarity scores (distances) of documents to a given query string But what if …

Tags:

  Data, Mining, Yarn, Tibshirani, Ryan tibshirani data mining, 36 462

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Ryan Tibshirani Data Mining: 36-462/36-662 January 22 2013

1 PageRankRyan TibshiraniData mining : 36-462/36-662 January 22 2013 Optional reading: ESL retrieval with the webLast time: information retrieval, learned how to compute similarityscores (distances) of documents to a given query stringBut what if documents are webpages, and our collection is thewhole web (or a big chunk of it)? Now, two problems:ITechniques from last lectures (normalization, IDF weighting)are computationally infeasible at this scale. There are about30 billion webpages!ISome webpages should be assigned more priority than others,for being more importantFortunately, there is an underlying structure that we can exploit:links between webpages2 Web search before Google(From Page et al.)

2 (1999), The PageRank Citation Ranking:Bringing Order to the Web )3 PageRank algorithmPageRank algorithm: famously invented by Larry Page and SergeiBrin, founders of Google. Assigns aPageRank(score, or a measureof importance) to each webpageGiven webpages numbered1, .. n. The PageRank of webpageiisbased on its linking webpages (webpagesjthat link toi), but wedon t just count the number of linking webpages, , don t wantto treat all linking webpages equallyInstead, we weight the links from different webpagesIWebpages that link toi, and have high PageRank scoresthemselves, should be given more weightIWebpages that link toi, but link to a lot of other webpages ingeneral, should be given less weightNote that the first idea is circular!

3 (But that s OK)4 BrokenRank (almost PageRank) definitionLetLij= 1if webpagejlinks to webpagei(writtenj i), andLij= 0otherwiseAlso letmj= nk=1 Lkj, the total number of webpages thatjlinks toFirst we define something that s almost PageRank, but not quite,because it s broken. The BrokenRankpiof webpageiispi= j ipjmj=n j=1 LijmjpjDoes this match our ideas from the last slide? Yes: forj i, theweight ispj/mj this increases withpj, but decreases withmj5 BrokenRank in matrix notationWritten in matrix notation,p= , L= L11L12.. L1nL21L22.. Lnn ,M= m10.

4 00m2.. 0.. mn Dimensions:pisn 1,LandMaren nNow re-express definition on the previous page: the BrokenRankvectorpis defined asp=LM 1p6 Eigenvalues and eigenvectorsLetA=LM 1, thenp=Ap. This means thatpis an eigenvectorof the matrixAwith eigenvalue1 Great! Because we know how to compute the eigenvalues andeigenvectors ofA, and there are even methods for doing thisquickly whenAis large and sparse (why is ourAsparse?)But wait .. do we know thatAhas an eigenvalue of 1, so thatsuch a vectorpexists? And even if it does exist, will be unique(well-defined)?

5 For these questions, it helps to interpret BrokenRank in terms of aMarkov chain7 BrokenRank as a Markov chainThink of a Markov Chain as a random process that moves betweenstates numbered1, .. n(each step of the process is one move).Recall that for a Markov chain to have ann ntransition matrixP, this meansP(go fromitoj) =PijSupposep(0)is ann-dimensional vector giving initial one step,p(1)=PTp(0)gives probabilities of being in eachstate (why?)Now consider a Markov chain, with the states as webpages, andwith transition matrixAT. Note that(AT)ij=Aji=Lji/mi, sowe can describe the chain asP(go fromitoj) ={1/miifi j0otherwise(Check: does this make sense?)}

6 This is like a random surfer, , aperson surfing the web by clicking on links uniformly at random8 Stationary distributionA stationary distribution of our Markov chain is a probabilityvectorp( , its entries are 0and sum to1) withp= , distribution after one step of the Markov chain is what we re looking for: an eigenvector ofAcorrespondingto eigenvalue1If the Markov chain is strongly connected, meaning that any statecan be reached from any other state, then stationary distributionpexists and is unique. Furthermore, we can think of the stationarydistribution as the of proportions of visits the chain pays to eachstate after a very long time (the ergodic theorem):pi= limt # of visits to stateiintstepstOur interpretation: the BrokenRank ofpiis the proportion of timeour random surfer spends on webpageiif we let him go forever9 Why is BrokenRank broken?

7 There s a problem here. Our Markov chain a random surfer onthe web graph is not strongly connected, in three cases (at least):DisconnectedcomponentsDangling linksLoopsActually, even for Markov chains that are not strongly connected, astationary distribution always exists, but may nonuniqueIn other words, the BrokenRank vectorpexists but is ambiguouslydefined10 BrokenRank exampleHereA=LM 1= 0 0 1 0 01 0 0 0 00 1 0 0 00 0 0 0 10 0 0 1 0 (Check: matches both definitions?)Here there are two eigenvectors ofAwith eigenvalue1:p= 13131300 andp= 0001212 These are totally opposite rankings!

8 11 PageRank definitionPageRank is given by a small modification of BrokenRank:pi=1 dn+dn j=1 Lijmjpj,where0< d <1is a constant (apparently Google usesd= )In matrix notation, this isp=(1 dnE+dLM 1)p,whereEis then nmatrix of1s, subject to the constraint ni=1pi= 1(Check: are these definitions the same? Show that the seconddefinition gives the first. Hint: ifeis then-vector of all1s, thenE=eeT, andeTp= 1)12 PageRank as a Markov chainLetA=1 dnE+dLM 1, and consider as before a Markov chainwith transition matrixATWell(AT)ij=Aji= (1 d)/n+dLji/mi, so the chain can bedescribed asP(go fromitoj) ={(1 d)/n+d/miifi j(1 d)/notherwise(Check: does this make sense?)}

9 The chain moves through a linkwith probability(1 d)/n+d/mi, and with probability(1 d)/nit jumps to an unlinked webpageHence this is like a random surfer with random jumps. Fortunately,the random jumps get rid of our problems: our Markov chain isnow strongly connected. Therefore the stationary distribution ( ,PageRank vector)pis unique13 PageRank exampleWithd= ,A=1 dnE+dLM 1= 1 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 1 + 0 0 1 0 01 0 0 0 00 1 0 0 00 0 0 0 10 0 0 1 0 = Now only one eigenvector ofAwith eigenvalue1:p= 14 Computing the PageRank vectorComputing the PageRank vectorpvia traditional methods, , aneigendecomposition, takes roughlyn3operations.

10 Whenn= 1010,n3= 1030. Yikes! (But a bigger concern would be memory ..)Fortunately, much faster way to compute the eigenvector ofAwitheigenvalue1: begin with any initial distributionp(0), and computep(1)=Ap(0)p(2)=Ap(1)..p(t)=Ap(t 1),Thenp(t) past . In practice, we just repeatedly multiplybyAuntil there isn t much change between , after 100 iterations, operation count:100n2 n3for largen15 Computation, continuedThere are still important questions remaining about computing thePageRank vectorp(with the algorithm presented on last slide):1.


Related search queries