%% This is an annotated diary of the derivation of the PageRank algorithm %% as I did it in class on Oct 17, but a little more slowly. %% Unlike my usual diaries, I've included the output in this one too. %% Feel free to try it yourself. load ucsbweb2007 %% This gets the same stuff as "[U,G] = surfer('http://cs.ucsb.edu',50); whos Name Size Bytes Class G 50x50 1759 logical array (sparse) U 50x1 6610 cell array Grand total is 2166 elements using 8369 bytes % G is the adjacency matrix of the 50-page web graph [n,n] = size(G) n = 50 n = 50 % First put ones on the diagonal of G G = G | speye(n); % Next scale the columns of G so that each column adds up to 1. % We do this by creating a diagonal matrix D whose elements are % the inverses of the columns sums of G: c = sum(G); %% column sums of G c = 1 ./ c; %% ./ means divide elementwise, that is, invert every element of c D = diag(sparse(c)); %% Now D is the desired diagonal matrix size(D) ans = 50 50 nnz(D) ans = 50 %% To scale the columns of G, we multiply on the right by D. %% (exercise: convince yourself that multiplying a matrix by %% a diagonal matrix on the right scales the columns, while %% multiplying by a diagonal matrix on the left scales the rows.) GD = G*D; sumgd = sum(GD); max(sumgd) ans = 1.0000 min(sumgd) ans = 1.0000 %% Now GD is the matrix that corresponds to taking a random link %% from a web page: %% If x is 50-by-1 vector of probabilities, adding up to 1, then %% GD * x is a vector that gives the probabilities after one more step: %% The random surfer's algorithm is to take a random link 85% of the time: p = .85 p = 0.8500 %% ... and the other 15% of the time to jump to a random page. 1-p ans = 0.1500 %% Now, what matrix represents jumping to a random page? %% It must be a matrix that when multiplied by any vector x with sum(x)=1, %% gives the vector of equal probabilites, that is, the vector with all %% entries equal to 1/n. That matrix is the n-by-n matrix of all ones, %% divided by n. We could compute that matrix by saying %% E = ones(n,n) / n; %% However, that would create a dense matrix, and we want to keep things %% sparse so our method will work on large webs. The trick is to notice %% that the n-by-n matrix of ones (with n^2 elements) can be represented %% as the product of two vectors, each with only n elements: e = ones(n,1); %% This is a column vector with n ones. %% Just to illustrate, we'll compute e * e' and show that it really is %% the same as the matrix of all ones. But in the algorithm we develop %% for PageRank we won't ever really form that dense matrix; this is just %% to show what's going on: size(e) ans = 50 1 E = e * e'; size(E) ans = 50 50 max(max(E)) %% largest element in all of E ans = 1 min(min(E)) %% smallest element in all of E ans = 1 nnz(E) ans = 2500 %% Now we have all the pieces, so we can put them together. %% The random surfer's algorithm is: %% p of the time, take the step represented by G*D %% (1-p) of the time, take the step represented by (1/n) * e * e' %% %% Thus the random surfer's algorithm takes the vector x into the vector %% (p*G*D + (1-p)*(1/n)*e*e') * x %% We could use A to refer to the matrix (p*G*D + (1-p)*(1/n)*e*e'), %% in which case the random surfer repeatedly does x = A*x; but we don't ever %% actually form the matrix A. %% %% So, how do we find the final "x" which is the probability distribution %% of the limit of the random surfer's trip? %% %% One way is to just iterate x = A*x many, many times (but without forming A). %% That's called "the power method". %% %% Another way, which is a little bit trickier, is presented in the book. %% It uses algebraic rearrangement to solve the equation x = A*x for x, %% as follows. Part of the trick is to use the fact that we are looking %% for an x with sum(x)=1, and since e is the vector of all ones we have e'*x = sum(x) %% (you should convince yourself that this is true). %% %% So, here goes. We write the equation A*x = x, and then substitute in %% the definition of A to get: %% %% (p*G*D + (1-p)*(1/n)*e*e') * x = x %% %% Taking the x on the right inside the parentheses gives %% %% p*G*D*x + (1-p)*(1/n)*e*e'*x = x %% %% Now we use the tricky fact that e'*x = sum(x) = 1 for the x we want, so %% %% p*G*D*x + (1-p)*(1/n)*e = x %% %% Finally, we collect terms so that the variable x is on the left and the %% constant vector is on the right. %% To do that, we'll let I be the n-by-n identity matrix, so that I*x = x. %% %% p*G*D*x - I*x = -(1-p)*(1/n)*e %% %% Or, negating both sides and factoring out the x on the left side, %% %% (I - p*G*D) * x = (1-p)*(1/n)*e %% %% Now, the stuff on the right side is just a constant vector -- in our %% example, it's the vector whose 50 entries are all equal to 0.15/50. %% Let's call the stuff on the right side b: b = (1-p)*(1/n)*e; %% The matrix on the left side, I-p*G*D, is sparse. Let's call it B: I = speye(n); B = (I-p*G*D); size(B) ans = 50 50 nnz(B) ans = 332 nnz(G) ans = 332 %% So, it just remains to solve the system for x: x = B \ b; sum(x) ans = 1.0000 min(x) ans = 0.0042 max(x) ans = 0.1174 %% Now let's sort x from largest to smallest: [xsorted, perm] = sort(x,'descend'); %% The permutation vector "perm" sorts the URLs by PageRank: U(perm) ans = [1x81 char] 'http://atyourservice.ucop.edu' 'http://www.ucsb.edu' 'http://www.policy.ucsb.edu/vcas/isc/WebTermsOfUse/content.html' 'http://www.universityofcalifornia.edu' 'http://www.universityofcalifornia.edu/regents/welcome.html' 'http://www.ia.ucsb.edu/accessibility/index.html' 'http://www.engineering.ucsb.edu/insights2008' 'http://www.universityofcalifornia.edu/admissions/undergradapp' 'http://www.registrar.ucsb.edu' 'http://www.ucsb.edu/policies/terms-of-use.shtml' 'http://hr.ucsb.edu' 'http://www.summer.ucsb.edu' 'http://www.ucsb.edu/index.shtml' 'http://www.admissions.ucsb.edu' 'http://www.oap.ucsb.edu' 'http://www.sa.ucsb.edu/Policies/CleryAct/CleryActCampusSecurityReport.asp' 'http://www.engineering.ucsb.edu' 'http://www.ia.ucsb.edu/dev/index.shtml' 'http://www.ia.ucsb.edu/util/contact.aspx' 'http://www.xlrn.ucsb.edu' 'http://www.ucsbvision2025.com' 'http://www.ucsb.edu/people/index.shtml' 'http://www.finaid.ucsb.edu' 'http://www.education.ucsb.edu/tep' 'http://www.extension.ucsb.edu' 'http://www.ocs.ucsb.edu' 'http://www.ia.ucsb.edu/pa/news.aspx' 'http://www.extension.ucsb.edu/ip' 'http://www.ia.ucsb.edu/campaign/index.shtml' 'http://hr.ucsb.edu/asap' 'http://hr.ucsb.edu/admin' 'http://hr.ucsb.edu/benefits' 'http://www.catalog.ucsb.edu' 'http://www.housing.ucsb.edu' 'http://www.ucsb.edu/policies' 'http://www.ucsb.edu/about-our-site/index.shtml' 'http://www.industry.ucsb.edu' 'http://www.industry.ucsb.edu/faculty' 'http://www.industry.ucsb.edu/technologies' 'http://www.industry.ucsb.edu/recruiting' 'http://www.industry.ucsb.edu/membership/benefits' 'http://www.ucsb.edu/sitemap/index.shtml' 'http://www.ucsb.edu/academics/index.shtml' 'http://www.sa.ucsb.edu/policies/index.asp' 'http://www.ucsb.edu/az/a.shtml' 'http://www.ucsbgauchos.com' 'http://www.ucsb.edu/campus-photos/index.shtml' 'http://www.ucsb.edu/pop/index.shtml' 'http://cs.ucsb.edu' diary off