%% This is an annotated diary of the derivation of the PageRank algorithm
%% as I did it in class on Oct 17, but a little more slowly.

%% Unlike my usual diaries, I've included the output in this one too.
%% Feel free to try it yourself.

load ucsbweb2007 %%  This gets the same stuff as "[U,G] = surfer('http://cs.ucsb.edu',50);
whos
  Name      Size                    Bytes  Class

  G        50x50                     1759  logical array (sparse)
  U        50x1                      6610  cell array

Grand total is 2166 elements using 8369 bytes

% G is the adjacency matrix of the 50-page web graph
[n,n] = size(G)
n =
    50
n =
    50

% First put ones on the diagonal of G
G = G | speye(n);

% Next scale the columns of G so that each column adds up to 1.
% We do this by creating a diagonal matrix D whose elements are
% the inverses of the columns sums of G:

c = sum(G);  %% column sums of G
c = 1 ./ c;  %% ./ means divide elementwise, that is, invert every element of c
D = diag(sparse(c));  %% Now D is the desired diagonal matrix
size(D)
ans =
    50    50
nnz(D)
ans =
    50

%% To scale the columns of G, we multiply on the right by D.
%% (exercise: convince yourself that multiplying a matrix by
%%  a diagonal matrix on the right scales the columns, while
%%  multiplying by a diagonal matrix on the left scales the rows.)
GD = G*D;
sumgd = sum(GD);
max(sumgd)
ans =
    1.0000
min(sumgd)
ans =
    1.0000


%% Now GD is the matrix that corresponds to taking a random link
%% from a web page:  
%% If x is 50-by-1 vector of probabilities, adding up to 1, then
%% GD * x is a vector that gives the probabilities after one more step:

%% The random surfer's algorithm is to take a random link 85% of the time:
p = .85
p =
    0.8500
%% ... and the other 15% of the time to jump to a random page.
1-p
ans =
    0.1500

%% Now, what matrix represents jumping to a random page?
%% It must be a matrix that when multiplied by any vector x with sum(x)=1,
%% gives the vector of equal probabilites, that is, the vector with all
%% entries equal to 1/n.  That matrix is the n-by-n matrix of all ones,
%% divided by n.  We could compute that matrix by saying
%% E = ones(n,n) / n;
%% However, that would create a dense matrix, and we want to keep things
%% sparse so our method will work on large webs.  The trick is to notice
%% that the n-by-n matrix of ones (with n^2 elements) can be represented
%% as the product of two vectors, each with only n elements:

e = ones(n,1);   %% This is a column vector with n ones.

%% Just to illustrate, we'll compute e * e' and show that it really is
%% the same as the matrix of all ones.  But in the algorithm we develop
%% for PageRank we won't ever really form that dense matrix; this is just
%% to show what's going on:

size(e)
ans =
    50     1
E = e * e';
size(E)
ans =
    50    50
max(max(E))  %% largest element in all of E
ans =
     1
min(min(E))  %% smallest element in all of E
ans =
     1
nnz(E)
ans =
        2500

%% Now we have all the pieces, so we can put them together.
%% The random surfer's algorithm is:
%%   p of the time, take the step represented by G*D
%%   (1-p) of the time, take the step represented by (1/n) * e * e'
%%
%% Thus the random surfer's algorithm takes the vector x into the vector 
%%  (p*G*D + (1-p)*(1/n)*e*e') * x
%% We could use A to refer to the matrix (p*G*D + (1-p)*(1/n)*e*e'),
%% in which case the random surfer repeatedly does x = A*x; but we don't ever
%% actually form the matrix A.
%%
%% So, how do we find the final "x" which is the probability distribution
%% of the limit of the random surfer's trip?
%%
%% One way is to just iterate x = A*x many, many times (but without forming A).
%% That's called "the power method".
%%
%% Another way, which is a little bit trickier, is presented in the book.
%% It uses algebraic rearrangement to solve the equation x = A*x for x,
%% as follows.  Part of the trick is to use the fact that we are looking
%% for an x with sum(x)=1, and since e is the vector of all ones we have e'*x = sum(x)
%% (you should convince yourself that this is true).
%%
%% So, here goes.  We write the equation A*x = x, and then substitute in
%% the definition of A to get:
%%
%%   (p*G*D + (1-p)*(1/n)*e*e') * x   =   x
%%
%% Taking the x on the right inside the parentheses gives
%%
%%   p*G*D*x + (1-p)*(1/n)*e*e'*x   =   x
%%
%% Now we use the tricky fact that e'*x = sum(x) = 1 for the x we want, so
%%
%%   p*G*D*x + (1-p)*(1/n)*e   =   x
%%
%% Finally, we collect terms so that the variable x is on the left and the
%% constant vector is on the right.  
%% To do that, we'll let I be the n-by-n identity matrix, so that I*x = x.
%%
%%  p*G*D*x - I*x  =  -(1-p)*(1/n)*e
%%
%% Or, negating both sides and factoring out the x on the left side,
%%
%%  (I - p*G*D) * x = (1-p)*(1/n)*e
%%
%% Now, the stuff on the right side is just a constant vector -- in our
%% example, it's the vector whose 50 entries are all equal to 0.15/50.
%% Let's call the stuff on the right side b:

b = (1-p)*(1/n)*e;

%% The matrix on the left side, I-p*G*D, is sparse.  Let's call it B:

I = speye(n);
B = (I-p*G*D);
size(B)
ans =
    50    50
nnz(B)
ans =
   332
nnz(G)
ans =
   332

%% So, it just remains to solve the system for x:

x = B \ b;

sum(x)
ans =
    1.0000

min(x)
ans =
    0.0042

max(x)
ans =
    0.1174


%% Now let's sort x from largest to smallest:

[xsorted, perm] = sort(x,'descend');

%% The permutation vector "perm" sorts the URLs by PageRank:

U(perm)

ans = 
    [1x81 char]
    'http://atyourservice.ucop.edu'
    'http://www.ucsb.edu'
    'http://www.policy.ucsb.edu/vcas/isc/WebTermsOfUse/content.html'
    'http://www.universityofcalifornia.edu'
    'http://www.universityofcalifornia.edu/regents/welcome.html'
    'http://www.ia.ucsb.edu/accessibility/index.html'
    'http://www.engineering.ucsb.edu/insights2008'
    'http://www.universityofcalifornia.edu/admissions/undergradapp'
    'http://www.registrar.ucsb.edu'
    'http://www.ucsb.edu/policies/terms-of-use.shtml'
    'http://hr.ucsb.edu'
    'http://www.summer.ucsb.edu'
    'http://www.ucsb.edu/index.shtml'
    'http://www.admissions.ucsb.edu'
    'http://www.oap.ucsb.edu'
    'http://www.sa.ucsb.edu/Policies/CleryAct/CleryActCampusSecurityReport.asp'
    'http://www.engineering.ucsb.edu'
    'http://www.ia.ucsb.edu/dev/index.shtml'
    'http://www.ia.ucsb.edu/util/contact.aspx'
    'http://www.xlrn.ucsb.edu'
    'http://www.ucsbvision2025.com'
    'http://www.ucsb.edu/people/index.shtml'
    'http://www.finaid.ucsb.edu'
    'http://www.education.ucsb.edu/tep'
    'http://www.extension.ucsb.edu'
    'http://www.ocs.ucsb.edu'
    'http://www.ia.ucsb.edu/pa/news.aspx'
    'http://www.extension.ucsb.edu/ip'
    'http://www.ia.ucsb.edu/campaign/index.shtml'
    'http://hr.ucsb.edu/asap'
    'http://hr.ucsb.edu/admin'
    'http://hr.ucsb.edu/benefits'
    'http://www.catalog.ucsb.edu'
    'http://www.housing.ucsb.edu'
    'http://www.ucsb.edu/policies'
    'http://www.ucsb.edu/about-our-site/index.shtml'
    'http://www.industry.ucsb.edu'
    'http://www.industry.ucsb.edu/faculty'
    'http://www.industry.ucsb.edu/technologies'
    'http://www.industry.ucsb.edu/recruiting'
    'http://www.industry.ucsb.edu/membership/benefits'
    'http://www.ucsb.edu/sitemap/index.shtml'
    'http://www.ucsb.edu/academics/index.shtml'
    'http://www.sa.ucsb.edu/policies/index.asp'
    'http://www.ucsb.edu/az/a.shtml'
    'http://www.ucsbgauchos.com'
    'http://www.ucsb.edu/campus-photos/index.shtml'
    'http://www.ucsb.edu/pop/index.shtml'
    'http://cs.ucsb.edu'

diary off