The Simple Math of PageRank Sculpting

The concept of PageRank Sculpting — what I originally named Dynamic Linking in 2003 — is to control the distribution of PageRank within a site by manipulating what links on the site are followed by Google. My original article on PageRank Sculpting explains the basics of how this works and why you would want to do it. In the original Javascript incarnation it was very tedious to get right but with the advent of the rel="nofollow" attribute, is now very easy to employ and has therefore come into wide use.

But Sculpting has recently come under fire because of some remarks at the most recent SMX-Advanced conference in Seattle. I was not there, but my friend and colleague Dan Thies was, so I'll leave it to Dan to report on what was actually said. For my part, I'll focus on the specific example that was published and show why the conclusion drawn from the example is simply incorrect — say what you want, but you can not fight the math.

The example in the referenced comments — accurate or not — describe a change in how PageRank is distributed where there are nofollowed links on the page. To show the impact of this purported change, we will need a PageRank computer we can hack to include this change. We can then compare the original algorithm with the hack and see for ourselves if the conclusion drawn in the report is accurate or not.

PageRank in 50 Lines

Let's start with the standard PageRank algorithm written in less than 50 lines of Perl. Yes Virginia, it really is that easy. In fact, the PageRank code is really only half of that — the rest is setup and printing the results. This is one of several ways to forumulate the algorithm. I chose this one because it is pretty simple to follow, is practical up to about 10,000 pages on a single machine and is very close to the < target="_blank" href="http://en.wikipedia.org/wiki/MapReduce">MapReduce implementation I use for analyzing very large sites ( > 500,000 pages) for clients.

Let's use this code to run a couple of "standard" examples so we have something to compare our "hacked" version to later.

An unSculpted Example

First, this is what an unSculpted structure looks like.

Reviewing the code briefly, the pages are named by small integers, 0-10, and the linking structure is give in the graph array. Each row in the array defines the outbound links for a single page so in this first example we see that page 0 links to pages 1 through 10 and page 1 links to page 0 and pages 2 through 10. The other pages are similarly linked creating a structure that I originally named a "Yarn Ball".

Recall that the PageRank of a page is the probability of a "random surfer" finding that page in the index, so clearly a fully interconnected structure as the one shown, with no other pages in the index — a very small index indeed — will result in all pages having precisely the same PageRank which is the result we obtain from our own implementation. Notice that the total of all PageRank in the index, just 11 pages in this case, must add up to 100%. The forumula requires that the random surfer always finds a page and the only pages to choose from are those in the index, hence the sum of PageRanks for all pages must add up. If this requirement is not met, the code does implement PageRank.

my @graph = (
[1,2,3,4,5,6,7,8,9,10],
[0,2,3,4,5,6,7,8,9,10],
[1,0,3,4,5,6,7,8,9,10],
[1,2,0,4,5,6,7,8,9,10],
[1,2,3,0,5,6,7,8,9,10],
[1,2,3,4,0,6,7,8,9,10],
[1,2,3,4,5,0,7,8,9,10],
[1,2,3,4,5,6,0,8,9,10],
[1,2,3,4,5,6,7,0,9,10],
[1,2,3,4,5,6,7,8,0,10],
[1,2,3,4,5,6,7,8,9, 0]
);
    Total PR = 11.00
Final PR:
0: 9.1%
1: 9.1%
2: 9.1%
3: 9.1%
4: 9.1%
5: 9.1%
6: 9.1%
7: 9.1%
8: 9.1%
9: 9.1%
10: 9.1%

A (Contrived) Example of Sculpting

Let's now take a look at a Sculpted example recognizing two important points:

  1. Any Sculpting of an 11 page website is contrived to start with and is not merely a waste of time but might actually cost you money. The details must await another day, but I never recommend the sculpting of a website until it is at least 1000 pages, generally more.
  2. This particular example is just dumb, but the math is correct and the structure sets us up for the final case study where we find the faults in the example reported from SMX.

In our graph, we have changed the way the internal pages of our site are interconnected and reduced the number of followed links out of the home page. Our graph only shows followed links since these are the only links that pass PageRank. If this were are real site, we would likely have nofollowed links to the other pages in addition to the followed links shown.

my @graph = (
[1,3,5,7,9],
[0,2,3,4,5],
[1,0,3,4,5],
[1,2,0,4,5],
[1,2,3,0,5],
[1,2,3,4,0],
[0,7,8,9,10],
[6,0,8,9,10],
[6,7,0,9,10],
[6,7,8,0,10],
[6,7,8,9, 0]
);
    Total PR = 11.00
Final PR:
0: 15.7%
1: 10.2%
2: 7.9%
3: 10.2%
4: 7.9%
5: 10.2%
6: 6.7%
7: 9.0%
8: 6.7%
9: 9.0%
10: 6.7%

Again, this is a silly Sculpting, but notice that the math continues to check out with the total PageRank conserved no matter the (weird) linking pattern employed. As I have said for now six years, it is Pages that create PageRank — Links just move it around. No amount of linking will change, either up or down, the total amount of PageRank. Creating more pages is the only way to increase the PageRank total.

Coding The SMX Hack

The discussion reported from SMX provided an example of how nofollow is now treated differently and that Sculpting no longer works because the PageRank "you thought you were saving is now going to waste". A precise example was given:

So today at SMX Advanced, sculpting was being discussed, and then Matt Cutts dropped a bomb shell that it no longer works to help flow more PageRank to the unblocked pages. Again ? and being really simplistic here ? if you have $10 in authority to spend on those ten links, and you block 5 of them, the other 5 aren?t going to get $2 each. They?re still getting $1. It?s just that the other $5 you thought you were saving is now going to waste.

Let's code that change in our (correct) algorithm and see what happens. Note: PageRank is surprisingly subtle. Before actually implementing the change, I guessed wrong on the precise error it would cause.

In our PageRank implementation this statement:

$ipr = $damp * ( $pr[$r] / $n );
distributes an increment of PageRank from the linking page to the target page. The "damping factor" ($damp) accounts for "surfer teleports" and the value of $n is the "out degree" (the count of outbound links) of the page with the link. The assumption we have all made is that $n counts only the followed links on the page since to do otherwise intuitively violtates what nofollow was invented to do. But frankly, until now I have never modeled the veracity of this assumption and could not find that anyone else has either. So here we go!

To hack our PageRank code to simulate what is reported from SMX, we must change $n to include not just followed links but nofollowed links as well. A general purpose solution to this is tedious, but is easy for our one contrived example. Just adding this statement:

if( $r==0 ) { $n = 10 };
before computing the PageRank increment ($ipr) causes the count of links form page 0 to be 10 when in fact there are only five in our example. This matches the example used in the quoted report from SMX. Let's see what it does.

Repeating the example from above and running with our code change we get the PageRank results shown. Take a moment and see if you too can spot the problem here. Notice the total PR? This is broken.

Returning to the random surfer, what has happened is the the probability of taking one of our followed links out of page 0 has been reduced, but the teleport probability for page 0 has not been correspondingly increased. So now there is a probability that the random surfer gets "stuck" on page 0 and never leaves! This will certainly help conversions, right? ;-) And we should expect a change shortly in Google Analytics to account for infinite time on site.

But all humor aside, I think we assume that they did not actually break a core algorithm named by and for a founder and go fix our own code to work around this problem.

my @graph = (
[1,3,5,7,9],
[0,2,3,4,5],
[1,0,3,4,5],
[1,2,0,4,5],
[1,2,3,0,5],
[1,2,3,4,0],
[0,7,8,9,10],
[6,0,8,9,10],
[6,7,0,9,10],
[6,7,8,0,10],
[6,7,8,9, 0]
);
    Total PR = 7.53
Final PR:
0: 16.2%
1: 9.3%
2: 8.1%
3: 9.3%
4: 8.1%
5: 9.3%
6: 7.5%
7: 8.7%
8: 7.5%
9: 8.7%
10: 7.5%

Hacking our Hack

The problem is that "lying" about the outdegree of page 0 also requires that we modify the damping factor used for page 0 because the damping factor and the sum of the link probabilities has to add to 100% or the random surfer will never leave — sort of a Hotel California effect. Our fix then has to compute the probability of being "stuck" on page 0 and change it into a teleport probability by adding it uniformally to all of the pages in the index. This is the only way to get the probabilities to add up, and PageRank is a probability distribution.

Our hack upon hack changes this line:

$newpr[$r] += (1-$damp);
to this:
$newpr[$r] += (1-$damp) +  ($damp/2)*$pr[0] / ($#newpr+1);
where the additional term is the amount of additional teleporting required to compensate for the earlier hack. Reruning our analysis we obtain the results in the table below.

my @graph = (
[1,3,5,7,9],
[0,2,3,4,5],
[1,0,3,4,5],
[1,2,0,4,5],
[1,2,3,0,5],
[1,2,3,4,0],
[0,7,8,9,10],
[6,0,8,9,10],
[6,7,0,9,10],
[6,7,8,0,10],
[6,7,8,9, 0]
);
    Total PR = 11.00
Final PR:
0: 16.2%
1: 9.3%
2: 8.1%
3: 9.3%
4: 8.1%
5: 9.3%
6: 7.5%
7: 8.7%
8: 7.5%
9: 8.7%
10: 7.5%

Notice that despite the headline that calls the use of nofollow into question, even with this bizarre change to the algorithm, Sculpting still works! But beware of a problem lurking in this data.

Random Bleeding!

Because our very small index was composed of just this one small site, the picture does not accurately show the effect of the additional teleport that we hacked into the code. What the added term in latest hack does is take the PageRank increment from every nofollowed link in the entire index and distribute evenly across the entire index — that's index, not site. Those nofollowed links are actually random bleeds.

Ask yourself. Can this really be the intention? I just don't buy it. And how would we have missed this? Citysearch is a 40 million page website that makes extensive use of nofollow. Shouldn't we have seen a change?

Observations and Concluding Thoughts

First, the entire idea is just competely silly to start with and would have noticiable and really really bad ramifications that every SEO on the planet would have already noticed.

Second, the purported/headlined purpose of "Depreciating Sculpting" is simply false. The random bleeding notwithstanding, Sculpting still works even with this ridiculous change.

What then would be the purpose or advantage in making such a sweeping change? None that I can see.

And finally, are we to expect that this change was somehow (1) put in place entirely since Matt Cutts last confirmed that Sculpting works or (2) that it was done entirely without him knowing about it or (3) that he intentially, knowingly and publicly mislead the entire community? I don't buy any of these stories.

Make up your own mind, but I'm going to keep doing what has been working for now 6 years and counting until I can see evidence that it no longer works.