How I Hijacked Rand Fishkin’s Blog
Search result hijacking is a surprisingly straightforward process. This post will go over theory, test cases done by Dejan SEO team and offer ways for webmasters to defend against search result theft.
I wish to thank Jim Munro, Rob Maas and Rand Fishkin for allowing me to run my experiment on their pages.
Before I go any further I’d like to make it clear that this is not a bug, hack or an exploit – it’s a feature. Google’s algorithm prevents duplicate content displaying in search results and everything is fine until you find yourself on the wrong end of the duplication scale. From time to time a larger, more authoritative site will overtake smaller websites’ position in the rankings for their own content. Read on to find out how exactly this happens.
When there are two identical documents on the web, Google will pick the one with higher PageRank and use it in results. It will also forward any links from any perceived ‘duplicate’ towards the selected ‘main’ document. This idea first came to my mind while reading a paper called “Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations” by Daniel Peng and Frank Dabek from Google.
Here is the key part:
“Consider the task of building an index of the web that can be used to answer search queries. The indexing system starts by crawling every page on the web and processing them while maintaining a set of invariants on the index. For example, if the same content is crawled under multiple URLs, only the URL with the highest PageRank  appears in the index. Each link is also inverted so that the anchor text from each outgoing link is attached to the page the link points to. Link inversion must work across duplicates: links to a duplicate of a page should be forwarded to the highest PageRank duplicate if necessary.”
I decided to test the above theory on real pages from Google’s index. The following pages were our selected ‘victims’.
- Dumb SEO Questions
- Rand Fishkin’s Blog
Case Study #1: MarketBizz
26 October 2012: Rob Maas kindly volunteered for the first stage test and offered one of his English language pages for our first ‘hijack’ attempt. We set up a subdomain called rob.dejanmarketing.com and created a single page http://rob.dejanmarketing.com/ReferentieEN.htm by copying the original HTML and images. The newly created page was +’ed and linked to from our blog. At this stage it was uncertain how similar (or identical) the two documents had to be for our test to work.
30 October 2012: Search result successfully hijacked. Not only did our new subdomain replace Rob’s page in results but the info: command was now showing the new page even for the original page and it’s original PageRank 1 was replaced by PageRank “0” of the new page. Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4.
Notice how the info: search for the URL returns our test domain instead?
So all it took was higher PageRank stream to the new page and a few days to allow for indexing of the new page.
Search for text from the original page also returned the new document:
One interesting fact is that site:www.marketbizz.nl still returns the original page “www.marketbizz.nl/en/ReferentieEN.htm” and does not omit it from site search results. Interestingly that URL does not return any results for cache, just like the copy we created. Google’s merge seems pretty thorough and complete in this case.
Case Study #2: dumbseoquestions.com
30 October 2012: Jim Munro volunteers his website dumbseoquestions.com in order to test whether authorship helps against result hijacking attempts. We copied his content and replicated it on http://dsq.dejanmarketing.com/ without copying any media across.
1 November 2012: The next day Jim’s page was replaced with our subdomain, rendering Jim’s original as a duplicate in Google’s index. This suggests that authorship did very little or nothing to stop this from happening.
The original website was replaced for both info: command and search queries.
Search for the exact match brand “Dumb SEO Questions” brings the correct result and not the newly created subdomain. This potentially reveals domain/query match layer of Google’s algorithm in action.
Whether Jim’s authorship helped in this instance is uncertain, but we did discover two conflicting search queries:
- Today we were fortunate to be joined by Richard Hearne from Red Cardinal Ltd. (returns the original site)
- Dumb+SEO+questions+answered+by+some+of+the+world’s+leading+SEO+practitioners (returns a copy)
Case Study #3: Shop Safe
The following subdomain was created http://shopsafe.dejanmarketing.com/ replicating a page which contained rel=”canonical”. Naturally the tag was stripped off on the duplicate page for the purposes of the experiment.
This page managed to overtake the original in search, but never replaced it when tested using the info: command. All +1’s were purposely removed after the hijack to see if the page would be restored. Several days later the original page overtook the copy, however it is unclear if +’s had any impact on this.
Possible defense mechanisms:
- Presence of rel=”canonical” on the original page
- Authorship markup / link from Google+ profile
Case Study #4: Rand Fishkin’s Blog
Our next test was related to domain authority so we picked a hard one. Rand Fishkin agreed to a hijack attempt so we set up a page in a similar way to previous experiments with a few minor edits (rel/prev, authorship, canonical). Given that a considerable amount of code was changed I did not expected this particular experiment to succeed to full extent.
Notice that the top result is our test domain, only a few days old. Same goes for the test blog post which now replaces the original site in Australian search results:
This “geo-locking” could be happening at least two reasons:
- .au domain hosts the copy
- .au domain links pointing towards the copied page
Not a Full Hijack
When a duplicate page is created and merged into a main “canonical” document version it will display it’s PageRank, cache, links, info but in Rand’s case also +1’s. Yes, even +1’s. For example if you +1 a designated duplicate, the selected main version will receive the +1’s. Similarly if you +1 the selected main URL the change in +1’s will immediately reflect on any recognised copies.
Example: http://rand.dejanmarketing.com/ – URL shows 18 +1’s which really belong to Rand’s main blog.
When a copy receives higher PageRank however, and the switch takes place, all links and social signals will be re-assigned to the “winning” version. So far we have two variants of this. In case of a full hijack, we’re seeing no +’s for the removed version and all +’s for the winning document, borderline cases seems to show +’s for both documents. Note that this could also be due to code/authorship markup on the page itself.
We’re currently investigating the cause for this behavior.
Further testing is needed to confirm the most efficient way for webmasters to defend against the result/document hijacking by stronger, more authoritative pages.
Most websites will simply mirror your content or scrape a substaintial amount of it from your site. This is typically done on the code level (particularly if automated). This means that presence of properly set rel=”canonical” (full URL) ensures that Google knows which document is the canonical version. Google takes rel=”canonical” as a hint and not an absolute directive so it could still happen that the URL replacement happens in search results, even if you canonicalise your pages.
There is a way to protect your documents too (e.g. PDF) through use of http header canonicalisation:
I am not entirely convinced that authorship will do much to prevent search result swap from a more juiced URL, however it could be a contributing factor or a signal and it doesn’t hurt to have it implemented regardless.
Using full URLs to reference to your home page and other pages on your site means that if somebody scrapes your content they will automatically link to your page passing PageRank to it. This of course doesn’t help if they edit the page to set the URL path to their own domain.
By using services such as CopyScape or Google Alerts webmasters can listen to references of their brand and content segments online, and as they happen. Acting quickly and requesting either removal or a link back /citation back to your site is an option if you notice a high authority domain is replicating your pages.
NOTE: I contacted John Mueller, Daniel Peng and Frank Dabek for comments and advice regarding this article and still waiting to hear from them. Also this was meant to be a draft version (accidentally published) and is missing information about how page hijacking reflects in Google Webmaster Tools.
Article titled “Mind-Blowing Hack for Competitive Link Research” explains how the above mentioned allows webmasters to see somebody else’s links in their Google Webmaster Tools.