Search result hijacking is a surprisingly straightforward process. This post will go over theory, test cases done by Dejan SEO team and offer ways for webmasters to defend against search result theft.
I wish to thank Jim Munro, Rob Maas and Rand Fishkin for allowing me to run my experiment on their pages.
[styledbox type=”general” ]UPDATE: Google issues a search quality issue notification for dejanmarketing.com.[/styledbox]
Brief Introduction
Before I go any further I’d like to make it clear that this is not a bug, hack or an exploit – it’s a feature. Google’s algorithm prevents duplicate content displaying in search results and everything is fine until you find yourself on the wrong end of the duplication scale. From time to time a larger, more authoritative site will overtake smaller websites’ position in the rankings for their own content. Read on to find out how exactly this happens.
Search Theory
When there are two identical documents on the web, Google will pick the one with higher PageRank and use it in results. It will also forward any links from any perceived ‘duplicate’ towards the selected ‘main’ document. This idea first came to my mind while reading a paper called “Large-scale Incremental Processing Using Distributed Transactions and Notifications” by Daniel Peng and Frank Dabek from Google.
Here is the key part:
“Consider the task of building an index of the web that can be used to answer search queries. The indexing system starts by crawling every page on the web and processing them while maintaining a set of invariants on the index. For example, if the same content is crawled under multiple URLs, only the URL with the highest PageRank [28] appears in the index. Each link is also inverted so that the anchor text from each outgoing link is attached to the page the link points to. Link inversion must work across duplicates: links to a duplicate of a page should be forwarded to the highest PageRank duplicate if necessary.”
Case Studies
I decided to test the above theory on real pages from Google’s index. The following pages were our selected ‘victims’.
- MarketBizz
- Dumb SEO Questions
- ShopSafe
- Rand Fishkin’s Blog
Case Study #1: MarketBizz
26 October 2012: Rob Maas kindly volunteered for the first stage test and offered one of his English language pages for our first ‘hijack’ attempt. We set up a subdomain called rob.dejanmarketing.com and created a single page http://rob.dejanmarketing.com/ReferentieEN.htm by copying the original HTML and images. The newly created page was +’ed and linked to from our blog. At this stage it was uncertain how similar (or identical) the two documents had to be for our test to work.
30 October 2012: Search result successfully hijacked. Not only did our new subdomain replace Rob’s page in results but the info: command was now showing the new page even for the original page and it’s original PageRank 1 was replaced by PageRank “0” of the new page. Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4.
Notice how the info: search for the URL returns our test domain instead?
So all it took was higher PageRank stream to the new page and a few days to allow for indexing of the new page.
Search for text from the original page also returned the new document:
One interesting fact is that site:www.marketbizz.nl still returns the original page “www.marketbizz.nl/en/ReferentieEN.htm” and does not omit it from site search results. Interestingly that URL does not return any results for cache, just like the copy we created. Google’s merge seems pretty thorough and complete in this case.
Case Study #2: dumbseoquestions.com
30 October 2012: Jim Munro volunteers his website dumbseoquestions.com in order to test whether authorship helps against result hijacking attempts. We copied his content and replicated it on http://dsq.dejanmarketing.com/ without copying any media across.
1 November 2012: The next day Jim’s page was replaced with our subdomain, rendering Jim’s original as a duplicate in Google’s index. This suggests that authorship did very little or nothing to stop this from happening.
The original website was replaced for both info: command and search queries.
Interesting Discovery
Search for the exact match brand “Dumb SEO Questions” brings the correct result and not the newly created subdomain. This potentially reveals domain/query match layer of Google’s algorithm in action.
Whether Jim’s authorship helped in this instance is uncertain, but we did discover two conflicting search queries:
- Today we were fortunate to be joined by Richard Hearne from Red Cardinal Ltd. (returns the original site)
- Dumb+SEO+questions+answered+by+some+of+the+world’s+leading+SEO+practitioners (returns a copy)
Case Study #3: Shop Safe
The following subdomain was created http://shopsafe.dejanmarketing.com/ replicating a page which contained rel=”canonical”. Naturally the tag was stripped off on the duplicate page for the purposes of the experiment.
This page managed to overtake the original in search, but never replaced it when tested using the info: command. All +1’s were purposely removed after the hijack to see if the page would be restored. Several days later the original page overtook the copy, however it is unclear if +’s had any impact on this.
Possible defense mechanisms:
- Presence of rel=”canonical” on the original page
- Authorship markup / link from Google+ profile
- +1’s
Case Study #4: Rand Fishkin’s Blog
Our next test was related to domain authority so we picked a hard one. Rand Fishkin agreed to a hijack attempt so we set up a page in a similar way to previous experiments with a few minor edits (rel/prev, authorship, canonical). Given that a considerable amount of code was changed I did not expected this particular experiment to succeed to full extent.
We did manage to hijack Rand’s search result for both his name and one of his articles, but only for Australian searches:
Notice that the top result is our test domain, only a few days old. Same goes for the test blog post which now replaces the original site in Australian search results:
This “geo-locking” could be happening at least two reasons:
- .au domain hosts the copy
- .au domain links pointing towards the copied page
Not a Full Hijack
Interesting Observation
When a duplicate page is created and merged into a main “canonical” document version it will display it’s PageRank, cache, links, info but in Rand’s case also +1’s. Yes, even +1’s. For example if you +1 a designated duplicate, the selected main version will receive the +1’s. Similarly if you +1 the selected main URL the change in +1’s will immediately reflect on any recognised copies.
Example: http://rand.dejanmarketing.com/ – URL shows 18 +1’s which really belong to Rand’s main blog.
When a copy receives higher PageRank however, and the switch takes place, all links and social signals will be re-assigned to the “winning” version. So far we have two variants of this. In case of a full hijack, we’re seeing no +’s for the removed version and all +’s for the winning document, borderline cases seems to show +’s for both documents. Note that this could also be due to code/authorship markup on the page itself.
We’re currently investigating the cause for this behavior.
Preventative Measures
Further testing is needed to confirm the most efficient way for webmasters to defend against the result/document hijacking by stronger, more authoritative pages.
Canonicalisation
Most websites will simply mirror your content or scrape a substaintial amount of it from your site. This is typically done on the code level (particularly if automated). This means that presence of properly set rel=”canonical” (full URL) ensures that Google knows which document is the canonical version. Google takes rel=”canonical” as a hint and not an absolute directive so it could still happen that the URL replacement happens in search results, even if you canonicalise your pages.
There is a way to protect your documents too (e.g. PDF) through use of http header canonicalisation:
Authorship
I am not entirely convinced that authorship will do much to prevent search result swap from a more juiced URL, however it could be a contributing factor or a signal and it doesn’t hurt to have it implemented regardless.
Internal Links
Using full URLs to reference to your home page and other pages on your site means that if somebody scrapes your content they will automatically link to your page passing PageRank to it. This of course doesn’t help if they edit the page to set the URL path to their own domain.
Content Monitoring
By using services such as CopyScape or Google Alerts webmasters can listen to references of their brand and content segments online, and as they happen. Acting quickly and requesting either removal or a link back /citation back to your site is an option if you notice a high authority domain is replicating your pages.
NOTE: I contacted John Mueller, Daniel Peng and Frank Dabek for comments and advice regarding this article and still waiting to hear from them. Also this was meant to be a draft version (accidentally published) and is missing information about how page hijacking reflects in Google Webmaster Tools.
PART II:
Article titled “Mind-Blowing Hack for Competitive Link Research” explains how the above mentioned allows webmasters to see somebody else’s links in their Google Webmaster Tools.
[styledbox type=”general” ]UPDATE: Google issues a search quality issue notification for dejanmarketing.com.[/styledbox]
Dan Petrovic, the managing director of DEJAN, is Australia’s best-known name in the field of search engine optimisation. Dan is a web author, innovator and a highly regarded search industry event speaker.
ORCID iD: https://orcid.org/0000-0002-6886-3211
This is a very interesting post, also observing 302 redirect isn’t needed to be successful in hijacking.
I’ve also recently experienced issues with different domains (eg. it, .com) having the same HTML but translated (IT – EN) content. Info, cache, link operators are still showing the wrong page in some cases. Hreflang alternate were in place but seemed not to be helpful; I’m trying canonicalization too but pages haven’t been re-crawled yet…
Thanks for the experiment, anyway, useful as usual.
Dan – Interesting test. The takeaways for me is to implement the rel canonicalization tag and not dig a bit deeper into the reallocation of links. Thanks.
draft 1 final 0
Thanks for sharing this Dan, love real-life SEO testing.
I have a question regarding the PageRank of the new, duplicate pages that you published on DejanSEO.
“Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4.”
How did you pass enough PageRank to these to make them rank higher than the originals? Was it purely internal linking from your very strong domain?
Thanks again for sharing!
Paddy
Darn fine job!
Goes to show that G has little?no automated interested in showing the Originator, only the Popular (yet again Popularity is the influencer :sigh: )
As for prevention of Hijacking – it should be the same as the Anti-Scrapping methods.
1) Create page
2) Live the page
3) Include full name in content (top and bottom)
4) Include Date/Time stamp
5) Include full URL as Text
6) Include full URL as link
7) Include SiteName in content
8) Include SiteName in Title
9) Use the Canonical Link Element
10) Use Authorship markup to your G+ Profile URL
11) Add page URL to Sitemap
12) Ping GWMT Sitemap tool
13) Use GWMT Fetch as GoogleBot tool for new URL
14) Link from your G+ Profile to your Page URL
15) Use services such as Ping.FM and PubSubHub
16) Social-Bookmark/Social-Share the new page/URL
Unfortuantely -we have no idea just how influential any of that is – but is “should” help.
Just keep in mind that G is interested in “the best”, which they view as “the most popular”.
Aww… shut up!
Yes it was just an internal link.
Very interesting test, however I believe some of the results are down to localisation of domains and local Google. For instance you are using a .com.au domain in the Australian Google. I highly doubt that the .com.au would show up in the States or UK above the hijacked sites. However that’s still up for testing 🙂
The first example also shows this as it is a .nl result. I believe there is some layer in Google’s algorithm that determines whether a foreign ccTLD is more relevant than a local ccTLD (which is why we hardly see any .com.au’s in the UK), so by using .com.au in Google Australia you may not be fully testing the hijacking issue. Very interesting study though. 🙂
Thanks Dan. Did you point any external links to any of the other duplicate pages or were they just internal links too?
Thanks Lyndon, brilliant stuff. Will have to update the article to include some of this stuff.
I see rand.dejanmarketing.com for Rand Fishkin in Poland. The same goes for Miami US I think. Just see – https://www.google.com/search?hl=en&gl=US&gr=US-FL&gcs=Miami&q=rand+fishkin
Malcom. Dan’s com.au Page is showing in the Netherlands as well !
Hi Guys
Good Stuff. Its interesting to see this kind of analysis and advance SEO research. I am kinda surprised to see +1 and social signals are being transferred to the popular page.
Have a few questions , if you don’t mind.
1) When you said ” Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4 “. What exactly do you mean by real-time PageRank ?
2) You guys have a strong domain and the subdomains on here would naturally rank well. I am interested to see if this sort of duplication will have any impact on a relatively new domain, with a heavy social push on the newly created duplicate pages.
Looking fwd to the next segment.
Regards
Saijo George
Pretty worrying indeed. Especially in situations where QDF is play i.e. getting your work pinched. Like almost everything else, SERPs mimic the real world. The bigger you are, the bigger you get
google.pl – [rand fishkin blog] -> TOP 1 is rand.dejanmarketing.com (same results for google.com)
One of the best SEO posts i read in recent times. Good experimentation and thorough analysis
I see you are logged in when you are taking the screenshots. Does this make any change in terms of what/which is the sites show up? Normally your personal results will be “skewed” compared to the “non-logged” in users or even other logged in users as they have a different search pattern and so on. Just curious as to whether that make a big difference in the results or not.
This is fascinating. Dan, what effects on the domain of a site with duplicated content would you expect to see? I mean, if its links are being passed to another site, would the entire domain suffer a loss of PR/authority if duplication was extensive?
I added a link from one external site to Rand’s copy.
That is a million dollar question Paul, and something I intend to ask Google.
No. Incognito mode shows the same results.
Thanks!
Yes. But at least we have ways of defending our pages.
There could exist two pages both PR4, but one got its pagerank after the public TBPR update and doesn’t show it. Similarly both pages should show PR4 and one could have lost it in the meantime and not showing the reduction until the next public update.
I would imagine that hijack attempts on a weak domain would not work well.
Thanks for testing that for me!
Oh interesting, it looks like we managed to replace his blog in the US too.
I can confirm that the .com.au is currently showing up in the UK SERPs in the place of Rand’s blog. Pretty scary experiment – I can already think of some pretty black-hat things that could be done with this.
Awesome research… Earlier I used to think Authorship would put a blog to safe zone, but now seems that I was wrong…..
Great Post Dan and love the testing.
And in the uk -> http://www.google.co.uk/search?q=rand+fishkin&aq=0&oq=rand+fishkin&sugexp=chrome,mod=0&sourceid=chrome&ie=UTF-8&pws=0
You can easily test geo search with http://www.impersonal.me
It appears that you were logged into google in the “rand fiskin” serp screenshot. If that’s the case, googles personalized results could account for all your test results. Can you replicate your findings by using an independent rank tracking tool?
PS: excellent experiment btw, thanks for sharing!
This blows my mind…while simultaneously scaring the crap out of me. I wouldn’t have believed it had I not checked the SERPs myself. What especially perplexed me was the authorship experiment. I cannot, for the life of me, figure out how and why they would switch his URL out for yours when his was verified. Very strange indeed (but great work!).
Awesome results and article, as usual.
It made me think a lot, unusual.
Would all of this have worked if you use a subdirectory or folder instead of a subdomain?
Same for UK too!
EDIT: It would be interesting to see how long this stays in the SERP.
Incredible test with some great takeaways. Thanks for taking the time to do this. Just reiterates the importance that rel=canonical can have on a site.
Wow, interesting test and it’s pretty deep. Might’ve got lost somewhere and had to reread a few times but GOOD JOB! Keep up with the testing
this experiment feels like little kids playing hide & seek game as well as a high adrenaline thriller at the same time! all hats off! Google’s engineers got their work cut out to come up with better stability.
I have an appreciation for the time devoted to this test but I can’t determine a useful purpose for the efforts. We recently had an issue with a client’s competitor using scraped content which resulted in their site being banned from Google search results. Under the DMCA (Digital Millenium Copyright Act) and a handy Google tool, sites with scraped content can be reported.
I agree that it is a good idea to document ownership however I do not believe it is necessary to go overboard. Periodic checks in Copyscape will find any scrapers, and a couple of emails can get the offending site wiped out. Just my thoughts . . . .
#Hijacked publishing 🙂
Very interesting test Dan, the result is quite scary actually seeing how easy it was for you to highjack them.
Oh right 🙂
Theoretically if its only looking at the domian authority ( which I would assume its not ) any one could do negative SEO by chucking up duplicate content on wordpress.com which is scary.
I would also be interested to see if the ranking naturally fall back to normal after a while ( assuming the freshness is lost on those posts ) . From what I can see from your study .. where social signals are being transferred to the duplicate page I dont think that will happen.
Good read and thanks for sharing. Interestingly this is what I get when searching rand fishkin blog
Australia: http://prntscr.com/j4j0r
USA : through proxy using different browser and cleared cookies http://prntscr.com/j4j2y
I’d be interested to see how this worked if it was a subpage instead of a subdomain. Did you do any tests like that?
I wonder what would happen if a stronger competitor copied your entire site on a subdomain of theirs …
This makes absolute URLs and self referential canonical tags that much more important (although it still seems competitors can outrank you regardless).
you think someone could build a WP plugin that does all that each time you publish a page automatically?
I would think that maybe there are some already doing stuff with this….
No comment on the logged in vs. not logged in factor?
Well this is just amazing I have seen similar results where I also did a duplicate content test with my own blog and not only were our post also indexed but it outranked the other article.
Hi Dan,
“Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4.”
How did you calculate the real time pagerank?
We’ve definitely seen this for a while. Until G+ matures and they can sift through real social profiles vs artificial there’s not going to be much social influence. I suspect they’re able to establish other Google account activity to solidify and expedite a person’s interests etc. You always hope it never happens to you but ya it happens every day.
Very interesting article Dan. Great job, I’ll be interested to read the Webmaster tools piece…
Also in Italy https://www.google.it/#hl=it&output=search&sclient=psy-ab&q=rand+fishkin&oq=rand+fishkin&gs_l=hp.3..0j0i30l3.690.690.0.1328.1.1.0.0.0.0.86.86.1.1.0…0.0…1c.1._GaWPwdCHwY&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.&fp=d47350a3b0a6d4fc&bpcl=37643589&biw=1920&bih=955
Loved this. Until I was trying to find Rand’s blog and couldn’t lol
I’m not sure Incognito mode is that reliable – try adding &pws=0 to the search string. My results when searching on Rand’s name does not have your test page. This blog post is listed though 🙂
Here’s a few tips on PageRank: https://plus.google.com/111588754935244257268/posts/Tt5HD76twgs
Thanks Tommy.
Yet Bing has no problems finding the real owner’s blog. Hooray Bing!
Amazing experiment. When I search for “Rand Fishkin” in US (incognito), I see the rand.dejanmarketing.com result in the 3rd position. moz.com is on the 2nd page.
Really scary results. My question is why does G choose your page over moz.com’s page? You mentioned that the page with the higher pagerank is going to replace the other in serps but I don’t think your newly created subdomain has a higher pagerank. What is the factor that makes your page superior?
Looking forward to more!
-Oleg
I’ll confirm that his page ranks for “Rand Fishkin” in USA, incognito browsing ahead of moz.com/rand (the original page)
Thx!
I just ran a couple tests too, in Canada dejanmarketing.com is ranking #1 – in US, ranking #3. This is absolutely shocking to see first hand. Well done sir, well done.
It does have higher pagerank, you just can’t see it because Google hasn’t pushed out fresh TBPR.
this is one of the best things i read on ‘whitehat’ blogs. really decent
All are interesting case studies. So, it can be seen that how an authorship have an impact.
Great article.. I would like to see this test with normal domains and not subdomains.. I think that you will get very different results.. Great article…
That’s a fantastic series of tests with interesting results. Thanks for publishing it. I would’ve thought the oldest (first indexed) would win. Isn’t that what Google’s been saying? Strange that it doesn’t. Great stuff!
So you are saying that the one document with higher PR will take over in results, but how did you build higher PR for those new subdomains that you just created to gain so quickly in search. I would think a new subdomain would take more then a day or two to build PR to overtake the original, but did you do anything aggressive to gain PR so quickly and thus resulting in the SERPs you saw, or was this completely natural and just over took the original page just based on the domain authority?
once these sites are detected and reported the BIG G takes manual action… they have helped plenty read comments above for more details… use copyscape to setects copies of your site then take appropriate action through Google web masters tools…..
I am sure some body will. Question is what is the BIG G doing about this hole in their system?
I’m relieved that I’m using the rel=canonical and have implemented “authorship” with a WordPress plugin, have set my Google+ pages up according to there procedures, referenced back to the sites I write on and have moved my sites to CloudFlare to monitor bot threats, set threat levels as further protection against scraping and have blocked whole countries who show a propensity to crawl my sites and are noted as “high level threats.”
It’s not only about implementing these things but being very consistent about posting back to your social media (particularly Google+) to retain some authorship protection. Google doesn’t commit to saying “authorship” will happen but in the long-term I believe it can’t hurt publishers to follow that practice.
Great post Dan! I quite like what you do in the SEO industry. This was a great experiment and revealed a lot of things I didn’t know. Look forward to you sharing more of these experiment results in the future.
Wow… cool stuff! I never thought that was possible without offpage stuff.
i dont really understand why this is news? We all knew this?
Can scraped content actually help given it’s an exact copy, and all my posts have a ton of internal linking on FinancialSamurai.com and Yakezie.com?
Thanks, Sam
Hold on…..the site that swipes my sites content also benefits from my links??,
That explains a lot and this means war
Well, I have noticed this some time ago with a scrapper blog.
Would love to see an additional test pinging first https://code.google.com/p/pubsubhubbub/. Thoughts?
No we purposely sent PageRank to the experiment page by linking to the page from a few other posts on our site.
Great idea, but I don’t think I’ll be running another experiment of this type. We have already received a warning from Google search quality team.
Thanks!
Fantastic! that means that now they know they have a bug in the algorithm.
Dan. Congratulations on an extremely informative and useful blog for SEOs. I’ve done a lot of work on geo targeting and duplication of content, so I fully respect and appreciate the effort you’ve gone to with this blog.
On the ‘interesting observation’ part, I too noticed that Tweets and likes were be assigned to the duplicate copy that should be attributed to the original copy. Again, great insight.
LOL, That is hilarious. Did you tell them what you were doing?
Interesting and more than little scary results. I can see how this would arm blackhat spammers.
You mentioned that the links to duplicate content get transferred to the original webpage. This raises an obvious Penguin-related question. What prevents a malicious person from scraping your site on a low PR domain and then spamming the duplicate domain with truckloads of bad links.
I would hope that Google has some fail-safes to prevent abuse like that, but the algo updates in the past year have really shaken my confidence in the big G.
What do you think about this?
… but your site is still ranking high – so what did they exactly complain about? Copying content?
So, I followed up with a test using a couple of Press Releases through PRweb.
When the content is placed on the site first (using pubsubhubbub) then the original site gets credit for the content over PRWeb (and other channels). It is interesting though that when searching for different exact strings within the article that sometimes it returns PRWeb and sometimes it returns the article on our site.
On another site that has pubsubhubbub and Google authorship the all the exact strings return our site’s article over PRWeb and their channels.
Brilliant! I will note it and use it as reference always! Thank you.
Dan,
I am very impressed at the effort you put into this experiment. How did you calculate the real-time pagerank for the subdomains? At the time of the experiment did you or the owners of the sites know that you could view their link profile in GMT?
Really? I will have to look into that – as that would be impressive!
This is interesting. The possibility of Hijacking other’s reputation with a higher PR links flow to fake page. Means the small businesses which have smaller PR (PR 3 – 4) are easier to become target of irresponsible people. There should be a way to protect these smaller sites.
I think the internal linking made it possible.
what would google do after reading this report?
I think the internal linking made it possible.Really? I will have to look into that – as that would be impressive!
Thanks for the information.. Now I get better page rank from doing this type of job.. Great Article keep it up…
Awesome and Congratulations! nice helpfull article as always 🙂
Hey Dan,
These are interesting theories and I find your observations very plausible. The mechanics of the game have gotten more complex and ambiguous with the participation of social signals in the picture. I’m looking forward to learn about the results of your investigation. Thanks.