If there are multiple instances of the same document on the web, the highest authority URL becomes the canonical version. The rest are considered duplicates.
- Inbound links pointing towards duplicates are inverted towards the canonical URL
- This is called “link inversion”
ExampleLet’s say, today you publish a document on your website and in a few days a few small scrapers copy your page. You’re the higher authority and still count as the canonical document. Your URL shows up in Google’s search. All other URLs are considered duplicates and their links are counted towards your URL. So far so good. Imagine then, a week after that Google picks up the exact document on a website with higher authority than yours. What happens? You’re the duplicate now and your inbound links count towards the new canonical URL of that document.
BackgroundAlmost a decade ago, Google realised they no longer meet their user’s expectations. Their results were stale and lagging behind what was really happening on the web. So, in 2010, Frank Dabek and Daniel Peng tackle the problem of speed and freshness by retiring the traditional batch-based indexing system powered by MapReduce. They introduce a completely new concept which allowed them to transform large datasets progressively, utilising numerous small, independent mutations. They called it Percolator. Full Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications.
Percolator & Caffeine
- Percolator is an incremental processing system which prepares web pages for inclusion in the live index
- Caffeine is a percolator-based indexing system
Google officially announced it in June that year.
“We have built and deployed Percolator and it has been used to produce Google’s websearch index since April, 2010.”Page 13, 5. Conclusion and Future Work
Ranking ComplexityOf course things get a bit more complex when dealing with partial content and there’s a multitude of other signals which may influence rankings including content visibility, personalisation, location, device, timing, search context and intent. That said, Google works at scale and ultimately their search quality team and engineers care about the end user first even if the publisher is somewhat disadvantaged in the process, that’s not as bad as if it was the other way around.
Nothing NewTwo years after Caffeine was released, I demonstrated this feature in a controlled set of experiments including Rand’s blog (with the permission of all included parties). As a reward, Google penalised me.
ScepticismWhenever I run an experiment, there will always be people who tell me that it’s impossible to test Google because there’s just too many variables. These are the people who would also have a hard time accepting they’re wet if I poured a bucket of water on their heads. One such bucket is the fact that “link inversion” isn’t some concept SEO people read in a research paper which Google may or may not use in practice. When triggered, inverted links from other domains actually show up in your Search Console.
Follow-Up TestsBefore publishing this article I conducted a few quick tests and successfully took over as the canonical result every time. I used Search Console to submit the new page to index and take over the original content publisher. It took me 30 seconds. A week after the test the links from the other domain showed up in my Search Console as if they were mine.
I have a client that has a PDF on their site. They are not the original business to feature it, many people are distributors for this product line. I noticed in GSC that they are credited with incoming links, b/c the PDF exists on other sites.— John Locke (@Lockedown_) October 11, 2018