Testing the influence of URL citations and term proximity on document indexation and ranking
Our aim was to determine whether phrases surrounding non-link URL citations on the web act as an off-site ranking factor. For the purpose of our experiment we had 42 citation pages indexed. Each page contained a written URL of our test document with a test phrase in its proximity. We then observed to see if the cited document would be returned in Google’s search results using our test phrase query.
In the first stage of the experiment we see evidence of document discovery but our test page remained unindexed. In the second stage of the experiment we passed PageRank to the test page from two independent sources (PageRank 6 and 7) leading to prompt page indexation. Since our search query never returned the test page we conclude that while Google uses URL citations for document discovery there doesn’t appear to be a link between newly found documents and terms surrounding their citations. Our next experiment will test impact of URL citations on document rankings.
Background
In 2012 we performed an experiment designed to test the impact of URL citations on ranking of documents on the web. There was none, however we did confirm that Google will use text-based web addresses (non-hrefs) to discover and potentially index new documents. One of the remaining questions was whether term proximity to a URL citation can cause for the URL to be returned in search results for that same term, even if not included within the page content. We decided to test this due to our prior findings regarding term proximity.
Experiment Setup
Domain: Aged with an existing live website.
Subdomain: Newly created. Simple landing page, not linking to our test page.
Test Page: subdomain.domain.com/test-page.html
External URL Citations: 42
Citation Examples:
…keyword: http://url.com/test-page.html
…keyword is here: http://url.com/test-page.html
…in http://url.com/test-page.html (keyword).
…keyword, for example: http://url.com/test-page.html
Observations
Event Timeline
Date |
Recorded Events |
16/12/14 |
Page created. |
17/12/14 |
Increasing the number of referring documents. Some already indexed. |
18/12/14 |
Total of 42 referring documents created. |
19/12/14 |
Most referring documents indexed. |
20/12/14 |
All referring documents indexed. |
30/12/14 |
First Googlebot visit to the test URL. |
02/01/15 |
Two PageRank sources added pointing at “/test-page.html”. |
03/01/15 |
Test page indexed and cached. |
10/01/15 |
Experiment Ends. |
Home Page Priority
Home page of a newly detected page will get indexed as a matter of priority. This suggests that the homepage may influence treatment of the test document by its nature (a root of a website or a thin landing page) and links (a link to the test page from domain’s root may be a signal of importance).
Googlebot Timing
Googlebot’s follow-up visits to the page happened on the exact same timestamp as the original visit.
66.249.69.243 – – [17/Dec/2014:01:56:00 -0800] “GET /robots.txt HTTP/1.1” 404 1148 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.69.235 – – [17/Dec/2014:01:56:00 -0800] “GET / HTTP/1.1” 200 313 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.69.243 – – [17/Dec/2014:02:05:56 -0800] “GET / HTTP/1.1” 200 313 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.69.243 – – [20/Dec/2014:01:56:00 -0800] “GET /robots.txt HTTP/1.1” 404 1148 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.69.243 – – [20/Dec/2014:01:56:00 -0800] “GET / HTTP/1.1” 200 313 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.69.219 – – [20/Dec/2014:02:05:56 -0800] “GET / HTTP/1.1” 200 313 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
Googlebot Follows Non-Link URL Citations
Googlebot visited our test page 14 days after the first referenced page was indexed, however the test page was not indexed itself. This is likely due to lack of PageRank.
66.249.69.219 – – [30/Dec/2014:18:42:23 -0800] “GET /test-page.html HTTP/1.1” 200 9013 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.69.219 – – [31/Dec/2014:02:02:58 -0800] “GET /test-page.html HTTP/1.1” 200 9013 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
[hozbreaktop]
Appendix: Robot Visits
Googlebot
66.249.69.243 - - [16/Dec/2014:07:23:39 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.243 - - [17/Dec/2014:01:56:00 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.235 - - [17/Dec/2014:01:56:00 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.243 - - [17/Dec/2014:02:05:56 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.219 - - [19/Dec/2014:00:42:23 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.243 - - [19/Dec/2014:00:42:23 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.243 - - [20/Dec/2014:01:56:00 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.243 - - [20/Dec/2014:01:56:00 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.219 - - [20/Dec/2014:02:05:56 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.219 - - [30/Dec/2014:18:42:23 -0800] "GET /test-page.html HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.219 - - [31/Dec/2014:02:02:58 -0800] "GET /test-page.html HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 54.144.254.10 - - [31/Dec/2014:06:46:08 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)" 54.205.17.56 - - [31/Dec/2014:06:46:41 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)" 54.163.187.152 - - [31/Dec/2014:07:00:51 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)" 54.205.17.56 - - [31/Dec/2014:07:15:31 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)" 54.159.213.52 - - [31/Dec/2014:07:30:23 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)" 54.147.218.217 - - [31/Dec/2014:07:34:34 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)" 54.204.247.27 - - [31/Dec/2014:07:38:14 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)" 66.249.69.235 - - [31/Dec/2014:09:17:09 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.235 - - [01/Jan/2015:01:56:00 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.219 - - [01/Jan/2015:01:56:00 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.69.235 - - [01/Jan/2015:02:06:01 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Yahoo! Slurp
68.180.229.34 - - [30/Dec/2014:21:48:23 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Baiduspider
180.76.5.27 - - [27/Dec/2014:05:24:33 -0800] "GET /test-page.html HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 180.76.5.155 - - [30/Dec/2014:23:28:41 -0800] "GET /test-page.html HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
Dan Petrovic, the managing director of DEJAN, is Australia’s best-known name in the field of search engine optimisation. Dan is a web author, innovator and a highly regarded search industry event speaker.
ORCID iD: https://orcid.org/0000-0002-6886-3211