Testing the influence of URL citations and term proximity on document indexation and ranking

Our aim was to determine whether phrases surrounding non-link URL citations on the web act as an off-site ranking factor. For the purpose of our experiment we had 42 citation pages indexed. Each page contained a written URL of our test document with a test phrase in its proximity. We then observed to see if the cited document would be returned in Google’s search results using our test phrase query.

citation

In the first stage of the experiment we see evidence of document discovery but our test page remained unindexed. In the second stage of the experiment we passed PageRank to the test page from two independent sources (PageRank 6 and 7) leading to prompt page indexation. Since our search query never returned the test page we conclude that while Google uses URL citations for document discovery there doesn’t appear to be a link between newly found documents and terms surrounding their citations. Our next experiment will test impact of URL citations on document rankings.

Background

In 2012 we performed an experiment designed to test the impact of URL citations on ranking of documents on the web. There was none, however we did confirm that Google will use text-based web addresses (non-hrefs) to discover and potentially index new documents. One of the remaining questions was whether term proximity to a URL citation can cause for the URL to be returned in search results for that same term, even if not included within the page content. We decided to test this due to our prior findings regarding term proximity.

Experiment Setup

Domain: Aged with an existing live website.
Subdomain: Newly created. Simple landing page, not linking to our test page.
Test Page: subdomain.domain.com/test-page.html
External URL Citations: 42

Citation Examples:

…keyword: http://url.com/test-page.html

…keyword is here: http://url.com/test-page.html

…in http://url.com/test-page.html (keyword).

keyword, for example: http://url.com/test-page.html

Observations

Event Timeline

Date

Recorded Events

16/12/14

Page created.
Starting to create referring documents.

17/12/14

Increasing the number of referring documents. Some already indexed.
01:56:00: Googlebot (66.249.69.243) hits “/robots.txt”, 404.
01:56:00: Googlebot (66.249.69.235) hits “/”, 200.
02:05:56: Googlebot (66.249.69.243) hits “/”, 200.

18/12/14

Total of 42 referring documents created.

19/12/14

Most referring documents indexed.
00:42:23: Googlebot (66.249.69.219) hits “/robots.txt”, 404.
00:42:23: Googlebot Mobile (66.249.69.243) hits “/”, 200.

20/12/14

All referring documents indexed.
01:56:00: Googlebot (66.249.69.243) hits “/robots.txt”, 404.
01:56:00: Googlebot (66.249.69.243) hits “/”, 200.
02:05:56: Googlebot (66.249.69.219) hits “/”, 200.

30/12/14

First Googlebot visit to the test URL.
18:42:23: Googlebot (66.249.69.219) hits “/test-page.html”, 200.
02:02:58: Googlebot (66.249.69.219) hits “/test-page.html”, 200.

02/01/15

Two PageRank sources added pointing at “/test-page.html”.

03/01/15

Test page indexed and cached.

10/01/15

Experiment Ends.

Home Page Priority

Home page of a newly detected page will get indexed as a matter of priority. This suggests that the homepage may influence treatment of the test document by its nature (a root of a website or a thin landing page) and links (a link to the test page from domain’s root may be a signal of importance).

Googlebot Timing

Googlebot’s follow-up visits to the page happened on the exact same timestamp as the original visit.

66.249.69.243 – – [17/Dec/2014:01:56:00 -0800] “GET /robots.txt HTTP/1.1” 404 1148 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

66.249.69.235 – – [17/Dec/2014:01:56:00 -0800] “GET / HTTP/1.1” 200 313 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

66.249.69.243 – – [17/Dec/2014:02:05:56 -0800] “GET / HTTP/1.1” 200 313 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

66.249.69.243 – – [20/Dec/2014:01:56:00 -0800] “GET /robots.txt HTTP/1.1” 404 1148 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

66.249.69.243 – – [20/Dec/2014:01:56:00 -0800] “GET / HTTP/1.1” 200 313 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

66.249.69.219 – – [20/Dec/2014:02:05:56 -0800] “GET / HTTP/1.1” 200 313 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

Googlebot Follows Non-Link URL Citations

Googlebot visited our test page 14 days after the first referenced page was indexed, however the test page was not indexed itself. This is likely due to lack of PageRank.

66.249.69.219 – – [30/Dec/2014:18:42:23 -0800] “GET /test-page.html HTTP/1.1” 200 9013 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

66.249.69.219 – – [31/Dec/2014:02:02:58 -0800] “GET /test-page.html HTTP/1.1” 200 9013 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

[hozbreaktop]

Appendix: Robot Visits

Googlebot

66.249.69.243 - - [16/Dec/2014:07:23:39 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.243 - - [17/Dec/2014:01:56:00 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.235 - - [17/Dec/2014:01:56:00 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.243 - - [17/Dec/2014:02:05:56 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.219 - - [19/Dec/2014:00:42:23 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.243 - - [19/Dec/2014:00:42:23 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.243 - - [20/Dec/2014:01:56:00 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.243 - - [20/Dec/2014:01:56:00 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.219 - - [20/Dec/2014:02:05:56 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.219 - - [30/Dec/2014:18:42:23 -0800] "GET /test-page.html HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.219 - - [31/Dec/2014:02:02:58 -0800] "GET /test-page.html HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
54.144.254.10 - - [31/Dec/2014:06:46:08 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)"
54.205.17.56 - - [31/Dec/2014:06:46:41 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)"
54.163.187.152 - - [31/Dec/2014:07:00:51 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)"
54.205.17.56 - - [31/Dec/2014:07:15:31 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)"
54.159.213.52 - - [31/Dec/2014:07:30:23 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)"
54.147.218.217 - - [31/Dec/2014:07:34:34 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)"
54.204.247.27 - - [31/Dec/2014:07:38:14 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Googlebot (gocrawl v0.4)"
66.249.69.235 - - [31/Dec/2014:09:17:09 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.235 - - [01/Jan/2015:01:56:00 -0800] "GET /robots.txt HTTP/1.1" 404 1148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.219 - - [01/Jan/2015:01:56:00 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.235 - - [01/Jan/2015:02:06:01 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Yahoo! Slurp

68.180.229.34 - - [30/Dec/2014:21:48:23 -0800] "GET / HTTP/1.1" 200 313 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Baiduspider

180.76.5.27 - - [27/Dec/2014:05:24:33 -0800] "GET /test-page.html HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
180.76.5.155 - - [30/Dec/2014:23:28:41 -0800] "GET /test-page.html HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"

Dan Petrovic, the managing director of DEJAN, is Australia’s best-known name in the field of search engine optimisation. Dan is a web author, innovator and a highly regarded search industry event speaker.
ORCID iD: https://orcid.org/0000-0002-6886-3211

0 Points