Hangouts Videos

SEO Challenge

Dan Petrovic 20/12/2012

Challenge #2

Past SEO challenge hangouts:

Challenge #1

Watch Dan Petrovic & Martin Reed in Dejan SEO Challenge Hangout. We’ll be testing SEO knowledge of webmasters from around the world with more than 30 questions in three difficulty levels. Audience votes for their favourite participant in the comments below. Your votes will be taken into account when we decide who will go to our “Hall of Fame”.

Some Interesting Answers:

How much of the title does Google index regardless of how much it shows in results?: Google indexes words, not characters. Google only treats the first twelve words as part of the title. Any words after the 12 are treated as part of the page.

Does Disallow in robots.txt prevent a site being shown in Google: No index is the only directive that prevent Google from showing a page in search results.

Ways to check if your site has been hacked using Google Webmaster tools: Check for malware, content keywords, ranking search queries & of course Messages.

Can Authorship Appear for media other than webpages, eg: PDFs:Yes, PDFs can have authorship markup. See Dan’s example in position 7

What happens if a page with the NOINDEX meta tag is blocked by robots.txt: The NOINDEX cannot be seen and will have no impact.

Name an example where Google explicitly accepts user agent based cloaking:

Mobile.
OS specific instructions, eg: redirecting users to the appropriate download page for software developed for your operating system.
Possibly A/B testing depending on user agent.
First click free, where Google is able to index the whole article whilst users are taken to a sample of that article.

Which Google spider can ignore robots.txt directive: Adwords bot, but also most bots, excluding Google Bot, ignore the robots.txt.

Is the size of the text on the page a factor taken into consideration by Google: In one of the first papers published by Google explaining how the search engine works it was explained that the size of the font was taken into account as part of results, however this was many years ago.

More recently, very small font is possibly a negative ranking factor.

Before canonicalisation are http:// and https:// versions of a site considered to be a different resource: Yes, it’s possible to have to completely different websites on each version of the URL.

What happens if we disallow /robots.txt in robots.txt: Nothing! Google will cache the file, parse it, not request the file till the cache has expired, then repeat.

SEO Challenge with Dan Petrovic and Martin Reed, 7pm 30th August 2012, Google+ Hangout

Highlights from Google+ Discussion

Enrico Altavilla

I understand that the question about the purpose of a Disallow directive was surely bad worded. Instead of asking “What does a Disallow directive ask to the search engine?” I should have asked “What does a Disallow directive ask to the spider?”, since Disallow directives are addressed only to the “spidering part” of the search engine. Talking about a “search engine” was probably misleading (even if a spider is a part of a SE) and I’m sorry about the confusion.

Said that, the answer to the more precise question can be found in the original document written by the person who invented the concept of robots.txt and the Disallow directive. The purpose of this directive has not changed since its creation (many years ago) and all popular search engines just follow the interpretation of “Disallow” written here: http://www.robotstxt.org/orig.html

There are more technical and detailed documents about the Robots Exclusion Standard but the bottom line is that the purpose of Disallow is to ask a spider or any automated software (e.g. Screaming Frog SEO spider, some download managers, some web site copiers, etc.) to not request/download one or more resources.

The Robots Exclusion Standard is a bad beast and I’ve followed its evolution through time, if you want to discuss it further, you’ll always find in me an interested interlocutor. 🙂

I hope that the clarification will make things a little more clear!

Tony McCreath

+Enrico Altavilla points out what a spider should do, and that emphasises that Google may not fully follow the rules. He is also clarifying that robots.txt only asks the spider to not re-crawl a page. Historical data can still be used.

What I find more interesting is the meta noindex concept and it’s conflict with follow. How can you follow if you don’t index?

Does Google still crawl and establish the link graph regardless of robots.txt directives and meta robots tags?

Why can Google still find links on a page that is blocked by robots.txt but can’t read the robots meta tag on the same page?

Bonus: I think Google recently stated that they will still crawl blocked pages if they contain a Google+ badge on them!

Enrico Altavilla

+Tony McCreath : exactly! That was what I was asking for: what the Disallow directive asks, not if search engines honor it. 🙂 It was intended just as a technical question, not as a way to talk about search engines correctness (I’m more interested into the engineering aspect of what a software does).

I think I can answer to your question about noindex and follow, from a technical point of view, because I studied some information retrieval and search engine database design. Please consider what follows just as a (common) way to develop a simple search engine database, I’m not referring to any actual implementation by a specific search engine.

The Main Index

First, some words about the main index. The index is not an archive containing the raw documents collected by the search engine but it’s a special database structure that makes it possible to quickly find what documents are related to some words (usually the words typed by the user during a search).

When a search engine downloads an HTML resource and parses it, the search engine basically splits all the text into single words, then it inserts those single words into the index, which is a list of rows like this:

the word “dog” appears in documents number 56, 356, 676, 803 and 4762.
the word “ball” appears in documents number 94, 257, 356 and 700.
etc…

These rows are one of the most useful structure of the database, but the important thing to understand is that after the indexing phase, there isn’t such a thing as the entire original document, because the document has been split into its component words.

Many search engines also keep a copy of the original raw resource (for example the cache copies shown to users) but these resources are not used during a search: only the index is queried.

The Links

What happens when the search engine finds a link on a document? The search engine does something very similar to what it did with the word index: it populates a second list, containing only information about links:

the document 356 links to the document 700 using the text “xyz”
the document 4762 links to the document 56 using the text “online casino”
etc…

Again, the original document doesn’t exist anymore, here. When the search engine needs to know something about what documents link to other documents, it queries this second list.

Two Different Structures

What I rougly described above it’s what actually happens to text documents when they are indexed. You have two different structures: the first one is a list of relations between words and documents, the second one is a list of the links found.

Let’s call it “word index” and “link index” (or “link graph”, if you prefere).

Noindex and Follow

How is it possible to “noindex but follow”? It is possible because text and links are stored in two completely different database structures.

“index” means: “please, add the words in this document to the word index”.
“noindex” means: “please, don’t add the words in this document to the word index”.

“follow” means: “please, add the links in this document to the link index.”
“nofollow” means: “please, don’t add the links in this document to the link index.”

Real Noindexing and Cosmetic Noindexing

When faced with a noindex or nofollow directive some search engines could decide to index words and links but to ignore those rows during a search. From an user point of view the result is the same, but in this way the noindex and nofollow directives lose their original meaning, because the search engine still has somewhere that information, even if it is ignored.

I hope that the explanation clarifies a bit what actually happens in the search engine software and database and how it is possible to “noindex but follow” when two completely different structures are used to save text and links. 🙂

For more information about the indexing, this Wikipedia article is a good reading: http://en.wikipedia.org/wiki/Search_engine_indexing

Now, I’ll try to reply to your other questions. You asked about Google but, again, these concepts apply to any search engine.

> Does Google still crawl and establish the link graph regardless of robots.txt directives and meta robots tags?

The robots.txt file talks to the spider while the meta robots tag talks to the indexer. This distinction is important in order to answer correctly to your question.

Any search engine which honors the robots.txt file will not ask for a disallowed resource. If a disallowed resource contains links, those links will not be perceived by the search engine. So the robots.txt file is a good way to actually influence the link graph.

About the robots meta tag and its nofollow directive, the links will be perceived but there is no way to understand if a nofollowable link is actually not inserted into the link graph or inserted but ignored. For sure, the search engine needs to save somewhere into the database the total count of links in the page.

> Why can Google still find links on a page that is blocked by robots.txt but can’t read the robots meta tag on the same page?

A resource disallowed in the robots.txt file is not requested by any popular search engine and as a consequence its contents (html code, inside text, links, HTTP headers) will not be perceived.

Can a search engine perceive a link in other ways, without downloading the resource? Theoretically there are some ways to (badly) achieve this goal, but I don’t know if any search engine actually uses them.

> Bonus: I think Google recently stated that they will still crawl blocked pages if they contain a Google+ badge on them!?

Yep!

Dan Petrovic

Dan Petrovic, the managing director of DEJAN, is Australia’s best-known name in the field of search engine optimisation. Dan is a web author, innovator and a highly regarded search industry event speaker.
ORCID iD: https://orcid.org/0000-0002-6886-3211

0 Points

Is it enough to use Google Webmaster Tools for link clean-up?

13 thoughts on “SEO Challenge”

Mateusz says:

30/08/2012 at 7:27 pm

It would be great if you place a question slide on monitor when you speak Dan (instead of participants faces). Just to memory the subject.
Mateusz says:

30/08/2012 at 7:45 pm

Thanks. Really helpfull.
Karol Dziedzic says:

30/08/2012 at 8:36 pm

My vote goes for tray 😉 for different point of view
Tomasz Stopka says:

30/08/2012 at 9:17 pm

Vote for Tray 🙂
Dejan SEO says:

30/08/2012 at 9:19 pm

Thanks will remember for next time.
Andrea Pernici says:

30/08/2012 at 11:18 pm

I vote for Enrico Altavilla 🙂
William Rock says:

30/08/2012 at 11:22 pm

Sorry I could not make it… I have had major allergies.. All good info, I really like the discussion about the cloaking…. A/B testing such as Google Analytics Page Experiments would be another one that Google will allow.
William Rock says:

30/08/2012 at 11:59 pm

My vote is for Trey and Jim..
Guest says:

31/08/2012 at 12:11 am

Thanks William but I’m voting for Trey. If it was a secret ballot I’d get two votes, yours and mine.:)
Giorgio Volpe says:

31/08/2012 at 8:14 am

Very nice, thanks everyone.
Gary Steel says:

01/09/2012 at 2:48 am

Fantastic discussion to all. Enrico gets my vote
Dave Fowler says:

03/09/2012 at 6:05 am

I honestly can’t pick a winner, but I did just want to congratulate you guys on a great hangout idea. It’s great to hear a bunch of fellow SEO obsessives talking shop, and also kind of reassuring that there is disagreement on so much of this stuff – it reinforces why personal experimentation is so important. I really enjoyed this, and I’ll look forward to watching the next.
One suggestion to get more input from the participants – make all the questions multiple choice, and before getting any individual to give a detailed reply, have everyone on screen initially holding up A, B or C cards.
Pragya Sharma says:

04/09/2012 at 9:25 pm

Thanks Dan for the hangout. I like all the people.. I was wondering if you could have a post about all the questions included in the video..It’ll be a great help.
Good luck for the next hangout. 🙂