Past SEO challenge hangouts:
Watch Dan Petrovic & Martin Reed in Dejan SEO Challenge Hangout. We’ll be testing SEO knowledge of webmasters from around the world with more than 30 questions in three difficulty levels. Audience votes for their favourite participant in the comments below. Your votes will be taken into account when we decide who will go to our “Hall of Fame”.
Some Interesting Answers:
How much of the title does Google index regardless of how much it shows in results?: Google indexes words, not characters. Google only treats the first twelve words as part of the title. Any words after the 12 are treated as part of the page.
Does Disallow in robots.txt prevent a site being shown in Google: No index is the only directive that prevent Google from showing a page in search results.
Ways to check if your site has been hacked using Google Webmaster tools: Check for malware, content keywords, ranking search queries & of course Messages.
Can Authorship Appear for media other than webpages, eg: PDFs:Yes, PDFs can have authorship markup. See Dan’s example in position 7
What happens if a page with the NOINDEX meta tag is blocked by robots.txt: The NOINDEX cannot be seen and will have no impact.
Name an example where Google explicitly accepts user agent based cloaking:
- OS specific instructions, eg: redirecting users to the appropriate download page for software developed for your operating system.
- Possibly A/B testing depending on user agent.
- First click free, where Google is able to index the whole article whilst users are taken to a sample of that article.
Which Google spider can ignore robots.txt directive: Adwords bot, but also most bots, excluding Google Bot, ignore the robots.txt.
Is the size of the text on the page a factor taken into consideration by Google: In one of the first papers published by Google explaining how the search engine works it was explained that the size of the font was taken into account as part of results, however this was many years ago.
More recently, very small font is possibly a negative ranking factor.
Before canonicalisation are http:// and https:// versions of a site considered to be a different resource: Yes, it’s possible to have to completely different websites on each version of the URL.
What happens if we disallow /robots.txt in robots.txt: Nothing! Google will cache the file, parse it, not request the file till the cache has expired, then repeat.
SEO Challenge with Dan Petrovic and Martin Reed, 7pm 30th August 2012, Google+ Hangout
Highlights from Google+ Discussion
There are more technical and detailed documents about the Robots Exclusion Standard but the bottom line is that the purpose of Disallow is to ask a spider or any automated software (e.g. Screaming Frog SEO spider, some download managers, some web site copiers, etc.) to not request/download one or more resources.
The Robots Exclusion Standard is a bad beast and I’ve followed its evolution through time, if you want to discuss it further, you’ll always find in me an interested interlocutor. 🙂
I hope that the clarification will make things a little more clear!
Does Google still crawl and establish the link graph regardless of robots.txt directives and meta robots tags?
Why can Google still find links on a page that is blocked by robots.txt but can’t read the robots meta tag on the same page?
Bonus: I think Google recently stated that they will still crawl blocked pages if they contain a Google+ badge on them!
+Tony McCreath : exactly! That was what I was asking for: what the Disallow directive asks, not if search engines honor it. 🙂 It was intended just as a technical question, not as a way to talk about search engines correctness (I’m more interested into the engineering aspect of what a software does).
I think I can answer to your question about noindex and follow, from a technical point of view, because I studied some information retrieval and search engine database design. Please consider what follows just as a (common) way to develop a simple search engine database, I’m not referring to any actual implementation by a specific search engine.
The Main Index
First, some words about the main index. The index is not an archive containing the raw documents collected by the search engine but it’s a special database structure that makes it possible to quickly find what documents are related to some words (usually the words typed by the user during a search).
When a search engine downloads an HTML resource and parses it, the search engine basically splits all the text into single words, then it inserts those single words into the index, which is a list of rows like this:
the word “dog” appears in documents number 56, 356, 676, 803 and 4762.
the word “ball” appears in documents number 94, 257, 356 and 700.
These rows are one of the most useful structure of the database, but the important thing to understand is that after the indexing phase, there isn’t such a thing as the entire original document, because the document has been split into its component words.
Many search engines also keep a copy of the original raw resource (for example the cache copies shown to users) but these resources are not used during a search: only the index is queried.
What happens when the search engine finds a link on a document? The search engine does something very similar to what it did with the word index: it populates a second list, containing only information about links:
the document 356 links to the document 700 using the text “xyz”
the document 4762 links to the document 56 using the text “online casino”
Again, the original document doesn’t exist anymore, here. When the search engine needs to know something about what documents link to other documents, it queries this second list.
Two Different Structures
What I rougly described above it’s what actually happens to text documents when they are indexed. You have two different structures: the first one is a list of relations between words and documents, the second one is a list of the links found.
Let’s call it “word index” and “link index” (or “link graph”, if you prefere).
Noindex and Follow
How is it possible to “noindex but follow”? It is possible because text and links are stored in two completely different database structures.
“index” means: “please, add the words in this document to the word index”.
“noindex” means: “please, don’t add the words in this document to the word index”.
“follow” means: “please, add the links in this document to the link index.”
“nofollow” means: “please, don’t add the links in this document to the link index.”
Real Noindexing and Cosmetic Noindexing
When faced with a noindex or nofollow directive some search engines could decide to index words and links but to ignore those rows during a search. From an user point of view the result is the same, but in this way the noindex and nofollow directives lose their original meaning, because the search engine still has somewhere that information, even if it is ignored.
I hope that the explanation clarifies a bit what actually happens in the search engine software and database and how it is possible to “noindex but follow” when two completely different structures are used to save text and links. 🙂
For more information about the indexing, this Wikipedia article is a good reading: http://en.wikipedia.org/wiki/Search_engine_indexing
Now, I’ll try to reply to your other questions. You asked about Google but, again, these concepts apply to any search engine.
> Does Google still crawl and establish the link graph regardless of robots.txt directives and meta robots tags?
The robots.txt file talks to the spider while the meta robots tag talks to the indexer. This distinction is important in order to answer correctly to your question.
Any search engine which honors the robots.txt file will not ask for a disallowed resource. If a disallowed resource contains links, those links will not be perceived by the search engine. So the robots.txt file is a good way to actually influence the link graph.
About the robots meta tag and its nofollow directive, the links will be perceived but there is no way to understand if a nofollowable link is actually not inserted into the link graph or inserted but ignored. For sure, the search engine needs to save somewhere into the database the total count of links in the page.
> Why can Google still find links on a page that is blocked by robots.txt but can’t read the robots meta tag on the same page?
A resource disallowed in the robots.txt file is not requested by any popular search engine and as a consequence its contents (html code, inside text, links, HTTP headers) will not be perceived.
Can a search engine perceive a link in other ways, without downloading the resource? Theoretically there are some ways to (badly) achieve this goal, but I don’t know if any search engine actually uses them.
> Bonus: I think Google recently stated that they will still crawl blocked pages if they contain a Google+ badge on them!?