Let’s take a look at some of the confirmed ranking factors and find out the reasons behind Google’s social focus.
Part 2 of the Search Quality Series by Dejan SEO
Our first search quality article discusses the link graph theory, co-citation, social graph, knowledge graph and top-level spam detection techniques. This time we go slightly further into specific search ranking factors and key events which lead to Google’s change of policy towards 3rd party providers and development of its own social platform.
Everything we discuss and quote in this article is detailed in the following paper: Indexing The World Wide Web: The Journey So Far by Abhishek Das and Ankit Jain, Next Generation Search Engines: Advanced Models for Information Retrieval, 2011. The paper covers much of Google’s fascinating crawling and indexing technology in great detail, however in this article we will only cover the items of interest to the SEO and internet marketing community.
On-Page Ranking Signals
Let’s see what Google engineers have to say about on-site ranking factors:
“First note that page structures, such as titles and headings, and url depth play a major role. Next we see that most terms occur close to each other in the results, highlighting the need for term positions or phrases during indexing. Also important, though not clear from the figure, is the respective position of terms on pages; users prefer pages that contain terms higher up in the page.
Other than these, search engines also learn from patterns across the web and analyze pages for undesirable properties, such as presence of offensive terms, lots of outgoing links, or even bad sentence or page-structures.
The diversity and size of the web also enables systems to determine statistical features such as the average length of a good sentence, ratio of number of outgoing links to number of words on page, ratio of visible keywords to those not visible (meta tags or alt text), etc.”
Here’s a quick overview of the key ranking factors outlined in the paragraph above.
- Title, H
- URL depth
- Term proximity in a phrase
- Term position on a page
- Order/sequence of phrases
- Offensive language
- Many outgoing links (low quality?)
- Bad grammar / spelling / style
- Bad page structure
- Average length of a sentence
- Number of outgoing links VS page word count
- Visible VS meta content ratio
Off-Page Ranking Signals
Here are the two key sections of the paper which talk about off-page ranking signals:
“…off-page signals have increasingly proved to be the difference between a good search engine and a not-so-good one. They allow search engines to determine what other pages say about a given page (anchor text) and whether the linking page itself is reputable (PageRank or HITS). […]
The final ranking is thus a blend of static a priori ordering that indicates if a page is relevant to queries in general, and a dynamic score which represents the probability that a page is relevant to the current query. […]
HITS, as described earlier, scores pages as both hubs and authorities, where a good hub is one that links to many good authorities, and a good authority is one that is linked from many good hubs. Essentially, hubs are useful information aggregations and provide broad categorization of topics, while authorities provide detailed information on a narrower facet of a topic. […]
However, instead of pre-computing the hub and authority scores at indexing time, each page is assigned a query specific score at serving time.“
Off-page signals, PageRank, HITS (hubs & authorities), anchor text and so on, combine as a layer on top of query matching on-site. Google looks at and trusts pages which only link to authoritative sites and ranks sites linked from trusted hubs. HITS related scoring is assigned at the query time and for a particular search segment, not like PageRank, which is pre-calculated and simply assigned to enhance sorting/scoring of results.
Anecdote: “…webmasters cannot control which sites link to their sites, but they can control which sites they link out to. For this reason, links into a site cannot harm the site…”
Future Direction: Real Time Data & Search
In this section we highlight key areas of the paper which talk about Twitter, Facebook and hint at reasons behind creation of Google’s own social search layer and social media platform.
“In 140 characters, users can describe where they are, publicize a link to an article they like, or share a fleeting thought. From a search engine perspective, this information is extremely valuable, but as different projects have shown over the last few years, this information has extra importance if it is mined and presented to search engine users in real time.”
Social graph is not a big deal for Google as they already have the algorithm and infrastructure to deal with linked data. People are, in fact, well-linked and this is something search engines look at. At the time, access to the firehose was taken for granted, and appreciated:
“In order to build a real time system, the first prerequisite is access to the raw micropost data. Luckily for the community, Twitter has been extremely good about providing this at affordable prices to anyone who has requested it.”
Social Ranking Signals
- Create a Social Graph: UserRank, UserTopicRank
- Extract and Index the Links
- Real-Time Related Topic Handling
- Sentiment Analysis (theoretical only)
Social graph in Twitter for example is understood through analysis of users’ connections (following, followers) and topics they are interested in (what we tweet about, what people we follow tweet about). Number of followers is another useful and measurable metric Google uses to determine the UserRank and even detect topic leaders, users influential within a certain niche (UserTopicRank).
Links within tweets are crawled and indexed. Standard link metrics such as anchor text and PageRank are replaced by the link context and UserRank respectively.
Finally, classic web index domain quality parameters can be added to social layer to enhance the results. This is a great example of how different index types at Google work well together to produce seamless search experience.
Thanks to real-time data Google can define and cluster fresh topics and serve the right results at at the right time (e.g. election, football game) and default back to ‘stale’ results when the popularity peak diminishes to usual levels.
Facebook & Bing Social Layer
The following few paragraphs truly reflect some of the key events which compelled Google into abolishing partnerships and alliances with third party providers and trusting only its own properties:
“Over the last few years, Facebook has become the leader in social networking with over 500M users [Zuckerberg, 2010]. Facebook users post a wealth of information on the network that can be used to define their online personality. Through static information such as book and movie interests, and dynamic information such as user locations (Facebook Places), status updates and wall posts, a system can learn user preferences.
Another feature of significant value is the social circle of a Facebook user, e.g. posts of a user’s friends, and of the friends’ friends. From a search engine’s perspective, learning a user’s social interactions can greatly help in personalizing the results for him or her.
Facebook has done two things that are impacting the world of search. First, in September 2009, they opened up the data to any third party service as long as their user authenticate themselves using Facebook Connect [Zuckerberg, 2008]. Second, as of September 2010, Facebook has started returning web search results based on the recommendations of those friends who are within two degrees of the user […]
In late 2009, Cuil launched a product called Facebook Results [Talbot, 2009], whereby they indexed an authenticated user’s, as well as his or her friends’, wall posts, comments and interests. Noting that the average user only has 300-400 friends with similar preferences and outlooks on the world, one of the first discoveries that Cuil made was the fact that this data was extremely sparse.
This implied that there were very few queries for which they could find useful social results. They overcame this limitation by extracting related query terms, which allowed for additional social results.”
To Google it was obvious that they lack the platform and technology to collect the wealth of data in the ever-growing social media sphere. A few attempts and negotiations were made in order to capture the slice of the prolific social user content pipeline; Google knew what they needed and what they had to do:
- Facebook opened data via authentication
- Web search results based on user recommendations
- Static Information (Interests)
- Dynamic Information (location, status/wall)
- Learning User Preferences
- Social Circles
- Personalisation of Results
Note: Small circle personalised results too shallow, related query extending useful to enrich choice.
The Last Straw
“In the future, web search engines can use such a signal to determine authority of social data. In October 2010, Bing and Facebook announced the Bing Social Layer.”
Series of events took place during this time including strategic acquisitions, change in management, policy and attitude shift towards 3rd party providers, product trim and focus on getting what Facebook and Twitter have. Academic research on social networks, user experience expertise and already developed infrastructure were adapted to handle big-data processing and algorithmic treatment of linked data.
Google announces SPYW, Google+ is launched. Top level management decides Google is now one unified product and utilises enormous search market share, brand, product range and influence to propel the success of the new social platform. Rapid deployment policy continues and Google’s products including their social media platform continue to evolve and change, while authorship becomes more significant and easier to implement. Attempts of public criticism are largely ignored or downplayed.
The search giant has moved on and even though Twitter and Facebook are being looked at and factored in, they will never again be in Google’s circle of trust. Ever?
Opinions, comments, questions & feedback on Google+
Das, A., Jain, A., – Indexing The World Wide Web: The Journey So Far, Next Generation Search Engines: Advanced Models for Information Retrieval, 2011.