The Only World-Standard SEO Software


SEO PowerSuite Christmas Sale 2017

Download Now
SEO PowerSuite
SEO PowerSuite Hot-new version
Supported OS

Robots.txt, Robots Meta Tag, X-Robots Tag: Shaken, Not Stirred

Link-Assistant.Com | Posted in category Uncategorized

We all strive to get our websites' content better exposure in the Internet, otherwise we all weren't that much keen on SEO and Link-Assistant.Com wouldn't have a bunch of dedicated clients worldwide.

However, sometimes we'd rather hide some web pages from being indexed and included in SERPs. For example, you would definitely prefer your clients to read the sales copy first rather than routing them to the shopping cart right away. Or there might be some sensitive information meant for internal use only, which is not login-protected for users' convenience. Or you might want to avoid indexation of duplicate content (print versions or html pages).

The common way to control indexation is to write robots.txt files.

Still, this is not a one-fits-all solution. Sometimes robots.txt-protected content or the robots.txt files themselves would be made public.

Link-Assistant.Com once had a small problem with Google indexing our website's content not meant to be indexed, which made us apply more sophisticated methods. :-)

In this post we would like to tell you about different types of Robots Exclusion Protocol (REP) and the best ways of combining them to hide what you really want to hide.

Robots.txt

The robots.txt files contain instruction for search engine crawlers on accessing certain parts of the content. Most commonly, they block non-public content from being shown in search results. The example of robots.txt is:

User-agent: Googlebot

Disallow: /

It means that the entire site is closed for the robot called “Googlebot”.

BTW, major robots include Googlebot (Google), Slurp (Yahoo!), msnbot (MSN) and TEOMA (Ask). You can refer to all robots with the help of asterisk symbol (*).

The next example shows how to prevent all robots from indexing a certain page (my_file.html):

User-agent: *

Disallow: /my_file.html

For more examples, have a deeper look at robots.txt.

Uncrawled URLs in SERPs

Many webmasters complain that Google would often show robots.txt-protected content in its search results. As surprising as it may sound, according to Matt Cutts, it doesn't mean Google either violates robots.txt or actually crawls these pages.

When Googlebot is stopped by robots.txt file, it really doesn’t go to the page. Even though the robot doesn't crawl and index the page, it just knows of the uncrawled URL reference, anchor text and all the websites pointing to it. If there's a sufficient amount of backlinks and anchors matching the search queries, the web page would find its way into SERPs. However, as Googlebot hasn't crawled it, it would probably have no snippet and the title would be a mixture of referring anchor texts.

Thus, the main limitation of robots.txt is that although they control crawling and indexing, they don't guarantee a web page won't be in search results.

Robots Meta tag

If you don't want a page to show up in search results, one of the best things to do is use a noindex meta tag at the top of the page. When a crawler sees a noindex tag, it will drop the page from search results completely.

<meta name="robots" content="noindex">

If you don't want a search engine to follow links from a particular page, include a nofollow tag:

<meta name="robots" content="nofollow">.

Other robot meta tag directives include archive/noarchive, snippet/nosnippet and the unavailable_after.

The main limitation with robots meta tags is that although they control crawling, indexing, caching and snippets, they work for HTML files only.

X-Robots tag

X-robots tags, aka REP header tags, are used to control indexing of files other than HTML: PDFs, Word doc, zip, etc. If you want to prevent the search engines displaying a document in search results, add X-robots tags directive in the HTTP Header used to serve the file.

For example, a directive

X-Robots-Tag: noindex

means that the document is to be excluded from Google search results.

Keeping track of Robots Exclusion Protocol

The updated Website Auditor, an SEO tool focused on website analysis and content optimization, lets you monitor Robots Exclusion Protocol directives. From now on, Website Auditor’s onpage SEO report features Robots Instructions column and shows you if a certain page of the website you analyze is allowed for indexing.

The bottom line

• Using Robots Exclusion Protocol doesn't just serve the purpose of hiding some confidential information. It can also be used for SEO purposes, namely making the semantic core more compact and limiting the number of pages for indexation. Every website has SEO-unimportant pages, like calendar with events marked or privacy policy in the footer that a webmaster doesn't want to see in the search results and which distracts crawlers from the website's key content. So simply write robots meta tags (for HTML pages) and X-robots tags (for all the rest) with a noindex directive to stop non-target pages from getting into SERPs.

• Optionally to using robots.txt, robots meta tag and X-robots tag, you can also request deletion from index with the help of Google URL removal tool.

• If you want to hide some confidential data indeed, don't rely on Robots Exclusion Protocol solely, use reliable authorization algorithms. It appears that 1/2 of what Matt Cutts says is Google propaganda, so you never know what they've actually got in their index. :-)



back to SEO blog