Indexation of site pages is what search engine optimization process starts with. Letting engine bots access your content signifies that your pages are ready for visitors and you want them to show up in SERP, so all-embracing indexation sounds like a huge benefit at first sight.
However, there may be cases when you can get more value from keeping certain pages of your site out of indexes. This post covers the main cases when it's more prudent to hide your content from search engine's attention.
And the first question is:
Which pages of a site should't be indexed by Google?
There are a number of reasons you would want to hide your pages from search engines' crawlers. Among them are:
Protecting content from direct search traffic is a must when a page holds personal information, confidential company details, information about alpha products, users' profiles info, private correspondence, registration or credential requirements.
- Duplicate content issues
Hiding pages with duplicate content (for example, Adobe PDF or printer-friendly versions of a website pages) is highly recommended to avoid duplicate content issues. Also, it's advisable for ecommerce sites to hide pages that have identical descriptions of one and the same product that vary only in color, size, etc.
- Offering little or no value to a website visitor
- Pages under development
Pages that are being in the process of development must be kept way from search engine crawlers until they are fully-ready for visitors.
* * *
And now the question is: How to hide all the above mentioned pages from pesky spiders? Below there are a couple tried-and-true ways to restrict pages from indexing (there are a lot more but let’s focus on the easiest and most popular ones).
Two Simple Ways to Hide a Webpage from Search Engine's view
1. Via robots.txt files.
Possibly, the simplest and most direct way to restrict search engine crawlers from accessing your pages is creating a robots.txt file.
This is how it works:
Robots.txt files let you proactively keep all unwanted content out of the search results. With this file you can restrict access to a single page, a whole directory or even just a single image or file.
Creating a robots.txt file
The procedure is pretty easy. You just create a .txt file that has the following fields:
- 'User-agent:' – in this line you identify the crawler in question;
- 'Disallow:' – 2 or more lines that instruct the specified crawlers not to access certain parts of a site.
Also note that some crawlers (particularly Google) also support an additional field called 'Allow:'. As the name implies, 'Allow:' lets you explicitly dictate what files/folders can be crawled.
Here are some basic examples of robots.txt files explained.
"*" in the 'User-agent' line means that all search engines bots are instructed not to crawl any of your site pages, which is indicated by "/". Most likely, that’s what you'd rather prefer to avoid, but now you get the idea.
By this file you restrict Google's Image bot from crawling your images in the selected directory.
You can find more instructions on how to write such files manually here.
But the process of creating robots.txt can be fully automated – there is a wide range of tools that are capable of creating and uploading such files to your site. For example, Website Auditor can easily compile a robots.txt file and instantly upload it onto your site.
If creating robots.txt sounds like a routine to you, you can make it sheer fun! Check this article – it tells about funny and interesting cases connected with using this type of files on some sites.
And remember that despite the use of such terms as 'allow' and 'disallow', the protocol is purely advisory. Robots.txt isn't a lock on your site pages, it's more like a "Private - keep out".
Robots.txt can prevent "law-abiding" bots (e.g. Google, Yahoo! Bing bots) from indexing your content. However, malicious bots simply ignore it and go through your content anyway. So there's a risk that your private data may be scraped, compiled, and re-utilized under the guise of fair use. If you want to keep your content 100% safe and secure, you should introduce more secure measures (e.g. introducing registration on a site, hiding content under a password, etc.).
2. Via a robots noindex meta tag.
Using a robots noindex meta tag to prevent search engine bots from indexing particular pages is both effective and easy. The process of creating such tags requires only a tiny bit of technical know-how and can be easily done even by a junior SEO.
This is how it works:
When Google bot fetches a page, it sees a noindex meta tag and doesn’t include this page into the web index.
Examples of robots meta tags:
<meta name="robots" content="index, follow">
Adding this meta tag into the HTML source of your page tells a search engine bot to index this and all the other pages of your site.
<meta name="robots" content="index, nofollow">
By changing 'follow' to 'nofollow' you influence the behavior of a search engine bot. Such tag instructs a search engine to index a page but nofollow all links that are placed on it.
<meta name="robots" content="noindex, follow">
This meta tag tells a search engine bot to ignore the page it’s placed on, but to follow all links placed on it.
<meta name="robots" content="noindex, nofollow">
This tag placed on a page means that neither a page nor links this page contains will be followed or indexed.
Where to add robots meta tags?
You can add a robots meta tag on the first index page, thus instructing a search engine bot to crawl a website or not. Also, you can add these tags to each page you need to hide from indexing. Just make sure that relevant meta tags are added.
Robots.txt files or nofollow meta tags?
A noindex tag is generally considered a more secure way to prevent pages from being indexed. However, using this tag is harder to manage because it's applied on a page-per-page basis.
Using robots.txt files is an easier way of to managing all non-indexed pages as all the information is stored in one file.
Now you know the basics of how to find and hide certain pages of your site from the attention of search engines' bots.
But if pages that contain private info or designed for your company's internal needs are easy to find, looking for pages with duplicate content can be quite a challenge. Stay tuned for the second part of this article to learn how to deal with duplicate content issue.
back to SEO blog