SEO is all about making your web pages more accessible to the search engines, but there are times when you want to keep the search engines at bay. Perhaps you don’t want to broadcast a poor company report or maybe you don’t want Google’s spiders to find duplicate content or pages under construction. And this is where robots.txt comes into play.
There are two ways to block robots from indexing designated pages or sections of your site: the robots meta tag and the robots.txt file. Using both it’s relatively simple to specify complex access policies.
Robots.txt allows you to control access (from the server) at multiple levels, including: the entire site, individual directories, pages of a specific type, or even individual pages.The robots meta tag gives more flexible control on a page-by-page basis without accessing the server.
It’s worth noting that using robots.txt means search engines won’t index your pages, it doesn’t necessarily mean that robots will stop visiting them.
How to use robots meta tags
The robots meta tag applies to individual HTML web pages and is particularly useful for anyone who has permission to edit files, but not site-wide control. The robots meta tag has two attributes and can be used to:
- Prevent search engines from placing files on their index (the ‘no index’ attribute)
- Prevent search engines from following links on a page (the ‘no follow’ attribute)
The robots meta tag should be placed between the <head></head> tags on your page, as follows:
<META name=”ROBOTS” content=”NOINDEX,NOFOLLOW”>
In the above example there are both ‘no index’ and ‘no follow’ attributes. However, they can be used individually if you would like bots to index the page but not follow the links, or vice versa.
How to use robots.txt
Effective use of robots.txt can give you a great deal of control over how your site is searched. A robots.txt file is placed in the root folder of your domain name (for example: www.thwebproject.com/robots.txt), which means it first thing that any visiting spiders ‘see’. This allows you to give the robots instructions about which files or folder to ignore when indexing.
To create a robots.txt file use notepad and save it as a .txt (not .htm or .html file extensions). A robots.txt file consists of two elements: The first specifies which user-agent (bot) you are referring to, and second lists which files are disallowed. Using an * in the user-agent field refers to all search engine robots.
For example, the following file would allow all bots access to index all pages on your site:
Note that it’s extremely easy to block the entire site by simply adding a forward slash to the disallow field:
The following example presents a more complicated, and realistic, scenario. Here the robots.txt file instructs all bots to not index anything stored in your /cgi-bin/ or /category/ folder as well as the individual page ‘my-page.htm’:
Because different search engines have different robots, which work in different ways, you may be in a situation where you want to exclude individual bots from certain pages. This can be done by giving specific instructions for individual bots in your robots.txt.
In the following example the robots.txt file only blocks ‘Googlebot-Image’ (the spider Google uses for indexing images) from indexing the images in the /pictures/ folder:
For a comprehensive list search engine bots visit the robots database at www.robotstxt.org
Be warned that you must be very careful when creating a robots.txt file. If you get it wrong you won’t be the first webmaster who’s unwittingly given instruction for the spiders to ignore the entire site.