Tuesday, October 18, 2005

The robots.txt File

Take a look at your log files for 404 errors and you may see that the robots.txt file tops the list. This is one of the most frequently accessed files on a web site and, if you don't have one, it will top the list of files generating 404 (page not found) errors.

The robots.txt file is accessed so frequently because search engine spiders check the robots.txt file for rules you use to tell spiders where they can and can not go on your web site. It's an important file because you can use it to keep spiders out of folders you don't want indexed, such as your images or stats folders.

The robots.txt file is a plain text file that goes in the root folder of your web site.

If you just want to stop the 404 errors, you can use an empty robots.txt file. There does not need to be any content in the file.

Here is an example of what might be included in a robots.txt file:

User-agent: *
Disallow: /setup.php
Disallow: /cgi-bin/
Disallow: /images/

The "User-agent" variable identifies the specific spider. In this case the asterisk means that the rules apply to all spiders.

The variable "Disallow:" identifies files or folders that the User-agent may not visit. In this case spiders may not index the setup.php file, nor may the index any of the files in the "cgi-bin" and "images" folders.

Here's another example:

User-agent: googlebot
Disallow: /images/

In the above the Google spider, GoogleBot, is being excluded from the images folder.

These two examples can be combined:

User-agent: googlebot
Disallow:

User-agent: *
Disallow: /setup.php
Disallow: /cgi-bin/
Disallow: /images/

In the above nothing is disallowed for GoogleBot, so it may index all files and folders on the web site. All other spiders may not index the setup.php file, nor the "cgi-bin" and "images" folders.

One of the best places to go for information about the robots.txt file is http://www.robotstxt.org/. It is an all inclusive source for information on the robots.txt file and Robot Exclusion Standards, and it provides articles about writing well-behaved web spiders. Topics covered include:

The Web Robots FAQs - Frequently Asked Questions about Web Robots, from Web users, Web authors, and Robot implementers.

Robots Exclusion - Find out what you can do to direct robots that visit your Web site.

A List of Robots - A database of currently known robots, with descriptions and contact details.

The Robots Mailing List - An archived mailing list for discussion of technical aspects of designing, building, and operating Web Robots.

Articles and Papers - Background reading for people interested in Web Robots

Related Sites - Some references to other sites that concern Web Robots.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home