Enter keywords:
Sample: web


  Home        Submit your site        Articles        Contact     

Operational Principles of Robots.txt
by Alexander O. (www.webskyguide.com)

After they first appeared, searching engines robots.txt file performed only one function: disallowed or allowed the searching robot (a "spider") to index the website content or the content of a page. As a rule, webmasters wrote robots.txt for pages used by small groups of people or for pages/websites under reconstruction. Today robots.txt file stopped being an "unimportant figure" in creating websites. An experienced webmaster can make a robots.txt file a magic wand in promoting websites with the help of searching methods.

Robots.txt: what is it?
As it can be seen from the name and extension, robots.txt is a usual text file that can be created in any text editor within several seconds. However, simple is often genius: with the help of this file, the website owner can have unique possibilities for promoting the website, in particular, influence the quality and length of searching system index of the whole website or some pages.

The file can inform a searching engine robot what pages it can index and what pages will be indexed by another searching system robot. Distributing the job of two searching engines with different databases can result in the successful attraction of users from two searching systems. To have robots.txt file function, it is necessary to locate it in the root catalogue.

Instructions for searching robots in robots.txt
Before talking about syntax of robots.txt, it is necessary to know the rules of its location. Today, almost all the "spiders" look for robots.txt only in the root catalogue of the server. It makes no sense to locate it in sub catalogues as well as place several files with different tasks on several pages. In both cases, the "spider" will not index all the files and will address only the files in the root catalogue. The file name should be typed in low-case letters; otherwise, the "spider" will ignore it.

User-agent and Disallow Lines
Tasks for the "spider" in robots.txt are written as groups of instructions. Each group begins with a User-agent line, which shows further actions - Disallow - for different "spiders." There should be several groups of instructions for several robots. Each User-agent line is written for one robot, which will perform further actions written in the Disallow line. The only exception for today is the "User-agent: *" instruction that allows indexing to the "spiders" of all the searching engines.

If there is no "User-agent: *" instruction, the User-agent line should be followed by at least one Disallow line. A webmaster can write as many Disallow lines for one robot as necessary, and they will be tasks for this robot. These are the most common Disallow instructions:
  • Disallow: / - disallow indexing some pages of the website. Absence of "/" means that the robot is allowed to perform actions on the website;
  • Disallow: /dir - disallow indexing all the pages on the server with names beginning with "/dir" (for example, "/dir.html", "/directory.html", "/dir/index.html");
  • Disallow: /dir/ - disallow visiting "/dir" catalogue.
The Disallow instruction itself does not disallow anything, therefore such instructions like "Disallow: *", "Disallow: *.doc", "Disallow: /dir/*.doc" means nothing and will be ignored by the "spider."

Meaning of comments and empty lines
Empty lines should not be in one group of the User-Agent. Empty lines are allowed only in the case that there are several User-Agent groups, located in one robots.txt. The Disallow line will be considered by the robot only in case it is directy subordinate to the User-Agent line (i. e. there should be no empty line between them). The "#" sign means that it is followed by a comment to the instruction. All the searching robots ignore comments at indexing.

"Robots" Meta Tags
For successful promotion of the website, it is not enough to know the peculiarities of robots.txt. Each page of the website often requires a "Robots" Meta tag, that allows or disallows indexing of a particular page. The "Robots" tag performs the same function as robots.txt; however, it differs in the syntax.

The body of the "Robots" tag is filled with the name of the searching engine robot that will be indexing the page. Next body, "content", is filled with required tasks:
  • INDEX - allows indexing the document;
  • NOINDEX - disallows indexing the document;
  • FOLLOW - allows going to the links;
  • NOFOLLOW - disallows going to the links in the document;
  • NONE - performs the same functions as NOINDEX, NOFOLLOW;
  • ALL - equals to INDEX, FOLLOW.
The case is not important for instructions. All the instructions may not contradict one another and should not be repeated.


Terms and Conditions     Privacy policy     Help     Link to Us
Copyright © 2005 Webskyguide.Com. All Rights Reserved.