A
robots.txt file is a file at the root of your site that indicates those
parts of your site you don’t want accessed by search engine crawlers. The file
uses the Robots Exclusion Standard, which is a protocol with a small set of
commands that can be used to indicate access to your site by section and by
specific kinds of web crawlers (such as mobile crawlers vs desktop
crawlers).
The simplest robots.txt file uses two key words, User-agent
and Disallow. User-agents are search engine robots (or web crawler software);
most user-agents are listed in the Web Robots Database. Disallow is a command
for the user-agent that tells it not to access a particular URL. On the other
hand, to give Google access to a particular URL that is a child directory in a
disallowed parent directory, then you can use a third key word, Allow.
Google
uses several user-agents, such as Googlebot for Google Search and
Googlebot-Image for Google Image Search. Most Google user-agents follow the
rules you set up for Googlebot, but you can override this option and make
specific rules for only certain Google user-agents as well.
The syntax
for using the keywords is as follows:
User-agent: [the name of the
robot the following rule applies to]
Disallow: [the URL path you
want to block] Allow: [the URL path in of a subdirectory, within a blocked
parent directory, that you want to unblock]
These two lines are together
considered a single entry in the file, where the Disallow rule only applies to
the user-agent(s) specified above it. You can include as many entries as you
want, and multiple Disallow lines can apply to multiple user-agents, all in one
entry. You can set the User-agent command to apply to all web crawlers by
listing an asterisk (*) as in the example below:
User-agent:
*
You must apply the following saving conventions so that
Googlebot and other web crawlers can find and identify your robots.txt file:
- You must save your robots.txt code as a text file,
- You must place the file in the highest-level directory of your site (or
the root of your domain), and
- The robots.txt file must be named robots.txt
As an example, a robots.txt file saved at the root of example.com, at the
URL address http://www. example.com/robots.txt, can be discovered by web
crawlers, but a robots.txt file at http://www. example.com/not_root/robots.txt
cannot be found by any web crawler.