What are Robots.txt File and how can I add it to my Website?

What is a robots text file?

A robots.txt is simply an ASCII or plain text file that tells the search engines where they are not allowed to go on a site – also known as the Standard for Robot Exclusion. Any files or folders listed in this document will not be crawled and indexed by the search engine spiders. Having a robots.txt, even a blank one, shows you acknowledge that search engines are allowed on your site and that they may have free access to it. We recommend adding a robots text file to your main domain and all sub-domains on your site.

What program should I use to create /robots.txt?

You can use anything that produces a text file.

On Microsoft Windows, use notepad.exe, or wordpad.exe (Save as Text Document), or even Microsoft Word (Save as Plain Text)

On the Macintosh, use TextEdit (Format->Make Plain Text, then Save as Western)

On Linux, vi or emacs

How to Check Robots.txt file on my website?

You can access it by typing: " yourwebsiteurl.com/robots.txt"

How to create a robots.txt file

Where to put it

The short answer: in the top-level directory of your web server.

How to add a robots.txt file to your site

A robots text file, or robots.txt file (often mistakenly referred to as a robot.txt file) is a must have for every website. Adding a robots.txt file to the root folder of your site is a very simple process, and having this file is actually a ‘sign of quality’ to the search engines. Let’s look at the robots.txt options available to your site.

Robots.txt options for formatting

Writing a robots.txt is an easy process. Follow these simple steps:

Open Notepad, Microsoft Word or any text editor and save the file as ‘robots,’ all lowercase, making sure to choose .txt as the file type extension (in Word, choose ‘Plain Text’ ).

Next, add the following two lines of text to your file:

User-agent: *

Disallow:

‘User-agent’ is another word for robots or search engine spiders. The asterisk (*) denotes that this line applies to all of the spiders. Here, there is no file or folder listed in the Disallow line, implying that every directory on your site may be accessed. This is a basic robots text file.

Blocking the search engine spiders from your whole site is also one of the robots.txt options. To do this, add these two lines to the file:

User-agent: * Disallow: /

If you’d like to block the spiders from certain areas of your site, your robots.txt might look something like this:

User-agent: * Disallow: /database/ Disallow: /scripts/

The above three lines tells all robots that they are not allowed to access anything in the database and scripts directories or sub-directories. Keep in mind that only one file or folder can be used per Disallow line. You may add as many Disallow lines as you need.

Be sure to add your search engine friendly XML sitemap file to the robots text file. This will ensure that the spiders can find your sitemap and easily index all of your site’s pages. Use this syntax:

Sitemap: http://www.mydomain.com/sitemap.xml

Once complete, save and upload your robots.txt file to the root directory of your site. For example, if your domain is www.mydomain.com, you will place the file at www.mydomain.com/robots.txt.

Once the file is in place, check the robots.txt file for any errors.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server

User-agent: * Disallow: /

To allow all robots complete access

User-agent: * Disallow: (or just create an empty "/robots.txt" file, or don't use one at all)

To exclude all robots from part of the server

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

To exclude a single robot

User-agent: BadBot Disallow: /

To allow a single robot

User-agent: Google Disallow: User-agent: * Disallow: /

To exclude all files except one

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: * Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *

User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html
You can check your robots through this tool: http://tool.motoricerca.info/robots-checker.phtml