NOTE: This is a collection of information and links collected over the years that might provide useful information. A Safer Company LLC does not guarantee, endorse or approve any of these links or their scripts. Use these links to other websites at your own risk.
Search Engines and Robots.txt
Robots.txt files provide information about which files are not crawled
and indexed by search engines.
Note:
Robots can choose to ignore the robots.txt
file.
404 Errors
When a robot crawls your site and does not find a robots.txt file, it assumes that it may crawl and index the entire site. Not having a robots.txt file can create unnecessary 404 errors in your server logs. To stop unnecessary 404 errors from occurring upload a blank or simple robots.txt file to the root directory of your domain.
Creating a robots.txt file
Create a text document and save the file as robots.txt in the root directory.
The syntax is
<field>:<optionalspace><value><optionalspace>
- Comments can be included in robots.txt
files.
- # character - used to indicate that preceding space and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely
- The simplest robots.txt file uses two rules:
- User-agent: <value>
- The value can be the name of the robot the record is describing access policy
- The value can be * (astericks) - the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
- Disallow: <value>
- The valude can be a full path or a partial path URL you want to block;
- Disallowing a specific file or folder to be crawled will keep it from being indexed and the file will not show up in the search engines
- An empty value, indicates that all URLs can be retrieved.
- At least one Disallow field needs to be present in a record.
- User-agent: <value>
- An empty /robots.txt file - all robots will consider themselves welcome.
Simple Robots.txt
# This will allow all robots to crawl and index all
files.
User-agent: *
Disallow:
Disallowing Files and Folders
#This rule is for all robots
to crawl all files except the ones that are listed in the Disallow
User-agent: *
Disallow: /images/ #disallows all files in the folder /images/
Disallow: /example #disallows all files /example.html
and all folders /example/index.html
Disallow: /product/ #disallows all files in the folder /product/
but allows the file /product.html
Disallow: /oldindex.html #this file is blocked
Disallow a Robot
Disallow specific robots from crawling your site or limit which files they may access.
# This example indicates that no robots should
visit this site
User-agent: *
Disallow: /
# This denies access to Googlebot-image
to any files in your domain
User-agent: Googlebot-Image
Disallow: /
# This specifically denies Googlebot-image
to your images file
User-agent: Googlebot-Image
Disallow: /images/
Allowing Specific Robots
# Cybermapperhas access to all files and
folders
User-agent: cybermapper
Disallow:
Robots.txt Validators
The robots.txt file should be validated once it has been uploaded to the root directory of your domain.
Links
Page last updated: May 31, 2012 10:30 AM
Content and Navigation...