About Robots.txt Files
The robots.txt file represents an "exclusion standard" that is used to give directives to web robots (spiders, bots) that crawl Web sites to help improve search engine results. When a bot first begins to crawl a site it will check for the existence of a robots.txt file in the web root. If one exists, most bots will follow the directives within the file such as the "Crawl-delay" and excluded directories. If the bot does not find a robots.txt file though, it will assume the Webmaster wants the entire site crawled.
A downside to using robots.txt files is that some bots will not respect the directives in the file, but the major search engines such as Google, Bing, Yahoo, etc will obey the directives.
The main reason we recommend using a robots.txt file is to control the rate at which the website is being crawled which can help prevent a bot/spider from creating a massive number of database connections at the same time.
To implement such a crawl delay, insert the following code in your site's robots.txt (in your wwwroot or public_html folder):
User-agent: * Crawl-delay: 5
You can adjust the crawl rate as desired, but we suggest nothing lower than 2 seconds.
If there are certain areas of your site you do not wish to have indexed, such as your site's administrative section or images folder, you can tell bots not to crawl such folders. To do this, add the following code to your robots.txt:
User-agent: * Disallow: /admin Disallow: /images
Exluding Specific Bots
If you wish to tell a specific bot to not crawl your site, you can do so with the following code:
User-agent: Baiduspider Disallow: / User-agent: Sosospider Disallow: /
The above example will prevent the Baiduspider and Sosospider bots from crawling your site. To block other bots, just replace the User-agent name with the actual name of the bot you wish to block.
More Info About Using robots.txt
More info about using a robots.txt file can be found at the Wikipedia Robots Exclusion Standard page.