robots.txt file give instructions about their site to web robots, using 'the robots exclusion protocol'.
How it works:
When a robot visits a website, it first checks for a robots.txt file in the root directory.
Syntax:
User-agent:
Disallow:
Allow:
User-agent: are search engine robots/crawlers/spiders
Disallow: lists the files and directories to be excluded from indexing.
Allow: lists allowed files and dir, but not all robots use this!
Two important things to be mindful about:
1. robots, esp malware robots can ignore your /robots.txt file.
2. this file is publically available. Anyone can see which sections of your website you don't want robots to use.
Don't try to use /robots.txt to hide information.
To exclude all robots from the entire site (Hides website from search engines):
User-agent: *
Disallow: /
To allow all robots complete access:
User-agent: *
Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all!)
To exclude all robots from some directories on server:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
To exclude a single robot:
User-agent: MalwareBot
Disallow: /
To allow a single robot:
User-agent: Google
Disallow:
User-agent: *
Disallow: /
You can explicitly disallow some pages:
User-agent: *
Disallow: /~andrew/a.html
Disallow: /~andrew/b.html