Re: robots.txt , authors of robots , webmasters ....

Benjamin Franz (snowhare@netimages.com)
Wed, 17 Jan 1996 10:59:43 -0800 (PST)


On Wed, 17 Jan 1996, Reinier Post wrote:

> You (savron@world-net.sct.fr) write:
> >
> >A few thoughts about the robots stuff :
> >
> >-- there should be no need to include a line such as :
> > /cgi-bin/
> > in robots.txt
> > because it should come as a standard of indexer robots
>
> That would be a kludge. It doesn't identify CGI scripts exactly
> (I do not usually include /cgi-bin/ in references to my CGI scripts)
> and it is not necessary tp exclude CGI scripts categorically
> (I sometimes serve a set of files through a CGI script). Furthermore,
> netter heuristics exist (eg. don't follow forms/POST requests).

And then you risk falling down rat holes like Usenet archives. I have
*over* 100,000 archived Usenet articles online on the Web via my
Usenet-Web software. The links are all GET to facilitate bookmarking. Now
- I know enough to have a robots.txt file blocking that tree from indexing.
But many of the people who have downloaded my software (many hundreds of
people) are unlikely to use robots.txt. But since the installation
instructions will generally lead people to put the script in /cgi-bin/ - a
smart indexer will avoid it because /cgi-bin/ is dangerous in general to
index.

It is very wise in general to avoid all links that match any of these
regexs:

\.pl$
\.cgi$
\?.*$
cgi-bin

-- 
Benjamin Franz