Robot exclustion for for non-'unix file' hierarchy

Hallvard B Furuseth (h.b.furuseth@usit.uio.no)
Tue, 1 Oct 1996 00:21:19 +0200 (MET DST)


I'd like to let X.500 gateways make *part* of the X.500 hierarchy
available to robots. That is, the part which is mastered by the X.500
server at the same site. That typically means URLs like this --
- For an organizational X.500 server:
http://foo.bar.baz:8888/<anything>,o=<organization>,c=<country>
- For a country's master X.500 server:
http://foo.bar.baz:8888/<anything>,c=<country>
*except* the organizations with their own X.500 servers.

Now, X.500 names are case-insensitive, the hierarchy is read from the
end instead of the beginning, '/' is not delimiter anyway, spaces (that
is, %20's) are not significant in many places, and there are a few other
problems as well. WWW gateways to other services than X.500 will have
similar 'robot' problems.

So, what should my poor little gateway do?

- If /robots.txt had both `Allow:' and `Disallow:' and handled regular
expressions as well as URL prefixes, I think it should be possible to
handle the X.500 case with a list of very ugly regexps. I'm not sure
if this is will help other gateways, though.

- If all robots sent something like an `X-Robot: ' header, the gateway
could treat robots differently from normal users.

- I've heard a few suggestions for a 'robot sink' URL inserted at the
beginning of the document, which is expected only to be followed by
robots. Then the gateway could identify the robot by the fact that it
followed that URL (hopefully before it followed any other URL).

- Only provide search forms, no "list contents" buttons.
I don't want to do this.

- Others?

Hallvard B Furuseth
UNINETT Directory Service, Norway email: katalog-hjelp@uninett.no