Re: Robot exclustion for for non-'unix file' hierarchy

Martijn Koster (m.koster@webcrawler.com)
Mon, 30 Sep 1996 17:03:43 -0700


At 12:21 AM 10/1/96, Hallvard B Furuseth wrote:
>I'd like to let X.500 gateways make *part* of the X.500 hierarchy
>available to robots. That is, the part which is mastered by the X.500
>server at the same site. That typically means URLs like this --
> - For an organizational X.500 server:
> http://foo.bar.baz:8888/<anything>,o=<organization>,c=<country>
> - For a country's master X.500 server:
> http://foo.bar.baz:8888/<anything>,c=<country>
> *except* the organizations with their own X.500 servers.
>
>Now, X.500 names are case-insensitive, the hierarchy is read from the
>end instead of the beginning, '/' is not delimiter anyway, spaces (that
>is, %20's) are not significant in many places, and there are a few other
>problems as well.

Of course the gateway could have been written to provide a familiar
left-to-right, slash delimited URL structure and translate it into
the internal representation of DN's (Be it in RFC 1779 or whatever)

But I guess it's too late for that. :-)

>So, what should my poor little gateway do?
>
>- If /robots.txt had both `Allow:' and `Disallow:' and handled regular
> expressions as well as URL prefixes, I think it should be possible to
> handle the X.500 case with a list of very ugly regexps. I'm not sure
> if this is will help other gateways, though.

Right. Both allow and regexps would be handy.

>- If all robots sent something like an `X-Robot: ' header, the gateway
> could treat robots differently from normal users.

Well, WebCrawler sends the string "robot" in the UserAgent field,
you can use that.

>- I've heard a few suggestions for a 'robot sink' URL inserted at the
> beginning of the document, which is expected only to be followed by
> robots. Then the gateway could identify the robot by the fact that it
> followed that URL (hopefully before it followed any other URL).

That assumes a certain path traversal; someone will make a direct link
somewhere, and a robot can come in that way.

>- Only provide search forms, no "list contents" buttons.
> I don't want to do this.

Well, if you want buttons go ahead and use POST. I don't think any robots
traverse POSTS. But I suspect you want normal <A HREF> style links too.

>- Others?

Not off-hand...

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html