Of course the gateway could have been written to provide a familiar
left-to-right, slash delimited URL structure and translate it into
the internal representation of DN's (Be it in RFC 1779 or whatever)
But I guess it's too late for that. :-)
>So, what should my poor little gateway do?
>
>- If /robots.txt had both `Allow:' and `Disallow:' and handled regular
> expressions as well as URL prefixes, I think it should be possible to
> handle the X.500 case with a list of very ugly regexps. I'm not sure
> if this is will help other gateways, though.
Right. Both allow and regexps would be handy.
>- If all robots sent something like an `X-Robot: ' header, the gateway
> could treat robots differently from normal users.
Well, WebCrawler sends the string "robot" in the UserAgent field,
you can use that.
>- I've heard a few suggestions for a 'robot sink' URL inserted at the
> beginning of the document, which is expected only to be followed by
> robots. Then the gateway could identify the robot by the fact that it
> followed that URL (hopefully before it followed any other URL).
That assumes a certain path traversal; someone will make a direct link
somewhere, and a robot can come in that way.
>- Only provide search forms, no "list contents" buttons.
> I don't want to do this.
Well, if you want buttons go ahead and use POST. I don't think any robots
traverse POSTS. But I suspect you want normal <A HREF> style links too.
>- Others?
Not off-hand...
-- Martijn
Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html