Or more obviously, keep the session key at the start, but begin it with a
"magic" sequence such as "__", then exclude URLs which start "/__".
However, this probably won't work. For example, if I try to retrieve
http://www.pathfinder.com/welcome/ :
: olly@noxious ~ 996$ telnet www.pathfinder.com 80
: Trying 204.71.242.42...
: Connected to pathfinder.com.
: Escape character is '^]'.
: GET /welcome/ HTTP/1.0
:
: HTTP/1.0 302 Found
: Date: Wednesday, 11-Dec-96 10:46:53 GMT
: Server: Open-Market-Secure-WebServer/2.0.5.RC0
: MIME-version: 1.0
: Security-Scheme: S-HTTP/1.1
: Set-Cookie: OpenMarketSI=/@@Hbhm*QcATFe5*r*u; path=/;
: Location: http://pathfinder.com/@@Hbhm*QcATFe5*r*u/welcome/
: Content-type: text/html
:
: <TITLE>Redirection</TITLE><H1>Redirection</H1>
: This document can be found <A
: [snip]
So the robot gets redirected to a URL containing the session key, but if the
robots.txt file tells it to ignore '/@@', then the redirection will take it
somewhere it's not allowed to go. So it doesn't retrieve the actual page
(or arguably shouldn't).
Martijn's idea of not giving a session key if the agent name contains
"Robot" seems good, but may not be easy to implement at the server end if
you're using a commercial server which you don't have source code for. But
if robot authors standardised on it, hopefully server authors would be
persuaded to implement it. If not, their customers will get overcrawled
and hopefully complain.
Olly
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html