Re: Crawlers and "dynamic" urls

olly@muscat.co.uk
11 Dec 1996 11:57:11 -0000


Klaus Johannes Rusch <e8726057@student.tuwien.ac.at> writes:
>In <19961210002328.AAA22826@novo2.novomedia.com>, koblas@novomedia.com (David Koblas) writes:
>> The general problem is that we rewrite URLs to contain a session key
>> [...] this ID is only good for an hour, in the process of
>> being crawlled by a search engine this ID eventually expires and a
>> new one is issued. Of course this means that to the crawler there
>> is a whole new set of unexplored URLs.
>
>What about putting the session key last, and restricting access for those URLs,
>like
>
> http://yoursite/special/document.html
> http://yoursite/special/document.html/sessionkey1/
> http://yoursite/special/document.html/sessionkey2/
>
>are equivalent so you would add the following to robots.txt:
>
>/document.html/

Or more obviously, keep the session key at the start, but begin it with a
"magic" sequence such as "__", then exclude URLs which start "/__".

However, this probably won't work. For example, if I try to retrieve
http://www.pathfinder.com/welcome/ :

: olly@noxious ~ 996$ telnet www.pathfinder.com 80
: Trying 204.71.242.42...
: Connected to pathfinder.com.
: Escape character is '^]'.
: GET /welcome/ HTTP/1.0
:
: HTTP/1.0 302 Found
: Date: Wednesday, 11-Dec-96 10:46:53 GMT
: Server: Open-Market-Secure-WebServer/2.0.5.RC0
: MIME-version: 1.0
: Security-Scheme: S-HTTP/1.1
: Set-Cookie: OpenMarketSI=/@@Hbhm*QcATFe5*r*u; path=/;
: Location: http://pathfinder.com/@@Hbhm*QcATFe5*r*u/welcome/
: Content-type: text/html
:
: <TITLE>Redirection</TITLE><H1>Redirection</H1>
: This document can be found <A
: [snip]

So the robot gets redirected to a URL containing the session key, but if the
robots.txt file tells it to ignore '/@@', then the redirection will take it
somewhere it's not allowed to go. So it doesn't retrieve the actual page
(or arguably shouldn't).

Martijn's idea of not giving a session key if the agent name contains
"Robot" seems good, but may not be easy to implement at the server end if
you're using a commercial server which you don't have source code for. But
if robot authors standardised on it, hopefully server authors would be
persuaded to implement it. If not, their customers will get overcrawled
and hopefully complain.

Olly
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html