Re: Crawlers and "dynamic" urls

Ian Graham (ianweb@smaug.java.utoronto.ca)
Wed, 11 Dec 1996 10:07:36 -0500 (EST)


I don't think there is a good solution to this problem. URLs are defined such
that a URL uniquely defines a resource (thw word 'idempotent' keeps
popping up in this context.) Any application that takes the definition
of a URL at face value must therefore assume that different URLs (i.e., in
your context, with different session keys) are 'different' resources
(perhaps equivalent to another, given identical checksums). I doubt the
URL spec will change to support the type of applications you are trying to
do. Perhaps a more reasonable approach is to use the Host: HTTP request
header, augmented by cookies, to track users, and to just ignore the rest?

As an alternative, perhaps you could put the key in the query string. I
suspect (but do not know for a fact) that most indexing tools will not
follow resources with appended query strings. However, you would need to
check with the different indexing services to know this for sure.

Ian

the definition of a URL is that
>
> In <19961210002328.AAA22826@novo2.novomedia.com>, koblas@novomedia.com (David Koblas) writes:
> > The general problem is that we rewrite URLs to contain a session key
> > so that we can do behavour analysis (we do support cookes too). The
> > problem is that this ID is only good for an hour, in the process of
> > being crawlled by a search engine this ID eventually expires and a
> > new one is issued. Of course this means that to the crawler there
> > is a whole new set of unexplored URLs.
> >
> > Any ideas of how to deal with this type of problem.
>
> What about putting the session key last, and restricting access for those URLs,
> like
>
> http://yoursite/special/document.html
> http://yoursite/special/document.html/sessionkey1/
> http://yoursite/special/document.html/sessionkey2/
>
> are equivalent so you would add the following to robots.txt:
>
> /document.html/
>
> Klaus Johannes Rusch
> --
> e8726057@student.tuwien.ac.at, KlausRusch@atmedia.net
> http://www.atmedia.net/KlausRusch/
> _________________________________________________
> This messages was sent by the robots mailing list. To unsubscribe, send mail
> to robots-request@webcrawler.com with the word "unsubscribe" in the body.
> For more info see http://info.webcrawler.com/mak/projects/robots/robots.html
>

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html