Re: Crawlers and "dynamic" urls

olly@muscat.co.uk
12 Dec 1996 15:57:32 -0000

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Rob Hartill: "USER_AGENT spoofing"
Previous message: Santa Claus: "Merry Christmas, HipXmas-SantaSpam!"
Maybe in reply to: David Koblas: "Crawlers and "dynamic" urls"

Ian Graham (ianweb@smaug.java.utoronto.ca) writes:
>I don't think there is a good solution to this problem. URLs are defined such
>that a URL uniquely defines a resource (thw word 'idempotent' keeps
>popping up in this context.) Any application that takes the definition
>of a URL at face value must therefore assume that different URLs (i.e., in
>your context, with different session keys) are 'different' resources
>(perhaps equivalent to another, given identical checksums). I doubt the
>URL spec will change to support the type of applications you are trying to
>do.

It is true that such schemes are rather nasty, but that's unlikely to stop
people trying to use them. So robot authors need some way of dealing with
such sites, as recrawling the site repeatedly with different "session keys"
isn't good for the robot or the site.

Ideally the solution should result in the robot traversing such sites only
once. But a solution which effectively ignored such sites would be better
than recrawling them (IMHO anyway).

>Perhaps a more reasonable approach is to use the Host: HTTP request
>header, augmented by cookies, to track users, and to just ignore the rest?

Cookies are a better solution, except that not all browsers support them.
And Host: isn't really reliable (especially from a large site with lots of
shell accounts). However, they're probably good enough if you're just
gathering information on how people move around the site (which I think the
original questioner was).

>As an alternative, perhaps you could put the key in the query string. I
>suspect (but do not know for a fact) that most indexing tools will not
>follow resources with appended query strings.

*ANY* solution which stops the robot from fetching the rewritten URL is
going to result in the site not being crawled. This is because the way the
rewriting works is that a request for the canonical version of a URL returns
a redirection to a rewritten version (I'm assuming the robot makes the same
checks on a URL it is redirected to as it makes on any other URL, which I
think it should).

Olly
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html

Next message: Rob Hartill: "USER_AGENT spoofing"
Previous message: Santa Claus: "Merry Christmas, HipXmas-SantaSpam!"
Maybe in reply to: David Koblas: "Crawlers and "dynamic" urls"