Recursing heuristics (Re: Does this..)

Jaakko Hyvatti (Jaakko.Hyvatti@www.fi)
Tue, 9 Jan 1996 11:38:47 +0200 (EET)


> I wouldn't. Then again I need not, because my philosophy on robot
> behavior is "be as non-aggressive as possible," so my robot would
> simply give it up. If I were a web site admin, I would appreciate
> that in a robot.
> Anyway, this trick was actually a 5-minute-hacking solution.
>
> -Budi.

The robots home pages mention some heuristics that sould be used in
recursive traversal, but it does not currently count those recently
mentioned here, not to mention it does not count all that are
necessary for a modern robot. I think it is time to collect a
definitive list of minimum requirements and possible refinements for
traversal algorithms.

My robot Hämähäkki indexes *.fi -domain, Finland, 207147 URL:s
currently, and I could list the following rules it follows
(expressed as something like regular expressions):

- check the recursion depth limit

- check with robots.txt and if the path already was fetched and has
not expired, whatever rules are used for that.

- recurse only '.*/', '.*\.html?' and paths that seem like they just
are missing the ending '/' and usually cause redirection to a index.
This means something like '.*/~?[a-zA-Z0-9]+' that does not match
'.*bin.*', '.*cgi.*' or '.*\..*'

As you see I do not use HEAD check for the type of every link
like for example the MOMspider. I might in the future.

- drop paths like '.*/cgi-bin/.*', '.*[?=+].*'

- drop paths like '.*\.html?/.*'

- interpret things like '.*//.*', '.*/\./.*' and '.*/\.\.//*'
correctly.

These are quite restrictive and might make me miss something, but
that's minor and they serve well. I am about to add recursion
detection with content comparison by crc shortly. Even if it has not
been a problem as only one or two sites out of 755 symlinked
substantial subtrees and they were easy to pick out by hand.
Otherwise none of the sites hit the recursion limit.

Am I missing something important here? Let's collect something
useful for the first-time robot-writers. And others.