avoiding infinite regress for robots

Reinier Post (reinpost@win.tue.nl)
Mon, 8 Jan 1996 20:04:19 +0100 (MET)


Benjamin Franz wrote:

>Many sites use
>symbolic links from lower to upper levels. If you try to suck
>'everything', you will end up in an infinite recursion. You need a depth
>limit (no more than X '/' elements in the URL), and probably a total
>pages limit (no more than Y pages total) to prevent any obscure cases
>from sucking it down an unexpected rat hole.

I'm surprised that no spider seems to use the page content to guess whether or
not two document trees are equal. For example, one heuristic would be to keep
a checksum for every visited page, and to decide that two subtrees are probably
equal if its root nodes and their children have iddentical checksums.

Do spiders use the content to cut off walks, and if not, is it because
alternative techniques are sufficient? Since my own spiders are rather
simple-minded (and not widely used), I'd be interested in seeing a more
informed opinion on the usefulness of comparing content.

>Benjamin Franz

-- 
Reinier Post						 reinpost@win.tue.nl
a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A>
[LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK]