Re: Does this count as a robot?

YUWONO BUDI (yuwono@uxmail.ust.hk)
Tue, 9 Jan 1996 02:10:06 +0800 (HKT)


>
> On Sun, 7 Jan 1996, Thomas Stets wrote:
> > Here is the basic functionality:
> >
> > - The program starts at a given URL and follows all links that
> > are in the same directory or below. (Starting with http://x/a/b/c/...
> > it would follow /a/b/c/d/... but not /a/b/e/...)
> > (except for IMG graphics)
> > - It will, optionally, follow links to other servers one level deep.
> > - No links with .../cgi-bin/... or ?parameters are followed.
> > - Only http: links are followed.
> > - No Document is requested twice. (To prevent loops)
> > - It will identify itself with User-agent: and From:
> > - It will use HEAD requests when refreshing pages.
>
> >From your description, it is vulnerable to looping still. Many sites use
> symbolic links from lower to upper levels. If you try to suck
> 'everything', you will end up in an infinite recursion. You need a depth
> limit (no more than X '/' elements in the URL), and probably a total
> pages limit (no more than Y pages total) to prevent any obscure cases
> from sucking it down an unexpected rat hole.

One trick that I use to get around symbolic-link loops is to
detect any recurring path segment (a /x/) in a URL. Hopefully,
no web author creates a subdirectory with the same name as
its parent or grand* parent directory (in which case my robot would
think there is a loop and stop there). So far (half a thousand
sites we visited), I haven't seen such a case.

-Budi.