Re: crawling FTP sites

Jaakko Hyvatti (Jaakko.Hyvatti@Elma.FI)
Wed, 11 Sep 1996 22:11:56 +0300 (EETDST)


Greg Fenton <gregf@opentext.com>:
> Which brings me to question:
> What would be considered FTP-friendly behaviour for a robot?
> How much information should a robot get at one time?

An FTP site usually creates an index of itself nightly, an index
that is retrieved by the archie servers. The system is very widely
used, and I believe you should not traverse these huge servers but try
to retrieve this file and build your db based on it.

Try: ftp:/ls-lR.gz
or: ftp:/ls-lR

For example, ftp://ftp.funet.fi/ls-lR.gz is 9290475 bytes. If the
zipped directory entries are 10MB, just imagine how long would it take
to crawl it.

If I remember correctly, if an archie robot does not find /ls-lR.gz
or /ls-lR, it just does 'dir' in the root and stops there.

(I believe you know this, I'm writing just to bring it up in the
discussion.)