Re: robots.txt extensions

Adam Jack (ajack@corp.micrognosis.com)
Wed, 10 Jan 1996 19:22:44 -0500


Martijn Koster wrote:

> > [...] sequential search exhausting our interest in a site in
>
> Interesting, I didn't think people still did that :-)

Martijn -- think lowly, very very lowley ... ;-)

People will allways start somewhere -- and, as I mentioned in
point to point mail, it is us beginners that *all* ought be wary
off.

Robots.txt seems the first line of defense. A site can make
explicit statements in it. Being explicit is a good reason for
a MinRequestInterval.

> I think 60 is a sensible default, so lets think about why you would
> change it from that. [...] But who would set it much lower?
>
60 might be sensible to your need -- but what about other's
search needs? Consider people like me who get libwwwperl and a spare
afternoon and a goal. Robots, Spiders et all will get more and more
prolific and they won't all have long term aims and/or budgets.

In testing, and in practice, I felt myself get tempted to hack down the
60 second default to, say, 30 ... then I read that an 'OK' robot on the
active list did once-a-second :-) :-) ...... Soon, rabid thoughts of 60
*micro *seconds came to mind ...

However - if any site every mentioned a preference for, say,
120 seconds - then I'd be happy to oblige.

I think this information is a good addition. It needn't be of use
to the thundering giants -- it is the WWW site that benefits.

> >DefaultIndex: index.html
> >
> > Stating that XXXX/ and XXXX/index.html are identicle.
> >
> > You can argue that this is lamely inadequate - or that it
> > makes a saving. I know the bigger issue is recusion. Here
> > I am merely hoping to save those single page recusions.
>
> Yes, I do argue that this is lamely inadequate; I too think checksums
> are the way for this, even if it is post-retrieval; pre-retrieval is
> always a guess (even if we could have an If-not-md5 HTTP header)
>

Again - giants verses the lowely. This misses a saving for those
who don't have MD5 capabilities.

Also, as for whether checksums are the answer - that seems odd :
So - a robot must cache a whole site of checksums, or load the
checksum lists when a site's URL is individually access ( for
those non-sequential giants.) All this to see if an URL is the
same as one already seen? Is this not a huge procesing overhead?
Is this mechanism suggested only because existing HTTP servers
and header field would need no change to support it?

Adam

--
+1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html