> > [...] sequential search exhausting our interest in a site in
>
> Interesting, I didn't think people still did that :-)
Martijn -- think lowly, very very lowley ... ;-)
People will allways start somewhere -- and, as I mentioned in
point to point mail, it is us beginners that *all* ought be wary
off.
Robots.txt seems the first line of defense. A site can make
explicit statements in it. Being explicit is a good reason for
a MinRequestInterval.
> I think 60 is a sensible default, so lets think about why you would
> change it from that. [...] But who would set it much lower?
>
60 might be sensible to your need -- but what about other's
search needs? Consider people like me who get libwwwperl and a spare
afternoon and a goal. Robots, Spiders et all will get more and more
prolific and they won't all have long term aims and/or budgets.
In testing, and in practice, I felt myself get tempted to hack down the
60 second default to, say, 30 ... then I read that an 'OK' robot on the
active list did once-a-second :-) :-) ...... Soon, rabid thoughts of 60
*micro *seconds came to mind ...
However - if any site every mentioned a preference for, say,
120 seconds - then I'd be happy to oblige.
I think this information is a good addition. It needn't be of use
to the thundering giants -- it is the WWW site that benefits.
> >DefaultIndex: index.html
> >
> > Stating that XXXX/ and XXXX/index.html are identicle.
> >
> > You can argue that this is lamely inadequate - or that it
> > makes a saving. I know the bigger issue is recusion. Here
> > I am merely hoping to save those single page recusions.
>
> Yes, I do argue that this is lamely inadequate; I too think checksums
> are the way for this, even if it is post-retrieval; pre-retrieval is
> always a guess (even if we could have an If-not-md5 HTTP header)
>
Again - giants verses the lowely. This misses a saving for those
who don't have MD5 capabilities.
Also, as for whether checksums are the answer - that seems odd :
So - a robot must cache a whole site of checksums, or load the
checksum lists when a site's URL is individually access ( for
those non-sequential giants.) All this to see if an URL is the
same as one already seen? Is this not a huge procesing overhead?
Is this mechanism suggested only because existing HTTP servers
and header field would need no change to support it?
Adam
-- +1-203-730-5437 | http://www.micrognosis.com/~ajack/index.html