Re: RFC, draft 1

Darren Hardy (dhardy@netscape.com)
Sat, 16 Nov 1996 14:15:43 -0800


Captain Napalm wrote:
>
> It was thus said that the Great Martijn Koster once stated:
> >
> >
> > Hallvard wrote recently:
> >
> > > We need official WWW standards to refer to the Robot Exclusion Standard
> >
> Maybe a better name would be 'Robot Access Standard'. This isn't just for
> exclusion anymore.
>

I disagree. The Allow/Disallow rules are still geared toward excluding
resources from robots. If we added a 'Suggest: URL' feature, where
the robots.txt actually directed the robot to useful links/content,
then I'd support the rename. Of course, it should be renamed to the
"Robot Filtering Control standard" -- as in "check the RFC on RFC" :-).

> > I finally sat down and wrote a new specification of the Standard for
> > Robot Exclusion (at the expense of reading the list :-)
> > My focus is not on new features (although I did add Allow),
> > but on providing a more solid specification which addresses concerns of
> > ambiguity and completeness.
>
> Any reason for this? It seems that we're fairly close (maybe) to a new
> standard, and if an RFC is going to be submitted, it might be better to
> include the new features that are deemed sorely needed (like Visit-time: and
> Request-rate: plus whatever will be Allow: and Disallow: with regular
> expressions).
>
> Not to say that the work here isn't good. It is (certainly clarifies
> things) but maybe, if you can hold off for a bit, a better, newer standard
> can be made into an RFC.

I'm not sure that Visit-Time is the best way to throttle robots.
Request-Rate also sounds too difficult for Robots to implement unless
it was very simple like 1000 URLs / day, but 10 URLs / minute is too
fine-grained. Then, I'd question its usefulness.
Are robots really more harsh on Web servers than several
powerful Web browsers which do concurrent retrievals?
I'm not convinced that they are. Perhaps these issues belong in a more
general approach that all Web clients can use to address
quality-of-service
with HTTP. For example, I want to ensure that all my Web browsers get
more
bandwidth than robots, and I want a few clients to receive even more
bandwidth
for real-time applications.

I believe that the real problem is that too many robots are trying to
retrieve
too much information from servers. Technology for focusing the scope
of the robots seems like a better approach. Harvest and RDM, for
example, are
geared toward having robots which run on (or very close to) the Web
server
itself, then publish the results for other robots to use. robots.txt is
an example of reducing scope too.

-Darren
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html