Re: Servers vs Agents

Rob Hartill (robh@imdb.com)
Wed, 27 Nov 1996 19:40:34 +0000 (GMT)


Davis, Ian wrote:

>What I am proposing is that we re-evaluate the reasoning behind
>robots.txt. The proposals I have seen in this list seem to rely on the
>assumption that robots.txt is enforcible when it is quite clearly not.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>These servers are going to have to learn how to
>prioritise different types of clients based on previous access patterns.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If there was an uniform way to identify a robot/agent then enforcing
robots.txt is made much easier.

:-)

What doesn't help is having (yesterday) to add "spYder" to a list
of USER-AGENTs that can't be trusted to follow robots.txt, even though it
was designed to.

-=-=-=

other rants...

"CyberSpyder" turned out to be a "content suitability" checker that blew
up on HTTP/1.1 responses.

"NetJet" is going to be a big problem in the future if it isn't forced to
behave. If you have a popular server, chances are it's already being spammed
from all over the world by NetJet users who have no idea that there are
a thousnad other users thinking exactly the same "I'll grab everything
while I sleep and refresh all those pages I didn't get around the reading
yesterday too".

There was an outcry when Netscape first released the concurrent download
feature (which is incredibly inefficient for HTTP/1.1 servers supporting
multiple requests per connection). NetJet and other "accelerators" are
far far worse.

Offline browsing is one thing, providing Joe Average with a tool that
lets him download tens of thousands of URLs from a single site while
he sleeps at night or mows the lawn is a recipe for disaster. Each Joe
Average thinks he's doing the net a favour because he believes the lies
that he's been fed from the folks who sold him his net-death-accelerator.

rob
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html