Re: the POST myth... a web admin's opinions..

Istvan (simon@mcs.mcs.csuhayward.edu)
Fri, 12 Jul 1996 08:48:25 +0800


>

Rob Hartill (11 Jul 96) asks:

>
>Should the fundamental use of the web (hyperlinking)
>be sacrificed so that third parties can run their robots without fear
>of hurting anyone?.

No.

>Why do the site providers have to bend over backwards
>to accomodate robots?

I can think of a few reasons:

1. Because they (robots) exist and because that is the site
providers' self-interest, given the IMPOSSIBILITY of insuring that
EVERY robot follow robots.txt.


2. Because only they can fully understand what happens when
a URL is requested, so only they can configure the server so that no matter
what sequence of URL's is requested their site remains safe.

3. Because the Web is a public place which invites and
encourages access by anyone using it.

4. Because many robots do perform a useful service impossible to
provide by other means. So robots are here to stay, and both sides need
to learn how to co-exist peacefully. robots.txt is ONE tool to achieve
this goal.

>and moreover why should they bend over backwards
>to accomodate the robots run by irresponsible people when in all likelyhood
>the robot owners use of the collected data will bring little or no
>benefits to the site being 'indexed'.
>

It has been said on this list by a number of knowledgeable
people running robots that DO follow REP that "alas, relatively few sites
bother to put up a robots.txt file." They deplore this, because
it makes the art of robot writing much more difficult, and because
there are sites that are veritable mine-fields for robots, with
infinite URL space for example, that do not have robots.txt.

Still the relative scarcity of robots.txt files would seem to
indicate that a large majority of current sites, even large and important
ones, seems to be able to tolerate robots quite nicely, even dumb robots.
For if a site does not put up a robots.txt file every robot becomes
a somewhat dumb robot.

I sympathize with your situation, running one of the sites
on which dumb robots regularly cause you grief. This means,
in view of the above, that your site is atypical on the Web,
and that places unfortunately special burdens on you.

I understand your frustration with this, and I even understand that
you feel that dumb-robot writers are the "enemy". And from your perspective,
I understand your desire that it would be only fair to place the burden
of insuring the safety of your site due to this problem on them and not you
by requiring all robots to follow REP.

Yet such a requirement is an impossible dream for all the reasons
already discussed recently here. So, I am sorry for you, sympathize with
your plight, but life is unfair, and I am afraid that reality is that
you will have to continue to live with this problem for the foreseeable
future.

What could help ease your burden?

First and foremost EDUCATION of robot-writers.
Probably 99% of the ones causing you grief SHOULD
follow REP, so we all must educate them, cajole them, entice them
to change their code, and if they still dont, pressure them by
locking them out of your site, or even many sites.

Second, standard library implementations for enforcing REP should be
developed and published in major programming languages, (e.g. Perl,
C, and Java) so that it becomes easier for the robot-writer to
implement REP bug-free. Some of these already exist in rudimentary form
at least.

Third, standard library implementations of defensive countermeasures
should be developed by sites such as yours, and anyone else that would
be interested in this problem, to help ease your job of securing your
site against the threat of dumb robots.


>I will apologies for any strong language that I've used so far on this
>list out of pure frustration with a constant barrage of robotic attacks
>that I've witnessed over the last 3 years, if the robot owners out there
>are willing to accept that the content of this mail is reasonable

No apologies necessary in my case. And I acknowledeg that the content
of your mail message is reasonable;

>and that
>the ONLY current cure for the problems of rogue robots is to have all
>robots respect robots.txt

I dont agree with this part of your statement at all

>UNLESS THEY HAVE BEEN GIVEN PERMISSION not to.
>

I wrote you a private message in which I argued against this too.
Yet in 99% of the cases I believe that this requirement is more
than reasonable and fair. I still worry about the other 1%.

>For every 'reasonable' reason to ignore robots.txt that people on this
>list can come up with, there's probably several counterexamples that
>would illustrate a potential problem.
>

Let's start discussing specific cases. You may be able to convince me,
or we might end up in stalemate,
[see, I have given up on
the possibility that I will ever convince you :-) ]
but the discussion will bring out good points to think about, and we all
might learn something from it.

>The fundamental problem with ignoring robots.txt is that the robot cannot
>read the results of the request and make reasonable adjustments that a
>human would if the results were unexpected.
>

The fundamental question is what is a robot? The most popular
definition that has been proposed here recently is "the human-in-the-loop"
requirement for a non-robot.

Your observation is an excellent point in support of this definition.

Unfortunately, in my opinion, this definition is inadequate, and I
dont know how to fix it currently.

I think that all of us agree that Netscape Navigator should not be
considered a robot. Yet if I click on an URL in Netscape Navigator,
and then start work in another window say, because it is taking a long
time to get the results, (something that I do all the time),
I'm then not in the loop, and Netscape Navigator becomes a robot.

For what does Netscape
Navigator do when we instruct it to get a page? It connects
to the site, asks for the document, analyses the results for
other things that it needs to ask for to be able to render the page,
like GIF's and Applets, etc., and retrieves those as well by the same
process. To me that sounds suspiciously like what a robot does in its
main loop.

So perhaps we should require instead for a non-robot that there be
a finite time between what the program is allowed to do automatically
before a human intervention is required.

But this is not good either,
because on the one hand there are "server-push" pages which will make
the time infinite for Netscape Navigator, (e.g. if the page
contains one of those animated gifs) and on the other my off-line
retriever example of a program that retrieves up to 50 URL's is then
a non-robot if and only if Netscape Navigator is a nonrobot
for the same set of URL's under this definition.

So I hope that these two examples illustrate that this question
has not been answered satisfactorily. And this is a serious
problem, because it means that even if we accept your point of view
that all robots must follow REP, we have no reliable way to decide
what is a robot and what isn't.

>People use robots.txt for various reasons. It's not just a protector
>for scripts, it also protects against indexing information that isn't
>suitable for indexing (will the POST supporters suggest the information
>be displayed as a GIF to protect it?). These are just 2 reasons, there
>are probably a dozen more examples that none of us on this list have
>even thought about.
>
>That's "robots.txt" dealt with. Just as important are the guidelines
>for reasonable robot behaviour. If robot owners choose to ignore the
>guidelines (which are just common sense after all) then they can cause
>as must damage, e.g a robot that hits a site too fast can cause a denial
>of service.. a robot that doesn't care that it's downloading foo.tar.gz
>might cost someone a few $$s in ISP fees. (BTW, I've seen a
>big search engine offering .PS files as search results.. they've weeded
>out ".ps" but aren't set up to check case-insensitively). A robot
>that downloads the same file every 30s all day (stupid?, it happens)
>will skew someone's site statistics.
>

These are all good points in favor of using robots.txt in "indexing"
or general statistics-gathering -on-the-Web applications.

The problem is there are many other applications, probably many that
none of us can even imagine, where these arguments do not apply.

--Steve Simon