Avoids misunderstanding and is also too narrow. There are automated requests
which will make sense in fighting the "flood of data" (since you can not
really call it information). This is a problem which was strongly worsened by
the WWW and now there is research and there are developments to deal with this
problem and to help people to get what is interesting for them.
Current Search Engines are not enough, since they throw too many "false hits"
and even with a cryptical complex request. Often you only can use simple
boolean operators which might not be expressive enough, bringing to you
too many irrelevant sites and let's you with the problem of manually browsing
the results to see what you can use.
>
> If there really are clearly separate classes of activity worthy of distinct
> exclusion rules, then perhaps the robots.txt file should be tarted
> up to cater to them. In the meantime, I think that a robots.txt file
Perhaps this will become necessary!
I think there ARE (or will come more often) several activities, which have
to be differenced clearly from the - possibly recursive - retrieving of high
numbers of documents.
Though retrieving one particular document for one particular user may cause
problems to a server if it is a very popular one and thousands or perhaps
(later?) millions of users wants to know when it has changed
(we know: people want information "at their fingertips")
it makes in general a big difference, if all these users have running
"user agents" doing such a job for them or if they all have "robots" running,
which recursively traverse the Web-space.
> that says: "Robots not welcome here" should be honoured by all
> non-humans, unless specific permission has been obtained.
As you see I circumvent this, because I don't think SUCH an user agent is a
"robot" ;-)
>
> It's always easy to assume that what your program does is an
> exception, or not that big a deal, but that's still jumping to a
> conclusion about someone else's motives in excluding robots,
It's pity, that you think of it this way. That was really not my intention.
I think there is a significant difference which should be discussed.
I strongly agree with the "guidelines" and the "exclusion protocol" and I
will allways honour the /robots.txt file IF the program we plan to implement
acts as a robot (we really should find a comprehensive definition of that ;-)
otherwise... ? I hope the dicussion goes on and we will see what it brings.
> and indeed what they consider to be a robot. If I had such a
> blanket exclusion file up, and an automatic process ignored it
> without getting my permission first, I'd consider its author to
> have bad manners no matter what the program did.
One way to think about it, yes.
And perhaps this is the way it must be seen right now, but people should
put information (or perhaps only data) in the Web for that others can profit
from them "in the BEST way". Now considering the information overload leads
me to the these, that the "BEST way" will become a way in which user agents
serve and in doing so, automatically retrieve documents to assist their user.
OF COURSE this should be done in a fashion, so that no resources are wasted
and the needs of other users are respected.
If a server has problems even with programs that retrieve just one file each
day, then this could be mentioned in a possibly expanded /robots.txt file.
THEN I would obey abolutely to this and tell my user, he has to monitor
this page manually (thereby probably reducing acceptance... but people online
will have to continue to develop ethics (for the them, not for robots ;-) to
accept such situations!).
Long enough... sorry
Mike