Re: Broadness of Robots.txt (Re: Washington again !!!)

Thaddeus O. Cooper (tcooper@mitre.org)
Wed, 20 Nov 1996 11:44:56 -0500


I have been quietly lurking in the background for a while now, as I am
somewhat new to exactly what is trying to be accomplished here, but....
It seems to me that there are some differences that are quite practical
and worth noting. Firstly agents index many pages on a Web server, and
certainly this can be a big problem. I have written robots that wreaked
havoc (when tested on my *own* machine, and no one elses) and so I
understand the problem. The big but is what are you going to do about
programs such as "Page Watchers" that have some of the attributes of a
browser/human, and may have some of the attributes of a robot. As was
stated earlier by Erik Selberg:

"However, where I may disagree with Rob (and probably others on this
list) is if MetaCrawler should fall under the robots.txt standard.
The
MetaCrawler does not run autonomously and suck up whatever it
finds. It simply verifies that pages are available and contain valid
data when instructed by the user, with the intent that users will
then
be visiting some of those pages in due course. Therefore, it is
unclear to me if robots.txt is appropriate; if a person is able to
get
to a page protected by robots.txt, shouldn't that person be able to
run
an agent to determine if that page has good data BEFORE the agent
shows it to the user?"

I believe that Page Watchers also fall under this category. I am
currently doing research into Page Watchers, and all of their issues (of
which there are many), and joined this list to understand what some of
the related issues are so that I can make responsible choices. The major
problem I see with saying "nonhuman" is that there are *some* programs
that will visit a site periodically (like a robot), and may even
retrieve the data, but are not retrieving many documents. Programs like
these incur very little traffic on a server, and will encourage people
to return to useful sites. Keeping them out of areas that humans have
access to doesn't make much sense if you want people to watch those
areas and return to them. If these areas are not for public consumption
(such as a web site that is not yet "up" then they should be developed
in an access controlled area using passwording or some other mechanism).
I think that trying to solve these types of issues with the Robot
Exclusing Standard doesn't really make sense. At least not to me.

--Thaddeus O. Cooper (speaking only for myself)
Senior Staff
The MITRE Corporation
(tcooper@mitre.org)

Brian Clark wrote:
>
> -- [ From: Brian Clark * EMC.Ver #2.5.02 ] --
>
> Erik raises the point I hit all the time (although comes to the opposite
> conclusion) ... a point that serves stressing again as the push moves on for
> the next standard.
>
> In short, robots.txt has become a poor bandaid for a big wound. Many people
> (from MetaCrawlers to ActiveAgents to whatever Microsoft has up their sleeve
> ) will continue to see robots.txt as a protocol for indexing robots ...
> something that even the RFC does nothing to dispel. While the distinction
> between "browser" and "robot" was both obvious and academic, that is no
> longer the case.
>
> Perhaps it shouldn't even be robots.txt anymore ... maybe that sends the
> wrong message. Maybe nonhuman.txt ... or agent.txt. I certainly wouldn't
> want to see some other group fill the vacuum by coming up with an agents.txt
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html