Re: robots.txt , authors of robots , webmasters ....

Nick Arnett (narnett@Verity.COM)
Thu, 18 Jan 1996 17:12:29 -0800


>Where did we get the idea that just because a thing is
>accessible, that that gives us the moral right to access it,
>perhaps against the interests of its owner?

There's a difference between making something accessible with the intention
of sharing it, as is the case in putting it on the Web without security,
and allowing it to be accessible without the intention of sharing it. The
moral argument is less clear when you dig a bit deeper into the publisher's
intentions, which may not include support for automated access that would
consume untoward resources.

>In our enthusiasm to become the cartographers of this new
>region of the information universe, do we not run the risk of
>violating the privacy of the indigenous peoples we find there?

The privacy argument is a difficult one to reconcile with the other
watch-word of the Internet, freedom. We could talk at great length about
that, but a robots list isn't the place, I think.

>In other words, we should not index a web-site if there is no
>robots.txt file to be retrieved that gives explicit permission
>to do so.

We thought and discussed this approach at some length when we got close to
releasing the 1.0 version of our spider. Our pre-release version had
basically no restrictions on it except that it wouldn't follow links from
one server to another; it was designed to index just one site at a time.
We even considered a scheme in which we'd look for robots.txt, and if it
wasn't present, generate an e-mail to the webmaster, suggesting that one
should be in place, with pointers to references. After X days, if we still
didn't find a robots.txt, we'd consider silence to be consent to index
anything the robot finds.

However, clearer heads prevailed, I think, and we left things as they were.
The fundamental reason that we scrapped the idea was that it was just too
complex. Too many things could go wrong, it added a lot of administrative
overhead, etc.

Let's remember that the marketplace usually eventually solves these
problems. Robot defenses can and will be built. In fact, we discovered
early on that inet-d is a pretty good defense, since it limits the number
of connections. Our first design of the robot was based on the typical
limits of inet-d.

I suspect that robot designers time would be better spent on reaching
consensus on distributed systems that will make the whole wretched mess
more efficient by combining pull and push methods of building indexes.
There is going to be a marketplace for the meta-information that robots are
generating. The sooner that robot developers agree on standards along the
lines of Harvest (but simpler, perhaps), the sooner that trade in
meta-information can begin to mature... and the less likely that one big
player will set the standards by sheer size. For example, what if
Microsoft announced a robot standard tomorrow...?

Nick