Re: robots.txt (A *little* off the subject)

Thaddeus O. Cooper (tcooper@mitre.org)
Tue, 26 Nov 1996 09:17:14 -0500


Erik Selberg wrote:
>
> "Thaddeus O. Cooper" <tcooper@mitre.org> writes:
>
> > 5. How do we make people use robots.txt?
> > Stuff deleted...
>
> My own take is that 5 is two-fold:
>
> 5a. How do we make sysadmins AND USERS use / admin robots.txt?
> 5b. How do we enfore robots / whatever to follow robots.txt?
>
> To solve 5a, we need to make a system that's easy for most sysadmins
> to administer, and hopefully make it easy for individual users to
> use. I personally think robots.txt is useless for a lot of systems,
> because the sysadmin doesn't control the content. Take our department
> for example. I may have some stuff I don't want robots to come look
> at, but it's a hassle to try and get that into a global robots.txt
> file. And random "sysadmin runs a find script which incorporates
> things" are solutions I think are somewhat simplistic.

I agree. Adding one more burden to an already busy system
administrator's life is not a good idea.

>
> Next is getting folks to follow it. Providing what robots want is one
> way to get them off your back. However, there may be other robots
> which want something you can't easily provide. For example, a page
> watcher may want notification. But do you notify 5 million people if
> say your ESPN sports page changes? Or your stock quote changes? Then
> there are others whom you may not want to deal with --- Rob mentioned
> an e-mail gatherer to create junk mailing lists. How do you get those
> off your back?

Ummm. A page watcher is what does the notification, not the site.
Although there is a group that is working on a notification protocol.
The current way that a page watcher determines if something has changed
is by using the HEAD command from a Web server (but everyone here knows
that -- at least I think they do :)). What I am proposing is that we put
some of the information into robots.txt so that you don't have to hit
every page in the site (or at least the pages you are allowed to see).
In this way we have a one stop shopping kind of arrangement that will
provide a large number of indexers/watchers/etc with the information
that we need. It also puts the burden for indexing on the site. This
would tend to cut down on the traffic of having crawlers run around
sites, and would let local administrators/content provider determine the
best way to index their site, and tell the rest of the world what they
want. Theoretically then the only robots that should be running around
would be robots that dont get the information that they need, and if the
information format is extensible, and it was determined that there was
sufficient need for some new piece of data to be added then the data
format could be extended. Just looking at the data that is in most of
the search engines on the "net", if you could give them title of the
document, the URL, and the date it was last modified (even if it were
out of date by a few days) you would probably reduce traffic enormously
because this is a large chunk of the data that is gathered. Additionally
if there were meta-data attached to the document it could be put into
the robots file as well.

I guess what I'm really saying is that it's ok to use robots.txt to tell
robots where not to go, or where it is ok to go, but we can also give
them the information that they most often request up front, and solve a
couple of problems in one swell foop. :)

As I said before, just my two cents.

--Thaddeus O. Cooper (only speaking for myself)
Senior Staff
The MITRE Corporation
(tcooper@mitre.org)
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html