Re: who/what uses robots.txt

Terry O'Neill (toneill@mariner.com)
Thu, 21 Nov 1996 19:15:19 -0700


> > Surely you must realize the importance of definitions.
>
> Absolutely. The key is making sure that you don't let folks weasel out
> of supporting standards by having an overly strict or vague
> definition.
>
> > When we can't agree on what a "robot" is, how can we agree
> > on what it should do, how it should behave, etc?
>
> I suspect we can agree on forms of behavior. There may be gobs of
> robots / agents out there, but how many of them are doing
> fundamentally different things? You got your indexers, you got your
> proxies, you got your page watchers. A couple others, But overall,
> most exhibit the same kind of behavior, which I think can be
> categorized (in an agreed upon standard!).
>
> > It's still unclear to me what the difference is, between
> > a person manually browsing pages, and a person manually
> > instructing their "agent" to browse those same pages.
> >
> > When the "agent" is based on a server, then (and in my
> > opinion, ONLY then), can one seriously believe that the
> > program can rapid-fire requests to another server.
> >
>
> How's this:
>
> New PowerBrowser 2000. Will pre-fetch in parallel all references one
> and two links removed from the page you're browsing while you read
> it. A must have for any corporate heavy browser.
>
> Run it from your corporate net with T1 or higher connectivity. Point
> it at Yahoo. See how quickly Filo and folks come hollering. It really
> doesn't take much to rapid-fire a site, and you don't have to be
> coming from big servers with big wires.

A definition of the word robot is superfluous
to this exercise. As a site administrator it
is my hope that any entity that enters my site,
be it human, robot, or something else,
behave in certain ways that will ensure that
other users are not deprived of the full enjoyment
of the site. That's the exclusive function of
the now badly-misnamed *robot*.* standard, and
I would suggest that a name with a lot less sex
appeal would be more appropriate - Terms Of Use
standard, Site Restrictions standard, or perhaps
Denial Of Service standard would describe this
function in a way that makes sense to everyone
and excludes no one.

A stickier (but vastly different) problem to solve
is the inclusive use of robots.txt, which attempts
to specify the pages that are usefully included
in external indexed databases. I'm not so sure
anybody much cares about this use of the standard
any more, which is ok since the diversity of these
"users" is such the idea that there can be a standard
specification that guides them gently to just the
stuff they need to see is, well, wrong. I'll take
the time to build my site in ways that help indexes
I know and care about, and pray that everybody else
follows the terms of use laid out to prevent my site
from being clobbered by anyone.

Regardless of where the current definition goes
and what it's called, it seems likely that some
narrowly defined terms of use for websites is
inevitable. The recent spate of SYN attacks points
out that Internet Standards as currently implemented
are not robust enough to withstand DOS attacks coming
from the open Internet whether accidental or deliberate;
it's a problem that will only get worse. As a result
I'm looking forward to the day when I can adjust my
Terms of Service (or whatever) and know that these
are translated not into a voluntary compliance by
things robot, but into a filter in my stack/firewall/whatever
that flushes *anything* that does not comply with my rules.
Robots etc. will then be forced to read my robots.txt
(or whatever) to learn how to get into my site.

You can be sure that the code that enforces
those terms of service will ignore any plaintive
wails of "I NOT ROBOT". :-)

mariner

----------
> From: Erik Selberg <selberg@cs.washington.edu>
> To: HipCrime <HipCrime@HipCrime.com>
> Cc: Erik Selberg <selberg@cs.washington.edu>; robots@webcrawler.com
> Subject: Re: who/what uses robots.txt
> Date: Thursday, November 21, 1996 4:18 PM
>
> HipCrime <HipCrime@HipCrime.com> writes:
>
> > Hi Erik ...
> >
>
> -Erik
> --
> Erik Selberg
> "I get by with a little help selberg@cs.washington.edu
> from my friends." http://www.cs.washington.edu/homes/selberg
> _________________________________________________
> This messages was sent by the robots mailing list. To unsubscribe, send
mail
> to robots-request@webcrawler.com with the word "unsubscribe" in the body.
> For more info see
http://info.webcrawler.com/mak/projects/robots/robots.html

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html