forwarded e-mail

Paul Ginsparg 505-667-7353 (ginsparg@qfwfq.lanl.gov)
Thu, 11 Jul 1996 17:38:23 -0700


Martijn,

As it happens, I have been framed (the message from aol.com was not reforwarded
by me, though the message itself was not forged, and the review is legitimate).
Someone has just forwarded to me some large number of messages concerning
our site, which I've briefly perused. I write this message in haste, so
pardon the likely typos.
Perhaps someone wanted to point out that in all the back-and-forth, relatively
few were even remotely concerned about what might be the actual content
of the site, and who exactly might be inconvenienced -- the simple fact that
many robot runners take zero interest in content has always been a prime
contributor to the problem. That sort of review we receive all of the time
from representatives of our real clientele -- if we are not popular on
robot lists, we can live with that (if we're lucky we'll be left alone,
though we did note some understanding of why we're forced into this corner
with respect to the misconfigured robots).

As you know, we've had problems with robots since early '94 and have been
running a web site before most of your mailing list had ever heard of the
web (recall the web started with physicists at CERN, and many of us have
been involved since '91); and we greatly appreciated your assistance in
tracking down many of the early problems. We also asked Boutell
to add a robot section to the www faq which points to your guidelines.

We have discovered the hard way after long and bitter experience that there
are two types of robot runners:
a) those amenable to logic and reason, and who give us no trouble from the
outset.
b) those not amenable to logic and reason. Ordinarily potential problem
sites are rapidly spotted, and they get a warning (our sad experience has
been that the first few hundred requests is just a prelude to a much worse
problems).
When that is ignored, we used to sit back in horror at what then proceeded.
By mid '94, we confirmed that it was our privilege to post guidelines and
enforce them. We had remarkably few problems from mid '94 to mid '95 after
posting the /RobotsBeware.html page, but then aparently everyone and
his/her brother decided to make a fortune on another wall street ipo with
another "indexing the internet" company.
Many of these robots are morally equivalent to the internet
worm written by morris in '89 -- they bounce from site to site oblivious to
the damage they may be doing. Wasting site admin time was written into
law as a federal offense: it is no different than the time wasted to repair
a broken window after someone has thrown a brick through it,
even if there is no fence.

Of course we start by sending 403 access-denied, but those are simply ignored,
and our errorlogs pile up ad-infinitum (we also work with the robots that
abide by the guidelines, for example we sent a large log of bad requests back
to altavista in the early days which enabled them to fix bugs in their relative
url parsing -- not clear we should be responsible for that kind of monitoring
for them, but they were one of the few in compliance from the outset so we
tried to help them stay in line).
Note that we do not email-bomb. As a next resort, we respond to http
requests by e-mail, with a clear message that our http responses are being
ignored so that we have no choice but to respond via e-mail. The responses
cease as soon as the http requests cease. There is never a problem as long as
the robots are monitored properly. And in no case is there ever a problem
if all the guidelines we discussed when you were at Nexor are followed.

For us this is not a game, we are only interested in wasting as little time
as possible, and we have little time to waste on this nuisance problem.
If a site doesn't respect the guidelines, we just want them to stop access,
that's all. Sometimes it is remarkably difficult to achieve that simple end,
and we have logs to document the horror stories which have forced us
to take an increasingly hard line in order to serve adequately our real users.
It's remarkable the number of times we hear it was that "one day" that their
software was buggy that we were hit. Ridiculous of course -- their
software is buggy all the time, we're just the first to detect it.

A few on your list suggested hiding the majority of the database behind a
post request, and of course we considered that way back when, but recognized
that would inconvenience many tens of thousands of real users at a time when
not all browsers even supported forms, and now
note that there are already direct links from all over the physics community
to specific url's here -- why would we want them all to break simultaneously
and inconvenience tens of thousands of people, to escape from at most a few
tens of inconsiderate or ignorant robot-runners. (And if we use password
protection on everything, then we run into the current netscape bug that can
result in sending thousands of 401 unauthorized per hour without the poor
[human] user knowing what is causing it...)
Ditto for changing our hostname -- we've been using the hostname since before
the web started (Tim B.-L. when still in Geneva once asked me why we used
xxx rather than www, and I had to remind him of the chronology...). We
could do it of course (and we do use an alias to accommodate the sites
whose imperfect proxies auto-block access to url's containing the string
'xxx'), but again this would result in tens of thousands of broken links
all over the network.

A few were concerned about our bandwidth -- don't worry, we have multiple
t3 lines. Others suggested there must be something wrong with the way our
software is designed -- why don't we have efficient static pages, etc. etc.
We don't expect people to understand why we do things in a particular way,
or why our database is such that it would be impossible to do things in a way
that they think without information might be more efficient
-- if you have no understanding of what it is we're doing, why make generic
suggestions? Trust us, and just stay away from the areas we have cordoned off.
It's to protect you as well as us -- you really don't want to download a few
gigabytes of gzipped postscript created on demand with variable resolution
bitmapped or type1 PS fonts, or ditto for pdf, just to discard.
It wastes your time, it wastes our time, and it wastes precious global
bandwidth.
(As an exercise sometime, compute how many robots running simultaneously
on the web at its current size would strangle the internet; and then examine
how the answer scales as the web grows -- maybe some of you will be depressed
enough at the answer to abandon running redundant robots altogether...)

Hartill (robh@imdb.com) shares many of the same problems we have on the
internet movie database that he administers, and we have pool information;
but not everything he reports concerning the behavior of the cluster here
is accurate.
We have discussed plans to implement a special "bad robot alert" mailing
list so that site admins around the world can organize and cut them off rapidly
(i.e. before they get a hold of a few pages and then start pounding) -- so many
are upset at feeling so defenseless because a small number of misguided people
have decided that anything operating on port 80 on the internet is fair game.
This comes up frequently on the apache mailing list, and it is recognized that
a coordinated anti-robot effort could spare much grief for many of the
larger site admins.

So if anyone wants to visit our site and abide by well-posted guidelines,
that's fine.
If someone doesn't want to abide by the guidelines, that's also fine.
Just stay away.
Consider it to be a public building with a sign that says "Public welcome,
just be quiet." Someone who decides to make a lot of noise is removed, forcibly
if necessary. We've been using the internet for our professional communication
for 15 years, and will continue without apology.

We're now in the process of constructing a multiply redundant globally mirrored
database, many sites in all continents -- one of the urgent problems we
discuss is how to protect it from misconfigured robots. Some of the
recent confusion that we see on the robots discussion list convinces us to
take an even harder line than we were planning.
Remember we also have to deal simultaneously with all the ones who've never
heard of the robot guidelines, and can't understand why it might be strange
to pound a site with as many as 30,000 requests a day at a rate of 15/second.

Please don't send me any messages. I am very busy with many other things,
and have already spent too much time on this in the interest of communicating
to as many as possible (feel free to forward to your list).
In the even anyone is interested in the philosophy underlying the
intellectual aspects of what we are doing, see
http://xxx.lanl.gov/blurb/pg96unesco.html

pg