> I just got a nastygram from the web admin at xxx.lanl.gov accusing
> my "robot" of "attacking" him. This "attack" consisted of HEAD's on
> 459 URL's, with a mean pause of about 2 minutes. The total data set
> was (all sites) 653k URLs, and yes I probably should have filtered the test set to
> limit the number of accesses to any one site. Mea culpa.
welcome to the club...
When I was testing my robot, I had it randomly pick some urls to see if it
would work on different kind of servers. So, probably xx.gov got in the list
and when I returned the following morning I had 2 messages for each
requested URL (HEAD or GET), one to wiebe@ and one to webmaster@. Turned out
they also sent mail to root@ and everybody logged on at that time (which was
at night, so one poor student was flamed...).
I think they reacted very agressively, and they probably sent twice the
amount of data (email) that I was requesting (http).
> He accused my "robot" of violating the "robot guidelines". He didn't
> enumerate which I violated. I'm guessing he may have been upset that the
> test was ignoring his robot.txt, but since the test wasn't traversing the
> general web space and was in no danger of looping or getting lost, there wasn't
> much point. The test was operating from a fixed list of URLs that once tested
> would be discarded.
same here :-)
> He also informed me that "we have no need for you to 'index' our site" only
> to then rebuke me for running "a particularly stupid robots that only does
> pointless HEADs". I didn't point out that the two would naturally be mutually
> exclusive. :)
Why put your site online for milions, if you don't want people to come by ?
> Anyway, he naturally asked that I "cease and desist", which I'm of course I'm
> happy to do.
hehehe.
In the years I've been online, I've learned a few things : stay away from
.GOV and .MIL...
> Comments? Should such accesses as mine also test robots.txt? Were my
> accesses "burdensome" at that rate? Are the "robot guidelines" no longer
> "guidelines" but "rules" and are these rules applicable to all forms of
> automated access, even if they aren't robots?
I think (now) you should allways respect /robots.txt. I wasn't quite sure a
few weeks ago, but now I am :-)
I sent a nice (...) email back to that .gov site, explaining how and why and
that it wouldn't happen again. Never heard from them again.. Maybe I'll get
a flame back : Stop sending us mail!
Grts
wiebe