I'm not sure it is a good example, because, off the
top of my head, I don't think I'd try to find out if
everything in the cache was up-to-date at a given time.
But, anyway, lets assume the following:
1. You are doing a legitimate experiment.
2. It gets no more than one HEAD every
couple minutes, and no more than 400
HEADs for any given server.
Strictly speaking, I think you are ethically obligated,
within the limits of practicality (and detectability)
to get permission to do even this to somebody's site.
I would not consider somebody who sent a nasty-gram
to be out of line (though I might personally consider
him to be rather uptight and wouldn't go out of my
way to have a beer with him).
Practically speaking, to save myself time and trouble,
I would probably consider a robots.txt file that
allowed access to be tacit permission to run the
experiment *even though the administrator of that
site might in fact not want me to run the experiment
on their site* (maybe they put the robots.txt there
because they want to be indexed, but don't want to
be pinged otherwise). (By no means do I always do
what I strictly speaking consider myself ethically
obligated to do. To compensate for that, I use guilt.)
By the same token, I would probably consider a robots.txt
file that disallowed access to be tacit denial of
permission to run the experiment (even though the
administrator of that system might be perfectly happy
to assist me in my experiment, but just happens to
hate indexing robots).
I guess this is a much-too-long-winded way of saying
that I personally think that people that build
automatic-url-pinging-boxes of pretty much any kind
should take a broad view of the definition of robot
and honor the robots.txt file.
I'll shut up now.
PF