Quite frankly, I am surprised and extremely dismayed at this
comment.
Previously, I wrongly accused Alta-Vista of indexing pages that
I had no interest in having indexed. It turned out that rather
than poking each TCP port for an HTTP server, Alta-Vista
actually did what every other 'bot does and follows all the
links it can find. I spent some tube-time sleuthing and
discovered that the pages were indeed referenced from other,
generally accessible pages.
I now believe my indignation at the possibility of this
port-poking behavior was based on two separate considerations:
1. that the poking of ports would impose an unwelcome
burden on my servers, and
2. that there are indeed pages I would not like to publish
broadly that are nonetheless available behind ports I don't
share with others.
Having put the first issue to rest, it is now this second idea
that attracts my attention.
Where did we get the idea that just because a thing is
accessible, that that gives us the moral right to access it,
perhaps against the interests of its owner?
In another message, Reinier states his belief that if a user
makes the mistake of exposing his home directory to the web,
that we (as robot owners) can index anything we find there with
impunity; that the error is on the part of the web-master and
not on the part of the robot's designer.
Let me see if I understand Reinier's point and can perhaps
state it another way: If I leave my house unlocked, I have
given my permission for any and all to come in and read my
personal papers. Does this strike anyone else as somewhat
absurd?
In our enthusiasm to become the cartographers of this new
region of the information universe, do we not run the risk of
violating the privacy of the indigenous peoples we find there?
I believe that this "-WE- are the most comprehensive index of
cyberspace" mentality is very dangerous and suggests a kind of
information vigiliantism that I find personally distasteful.
Perhaps what is really needed is a reevaluation of the role of
the robots.txt file. If we take the stance, as I believe we
should, that the decision to be indexed belongs in the hands of
the owner of the data, not in the mechanical claws of wild
roving robots, the robots.txt file should become the a source of
permission not exclusion from indexing. And most importantly,
that the expectation should be one of privacy, not exposure.
In other words, we should not index a web-site if there is no
robots.txt file to be retrieved that gives explicit permission
to do so.
Do any others feel as I do that control over use of my
information is my responsibility and mine alone? That the
assumption should be not to index a site that has not explicitly
given permission to be indexed? (I don't expect much agreement
here, to be honest. But I thought I would ask.)
It should be noted that there is a fairly strong case to be
made that a robot threshing through a non-published web site is
an illegal activity under the abuse of computing facilities
statute in U.S. law.
</rr>