>The specific example was that the
>Alta Vista web crawler didn't only index linked documents, but any and all
>documents that it could find at a site!
Did you also get the messages in which the author explained that
this isn't true?
>Is this true, and if so, how is it doing it? How does one keep documents
>private? I sure don't want my personal correspondence sitting out on
>someone's database just because my home directory happens to be readable!
I have a big problem with your phrase 'happens to be'.
There have been more discussions like this, in which people were quite happy
to make a bunch of documents available without restriction, except to indexers.
Their main idea was that it is common practice to keep documents 'out of
sight' without actually indicating access restrictions explicitly. I think
this is plainly wrong. On Unix, if you want to indicate who is allowed access
to your files, you use file permissions. If a certain file of mine is world
readable, the implication is that I, the author, intentionally allow the rest
of the world to read my file. (Here, 'the world' means any user with access
to the file system.) I have, occasionally, browsed other people's directories
and found stuff that wasn't intended for me to be read; I always assumed a
mistake on their part, and decided not to read on, as a matter of courtesy.
But the mistake was theirs.
The same principle has always been assumed on the Internet, I guess.
Iif you serve files off a WWW server without access restrictions,
you intend to make them available to the rest of the world.
There is no way of knowing the purpose of the accesses you get for your
documents: it may be an individual user, a WWW indexer, or a secret program
operated by the FBI/Mossad/KGB/whoever to scan for suspect activities.
It's the access permissions that specify your intentions, not the existence
of explicit references to the files, or the set of users you have told
the URLs to your site explicitly, or anything else.
In my opinion, it's a mistake to accuse robots of malicious behaviour
when all they do is find files that have been made available to them.
robots.txt should be regarded as a service to robots, a way of saying:
don't bother to index this, the results won't justify the load it will
place on the network and on my system. To honour this is a matter of
courtesy. If you don't want robots to get access to your documents at
all, then set proper access restrictions on the documents themselves.
The only problem I see is that 'the world' is not the same for everybody.
For example, suppose user A wants all files to be readable for all
other users on the system. To user A, 'the world' is all users
on the system. User A makes all files world readable.
Now suppose that user B runs a WWW server, making all files on the system
available to the whole Internet. (User B will think twice before doing this
on purpose, but it may be a configuration error.) Suddenly, user A's files
have become available to the whole Internet community. Suppose that user C
(a WWW indexer) finds user A's files. It is unreasonable for user A to
blame C, when B is at fault. Obviously, there must be a way for A to correct
the problem, and get the files removed from C's index. This is possible in
most WWW indexers. But if A is indignant at the mere fact that C found his
files, s/he's barking up the wrong tree.
-- Reinier Post reinpost@win.tue.nl a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A> [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK]