Re: The Internet Archive robot

Marilyn R Wulfekuhler (wulfekuh@cps.msu.edu)
Mon, 9 Sep 1996 10:56:30 -0400 (EDT)


The announcement of the Internet Archive robot has raised some concerns
about copyright issues.

My experience may provide some basis for further discussion.

I am doing research on content based analysis and retrieval, and have
used a modified htmlgobble robot to copy about 3 gig of web pages in
which the links were altered to retain the structure of the originals,
but point locally to our copies instead of to the "real" URLs. And
since I am only interested in analyzing text, I also altered the
documents (in my local copy) to point to a dummy image/dummy audio file
so that the link structure of the pages is preserved, but I don't need
to waste bandwidth and space by copying the actual images. So what I
have is a local copy of a subset of the web, modified as described above,
which can be traversed by test robots (or even browsed with a
traditional browser) without bothering anyone except our server.

Some of you may remember that quite a while ago I offered to make this
"proving grounds" available to the public so that other researchers
could use our server for testing. So what happened to it?

Well, someone objected to our having a copy of his page, which is not a
problem (I deleted it as soon as I got his email), but he was
particularly nasty and threatening, even after we explained our use and
intentions. It was clear that he never read the "cover page" of the
proving grounds which explained what we were doing. BTW, he also had no
robots.txt, and wasn't interested in hearing about it. I'm still not
sure exactly what the basis of his objection was (perhaps it was that
his copyright notice was embedded in a bitmap image, and so wasn't
copied onto our copy, but I think he just objected to our having a copy
of his page, image or not). Anyway, this whole experience pointed out
the sensitivity some people have to the copyright issue, and after
several unpleasant exchanges with this individual, our solution was to
simply close down the proving grounds to the public, and keep it to
ourselves :-(

I apologize to those of you on this list who expressed interest in
using the proving grounds, and never heard what happened. We have
thought about some possible solutions, such as a password which you
have to get via email, or some similar scheme to restrict access, but
frankly that sort of thing takes my time away from real research and
some people may even object to a "limited public" use (though it would
be harder for them to find out about it).

I predict that someone will object to the Internet Archive robot's
activities, no matter how hard you try to appease them (with clear
instructions, disclaimers, perfectly faithful copies with full credit
given, or whatever), and not only will they object, they will also
threaten you with litigation.

Cynically yours,

Marilyn Wulfekuhler
Intelligent Systems Laboratory
Michigan State University