Re: Looking for a spider

Marilyn R Wulfekuhler (wulfekuh@cps.msu.edu)
Mon, 23 Oct 95 10:50:03 EDT


Alain Desilets writes:

> In order to test our methods we need to acquire a large corpus of
> full HTML files from the Web. We plan to use a spider for that task.
>
and Alvaro Monge writes:

> A colleague of mine and I are also doing research which is AI based
> and are in need of a large corpus for our use. We would like to use
> anything that is already available which keeps the structure of the
> real WWW and does not take anything away. This is in order to create
> realistic experiments of our approaches.
>

We are also doing research on AI based approaches to processing the
web, and toward the goal of having a test bed of the web, we have a
text-only copy of a subset of the web (currently about 650 meg) which
we have been calling "the proving grounds". It is not possible to get
a complete snapshot of the web at any given time, but without images
and audio, we can at least have a large, known, subset. It's also to
our collective advantage to all be working from the same subset.

It is our intention to make the proving grounds available to the public,
hopefully within the next two weeks.

We used a spider which was a modified htmlgobble, which takes a URL and
follows all the links, copying all the documents it finds except image,
audio, and video files. The urls inside the documents have been modified
so that everything points to the local copy, enabling a spider (or human
browser) to traverse the database locally.

Before we go public, I have a few questions:

(1) We currently don't copy audio, video, image files and instead
create a file by the same name with a single character identifying
it as video, image, or audio. Would an empty file suffice? Is
there another identification scheme that would be more useful?

(2) We currently copy postscript, but are considering treating them as
we do image files. They take a LOT of space, and are of no utility
for the kind of analysis that we want to do. Would it be more useful
to keep the postscript, or treat it as we do images (which would then
allow us to use the space for a larger web subset)?

I appreciate any feedback and I'll announce to the list when it's ready
for public use.

Marilyn Wulfekuhler
Intelligent Systems Lab, Michigan State University