Re: Looking for a spider

Alvaro Monge (amonge@cs.ucsd.edu)
Wed, 18 Oct 1995 13:13:55 -0700 (PDT)


A colleague of mine and I are also doing research which is AI based
and are in need of a large corpus for our use. We would like to use
anything that is already available which keeps the structure of the
real WWW and does not take anything away. This is in order to create
realistic experiments of our approaches.

Thanks in advance for any pointers,

--Alvaro
Computer science and engineering department
University of California, San Diego

>
> Dear spider developpers.
>
>
> My name is Alain Desilets. I am a researcher in the Interactive
> Information Group of the National Research Council of Canada.
>
> We are a small group (6 people) developing tools for interactive
> access to information. Our technological angle on this problem is AI
> based approaches, in particular Machine Learning and Agents. You can
> find more about our work at http://ai.iit.nrc.ca/II_public/.
>
> In order to test our methods we need to acquire a large corpus of
> full HTML files from the Web. We plan to use a spider for that task.
>
> We are aware of the controversy surrounding the creation of new
> spiders and therefore do not plan to develop one. That
> would not only be a duplication of effort but would also introduce a
> new, possibly buggy spider in Koster's already vast list of Web
> critters. Instead, we would like to use a publically available, well
> behaved and proven spider.
>
> Is there such spider available for serious research purpose?
>
> Or maybe the corpus we need already exists? Is there a CD-ROM or .zip
> file that would give us the whole of the web in full HTML?
>
>
> Thanks for your help.
>
> Alain Desilets
>
> Institute for Information Technology
> National Research Concil of Canada
> Building M-50
> Montreal Road
> Ottawa (Ont)
> K1A 0R6
>
> e-mail: alain@ai.iit.nrc.ca
> Tel: (613) 990-2813
> Fax: (613) 952-7151
>
>