Thanks in advance for any pointers,
--Alvaro
Computer science and engineering department
University of California, San Diego
>
> Dear spider developpers.
>
>
> My name is Alain Desilets. I am a researcher in the Interactive
> Information Group of the National Research Council of Canada.
>
> We are a small group (6 people) developing tools for interactive
> access to information. Our technological angle on this problem is AI
> based approaches, in particular Machine Learning and Agents. You can
> find more about our work at http://ai.iit.nrc.ca/II_public/.
>
> In order to test our methods we need to acquire a large corpus of
> full HTML files from the Web. We plan to use a spider for that task.
>
> We are aware of the controversy surrounding the creation of new
> spiders and therefore do not plan to develop one. That
> would not only be a duplication of effort but would also introduce a
> new, possibly buggy spider in Koster's already vast list of Web
> critters. Instead, we would like to use a publically available, well
> behaved and proven spider.
>
> Is there such spider available for serious research purpose?
>
> Or maybe the corpus we need already exists? Is there a CD-ROM or .zip
> file that would give us the whole of the web in full HTML?
>
>
> Thanks for your help.
>
> Alain Desilets
>
> Institute for Information Technology
> National Research Concil of Canada
> Building M-50
> Montreal Road
> Ottawa (Ont)
> K1A 0R6
>
> e-mail: alain@ai.iit.nrc.ca
> Tel: (613) 990-2813
> Fax: (613) 952-7151
>
>