My name is Alain Desilets. I am a researcher in the Interactive
Information Group of the National Research Council of Canada.
We are a small group (6 people) developing tools for interactive
access to information. Our technological angle on this problem is AI
based approaches, in particular Machine Learning and Agents. You can
find more about our work at http://ai.iit.nrc.ca/II_public/.
In order to test our methods we need to acquire a large corpus of
full HTML files from the Web. We plan to use a spider for that task.
We are aware of the controversy surrounding the creation of new
spiders and therefore do not plan to develop one. That
would not only be a duplication of effort but would also introduce a
new, possibly buggy spider in Koster's already vast list of Web
critters. Instead, we would like to use a publically available, well
behaved and proven spider.
Is there such spider available for serious research purpose?
Or maybe the corpus we need already exists? Is there a CD-ROM or .zip
file that would give us the whole of the web in full HTML?
Thanks for your help.
Alain Desilets
Institute for Information Technology
National Research Concil of Canada
Building M-50
Montreal Road
Ottawa (Ont)
K1A 0R6
e-mail: alain@ai.iit.nrc.ca
Tel: (613) 990-2813
Fax: (613) 952-7151