Re: Looking for a spider

Alain Desilets (alain@ai.iit.nrc.ca)
Thu, 19 Oct 95 12:09:49 EDT


In response to Alvaro's message,

> >
> >> The Lycos exploration robot locates new and changed documents and
> >> builds abstracts, which consist of title, headings, subheadings,
> >> 100 most significant words and the first 20 lines of the document.
> >
> >For my research, this is not that useful. I need the entire document,
> >as it appears at the source -- not as saved by some robot, because I
> >want to follow the links within the document.

Reinier Post writes:
>
> Lycos follows the links of documents; that's how robots work.
> The summaries are built for indexing purposes. You can't save
> the full text of all documents because of the disk space requirements
> (perhaps OpenText can?) and because of legal considerations.
>

Like Alvaro, no robot generated indexe of the whole web is sufficient for
my purpose. My group working on developping new tools that can process the web
and "summarise" it in some novel way. For example:

- New and hopefully better keyword extraction algorithms
- Automatic generation of hierarchichal indexes a la Yahoo
- Merging of small indexes into bigger ones
- etc...

In order to test these new approaches, we need the full HTML, not an index of
it.

- Alain