Re: Single tar (Re: Inktomi & large scale spidering)

Simon Wilkinson (sxw@tardis.ed.ac.uk)
Tue, 11 Feb 1997 23:37:57 GMT


> Erik Selberg writes:
>
> | I think the only reasonable solution in a lot of ways is to make a
> | spider which is attached to a server. This spider would then create
> | the tarballs (and you could have one big file, as well as a week's
> | worth of changes in another for incremental updating). The spider
> | could also do other useful things, like make sure you got all your
> | scripting done right, you didn't forget any links, etc. etc. etc.
>
> I think there are two sides to this - one is how you choose to build
> your local index data for exporting to the outside world, and the
> other is the format(s) and protocol(s) you use to do the exporting.
>
> What I'm really interested in is whether robot authors can be
> persuaded to pick up index data in a small number (ideally one?) of
> common formats and via a small number of common protocols (one?!).
> For example: SOIF and the Harvest Gatherer protocol, or RDM over HTTP.

The other issue is how do you notify any remote spiders that the
information (in whatever format) is available for collection. Is there
a need for something like a "gatherer.txt" file in the web servers root
directory containing details of indexing data that is available, the
format that it is in and where to get it from.

For instance with to notify the web robot that a SOIF stream was available
from a Harvest Gatherer you could use a file containing something along
the lines of:
SOIF Harvest www.tardis.ed.ac.uk:8501

Other fields that you might want to supply would be the detail of the indexing
information - so a web crawler could decide whether it wants the most detailed
information that the site makes available, or just a brief overview - sites
could then provide these differing levels themselves, whilst still having
complete control over exactly what appears in each index.

The question is whether anyone would actually make any use of this if
it was available...

Comments?

Simon

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html