Re: Info on large scale spidering?

Martin Hamilton (martin@mrrl.lut.ac.uk)
Fri, 24 Jan 1997 16:50:06 +0000


--==_Exmh_-491874474P
Content-Type: text/plain; charset=us-ascii

Otis Gospodnetic writes:

| sorry if I'm beating a dead horse, but here are a few more thoughts...

ditto! But here goes anyway... :-)

| A better way would be if servers would either notify search engines about
| pages that have been added or modified, or deleted, or if they (servers)
| would keep info about those pages in some standard place, so that a robot
| can visit a server once (the first time), get URLs of all pages that have
| been added/deleted/modified, and index them.

Just to follow this line of investigation a little :-

We've had the ALIWEB/siteindex approach, where indexing info is made
available in a well-known location. We've also had the Harvest
Gatherer approach, where indexes can be incrementally transferred,
optionally with compression, and perhaps using a special protocol.
There have been a few other variations on the theme - like RDM, for
instance.

I'd like to know what the major robot authors would prefer for
WWW server admins to make available to them, in the event that the
WWW server admins decided not to let their servers be trawled through
in the present manner. I'm thinking primarily about the sort of
automated (local) index generation which it would be feasible to
implement as, say, an Apache module - though there's no reason why
hand crafted index info should not feature in addition to or instead
of the automatically generated stuff ?

Comments ?

Cheerio,

Martin

--==_Exmh_-491874474P
Content-Type: application/pgp-signature

-----BEGIN PGP MESSAGE-----
Version: 2.6.3i

iQCVAwUBMujoNtZdpXZXTSjhAQEl3QP/cewXtrBGj4GuQXbtZODSuvhHl6M78pUf
m3m+7Xdi0pl4wAczC9THb3BTGhloKJylP5iVIx3SVdLqERCUc0aBPON5Jsu6MJaM
O7iYhyKX4VMixConv5H+XnlsboTP5tlx0U8MynvD4Z408WNBZqvaAHYu2FnJ2QZE
Bmw2Nae+3zE=
=UVad
-----END PGP MESSAGE-----

--==_Exmh_-491874474P--
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html