Re: Server Indexing -- Helping a Robot Out

Ian Graham (ianweb@smaug.java.utoronto.ca)
Thu, 5 Dec 1996 18:13:31 -0500 (EST)


Yes -- it was a suprise to see the siteinfo.txt letter come out
at the same time as my letter -- but nice to know I am not the only
one to be thinking about this.

The format I propose lies somewhere between siteinfo.txt and ALIweb
(If I remember correctly, some guy name Koster had something
to do with the latter ;-) ). Siteinfo.txt proposes a simple
text list of files, alongside file size and last modificaton dates --
a size of 0 implying a deleted file. ALIweb, on the other hand, stores
information about resources in a much more sophisticated format , along
with additional organiational (Author-name, organization-name, etc.)
and document-related (keywords, description, etc.) not usually available
for a server-based resource. From my reading of the FIND documentation,
the natural successor to the ideas behind ALIWeb seems to be the
Common Indexing Protocol. This protocol would quite naturally require
sophisticated processing of a server resource to generate the proper
"centroid" (or whatever) to describe a resource. This approach looks
remarkably powerful, but seems more addressed at database-servers
sharing information with other database servers, than at helping
an indexing robot efficiently index a given HTTP server.

What I am proposing would be used by a web server to indicate, to
a remote agent, deleted resources or recent changes to the server --
this would help to reduce the load from robots (by telling them
what they could ignore) and a robot in indexing a site's content more
efficiently, The data returned that describes a given resource would
simply be the HTTP header fields ordinarly returned following a
request for the resource (with a few fiddles to allow for content-negotiable
resources) -- the returned data stream is then simply a collection of
HTTP-like headers. There would also be some headers at the start of
the file describing useful server properties that are not described by
the standard HTTP response headers describing server characteristics.

Thus the approach is more complicated than siteinfo.txt, but less so
that ALIWeb. However, since it only returns HTTP-related informaiton, the
process can be entirely automated, and requires no complicated extraction
of centroids, or user input of descriptions or keywords. Similarly,
as the response is based on HTTP header information, the returned format
could be (I haven't thought this through completely, but it seems
obvious....) slaved to the HTTP protocol specification, so that, as
additional resource-specific fields are incorporated into HTTP, they
could be easily incorporated into a revision of this proposal.

Ian

> At 7:21 PM 12/4/96, Ian Graham wrote:
>
> >THe question is -- how to make this information available to
> >an indexing engine?
>
> just for completeness, you may want to check out InfoSeek's siteinfo.txt (?)
> mentioned in a recent posting. And for historical interest there is always
> ALIWEB :-)
>
>
> -- Martijn
>
> Email: m.koster@webcrawler.com
> WWW: http://info.webcrawler.com/mak/mak.html
>
>
> _________________________________________________
> This messages was sent by the robots mailing list. To unsubscribe, send mail
> to robots-request@webcrawler.com with the word "unsubscribe" in the body.
> For more info see http://info.webcrawler.com/mak/projects/robots/robots.html
>

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html