Server Indexing -- Helping a Robot Out

Ian Graham (ianweb@smaug.java.utoronto.ca)
Wed, 4 Dec 1996 19:21:29 -0500 (EST)


A while back, I proposed adding a lot of stuff to robots.txt. After some
discussion, Martin rightly pointed out that what I wanted to do didn't
belong in robots.txt, since it did not address exclusion. Instead, I
wanted to address other server issues, such as indicating the
preferred domain name for the server.

This got me thinking of the more general question -- what does an
indexing tool want? The answer is, in part, a way of identifying
what has and has not changed on a server, so as to selectively
retrieve and index those resources that are new. Note that the
phrase 'what has changed' has a wide meaning in this context,
ranging from changed files or database content, to changes in the
server's domain name that are relavant to the exterior world.

THe question is -- how to make this information available to
an indexing engine? I propose that the server export this
information via a special database gateway located at
a well-defined URL. For example,

http://serv.domain.nam/serverinfo

Servinfo can be either a document listing the server content
and information about that content, or it could be a queryable
database that returns specific information about the resources
on the server. The information returned would be equivalent to
that produced by the HTTP response headers, with some additions
to allow for descriptions of multiple resources as well as
variable (urls that reference multiple variants of the same
object), deleted and moved resources.

An overview of the proposed database structure I have developed
to-date is found at:

http://www.utoronto.ca/ian/docs/Indexing/server.html

And I am working on a tool for creating these data, as well as an
msql interface for storing and accessing a database of this type. I
would seriously enjoy some feedback on these ideas.

Ian

--
Ian Graham ........................................ ian.graham@utoronto.ca
Information Commons                                      Tel: 416-978-4548
University of Toronto                                    Fax: 416-978-7705

_________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html