This got me thinking of the more general question -- what does an
indexing tool want? The answer is, in part, a way of identifying
what has and has not changed on a server, so as to selectively
retrieve and index those resources that are new. Note that the
phrase 'what has changed' has a wide meaning in this context,
ranging from changed files or database content, to changes in the
server's domain name that are relavant to the exterior world.
THe question is -- how to make this information available to
an indexing engine? I propose that the server export this
information via a special database gateway located at
a well-defined URL. For example,
http://serv.domain.nam/serverinfo
Servinfo can be either a document listing the server content
and information about that content, or it could be a queryable
database that returns specific information about the resources
on the server. The information returned would be equivalent to
that produced by the HTTP response headers, with some additions
to allow for descriptions of multiple resources as well as
variable (urls that reference multiple variants of the same
object), deleted and moved resources.
An overview of the proposed database structure I have developed
to-date is found at:
http://www.utoronto.ca/ian/docs/Indexing/server.html
And I am working on a tool for creating these data, as well as an
msql interface for storing and accessing a database of this type. I
would seriously enjoy some feedback on these ideas.
Ian
-- Ian Graham ........................................ ian.graham@utoronto.ca Information Commons Tel: 416-978-4548 University of Toronto Fax: 416-978-7705
_________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html