Re: Notification protocol?

John D. Pritchard (jdp@cs.columbia.edu)
Tue, 12 Nov 1996 15:38:25 -0500


Nick said...

> For those who are wondering, a notification protocol would be a means
> whereby a search service could be notified that a Web resource (a page,
> typically) is new or changed, presumably prompting the service to re-index
> that resource. Notification could be built into or added onto Web servers
> so that as documents are published and changed, notices would be sent.

request for comments..

i am building a LINK/UNLINK protocol into my wwweb server that will be very
simple and serve to provide an important organization for links. i am
certain that it would be very useful for the net. its syntactic and
semantic simplicity ensure accurate implemention. required semantics are
very limited. only support for the LINK call, and clean disposal of other
calls, is required by implementing systems.

[ this "proposal" has changed since it was last posted to "robots".
a fix to a critical flaw has been made. this flaw was a lack of <source
url> in LINK and UNLINK, which implied tabular storage overhead for the
anchor or source side of links. ]

most importantly, this simple, lightweight form doesn't require storage
overhead on robots, crawlers, etc..

the link terminology i use is "source" and "target". the source of a link
is the HTML anchor. the target is the resource pointed to from the anchor.

as derived from the robots.txt context, we're dealing with URL links that
are not to CGI or other client-input-specific content.

if the source of the document, ie, wwweb server, maintains a table of LINKs
to the target document, it can issue UNLINKs to delete or revise others'
information when the doc location changes or evaporates. so the cost is
linear in number of links, in simple network calls and table size.

The table for a particular doc.html would store link source info, or
reverse links. the UNLINK call is made to the host in the source end of
the link, with the source and target links so that it can handle the
request with no storage overhead. the LINK call is made to the host in the
target link when the (target) link (HTML Anchor) is created.

LINK <source url> <target url>

UNLINK <source url> <target url> [ <replacement target url> ]

LINK is accepted by wwweb servers. UNLINK is accepted by wwweb servers and
robots that maintain links to resources.

something like this is best done on UDP.

The link may be dropped from the source side if a page is updated and
deletes a previous link. for this situation we have

UNLINKR <source url> <target url>

which the source side wwweb server would use to avoid getting unnessary
UNLINKs. this call does not imply tabular overhead, but knowing the source
document last and next version text. if the last version includes a link
that's not in the next version, the call is made to the host in the dropped
URL.

a LINKMOD call could notify robots that a page has been updated. this
would require that LINK be extended with optional request for LINKMOD
calls.

LINK <source url> <target url> [ LINKMOD ]

LINKMOD <source url> <target url>

LINKMOD would be accepted by robots and crawlers in addition to UNLINK.

at least robot-level spamming would be segmented into LINKMOD domain until
people used UNLINK <target> <target> or the variation based on replicating
pages, ie, UNLINK <target> <copy of target>.

my implementation will allow the wwweb server to replace links (prompted by
UNLINK <target> <repl>) subject to human review.

the wwweb server will make LINK calls when documents are deposited, if the
update results in a new link. if the update results in the source dropping
a link, then it makes an UNLINKR <source url> <target url>.

-john

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html