Canonical Names for documents (was Re: Server name in /robots.txt)

Michael De La Rue (mikedlr@indy.unipress.waw.pl)
Thu, 15 Feb 1996 11:40:41 +0100 (MET)


To go back to the original argument which started the Meta debate
I would like to suggest tags like

<LINK REV="mirror-from" HREF="original-url">

as a way of building in the logic for canonical names.

Then, pressure could be put on people building mirroring packages (which
seem to be becoming more popular) to put these in. This will mess up
MD5 hashing schemes, but these wouldn't work anyway as most mirror
systems will be changing BASENAME tags anyway.

The idea would be that original authors would be encouraged to put this
in as much as possible (most won't, but probably those who have well
maintained pages will) and otherwise mirror robots MUST do it.

The search engines should then identify these as the same document and
provide one header followed by a list of mirror sites, possibly with last
modified dates.

<A HREF="orig-location">this-is-main-link</A>

<P>this is first 250 words from the text..

<UL class=mirror-menu>
<LT>alternates
<LI>mirror sites WITH VISIBLE URL to allow choice by user
</UL>

The reason to favour this over automatic selection or something similar
is that it gives the user the choice of where to go (does he need the
original.. is one of the mirror sites know reliable or otherwise)

Link has the advantages that

It's an agreed standard tag that shouldn't have any browser
interactions

It stays away from the META info contraversy

it could be used to implement a 'go to original' button in
browsers

On the subject of the META tag etc:-

Although I don't agree with the efficiency argument because I think the
meta scheme can be implemented properly (just keep a meta data cache
and re-parse if a document gets changed under you), so people that don't
are their own problem. I'm largely convinced by the argument that MK put
in against the HTTP equiv tag provided that some alternative methods are
added to HTTP to lower bandwidth requirements.

Just saying that alternatives could exist isn't enough. I think that
the proposal is better than having no meta-data. I realise that
initially the meta-data is going to be largely ignored, but as people
start to maintain pages it will begin to be useful. Especially where
(like me) you're dealing with a large number of sites who are willing to
cooperate losely, and many of whom will be quite willing to implement

Finally, getting the tops of documents:-

What's a good way of implementing MK's suggestion of only getting the
first couple of k of a document to allow extracting META info? I would
probably want to stop at the end of the header, but how do you know when
this has ended in a badly written document? Do you stop at the first
non-head element (<P>?) or is there something more robust

<http://www.tardis.ed.ac.uk/~mikedlr/biography.html>
Scottish Climbing Archive: <http://www.tardis.ed.ac.uk/~mikedlr/climbing/>
Linux/Unix clone@ftp://src.doc.ic.ac.uk/packages/linux/sunsite.unc-mirror/docs/