Yes, well, reading a variety of specs carefully makes it clear that HTML
does *not* at the current time provide a mechanism for specifying the
"canonical name" of the current page. Having noticed this [several tens
of thousands of times] during the construction of the Open Text Index,
I tried rattling cages over in the HTML Working Group, and discovered a
complete lack of consensus; some people feel that this is an appropriate
use of <BASE>, as did I; others, including people who *really* know HTML,
think <META HTTP-EQUIV="URI" CONTENT="http://xxx.xxx.xxx/xxx.html"> is more
appropriate. I tried to get them to make up their minds, but couldn't
generate sufficient interest. I don't care [nor would any other robot
flogger, I think] which mechanism is used, as long as one is available.
This is a sub-issue of the larger [and unfortunately largely ignored]
issue of WWW metadata.
The canonical-name issue could be settled, I suppose, if we and Infoseek and
Lycos and Excite got together and said "do it THIS way". I think
<META> is probably the way to go, since <BASE> is overloaded for other
functions.
Anyone who's written a serious robot knows that the aliasing available
in the IP/DNS mechanisms and the Unix filesystems, plus the habit of
mirroring good stuff, makes automatic duplicate detection [at the least]
very difficult. Checksums help, but not with volatile pages. Making
things more difficult is the fact that there are lots of people that
*like* having their pages show up multiple times.
Sigh, Tim Bray, Open Text