Re: Server name in /robots.txt

Tim Bray (tbray@opentext.com)
Fri, 19 Jan 96 10:00 PST


>I have found
>the same pages appearing under multiple domain names - the canonical DNS
>name, various CNAME equivalents and the raw IP address *despite* having a
>
> <BASE HREF="http://xxx.xxx.xxx.xxx/xxx.html">
>
>giving a 'preferred URL' in the header. Obviously indexers don't
>(or some indexer don't) recognise this and just build on incorrect,
>but currently working, links from other pages.

Yes, well, reading a variety of specs carefully makes it clear that HTML
does *not* at the current time provide a mechanism for specifying the
"canonical name" of the current page. Having noticed this [several tens
of thousands of times] during the construction of the Open Text Index,
I tried rattling cages over in the HTML Working Group, and discovered a
complete lack of consensus; some people feel that this is an appropriate
use of <BASE>, as did I; others, including people who *really* know HTML,
think <META HTTP-EQUIV="URI" CONTENT="http://xxx.xxx.xxx/xxx.html"> is more
appropriate. I tried to get them to make up their minds, but couldn't
generate sufficient interest. I don't care [nor would any other robot
flogger, I think] which mechanism is used, as long as one is available.

This is a sub-issue of the larger [and unfortunately largely ignored]
issue of WWW metadata.

The canonical-name issue could be settled, I suppose, if we and Infoseek and
Lycos and Excite got together and said "do it THIS way". I think
<META> is probably the way to go, since <BASE> is overloaded for other
functions.

Anyone who's written a serious robot knows that the aliasing available
in the IP/DNS mechanisms and the Unix filesystems, plus the habit of
mirroring good stuff, makes automatic duplicate detection [at the least]
very difficult. Checksums help, but not with volatile pages. Making
things more difficult is the fact that there are lots of people that
*like* having their pages show up multiple times.

Sigh, Tim Bray, Open Text