Re: Server name in /robots.txt

Martin Kiff (MGK@NEWTON.NPL.CO.UK)
Mon, 22 Jan 1996 09:16 UT


Tim:

> The Canonical-name issue could be settled, I suppose, if we and Infoseek
> and Lycos and Excite got together and said 'do it THIS way', I think
> <META> is probably the way to go, since <BASE> is overloaded for
> other functions.

Sounds fine to me - certainly having sensible 'we do it this way'
information beats the current situation, that of there not being any
advice on the indexers' home pages (none that I could find, that is)

I would say that if people have gone to the trouble of setting up
HTTP-EQUIV="URI" in their pages then they are doing it for a reason
and it should be the first choice (as it can cope with multiple entries
can it not? [1])

Without a HTTP-EQUIV="URI" could the BASE information be used as a hint?
Presumably neither the HTTP-EQUIV and BASE data should be taken
un-checked... The page should be re-read from the URL's supplied
and, if valid and pointing to the same information, those URL's indexed?

Now in the absence of any HTTP-EQUIV or BASE data - i.e. the normal case -
the indexer could fall back on a default server name in the /robots.txt
file - solving a part, but at least a part, of the problem for the
webweaver.

Tim again:

> Also .... I want to solve
> the /a/./c/ and /a/x/../b/y/../c and hard-link and symlink file problems.

It is up to the individual webweavers to decide whether they are going
down the slippery route of links, they can decide to or not. I would say
though that it's a 'natural solution' when faced with the need to move
a page, it's only a while afterwards that they notice the chickens coming
home to roost. I fell into that hole and I thought I had done what I could
with <BASE HREF=....>

Note though that webweavers often have no control of the CNAME's in
the DNS and certainly no control of third parties picking up raw IP
addresses (but from where?) and using those in their links. Using
<BASE HREF...> can contain the damage if indexers make use of it
even in its strict meaning.

While I am typing, a hopefully trivial question. Who actually reads the
HTTP-EQUIV? The documentation I've read doesn't say whether it should
be understood by the httpd daemon and handed out in response to a HEAD
request (I haven't noticed my, maybe elderly copies of the, CERN or
NCSA daemons actually doing this) or should it be understood by the
browser or robot as it GETs the page?

Regards,
Martin Kiff
mgk@newton.npl.co.uk

[1] Guesswork based on the libwww.pl library discarding 'multiple' URI's