Re: Possible robots.txt addition (fwd)

Ian Graham (ianweb@smaug.java.utoronto.ca)
Wed, 6 Nov 1996 17:57:19 -0500 (EST)


> Ian Graham noticed:
>
> > A common problem (at least within our organization) is the expiry and/or
> > change of domain names: internal departmental or divisional reorganizations
> > lead to changes in domain names (in our case, *.utirc.utoronto.ca to
> > *.hprc.utoronto.ca) and the eventual elimination of 'obsolete' domains,
> > after some period of coexistence.
> >
> > Unfortunately, it is currently impossible to tell robots which of
> > domain name should be used for a particular site.
>
> and suggested an addition to robots.txt.
>
> There is a recent addition to the HTTP protocol, proposed for 1.1,
> but trivially implemented by any HTTP client : the Host header, which
> contains (sic) the host name part of the URL, as known by the client.
>
> Since the normal HTTP requests includes only the distant path part of the
> URL, it was impossible to host different servers on the same machine with
> the same port. This is now easier.
> An application of this is to *redirect* (301) requests that refer in Host
> to the old domain. This way, a canonical URL can be given for each actual
> location.
> Of course, this requires robot writers to add the Host: header in the
> requests (trivial), and the webmaster to implement this redirection (I admit
> I haven't check if that's easy).
> It's maybe more difficult for the webmaster, but it has the advantage of
> not being limited to robots. Any client will notice the change.

This is true, as far as it goes. But, redirection can only redirect a
single URL, not an entire server's worth of content. With redirects, a
robot must re-index the *entire* server content to correct all the
URLs, whereas a single robots.txt file can in principle correct the domain
name portion for all URLs residing at that site. This is so because HTTP
response header information is specific to a URL, nad not to a server's
entire content.

I suspect that each of these virtual homes can have it's own robots.txt
file, so this scheme should work regardless.

Ian