Re: Domains and HTTP_HOST

Ian Graham (ianweb@smaug.java.utoronto.ca)
Wed, 6 Nov 1996 14:41:30 -0500 (EST)


This presents a 'host' of problems (if you'll excuse the pun) ....

Provided the robots.txt files are placed in the right places, I
don't see that use-domain: or obsolete-domain: would cause problems.
For example, suppose the following robots files, at the indicated
URIs and with the indicated content:

http://host1.domain.ca/robots.txt http://host2.domain.ca/robots.txt

host1.domain.ca host2.domain.ca
althost2.boob.ca (obsolete)

use-domain: host1.domain.ca use-domain: host2.domain.ca
obsolete-domain: althost2.boob.ca

(Base mapped to: Base mapped to:
host1.domain.ca/host1/ host2.domain.ca/host2/ )

Now, provided the virtual domaining properly mapped host1.domain.ca to
one location, and host2.domain.ca (as well as althost2.boob.ca) to
another, things would work fine.

I see no way around the incorrect mapping problem arising from missing
HTTP_HOST informatin, other than to add server redirects that map things
like http://www.mydomain.com/mydomain/ back to the correct locations....
I've never done this -- is that indeed possible?

Ian

--
Ian Graham ........................................ ian.graham@utoronto.ca
Information Commons                                      Tel: 416-978-4548
University of Toronto                                    Fax: 416-978-0440

> -- [ From: Brian Clark * EMC.Ver #2.5.02 ] -- > > I'm fascinated by the discussions about host-name issues in the robots.txt > file (I'll save a long citation) ... but it brings to mind other issues that > I haven't seen discussed here. > > First off, some hosts are utilizing virtual domaining ... where different > domains stipulate different document roots and the webserver decides which > document root is appropriate based upon the contents of the HTTP_HOST > environment variable set by the browser. Of course, many browsers (and many > spiders) don't set the HTTP_HOST variable, causing that same webserver to > point the request to the wrong document root. You see this effect through > the search engine databanks ... most typically, they do encounter the > correct pages at some point during their crawl, turning a URL like: > > http://www.mydomain.com/ > > to something like: > > http://www.mydomain.com/mydomain/ > > Which isn't always a valid link for every browser (since a browser that the > HTTP_HOST variable is now likely to get a 404.) On a side note from the > current discussion, I'm interested to knowhow many of the robot maintainers > out there are utilizing the HTTP_HOST environment variables on these issues > (and how many aren't.) > > More to the current topic, however: if "use this domain" directives are > placed in robots.txt and the domain is of a "virtual" nature, it's quite > likely the robot will get another "domain"'s robots.txt anyway ... and not > the one for the host they think they are exploring (that is, if this spider > doesn't utilize HTTP_HOST.) > > I'm interested in hearing people's thoughts on these issues (as the number > of virtual domains our spiders have been encountering seem to be rising in > proportion to the whole.) > > > Brian >