Re: Domains and HTTP_HOST

Ian Graham (ianweb@smaug.java.utoronto.ca)
Wed, 6 Nov 1996 14:41:30 -0500 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: DECLAN FITZPATRICK: "Re: Domains and HTTP_HOST"
Previous message: John D. Pritchard: "Re: Back of the envelope computations"
In reply to: Brian Clark: "Domains and HTTP_HOST"
Next in thread: DECLAN FITZPATRICK: "Re: Domains and HTTP_HOST"

This presents a 'host' of problems (if you'll excuse the pun) ....

Provided the robots.txt files are placed in the right places, I
don't see that use-domain: or obsolete-domain: would cause problems.
For example, suppose the following robots files, at the indicated
URIs and with the indicated content:

http://host1.domain.ca/robots.txt http://host2.domain.ca/robots.txt

host1.domain.ca host2.domain.ca
althost2.boob.ca (obsolete)

use-domain: host1.domain.ca use-domain: host2.domain.ca
obsolete-domain: althost2.boob.ca

(Base mapped to: Base mapped to:
host1.domain.ca/host1/ host2.domain.ca/host2/ )

Now, provided the virtual domaining properly mapped host1.domain.ca to
one location, and host2.domain.ca (as well as althost2.boob.ca) to
another, things would work fine.

I see no way around the incorrect mapping problem arising from missing
HTTP_HOST informatin, other than to add server redirects that map things
like http://www.mydomain.com/mydomain/ back to the correct locations....
I've never done this -- is that indeed possible?

Ian

--
Ian Graham ........................................ ian.graham@utoronto.ca
Information Commons                                      Tel: 416-978-4548
University of Toronto                                    Fax: 416-978-0440

> -- [ From: Brian Clark * EMC.Ver #2.5.02 ] --
> 
> I'm fascinated by the discussions about host-name issues in the robots.txt
> file (I'll save a long citation) ... but it brings to mind other issues that
> I haven't seen discussed here.
> 
> First off, some hosts are utilizing virtual domaining ... where different
> domains stipulate different document roots and the webserver decides which
> document root is appropriate based upon the contents of the HTTP_HOST
> environment variable set by the browser. Of course, many browsers (and many
> spiders) don't set the HTTP_HOST variable, causing that same webserver to
> point the request to the wrong document root. You see this effect through
> the search engine databanks ... most typically, they do encounter the
> correct pages at some point during their crawl, turning a URL like:
> 
> 	http://www.mydomain.com/
> 
> to something like:
> 
> 	http://www.mydomain.com/mydomain/
> 
> Which isn't always a valid link for every browser (since a browser that the
> HTTP_HOST variable is now likely to get a 404.) On a side note from the
> current discussion, I'm interested to knowhow many of the robot maintainers
> out there are utilizing the HTTP_HOST environment variables on these issues 
> (and how many aren't.)
> 
> More to the current topic, however: if "use this domain" directives are
> placed in robots.txt and the domain is of a "virtual" nature, it's quite
> likely the robot will get another "domain"'s robots.txt anyway ... and not
> the one for the host they think they are exploring (that is, if this spider
> doesn't utilize HTTP_HOST.)
> 
> I'm interested in hearing people's thoughts on these issues (as the number
> of virtual domains our spiders have been encountering seem to be rising in
> proportion to the whole.)
> 
> 
> Brian
>

Next message: DECLAN FITZPATRICK: "Re: Domains and HTTP_HOST"
Previous message: John D. Pritchard: "Re: Back of the envelope computations"
In reply to: Brian Clark: "Domains and HTTP_HOST"
Next in thread: DECLAN FITZPATRICK: "Re: Domains and HTTP_HOST"