Domains and HTTP_HOST

Brian Clark (bclark@radzone.org)
Wed, 06 Nov 96 09:24:14 -0500


-- [ From: Brian Clark * EMC.Ver #2.5.02 ] --

I'm fascinated by the discussions about host-name issues in the robots.txt
file (I'll save a long citation) ... but it brings to mind other issues that
I haven't seen discussed here.

First off, some hosts are utilizing virtual domaining ... where different
domains stipulate different document roots and the webserver decides which
document root is appropriate based upon the contents of the HTTP_HOST
environment variable set by the browser. Of course, many browsers (and many
spiders) don't set the HTTP_HOST variable, causing that same webserver to
point the request to the wrong document root. You see this effect through
the search engine databanks ... most typically, they do encounter the
correct pages at some point during their crawl, turning a URL like:

http://www.mydomain.com/

to something like:

http://www.mydomain.com/mydomain/

Which isn't always a valid link for every browser (since a browser that the
HTTP_HOST variable is now likely to get a 404.) On a side note from the
current discussion, I'm interested to knowhow many of the robot maintainers
out there are utilizing the HTTP_HOST environment variables on these issues
(and how many aren't.)

More to the current topic, however: if "use this domain" directives are
placed in robots.txt and the domain is of a "virtual" nature, it's quite
likely the robot will get another "domain"'s robots.txt anyway ... and not
the one for the host they think they are exploring (that is, if this spider
doesn't utilize HTTP_HOST.)

I'm interested in hearing people's thoughts on these issues (as the number
of virtual domains our spiders have been encountering seem to be rising in
proportion to the whole.)

Brian