Provided the robots.txt files are placed in the right places, I
don't see that use-domain: or obsolete-domain: would cause problems.
For example, suppose the following robots files, at the indicated
URIs and with the indicated content:
http://host1.domain.ca/robots.txt http://host2.domain.ca/robots.txt
host1.domain.ca host2.domain.ca
althost2.boob.ca (obsolete)
use-domain: host1.domain.ca use-domain: host2.domain.ca
obsolete-domain: althost2.boob.ca
(Base mapped to: Base mapped to:
host1.domain.ca/host1/ host2.domain.ca/host2/ )
Now, provided the virtual domaining properly mapped host1.domain.ca to
one location, and host2.domain.ca (as well as althost2.boob.ca) to
another, things would work fine.
I see no way around the incorrect mapping problem arising from missing
HTTP_HOST informatin, other than to add server redirects that map things
like http://www.mydomain.com/mydomain/ back to the correct locations....
I've never done this -- is that indeed possible?
Ian
-- Ian Graham ........................................ ian.graham@utoronto.ca Information Commons Tel: 416-978-4548 University of Toronto Fax: 416-978-0440> -- [ From: Brian Clark * EMC.Ver #2.5.02 ] -- > > I'm fascinated by the discussions about host-name issues in the robots.txt > file (I'll save a long citation) ... but it brings to mind other issues that > I haven't seen discussed here. > > First off, some hosts are utilizing virtual domaining ... where different > domains stipulate different document roots and the webserver decides which > document root is appropriate based upon the contents of the HTTP_HOST > environment variable set by the browser. Of course, many browsers (and many > spiders) don't set the HTTP_HOST variable, causing that same webserver to > point the request to the wrong document root. You see this effect through > the search engine databanks ... most typically, they do encounter the > correct pages at some point during their crawl, turning a URL like: > > http://www.mydomain.com/ > > to something like: > > http://www.mydomain.com/mydomain/ > > Which isn't always a valid link for every browser (since a browser that the > HTTP_HOST variable is now likely to get a 404.) On a side note from the > current discussion, I'm interested to knowhow many of the robot maintainers > out there are utilizing the HTTP_HOST environment variables on these issues > (and how many aren't.) > > More to the current topic, however: if "use this domain" directives are > placed in robots.txt and the domain is of a "virtual" nature, it's quite > likely the robot will get another "domain"'s robots.txt anyway ... and not > the one for the host they think they are exploring (that is, if this spider > doesn't utilize HTTP_HOST.) > > I'm interested in hearing people's thoughts on these issues (as the number > of virtual domains our spiders have been encountering seem to be rising in > proportion to the whole.) > > > Brian >