Suggested change: the robot should access robots.txt using the same method as
for the document it tries to fetch if applicable (i.e., for HTTP, HTTPS,
SHTTP, FTP), or HTTP if that method fails or is not applicable.
> Following redirects really should be required. It would make life for
> servers which serve out documents for multiple web hosts much easier. Big
> servers are often the ones most sensitive to robots. For example,
> a server which serves out dozens of vantiy domains could more easily
> implement /robots.txt per domain using redirection like so:
>
> http://www.vanity1.com/robots.txt -> redirect -> /robots/vanity1.txt
> http://www.vanity2.com/robots.txt -> redirect -> /robots/vanity2.txt
Hmmm, robots should also be required to send the Host: information defined for HTTP/1.1
probably for non-virtual servers.
> It might be very tough for content providers to use the "standard HTTP
> cache-control"
> mechanisms to specify Expires headers, for example, since robots.txt
> uses
> the text/plain type, not HTML. Typically, you would use HTML to do
> this:
> <META HTTP-EQUIV="Expires" CONTENT="blah">
> or whatever. Many Web servers have poor support for expiration in HTTP.
>
> So, I'd suggest explicitly adding an Expiration field to the robots.txt
> format.
> Using the HTTP date format, IMHO.
There is no need for HTML code to send an Expires header, as the HTTP-EQUIV already says
this is part of the HTTP protocol, any web server should be configurable to send an Expires:
header (or, in absense of such a header, the robot would assume 7 days, which should be okay
for most applications anyway).
Klaus Johannes Rusch
-- e8726057@student.tuwien.ac.at, KlausRusch@atmedia.net http://www.atmedia.net/KlausRusch/ _________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html