Re: robots.txt unavailability

Daniel T. Martin (MARTIND@carleton.edu)
Tue, 09 Jul 1996 07:05:42 -0500 (CDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Chris Crowther: "Re: Should I index all ..."
Previous message: Michael G=?iso-8859-1?Q?=F6ckel: "Re: robots.txt unavailability"
Maybe in reply to: Daniel T. Martin: "robots.txt unavailability"
Next in thread: levitte@lp.se: "Re: robots.txt unavailability"

> To clear any confusion, this is expressed as a robots file like:
>________________________________________
>User-agent: *
>Disallow: /

Oh, yeah - that's what I meant.

> I agree, as some servers may be configured to return wrong responses.
>Are there any? Some sites masquerade 404 and 403 to same error for
>security reasons, but I have so far seen only 404 as the error they
>return in both cases.

For an example of an often-returned 403 error, visit any site with the OSU
DecThreads-based server (I know of only three sites; undoubtedly there are
more; however, I only feel comfortable to mention the one I control) and ask
for some top-level document that's not there; for example,
http://public.carleton.edu/gonzo.html

The error returned will be 403. This is because the default configuration
of the OSU server, unlike many others, does not simply map /* to
$document_root/* - instead, it allows access only to certain top-level
directories; this can be most confusing and I won't go into it as I still
can't find a decent way to explain it thoroughly. Anyway, the upshot of
this is that if a web-master did not explicitly plan for someone asking for
/robots.txt, then a code 403 is returned. While versions since 1.9c all
contain a default /robots.txt (so that something is returned, instead of
code 403), unless the webmaster of a site has explicitly planned for it,
sites running earlier versions will return error 403, in a case where really
the error should be "not found."

This isn't quite what you asked for, since it is still very possible to get
a 404 error from these sites; just try:
http://public.carleton.edu/www/gonzo.html

Also, how should robots handle other responses from the server? Handling a
redirect seems obvious, but what of other responses? I personally would
think that the best solution would be the simplest; that is, to treat all
unexpected response codes as "not found".
-=-
Daniel Martin * your sig here *

Next message: Chris Crowther: "Re: Should I index all ..."
Previous message: Michael G=?iso-8859-1?Q?=F6ckel: "Re: robots.txt unavailability"
Maybe in reply to: Daniel T. Martin: "robots.txt unavailability"
Next in thread: levitte@lp.se: "Re: robots.txt unavailability"