Re: robots.txt unavailability

Jaakko Hyvatti (Jaakko.Hyvatti@Elma.FI)
Tue, 9 Jul 1996 09:11:29 +0300 (EET DST)


Daniel Martin:
> Just a short message (sorry)

No need for (sorry), this is a good question!

> I know what the standard is when a site reports (code 404) that the
> /robots.txt URL is not found; what is the standard behavior when other
> responses are received (say if a site reports code 403 - Forbidden)?
>
> I ask because it appears that this is handled inconsistently; opentext and
> lycos apparently treat this response as though it were a "not found"
> response - altavista's scooter, however, treats the response as though it
> were a file consisting of:
> Disallow: *
> That is, it assumes that the site is completely off-limits.

To clear any confusion, this is expressed as a robots file like:
________________________________________
User-agent: *
Disallow: /

________________________________________

> Based on my own experience, I'm inclined to say that it should be treated
> as a "Not Found" response, as should other server errors.

I agree, as some servers may be configured to return wrong responses.
Are there any? Some sites masquerade 404 and 403 to same error for
security reasons, but I have so far seen only 404 as the error they
return in both cases.

Sometimes wrong permissions on robots.txt might give 403, and if the
contents of the file for example disallowed all, the webmaster might get
mad for you indexing the site. Has this happened? Is this the reason
some treat 403 for robots.txt as disallowing all? Wouldn't webmasters,
before accusing you and as soon as they discover you hitting the site,
check the error logs to see if you have checked the robots.txt file and
discover the error?

I suppose webmasters think 'my fault', fix the permissions and then
the robot next time it refreshes it's robots.txt cache will drop the
disallowed pages it had fetched earlier.

Hey, that is a good requirement that has not been mentioned in the
robots.txt documentation yet!