> > Based on my own experience, I'm inclined to say that it should be treated
> > as a "Not Found" response, as should other server errors.
>
> I agree, as some servers may be configured to return wrong responses.
> Are there any? Some sites masquerade 404 and 403 to same error for
> security reasons, but I have so far seen only 404 as the error they
> return in both cases.
>
> Sometimes wrong permissions on robots.txt might give 403, and if the
> contents of the file for example disallowed all, the webmaster might get
> mad for you indexing the site. Has this happened? Is this the reason
> some treat 403 for robots.txt as disallowing all? Wouldn't webmasters,
> before accusing you and as soon as they discover you hitting the site,
> check the error logs to see if you have checked the robots.txt file and
> discover the error?
>
> I suppose webmasters think 'my fault', fix the permissions and then
> the robot next time it refreshes it's robots.txt cache will drop the
> disallowed pages it had fetched earlier.
>
> Hey, that is a good requirement that has not been mentioned in the
> robots.txt documentation yet!
I retrived about 5000 robots.txt-files during the last week. In germany
(where 70% of the files come from) robots.txt are not so common like in
the US, I think. The majority of my results is 404, some 403, some
redirections 30x and (hear! hear!) a bunch of index.html or default.html
pages. The >am I allowed to get this page<-routine treats all except 200
as permission granted. Cause no-robot.txt in nearly 80% would lead to a
very powerless webindex. And this by the reason of webmasters, who never
heard about robots and this stuff. If any of the webmasters don't like
my robot on their site, they could set up a robots.txt or mail me to
stop the robot from indexing their site. I think this should be the
correct behaviour. Although the nice way would be to ask, bevor taking,
files exposed to the net are for common use and everyone, who places
something on the net is interested in having it spread (almost the
information that it's there).
To prevent your robot from doing some weird virtual-url-space-traversal,
you should always take a look on the url's you found. Sometimes it makes
sense to traverse it (like the german x.500-whois directory) sometimes
it doesn't make sense. If the webmaster at the host with a
virtual-url-space doesn't do the work for you (in setting up a propper
robots.txt), you have to do it by yourself.
phew.
So long!
Mika
-- ------------------------------------------------------------------ Michael Göckel CyberCon Gesellschaft Michael@cybercon.technopark.gmd.de für neue Medien mbH Tel. 0 22 41 / 93 50 -0 Rathausallee 10 Fax: 0 22 41 / 93 50 -99 53757 St. Augustin www.cybercon.technopark.gmd.de Germany ------------------------------------------------------------------