Re: robots.txt (A *little* off the subject)

Erik Selberg (selberg@cs.washington.edu)
24 Nov 1996 14:32:08 -0800


m.koster@webcrawler.com (Martijn Koster) writes:

> >I may have some stuff I don't want robots to come look
> >at, but it's a hassle to try and get that into a global robots.txt
> >file. And random "sysadmin runs a find script which incorporates
> >things" are solutions I think are somewhat simplistic.
>
> Simplistic is bad? Give me a _good_ reason you're not reducing your
> hassle by hacking up a simple script? Hell, I'll even write you one
> if you like :-)
>
> No, it may not work for everyone. But I'm always interested to hear
> what techno-savvy people think...

Perhaps "simplistic" is a poor word choice. What I'd like to have is
the "content admins" be in charge of how their stuff is indexed,
versus the sys admins. There may also be multiple reasons why "find"
or something similar may have problems --- massively interconnected
symlinks leading to infinite loops (if symlinks are followed) or
exclusion of certain areas (if they aren't) to name the most obvious.
And the "realpath" hack doesn't always work. Sigh.

I think the overall problem is that we want the content admins /
creators to be able to determine who can see what portion of their
stuff. The sysadmins shouldn't get involved if that can be avoided. To
do that, you need to put something like a robots.txt or stronger
permission in local directories. One potentially nasty problem is then
a robot has to try and download a new file every time it explores a
new directory.

Actually, how's this:

Users can put a robots.txt file anywhere in the system. Presumably
this is only the areas they have access to (like their home
directory), but could potentially be every single directory.

The root /robots.txt has pointers to other directories that have
robots.txt files. For example:

# sample root robots.txt file
User-Agent: *
Disallow: /tmp
Search: /homes/*/ # tells a robot to search any URL which matches
# /homes/*/ for a robots.txt file

An aggressive sysadmin will run a find script and only have one
robots.txt. However, in cases where that isn't done for whatever
reason, this allows users to control things in a much finer-grain
level.

-Erik

-- 
				Erik Selberg
"I get by with a little help	selberg@cs.washington.edu
 from my friends."		http://www.cs.washington.edu/homes/selberg
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html