Re: Broadness of Robots.txt (Re: Washington again !!!)

Erik Selberg (selberg@cs.washington.edu)
20 Nov 1996 13:15:06 -0800


I think the key problem here is that we're using one narrow standard
to patch several different wounds. Currently, we're enabling a
sysadmin to determine only WHAT should be protected, and naturally
we're coming up with a good method on doing that. However, what we
don't have is a method to describe WHY things should be protected. For
example:

# sample robots.txt
User-Agent: *
Disallow: /tmp # random tmp documents, no one should look at this
Disallow: /Internal # our Internal stuff, only we can look at this
Disallow: /smut-news # our own smmut news; don't let
# someone take it; this is why folks come
# here

Now, there are reasons to protect all of above directories from
robotic indexers. However, there may be some cases where
this is too restrictive. For example, a PageWatcher type run
internally should be able to access /Internal, even though it's
protected. A "NetNanny" type agent may want to check on /smut-news to
ensure that it fits with a parent's guidelines. Etc.

Perhaps what may be more effective is to adjust the standard to
include CATEGORIES (rathole alert!) of access purposes, and regulate
those. For example:

# sample robots.txt with categories
User-Agent: *
# field dir cat comment
Disallow: /tmp ALL # no one should look at this
Disallow: /Internal EXT-VIEW # internal use only
Disallow: /smut-news REDIST # can be indexed and scanned, but
# not redistributed

The addition of categories would allow sysadmins to target classes of
robots and give them information on how they should proceed. The
obvious problems crop up with the definition of those categories, and
again there isn't much in the way of enforcement, but it's a start.

Comments?
-Erik

-- 
				Erik Selberg
"I get by with a little help	selberg@cs.washington.edu
 from my friends."		http://www.cs.washington.edu/homes/selberg
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html