# sample robots.txt
User-Agent: *
Disallow: /tmp # random tmp documents, no one should look at this
Disallow: /Internal # our Internal stuff, only we can look at this
Disallow: /smut-news # our own smmut news; don't let
# someone take it; this is why folks come
# here
Now, there are reasons to protect all of above directories from
robotic indexers. However, there may be some cases where
this is too restrictive. For example, a PageWatcher type run
internally should be able to access /Internal, even though it's
protected. A "NetNanny" type agent may want to check on /smut-news to
ensure that it fits with a parent's guidelines. Etc.
Perhaps what may be more effective is to adjust the standard to
include CATEGORIES (rathole alert!) of access purposes, and regulate
those. For example:
# sample robots.txt with categories
User-Agent: *
# field dir cat comment
Disallow: /tmp ALL # no one should look at this
Disallow: /Internal EXT-VIEW # internal use only
Disallow: /smut-news REDIST # can be indexed and scanned, but
# not redistributed
The addition of categories would allow sysadmins to target classes of
robots and give them information on how they should proceed. The
obvious problems crop up with the definition of those categories, and
again there isn't much in the way of enforcement, but it's a start.
Comments?
-Erik
-- Erik Selberg "I get by with a little help selberg@cs.washington.edu from my friends." http://www.cs.washington.edu/homes/selberg _________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html