Re: Broadness of Robots.txt (Re: Washington again !!!)

Captain Napalm (spc@armigeron.com)
Wed, 20 Nov 1996 18:50:43 -0500 (EST)


It was thus said that the Great Erik Selberg once stated:
>
>
> I think the key problem here is that we're using one narrow standard
> to patch several different wounds. Currently, we're enabling a
> sysadmin to determine only WHAT should be protected, and naturally
> we're coming up with a good method on doing that. However, what we
> don't have is a method to describe WHY things should be protected. For
> example:
>
> # sample robots.txt
> User-Agent: *
> Disallow: /tmp # random tmp documents, no one should look at this
> Disallow: /Internal # our Internal stuff, only we can look at this
> Disallow: /smut-news # our own smmut news; don't let
> # someone take it; this is why folks come
> # here
>
> Now, there are reasons to protect all of above directories from
> robotic indexers. However, there may be some cases where
> this is too restrictive. For example, a PageWatcher type run
> internally should be able to access /Internal, even though it's
> protected. A "NetNanny" type agent may want to check on /smut-news to
> ensure that it fits with a parent's guidelines. Etc.
>
But you don't need to add a catagory. If the internal robot can be given
a unique name, you can add it to robots.txt with more lax rules:

User-agent: internalcrawler
Disallow: /tmp
Disallow: /cgi-bin

User-agent: netnanny
Disallow: /tmp
Disallow: /Internal
Disallow: /cgi-bin

User-agent: *
Disallow: /tmp
Disallow: /Internal
Disallow: /smut-news
Disallow: /cgi-bin

But, this does raise an issue: What about areas that are common to all
robots (or rules)? It might (keyword, might) be a good idea to have a set
of "global rules" that all robots follow, in addition to any specific rules.
Then, the following can be done:

User-agent: all # all robots have to follow these
Disallow: /tmp
Disallow: /cgi-bin

User-agent: netnanny
Disallow: /Internal

User-agent: * # default
Disallow: /Internal
Disallow: /smut-news

Then again, how many different rule sets does a typical robots.txt file
have? Also, do specific rules for a robot override the "global rules"?
Maybe not ...

-spc (Just an idea ... )

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html