Re: Broadness of Robots.txt (Re: Washington again !!!)

Art Matheny (matheny@usf.edu)
Thu, 21 Nov 1996 11:54:05 -0500 (EST)


On 20 Nov 1996, Erik Selberg wrote:

> so what happens when I get a second internalcrawler? the problem
> inherent w/ robots.txt is the administration --- it needs to be easy!
> aside from the sysadmin only problem, requiring the sysadmin to keep a
> detailed inventory of all the crawlers and all their behaviors is
> unreasonable,

I totally agree with these statements. I would suggest a slightly
different implementation. The standard should include a list of general
behavior classifications and assign a fictitious "User-Agent" name to each
class.

An agent will scan the robots.txt looking for *either* its own specific
agent name or the appropriate ficititious class name. If it sees either of
these, it uses that rule set. There is obviously ambiguity in the case
where both matches are found. Should the burden be placed on the webadmin
to put the specific rule sets first?

There may very well be some overlap in the classifications (behaviors), so
that it is conceivable for an agent to consider itself as belonging to two
or more classifications. Also, some classifications might be subdivided
into more specific classifications, which again leads to a multiple
membership situation. A first-match rule would work if webadmins can be
coerced into following a specific-to-general sequence. Then the "*"
class would then be just another class (the most general, and therefore
last in robots.txt). I am afraid, however, that since the original
standard did not specify that the "*" rule set must be last, the
compatibility issue will nix the first-match rule. It might be better to
leave it up to the agent to resolve ambiguous matches, which I don't think
will be too difficult in real cases.

-- 
 LLLL  LLLLL LLLLLL  Arthur Matheny       LIB 612
LL  LL LL  LL  LL    Academic Computing   University of South Florida
LLLLLL LLLLL   LL    matheny@usf.edu      Tampa, FL 33620
LL  LL LL LL   LL    813-974-1795         FAX: 813-974-1799
LL  LL LL  LL  LL    http://www.acomp.usf.edu/

_________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html