Disallow/Allow by Action (Re: robots.txt syntax)

Brian Clark (bclark@radzone.org)
Fri, 18 Oct 96 08:14:46 -0500


-- [ From: Brian Clark * EMC.Ver #2.5.02 ] --

While I greatly respect the quality of the proposal from Captain Napalm, let
me make a bid again for an additional way of organizing robots.txt (if we're
discussing radical revisions, here's one for you...)

I think part of the reason people aren't using robots.txt is it's sheer
complexity - a strength for webmasters that want the "fine level" control,
but difficult to understand for the host of "HTML in Seven Days" webmasters
building the bulk of new content. In addition, robots.txt has shortcomings
by assuming that all robots are indexing (which we now know is only part of
what is going on.) Recent discussions in this list alone (the Internet
Archive Project, the ActiveAgent code, etc.)

I'd like to push again for some modification of robots.txt to incorporate
the behavior of the robot as a criteria. If a central database of actions
could be defined in MIME-style (to be forward compatible with new robot
tasks as they are developed), then a much more upwardly-scalable robot
exclusion protocol would result.

For example, "Robots can index this website, but not archive without
permission or harvest email addresses" might be represented as "overall"
directives for which robots to allow or disallow (as long as robots are
willing to look for their real behavior in this file ... deception and
disregard would be the same enemies as in the current system.)

Creating an "action" oriented robots.txt approach would also prevent some of
the current debates that plague this list - ones of definition (no, I don't
have to adhere to robots.txt because we're not X, we're Y).

I don't believe such a concept excludes or replaces the other syntax
suggestions that have been discussed, but would leave them open to greater
flexibility as the world of "webpage indexing robots" gives way to the host
of robots/agents being developed.

Brian