Re: agents ignoring robots.txt

John D. Pritchard (jdp@cs.columbia.edu)
Thu, 17 Oct 1996 12:25:43 -0400

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Erik Selberg: "Re: agents ignoring robots.txt"
Previous message: Rob Hartill: "Re: agents ignoring robots.txt"
Maybe in reply to: Rob Hartill: "agents ignoring robots.txt"

sorry, havent been following for a couple days...i missed "why robots.txt
isnt scalable"... i guess because getting into robot details is too time
consuming.

> It would be much more useful if I could identify all robots/crawler/whatever
> with one unique string instead of using a long list of names to be blocked.

this idea relates to another old theme on this list... a federated db for
robot identification and classification. i'll build and host one in a
couple of months if more than just my software will use it.. i have an
idea floating around so i'll meander through it..

for a wwweber it would be a gui-applet client (to a db (bit bucket) server)
for building your robots.txt kind of setup. the user would manipulate
Allow-Disallow (or whatever scheme is agreeable) paths for all robots or
classes of robots.

there would also be an input system, working on an admin GUI (applet) a
email input. classes of robots would be defined by wwwebers who see that a
robot behaves like X or Y. they would be arbitrarily named like "Corporal
OctaneJelly's Robots Who Ignore Robot Exclusion". ;-) and so people who
want to add to this list would submit their suggestions to the
administrator of the list. the list would just be robot names and who
suggested them. each list owner would implement their own policies for
maintaining the list. each list would have an email address on the server
where suggestions would be mailed to if they weren't input via another
applet.

the list is on the federated db. via gui wwweb hackers can get the list
transformed into their robot filter instrument, eg, "robots.txt", by
selecting robot lists and how they should be filtered (disallow) or
directed (allow). or they can get the raw lists.

if it incorporated a federated robots db it *could be* translated into a
URI strategy where the URI

/robots

gave up the "robots.txt" filter instrument and then if you're "RoboGrep"
you might try URI

/robots/RoboGrep

to see if you can get the filter applicable to you without getting the
whole filter.

this is an approach i would like for my wwweb server because of the way
it's written... it's only useful for CGI-centric wwwebs if you have
"index.cgi" root processor for a branch, and you think running a cgi to
serve a piece of your robot exclusion filter is worthwhile.

this idea pops up as practical when the filter instrument could get large,
which is likely in a situation where there's a federated db of robot
classification (like a dns map) inserting enumerations of robot names for
each robot class (list) rule.

the user interface would allow you to filter according to all robots or
robot classes according to the robot exclusion lists. as a result you
would get a filter that would enumerate all robots and so would be large.

it would enumerate all robots so that robots can be dumb and just self
interested and just look for a default or an entry specifically for
themselves, rather than a class.

this is kinda weak and malformed presentation, but i hope the idea is
clear. i'll make a decent presentation when i have some comments. unless
everyone thinks this is a very bad idea..

-john

Next message: Erik Selberg: "Re: agents ignoring robots.txt"
Previous message: Rob Hartill: "Re: agents ignoring robots.txt"
Maybe in reply to: Rob Hartill: "agents ignoring robots.txt"