Brilliant deduction, aside from the fact that MOMspider does exactly that
and was one of the first robots to implement the robot guidelines and
the first to distribute code for others to do the same. /robots.txt must
be read because it can steer the robot (any robot) away from URLs that
have problematic side-effects (as do many old CGI scripts).
The information can be cached (for a reasonable period of time) to reduce
load, but it cannot be safely ignored. Running a program that ignores
the /robots.txt is equivalent to running an unsafe program, and you are
responsible for ANY detrimental effects of that program, including lost
bandwidth and the time/personnel cost of the webmasters who track you down.
Think about that before running your program on someone else's site
without permission.
...Roy T. Fielding
Department of Information & Computer Science (fielding@ics.uci.edu)
University of California, Irvine, CA 92697-3425 fax:+1(714)824-4056
http://www.ics.uci.edu/~fielding/