Here are some techniques to alert a webmaster to a possible
attack :
(A) Alert webmaster about excessive requests by a suspected
robot.
(B) Alert webmaster about excessive error status codes being
generated in response by a suspected robot.
(C) Alert webmaster about a url being requested that no human
would request.
(D) Capture a robot in an infinite loop trap.
(E) Trap the robot into retrieving "a gigabyte-size HTML
document generated on-the-fly" (2)
I welcome your frank criticisms and flames about these
techniques.
(A) Alert webmaster about excessive requests by a suspected
robot. A simple script run every few minutes would grep
through the log files, looking for greater than 1,000 requests
by a single From, User Agent, or Referrer, or too many time
stamps being generated by a single host. The results are
printed to a file, and a standard alert message is mailed to
the webmaster email box or printed to screen.
(B) Alert webmaster about excessive error status codes being
generated in response to imaginary requests by a robot. A
simple script run every few minutes would grep through the log
files, looking for greater than 10 error status codes being
generated. The results are printed to a file, etc. A list of
acceptable and suspicious error status codes for the peculiar
needs of my website follows (2)
(C) Alert webmaster about a url being requested that no human
would request. A link not easily visible to a human being is
included on a page. This link leads to a warning that the
requester has stumbled upon a robot defense, and should
immediately exit, and should not delve further, or risk having
the requestor's host being banned from the website. In the
event that the robot does delve further, a message is printed
to a file, etc.
(D) Capture the robot in an infinite loop trap that does not
use too much resources. In the event that the infinite loop
trap is tripped, a message is printed to a file, etc. Please
reply with examples of simple infinite loops.
(E) Trap the robot into retrieving "a gigabyte-size HTML
document generated on-the-fly" (1). Please reply with
examples of this technique.
Of course robots_deny.txt will not keep out deliberately
misconfigured robots
(1) Quote from Internet Agents Spiders, Wanderers, Brokers,
and Bots by Fah-Chun Cheong, New Riders Publishing, 1996.
(2) Suspicious error status codes : 204 no content, 300
multiple choices, 301 moved permanently, 302 moved
temporarily, 303 method, 304 not modified, 400 bad request,
401 unauthorized, 402 payment method, 403 forbidden, 404 not
found, 405 method not allowed, 406 none acceptable, 407 proxy
authentication required, 408 request timeout, 409 conflict,
410 gone, 500 internal server error, 501 not implemented, 502
bad gateway, 504 gateway timeout.
Acceptable status codes : 200 OK, 201 created.
Thank you for reading this post, and for your frank criticisms
and flames, and for your examples.