anti-robot regexps

Hallvard B Furuseth (h.b.furuseth@usit.uio.no)
Fri, 8 Nov 1996 20:43:11 +0100 (MET)


If robots.txt allows general regexps, a few strategic ((((.*)*)*)*)*'s
would make URL parsing an O(5) operation instead of O(1). Robot haters
would love this.

We need a method to check that a regexp would not take too long to run
-- at the same time as regexps are not crippled too much.
Or we can use something else than regexps.
Do anyone know of such a regexp checker or other text parser?

Or robots can run alarm() before executing a robots.txt's regexps, and
abort and ignore the site if the regexp takes too long. That's the
nicest alternative to robots.txt authors, but --
Either the robot process parsing that URL must exit and be restarted, or
the robot's system must allow programs to longjmp out of signal
handlers, and the regexp code must have no internal static information
which can be corrupted by such a longjmp. Is this feasible?

Regards,

Hallvard