Re: Defenses against bad robots

Benjamin Franz (snowhare@netimages.com)
Sat, 18 May 1996 07:43:01 -0700 (PDT)


On Fri, 17 May 1996 mred@neosoft.com wrote:

> ** Reply to note from Benjamin Franz <snowhare@netimages.com> 05/17/96 5:21pm
> -0700
>
> > It would be better to use a directory of static web pages with static
> > links. A few hundred chained together with the last pointing to a CGI
> > script to notify you of the trip. That way you don't pay any CGI
> > load penalty unless the trip actually happens. Have the CGI record all the
> > pertinant info and mail it to you. A short script could easily generate
> > a few hundred chained static pages in a matter of seconds. Add that
> > directory to your robots.txt file and the only thing you should see is
> > rogue bots.
>
> Assuming the robot goes a few hundred levels deep. If it's a well-written
> malicious robot, it will be doing breath-first gathering; in which case, I
> seriously doubt it's going to get that far. If it does, your server will have
> already taken quite a huge hit, since the robot has probably retrieved
> everything you have before finally hitting the trap.

So make the first page about a thousand links to the top of the trap chain
- with *one* link directly to the trip in the middle of the page of links.
If the robot is doing depth first searching - it will head down the chain.
If it is doing a breadth first search, it will hit the direct link to the
trap after only a few hundred hits. You might even be able to make the
link to the trap page inaccessible to normal browsers by making it a
commented out A link from the home page:

<!-- A HREF="boguslink.html" -->

The heuristics in many robots are probably not even doing HTML parsing
(instead just looking for HREF="..." strings) when searching for links.
Brute force heuristics are much easier to write than good parsers -
particularly when so many pages have HTML that can't be parsed
according to actual SGML rules.

There is no way to insure that the robot won't (at least try) to index the
whole site before tripping your detector - the detector is really just a
'heads up' to let you know about them and prevent future problems with the
same robot. Say by adding their domain/IP to your server's access deny
filter. Or to the firewall IP filter if they are really persistant.

--
Benjamin Franz