Re: The Internet Archive robot

Michael G=?iso-8859-1?Q?=F6ckel (Michael@cybercon.technopark.gmd.de)
Fri, 20 Sep 1996 20:47:21 +0100


Jeremy Sigmon wrote:

> Maybe you should be able to request a "load level" from the server
> and if it is low enough then grab the pages.

As we discussed before, the response time (lag) for your request is a
good clue to server load. Although it depends on a broad range of
parameters (connection bandwith, number of hops etc.) it is the only way
to get informations about the server load I know. If you want to be
perfect, you can do a ping befor your actual request to try to get
information about the "net-delay".

My robot http://www.hotlist.de uses this strategy (without the ping). I
fetch a maximum of one document per server during a period of about 20 x
(time for the last fetch) + 300 sec.

That isn't too much.

To prevent net-load there is no other way than running your robots at
off-peak times (02:00 to 08:00) or slow down it at other times.

Here in germany there are not so many robots these times and my opinion
is: better to have a good search engine with extended features like
email inform of customers, when it gets new hits for a search-term than
to implement everybodys own (maybe browser built in) spider-agent.

Another interesting thing to do would be to try to collect information
about ones special interests or browsing-behavior to be able to give a
better service, but german "Datenschutz" laws prevent us from doing
this.

-- 
------------------------------------------------------------------
Michael Göckel                               CyberCon Gesellschaft
Michael@cybercon.technopark.gmd.de             für neue Medien mbH
Tel. 0 22 41 / 93 50 -0                            Rathausallee 10
Fax: 0 22 41 / 93 50 -99                        53757 St. Augustin
www.cybercon.technopark.gmd.de                             Germany
------------------------------------------------------------------