Cache Filler

Nigel Rantor (wiggly@deepsea.sys.web-uk.net)
Fri, 29 Nov 1996 12:56:05 +0000 (GMT)


Allow me to introduce myself,

I am a programmer at a London(England) based internet service company. We
have been using caching http servers (Apache) for quite some time now but
thought it would be nice to be able to prime the cache with certain sites,
either because we know that users often visit them or because we know of a
new popular site that has just opened that we would like our users to have
immediate access to, or indeed to prime and take care of a web cache that
is to be used as a proxy for other, subsidiary caches...

So I am currently developing just such a
program/robot/agent/crawler/spider/ant/worm/<favourite term here>.

Why?

Well because I haven't heard of anything that just goes around filling a
cache machine. Plus it will hopefully cut down on bandwidth being used by
our users if we can cache *once* a lot of stuff from popular sites, which
is the whole reason for this : cut down on number of hits to outside
servers, provide customers with quicker access times.

Yes ok, given enough time the cache will fill up on its own but as I said
this is to be used mainly for *priming* a cache machine, and adding new
sites that become popular.

In addition this is to be a Webmaster's tool down here, not for
direct use by subscribers. This means that users won't be able to send it
off to gather porn, (enough trouble expiring news without caching it all
from the web!) :P

How?

I'm using C++, because I like it, I can use lots of existing socket
libraries and string manipulation classes. Hey it makes life easier and I
can worry about reading robots.txt instead...

Programming isn't the problem though, features are. If there are any
simple features that could be added to a new crawler for use by a wider
community I would be happy to read proposals from any and all sources.

When?

Soon, I am working on now, I will test locally since we have a plethora of
machines to mess around with, and then I will approach friends who manage
sites to see if I can hit theirs for testing.

This means that you shoudln't see it in any access logs until it has been
tested locally and on some cooperating outside systems. However if you do
see it in your logs and you haven't been approached by me regarding
testing PLEASE TELL ME! It will be using a User-Agent: field as follows;

User-Agent: Snarf/v0.0-pre-alpha

Well it probably will...

Other things...

Well I have read a lot of the archived stuff on this group, and consumed
Martijn Koster's pages. I expect to conform to robots.txt, deal with
relative links including the '.' and '..' directories, use raw IP
addresses to index previously visited servers to get around aliasing,
possibly limit depth of searches(although this depends on the site), and
all that other stuff to make it a 'nice' wobot...

I'll be on this group from now on to catch any other ideas or proposals
for 'bots and if they apply I guess I'll try to stick to it.

Apart from that I'll accept any suggestions of what [not] to do,

Cheers

Nige

+--------------------------------------------------------------------+
| Nigel A Rantor | WEB Ltd |
| e-Mail: nigel@mail.bogo.co.uk | The Pall Mall Deposit |
| | 124-128 Barlby Road |
| Tel: 0181-960-3050 | London W10 6BL |
+--------------------------------------------------------------------+
| She lies and says shes in love with him, |
| Can't find a better man, |
| Better Man - Pearl Jam |
+--------------------------------------------------------------------+

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html