Ian
-- Ian Graham ........................................ ian.graham@utoronto.ca Information Commons Tel: 416-978-4548 University of Toronto Fax: 416-978-7705
> > Allow me to introduce myself, > > I am a programmer at a London(England) based internet service company. We > have been using caching http servers (Apache) for quite some time now but > thought it would be nice to be able to prime the cache with certain sites, > either because we know that users often visit them or because we know of a > new popular site that has just opened that we would like our users to have > immediate access to, or indeed to prime and take care of a web cache that > is to be used as a proxy for other, subsidiary caches... > > So I am currently developing just such a > program/robot/agent/crawler/spider/ant/worm/<favourite term here>. > > Why? > > Well because I haven't heard of anything that just goes around filling a > cache machine. Plus it will hopefully cut down on bandwidth being used by > our users if we can cache *once* a lot of stuff from popular sites, which > is the whole reason for this : cut down on number of hits to outside > servers, provide customers with quicker access times. > > Yes ok, given enough time the cache will fill up on its own but as I said > this is to be used mainly for *priming* a cache machine, and adding new > sites that become popular. > > In addition this is to be a Webmaster's tool down here, not for > direct use by subscribers. This means that users won't be able to send it > off to gather porn, (enough trouble expiring news without caching it all > from the web!) :P > > How? > > I'm using C++, because I like it, I can use lots of existing socket > libraries and string manipulation classes. Hey it makes life easier and I > can worry about reading robots.txt instead... > > Programming isn't the problem though, features are. If there are any > simple features that could be added to a new crawler for use by a wider > community I would be happy to read proposals from any and all sources. > > When? > > Soon, I am working on now, I will test locally since we have a plethora of > machines to mess around with, and then I will approach friends who manage > sites to see if I can hit theirs for testing. > > This means that you shoudln't see it in any access logs until it has been > tested locally and on some cooperating outside systems. However if you do > see it in your logs and you haven't been approached by me regarding > testing PLEASE TELL ME! It will be using a User-Agent: field as follows; > > User-Agent: Snarf/v0.0-pre-alpha > > Well it probably will... > > Other things... > > Well I have read a lot of the archived stuff on this group, and consumed > Martijn Koster's pages. I expect to conform to robots.txt, deal with > relative links including the '.' and '..' directories, use raw IP > addresses to index previously visited servers to get around aliasing, > possibly limit depth of searches(although this depends on the site), and > all that other stuff to make it a 'nice' wobot... > > I'll be on this group from now on to catch any other ideas or proposals > for 'bots and if they apply I guess I'll try to stick to it. > > Apart from that I'll accept any suggestions of what [not] to do, > > Cheers > > Nige > > +--------------------------------------------------------------------+ > | Nigel A Rantor | WEB Ltd | > | e-Mail: nigel@mail.bogo.co.uk | The Pall Mall Deposit | > | | 124-128 Barlby Road | > | Tel: 0181-960-3050 | London W10 6BL | > +--------------------------------------------------------------------+ > | She lies and says shes in love with him, | > | Can't find a better man, | > | Better Man - Pearl Jam | > +--------------------------------------------------------------------+ > > _________________________________________________ > This messages was sent by the robots mailing list. To unsubscribe, send mail > to robots-request@webcrawler.com with the word "unsubscribe" in the body. > For more info see http://info.webcrawler.com/mak/projects/robots/robots.html >
_________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html