The Web Robots Pages
Guidelines for Robot Writers
Martijn Koster, 1993
This document contains some suggestions for people who are
thinking about developing Web Wanderers (Robots), programs that
traverse the Web.
Reconsider
Are you sure you really need a robot? They put a strain on network
and processing resources all over the world, so consider if your
purpose is really worth it. Also, the purpose for which you want
to run your robot are probably not as novel as you think; there
are already many other spiders out there.
Perhaps you can make use of the data collected by one of the other
spiders (check the list of robots and
the mailing list).
Finally, are you sure you can cope with the results?
Retrieving the entire Web is not a scalable solution, it is just
too big. If you do decide to do it, don't aim to traverse then
entire web, only go a few levels deep.
Be Accountable
If you do decide you want to write and/or run one, make sure that
if your actions do cause problems, people can easily contact you
and start a dialog. Specifically:
- Identify your Web Wanderer
-
HTTP supports a
User-agent
field to identify a
WWW browser. As your robot is a kind of WWW browser, use this
field to name your robot e.g. "NottinghamRobot/1.0".
This will allow server maintainers to set your robot apart from
human users using interactive browsers.
It is also recommended to run it from a machine registered in the
DNS, which will make it easier to recognise, and will indicate to
people where you are.
- Identify yourself
-
HTTP supports a
From
field to identify the
user who runs the WWW browser. Use this to advertise your
email address e.g. "j.smith@somehwere.edu".
This will allow server maintainers to contact you in case of
problems, so that you can start a dialogue on better terms
than if you were hard to track down.
- Announce It
-
Post a message to
comp.infosystems.www.providers
before
running your robots. If people know in advance they can keep an eye
out. I maintain a list of active Web
Wanderers, so that people who wonder about access from a certain
site can quickly check if it is a known robot -- please help me
keep it up-to-date by informing me of any missing ones.
- Announce it to the target
-
If you are only targetting a single site, or a few, contact
its administrator and inform him/her.
- Be informative
-
Server maintainers often wonder why their server is hit.
If you use the HTTP
Referer
field you can tell them.
This costs no effort on your part, and may be informative.
- Be there
-
Don't set your Web Wanderer going and then go on holiday for a
couple of days. If in your absence it does things that upset
people you are the only one who can fix it. It is best to
remain logged in to the machine that is running your robot, so
people can use "finger" and "talk" to contact you
Suspend the robot when you're not there for a number of days
(in the weekend), only run it in your presence. Yes, it may
be better for the performance of the machine if you run it
over night, but that implies you don't think about the
performance overhead of other machines. Yes, it will take
longer for the robot to run, but this is more an indication
that robots are not they way to do things anyway, then an
argument for running it continually; after all, what's the
rush?
- Notify your authorities
-
It is advisable to tell your system administrator / network
provider what you are planning to do. You will be asking a lot
of the services they offer, and if something goes wrong they
like to hear it from you first, not from external people.
Test Locally
Don't run repeated test on remote servers, instead run a number of
servers locally and use them to test your robot first. When going
off-site for the first time, stay close to home first (e.g. start
from a page with local servers). After doing a small run, analyse
your performance, your results, and estimate how they scale up to
thousands of documents. It may soon become obvious you can't cope.
Don't hog resources
Robots consume a lot of resources. To minimise the impact, keep
the following in mind:
- Walk, don't run
-
Make sure your robot runs slowly: although robots can handle
hundreds of documents per minute, this puts a large strain on
a server, and is guaranteed to infuriate the server
maintainer. Instead, put a sleep in, or if you're clever
rotate queries between different servers in a round-robin
fashion. Retrieving 1 document per minute is a lot better than
one per second. One per 5 minutes is better still. Yes, your
robot will take longer, but what's the rush, it's only a
program.
- Use If-modified-since or HEAD where possible
-
If your application can use the HTTP If-modified-since header,
or the HEAD method for its
purposes, that gives less overhead than full GETs.
- Ask for what you want
-
HTTP has a
Accept
field in which a browser (or
your robot) can specify the kinds of data it can handle. Use
it: if you only analyse text, specify so. This will allow
clever servers to not bother sending you data you can't handle
and have to throw away anyway. Also, make use of url suffices
if they're there.
- Ask only for what you want
-
You can build in some logic yourself: if a link refers to a
".ps", ".zip", ".Z", ".gif" etc, and you only handle text,
then don't ask for it. Although they are not the modern way to
do things (Accept is), there is an enourmeous installed base
out there that uses it (especially FTP sites). Also look out
for gateways (e.g. url's starting with finger), News gateways,
WAIS gateways etc. And think about other protocols ("news:",
"wais:") etc. Don't forget the sub-page references (<A
HREF="#abstract">) -- don't retrieve the same page more
then once. It's imperative to make a list of places not to
visit before you start...
- Check URL's
-
Don't assume the HTML documents you are going to get back are
sensible. When scanning for URL be wary of things like <A
HREF=" http://somehost.somedom/doc>.
A lot of sites don't put the trailing / on urls for directories, a
naieve strategy of concatenating the names of sub urls can result in
bad names.
- Check the results
-
Check what comes back. If a server refuses a number of
documents in a row, check what it is saying. It may be that
the server refuses to let you retrieve these things because
you're a robot.
- Don't Loop or Repeat
-
Remember all the places you have visited, so you can check
that you're not looping. Check to see if the different machine
addresses you have are not in fact the same box
(e.g. web.nexor.co.uk is the same machine as
"hercules.nexor.co.uk" and 128.243.219.1) so you don't have to
go through it again. This is imperative.
- Run at opportune times
-
On some systems there are preferred times of access, when the
machine is only lightly loaded. If you plan to do many
automatic requests from one particular site, check with its
administrator(s) when the preferred time of access is.
- Don't run it often
-
How often people find acceptable differs, but I'd say once
every two months is probably too often. Also, when you re-run
it, make use of your previous data: you know which url's to
avoid. Make a list of volatile links (like the what's new
page, and the meta-index). Use this to get pointers to other
documents, and concentrate on new links -- this way you will
get a high initial yield, and if you stop your robot for some
reason at least it has spent it's time well.
- Don't try queries
-
Some WWW documents are searcheable (ISINDEX) or contain forms.
Don't follow these. The Fish Search does this for example,
which may result in a search for "cars" being sent to
databases with computer science PhD's, people in the X.500
directory, or botanical data. Not sensible.
Stay with it
It is vital you know what your robot is doing, and that it remains
under control
- Log
-
Make sure it provides ample logging, and it wouldn't hurt to keep
certain statistics, such as the number of successes/failures, the
hosts accessed recently, the average size of recent files, and keep an
eye on it. This ties in with the "Don't Loop" section -- you need to
log where you have been to prevent looping.
Again, estimate the required disk-space, you may find you can't cope.
- Be interactive
-
Arrange for you to be able to guide your robot. Commands that
suspend or cancel the robot, or make it skip the current host can be
very useful. Checkpoint your robot frequently. This way you don't lose
everything if it falls over.
- Be prepared
-
Your robot will visit hundreds of sites. It will probably upset a
number of people. Be prepared to respond quickly to their enquiries,
and tell them what you're doing.
- Be understanding
-
If your robot upsets someone, instruct it not to visit his/her
site, or only the home page. Don't lecture him/her about why
your cause is worth it, because they probably aren't in the
least interested. If you encounter barriers that people put
up to stop your access, don't try to go around them to show
that in the Web it is difficult to limit access. I have
actually had this happen to me; and although I'm not normally
violent, I was ready to strangle this person as he was
deliberatly wasting my time. I have written a standard practice proposal for a
simple method of excluding servers. Please implement this
practice, and respect the wishes of the server maintainers.
Share results
OK, so you are using the resources of a lot of people to do this.
Do something back:
- Keep results
-
This may sound obvious, but think about what you are going to
do with the retrieved documents. Try and keep as much info as
you can possibly store. This will the results optimally
useful.
- Raw Result
-
Make your raw results available, from FTP, or the Web or
whatever. This means other people can use it, and don't need
to run their own servers.
- Polished Result
-
You are running a robot for a reason; probably to create a
database, or gather statistics. If you make these results
available on the Web people are more likely to think it worth
it. And you might get in touch with people with similar
interests.
- Report Errors
-
Your robot might come accross dangling links. You might as
well publish them on the Web somewhere (after checking they
really are. If you are convinced they are in error (as
opposed to restricted), notify the administrator of the
server.
Examples
This is not intended to be a public flaming forum or a "Best/Worst
Robot" league-table. But it shows the problems are real, and the
guidelines help aleviate them. He, maybe a league table isn't too
bad an idea anyway.
Examples of how not to do it
The robot which retrieved the same sequence of about 100 documents
on three occasions in four days. And the machine couldn't be
fingered. The results were never published. Sigh.
The robot run from phoenix.doc.ic.ac.uk in Jan 94. It provides
no User-agent
or From
fields, one
can't finger the host, and it is not part of a publicly known
project. In addition it has been reported to retrieve documents
it can't handle. Has since improved.
The Fish search capability added to Mosaic. One instance managed to
retrieve 25 documents in under one minute.
Better examples
The RBSE-Spider, run in December 93. It had a
User-agent
field, and after a finger to the host it
was possible to open a dialogue with the robot writers. Their web
server explained the purpose of it.
Jumpstation: the results are presented in a searchable database,
the author announced it, and is considering making the raw
results available. Unfortunately some people complained about the
high rate with which documents were retrieved.
Why?
Why am I rambling on about this? Because it annoys me to see that
people cause other people unnecessary hassle, and the whole
discussion can be so much gentler. And because I run a server that
is regularly visited by robots, and I am worried they could make
the Web look bad.
This page has been contributed to by Jonathon Fletcher
JumpStation Robot author, Lee McLoughlin
(L.McLoughlin@doc.ic.ac.uk), and others.