Preliminary robot.faq (Please Send Questions or Comments)

Keith Fischer (kfischer@mail.win.org)
Tue, 7 Nov 1995 00:43:47 -0600


Archive-name: robot.faq
Posting-Frequency: variable
Last-modified: Nov. 6, 1995

This article is a description and primer for World Wide Web robots and spiders.

The following topics are addressed:

1) DEFINING ROBOTS AND SPIDERS
1.1) What is a ROBOT?
1.2) What is a SPIDER?
1.3) What is a search engine?
1.4) How many ROBOTS are there?
1.5) What can be achieved by using ROBOTS?
1.6) What harm can a ROBOT do?

2) THE THEORY BEHIND A ROBOT
2.1) Who can write one?
2.2) How is one written?
2.3) What is the Proposed Standard for Robot Exclusion?
2.4) What are the potential problems?
2.5) How do I use proper Etiquette?

3) THE REALITY OF THE WEB
3.1) Can I visit the entire web?

1) DEFINING ROBOTS AND SPIDERS

1.1) What is a ROBOT?

A Robot is a program that traverses the World Wide Web, gathering some
sort of information from each site it visits. This journey is accomplished
by visiting a web page and then recursively visiting all or some of it's
linked pages.

1.2) What is a SPIDER?

Spiders are synonymous with Robots, as are Wanderers. These names
however, have some misleading implications. For instance many people think
that a spider or wanderer leaves the home site to work its magic, when in
reality it never leaves. The Spider rather just acts as a sophisticated web
browser, automatically retrieving documents and/or images until it is told
to stop. I prefer the term Robot and will continue using it throughout this
document.

1.3) What is a search engine?

A search engine is not a robot. However some search engines rely heavily on
robots. A search engine is nothing more than a glorified index. It searches
the index, which resides on the host's computer, and returns the result. A
common misconception is that a search engine like Lycos or Yahoo actively
searches the web upon request. This is not true, all activity by the robot
is done ahead of time.

1.4) How many ROBOTS are there?

There are about 30 in existence. Martijn Koster maintains a list at:

http://info.webcrawler.com/mak/projects/robots/active.html

1.5) What can be achieved by using ROBOTS?

The possibilities are endless. Once you visit a page, you have free run of
the html. You can retrieve files or the html itself. Most robots retrieve
pieces of the html document. This is then used to build an index, which is
later used by a search engine.

1.6) What harm can a ROBOT do?

The robot can do no harm per say, but it can anger a lot of people. If your
robot acts irresponsibly it can fall into a black hole, a link that
dynamically makes new links, or worse it can get stuck in a loop. Both of
these actions are certain to reek havoc on a server. The goal in web
traversal is to never be on one server for to long.

The solution to the problem of bad htmls or rather your robot's handling of
bad htmls is to stay online. Simply put, never leave your robot unattended.

2) THE THEORY BEHIND A ROBOT

2.1) Who can write one?

Anyone can write a robot provided that they have web access. But, a word to
the wise, tell your system administrators because they WILL feel the system
drain and they WILL hear many complaints concerning your activities.

But, just because the possibility exists doesn't mean you should take on
this task half cocked. Before even thinking about coding a robot: do your
research, have an intended goal, and read the following:

The Proposed Standard for Robot Exclusion located at:
http://info.webcrawler.com/mak/projects/robots/norobots.html


The Guidelines for Robot Writers located at:
http://info.webcrawler.com/mak/projects/robots/guidelines.html

Ethical Web Agent located at:
http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma
nn.html

2.2) How is one written?

A Robot is nothing more than an executable program. It can be in the form
of a script or a binary file. It makes a connection to a web server and
requests a document be sent, much the same way a web browser works. The
difference is in the automation provided by the robot.

2.3) What is the Proposed Standard for Robot Exclusion?

Martijn Koster explains the reason for a robot exclusion standard with the
following: "In 1993 and 1994 there have been occasions where robots have
visited WWW servers where they weren't welcome for various reasons.
Sometimes these reasons were robot specific, e.g. certain robots swamped
servers with rapid-fire requests, or retrieved the same files repeatedly. In
other situations robots traversed parts of WWW servers that weren't
suitable, e.g. very deep virtual trees, duplicated information, temporary
information, or cgi-scripts with side-effects (such as voting)."

The form the robot exclusion standard takes is given in more detail at:

The Proposed Standard for Robot Exclusion located at:
http://info.webcrawler.com/mak/projects/robots/norobots.html

2.4) What are the potential problems?

The potential problems can't be listed. The list would be far to big and
unpredictable. The very nature of the World Wide Web is diversity and this
very diversity makes robot writing both important and increasingly
difficult. There is no one right html. They can be written in many ways and
in many formats. My suggestion is get the spec sheet for html and practice,
practice, practice, making your robot robust.

2.5) How do I use proper Etiquette?

Etiquette is a very touchy subject. Many people stand in opposition to your
newly written robot. They don't like the idea that their server will be
over run with seemingly pointless requests. The solution is simple, first
give them the results. Or rather put up for public consumption the results
of your searches. This is the concept of giving back to the community that
provided for you. Not to mention, if a person can use your results, the
robot's requests may seem to have more merit.

Another form of etiquette is slow requests. You've heard the term rapid
fire. This means quick requests (a request every second or so); basically
put, this brings a server to its figurative knees. The solution is limit
your requests to any given server to one every minute (some say one every
five minutes).

More information about etiquette is located at:

The Guidelines for Robot Writers located at:
http://info.webcrawler.com/mak/projects/robots/guidelines.html

Ethical Web Agents located at:
http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma
nn.html

3) THE REALITY OF THE WEB

3.1) Can I visit the entire web?

No. So don't try. Gauge your goals in reasonable amounts.
______________________________________________________________

I disclaim everything. The contents of this article might be totally
inaccurate, inappropriate, misguided, or otherwise perverse - except for my
name (you can probably trust me on that).

Copyright (c) 1995 by Keith D. Fischer, all rights reserved.
This FAQ may be posted to any USENET newsgroup, on-line service, or BBS as
long as it is posted in its entirety and includes this copyright statement.
This FAQ may not be distributed for financial gain.
This FAQ may not be included in commercial collections or compilations
without express permission from the author.
____________________________________________________________
Keith D. Fischer - kfischer@mail.win.org or kfischer@science.smsu.edu

Keith D. Fischer
kfischer@mail.win.org
kdf274s@nic.smsu.edu

"Misery loves company" By Anonymous
"Today is a good day to die." By Crazy Horse
"To be or not to be ..." Hamlet -- William Shakespeare