FAQ again.

Martijn Koster (m.koster@webcrawler.com)
Wed, 10 Jan 1996 12:09:45 -0700


Hi all,

I've been getting a lot of robot questions recently, so decided the FAQ time
is now :-) I wrote the stuff below, and cross-checked with Keith Fischer's
preliminary FAQ of early November last year; think I have addressed most of
the questions he proposed.

Pending comments I'l HTML-ise it and add it to the robot pages this week.

Regards,

______________

WWW Robot Frequently Asked Questions

Last updated: 10 January 1996

Maintained by Martijn Koster <m.koster@webcrawler.com>

Location: http://info.webcrawler.com/mak/projects/robots/faq.html

1) About WWW robots
1.1) What is a WWW robot?
1.2) What is an agent?
1.3) What is a search engine?
1.4) What kinds of robots are there?

1.5) Aren't robots bad for the web?

1.6) Where do I find out more about robots?

2) Indexing robots
2.1) How does a robot decide where to visit?
2.2) How does an indexing robot decide what to index?
2.3) How do I register my page with a robot?

3) For Server Administrators
3.1) How do I know if I've been visited by a robot?
3.2) I've been visited by a robot. Now what?
3.3) A robot is traversing my whole site too fast!
3.4) How do I keep a robot off my server?

4) Robots exclusion standard
4.1) Why do I find entries for /robots.txt in my log files?
4.2) How do I prevent robots scanning my site?
4.3) Where do I find out how /robots.txt files work?
4.4) Will the /robots.txt standard be extended?

5) Availability
5.1) Where can I use a robot?
5.2) Where can I get a robot?
5.3) Where can I get the source code for a robot?
5.4) I'm writing a robot, what do I need to be careful of?
5.5) I've written a robot, how do I list it?

1) About Web Robots
===================

1.1) What is a WWW robot?
-------------------------

A robot is a program that automatically traverses the Web's hypertext
structure by retrieving a document, and recursively retrieving all
documents that are referenced.

Note that "recursive" here doesn't limit the definition to any specific
traversal algorithm; even if a robot applies some heuristic to the
selection and order of documents to visit and spaces out requests
over a long space of time, it is still a robot.

Normal Web browsers are not robots, because the are operated by a human,
and don't automatically retrieve referenced documents (other than
inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers,
or Spiders. These names are a bit misleading as they give the impression
the software itself moves between sites like a virus; this not the case,
a robot simply visits sites by requesting documents from them.

1.2) What is an agent?
----------------------

The word "agent" is used for lots of meanings in computing these days.
Specifically:

- Autonomous agents are programs that do travel between sites, deciding
themselves when to move and what to do (e.g. General Magic's Telescript).
These can only travel between special servers and are currently not
widespread in the Internet.

- Intelligent agents are programs that help users with things, such as
choosing a product, or guiding a user through form filling, or even
helping users find things. These have generally little to do with
networking.

- User-agents are a technical name for programs that perform networking
tasks for a user, such as Web User-agents like Netscape Explorer,
Email User-agent like Qualcomm Eudora etc.

1.3) What is a search engine?
-----------------------------

A search engine is a program that searches through some dataset. In the
context of the Web, the word "search engine" is most often used for search
forms that search through databases of HTML documents gathered by a robot.

1.4) What other kinds of robots are there?
------------------------------------------

Robots can be used for a number of purposes:

- Indexing (see section 2)
- HTML validation
- Link validation
- "What's New" monitoring
- Mirroring

See the list of active robots to see what robot does what.
Don't ask me -- all I know is what's on the list...

1.5) Aren't robots bad for the web?
-----------------------------------

There are a few reasons people believe robots are bad for the Web:

- Certain robot implementations can (and have in the past) overloaded
networks and servers. This happens especially with people who are
just starting to write a robot; these days there is sufficient
information on robots to prevent some of these mistakes.

- Robots are operated by humans, who make mistakes in configuration,
or simply don't consider the implications of their actions.
This means people need to be careful, and robot authors need to make
it difficult for people to make mistakes with bad effects

- Web-wide indexing robots build a central database of documents,
which doesn't scale too well to millions of documents on millions
of sites.

But at the same time the majority of robots are well designed,
professionally operated, cause no problems, and provide a valuable service
in the absence of widely deployed better solutions.

So no, robots aren't inherently bad, nor inherently brilliant,
and need careful attention.

1.6) Where do I find out more about robots?
-------------------------------------------

There is a Web robots home page on:

http://info.webcrawler.com/mak/projects/robots/robots.html

while this is hosted at one of the major robots' site, it is
an unbiased and reasoneably comprehensive collection of information
which is maintained by Martijn Koster <m.koster@webcrawler.com>.

Of course the latest version of this FAQ is there.

You'll also find details and an archive of the robots mailing
list, which is intended for technical discussions about robots.

2) Indexing robots
==================

2.1) How does a robot decide where to visit?
--------------------------------------------

This depends on the robot, each one uses different strategies.
In general they start from a historical list of URLs, especially
of documents with many links elsewhere, such as server lists,
"What's New" pages, and the most popular sites on the Web.

Most indexing services also allow you to submit URLs manually,
which will then be queued and visited by the robot.

Sometimes other sources for URLs are used, such as scanners
through USENET postings, published mailing list achives etc.

Given those starting points a robot can select URLs to visit
and index, and to parse and use as a source for new URLs.

2.2) How does an indexing robot decide what to index?
-----------------------------------------------------

If an indexing robot knows about a document, it may decide to
parse it, and insert it into its database. How this is done
depends on the robot: Some robots index the HTML
Titles, or the first few paragraphs, or parse the entire
HTML and index all words, with weightings depending on HTML
constructs, etc. Some parse the META tag, or other special
hidden tags.

We hope that as the Web evolves more facilities becomes available
to efficiently associate meta data such as indexing information
with a document. This is being worked on...

2.3) How do I register my page with a robot?
--------------------------------------------

You guessed it, it depends on the service :-) Most services have
a link to a URL submission form on their search page.

Fortunately you don't have to submit your URL to every service
by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you.

3) For Server Administrators
============================

3.1) How do I know if I've been visited by a robot?
---------------------------------------------------

You can check your server logs for sites that retrieve many
documents, especially in a short time.

If your server supports User-agent logging you can check for
retrievals with unusual User-agent heder values.

Finally, if you notice a site repeatedly checking for the file
'/robots.txt' chances are that is a robot too.

3.2) I've been visited by a robot. Now what?
--------------------------------------------

Well, nothing :-) The whole idea is they are automatic; you don't
need to do anything.

If you think you have discovered a new robot (ie one that is not
listed on the list of active robots on <URL: http://info.webcrawler.com/
mak/projects/robots/robots.html>, and it does more than sporadic visits,
drop me a line so I can make a note of it for future reference.
But please don't tell me about every robot that happens to drop by!

3.3) A robot is traversing my whole site too fast!
--------------------------------------------------

This is called "rapid-fire", and people usually notice it if they're
monitoring or analysing an access log file.

First of all check if it is a problem by checking the load of your server,
and monitoring your servers' error log, and concurrent connections if
you can. If you have a medium or high performance server, it is quite
likely to be able to cope a high load of even several requests per second,
especially if the visits are quick.

However you may have problems if you have a low performance site, such as
your own desktop PC or Mac you're working on, or you run low performance
server software, or if you have many long retrievals (such as CGI scripts
or large documents). These problems manifest themselves in refused
connections, a high load, performance slowdowns, or in extreme cases a
system crash.

If this happens, there are a few things you should do. Most importantly,
start logging information: when did you notice, what happened, what do
your logs say, what are you doing in response etc; this helps investigating
the problem later. Secondly, try and find out where the robot came from,
what IP addresses or DNS domains, and see if they are mentioned in the
list of active robots on <URL: http://info.webcrawler.com/mak/projects
/robots/robots.html>. If you can identify a site this way, you can
email the person responsible, and ask them what's up. If this doesn't help,
try their own site for telephone numbers, or mail postmaster at their
domain.

If the robot is not on the list, mail me with all the information you
have collected, including actions on your part. If I can't help, at least
I can make a note of it for others.

3.4) How do I keep a robot off my server?

Read the next section...

4) Robots exclusion standard
============================

4.1) Why do I find entries for /robots.txt in my log files?
-----------------------------------------------------------

They are probably from robots trying to see if you have specified
any rules for them using the Standard for Robot Exclusion,
see question 4.4.

If you don't care about robots and want to prevent the messages
in your error logs, simply create an empty file called robots.txt
in the root level of your server.

Don't put any HTML or English language "Who the hell are you?"
text in it -- it will probably never get read by anyone :-)

4.2) How do I prevent robots scanning my site?
----------------------------------------------

The quick way to prevent robots visiting your site is put these
two lines into your server:

User-agent: *
Disallow: /

but its easy to be more selective than that, see 4.3

4.3) Where do I find out how /robots.txt files work?
----------------------------------------------------

You can read the whole standard on the Robot Page
<URL: http://info.webcrawler.com/mak/projects/robots/robots.html>
but the basic concept is simple: by writing a structured text
file you can indicate to robots that certain parts of your
server are off-limits to some or all robots. It is best explained
with an example (The vertical bar on the left is not part of the
contents):

| # /robots.txt file for http://webcrawler.com/
| # mail webmaster@webcrawler.com for constructive criticism
|
| User-agent: webcrawler
| Disallow:
|
| User-agent: lycra
| Disallow: /
|
| User-agent: *
| Disallow: /tmp
| Disallow: /logs

The first two lines, starting with '#', specify a comment

The first paragraph specifies that the robot called 'webcrawler'
has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called 'lycra'
has all relative URLs starting with '/' disallowed.
Because all relative URL's on a server start with '/',
this means the entire site is closed off.

The third paragraph indicates that all other robots should not
visit URLs starting with /tmp or /log. Note the '*' is a special
token; its not a regular expression.

Two common errors:
Regular expressions are _not_ supported: instead of
'Disallow: /tmp/*' just say 'Disallow: /tmp'.
You shouldn't put more than one path on a Disallow line (this may
change in a future version of the spec)

4.4) Will the /robots.txt standard be extended?
-----------------------------------------------

Probably... there are some ideas floating around. They haven't
made it into a coherent proposal because of time constraints,
and because there is little pressure. Mail suggestions to the
robots mailing list, and check the robots home page for work
in progress.

5) Availability
===============

5.1) Where can I use a robot?
-----------------------------

If you mean a search service, check out the various directory pages
on the Web, such as Netscape's
<URL: http://home.netscape.com/home/internet-directory.html>
or try one of the Meta search services such as
<UL: http://metasearch.com/>

5.2) Where can I get a robot?
-----------------------------

Well, you can have a look at the list of robots; I'm starting to
indicate their public availability slowly.

In the meantime, two indexing robots that you should be able to
get hold of are Harvest (free), and Verity's.

5.3) Where can I get the source code for a robot?
-------------------------------------------------

See 5.2 -- some may be willing to give out source code.

5.4) I'm writing a robot, what do I need to be careful of?
----------------------------------------------------------

Lots. First read through all the stuff on the robot page
http://info.webcrawler.com/mak/projects/robots/robots.html
then read the proceedings of past WWW Conferences, and the
complete HTTP and HTML spec. Yes; it's a lot of work :-)

5.5) I've written a robot, how do I list it?
---------------------------------------------

Simply fill in
http://info.webcrawler.com/mak/projects/robots/form.html
and mail the result to Martijn Koster <m.koster@webcrawler.com>
with a subject of "Addition to the list of robots".

THE END

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html