robot algorithm ?

Otis Gospodnetic (otisg@panther.middlebury.edu)
Sat, 9 Nov 1996 18:32:32 -0500 (EST)


Hi,

I don't know why I even started thinking about this, but my brain got tired of
trying to think of the algorithm Web robots/spiders/crawlers use in order to
visit and index the max. number of pages, avoid visits to the same page more
than once, visit the max. number of pages in the shortest possible period of
time, yet at the same time be careful about visiting the same server too
often, and so on.

This is the only thing I could think of. I'd be curious to hear from people
who have experience in writing robots to see how far off I am from the
real-world robots' algorithms :)

here is the alg.:

give the robot the start URL
URL visited before ?
if no :
index it
add it to database
extract URLs from it
put extracted URLs in the queue for later visits
if yes:
has the page been updated since last visit ?
if no:
go to next URL
if yes:
index it
add it to database
extract URLs from it
put extracted URLs in the queue for
later visits

queue of URLs to visit empty ?
if no:
pull out the first URL
can we request this URL now ? (being polite to servers)
if no:
go to "pull out the first URL" to get
the next URL, just so we don't have to
sit and wait until we can request the
URL in question
do something with previous URL
(put at the end of queue maybe ?)
(or try it next ?)
if yes:
go to "URL visited before ?"
if yes:
no more URLs to visit, unlikely to happen, exit.

Hm, it looks quite simple. So how far off am I from the actual algorithm
robots like Infoseek's, Webcrawler's, etc. robots use ?

Let me ask one more thing.

say there is a page and in it a few URLs: A, B, C, and D.

if the robots visits that page, it detects those URLs. What is its next step
? Does it follow URL A, gets all URLs from that page, follows the first URL,
and so on ?
Or does it visit A, gets all URLs from that page, then visit B, gets all URLs
from there, same for C, and D, and then goes on to visit URLs it got, starting
from URLs extracted from A, then B, C, and D. ?

Uh, I hope I managed to explain this clearly.

If anybody knows what I'm talking about and knows the answer(s), I'm all
ears...eyes, whatever.

Thanks in advance,

Otis