Re: depth first vs breadth first

David Eichmann (eichmann@ricis.cl.uh.edu)
Thu, 29 Aug 1996 23:41:11 -0800


Sorry about the preceeding dummy send...

-->Robert Nicholson wrote:
-->>
-->>Hi, I was having a discussion with a work collegue about robots and
-->>implementing them in a depth or breadth first fashion.
-->>
-->>I seem to remember an early version of the "Guidelines for Robot Writers"
-->>page discouraging depth first and I'm wondering why.
-->>
-->>Would somebody please explain to me why it is "unfriendly" to implement
-->>a depth first traversal. I know this effects the stack locally but I don't
-->>quite appreciate how it effects the remote sites I'm visiting.
-->>
-->>NOTE: I don't intend to write a robot I'm just interested in the area :-)
-->
--> Depth-first traversals hit the same site over and over again. It's
-->considered more-friendly to develop a list of sites you wish to visit
--and then
-->visit them a page-at-a-time in turn. That way, the load from your robot is
-->spread out over time.

Depth-first traversals *have the potential* for hitting the same site
repeatedly. [I knew having a nit-picky advisor was going to come home to
roost some day...] The true impact that a given robot will have on an
arbitrary server will be a function of:

* the connectivity of the pages accessed on that server (if the first hit
on a given server contains only links to pages on other servers, even a
depth-first traversal will immediately shift to a different server);

* the robot's pacing scheme (our engine will only access a given server
once every x minutes, even if all candidate URLs are on that server);

* the robot's adherance to the exclusion standard (thereby avoiding
infinitely recursive URL chains that the server administrator cares to warn
the Web about); and

* the traversal algorithm.

An interesting intermediate approach that we've had some interesting
results with is the "shortest URL for this site that we haven't indexed
yet" scheme.

- Dave

-----------
David Eichmann
RBSE Director of R & D Web: http://ricis.cl.uh.edu/eichmann/
Chair, Software Engineering Program Phone: (713) 283-3875
University of Houston - Clear Lake fax: (713) 283-3869
2700 Bay Area Blvd. Email: eichmann@rbse.jsc.nasa.gov
Houston, TX 77058 or: eichmann@cl.uh.edu
RBSE on the Web: http://rbse.jsc.nasa.gov/rbse/