Most robots (at least the friendly ones) locate URLs by starting with a
seed database and then following the links contained in the HTML - you'd
be surprised at how quickly the URL queue fills up from that alone. The
other way robots get new URLs is when someone registers an URL with the
engine.
IMO, if a site has no links pointing to it, and it hasn't registered
itself with my engine, then that's enough reason not to index it. I
heard rumors a while back about one of the spiders (Alta Vista?) that
did port scans on sequential IP addresses, checking if there was anyone
listening on port 80, and then it went and indexed the site if a httpd
server was running, but this is VERY impolite.
-- Dave Sifry http://www.sifry.com President, Sifry Consulting (408) 471-0667 (voice) david@sifry.com (408) 471-0666 (fax) _________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html