Re: Crawling & DNS issues

David L. Sifry (david@sifry.com)
Tue, 21 Jan 1997 23:17:08 -0800


Neil Cotty wrote:
>
> How do web crawlers find new sites ? Is it purely from existing
> references in HTML documents ? What if a site has no listing anywhere
> else on the Internet, how would a crawler ever find this site if it
> couldn't locate the new domain from a name server ?
>
They wouldn't.

Most robots (at least the friendly ones) locate URLs by starting with a
seed database and then following the links contained in the HTML - you'd
be surprised at how quickly the URL queue fills up from that alone. The
other way robots get new URLs is when someone registers an URL with the
engine.

IMO, if a site has no links pointing to it, and it hasn't registered
itself with my engine, then that's enough reason not to index it. I
heard rumors a while back about one of the spiders (Alta Vista?) that
did port scans on sequential IP addresses, checking if there was anyone
listening on port 80, and then it went and indexed the site if a httpd
server was running, but this is VERY impolite.

-- 
Dave Sifry 				http://www.sifry.com
President, Sifry Consulting		(408) 471-0667 (voice)
david@sifry.com				(408) 471-0666 (fax)
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html