The verdict will probably always be out on the etiquette of hitting 'unlisted' sites
with robots. I use caller ID to screen trash calls on my phone. If I had a 'private'
web site for invited folks only, I would put it on a non-standard port. Indexing a
non-published, non-80 HTTP server can probably be considered rude. How many folks
have such sites, and why? There are always passwords. Plus, how long can a site be
free of external references. Currently, the idea of 'unlisted' URL is flimsy.
David is right about indexing, but there is a limit to one's indignation over
the 'public' accessing a 'public' server on a 'public' packet switched network. There are
just too many ways to put information on a public network for private use.
If robots as a rule used a random method instead of 'sequential' method of hitting
URLs, ALL of the concerns of bandwidth or compute overload due to robots would go away.
What webmaster would be concerned with a few hundred hits scattered over a 24 hour
period?
Be careful about searching out mailing lists. A few misdirected emails to
an entire list instead of the list administrator might quickly get you many messages
in your in-box.
The information you are looking for cannot be gleaned from DNS, or necessarily
by scanning web pages.
Bryan Hackney
bbh@xenodata.com
----- Begin Included Message -----
From: "David L. Sifry" <david@sifry.com>
Subject: Re: Crawling & DNS issues
Neil Cotty wrote:
>
> How do web crawlers find new sites ? Is it purely from existing
> references in HTML documents ? What if a site has no listing anywhere
> else on the Internet, how would a crawler ever find this site if it
> couldn't locate the new domain from a name server ?
>
They wouldn't.
Most robots (at least the friendly ones) locate URLs by starting with a
seed database and then following the links contained in the HTML - you'd
be surprised at how quickly the URL queue fills up from that alone. The
other way robots get new URLs is when someone registers an URL with the
engine.
IMO, if a site has no links pointing to it, and it hasn't registered
itself with my engine, then that's enough reason not to index it. I
heard rumors a while back about one of the spiders (Alta Vista?) that
did port scans on sequential IP addresses, checking if there was anyone
listening on port 80, and then it went and indexed the site if a httpd
server was running, but this is VERY impolite.
-- Dave Sifry http://www.sifry.com President, Sifry Consulting (408) 471-0667 (voice) david@sifry.com (408) 471-0666 (fax) _________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html
----- End Included Message -----
_________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html